In the world of Big Data, your first program isn't "Hello, World!", it's Word Count, because it's easy to understand and conceptualize. In Java MapReduce, this would take about 4 pages of code. In Scala/Spark, it can be done in a single line.
val changeFile = sc.textFile("/opt/spark/CHANGES.txt")
val changeFileLower = changeFile.map(_.toLowerCase)
val changeFlatMap = changeFileLower.flatMap("[a-z]+".r findAllIn _)
val changeMR = changeFlatMap.map(word => (word,1)).reduceByKey(_ + _)
changeMR.take(10)
Array[(String, Int)] = Array((actors,1), (decline,3), (findtaskfromlist,1), (biswal,4), (greater,2), (runner,1), (counted,1), (order,15), (logwarning,1), (clientbase,1))
So what's going on in this word count sample of code?
- Create a variable pointing to the Spark CHANGES.txt file.
- Convert all lines to lower case for consistency's sake.
- Perform a regex to strip out all characters except a-z within a flatMap.
- Perform the actual MapReduce by taking each word, and creating a (word,1) tuple. Then the reduceByKey function will perform the reduction, resulting in a final (word,count) tuple.
- Finally, the take action kicks off the evaluation process (since Spark is a lazy evaluator, none of the data transforms were performed yet), and returns the first 10 results.
Because we're using a functional object oriented language (Scala), you can actually write the entire Word Count program in a single line, if you really wanted to. The code is interpreted by the optimizer in exactly the same way, but it's more difficult for humans to read.
val changeFile = sc.textFile("/opt/spark/CHANGES.txt").map(_.toLowerCase).flatMap("[a-z]+".r findAllIn _).map(word => (word,1)).reduceByKey(_ + _).take(10)
changeFile: Array[(String, Int)] = Array((actors,1), (decline,3), (findtaskfromlist,1), (biswal,4), (greater,2), (runner,1), (counted,1), (order,15), (logwarning,1), (clientbase,1))