spark, scala

Spark Word Count

In the world of Big Data, your first program isn't "Hello, World!", it's Word Count, because it's easy to understand and conceptualize. In Java MapReduce, this would take about 4 pages of code. In Scala/Spark, it can be done in a single line.

val changeFile = sc.textFile("/opt/spark/CHANGES.txt")
val changeFileLower =
val changeFlatMap = changeFileLower.flatMap("[a-z]+".r findAllIn _)
val changeMR = => (word,1)).reduceByKey(_ + _)

Array[(String, Int)] = Array((actors,1), (decline,3), (findtaskfromlist,1), (biswal,4), (greater,2), (runner,1), (counted,1), (order,15), (logwarning,1), (clientbase,1))

So what's going on in this word count sample of code?

  1. Create a variable pointing to the Spark CHANGES.txt file.
  2. Convert all lines to lower case for consistency's sake.
  3. Perform a regex to strip out all characters except a-z within a flatMap.
  4. Perform the actual MapReduce by taking each word, and creating a (word,1) tuple. Then the reduceByKey function will perform the reduction, resulting in a final (word,count) tuple.
  5. Finally, the take action kicks off the evaluation process (since Spark is a lazy evaluator, none of the data transforms were performed yet), and returns the first 10 results.

Because we're using a functional object oriented language (Scala), you can actually write the entire Word Count program in a single line, if you really wanted to. The code is interpreted by the optimizer in exactly the same way, but it's more difficult for humans to read.

val changeFile = sc.textFile("/opt/spark/CHANGES.txt").map(_.toLowerCase).flatMap("[a-z]+".r findAllIn _).map(word => (word,1)).reduceByKey(_ + _).take(10)

changeFile: Array[(String, Int)] = Array((actors,1), (decline,3), (findtaskfromlist,1), (biswal,4), (greater,2), (runner,1), (counted,1), (order,15), (logwarning,1), (clientbase,1))
Author image

About James Conner

Scuba dive master, wildlife photographer, anthropologist, programmer, electronics tinkerer and big data expert.