Spark Word Count

In the world of Big Data, your first program isn't "Hello, World!", it's Word Count, because it's easy to understand and conceptualize. In Java MapReduce, this would take about 4 pages of code. In Scala/Spark, it can be done in a single line.

val changeFile = sc.textFile("/opt/spark/CHANGES.txt")
val changeFileLower = changeFile.map(_.toLowerCase)
val changeFlatMap = changeFileLower.flatMap("[a-z]+".r findAllIn _)
val changeMR = changeFlatMap.map(word => (word,1)).reduceByKey(_ + _)
changeMR.take(10)

Array[(String, Int)] = Array((actors,1), (decline,3), (findtaskfromlist,1), (biswal,4), (greater,2), (runner,1), (counted,1), (order,15), (logwarning,1), (clientbase,1))

So what's going on in this word count sample of code?

Create a variable pointing to the Spark CHANGES.txt file.
Convert all lines to lower case for consistency's sake.
Perform a regex to strip out all characters except a-z within a flatMap.
Perform the actual MapReduce by taking each word, and creating a (word,1) tuple. Then the reduceByKey function will perform the reduction, resulting in a final (word,count) tuple.
Finally, the take action kicks off the evaluation process (since Spark is a lazy evaluator, none of the data transforms were performed yet), and returns the first 10 results.

Because we're using a functional object oriented language (Scala), you can actually write the entire Word Count program in a single line, if you really wanted to. The code is interpreted by the optimizer in exactly the same way, but it's more difficult for humans to read.

val changeFile = sc.textFile("/opt/spark/CHANGES.txt").map(_.toLowerCase).flatMap("[a-z]+".r findAllIn _).map(word => (word,1)).reduceByKey(_ + _).take(10)

changeFile: Array[(String, Int)] = Array((actors,1), (decline,3), (findtaskfromlist,1), (biswal,4), (greater,2), (runner,1), (counted,1), (order,15), (logwarning,1), (clientbase,1))

Spark Word Count

About James Conner

Create an Apache Spark cluster using Raspberry Pi 2 Nodes

Controlling NVIDIA GPU Fans on a headless Ubuntu system

Google's 'Coral' Edge TPU Dev Boards

Google's 'Coral' Edge TPU Accelerator

Spark Word Count

About James Conner

Subscribe to My Areas of Expertise

Create an Apache Spark cluster using Raspberry Pi 2 Nodes

Controlling NVIDIA GPU Fans on a headless Ubuntu system

Google's 'Coral' Edge TPU Dev Boards

Google's 'Coral' Edge TPU Accelerator