spark

Create an Apache Spark cluster using Raspberry Pi 2 Nodes

James Conner

07 Jul 2015 • 6 min read

Apache Spark on Raspberry Pi Cluster

Everyone talks about Spark in the context of "Big Data", but I've been having fun with it on a tiny platform: Raspberry Pi 2 (RP2) micro computers! Though the specs of the RP2 are modest, you can easily run Spark on a single node ... but where's the fun in that? Creating a small cluster out of multiple RP2 nodes and running distributed is far more entertaining!

The steps I'll cover in this guide are:

My Cluster Hardware List
Setup the RP2 OS / Firmware
Install Spark
Configure Spark

1. My Cluster Hardware List

4 x Raspberry Pi 2 (RP2)
4 x Edimax Wireless N USB Adapter
4 x Samsung 64GB Micro SDXC
1 x Amazon 7-port USB2 Hub
1 x 5" USB-A to Micro USB-B Cables
1 x Stacked Acrylic Case
1 x NAS/CIFS Mount Point for Data

Notes about the Hardware List

Raspberry Pi 2: Since I've installed Spark on this cluster, I decided to use the Raspberry Pi 2 model, instead of one of the earlier models. The RP2 has 1GB of RAM, compared to the earlier models which have 256 & 512MB.
Edimax Wireless N USB Adapter: This is an optional device. I preferred to use wireless so my cluster could be mobile. If you prefer not to use the wireless adapter, the RP2 comes with an integrated 10/100Mb RJ-45 port, but then you would need a network switch or hub.
Samsung 64GB Micro SDXC: You can use a 32GB card instead, but if you use your cluster in earnest, you may discover that 32GB isn't enough. At the time, the Samsung 64GB was the cheapest SDXC card available.
Amazon 7-port USB2 Hub: The hub is only needed to act as a power distribution unit. No data is transferred via the USB, so there is no functional difference between a USB2 or USB3 hub.
Stacked Acrylic Case: Another optional component, but it does make it much easier to have all of the RP2 nodes contained within a single stacked case.

2. Setup the Raspberry Pi 2 Operating System / Firmware

While I'm sure the NOOBS installer works just fine, I prefer the old fashioned method of writing the Raspbian image to the SD card via a tool like Win32DiskImager on Windows, or dd on Linux/OSX. Check out the Raspbian Installation Guide for the specifics regarding your OS.

After you get the image written to the SD, hook up a keyboard and monitor and then log in to continue with the Raspberry PI configuration wizard (raspi-config). You'll definitely want to expand the filesystem, change the hostname and enable the SSH daemon. One thing you won't be able to do in this wizard is set up the Wireless adapter; you'll have to do that after the reboot. Once you've rebooted, check out the Raspbian Wireless Configuration guide to complete the network setup. After this is completed, you should be able to disconnect your keyboard and monitor after confirming you can connect via SSH.

After you install the OS and go through the configuration steps, it's generally a good idea to update and upgrade your OS, as well as your firmware. Fortunately, this is easily achieved with a few simple commands which can be found in the Raspbian Update Guide.

Raspbian OS Update/Upgrade Commands

sudo apt-get update
sudo apt-get upgrade

Raspberry Pi 2 Firmware Update Command

sudo rpi-update

3. Install Spark

Installing Spark is a very simple task; it comes as a compressed tarball, which is ready to go after being unzipped. Personally, I like to perform a few extra steps after unzipping the tarball, just to follow best practices in a Linux environment. As an additional note, since most of these commands have to be run against all nodes in the cluster, one of the easiest ways to accomplish this is to use a terminal program capable of sending input to multiple servers at once. There are a lot of programs that can do this, but my personal favorite is MobaXterm in MultiExec mode.

a. We'll start off by creating a user that will run the two types of Spark daemons: master and slaves. This ensures that there's a separation of roles in the Linux environment, in case of a potential security vulnerability. Naturally, after creating the user, you need to give it a password.

sudo groupadd -g 5000 spark
sudo useradd -g 5000 -u 5000 -m spark
sudo passwd spark

b. For the sake of simplicity, we'll want to create an SSH key for the spark user so that when the daemons are created, you don't have to use passwords. On the node that will be your master, run the following commands:

ssh-keygen -C "spark@raspi" -b 2048 -t rsa
ssh-copy-id spark@host2
ssh-copy-id spark@host3
ssh-copy-id spark@host4

c. Next let's grab a copy of the Spark tarball from the Spark download site using a simple wget command. Extract the files to /opt/, create a link pointing /opt/spark to the specific Spark directory, and change the ownership:group to the newly created spark user/group. The current version and the time of this post is Spark 1.4.0.

wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.0-bin-hadoop2.6.tgz
sudo tar -zxf spark-1.4.0-bin-hadoop2.6.tgz -C /opt/
sudo ln -s /opt/spark-1.4.0-bin-hadoop2.6 /opt/spark
sudo chown -R spark:spark /opt/spark-1.4.0-bin-hadoop2.6

4. Configure Spark

We'll keep the Spark configuration simple to get up and running quickly. There are a lot of options, but for our purpose, we just need to declare the master node, the memory size to be used by the slaves, and the list of slaves. The commands in this section should be run on the master node as the spark user.

a. Make a copy of the /opt/spark/conf/spark-env.sh.template file to /opt/spark/conf/spark-env.sh

cp `/opt/spark/conf/spark-env.sh.template` `/opt/spark/conf/spark-env.sh`

b. Edit the /opt/spark/conf/spark-env.sh file using your preferred editor (vi and pico are installed in Raspbian by default), and at the bottom of the file, add the parameter to declare the master node.

SPARK_MASTER_IP=<yourIP>

c. Edit the /opt/spark/conf/spark-env.sh file using your preferred editor (vi and pico are installed in Raspbian by default), and at the bottom of the file, add the parameter to declare memory size to be used by the slaves|workers.

SPARK_WORKER_MEMORY=768m

d. Create the /opt/spark/conf/slaves file using your preferred editor (vi and pico are installed in Raspbian by default). In the file, just put the hostnames of all your RP2 nodes. Example:

host1
host2
host3
host4

e. Secure Copy the files that were just modified/created to all nodes in the cluster:

for i in `cat /opt/spark/conf/slaves` ; do scp /opt/spark/conf/slaves $i:/opt/spark/conf/slaves
for i in `cat /opt/spark/conf/slaves` ; do scp /opt/spark/conf/spark-env.sh $i:/opt/spark/conf/spark-env.sh

5. Start Spark

There are a couple of ways that Spark can be started, but I prefer to start the master and slave processes independently. To do this, you'll want to be on the master node as the spark user.

/opt/spark/sbin/start-master.sh

After starting up the master, I like to check out the log files, just to make sure there's nothing unusual or out of place.

less /opt/spark/logs/spark-*Master*.out

If everything looks good, progress to starting up the the slaves|workers.

/opt/spark/sbin/start-slaves.sh

And again, checking the logs is generally a good idea, but since there is one slave|worker spawned per node, you'll have to read the log on each node separately.

less /opt/spark/logs/spark-*Worker*.out

6. Spark Web Service

You'll probably want to take a look at the Spark Web Interface ... this will tell you which slaves|workers are connected to the master, the active applications as well as the application history.

Open up a web browser and point to http://<master node>:8080. Be sure to verify that the workers you're expecting are logged into the master. If you don't see 'em, you'll need to start troubleshooting.

7. Run the Spark Shell

Finally, we get to the Spark shell! This is where it comes down to personal preference and what equipment you have available. Personally, I like running the Spark shell on a completely separate machine ... a Windows laptop, a Linux VM, etc. The reason for this is that I don't want the Spark driver taking away memory from the worker processes. Regardless of wherever you choose to execute the Spark shell, the command is the same:

/opt/spark/bin/spark-shell --master spark://<master node>:7077

8. Run a Distributed Word Count

Just to make sure everything is working as planned, try a Distributed Word Count in Spark.

val changeFile = sc.textFile("/opt/spark/CHANGES.txt")
changeFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at textFile at :21

val changeFileLower = changeFile.map(_.toLowerCase)
changeFileLower: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[8] at map at :23

val changeFlatMap = changeFileLower.flatMap("[a-z]+".r findAllIn _)
changeFlatMap: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at flatMap at :25

val changeMR = changeFlatMap.map(word => (word,1)).reduceByKey(_ + _)
changeMR: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[11] at reduceByKey at :27

changeMR.take(10)
Array[(String, Int)] = Array((actors,1), (decline,3), (findtaskfromlist,1), (biswal,4), (greater,2), (runner,1), (counted,1), (order,15), (logwarning,1), (clientbase,1))

Congratulations! You're now running Spark on a Raspberry Pi cluster!