Everyone talks about Spark in the context of "Big Data", but I've been having fun with it on a tiny platform: Raspberry Pi 2 (RP2) micro computers! Though the specs of the RP2 are modest, you can easily run Spark on a single node ... but where's the fun in that? Creating a small cluster out of multiple RP2 nodes and running distributed is far more entertaining!
The steps I'll cover in this guide are:
- My Cluster Hardware List
- Setup the RP2 OS / Firmware
- Install Spark
- Configure Spark
1. My Cluster Hardware List
4 x Raspberry Pi 2 (RP2)
4 x Edimax Wireless N USB Adapter
4 x Samsung 64GB Micro SDXC
1 x Amazon 7-port USB2 Hub
1 x 5" USB-A to Micro USB-B Cables
1 x Stacked Acrylic Case
1 x NAS/CIFS Mount Point for Data
Notes about the Hardware List
Raspberry Pi 2: Since I've installed Spark on this cluster, I decided to use the Raspberry Pi 2 model, instead of one of the earlier models. The RP2 has 1GB of RAM, compared to the earlier models which have 256 & 512MB.
Edimax Wireless N USB Adapter: This is an optional device. I preferred to use wireless so my cluster could be mobile. If you prefer not to use the wireless adapter, the RP2 comes with an integrated 10/100Mb RJ-45 port, but then you would need a network switch or hub.
Samsung 64GB Micro SDXC: You can use a 32GB card instead, but if you use your cluster in earnest, you may discover that 32GB isn't enough. At the time, the Samsung 64GB was the cheapest SDXC card available.
Amazon 7-port USB2 Hub: The hub is only needed to act as a power distribution unit. No data is transferred via the USB, so there is no functional difference between a USB2 or USB3 hub.
Stacked Acrylic Case: Another optional component, but it does make it much easier to have all of the RP2 nodes contained within a single stacked case.
2. Setup the Raspberry Pi 2 Operating System / Firmware
While I'm sure the NOOBS installer works just fine, I prefer the old fashioned method of writing the Raspbian image to the SD card via a tool like Win32DiskImager on Windows, or dd on Linux/OSX. Check out the Raspbian Installation Guide for the specifics regarding your OS.
After you get the image written to the SD, hook up a keyboard and monitor and then log in to continue with the Raspberry PI configuration wizard (raspi-config). You'll definitely want to expand the filesystem, change the hostname and enable the SSH daemon. One thing you won't be able to do in this wizard is set up the Wireless adapter; you'll have to do that after the reboot. Once you've rebooted, check out the Raspbian Wireless Configuration guide to complete the network setup. After this is completed, you should be able to disconnect your keyboard and monitor after confirming you can connect via SSH.
After you install the OS and go through the configuration steps, it's generally a good idea to update and upgrade your OS, as well as your firmware. Fortunately, this is easily achieved with a few simple commands which can be found in the Raspbian Update Guide.
Raspbian OS Update/Upgrade Commands
sudo apt-get update
sudo apt-get upgrade
Raspberry Pi 2 Firmware Update Command
3. Install Spark
Installing Spark is a very simple task; it comes as a compressed tarball, which is ready to go after being unzipped. Personally, I like to perform a few extra steps after unzipping the tarball, just to follow best practices in a Linux environment. As an additional note, since most of these commands have to be run against all nodes in the cluster, one of the easiest ways to accomplish this is to use a terminal program capable of sending input to multiple servers at once. There are a lot of programs that can do this, but my personal favorite is MobaXterm in MultiExec mode.
- a. We'll start off by creating a user that will run the two types of Spark daemons: master and slaves. This ensures that there's a separation of roles in the Linux environment, in case of a potential security vulnerability. Naturally, after creating the user, you need to give it a password.
sudo groupadd -g 5000 spark sudo useradd -g 5000 -u 5000 -m spark sudo passwd spark
- b. For the sake of simplicity, we'll want to create an SSH key for the spark user so that when the daemons are created, you don't have to use passwords. On the node that will be your master, run the following commands:
ssh-keygen -C "spark@raspi" -b 2048 -t rsa ssh-copy-id spark@host2 ssh-copy-id spark@host3 ssh-copy-id spark@host4
- c. Next let's grab a copy of the Spark tarball from the Spark download site using a simple wget command. Extract the files to
/opt/, create a link pointing
/opt/sparkto the specific Spark directory, and change the ownership:group to the newly created spark user/group. The current version and the time of this post is Spark 1.4.0.
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.4.0-bin-hadoop2.6.tgz sudo tar -zxf spark-1.4.0-bin-hadoop2.6.tgz -C /opt/ sudo ln -s /opt/spark-1.4.0-bin-hadoop2.6 /opt/spark sudo chown -R spark:spark /opt/spark-1.4.0-bin-hadoop2.6
4. Configure Spark
We'll keep the Spark configuration simple to get up and running quickly. There are a lot of options, but for our purpose, we just need to declare the master node, the memory size to be used by the slaves, and the list of slaves. The commands in this section should be run on the master node as the spark user.
- a. Make a copy of the
cp `/opt/spark/conf/spark-env.sh.template` `/opt/spark/conf/spark-env.sh`
- b. Edit the
/opt/spark/conf/spark-env.shfile using your preferred editor (vi and pico are installed in Raspbian by default), and at the bottom of the file, add the parameter to declare the master node.
- c. Edit the
/opt/spark/conf/spark-env.shfile using your preferred editor (vi and pico are installed in Raspbian by default), and at the bottom of the file, add the parameter to declare memory size to be used by the slaves|workers.
- d. Create the
/opt/spark/conf/slavesfile using your preferred editor (vi and pico are installed in Raspbian by default). In the file, just put the hostnames of all your RP2 nodes. Example:
host1 host2 host3 host4
- e. Secure Copy the files that were just modified/created to all nodes in the cluster:
for i in `cat /opt/spark/conf/slaves` ; do scp /opt/spark/conf/slaves $i:/opt/spark/conf/slaves for i in `cat /opt/spark/conf/slaves` ; do scp /opt/spark/conf/spark-env.sh $i:/opt/spark/conf/spark-env.sh
5. Start Spark
There are a couple of ways that Spark can be started, but I prefer to start the master and slave processes independently. To do this, you'll want to be on the master node as the spark user.
After starting up the master, I like to check out the log files, just to make sure there's nothing unusual or out of place.
If everything looks good, progress to starting up the the slaves|workers.
And again, checking the logs is generally a good idea, but since there is one slave|worker spawned per node, you'll have to read the log on each node separately.
6. Spark Web Service
You'll probably want to take a look at the Spark Web Interface ... this will tell you which slaves|workers are connected to the master, the active applications as well as the application history.
Open up a web browser and point to
http://<master node>:8080. Be sure to verify that the workers you're expecting are logged into the master. If you don't see 'em, you'll need to start troubleshooting.
7. Run the Spark Shell
Finally, we get to the Spark shell! This is where it comes down to personal preference and what equipment you have available. Personally, I like running the Spark shell on a completely separate machine ... a Windows laptop, a Linux VM, etc. The reason for this is that I don't want the Spark driver taking away memory from the worker processes. Regardless of wherever you choose to execute the Spark shell, the command is the same:
/opt/spark/bin/spark-shell --master spark://<master node>:7077
8. Run a Distributed Word Count
Just to make sure everything is working as planned, try a Distributed Word Count in Spark.
val changeFile = sc.textFile("/opt/spark/CHANGES.txt") changeFile: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD at textFile at
:21 val changeFileLower = changeFile.map(_.toLowerCase) changeFileLower: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD at map at :23 val changeFlatMap = changeFileLower.flatMap("[a-z]+".r findAllIn _) changeFlatMap: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD at flatMap at :25 val changeMR = changeFlatMap.map(word => (word,1)).reduceByKey(_ + _) changeMR: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD at reduceByKey at :27 changeMR.take(10) Array[(String, Int)] = Array((actors,1), (decline,3), (findtaskfromlist,1), (biswal,4), (greater,2), (runner,1), (counted,1), (order,15), (logwarning,1), (clientbase,1))
Congratulations! You're now running Spark on a Raspberry Pi cluster!