Scala is a fun language which gives you all the power of Java, with the simplicity of Python, and the power of functional programming.
Though there are a variety of IDE options when working with Scala (IntelliJ and Atom being among my personal favorites), I enjoy using Jupyter for interactive data science with Scala/Spark.
It's actually fairly easy to setup Scala and remote Spark clusters in Jupyter notebooks these days. Without further ado, let's get to making the magic happen.
Note: This assumes you already have a functioning Spark cluster and Jupyter or JupyterLab is already installed.
Install Spark software on the same machine as Jupyter and configure it for cluster connectivity
This is the pre-work step ... some of it you may already have completed, but it doesn't hurt to walk through the steps and just verify things are setup as expected.
Typically, I install Spark in /opt/spark/<version>
and create a symbolic link for /opt/spark/latest
to point to the new version. That way my environment variables such as SPARK_HOME
never have to change.
wget http://apache.cs.utah.edu/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
tar zxpvf spark-2.3.1-bin-hadoop2.7.tgz --directory /opt/spark/2.3.1
ln -s /opt/spark/2.3.1 /opt/spark/latest
Next I create the SPARK_HOME
variable in /etc/bashrc
and point it to /opt/spark/latest
echo "export SPARK_HOME=/opt/spark/latest" >> /etc/bashrc
Then I configure the the spark-defaults.conf
file in /opt/spark/latest/conf/
so that the master variable is set to the correct Spark master url or setting. This can be slightly confusing, depending on how your Spark cluster is configured. If the cluster has been installed in standalone mode (in other words, not running on top of Hadoop), you would use a line that looks like "spark.master spark://server:7077".
cp /opt/spark/latest/conf/spark-defaults.conf.template /opt/spark/latest/conf/spark-defaults.conf
echo "spark.master spark://server:7077" >> /opt/spark/latest/conf/spark-defaults.conf
It's probably more common that people are using Spark clusters on Hadoop, so the line you would use would be "spark.master yarn". In this case, Spark needs to know where the master is actually located, and that information is contained within the Hadoop xml configuration files. You have to download the xml files from the Hadoop cluster, and stash them in a directory (I personally use /opt/hadoop/conf
), and the edit your /etc/bashrc
file to add an environment variable HADOOP_CONF_DIR
to point to that location.
mkdir -p /opt/hadoop/conf
scp -r hadoop@cluster:/etc/hadoop/conf/ /opt/hadoop/conf/
echo "export HADOOP_CONF_DIR=/opt/hadoop/conf" >> /etc/bashrc
cp /opt/spark/latest/conf/spark-defaults.conf.template /opt/spark/latest/conf/spark-defaults.conf
echo "spark.master yarn" >> /opt/spark/latest/conf/spark-defaults.conf
Install the Apache Toree Jupyter Python package
Nothing too complex about this ... just run the pip command to perform the installation of the Toree Python package.
pip install toree
Install the Apache Toree jupyter kernel
Now for the final step, installing the Toree kernel. A couple of things to note here: First, you have to source your bashrc file, since the installation is referencing an environment variable you created earlier (SPARK_HOME
). Second, the default naming of Toree kernels uses a prefix of "Apache Toree", which I don't really care for, so I prefer to use "Spark". This means my Jupyter kernel name will be "Spark - Scala" instead of "Apache Toree - Scala".
source /etc/bashrc
jupyter toree install --spark_home=$SPARK_HOME --kernel_name="Spark"
Start your Jupyter notebook
At this point, everything is complete. If you start your jupyter notebook by running a command such as jupyter notebook
or jupyter lab
, you'll notice that you have a new kernel option available called Spark - Scala
.