Connect Jupyter to Remote Spark Clusters With Apache Toree

Scala is a fun language which gives you all the power of Java, with the simplicity of Python, and the power of functional programming.

Though there are a variety of IDE options when working with Scala (IntelliJ and Atom being among my personal favorites), I enjoy using Jupyter for interactive data science with Scala/Spark.

It's actually fairly easy to setup Scala and remote Spark clusters in Jupyter notebooks these days. Without further ado, let's get to making the magic happen.

Note: This assumes you already have a functioning Spark cluster and Jupyter or JupyterLab is already installed.

Install Spark software on the same machine as Jupyter and configure it for cluster connectivity

This is the pre-work step ... some of it you may already have completed, but it doesn't hurt to walk through the steps and just verify things are setup as expected.

Typically, I install Spark in /opt/spark/<version> and create a symbolic link for /opt/spark/latest to point to the new version. That way my environment variables such as SPARK_HOME never have to change.

wget http://apache.cs.utah.edu/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz

tar zxpvf spark-2.3.1-bin-hadoop2.7.tgz --directory /opt/spark/2.3.1

ln -s /opt/spark/2.3.1 /opt/spark/latest

Next I create the SPARK_HOME variable in /etc/bashrc and point it to /opt/spark/latest

echo "export SPARK_HOME=/opt/spark/latest" >> /etc/bashrc

Then I configure the the spark-defaults.conf file in /opt/spark/latest/conf/ so that the master variable is set to the correct Spark master url or setting. This can be slightly confusing, depending on how your Spark cluster is configured. If the cluster has been installed in standalone mode (in other words, not running on top of Hadoop), you would use a line that looks like "spark.master spark://server:7077".

cp /opt/spark/latest/conf/spark-defaults.conf.template /opt/spark/latest/conf/spark-defaults.conf

echo "spark.master spark://server:7077" >> /opt/spark/latest/conf/spark-defaults.conf

It's probably more common that people are using Spark clusters on Hadoop, so the line you would use would be "spark.master yarn". In this case, Spark needs to know where the master is actually located, and that information is contained within the Hadoop xml configuration files. You have to download the xml files from the Hadoop cluster, and stash them in a directory (I personally use /opt/hadoop/conf), and the edit your /etc/bashrc file to add an environment variable HADOOP_CONF_DIR to point to that location.

mkdir -p /opt/hadoop/conf

scp -r hadoop@cluster:/etc/hadoop/conf/ /opt/hadoop/conf/

echo "export HADOOP_CONF_DIR=/opt/hadoop/conf" >> /etc/bashrc

cp /opt/spark/latest/conf/spark-defaults.conf.template /opt/spark/latest/conf/spark-defaults.conf

echo "spark.master yarn" >> /opt/spark/latest/conf/spark-defaults.conf

Install the Apache Toree Jupyter Python package

Nothing too complex about this ... just run the pip command to perform the installation of the Toree Python package.

pip install toree

Install the Apache Toree jupyter kernel

Now for the final step, installing the Toree kernel. A couple of things to note here: First, you have to source your bashrc file, since the installation is referencing an environment variable you created earlier (SPARK_HOME). Second, the default naming of Toree kernels uses a prefix of "Apache Toree", which I don't really care for, so I prefer to use "Spark". This means my Jupyter kernel name will be "Spark - Scala" instead of "Apache Toree - Scala".

source /etc/bashrc

jupyter toree install --spark_home=$SPARK_HOME --kernel_name="Spark"

Start your Jupyter notebook

At this point, everything is complete. If you start your jupyter notebook by running a command such as jupyter notebook or jupyter lab, you'll notice that you have a new kernel option available called Spark - Scala.

Connect Jupyter to Remote Spark Clusters With Apache Toree

Install Spark software on the same machine as Jupyter and configure it for cluster connectivity

Install the Apache Toree Jupyter Python package

Install the Apache Toree jupyter kernel

Start your Jupyter notebook

About James Conner

How to run TensorFlow with GPU on Windows 10 in a Jupyter Notebook

Pandas - Opening and Selecting Data

Controlling NVIDIA GPU Fans on a headless Ubuntu system

Google's 'Coral' Edge TPU Dev Boards

Google's 'Coral' Edge TPU Accelerator

Connect Jupyter to Remote Spark Clusters With Apache Toree

Install Spark software on the same machine as Jupyter and configure it for cluster connectivity

Install the Apache Toree Jupyter Python package

Install the Apache Toree jupyter kernel

Start your Jupyter notebook

About James Conner

Subscribe to My Areas of Expertise

How to run TensorFlow with GPU on Windows 10 in a Jupyter Notebook

Pandas - Opening and Selecting Data

Controlling NVIDIA GPU Fans on a headless Ubuntu system

Google's 'Coral' Edge TPU Dev Boards

Google's 'Coral' Edge TPU Accelerator