It's actually fairly easy to setup Scala and remote Spark clusters in Jupyter notebooks these days. Without further ado, let's get to making the magic happen.
Install Spark software on the same machine as Jupyter and configure it for cluster connectivity
This is the pre-work step ... some of it you may already have completed, but it doesn't hurt to walk through the steps and just verify things are setup as expected.
Typically, I install Spark in
/opt/spark/<version> and create a symbolic link for
/opt/spark/latest to point to the new version. That way my environment variables such as
SPARK_HOME never have to change.
wget http://apache.cs.utah.edu/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz tar zxpvf spark-2.3.1-bin-hadoop2.7.tgz --directory /opt/spark/2.3.1 ln -s /opt/spark/2.3.1 /opt/spark/latest
Next I create the
SPARK_HOME variable in
/etc/bashrc and point it to
echo "export SPARK_HOME=/opt/spark/latest" >> /etc/bashrc
Then I configure the the
spark-defaults.conf file in
/opt/spark/latest/conf/ so that the master variable is set to the correct Spark master url or setting. This can be slightly confusing, depending on how your Spark cluster is configured. If the cluster has been installed in standalone mode (in other words, not running on top of Hadoop), you would use a line that looks like "spark.master spark://server:7077".
cp /opt/spark/latest/conf/spark-defaults.conf.template /opt/spark/latest/conf/spark-defaults.conf echo "spark.master spark://server:7077" >> /opt/spark/latest/conf/spark-defaults.conf
It's probably more common that people are using Spark clusters on Hadoop, so the line you would use would be "spark.master yarn". In this case, Spark needs to know where the master is actually located, and that information is contained within the Hadoop xml configuration files. You have to download the xml files from the Hadoop cluster, and stash them in a directory (I personally use
/opt/hadoop/conf), and the edit your
/etc/bashrc file to add an environment variable
HADOOP_CONF_DIR to point to that location.
mkdir -p /opt/hadoop/conf scp -r hadoop@cluster:/etc/hadoop/conf/ /opt/hadoop/conf/ echo "export HADOOP_CONF_DIR=/opt/hadoop/conf" >> /etc/bashrc cp /opt/spark/latest/conf/spark-defaults.conf.template /opt/spark/latest/conf/spark-defaults.conf echo "spark.master yarn" >> /opt/spark/latest/conf/spark-defaults.conf
Install the Apache Toree Jupyter Python package
Nothing too complex about this ... just run the pip command to perform the installation of the Toree Python package.
pip install toree
Install the Apache Toree jupyter kernel
Now for the final step, installing the Toree kernel. A couple of things to note here: First, you have to source your bashrc file, since the installation is referencing an environment variable you created earlier (
SPARK_HOME). Second, the default naming of Toree kernels uses a prefix of "Apache Toree", which I don't really care for, so I prefer to use "Spark". This means my Jupyter kernel name will be "Spark - Scala" instead of "Apache Toree - Scala".
source /etc/bashrc jupyter toree install --spark_home=$SPARK_HOME --kernel_name="Spark"
Start your Jupyter notebook
At this point, everything is complete. If you start your jupyter notebook by running a command such as
jupyter notebook or
jupyter lab, you'll notice that you have a new kernel option available called
Spark - Scala.