My Areas of Expertise

Using Spark, Scala and XGBoost On The Titanic Dataset from Kaggle

The Titanic: Machine Learning from Disaster [https://www.kaggle.com/c/titanic] competition on Kaggle [https://www.kaggle.com/] is an excellent resource for anyone wanting to dive into Machine Learning [https://en.wikipedia.org/wiki/Machine_learning]. There are forums [https://www.kaggle.com/c/titanic/discussion] where you

Family of Elephants in the Mara

Copyright: James Conner Camera: Canon EOS 5D Mark III Lens: Canon 100-400L F4 with 1.4x Internal Extender Stats: 560mm/ƒ/5.6/1/1600s/ISO 640 Taken: March 5 2015

List All Additional Jars Loaded in Spark

Once in a while, you need to verify the versions of your jars which have been loaded into your Spark session. Fortunately, there's a relatively easy way to do this: the listJars method. As you can see from the example below, the listJars method shows all jars loaded

Spark Vector of Vectors

I recently ran into a problem with creating a features vector for a machine learning project. If the number of features in your dataframe is too large, the JVM will crash during the Catalyst optimizer process because the number of constant variables generated exceeds the JVM limit of 65,536.

Joining Spark DataFrames Without Duplicate or Ambiguous Column Names

When performing joins in Spark, one question keeps coming up: When joining multiple dataframes, how do you prevent ambiguous column name errors? 1) Let's start off by preparing a couple of simple example dataframes // Create first example dataframe val firstDF = spark.createDataFrame(Seq( (1, 1, 2, 3, 8,