spark, scala, ML, Machine Learning

Convert Spark Vectors to DataFrame Columns

Vectors are typically required for Machine Learning tasks, but are otherwise not commonly used. Sometimes you end up with an assembled Vector that you just want to disassemble into its individual component columns so you can do some Spark SQL work, for example. Fortunately, there's an easy answer for that ...

The VectorDisassembler code can be downloaded from Once downloaded, it needs to be added to your spark-shell or spark-submit command using the --jars. For example spark-shell --master=spark://datasci:7077 --jars /opt/jars/VectorDisassembler-0.1.jar

One you've got the VectorDisassembler jar loaded in your shell/submit command, let's get going!

// Import the required classes

// Create a simple dataset of 3 columns
val dataset = (spark.createDataFrame(
    Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
    ).toDF("id", "val1", "val2"))

// Next we'll use the Vector Assembler to create our vector datatype from the val1 and val2 columns
val assembler = (new VectorAssembler()
    .setInputCols(Array("val1", "val2"))

// Create the intermediate dataframe.  This transformation action on the assembler executes the build of the vectorCol
val intermediate = assembler.transform(dataset)

// Take a look at the intermediate dataframe.  The val1 and val2 columns are redundant, so let's do something about that
//| id|val1|val2|vectorCol|
//| 0| 1.2| 1.3|[1.2,1.3]|
//| 1| 2.2| 2.3|[2.2,2.3]|
//| 2| 3.2| 3.3|[3.2,3.3]|

// Tidy up the dataframe by dropping original columns.  Show the output to prove the columns have been dropped.
val output = intermediate.drop("val1","val2")
//| id|vectorCol|
//|  0|[1.2,1.3]|
//|  1|[2.2,2.3]|
//|  2|[3.2,3.3]|

// Now that the data is in the right format, let's disassemble it
// The disassembler, like the assembler, extends the Transformer class
// This means it's built the same way as the assembler.
val disassembler = new VectorDisassembler().setInputCol("vectorCol")

// Execute the dissassembler transformer against the output dataframe, and show the results.
//| id|vectorCol|val1|val2|
//|  0|[1.2,1.3]| 1.2| 1.3|
//|  1|[2.2,2.3]| 2.2| 2.3|
//|  2|[3.2,3.3]| 3.2| 3.3|

And there you go ... the output is a dataframe which has the original values recreated by disassembling the vector column!

Author image

About James Conner

Scuba dive master, wildlife photographer, anthropologist, programmer, electronics tinkerer and big data expert.
You've successfully subscribed to My Areas of Expertise
Great! Next, complete checkout for full access to My Areas of Expertise
Welcome back! You've successfully signed in.
Unable to sign you in. Please try again.
Success! Your account is fully activated, you now have access to all content.
Error! Stripe checkout failed.
Success! Your billing info is updated.
Error! Billing info update failed.