Convert Spark Vectors to DataFrame Columns

Vectors are typically required for Machine Learning tasks, but are otherwise not commonly used. Sometimes you end up with an assembled Vector that you just want to disassemble into its individual component columns so you can do some Spark SQL work, for example. Fortunately, there's an easy answer for that ...

The VectorDisassembler code can be downloaded from github.com. Once downloaded, it needs to be added to your spark-shell or spark-submit command using the --jars. For example spark-shell --master=spark://datasci:7077 --jars /opt/jars/VectorDisassembler-0.1.jar

One you've got the VectorDisassembler jar loaded in your shell/submit command, let's get going!

// Import the required classes
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.feature.VectorDisassembler
import org.apache.spark.ml.linalg.Vectors

// Create a simple dataset of 3 columns
val dataset = (spark.createDataFrame(
    Seq((0, 1.2, 1.3), (1, 2.2, 2.3), (2, 3.2, 3.3))
    ).toDF("id", "val1", "val2"))

// Next we'll use the Vector Assembler to create our vector datatype from the val1 and val2 columns
val assembler = (new VectorAssembler()
    .setInputCols(Array("val1", "val2"))
    .setOutputCol("vectorCol"))

// Create the intermediate dataframe.  This transformation action on the assembler executes the build of the vectorCol
val intermediate = assembler.transform(dataset)

// Take a look at the intermediate dataframe.  The val1 and val2 columns are redundant, so let's do something about that
intermediate.show()
//+---+----+----+---------+
//| id|val1|val2|vectorCol|
//+---+----+----+---------+
//| 0| 1.2| 1.3|[1.2,1.3]|
//| 1| 2.2| 2.3|[2.2,2.3]|
//| 2| 3.2| 3.3|[3.2,3.3]|
//+---+----+----+---------+


// Tidy up the dataframe by dropping original columns.  Show the output to prove the columns have been dropped.
val output = intermediate.drop("val1","val2")
output.show()
//+---+---------+
//| id|vectorCol|
//+---+---------+
//|  0|[1.2,1.3]|
//|  1|[2.2,2.3]|
//|  2|[3.2,3.3]|
//+---+---------+


// Now that the data is in the right format, let's disassemble it
// The disassembler, like the assembler, extends the Transformer class
// This means it's built the same way as the assembler.
val disassembler = new VectorDisassembler().setInputCol("vectorCol")


// Execute the dissassembler transformer against the output dataframe, and show the results.
disassembler.transform(output).show()
//+---+---------+----+----+
//| id|vectorCol|val1|val2|
//+---+---------+----+----+
//|  0|[1.2,1.3]| 1.2| 1.3|
//|  1|[2.2,2.3]| 2.2| 2.3|
//|  2|[3.2,3.3]| 3.2| 3.3|
//+---+---------+----+----+

And there you go ... the output is a dataframe which has the original values recreated by disassembling the vector column!

Convert Spark Vectors to DataFrame Columns

About James Conner

Pivoting data with Spark

Transpose data with Spark

Controlling NVIDIA GPU Fans on a headless Ubuntu system

Google's 'Coral' Edge TPU Dev Boards

Google's 'Coral' Edge TPU Accelerator

Convert Spark Vectors to DataFrame Columns

About James Conner

Subscribe to My Areas of Expertise

Pivoting data with Spark

Transpose data with Spark

Controlling NVIDIA GPU Fans on a headless Ubuntu system

Google's 'Coral' Edge TPU Dev Boards

Google's 'Coral' Edge TPU Accelerator