Joining Spark DataFrames Without Duplicate or Ambiguous Column Names

When performing joins in Spark, one question keeps coming up: When joining multiple dataframes, how do you prevent ambiguous column name errors?

1) Let's start off by preparing a couple of simple example dataframes

// Create first example dataframe
val firstDF = spark.createDataFrame(Seq(
  (1, 1, 2, 3, 8, 4, 5),
  (2, 4, 3, 8, 7, 9, 8),
  (3, 6, 1, 9, 2, 3, 6),
  (4, 7, 8, 6, 9, 4, 5),
  (5, 9, 2, 7, 8, 7, 3),
  (6, 1, 1, 4, 2, 8, 4)
)).toDF("uid", "col1", "col2", "col3", "col4", "col5", "col6")

// Display the dataframe
firstDF.show()

// Create second example dataframe
val secondDF = spark.createDataFrame(Seq(
  (1, 3, 2, 0, 4, 2, 8),
  (2, 3, 3, 2, 6, 5, 4),
  (3, 8, 5, 1, 2, 3, 5),
  (4, 9, 8, 2, 4, 9, 2),
  (5, 3, 4, 8, 0, 6, 2),
  (6, 3, 9, 8, 8, 9, 3)
)).toDF("uid", "colA", "colB", "colC", "colD", "colE", "colF")

// Display the dataframe
secondDF.show()

2) Next let's perform a simple inner join.

// Perform simple inner join on uid column
val joinedDF = firstDF.join(secondDF)

// Show the dataframe schema
joinedDF.printSchema

// Display the dataframe
joinedDF.select("uid","col1","colA")

Results:

joinedDF.printSchema
root
 |-- uid: integer (nullable = false)
 |-- col1: integer (nullable = false)
 |-- col2: integer (nullable = false)
 |-- col3: integer (nullable = false)
 |-- col4: integer (nullable = false)
 |-- col5: integer (nullable = false)
 |-- col6: integer (nullable = false)
 |-- uid: integer (nullable = false)
 |-- colA: integer (nullable = false)
 |-- colB: integer (nullable = false)
 |-- colC: integer (nullable = false)
 |-- colD: integer (nullable = false)
 |-- colE: integer (nullable = false)
 |-- colF: integer (nullable = false)


joinedDF.select("uid","col1","colA")
org.apache.spark.sql.AnalysisException: Reference 'uid' is ambiguous, could be: uid#298, uid#337.;

In the schema, notice that there are two "uid" columns, which is what causes the "ambiguous column error" in the following select statement.

3) And finally let's perform a join that removes the ambiguous column error

// Perform inner join on uid column
val joinedDF = firstDF.join(secondDF, Seq("uid"))

// Show the dataframe schema
joinedDF.printSchema

// Display the dataframe
joinedDF.show()

// Display the dataframe
joinedDF.select("uid","col1","colA")

Results:

joinedDF.printSchema
root
 |-- uid: integer (nullable = false)
 |-- col1: integer (nullable = false)
 |-- col2: integer (nullable = false)
 |-- col3: integer (nullable = false)
 |-- col4: integer (nullable = false)
 |-- col5: integer (nullable = false)
 |-- col6: integer (nullable = false)
 |-- colA: integer (nullable = false)
 |-- colB: integer (nullable = false)
 |-- colC: integer (nullable = false)
 |-- colD: integer (nullable = false)
 |-- colE: integer (nullable = false)
 |-- colF: integer (nullable = false)


joinedDF.show()
+---+----+----+----+----+----+----+----+----+----+----+----+----+
|uid|col1|col2|col3|col4|col5|col6|colA|colB|colC|colD|colE|colF|
+---+----+----+----+----+----+----+----+----+----+----+----+----+
|  1|   1|   2|   3|   8|   4|   5|   3|   2|   0|   4|   2|   8|
|  2|   4|   3|   8|   7|   9|   8|   3|   3|   2|   6|   5|   4|
|  3|   6|   1|   9|   2|   3|   6|   8|   5|   1|   2|   3|   5|
|  4|   7|   8|   6|   9|   4|   5|   9|   8|   2|   4|   9|   2|
|  5|   9|   2|   7|   8|   7|   3|   3|   4|   8|   0|   6|   2|
|  6|   1|   1|   4|   2|   8|   4|   3|   9|   8|   8|   9|   3|
+---+----+----+----+----+----+----+----+----+----+----+----+----+


joinedDF.select("uid","col1","colA").show()
+---+----+----+
|uid|col1|colA|
+---+----+----+
|  1|   1|   3|
|  2|   4|   3|
|  3|   6|   8|
|  4|   7|   9|
|  5|   9|   3|
|  6|   1|   3|
+---+----+----+

Notice the lack of duplicated "uid" columns (yay)! Now if you perform a select using the "uid" column, there is no issue with "ambiguous column name" errors.

Joining Spark DataFrames Without Duplicate or Ambiguous Column Names

1) Let's start off by preparing a couple of simple example dataframes

2) Next let's perform a simple inner join.

Results:

3) And finally let's perform a join that removes the ambiguous column error

Results:

About James Conner

Selecting Dynamic Columns In Spark DataFrames (aka Excluding Columns)

Spark Vector of Vectors

Controlling NVIDIA GPU Fans on a headless Ubuntu system

Google's 'Coral' Edge TPU Dev Boards

Google's 'Coral' Edge TPU Accelerator

Joining Spark DataFrames Without Duplicate or Ambiguous Column Names

1) Let's start off by preparing a couple of simple example dataframes

2) Next let's perform a simple inner join.

Results:

3) And finally let's perform a join that removes the ambiguous column error

Results:

About James Conner

Subscribe to My Areas of Expertise

Selecting Dynamic Columns In Spark DataFrames (aka Excluding Columns)

Spark Vector of Vectors

Controlling NVIDIA GPU Fans on a headless Ubuntu system

Google's 'Coral' Edge TPU Dev Boards

Google's 'Coral' Edge TPU Accelerator