spark, scala

Joining Spark DataFrames Without Duplicate or Ambiguous Column Names

When performing joins in Spark, one question keeps coming up: When joining multiple dataframes, how do you prevent ambiguous column name errors?

1) Let's start off by preparing a couple of simple example dataframes

// Create first example dataframe
val firstDF = spark.createDataFrame(Seq(
  (1, 1, 2, 3, 8, 4, 5),
  (2, 4, 3, 8, 7, 9, 8),
  (3, 6, 1, 9, 2, 3, 6),
  (4, 7, 8, 6, 9, 4, 5),
  (5, 9, 2, 7, 8, 7, 3),
  (6, 1, 1, 4, 2, 8, 4)
)).toDF("uid", "col1", "col2", "col3", "col4", "col5", "col6")

// Display the dataframe
firstDF.show()

// Create second example dataframe
val secondDF = spark.createDataFrame(Seq(
  (1, 3, 2, 0, 4, 2, 8),
  (2, 3, 3, 2, 6, 5, 4),
  (3, 8, 5, 1, 2, 3, 5),
  (4, 9, 8, 2, 4, 9, 2),
  (5, 3, 4, 8, 0, 6, 2),
  (6, 3, 9, 8, 8, 9, 3)
)).toDF("uid", "colA", "colB", "colC", "colD", "colE", "colF")

// Display the dataframe
secondDF.show()



2) Next let's perform a simple inner join.

// Perform simple inner join on uid column
val joinedDF = firstDF.join(secondDF)

// Show the dataframe schema
joinedDF.printSchema

// Display the dataframe
joinedDF.select("uid","col1","colA")

Results:

joinedDF.printSchema
root
 |-- uid: integer (nullable = false)
 |-- col1: integer (nullable = false)
 |-- col2: integer (nullable = false)
 |-- col3: integer (nullable = false)
 |-- col4: integer (nullable = false)
 |-- col5: integer (nullable = false)
 |-- col6: integer (nullable = false)
 |-- uid: integer (nullable = false)
 |-- colA: integer (nullable = false)
 |-- colB: integer (nullable = false)
 |-- colC: integer (nullable = false)
 |-- colD: integer (nullable = false)
 |-- colE: integer (nullable = false)
 |-- colF: integer (nullable = false)

joinedDF.select("uid","col1","colA") org.apache.spark.sql.AnalysisException: Reference 'uid' is ambiguous, could be: uid#298, uid#337.;

In the schema, notice that there are two "uid" columns, which is what causes the "ambiguous column error" in the following select statement.



3) And finally let's perform a join that removes the ambiguous column error

// Perform inner join on uid column
val joinedDF = firstDF.join(secondDF, Seq("uid"))

// Show the dataframe schema
joinedDF.printSchema

// Display the dataframe
joinedDF.show()

// Display the dataframe
joinedDF.select("uid","col1","colA")

Results:

joinedDF.printSchema
root
 |-- uid: integer (nullable = false)
 |-- col1: integer (nullable = false)
 |-- col2: integer (nullable = false)
 |-- col3: integer (nullable = false)
 |-- col4: integer (nullable = false)
 |-- col5: integer (nullable = false)
 |-- col6: integer (nullable = false)
 |-- colA: integer (nullable = false)
 |-- colB: integer (nullable = false)
 |-- colC: integer (nullable = false)
 |-- colD: integer (nullable = false)
 |-- colE: integer (nullable = false)
 |-- colF: integer (nullable = false)

joinedDF.show() +---+----+----+----+----+----+----+----+----+----+----+----+----+ |uid|col1|col2|col3|col4|col5|col6|colA|colB|colC|colD|colE|colF| +---+----+----+----+----+----+----+----+----+----+----+----+----+ | 1| 1| 2| 3| 8| 4| 5| 3| 2| 0| 4| 2| 8| | 2| 4| 3| 8| 7| 9| 8| 3| 3| 2| 6| 5| 4| | 3| 6| 1| 9| 2| 3| 6| 8| 5| 1| 2| 3| 5| | 4| 7| 8| 6| 9| 4| 5| 9| 8| 2| 4| 9| 2| | 5| 9| 2| 7| 8| 7| 3| 3| 4| 8| 0| 6| 2| | 6| 1| 1| 4| 2| 8| 4| 3| 9| 8| 8| 9| 3| +---+----+----+----+----+----+----+----+----+----+----+----+----+
joinedDF.select("uid","col1","colA").show() +---+----+----+ |uid|col1|colA| +---+----+----+ | 1| 1| 3| | 2| 4| 3| | 3| 6| 8| | 4| 7| 9| | 5| 9| 3| | 6| 1| 3| +---+----+----+

Notice the lack of duplicated "uid" columns (yay)! Now if you perform a select using the "uid" column, there is no issue with "ambiguous column name" errors.
Author image

About James Conner

Scuba dive master, wildlife photographer, anthropologist, programmer, electronics tinkerer and big data expert.