When performing joins in Spark, one question keeps coming up: When joining multiple dataframes, how do you prevent ambiguous column name errors?
1) Let's start off by preparing a couple of simple example dataframes
// Create first example dataframe
val firstDF = spark.createDataFrame(Seq(
(1, 1, 2, 3, 8, 4, 5),
(2, 4, 3, 8, 7, 9, 8),
(3, 6, 1, 9, 2, 3, 6),
(4, 7, 8, 6, 9, 4, 5),
(5, 9, 2, 7, 8, 7, 3),
(6, 1, 1, 4, 2, 8, 4)
)).toDF("uid", "col1", "col2", "col3", "col4", "col5", "col6")
// Display the dataframe
firstDF.show()
// Create second example dataframe
val secondDF = spark.createDataFrame(Seq(
(1, 3, 2, 0, 4, 2, 8),
(2, 3, 3, 2, 6, 5, 4),
(3, 8, 5, 1, 2, 3, 5),
(4, 9, 8, 2, 4, 9, 2),
(5, 3, 4, 8, 0, 6, 2),
(6, 3, 9, 8, 8, 9, 3)
)).toDF("uid", "colA", "colB", "colC", "colD", "colE", "colF")
// Display the dataframe
secondDF.show()
2) Next let's perform a simple inner join.
// Perform simple inner join on uid column
val joinedDF = firstDF.join(secondDF)
// Show the dataframe schema
joinedDF.printSchema
// Display the dataframe
joinedDF.select("uid","col1","colA")
Results:
joinedDF.printSchema
root
|-- uid: integer (nullable = false)
|-- col1: integer (nullable = false)
|-- col2: integer (nullable = false)
|-- col3: integer (nullable = false)
|-- col4: integer (nullable = false)
|-- col5: integer (nullable = false)
|-- col6: integer (nullable = false)
|-- uid: integer (nullable = false)
|-- colA: integer (nullable = false)
|-- colB: integer (nullable = false)
|-- colC: integer (nullable = false)
|-- colD: integer (nullable = false)
|-- colE: integer (nullable = false)
|-- colF: integer (nullable = false)
joinedDF.select("uid","col1","colA")
org.apache.spark.sql.AnalysisException: Reference 'uid' is ambiguous, could be: uid#298, uid#337.;
In the schema, notice that there are two "uid" columns, which is what causes the "ambiguous column error" in the following select statement.
3) And finally let's perform a join that removes the ambiguous column error
// Perform inner join on uid column
val joinedDF = firstDF.join(secondDF, Seq("uid"))
// Show the dataframe schema
joinedDF.printSchema
// Display the dataframe
joinedDF.show()
// Display the dataframe
joinedDF.select("uid","col1","colA")
Results:
joinedDF.printSchema
root
|-- uid: integer (nullable = false)
|-- col1: integer (nullable = false)
|-- col2: integer (nullable = false)
|-- col3: integer (nullable = false)
|-- col4: integer (nullable = false)
|-- col5: integer (nullable = false)
|-- col6: integer (nullable = false)
|-- colA: integer (nullable = false)
|-- colB: integer (nullable = false)
|-- colC: integer (nullable = false)
|-- colD: integer (nullable = false)
|-- colE: integer (nullable = false)
|-- colF: integer (nullable = false)
joinedDF.show()
+---+----+----+----+----+----+----+----+----+----+----+----+----+
|uid|col1|col2|col3|col4|col5|col6|colA|colB|colC|colD|colE|colF|
+---+----+----+----+----+----+----+----+----+----+----+----+----+
| 1| 1| 2| 3| 8| 4| 5| 3| 2| 0| 4| 2| 8|
| 2| 4| 3| 8| 7| 9| 8| 3| 3| 2| 6| 5| 4|
| 3| 6| 1| 9| 2| 3| 6| 8| 5| 1| 2| 3| 5|
| 4| 7| 8| 6| 9| 4| 5| 9| 8| 2| 4| 9| 2|
| 5| 9| 2| 7| 8| 7| 3| 3| 4| 8| 0| 6| 2|
| 6| 1| 1| 4| 2| 8| 4| 3| 9| 8| 8| 9| 3|
+---+----+----+----+----+----+----+----+----+----+----+----+----+
joinedDF.select("uid","col1","colA").show()
+---+----+----+
|uid|col1|colA|
+---+----+----+
| 1| 1| 3|
| 2| 4| 3|
| 3| 6| 8|
| 4| 7| 9|
| 5| 9| 3|
| 6| 1| 3|
+---+----+----+
Notice the lack of duplicated "uid" columns (yay)! Now if you perform a select using the "uid" column, there is no issue with "ambiguous column name" errors.