Dataframe multiple joins with same dataframe not able to resolve correct join columns

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Dataframe multiple joins with same dataframe not able to resolve correct join columns

Nirav Patel
I am trying to joind df1 with df2 and result of which to again with df2.

df is a common dataframe.

val df3 = df1
      .join(df2,
          df1("PARTICIPANT_ID") === df2("PARTICIPANT_ID") and 
          df1("BUSINESS_ID") === df2("BUSINESS_ID"))
      .drop(df1("BUSINESS_ID")) //dropping duplicates
      .drop(df1("PARTICIPANT_ID")) //dropping duplicates
      .select("EMPLOYEE_ID",...)

val df4 = df3
      .join(df2,
          df3("EMPLOYEE_ID") === df2("EMPLOYEE_ID") and 
          df3("BUSINESS_ID") === df2("BUSINESS_ID"))
      .drop(df2("BUSINESS_ID")) //dropping duplicates
      .drop(df2("EMPLOYEE_ID")) //dropping duplicates
      .select(...)


I am getting following warning and most likely its an Cartesian join which is not what I want.
14:30:32.193 12262 [main] WARN   org.apache.spark.sql.Column - Constructing trivially true equals predicate, 'EMPLOYEE_ID#83 = EMPLOYEE_ID#83'. Perhaps you need to use aliases.

14:30:32.195 12264 [main] WARN   org.apache.spark.sql.Column - Constructing trivially true equals predicate, 'BUSINESS_ID#312 = BUSINESS_ID#312'. Perhaps you need to use aliases.

As you can see,  one of my Join predicate is converted to "(EMPLOYEE_ID#83 = EMPLOYEE_ID#83)"  I think this should be okay because they should still be columns from different dataframe (df3 and df2).

Just want to confirm that this warning is harmless in this scenario.

Problem is similar to this one:

 



What's New with Xactly

        
Reply | Threaded
Open this post in threaded view
|

Re: Dataframe multiple joins with same dataframe not able to resolve correct join columns

Ben White
Sounds like the same root cause as SPARK-14948 or SPARK-10925.

A workaround is to "clone" df3 like this:

val df3clone = df3.toDF(df.schema.fieldNames:_*)

Then use df3clone in place of df3 in the second join.



On Wed, Jul 11, 2018 at 2:52 PM Nirav Patel <[hidden email]> wrote:
I am trying to joind df1 with df2 and result of which to again with df2.

df is a common dataframe.

val df3 = df1
      .join(df2,
          df1("PARTICIPANT_ID") === df2("PARTICIPANT_ID") and 
          df1("BUSINESS_ID") === df2("BUSINESS_ID"))
      .drop(df1("BUSINESS_ID")) //dropping duplicates
      .drop(df1("PARTICIPANT_ID")) //dropping duplicates
      .select("EMPLOYEE_ID",...)

val df4 = df3
      .join(df2,
          df3("EMPLOYEE_ID") === df2("EMPLOYEE_ID") and 
          df3("BUSINESS_ID") === df2("BUSINESS_ID"))
      .drop(df2("BUSINESS_ID")) //dropping duplicates
      .drop(df2("EMPLOYEE_ID")) //dropping duplicates
      .select(...)


I am getting following warning and most likely its an Cartesian join which is not what I want.
14:30:32.193 12262 [main] WARN   org.apache.spark.sql.Column - Constructing trivially true equals predicate, 'EMPLOYEE_ID#83 = EMPLOYEE_ID#83'. Perhaps you need to use aliases.

14:30:32.195 12264 [main] WARN   org.apache.spark.sql.Column - Constructing trivially true equals predicate, 'BUSINESS_ID#312 = BUSINESS_ID#312'. Perhaps you need to use aliases.

As you can see,  one of my Join predicate is converted to "(EMPLOYEE_ID#83 = EMPLOYEE_ID#83)"  I think this should be okay because they should still be columns from different dataframe (df3 and df2).

Just want to confirm that this warning is harmless in this scenario.

Problem is similar to this one:

 



What's New with Xactly