Re: Inner join with the table itself

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Inner join with the table itself

Jacek Laskowski
Hi Michael,

-dev +user

What's the query? How do you "fool spark"?


On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma <[hidden email]> wrote:
Hi all,

If I try joining the table with itself using join columns, I am
getting the following error:
"Join condition is missing or trivial. Use the CROSS JOIN syntax to
allow cartesian products between these relations.;"

This is not true, and my join is not trivial and is not a real cross
join. I am providing join condition and expect to get maybe a couple
of joined rows for each row in the original table.

There is a workaround for this, which implies renaming all the columns
in source data frame and only afterwards proceed with the join. This
allows us to fool spark.

Now I am wondering if there is a way to get rid of this problem in a
better way? I do not like the idea of renaming the columns because
this makes it really difficult to keep track of the names in the
columns in result data frames.
Is it possible to deactivate this check?

Thanks,
Michael

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: Inner join with the table itself

Gengliang Wang
Hi Michael,

You can use `Explain` to see how your query is optimized. 
I believe your query is an actual cross join, which is usually very slow in execution.

To get rid of this, you can set `spark.sql.crossJoin.enabled` as true.


在 2018年1月15日,下午6:09,Jacek Laskowski <[hidden email]> 写道:

Hi Michael,

-dev +user

What's the query? How do you "fool spark"?


On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma <[hidden email]> wrote:
Hi all,

If I try joining the table with itself using join columns, I am
getting the following error:
"Join condition is missing or trivial. Use the CROSS JOIN syntax to
allow cartesian products between these relations.;"

This is not true, and my join is not trivial and is not a real cross
join. I am providing join condition and expect to get maybe a couple
of joined rows for each row in the original table.

There is a workaround for this, which implies renaming all the columns
in source data frame and only afterwards proceed with the join. This
allows us to fool spark.

Now I am wondering if there is a way to get rid of this problem in a
better way? I do not like the idea of renaming the columns because
this makes it really difficult to keep track of the names in the
columns in result data frames.
Is it possible to deactivate this check?

Thanks,
Michael

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

Re: Inner join with the table itself

smikesh
Hi Jacek & Gengliang,

let's take a look at the following query:

val pos = spark.read.parquet(prefix + "POSITION.parquet")
pos.createOrReplaceTempView("POSITION")
spark.sql("SELECT  POSITION.POSITION_ID  FROM POSITION POSITION JOIN
POSITION POSITION1 ON POSITION.POSITION_ID0 = POSITION1.POSITION_ID
").collect()

This query is working for me right now using spark 2.2.

Now we can try implementing the same logic with DataFrame API:

pos.join(pos, pos("POSITION_ID0")===pos("POSITION_ID")).collect()

I am getting the following error:

"Join condition is missing or trivial.

Use the CROSS JOIN syntax to allow cartesian products between these relations.;"

I have tried using alias function, but without success:

val pos2 = pos.alias("P2")
pos.join(pos2, pos("POSITION_ID0")===pos2("POSITION_ID")).collect()

This also leads us to the same error.
Am  I missing smth about the usage of alias?

Now let's rename the columns:

val pos3 = pos.toDF(pos.columns.map(_ + "_2"): _*)
pos.join(pos3, pos("POSITION_ID0")===pos3("POSITION_ID_2")).collect()

It works!

There is one more really odd thing about all this: a colleague of mine
has managed to get the same exception ("Join condition is missing or
trivial") also using original SQL query, but I think he has been using
empty tables.

Thanks,
Michael


On Mon, Jan 15, 2018 at 11:27 AM, Gengliang Wang
<[hidden email]> wrote:

> Hi Michael,
>
> You can use `Explain` to see how your query is optimized.
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html
> I believe your query is an actual cross join, which is usually very slow in
> execution.
>
> To get rid of this, you can set `spark.sql.crossJoin.enabled` as true.
>
>
> 在 2018年1月15日,下午6:09,Jacek Laskowski <[hidden email]> 写道:
>
> Hi Michael,
>
> -dev +user
>
> What's the query? How do you "fool spark"?
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
> On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma <[hidden email]>
> wrote:
>>
>> Hi all,
>>
>> If I try joining the table with itself using join columns, I am
>> getting the following error:
>> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
>> allow cartesian products between these relations.;"
>>
>> This is not true, and my join is not trivial and is not a real cross
>> join. I am providing join condition and expect to get maybe a couple
>> of joined rows for each row in the original table.
>>
>> There is a workaround for this, which implies renaming all the columns
>> in source data frame and only afterwards proceed with the join. This
>> allows us to fool spark.
>>
>> Now I am wondering if there is a way to get rid of this problem in a
>> better way? I do not like the idea of renaming the columns because
>> this makes it really difficult to keep track of the names in the
>> columns in result data frames.
>> Is it possible to deactivate this check?
>>
>> Thanks,
>> Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Inner join with the table itself

Jacek Laskowski
Hi Michael,

scala> spark.version
res0: String = 2.4.0-SNAPSHOT

scala> val r1 = spark.range(1)
r1: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> r1.as("left").join(r1.as("right")).filter($"left.id" === $"right.id").show
+---+---+
| id| id|
+---+---+
|  0|  0|
+---+---+

Am I missing something? When aliasing a table, use the identifier in column refs (inside).



On Mon, Jan 15, 2018 at 3:26 PM, Michael Shtelma <[hidden email]> wrote:
Hi Jacek & Gengliang,

let's take a look at the following query:

val pos = spark.read.parquet(prefix + "POSITION.parquet")
pos.createOrReplaceTempView("POSITION")
spark.sql("SELECT  POSITION.POSITION_ID  FROM POSITION POSITION JOIN
POSITION POSITION1 ON POSITION.POSITION_ID0 = POSITION1.POSITION_ID
").collect()

This query is working for me right now using spark 2.2.

Now we can try implementing the same logic with DataFrame API:

pos.join(pos, pos("POSITION_ID0")===pos("POSITION_ID")).collect()

I am getting the following error:

"Join condition is missing or trivial.

Use the CROSS JOIN syntax to allow cartesian products between these relations.;"

I have tried using alias function, but without success:

val pos2 = pos.alias("P2")
pos.join(pos2, pos("POSITION_ID0")===pos2("POSITION_ID")).collect()

This also leads us to the same error.
Am  I missing smth about the usage of alias?

Now let's rename the columns:

val pos3 = pos.toDF(pos.columns.map(_ + "_2"): _*)
pos.join(pos3, pos("POSITION_ID0")===pos3("POSITION_ID_2")).collect()

It works!

There is one more really odd thing about all this: a colleague of mine
has managed to get the same exception ("Join condition is missing or
trivial") also using original SQL query, but I think he has been using
empty tables.

Thanks,
Michael


On Mon, Jan 15, 2018 at 11:27 AM, Gengliang Wang
<[hidden email]> wrote:
> Hi Michael,
>
> You can use `Explain` to see how your query is optimized.
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html
> I believe your query is an actual cross join, which is usually very slow in
> execution.
>
> To get rid of this, you can set `spark.sql.crossJoin.enabled` as true.
>
>
> 在 2018年1月15日,下午6:09,Jacek Laskowski <[hidden email]> 写道:
>
> Hi Michael,
>
> -dev +user
>
> What's the query? How do you "fool spark"?
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
> On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma <[hidden email]>
> wrote:
>>
>> Hi all,
>>
>> If I try joining the table with itself using join columns, I am
>> getting the following error:
>> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
>> allow cartesian products between these relations.;"
>>
>> This is not true, and my join is not trivial and is not a real cross
>> join. I am providing join condition and expect to get maybe a couple
>> of joined rows for each row in the original table.
>>
>> There is a workaround for this, which implies renaming all the columns
>> in source data frame and only afterwards proceed with the join. This
>> allows us to fool spark.
>>
>> Now I am wondering if there is a way to get rid of this problem in a
>> better way? I do not like the idea of renaming the columns because
>> this makes it really difficult to keep track of the names in the
>> columns in result data frames.
>> Is it possible to deactivate this check?
>>
>> Thanks,
>> Michael
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>
>

Reply | Threaded
Open this post in threaded view
|

Re: Inner join with the table itself

smikesh
Hi Jacek,

Thank you for the workaround.
It is really working in this way:
pos.as("p1").join(pos.as("p2")).filter($"p1.POSITION_ID0"===$"p2.POSITION_ID")
I have checked, that in this way I get the same execution plan as for
the join with renamed columns.

Best,
Michael


On Mon, Jan 15, 2018 at 10:33 PM, Jacek Laskowski <[hidden email]> wrote:

> Hi Michael,
>
> scala> spark.version
> res0: String = 2.4.0-SNAPSHOT
>
> scala> val r1 = spark.range(1)
> r1: org.apache.spark.sql.Dataset[Long] = [id: bigint]
>
> scala> r1.as("left").join(r1.as("right")).filter($"left.id" ===
> $"right.id").show
> +---+---+
> | id| id|
> +---+---+
> |  0|  0|
> +---+---+
>
> Am I missing something? When aliasing a table, use the identifier in column
> refs (inside).
>
>
> Pozdrawiam,
> Jacek Laskowski
> ----
> https://about.me/JacekLaskowski
> Mastering Spark SQL https://bit.ly/mastering-spark-sql
> Spark Structured Streaming https://bit.ly/spark-structured-streaming
> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
> Follow me at https://twitter.com/jaceklaskowski
>
> On Mon, Jan 15, 2018 at 3:26 PM, Michael Shtelma <[hidden email]> wrote:
>>
>> Hi Jacek & Gengliang,
>>
>> let's take a look at the following query:
>>
>> val pos = spark.read.parquet(prefix + "POSITION.parquet")
>> pos.createOrReplaceTempView("POSITION")
>> spark.sql("SELECT  POSITION.POSITION_ID  FROM POSITION POSITION JOIN
>> POSITION POSITION1 ON POSITION.POSITION_ID0 = POSITION1.POSITION_ID
>> ").collect()
>>
>> This query is working for me right now using spark 2.2.
>>
>> Now we can try implementing the same logic with DataFrame API:
>>
>> pos.join(pos, pos("POSITION_ID0")===pos("POSITION_ID")).collect()
>>
>> I am getting the following error:
>>
>> "Join condition is missing or trivial.
>>
>> Use the CROSS JOIN syntax to allow cartesian products between these
>> relations.;"
>>
>> I have tried using alias function, but without success:
>>
>> val pos2 = pos.alias("P2")
>> pos.join(pos2, pos("POSITION_ID0")===pos2("POSITION_ID")).collect()
>>
>> This also leads us to the same error.
>> Am  I missing smth about the usage of alias?
>>
>> Now let's rename the columns:
>>
>> val pos3 = pos.toDF(pos.columns.map(_ + "_2"): _*)
>> pos.join(pos3, pos("POSITION_ID0")===pos3("POSITION_ID_2")).collect()
>>
>> It works!
>>
>> There is one more really odd thing about all this: a colleague of mine
>> has managed to get the same exception ("Join condition is missing or
>> trivial") also using original SQL query, but I think he has been using
>> empty tables.
>>
>> Thanks,
>> Michael
>>
>>
>> On Mon, Jan 15, 2018 at 11:27 AM, Gengliang Wang
>> <[hidden email]> wrote:
>> > Hi Michael,
>> >
>> > You can use `Explain` to see how your query is optimized.
>> >
>> > https://docs.databricks.com/spark/latest/spark-sql/language-manual/explain.html
>> > I believe your query is an actual cross join, which is usually very slow
>> > in
>> > execution.
>> >
>> > To get rid of this, you can set `spark.sql.crossJoin.enabled` as true.
>> >
>> >
>> > 在 2018年1月15日,下午6:09,Jacek Laskowski <[hidden email]> 写道:
>> >
>> > Hi Michael,
>> >
>> > -dev +user
>> >
>> > What's the query? How do you "fool spark"?
>> >
>> > Pozdrawiam,
>> > Jacek Laskowski
>> > ----
>> > https://about.me/JacekLaskowski
>> > Mastering Spark SQL https://bit.ly/mastering-spark-sql
>> > Spark Structured Streaming https://bit.ly/spark-structured-streaming
>> > Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
>> > Follow me at https://twitter.com/jaceklaskowski
>> >
>> > On Mon, Jan 15, 2018 at 10:23 AM, Michael Shtelma <[hidden email]>
>> > wrote:
>> >>
>> >> Hi all,
>> >>
>> >> If I try joining the table with itself using join columns, I am
>> >> getting the following error:
>> >> "Join condition is missing or trivial. Use the CROSS JOIN syntax to
>> >> allow cartesian products between these relations.;"
>> >>
>> >> This is not true, and my join is not trivial and is not a real cross
>> >> join. I am providing join condition and expect to get maybe a couple
>> >> of joined rows for each row in the original table.
>> >>
>> >> There is a workaround for this, which implies renaming all the columns
>> >> in source data frame and only afterwards proceed with the join. This
>> >> allows us to fool spark.
>> >>
>> >> Now I am wondering if there is a way to get rid of this problem in a
>> >> better way? I do not like the idea of renaming the columns because
>> >> this makes it really difficult to keep track of the names in the
>> >> columns in result data frames.
>> >> Is it possible to deactivate this check?
>> >>
>> >> Thanks,
>> >> Michael
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail: [hidden email]
>> >>
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]