Spark SQL check timestamp with other table and update a column.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark SQL check timestamp with other table and update a column.

anbutech
Hi Team,

i want to update a col3 in table 1 if col1 from table2 is less than col1 in
table1 and update each record in table 1.I 'am not getting the correct
output.

Table 1:
col1|col2|col3
2020-11-17T20:50:57.777+0000|1|null

Table 2:
col1|col2|col3
2020-11-17T21:19:06.508+0000|1|win
2020-11-17T20:49:06.244+0000|1|win
2020-11-17T20:19:13.484+0000|1|Win

sql tried:

select a.col1,a.col2.coalesce(a.col3,b.col3) as col3
from table1 a left table2 b
on (a.col2=b.col2) and (b.col1 < b.col1)

output:
I getting the following output.

2020-11-17T20:50:57.777+0000|1|Win    2020-11-17T21:19:06.508+0000|1|win
2020-11-17T20:50:57.777+0000|1|Win    2020-11-17T20:49:06.244+0000|1|win

i'm looking for only the second record in the output.

the issue here is i'm getting additional one records if col1 from table2 is
less than col1 in table1 when i'm using the above query.

expected output:

2020-11-17T20:50:57.777+0000|1|Win 2020-11-17T20:49:06.244+0000|1|win

how do we achieve that correctly.I have many records like this.

Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark SQL check timestamp with other table and update a column.

Khalid Mammadov

Hi,

I am not sure if you were writing pseudo-code or real one but there were few issues in the sql.

I have reproduced you example in the Spark REPL and all worked as expected and result is the one you need

Please see below full code:

## Spark 3.0.0

>>> a = spark.read.csv("tab1", sep="|", header=True, schema="col1 TIMESTAMP, col2 INT, col3 STRING")
>>> a.printSchema
<bound method DataFrame.printSchema of DataFrame[col1: timestamp, col2: int, col3: string]>

>>> a.show()
+--------------------+----+----+
|                col1|col2|col3|
+--------------------+----+----+
|2020-11-17 20:50:...|   1|null|
+--------------------+----+----+

>>> b = spark.read.csv("tab2", sep="|", header=True, schema="col1 TIMESTAMP, col2 INT, col3 STRING")
>>> b.printSchema
<bound method DataFrame.printSchema of DataFrame[col1: timestamp, col2: int, col3: string]>

>>> b.show()
+--------------------+----+----+
|                col1|col2|col3|
+--------------------+----+----+
|2020-11-17 21:19:...|   1| win|
|2020-11-17 20:49:...|   1| win|
|2020-11-17 20:19:...|   1| Win|
+--------------------+----+----+

>>> a.createOrReplaceTempView("table1")
>>> b.createOrReplaceTempView("table2")

>>> res = spark.sql("""
... select a.col1,a.col2, b.col1, b.col2, coalesce(a.col3,b.col3) as col3
... from table1 a join table2 b
... on (a.col2=b.col2) and (a.col1 < b.col1)
... """)
>>> res.show()
+--------------------+----+--------------------+----+----+
|                col1|col2|                col1|col2|col3|
+--------------------+----+--------------------+----+----+
|2020-11-17T20:50:...|   1|2020-11-17T21:19:...|   1| win|
+--------------------+----+--------------------+----+----+

Regards,

Khalid

On 19/11/2020 05:13, anbutech wrote:
select a.col1,a.col2.coalesce(a.col3,b.col3) as col3 
from table1 a left table2 b
on (a.col2=b.col2) and (b.col1 < b.col1)