Need help with SparkSQL Query

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Need help with SparkSQL Query

Nikhil Goyal
Hi guys,

I have a dataframe of type Record (id: Long, timestamp: Long, isValid: Boolean, .... other metrics)

Schema looks like this:
root
 |-- id: long (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- isValid: boolean (nullable = true)
.....

I need to find the earliest valid record per id. In RDD world I can do groupBy 'id' and find the earliest one but I am not sure how I can do it in SQL. Since I am doing this in PySpark I cannot really use DataSet API for this.

One thing I can do is groupBy 'id', find the earliest timestamp available and then join with the original dataframe to get the right record (all the metrics).

Or I can create a single column with all the records and then implement a UDAF in scala and use it in pyspark.

Both solutions don't seem to be straight forward. Is there a simpler solution to this?

Thanks
Nikhil
Reply | Threaded
Open this post in threaded view
|

Re: Need help with SparkSQL Query

Patrick McCarthy-2
Untested, but something like the below should work:

from pyspark.sql import functions as F
from pyspark.sql import window as W

(record
.withColumn('ts_rank', F.dense_rank().over(W.Window.orderBy('timestamp').partitionBy("id"))
.filter(F.col('ts_rank')==1)
.drop('ts_rank')
)


On Mon, Dec 17, 2018 at 4:04 PM Nikhil Goyal <[hidden email]> wrote:
Hi guys,

I have a dataframe of type Record (id: Long, timestamp: Long, isValid: Boolean, .... other metrics)

Schema looks like this:
root
 |-- id: long (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- isValid: boolean (nullable = true)
.....

I need to find the earliest valid record per id. In RDD world I can do groupBy 'id' and find the earliest one but I am not sure how I can do it in SQL. Since I am doing this in PySpark I cannot really use DataSet API for this.

One thing I can do is groupBy 'id', find the earliest timestamp available and then join with the original dataframe to get the right record (all the metrics).

Or I can create a single column with all the records and then implement a UDAF in scala and use it in pyspark.

Both solutions don't seem to be straight forward. Is there a simpler solution to this?

Thanks
Nikhil
Reply | Threaded
Open this post in threaded view
|

Re: Need help with SparkSQL Query

ramannanda9@gmail.com
In reply to this post by Nikhil Goyal
You can use analytical functions in spark sql. 

Something like select * from (select id, row_number() over (partition by id order by timestamp ) as rn from root) where rn=1

On Mon, Dec 17, 2018 at 4:03 PM Nikhil Goyal <[hidden email]> wrote:
Hi guys,

I have a dataframe of type Record (id: Long, timestamp: Long, isValid: Boolean, .... other metrics)

Schema looks like this:
root
 |-- id: long (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- isValid: boolean (nullable = true)
.....

I need to find the earliest valid record per id. In RDD world I can do groupBy 'id' and find the earliest one but I am not sure how I can do it in SQL. Since I am doing this in PySpark I cannot really use DataSet API for this.

One thing I can do is groupBy 'id', find the earliest timestamp available and then join with the original dataframe to get the right record (all the metrics).

Or I can create a single column with all the records and then implement a UDAF in scala and use it in pyspark.

Both solutions don't seem to be straight forward. Is there a simpler solution to this?

Thanks
Nikhil


--
Regards,
Ramandeep Singh
http://orastack.com
+13474792296
[hidden email]