Do we support excluding the CURRENT ROW in PARTITION BY windowing functions?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Do we support excluding the CURRENT ROW in PARTITION BY windowing functions?

mathewwicks
This post has NOT been accepted by the mailing list yet.
Here is an example to illustrate my question.

In this toy example, we are collecting a list of the other products that each user has bought, and appending it as a new column. (Also note, that we are filtering on some arbitrary column 'good_bad'.)

I would like to know if we support NOT including the CURRENT ROW in the OVER(PARTITION BY xxx) windowing function.

For example, transaction 1 would have `other_purchases = [prod2, prod3]` rather than `other_purchases = [prod1, prod2, prod3]`.

------------------- Code Below -------------------
df = spark.createDataFrame([
    (1, "user1", "prod1", "good"),
    (2, "user1", "prod2", "good"),
    (3, "user1", "prod3", "good"),
    (4, "user2", "prod3", "bad"),
    (5, "user2", "prod4", "good"),
    (5, "user2", "prod5", "good")],
    ("trans_id", "user_id", "prod_id", "good_bad")
)
df.show()

df = df.selectExpr(
    "trans_id",
    "user_id",
    "COLLECT_LIST(CASE WHEN good_bad == 'good' THEN prod_id END) OVER(PARTITION BY user_id) AS other_purchases"
)
df.show()
----------------------------------------------------

Here is a stackoverflow link: https://stackoverflow.com/questions/43180723/spark-sql-excluding-the-current-row-in-partition-by-windowing-functions
Loading...