Mean over window with minimum number of rows

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Mean over window with minimum number of rows

Hi all,
Before I go the route of rolling my own UDAF:

I'm doing a calculation of last 5 mean so I have the following window defined:
Window.partitionBy(person).orderBy(timestamp).rowsBetween(-4, Window.currentRow)
Then I calculate the mean over that window.

Within each partition, I'd like the first 4 elements to return null / NaN because there aren't enough rows to be a true "last 5." This is the behavior when I do this in pandas using rolling mean. Instead, it appears to calculate the mean of whatever rows happen to be in the partition, even if there is only 1 row.

Is there a simple way already in Spark to do this? It seems like a normal thing so I wonder if I am missing something.