Hi dear spark community !
I want to create a lib which generates features for potentially very large datasets, so I believe spark could be a nice tool for that. Let me explain what I need to do : Each file 'F' of my dataset is composed of at least : - an id ( string or int ) - a timestamp ( or a long value ) - a value ( generaly a double ) I want my tool to : - compute aggregate function for many pairs 'instants + duration' ===> FOR EXAMPLE : ===== compute for the instant 't = 2001-01-01' aggregate functions for data between 't-1month and t' and 't-12months and t-9months' and this, FOR EACH ID ! ( aggregate functions such as min/max/count/distinct/last/mode/kurtosis... or even user defined ! ) My constraints : - I don't want to compute aggregate for each tuple of 'F' ---> I want to provide a list of couples 'instants + duration' ( potentially large ) - My 'window' defined by the duration may be really large ( but may contain only a few values... ) - I may have many id... - I may have many timestamps... ======================================================== ======================================================== ======================================================== Let me describe this with some kind of example to see if SPARK ( SPARK STREAMING ? ) may help me to do that : Let's imagine that I have all my data in a DB or a file with the following columns : id | timestamp(ms) | value A | 1000000 | 100 A | 1000500 | 66 B | 1000000 | 100 B | 1000010 | 50 B | 1000020 | 200 B | 2500000 | 500 ( The timestamp is a long value, so as to be able to express date in ms from 0000-01-01..... to today ) I want to compute operations such as min, max, average, last on the value column, for a these couples : -> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [ t-1000ms and t ] -> instant = 1333333 / [-5000ms, -2500 ] ( i.e. : aggregate data between [ t-5000ms and t-2500ms ] And this will produce this kind of output : id | timestamp(ms) | min_value | max_value | avg_value | last_value ------------------------------------------------------------------- A | 1000500 | min... | max.... | avg.... | last.... B | 1000500 | min... | max.... | avg.... | last.... A | 1333333 | min... | max.... | avg.... | last.... B | 1333333 | min... | max.... | avg.... | last.... Do you think we can do this efficiently with spark and/or spark streaming, and do you have an idea on "how" ? ( I have tested some solutions but I'm not really satisfied ATM... ) Thanks a lot Community :) --------------------------------------------------------------------- To unsubscribe e-mail: [hidden email] |
Look at custom UADF functions.
Hi dear spark community !
Also the rdd stat counter will already conpute most of your desired metrics as well as df.describe https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
Look at custom UADF functions
