Fwd: Feature Generation for Large datasets composed of many time series

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Feature Generation for Large datasets composed of many time series

julio.cesare
Hi dear spark community !

I want to create a lib which generates features for potentially very
large datasets, so I believe spark could be a nice tool for that.
Let me explain what I need to do :

Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( generaly a double )

I want my tool to :
- compute aggregate function for many pairs 'instants + duration'
===> FOR EXAMPLE :
===== compute for the instant 't = 2001-01-01' aggregate functions for
data between 't-1month and t' and 't-12months and t-9months' and this,
FOR EACH ID !
( aggregate functions such as
min/max/count/distinct/last/mode/kurtosis... or even user defined ! )

My constraints :
- I don't want to compute aggregate for each tuple of 'F'
---> I want to provide a list of couples 'instants + duration' (
potentially large )
- My 'window' defined by the duration may be really large ( but may
contain only a few values... )
- I may have many id...
- I may have many timestamps...

========================================================
========================================================
========================================================

Let me describe this with some kind of example to see if SPARK ( SPARK
STREAMING ? ) may help me to do that :

Let's imagine that I have all my data in a DB or a file with the
following columns :
id | timestamp(ms) | value
A | 1000000 |  100
A | 1000500 |  66
B | 1000000 |  100
B | 1000010 |  50
B | 1000020 |  200
B | 2500000 |  500

( The timestamp is a long value, so as to be able to express date in ms
from 0000-01-01..... to today )

I want to compute operations such as min, max, average, last on the
value column, for a these couples :
-> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [
t-1000ms and t ]
-> instant = 1333333 / [-5000ms, -2500 ] ( i.e. : aggregate data between
[ t-5000ms and t-2500ms ]


And this will produce this kind of output :

id | timestamp(ms) | min_value | max_value | avg_value | last_value
-------------------------------------------------------------------
A | 1000500        | min...    | max....   | avg....   | last....
B | 1000500        | min...    | max....   | avg....   | last....
A | 1333333        | min...    | max....   | avg....   | last....
B | 1333333        | min...    | max....   | avg....   | last....



Do you think we can do this efficiently with spark and/or spark
streaming, and do you have an idea on "how" ?
( I have tested some solutions but I'm not really satisfied ATM... )


Thanks a lot Community :)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Feature generation / aggregate functions / timeseries

julio.cesare
Hi dear spark community !

I want to create a lib which generates features for potentially very
large datasets, so I believe spark could be a nice tool for that.
Let me explain what I need to do :

Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( generaly a double )

I want my tool to :
- compute aggregate function for many pairs 'instants + duration'
===> FOR EXAMPLE :
===== compute for the instant 't = 2001-01-01' aggregate functions for
data between 't-1month and t' and 't-12months and t-9months' and this,
FOR EACH ID !
( aggregate functions such as
min/max/count/distinct/last/mode/kurtosis... or even user defined ! )

My constraints :
- I don't want to compute aggregate for each tuple of 'F'
---> I want to provide a list of couples 'instants + duration' (
potentially large )
- My 'window' defined by the duration may be really large ( but may
contain only a few values... )
- I may have many id...
- I may have many timestamps...

========================================================
========================================================
========================================================

Let me describe this with some kind of example to see if SPARK ( SPARK
STREAMING ? ) may help me to do that :

Let's imagine that I have all my data in a DB or a file with the
following columns :
id | timestamp(ms) | value
A | 1000000 |  100
A | 1000500 |  66
B | 1000000 |  100
B | 1000010 |  50
B | 1000020 |  200
B | 2500000 |  500

( The timestamp is a long value, so as to be able to express date in ms
from 0000-01-01..... to today )

I want to compute operations such as min, max, average, last on the
value column, for a these couples :
-> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [
t-1000ms and t ]
-> instant = 1333333 / [-5000ms, -2500 ] ( i.e. : aggregate data between
[ t-5000ms and t-2500ms ]


And this will produce this kind of output :

id | timestamp(ms) | min_value | max_value | avg_value | last_value
-------------------------------------------------------------------
A | 1000500        | min...    | max....   | avg....   | last....
B | 1000500        | min...    | max....   | avg....   | last....
A | 1333333        | min...    | max....   | avg....   | last....
B | 1333333        | min...    | max....   | avg....   | last....



Do you think we can do this efficiently with spark and/or spark
streaming, and do you have an idea on "how" ?
( I have tested some solutions but I'm not really satisfied ATM... )


Thanks a lot Community :)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Feature generation / aggregate functions / timeseries

geoHeil
Look at custom UADF functions.
<[hidden email]> schrieb am Do. 14. Dez. 2017 um 09:31:
Hi dear spark community !

I want to create a lib which generates features for potentially very
large datasets, so I believe spark could be a nice tool for that.
Let me explain what I need to do :

Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( generaly a double )

I want my tool to :
- compute aggregate function for many pairs 'instants + duration'
===> FOR EXAMPLE :
===== compute for the instant 't = 2001-01-01' aggregate functions for
data between 't-1month and t' and 't-12months and t-9months' and this,
FOR EACH ID !
( aggregate functions such as
min/max/count/distinct/last/mode/kurtosis... or even user defined ! )

My constraints :
- I don't want to compute aggregate for each tuple of 'F'
---> I want to provide a list of couples 'instants + duration' (
potentially large )
- My 'window' defined by the duration may be really large ( but may
contain only a few values... )
- I may have many id...
- I may have many timestamps...

========================================================
========================================================
========================================================

Let me describe this with some kind of example to see if SPARK ( SPARK
STREAMING ? ) may help me to do that :

Let's imagine that I have all my data in a DB or a file with the
following columns :
id | timestamp(ms) | value
A | 1000000 |  100
A | 1000500 |  66
B | 1000000 |  100
B | 1000010 |  50
B | 1000020 |  200
B | 2500000 |  500

( The timestamp is a long value, so as to be able to express date in ms
from 0000-01-01..... to today )

I want to compute operations such as min, max, average, last on the
value column, for a these couples :
-> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [
t-1000ms and t ]
-> instant = 1333333 / [-5000ms, -2500 ] ( i.e. : aggregate data between
[ t-5000ms and t-2500ms ]


And this will produce this kind of output :

id | timestamp(ms) | min_value | max_value | avg_value | last_value
-------------------------------------------------------------------
A | 1000500        | min...    | max....   | avg....   | last....
B | 1000500        | min...    | max....   | avg....   | last....
A | 1333333        | min...    | max....   | avg....   | last....
B | 1333333        | min...    | max....   | avg....   | last....



Do you think we can do this efficiently with spark and/or spark
streaming, and do you have an idea on "how" ?
( I have tested some solutions but I'm not really satisfied ATM... )


Thanks a lot Community :)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Feature generation / aggregate functions / timeseries

geoHeil
Also the rdd stat counter will already conpute most of your desired metrics as well as df.describe https://databricks.com/blog/2015/06/02/statistical-and-mathematical-functions-with-dataframes-in-spark.html
Georg Heiler <[hidden email]> schrieb am Do. 14. Dez. 2017 um 19:40:
Look at custom UADF functions
<[hidden email]> schrieb am Do. 14. Dez. 2017 um 09:31:
Hi dear spark community !

I want to create a lib which generates features for potentially very
large datasets, so I believe spark could be a nice tool for that.
Let me explain what I need to do :

Each file 'F' of my dataset is composed of at least :
- an id ( string or int )
- a timestamp ( or a long value )
- a value ( generaly a double )

I want my tool to :
- compute aggregate function for many pairs 'instants + duration'
===> FOR EXAMPLE :
===== compute for the instant 't = 2001-01-01' aggregate functions for
data between 't-1month and t' and 't-12months and t-9months' and this,
FOR EACH ID !
( aggregate functions such as
min/max/count/distinct/last/mode/kurtosis... or even user defined ! )

My constraints :
- I don't want to compute aggregate for each tuple of 'F'
---> I want to provide a list of couples 'instants + duration' (
potentially large )
- My 'window' defined by the duration may be really large ( but may
contain only a few values... )
- I may have many id...
- I may have many timestamps...

========================================================
========================================================
========================================================

Let me describe this with some kind of example to see if SPARK ( SPARK
STREAMING ? ) may help me to do that :

Let's imagine that I have all my data in a DB or a file with the
following columns :
id | timestamp(ms) | value
A | 1000000 |  100
A | 1000500 |  66
B | 1000000 |  100
B | 1000010 |  50
B | 1000020 |  200
B | 2500000 |  500

( The timestamp is a long value, so as to be able to express date in ms
from 0000-01-01..... to today )

I want to compute operations such as min, max, average, last on the
value column, for a these couples :
-> instant = 1000500 / [-1000ms, 0 ] ( i.e. : aggregate data between [
t-1000ms and t ]
-> instant = 1333333 / [-5000ms, -2500 ] ( i.e. : aggregate data between
[ t-5000ms and t-2500ms ]


And this will produce this kind of output :

id | timestamp(ms) | min_value | max_value | avg_value | last_value
-------------------------------------------------------------------
A | 1000500        | min...    | max....   | avg....   | last....
B | 1000500        | min...    | max....   | avg....   | last....
A | 1333333        | min...    | max....   | avg....   | last....
B | 1333333        | min...    | max....   | avg....   | last....



Do you think we can do this efficiently with spark and/or spark
streaming, and do you have an idea on "how" ?
( I have tested some solutions but I'm not really satisfied ATM... )


Thanks a lot Community :)

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]