Several Aggregations on a window function

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Several Aggregations on a window function

jchamp
Hi Spark Community members !

I want to do several ( from 1 to 10) aggregate functions using window functions on something like 100 columns.

Instead of doing several pass on the data to compute each aggregate function, is there a way to do this efficiently ?



Currently it seems that doing

val tw =
Window
.orderBy("date")
.partitionBy("id")
.rangeBetween(-8035200000L, 0)
and then 
x
.withColumn("agg1", max("col").over(tw))
.withColumn("agg2", min("col").over(tw))
.withColumn("aggX", avg("col").over(tw))

Is not really efficient :/ 
It seems that it iterates on the whole column for each aggregation ? Am I right ?

Is there a way to compute all the required operations on a columns with a single pass ? 
Event better, to compute all the required operations on ALL columns with a single pass ?

Thx for your Future[Answers]

Julien





--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière



Ce message peut contenir des informations confidentielles ou couvertes par le secret professionnel, à l’intention de son destinataire. Si vous n’en êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer toute copie.
This email may contain confidential and/or privileged information for the intended recipient. If you are not the intended recipient, please contact the sender and delete all copies.


Reply | Threaded
Open this post in threaded view
|

Re: Several Aggregations on a window function

jchamp
May be I should consider something like impala ?

Le ven. 15 déc. 2017 à 11:32, Julien CHAMP <[hidden email]> a écrit :
Hi Spark Community members !

I want to do several ( from 1 to 10) aggregate functions using window functions on something like 100 columns.

Instead of doing several pass on the data to compute each aggregate function, is there a way to do this efficiently ?



Currently it seems that doing

val tw =
Window
.orderBy("date")
.partitionBy("id")
.rangeBetween(-8035200000L, 0)
and then 
x
.withColumn("agg1", max("col").over(tw))
.withColumn("agg2", min("col").over(tw))
.withColumn("aggX", avg("col").over(tw))

Is not really efficient :/ 
It seems that it iterates on the whole column for each aggregation ? Am I right ?

Is there a way to compute all the required operations on a columns with a single pass ? 
Event better, to compute all the required operations on ALL columns with a single pass ?

Thx for your Future[Answers]

Julien





--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="m_6073364623645303948gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière

--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière



Ce message peut contenir des informations confidentielles ou couvertes par le secret professionnel, à l’intention de son destinataire. Si vous n’en êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer toute copie.
This email may contain confidential and/or privileged information for the intended recipient. If you are not the intended recipient, please contact the sender and delete all copies.


Reply | Threaded
Open this post in threaded view
|

Re: Several Aggregations on a window function

jchamp
I've been looking for several solutions but I can't find something efficient to compute many window function efficiently ( optimized computation or efficient parallelism ) 
Am I the only one interested by this ? 


Regards,

Julien

Le ven. 15 déc. 2017 à 21:34, Julien CHAMP <[hidden email]> a écrit :
May be I should consider something like impala ?

Le ven. 15 déc. 2017 à 11:32, Julien CHAMP <[hidden email]> a écrit :
Hi Spark Community members !

I want to do several ( from 1 to 10) aggregate functions using window functions on something like 100 columns.

Instead of doing several pass on the data to compute each aggregate function, is there a way to do this efficiently ?



Currently it seems that doing

val tw =
Window
.orderBy("date")
.partitionBy("id")
.rangeBetween(-8035200000L, 0)
and then 
x
.withColumn("agg1", max("col").over(tw))
.withColumn("agg2", min("col").over(tw))
.withColumn("aggX", avg("col").over(tw))

Is not really efficient :/ 
It seems that it iterates on the whole column for each aggregation ? Am I right ?

Is there a way to compute all the required operations on a columns with a single pass ? 
Event better, to compute all the required operations on ALL columns with a single pass ?

Thx for your Future[Answers]

Julien





--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="m_-4072636311788037491m_6073364623645303948gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière

--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="m_-4072636311788037491gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière

--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière



Ce message peut contenir des informations confidentielles ou couvertes par le secret professionnel, à l’intention de son destinataire. Si vous n’en êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer toute copie.
This email may contain confidential and/or privileged information for the intended recipient. If you are not the intended recipient, please contact the sender and delete all copies.


Reply | Threaded
Open this post in threaded view
|

Re: Several Aggregations on a window function

Anastasios Zouzias
Hi Julien,

I am not sure if my answer applies on the streaming part of your question. However, in batch processing, if you want to perform multiple aggregations over an RDD with a single pass, a common approach is to use multiple aggregators (a.k.a. tuple monoids), see below an example from algebird:

https://github.com/twitter/scalding/wiki/Aggregation-using-Algebird-Aggregators#composing-aggregators.

Best,
Anastasios

On Mon, Dec 18, 2017 at 10:38 AM, Julien CHAMP <[hidden email]> wrote:
I've been looking for several solutions but I can't find something efficient to compute many window function efficiently ( optimized computation or efficient parallelism ) 
Am I the only one interested by this ? 


Regards,

Julien

Le ven. 15 déc. 2017 à 21:34, Julien CHAMP <[hidden email]> a écrit :
May be I should consider something like impala ?

Le ven. 15 déc. 2017 à 11:32, Julien CHAMP <[hidden email]> a écrit :
Hi Spark Community members !

I want to do several ( from 1 to 10) aggregate functions using window functions on something like 100 columns.

Instead of doing several pass on the data to compute each aggregate function, is there a way to do this efficiently ?



Currently it seems that doing

val tw =
Window
.orderBy("date")
.partitionBy("id")
.rangeBetween(-8035200000L, 0)
and then 
x
.withColumn("agg1", max("col").over(tw))
.withColumn("agg2", min("col").over(tw))
.withColumn("aggX", avg("col").over(tw))

Is not really efficient :/ 
It seems that it iterates on the whole column for each aggregation ? Am I right ?

Is there a way to compute all the required operations on a columns with a single pass ? 
Event better, to compute all the required operations on ALL columns with a single pass ?

Thx for your Future[Answers]

Julien





--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="m_-8453865844858151894m_-4072636311788037491m_6073364623645303948gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière

--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="m_-8453865844858151894m_-4072636311788037491gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière

--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="m_-8453865844858151894gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière



Ce message peut contenir des informations confidentielles ou couvertes par le secret professionnel, à l’intention de son destinataire. Si vous n’en êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer toute copie.
This email may contain confidential and/or privileged information for the intended recipient. If you are not the intended recipient, please contact the sender and delete all copies.





--
-- Anastasios Zouzias
Reply | Threaded
Open this post in threaded view
|

Re: Several Aggregations on a window function

jchamp
It seems interesting, however scalding seems to require be used outside of spark ?


Le lun. 18 déc. 2017 à 17:15, Anastasios Zouzias <[hidden email]> a écrit :
Hi Julien,

I am not sure if my answer applies on the streaming part of your question. However, in batch processing, if you want to perform multiple aggregations over an RDD with a single pass, a common approach is to use multiple aggregators (a.k.a. tuple monoids), see below an example from algebird:

https://github.com/twitter/scalding/wiki/Aggregation-using-Algebird-Aggregators#composing-aggregators.

Best,
Anastasios

On Mon, Dec 18, 2017 at 10:38 AM, Julien CHAMP <[hidden email]> wrote:
I've been looking for several solutions but I can't find something efficient to compute many window function efficiently ( optimized computation or efficient parallelism ) 
Am I the only one interested by this ? 


Regards,

Julien

Le ven. 15 déc. 2017 à 21:34, Julien CHAMP <[hidden email]> a écrit :
May be I should consider something like impala ?

Le ven. 15 déc. 2017 à 11:32, Julien CHAMP <[hidden email]> a écrit :
Hi Spark Community members !

I want to do several ( from 1 to 10) aggregate functions using window functions on something like 100 columns.

Instead of doing several pass on the data to compute each aggregate function, is there a way to do this efficiently ?



Currently it seems that doing

val tw =
Window
.orderBy("date")
.partitionBy("id")
.rangeBetween(-8035200000L, 0)
and then 
x
.withColumn("agg1", max("col").over(tw))
.withColumn("agg2", min("col").over(tw))
.withColumn("aggX", avg("col").over(tw))

Is not really efficient :/ 
It seems that it iterates on the whole column for each aggregation ? Am I right ?

Is there a way to compute all the required operations on a columns with a single pass ? 
Event better, to compute all the required operations on ALL columns with a single pass ?

Thx for your Future[Answers]

Julien





--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="m_4698123063955538990m_-8453865844858151894m_-4072636311788037491m_6073364623645303948gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière

--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="m_4698123063955538990m_-8453865844858151894m_-4072636311788037491gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière

--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="m_4698123063955538990m_-8453865844858151894gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière



Ce message peut contenir des informations confidentielles ou couvertes par le secret professionnel, à l’intention de son destinataire. Si vous n’en êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer toute copie.
This email may contain confidential and/or privileged information for the intended recipient. If you are not the intended recipient, please contact the sender and delete all copies.





--
-- Anastasios Zouzias
--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière



Ce message peut contenir des informations confidentielles ou couvertes par le secret professionnel, à l’intention de son destinataire. Si vous n’en êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer toute copie.
This email may contain confidential and/or privileged information for the intended recipient. If you are not the intended recipient, please contact the sender and delete all copies.


Reply | Threaded
Open this post in threaded view
|

Re: Several Aggregations on a window function

Anastasios Zouzias
Hi,

You can use https://twitter.github.io/algebird/ which provides an implementation of interesting Monoids and ways to combine them to tuples (or products) of Monoids. Of course, you are not bound to use the algebird library but it might be helpful to bootstrap.
 


On Mon, Dec 18, 2017 at 7:18 PM, Julien CHAMP <[hidden email]> wrote:
It seems interesting, however scalding seems to require be used outside of spark ?


Le lun. 18 déc. 2017 à 17:15, Anastasios Zouzias <[hidden email]> a écrit :
Hi Julien,

I am not sure if my answer applies on the streaming part of your question. However, in batch processing, if you want to perform multiple aggregations over an RDD with a single pass, a common approach is to use multiple aggregators (a.k.a. tuple monoids), see below an example from algebird:

https://github.com/twitter/scalding/wiki/Aggregation-using-Algebird-Aggregators#composing-aggregators.

Best,
Anastasios

On Mon, Dec 18, 2017 at 10:38 AM, Julien CHAMP <[hidden email]> wrote:
I've been looking for several solutions but I can't find something efficient to compute many window function efficiently ( optimized computation or efficient parallelism ) 
Am I the only one interested by this ? 


Regards,

Julien

Le ven. 15 déc. 2017 à 21:34, Julien CHAMP <[hidden email]> a écrit :
May be I should consider something like impala ?

Le ven. 15 déc. 2017 à 11:32, Julien CHAMP <[hidden email]> a écrit :
Hi Spark Community members !

I want to do several ( from 1 to 10) aggregate functions using window functions on something like 100 columns.

Instead of doing several pass on the data to compute each aggregate function, is there a way to do this efficiently ?



Currently it seems that doing

val tw =
Window
.orderBy("date")
.partitionBy("id")
.rangeBetween(-8035200000L, 0)
and then 
x
.withColumn("agg1", max("col").over(tw))
.withColumn("agg2", min("col").over(tw))
.withColumn("aggX", avg("col").over(tw))

Is not really efficient :/ 
It seems that it iterates on the whole column for each aggregation ? Am I right ?

Is there a way to compute all the required operations on a columns with a single pass ? 
Event better, to compute all the required operations on ALL columns with a single pass ?

Thx for your Future[Answers]

Julien





--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="m_3239994684120405066m_4698123063955538990m_-8453865844858151894m_-4072636311788037491m_6073364623645303948gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière

--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="m_3239994684120405066m_4698123063955538990m_-8453865844858151894m_-4072636311788037491gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière

--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="m_3239994684120405066m_4698123063955538990m_-8453865844858151894gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière



Ce message peut contenir des informations confidentielles ou couvertes par le secret professionnel, à l’intention de son destinataire. Si vous n’en êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer toute copie.
This email may contain confidential and/or privileged information for the intended recipient. If you are not the intended recipient, please contact the sender and delete all copies.





--
-- Anastasios Zouzias
--


Julien CHAMP — Data Scientist


Web : www.tellmeplus.com  Email : [hidden email]

Phone  : <a href="tel:0689350189" value="+33689350189" class="m_3239994684120405066gmail_msg" target="_blank">06 89 35 01 89  — LinkedIn here

TellMePlus S.A — Predictive Objects

Paris : 7 rue des Pommerots, 78400 Chatou
Montpellier : 51 impasse des églantiers, 34980 St Clément de Rivière



Ce message peut contenir des informations confidentielles ou couvertes par le secret professionnel, à l’intention de son destinataire. Si vous n’en êtes pas le destinataire, merci de contacter l’expéditeur et d’en supprimer toute copie.
This email may contain confidential and/or privileged information for the intended recipient. If you are not the intended recipient, please contact the sender and delete all copies.





--
-- Anastasios Zouzias