Pyspark How to groupBy -> fit

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Pyspark How to groupBy -> fit

Riccardo Ferrari
Hi list,

I am looking for an efficient solution to apply a training pipeline to each group of a DataFrame.groupBy.

This is very easy if you're using a pandas udf (i.e. groupBy().apply()), I am not able to find the equivalent for a spark pipeline.

The ultimate goal is to fit multiple models, one per group of data.

Thanks,

Reply | Threaded
Open this post in threaded view
|

Re: Pyspark How to groupBy -> fit

srowen
If you mean you want to train N models in parallel, you wouldn't be able to do that with a groupBy first. You apply logic to the result of groupBy with Spark, but can't use Spark within Spark. You can run N Spark jobs in parallel on the driver but you'd have to have each read the subset of data that it's meant to model separately.

A pandas UDF is a fine solution here, because I assume that implies your groups aren't that big, so, maybe no need for a Spark pipeline.


On Thu, Jan 21, 2021 at 9:20 AM Riccardo Ferrari <[hidden email]> wrote:
Hi list,

I am looking for an efficient solution to apply a training pipeline to each group of a DataFrame.groupBy.

This is very easy if you're using a pandas udf (i.e. groupBy().apply()), I am not able to find the equivalent for a spark pipeline.

The ultimate goal is to fit multiple models, one per group of data.

Thanks,

Reply | Threaded
Open this post in threaded view
|

Re: Pyspark How to groupBy -> fit

Mich Talebzadeh
I guess one drawback would be that the data cannot be processed and stored in Pandas DataFrames as these DataFrames store data in RAM. If you are going to run multiple parallel jobs then a single machine may not be viable?




On Thu, 21 Jan 2021 at 16:29, Sean Owen <[hidden email]> wrote:
If you mean you want to train N models in parallel, you wouldn't be able to do that with a groupBy first. You apply logic to the result of groupBy with Spark, but can't use Spark within Spark. You can run N Spark jobs in parallel on the driver but you'd have to have each read the subset of data that it's meant to model separately.

A pandas UDF is a fine solution here, because I assume that implies your groups aren't that big, so, maybe no need for a Spark pipeline.


On Thu, Jan 21, 2021 at 9:20 AM Riccardo Ferrari <[hidden email]> wrote:
Hi list,

I am looking for an efficient solution to apply a training pipeline to each group of a DataFrame.groupBy.

This is very easy if you're using a pandas udf (i.e. groupBy().apply()), I am not able to find the equivalent for a spark pipeline.

The ultimate goal is to fit multiple models, one per group of data.

Thanks,

Reply | Threaded
Open this post in threaded view
|

Re: Pyspark How to groupBy -> fit

Riccardo Ferrari
In reply to this post by srowen
Thanks for the answers.

I am trying to avoid reading the same data multiple times (each per model).

One approach I can think of is 'filtering' on the column I want to split on and train each model. I was hoping to find a more elegant approach.



On Thu, Jan 21, 2021 at 5:28 PM Sean Owen <[hidden email]> wrote:
If you mean you want to train N models in parallel, you wouldn't be able to do that with a groupBy first. You apply logic to the result of groupBy with Spark, but can't use Spark within Spark. You can run N Spark jobs in parallel on the driver but you'd have to have each read the subset of data that it's meant to model separately.

A pandas UDF is a fine solution here, because I assume that implies your groups aren't that big, so, maybe no need for a Spark pipeline.


On Thu, Jan 21, 2021 at 9:20 AM Riccardo Ferrari <[hidden email]> wrote:
Hi list,

I am looking for an efficient solution to apply a training pipeline to each group of a DataFrame.groupBy.

This is very easy if you're using a pandas udf (i.e. groupBy().apply()), I am not able to find the equivalent for a spark pipeline.

The ultimate goal is to fit multiple models, one per group of data.

Thanks,

Reply | Threaded
Open this post in threaded view
|

Re: Pyspark How to groupBy -> fit

srowen
Yep that's one approach. That may not really re-read the data N times; for example if the filtering aligns with partitioning, you'd be reading subsets each time. You can also cache the input first to avoid I/O N times.
But again I wonder if you are at a scale that really needs distributed training.

On Thu, Jan 21, 2021 at 1:52 PM Riccardo Ferrari <[hidden email]> wrote:
Thanks for the answers.

I am trying to avoid reading the same data multiple times (each per model).

One approach I can think of is 'filtering' on the column I want to split on and train each model. I was hoping to find a more elegant approach.



On Thu, Jan 21, 2021 at 5:28 PM Sean Owen <[hidden email]> wrote:
If you mean you want to train N models in parallel, you wouldn't be able to do that with a groupBy first. You apply logic to the result of groupBy with Spark, but can't use Spark within Spark. You can run N Spark jobs in parallel on the driver but you'd have to have each read the subset of data that it's meant to model separately.

A pandas UDF is a fine solution here, because I assume that implies your groups aren't that big, so, maybe no need for a Spark pipeline.


On Thu, Jan 21, 2021 at 9:20 AM Riccardo Ferrari <[hidden email]> wrote:
Hi list,

I am looking for an efficient solution to apply a training pipeline to each group of a DataFrame.groupBy.

This is very easy if you're using a pandas udf (i.e. groupBy().apply()), I am not able to find the equivalent for a spark pipeline.

The ultimate goal is to fit multiple models, one per group of data.

Thanks,