Use Arrow instead of Pickle without pandas_udf

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Use Arrow instead of Pickle without pandas_udf

Hichame El Khalfi

Hi There,


Is there a way to use Arrow format instead of Pickle but without using pandas_udf ?


Thank for your help,


Hichame

Reply | Threaded
Open this post in threaded view
|

Re: Use Arrow instead of Pickle without pandas_udf

Holden Karau
Not currently. What's the problem with pandas_udf for your use case?

On Wed, Jul 25, 2018 at 1:27 PM, Hichame El Khalfi <[hidden email]> wrote:

Hi There,


Is there a way to use Arrow format instead of Pickle but without using pandas_udf ?


Thank for your help,


Hichame




--
Reply | Threaded
Open this post in threaded view
|

Re: Use Arrow instead of Pickle without pandas_udf

Hichame El Khalfi
Hey Holden,
Thanks for your reply,

We currently using a python function that produces a Row(TS=LongType(), bin=BinaryType()).
We use this function like this dataframe.rdd.map(my_function).toDF().write.parquet()

To reuse it in pandas_udf, we changes the return type to StructType(StructField(Long), StructField(BinaryType).

1)But we face an issue that StructType is not supported by pandas_udf.

So I was wondering to still continue to reuse dataftame.rdd.map but get an improvement in serialization by using ArrowFormat instead of Pickle.

Sent: July 25, 2018 4:41 PM
Subject: Re: Use Arrow instead of Pickle without pandas_udf

Not currently. What's the problem with pandas_udf for your use case?

On Wed, Jul 25, 2018 at 1:27 PM, Hichame El Khalfi <[hidden email]> wrote:

Hi There,


Is there a way to use Arrow format instead of Pickle but without using pandas_udf ?


Thank for your help,


Hichame




--
Reply | Threaded
Open this post in threaded view
|

Re: Use Arrow instead of Pickle without pandas_udf

Bryan Cutler
Here is a link to the JIRA for adding StructType support for scalar pandas_udf https://issues.apache.org/jira/browse/SPARK-24579


On Wed, Jul 25, 2018 at 3:36 PM, Hichame El Khalfi <[hidden email]> wrote:
Hey Holden,
Thanks for your reply,

We currently using a python function that produces a Row(TS=LongType(), bin=BinaryType()).
We use this function like this dataframe.rdd.map(my_function).toDF().write.parquet()

To reuse it in pandas_udf, we changes the return type to StructType(StructField(Long), StructField(BinaryType).

1)But we face an issue that StructType is not supported by pandas_udf.

So I was wondering to still continue to reuse dataftame.rdd.map but get an improvement in serialization by using ArrowFormat instead of Pickle.

Sent: July 25, 2018 4:41 PM
Subject: Re: Use Arrow instead of Pickle without pandas_udf

Not currently. What's the problem with pandas_udf for your use case?

On Wed, Jul 25, 2018 at 1:27 PM, Hichame El Khalfi <[hidden email]> wrote:

Hi There,


Is there a way to use Arrow format instead of Pickle but without using pandas_udf ?


Thank for your help,


Hichame




--

Reply | Threaded
Open this post in threaded view
|

Re: Use Arrow instead of Pickle without pandas_udf

Hichame El Khalfi
Thanks Bryan for the pointer +1

Hichame

Sent: July 30, 2018 6:40 PM
Subject: Re: Use Arrow instead of Pickle without pandas_udf

Here is a link to the JIRA for adding StructType support for scalar pandas_udf https://issues.apache.org/jira/browse/SPARK-24579


On Wed, Jul 25, 2018 at 3:36 PM, Hichame El Khalfi <[hidden email]> wrote:
Hey Holden,
Thanks for your reply,

We currently using a python function that produces a Row(TS=LongType(), bin=BinaryType()).
We use this function like this dataframe.rdd.map(my_function).toDF().write.parquet()

To reuse it in pandas_udf, we changes the return type to StructType(StructField(Long), StructField(BinaryType).

1)But we face an issue that StructType is not supported by pandas_udf.

So I was wondering to still continue to reuse dataftame.rdd.map but get an improvement in serialization by using ArrowFormat instead of Pickle.

Sent: July 25, 2018 4:41 PM
Subject: Re: Use Arrow instead of Pickle without pandas_udf

Not currently. What's the problem with pandas_udf for your use case?

On Wed, Jul 25, 2018 at 1:27 PM, Hichame El Khalfi <[hidden email]> wrote:

Hi There,


Is there a way to use Arrow format instead of Pickle but without using pandas_udf ?


Thank for your help,


Hichame




--