question about pyarrow.Table to pyspark.DataFrame conversion

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

question about pyarrow.Table to pyspark.DataFrame conversion

Artem Kozhevnikov
I wonder if there's some recommended method to convert in memory pyarrow.Table (or pyarrow.BatchRecord) to pyspark.Dataframe without using pandas ?
My motivation is about converting nested data (like List[int]) that have an efficient representation in pyarrow which is not possible with Pandas (I don't want to pass by python list of int ...).

Thanks inĀ advance !
Artem


Reply | Threaded
Open this post in threaded view
|

Re: question about pyarrow.Table to pyspark.DataFrame conversion

Bryan Cutler
Hi Artem,

I don't believe this is currently possible, but it could be a great addition to PySpark since this would offer a convenient and efficient way to parallelize nested column data. I created the JIRA https://issues.apache.org/jira/browse/SPARK-29040 for this.

On Tue, Aug 27, 2019 at 7:55 PM Artem Kozhevnikov <[hidden email]> wrote:
I wonder if there's some recommended method to convert in memory pyarrow.Table (or pyarrow.BatchRecord) to pyspark.Dataframe without using pandas ?
My motivation is about converting nested data (like List[int]) that have an efficient representation in pyarrow which is not possible with Pandas (I don't want to pass by python list of int ...).

Thanks inĀ advance !
Artem


Reply | Threaded
Open this post in threaded view
|

Re: question about pyarrow.Table to pyspark.DataFrame conversion

shouheng
Hi Bryan,

I came across  SPARK-29040
<https://issues.apache.org/jira/browse/SPARK-29040>   and I'm very excited
that others are looking for such feature as well. It will be tremendously
useful if we can implement this feature.

Currently, my workaround is to serialize `pyarrow.Table` to a parquet file,
then let Spark to read that parquet file. I avoided using `pd.Dataframe`,
same as what Artem mentioned above.

Do you think this ticket has a chance to get prioritized?

Thank you very much.

Best,
Shouheng



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]