Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversion in streaming fashion

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversion in streaming fashion

Tanveer Ahmad - EWI

Hi all,


I need some help regarding Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversions.

Here the example explains very well how to convert a single Pandas Dataframe to Spark Dataframe [1].


But in my case, some external applications are generating Arrow RecordBatches in my PySpark application in streaming fashion. Each time I receive an Arrow RB, I want to transfer/append it to a Spark Dataframe. So is it possible to create a Spark Dataframe initially from one Arrow RecordBatch and then start appending many other in-coming Arrow RecordBatches to that Spark Dataframe (like in streaming fashion)? Thanks!


I saw another example [2] in which all the Arrow RB are being converted to Spark Dataframe but my case is little bit different than this. 


[1] https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html

Reply | Threaded
Open this post in threaded view
|

Re: Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversion in streaming fashion

Jorge Machado-2
Hey, from what I know you can try to Union them df.union(df2)

Not sure if this is what you need 

On 25. May 2020, at 13:53, Tanveer Ahmad - EWI <[hidden email]> wrote:

Hi all,

I need some help regarding Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversions.
Here the example explains very well how to convert a single Pandas Dataframe to Spark Dataframe [1]. 

But in my case, some external applications are generating Arrow RecordBatches in my PySpark application in streaming fashion. Each time I receive an Arrow RB, I want to transfer/append it to a Spark Dataframe. So is it possible to create a Spark Dataframe initially from one Arrow RecordBatch and then start appending many other in-coming Arrow RecordBatches to that Spark Dataframe (like in streaming fashion)? Thanks!

I saw another example [2] in which all the Arrow RB are being converted to Spark Dataframe but my case is little bit different than this.  


Reply | Threaded
Open this post in threaded view
|

Re: Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversion in streaming fashion

Tanveer Ahmad - EWI

Hi Jorge,


Thank you. This union function is better alternative for my work.


Regards,
Tanveer Ahmad



From: Jorge Machado <[hidden email]>
Sent: Monday, May 25, 2020 3:56:04 PM
To: Tanveer Ahmad - EWI
Cc: Spark Group
Subject: Re: Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversion in streaming fashion
 
Hey, from what I know you can try to Union them df.union(df2)

Not sure if this is what you need 

On 25. May 2020, at 13:53, Tanveer Ahmad - EWI <[hidden email]> wrote:

Hi all,

I need some help regarding Arrow RecordBatches/Pandas Dataframes to (Arrow enabled) Spark Dataframe conversions.
Here the example explains very well how to convert a single Pandas Dataframe to Spark Dataframe [1]. 

But in my case, some external applications are generating Arrow RecordBatches in my PySpark application in streaming fashion. Each time I receive an Arrow RB, I want to transfer/append it to a Spark Dataframe. So is it possible to create a Spark Dataframe initially from one Arrow RecordBatch and then start appending many other in-coming Arrow RecordBatches to that Spark Dataframe (like in streaming fashion)? Thanks!

I saw another example [2] in which all the Arrow RB are being converted to Spark Dataframe but my case is little bit different than this.