Apache Arrow support for Apache Spark

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Apache Arrow support for Apache Spark

Subash Prabakar
Hi Team,

I have two questions regarding Arrow and Spark integration,

1. I am joining two huge tables (1PB) each - will the performance be huge when I use Arrow format before shuffling ? Will the serialization/deserialization cost have significant improvement?

2. Can we store the final data in Arrow format to HDFS and read them back in another Spark application? If so how could I do that ? 
Note: The dataset is transient  - separation of responsibility is for easier management. Though resiliency inside spark - we use different language (in our case Java and Python)

Thanks,
Subash

Reply | Threaded
Open this post in threaded view
|

Re: Apache Arrow support for Apache Spark

Chris Teoh
1. I'd also consider how you're structuring the data before applying the join, naively doing the join could be expensive so doing a bit of data preparation may be necessary to improve join performance. Try to get a baseline as well. Arrow would help improve it.

2. Try storing it back as Parquet but in a way the next application can take advantage of predicate pushdown.



On Mon, 17 Feb 2020, 6:41 pm Subash Prabakar, <[hidden email]> wrote:
Hi Team,

I have two questions regarding Arrow and Spark integration,

1. I am joining two huge tables (1PB) each - will the performance be huge when I use Arrow format before shuffling ? Will the serialization/deserialization cost have significant improvement?

2. Can we store the final data in Arrow format to HDFS and read them back in another Spark application? If so how could I do that ? 
Note: The dataset is transient  - separation of responsibility is for easier management. Though resiliency inside spark - we use different language (in our case Java and Python)

Thanks,
Subash