use java in Grouped Map pandas udf to avoid serDe

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

use java in Grouped Map pandas udf to avoid serDe

Lian Jiang
Hi,

I am using pyspark Grouped Map pandas UDF (https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html). Functionality wise it works great. However, serDe causes a lot of perf hits. To optimize this UDF, can I do either below:

1. use a java UDF to completely replace the python Grouped Map pandas UDF.
2. The Python Grouped Map pandas UDF calls a java function internally.

Which way is more promising and how? Thanks for any pointers.

Thanks
Lian



Reply | Threaded
Open this post in threaded view
|

Re: use java in Grouped Map pandas udf to avoid serDe

Lian Jiang
Please ignore this question. https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow shows pandas udf should have avoided jvm<->Python SerDe by maintaining one data copy in memory. spark.sql.execution.arrow.enabled is false by default. I think I missed enabling spark.sql.execution.arrow.enabled. Thanks. Regards.

On Sun, Oct 4, 2020 at 10:22 AM Lian Jiang <[hidden email]> wrote:
Hi,

I am using pyspark Grouped Map pandas UDF (https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html). Functionality wise it works great. However, serDe causes a lot of perf hits. To optimize this UDF, can I do either below:

1. use a java UDF to completely replace the python Grouped Map pandas UDF.
2. The Python Grouped Map pandas UDF calls a java function internally.

Which way is more promising and how? Thanks for any pointers.

Thanks
Lian





--
Reply | Threaded
Open this post in threaded view
|

Re: use java in Grouped Map pandas udf to avoid serDe

Lian Jiang
Hi,

I used these settings but did not see obvious improvement (190 minutes reduced to 170 minutes):

        spark.sql.execution.arrow.pyspark.enabled: True
        spark.sql.execution.arrow.pyspark.fallback.enabled: True

This job heavily uses pandas udfs and it runs on a 30 xlarge node emr. Any idea why the perf improvement is small after enabling arrow? Anything else could be missing? Thanks. 

On Sun, Oct 4, 2020 at 10:36 AM Lian Jiang <[hidden email]> wrote:
Please ignore this question. https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow shows pandas udf should have avoided jvm<->Python SerDe by maintaining one data copy in memory. spark.sql.execution.arrow.enabled is false by default. I think I missed enabling spark.sql.execution.arrow.enabled. Thanks. Regards.

On Sun, Oct 4, 2020 at 10:22 AM Lian Jiang <[hidden email]> wrote:
Hi,

I am using pyspark Grouped Map pandas UDF (https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html). Functionality wise it works great. However, serDe causes a lot of perf hits. To optimize this UDF, can I do either below:

1. use a java UDF to completely replace the python Grouped Map pandas UDF.
2. The Python Grouped Map pandas UDF calls a java function internally.

Which way is more promising and how? Thanks for any pointers.

Thanks
Lian





--


--
Reply | Threaded
Open this post in threaded view
|

Re: use java in Grouped Map pandas udf to avoid serDe

Evgeniy Ignatiev

Note: forwarding to list, incorrectly hit "Repliy" first, instead of "Reply List"

Hello,

Does your code run without enabling fallback mode? Arrow vectorization might not just get applied - if you still observe "javaToPython" stages on Spark UI. Also data is not skewed (partitions are too large and data parallelism can't be fully utilised) or logic is simply too heavy-weight - so using Pandas UDF doesn't improve performance much?

Best regards,
Evgenii Ignatev.

On 06.10.2020 19:44, Lian Jiang wrote:
Hi,

I used these settings but did not see obvious improvement (190 minutes reduced to 170 minutes):

        spark.sql.execution.arrow.pyspark.enabled: True
        spark.sql.execution.arrow.pyspark.fallback.enabled: True

This job heavily uses pandas udfs and it runs on a 30 xlarge node emr. Any idea why the perf improvement is small after enabling arrow? Anything else could be missing? Thanks. 

On Sun, Oct 4, 2020 at 10:36 AM Lian Jiang <[hidden email]> wrote:
Please ignore this question. https://kontext.tech/column/spark/370/improve-pyspark-performance-using-pandas-udf-with-apache-arrow shows pandas udf should have avoided jvm<->Python SerDe by maintaining one data copy in memory. spark.sql.execution.arrow.enabled is false by default. I think I missed enabling spark.sql.execution.arrow.enabled. Thanks. Regards.

On Sun, Oct 4, 2020 at 10:22 AM Lian Jiang <[hidden email]> wrote:
Hi,

I am using pyspark Grouped Map pandas UDF (https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html). Functionality wise it works great. However, serDe causes a lot of perf hits. To optimize this UDF, can I do either below:

1. use a java UDF to completely replace the python Grouped Map pandas UDF.
2. The Python Grouped Map pandas UDF calls a java function internally.

Which way is more promising and how? Thanks for any pointers.

Thanks
Lian





--


--