Quantcast

PySpark Serialization/Deserialization (Pickling) Overhead

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

PySpark Serialization/Deserialization (Pickling) Overhead

Yeoul Na
This post has NOT been accepted by the mailing list yet.

Hi all,

I am trying to analyze PySpark performance overhead. People just say PySpark
is slower than Scala due to the Serialization/Deserialization overhead. I tried with the example in this post:
https://0x0fff.com/spark-dataframes-are-faster-arent-they/. This and many articles say straight-forward Python implementation is the slowest due to the serialization/deserialization overhead.

However, when I actually looked at the log in the Web UI, serialization and deserialization time of PySpark do not seem to be any bigger than that of Scala. The main contributor was "Executor Computing Time". Thus, we cannot sure whether this is due to serialization or because Python code is basically slower than Scala code.

So my question is that does "Task Deserialization Time" in Spark WebUI actually include serialization/deserialization times in PySpark? If this is not the case, how can I actually measure the serialization/deserialization overhead?

Thanks,
Yeoul
rok
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: PySpark Serialization/Deserialization (Pickling) Overhead

rok
This post has NOT been accepted by the mailing list yet.
My guess is that the UI serialization times show the Java side only. To get a feeling for the python pickling/unpickling, use the show_profiles() method of the SparkContext instance: http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.show_profiles

That will show you how much of the execution time is used up by cPickle.load() and .dump() methods. 

Hope that helps,

Rok

On Wed, Mar 8, 2017 at 3:18 AM, Yeoul Na [via Apache Spark User List] <[hidden email]> wrote:

Hi all,

I am trying to analyze PySpark performance overhead. People just say PySpark
is slower than Scala due to the Serialization/Deserialization overhead. I tried with the example in this post:
https://0x0fff.com/spark-dataframes-are-faster-arent-they/. This and many articles say straight-forward Python implementation is the slowest due to the serialization/deserialization overhead.

However, when I actually looked at the log in the Web UI, serialization and deserialization time of PySpark do not seem to be any bigger than that of Scala. The main contributor was "Executor Computing Time". Thus, we cannot sure whether this is due to serialization or because Python code is basically slower than Scala code.

So my question is that does "Task Deserialization Time" in Spark WebUI actually include serialization/deserialization times in PySpark? If this is not the case, how can I actually measure the serialization/deserialization overhead?

Thanks,
Yeoul


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Serialization-Deserialization-Pickling-Overhead-tp28468.html
To start a new topic under Apache Spark User List, email [hidden email]
To unsubscribe from Apache Spark User List, click here.
NAML

Loading...