Creating DataFrame with the implicit localSeqToDatasetHolder has bad performance

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Creating DataFrame with the implicit localSeqToDatasetHolder has bad performance

msinton
Hi,

Using Scala, spark version 2.3.0 (also 2.2.0):
I've come across two main ways to create a DataFrame from a sequence.
The more common:
  spark.sparkContext.parallelize(0 until 100000).toDF("value")  good
and the less common (but still prevalent):
  (0 until 100000).toDF("value") bad
The latter results in much worse performance (for example in, df.agg(mean("value")).collect()). I don't know if it is a bug or a misunderstanding that these two are equivalent?
The latter appears to use the implicit method localSeqToDatasetHolder while the former uses rddToDatasetHolder.

The difference in the physical plans is that the good looks like:

*(1) SerializeFromObject [input[0, int, false] AS value#2]
+- Scan ExternalRDDScan[obj#1]

The bad looks like:

LocalTableScan [value#1]

Even if this is not a bug, it would be great to learn more about what is going on here and why I see such a huge performance difference. I've tried to find some resources that would help me understand more about this but I've struggled to get anywhere. (Looking at the source code I can follow what is going on to generate these plans, but I haven't found the why).

Many thanks,
Matt


Sent from the Apache Spark User List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Creating DataFrame with the implicit localSeqToDatasetHolder has bad performance

msinton
I think I understand that in the second case the DataFrame is created as a
Local object, so it lives in the memory of the driver and is serialized as
part of the Task that gets sent to each executor.

Though I think the implicit conversion here is something that others could
also misunderstand - maybe it would be better if it was not part of
spark.implicits? Or at least something can be said/warning in the developer
guides.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]