Creating DataFrame with the implicit localSeqToDatasetHolder has bad performance
Using Scala, spark version 2.3.0 (also 2.2.0):
I've come across two main ways to create a DataFrame from a sequence.
The more common:
spark.sparkContext.parallelize(0 until 100000).toDF("value")  good
and the less common (but still prevalent):
(0 until 100000).toDF("value") bad
The latter results in much worse performance (for example in, df.agg(mean("value")).collect()). I don't know if it is a bug or a misunderstanding that these two are equivalent?
The latter appears to use the implicit method localSeqToDatasetHolder while the former uses rddToDatasetHolder.
The difference in the physical plans is that the good looks like:
*(1) SerializeFromObject [input[0, int, false] AS value#2]
+- Scan ExternalRDDScan[obj#1]
The bad looks like:
Even if this is not a bug, it would be great to learn more about what is going on here and why I see such a huge performance difference. I've tried to find some resources that would help me understand more about this but I've struggled to get anywhere. (Looking at the source code I can follow what is going on to generate these plans, but I haven't found the why).
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Creating DataFrame with the implicit localSeqToDatasetHolder has bad performance
I think I understand that in the second case the DataFrame is created as a
Local object, so it lives in the memory of the driver and is serialized as
part of the Task that gets sent to each executor.
Though I think the implicit conversion here is something that others could
also misunderstand - maybe it would be better if it was not part of
spark.implicits? Or at least something can be said/warning in the developer