Huge difference in speed between pyspark and scalaspark

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Huge difference in speed between pyspark and scalaspark

Steven Van Ingelgem

Public


Hello all,

 

 

We noticed a HUGE difference between using pyspark and spark in scala.

Pyspark runs:

  • on my work computer in +-350 seconds
  • on my home computer in +- 130 seconds (Windows defender enabled)
  • on my home computer in +- 105 seconds (Windows defender disabled)
  • on my home computer as Scala code in +- 7 seconds
  •  

 

The script:

def setUp(self):
   
self.left = self.parallelize([
        (
'Wim', 46),
        (
'Klaas', 18)
    ]).toDF(
'name: string, age: int')

   
self.right = self.parallelize([
        (
'Jiri', 25),
        (
'Tomasz', 23)
    ]).toDF(
'name: string, age: int')

def test_simple_union(self):
    sut =
self.left.union(self.right)

   
self.assertDatasetEquals(sut, self.parallelize([
            (
'Wim', 46),
            (
'Klaas', 18),
            (
'Jiri', 25),
            (
'Tomasz', 23)
        ]).toDF(
'name: string, age: int')
    )


Disclaimer
Reply | Threaded
Open this post in threaded view
|

RE: Huge difference in speed between pyspark and scalaspark

Steven Van Ingelgem

Public

 

Hello all,

 

 

We noticed a HUGE difference between using pyspark and spark in scala.

Pyspark runs:

  • on my work computer in +- 350 seconds
  • on my home computer in +- 130 seconds (Windows defender enabled)
  • on my home computer in +- 105 seconds (Windows defender disabled)
  • on my home computer as Scala code in +- 7 seconds

 

What we already investigated:

  • memory is correct (and enough) in both scala & pyspark. 1G was defined, and it was never reaching the garbage collecting limit (which was also 1G).
  • There are a lot of threads in the spark process, is this normal? Unknown… There are 300 more in the Spark session under pyspark than under scala-spark.
  • Scala code timings are consistent on any platform used (windows, linux, mac, WSL)
  • Disabling the antivirus makes a bit of a difference, but not enough to justify the huge differences in timings.
  • Under Debian WSL on the same windows hosts, the pyspark code runs in the same time as the scala code.
  • Timings for pyspark on Linux/Mac/WSL are similar to the timings for scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)

 

 

What could we continue to investigate to figure out what the difference in time could be?

And/or did someone encounter this same behavior and could point us to a possible solution?

 

 

Thanks,

Steven

 

 

 

 

 

The pyspark script:

The spark session is created via:

SparkSession
                      .builder
                      .appName(
'Testing')
                      .config('spark.driver.extraJavaOptions', '-Xms1g')
                      .getOrCreate()

 

This is the part of the unittest:

def setUp(self):
   
self.left = self.parallelize([
        (
'Wim', 46),
        (
'Klaas', 18)
    ]).toDF(
'name: string, age: int')

   
self.right = self.parallelize([
        (
'Jiri', 25),
        (
'Tomasz', 23)
    ]).toDF(
'name: string, age: int')

def test_simple_union(self):
    sut =
self.left.union(self.right)

   
self.assertDatasetEquals(sut, self.parallelize([
            (
'Wim', 46),
            (
'Klaas', 18),
            (
'Jiri', 25),
            (
'Tomasz', 23)
        ]).toDF(
'name: string, age: int')
    )

 

VisualVM from pyspark script:

 

The same Scala script:

class GlowPerformanceSpec extends SparkFunSuite
                                 
with Matchers
                                 
with DatasetComparer {

  test(
"test simple union") {
   
val data = testData()
   
val sut = data._1.union(data._2)
    assertDatasetEquality(sut,
                          parallelize(Seq((
"Wim", 46),
                                          (
"Klaas", 18),
                                          (
"Jiri", 25),
                                          (
"Tomasz", 23)),
                                     
"name",
                                     
"age"))
  }


 
private def testData(): (DataFrame, DataFrame) = {
   
val left = parallelize(Seq(("Wim", 46),
                               (
"Klaas", 18)),
                          
"name",
                          
"age")
   
val right = parallelize(Seq(("Jiri", 25),
                                (
"Tomasz", 23)),
                           
"name",
                           
"age")
    (left, right)
  }

 
import scala.reflect.runtime.universe._
 
private def parallelize[T <: Product : ClassTag : TypeTag](data: Seq[T],
                                                             colNames: String*): DataFrame = {
   
import spark.implicits._
    spark.sparkContext.parallelize(data).toDF(colNames:_*)
  }
}

 

VisualVM from scala:


Disclaimer
Reply | Threaded
Open this post in threaded view
|

Re: Huge difference in speed between pyspark and scalaspark

Russell Spitzer
Well, right off the bat, the Python version is going to have to be slower because the dataset is being parallelized from python types.

While dataframes are usually identical in performance between different languages, the beginning of your script starts off the intial data
outside of Dataframes. This move from native language objects to Java objects is going to be basically instanteous for Java, but will carry
a performance penalty in Python.

I haven't looked at this code recently, but worse case scenario you would end up serializing the python objects to the executors and
each executor would then start a local python process and the serialization would have to happen there. Best case scenario is that
it would do this conversion on the driver prior to parallelizing so at least you wouldn't have to spin up Python interpreters on the executors. I
can't remember which it actually is in the code. Either way this is a big hit compared to java which can just encode the Java objects
to Spark Internal Rows immediately. 

Although that is expensive I wouldn't imagine it would be as expensive as the results you are seeing (350 seconds is pretty ridiculous), but one way to eliminate this
difference between the code samples would be to have your script read from a CSV file for data (or other Datasource.) If you avoid
calling parallelize you will at least be able to remove the serialization costs for Python. Then the two code samples would be much
closer to identical. I would try doing that before looking into anything else.


On Wed, May 13, 2020 at 8:06 AM Steven Van Ingelgem <[hidden email]> wrote:

Public

 

Hello all,

 

 

We noticed a HUGE difference between using pyspark and spark in scala.

Pyspark runs:

  • on my work computer in +- 350 seconds
  • on my home computer in +- 130 seconds (Windows defender enabled)
  • on my home computer in +- 105 seconds (Windows defender disabled)
  • on my home computer as Scala code in +- 7 seconds

 

What we already investigated:

  • memory is correct (and enough) in both scala & pyspark. 1G was defined, and it was never reaching the garbage collecting limit (which was also 1G).
  • There are a lot of threads in the spark process, is this normal? Unknown… There are 300 more in the Spark session under pyspark than under scala-spark.
  • Scala code timings are consistent on any platform used (windows, linux, mac, WSL)
  • Disabling the antivirus makes a bit of a difference, but not enough to justify the huge differences in timings.
  • Under Debian WSL on the same windows hosts, the pyspark code runs in the same time as the scala code.
  • Timings for pyspark on Linux/Mac/WSL are similar to the timings for scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)

 

 

What could we continue to investigate to figure out what the difference in time could be?

And/or did someone encounter this same behavior and could point us to a possible solution?

 

 

Thanks,

Steven

 

 

 

 

 

The pyspark script:

The spark session is created via:

SparkSession
                      .builder
                      .appName(
'Testing')
                      .config('spark.driver.extraJavaOptions', '-Xms1g')
                      .getOrCreate()

 

This is the part of the unittest:

def setUp(self):
   
self.left = self.parallelize([
        (
'Wim', 46),
        (
'Klaas', 18)
    ]).toDF(
'name: string, age: int')

   
self.right = self.parallelize([
        (
'Jiri', 25),
        (
'Tomasz', 23)
    ]).toDF(
'name: string, age: int')

def test_simple_union(self):
    sut =
self.left.union(self.right)

   
self.assertDatasetEquals(sut, self.parallelize([
            (
'Wim', 46),
            (
'Klaas', 18),
            (
'Jiri', 25),
            (
'Tomasz', 23)
        ]).toDF(
'name: string, age: int')
    )

 

VisualVM from pyspark script:

 

The same Scala script:

class GlowPerformanceSpec extends SparkFunSuite
                                 
with Matchers
                                 
with DatasetComparer {

  test(
"test simple union") {
   
val data = testData()
   
val sut = data._1.union(data._2)
    assertDatasetEquality(sut,
                          parallelize(Seq((
"Wim", 46),
                                          (
"Klaas", 18),
                                          (
"Jiri", 25),
                                          (
"Tomasz", 23)),
                                     
"name",
                                     
"age"))
  }


 
private def testData(): (DataFrame, DataFrame) = {
   
val left = parallelize(Seq(("Wim", 46),
                               (
"Klaas", 18)),
                          
"name",
                          
"age")
   
val right = parallelize(Seq(("Jiri", 25),
                                (
"Tomasz", 23)),
                           
"name",
                           
"age")
    (left, right)
  }

 
import scala.reflect.runtime.universe._
 
private def parallelize[T <: Product : ClassTag : TypeTag](data: Seq[T],
                                                             colNames: String*): DataFrame = {
   
import spark.implicits._
    spark.sparkContext.parallelize(data).toDF(colNames:_*)
  }
}

 

VisualVM from scala:


Disclaimer
Reply | Threaded
Open this post in threaded view
|

Re: Huge difference in speed between pyspark and scalaspark

maasg
In reply to this post by Steven Van Ingelgem
Steven,

I'm not sure what the goals of the comparison are. 

The reason behind the difference is that `parallelize` is an RDD-based operation. When called from pySpark, the Python-JVM call incurs in additional overhead transferring that array. 
These early-days differences[1] between the Scala and the Python APIs are no longer an issue when you use the SparkSQL-based Dataframe/Dataset API.

Given that the source of your data is unlikely to be synthetically-generated in-memory records, a more interesting test would be to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala and PySpark.

I hope this helps. 

Met vriendelijke groeten,
- Gerard.




On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem <[hidden email]> wrote:

Public

 

Hello all,

 

 

We noticed a HUGE difference between using pyspark and spark in scala.

Pyspark runs:

  • on my work computer in +- 350 seconds
  • on my home computer in +- 130 seconds (Windows defender enabled)
  • on my home computer in +- 105 seconds (Windows defender disabled)
  • on my home computer as Scala code in +- 7 seconds

 

What we already investigated:

  • memory is correct (and enough) in both scala & pyspark. 1G was defined, and it was never reaching the garbage collecting limit (which was also 1G).
  • There are a lot of threads in the spark process, is this normal? Unknown… There are 300 more in the Spark session under pyspark than under scala-spark.
  • Scala code timings are consistent on any platform used (windows, linux, mac, WSL)
  • Disabling the antivirus makes a bit of a difference, but not enough to justify the huge differences in timings.
  • Under Debian WSL on the same windows hosts, the pyspark code runs in the same time as the scala code.
  • Timings for pyspark on Linux/Mac/WSL are similar to the timings for scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)

 

 

What could we continue to investigate to figure out what the difference in time could be?

And/or did someone encounter this same behavior and could point us to a possible solution?

 

 

Thanks,

Steven

 

 

 

 

 

The pyspark script:

The spark session is created via:

SparkSession
                      .builder
                      .appName(
'Testing')
                      .config('spark.driver.extraJavaOptions', '-Xms1g')
                      .getOrCreate()

 

This is the part of the unittest:

def setUp(self):
   
self.left = self.parallelize([
        (
'Wim', 46),
        (
'Klaas', 18)
    ]).toDF(
'name: string, age: int')

   
self.right = self.parallelize([
        (
'Jiri', 25),
        (
'Tomasz', 23)
    ]).toDF(
'name: string, age: int')

def test_simple_union(self):
    sut =
self.left.union(self.right)

   
self.assertDatasetEquals(sut, self.parallelize([
            (
'Wim', 46),
            (
'Klaas', 18),
            (
'Jiri', 25),
            (
'Tomasz', 23)
        ]).toDF(
'name: string, age: int')
    )

 

VisualVM from pyspark script:

 

The same Scala script:

class GlowPerformanceSpec extends SparkFunSuite
                                 
with Matchers
                                 
with DatasetComparer {

  test(
"test simple union") {
   
val data = testData()
   
val sut = data._1.union(data._2)
    assertDatasetEquality(sut,
                          parallelize(Seq((
"Wim", 46),
                                          (
"Klaas", 18),
                                          (
"Jiri", 25),
                                          (
"Tomasz", 23)),
                                     
"name",
                                     
"age"))
  }


 
private def testData(): (DataFrame, DataFrame) = {
   
val left = parallelize(Seq(("Wim", 46),
                               (
"Klaas", 18)),
                          
"name",
                          
"age")
   
val right = parallelize(Seq(("Jiri", 25),
                                (
"Tomasz", 23)),
                           
"name",
                           
"age")
    (left, right)
  }

 
import scala.reflect.runtime.universe._
 
private def parallelize[T <: Product : ClassTag : TypeTag](data: Seq[T],
                                                             colNames: String*): DataFrame = {
   
import spark.implicits._
    spark.sparkContext.parallelize(data).toDF(colNames:_*)
  }
}

 

VisualVM from scala:


Disclaimer
Reply | Threaded
Open this post in threaded view
|

Re: Huge difference in speed between pyspark and scalaspark

sriramb12
Pyspark and scala versions along with spark version data also are needed to conclude anything
pyspark had a performance lag in the early days in comparision to scala , which is native for spark

Also: The video referenced is 5 yrs old 


On Wed, May 13, 2020 at 6:59 PM Gerard Maas <[hidden email]> wrote:
Steven,

I'm not sure what the goals of the comparison are. 

The reason behind the difference is that `parallelize` is an RDD-based operation. When called from pySpark, the Python-JVM call incurs in additional overhead transferring that array. 
These early-days differences[1] between the Scala and the Python APIs are no longer an issue when you use the SparkSQL-based Dataframe/Dataset API.

Given that the source of your data is unlikely to be synthetically-generated in-memory records, a more interesting test would be to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala and PySpark.

I hope this helps. 

Met vriendelijke groeten,
- Gerard.




On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem <[hidden email]> wrote:

Public

 

Hello all,

 

 

We noticed a HUGE difference between using pyspark and spark in scala.

Pyspark runs:

  • on my work computer in +- 350 seconds
  • on my home computer in +- 130 seconds (Windows defender enabled)
  • on my home computer in +- 105 seconds (Windows defender disabled)
  • on my home computer as Scala code in +- 7 seconds

 

What we already investigated:

  • memory is correct (and enough) in both scala & pyspark. 1G was defined, and it was never reaching the garbage collecting limit (which was also 1G).
  • There are a lot of threads in the spark process, is this normal? Unknown… There are 300 more in the Spark session under pyspark than under scala-spark.
  • Scala code timings are consistent on any platform used (windows, linux, mac, WSL)
  • Disabling the antivirus makes a bit of a difference, but not enough to justify the huge differences in timings.
  • Under Debian WSL on the same windows hosts, the pyspark code runs in the same time as the scala code.
  • Timings for pyspark on Linux/Mac/WSL are similar to the timings for scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)

 

 

What could we continue to investigate to figure out what the difference in time could be?

And/or did someone encounter this same behavior and could point us to a possible solution?

 

 

Thanks,

Steven

 

 

 

 

 

The pyspark script:

The spark session is created via:

SparkSession
                      .builder
                      .appName(
'Testing')
                      .config('spark.driver.extraJavaOptions', '-Xms1g')
                      .getOrCreate()

 

This is the part of the unittest:

def setUp(self):
   
self.left = self.parallelize([
        (
'Wim', 46),
        (
'Klaas', 18)
    ]).toDF(
'name: string, age: int')

   
self.right = self.parallelize([
        (
'Jiri', 25),
        (
'Tomasz', 23)
    ]).toDF(
'name: string, age: int')

def test_simple_union(self):
    sut =
self.left.union(self.right)

   
self.assertDatasetEquals(sut, self.parallelize([
            (
'Wim', 46),
            (
'Klaas', 18),
            (
'Jiri', 25),
            (
'Tomasz', 23)
        ]).toDF(
'name: string, age: int')
    )

 

VisualVM from pyspark script:

 

The same Scala script:

class GlowPerformanceSpec extends SparkFunSuite
                                 
with Matchers
                                 
with DatasetComparer {

  test(
"test simple union") {
   
val data = testData()
   
val sut = data._1.union(data._2)
    assertDatasetEquality(sut,
                          parallelize(Seq((
"Wim", 46),
                                          (
"Klaas", 18),
                                          (
"Jiri", 25),
                                          (
"Tomasz", 23)),
                                     
"name",
                                     
"age"))
  }


 
private def testData(): (DataFrame, DataFrame) = {
   
val left = parallelize(Seq(("Wim", 46),
                               (
"Klaas", 18)),
                          
"name",
                          
"age")
   
val right = parallelize(Seq(("Jiri", 25),
                                (
"Tomasz", 23)),
                           
"name",
                           
"age")
    (left, right)
  }

 
import scala.reflect.runtime.universe._
 
private def parallelize[T <: Product : ClassTag : TypeTag](data: Seq[T],
                                                             colNames: String*): DataFrame = {
   
import spark.implicits._
    spark.sparkContext.parallelize(data).toDF(colNames:_*)
  }
}

 

VisualVM from scala:


Disclaimer


--
-Sriram
Reply | Threaded
Open this post in threaded view
|

Re: Huge difference in speed between pyspark and scalaspark

Mich Talebzadeh
In reply to this post by maasg
This comparison comes up time and time again. Spark is written in Scala and provides APIs in Scala, Java, Python, and R.

However, its primary focus has been on Scala. In generic terms this means that Python, Java etc are add-ons and I suspect if you look under the bonnet they  interface with Scala.

Hence that would be a driver for Spark on Scala being fastest. The question is it is what it is. So if you are going to use Python then expect that behaviour to materialise.

HTH,


Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 13 May 2020 at 14:30, Gerard Maas <[hidden email]> wrote:
Steven,

I'm not sure what the goals of the comparison are. 

The reason behind the difference is that `parallelize` is an RDD-based operation. When called from pySpark, the Python-JVM call incurs in additional overhead transferring that array. 
These early-days differences[1] between the Scala and the Python APIs are no longer an issue when you use the SparkSQL-based Dataframe/Dataset API.

Given that the source of your data is unlikely to be synthetically-generated in-memory records, a more interesting test would be to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala and PySpark.

I hope this helps. 

Met vriendelijke groeten,
- Gerard.




On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem <[hidden email]> wrote:

Public

 

Hello all,

 

 

We noticed a HUGE difference between using pyspark and spark in scala.

Pyspark runs:

  • on my work computer in +- 350 seconds
  • on my home computer in +- 130 seconds (Windows defender enabled)
  • on my home computer in +- 105 seconds (Windows defender disabled)
  • on my home computer as Scala code in +- 7 seconds

 

What we already investigated:

  • memory is correct (and enough) in both scala & pyspark. 1G was defined, and it was never reaching the garbage collecting limit (which was also 1G).
  • There are a lot of threads in the spark process, is this normal? Unknown… There are 300 more in the Spark session under pyspark than under scala-spark.
  • Scala code timings are consistent on any platform used (windows, linux, mac, WSL)
  • Disabling the antivirus makes a bit of a difference, but not enough to justify the huge differences in timings.
  • Under Debian WSL on the same windows hosts, the pyspark code runs in the same time as the scala code.
  • Timings for pyspark on Linux/Mac/WSL are similar to the timings for scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)

 

 

What could we continue to investigate to figure out what the difference in time could be?

And/or did someone encounter this same behavior and could point us to a possible solution?

 

 

Thanks,

Steven

 

 

 

 

 

The pyspark script:

The spark session is created via:

SparkSession
                      .builder
                      .appName(
'Testing')
                      .config('spark.driver.extraJavaOptions', '-Xms1g')
                      .getOrCreate()

 

This is the part of the unittest:

def setUp(self):
   
self.left = self.parallelize([
        (
'Wim', 46),
        (
'Klaas', 18)
    ]).toDF(
'name: string, age: int')

   
self.right = self.parallelize([
        (
'Jiri', 25),
        (
'Tomasz', 23)
    ]).toDF(
'name: string, age: int')

def test_simple_union(self):
    sut =
self.left.union(self.right)

   
self.assertDatasetEquals(sut, self.parallelize([
            (
'Wim', 46),
            (
'Klaas', 18),
            (
'Jiri', 25),
            (
'Tomasz', 23)
        ]).toDF(
'name: string, age: int')
    )

 

VisualVM from pyspark script:

 

The same Scala script:

class GlowPerformanceSpec extends SparkFunSuite
                                 
with Matchers
                                 
with DatasetComparer {

  test(
"test simple union") {
   
val data = testData()
   
val sut = data._1.union(data._2)
    assertDatasetEquality(sut,
                          parallelize(Seq((
"Wim", 46),
                                          (
"Klaas", 18),
                                          (
"Jiri", 25),
                                          (
"Tomasz", 23)),
                                     
"name",
                                     
"age"))
  }


 
private def testData(): (DataFrame, DataFrame) = {
   
val left = parallelize(Seq(("Wim", 46),
                               (
"Klaas", 18)),
                          
"name",
                          
"age")
   
val right = parallelize(Seq(("Jiri", 25),
                                (
"Tomasz", 23)),
                           
"name",
                           
"age")
    (left, right)
  }

 
import scala.reflect.runtime.universe._
 
private def parallelize[T <: Product : ClassTag : TypeTag](data: Seq[T],
                                                             colNames: String*): DataFrame = {
   
import spark.implicits._
    spark.sparkContext.parallelize(data).toDF(colNames:_*)
  }
}

 

VisualVM from scala:


Disclaimer
Reply | Threaded
Open this post in threaded view
|

Re: Huge difference in speed between pyspark and scalaspark

srowen
In reply to this post by sriramb12
That code is barely doing anything; there is no action. I can't imagine it taking 350s. Is this really all that is run?

On Wed, May 13, 2020, 9:25 AM Sriram Bhamidipati <[hidden email]> wrote:
Pyspark and scala versions along with spark version data also are needed to conclude anything
pyspark had a performance lag in the early days in comparision to scala , which is native for spark

Also: The video referenced is 5 yrs old 


On Wed, May 13, 2020 at 6:59 PM Gerard Maas <[hidden email]> wrote:
Steven,

I'm not sure what the goals of the comparison are. 

The reason behind the difference is that `parallelize` is an RDD-based operation. When called from pySpark, the Python-JVM call incurs in additional overhead transferring that array. 
These early-days differences[1] between the Scala and the Python APIs are no longer an issue when you use the SparkSQL-based Dataframe/Dataset API.

Given that the source of your data is unlikely to be synthetically-generated in-memory records, a more interesting test would be to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala and PySpark.

I hope this helps. 

Met vriendelijke groeten,
- Gerard.




On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem <[hidden email]> wrote:

Public

 

Hello all,

 

 

We noticed a HUGE difference between using pyspark and spark in scala.

Pyspark runs:

  • on my work computer in +- 350 seconds
  • on my home computer in +- 130 seconds (Windows defender enabled)
  • on my home computer in +- 105 seconds (Windows defender disabled)
  • on my home computer as Scala code in +- 7 seconds

 

What we already investigated:

  • memory is correct (and enough) in both scala & pyspark. 1G was defined, and it was never reaching the garbage collecting limit (which was also 1G).
  • There are a lot of threads in the spark process, is this normal? Unknown… There are 300 more in the Spark session under pyspark than under scala-spark.
  • Scala code timings are consistent on any platform used (windows, linux, mac, WSL)
  • Disabling the antivirus makes a bit of a difference, but not enough to justify the huge differences in timings.
  • Under Debian WSL on the same windows hosts, the pyspark code runs in the same time as the scala code.
  • Timings for pyspark on Linux/Mac/WSL are similar to the timings for scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)

 

 

What could we continue to investigate to figure out what the difference in time could be?

And/or did someone encounter this same behavior and could point us to a possible solution?

 

 

Thanks,

Steven

 

 

 

 

 

The pyspark script:

The spark session is created via:

SparkSession
                      .builder
                      .appName(
'Testing')
                      .config('spark.driver.extraJavaOptions', '-Xms1g')
                      .getOrCreate()

 

This is the part of the unittest:

def setUp(self):
   
self.left = self.parallelize([
        (
'Wim', 46),
        (
'Klaas', 18)
    ]).toDF(
'name: string, age: int')

   
self.right = self.parallelize([
        (
'Jiri', 25),
        (
'Tomasz', 23)
    ]).toDF(
'name: string, age: int')

def test_simple_union(self):
    sut =
self.left.union(self.right)

   
self.assertDatasetEquals(sut, self.parallelize([
            (
'Wim', 46),
            (
'Klaas', 18),
            (
'Jiri', 25),
            (
'Tomasz', 23)
        ]).toDF(
'name: string, age: int')
    )

 

VisualVM from pyspark script:

 

The same Scala script:

class GlowPerformanceSpec extends SparkFunSuite
                                 
with Matchers
                                 
with DatasetComparer {

  test(
"test simple union") {
   
val data = testData()
   
val sut = data._1.union(data._2)
    assertDatasetEquality(sut,
                          parallelize(Seq((
"Wim", 46),
                                          (
"Klaas", 18),
                                          (
"Jiri", 25),
                                          (
"Tomasz", 23)),
                                     
"name",
                                     
"age"))
  }


 
private def testData(): (DataFrame, DataFrame) = {
   
val left = parallelize(Seq(("Wim", 46),
                               (
"Klaas", 18)),
                          
"name",
                          
"age")
   
val right = parallelize(Seq(("Jiri", 25),
                                (
"Tomasz", 23)),
                           
"name",
                           
"age")
    (left, right)
  }

 
import scala.reflect.runtime.universe._
 
private def parallelize[T <: Product : ClassTag : TypeTag](data: Seq[T],
                                                             colNames: String*): DataFrame = {
   
import spark.implicits._
    spark.sparkContext.parallelize(data).toDF(colNames:_*)
  }
}

 

VisualVM from scala:


Disclaimer


--
-Sriram
Reply | Threaded
Open this post in threaded view
|

Re: Huge difference in speed between pyspark and scalaspark

Wim Van Leuven
Yes it is ... That why it is strange. But never mind the 350 secs. That is not relevant. That's probably the security layer...

What is relevant is that the same pyspark code take 135s on a plain win10 PC vs only 8 secs on WSL Debian on the same Win10 PC, also 8 on a Linux server and also 8 on a Mac. Thát difference is the question. 

That's why we added the same implementation on Spark, which also just takes 8 secs on Win10. The difference is just insane... That's not a simple ser/deser of these few strings, right?

Thoughts?
-wim



On Wed, 13 May 2020 at 18:46, Sean Owen <[hidden email]> wrote:
That code is barely doing anything; there is no action. I can't imagine it taking 350s. Is this really all that is run?

On Wed, May 13, 2020, 9:25 AM Sriram Bhamidipati <[hidden email]> wrote:
Pyspark and scala versions along with spark version data also are needed to conclude anything
pyspark had a performance lag in the early days in comparision to scala , which is native for spark

Also: The video referenced is 5 yrs old 


On Wed, May 13, 2020 at 6:59 PM Gerard Maas <[hidden email]> wrote:
Steven,

I'm not sure what the goals of the comparison are. 

The reason behind the difference is that `parallelize` is an RDD-based operation. When called from pySpark, the Python-JVM call incurs in additional overhead transferring that array. 
These early-days differences[1] between the Scala and the Python APIs are no longer an issue when you use the SparkSQL-based Dataframe/Dataset API.

Given that the source of your data is unlikely to be synthetically-generated in-memory records, a more interesting test would be to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala and PySpark.

I hope this helps. 

Met vriendelijke groeten,
- Gerard.




On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem <[hidden email]> wrote:

Public

 

Hello all,

 

 

We noticed a HUGE difference between using pyspark and spark in scala.

Pyspark runs:

  • on my work computer in +- 350 seconds
  • on my home computer in +- 130 seconds (Windows defender enabled)
  • on my home computer in +- 105 seconds (Windows defender disabled)
  • on my home computer as Scala code in +- 7 seconds

 

What we already investigated:

  • memory is correct (and enough) in both scala & pyspark. 1G was defined, and it was never reaching the garbage collecting limit (which was also 1G).
  • There are a lot of threads in the spark process, is this normal? Unknown… There are 300 more in the Spark session under pyspark than under scala-spark.
  • Scala code timings are consistent on any platform used (windows, linux, mac, WSL)
  • Disabling the antivirus makes a bit of a difference, but not enough to justify the huge differences in timings.
  • Under Debian WSL on the same windows hosts, the pyspark code runs in the same time as the scala code.
  • Timings for pyspark on Linux/Mac/WSL are similar to the timings for scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)

 

 

What could we continue to investigate to figure out what the difference in time could be?

And/or did someone encounter this same behavior and could point us to a possible solution?

 

 

Thanks,

Steven

 

 

 

 

 

The pyspark script:

The spark session is created via:

SparkSession
                      .builder
                      .appName(
'Testing')
                      .config('spark.driver.extraJavaOptions', '-Xms1g')
                      .getOrCreate()

 

This is the part of the unittest:

def setUp(self):
   
self.left = self.parallelize([
        (
'Wim', 46),
        (
'Klaas', 18)
    ]).toDF(
'name: string, age: int')

   
self.right = self.parallelize([
        (
'Jiri', 25),
        (
'Tomasz', 23)
    ]).toDF(
'name: string, age: int')

def test_simple_union(self):
    sut =
self.left.union(self.right)

   
self.assertDatasetEquals(sut, self.parallelize([
            (
'Wim', 46),
            (
'Klaas', 18),
            (
'Jiri', 25),
            (
'Tomasz', 23)
        ]).toDF(
'name: string, age: int')
    )

 

VisualVM from pyspark script:

 

The same Scala script:

class GlowPerformanceSpec extends SparkFunSuite
                                 
with Matchers
                                 
with DatasetComparer {

  test(
"test simple union") {
   
val data = testData()
   
val sut = data._1.union(data._2)
    assertDatasetEquality(sut,
                          parallelize(Seq((
"Wim", 46),
                                          (
"Klaas", 18),
                                          (
"Jiri", 25),
                                          (
"Tomasz", 23)),
                                     
"name",
                                     
"age"))
  }


 
private def testData(): (DataFrame, DataFrame) = {
   
val left = parallelize(Seq(("Wim", 46),
                               (
"Klaas", 18)),
                          
"name",
                          
"age")
   
val right = parallelize(Seq(("Jiri", 25),
                                (
"Tomasz", 23)),
                           
"name",
                           
"age")
    (left, right)
  }

 
import scala.reflect.runtime.universe._
 
private def parallelize[T <: Product : ClassTag : TypeTag](data: Seq[T],
                                                             colNames: String*): DataFrame = {
   
import spark.implicits._
    spark.sparkContext.parallelize(data).toDF(colNames:_*)
  }
}

 

VisualVM from scala:


Disclaimer


--
-Sriram
Reply | Threaded
Open this post in threaded view
|

Re: Huge difference in speed between pyspark and scalaspark

Russell Spitzer
It could also be the cost of spinning up interpreters, but there is an easy way to find out. Remove the serialization from the equation.

On Wed, May 13, 2020 at 1:40 PM Wim Van Leuven <[hidden email]> wrote:
Yes it is ... That why it is strange. But never mind the 350 secs. That is not relevant. That's probably the security layer...

What is relevant is that the same pyspark code take 135s on a plain win10 PC vs only 8 secs on WSL Debian on the same Win10 PC, also 8 on a Linux server and also 8 on a Mac. Thát difference is the question. 

That's why we added the same implementation on Spark, which also just takes 8 secs on Win10. The difference is just insane... That's not a simple ser/deser of these few strings, right?

Thoughts?
-wim



On Wed, 13 May 2020 at 18:46, Sean Owen <[hidden email]> wrote:
That code is barely doing anything; there is no action. I can't imagine it taking 350s. Is this really all that is run?

On Wed, May 13, 2020, 9:25 AM Sriram Bhamidipati <[hidden email]> wrote:
Pyspark and scala versions along with spark version data also are needed to conclude anything
pyspark had a performance lag in the early days in comparision to scala , which is native for spark

Also: The video referenced is 5 yrs old 


On Wed, May 13, 2020 at 6:59 PM Gerard Maas <[hidden email]> wrote:
Steven,

I'm not sure what the goals of the comparison are. 

The reason behind the difference is that `parallelize` is an RDD-based operation. When called from pySpark, the Python-JVM call incurs in additional overhead transferring that array. 
These early-days differences[1] between the Scala and the Python APIs are no longer an issue when you use the SparkSQL-based Dataframe/Dataset API.

Given that the source of your data is unlikely to be synthetically-generated in-memory records, a more interesting test would be to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala and PySpark.

I hope this helps. 

Met vriendelijke groeten,
- Gerard.




On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem <[hidden email]> wrote:

Public

 

Hello all,

 

 

We noticed a HUGE difference between using pyspark and spark in scala.

Pyspark runs:

  • on my work computer in +- 350 seconds
  • on my home computer in +- 130 seconds (Windows defender enabled)
  • on my home computer in +- 105 seconds (Windows defender disabled)
  • on my home computer as Scala code in +- 7 seconds

 

What we already investigated:

  • memory is correct (and enough) in both scala & pyspark. 1G was defined, and it was never reaching the garbage collecting limit (which was also 1G).
  • There are a lot of threads in the spark process, is this normal? Unknown… There are 300 more in the Spark session under pyspark than under scala-spark.
  • Scala code timings are consistent on any platform used (windows, linux, mac, WSL)
  • Disabling the antivirus makes a bit of a difference, but not enough to justify the huge differences in timings.
  • Under Debian WSL on the same windows hosts, the pyspark code runs in the same time as the scala code.
  • Timings for pyspark on Linux/Mac/WSL are similar to the timings for scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)

 

 

What could we continue to investigate to figure out what the difference in time could be?

And/or did someone encounter this same behavior and could point us to a possible solution?

 

 

Thanks,

Steven

 

 

 

 

 

The pyspark script:

The spark session is created via:

SparkSession
                      .builder
                      .appName(
'Testing')
                      .config('spark.driver.extraJavaOptions', '-Xms1g')
                      .getOrCreate()

 

This is the part of the unittest:

def setUp(self):
   
self.left = self.parallelize([
        (
'Wim', 46),
        (
'Klaas', 18)
    ]).toDF(
'name: string, age: int')

   
self.right = self.parallelize([
        (
'Jiri', 25),
        (
'Tomasz', 23)
    ]).toDF(
'name: string, age: int')

def test_simple_union(self):
    sut =
self.left.union(self.right)

   
self.assertDatasetEquals(sut, self.parallelize([
            (
'Wim', 46),
            (
'Klaas', 18),
            (
'Jiri', 25),
            (
'Tomasz', 23)
        ]).toDF(
'name: string, age: int')
    )

 

VisualVM from pyspark script:

 

The same Scala script:

class GlowPerformanceSpec extends SparkFunSuite
                                 
with Matchers
                                 
with DatasetComparer {

  test(
"test simple union") {
   
val data = testData()
   
val sut = data._1.union(data._2)
    assertDatasetEquality(sut,
                          parallelize(Seq((
"Wim", 46),
                                          (
"Klaas", 18),
                                          (
"Jiri", 25),
                                          (
"Tomasz", 23)),
                                     
"name",
                                     
"age"))
  }


 
private def testData(): (DataFrame, DataFrame) = {
   
val left = parallelize(Seq(("Wim", 46),
                               (
"Klaas", 18)),
                          
"name",
                          
"age")
   
val right = parallelize(Seq(("Jiri", 25),
                                (
"Tomasz", 23)),
                           
"name",
                           
"age")
    (left, right)
  }

 
import scala.reflect.runtime.universe._
 
private def parallelize[T <: Product : ClassTag : TypeTag](data: Seq[T],
                                                             colNames: String*): DataFrame = {
   
import spark.implicits._
    spark.sparkContext.parallelize(data).toDF(colNames:_*)
  }
}

 

VisualVM from scala:


Disclaimer


--
-Sriram
Reply | Threaded
Open this post in threaded view
|

RE: Huge difference in speed between pyspark and scalaspark

Steven Van Ingelgem

Internal

 

Wow!

I changed the simple test I ran yesterday with a CSV version:

 


def setUp(self):
   
self.left = self.spark.read.format("csv").options(header='true', inferSchema='true').load("test_left.csv")
   
self.right = self.spark.read.format("csv").options(header='true', inferSchema='true').load("test_right.csv")

def test_simple_union(self):
    sut =
self.left.union(self.right)

   
self.assertDatasetEquals(sut, self.spark.read.format("csv").options(header='true', inferSchema='true').load("expected.csv"))

 

 

Scala parallelize: 19.956

Python parallelize: 61.184

Scala CSV: 18.394

Python CSV: reported as 3.747 by unittest (but the total system time was 16.291. So I assume 12.544s spinning up spark)

 

What I don’t understand though is why the parallelize version in Scala is that much faster, whereas the csv version is similar?

What exactly is it that makes this huge difference in pyspark between parallelize & csv and not in scala parallelize & csv?

 

 

Thanks!

 

 

From: Russell Spitzer <[hidden email]>
Sent: woensdag 13 mei 2020 20:43
To: Wim Van Leuven <[hidden email]>
Cc: Sean Owen <[hidden email]>; Gerard Maas <[hidden email]>; Sriram Bhamidipati <[hidden email]>; Steven Van Ingelgem <[hidden email]>; [hidden email]
Subject: Re: Huge difference in speed between pyspark and scalaspark

 

It could also be the cost of spinning up interpreters, but there is an easy way to find out. Remove the serialization from the equation.

 

On Wed, May 13, 2020 at 1:40 PM Wim Van Leuven <[hidden email]> wrote:

Yes it is ... That why it is strange. But never mind the 350 secs. That is not relevant. That's probably the security layer...

 

What is relevant is that the same pyspark code take 135s on a plain win10 PC vs only 8 secs on WSL Debian on the same Win10 PC, also 8 on a Linux server and also 8 on a Mac. Thát difference is the question. 

 

That's why we added the same implementation on Spark, which also just takes 8 secs on Win10. The difference is just insane... That's not a simple ser/deser of these few strings, right?

 

Thoughts?

-wim

 

 

 

On Wed, 13 May 2020 at 18:46, Sean Owen <[hidden email]> wrote:

That code is barely doing anything; there is no action. I can't imagine it taking 350s. Is this really all that is run?

 

On Wed, May 13, 2020, 9:25 AM Sriram Bhamidipati <[hidden email]> wrote:

Pyspark and scala versions along with spark version data also are needed to conclude anything

pyspark had a performance lag in the early days in comparision to scala , which is native for spark

 

Also: The video referenced is 5 yrs old 

 

 

On Wed, May 13, 2020 at 6:59 PM Gerard Maas <[hidden email]> wrote:

Steven,

 

I'm not sure what the goals of the comparison are. 

 

The reason behind the difference is that `parallelize` is an RDD-based operation. When called from pySpark, the Python-JVM call incurs in additional overhead transferring that array. 

These early-days differences[1] between the Scala and the Python APIs are no longer an issue when you use the SparkSQL-based Dataframe/Dataset API.

 

Given that the source of your data is unlikely to be synthetically-generated in-memory records, a more interesting test would be to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala and PySpark.

 

I hope this helps. 

 

Met vriendelijke groeten,

- Gerard.

 

 

 

 

On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem <[hidden email]> wrote:

Public

 

Hello all,

 

 

We noticed a HUGE difference between using pyspark and spark in scala.

Pyspark runs:

  • on my work computer in +- 350 seconds
  • on my home computer in +- 130 seconds (Windows defender enabled)
  • on my home computer in +- 105 seconds (Windows defender disabled)
  • on my home computer as Scala code in +- 7 seconds

 

What we already investigated:

  • memory is correct (and enough) in both scala & pyspark. 1G was defined, and it was never reaching the garbage collecting limit (which was also 1G).
  • There are a lot of threads in the spark process, is this normal? Unknown… There are 300 more in the Spark session under pyspark than under scala-spark.
  • Scala code timings are consistent on any platform used (windows, linux, mac, WSL)
  • Disabling the antivirus makes a bit of a difference, but not enough to justify the huge differences in timings.
  • Under Debian WSL on the same windows hosts, the pyspark code runs in the same time as the scala code.
  • Timings for pyspark on Linux/Mac/WSL are similar to the timings for scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)

 

 

What could we continue to investigate to figure out what the difference in time could be?

And/or did someone encounter this same behavior and could point us to a possible solution?

 

 

Thanks,

Steven

 

 

 

 

 

The pyspark script:

The spark session is created via:

SparkSession
                      .builder
                      .appName(
'Testing')
                      .config('spark.driver.extraJavaOptions', '-Xms1g')
                      .getOrCreate()

 

This is the part of the unittest:

def setUp(self):
   
self.left = self.parallelize([
        (
'Wim', 46),
        (
'Klaas', 18)
    ]).toDF(
'name: string, age: int')

   
self.right = self.parallelize([
        (
'Jiri', 25),
        (
'Tomasz', 23)
    ]).toDF(
'name: string, age: int')

def test_simple_union(self):
    sut =
self.left.union(self.right)

   
self.assertDatasetEquals(sut, self.parallelize([
            (
'Wim', 46),
            (
'Klaas', 18),
            (
'Jiri', 25),
            (
'Tomasz', 23)
        ]).toDF(
'name: string, age: int')
    )

 

VisualVM from pyspark script:

 

The same Scala script:

class GlowPerformanceSpec extends SparkFunSuite
                                 
with Matchers
                                 
with DatasetComparer {

  test(
"test simple union") {
   
val data = testData()
   
val sut = data._1.union(data._2)
    assertDatasetEquality(sut,
                          parallelize(Seq((
"Wim", 46),
                                          (
"Klaas", 18),
                                          (
"Jiri", 25),
                                          (
"Tomasz", 23)),
                                     
"name",
                                     
"age"))
  }


 
private def testData(): (DataFrame, DataFrame) = {
   
val left = parallelize(Seq(("Wim", 46),
                               (
"Klaas", 18)),
                          
"name",
                          
"age")
   
val right = parallelize(Seq(("Jiri", 25),
                                (
"Tomasz", 23)),
                           
"name",
                           
"age")
    (left, right)
  }

 
import scala.reflect.runtime.universe._
 
private def parallelize[T <: Product : ClassTag : TypeTag](data: Seq[T],
                                                             colNames: String*): DataFrame = {
   
import spark.implicits._
    spark.sparkContext.parallelize(data).toDF(colNames:_*)
  }
}

 

VisualVM from scala:


Disclaimer


 

--

-Sriram


Disclaimer
Reply | Threaded
Open this post in threaded view
|

Re: Huge difference in speed between pyspark and scalaspark

Russell Spitzer
Like I wrote before. When you mix in python spark needs a Python process to deal with python objects, that means you end up booting up python processes for every executor. My guess would be on your system python takes a while to start up, so before any work can be done you are waiting on this slow startup. My guess is with a much larger dataset the difference would not be as great since it would amortize the python startup costs. The serialization costs are also non trivial but if you don't use any python lambdas I don't think that cost would scale that dramatically with your data size. Check out some of the original dataframe articles for more perf comparisons and explanations

On Thu, May 14, 2020, 12:50 AM Steven Van Ingelgem <[hidden email]> wrote:

Internal

 

Wow!

I changed the simple test I ran yesterday with a CSV version:

 


def setUp(self):
   
self.left = self.spark.read.format("csv").options(header='true', inferSchema='true').load("test_left.csv")
   
self.right = self.spark.read.format("csv").options(header='true', inferSchema='true').load("test_right.csv")

def test_simple_union(self):
    sut =
self.left.union(self.right)

   
self.assertDatasetEquals(sut, self.spark.read.format("csv").options(header='true', inferSchema='true').load("expected.csv"))

 

 

Scala parallelize: 19.956

Python parallelize: 61.184

Scala CSV: 18.394

Python CSV: reported as 3.747 by unittest (but the total system time was 16.291. So I assume 12.544s spinning up spark)

 

What I don’t understand though is why the parallelize version in Scala is that much faster, whereas the csv version is similar?

What exactly is it that makes this huge difference in pyspark between parallelize & csv and not in scala parallelize & csv?

 

 

Thanks!

 

 

From: Russell Spitzer <[hidden email]>
Sent: woensdag 13 mei 2020 20:43
To: Wim Van Leuven <[hidden email]>
Cc: Sean Owen <[hidden email]>; Gerard Maas <[hidden email]>; Sriram Bhamidipati <[hidden email]>; Steven Van Ingelgem <[hidden email]>; [hidden email]
Subject: Re: Huge difference in speed between pyspark and scalaspark

 

It could also be the cost of spinning up interpreters, but there is an easy way to find out. Remove the serialization from the equation.

 

On Wed, May 13, 2020 at 1:40 PM Wim Van Leuven <[hidden email]> wrote:

Yes it is ... That why it is strange. But never mind the 350 secs. That is not relevant. That's probably the security layer...

 

What is relevant is that the same pyspark code take 135s on a plain win10 PC vs only 8 secs on WSL Debian on the same Win10 PC, also 8 on a Linux server and also 8 on a Mac. Thát difference is the question. 

 

That's why we added the same implementation on Spark, which also just takes 8 secs on Win10. The difference is just insane... That's not a simple ser/deser of these few strings, right?

 

Thoughts?

-wim

 

 

 

On Wed, 13 May 2020 at 18:46, Sean Owen <[hidden email]> wrote:

That code is barely doing anything; there is no action. I can't imagine it taking 350s. Is this really all that is run?

 

On Wed, May 13, 2020, 9:25 AM Sriram Bhamidipati <[hidden email]> wrote:

Pyspark and scala versions along with spark version data also are needed to conclude anything

pyspark had a performance lag in the early days in comparision to scala , which is native for spark

 

Also: The video referenced is 5 yrs old 

 

 

On Wed, May 13, 2020 at 6:59 PM Gerard Maas <[hidden email]> wrote:

Steven,

 

I'm not sure what the goals of the comparison are. 

 

The reason behind the difference is that `parallelize` is an RDD-based operation. When called from pySpark, the Python-JVM call incurs in additional overhead transferring that array. 

These early-days differences[1] between the Scala and the Python APIs are no longer an issue when you use the SparkSQL-based Dataframe/Dataset API.

 

Given that the source of your data is unlikely to be synthetically-generated in-memory records, a more interesting test would be to load the same dataset (text, CSV, Parquet, ...) from both Spark/Scala and PySpark.

 

I hope this helps. 

 

Met vriendelijke groeten,

- Gerard.

 

 

 

 

On Wed, May 13, 2020 at 3:05 PM Steven Van Ingelgem <[hidden email]> wrote:

Public

 

Hello all,

 

 

We noticed a HUGE difference between using pyspark and spark in scala.

Pyspark runs:

  • on my work computer in +- 350 seconds
  • on my home computer in +- 130 seconds (Windows defender enabled)
  • on my home computer in +- 105 seconds (Windows defender disabled)
  • on my home computer as Scala code in +- 7 seconds

 

What we already investigated:

  • memory is correct (and enough) in both scala & pyspark. 1G was defined, and it was never reaching the garbage collecting limit (which was also 1G).
  • There are a lot of threads in the spark process, is this normal? Unknown… There are 300 more in the Spark session under pyspark than under scala-spark.
  • Scala code timings are consistent on any platform used (windows, linux, mac, WSL)
  • Disabling the antivirus makes a bit of a difference, but not enough to justify the huge differences in timings.
  • Under Debian WSL on the same windows hosts, the pyspark code runs in the same time as the scala code.
  • Timings for pyspark on Linux/Mac/WSL are similar to the timings for scala. (so virtualized it’s also around 7s, but not virtualized it’s 130s!)

 

 

What could we continue to investigate to figure out what the difference in time could be?

And/or did someone encounter this same behavior and could point us to a possible solution?

 

 

Thanks,

Steven

 

 

 

 

 

The pyspark script:

The spark session is created via:

SparkSession
                      .builder
                      .appName(
'Testing')
                      .config('spark.driver.extraJavaOptions', '-Xms1g')
                      .getOrCreate()

 

This is the part of the unittest:

def setUp(self):
   
self.left = self.parallelize([
        (
'Wim', 46),
        (
'Klaas', 18)
    ]).toDF(
'name: string, age: int')

   
self.right = self.parallelize([
        (
'Jiri', 25),
        (
'Tomasz', 23)
    ]).toDF(
'name: string, age: int')

def test_simple_union(self):
    sut =
self.left.union(self.right)

   
self.assertDatasetEquals(sut, self.parallelize([
            (
'Wim', 46),
            (
'Klaas', 18),
            (
'Jiri', 25),
            (
'Tomasz', 23)
        ]).toDF(
'name: string, age: int')
    )

 

VisualVM from pyspark script:

 

The same Scala script:

class GlowPerformanceSpec extends SparkFunSuite
                                 
with Matchers
                                 
with DatasetComparer {

  test(
"test simple union") {
   
val data = testData()
   
val sut = data._1.union(data._2)
    assertDatasetEquality(sut,
                          parallelize(Seq((
"Wim", 46),
                                          (
"Klaas", 18),
                                          (
"Jiri", 25),
                                          (
"Tomasz", 23)),
                                     
"name",
                                     
"age"))
  }


 
private def testData(): (DataFrame, DataFrame) = {
   
val left = parallelize(Seq(("Wim", 46),
                               (
"Klaas", 18)),
                          
"name",
                          
"age")
   
val right = parallelize(Seq(("Jiri", 25),
                                (
"Tomasz", 23)),
                           
"name",
                           
"age")
    (left, right)
  }

 
import scala.reflect.runtime.universe._
 
private def parallelize[T <: Product : ClassTag : TypeTag](data: Seq[T],
                                                             colNames: String*): DataFrame = {
   
import spark.implicits._
    spark.sparkContext.parallelize(data).toDF(colNames:_*)
  }
}

 

VisualVM from scala:


Disclaimer


 

--

-Sriram


Disclaimer

image001.jpg (51K) Download Attachment
image002.jpg (77K) Download Attachment
image002.jpg (77K) Download Attachment