Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Kalin Stoyanov
Hi all,

First of all let me say that I am pretty new to Spark so this could be entirely my fault somehow...
I noticed this when I was running a job on an amazon emr cluster with Spark 2.4.4, and it got done slower than when I had ran it locally (on Spark 2.4.1). I checked out the event logs, and the one from the newer version had more stages.
Then I decided to do a comparison in the same environment so I created the two versions of the same cluster with the only difference being the emr release, and hence the spark version(?) - first one was emr-5.24.1 with Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, the same thing happened with the newer version having more stages and taking almost twice as long to finish.
So I am pretty much at a loss here - could it be that it is not because of spark itself, but because of some difference introduced in the emr releases? At the moment I can't think of any other alternative besides it being a bug...

Here are the two event logs:
https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
and my code is here:
https://github.com/kgskgs/stars-spark3d

I ran it like so on the clusters (after putting it on s3):
spark-submit --deploy-mode cluster --py-files s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/

So yeah I was considering submitting a bug report, but in the guide it said it's better to ask here first, so any ideas on what's going on? Maybe I am missing something?

Regards,
Kalin
Reply | Threaded
Open this post in threaded view
|

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Xiao Li
EMR is having their own fork of Spark, called EMR runtime. They are not Apache Spark. You might need to talk with them instead of posting questions in the Apache Spark community. 

Cheers,

Xiao

Kalin Stoyanov <[hidden email]> 于2020年1月15日周三 上午9:53写道:
Hi all,

First of all let me say that I am pretty new to Spark so this could be entirely my fault somehow...
I noticed this when I was running a job on an amazon emr cluster with Spark 2.4.4, and it got done slower than when I had ran it locally (on Spark 2.4.1). I checked out the event logs, and the one from the newer version had more stages.
Then I decided to do a comparison in the same environment so I created the two versions of the same cluster with the only difference being the emr release, and hence the spark version(?) - first one was emr-5.24.1 with Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, the same thing happened with the newer version having more stages and taking almost twice as long to finish.
So I am pretty much at a loss here - could it be that it is not because of spark itself, but because of some difference introduced in the emr releases? At the moment I can't think of any other alternative besides it being a bug...

Here are the two event logs:
https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
and my code is here:
https://github.com/kgskgs/stars-spark3d

I ran it like so on the clusters (after putting it on s3):
spark-submit --deploy-mode cluster --py-files s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/

So yeah I was considering submitting a bug report, but in the guide it said it's better to ask here first, so any ideas on what's going on? Maybe I am missing something?

Regards,
Kalin
Reply | Threaded
Open this post in threaded view
|

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Kalin Stoyanov
Hi Xiao,

Thanks, I didn't know that. This https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/ implies that their fork is not used in emr 5.27. I tried that and it has the same issue. But then again in their article they were comparing emr 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest version of Spark locally and make the comparison that way.

Regards,
Kalin

On Wed, Jan 15, 2020 at 7:58 PM Xiao Li <[hidden email]> wrote:
EMR is having their own fork of Spark, called EMR runtime. They are not Apache Spark. You might need to talk with them instead of posting questions in the Apache Spark community. 

Cheers,

Xiao

Kalin Stoyanov <[hidden email]> 于2020年1月15日周三 上午9:53写道:
Hi all,

First of all let me say that I am pretty new to Spark so this could be entirely my fault somehow...
I noticed this when I was running a job on an amazon emr cluster with Spark 2.4.4, and it got done slower than when I had ran it locally (on Spark 2.4.1). I checked out the event logs, and the one from the newer version had more stages.
Then I decided to do a comparison in the same environment so I created the two versions of the same cluster with the only difference being the emr release, and hence the spark version(?) - first one was emr-5.24.1 with Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, the same thing happened with the newer version having more stages and taking almost twice as long to finish.
So I am pretty much at a loss here - could it be that it is not because of spark itself, but because of some difference introduced in the emr releases? At the moment I can't think of any other alternative besides it being a bug...

Here are the two event logs:
https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
and my code is here:
https://github.com/kgskgs/stars-spark3d

I ran it like so on the clusters (after putting it on s3):
spark-submit --deploy-mode cluster --py-files s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/

So yeah I was considering submitting a bug report, but in the guide it said it's better to ask here first, so any ideas on what's going on? Maybe I am missing something?

Regards,
Kalin
Reply | Threaded
Open this post in threaded view
|

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Gourav Sengupta
In reply to this post by Kalin Stoyanov
Hi,

I am pretty sure that AWS has released 5.28.1 with some bug fixes day before yesterday. 

Also please ensure that you are using s3:// instead of s3a:// or anything like that.

On another note, Xiao, is not entirely right in mentioning about issues in EMR not to be posted here, a large group of users use SPARK in Databricks, GCP, Azure, native installations, and ofcourse in EMR, and Glue. I have always found that the Apache SPARK community takes care of each other and answers questions to the largest user base, just like I did now. I think that only Matei Zaharia can take such a sweeping call on what this entire community is about.


Thanks and Regards,
Gourav Sengupta 

On Wed, Jan 15, 2020 at 5:53 PM Kalin Stoyanov <[hidden email]> wrote:
Hi all,

First of all let me say that I am pretty new to Spark so this could be entirely my fault somehow...
I noticed this when I was running a job on an amazon emr cluster with Spark 2.4.4, and it got done slower than when I had ran it locally (on Spark 2.4.1). I checked out the event logs, and the one from the newer version had more stages.
Then I decided to do a comparison in the same environment so I created the two versions of the same cluster with the only difference being the emr release, and hence the spark version(?) - first one was emr-5.24.1 with Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, the same thing happened with the newer version having more stages and taking almost twice as long to finish.
So I am pretty much at a loss here - could it be that it is not because of spark itself, but because of some difference introduced in the emr releases? At the moment I can't think of any other alternative besides it being a bug...

Here are the two event logs:
https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
and my code is here:
https://github.com/kgskgs/stars-spark3d

I ran it like so on the clusters (after putting it on s3):
spark-submit --deploy-mode cluster --py-files s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/

So yeah I was considering submitting a bug report, but in the guide it said it's better to ask here first, so any ideas on what's going on? Maybe I am missing something?

Regards,
Kalin
Reply | Threaded
Open this post in threaded view
|

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Xiao Li
In reply to this post by Kalin Stoyanov
If you can confirm that this is caused by Apache Spark, feel free to open a JIRA. In each release, I do not expect your queries should hit such a major performance regression. Also, please try the 3.0 preview releases. 

Thanks,

Xiao  

Kalin Stoyanov <[hidden email]> 于2020年1月15日周三 上午10:53写道:
Hi Xiao,

Thanks, I didn't know that. This https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/ implies that their fork is not used in emr 5.27. I tried that and it has the same issue. But then again in their article they were comparing emr 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest version of Spark locally and make the comparison that way.

Regards,
Kalin

On Wed, Jan 15, 2020 at 7:58 PM Xiao Li <[hidden email]> wrote:
EMR is having their own fork of Spark, called EMR runtime. They are not Apache Spark. You might need to talk with them instead of posting questions in the Apache Spark community. 

Cheers,

Xiao

Kalin Stoyanov <[hidden email]> 于2020年1月15日周三 上午9:53写道:
Hi all,

First of all let me say that I am pretty new to Spark so this could be entirely my fault somehow...
I noticed this when I was running a job on an amazon emr cluster with Spark 2.4.4, and it got done slower than when I had ran it locally (on Spark 2.4.1). I checked out the event logs, and the one from the newer version had more stages.
Then I decided to do a comparison in the same environment so I created the two versions of the same cluster with the only difference being the emr release, and hence the spark version(?) - first one was emr-5.24.1 with Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, the same thing happened with the newer version having more stages and taking almost twice as long to finish.
So I am pretty much at a loss here - could it be that it is not because of spark itself, but because of some difference introduced in the emr releases? At the moment I can't think of any other alternative besides it being a bug...

Here are the two event logs:
https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
and my code is here:
https://github.com/kgskgs/stars-spark3d

I ran it like so on the clusters (after putting it on s3):
spark-submit --deploy-mode cluster --py-files s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/

So yeah I was considering submitting a bug report, but in the guide it said it's better to ask here first, so any ideas on what's going on? Maybe I am missing something?

Regards,
Kalin
Reply | Threaded
Open this post in threaded view
|

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Kalin Stoyanov
Hi all,

@Enrico, I've added just the SQL query pages (+js dependencies etc.)  in the google drive - https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
That is what you had in mind right? They are different indeed. (For some reason after I saved them off of the history server the graphs get drawn twice, but that shouldn't matter)

@Gourav Thanks, but emr 5.28.1 is not appearing for me when creating a cluster, so I can't check that for now; also I am using just s3:// 

@Xiao, Yes, I will try to run this locally as well, but installing new versions of Spark won't be very fast and easy for me, so I won't be doing it right away.

Regards,
Kalin


On Wed, Jan 15, 2020 at 10:20 PM Xiao Li <[hidden email]> wrote:
If you can confirm that this is caused by Apache Spark, feel free to open a JIRA. In each release, I do not expect your queries should hit such a major performance regression. Also, please try the 3.0 preview releases. 

Thanks,

Xiao  

Kalin Stoyanov <[hidden email]> 于2020年1月15日周三 上午10:53写道:
Hi Xiao,

Thanks, I didn't know that. This https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/ implies that their fork is not used in emr 5.27. I tried that and it has the same issue. But then again in their article they were comparing emr 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest version of Spark locally and make the comparison that way.

Regards,
Kalin

On Wed, Jan 15, 2020 at 7:58 PM Xiao Li <[hidden email]> wrote:
EMR is having their own fork of Spark, called EMR runtime. They are not Apache Spark. You might need to talk with them instead of posting questions in the Apache Spark community. 

Cheers,

Xiao

Kalin Stoyanov <[hidden email]> 于2020年1月15日周三 上午9:53写道:
Hi all,

First of all let me say that I am pretty new to Spark so this could be entirely my fault somehow...
I noticed this when I was running a job on an amazon emr cluster with Spark 2.4.4, and it got done slower than when I had ran it locally (on Spark 2.4.1). I checked out the event logs, and the one from the newer version had more stages.
Then I decided to do a comparison in the same environment so I created the two versions of the same cluster with the only difference being the emr release, and hence the spark version(?) - first one was emr-5.24.1 with Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, the same thing happened with the newer version having more stages and taking almost twice as long to finish.
So I am pretty much at a loss here - could it be that it is not because of spark itself, but because of some difference introduced in the emr releases? At the moment I can't think of any other alternative besides it being a bug...

Here are the two event logs:
https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
and my code is here:
https://github.com/kgskgs/stars-spark3d

I ran it like so on the clusters (after putting it on s3):
spark-submit --deploy-mode cluster --py-files s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/

So yeah I was considering submitting a bug report, but in the guide it said it's better to ask here first, so any ideas on what's going on? Maybe I am missing something?

Regards,
Kalin
Reply | Threaded
Open this post in threaded view
|

Re: Spark 2.4.4 having worse performance than 2.4.2 when running the same code [pyspark][sql]

Gourav Sengupta
Hi Xiao,

that is the right attitude, thanks a ton :)

Hi Kalin,
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-5x.html#emr-5281-relnotes EMR latest version should be available right out of the box, perhaps you can raise a quick AWS ticket and find out in case its release it getting delayed in your region or not. The release notes does mention that it fixes a few SPARK compatibility issues. Also working on the latest version of SPARK takes less than 10 seconds after you have downloaded and unzipped the file from APACHE SPARK. Besides that I am almost always sure that starting SPARK session in EMR using the following statement is always going to give the same performance and predictability. As Xiao mentions it might be better to first isolate the cause and replicate it before raising issues.

(spark = SparkSession.builder.getOrCreate())

Thanks and Regards,
Gourav Sengupta 

On Wed, Jan 15, 2020 at 9:10 PM Kalin Stoyanov <[hidden email]> wrote:
Hi all,

@Enrico, I've added just the SQL query pages (+js dependencies etc.)  in the google drive - https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
That is what you had in mind right? They are different indeed. (For some reason after I saved them off of the history server the graphs get drawn twice, but that shouldn't matter)

@Gourav Thanks, but emr 5.28.1 is not appearing for me when creating a cluster, so I can't check that for now; also I am using just s3:// 

@Xiao, Yes, I will try to run this locally as well, but installing new versions of Spark won't be very fast and easy for me, so I won't be doing it right away.

Regards,
Kalin


On Wed, Jan 15, 2020 at 10:20 PM Xiao Li <[hidden email]> wrote:
If you can confirm that this is caused by Apache Spark, feel free to open a JIRA. In each release, I do not expect your queries should hit such a major performance regression. Also, please try the 3.0 preview releases. 

Thanks,

Xiao  

Kalin Stoyanov <[hidden email]> 于2020年1月15日周三 上午10:53写道:
Hi Xiao,

Thanks, I didn't know that. This https://aws.amazon.com/about-aws/whats-new/2019/11/announcing-emr-runtime-for-apache-spark/ implies that their fork is not used in emr 5.27. I tried that and it has the same issue. But then again in their article they were comparing emr 5.27 vs 5.16 so I can't be sure... Maybe I'll try getting the latest version of Spark locally and make the comparison that way.

Regards,
Kalin

On Wed, Jan 15, 2020 at 7:58 PM Xiao Li <[hidden email]> wrote:
EMR is having their own fork of Spark, called EMR runtime. They are not Apache Spark. You might need to talk with them instead of posting questions in the Apache Spark community. 

Cheers,

Xiao

Kalin Stoyanov <[hidden email]> 于2020年1月15日周三 上午9:53写道:
Hi all,

First of all let me say that I am pretty new to Spark so this could be entirely my fault somehow...
I noticed this when I was running a job on an amazon emr cluster with Spark 2.4.4, and it got done slower than when I had ran it locally (on Spark 2.4.1). I checked out the event logs, and the one from the newer version had more stages.
Then I decided to do a comparison in the same environment so I created the two versions of the same cluster with the only difference being the emr release, and hence the spark version(?) - first one was emr-5.24.1 with Spark 2.4.2, and the second one - emr-5.28.0 with Spark 2.4.4. Sure enough, the same thing happened with the newer version having more stages and taking almost twice as long to finish.
So I am pretty much at a loss here - could it be that it is not because of spark itself, but because of some difference introduced in the emr releases? At the moment I can't think of any other alternative besides it being a bug...

Here are the two event logs:
https://drive.google.com/drive/folders/12pNc5uqhHtCoeCO3nHS3eQ3X7cFzUAQL?usp=sharing
and my code is here:
https://github.com/kgskgs/stars-spark3d

I ran it like so on the clusters (after putting it on s3):
spark-submit --deploy-mode cluster --py-files s3://kgs-s3/scripts/utils.py,s3://kgs-s3/scripts/interactions.py,s3://kgs-s3/scripts/schemas.py --name sim100_dt100_spark242 s3://kgs-s3/scripts/main.py 100 100 --outputDir s3://kgs-s3/output/ --inputDir s3://kgs-s3/input/

So yeah I was considering submitting a bug report, but in the guide it said it's better to ask here first, so any ideas on what's going on? Maybe I am missing something?

Regards,
Kalin