Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Rao, Abhishek (Nokia - IN/Bangalore)

Hi All,

 

We’re doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS.

We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange.

Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop.

 

Environment Details:

  • Spark running on Kubernetes
  • TPC DS Scale Factor: 500 GB
  • Hadoop 3.x
  • Same CPU and memory used for all executions

 

Query

Spark 3.0 with S3 (Time in seconds)

Spark 3.0 with Hadoop (Time in seconds)

 

 

Spark 2.4.5 with S3

(Time in seconds)

Spark 3.0 HDFS vs S3 (Factor)

Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)

Table involved

9

880.129

106.109

147.65

8.294574

5.960914

store_sales

44

129.618

23.747

103.916

5.458289

1.247334

store_sales

58

142.113

20.996

33.936

6.768575

4.187677

store_sales

62

32.519

5.425

14.809

5.994286

2.195894

web_sales

76

138.765

20.73

49.892

6.693922

2.781308

store_sales

88

475.824

48.2

94.382

9.871867

5.04147

store_sales

90

53.896

6.804

18.11

7.921223

2.976035

web_sales

94

241.172

43.49

81.181

5.545459

2.970794

web_sales

96

67.059

10.396

15.993

6.450462

4.193022

store_sales

 

When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I’m attaching the screen shots of Driver UI for one such instance (Query 9) for reference.

Also attached the spark configurations (Spark 3.0) used for these tests.

 

We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we’re missing?

 

Thanks and Regards,

Abhishek

 



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Spark 3.0.0 Configuration.txt (6K) Download Attachment
Query 9 Spark 3.0.0 with Hadoop.PNG (379K) Download Attachment
Query 9 Spark 2.4.5 With S3.PNG (217K) Download Attachment
Query 9 Spark 3.0.0 With S3.PNG (229K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Luca Canali

Hi Abhishek,

 

Just a few ideas/comments on the topic:

 

When benchmarking/testing I find it useful to  collect a more complete view of resources usage and Spark metrics, beyond just measuring query elapsed time. Something like this:

https://github.com/cerndb/spark-dashboard

 

I’d rather not use dynamic allocation when benchmarking if possible, as it adds a layer of complexity when examining results.

 

If you suspect that reading from S3 vs. HDFS may play an important role on the performance you observe, you may want to drill down on that with a simple micro-benchmark, for example something like this (for Spark 3.0):

 

val df=spark.read.parquet("/TPCDS/tpcds_1500/store_sales")

df.write.format("noop").mode("overwrite").save

 

Best,

Luca

 

From: Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]>
Sent: Monday, August 24, 2020 13:50
To: [hidden email]
Subject: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

 

Hi All,

 

We’re doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS.

We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange.

Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop.

 

Environment Details:

  • Spark running on Kubernetes
  • TPC DS Scale Factor: 500 GB
  • Hadoop 3.x
  • Same CPU and memory used for all executions

 

Query

Spark 3.0 with S3 (Time in seconds)

Spark 3.0 with Hadoop (Time in seconds)

 

 

Spark 2.4.5 with S3

(Time in seconds)

Spark 3.0 HDFS vs S3 (Factor)

Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)

Table involved

9

880.129

106.109

147.65

8.294574

5.960914

store_sales

44

129.618

23.747

103.916

5.458289

1.247334

store_sales

58

142.113

20.996

33.936

6.768575

4.187677

store_sales

62

32.519

5.425

14.809

5.994286

2.195894

web_sales

76

138.765

20.73

49.892

6.693922

2.781308

store_sales

88

475.824

48.2

94.382

9.871867

5.04147

store_sales

90

53.896

6.804

18.11

7.921223

2.976035

web_sales

94

241.172

43.49

81.181

5.545459

2.970794

web_sales

96

67.059

10.396

15.993

6.450462

4.193022

store_sales

 

When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I’m attaching the screen shots of Driver UI for one such instance (Query 9) for reference.

Also attached the spark configurations (Spark 3.0) used for these tests.

 

We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we’re missing?

 

Thanks and Regards,

Abhishek

 

Reply | Threaded
Open this post in threaded view
|

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Rao, Abhishek (Nokia - IN/Bangalore)

Hi Luca,

 

Thanks for sharing the feedback. We'll include these recommendations in our tests. However, we feel the issue that we're seeing right now is due to the difference in size of data downloaded from storage by the executors. In case of S3, executors are downloading almost 50 GB of data whereas in case of HDFS, it is only 4.5 GB.

Any idea why this difference is there?

 

 

Thanks and Regards,

Abhishek

 

From: Luca Canali <[hidden email]>
Sent: Monday, August 24, 2020 7:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]>
Cc: [hidden email]
Subject: RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

 

Hi Abhishek,

 

Just a few ideas/comments on the topic:

 

When benchmarking/testing I find it useful to  collect a more complete view of resources usage and Spark metrics, beyond just measuring query elapsed time. Something like this:

https://github.com/cerndb/spark-dashboard

 

I’d rather not use dynamic allocation when benchmarking if possible, as it adds a layer of complexity when examining results.

 

If you suspect that reading from S3 vs. HDFS may play an important role on the performance you observe, you may want to drill down on that with a simple micro-benchmark, for example something like this (for Spark 3.0):

 

val df=spark.read.parquet("/TPCDS/tpcds_1500/store_sales")

df.write.format("noop").mode("overwrite").save

 

Best,

Luca

 

From: Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]>
Sent: Monday, August 24, 2020 13:50
To:
[hidden email]
Subject: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

 

Hi All,

 

We’re doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS.

We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange.

Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop.

 

Environment Details:

  • Spark running on Kubernetes
  • TPC DS Scale Factor: 500 GB
  • Hadoop 3.x
  • Same CPU and memory used for all executions

 

Query

Spark 3.0 with S3 (Time in seconds)

Spark 3.0 with Hadoop (Time in seconds)

 

 

Spark 2.4.5 with S3

(Time in seconds)

Spark 3.0 HDFS vs S3 (Factor)

Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)

Table involved

9

880.129

106.109

147.65

8.294574

5.960914

store_sales

44

129.618

23.747

103.916

5.458289

1.247334

store_sales

58

142.113

20.996

33.936

6.768575

4.187677

store_sales

62

32.519

5.425

14.809

5.994286

2.195894

web_sales

76

138.765

20.73

49.892

6.693922

2.781308

store_sales

88

475.824

48.2

94.382

9.871867

5.04147

store_sales

90

53.896

6.804

18.11

7.921223

2.976035

web_sales

94

241.172

43.49

81.181

5.545459

2.970794

web_sales

96

67.059

10.396

15.993

6.450462

4.193022

store_sales

 

When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I’m attaching the screen shots of Driver UI for one such instance (Query 9) for reference.

Also attached the spark configurations (Spark 3.0) used for these tests.

 

We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we’re missing?

 

Thanks and Regards,

Abhishek

 

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Gourav Sengupta
In reply to this post by Rao, Abhishek (Nokia - IN/Bangalore)
Hi,

are you using s3a, which is not using EMRFS? In that case, these results does not make sense to me.

Regards,
Gourav Sengupta

On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]> wrote:

Hi All,

 

We’re doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS.

We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange.

Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop.

 

Environment Details:

  • Spark running on Kubernetes
  • TPC DS Scale Factor: 500 GB
  • Hadoop 3.x
  • Same CPU and memory used for all executions

 

Query

Spark 3.0 with S3 (Time in seconds)

Spark 3.0 with Hadoop (Time in seconds)

 

 

Spark 2.4.5 with S3

(Time in seconds)

Spark 3.0 HDFS vs S3 (Factor)

Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)

Table involved

9

880.129

106.109

147.65

8.294574

5.960914

store_sales

44

129.618

23.747

103.916

5.458289

1.247334

store_sales

58

142.113

20.996

33.936

6.768575

4.187677

store_sales

62

32.519

5.425

14.809

5.994286

2.195894

web_sales

76

138.765

20.73

49.892

6.693922

2.781308

store_sales

88

475.824

48.2

94.382

9.871867

5.04147

store_sales

90

53.896

6.804

18.11

7.921223

2.976035

web_sales

94

241.172

43.49

81.181

5.545459

2.970794

web_sales

96

67.059

10.396

15.993

6.450462

4.193022

store_sales

 

When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I’m attaching the screen shots of Driver UI for one such instance (Query 9) for reference.

Also attached the spark configurations (Spark 3.0) used for these tests.

 

We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we’re missing?

 

Thanks and Regards,

Abhishek

 


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Rao, Abhishek (Nokia - IN/Bangalore)

Hi Gourav,

 

Yes. We’re using s3a.

 

Thanks and Regards,

Abhishek

 

From: Gourav Sengupta <[hidden email]>
Sent: Wednesday, August 26, 2020 1:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]>
Cc: [hidden email]
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

 

Hi,

 

are you using s3a, which is not using EMRFS? In that case, these results does not make sense to me.

 

Regards,

Gourav Sengupta

 

On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]> wrote:

Hi All,

 

We’re doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS.

We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange.

Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop.

 

Environment Details:

  • Spark running on Kubernetes
  • TPC DS Scale Factor: 500 GB
  • Hadoop 3.x
  • Same CPU and memory used for all executions

 

Query

Spark 3.0 with S3 (Time in seconds)

Spark 3.0 with Hadoop (Time in seconds)

 

 

Spark 2.4.5 with S3

(Time in seconds)

Spark 3.0 HDFS vs S3 (Factor)

Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)

Table involved

9

880.129

106.109

147.65

8.294574

5.960914

store_sales

44

129.618

23.747

103.916

5.458289

1.247334

store_sales

58

142.113

20.996

33.936

6.768575

4.187677

store_sales

62

32.519

5.425

14.809

5.994286

2.195894

web_sales

76

138.765

20.73

49.892

6.693922

2.781308

store_sales

88

475.824

48.2

94.382

9.871867

5.04147

store_sales

90

53.896

6.804

18.11

7.921223

2.976035

web_sales

94

241.172

43.49

81.181

5.545459

2.970794

web_sales

96

67.059

10.396

15.993

6.450462

4.193022

store_sales

 

When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I’m attaching the screen shots of Driver UI for one such instance (Query 9) for reference.

Also attached the spark configurations (Spark 3.0) used for these tests.

 

We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we’re missing?

 

Thanks and Regards,

Abhishek

 


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Gourav Sengupta
Hi,

So the results does not make sense.


Regards,
Gourav

On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]> wrote:

Hi Gourav,

 

Yes. We’re using s3a.

 

Thanks and Regards,

Abhishek

 

From: Gourav Sengupta <[hidden email]>
Sent: Wednesday, August 26, 2020 1:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]>
Cc: [hidden email]
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

 

Hi,

 

are you using s3a, which is not using EMRFS? In that case, these results does not make sense to me.

 

Regards,

Gourav Sengupta

 

On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]> wrote:

Hi All,

 

We’re doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS.

We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange.

Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop.

 

Environment Details:

  • Spark running on Kubernetes
  • TPC DS Scale Factor: 500 GB
  • Hadoop 3.x
  • Same CPU and memory used for all executions

 

Query

Spark 3.0 with S3 (Time in seconds)

Spark 3.0 with Hadoop (Time in seconds)

 

 

Spark 2.4.5 with S3

(Time in seconds)

Spark 3.0 HDFS vs S3 (Factor)

Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)

Table involved

9

880.129

106.109

147.65

8.294574

5.960914

store_sales

44

129.618

23.747

103.916

5.458289

1.247334

store_sales

58

142.113

20.996

33.936

6.768575

4.187677

store_sales

62

32.519

5.425

14.809

5.994286

2.195894

web_sales

76

138.765

20.73

49.892

6.693922

2.781308

store_sales

88

475.824

48.2

94.382

9.871867

5.04147

store_sales

90

53.896

6.804

18.11

7.921223

2.976035

web_sales

94

241.172

43.49

81.181

5.545459

2.970794

web_sales

96

67.059

10.396

15.993

6.450462

4.193022

store_sales

 

When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I’m attaching the screen shots of Driver UI for one such instance (Query 9) for reference.

Also attached the spark configurations (Spark 3.0) used for these tests.

 

We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we’re missing?

 

Thanks and Regards,

Abhishek

 


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Rao, Abhishek (Nokia - IN/Bangalore)

Yeah… Not sure if I’m missing any configurations which is causing this issue. Any suggestions?

 

Thanks and Regards,

Abhishek

 

From: Gourav Sengupta <[hidden email]>
Sent: Wednesday, August 26, 2020 2:35 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]>
Cc: [hidden email]
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

 

Hi,

 

So the results does not make sense.

 

 

Regards,

Gourav

 

On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]> wrote:

Hi Gourav,

 

Yes. We’re using s3a.

 

Thanks and Regards,

Abhishek

 

From: Gourav Sengupta <[hidden email]>
Sent: Wednesday, August 26, 2020 1:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]>
Cc: [hidden email]
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

 

Hi,

 

are you using s3a, which is not using EMRFS? In that case, these results does not make sense to me.

 

Regards,

Gourav Sengupta

 

On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]> wrote:

Hi All,

 

We’re doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS.

We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange.

Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop.

 

Environment Details:

  • Spark running on Kubernetes
  • TPC DS Scale Factor: 500 GB
  • Hadoop 3.x
  • Same CPU and memory used for all executions

 

Query

Spark 3.0 with S3 (Time in seconds)

Spark 3.0 with Hadoop (Time in seconds)

 

 

Spark 2.4.5 with S3

(Time in seconds)

Spark 3.0 HDFS vs S3 (Factor)

Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)

Table involved

9

880.129

106.109

147.65

8.294574

5.960914

store_sales

44

129.618

23.747

103.916

5.458289

1.247334

store_sales

58

142.113

20.996

33.936

6.768575

4.187677

store_sales

62

32.519

5.425

14.809

5.994286

2.195894

web_sales

76

138.765

20.73

49.892

6.693922

2.781308

store_sales

88

475.824

48.2

94.382

9.871867

5.04147

store_sales

90

53.896

6.804

18.11

7.921223

2.976035

web_sales

94

241.172

43.49

81.181

5.545459

2.970794

web_sales

96

67.059

10.396

15.993

6.450462

4.193022

store_sales

 

When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I’m attaching the screen shots of Driver UI for one such instance (Query 9) for reference.

Also attached the spark configurations (Spark 3.0) used for these tests.

 

We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we’re missing?

 

Thanks and Regards,

Abhishek

 


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Gourav Sengupta
Hi
Can you try using emrfs? 
Your study looks good best of luck. 

Regards 
Gourav 

On Wed, 26 Aug 2020, 12:37 Rao, Abhishek (Nokia - IN/Bangalore), <[hidden email]> wrote:

Yeah… Not sure if I’m missing any configurations which is causing this issue. Any suggestions?

 

Thanks and Regards,

Abhishek

 

From: Gourav Sengupta <[hidden email]>
Sent: Wednesday, August 26, 2020 2:35 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]>
Cc: [hidden email]
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

 

Hi,

 

So the results does not make sense.

 

 

Regards,

Gourav

 

On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]> wrote:

Hi Gourav,

 

Yes. We’re using s3a.

 

Thanks and Regards,

Abhishek

 

From: Gourav Sengupta <[hidden email]>
Sent: Wednesday, August 26, 2020 1:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]>
Cc: [hidden email]
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

 

Hi,

 

are you using s3a, which is not using EMRFS? In that case, these results does not make sense to me.

 

Regards,

Gourav Sengupta

 

On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]> wrote:

Hi All,

 

We’re doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS.

We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange.

Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop.

 

Environment Details:

  • Spark running on Kubernetes
  • TPC DS Scale Factor: 500 GB
  • Hadoop 3.x
  • Same CPU and memory used for all executions

 

Query

Spark 3.0 with S3 (Time in seconds)

Spark 3.0 with Hadoop (Time in seconds)

 

 

Spark 2.4.5 with S3

(Time in seconds)

Spark 3.0 HDFS vs S3 (Factor)

Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)

Table involved

9

880.129

106.109

147.65

8.294574

5.960914

store_sales

44

129.618

23.747

103.916

5.458289

1.247334

store_sales

58

142.113

20.996

33.936

6.768575

4.187677

store_sales

62

32.519

5.425

14.809

5.994286

2.195894

web_sales

76

138.765

20.73

49.892

6.693922

2.781308

store_sales

88

475.824

48.2

94.382

9.871867

5.04147

store_sales

90

53.896

6.804

18.11

7.921223

2.976035

web_sales

94

241.172

43.49

81.181

5.545459

2.970794

web_sales

96

67.059

10.396

15.993

6.450462

4.193022

store_sales

 

When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I’m attaching the screen shots of Driver UI for one such instance (Query 9) for reference.

Also attached the spark configurations (Spark 3.0) used for these tests.

 

We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we’re missing?

 

Thanks and Regards,

Abhishek

 


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

Rao, Abhishek (Nokia - IN/Bangalore)

Hi All,

 

We tried to regenerate the TPC DS data on S3 and after regeneration, we see that the queries are running faster and the execution time is now comparable with execution time on HDFS with Spark 3.0.0.

So may be there was some issue in generating the TPC DS data first time due to which we were seeing discrepancy in query execution time on S3 with Spark 3.0.0.

 

Thanks and Regards,

Abhishek

 

From: Gourav Sengupta <[hidden email]>
Sent: Wednesday, August 26, 2020 5:49 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

 

Hi

Can you try using emrfs? 

Your study looks good best of luck. 

 

Regards 

Gourav 

 

On Wed, 26 Aug 2020, 12:37 Rao, Abhishek (Nokia - IN/Bangalore), <[hidden email]> wrote:

Yeah… Not sure if I’m missing any configurations which is causing this issue. Any suggestions?

 

Thanks and Regards,

Abhishek

 

From: Gourav Sengupta <[hidden email]>
Sent: Wednesday, August 26, 2020 2:35 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]>
Cc: [hidden email]
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

 

Hi,

 

So the results does not make sense.

 

 

Regards,

Gourav

 

On Wed, Aug 26, 2020 at 9:04 AM Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]> wrote:

Hi Gourav,

 

Yes. We’re using s3a.

 

Thanks and Regards,

Abhishek

 

From: Gourav Sengupta <[hidden email]>
Sent: Wednesday, August 26, 2020 1:18 PM
To: Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]>
Cc: [hidden email]
Subject: Re: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

 

Hi,

 

are you using s3a, which is not using EMRFS? In that case, these results does not make sense to me.

 

Regards,

Gourav Sengupta

 

On Mon, Aug 24, 2020 at 12:52 PM Rao, Abhishek (Nokia - IN/Bangalore) <[hidden email]> wrote:

Hi All,

 

We’re doing some performance comparisons between Spark querying data on HDFS vs Spark querying data on S3 (Ceph Object Store used for S3 storage) using standard TPC DS Queries. We are observing that Spark 3.0 with S3 is consuming significantly larger duration for some set of queries when compared with HDFS.

We also ran similar queries with Spark 2.4.5 querying data from S3 and we see that for these set of queries, time taken by Spark 2.4.5 is lesser compared to Spark 3.0 looks to be very strange.

Below are the details of 9 queries where Spark 3.0 is taking >5 times the duration for running queries on S3 when compared to Hadoop.

 

Environment Details:

  • Spark running on Kubernetes
  • TPC DS Scale Factor: 500 GB
  • Hadoop 3.x
  • Same CPU and memory used for all executions

 

Query

Spark 3.0 with S3 (Time in seconds)

Spark 3.0 with Hadoop (Time in seconds)

 

 

Spark 2.4.5 with S3

(Time in seconds)

Spark 3.0 HDFS vs S3 (Factor)

Spark 2.4.5 S3 vs Spark 3.0 S3 (Factor)

Table involved

9

880.129

106.109

147.65

8.294574

5.960914

store_sales

44

129.618

23.747

103.916

5.458289

1.247334

store_sales

58

142.113

20.996

33.936

6.768575

4.187677

store_sales

62

32.519

5.425

14.809

5.994286

2.195894

web_sales

76

138.765

20.73

49.892

6.693922

2.781308

store_sales

88

475.824

48.2

94.382

9.871867

5.04147

store_sales

90

53.896

6.804

18.11

7.921223

2.976035

web_sales

94

241.172

43.49

81.181

5.545459

2.970794

web_sales

96

67.059

10.396

15.993

6.450462

4.193022

store_sales

 

When we analysed it further, we see that all these queries are performing operations either on store_sales or web_sales tables and Spark 3 with S3 seems to be downloading much more data from storage when compared to Spark 3 with Hadoop or Spark 2.4.5 with S3 and this is resulting in more time for query completion. I’m attaching the screen shots of Driver UI for one such instance (Query 9) for reference.

Also attached the spark configurations (Spark 3.0) used for these tests.

 

We’re not sure why Spark 3.0 on S3 is having this behaviour. Any inputs on what we’re missing?

 

Thanks and Regards,

Abhishek

 


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]