performance

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

performance

Yann Luppo
Hi,

I have what I hope is a simple question. What's a typical approach to diagnostic performance issues on a Spark cluster?
We've followed all the pertinent parts of the following document already: http://spark.incubator.apache.org/docs/latest/tuning.html
But we seem to still have issues. More specifically we have a leftouterjoin followed by a flatmap and then a collect running a bit long.

How would I go about determining the bottleneck operation(s) ? 
Is our leftouterjoin taking a long time? 
Is the function we send to the flatmap not optimized?

Thanks,
Yann
Reply | Threaded
Open this post in threaded view
|

Re: performance

Andrew Ash
My first thought on hearing that you're calling collect is that taking all the data back to the driver is intensive on the network.  Try checking the basic systems stuff on the machines to get a sense of what's being heavily used:

disk IO
CPU
network

Any kind of distributed system monitoring framework should be able to handle these sorts of things.

Cheers!
Andrew


On Wed, Jan 8, 2014 at 1:49 PM, Yann Luppo <[hidden email]> wrote:
Hi,

I have what I hope is a simple question. What's a typical approach to diagnostic performance issues on a Spark cluster?
We've followed all the pertinent parts of the following document already: http://spark.incubator.apache.org/docs/latest/tuning.html
But we seem to still have issues. More specifically we have a leftouterjoin followed by a flatmap and then a collect running a bit long.

How would I go about determining the bottleneck operation(s) ? 
Is our leftouterjoin taking a long time? 
Is the function we send to the flatmap not optimized?

Thanks,
Yann

Reply | Threaded
Open this post in threaded view
|

Re: performance

Evan R. Sparks
On this note - the ganglia web front end that runs on the master (assuming you're launching with the ec2 scripts) is great for this. 

Also, a common technique for diagnosing "which step is slow" is to run a '.cache' and a '.count' on the RDD after each step. This forces the RDD to be materialized, which subverts the lazy evaluation that causes such diagnosis to be hard sometimes. 

- Evan

On Jan 8, 2014, at 2:57 PM, Andrew Ash <[hidden email]> wrote:

My first thought on hearing that you're calling collect is that taking all the data back to the driver is intensive on the network.  Try checking the basic systems stuff on the machines to get a sense of what's being heavily used:

disk IO
CPU
network

Any kind of distributed system monitoring framework should be able to handle these sorts of things.

Cheers!
Andrew


On Wed, Jan 8, 2014 at 1:49 PM, Yann Luppo <[hidden email]> wrote:
Hi,

I have what I hope is a simple question. What's a typical approach to diagnostic performance issues on a Spark cluster?
We've followed all the pertinent parts of the following document already: http://spark.incubator.apache.org/docs/latest/tuning.html
But we seem to still have issues. More specifically we have a leftouterjoin followed by a flatmap and then a collect running a bit long.

How would I go about determining the bottleneck operation(s) ? 
Is our leftouterjoin taking a long time? 
Is the function we send to the flatmap not optimized?

Thanks,
Yann

Reply | Threaded
Open this post in threaded view
|

Re: performance

Yann Luppo
Thank you guys that was really helpful in identifying the slow step, which in our case is the leftouterjoin. 
I'm checking with our admins to see if we have some sort of distributed system monitoring in place, which I'm sure we do. 

Now just out of curiosity, what would be the rule of thumb or general guideline for the number of partitions and the number of reducers?
Should it be some kind of factor of the number of cores available? Of nodes available? Should the number of partitions match the number of reducers or at least be some multiple of it for better performance?

Thanks,
Yann

From: Evan Sparks <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Wednesday, January 8, 2014 5:28 PM
To: "[hidden email]" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: performance

On this note - the ganglia web front end that runs on the master (assuming you're launching with the ec2 scripts) is great for this. 

Also, a common technique for diagnosing "which step is slow" is to run a '.cache' and a '.count' on the RDD after each step. This forces the RDD to be materialized, which subverts the lazy evaluation that causes such diagnosis to be hard sometimes. 

- Evan

On Jan 8, 2014, at 2:57 PM, Andrew Ash <[hidden email]> wrote:

My first thought on hearing that you're calling collect is that taking all the data back to the driver is intensive on the network.  Try checking the basic systems stuff on the machines to get a sense of what's being heavily used:

disk IO
CPU
network

Any kind of distributed system monitoring framework should be able to handle these sorts of things.

Cheers!
Andrew


On Wed, Jan 8, 2014 at 1:49 PM, Yann Luppo <[hidden email]> wrote:
Hi,

I have what I hope is a simple question. What's a typical approach to diagnostic performance issues on a Spark cluster?
We've followed all the pertinent parts of the following document already: http://spark.incubator.apache.org/docs/latest/tuning.html
But we seem to still have issues. More specifically we have a leftouterjoin followed by a flatmap and then a collect running a bit long.

How would I go about determining the bottleneck operation(s) ? 
Is our leftouterjoin taking a long time? 
Is the function we send to the flatmap not optimized?

Thanks,
Yann

Reply | Threaded
Open this post in threaded view
|

Re: performance

Matei Zaharia
Administrator
Typically you want 2-3 partitions per CPU core to get good load balancing. How big is the data you’re transferring in this case? And have you looked at the machines to see whether they’re spending lots of time on IO, CPU, etc? (Use top or dstat on each machine for this). For large datasets with larger numbers of tasks, one option we added in 0.8.1 that helps a lot is consolidating shuffle files (see http://spark.incubator.apache.org/releases/spark-release-0-8-1.html). However, another common problem is just serialization taking a lot of time, which you’ll notice if the application is CPU-heavy, and which you can fix using Kryo.

Matei

On Jan 9, 2014, at 2:11 PM, Yann Luppo <[hidden email]> wrote:

Thank you guys that was really helpful in identifying the slow step, which in our case is the leftouterjoin. 
I'm checking with our admins to see if we have some sort of distributed system monitoring in place, which I'm sure we do. 

Now just out of curiosity, what would be the rule of thumb or general guideline for the number of partitions and the number of reducers?
Should it be some kind of factor of the number of cores available? Of nodes available? Should the number of partitions match the number of reducers or at least be some multiple of it for better performance?

Thanks,
Yann

From: Evan Sparks <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Wednesday, January 8, 2014 5:28 PM
To: "[hidden email]" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: performance

On this note - the ganglia web front end that runs on the master (assuming you're launching with the ec2 scripts) is great for this. 

Also, a common technique for diagnosing "which step is slow" is to run a '.cache' and a '.count' on the RDD after each step. This forces the RDD to be materialized, which subverts the lazy evaluation that causes such diagnosis to be hard sometimes. 

- Evan

On Jan 8, 2014, at 2:57 PM, Andrew Ash <[hidden email]> wrote:

My first thought on hearing that you're calling collect is that taking all the data back to the driver is intensive on the network.  Try checking the basic systems stuff on the machines to get a sense of what's being heavily used:

disk IO
CPU
network

Any kind of distributed system monitoring framework should be able to handle these sorts of things.

Cheers!
Andrew


On Wed, Jan 8, 2014 at 1:49 PM, Yann Luppo <[hidden email]> wrote:
Hi,

I have what I hope is a simple question. What's a typical approach to diagnostic performance issues on a Spark cluster?
We've followed all the pertinent parts of the following document already: http://spark.incubator.apache.org/docs/latest/tuning.html
But we seem to still have issues. More specifically we have a leftouterjoin followed by a flatmap and then a collect running a bit long.

How would I go about determining the bottleneck operation(s) ? 
Is our leftouterjoin taking a long time? 
Is the function we send to the flatmap not optimized?

Thanks,
Yann


Reply | Threaded
Open this post in threaded view
|

Re: performance

Stoney Vintson
Here is a good post on linux performance monitoring tools.  Look for the cheat sheet image towards the bottom of the post. 
http://www.joyent.com/blog/linux-performance-analysis-and-tools-brendan-gregg-s-talk-at-scale-11x


On Thu, Jan 9, 2014 at 3:23 PM, Matei Zaharia <[hidden email]> wrote:
Typically you want 2-3 partitions per CPU core to get good load balancing. How big is the data you’re transferring in this case? And have you looked at the machines to see whether they’re spending lots of time on IO, CPU, etc? (Use top or dstat on each machine for this). For large datasets with larger numbers of tasks, one option we added in 0.8.1 that helps a lot is consolidating shuffle files (see http://spark.incubator.apache.org/releases/spark-release-0-8-1.html). However, another common problem is just serialization taking a lot of time, which you’ll notice if the application is CPU-heavy, and which you can fix using Kryo.

Matei

On Jan 9, 2014, at 2:11 PM, Yann Luppo <[hidden email]> wrote:

Thank you guys that was really helpful in identifying the slow step, which in our case is the leftouterjoin. 
I'm checking with our admins to see if we have some sort of distributed system monitoring in place, which I'm sure we do. 

Now just out of curiosity, what would be the rule of thumb or general guideline for the number of partitions and the number of reducers?
Should it be some kind of factor of the number of cores available? Of nodes available? Should the number of partitions match the number of reducers or at least be some multiple of it for better performance?

Thanks,
Yann

From: Evan Sparks <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Wednesday, January 8, 2014 5:28 PM
To: "[hidden email]" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: performance

On this note - the ganglia web front end that runs on the master (assuming you're launching with the ec2 scripts) is great for this. 

Also, a common technique for diagnosing "which step is slow" is to run a '.cache' and a '.count' on the RDD after each step. This forces the RDD to be materialized, which subverts the lazy evaluation that causes such diagnosis to be hard sometimes. 

- Evan

On Jan 8, 2014, at 2:57 PM, Andrew Ash <[hidden email]> wrote:

My first thought on hearing that you're calling collect is that taking all the data back to the driver is intensive on the network.  Try checking the basic systems stuff on the machines to get a sense of what's being heavily used:

disk IO
CPU
network

Any kind of distributed system monitoring framework should be able to handle these sorts of things.

Cheers!
Andrew


On Wed, Jan 8, 2014 at 1:49 PM, Yann Luppo <[hidden email]> wrote:
Hi,

I have what I hope is a simple question. What's a typical approach to diagnostic performance issues on a Spark cluster?
We've followed all the pertinent parts of the following document already: http://spark.incubator.apache.org/docs/latest/tuning.html
But we seem to still have issues. More specifically we have a leftouterjoin followed by a flatmap and then a collect running a bit long.

How would I go about determining the bottleneck operation(s) ? 
Is our leftouterjoin taking a long time? 
Is the function we send to the flatmap not optimized?

Thanks,
Yann





Reply | Threaded
Open this post in threaded view
|

Re: performance

Evan R. Sparks
In reply to this post by Yann Luppo
If your left outer join is slow and one of the tables is relatively small, you could consider broadcasting the smaller table and doing a join like in slide 11 of this presentation:

If both tables are big, there has been some work on IndexedRDDs which might help speed things up, but this feature hasn't made it into spark master yet. https://github.com/mesos/spark/pull/848


On Thu, Jan 9, 2014 at 2:11 PM, Yann Luppo <[hidden email]> wrote:
Thank you guys that was really helpful in identifying the slow step, which in our case is the leftouterjoin. 
I'm checking with our admins to see if we have some sort of distributed system monitoring in place, which I'm sure we do. 

Now just out of curiosity, what would be the rule of thumb or general guideline for the number of partitions and the number of reducers?
Should it be some kind of factor of the number of cores available? Of nodes available? Should the number of partitions match the number of reducers or at least be some multiple of it for better performance?

Thanks,
Yann

From: Evan Sparks <[hidden email]>
Reply-To: "[hidden email]" <[hidden email]>
Date: Wednesday, January 8, 2014 5:28 PM
To: "[hidden email]" <[hidden email]>
Cc: "[hidden email]" <[hidden email]>
Subject: Re: performance

On this note - the ganglia web front end that runs on the master (assuming you're launching with the ec2 scripts) is great for this. 

Also, a common technique for diagnosing "which step is slow" is to run a '.cache' and a '.count' on the RDD after each step. This forces the RDD to be materialized, which subverts the lazy evaluation that causes such diagnosis to be hard sometimes. 

- Evan

On Jan 8, 2014, at 2:57 PM, Andrew Ash <[hidden email]> wrote:

My first thought on hearing that you're calling collect is that taking all the data back to the driver is intensive on the network.  Try checking the basic systems stuff on the machines to get a sense of what's being heavily used:

disk IO
CPU
network

Any kind of distributed system monitoring framework should be able to handle these sorts of things.

Cheers!
Andrew


On Wed, Jan 8, 2014 at 1:49 PM, Yann Luppo <[hidden email]> wrote:
Hi,

I have what I hope is a simple question. What's a typical approach to diagnostic performance issues on a Spark cluster?
We've followed all the pertinent parts of the following document already: http://spark.incubator.apache.org/docs/latest/tuning.html
But we seem to still have issues. More specifically we have a leftouterjoin followed by a flatmap and then a collect running a bit long.

How would I go about determining the bottleneck operation(s) ? 
Is our leftouterjoin taking a long time? 
Is the function we send to the flatmap not optimized?

Thanks,
Yann