Interleaving stages

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Interleaving stages

David Thomas
I have a spark application that has the below structure:

while(...) { // 10-100k iterations
  rdd.map(...).collect
}

Basically, I have an RDD and I need to query it multiple times.

Now when I run this, for each iteration, Spark creates a new stage (each stage having multiple tasks). What I find is that the stage execution takes about 1 second and most time is spend in scheduling the tasks. Since a stage is not submitted until the previous stage is completed, this loop takes a long time to complete. So my question is, is there a way to interleave multiple stage executions? Any other suggestions to improve the above query pattern?
Reply | Threaded
Open this post in threaded view
|

Re: Interleaving stages

Mark Hamstra
With so little information about what your code is actually doing, what you have shared looks likely to be an anti-pattern to me.  Doing many collect actions is something to be avoided if at all possible, since this forces a lot of network communication to materialize the results back within the driver process, and network communication severely constrains performance. 


On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <[hidden email]> wrote:
I have a spark application that has the below structure:

while(...) { // 10-100k iterations
  rdd.map(...).collect
}

Basically, I have an RDD and I need to query it multiple times.

Now when I run this, for each iteration, Spark creates a new stage (each stage having multiple tasks). What I find is that the stage execution takes about 1 second and most time is spend in scheduling the tasks. Since a stage is not submitted until the previous stage is completed, this loop takes a long time to complete. So my question is, is there a way to interleave multiple stage executions? Any other suggestions to improve the above query pattern?

Reply | Threaded
Open this post in threaded view
|

Re: Interleaving stages

David Thomas
Is there a way I can queue several stages at once?


On Mon, Feb 17, 2014 at 12:08 PM, Mark Hamstra <[hidden email]> wrote:
With so little information about what your code is actually doing, what you have shared looks likely to be an anti-pattern to me.  Doing many collect actions is something to be avoided if at all possible, since this forces a lot of network communication to materialize the results back within the driver process, and network communication severely constrains performance. 


On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <[hidden email]> wrote:
I have a spark application that has the below structure:

while(...) { // 10-100k iterations
  rdd.map(...).collect
}

Basically, I have an RDD and I need to query it multiple times.

Now when I run this, for each iteration, Spark creates a new stage (each stage having multiple tasks). What I find is that the stage execution takes about 1 second and most time is spend in scheduling the tasks. Since a stage is not submitted until the previous stage is completed, this loop takes a long time to complete. So my question is, is there a way to interleave multiple stage executions? Any other suggestions to improve the above query pattern?


Reply | Threaded
Open this post in threaded view
|

Re: Interleaving stages

Guillaume Pitel
In reply to this post by Mark Hamstra
Whatever you want to do, if you really have to do it that way, don't use Spark. And the answer to your question is : Spark automatically "interleaves" stages that can be interleaved.

Now, I do not believe that you really want to do that. You probably should just do a filter + map or a flatmap. But explain what you're trying to achieve so we can recommend you with a better way.

Guillaume
With so little information about what your code is actually doing, what you have shared looks likely to be an anti-pattern to me.  Doing many collect actions is something to be avoided if at all possible, since this forces a lot of network communication to materialize the results back within the driver process, and network communication severely constrains performance. 


On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <[hidden email]> wrote:
I have a spark application that has the below structure:

while(...) { // 10-100k iterations
  rdd.map(...).collect
}

Basically, I have an RDD and I need to query it multiple times.

Now when I run this, for each iteration, Spark creates a new stage (each stage having multiple tasks). What I find is that the stage execution takes about 1 second and most time is spend in scheduling the tasks. Since a stage is not submitted until the previous stage is completed, this loop takes a long time to complete. So my question is, is there a way to interleave multiple stage executions? Any other suggestions to improve the above query pattern?



--
eXenSa
Guillaume PITEL, Président
+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel +33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05
Reply | Threaded
Open this post in threaded view
|

Re: Interleaving stages

Prashant Sharma
I am not sure !, may be Mark can correct me. You may try the AsyncRDDFunctions, (check API docs for details.) I am feeling as if, it can send many tasks and then result can be received Async.  


On Tue, Feb 18, 2014 at 1:14 PM, Guillaume Pitel <[hidden email]> wrote:
Whatever you want to do, if you really have to do it that way, don't use Spark. And the answer to your question is : Spark automatically "interleaves" stages that can be interleaved.

Now, I do not believe that you really want to do that. You probably should just do a filter + map or a flatmap. But explain what you're trying to achieve so we can recommend you with a better way.

Guillaume
With so little information about what your code is actually doing, what you have shared looks likely to be an anti-pattern to me.  Doing many collect actions is something to be avoided if at all possible, since this forces a lot of network communication to materialize the results back within the driver process, and network communication severely constrains performance. 


On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <[hidden email]> wrote:
I have a spark application that has the below structure:

while(...) { // 10-100k iterations
  rdd.map(...).collect
}

Basically, I have an RDD and I need to query it multiple times.

Now when I run this, for each iteration, Spark creates a new stage (each stage having multiple tasks). What I find is that the stage execution takes about 1 second and most time is spend in scheduling the tasks. Since a stage is not submitted until the previous stage is completed, this loop takes a long time to complete. So my question is, is there a way to interleave multiple stage executions? Any other suggestions to improve the above query pattern?



--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05



--
Prashant
Reply | Threaded
Open this post in threaded view
|

Re: Interleaving stages

David Thomas
In reply to this post by Guillaume Pitel
Here is an example code that is bundled with Spark

for (i <- 1 to ITERATIONS) {
      println("On iteration " + i)
      val gradient = points.map { p =>
        (1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y * p.x
      }.reduce(_ + _)
      w -= gradient
    }

As you can see, an action is called on every iteration. I guess this is the same pattern I have in my code. So why do you think this is not the right thing to do with Spark?




On Tue, Feb 18, 2014 at 12:44 AM, Guillaume Pitel <[hidden email]> wrote:
Whatever you want to do, if you really have to do it that way, don't use Spark. And the answer to your question is : Spark automatically "interleaves" stages that can be interleaved.

Now, I do not believe that you really want to do that. You probably should just do a filter + map or a flatmap. But explain what you're trying to achieve so we can recommend you with a better way.

Guillaume
With so little information about what your code is actually doing, what you have shared looks likely to be an anti-pattern to me.  Doing many collect actions is something to be avoided if at all possible, since this forces a lot of network communication to materialize the results back within the driver process, and network communication severely constrains performance. 


On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <[hidden email]> wrote:
I have a spark application that has the below structure:

while(...) { // 10-100k iterations
  rdd.map(...).collect
}

Basically, I have an RDD and I need to query it multiple times.

Now when I run this, for each iteration, Spark creates a new stage (each stage having multiple tasks). What I find is that the stage execution takes about 1 second and most time is spend in scheduling the tasks. Since a stage is not submitted until the previous stage is completed, this loop takes a long time to complete. So my question is, is there a way to interleave multiple stage executions? Any other suggestions to improve the above query pattern?



--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05

Reply | Threaded
Open this post in threaded view
|

Re: Interleaving stages

Evan R. Sparks
The difference between this code and the code you showed is that in this code there's a .reduce() whose result is the size of ONE element in the RDD, while your code calls .collect() whose result is the size of your entire dataset.

It would help if you provided a more complete example.


On Tue, Feb 18, 2014 at 11:58 AM, David Thomas <[hidden email]> wrote:
Here is an example code that is bundled with Spark

for (i <- 1 to ITERATIONS) {
      println("On iteration " + i)
      val gradient = points.map { p =>
        (1 / (1 + exp(-p.y * (w dot p.x))) - 1) * p.y * p.x
      }.reduce(_ + _)
      w -= gradient
    }

As you can see, an action is called on every iteration. I guess this is the same pattern I have in my code. So why do you think this is not the right thing to do with Spark?




On Tue, Feb 18, 2014 at 12:44 AM, Guillaume Pitel <[hidden email]> wrote:
Whatever you want to do, if you really have to do it that way, don't use Spark. And the answer to your question is : Spark automatically "interleaves" stages that can be interleaved.

Now, I do not believe that you really want to do that. You probably should just do a filter + map or a flatmap. But explain what you're trying to achieve so we can recommend you with a better way.

Guillaume
With so little information about what your code is actually doing, what you have shared looks likely to be an anti-pattern to me.  Doing many collect actions is something to be avoided if at all possible, since this forces a lot of network communication to materialize the results back within the driver process, and network communication severely constrains performance. 


On Mon, Feb 17, 2014 at 9:51 AM, David Thomas <[hidden email]> wrote:
I have a spark application that has the below structure:

while(...) { // 10-100k iterations
  rdd.map(...).collect
}

Basically, I have an RDD and I need to query it multiple times.

Now when I run this, for each iteration, Spark creates a new stage (each stage having multiple tasks). What I find is that the stage execution takes about 1 second and most time is spend in scheduling the tasks. Since a stage is not submitted until the previous stage is completed, this loop takes a long time to complete. So my question is, is there a way to interleave multiple stage executions? Any other suggestions to improve the above query pattern?



--
eXenSa
Guillaume PITEL, Président
<a href="tel:%2B33%280%296%2025%2048%2086%2080" value="+33625488680" target="_blank">+33(0)6 25 48 86 80

eXenSa S.A.S.
41, rue Périer - 92120 Montrouge - FRANCE
Tel <a href="tel:%2B33%280%291%2084%2016%2036%2077" value="+33184163677" target="_blank">+33(0)1 84 16 36 77 / Fax +33(0)9 72 28 37 05