Re: Why does sortByKey launch cluster job?

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Why does sortByKey launch cluster job?

Andrew Ash
Hi Josh,

I just ran into this again myself and noticed that the source hasn't changed since we discussed in December.  Should I file an official bug in Jira?

Andrew


On Tue, Dec 10, 2013 at 8:34 AM, Josh Rosen <[hidden email]> wrote:
I wonder whether making RangePartitoner .rangeBounds into a lazy val would fix this (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).  We'd need to make sure that rangeBounds() is never called before an action is performed.  This could be tricky because it's called in the RangePartitioner.equals() method.  Maybe it's sufficient to just compare the number of partitions, the ids of the RDDs used to create the RangePartitioner, and the sort ordering.  This still supports the case where I range-partition one RDD and pass the same partitioner to a different RDD.  It breaks support for the case where two range partitioners created on different RDDs happened to have the same rangeBounds(), but it seems unlikely that this would really harm performance since it's probably unlikely that the range partitioners are equal by chance.


On Tue, Dec 10, 2013 at 8:18 AM, Ryan Prenger <[hidden email]> wrote:
Thanks for the responses!  I agree that b seems like it would be better.  I could imagine optimizations that could be made if a filter call came after the sortByKey that would make the initial partitioning sub-optimal.  Plus this way, it's a pain to use in the REPL.

Cheers,

Ryan


On Tue, Dec 10, 2013 at 7:06 AM, Andrew Ash <[hidden email]> wrote:
Since sortByKey() invokes those right now, we should either a) change the documentation to treat note that it kicks off actions or b) change the method to execute those things lazily.

Personally I'd prefer b but don't know how difficult that would be.


On Tue, Dec 10, 2013 at 1:52 AM, Jason Lenderman <[hidden email]> wrote:
Hey Ryan,

The sortByKey method creates a RangePartitioner (see Partitioner.scala), and the initialization code of the RangePartitioner invokes actions count and sample.


Jason

 


On Mon, Dec 9, 2013 at 7:01 PM, Ryan Prenger <[hidden email]> wrote:
sortByKey is listed as a data transformation, not an action, yet it launches a job.  This doesn't seem to square with the documentation.

Ryan





Reply | Threaded
Open this post in threaded view
|

Re: Why does sortByKey launch cluster job?

Aaron Davidson
Feel free to always file official bugs in Jira, as long as it's not already there!


On Tue, Jan 7, 2014 at 9:47 PM, Andrew Ash <[hidden email]> wrote:
Hi Josh,

I just ran into this again myself and noticed that the source hasn't changed since we discussed in December.  Should I file an official bug in Jira?

Andrew


On Tue, Dec 10, 2013 at 8:34 AM, Josh Rosen <[hidden email]> wrote:
I wonder whether making RangePartitoner .rangeBounds into a lazy val would fix this (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).  We'd need to make sure that rangeBounds() is never called before an action is performed.  This could be tricky because it's called in the RangePartitioner.equals() method.  Maybe it's sufficient to just compare the number of partitions, the ids of the RDDs used to create the RangePartitioner, and the sort ordering.  This still supports the case where I range-partition one RDD and pass the same partitioner to a different RDD.  It breaks support for the case where two range partitioners created on different RDDs happened to have the same rangeBounds(), but it seems unlikely that this would really harm performance since it's probably unlikely that the range partitioners are equal by chance.


On Tue, Dec 10, 2013 at 8:18 AM, Ryan Prenger <[hidden email]> wrote:
Thanks for the responses!  I agree that b seems like it would be better.  I could imagine optimizations that could be made if a filter call came after the sortByKey that would make the initial partitioning sub-optimal.  Plus this way, it's a pain to use in the REPL.

Cheers,

Ryan


On Tue, Dec 10, 2013 at 7:06 AM, Andrew Ash <[hidden email]> wrote:
Since sortByKey() invokes those right now, we should either a) change the documentation to treat note that it kicks off actions or b) change the method to execute those things lazily.

Personally I'd prefer b but don't know how difficult that would be.


On Tue, Dec 10, 2013 at 1:52 AM, Jason Lenderman <[hidden email]> wrote:
Hey Ryan,

The sortByKey method creates a RangePartitioner (see Partitioner.scala), and the initialization code of the RangePartitioner invokes actions count and sample.


Jason

 


On Mon, Dec 9, 2013 at 7:01 PM, Ryan Prenger <[hidden email]> wrote:
sortByKey is listed as a data transformation, not an action, yet it launches a job.  This doesn't seem to square with the documentation.

Ryan






Reply | Threaded
Open this post in threaded view
|

Re: Why does sortByKey launch cluster job?

Andrew Ash
And at the moment we should use the atlassian.net Jira instance, not the apache.org one?  The apache one looks empty.



On Wed, Jan 8, 2014 at 9:04 AM, Aaron Davidson <[hidden email]> wrote:
Feel free to always file official bugs in Jira, as long as it's not already there!


On Tue, Jan 7, 2014 at 9:47 PM, Andrew Ash <[hidden email]> wrote:
Hi Josh,

I just ran into this again myself and noticed that the source hasn't changed since we discussed in December.  Should I file an official bug in Jira?

Andrew


On Tue, Dec 10, 2013 at 8:34 AM, Josh Rosen <[hidden email]> wrote:
I wonder whether making RangePartitoner .rangeBounds into a lazy val would fix this (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).  We'd need to make sure that rangeBounds() is never called before an action is performed.  This could be tricky because it's called in the RangePartitioner.equals() method.  Maybe it's sufficient to just compare the number of partitions, the ids of the RDDs used to create the RangePartitioner, and the sort ordering.  This still supports the case where I range-partition one RDD and pass the same partitioner to a different RDD.  It breaks support for the case where two range partitioners created on different RDDs happened to have the same rangeBounds(), but it seems unlikely that this would really harm performance since it's probably unlikely that the range partitioners are equal by chance.


On Tue, Dec 10, 2013 at 8:18 AM, Ryan Prenger <[hidden email]> wrote:
Thanks for the responses!  I agree that b seems like it would be better.  I could imagine optimizations that could be made if a filter call came after the sortByKey that would make the initial partitioning sub-optimal.  Plus this way, it's a pain to use in the REPL.

Cheers,

Ryan


On Tue, Dec 10, 2013 at 7:06 AM, Andrew Ash <[hidden email]> wrote:
Since sortByKey() invokes those right now, we should either a) change the documentation to treat note that it kicks off actions or b) change the method to execute those things lazily.

Personally I'd prefer b but don't know how difficult that would be.


On Tue, Dec 10, 2013 at 1:52 AM, Jason Lenderman <[hidden email]> wrote:
Hey Ryan,

The sortByKey method creates a RangePartitioner (see Partitioner.scala), and the initialization code of the RangePartitioner invokes actions count and sample.


Jason

 


On Mon, Dec 9, 2013 at 7:01 PM, Ryan Prenger <[hidden email]> wrote:
sortByKey is listed as a data transformation, not an action, yet it launches a job.  This doesn't seem to square with the documentation.

Ryan







Reply | Threaded
Open this post in threaded view
|

Re: Why does sortByKey launch cluster job?

Andrew Ash


On Wed, Jan 8, 2014 at 9:56 AM, Andrew Ash <[hidden email]> wrote:
And at the moment we should use the atlassian.net Jira instance, not the apache.org one?  The apache one looks empty.



On Wed, Jan 8, 2014 at 9:04 AM, Aaron Davidson <[hidden email]> wrote:
Feel free to always file official bugs in Jira, as long as it's not already there!


On Tue, Jan 7, 2014 at 9:47 PM, Andrew Ash <[hidden email]> wrote:
Hi Josh,

I just ran into this again myself and noticed that the source hasn't changed since we discussed in December.  Should I file an official bug in Jira?

Andrew


On Tue, Dec 10, 2013 at 8:34 AM, Josh Rosen <[hidden email]> wrote:
I wonder whether making RangePartitoner .rangeBounds into a lazy val would fix this (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).  We'd need to make sure that rangeBounds() is never called before an action is performed.  This could be tricky because it's called in the RangePartitioner.equals() method.  Maybe it's sufficient to just compare the number of partitions, the ids of the RDDs used to create the RangePartitioner, and the sort ordering.  This still supports the case where I range-partition one RDD and pass the same partitioner to a different RDD.  It breaks support for the case where two range partitioners created on different RDDs happened to have the same rangeBounds(), but it seems unlikely that this would really harm performance since it's probably unlikely that the range partitioners are equal by chance.


On Tue, Dec 10, 2013 at 8:18 AM, Ryan Prenger <[hidden email]> wrote:
Thanks for the responses!  I agree that b seems like it would be better.  I could imagine optimizations that could be made if a filter call came after the sortByKey that would make the initial partitioning sub-optimal.  Plus this way, it's a pain to use in the REPL.

Cheers,

Ryan


On Tue, Dec 10, 2013 at 7:06 AM, Andrew Ash <[hidden email]> wrote:
Since sortByKey() invokes those right now, we should either a) change the documentation to treat note that it kicks off actions or b) change the method to execute those things lazily.

Personally I'd prefer b but don't know how difficult that would be.


On Tue, Dec 10, 2013 at 1:52 AM, Jason Lenderman <[hidden email]> wrote:
Hey Ryan,

The sortByKey method creates a RangePartitioner (see Partitioner.scala), and the initialization code of the RangePartitioner invokes actions count and sample.


Jason

 


On Mon, Dec 9, 2013 at 7:01 PM, Ryan Prenger <[hidden email]> wrote:
sortByKey is listed as a data transformation, not an action, yet it launches a job.  This doesn't seem to square with the documentation.

Ryan