Spark process locality

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark process locality

vinay Bajaj
Hi

It will be very helpful if anyone could elaborate your ideas on spark.locality.wait and multiple locality levels (process-local, node-local, rack-local and then any) and what is the best configuration i can achieve by modifying this wait and what is the difference between process local and node local.

Thanks
Vinay Bajaj


Reply | Threaded
Open this post in threaded view
|

Re: Spark process locality

Mayur Rustagi
Process local implies the data is cached on the same jvm as the task, node local means its cached on the same system but not in the same jvm(on some other core perhaps). Wait modification is a tune process depends on your system configuration (memory vs disk vs network). I frankly never had to modify it..can you share your usecase that is requiring you to do that?



On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <[hidden email]> wrote:
Hi

It will be very helpful if anyone could elaborate your ideas on spark.locality.wait and multiple locality levels (process-local, node-local, rack-local and then any) and what is the best configuration i can achieve by modifying this wait and what is the difference between process local and node local.

Thanks
Vinay Bajaj



Reply | Threaded
Open this post in threaded view
|

Re: Spark process locality

Patrick Wendell
I think these are fairly well explained in the user docs. Was there
something unclear that maybe we could update?

http://spark.incubator.apache.org/docs/latest/configuration.html

On Wed, Feb 19, 2014 at 10:04 AM, Mayur Rustagi <[hidden email]> wrote:

> Process local implies the data is cached on the same jvm as the task, node
> local means its cached on the same system but not in the same jvm(on some
> other core perhaps). Wait modification is a tune process depends on your
> system configuration (memory vs disk vs network). I frankly never had to
> modify it..can you share your usecase that is requiring you to do that?
>
> Mayur Rustagi
> Ph: +919632149971
> http://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <[hidden email]> wrote:
>>
>> Hi
>>
>> It will be very helpful if anyone could elaborate your ideas on
>> spark.locality.wait and multiple locality levels (process-local, node-local,
>> rack-local and then any) and what is the best configuration i can achieve by
>> modifying this wait and what is the difference between process local and
>> node local.
>>
>> Thanks
>> Vinay Bajaj
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Spark process locality

vinay Bajaj
In reply to this post by Mayur Rustagi
Hi Mayur

I am trying to analyse the Apache logs which contains the traffic details. Basically trying to figure out the statistics on Data points such as total views from each country and unique URLs. And i have one cluster running with 4 workers and one master (total space 240GB and 96 cores). And i was trying some things to make it faster so was stuck with these locality type of the process.

Regards
Vinay Bajaj


On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi <[hidden email]> wrote:
Process local implies the data is cached on the same jvm as the task, node local means its cached on the same system but not in the same jvm(on some other core perhaps). Wait modification is a tune process depends on your system configuration (memory vs disk vs network). I frankly never had to modify it..can you share your usecase that is requiring you to do that?



On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <[hidden email]> wrote:
Hi

It will be very helpful if anyone could elaborate your ideas on spark.locality.wait and multiple locality levels (process-local, node-local, rack-local and then any) and what is the best configuration i can achieve by modifying this wait and what is the difference between process local and node local.

Thanks
Vinay Bajaj




Reply | Threaded
Open this post in threaded view
|

Re: Spark process locality

Mayur Rustagi
Its highly likely that locality type will not become a bottleneck as spark tries to schedule the tasks where the data is cached, 2 thing might help
1. Make sure you have enough memory to cache the whole data as a RDD, keep in mind sometimes the RDD may be higher than just raw text as Java objects may have overhead
2. you can try and increase the replication factor of data, so that data is available on all workers hence is faster to cache in other workers if they already dont have it(in non-local cases per say). 

Regards
Mayur



On Thu, Feb 20, 2014 at 12:29 AM, vinay Bajaj <[hidden email]> wrote:
Hi Mayur

I am trying to analyse the Apache logs which contains the traffic details. Basically trying to figure out the statistics on Data points such as total views from each country and unique URLs. And i have one cluster running with 4 workers and one master (total space 240GB and 96 cores). And i was trying some things to make it faster so was stuck with these locality type of the process.

Regards
Vinay Bajaj


On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi <[hidden email]> wrote:
Process local implies the data is cached on the same jvm as the task, node local means its cached on the same system but not in the same jvm(on some other core perhaps). Wait modification is a tune process depends on your system configuration (memory vs disk vs network). I frankly never had to modify it..can you share your usecase that is requiring you to do that?

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971


On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <[hidden email]> wrote:
Hi

It will be very helpful if anyone could elaborate your ideas on spark.locality.wait and multiple locality levels (process-local, node-local, rack-local and then any) and what is the best configuration i can achieve by modifying this wait and what is the difference between process local and node local.

Thanks
Vinay Bajaj





Reply | Threaded
Open this post in threaded view
|

Re: Spark process locality

vinay Bajaj
Hi Mayur,

Thanks a lot for very quick reply.

I have few questions regarding RDD
1) how do I know RDD placement per machine as in which RDD data is cached at what location ?
2) how do I know total space taken by each RDD created by my program/module ?
3) does enabling compression on RDD help ?

Thanks,
Vinay




On Thu, Feb 20, 2014 at 11:44 PM, Mayur Rustagi <[hidden email]> wrote:
Its highly likely that locality type will not become a bottleneck as spark tries to schedule the tasks where the data is cached, 2 thing might help
1. Make sure you have enough memory to cache the whole data as a RDD, keep in mind sometimes the RDD may be higher than just raw text as Java objects may have overhead
2. you can try and increase the replication factor of data, so that data is available on all workers hence is faster to cache in other workers if they already dont have it(in non-local cases per say). 

Regards
Mayur
On Thu, Feb 20, 2014 at 12:29 AM, vinay Bajaj <[hidden email]> wrote:
Hi Mayur

I am trying to analyse the Apache logs which contains the traffic details. Basically trying to figure out the statistics on Data points such as total views from each country and unique URLs. And i have one cluster running with 4 workers and one master (total space 240GB and 96 cores). And i was trying some things to make it faster so was stuck with these locality type of the process.

Regards
Vinay Bajaj


On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi <[hidden email]> wrote:
Process local implies the data is cached on the same jvm as the task, node local means its cached on the same system but not in the same jvm(on some other core perhaps). Wait modification is a tune process depends on your system configuration (memory vs disk vs network). I frankly never had to modify it..can you share your usecase that is requiring you to do that?

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971


On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <[hidden email]> wrote:
Hi

It will be very helpful if anyone could elaborate your ideas on spark.locality.wait and multiple locality levels (process-local, node-local, rack-local and then any) and what is the best configuration i can achieve by modifying this wait and what is the difference between process local and node local.

Thanks
Vinay Bajaj






Reply | Threaded
Open this post in threaded view
|

Re: Spark process locality

Mayur Rustagi
Using the storage tab on Spark Web UI you can find that.
Compression will help certainly !!!



On Fri, Feb 21, 2014 at 12:09 AM, vinay Bajaj <[hidden email]> wrote:
Hi Mayur,

Thanks a lot for very quick reply.

I have few questions regarding RDD
1) how do I know RDD placement per machine as in which RDD data is cached at what location ?
2) how do I know total space taken by each RDD created by my program/module ?
3) does enabling compression on RDD help ?

Thanks,
Vinay




On Thu, Feb 20, 2014 at 11:44 PM, Mayur Rustagi <[hidden email]> wrote:
Its highly likely that locality type will not become a bottleneck as spark tries to schedule the tasks where the data is cached, 2 thing might help
1. Make sure you have enough memory to cache the whole data as a RDD, keep in mind sometimes the RDD may be higher than just raw text as Java objects may have overhead
2. you can try and increase the replication factor of data, so that data is available on all workers hence is faster to cache in other workers if they already dont have it(in non-local cases per say). 

Regards
Mayur

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971


On Thu, Feb 20, 2014 at 12:29 AM, vinay Bajaj <[hidden email]> wrote:
Hi Mayur

I am trying to analyse the Apache logs which contains the traffic details. Basically trying to figure out the statistics on Data points such as total views from each country and unique URLs. And i have one cluster running with 4 workers and one master (total space 240GB and 96 cores). And i was trying some things to make it faster so was stuck with these locality type of the process.

Regards
Vinay Bajaj


On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi <[hidden email]> wrote:
Process local implies the data is cached on the same jvm as the task, node local means its cached on the same system but not in the same jvm(on some other core perhaps). Wait modification is a tune process depends on your system configuration (memory vs disk vs network). I frankly never had to modify it..can you share your usecase that is requiring you to do that?

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971


On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <[hidden email]> wrote:
Hi

It will be very helpful if anyone could elaborate your ideas on spark.locality.wait and multiple locality levels (process-local, node-local, rack-local and then any) and what is the best configuration i can achieve by modifying this wait and what is the difference between process local and node local.

Thanks
Vinay Bajaj







Reply | Threaded
Open this post in threaded view
|

Re: Spark process locality

dachuan
Mayur, is there any way to command each RDD's partition to be some node? 

The input data is usually stored in HDFS and has its own preferred locations. But I am just curious about it, whether we can force the RDD's partitions to be stored in this way regardless of how you are stored now.

thanks.


On Fri, Feb 21, 2014 at 11:00 AM, Mayur Rustagi <[hidden email]> wrote:
Using the storage tab on Spark Web UI you can find that.
Compression will help certainly !!!

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971


On Fri, Feb 21, 2014 at 12:09 AM, vinay Bajaj <[hidden email]> wrote:
Hi Mayur,

Thanks a lot for very quick reply.

I have few questions regarding RDD
1) how do I know RDD placement per machine as in which RDD data is cached at what location ?
2) how do I know total space taken by each RDD created by my program/module ?
3) does enabling compression on RDD help ?

Thanks,
Vinay




On Thu, Feb 20, 2014 at 11:44 PM, Mayur Rustagi <[hidden email]> wrote:
Its highly likely that locality type will not become a bottleneck as spark tries to schedule the tasks where the data is cached, 2 thing might help
1. Make sure you have enough memory to cache the whole data as a RDD, keep in mind sometimes the RDD may be higher than just raw text as Java objects may have overhead
2. you can try and increase the replication factor of data, so that data is available on all workers hence is faster to cache in other workers if they already dont have it(in non-local cases per say). 

Regards
Mayur

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971


On Thu, Feb 20, 2014 at 12:29 AM, vinay Bajaj <[hidden email]> wrote:
Hi Mayur

I am trying to analyse the Apache logs which contains the traffic details. Basically trying to figure out the statistics on Data points such as total views from each country and unique URLs. And i have one cluster running with 4 workers and one master (total space 240GB and 96 cores). And i was trying some things to make it faster so was stuck with these locality type of the process.

Regards
Vinay Bajaj


On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi <[hidden email]> wrote:
Process local implies the data is cached on the same jvm as the task, node local means its cached on the same system but not in the same jvm(on some other core perhaps). Wait modification is a tune process depends on your system configuration (memory vs disk vs network). I frankly never had to modify it..can you share your usecase that is requiring you to do that?

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971


On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <[hidden email]> wrote:
Hi

It will be very helpful if anyone could elaborate your ideas on spark.locality.wait and multiple locality levels (process-local, node-local, rack-local and then any) and what is the best configuration i can achieve by modifying this wait and what is the difference between process local and node local.

Thanks
Vinay Bajaj










--
Dachuan Huang
Cellphone: 614-390-7234
2015 Neil Avenue
Ohio State University
Columbus, Ohio
U.S.A.
43210
Reply | Threaded
Open this post in threaded view
|

Re: Spark process locality

Mayur Rustagi
No you cannot force RDD to a particular node. 



On Fri, Feb 21, 2014 at 8:30 AM, dachuan <[hidden email]> wrote:
Mayur, is there any way to command each RDD's partition to be some node? 

The input data is usually stored in HDFS and has its own preferred locations. But I am just curious about it, whether we can force the RDD's partitions to be stored in this way regardless of how you are stored now.

thanks.


On Fri, Feb 21, 2014 at 11:00 AM, Mayur Rustagi <[hidden email]> wrote:
Using the storage tab on Spark Web UI you can find that.
Compression will help certainly !!!

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971


On Fri, Feb 21, 2014 at 12:09 AM, vinay Bajaj <[hidden email]> wrote:
Hi Mayur,

Thanks a lot for very quick reply.

I have few questions regarding RDD
1) how do I know RDD placement per machine as in which RDD data is cached at what location ?
2) how do I know total space taken by each RDD created by my program/module ?
3) does enabling compression on RDD help ?

Thanks,
Vinay




On Thu, Feb 20, 2014 at 11:44 PM, Mayur Rustagi <[hidden email]> wrote:
Its highly likely that locality type will not become a bottleneck as spark tries to schedule the tasks where the data is cached, 2 thing might help
1. Make sure you have enough memory to cache the whole data as a RDD, keep in mind sometimes the RDD may be higher than just raw text as Java objects may have overhead
2. you can try and increase the replication factor of data, so that data is available on all workers hence is faster to cache in other workers if they already dont have it(in non-local cases per say). 

Regards
Mayur

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971


On Thu, Feb 20, 2014 at 12:29 AM, vinay Bajaj <[hidden email]> wrote:
Hi Mayur

I am trying to analyse the Apache logs which contains the traffic details. Basically trying to figure out the statistics on Data points such as total views from each country and unique URLs. And i have one cluster running with 4 workers and one master (total space 240GB and 96 cores). And i was trying some things to make it faster so was stuck with these locality type of the process.

Regards
Vinay Bajaj


On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi <[hidden email]> wrote:
Process local implies the data is cached on the same jvm as the task, node local means its cached on the same system but not in the same jvm(on some other core perhaps). Wait modification is a tune process depends on your system configuration (memory vs disk vs network). I frankly never had to modify it..can you share your usecase that is requiring you to do that?

Mayur Rustagi
Ph: <a href="tel:%2B919632149971" value="+919632149971" target="_blank">+919632149971


On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <[hidden email]> wrote:
Hi

It will be very helpful if anyone could elaborate your ideas on spark.locality.wait and multiple locality levels (process-local, node-local, rack-local and then any) and what is the best configuration i can achieve by modifying this wait and what is the difference between process local and node local.

Thanks
Vinay Bajaj










--
Dachuan Huang
Cellphone: 614-390-7234
2015 Neil Avenue
Ohio State University
Columbus, Ohio
U.S.A.
43210