Will JVM be reused?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Will JVM be reused?

Archit Thakur
A JVM reuse doubt.
Lets say I have a job which has 5 stages:
Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be launched?

1-Master Daemon 3-Worker Daemon
JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3 machine, but trasformation done sequentially launching a JVM every transformation for each stage.)
OR
1+3+5*10 (where at a time 10 will be executed parallely on 3 machine but different stage in different set of JVM)
OR
1+3+5*3 (So, JVM will be reused for different partition on single machine but different stage in different set of JVM)
OR
1+3+3 (So, One JVM per Worker in any case).
OR
none

Thx,
Archit_Thakur.


Reply | Threaded
Open this post in threaded view
|

Re: Will JVM be reused?

Roshan Nair

Hi Archit,

I believe its the last case - 1+3+3.

From what I've seen its one jvm per worker per spark application.

You will have multiple threads within a worker jvm working on different partitions concurrently. The number of partitions that a worker handles concurrently appears to be determined by the number of cores you've set the worker(or app) to use.

You'd have to save to disk and reload an RDD into memory between stages, which is why spark won't do that.

Roshan

On Jan 5, 2014 1:06 AM, "Archit Thakur" <[hidden email]> wrote:
A JVM reuse doubt.
Lets say I have a job which has 5 stages:
Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be launched?

1-Master Daemon 3-Worker Daemon
JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3 machine, but trasformation done sequentially launching a JVM every transformation for each stage.)
OR
1+3+5*10 (where at a time 10 will be executed parallely on 3 machine but different stage in different set of JVM)
OR
1+3+5*3 (So, JVM will be reused for different partition on single machine but different stage in different set of JVM)
OR
1+3+3 (So, One JVM per Worker in any case).
OR
none

Thx,
Archit_Thakur.


Reply | Threaded
Open this post in threaded view
|

Re: Will JVM be reused?

Roshan Nair

I missed this. Its actually 1+3+3+1. The last being the jvm in which your driver runs.

Roshan

On Jan 5, 2014 1:24 AM, "Roshan Nair" <[hidden email]> wrote:

Hi Archit,

I believe its the last case - 1+3+3.

From what I've seen its one jvm per worker per spark application.

You will have multiple threads within a worker jvm working on different partitions concurrently. The number of partitions that a worker handles concurrently appears to be determined by the number of cores you've set the worker(or app) to use.

You'd have to save to disk and reload an RDD into memory between stages, which is why spark won't do that.

Roshan

On Jan 5, 2014 1:06 AM, "Archit Thakur" <[hidden email]> wrote:
A JVM reuse doubt.
Lets say I have a job which has 5 stages:
Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be launched?

1-Master Daemon 3-Worker Daemon
JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3 machine, but trasformation done sequentially launching a JVM every transformation for each stage.)
OR
1+3+5*10 (where at a time 10 will be executed parallely on 3 machine but different stage in different set of JVM)
OR
1+3+5*3 (So, JVM will be reused for different partition on single machine but different stage in different set of JVM)
OR
1+3+3 (So, One JVM per Worker in any case).
OR
none

Thx,
Archit_Thakur.


Reply | Threaded
Open this post in threaded view
|

Re: Will JVM be reused?

Archit Thakur
Yeah, I believed that too.

The last being the jvm in which your driver runs.??? Isn't it in the 3 worker daemon, we have already considered.


On Sun, Jan 5, 2014 at 1:28 AM, Roshan Nair <[hidden email]> wrote:

I missed this. Its actually 1+3+3+1. The last being the jvm in which your driver runs.

Roshan

On Jan 5, 2014 1:24 AM, "Roshan Nair" <[hidden email]> wrote:

Hi Archit,

I believe its the last case - 1+3+3.

From what I've seen its one jvm per worker per spark application.

You will have multiple threads within a worker jvm working on different partitions concurrently. The number of partitions that a worker handles concurrently appears to be determined by the number of cores you've set the worker(or app) to use.

You'd have to save to disk and reload an RDD into memory between stages, which is why spark won't do that.

Roshan

On Jan 5, 2014 1:06 AM, "Archit Thakur" <[hidden email]> wrote:
A JVM reuse doubt.
Lets say I have a job which has 5 stages:
Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be launched?

1-Master Daemon 3-Worker Daemon
JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3 machine, but trasformation done sequentially launching a JVM every transformation for each stage.)
OR
1+3+5*10 (where at a time 10 will be executed parallely on 3 machine but different stage in different set of JVM)
OR
1+3+5*3 (So, JVM will be reused for different partition on single machine but different stage in different set of JVM)
OR
1+3+3 (So, One JVM per Worker in any case).
OR
none

Thx,
Archit_Thakur.



Reply | Threaded
Open this post in threaded view
|

Re: Will JVM be reused?

Archit Thakur
Oh, you meant main driver. Yes, correct.


On Sun, Jan 5, 2014 at 1:36 AM, Archit Thakur <[hidden email]> wrote:
Yeah, I believed that too.

The last being the jvm in which your driver runs.??? Isn't it in the 3 worker daemon, we have already considered.


On Sun, Jan 5, 2014 at 1:28 AM, Roshan Nair <[hidden email]> wrote:

I missed this. Its actually 1+3+3+1. The last being the jvm in which your driver runs.

Roshan

On Jan 5, 2014 1:24 AM, "Roshan Nair" <[hidden email]> wrote:

Hi Archit,

I believe its the last case - 1+3+3.

From what I've seen its one jvm per worker per spark application.

You will have multiple threads within a worker jvm working on different partitions concurrently. The number of partitions that a worker handles concurrently appears to be determined by the number of cores you've set the worker(or app) to use.

You'd have to save to disk and reload an RDD into memory between stages, which is why spark won't do that.

Roshan

On Jan 5, 2014 1:06 AM, "Archit Thakur" <[hidden email]> wrote:
A JVM reuse doubt.
Lets say I have a job which has 5 stages:
Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be launched?

1-Master Daemon 3-Worker Daemon
JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3 machine, but trasformation done sequentially launching a JVM every transformation for each stage.)
OR
1+3+5*10 (where at a time 10 will be executed parallely on 3 machine but different stage in different set of JVM)
OR
1+3+5*3 (So, JVM will be reused for different partition on single machine but different stage in different set of JVM)
OR
1+3+3 (So, One JVM per Worker in any case).
OR
none

Thx,
Archit_Thakur.




Reply | Threaded
Open this post in threaded view
|

Re: Will JVM be reused?

Roshan Nair
In reply to this post by Archit Thakur

The driver jvm is the jvm in which you create the sparkContext and launch your job. Its different from the master and worker daemons.

Roshan

On Jan 5, 2014 1:37 AM, "Archit Thakur" <[hidden email]> wrote:
Yeah, I believed that too.

The last being the jvm in which your driver runs.??? Isn't it in the 3 worker daemon, we have already considered.


On Sun, Jan 5, 2014 at 1:28 AM, Roshan Nair <[hidden email]> wrote:

I missed this. Its actually 1+3+3+1. The last being the jvm in which your driver runs.

Roshan

On Jan 5, 2014 1:24 AM, "Roshan Nair" <[hidden email]> wrote:

Hi Archit,

I believe its the last case - 1+3+3.

From what I've seen its one jvm per worker per spark application.

You will have multiple threads within a worker jvm working on different partitions concurrently. The number of partitions that a worker handles concurrently appears to be determined by the number of cores you've set the worker(or app) to use.

You'd have to save to disk and reload an RDD into memory between stages, which is why spark won't do that.

Roshan

On Jan 5, 2014 1:06 AM, "Archit Thakur" <[hidden email]> wrote:
A JVM reuse doubt.
Lets say I have a job which has 5 stages:
Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be launched?

1-Master Daemon 3-Worker Daemon
JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3 machine, but trasformation done sequentially launching a JVM every transformation for each stage.)
OR
1+3+5*10 (where at a time 10 will be executed parallely on 3 machine but different stage in different set of JVM)
OR
1+3+5*3 (So, JVM will be reused for different partition on single machine but different stage in different set of JVM)
OR
1+3+3 (So, One JVM per Worker in any case).
OR
none

Thx,
Archit_Thakur.



Reply | Threaded
Open this post in threaded view
|

Re: Will JVM be reused?

Archit Thakur
ya ya had got that. Thx.


On Sun, Jan 5, 2014 at 1:41 AM, Roshan Nair <[hidden email]> wrote:

The driver jvm is the jvm in which you create the sparkContext and launch your job. Its different from the master and worker daemons.

Roshan

On Jan 5, 2014 1:37 AM, "Archit Thakur" <[hidden email]> wrote:
Yeah, I believed that too.

The last being the jvm in which your driver runs.??? Isn't it in the 3 worker daemon, we have already considered.


On Sun, Jan 5, 2014 at 1:28 AM, Roshan Nair <[hidden email]> wrote:

I missed this. Its actually 1+3+3+1. The last being the jvm in which your driver runs.

Roshan

On Jan 5, 2014 1:24 AM, "Roshan Nair" <[hidden email]> wrote:

Hi Archit,

I believe its the last case - 1+3+3.

From what I've seen its one jvm per worker per spark application.

You will have multiple threads within a worker jvm working on different partitions concurrently. The number of partitions that a worker handles concurrently appears to be determined by the number of cores you've set the worker(or app) to use.

You'd have to save to disk and reload an RDD into memory between stages, which is why spark won't do that.

Roshan

On Jan 5, 2014 1:06 AM, "Archit Thakur" <[hidden email]> wrote:
A JVM reuse doubt.
Lets say I have a job which has 5 stages:
Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be launched?

1-Master Daemon 3-Worker Daemon
JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3 machine, but trasformation done sequentially launching a JVM every transformation for each stage.)
OR
1+3+5*10 (where at a time 10 will be executed parallely on 3 machine but different stage in different set of JVM)
OR
1+3+5*3 (So, JVM will be reused for different partition on single machine but different stage in different set of JVM)
OR
1+3+3 (So, One JVM per Worker in any case).
OR
none

Thx,
Archit_Thakur.




Reply | Threaded
Open this post in threaded view
|

Re: Will JVM be reused?

Archit Thakur
I am facing a general problem actually, which seem to be related to how many JVM get launched.
In my map task I read a file and fill a map out of it.
Now, since the data is static and map tasks are called for every record of RDD and I want to read it only once, so I kept the map as static (in Java) , so that atleast for a single JVM I do not have to do more than one I/O , but keeping it static gives me NPE and sometimes throws exception from somewhere deep inside. (Seems like spark is serializing things here and not able to load static members ) However, not keeping it static runs successfully.

I know I can do it by reading it on master and then broadcasting, but there is a reason I want to do it this way.




On Sun, Jan 5, 2014 at 1:43 AM, Archit Thakur <[hidden email]> wrote:
ya ya had got that. Thx.


On Sun, Jan 5, 2014 at 1:41 AM, Roshan Nair <[hidden email]> wrote:

The driver jvm is the jvm in which you create the sparkContext and launch your job. Its different from the master and worker daemons.

Roshan

On Jan 5, 2014 1:37 AM, "Archit Thakur" <[hidden email]> wrote:
Yeah, I believed that too.

The last being the jvm in which your driver runs.??? Isn't it in the 3 worker daemon, we have already considered.


On Sun, Jan 5, 2014 at 1:28 AM, Roshan Nair <[hidden email]> wrote:

I missed this. Its actually 1+3+3+1. The last being the jvm in which your driver runs.

Roshan

On Jan 5, 2014 1:24 AM, "Roshan Nair" <[hidden email]> wrote:

Hi Archit,

I believe its the last case - 1+3+3.

From what I've seen its one jvm per worker per spark application.

You will have multiple threads within a worker jvm working on different partitions concurrently. The number of partitions that a worker handles concurrently appears to be determined by the number of cores you've set the worker(or app) to use.

You'd have to save to disk and reload an RDD into memory between stages, which is why spark won't do that.

Roshan

On Jan 5, 2014 1:06 AM, "Archit Thakur" <[hidden email]> wrote:
A JVM reuse doubt.
Lets say I have a job which has 5 stages:
Each stage has 10 tasks(10 partitions) Each task has 3 transformation.
My Cluster is size 4 (1 Master, 3 Workers), How many JVMs will be launched?

1-Master Daemon 3-Worker Daemon
JVM = 1+3+10*3*5 (where at a time 10 will be executed parallely on 3 machine, but trasformation done sequentially launching a JVM every transformation for each stage.)
OR
1+3+5*10 (where at a time 10 will be executed parallely on 3 machine but different stage in different set of JVM)
OR
1+3+5*3 (So, JVM will be reused for different partition on single machine but different stage in different set of JVM)
OR
1+3+3 (So, One JVM per Worker in any case).
OR
none

Thx,
Archit_Thakur.