Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

Cassa L
Hi,
I have a spark job that has use case as below: 
RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some transformation and after that I do a count on transformed data.

Code somewhat  looks like this:

RDD1=JavaFunctions.cassandraTable(...)
RDD2=JavaFunctions.cassandraTable(...)
RDD3 = RDD1.flatMap(..)
RDD4 = RDD2.flatMap()

RDD3.count
RDD4.count

In Spark UI I see count() functions are getting called one after another. How do I make it parallel? I also looked at below discussion from Cloudera, but it does not show how to run driver functions in parallel. Do I just add Executor and run them in threads?


Inline image 1Attaching UI snapshot here?


Thanks.
LCassa
Reply | Threaded
Open this post in threaded view
|

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

Jörn Franke
Do you use yarn ? Then you need to configure the queues with the right scheduler and method.

On 27. Oct 2017, at 08:05, Cassa L <[hidden email]> wrote:

Hi,
I have a spark job that has use case as below: 
RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some transformation and after that I do a count on transformed data.

Code somewhat  looks like this:

RDD1=JavaFunctions.cassandraTable(...)
RDD2=JavaFunctions.cassandraTable(...)
RDD3 = RDD1.flatMap(..)
RDD4 = RDD2.flatMap()

RDD3.count
RDD4.count

In Spark UI I see count() functions are getting called one after another. How do I make it parallel? I also looked at below discussion from Cloudera, but it does not show how to run driver functions in parallel. Do I just add Executor and run them in threads?


<Screen Shot 2017-10-26 at 10.54.51 PM.png>Attaching UI snapshot here?


Thanks.
LCassa
Reply | Threaded
Open this post in threaded view
|

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

Jörn Franke
In reply to this post by Cassa L

On 27. Oct 2017, at 08:05, Cassa L <[hidden email]> wrote:

Hi,
I have a spark job that has use case as below: 
RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some transformation and after that I do a count on transformed data.

Code somewhat  looks like this:

RDD1=JavaFunctions.cassandraTable(...)
RDD2=JavaFunctions.cassandraTable(...)
RDD3 = RDD1.flatMap(..)
RDD4 = RDD2.flatMap()

RDD3.count
RDD4.count

In Spark UI I see count() functions are getting called one after another. How do I make it parallel? I also looked at below discussion from Cloudera, but it does not show how to run driver functions in parallel. Do I just add Executor and run them in threads?


<Screen Shot 2017-10-26 at 10.54.51 PM.png>Attaching UI snapshot here?


Thanks.
LCassa
Reply | Threaded
Open this post in threaded view
|

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

Cassa L
In reply to this post by Jörn Franke
No, I dont use Yarn.  This is standalone spark that comes with DataStax Enterprise version of Cassandra.

On Thu, Oct 26, 2017 at 11:22 PM, Jörn Franke <[hidden email]> wrote:
Do you use yarn ? Then you need to configure the queues with the right scheduler and method.

On 27. Oct 2017, at 08:05, Cassa L <[hidden email]> wrote:

Hi,
I have a spark job that has use case as below: 
RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some transformation and after that I do a count on transformed data.

Code somewhat  looks like this:

RDD1=JavaFunctions.cassandraTable(...)
RDD2=JavaFunctions.cassandraTable(...)
RDD3 = RDD1.flatMap(..)
RDD4 = RDD2.flatMap()

RDD3.count
RDD4.count

In Spark UI I see count() functions are getting called one after another. How do I make it parallel? I also looked at below discussion from Cloudera, but it does not show how to run driver functions in parallel. Do I just add Executor and run them in threads?


<Screen Shot 2017-10-26 at 10.54.51 PM.png>Attaching UI snapshot here?


Thanks.
LCassa

Reply | Threaded
Open this post in threaded view
|

Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

Thakrar, Jayesh

What you have is sequential and hence sequential processing.

Also Spark/Scala are not parallel programming languages.

But even if they were, statements are executed sequentially unless you exploit the parallel/concurrent execution features.

 

Anyway, see if this works:

 

val (RDD1, RDD2) = (JavaFunctions.cassandraTable(...), JavaFunctions.cassandraTable(...))

 

val (RDD3, RDD4) = (RDD1.flatMap(..), RDD2.flatMap(..))

 

 

I am hoping that Spark being based on Scala, the behavior below will apply:

scala> var x = 0

x: Int = 0

 

scala> val (a,b) = (x + 1, x+1)

a: Int = 1

b: Int = 1

 

 

 

From: Cassa L <[hidden email]>
Date: Friday, October 27, 2017 at 1:50 AM
To: Jörn Franke <[hidden email]>

Cc: user <[hidden email]>, <[hidden email]>
Subject: Re: Why don't I see my spark jobs running in parallel in Cassandra/Spark DSE cluster?

 

No, I dont use Yarn.  This is standalone spark that comes with DataStax Enterprise version of Cassandra.

 

On Thu, Oct 26, 2017 at 11:22 PM, Jörn Franke <[hidden email]> wrote:

Do you use yarn ? Then you need to configure the queues with the right scheduler and method.


On 27. Oct 2017, at 08:05, Cassa L <[hidden email]> wrote:

Hi,

I have a spark job that has use case as below: 

RRD1 and RDD2 read from Cassandra tables. These two RDDs then do some transformation and after that I do a count on transformed data.

 

Code somewhat  looks like this:

 

RDD1=JavaFunctions.cassandraTable(...)

RDD2=JavaFunctions.cassandraTable(...)

RDD3 = RDD1.flatMap(..)

RDD4 = RDD2.flatMap()

 

RDD3.count

RDD4.count

 

In Spark UI I see count() functions are getting called one after another. How do I make it parallel? I also looked at below discussion from Cloudera, but it does not show how to run driver functions in parallel. Do I just add Executor and run them in threads?

 

 

<Screen Shot 2017-10-26 at 10.54.51 PM.png>Attaching UI snapshot here?

 

 

Thanks.

LCassa