Using same rdd from two threads

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Using same rdd from two threads

jelmer
HI,

I have a piece of code in which an rdd is created from a main method.
It then does work on this rdd from 2 different threads running in parallel.

When running this code as part of a test with a local master it will sometimes make spark hang ( 1 task will never get completed) 

If i make a copy of the rdd  the joh will complete fine.

I suspect it's a bad idea to use the same rdd from two threads but I could not find any documentation on the subject.

Should it be possible to do this and if not can anyone point me to documentation pointing our that this is not on the table

--jelmer
Reply | Threaded
Open this post in threaded view
|

Re: Using same rdd from two threads

srowen
RDDs are immutable, and Spark itself is thread-safe. This should be fine. Something else is going on in your code.

On Fri, Jan 22, 2021 at 7:59 AM jelmer <[hidden email]> wrote:
HI,

I have a piece of code in which an rdd is created from a main method.
It then does work on this rdd from 2 different threads running in parallel.

When running this code as part of a test with a local master it will sometimes make spark hang ( 1 task will never get completed) 

If i make a copy of the rdd  the joh will complete fine.

I suspect it's a bad idea to use the same rdd from two threads but I could not find any documentation on the subject.

Should it be possible to do this and if not can anyone point me to documentation pointing our that this is not on the table

--jelmer
Reply | Threaded
Open this post in threaded view
|

Re: Using same rdd from two threads

jelmer
Well it is now...

The RDD had a repartition call on it. 

When I removed repartition it it it would work,
When i did not remove the repartition but called called rdd.partitions.length on it it would also work!

I looked into the partitions method and in it some instance variables get initialized, so saying rdd's are immutable is only true on a "logical" level


And it looks like this change fixed it


But since we're using an old version that does not really help


On Fri, 22 Jan 2021 at 15:34, Sean Owen <[hidden email]> wrote:
RDDs are immutable, and Spark itself is thread-safe. This should be fine. Something else is going on in your code.

On Fri, Jan 22, 2021 at 7:59 AM jelmer <[hidden email]> wrote:
HI,

I have a piece of code in which an rdd is created from a main method.
It then does work on this rdd from 2 different threads running in parallel.

When running this code as part of a test with a local master it will sometimes make spark hang ( 1 task will never get completed) 

If i make a copy of the rdd  the joh will complete fine.

I suspect it's a bad idea to use the same rdd from two threads but I could not find any documentation on the subject.

Should it be possible to do this and if not can anyone point me to documentation pointing our that this is not on the table

--jelmer