How to Spawn Child Thread or Sub-jobs in a Spark Session

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

How to Spawn Child Thread or Sub-jobs in a Spark Session

Artemis User
We have a Spark job that produces a result data frame, say DF-1 at the
end of the pipeline (i.e. Proc-1).  From DF-1, we need to create two or
more dataf rames, say DF-2 and DF-3 via additional SQL or ML processes,
i.e. Proc-2 and Proc-3.  Ideally, we would like to perform Proc-2 and
Proc-3 in parallel, since Proc-2 and Proc-3 can be executed
independently, with DF-1 made immutable and DF-2 and DF-3 are
mutual-exclusive.

Does Spark has some built-in APIs to support spawning sub-jobs in a
single session?  If multi-threading is needed, what are the common best
practices in this case?

Thanks in advance for your help!

-- ND


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to Spawn Child Thread or Sub-jobs in a Spark Session

Raghavendra Ganesh
There should not be any need to explicitly make DF-2, DF-3 computation parallel. Spark generates execution plans and it can decide what can run in parallel (ideally you should see them running parallel in spark UI).

You need to cache DF-1 if possible (either in memory/disk), otherwise computation of DF-2 and DF-3 might trigger the DF-1 computation in duplicate.

--
Raghavendra


On Sat, Dec 5, 2020 at 12:31 AM Artemis User <[hidden email]> wrote:
We have a Spark job that produces a result data frame, say DF-1 at the
end of the pipeline (i.e. Proc-1).  From DF-1, we need to create two or
more dataf rames, say DF-2 and DF-3 via additional SQL or ML processes,
i.e. Proc-2 and Proc-3.  Ideally, we would like to perform Proc-2 and
Proc-3 in parallel, since Proc-2 and Proc-3 can be executed
independently, with DF-1 made immutable and DF-2 and DF-3 are
mutual-exclusive.

Does Spark has some built-in APIs to support spawning sub-jobs in a
single session?  If multi-threading is needed, what are the common best
practices in this case?

Thanks in advance for your help!

-- ND


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]