How to avoid long-running jobs blocking short-running jobs

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

How to avoid long-running jobs blocking short-running jobs

conner
Hi,

I use spark cluster to run ETL jobs and analysis computation about the data
after elt stage.
The elt jobs can keep running for several hours, but analysis computation is
a short-running job which can finish in a few seconds.
The dilemma I entrapped is that my application runs in a single JVM and
can't be a cluster application, so just one spark context in my application
currently. But when the elt jobs are running,
the jobs will occupy all resource including worker executors too long to
block all my analysis computation jobs.

My solution is to find a good way to divide the spark cluster resource into
two. One part for analysis computation jobs, another for
elt jobs. if the part for elt jobs is free, I can allocate analysis
computation jobs to it.
So I want to find a middleware that can support two spark context and it
must be embedded in my application. I do some research on the third party
project spark job server. It can divide spark resource by launching another
JVM to run spark context with a specific resource.
these operations are invisible to the upper layer, so it's a good solution
for me. But this project is running in a single JVM  and just support REST
API, I can't endure the data transfer by TCP again
which too slow to me. I want to get a result from spark cluster by TCP and
give this result to view layer to show.
Can anybody give me some good suggestion? I shall be so grateful.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to avoid long-running jobs blocking short-running jobs

Nicolas Paris-2
On Sat, Nov 03, 2018 at 02:04:01AM -0700, conner wrote:
> My solution is to find a good way to divide the spark cluster resource
> into two.

What about yarn and its queue management system ?

--
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: How to avoid long-running jobs blocking short-running jobs

Jörn Franke
In reply to this post by conner
Hi,

What does your Spark deployment architecture looks like? Standalone? Yarn? Mesos? Kubernetes? Those have resource managers (not middlewares) that allow to implement scenarios as you want to achieve.

 In any case you can try the FairScheduler of any of those solutions.

Best regards

> Am 03.11.2018 um 10:04 schrieb conner <[hidden email]>:
>
> Hi,
>
> I use spark cluster to run ETL jobs and analysis computation about the data
> after elt stage.
> The elt jobs can keep running for several hours, but analysis computation is
> a short-running job which can finish in a few seconds.
> The dilemma I entrapped is that my application runs in a single JVM and
> can't be a cluster application, so just one spark context in my application
> currently. But when the elt jobs are running,
> the jobs will occupy all resource including worker executors too long to
> block all my analysis computation jobs.
>
> My solution is to find a good way to divide the spark cluster resource into
> two. One part for analysis computation jobs, another for
> elt jobs. if the part for elt jobs is free, I can allocate analysis
> computation jobs to it.
> So I want to find a middleware that can support two spark context and it
> must be embedded in my application. I do some research on the third party
> project spark job server. It can divide spark resource by launching another
> JVM to run spark context with a specific resource.
> these operations are invisible to the upper layer, so it's a good solution
> for me. But this project is running in a single JVM  and just support REST
> API, I can't endure the data transfer by TCP again
> which too slow to me. I want to get a result from spark cluster by TCP and
> give this result to view layer to show.
> Can anybody give me some good suggestion? I shall be so grateful.
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Fwd: How to avoid long-running jobs blocking short-running jobs

onmstester onmstester-2
In reply to this post by conner
You could have used two separate pools with different weights for ETL and rest jobs, when ETL pool weights is about 1 and Rest weight is 1000, anytime a Rest Job comes in, it allocate all the resources. Details:

Sent using Zoho Mail



============ Forwarded message ============
From : conner <[hidden email]>
Date : Sat, 03 Nov 2018 12:34:01 +0330
Subject : How to avoid long-running jobs blocking short-running jobs
============ Forwarded message ============

Hi,

I use spark cluster to run ETL jobs and analysis computation about the data
after elt stage.
The elt jobs can keep running for several hours, but analysis computation is
a short-running job which can finish in a few seconds.
The dilemma I entrapped is that my application runs in a single JVM and
can't be a cluster application, so just one spark context in my application
currently. But when the elt jobs are running,
the jobs will occupy all resource including worker executors too long to
block all my analysis computation jobs.

My solution is to find a good way to divide the spark cluster resource into
two. One part for analysis computation jobs, another for
elt jobs. if the part for elt jobs is free, I can allocate analysis
computation jobs to it.
So I want to find a middleware that can support two spark context and it
must be embedded in my application. I do some research on the third party
project spark job server. It can divide spark resource by launching another
JVM to run spark context with a specific resource.
these operations are invisible to the upper layer, so it's a good solution
for me. But this project is running in a single JVM and just support REST
API, I can't endure the data transfer by TCP again
which too slow to me. I want to get a result from spark cluster by TCP and
give this result to view layer to show.
Can anybody give me some good suggestion? I shall be so grateful.





--

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



Reply | Threaded
Open this post in threaded view
|

RE: How to avoid long-running jobs blocking short-running jobs

Taylor Cox
In reply to this post by conner
Hi Conner,

What is preventing you from using a cluster model?
I wonder if docker containers could help you here?
A quick internet search yielded Mist: https://github.com/Hydrospheredata/mist Could be useful?

Taylor

-----Original Message-----
From: conner <[hidden email]>
Sent: Saturday, November 3, 2018 2:04 AM
To: [hidden email]
Subject: How to avoid long-running jobs blocking short-running jobs

Hi,

I use spark cluster to run ETL jobs and analysis computation about the data after elt stage.
The elt jobs can keep running for several hours, but analysis computation is a short-running job which can finish in a few seconds.
The dilemma I entrapped is that my application runs in a single JVM and can't be a cluster application, so just one spark context in my application currently. But when the elt jobs are running, the jobs will occupy all resource including worker executors too long to block all my analysis computation jobs.

My solution is to find a good way to divide the spark cluster resource into two. One part for analysis computation jobs, another for elt jobs. if the part for elt jobs is free, I can allocate analysis computation jobs to it.
So I want to find a middleware that can support two spark context and it must be embedded in my application. I do some research on the third party project spark job server. It can divide spark resource by launching another JVM to run spark context with a specific resource.
these operations are invisible to the upper layer, so it's a good solution for me. But this project is running in a single JVM  and just support REST API, I can't endure the data transfer by TCP again which too slow to me. I want to get a result from spark cluster by TCP and give this result to view layer to show.
Can anybody give me some good suggestion? I shall be so grateful.





--
Sent from: https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fapache-spark-user-list.1001560.n3.nabble.com%2F&amp;data=02%7C01%7CTaylor.Cox%40microsoft.com%7C3f9379c723d64ca988a908d6416b4f7c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636768326485388503&amp;sdata=h%2BOzv9rIxo%2FYI6xmjFYvEyvcptmDXEBBA%2BDVhngpKsk%3D&amp;reserved=0

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]