Spark and Oozie

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark and Oozie

Dennis Suhari

Dear experts,

I am using Spark for processing data from HDFS (hadoop). These Spark application are data pipelines, data wrangling and machine learning applications. Thus Spark submits its job using YARN.
This also works well. For scheduling I am now trying to use Apache Oozie, but I am facing performqnce impacts. A Spark job which tooks 44 seconds when submitting it via CLI now takes nearly 3 Minutes.

Have you faced similar experiences in using Oozie for scheduling Spark application jobs ? What alternative workflow tools are you using for scheduling Spark jobs on Hadoop ?


Br,

Dennis

Von meinem iPhone gesendet
Von meinem iPhone gesendet

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark and Oozie

Bartek Dobija
Hi Dennis, 

Oozie jobs shouldn't take that long in a well configured cluster. Oozie allocates it's own resources in Yarn which may require fine tuning. Check if YARN gives resources to the Oozie job immediately which may be one of the reasons and change jobs priorities in YARN scheduling configuration.  

Alternatively check the Apache Airflow project which is a good alternative to Oozie. 

Regards,
Bartek 

On Fri, Jul 19, 2019, 09:09 Dennis Suhari <[hidden email]> wrote:

Dear experts,

I am using Spark for processing data from HDFS (hadoop). These Spark application are data pipelines, data wrangling and machine learning applications. Thus Spark submits its job using YARN.
This also works well. For scheduling I am now trying to use Apache Oozie, but I am facing performqnce impacts. A Spark job which tooks 44 seconds when submitting it via CLI now takes nearly 3 Minutes.

Have you faced similar experiences in using Oozie for scheduling Spark application jobs ? What alternative workflow tools are you using for scheduling Spark jobs on Hadoop ?


Br,

Dennis

Von meinem iPhone gesendet
Von meinem iPhone gesendet

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark and Oozie

William Shen
Dennis, do you know what’s taking the additional time? Is it the Spark Job, or oozie waiting for allocation from YARN? Do you have resource contention issue in YARN?

On Fri, Jul 19, 2019 at 12:24 AM Bartek Dobija <[hidden email]> wrote:
Hi Dennis, 

Oozie jobs shouldn't take that long in a well configured cluster. Oozie allocates it's own resources in Yarn which may require fine tuning. Check if YARN gives resources to the Oozie job immediately which may be one of the reasons and change jobs priorities in YARN scheduling configuration.  

Alternatively check the Apache Airflow project which is a good alternative to Oozie. 

Regards,
Bartek 

On Fri, Jul 19, 2019, 09:09 Dennis Suhari <[hidden email]> wrote:

Dear experts,

I am using Spark for processing data from HDFS (hadoop). These Spark application are data pipelines, data wrangling and machine learning applications. Thus Spark submits its job using YARN.
This also works well. For scheduling I am now trying to use Apache Oozie, but I am facing performqnce impacts. A Spark job which tooks 44 seconds when submitting it via CLI now takes nearly 3 Minutes.

Have you faced similar experiences in using Oozie for scheduling Spark application jobs ? What alternative workflow tools are you using for scheduling Spark jobs on Hadoop ?


Br,

Dennis

Von meinem iPhone gesendet
Von meinem iPhone gesendet

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark and Oozie

Dennis Suhari
Hi William,

because it is the only job that is running I don't think it is resource contention. We have configured capacity scheduler which means using yarn queues. As it is the only job I cant see that it is waiting somehow in the queue. 

Br,

Dennis

Von meinem iPhone gesendet

Am 20.07.2019 um 01:48 schrieb William Shen <[hidden email]>:

Dennis, do you know what’s taking the additional time? Is it the Spark Job, or oozie waiting for allocation from YARN? Do you have resource contention issue in YARN?

On Fri, Jul 19, 2019 at 12:24 AM Bartek Dobija <[hidden email]> wrote:
Hi Dennis, 

Oozie jobs shouldn't take that long in a well configured cluster. Oozie allocates it's own resources in Yarn which may require fine tuning. Check if YARN gives resources to the Oozie job immediately which may be one of the reasons and change jobs priorities in YARN scheduling configuration.  

Alternatively check the Apache Airflow project which is a good alternative to Oozie. 

Regards,
Bartek 

On Fri, Jul 19, 2019, 09:09 Dennis Suhari <[hidden email]> wrote:

Dear experts,

I am using Spark for processing data from HDFS (hadoop). These Spark application are data pipelines, data wrangling and machine learning applications. Thus Spark submits its job using YARN.
This also works well. For scheduling I am now trying to use Apache Oozie, but I am facing performqnce impacts. A Spark job which tooks 44 seconds when submitting it via CLI now takes nearly 3 Minutes.

Have you faced similar experiences in using Oozie for scheduling Spark application jobs ? What alternative workflow tools are you using for scheduling Spark jobs on Hadoop ?


Br,

Dennis

Von meinem iPhone gesendet
Von meinem iPhone gesendet

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]