Recommended pipeline automation tool? Oozie?

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Recommended pipeline automation tool? Oozie?

k.tham
I'm just wondering what's the general recommendation for data pipeline automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D, and if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some advice/suggestions.

Thanks!
Reply | Threaded
Open this post in threaded view
|

Re: Recommended pipeline automation tool? Oozie?

Paul Brown

We use Luigi for this purpose.  (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.)


[hidden email] | Multifarious, Inc. | http://mult.ifario.us/


On Thu, Jul 10, 2014 at 10:20 AM, k.tham <[hidden email]> wrote:
I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D, and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply | Threaded
Open this post in threaded view
|

Re: Recommended pipeline automation tool? Oozie?

Andrei
I used both - Oozie and Luigi - but found them inflexible and still overcomplicated, especially in presence of Spark.

Oozie has a fixed list of building blocks, which is pretty limiting. For example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out of scope (of course, you can always write wrapper as Java or Shell action, but does it really need to be so complicated?). Another issue with Oozie is passing variables between actions. There's Oozie context that is suitable for passing key-value pairs (both strings) between actions, but for more complex objects (say, FileInputStream that should be closed at last step only) you have to do some advanced kung fu.

Luigi, on other hand, has its niche - complicated dataflows with many tasks that depend on each other. Basically, there are tasks (this is where you define computations) and targets (something that can "exist" - file on disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates a plan for achieving this. Luigi is really shiny when your workflow fits this model, but one step away and you are in trouble. For example, consider simple pipeline: run MR job and output temporary data, run another MR job and output final data, clean temporary data. You can make target Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1, right? Not so easy. How do you check that Clean task is achieved? If you just test whether temporary directory is empty or not, you catch both cases - when all tasks are done and when they are not even started yet. Luigi allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single "run()" method, but ruins the entire idea.

And of course, both of these frameworks are optimized for standard MapReduce jobs, which is probably not what you want on Spark mailing list :)

Experience with these frameworks, however, gave me some insights about typical data pipelines.

1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks allow branching, but most pipelines actually consist of moving data from source to destination with possibly some transformations in between (I'll be glad if somebody share use cases when you really need branching).
2. Transactional logic is important. Either everything, or nothing. Otherwise it's really easy to get into inconsistent state.
3. Extensibility is important. You never know what will need in a week or two.

So eventually I decided that it is much easier to create your own pipeline instead of trying to adopt your code to existing frameworks. My latest pipeline incarnation simply consists of a list of steps that are started sequentially. Each step is a class with at least these methods:

 * run() - launch this step
 * fail() - what to do if step fails
 * finalize() - (optional) what to do when all steps are done

For example, if you want to add possibility to run Spark jobs, you just create SparkStep and configure it with required code. If you want Hive query - just create HiveStep and configure it with Hive connection settings. I use YAML file to configure steps and Context (basically, Map[String, Any]) to pass variables between them. I also use configurable Reporter available for all steps to report the progress.

Hopefully, this will give you some insights about best pipeline for your specific case.



On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown <[hidden email]> wrote:

We use Luigi for this purpose.  (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.)


[hidden email] | Multifarious, Inc. | http://mult.ifario.us/


On Thu, Jul 10, 2014 at 10:20 AM, k.tham <[hidden email]> wrote:
I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D, and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: Recommended pipeline automation tool? Oozie?

MLnick
You may look into the new Azkaban - which while being quite heavyweight is actually quite pleasant to use when set up.

You can run spark jobs (spark-submit) using azkaban shell commands and pass paremeters between jobs. It supports dependencies, simple dags and scheduling with retries. 

I'm digging deeper and it may be worthwhile extending it with a Spark job type...

It's probably best for mixed Hadoop / Spark clusters...

Sent from Mailbox


On Fri, Jul 11, 2014 at 12:52 AM, Andrei <[hidden email]> wrote:

I used both - Oozie and Luigi - but found them inflexible and still overcomplicated, especially in presence of Spark.

Oozie has a fixed list of building blocks, which is pretty limiting. For example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out of scope (of course, you can always write wrapper as Java or Shell action, but does it really need to be so complicated?). Another issue with Oozie is passing variables between actions. There's Oozie context that is suitable for passing key-value pairs (both strings) between actions, but for more complex objects (say, FileInputStream that should be closed at last step only) you have to do some advanced kung fu.

Luigi, on other hand, has its niche - complicated dataflows with many tasks that depend on each other. Basically, there are tasks (this is where you define computations) and targets (something that can "exist" - file on disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates a plan for achieving this. Luigi is really shiny when your workflow fits this model, but one step away and you are in trouble. For example, consider simple pipeline: run MR job and output temporary data, run another MR job and output final data, clean temporary data. You can make target Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1, right? Not so easy. How do you check that Clean task is achieved? If you just test whether temporary directory is empty or not, you catch both cases - when all tasks are done and when they are not even started yet. Luigi allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single "run()" method, but ruins the entire idea.

And of course, both of these frameworks are optimized for standard MapReduce jobs, which is probably not what you want on Spark mailing list :)

Experience with these frameworks, however, gave me some insights about typical data pipelines.

1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks allow branching, but most pipelines actually consist of moving data from source to destination with possibly some transformations in between (I'll be glad if somebody share use cases when you really need branching).
2. Transactional logic is important. Either everything, or nothing. Otherwise it's really easy to get into inconsistent state.
3. Extensibility is important. You never know what will need in a week or two.

So eventually I decided that it is much easier to create your own pipeline instead of trying to adopt your code to existing frameworks. My latest pipeline incarnation simply consists of a list of steps that are started sequentially. Each step is a class with at least these methods:

 * run() - launch this step
 * fail() - what to do if step fails
 * finalize() - (optional) what to do when all steps are done

For example, if you want to add possibility to run Spark jobs, you just create SparkStep and configure it with required code. If you want Hive query - just create HiveStep and configure it with Hive connection settings. I use YAML file to configure steps and Context (basically, Map[String, Any]) to pass variables between them. I also use configurable Reporter available for all steps to report the progress.

Hopefully, this will give you some insights about best pipeline for your specific case.



On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown <[hidden email]> wrote:

We use Luigi for this purpose.  (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.)


[hidden email] | Multifarious, Inc. | http://mult.ifario.us/


On Thu, Jul 10, 2014 at 10:20 AM, k.tham <[hidden email]> wrote:
I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D, and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



Reply | Threaded
Open this post in threaded view
|

Re: Recommended pipeline automation tool? Oozie?

明风
We use Azkaban for a short time and suffer a lot. Finally we almost rewrite it totally. Don’t recommend it really.

发件人: Nick Pentreath <[hidden email]>
答复: <[hidden email]>
日期: 2014年7月11日 星期五 下午3:18
至: <[hidden email]>
主题: Re: Recommended pipeline automation tool? Oozie?

You may look into the new Azkaban - which while being quite heavyweight is actually quite pleasant to use when set up.

You can run spark jobs (spark-submit) using azkaban shell commands and pass paremeters between jobs. It supports dependencies, simple dags and scheduling with retries. 

I'm digging deeper and it may be worthwhile extending it with a Spark job type...

It's probably best for mixed Hadoop / Spark clusters...

Sent from Mailbox


On Fri, Jul 11, 2014 at 12:52 AM, Andrei <[hidden email]> wrote:

I used both - Oozie and Luigi - but found them inflexible and still overcomplicated, especially in presence of Spark.

Oozie has a fixed list of building blocks, which is pretty limiting. For example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out of scope (of course, you can always write wrapper as Java or Shell action, but does it really need to be so complicated?). Another issue with Oozie is passing variables between actions. There's Oozie context that is suitable for passing key-value pairs (both strings) between actions, but for more complex objects (say, FileInputStream that should be closed at last step only) you have to do some advanced kung fu.

Luigi, on other hand, has its niche - complicated dataflows with many tasks that depend on each other. Basically, there are tasks (this is where you define computations) and targets (something that can "exist" - file on disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates a plan for achieving this. Luigi is really shiny when your workflow fits this model, but one step away and you are in trouble. For example, consider simple pipeline: run MR job and output temporary data, run another MR job and output final data, clean temporary data. You can make target Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1, right? Not so easy. How do you check that Clean task is achieved? If you just test whether temporary directory is empty or not, you catch both cases - when all tasks are done and when they are not even started yet. Luigi allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single "run()" method, but ruins the entire idea.

And of course, both of these frameworks are optimized for standard MapReduce jobs, which is probably not what you want on Spark mailing list :)

Experience with these frameworks, however, gave me some insights about typical data pipelines.

1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks allow branching, but most pipelines actually consist of moving data from source to destination with possibly some transformations in between (I'll be glad if somebody share use cases when you really need branching).
2. Transactional logic is important. Either everything, or nothing. Otherwise it's really easy to get into inconsistent state.
3. Extensibility is important. You never know what will need in a week or two.

So eventually I decided that it is much easier to create your own pipeline instead of trying to adopt your code to existing frameworks. My latest pipeline incarnation simply consists of a list of steps that are started sequentially. Each step is a class with at least these methods:

 * run() - launch this step
 * fail() - what to do if step fails
 * finalize() - (optional) what to do when all steps are done

For example, if you want to add possibility to run Spark jobs, you just create SparkStep and configure it with required code. If you want Hive query - just create HiveStep and configure it with Hive connection settings. I use YAML file to configure steps and Context (basically, Map[String, Any]) to pass variables between them. I also use configurable Reporter available for all steps to report the progress.

Hopefully, this will give you some insights about best pipeline for your specific case.



On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown <[hidden email]> wrote:

We use Luigi for this purpose.  (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.)


[hidden email] | Multifarious, Inc. | http://mult.ifario.us/


On Thu, Jul 10, 2014 at 10:20 AM, k.tham <[hidden email]> wrote:
I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D, and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



Reply | Threaded
Open this post in threaded view
|

Re: Recommended pipeline automation tool? Oozie?

MLnick
Did you use "old" azkaban or azkaban 2.5? It has been completely rewritten.

Not saying it is the best but I found it way better than oozie for example.

Sent from my iPhone

On 11 Jul 2014, at 09:24, "明风" <[hidden email]> wrote:

We use Azkaban for a short time and suffer a lot. Finally we almost rewrite it totally. Don’t recommend it really.

发件人: Nick Pentreath <[hidden email]>
答复: <[hidden email]>
日期: 2014年7月11日 星期五 下午3:18
至: <[hidden email]>
主题: Re: Recommended pipeline automation tool? Oozie?

You may look into the new Azkaban - which while being quite heavyweight is actually quite pleasant to use when set up.

You can run spark jobs (spark-submit) using azkaban shell commands and pass paremeters between jobs. It supports dependencies, simple dags and scheduling with retries. 

I'm digging deeper and it may be worthwhile extending it with a Spark job type...

It's probably best for mixed Hadoop / Spark clusters...

Sent from Mailbox


On Fri, Jul 11, 2014 at 12:52 AM, Andrei <[hidden email]> wrote:

I used both - Oozie and Luigi - but found them inflexible and still overcomplicated, especially in presence of Spark.

Oozie has a fixed list of building blocks, which is pretty limiting. For example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out of scope (of course, you can always write wrapper as Java or Shell action, but does it really need to be so complicated?). Another issue with Oozie is passing variables between actions. There's Oozie context that is suitable for passing key-value pairs (both strings) between actions, but for more complex objects (say, FileInputStream that should be closed at last step only) you have to do some advanced kung fu.

Luigi, on other hand, has its niche - complicated dataflows with many tasks that depend on each other. Basically, there are tasks (this is where you define computations) and targets (something that can "exist" - file on disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates a plan for achieving this. Luigi is really shiny when your workflow fits this model, but one step away and you are in trouble. For example, consider simple pipeline: run MR job and output temporary data, run another MR job and output final data, clean temporary data. You can make target Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1, right? Not so easy. How do you check that Clean task is achieved? If you just test whether temporary directory is empty or not, you catch both cases - when all tasks are done and when they are not even started yet. Luigi allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single "run()" method, but ruins the entire idea.

And of course, both of these frameworks are optimized for standard MapReduce jobs, which is probably not what you want on Spark mailing list :)

Experience with these frameworks, however, gave me some insights about typical data pipelines.

1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks allow branching, but most pipelines actually consist of moving data from source to destination with possibly some transformations in between (I'll be glad if somebody share use cases when you really need branching).
2. Transactional logic is important. Either everything, or nothing. Otherwise it's really easy to get into inconsistent state.
3. Extensibility is important. You never know what will need in a week or two.

So eventually I decided that it is much easier to create your own pipeline instead of trying to adopt your code to existing frameworks. My latest pipeline incarnation simply consists of a list of steps that are started sequentially. Each step is a class with at least these methods:

 * run() - launch this step
 * fail() - what to do if step fails
 * finalize() - (optional) what to do when all steps are done

For example, if you want to add possibility to run Spark jobs, you just create SparkStep and configure it with required code. If you want Hive query - just create HiveStep and configure it with Hive connection settings. I use YAML file to configure steps and Context (basically, Map[String, Any]) to pass variables between them. I also use configurable Reporter available for all steps to report the progress.

Hopefully, this will give you some insights about best pipeline for your specific case.



On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown <[hidden email]> wrote:

We use Luigi for this purpose.  (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.)


[hidden email] | Multifarious, Inc. | http://mult.ifario.us/


On Thu, Jul 10, 2014 at 10:20 AM, k.tham <[hidden email]> wrote:
I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D, and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.



Reply | Threaded
Open this post in threaded view
|

Re: Recommended pipeline automation tool? Oozie?

Wei Tan
In reply to this post by k.tham
Just curious: how about using scala to drive the workflow? I guess if you use other tools (oozie, etc) you lose the advantage of reading from RDD -- you have to read from HDFS.

Best regards,
Wei

---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:        "k.tham" <[hidden email]>
To:        [hidden email],
Date:        07/10/2014 01:20 PM
Subject:        Recommended pipeline automation tool? Oozie?




I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D, and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Reply | Threaded
Open this post in threaded view
|

Re: Recommended pipeline automation tool? Oozie?

Li Pu
I like the idea of using scala to drive the workflow. Spark already comes with a scheduler, why not program a plugin to schedule other types of tasks (copy file, send email, etc.)? Scala could handle any logic required by the pipeline. Passing objects (including RDDs) between tasks is also easier. I don't know if this is an overuse of Spark scheduler, but sounds like a good tool. The only issue would be releasing resources that is not used at intermediate steps. 

On Fri, Jul 11, 2014 at 12:05 PM, Wei Tan <[hidden email]> wrote:
Just curious: how about using scala to drive the workflow? I guess if you use other tools (oozie, etc) you lose the advantage of reading from RDD -- you have to read from HDFS.

Best regards,
Wei

---------------------------------
Wei Tan, PhD
Research Staff Member
IBM T. J. Watson Research Center
http://researcher.ibm.com/person/us-wtan



From:        "k.tham" <[hidden email]>
To:        [hidden email],
Date:        07/10/2014 01:20 PM
Subject:        Recommended pipeline automation tool? Oozie?




I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D, and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.





--
Li
@vrilleup
Reply | Threaded
Open this post in threaded view
|

Re: Recommended pipeline automation tool? Oozie?

Dean Wampler
In reply to this post by Andrei
If you're already using Scala for Spark programming and you hate Oozie XML as much as I do ;), you might check out Scoozie, a Scala DSL for Oozie: https://github.com/klout/scoozie


On Thu, Jul 10, 2014 at 5:52 PM, Andrei <[hidden email]> wrote:
I used both - Oozie and Luigi - but found them inflexible and still overcomplicated, especially in presence of Spark.

Oozie has a fixed list of building blocks, which is pretty limiting. For example, you can launch Hive query, but Impala, Shark/SparkSQL, etc. are out of scope (of course, you can always write wrapper as Java or Shell action, but does it really need to be so complicated?). Another issue with Oozie is passing variables between actions. There's Oozie context that is suitable for passing key-value pairs (both strings) between actions, but for more complex objects (say, FileInputStream that should be closed at last step only) you have to do some advanced kung fu.

Luigi, on other hand, has its niche - complicated dataflows with many tasks that depend on each other. Basically, there are tasks (this is where you define computations) and targets (something that can "exist" - file on disk, entry in ZooKeeper, etc.). You ask Luigi to get some target, and it creates a plan for achieving this. Luigi is really shiny when your workflow fits this model, but one step away and you are in trouble. For example, consider simple pipeline: run MR job and output temporary data, run another MR job and output final data, clean temporary data. You can make target Clean, that depends on target MRJob2 that, in its turn, depends on MRJob1, right? Not so easy. How do you check that Clean task is achieved? If you just test whether temporary directory is empty or not, you catch both cases - when all tasks are done and when they are not even started yet. Luigi allows you to specify all 3 actions - MRJob1, MRJob2, Clean - in a single "run()" method, but ruins the entire idea.

And of course, both of these frameworks are optimized for standard MapReduce jobs, which is probably not what you want on Spark mailing list :)

Experience with these frameworks, however, gave me some insights about typical data pipelines.

1. Pipelines are mostly linear. Oozie, Luigi and number of other frameworks allow branching, but most pipelines actually consist of moving data from source to destination with possibly some transformations in between (I'll be glad if somebody share use cases when you really need branching).
2. Transactional logic is important. Either everything, or nothing. Otherwise it's really easy to get into inconsistent state.
3. Extensibility is important. You never know what will need in a week or two.

So eventually I decided that it is much easier to create your own pipeline instead of trying to adopt your code to existing frameworks. My latest pipeline incarnation simply consists of a list of steps that are started sequentially. Each step is a class with at least these methods:

 * run() - launch this step
 * fail() - what to do if step fails
 * finalize() - (optional) what to do when all steps are done

For example, if you want to add possibility to run Spark jobs, you just create SparkStep and configure it with required code. If you want Hive query - just create HiveStep and configure it with Hive connection settings. I use YAML file to configure steps and Context (basically, Map[String, Any]) to pass variables between them. I also use configurable Reporter available for all steps to report the progress.

Hopefully, this will give you some insights about best pipeline for your specific case.



On Thu, Jul 10, 2014 at 9:10 PM, Paul Brown <[hidden email]> wrote:

We use Luigi for this purpose.  (Our pipelines are typically on AWS (no EMR) backed by S3 and using combinations of Python jobs, non-Spark Java/Scala, and Spark.  We run Spark jobs by connecting drivers/clients to the master, and those are what is invoked from Luigi.)


[hidden email] | Multifarious, Inc. | http://mult.ifario.us/


On Thu, Jul 10, 2014 at 10:20 AM, k.tham <[hidden email]> wrote:
I'm just wondering what's the general recommendation for data pipeline
automation.

Say, I want to run Spark Job A, then B, then invoke script C, then do D, and
if D fails, do E, and if Job A fails, send email F, etc...

It looks like Oozie might be the best choice. But I'd like some
advice/suggestions.

Thanks!



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Recommended-pipeline-automation-tool-Oozie-tp9319.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.





--
Dean Wampler, Ph.D.
Reply | Threaded
Open this post in threaded view
|

registerAsTable can't be compiled

junius
In reply to this post by 明风
Hello,
I write code to practice Spark Sql based on latest Spark version.
But I get compilation error as following, seems the implicit conversion
from RDD to SchemaRDD doesn't
work. If anybody can help me to fix it. Thanks a lot.

value registerAsTable is not a member of
org.apache.spark.rdd.RDD[org.apache.spark.examples.mySparkExamples.Record]

Junius Zhou
b.r
Reply | Threaded
Open this post in threaded view
|

Re: registerAsTable can't be compiled

Michael Armbrust
Can you provide the code?  Is Record a case class? and is it defined as a top level object?  Also have you done "import sqlContext._"?


On Sat, Jul 19, 2014 at 3:39 AM, junius <[hidden email]> wrote:
Hello,
I write code to practice Spark Sql based on latest Spark version.
But I get compilation error as following, seems the implicit conversion
from RDD to SchemaRDD doesn't
work. If anybody can help me to fix it. Thanks a lot.

value registerAsTable is not a member of
org.apache.spark.rdd.RDD[org.apache.spark.examples.mySparkExamples.Record]

Junius Zhou
b.r