Apache Spark and Airflow connection

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Apache Spark and Airflow connection

Uğur Sopaoğlu
I have a docker based cluster. In my cluster, I try to schedule spark jobs by using Airflow. Airflow and Spark are running separately in different containers.  However, I cannot run a spark job by using airflow. 

Below the code is my airflow script:

from airflow import DAG

from airflow.contrib.operators.spark_submit_operator import SparkSubmitOperator
from datetime import datetime, timedelta


args = {'owner': 'airflow', 'start_date': datetime(2018, 7, 31) }

dag = DAG('spark_example_new', default_args=args, schedule_interval="@once")

operator = SparkSubmitOperator(task_id='spark_submit_job', conn_id='spark_default', java_class='Main', application='/SimpleSpark.jar', name='airflow-spark-example',
        dag=dag)

I also configure spark_default in Airflow UI:

Screenshot from 2018-09-24 12-00-46.png


However, it produce following error:

[Errno 2] No such file or directory: 'spark-submit': 'spark-submit'
I think, airflow try to run spark job in own. How can I configure that it runs spark code on spark master. 

--
Uğur Sopaoğlu