Let multiple jobs share one rdd?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Let multiple jobs share one rdd?

Gang Li
Hi all,

There are three jobs, among which the first rdd is the same. Can the first
rdd be calculated once, and then the subsequent operations will be
calculated in parallel?

<http://apache-spark-user-list.1001560.n3.nabble.com/file/t11001/1600954813075-image.png>

My code is as follows:

sqls = ["""
            INSERT OVERWRITE TABLE `spark_input_test3` PARTITION
(dt='20200917')
            SELECT id, plan_id, status, retry_times, start_time,
schedule_datekey
            FROM temp_table where status=3""",
            """
            INSERT OVERWRITE TABLE `spark_input_test4` PARTITION
(dt='20200917')
            SELECT id, cur_inst_id, status, update_time, schedule_time,
task_name
            FROM temp_table where schedule_time > '2020-09-01 00:00:00'
            """]

def multi_thread():
    sql = """SELECT id, plan_id, status, retry_times, start_time,
schedule_datekey, task_name, update_time, schedule_time, cur_inst_id,
scheduler_id
        FROM table
        where dt < '20200801'"""
    data = spark.sql(sql).persist(StorageLevel.MEMORY_AND_DISK)
    threads = []
    for i in range(2):
        try:
            t = threading.Thread(target=insert_overwrite_thread,
args=(sqls[i], data, ))
            t.start()
            threads.append(t)
        except Exception as x:
            print x
    for t in threads:
        t.join()

def insert_overwrite_thread(sql, data):
    data.createOrReplaceTempView('temp_table')
    spark.sql(sql)



Since spark is in lazy mode, the first RDD will still be calculated multiple
times during parallel submission.
I would like to ask you if there are other ways, thanks!

Cheers,
Gang Li



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Let multiple jobs share one rdd?

Khalid Mammadov
Perhaps you can use Global Temp Views?

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.createGlobalTempView


On 24/09/2020 14:52, Gang Li wrote:

> Hi all,
>
> There are three jobs, among which the first rdd is the same. Can the first
> rdd be calculated once, and then the subsequent operations will be
> calculated in parallel?
>
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/t11001/1600954813075-image.png>
>
> My code is as follows:
>
> sqls = ["""
>              INSERT OVERWRITE TABLE `spark_input_test3` PARTITION
> (dt='20200917')
>              SELECT id, plan_id, status, retry_times, start_time,
> schedule_datekey
>              FROM temp_table where status=3""",
>              """
>              INSERT OVERWRITE TABLE `spark_input_test4` PARTITION
> (dt='20200917')
>              SELECT id, cur_inst_id, status, update_time, schedule_time,
> task_name
>              FROM temp_table where schedule_time > '2020-09-01 00:00:00'
>              """]
>
> def multi_thread():
>      sql = """SELECT id, plan_id, status, retry_times, start_time,
> schedule_datekey, task_name, update_time, schedule_time, cur_inst_id,
> scheduler_id
>          FROM table
>          where dt < '20200801'"""
>      data = spark.sql(sql).persist(StorageLevel.MEMORY_AND_DISK)
>      threads = []
>      for i in range(2):
>          try:
>              t = threading.Thread(target=insert_overwrite_thread,
> args=(sqls[i], data, ))
>              t.start()
>              threads.append(t)
>          except Exception as x:
>              print x
>      for t in threads:
>          t.join()
>
> def insert_overwrite_thread(sql, data):
>      data.createOrReplaceTempView('temp_table')
>      spark.sql(sql)
>
>
>
> Since spark is in lazy mode, the first RDD will still be calculated multiple
> times during parallel submission.
> I would like to ask you if there are other ways, thanks!
>
> Cheers,
> Gang Li
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]