Spark Views Functioning

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Views Functioning

Kushagra Deep
Hi all,

I just wanted to know that when we create a 'createOrReplaceTempView' on a spark dataset, where does the view reside ? Does all the data come to driver and the view is created ? Or individual executors have part of the views (based on the data each executor has) with them , so that when we query a view, the query runs on each part of data that is there in every executor? 


Reply | Threaded
Open this post in threaded view
|

Re: Spark Views Functioning

Mich Talebzadeh

As a first guess, where do you think this view is created in a distributed environment?

The whole purpose is fast access to this temporary storage (shared among executors in this job) and that storage is only materialised after an action is performed.

scala> val sales = spark.read.format("jdbc").options(
     |        Map("url" -> _ORACLEserver,
     |        "dbtable" -> "(SELECT * FROM sh.sales)",
     |        "user" -> _username,
     |        "password" -> _password)).load
sales: org.apache.spark.sql.DataFrame = [PROD_ID: decimal(38,10), CUST_ID: decimal(38,10) ... 5 more fields]

scala> sales.createOrReplaceTempView("sales")

scala> spark.sql("select count(1) from sales").show
+--------+
|count(1)|
+--------+
|  918843|
+--------+

HTH



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 26 Mar 2021 at 06:55, Kushagra Deep <[hidden email]> wrote:
Hi all,

I just wanted to know that when we create a 'createOrReplaceTempView' on a spark dataset, where does the view reside ? Does all the data come to driver and the view is created ? Or individual executors have part of the views (based on the data each executor has) with them , so that when we query a view, the query runs on each part of data that is there in every executor? 


Reply | Threaded
Open this post in threaded view
|

Re: Spark Views Functioning

srowen
Views are simply bookkeeping about how the query is executed, like a DataFrame. There is no data or result to store; it's just how to run a query. The views exist on the driver. The query executes like any other, on the cluster.

On Fri, Mar 26, 2021 at 3:38 AM Mich Talebzadeh <[hidden email]> wrote:

As a first guess, where do you think this view is created in a distributed environment?

The whole purpose is fast access to this temporary storage (shared among executors in this job) and that storage is only materialised after an action is performed.

scala> val sales = spark.read.format("jdbc").options(
     |        Map("url" -> _ORACLEserver,
     |        "dbtable" -> "(SELECT * FROM sh.sales)",
     |        "user" -> _username,
     |        "password" -> _password)).load
sales: org.apache.spark.sql.DataFrame = [PROD_ID: decimal(38,10), CUST_ID: decimal(38,10) ... 5 more fields]

scala> sales.createOrReplaceTempView("sales")

scala> spark.sql("select count(1) from sales").show
+--------+
|count(1)|
+--------+
|  918843|
+--------+

HTH



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 26 Mar 2021 at 06:55, Kushagra Deep <[hidden email]> wrote:
Hi all,

I just wanted to know that when we create a 'createOrReplaceTempView' on a spark dataset, where does the view reside ? Does all the data come to driver and the view is created ? Or individual executors have part of the views (based on the data each executor has) with them , so that when we query a view, the query runs on each part of data that is there in every executor? 


Reply | Threaded
Open this post in threaded view
|

Re: Spark Views Functioning

Mich Talebzadeh
My view is that temporary views createOrReplaceTempView  or its predecessor  registerTempTable are created in the driver memory. The dag states

scala> val sales = spark.read.format("jdbc").options(
     |        Map("url" -> _ORACLEserver,
     |        "dbtable" -> "(SELECT * FROM sh.sales)",
     |        "user" -> _username,
     |        "password" -> _password)).load
sales: org.apache.spark.sql.DataFrame = [PROD_ID: decimal(38,10), CUST_ID: decimal(38,10) ... 5 more fields]

scala> sales.createOrReplaceTempView("sales")


Execute CreateViewCommand
== Physical Plan ==
Execute CreateViewCommand (1)
   +- CreateViewCommand (2)
         +- LogicalRelation (3)


(1) Execute CreateViewCommand
Output: []

(2) CreateViewCommand
Arguments: `tmp`, false, true, LocalTempView

(3) LogicalRelation
Arguments: JDBCRelation((SELECT * FROM sh.sales)) [numPartitions=1], [PROD_ID#24, CUST_ID#25, TIME_ID#26, CHANNEL_ID#27, PROMO_ID#28, QUANTITY_SOLD#29, AMOUNT_SOLD#30], false


So behind the scene you still work on the data frame itself?



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 26 Mar 2021 at 13:54, Sean Owen <[hidden email]> wrote:
Views are simply bookkeeping about how the query is executed, like a DataFrame. There is no data or result to store; it's just how to run a query. The views exist on the driver. The query executes like any other, on the cluster.

On Fri, Mar 26, 2021 at 3:38 AM Mich Talebzadeh <[hidden email]> wrote:

As a first guess, where do you think this view is created in a distributed environment?

The whole purpose is fast access to this temporary storage (shared among executors in this job) and that storage is only materialised after an action is performed.

scala> val sales = spark.read.format("jdbc").options(
     |        Map("url" -> _ORACLEserver,
     |        "dbtable" -> "(SELECT * FROM sh.sales)",
     |        "user" -> _username,
     |        "password" -> _password)).load
sales: org.apache.spark.sql.DataFrame = [PROD_ID: decimal(38,10), CUST_ID: decimal(38,10) ... 5 more fields]

scala> sales.createOrReplaceTempView("sales")

scala> spark.sql("select count(1) from sales").show
+--------+
|count(1)|
+--------+
|  918843|
+--------+

HTH



   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Fri, 26 Mar 2021 at 06:55, Kushagra Deep <[hidden email]> wrote:
Hi all,

I just wanted to know that when we create a 'createOrReplaceTempView' on a spark dataset, where does the view reside ? Does all the data come to driver and the view is created ? Or individual executors have part of the views (based on the data each executor has) with them , so that when we query a view, the query runs on each part of data that is there in every executor?