[Spark SQL]: Can complex oracle views be created using Spark SQL

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark SQL]: Can complex oracle views be created using Spark SQL

Gaurav Singh
Hi Team, 

We have lots of complex oracle views ( containing multiple tables, joins, analytical and  aggregate functions, sub queries etc) and we are wondering if Spark can help us execute those views faster. 

Also we want to know if those complex views can be implemented using Spark SQL? 

Thanks and regards, 
Gaurav Singh 
+91 8600852256 

Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL]: Can complex oracle views be created using Spark SQL

Mich Talebzadeh
Hi Gaurav,

What version of Spark will you be using?

Have you tried a simple example of reading one of the views through JDBC connection to Oracle yourself

HTH


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Mon, 22 Mar 2021 at 05:38, Gaurav Singh <[hidden email]> wrote:
Hi Team, 

We have lots of complex oracle views ( containing multiple tables, joins, analytical and  aggregate functions, sub queries etc) and we are wondering if Spark can help us execute those views faster. 

Also we want to know if those complex views can be implemented using Spark SQL? 

Thanks and regards, 
Gaurav Singh 
+91 8600852256 

Reply | Threaded
Open this post in threaded view
|

Re: [Spark SQL]: Can complex oracle views be created using Spark SQL

Mich Talebzadeh
In reply to this post by Gaurav Singh
Hi,

I did some investigation on this and created a dataframe on top of the underlying view in Oracle database.

Let assume that our oracle view is just a normal view as opposed to materialized view, something like below where both sales and costs are FACT tables 

CREATE OR REPLACE FORCE EDITIONABLE VIEW "SH"."PROFITS" ("CHANNEL_ID", "CUST_ID", "PROD_ID", "PROMO_ID", "TIME_ID", "UNIT_COST", "UNIT_PRICE", "AMOUNT_SOLD", "QUANTITY_SOLD", "TOTAL_COST") AS 
  SELECT
        s.channel_id,
        s.cust_id,
        s.prod_id,
        s.promo_id,
        s.time_id,
        c.unit_cost,
        c.unit_price,
        s.amount_sold,
        s.quantity_sold,
        c.unit_cost * s.quantity_sold TOTAL_COST
 FROM   costs c, sales s
 WHERE c.prod_id = s.prod_id
   AND c.time_id = s.time_id
   AND c.channel_id = s.channel_id
   AND c.promo_id = s.promo_id;

So it is pretty simple view with a join on sales and cost tables

You typically access this view in Spark with

scala> val df = spark.read.format("jdbc").options(
     |        Map("url" -> _ORACLEserver,
     |        "dbtable" -> "(SELECT * FROM sh.profits)",  // dbtable could be on a view or any valid sql
     |        "user" -> _username,
     |        "password" -> _password)).load
df: org.apache.spark.sql.DataFrame = [CHANNEL_ID: decimal(38,10), CUST_ID: decimal(38,10) ... 8 more fields]

scala>

scala> df.printSchema()
root
 |-- CHANNEL_ID: decimal(38,10) (nullable = true)
 |-- CUST_ID: decimal(38,10) (nullable = true)
 |-- PROD_ID: decimal(38,10) (nullable = true)
 |-- PROMO_ID: decimal(38,10) (nullable = true)
 |-- TIME_ID: timestamp (nullable = true)
 |-- UNIT_COST: decimal(10,2) (nullable = true)
 |-- UNIT_PRICE: decimal(10,2) (nullable = true)
 |-- AMOUNT_SOLD: decimal(10,2) (nullable = true)
 |-- QUANTITY_SOLD: decimal(10,2) (nullable = true)
 |-- TOTAL_COST: decimal(38,10) (nullable = true)

If you run this all spark is going to fetch the result set from Oracle itself and the optimisation is going to happen within Oracle itself and results will be returned to Oracle.

However, if you use the underlying Oracle tables themselves (create DF on top of them, here costs and sales tables), and run the  SQL code in Spark itself, then you will get a more performant result.

scala> val sales = spark.read.format("jdbc").options(
     |        Map("url" -> _ORACLEserver,
     |        "dbtable" -> "(SELECT * FROM sh.sales)",
     |        "user" -> _username,
     |        "password" -> _password)).load
sales: org.apache.spark.sql.DataFrame = [PROD_ID: decimal(38,10), CUST_ID: decimal(38,10) ... 5 more fields]

scala> sales.createOrReplaceTempView("sales")

scala> val costs = spark.read.format("jdbc").options(
     |        Map("url" -> _ORACLEserver,
     |        "dbtable" -> "(SELECT * FROM sh.costs)",
     |        "user" -> _username,
     |        "password" -> _password)).load
costs: org.apache.spark.sql.DataFrame = [PROD_ID: decimal(38,10), TIME_ID: timestamp ... 4 more fields]

scala> costs.createOrReplaceTempView("costs")

scala> var sqltext = """
     | SELECT
     |         s.channel_id,
     |         s.cust_id,
     |         s.prod_id,
     |         s.promo_id,
     |         s.time_id,
     |         c.unit_cost,
     |         c.unit_price,
     |         s.amount_sold,
     |         s.quantity_sold,
     |         c.unit_cost * s.quantity_sold TOTAL_COST
     |  FROM   costs c, sales s
     |  WHERE c.prod_id = s.prod_id
     |    AND c.time_id = s.time_id
     |    AND c.channel_id = s.channel_id
     |    AND c.promo_id = s.promo_id
     | """
sqltext: String =
"
SELECT
        s.channel_id,
        s.cust_id,
        s.prod_id,
        s.promo_id,
        s.time_id,
        c.unit_cost,
        c.unit_price,
        s.amount_sold,
        s.quantity_sold,
        c.unit_cost * s.quantity_sold TOTAL_COST
 FROM   costs c, sales s
 WHERE c.prod_id = s.prod_id
   AND c.time_id = s.time_id
   AND c.channel_id = s.channel_id
   AND c.promo_id = s.promo_id
"

Then you can look at what spark optimiser is doing

scala> spark.sql(sqltext).explain()
== Physical Plan ==
*(5) Project [channel_id#27, cust_id#25, prod_id#24, promo_id#28, time_id#26, unit_cost#42, unit_price#43, amount_sold#30, quantity_sold#29, CheckOverflow((promote_precision(unit_cost#42) * promote_precision(quantity_sold#29)), DecimalType(21,4), true) AS TOTAL_COST#50]
+- *(5) SortMergeJoin [prod_id#38, time_id#39, channel_id#41, promo_id#40], [prod_id#24, time_id#26, channel_id#27, promo_id#28], Inner
   :- *(2) Sort [prod_id#38 ASC NULLS FIRST, time_id#39 ASC NULLS FIRST, channel_id#41 ASC NULLS FIRST, promo_id#40 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(prod_id#38, time_id#39, channel_id#41, promo_id#40, 200), ENSURE_REQUIREMENTS, [id=#37]
   :     +- *(1) Scan JDBCRelation((SELECT * FROM sh.costs)) [numPartitions=1] [PROD_ID#38,TIME_ID#39,PROMO_ID#40,CHANNEL_ID#41,UNIT_COST#42,UNIT_PRICE#43] PushedFilters: [*IsNotNull(PROD_ID), *IsNotNull(TIME_ID), *IsNotNull(CHANNEL_ID), *IsNotNull(PROMO_ID)], ReadSchema: struct<PROD_ID:decimal(38,10),TIME_ID:timestamp,PROMO_ID:decimal(38,10),CHANNEL_ID:decimal(38,10)...
   +- *(4) Sort [prod_id#24 ASC NULLS FIRST, time_id#26 ASC NULLS FIRST, channel_id#27 ASC NULLS FIRST, promo_id#28 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(prod_id#24, time_id#26, channel_id#27, promo_id#28, 200), ENSURE_REQUIREMENTS, [id=#43]
         +- *(3) Scan JDBCRelation((SELECT * FROM sh.sales)) [numPartitions=1] [PROD_ID#24,CUST_ID#25,TIME_ID#26,CHANNEL_ID#27,PROMO_ID#28,QUANTITY_SOLD#29,AMOUNT_SOLD#30] PushedFilters: [*IsNotNull(PROD_ID), *IsNotNull(TIME_ID), *IsNotNull(CHANNEL_ID), *IsNotNull(PROMO_ID)], ReadSchema: struct<PROD_ID:decimal(38,10),CUST_ID:decimal(38,10),TIME_ID:timestamp,CHANNEL_ID:decimal(38,10),...


So in summary, leave the data where it is (Oracle) and do your SQL code within Spark.  Your mileage varies as some Oracle code may not be possible in Spark SQL etc.


HTH


   view my Linkedin profile

 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Mon, 22 Mar 2021 at 05:38, Gaurav Singh <[hidden email]> wrote:
Hi Team, 

We have lots of complex oracle views ( containing multiple tables, joins, analytical and  aggregate functions, sub queries etc) and we are wondering if Spark can help us execute those views faster. 

Also we want to know if those complex views can be implemented using Spark SQL? 

Thanks and regards, 
Gaurav Singh 
+91 8600852256