[SPARK SQL] Difference between 'Hive on spark' and Spark SQL

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[SPARK SQL] Difference between 'Hive on spark' and Spark SQL

luby
Hi, All,

We are starting to migrate our data to Hadoop platform in hoping to use 'Big Data' technologies to
improve our business.

We are new in the area and want to get some help from you.

Currently all our data is put into Hive and some complicated SQL query statements are run daily.

We want to improve the performance of these queries and have two options at hand:
a. Turn on 'Hive on spark' feature and run HQLs and
b. Run those query statements with spark SQL

What the difference between these options?

Another question is:
There is a hive setting 'hive.optimze.ppd' to enable 'predicated pushdown' query optimize
Is ther equivalent option in spark sql or the same setting also works for spark SQL?

Thanks in advance

Boying



   
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。


     
This email message may contain confidential and/or privileged information. If you are not the intended recipient, please do not read, save, forward, disclose or copy the contents of this email or open any file attached to this email. We will be grateful if you could advise the sender immediately by replying this email, and delete this email and any attachment or links to this email completely and immediately from your computer system.



Reply | Threaded
Open this post in threaded view
|

Re: [SPARK SQL] Difference between 'Hive on spark' and Spark SQL

Jörn Franke
If you have already a lot of queries then it makes sense to look at Hive (in a recent version)+TEZ+Llap and all tables in ORC format partitioned and sorted on filter columns. That would be the most easiest way and can improve performance significantly .

If you want to use Spark, eg because you want to use additional features and it could become part of your strategy justifying the investment:
* hive on Spark - I don’t think it is as much used as the above combination. I am also not sure if it supports recent Spark versions and all Hive features. It would also not really allow you to use Spark features beyond Hive features . Basically you just set a different engine in Hive and execute the queries as you do now. 
* spark.sql : you would have to write all your Hive queries as Spark queries and potentially integrate or rewrite HiveUdfs. Given that you can use HiveContext to execute queries it may not require so much effort to rewrite then. The pushdown possibilities are available in Spark. You have to write Spark programs to execute queries. There are some servers that you can connect to using SQL queries but their maturity varies.

In the end you have to make an assessment of all your queries and investigate if they can be executed using either of the options

Am 20.12.2018 um 08:17 schrieb [hidden email]:

Hi, All,

We are starting to migrate our data to Hadoop platform in hoping to use 'Big Data' technologies to
improve our business.

We are new in the area and want to get some help from you.

Currently all our data is put into Hive and some complicated SQL query statements are run daily.

We want to improve the performance of these queries and have two options at hand:
a. Turn on 'Hive on spark' feature and run HQLs and
b. Run those query statements with spark SQL

What the difference between these options?

Another question is:
There is a hive setting 'hive.optimze.ppd' to enable 'predicated pushdown' query optimize
Is ther equivalent option in spark sql or the same setting also works for spark SQL?

Thanks in advance

Boying



   
本邮件内容包含保密信息。如阁下并非拟发送的收件人,请您不要阅读、保存、对外披露或复制本邮件的任何内容,或者打开本邮件的任何附件。请即回复邮件告知发件人,并立刻将该邮件及其附件从您的电脑系统中全部删除,不胜感激。


     
This email message may contain confidential and/or privileged information. If you are not the intended recipient, please do not read, save, forward, disclose or copy the contents of this email or open any file attached to this email. We will be grateful if you could advise the sender immediately by replying this email, and delete this email and any attachment or links to this email completely and immediately from your computer system.