Hive using Spark engine vs native spark with hive integration.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Hive using Spark engine vs native spark with hive integration.

Manu Jacob

Hi All,

 

Not sure if I need to ask this question on spark community or hive community.

 

We have a set of hive scripts that runs on EMR (Tez engine). We would like to experiment by moving some of it onto Spark. We are planning to experiment with two options.

 

  1. Use the current code based on HQL, with engine set as spark.
  2. Write pure spark code in scala/python using SparkQL and hive integration.

 

The first approach helps us to transition to Spark quickly but not sure if this is the best approach in terms of performance.  Could not find any reasonable comparisons of this two approaches.  It looks like writing pure Spark code, gives us more control to add logic and also control some of the performance features, for example things like caching/evicting etc.

 

 

Any advice on this is much appreciated.

 

 

Thanks,

-Manu

 

Reply | Threaded
Open this post in threaded view
|

Re: Hive using Spark engine vs native spark with hive integration.

rmartine
My 2 cents is that this is a complicated question since I'm not confident that Spark is 100% compatible with Hive in terms of query language. I have an unanswered question in this list about this:


One thing that is important to check is if you are using the supported objects in both Hive and Spark. One example is the lack of support for materialized views in Spark: https://issues.apache.org/jira/browse/SPARK-29038

With that being said, I'd recommend going to 2. as this will force your code to use that Spark offers.

Hope that helps.

On Tue, Oct 6, 2020 at 1:14 PM Manu Jacob <[hidden email]> wrote:

Hi All,

 

Not sure if I need to ask this question on spark community or hive community.

 

We have a set of hive scripts that runs on EMR (Tez engine). We would like to experiment by moving some of it onto Spark. We are planning to experiment with two options.

 

  1. Use the current code based on HQL, with engine set as spark.
  2. Write pure spark code in scala/python using SparkQL and hive integration.

 

The first approach helps us to transition to Spark quickly but not sure if this is the best approach in terms of performance.  Could not find any reasonable comparisons of this two approaches.  It looks like writing pure Spark code, gives us more control to add logic and also control some of the performance features, for example things like caching/evicting etc.

 

 

Any advice on this is much appreciated.

 

 

Thanks,

-Manu

 



--

Ricardo Martinelli De Oliveira

Data Engineer, AI CoE

Red Hat Brazil

Av. Brigadeiro Faria Lima, 3900

8th floor

[hidden email]    T: <a href="tel:+551135426125" style="color:rgb(0,0,0);margin:0px" target="_blank">+551135426125    
M: <a href="tel:+5511970696531" style="color:rgb(0,0,0);margin:0px" target="_blank">+5511970696531    

Reply | Threaded
Open this post in threaded view
|

Re: Hive using Spark engine vs native spark with hive integration.

Patrick McCarthy-2
In reply to this post by Manu Jacob
I think a lot will depend on what the scripts do. I've seen some legacy hive scripts which were written in an awkward way (e.g. lots of subqueries, nested explodes) because pre-spark it was the only way to express certain logic. For fairly straightforward operations I expect Catalyst would reduce both code to similar plans.

On Tue, Oct 6, 2020 at 12:07 PM Manu Jacob <[hidden email]> wrote:

Hi All,

 

Not sure if I need to ask this question on spark community or hive community.

 

We have a set of hive scripts that runs on EMR (Tez engine). We would like to experiment by moving some of it onto Spark. We are planning to experiment with two options.

 

  1. Use the current code based on HQL, with engine set as spark.
  2. Write pure spark code in scala/python using SparkQL and hive integration.

 

The first approach helps us to transition to Spark quickly but not sure if this is the best approach in terms of performance.  Could not find any reasonable comparisons of this two approaches.  It looks like writing pure Spark code, gives us more control to add logic and also control some of the performance features, for example things like caching/evicting etc.

 

 

Any advice on this is much appreciated.

 

 

Thanks,

-Manu

 



--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016