Spark hive build and connectivity

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark hive build and connectivity

ravishankar
Hello all, 
I am trying to understand how the Spark SQL integration with hive works. Whenever i build spark with -Phive -P hive-thriftserver options, i see that it is packaged with hive-2.3.7*.jars and spark-hive*.jars. And the documentation claims that spark can talk to different versions of hive. If that is the case , what should i do if i have a hive 3.2.1 running on my instance and i want my spark application to talk to that hive cluster. 

Does this mean i have to build spark with hive version 3.2.1 or like the documentation states, is it enough if i just add the metastore jars to spark-defaults.conf ? 

Should i add my hive 3.2.1 lib to the SPARK_DIST_CLASSPATH as well ? Will there be conflicts between the hive 2.3.7 jars and the hive 3.2.1 jars i will have in this case ? 


Thanks ! 
Reply | Threaded
Open this post in threaded view
|

Re: Spark hive build and connectivity

Mich Talebzadeh
Hi Ravi,

What exactly are you trying to do?

You want to enhance Spark SQl or you want to run Hive on Spark engine?

HTH



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Thu, 22 Oct 2020 at 16:36, Ravi Shankar <[hidden email]> wrote:
Hello all, 
I am trying to understand how the Spark SQL integration with hive works. Whenever i build spark with -Phive -P hive-thriftserver options, i see that it is packaged with hive-2.3.7*.jars and spark-hive*.jars. And the documentation claims that spark can talk to different versions of hive. If that is the case , what should i do if i have a hive 3.2.1 running on my instance and i want my spark application to talk to that hive cluster. 

Does this mean i have to build spark with hive version 3.2.1 or like the documentation states, is it enough if i just add the metastore jars to spark-defaults.conf ? 

Should i add my hive 3.2.1 lib to the SPARK_DIST_CLASSPATH as well ? Will there be conflicts between the hive 2.3.7 jars and the hive 3.2.1 jars i will have in this case ? 


Thanks ! 
Reply | Threaded
Open this post in threaded view
|

Re: Spark hive build and connectivity

ravishankar
Hello Mitch, 
I am just trying to access hive tables from my hive 3.2.1 cluster using spark. Basically i just want my spark-jobs to be able to access these hive tables. I want to understand how spark jobs interact with hive to access these tables. 

- I see that whenever i build spark with hive support (-Phive -Phive-thriftserver) , it gets built with hive 2.3.7 jars. So , will it be ok if i access tables created using my hive 3.2.1 cluster ? 
- Do i have to add hive 3.2.1 jars to spark's (SPARK_DIST_CLASSPATH) ? 



On Thu, Oct 22, 2020 at 11:20 AM Mich Talebzadeh <[hidden email]> wrote:
Hi Ravi,

What exactly are you trying to do?

You want to enhance Spark SQl or you want to run Hive on Spark engine?

HTH



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Thu, 22 Oct 2020 at 16:36, Ravi Shankar <[hidden email]> wrote:
Hello all, 
I am trying to understand how the Spark SQL integration with hive works. Whenever i build spark with -Phive -P hive-thriftserver options, i see that it is packaged with hive-2.3.7*.jars and spark-hive*.jars. And the documentation claims that spark can talk to different versions of hive. If that is the case , what should i do if i have a hive 3.2.1 running on my instance and i want my spark application to talk to that hive cluster. 

Does this mean i have to build spark with hive version 3.2.1 or like the documentation states, is it enough if i just add the metastore jars to spark-defaults.conf ? 

Should i add my hive 3.2.1 lib to the SPARK_DIST_CLASSPATH as well ? Will there be conflicts between the hive 2.3.7 jars and the hive 3.2.1 jars i will have in this case ? 


Thanks ! 
Reply | Threaded
Open this post in threaded view
|

Re: Spark hive build and connectivity

Artemis User
In reply to this post by ravishankar
By default Spark will build with Hive 2.3.7, according to the Spark
build doc.  If you want to replace it with a different hive jar, you
need to change the Maven pom.xml file.

-- ND

On 10/22/20 11:35 AM, Ravi Shankar wrote:

> Hello all,
> I am trying to understand how the Spark SQL integration with hive
> works. Whenever i build spark with -Phive -P hive-thriftserver
> options, i see that it is packaged with hive-2.3.7*.jars and
> spark-hive*.jars. And the documentation claims that spark can talk to
> different versions of hive. If that is the case , what should i do if
> i have a hive 3.2.1 running on my instance and i want my spark
> application to talk to that hive cluster.
>
> Does this mean i have to build spark with hive version 3.2.1 or like
> the documentation states, is it enough if i just add the metastore
> jars to spark-defaults.conf ?
>
> Should i add my hive 3.2.1 lib to the SPARK_DIST_CLASSPATH as well ?
> Will there be conflicts between the hive 2.3.7 jars and the hive 3.2.1
> jars i will have in this case ?
>
>
> Thanks !

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark hive build and connectivity

Mich Talebzadeh
In reply to this post by ravishankar
Hi,

To access Hive tables Spark uses native API as below (default) where you have set-up

ltr $SPARK_HOME/conf
hive-site.xml -> /data6/hduser/hive-3.0.0/conf/hive-site.xml

val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
HiveContext.sql("use ilayer")
val account_table = HiveContext.table("joint_accounts")  // account_table is a DF

or you can access any version of Hive on any host using JDBC connection

Example using Cloudera drivers (the only one that works I think)

driver: com.cloudera.hive.jdbc41.HS2Driver

Connection URL: jdbc:hive2://rhes75:10099  ## Hive thrift server port

HTH



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Thu, 22 Oct 2020 at 17:47, Ravi Shankar <[hidden email]> wrote:
Hello Mitch, 
I am just trying to access hive tables from my hive 3.2.1 cluster using spark. Basically i just want my spark-jobs to be able to access these hive tables. I want to understand how spark jobs interact with hive to access these tables. 

- I see that whenever i build spark with hive support (-Phive -Phive-thriftserver) , it gets built with hive 2.3.7 jars. So , will it be ok if i access tables created using my hive 3.2.1 cluster ? 
- Do i have to add hive 3.2.1 jars to spark's (SPARK_DIST_CLASSPATH) ? 



On Thu, Oct 22, 2020 at 11:20 AM Mich Talebzadeh <[hidden email]> wrote:
Hi Ravi,

What exactly are you trying to do?

You want to enhance Spark SQl or you want to run Hive on Spark engine?

HTH



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Thu, 22 Oct 2020 at 16:36, Ravi Shankar <[hidden email]> wrote:
Hello all, 
I am trying to understand how the Spark SQL integration with hive works. Whenever i build spark with -Phive -P hive-thriftserver options, i see that it is packaged with hive-2.3.7*.jars and spark-hive*.jars. And the documentation claims that spark can talk to different versions of hive. If that is the case , what should i do if i have a hive 3.2.1 running on my instance and i want my spark application to talk to that hive cluster. 

Does this mean i have to build spark with hive version 3.2.1 or like the documentation states, is it enough if i just add the metastore jars to spark-defaults.conf ? 

Should i add my hive 3.2.1 lib to the SPARK_DIST_CLASSPATH as well ? Will there be conflicts between the hive 2.3.7 jars and the hive 3.2.1 jars i will have in this case ? 


Thanks ! 
Reply | Threaded
Open this post in threaded view
|

Re: Spark hive build and connectivity

Kimahriman
I have always been a little confused about the different hive-version
integration as well. To expand on this question, we have a Hive 3.1.1
metastore that we can successfully interact with using the -Phive profile
with Hive 2.3.7. We do not use the Hive 3.1.1 jars anywhere in our Spark
applications. Are we just lucky that the 2.3.7 jars are compatible for our
use cases with the 3.1.1 metastore? Or are the
`spark.sql.hive.metastore.jars` only used if you are using a direct JDBC
connection and acting as the metastore?

Also FWIW, the documentation only claims compatibility up to Hive version
3.1.2. Not sure if there's any breaking changes in 3.2 and beyond.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark hive build and connectivity

ravishankar
Thanks ! I have a very similar setup. I have built spark with -Phive which includes hive-2.3.7 jars , spark-hive*jars and some hadoop-common* jars. 

At runtime, i set SPARK_DIST_CLASSPATH=${hadoop classpath}

and set spark.sql.hive.metastore.version and spark.sql.hive.metastore.jars to $HIVE_HOME/lib/*. 

With this , I am able to read and write to hive successfully from my spark jobs. So my question and doubt is the same as yours - is it just working by chance ? How and when does spark use the hive-2.3.7* jars  as opposed to the metastore jars ?

What if my hive tables uses some serdes and functions in my hive 3.x cluster ? How will spark be able to use them at runtime ? Hope someone has a clear understanding of how spark works with hive. 

On Thu, Oct 22, 2020 at 12:48 PM Kimahriman <[hidden email]> wrote:
I have always been a little confused about the different hive-version
integration as well. To expand on this question, we have a Hive 3.1.1
metastore that we can successfully interact with using the -Phive profile
with Hive 2.3.7. We do not use the Hive 3.1.1 jars anywhere in our Spark
applications. Are we just lucky that the 2.3.7 jars are compatible for our
use cases with the 3.1.1 metastore? Or are the
`spark.sql.hive.metastore.jars` only used if you are using a direct JDBC
connection and acting as the metastore?

Also FWIW, the documentation only claims compatibility up to Hive version
3.1.2. Not sure if there's any breaking changes in 3.2 and beyond.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark hive build and connectivity

hopefulnick
In reply to this post by ravishankar
For compatibility,it's recommended: - Use compatible version of Hive. - Build Spark without hive and configure hive to use Spark. Here is the way to build Spark with custom Hive. It worked for me and hope helpful to you. Hive on Spark

Sent from the Apache Spark User List mailing list archive at Nabble.com.