Connecting to Hive on -premise from Spark in Cloud using JDBC driver for Hive

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Connecting to Hive on -premise from Spark in Cloud using JDBC driver for Hive

Mich Talebzadeh
Hi,

This is as a matter of information.

I have seen some threads in stackoverflow about issues accessing Hive remotely without using the locality (Spark and Hive on the same Haddop cluster) or using hive-site.xml under $SPARK/conf.

That process works fine. However, challenges come when accessing Hive on-premise from Cloud through PySpark etc.

I worked on it a while back and today had to revisit it. The problem appears to be the classical JDBC drivers shipped with Hive treat Hive tables like a csv file, with header returned only. The signature is from df.printSchema() that displays <table_name>.<column_name> as opposed to column name  only.

I tried all sort of vendors drivers and the only one seem to work is Cloudera supplied driver for Hive.The driver is called com.cloudera.hive.jdbc41.HS2Driver  and you need Hive connection URL in my case jdbc:hive2://<HOST_NAME>:10099/default

The jar file is called HiveJDBC41.jar and all you need to do is to put it under $SPARK_HOME/jars, nothing else.

To read the table

    jdbcHive = spark.read. \
    format("jdbc"). \
    option("url", config['hiveVariables']['hive_url']). \
    option("dbtable", fullyQualifiedTableName). \
    option("user", config['hiveVariables']['hive_user']). \
    option("password", config['hiveVariables']['hive_password']). \
    option("driver", config['hiveVariables']['hive_driver']). \
    option("fetchsize", "1000"). \
    load()

fullyQualifiedTableName  composed of <hive_database>.<hive_table>. I have provided a generous fetchsize. anything >= 20 should do.

HTH



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Connecting to Hive on -premise from Spark in Cloud using JDBC driver for Hive

badrinath patchikolla
Hi Mich,

Is there any possible way to connect Hive JDBC through Kerberos
Authentication Type in Spark JDBC?

https://docs.cloudera.com/runtime/7.2.2/securing-hive/topics/hive_remote_data_access.html.


Thanks,
Badrinath



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Connecting to Hive on -premise from Spark in Cloud using JDBC driver for Hive

Mich Talebzadeh
Hi Badrinath,

This is a very valid question.

The option of getting a ticket before being authorised is clearly not going to work here as any authentication of that nature applies to the environment where both Hive and Spark co-exist. So the question has to move to how we can authenticate  connection to remote Hive with beeline.

If you look at the Spark connection syntax we have

def loadTableFromHiveJDBC(dataFrame,tableName)
    try:
        house_df = spark.read. \
            format("jdbc"). \
            option("url", config['hiveVariables']['hive_url']). \
            option("dbtable", tableName). \
            option("user", config['hiveVariables']['hive_user']). \
            option("password", config['hiveVariables']['hive_password']). \
            option("driver", config['hiveVariables']['hive_driver']). \
            option("fetchsize", config['hiveVariables']['fetchsize']). \
            load()
        return house_df
     except Exception as e:
        print(f"""{e}, quitting""")
        sys.exit(1)

So we no longer require the option("user",..) and option ("password", ...)

In simplest form the above is equivalent to below thrift connection

 beeline -u jdbc:hive2://HOST:PORT/default org.apache.hive.jdbc.HiveDriver -n hduser -p hduser 

In other words

beeline -u <URL> <driver> -n <username> -p <password>

The link you provided states:

beeline -u "jdbc:hive2://HOST:PORT/default;principal=hive/[hidden email]"


So we need to carry the authentication through hive_url and that needs to be valid on the 'remote environment' that Hive is running!


Do you have a kerberized Hive that you can test this please, assuming you have all the details for the principal?


Thanks


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Thu, 28 Jan 2021 at 13:30, badrinath patchikolla <[hidden email]> wrote:
Hi Mich,

Is there any possible way to connect Hive JDBC through Kerberos
Authentication Type in Spark JDBC?

https://docs.cloudera.com/runtime/7.2.2/securing-hive/topics/hive_remote_data_access.html.


Thanks,
Badrinath



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]