Accessing Hive Database (On Hadoop) using Spark

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Accessing Hive Database (On Hadoop) using Spark

Rishikesh Gawade
Hello there. I am a newbie in the world of Spark. I have been working on a Spark Project using Java.
I have configured Hive and Spark to run on Hadoop.
As of now i have created a Hive (derby) database on Hadoop HDFS at the given location(warehouse location): /user/hive/warehouse and database name as : spam (saved as spam.db at the aforementioned location).
I have been trying to read tables in this database in spark to create RDDs/DataFrames.
Could anybody please guide me in how I can achieve this?
I used the following statements in my Java Code:

SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example").master("yarn")
.config("spark.sql.warehouse.dir","/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate();
spark.sql("USE spam");
spark.sql("SELECT * FROM spamdataset").show();
After this i built the project using Maven as follows: mvn clean package -DskipTests and a JAR was generated.

After this, I tried running the project via spark-submit CLI using :

spark-submit --class com.adbms.SpamFilter --master yarn ~/IdeaProjects/mlproject/target/mlproject-1.0-SNAPSHOT.jar

and got the following error:

Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'spam' not found;
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireDbExists(SessionCatalog.scala:174)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.setCurrentDatabase(SessionCatalog.scala:256)
at org.apache.spark.sql.execution.command.SetDatabaseCommand.run(databases.scala:59)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:190)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
at com.adbms.SpamFilter.main(SpamFilter.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


I request you to please check this and if anything is wrong then please suggest an ideal way to read Hive tables on Hadoop in Spark using Java. A link to a webpage having relevant info would also be appreciated.
Thank you in anticipation.
Regards, 
Rishikesh Gawade

Reply | Threaded
Open this post in threaded view
|

Re: Accessing Hive Database (On Hadoop) using Spark

Nicolas Paris
Hi

Sounds your configuration files are not well filed.
What does :
spark.sql("SHOW DATABASES").show();
outputs ?
If you only have default database, such investigation there should help
https://stackoverflow.com/questions/47257680/unable-to-get-existing-hive-tables-from-hivecontext-using-spark


2018-04-15 18:14 GMT+02:00 Rishikesh Gawade <[hidden email]>:
Hello there. I am a newbie in the world of Spark. I have been working on a Spark Project using Java.
I have configured Hive and Spark to run on Hadoop.
As of now i have created a Hive (derby) database on Hadoop HDFS at the given location(warehouse location): /user/hive/warehouse and database name as : spam (saved as spam.db at the aforementioned location).
I have been trying to read tables in this database in spark to create RDDs/DataFrames.
Could anybody please guide me in how I can achieve this?
I used the following statements in my Java Code:

SparkSession spark = SparkSession
.builder()
.appName("Java Spark Hive Example").master("yarn")
.config("spark.sql.warehouse.dir","/user/hive/warehouse")
.enableHiveSupport()
.getOrCreate();
spark.sql("USE spam");
spark.sql("SELECT * FROM spamdataset").show();
After this i built the project using Maven as follows: mvn clean package -DskipTests and a JAR was generated.

After this, I tried running the project via spark-submit CLI using :

spark-submit --class com.adbms.SpamFilter --master yarn ~/IdeaProjects/mlproject/target/mlproject-1.0-SNAPSHOT.jar

and got the following error:

Exception in thread "main" org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'spam' not found;
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.org$apache$spark$sql$catalyst$catalog$SessionCatalog$$requireDbExists(SessionCatalog.scala:174)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.setCurrentDatabase(SessionCatalog.scala:256)
at org.apache.spark.sql.execution.command.SetDatabaseCommand.run(databases.scala:59)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
at org.apache.spark.sql.Dataset.<init>(Dataset.scala:190)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:75)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:638)
at com.adbms.SpamFilter.main(SpamFilter.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


I request you to please check this and if anything is wrong then please suggest an ideal way to read Hive tables on Hadoop in Spark using Java. A link to a webpage having relevant info would also be appreciated.
Thank you in anticipation.
Regards, 
Rishikesh Gawade