Spark 3 connect to Hive 1.2

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark 3 connect to Hive 1.2

Ashika Umanga
Greetings,

Our standalone Spark 3 cluster is trying to connect to Hadoop 2.6 cluster running Hive server 1.2 (/usr/hdp/2.6.2.0-205/hive/lib/hive-service-1.2.1000.2.6.2.0-205.jar)

import org.apache.spark.sql.functions._
import java.sql.Timestamp

val df1 = spark.createDataFrame(
      Seq(
        ("id1", "v2", "notshared", Timestamp.valueOf("2019-09-13 10:00:00"), false, 1, "2019-09-13"),
        ("id2", "v3", "notshared", Timestamp.valueOf("2019-09-13 09:00:00"), false, 2, "2019-09-13"),
        ("id2", "v4", "notshared", Timestamp.valueOf("2019-09-14 11:00:00"), false, 3, "2019-09-14"),
        ("id2", "v5", "notshared", Timestamp.valueOf("2019-09-14 13:00:00"), false, 4, "2019-09-14"),
        ("id3", "v4", "notshared", Timestamp.valueOf("2019-09-14 17:00:00"), false, 5, "2019-09-14"),
        ("id4", "v1", "notshared", Timestamp.valueOf("2019-09-15 19:00:00"), false, 6, "2019-09-15"))).toDF("user_id", "col2", "pidd", "land_ts", "deleted","offset", "partition")

df1.write.mode("overwrite").saveAsTable("db.spark3_test")

when running above code, is throws the error :

org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table spark_27686. Invalid method name: 'get_table_req';
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110)
  at


I assume this is caused because Spark 3 ships "hive-metastore-2.3.7.jar". To work with Hive Server 1.2 can I use "hive-metastore-1.2.1.spark2.jar" from Spark 2.4 distribution ? Do I need any other dependencies ?
Reply | Threaded
Open this post in threaded view
|

Re: Spark 3 connect to Hive 1.2

Ashika Umanga
thank you. But i think this document doesnt address Spark 3.0 (Databrick runtime 7.0? )

When I ran spark-shell as explained in the document :

spark-shell --deploy-mode client  --conf spark.sql.hive.metastore.version=1.2.1 --conf spark.sql.hive.metastore.jars="builtin"

It throws the same error :

java.lang.IllegalArgumentException: Builtin jars can only be used when hive execution version == hive metastore version. Execution: 2.3.7 != Metastore: 1.2.1. Specify a valid path to the correct hive jars using spark.sql.hive.metastore.jars or change spark.sql.hive.metastore.version to 2.3.7.

  at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:345)


On Thu, Jul 23, 2020 at 6:22 PM yunpeng jia <[hidden email]> wrote:

Ashika Umanga <[hidden email]> 于2020年7月22日周三 下午3:27写道:
Greetings,

Our standalone Spark 3 cluster is trying to connect to Hadoop 2.6 cluster running Hive server 1.2 (/usr/hdp/2.6.2.0-205/hive/lib/hive-service-1.2.1000.2.6.2.0-205.jar)

import org.apache.spark.sql.functions._
import java.sql.Timestamp

val df1 = spark.createDataFrame(
      Seq(
        ("id1", "v2", "notshared", Timestamp.valueOf("2019-09-13 10:00:00"), false, 1, "2019-09-13"),
        ("id2", "v3", "notshared", Timestamp.valueOf("2019-09-13 09:00:00"), false, 2, "2019-09-13"),
        ("id2", "v4", "notshared", Timestamp.valueOf("2019-09-14 11:00:00"), false, 3, "2019-09-14"),
        ("id2", "v5", "notshared", Timestamp.valueOf("2019-09-14 13:00:00"), false, 4, "2019-09-14"),
        ("id3", "v4", "notshared", Timestamp.valueOf("2019-09-14 17:00:00"), false, 5, "2019-09-14"),
        ("id4", "v1", "notshared", Timestamp.valueOf("2019-09-15 19:00:00"), false, 6, "2019-09-15"))).toDF("user_id", "col2", "pidd", "land_ts", "deleted","offset", "partition")

df1.write.mode("overwrite").saveAsTable("db.spark3_test")

when running above code, is throws the error :

org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table spark_27686. Invalid method name: 'get_table_req';
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110)
  at


I assume this is caused because Spark 3 ships "hive-metastore-2.3.7.jar". To work with Hive Server 1.2 can I use "hive-metastore-1.2.1.spark2.jar" from Spark 2.4 distribution ? Do I need any other dependencies ?


--
Reply | Threaded
Open this post in threaded view
|

Apache Spark + Python + Pyspark + Kaola

Suat Toksöz
Hi everyone, I want to ask for guidance for my log analyzer platform idea. I have an elasticsearch system which collects the logs from different platforms, and creates alerts. The system writes the alerts to an index on ES. Also, my alerts are stored in a folder as JSON (multi line format).

The Goals:
  1. Read json folder or ES index as streaming (read in new entry within 5 min)
  2. Select only alerts that I want to work on ( alert.id = 100 , status=true , ...)
  3. Create a DataFrame + Window for 10 min period
  4. Run a query fro that DataFrame by grupping by IP ( If same IP gets 3 alerts then show me the result)
  5. All the coding should be in python

The ideas is something like this, my question is how should I proceed to this task. What are the technologies that I should use?

Apache Spark + Python + Pyspark + Kaola can handle this ?

Best regards,

Suat Toksoz