Spark 2.4 and Hive 2.3 - Performance issue with concurrent hive DDL queries

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Spark 2.4 and Hive 2.3 - Performance issue with concurrent hive DDL queries

Nirav Patel

Hi,

I am trying to do 1000s of update parquet partition operations on different hive tables parallely from my client application. I am using sparksql in local mode with hive enabled in my application to submit hive query. Spark is being used in local mode because all the operations we do are pretty simple DDL queries so we don't want to use cluster resources in this case.

Spark hive config property is:

hive.metastore.uris=thrift://hivebox:9083

Example sql query that we want to execute in parallel:

spark.sql(" ALTER TABLE mytable PARTITION (a=3, b=3) SET LOCATION '/newdata/mytable/a=3/b=3/part.parquet")


I can see all the queries are submitted via different threads from my fork-join pool. i couldn't scale this operation however way i tweak the thread pool. Then I started observing hive metastore logs and I see that only thread is making all writes.

    2020-01-29T16:27:15,638  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb tbl=mytable1
2020-01-29T16:27:15,638  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb tbl=mytable1    
2020-01-29T16:27:15,653  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
2020-01-29T16:27:15,653  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database: mydb  
2020-01-29T16:27:15,655  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb tbl=mytable2
2020-01-29T16:27:15,656  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb tbl=mytable2    
2020-01-29T16:27:15,670  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_database: mydb
2020-01-29T16:27:15,670  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_database: mydb  
2020-01-29T16:27:15,672  INFO [pool-6-thread-163] metastore.HiveMetaStore: 163: source:10.250.70.14 get_table : db=mydb tbl=mytable3
2020-01-29T16:27:15,672  INFO [pool-6-thread-163] HiveMetaStore.audit: ugi=mycomp   ip=10.250.70.14 cmd=source:10.250.70.14 get_table : db=mydb tbl=mytable3

All actions are performed by only one thread pool-6-thread-163 I have scanned 100s of lines and it just same thread. I don't see much log in hiverserver.log file.

Is it bound to consumer IP because i see in log record source:10.250.70.14? which would make sense as I am submitting all jobs from single machine. If that's the case how do I scale this? Am I missing any configuration or is there any issue with how hive handles connection from spark client?

I know the workaround could be to run my application in cluster in which case queries will be submitted by different client machines (worker nodes) but we really just want to use spark in local mode.

Thanks,

Nirav