Hive on Spark is not populating correct records

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Hive on Spark is not populating correct records

Vikash Pareek
This post has NOT been accepted by the mailing list yet.
Hi,

Not sure whether it is right place to discuss this issue.

I am running following Hive query multiple times with execution engine as Hive on Spark and Hive on MapReduce.
SELECT COUNT(DISTINCT t1.region, t1.amount)
FROM my_db.my_table1 t1
LEFT OUTER
JOIN my-db.my_table2 t2 ON (t1.id = t2.id
                            AND t1.name = t2.name)
With Hive on Spark: Result (count) were different of every execution.
With Hive on MapReduce: Result (count) were same of every execution.

Seems like Hive on Spark behaving differently in each execution and does not populating correct result.

Volume of data as follow:
my_table1 (left): 30 million records
my_table2 (right): 85 million records

-- Thanks
Vikash

__Vikash Pareek
Reply | Threaded
Open this post in threaded view
|

Re: Hive on Spark is not populating correct records

Vikash Pareek
This post has NOT been accepted by the mailing list yet.
After lots of expermiments, I have figured out that it was a potential bug in cloudera with Hive on Spark.
Hive on Spark does not populate consistent output on aggregate functions.

Hopefully, it will be fixed in next relaese.

__Vikash Pareek
Reply | Threaded
Open this post in threaded view
|

Re: Hive on Spark is not populating correct records

abhimadav
This post has NOT been accepted by the mailing list yet.
Hey Vikash,

Could you please share the cloudera version you were using? Also, is this issue tracked by a JIRA by cloudera?

Thanks
Abhishek