Saving Spark run stats and run watermark

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Saving Spark run stats and run watermark

Manjunath Shetty H
Hi All,

Want to save each spark batch run stats (start, end, ID etc) and watermark ( Last processed timestamp from external data source). 

We have tried Hive JDBC, but it is very slow due MR jobs it will trigger. Cant save to normal Hive tables as it will create lots of small files in HDFS.

Please suggest what is the recommended way to do this ? Any pointers will be helpful

Thanks and regards
Manjunath
Reply | Threaded
Open this post in threaded view
|

Re: Saving Spark run stats and run watermark

Manjunath Shetty H
Thanks for suggestion Netanel,

Sorry for less information, I am specifically looking for something inside Hadoop ecosystem.


-
Manjunath

From: Netanel Malka <[hidden email]>
Sent: Wednesday, March 18, 2020 5:26 PM
To: Manjunath Shetty H <[hidden email]>
Subject: Re: Saving Spark run stats and run watermark
 
You can try to use a RDBMS like postgrsql or mysql.
I would use a regular table.
Spark have an built-in integration for that:


On Wed, Mar 18, 2020, 13:03 Manjunath Shetty H <[hidden email]> wrote:
Hi All,

Want to save each spark batch run stats (start, end, ID etc) and watermark ( Last processed timestamp from external data source). 

We have tried Hive JDBC, but it is very slow due MR jobs it will trigger. Cant save to normal Hive tables as it will create lots of small files in HDFS.

Please suggest what is the recommended way to do this ? Any pointers will be helpful

Thanks and regards
Manjunath