How to collect Spark dataframe write metrics

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to collect Spark dataframe write metrics

Manjunath Shetty H
Hi all,

Basically my use case is to validate the DataFrame rows count before and after writing to HDFS. Is this even to good practice ? Or Should relay on spark for guaranteed writes ?.

If it is a good practice to follow then how to get the DataFrame level write metrics ? 

Any pointers would be helpful.


Thanks and Regards
Manjunath 
Reply | Threaded
Open this post in threaded view
|

Re: How to collect Spark dataframe write metrics

Zohar Stiro
Hi, 

to get DataFrame level write metrics you can take a look at the following trait :
and a basic implementation example:

and here is an example of how it is being used in FileStreamSink:

- about the good practise - it depends on your use case but Generally speaking I would not do it - at least not for checking your logic/ checking spark is working correctly. 

‫בתאריך יום א׳, 1 במרץ 2020 ב-14:32 מאת ‪Manjunath Shetty H‬‏ <‪[hidden email]‬‏>:‬
Hi all,

Basically my use case is to validate the DataFrame rows count before and after writing to HDFS. Is this even to good practice ? Or Should relay on spark for guaranteed writes ?.

If it is a good practice to follow then how to get the DataFrame level write metrics ? 

Any pointers would be helpful.


Thanks and Regards
Manjunath 
Reply | Threaded
Open this post in threaded view
|

Re: How to collect Spark dataframe write metrics

Manjunath Shetty H
Thanks Zohar,

Will try that


-
Manjunath

From: Zohar Stiro <[hidden email]>
Sent: Tuesday, March 3, 2020 1:49 PM
To: Manjunath Shetty H <[hidden email]>
Cc: user <[hidden email]>
Subject: Re: How to collect Spark dataframe write metrics
 
Hi, 

to get DataFrame level write metrics you can take a look at the following trait :
and a basic implementation example:

and here is an example of how it is being used in FileStreamSink:

- about the good practise - it depends on your use case but Generally speaking I would not do it - at least not for checking your logic/ checking spark is working correctly. 

‫בתאריך יום א׳, 1 במרץ 2020 ב-14:32 מאת ‪Manjunath Shetty H‬‏ <‪[hidden email]‬‏>:‬
Hi all,

Basically my use case is to validate the DataFrame rows count before and after writing to HDFS. Is this even to good practice ? Or Should relay on spark for guaranteed writes ?.

If it is a good practice to follow then how to get the DataFrame level write metrics ? 

Any pointers would be helpful.


Thanks and Regards
Manjunath