Standard practices for building dashboards for spark processed data

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Standard practices for building dashboards for spark processed data

Aniruddha P Tekade
Hello,

I am trying to build a data pipeline that uses spark structured streaming with delta project and runs into Kubernetes. Due to this, I get my output files only into parquet format. Since I am asked to use the prometheus and grafana 
for building the dashboard for this pipeline, I run an another small spark job and convert output into json so that I would be able to insert them into Grafana. Although I can see that this step is redundant, considering the important of delta lake project, I can not write my data directly into json. Therefore I need some help/guidelines/opinions about moving forward from here.

I would appreciate if the spark user(s) can provide me some practices to follow with respect to the following questions - 
  1. Since I can not have direct json output from spark structured streams, is there any better way to convert parquet into json? Or should I keep only parquet?  
  2. Will I need to write some custom exporter for prometheus so as to make grafana read those time-series data? 
  3. Is there any better dashboard alternative than Grafana for this requirement? 
  4. Since the pipeline is going to run into Kubernetes, I am trying to avoid InfluxDB as time-series database and moving with prometheus. Is this approach correct? 
Thanks,
Ani
-----------
Reply | Threaded
Open this post in threaded view
|

Re: Standard practices for building dashboards for spark processed data

Roland Johann
Hi Ani,

Prometheus is not well suited for ingesting explicit timeseries data. Its purpose is for technical monitoring. If you want to monitor your spark jobs with prometheus you can publish the metrics so prometheus can scrape it. What you propably are looking for is a timeseries database that you can push metrics to.

Looking for an alternative for grafana should be done only if you find grafana is not well suited for your use case regarding visualization.

As said earlier, at a quick glance it sounds that you should look for an alternative to prometheus.

For timeseries you can reach out to TimescaleDB, InfluxDB. Other databases like normal SQL databases or cassandra lacks up/downsampling capabilities that can lead to large query responses and the need for the client to post process.

Kind regards,

Aniruddha P Tekade <[hidden email]> schrieb am Mi. 26. Feb. 2020 um 02:23:
Hello,

I am trying to build a data pipeline that uses spark structured streaming with delta project and runs into Kubernetes. Due to this, I get my output files only into parquet format. Since I am asked to use the prometheus and grafana 
for building the dashboard for this pipeline, I run an another small spark job and convert output into json so that I would be able to insert them into Grafana. Although I can see that this step is redundant, considering the important of delta lake project, I can not write my data directly into json. Therefore I need some help/guidelines/opinions about moving forward from here.

I would appreciate if the spark user(s) can provide me some practices to follow with respect to the following questions - 
  1. Since I can not have direct json output from spark structured streams, is there any better way to convert parquet into json? Or should I keep only parquet?  
  2. Will I need to write some custom exporter for prometheus so as to make grafana read those time-series data? 
  3. Is there any better dashboard alternative than Grafana for this requirement? 
  4. Since the pipeline is going to run into Kubernetes, I am trying to avoid InfluxDB as time-series database and moving with prometheus. Is this approach correct? 
Thanks,
Ani
-----------
--
Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: [hidden email]
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann
Reply | Threaded
Open this post in threaded view
|

Re: [External Email] Re: Standard practices for building dashboards for spark processed data

Aniruddha P Tekade
Hi Roland, 

Thank you for your reply. That's quite helpful. I think I should try influxDB then. But I am curious if in case of prometheus writing a custom exporter be a good choice and solve the purpose efficiently? Grafana is not something I want to drop.

Best,
Aniruddha
-----------


On Tue, Feb 25, 2020 at 11:36 PM Roland Johann <[hidden email]> wrote:
Hi Ani,

Prometheus is not well suited for ingesting explicit timeseries data. Its purpose is for technical monitoring. If you want to monitor your spark jobs with prometheus you can publish the metrics so prometheus can scrape it. What you propably are looking for is a timeseries database that you can push metrics to.

Looking for an alternative for grafana should be done only if you find grafana is not well suited for your use case regarding visualization.

As said earlier, at a quick glance it sounds that you should look for an alternative to prometheus.

For timeseries you can reach out to TimescaleDB, InfluxDB. Other databases like normal SQL databases or cassandra lacks up/downsampling capabilities that can lead to large query responses and the need for the client to post process.

Kind regards,

Aniruddha P Tekade <[hidden email]> schrieb am Mi. 26. Feb. 2020 um 02:23:
Hello,

I am trying to build a data pipeline that uses spark structured streaming with delta project and runs into Kubernetes. Due to this, I get my output files only into parquet format. Since I am asked to use the prometheus and grafana 
for building the dashboard for this pipeline, I run an another small spark job and convert output into json so that I would be able to insert them into Grafana. Although I can see that this step is redundant, considering the important of delta lake project, I can not write my data directly into json. Therefore I need some help/guidelines/opinions about moving forward from here.

I would appreciate if the spark user(s) can provide me some practices to follow with respect to the following questions - 
  1. Since I can not have direct json output from spark structured streams, is there any better way to convert parquet into json? Or should I keep only parquet?  
  2. Will I need to write some custom exporter for prometheus so as to make grafana read those time-series data? 
  3. Is there any better dashboard alternative than Grafana for this requirement? 
  4. Since the pipeline is going to run into Kubernetes, I am trying to avoid InfluxDB as time-series database and moving with prometheus. Is this approach correct? 
Thanks,
Ani
-----------
--
Roland Johann
Software Developer/Data Engineer

phenetic GmbH
Lütticher Straße 10, 50674 Köln, Germany

Mobil: +49 172 365 26 46
Mail: [hidden email]
Web: phenetic.io

Handelsregister: Amtsgericht Köln (HRB 92595)
Geschäftsführer: Roland Johann, Uwe Reimann
Reply | Threaded
Open this post in threaded view
|

Re: Standard practices for building dashboards for spark processed data

Breno Arosa
In reply to this post by Aniruddha P Tekade
I have been using Athena/Presto to read the parquet files in datalake, if
your are already saving data to s3 I think this is the easiest option.
Then I use Redash or Metabase to build dashboards (they have different
limitations), both are very intuitive to use and easy to setup with docker.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]