Spark Structured Streaming from GCS files

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Structured Streaming from GCS files

Gowrishankar Sunder
Hi,
   We have a use case to stream files from GCS time-partitioned folders and perform structured streaming queries on top of them. I have detailed the use cases and requirements in this Stackoverflow question but at a high level, the problems I am facing are listed below and would like guidance on the best approach to use
  • Custom source APIs for Structured Streaming are undergoing major changes (including the new Table API support) and the documentation does not capture much details when it comes to building custom sources. I was wondering if the current APIs are expected to remain stable through the targeted 3.2 release and if there are examples on how to use them for my use case.
  • The default FileStream source looks up a static glob path which might not scale when the job runs for days with multiple time partitions. But it has some really useful features handling files - supports all major source formats (AVRO, Parquet, JSON etc...), takes care of compression and partitioning large files into sub-tasks - all of which I need to implement again for the current custom source APIs as they stand. I was wondering if I can still somehow make use of them while solving the scaling time partitioning file globbing issue.
Thanks


Reply | Threaded
Open this post in threaded view
|

Re: Spark Structured Streaming from GCS files

Mich Talebzadeh

Hi,

I looked at the stackoverflow reference.

The first question that comes to my mind is how you are populating these gcs buckets? Are you shifting data from on-prem and landing them in the buckets and  creating a new folder at the given interval?

Where will you be running your Spark Structured Streaming? On dataproics?

HTH


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Mon, 15 Mar 2021 at 19:00, Gowrishankar Sunder <[hidden email]> wrote:
Hi,
   We have a use case to stream files from GCS time-partitioned folders and perform structured streaming queries on top of them. I have detailed the use cases and requirements in this Stackoverflow question but at a high level, the problems I am facing are listed below and would like guidance on the best approach to use
  • Custom source APIs for Structured Streaming are undergoing major changes (including the new Table API support) and the documentation does not capture much details when it comes to building custom sources. I was wondering if the current APIs are expected to remain stable through the targeted 3.2 release and if there are examples on how to use them for my use case.
  • The default FileStream source looks up a static glob path which might not scale when the job runs for days with multiple time partitions. But it has some really useful features handling files - supports all major source formats (AVRO, Parquet, JSON etc...), takes care of compression and partitioning large files into sub-tasks - all of which I need to implement again for the current custom source APIs as they stand. I was wondering if I can still somehow make use of them while solving the scaling time partitioning file globbing issue.
Thanks


Reply | Threaded
Open this post in threaded view
|

Re: Spark Structured Streaming from GCS files

Gowrishankar Sunder
Our online services running in GCP collect data from our clients and write it to GCS under time-partitioned folders like yyyy/mm/dd/hh/mm (current-time) or similar ones. We need these files to be processed in real-time from Spark. As for the runtime, we plan to run it either on Dataproc or K8s.

- Gowrishankar Sunder


On Mon, Mar 15, 2021 at 12:13 PM Mich Talebzadeh <[hidden email]> wrote:

Hi,

I looked at the stackoverflow reference.

The first question that comes to my mind is how you are populating these gcs buckets? Are you shifting data from on-prem and landing them in the buckets and  creating a new folder at the given interval?

Where will you be running your Spark Structured Streaming? On dataproics?

HTH


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Mon, 15 Mar 2021 at 19:00, Gowrishankar Sunder <[hidden email]> wrote:
Hi,
   We have a use case to stream files from GCS time-partitioned folders and perform structured streaming queries on top of them. I have detailed the use cases and requirements in this Stackoverflow question but at a high level, the problems I am facing are listed below and would like guidance on the best approach to use
  • Custom source APIs for Structured Streaming are undergoing major changes (including the new Table API support) and the documentation does not capture much details when it comes to building custom sources. I was wondering if the current APIs are expected to remain stable through the targeted 3.2 release and if there are examples on how to use them for my use case.
  • The default FileStream source looks up a static glob path which might not scale when the job runs for days with multiple time partitions. But it has some really useful features handling files - supports all major source formats (AVRO, Parquet, JSON etc...), takes care of compression and partitioning large files into sub-tasks - all of which I need to implement again for the current custom source APIs as they stand. I was wondering if I can still somehow make use of them while solving the scaling time partitioning file globbing issue.
Thanks