[Structured Streaming] Multiple sources best practice/recommendation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[Structured Streaming] Multiple sources best practice/recommendation

JG Perrin

Hi,

 

I have different files being dumped on S3, I want to ingest them and join them.

 

What does sound better to you? Have one “ directory” for all or one per file format?

 

If I have one directory for all, can you get some metadata about the file, like its name?

 

If multiple directory, how can I have multiple “listeners”?

 

Thanks

 

jg


This electronic transmission and any documents accompanying this electronic transmission contain confidential information belonging to the sender. This information may contain confidential health information that is legally privileged. The information is intended only for the use of the individual or entity named above. The authorized recipient of this transmission is prohibited from disclosing this information to any other party unless required to do so by law or regulation and is required to delete or destroy the information after its stated need has been fulfilled. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on or regarding the contents of this electronically transmitted information is strictly prohibited. If you have received this E-mail in error, please notify the sender and delete this message immediately.

Reply | Threaded
Open this post in threaded view
|

Re: [Structured Streaming] Multiple sources best practice/recommendation

Michael Armbrust
I would probably suggest that you partition by format (though you can get the file name from the build in function input_file_name()).  You can load multiple streams from different directories and union them together as long as the schema is the same after parsing.  Otherwise you can just run multiple streams on the same cluster.

On Wed, Sep 13, 2017 at 7:56 AM, JG Perrin <[hidden email]> wrote:

Hi,

 

I have different files being dumped on S3, I want to ingest them and join them.

 

What does sound better to you? Have one “ directory” for all or one per file format?

 

If I have one directory for all, can you get some metadata about the file, like its name?

 

If multiple directory, how can I have multiple “listeners”?

 

Thanks

 

jg


This electronic transmission and any documents accompanying this electronic transmission contain confidential information belonging to the sender. This information may contain confidential health information that is legally privileged. The information is intended only for the use of the individual or entity named above. The authorized recipient of this transmission is prohibited from disclosing this information to any other party unless required to do so by law or regulation and is required to delete or destroy the information after its stated need has been fulfilled. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on or regarding the contents of this electronically transmitted information is strictly prohibited. If you have received this E-mail in error, please notify the sender and delete this message immediately.