REST Structured Steaming Sink

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

REST Structured Steaming Sink

Sam Elamin
Hi All,


We ingest alot of restful APIs into our lake and I'm wondering if it is at all possible to created a rest sink in structured streaming?

For now I'm only focusing on restful services that have an incremental ID so my sink can just poll for new data then ingest. 

I can't seem to find a connector that does this and my gut instinct tells me it's probably because it isn't possible due to something completely obvious that I am missing 

I know some RESTful API obfuscate the IDs to a hash of strings and that could be a problem but since I'm planning on focusing on just numerical IDs that just get incremented I think I won't be facing that issue 


Can anyone let me know if this sounds like a daft idea? Will I need something like Kafka or kinesis as a buffer and redundancy or am I overthinking this?


I would love to bounce ideas with people who runs structured streaming jobs in production


Kind regards
San 


Reply | Threaded
Open this post in threaded view
|

Re: REST Structured Steaming Sink

Jungtaek Lim-2
I guess the method, query parameter, header, and the payload would be all different for almost every use case - that makes it hard to generalize and requires implementation to be pretty much complicated to be flexible enough.

I'm not aware of any custom sink implementing REST so your best bet would be simply implementing your own with foreachBatch, but so someone might jump in and provide a pointer if there is something in the Spark ecosystem.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin <[hidden email]> wrote:
Hi All,


We ingest alot of restful APIs into our lake and I'm wondering if it is at all possible to created a rest sink in structured streaming?

For now I'm only focusing on restful services that have an incremental ID so my sink can just poll for new data then ingest. 

I can't seem to find a connector that does this and my gut instinct tells me it's probably because it isn't possible due to something completely obvious that I am missing 

I know some RESTful API obfuscate the IDs to a hash of strings and that could be a problem but since I'm planning on focusing on just numerical IDs that just get incremented I think I won't be facing that issue 


Can anyone let me know if this sounds like a daft idea? Will I need something like Kafka or kinesis as a buffer and redundancy or am I overthinking this?


I would love to bounce ideas with people who runs structured streaming jobs in production


Kind regards
San 


Reply | Threaded
Open this post in threaded view
|

Re: REST Structured Steaming Sink

Holden Karau
I think adding something like this (if it doesn't already exist) could help make structured streaming easier to use, foreachBatch is not the best API.

On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim <[hidden email]> wrote:
I guess the method, query parameter, header, and the payload would be all different for almost every use case - that makes it hard to generalize and requires implementation to be pretty much complicated to be flexible enough.

I'm not aware of any custom sink implementing REST so your best bet would be simply implementing your own with foreachBatch, but so someone might jump in and provide a pointer if there is something in the Spark ecosystem.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin <[hidden email]> wrote:
Hi All,


We ingest alot of restful APIs into our lake and I'm wondering if it is at all possible to created a rest sink in structured streaming?

For now I'm only focusing on restful services that have an incremental ID so my sink can just poll for new data then ingest. 

I can't seem to find a connector that does this and my gut instinct tells me it's probably because it isn't possible due to something completely obvious that I am missing 

I know some RESTful API obfuscate the IDs to a hash of strings and that could be a problem but since I'm planning on focusing on just numerical IDs that just get incremented I think I won't be facing that issue 


Can anyone let me know if this sounds like a daft idea? Will I need something like Kafka or kinesis as a buffer and redundancy or am I overthinking this?


I would love to bounce ideas with people who runs structured streaming jobs in production


Kind regards
San 




--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: REST Structured Steaming Sink

Burak Yavuz-2
I'm not sure having a built-in sink that allows you to DDOS servers is the best idea either. foreachWriter is typically used for such use cases, not foreachBatch. It's also pretty hard to guarantee exactly-once, rate limiting, etc.

Best,
Burak

On Wed, Jul 1, 2020 at 5:54 PM Holden Karau <[hidden email]> wrote:
I think adding something like this (if it doesn't already exist) could help make structured streaming easier to use, foreachBatch is not the best API.

On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim <[hidden email]> wrote:
I guess the method, query parameter, header, and the payload would be all different for almost every use case - that makes it hard to generalize and requires implementation to be pretty much complicated to be flexible enough.

I'm not aware of any custom sink implementing REST so your best bet would be simply implementing your own with foreachBatch, but so someone might jump in and provide a pointer if there is something in the Spark ecosystem.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin <[hidden email]> wrote:
Hi All,


We ingest alot of restful APIs into our lake and I'm wondering if it is at all possible to created a rest sink in structured streaming?

For now I'm only focusing on restful services that have an incremental ID so my sink can just poll for new data then ingest. 

I can't seem to find a connector that does this and my gut instinct tells me it's probably because it isn't possible due to something completely obvious that I am missing 

I know some RESTful API obfuscate the IDs to a hash of strings and that could be a problem but since I'm planning on focusing on just numerical IDs that just get incremented I think I won't be facing that issue 


Can anyone let me know if this sounds like a daft idea? Will I need something like Kafka or kinesis as a buffer and redundancy or am I overthinking this?


I would love to bounce ideas with people who runs structured streaming jobs in production


Kind regards
San 




--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: REST Structured Steaming Sink

Holden Karau


On Wed, Jul 1, 2020 at 6:13 PM Burak Yavuz <[hidden email]> wrote:
I'm not sure having a built-in sink that allows you to DDOS servers is the best idea either
Do you think it would be used accidentally? If so we could have it with default per server rate limits that people would have to explicitly tune.
. foreachWriter is typically used for such use cases, not foreachBatch. It's also pretty hard to guarantee exactly-once, rate limiting, etc.

Best,
Burak

On Wed, Jul 1, 2020 at 5:54 PM Holden Karau <[hidden email]> wrote:
I think adding something like this (if it doesn't already exist) could help make structured streaming easier to use, foreachBatch is not the best API.

On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim <[hidden email]> wrote:
I guess the method, query parameter, header, and the payload would be all different for almost every use case - that makes it hard to generalize and requires implementation to be pretty much complicated to be flexible enough.

I'm not aware of any custom sink implementing REST so your best bet would be simply implementing your own with foreachBatch, but so someone might jump in and provide a pointer if there is something in the Spark ecosystem.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin <[hidden email]> wrote:
Hi All,


We ingest alot of restful APIs into our lake and I'm wondering if it is at all possible to created a rest sink in structured streaming?

For now I'm only focusing on restful services that have an incremental ID so my sink can just poll for new data then ingest. 

I can't seem to find a connector that does this and my gut instinct tells me it's probably because it isn't possible due to something completely obvious that I am missing 

I know some RESTful API obfuscate the IDs to a hash of strings and that could be a problem but since I'm planning on focusing on just numerical IDs that just get incremented I think I won't be facing that issue 


Can anyone let me know if this sounds like a daft idea? Will I need something like Kafka or kinesis as a buffer and redundancy or am I overthinking this?


I would love to bounce ideas with people who runs structured streaming jobs in production


Kind regards
San 




--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
--
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
Reply | Threaded
Open this post in threaded view
|

Re: REST Structured Steaming Sink

Andrew Melo
In reply to this post by Burak Yavuz-2
On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz <[hidden email]> wrote:
>
> I'm not sure having a built-in sink that allows you to DDOS servers is the best idea either. foreachWriter is typically used for such use cases, not foreachBatch. It's also pretty hard to guarantee exactly-once, rate limiting, etc.

If you control the machines and can run arbitrary code, you can DDOS
whatever you want. What's the difference between this proposal and
writing a UDF that opens 1,000 connections to a target machine?

> Best,
> Burak
>
> On Wed, Jul 1, 2020 at 5:54 PM Holden Karau <[hidden email]> wrote:
>>
>> I think adding something like this (if it doesn't already exist) could help make structured streaming easier to use, foreachBatch is not the best API.
>>
>> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim <[hidden email]> wrote:
>>>
>>> I guess the method, query parameter, header, and the payload would be all different for almost every use case - that makes it hard to generalize and requires implementation to be pretty much complicated to be flexible enough.
>>>
>>> I'm not aware of any custom sink implementing REST so your best bet would be simply implementing your own with foreachBatch, but so someone might jump in and provide a pointer if there is something in the Spark ecosystem.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin <[hidden email]> wrote:
>>>>
>>>> Hi All,
>>>>
>>>>
>>>> We ingest alot of restful APIs into our lake and I'm wondering if it is at all possible to created a rest sink in structured streaming?
>>>>
>>>> For now I'm only focusing on restful services that have an incremental ID so my sink can just poll for new data then ingest.
>>>>
>>>> I can't seem to find a connector that does this and my gut instinct tells me it's probably because it isn't possible due to something completely obvious that I am missing
>>>>
>>>> I know some RESTful API obfuscate the IDs to a hash of strings and that could be a problem but since I'm planning on focusing on just numerical IDs that just get incremented I think I won't be facing that issue
>>>>
>>>>
>>>> Can anyone let me know if this sounds like a daft idea? Will I need something like Kafka or kinesis as a buffer and redundancy or am I overthinking this?
>>>>
>>>>
>>>> I would love to bounce ideas with people who runs structured streaming jobs in production
>>>>
>>>>
>>>> Kind regards
>>>> San
>>>>
>>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: REST Structured Steaming Sink

Burak Yavuz-2
Well, the difference is, a technical user writes the UDF and a non-technical user may use this built-in thing (misconfigure it) and shoot themselves in the foot.

On Wed, Jul 1, 2020, 6:40 PM Andrew Melo <[hidden email]> wrote:
On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz <[hidden email]> wrote:
>
> I'm not sure having a built-in sink that allows you to DDOS servers is the best idea either. foreachWriter is typically used for such use cases, not foreachBatch. It's also pretty hard to guarantee exactly-once, rate limiting, etc.

If you control the machines and can run arbitrary code, you can DDOS
whatever you want. What's the difference between this proposal and
writing a UDF that opens 1,000 connections to a target machine?

> Best,
> Burak
>
> On Wed, Jul 1, 2020 at 5:54 PM Holden Karau <[hidden email]> wrote:
>>
>> I think adding something like this (if it doesn't already exist) could help make structured streaming easier to use, foreachBatch is not the best API.
>>
>> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim <[hidden email]> wrote:
>>>
>>> I guess the method, query parameter, header, and the payload would be all different for almost every use case - that makes it hard to generalize and requires implementation to be pretty much complicated to be flexible enough.
>>>
>>> I'm not aware of any custom sink implementing REST so your best bet would be simply implementing your own with foreachBatch, but so someone might jump in and provide a pointer if there is something in the Spark ecosystem.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin <[hidden email]> wrote:
>>>>
>>>> Hi All,
>>>>
>>>>
>>>> We ingest alot of restful APIs into our lake and I'm wondering if it is at all possible to created a rest sink in structured streaming?
>>>>
>>>> For now I'm only focusing on restful services that have an incremental ID so my sink can just poll for new data then ingest.
>>>>
>>>> I can't seem to find a connector that does this and my gut instinct tells me it's probably because it isn't possible due to something completely obvious that I am missing
>>>>
>>>> I know some RESTful API obfuscate the IDs to a hash of strings and that could be a problem but since I'm planning on focusing on just numerical IDs that just get incremented I think I won't be facing that issue
>>>>
>>>>
>>>> Can anyone let me know if this sounds like a daft idea? Will I need something like Kafka or kinesis as a buffer and redundancy or am I overthinking this?
>>>>
>>>>
>>>> I would love to bounce ideas with people who runs structured streaming jobs in production
>>>>
>>>>
>>>> Kind regards
>>>> San
>>>>
>>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
Reply | Threaded
Open this post in threaded view
|

Re: REST Structured Steaming Sink

Sam Elamin
Hi Folks,

Great discussion! I will take into account rate-limiting and make it configurable for the http request as well as all 

I was wondering if there is anything I might have missed that would make it technically impossible to do or at least difficult enough to not warrant the effort

Is there anything I might have overlooked? Also, would this be useful to people?

My idea is from a business perspective, why are we making them wait till the next scheduled batch run for data that is already available from an API. You could run a job every minute/hour but that in itself sounds like a streaming use-case

Thoughts?

Regards
Sam

On Thu, Jul 2, 2020 at 3:31 AM Burak Yavuz <[hidden email]> wrote:
Well, the difference is, a technical user writes the UDF and a non-technical user may use this built-in thing (misconfigure it) and shoot themselves in the foot.

On Wed, Jul 1, 2020, 6:40 PM Andrew Melo <[hidden email]> wrote:
On Wed, Jul 1, 2020 at 8:13 PM Burak Yavuz <[hidden email]> wrote:
>
> I'm not sure having a built-in sink that allows you to DDOS servers is the best idea either. foreachWriter is typically used for such use cases, not foreachBatch. It's also pretty hard to guarantee exactly-once, rate limiting, etc.

If you control the machines and can run arbitrary code, you can DDOS
whatever you want. What's the difference between this proposal and
writing a UDF that opens 1,000 connections to a target machine?

> Best,
> Burak
>
> On Wed, Jul 1, 2020 at 5:54 PM Holden Karau <[hidden email]> wrote:
>>
>> I think adding something like this (if it doesn't already exist) could help make structured streaming easier to use, foreachBatch is not the best API.
>>
>> On Wed, Jul 1, 2020 at 2:21 PM Jungtaek Lim <[hidden email]> wrote:
>>>
>>> I guess the method, query parameter, header, and the payload would be all different for almost every use case - that makes it hard to generalize and requires implementation to be pretty much complicated to be flexible enough.
>>>
>>> I'm not aware of any custom sink implementing REST so your best bet would be simply implementing your own with foreachBatch, but so someone might jump in and provide a pointer if there is something in the Spark ecosystem.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Thu, Jul 2, 2020 at 3:21 AM Sam Elamin <[hidden email]> wrote:
>>>>
>>>> Hi All,
>>>>
>>>>
>>>> We ingest alot of restful APIs into our lake and I'm wondering if it is at all possible to created a rest sink in structured streaming?
>>>>
>>>> For now I'm only focusing on restful services that have an incremental ID so my sink can just poll for new data then ingest.
>>>>
>>>> I can't seem to find a connector that does this and my gut instinct tells me it's probably because it isn't possible due to something completely obvious that I am missing
>>>>
>>>> I know some RESTful API obfuscate the IDs to a hash of strings and that could be a problem but since I'm planning on focusing on just numerical IDs that just get incremented I think I won't be facing that issue
>>>>
>>>>
>>>> Can anyone let me know if this sounds like a daft idea? Will I need something like Kafka or kinesis as a buffer and redundancy or am I overthinking this?
>>>>
>>>>
>>>> I would love to bounce ideas with people who runs structured streaming jobs in production
>>>>
>>>>
>>>> Kind regards
>>>> San
>>>>
>>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau