Custom Data Source for getting data from Rest based services
Need your thoughts/inputs on a custom Data Source for accessing Rest based services in parallel using Spark.
Many a times for business applications (batch oriented) one has to call a target Rest service for a high number of times (with different set of values of parameters/KV pairs).
The example use cases for the same are -
- Getting results/prediction from Machine Learning/NLP systems,
- Accessing utility APIs (like address validation) in bulk for 1000s of inputs
- Ingesting data from systems who support only parametric data query (say for time series data), - Indexing data to Search systems - Web crawling - Accessing business applications which do not support bulk download - others ....
Typically, for these use cases, the number of time the Service is called (with various parameters/data) can be high. So people use/develop a parallel processing framework (specific to his/her choice of language) to call the APIs in parallel. But typically it is hard to make such thing run in a distributed manner using multiple machines.
The interface goes like this : - Inputs : REST API endpoint URL, input Data in a Temporary Spark Table - the name of the table has to be passed, type of method (Get, Post, Put or Delete), userid/password (for the sites which need authentication), connection parameters (connection time, read time), parameter to call the target Rest API only once (useful for services for which you have to pay or have a daily/hourly limit) - Output : A DataFrame with Rows of Struct. The Struct will have the output returned by the target API.
Any thoughts/inputs on this ? a) Will this be useful for the applications/use cases you develop ? b) What you typically use to address this type of needs ? c) What else should be considered to make this framework more generic /useful ?
P.S. I found this resource (https://www.alibabacloud.com/forum/read-474)
where the similar requirement is discussed and a solution is proposed. Not
sure what is the status of the proposal. However, some more things I
found need to be addressed in that proposal - a) The proposal covers calling the Rest API for one set of key/value parameter. In the above approach one can call same Rest API multiple times with different sets of values of the keys. b) There should be an option where Rest API should be called
only once for a given set of key/value parameters. This is important as many a times one has to pay for accessing
a Rest API and also there may be a limit per day/hour basis. c) Does not support calling a Rest service which is based on Post or other HTTP methods
d) The results in other formats (like xml, csv) cannot be addressed
Re: Custom Data Source for getting data from Rest based services
This is quite an useful addition to the spark family, this is a usecase
that comes more often than talked about.
* to get a 3rd party mapping data(geo coordinates) ,
* access database data through rest.
* download data from from bulk data api service
It will be really useful to be able to interact with application layer
through restapi send over data to the rest api(case of post request which
you already mentioned)
I have few follow up thoughts
1) What's your thought when a resapi returns more complex nested json data ,
will this seamlessly map to a dataframe as dataframes are more flatter in
2) how can this dataframe be kept in distributed cache in spark workers to
be available , to encourage re-use of slow-changing data (does broadcast
work on a dataframe?) . This is related to your b)
3) Last case in my mind is how can this be extended for streaming , and
control the frequency of the resapi call and perform a join of two
dataframes, one is slow-moving(may be a lookup table in db getting accessed
over rest) and fast moving event stream.
1. Yes this should be able to handle any complex json structure returned by
the target rest API. Essentially what it would be returning is Rows of that
complex structure. Then one can use Spark SQL to further flatten it using
the functions like inline, explode, etc.
2. In my current implementation I have kept an option as "callStrictlyOnce".
This will ensure that the REST API is called only once for each set of
parameter values and the result would be persisted/cached for next time use.
3. I'm not sure what exactly you have in mind regarding extending this to
Spark Streaming. As such this cannot be used as a Spark Streaming receiver
right now as this does not implement the necessary interfaces for a custom
streaming receiver. But you can use this within your Spark Streaming
application as a regular Data Source to merge the data you are receiving
from streaming source.