spark optimized pagination

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

spark optimized pagination

onmstester onmstester
Hi,
I'm using spark on top of cassandra as backend CRUD of a Restfull Application.
Most of Rest API's retrieve huge amount of data from cassandra and doing a lot of aggregation on them  in spark which take some seconds.

Problem: sometimes the output result would be a big list which make client browser throw stop script, so we should paginate the result at the server-side,
but it would be so annoying for user to wait some seconds on each page to cassandra-spark processings,

Current Dummy Solution: For now i was thinking about assigning a UUID to each request which would be sent back and forth between server-side and client-side,
the first time a rest API invoked, the result would be saved in a temptable  and in subsequent similar requests (request for next pages) the result would be fetch from
temptable (instead of common flow of retrieve from cassandra + aggregation in spark which would take some time). On memory limit, the old results would be deleted.

Is there any built-in clean caching strategy in spark to handle such scenarios?

Sent using Zoho Mail



Reply | Threaded
Open this post in threaded view
|

Re: spark optimized pagination

Deepak Goel
I think your requirement is that of OLTP system. Spark & Cassandra are more suitable for batch kind of jobs (They can be used for OLTP but there would be a performance hit)



Deepak
"The greatness of a nation can be judged by the way its animals are treated. Please consider stopping the cruelty by becoming a Vegan"


"Plant a Tree, Go Green"


On Sun, Jun 10, 2018 at 10:42 AM, onmstester onmstester <[hidden email]> wrote:
Hi,
I'm using spark on top of cassandra as backend CRUD of a Restfull Application.
Most of Rest API's retrieve huge amount of data from cassandra and doing a lot of aggregation on them  in spark which take some seconds.

Problem: sometimes the output result would be a big list which make client browser throw stop script, so we should paginate the result at the server-side,
but it would be so annoying for user to wait some seconds on each page to cassandra-spark processings,

Current Dummy Solution: For now i was thinking about assigning a UUID to each request which would be sent back and forth between server-side and client-side,
the first time a rest API invoked, the result would be saved in a temptable  and in subsequent similar requests (request for next pages) the result would be fetch from
temptable (instead of common flow of retrieve from cassandra + aggregation in spark which would take some time). On memory limit, the old results would be deleted.

Is there any built-in clean caching strategy in spark to handle such scenarios?

Sent using Zoho Mail




Reply | Threaded
Open this post in threaded view
|

Re: spark optimized pagination

theikkila
In reply to this post by onmstester onmstester
So you are now providing the data on-demand through spark?

I suggest you change your API to query from cassandra and store the results from Spark back there, that way you will have to process the whole dataset just once and cassandra is suitable for that kind of workloads.

-T

On 10 Jun 2018, at 8.12, onmstester onmstester <[hidden email]> wrote:

Hi,
I'm using spark on top of cassandra as backend CRUD of a Restfull Application.
Most of Rest API's retrieve huge amount of data from cassandra and doing a lot of aggregation on them  in spark which take some seconds.

Problem: sometimes the output result would be a big list which make client browser throw stop script, so we should paginate the result at the server-side,
but it would be so annoying for user to wait some seconds on each page to cassandra-spark processings,

Current Dummy Solution: For now i was thinking about assigning a UUID to each request which would be sent back and forth between server-side and client-side,
the first time a rest API invoked, the result would be saved in a temptable  and in subsequent similar requests (request for next pages) the result would be fetch from
temptable (instead of common flow of retrieve from cassandra + aggregation in spark which would take some time). On memory limit, the old results would be deleted.

Is there any built-in clean caching strategy in spark to handle such scenarios?

Sent using Zoho Mail




Reply | Threaded
Open this post in threaded view
|

Re: spark optimized pagination

vaquar khan
Spark is processing engine not storage or cache  ,you can dump your results back to Cassandra, if you see latency then you can use cache to dump spark results.

In short answer is NO,spark doesn't supporter give  any api to give you cache kind of storage.

 Directly reading from dataset millions of records will be big delay in response.

Regards,
Vaquar khan

On Mon, Jun 11, 2018, 2:59 AM Teemu Heikkilä <[hidden email]> wrote:
So you are now providing the data on-demand through spark?

I suggest you change your API to query from cassandra and store the results from Spark back there, that way you will have to process the whole dataset just once and cassandra is suitable for that kind of workloads.

-T

On 10 Jun 2018, at 8.12, onmstester onmstester <[hidden email]> wrote:

Hi,
I'm using spark on top of cassandra as backend CRUD of a Restfull Application.
Most of Rest API's retrieve huge amount of data from cassandra and doing a lot of aggregation on them  in spark which take some seconds.

Problem: sometimes the output result would be a big list which make client browser throw stop script, so we should paginate the result at the server-side,
but it would be so annoying for user to wait some seconds on each page to cassandra-spark processings,

Current Dummy Solution: For now i was thinking about assigning a UUID to each request which would be sent back and forth between server-side and client-side,
the first time a rest API invoked, the result would be saved in a temptable  and in subsequent similar requests (request for next pages) the result would be fetch from
temptable (instead of common flow of retrieve from cassandra + aggregation in spark which would take some time). On memory limit, the old results would be deleted.

Is there any built-in clean caching strategy in spark to handle such scenarios?

Sent using Zoho Mail