How to use groupByKey() in spark structured streaming without aggregates

Previous Topic Next Topic
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

How to use groupByKey() in spark structured streaming without aggregates

This post was updated on .
Is there a way through which we can use* groupByKey() Function in spark
structured streaming without aggregates ?*

I have a scenario like below, where we would like to group the items based
on a key without applying any aggregates.

*Sample incoming data:*

device_id, timestamp, value
device_1, 2018-09-28T15:49:57.2420418+00:00, 10
device_1, 2018-09-28T15:50:57.2420418+00:00, 11
device_2, 2018-09-28T15:49:57.2420418+00:00, 20
device_2, 2018-09-28T15:50:57.2420418+00:00, 21
device_3, 2018-09-28T15:49:57.2420418+00:00, 10

I would like to apply groupByKey on field - "device_id", so that i will be
getting an output like below.

device1, [{timestamp:2018-09-28T15:49:57.2420418+00:00, value: 10},{timestamp:2018-09-28T15:50:57.2420418+00:00, value: 11}]
device2, [{timestamp:2018-09-28T15:49:57.2420418+00:00, value: 20},{timestamp:2018-09-28T15:50:57.2420418+00:00, value: 21}]
device3, [{timestamp:2018-09-28T15:49:57.2420418+00:00, value: 10}]

I have also tried using collect_list() in the aggregate expression of
groupByKey, but that is taking more time to process the datasets.

Also, since we are aggregating - we could only use either 'Complete' or
'Update' in output modes, but 'Append' mode looks more suitable for our use

I have also looked at the groupByKey(Num_Partitions) and reduceByKey()
functions available in Direct Dstream which gives results like in the form
of -> (String, Itreable[String,Int]) without doing any aggregates.

Is there something available similar to that in structured streaming ?

Sent from:

To unsubscribe e-mail: