(PySpark) Nested lists after reduceByKey

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

(PySpark) Nested lists after reduceByKey

This post has NOT been accepted by the mailing list yet.
I'm sure this is something very simple but I didn't find anything related to this.

My code is simple:
    stream = stream.map(mapper)
    stream = stream.reduceByKey(reducer)

Nothing extraordinary. After all, the output looks like that:
  key1  value1
  key2  [value2, value3]
  key3  [[value4, value5], value6]

And so on. So, sometimes I got a flat value (if it's single). Sometimes - nested lists that might be really, really deep (on my simple test data it was 3 levels deep).

I tried searching thru the sources for something like 'flat' - but found only flatMap method which is (as I understand it) not what I need.

I don't know why those lists are nested. My guess is that they were handled by different processes (workers?) and then joined together without flattening.

Of course, I can write a code in Python which will unfold that list and flatten it. But I believe this is not a normal situation - I think almost everybody needs a flat output.

itertools.chain stops unfolding on fist found non-iterable value. In other words, it still needs some coding (previous paragraph).

So - how to flatten the list using PySpark's native methods?