(PySpark) Nested lists after reduceByKey

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

(PySpark) Nested lists after reduceByKey

andyN
This post has NOT been accepted by the mailing list yet.
I'm sure this is something very simple but I didn't find anything related to this.

My code is simple:
    ...
    stream = stream.map(mapper)
    stream = stream.reduceByKey(reducer)
    ...

Nothing extraordinary. After all, the output looks like that:
  ...
  key1  value1
  key2  [value2, value3]
  key3  [[value4, value5], value6]
  ...

And so on. So, sometimes I got a flat value (if it's single). Sometimes - nested lists that might be really, really deep (on my simple test data it was 3 levels deep).

I tried searching thru the sources for something like 'flat' - but found only flatMap method which is (as I understand it) not what I need.

I don't know why those lists are nested. My guess is that they were handled by different processes (workers?) and then joined together without flattening.

Of course, I can write a code in Python which will unfold that list and flatten it. But I believe this is not a normal situation - I think almost everybody needs a flat output.

itertools.chain stops unfolding on fist found non-iterable value. In other words, it still needs some coding (previous paragraph).

So - how to flatten the list using PySpark's native methods?

Thanks