Distributed streaming quantiles with PySpark

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Distributed streaming quantiles with PySpark

Uri Laserson
Hi everyone,

I implemented a version of distributed streaming quantiles for PySpark.  It uses a count-min sketch approach.  You can find the code here:


Thought it might be of interest...

Uri

--
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
Reply | Threaded
Open this post in threaded view
|

Re: Distributed streaming quantiles with PySpark

MLnick
Thanks Uri, I came across that and took a quick look, seems interesting.

On a related note, it would be quite cool to have a sort of port of Algebird (or at least count-min, top-k and HLL, perhaps bloom filter) to Python, that are monoid-style for us in PySpark...

Sent from Mailbox for iPhone


On Sat, Feb 1, 2014 at 2:34 AM, Uri Laserson <[hidden email]> wrote:

Hi everyone,

I implemented a version of distributed streaming quantiles for PySpark.  It uses a count-min sketch approach.  You can find the code here:


Thought it might be of interest...

Uri

--
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447