> Hi there,
> Could someone that attended the sprint send a rough summary ?
> I'd be particularly interested about the tested approaches, those that
> didn't work, those that seem promising and what the next steps could be ...
We started with a general discussion on PySpark. It naturally features
a Python wrapper to the mllib Scala distributed machine learning
library  that is optimized to work on Spark. However Python users
might still want to leverage existing numpy / scipy tools for some
The main difficulty to use numpy-aware tools efficiently is that
Sparks presents the data to the workers as an iterator over a large,
possibly cluster-partitioned collection of elements called a RDD. If
used naively one would load individual rows (1D numpy arrays) as
elements of an RDD to represent the content of a 2D data matrix. This
is not efficient because of the communication overhead between scala
and python workers and because it prevent to do efficient BLAS
operations that involve several rows at a time such as BLAS DGEMM
calls via numpy.dot for instance.
So this first issue was tackled by writing a block_rdd helper function
 to concatenate a bunch rows (e.g. 1D numpy arrays or list of
Python dicts) as chunked 2D numpy arrays or pandas DataFrame
This makes it possible to train linear model incrementally more
efficiently as done in . Model averaging is done via a reduction
We also discussed how we could make it easier to plot the distribution
of data stored in a RDD and came up with the idea of computing
histograms on the spark side while exposing it with the same API as
the numpy.histogram function .
Have a look at the tests  for basic usage examples of all of the above.
There is also some high level discussion of the scope of the project in .