Incremental (online) machine learning algorithms on ML

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Incremental (online) machine learning algorithms on ML

Lucas Chagas
Hi,

After searching the machine learning library for streaming algorithms, I
found two that fit the criteria: Streaming Linear Regression
(https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression)
and Streaming K-Means
(https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means).

However, both use the RDD-based API MLlib instead of the DataFrame-based
API ML; are there any plans for bringing them both to ML?

Also, is there any technical reason why there are so few incremental
algorithms on the machine learning library? There's only 1 algorithm for
regression and clustering each, with nothing for classification,
dimensionality reduction or feature extraction.

If there is a reason, how were those two algorithms implemented? If
there isn't, what is the general consensus on adding new online machine
learning algorithms?

Regards,
Lucas Chagas

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Incremental (online) machine learning algorithms on ML

Stephen Boesch
There are several high bars to getting a new algorithm adopted.  

*  It needs to be deemed by the MLLib committers/shepherds as widely useful to the community.  Algorithms offered by larger companies after having demonstrated usefulness at scale for   use cases  likely to be encountered by many other companies stand a better chance
* There is quite limited bandwidth for consideration of new algorithms: there has been a dearth of new ones accepted since early 2015 . So prioritization is a challenge.
* The code must demonstrate high quality standards especially wrt testability, maintainability, computational performance, and scalability. 
* The chosen algorithms and options should be well documented and include comparisons/ tradeoffs with state of the art described in relevant papers. These questions will typically be asked during design/code reviews - i.e. did you consider the approach shown here   
* There is also luck and timing involved. The review process might start in a given month A but reviewers become busy or higher priorities intervene .. and then when will the reviewing continue.. 
* At the point that the above are complete then there are intricacies with integrating with a particular Spark release

Am Mo., 5. Aug. 2019 um 05:58 Uhr schrieb chagas <[hidden email]>:
Hi,

After searching the machine learning library for streaming algorithms, I
found two that fit the criteria: Streaming Linear Regression
(https://spark.apache.org/docs/latest/mllib-linear-methods.html#streaming-linear-regression)
and Streaming K-Means
(https://spark.apache.org/docs/latest/mllib-clustering.html#streaming-k-means).

However, both use the RDD-based API MLlib instead of the DataFrame-based
API ML; are there any plans for bringing them both to ML?

Also, is there any technical reason why there are so few incremental
algorithms on the machine learning library? There's only 1 algorithm for
regression and clustering each, with nothing for classification,
dimensionality reduction or feature extraction.

If there is a reason, how were those two algorithms implemented? If
there isn't, what is the general consensus on adding new online machine
learning algorithms?

Regards,
Lucas Chagas

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]