computeStats() in MLUtils will cause Nan (not a number) error

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

computeStats() in MLUtils will cause Nan (not a number) error

yinxusen
Hi all,

These days I test Lasso and ridge regression in MLlib, and I find an error of Double.Nan. While other classification and regression methods do very well.

Finally I find that Lasso and RidgeRegression call computeStats() function to compute mean and SD (standard deviation) for normalizing input data. However, some returned SDs are zeroes. So when encountering 0.0 / 0.0, there will be a Nan error.

How about setting directly to zero if both the divisor and dividend are zeroes, and adding a smoothing factor (e.g. 1.0e-10) if the dividend alone is zero? Or anyone have better ideas ?

Thanks !
Reply | Threaded
Open this post in threaded view
|

Re: computeStats() in MLUtils will cause Nan (not a number) error

Xiangrui Meng
It happens when there are empty columns. Adding a very small smoothing
factor should help. Btw, I notice that the computation of variance
there is not stable, which should use the stable method implemented in
RDD[Double]. -Xiangrui

On Tue, Jan 28, 2014 at 5:22 AM, yinxusen <[hidden email]> wrote:

> Hi all,
>
> These days I test Lasso and ridge regression in MLlib, and I find an error
> of Double.Nan. While other classification and regression methods do very
> well.
>
> Finally I find that Lasso and RidgeRegression call computeStats() function
> to compute mean and SD (standard deviation) for normalizing input data.
> However, some returned SDs are zeroes. So when encountering 0.0 / 0.0, there
> will be a Nan error.
>
> How about setting directly to zero if both the divisor and dividend are
> zeroes, and adding a smoothing factor (e.g. 1.0e-10) if the dividend alone
> is zero? Or anyone have better ideas ?
>
> Thanks !
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/computeStats-in-MLUtils-will-cause-Nan-not-a-number-error-tp980.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: computeStats() in MLUtils will cause Nan (not a number) error

yinxusen
Yep, thanks Xiangrui. That's my fault, because I write a naive function to transform my sparse input into dense one, to use the MLlib interface. I just forget to remove all-zeros columns. Oh it's really a pitfall.


2014-01-29 Xiangrui Meng <[hidden email]>
It happens when there are empty columns. Adding a very small smoothing
factor should help. Btw, I notice that the computation of variance
there is not stable, which should use the stable method implemented in
RDD[Double]. -Xiangrui

On Tue, Jan 28, 2014 at 5:22 AM, yinxusen <[hidden email]> wrote:
> Hi all,
>
> These days I test Lasso and ridge regression in MLlib, and I find an error
> of Double.Nan. While other classification and regression methods do very
> well.
>
> Finally I find that Lasso and RidgeRegression call computeStats() function
> to compute mean and SD (standard deviation) for normalizing input data.
> However, some returned SDs are zeroes. So when encountering 0.0 / 0.0, there
> will be a Nan error.
>
> How about setting directly to zero if both the divisor and dividend are
> zeroes, and adding a smoothing factor (e.g. 1.0e-10) if the dividend alone
> is zero? Or anyone have better ideas ?
>
> Thanks !
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/computeStats-in-MLUtils-will-cause-Nan-not-a-number-error-tp980.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.



--
Best Regards
-----------------------------------
Xusen Yin    尹绪森
Beijing Key Laboratory of Intelligent Telecommunications Software and Multimedia
Beijing University of Posts & Telecommunications
Intel Labs China