mLIb solving linear regression with sparse inputs

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

mLIb solving linear regression with sparse inputs

im281
This post has NOT been accepted by the mailing list yet.
I want to solve the linear regression problem using spark with huge martrices:

Ax = b
using least squares:
x = Inverse(A-transpose) * A)*A-transpose *b

The A matrix is a large sparse matrix (as is the b vector).

I have pondered several solutions to the Ax = b problem including:

1) directly solving the problem above where the matrix is transposed, multiplied by itself, the inverse is taken and then multiplied by A-transpose and then multiplied by b which will give the solution vector x

2) iterative solver (no need to take the inverse)

My question is:

What is the best way to solve this problem using the MLib libraries, in JAVA and using RDD and spark?

Is there any code as an example? Has anyone done this?





The code to take in data represented as a coordinate matrix and perform transposition and multiplication is shown below but I need to take the inverse if I use this strategy:

//Read coordinate matrix from text or database
                JavaRDD<String> fileA = sc.textFile(file);

                //map text file with coordinate data (sparse matrix) to JavaRDD<MatrixEntry>
                JavaRDD<MatrixEntry> matrixA = fileA.map(new Function<String, MatrixEntry>() {
                    public MatrixEntry call(String x){
                        String[] indeceValue = x.split(",");
                        long i = Long.parseLong(indeceValue[0]);
                        long j = Long.parseLong(indeceValue[1]);
                        double value = Double.parseDouble(indeceValue[2]);
                        return new MatrixEntry(i, j, value );
                    }
                });
               
                //coordinate matrix from sparse data
                CoordinateMatrix cooMatrixA = new CoordinateMatrix(matrixA.rdd());
               
                //create block matrix
                BlockMatrix matA = cooMatrixA.toBlockMatrix();
               
                //create block matrix after matrix multiplication (square matrix)
                BlockMatrix ata = matA.transpose().multiply(matA);
               
                //print out the original dense matrix
                System.out.println(matA.toLocalMatrix().toString());
               
                //print out the transpose of the dense matrix
                System.out.println(matA.transpose().toLocalMatrix().toString());
               
                //print out the square matrix (after multiplication)
                System.out.println(ata.toLocalMatrix().toString());
               
                JavaRDD<MatrixEntry> entries = ata.toCoordinateMatrix().entries().toJavaRDD();

Reply | Threaded
Open this post in threaded view
|

Re: mLIb solving linear regression with sparse inputs

Robineast
This post has NOT been accepted by the mailing list yet.
Any reason why you can’t use built in linear regression e.g. http://spark.apache.org/docs/latest/ml-classification-regression.html#regression or http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression?

-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.





On 3 Nov 2016, at 16:08, im281 [via Apache Spark User List] <[hidden email]> wrote:

I want to solve the linear regression problem using spark with huge martrices:

Ax = b
using least squares:
x = Inverse(A-transpose) * A)*A-transpose *b

The A matrix is a large sparse matrix (as is the b vector).

I have pondered several solutions to the Ax = b problem including:

1) directly solving the problem above where the matrix is transposed, multiplied by itself, the inverse is taken and then multiplied by A-transpose and then multiplied by b which will give the solution vector x

2) iterative solver (no need to take the inverse)

My question is:

What is the best way to solve this problem using the MLib libraries, in JAVA and using RDD and spark?

Is there any code as an example? Has anyone done this?





The code to take in data represented as a coordinate matrix and perform transposition and multiplication is shown below but I need to take the inverse if I use this strategy:

//Read coordinate matrix from text or database
                JavaRDD<String> fileA = sc.textFile(file);

                //map text file with coordinate data (sparse matrix) to JavaRDD<MatrixEntry>
                JavaRDD<MatrixEntry> matrixA = fileA.map(new Function<String, MatrixEntry>() {
                    public MatrixEntry call(String x){
                        String[] indeceValue = x.split(",");
                        long i = Long.parseLong(indeceValue[0]);
                        long j = Long.parseLong(indeceValue[1]);
                        double value = Double.parseDouble(indeceValue[2]);
                        return new MatrixEntry(i, j, value );
                    }
                });
               
                //coordinate matrix from sparse data
                CoordinateMatrix cooMatrixA = new CoordinateMatrix(matrixA.rdd());
               
                //create block matrix
                BlockMatrix matA = cooMatrixA.toBlockMatrix();
               
                //create block matrix after matrix multiplication (square matrix)
                BlockMatrix ata = matA.transpose().multiply(matA);
               
                //print out the original dense matrix
                System.out.println(matA.toLocalMatrix().toString());
               
                //print out the transpose of the dense matrix
                System.out.println(matA.transpose().toLocalMatrix().toString());
               
                //print out the square matrix (after multiplication)
                System.out.println(ata.toLocalMatrix().toString());
               
                JavaRDD<MatrixEntry> entries = ata.toCoordinateMatrix().entries().toJavaRDD();




If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006.html
To start a new topic under Apache Spark User List, email [hidden email]
To unsubscribe from Apache Spark User List, click here.
NAML

Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action
Reply | Threaded
Open this post in threaded view
|

Re: mLIb solving linear regression with sparse inputs

im281
This post has NOT been accepted by the mailing list yet.
I would like to use it. But how do I do the following
1) Read sparse data (from text or database)
2) pass the sparse data to the linearRegression class?

For example:

Sparse matrix A
row, column, value
0,0,.42
0,1,.28
0,2,.89
1,0,.83
1,1,.34
1,2,.42
2,0,.23
3,0,.42
3,1,.98
3,2,.88
4,0,.23
4,1,.36
4,2,.97

Sparse vector b
row, column, value
0,2,.89
1,2,.42
3,2,.88
4,2,.97

Solve Ax = b???

Reply | Threaded
Open this post in threaded view
|

Re: mLIb solving linear regression with sparse inputs

Robineast
This post has NOT been accepted by the mailing list yet.
Here’s a way of creating sparse vectors in MLLib:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.rdd.RDD

val rdd = sc.textFile("A.txt").map(line => line.split(",")).
     map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))

val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))

val create = (first: (Int, Int, Double)) => (Array(first._2), Array(first._3))
val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double)) => (head._1 :+ tail._2, head._2 :+ tail._3)
val merge = (a: (Array[Int], Array[Double]), b: (Array[Int], Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2)

val A = pairRdd.combineByKey(create,combine,merge).map(el => Vectors.sparse(3,el._2._1,el._2._2))

If you have a separate file of b’s then you would need to manipulate this slightly to join the b’s to the A RDD and then create LabeledPoints. I guess there is a way of doing this using the newer ML interfaces but it’s not particularly obvious to me how.

One point: In the example you give the b’s are exactly the same as col 2 in the A matrix. I presume this is just a quick hacked together example because that would give a trivial result.

-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.





On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] <[hidden email]> wrote:

I would like to use it. But how do I do the following
1) Read sparse data (from text or database)
2) pass the sparse data to the linearRegression class?

For example:

Sparse matrix A
row, column, value
0,0,.42
0,1,.28
0,2,.89
1,0,.83
1,1,.34
1,2,.42
2,0,.23
3,0,.42
3,1,.98
3,2,.88
4,0,.23
4,1,.36
4,2,.97

Sparse vector b
row, column, value
0,2,.89
1,2,.42
3,2,.88
4,2,.97

Solve Ax = b???




If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html
To start a new topic under Apache Spark User List, email [hidden email]
To unsubscribe from Apache Spark User List, click here.
NAML

Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action
Reply | Threaded
Open this post in threaded view
|

Re: mLIb solving linear regression with sparse inputs

im281
This post has NOT been accepted by the mailing list yet.

Thank you! Would happen to have this code in Java?.
This is extremely helpful!
Iman




On Sun, Nov 6, 2016 at 3:35 AM -0800, "Robineast [via Apache Spark User List]" <[hidden email]> wrote:

Here’s a way of creating sparse vectors in MLLib:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.rdd.RDD

val rdd = sc.textFile("A.txt").map(line => line.split(",")).
     map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))

val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))

val create = (first: (Int, Int, Double)) => (Array(first._2), Array(first._3))
val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double)) => (head._1 :+ tail._2, head._2 :+ tail._3)
val merge = (a: (Array[Int], Array[Double]), b: (Array[Int], Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2)

val A = pairRdd.combineByKey(create,combine,merge).map(el => Vectors.sparse(3,el._2._1,el._2._2))

If you have a separate file of b’s then you would need to manipulate this slightly to join the b’s to the A RDD and then create LabeledPoints. I guess there is a way of doing this using the newer ML interfaces but it’s not particularly obvious to me how.

One point: In the example you give the b’s are exactly the same as col 2 in the A matrix. I presume this is just a quick hacked together example because that would give a trivial result.

-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.





On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] <[hidden email]> wrote:

I would like to use it. But how do I do the following
1) Read sparse data (from text or database)
2) pass the sparse data to the linearRegression class?

For example:

Sparse matrix A
row, column, value
0,0,.42
0,1,.28
0,2,.89
1,0,.83
1,1,.34
1,2,.42
2,0,.23
3,0,.42
3,1,.98
3,2,.88
4,0,.23
4,1,.36
4,2,.97

Sparse vector b
row, column, value
0,2,.89
1,2,.42
3,2,.88
4,2,.97

Solve Ax = b???




If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html
To start a new topic under Apache Spark User List, email [hidden email]
To unsubscribe from Apache Spark User List, click here.
NAML

Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28027.html
To unsubscribe from mLIb solving linear regression with sparse inputs, click here.
NAML
Reply | Threaded
Open this post in threaded view
|

Re: mLIb solving linear regression with sparse inputs

im281
This post has NOT been accepted by the mailing list yet.
In reply to this post by Robineast
Hi Robin,
It looks like the linear regression model takes in a dataset not a matrix? It would be helpful for this example if you could set up the whole problem end to end using one of the columns of the matrix as b. So A is a sparse matrix and b is a sparse vector
Best regards.
Iman

On Sun, Nov 6, 2016 at 6:43 AM <[hidden email]> wrote:

Thank you! Would happen to have this code in Java?.
This is extremely helpful!


Iman




On Sun, Nov 6, 2016 at 3:35 AM -0800, "Robineast [via Apache Spark User List]" <[hidden email]> wrote:

Here’s a way of creating sparse vectors in MLLib:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.rdd.RDD

val rdd = sc.textFile("A.txt").map(line => line.split(",")).
     map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))

val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))

val create = (first: (Int, Int, Double)) => (Array(first._2), Array(first._3))
val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double)) => (head._1 :+ tail._2, head._2 :+ tail._3)
val merge = (a: (Array[Int], Array[Double]), b: (Array[Int], Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2)

val A = pairRdd.combineByKey(create,combine,merge).map(el => Vectors.sparse(3,el._2._1,el._2._2))

If you have a separate file of b’s then you would need to manipulate this slightly to join the b’s to the A RDD and then create LabeledPoints. I guess there is a way of doing this using the newer ML interfaces but it’s not particularly obvious to me how.

One point: In the example you give the b’s are exactly the same as col 2 in the A matrix. I presume this is just a quick hacked together example because that would give a trivial result.

-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.





On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] <[hidden email]> wrote:

I would like to use it. But how do I do the following
1) Read sparse data (from text or database)
2) pass the sparse data to the linearRegression class?

For example:

Sparse matrix A
row, column, value
0,0,.42
0,1,.28
0,2,.89
1,0,.83
1,1,.34
1,2,.42
2,0,.23
3,0,.42
3,1,.98
3,2,.88
4,0,.23
4,1,.36
4,2,.97

Sparse vector b
row, column, value
0,2,.89
1,2,.42
3,2,.88
4,2,.97

Solve Ax = b???




If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html
To start a new topic under Apache Spark User List, email [hidden email]
To unsubscribe from Apache Spark User List, click here.
NAML

Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28027.html
To unsubscribe from mLIb solving linear regression with sparse inputs, click here.
NAML
Reply | Threaded
Open this post in threaded view
|

Re: mLIb solving linear regression with sparse inputs

im281
This post has NOT been accepted by the mailing list yet.
In reply to this post by Robineast
Also in Java as well. Thanks again!
Iman

On Sun, Nov 6, 2016 at 8:28 AM Iman Mohtashemi <[hidden email]> wrote:
Hi Robin,
It looks like the linear regression model takes in a dataset not a matrix? It would be helpful for this example if you could set up the whole problem end to end using one of the columns of the matrix as b. So A is a sparse matrix and b is a sparse vector
Best regards.
Iman

On Sun, Nov 6, 2016 at 6:43 AM <[hidden email]> wrote:

Thank you! Would happen to have this code in Java?.
This is extremely helpful!


Iman




On Sun, Nov 6, 2016 at 3:35 AM -0800, "Robineast [via Apache Spark User List]" <[hidden email]> wrote:

Here’s a way of creating sparse vectors in MLLib:

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.rdd.RDD

val rdd = sc.textFile("A.txt").map(line => line.split(",")).
     map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))

val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))

val create = (first: (Int, Int, Double)) => (Array(first._2), Array(first._3))
val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double)) => (head._1 :+ tail._2, head._2 :+ tail._3)
val merge = (a: (Array[Int], Array[Double]), b: (Array[Int], Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2)

val A = pairRdd.combineByKey(create,combine,merge).map(el => Vectors.sparse(3,el._2._1,el._2._2))

If you have a separate file of b’s then you would need to manipulate this slightly to join the b’s to the A RDD and then create LabeledPoints. I guess there is a way of doing this using the newer ML interfaces but it’s not particularly obvious to me how.

One point: In the example you give the b’s are exactly the same as col 2 in the A matrix. I presume this is just a quick hacked together example because that would give a trivial result.

-------------------------------------------------------------------------------
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.





On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] <[hidden email]> wrote:

I would like to use it. But how do I do the following
1) Read sparse data (from text or database)
2) pass the sparse data to the linearRegression class?

For example:

Sparse matrix A
row, column, value
0,0,.42
0,1,.28
0,2,.89
1,0,.83
1,1,.34
1,2,.42
2,0,.23
3,0,.42
3,1,.98
3,2,.88
4,0,.23
4,1,.36
4,2,.97

Sparse vector b
row, column, value
0,2,.89
1,2,.42
3,2,.88
4,2,.97

Solve Ax = b???




If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28008.html
To start a new topic under Apache Spark User List, email [hidden email]
To unsubscribe from Apache Spark User List, click here.
NAML

Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/mLIb-solving-linear-regression-with-sparse-inputs-tp28006p28027.html
To unsubscribe from mLIb solving linear regression with sparse inputs, click here.
NAML
Reply | Threaded
Open this post in threaded view
|

Re: mLIb solving linear regression with sparse inputs

Robineast
In reply to this post by im281
Well I did eventually write this code in Java, and it was very long! see
https://github.com/insidedctm/sparse-linear-regression
<https://github.com/insidedctm/sparse-linear-regression>  



-----
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action

--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action