

This post has NOT been accepted by the mailing list yet.
I want to solve the linear regression problem using spark with huge martrices:
Ax = b
using least squares:
x = Inverse(Atranspose) * A)*Atranspose *b
The A matrix is a large sparse matrix (as is the b vector).
I have pondered several solutions to the Ax = b problem including:
1) directly solving the problem above where the matrix is transposed, multiplied by itself, the inverse is taken and then multiplied by Atranspose and then multiplied by b which will give the solution vector x
2) iterative solver (no need to take the inverse)
My question is:
What is the best way to solve this problem using the MLib libraries, in JAVA and using RDD and spark?
Is there any code as an example? Has anyone done this?
The code to take in data represented as a coordinate matrix and perform transposition and multiplication is shown below but I need to take the inverse if I use this strategy:
//Read coordinate matrix from text or database
JavaRDD<String> fileA = sc.textFile(file);
//map text file with coordinate data (sparse matrix) to JavaRDD<MatrixEntry> JavaRDD<MatrixEntry> matrixA = fileA.map(new Function<String, MatrixEntry>() {
public MatrixEntry call(String x){
String[] indeceValue = x.split(",");
long i = Long.parseLong(indeceValue[0]);
long j = Long.parseLong(indeceValue[1]);
double value = Double.parseDouble(indeceValue[2]);
return new MatrixEntry(i, j, value );
}
});
//coordinate matrix from sparse data
CoordinateMatrix cooMatrixA = new CoordinateMatrix(matrixA.rdd());
//create block matrix
BlockMatrix matA = cooMatrixA.toBlockMatrix();
//create block matrix after matrix multiplication (square matrix)
BlockMatrix ata = matA.transpose().multiply(matA);
//print out the original dense matrix
System.out.println(matA.toLocalMatrix().toString());
//print out the transpose of the dense matrix
System.out.println(matA.transpose().toLocalMatrix().toString());
//print out the square matrix (after multiplication)
System.out.println(ata.toLocalMatrix().toString());
JavaRDD<MatrixEntry> entries = ata.toCoordinateMatrix().entries().toJavaRDD();


This post has NOT been accepted by the mailing list yet.
Any reason why you can’t use built in linear regression e.g. http://spark.apache.org/docs/latest/mlclassificationregression.html#regression or http://spark.apache.org/docs/latest/mlliblinearmethods.html#linearleastsquareslassoandridgeregression?
 Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.
On 3 Nov 2016, at 16:08, im281 [via Apache Spark User List] < [hidden email]> wrote:
I want to solve the linear regression problem using spark with huge martrices:
Ax = b
using least squares:
x = Inverse(Atranspose) * A)*Atranspose *b
The A matrix is a large sparse matrix (as is the b vector).
I have pondered several solutions to the Ax = b problem including:
1) directly solving the problem above where the matrix is transposed, multiplied by itself, the inverse is taken and then multiplied by Atranspose and then multiplied by b which will give the solution vector x
2) iterative solver (no need to take the inverse)
My question is:What is the best way to solve this problem using the MLib libraries, in JAVA and using RDD and spark?
Is there any code as an example? Has anyone done this?
The code to take in data represented as a coordinate matrix and perform transposition and multiplication is shown below but I need to take the inverse if I use this strategy:
//Read coordinate matrix from text or database
JavaRDD<String> fileA = sc.textFile(file);
//map text file with coordinate data (sparse matrix) to JavaRDD<MatrixEntry> JavaRDD<MatrixEntry> matrixA = fileA.map(new Function<String, MatrixEntry>() {
public MatrixEntry call(String x){
String[] indeceValue = x.split(",");
long i = Long.parseLong(indeceValue[0]);
long j = Long.parseLong(indeceValue[1]);
double value = Double.parseDouble(indeceValue[2]);
return new MatrixEntry(i, j, value );
}
});
//coordinate matrix from sparse data
CoordinateMatrix cooMatrixA = new CoordinateMatrix(matrixA.rdd());
//create block matrix
BlockMatrix matA = cooMatrixA.toBlockMatrix();
//create block matrix after matrix multiplication (square matrix)
BlockMatrix ata = matA.transpose().multiply(matA);
//print out the original dense matrix
System.out.println(matA.toLocalMatrix().toString());
//print out the transpose of the dense matrix
System.out.println(matA.transpose().toLocalMatrix().toString());
//print out the square matrix (after multiplication)
System.out.println(ata.toLocalMatrix().toString());
JavaRDD<MatrixEntry> entries = ata.toCoordinateMatrix().entries().toJavaRDD();


This post has NOT been accepted by the mailing list yet.
I would like to use it. But how do I do the following
1) Read sparse data (from text or database)
2) pass the sparse data to the linearRegression class?
For example:
Sparse matrix A
row, column, value
0,0,.42
0,1,.28
0,2,.89
1,0,.83
1,1,.34
1,2,.42
2,0,.23
3,0,.42
3,1,.98
3,2,.88
4,0,.23
4,1,.36
4,2,.97
Sparse vector b
row, column, value
0,2,.89
1,2,.42
3,2,.88
4,2,.97
Solve Ax = b???


This post has NOT been accepted by the mailing list yet.
Here’s a way of creating sparse vectors in MLLib:
import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.rdd.RDD
val rdd = sc.textFile("A.txt").map(line => line.split(",")). map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))
val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))
val create = (first: (Int, Int, Double)) => (Array(first._2), Array(first._3)) val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double)) => (head._1 :+ tail._2, head._2 :+ tail._3) val merge = (a: (Array[Int], Array[Double]), b: (Array[Int], Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2)
val A = pairRdd.combineByKey(create,combine,merge).map(el => Vectors.sparse(3,el._2._1,el._2._2))
If you have a separate file of b’s then you would need to manipulate this slightly to join the b’s to the A RDD and then create LabeledPoints. I guess there is a way of doing this using the newer ML interfaces but it’s not particularly obvious to me how.
One point: In the example you give the b’s are exactly the same as col 2 in the A matrix. I presume this is just a quick hacked together example because that would give a trivial result.
 Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.
On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] < [hidden email]> wrote:
I would like to use it. But how do I do the following
1) Read sparse data (from text or database)
2) pass the sparse data to the linearRegression class?
For example:
Sparse matrix A
row, column, value
0,0,.42
0,1,.28
0,2,.89
1,0,.83
1,1,.34
1,2,.42
2,0,.23
3,0,.42
3,1,.98
3,2,.88
4,0,.23
4,1,.36
4,2,.97
Sparse vector b
row, column, value
0,2,.89
1,2,.42
3,2,.88
4,2,.97
Solve Ax = b???


This post has NOT been accepted by the mailing list yet.
Thank you! Would happen to have this code in Java?.
This is extremely helpful!
Iman
On Sun, Nov 6, 2016 at 3:35 AM 0800, "Robineast [via Apache Spark User List]" <[hidden email]> wrote:
Here’s a way of creating sparse vectors in MLLib:
import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.rdd.RDD
val rdd = sc.textFile("A.txt").map(line => line.split(",")). map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))
val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))
val create = (first: (Int, Int, Double)) => (Array(first._2), Array(first._3)) val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double)) => (head._1 :+ tail._2, head._2 :+ tail._3) val merge = (a: (Array[Int], Array[Double]), b: (Array[Int], Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2)
val A = pairRdd.combineByKey(create,combine,merge).map(el => Vectors.sparse(3,el._2._1,el._2._2))
If you have a separate file of b’s then you would need to manipulate this slightly to join the b’s to the A RDD and then create LabeledPoints. I guess there is a way of doing this using the newer ML interfaces but it’s not particularly obvious to me how.
One point: In the example you give the b’s are exactly the same as col 2 in the A matrix. I presume this is just a quick hacked together example because that would give a trivial result.
 Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.
On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] < [hidden email]> wrote:
I would like to use it. But how do I do the following
1) Read sparse data (from text or database)
2) pass the sparse data to the linearRegression class?
For example:
Sparse matrix A
row, column, value
0,0,.42
0,1,.28
0,2,.89
1,0,.83
1,1,.34
1,2,.42
2,0,.23
3,0,.42
3,1,.98
3,2,.88
4,0,.23
4,1,.36
4,2,.97
Sparse vector b
row, column, value
0,2,.89
1,2,.42
3,2,.88
4,2,.97
Solve Ax = b???
To unsubscribe from mLIb solving linear regression with sparse inputs, click here.
NAML


This post has NOT been accepted by the mailing list yet.
Hi Robin, It looks like the linear regression model takes in a dataset not a matrix? It would be helpful for this example if you could set up the whole problem end to end using one of the columns of the matrix as b. So A is a sparse matrix and b is a sparse vector Best regards. Iman Thank you! Would happen to have this code in Java?.
This is extremely helpful!
On Sun, Nov 6, 2016 at 3:35 AM 0800, "Robineast [via Apache Spark User List]" <[hidden email]> wrote:
Here’s a way of creating sparse vectors in MLLib:
import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.rdd.RDD
val rdd = sc.textFile("A.txt").map(line => line.split(",")). map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))
val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))
val create = (first: (Int, Int, Double)) => (Array(first._2), Array(first._3)) val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double)) => (head._1 :+ tail._2, head._2 :+ tail._3) val merge = (a: (Array[Int], Array[Double]), b: (Array[Int], Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2)
val A = pairRdd.combineByKey(create,combine,merge).map(el => Vectors.sparse(3,el._2._1,el._2._2))
If you have a separate file of b’s then you would need to manipulate this slightly to join the b’s to the A RDD and then create LabeledPoints. I guess there is a way of doing this using the newer ML interfaces but it’s not particularly obvious to me how.
One point: In the example you give the b’s are exactly the same as col 2 in the A matrix. I presume this is just a quick hacked together example because that would give a trivial result.
 Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.
On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] < [hidden email]> wrote:
I would like to use it. But how do I do the following
1) Read sparse data (from text or database)
2) pass the sparse data to the linearRegression class?
For example:
Sparse matrix A
row, column, value
0,0,.42
0,1,.28
0,2,.89
1,0,.83
1,1,.34
1,2,.42
2,0,.23
3,0,.42
3,1,.98
3,2,.88
4,0,.23
4,1,.36
4,2,.97
Sparse vector b
row, column, value
0,2,.89
1,2,.42
3,2,.88
4,2,.97
Solve Ax = b???
To unsubscribe from mLIb solving linear regression with sparse inputs, click here.
NAML


This post has NOT been accepted by the mailing list yet.
Also in Java as well. Thanks again! On Sun, Nov 6, 2016 at 8:28 AM Iman Mohtashemi < [hidden email]> wrote: Hi Robin, It looks like the linear regression model takes in a dataset not a matrix? It would be helpful for this example if you could set up the whole problem end to end using one of the columns of the matrix as b. So A is a sparse matrix and b is a sparse vector Best regards.
Thank you! Would happen to have this code in Java?.
This is extremely helpful!
On Sun, Nov 6, 2016 at 3:35 AM 0800, "Robineast [via Apache Spark User List]" <[hidden email]> wrote:
Here’s a way of creating sparse vectors in MLLib:
import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.rdd.RDD
val rdd = sc.textFile("A.txt").map(line => line.split(",")). map(ary => (ary(0).toInt, ary(1).toInt, ary(2).toDouble))
val pairRdd: RDD[(Int, (Int, Int, Double))] = rdd.map(el => (el._1, el))
val create = (first: (Int, Int, Double)) => (Array(first._2), Array(first._3)) val combine = (head: (Array[Int], Array[Double]), tail: (Int, Int, Double)) => (head._1 :+ tail._2, head._2 :+ tail._3) val merge = (a: (Array[Int], Array[Double]), b: (Array[Int], Array[Double])) => (a._1 ++ b._1, a._2 ++ b._2)
val A = pairRdd.combineByKey(create,combine,merge).map(el => Vectors.sparse(3,el._2._1,el._2._2))
If you have a separate file of b’s then you would need to manipulate this slightly to join the b’s to the A RDD and then create LabeledPoints. I guess there is a way of doing this using the newer ML interfaces but it’s not particularly obvious to me how.
One point: In the example you give the b’s are exactly the same as col 2 in the A matrix. I presume this is just a quick hacked together example because that would give a trivial result.
 Robin East Spark GraphX in Action Michael Malak and Robin East Manning Publications Co.
On 3 Nov 2016, at 18:12, im281 [via Apache Spark User List] < [hidden email]> wrote:
I would like to use it. But how do I do the following
1) Read sparse data (from text or database)
2) pass the sparse data to the linearRegression class?
For example:
Sparse matrix A
row, column, value
0,0,.42
0,1,.28
0,2,.89
1,0,.83
1,1,.34
1,2,.42
2,0,.23
3,0,.42
3,1,.98
3,2,.88
4,0,.23
4,1,.36
4,2,.97
Sparse vector b
row, column, value
0,2,.89
1,2,.42
3,2,.88
4,2,.97
Solve Ax = b???
To unsubscribe from mLIb solving linear regression with sparse inputs, click here.
NAML

