shared variable and ALS in mllib

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

shared variable and ALS in mllib

Nan Zhu
Hi, all

I meet a question related to how to share a variable among tasks, it seems that neither broadcast nor accumulator can resolve my problem

I have a set of txt files as my dataset, naming 1.txt - 20000.txt

each txt file represents the rating of users to a certain product, the product ID is indicated in the first line of each file, “1:”…”20000:”

the following lines are ratings “userid, rating"

I want to parse the input files with spark and pass it to the ALS implementation in mllib

the ALS requires me to have a RDD of Rating objects, where Rating is 3-tuple (user, product, rating)

My problem is that some tasks get the partition of a certain text file, so it will never see the first line like “1:” so that it cannot get which product the rating is corresponded to

How can I resolve this, except getting some script to transform the format of the files by appending the product id to each line?

Best,

-- 
Nan Zhu

Reply | Threaded
Open this post in threaded view
|

Re: shared variable and ALS in mllib

Jason Dai
If you assign each file to a standalone partition, then you can generate the Rating RDD using something like the following:

files.mapPartitions { part =>
   product = part.next()
   part.map((user, rating) => (user, product, rating))
}

Thanks,
-Jason


On Tue, Jan 7, 2014 at 1:17 AM, Nan Zhu <[hidden email]> wrote:
Hi, all

I meet a question related to how to share a variable among tasks, it seems that neither broadcast nor accumulator can resolve my problem

I have a set of txt files as my dataset, naming 1.txt - 20000.txt

each txt file represents the rating of users to a certain product, the product ID is indicated in the first line of each file, “1:”…”20000:”

the following lines are ratings “userid, rating"

I want to parse the input files with spark and pass it to the ALS implementation in mllib

the ALS requires me to have a RDD of Rating objects, where Rating is 3-tuple (user, product, rating)

My problem is that some tasks get the partition of a certain text file, so it will never see the first line like “1:” so that it cannot get which product the rating is corresponded to

How can I resolve this, except getting some script to transform the format of the files by appending the product id to each line?

Best,

-- 
Nan Zhu


Reply | Threaded
Open this post in threaded view
|

Re: shared variable and ALS in mllib

Nan Zhu
Thanks Jason, yes, that’s true, but how to finish the first step

it seems that sc.textFile() has no parameters to achieve the goal, 

I stored the file on s3

Best,

-- 
Nan Zhu

On Monday, January 6, 2014 at 11:27 PM, Jason Dai wrote:

If you assign each file to a standalone partition, then you can generate the Rating RDD using something like the following:

files.mapPartitions { part =>
   product = part.next()
   part.map((user, rating) => (user, product, rating))
}

Thanks,
-Jason


On Tue, Jan 7, 2014 at 1:17 AM, Nan Zhu <[hidden email]> wrote:
Hi, all

I meet a question related to how to share a variable among tasks, it seems that neither broadcast nor accumulator can resolve my problem

I have a set of txt files as my dataset, naming 1.txt - 20000.txt

each txt file represents the rating of users to a certain product, the product ID is indicated in the first line of each file, “1:”…”20000:”

the following lines are ratings “userid, rating"

I want to parse the input files with spark and pass it to the ALS implementation in mllib

the ALS requires me to have a RDD of Rating objects, where Rating is 3-tuple (user, product, rating)

My problem is that some tasks get the partition of a certain text file, so it will never see the first line like “1:” so that it cannot get which product the rating is corresponded to

How can I resolve this, except getting some script to transform the format of the files by appending the product id to each line?

Best,

-- 
Nan Zhu