Using String Dataset for Logistic Regression

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Using String Dataset for Logistic Regression

praveshjain1991
I have been trying to use LR in Spark's Java API. I used the dataset given along with Spark for the training and testing purposes.

Now i want to use it on another dataset that contains string values along with numbers. Is there any way to do this?

I am attaching the Dataset that i want to use.

Thanks and Regards,Test.data
Reply | Threaded
Open this post in threaded view
|

Re: Using String Dataset for Logistic Regression

Xiangrui Meng
It depends on how you want to use the string features. For the day of
the week, you can replace it with 6 binary features indicating
Mon/Tue/Wed/Th/Fri/Sat. -Xiangrui

On Fri, May 9, 2014 at 5:31 AM, praveshjain1991
<[hidden email]> wrote:

> I have been trying to use LR in Spark's Java API. I used the dataset given
> along with Spark for the training and testing purposes.
>
> Now i want to use it on another dataset that contains string values along
> with numbers. Is there any way to do this?
>
> I am attaching the Dataset that i want to use.
>
> Thanks and Regards, Test.data
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n5523/Test.data>
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-String-Dataset-for-Logistic-Regression-tp5523.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Using String Dataset for Logistic Regression

DB Tsai-2
You could also use dummy coding to convert categorical feature to
numeric feature.

http://en.wikipedia.org/wiki/Categorical_variable#Dummy_coding

Sincerely,

DB Tsai
-------------------------------------------------------
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Wed, May 14, 2014 at 10:37 PM, Xiangrui Meng <[hidden email]> wrote:

> It depends on how you want to use the string features. For the day of
> the week, you can replace it with 6 binary features indicating
> Mon/Tue/Wed/Th/Fri/Sat. -Xiangrui
>
> On Fri, May 9, 2014 at 5:31 AM, praveshjain1991
> <[hidden email]> wrote:
>> I have been trying to use LR in Spark's Java API. I used the dataset given
>> along with Spark for the training and testing purposes.
>>
>> Now i want to use it on another dataset that contains string values along
>> with numbers. Is there any way to do this?
>>
>> I am attaching the Dataset that i want to use.
>>
>> Thanks and Regards, Test.data
>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n5523/Test.data>
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-String-Dataset-for-Logistic-Regression-tp5523.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Using String Dataset for Logistic Regression

praveshjain1991
In reply to this post by Xiangrui Meng
Thank you for your reply.

So i take it that there's no direct way of using String datasets while using LR in Spark.

-Pravesh
Reply | Threaded
Open this post in threaded view
|

Re: Using String Dataset for Logistic Regression

Brian Gawalt
Pravesh,

Correct, the logistic regression engine is set up to perform classification tasks that take feature vectors (arrays of real-valued numbers) that are given a class label, and learning a linear combination of those features that divide the classes. As the above commenters have mentioned, there's lots of different ways to turn string data into feature vectors.

For instance, if you're classifying documents between, say, spam or valid email, you may want to start with a bag-of-words model (http://en.wikipedia.org/wiki/Bag-of-words_model ) or the rescaled variant TF-IDF ( http://en.wikipedia.org/wiki/Tf%E2%80%93idf ). You'd turn a single document into a single, high-dimensional, sparse vector whose element j encodes the number of appearance term j. Maybe you want to try the experiment by featurizing on bigrams, trigrams, etc...

Or if you're just trying to tell "english language tweets" from "non-english language tweets", in which case the bag of words might be overkill: you could instead try featurizing on just the counts of each pair of consecutive characters. E.g., the first element counts "aa" appearances, then the second "ab"...., then "zy" then "zz". Those will be smaller feature vectors, capturing less information, but it's probably sufficient for the simpler task, and you'll be able to fit the model with less data than trying to fit a whole-word-based model.

Different applications are going to need more or less context from your strings -- whole words? n-grams? just characters? treat them as ENUMs as in the days of week example? -- so it might not make sense for Spark to come with "a direct way" to turn a string attribute into a vector for use in logistic regression. You'll have to settle on the featurization approach that's right for your domain before you try training the logistic regression classifier on your labelled feature vectors.

Best,
-Brian

Reply | Threaded
Open this post in threaded view
|

Re: Using String Dataset for Logistic Regression

praveshjain1991
Thank you for your replies. I've now been using integer datasets but ran into another issue.

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-not-processing-file-with-particular-number-of-entries-td6694.html

Any ideas?

--
Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Using String Dataset for Logistic Regression

Wush Wu

Dear all,

Does spark support sparse matrix/vector for LR now?

Best,
Wush

2014/6/2 下午3:19 於 "praveshjain1991" <[hidden email]> 寫道:
Thank you for your replies. I've now been using integer datasets but ran into
another issue.

http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-not-processing-file-with-particular-number-of-entries-td6694.html

Any ideas?

--
Thanks



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-String-Dataset-for-Logistic-Regression-tp5523p6695.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Reply | Threaded
Open this post in threaded view
|

Re: Using String Dataset for Logistic Regression

praveshjain1991
I am not sure. I have just been using some numerical datasets.
Reply | Threaded
Open this post in threaded view
|

Re: Using String Dataset for Logistic Regression

Xiangrui Meng
Yes. MLlib 1.0 supports sparse input data for linear methods. -Xiangrui

On Mon, Jun 2, 2014 at 11:36 PM, praveshjain1991
<[hidden email]> wrote:
> I am not sure. I have just been using some numerical datasets.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-String-Dataset-for-Logistic-Regression-tp5523p6784.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.