StringIndexer on several columns in a DataFrame with Scala

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

StringIndexer on several columns in a DataFrame with Scala

Md. Rezaul Karim
Hi All,

There are several categorical columns in my dataset as follows:
Inline images 1

How can I transform values in each (categorical) columns into numeric using StringIndexer so that the resulting DataFrame can be feed into VectorAssembler to generate a feature vector?

A naive approach that I can try using StringIndexer for each categorical column. But that sounds hilarious, I know.

A possible workaround in PySpark is combining several StringIndexer on a list and use a Pipeline to execute them all as follows:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)
df_r.show()
How I can do the same in Scala? I tried the following: 

val featureCol = trainingDF.columns
var indexers: Array[StringIndexer] = null

for (colName <- featureCol) {
val index = new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "_indexed")
//.fit(trainDF)
indexers = indexers :+ index
}

val pipeline = new Pipeline()
.setStages(indexers)
val newDF = pipeline.fit(trainingDF).transform(trainingDF)
newDF.show()
However, I am experiencing NullPointerException at 
for (colName <- featureCol) 

I am sure, I am doing something wrong. Any suggestion? 



Regards,
_________________________________
Md. Rezaul Karim, BSc, MSc
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Reply | Threaded
Open this post in threaded view
|

Re: StringIndexer on several columns in a DataFrame with Scala

MLnick
For now, you must follow this approach of constructing a pipeline consisting of a StringIndexer for each categorical column. See https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to allow multiple columns for StringIndexer, which is being worked on currently.

The reason you're seeing a NPE is:
var indexers: Array[StringIndexer] = null
and then you're trying to append an element to something that is null.

Try this instead:

var indexers: Array[StringIndexer] = Array()

But even better is a more functional approach:

val indexers = featureCol.map { colName =>
  new StringIndexer().setInputCol(colName).setOutpucol(colName + "_indexed")
}

On Fri, 27 Oct 2017 at 22:29 Md. Rezaul Karim <[hidden email]> wrote:
Hi All,

There are several categorical columns in my dataset as follows:
grafik.png

How can I transform values in each (categorical) columns into numeric using StringIndexer so that the resulting DataFrame can be feed into VectorAssembler to generate a feature vector?

A naive approach that I can try using StringIndexer for each categorical column. But that sounds hilarious, I know.

A possible workaround in PySpark is combining several StringIndexer on a list and use a Pipeline to execute them all as follows:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)
df_r.show()
How I can do the same in Scala? I tried the following: 

val featureCol = trainingDF.columns
var indexers: Array[StringIndexer] = null

for (colName <- featureCol) {
val index = new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "_indexed")
//.fit(trainDF)
indexers = indexers :+ index
}

val pipeline = new Pipeline()
.setStages(indexers)
val newDF = pipeline.fit(trainingDF).transform(trainingDF)
newDF.show()
However, I am experiencing NullPointerException at 
for (colName <- featureCol) 

I am sure, I am doing something wrong. Any suggestion? 



Regards,
_________________________________
Md. Rezaul Karim, BSc, MSc
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Reply | Threaded
Open this post in threaded view
|

Re: StringIndexer on several columns in a DataFrame with Scala

Weichen Xu
Yes I am working on this. Sorry for late, but I will try to submit PR ASAP. Thanks!

On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath <[hidden email]> wrote:
For now, you must follow this approach of constructing a pipeline consisting of a StringIndexer for each categorical column. See https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to allow multiple columns for StringIndexer, which is being worked on currently.

The reason you're seeing a NPE is:
var indexers: Array[StringIndexer] = null
and then you're trying to append an element to something that is null.

Try this instead:

var indexers: Array[StringIndexer] = Array()

But even better is a more functional approach:

val indexers = featureCol.map { colName =>
  new StringIndexer().setInputCol(colName).setOutpucol(colName + "_indexed")
}

On Fri, 27 Oct 2017 at 22:29 Md. Rezaul Karim <[hidden email]> wrote:
Hi All,

There are several categorical columns in my dataset as follows:
grafik.png

How can I transform values in each (categorical) columns into numeric using StringIndexer so that the resulting DataFrame can be feed into VectorAssembler to generate a feature vector?

A naive approach that I can try using StringIndexer for each categorical column. But that sounds hilarious, I know.

A possible workaround in PySpark is combining several StringIndexer on a list and use a Pipeline to execute them all as follows:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)
df_r.show()
How I can do the same in Scala? I tried the following: 

val featureCol = trainingDF.columns
var indexers: Array[StringIndexer] = null

for (colName <- featureCol) {
val index = new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "_indexed")
//.fit(trainDF)
indexers = indexers :+ index
}

val pipeline = new Pipeline()
.setStages(indexers)
val newDF = pipeline.fit(trainingDF).transform(trainingDF)
newDF.show()
However, I am experiencing NullPointerException at 
for (colName <- featureCol) 

I am sure, I am doing something wrong. Any suggestion? 



Regards,
_________________________________
Md. Rezaul Karim, BSc, MSc
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland

Reply | Threaded
Open this post in threaded view
|

Re: StringIndexer on several columns in a DataFrame with Scala

Md. Rezaul Karim
Hi Nick,

Both approaches worked and I realized my silly mistake too. Thank you so much. 

@Xu, thanks for the update.





Best,

Regards,
_________________________________
Md. Rezaul Karim, BSc, MSc
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland

On 30 October 2017 at 10:40, Weichen Xu <[hidden email]> wrote:
Yes I am working on this. Sorry for late, but I will try to submit PR ASAP. Thanks!

On Mon, Oct 30, 2017 at 5:19 PM, Nick Pentreath <[hidden email]> wrote:
For now, you must follow this approach of constructing a pipeline consisting of a StringIndexer for each categorical column. See https://issues.apache.org/jira/browse/SPARK-11215 for the related JIRA to allow multiple columns for StringIndexer, which is being worked on currently.

The reason you're seeing a NPE is:
var indexers: Array[StringIndexer] = null
and then you're trying to append an element to something that is null.

Try this instead:

var indexers: Array[StringIndexer] = Array()

But even better is a more functional approach:

val indexers = featureCol.map { colName =>
  new StringIndexer().setInputCol(colName).setOutpucol(colName + "_indexed")
}

On Fri, 27 Oct 2017 at 22:29 Md. Rezaul Karim <[hidden email]> wrote:
Hi All,

There are several categorical columns in my dataset as follows:
grafik.png

How can I transform values in each (categorical) columns into numeric using StringIndexer so that the resulting DataFrame can be feed into VectorAssembler to generate a feature vector?

A naive approach that I can try using StringIndexer for each categorical column. But that sounds hilarious, I know.

A possible workaround in PySpark is combining several StringIndexer on a list and use a Pipeline to execute them all as follows:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in list(set(df.columns)-set(['date'])) ]
pipeline = Pipeline(stages=indexers)
df_r = pipeline.fit(df).transform(df)
df_r.show()
How I can do the same in Scala? I tried the following: 

val featureCol = trainingDF.columns
var indexers: Array[StringIndexer] = null

for (colName <- featureCol) {
val index = new StringIndexer()
.setInputCol(colName)
.setOutputCol(colName + "_indexed")
//.fit(trainDF)
indexers = indexers :+ index
}

val pipeline = new Pipeline()
.setStages(indexers)
val newDF = pipeline.fit(trainingDF).transform(trainingDF)
newDF.show()
However, I am experiencing NullPointerException at 
for (colName <- featureCol) 

I am sure, I am doing something wrong. Any suggestion? 



Regards,
_________________________________
Md. Rezaul Karim, BSc, MSc
Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland