Please Help with DecisionTree/FeatureIndexer

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Please Help with DecisionTree/FeatureIndexer

Marco Mistroni
HI all
 i am trying to run a sample decision tree, following examples here (for Mllib)

https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier

the example seems to use  a Vectorindexer, however i am missing something.
How does the featureIndexer knows which columns are features?
Isnt' there something missing?  or the featuresIndexer is able to figure out by itself
which columns of teh DAtaFrame are features?

val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
  .fit(data)

Using this code i am getting back this exception

Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)

what am i missing?

w/kindest regarsd
 marco
Reply | Threaded
Open this post in threaded view
|

Re: Please Help with DecisionTree/FeatureIndexer

Weichen Xu
Hi, Marco,

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

The data now include a feature column with name "features",

val featureIndexer = new VectorIndexer()
  .setInputCol("features")   <------ Here specify the "features" column to index.
  .setOutputCol("indexedFeatures")  

Thanks.

On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <[hidden email]> wrote:
HI all
 i am trying to run a sample decision tree, following examples here (for Mllib)

https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier

the example seems to use  a Vectorindexer, however i am missing something.
How does the featureIndexer knows which columns are features?
Isnt' there something missing?  or the featuresIndexer is able to figure out by itself
which columns of teh DAtaFrame are features?

val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
  .fit(data)

Using this code i am getting back this exception

Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)

what am i missing?

w/kindest regarsd
 marco

Reply | Threaded
Open this post in threaded view
|

Re: Please Help with DecisionTree/FeatureIndexer

Marco Mistroni
Hello Wei
 Thanks, i should have c hecked the data
My data has this format
|col1|col2|col3|label|

so it looks like i cannot use VectorIndexer directly (it accepts a Vector column).
I am guessing what i should do is something like this (given i have few categorical features)

val assembler = new VectorAssembler().
      setInputCols(inputData.columns.filter(_ != "Label")).
      setOutputCol("features")

    val transformedData = assembler.transform(inputData) 
     
   
    val featureIndexer =     
      new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(5) // features with > 4 distinct values are treated as continuous.
      .fit(transformedData)
   
?
Apologies for the basic question btu last time i worked on an ML project i was using Spark 1.x

kr
 marco









On Dec 16, 2017 1:24 PM, "Weichen Xu" <[hidden email]> wrote:
Hi, Marco,

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

The data now include a feature column with name "features",

val featureIndexer = new VectorIndexer()
  .setInputCol("features")   <------ Here specify the "features" column to index.
  .setOutputCol("indexedFeatures")  

Thanks.

On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <[hidden email]> wrote:
HI all
 i am trying to run a sample decision tree, following examples here (for Mllib)

https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier

the example seems to use  a Vectorindexer, however i am missing something.
How does the featureIndexer knows which columns are features?
Isnt' there something missing?  or the featuresIndexer is able to figure out by itself
which columns of teh DAtaFrame are features?

val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
  .fit(data)

Using this code i am getting back this exception

Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)

what am i missing?

w/kindest regarsd
 marco

Reply | Threaded
Open this post in threaded view
|

Re: Please Help with DecisionTree/FeatureIndexer

Weichen Xu
Hi Marco,

Yes you can apply `VectorAssembler` first in the pipeline to assemble multiple features column.

Thanks.

On Sun, Dec 17, 2017 at 6:33 AM, Marco Mistroni <[hidden email]> wrote:
Hello Wei
 Thanks, i should have c hecked the data
My data has this format
|col1|col2|col3|label|

so it looks like i cannot use VectorIndexer directly (it accepts a Vector column).
I am guessing what i should do is something like this (given i have few categorical features)

val assembler = new VectorAssembler().
      setInputCols(inputData.columns.filter(_ != "Label")).
      setOutputCol("features")

    val transformedData = assembler.transform(inputData) 
     
   
    val featureIndexer =     
      new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(5) // features with > 4 distinct values are treated as continuous.
      .fit(transformedData)
   
?
Apologies for the basic question btu last time i worked on an ML project i was using Spark 1.x

kr
 marco









On Dec 16, 2017 1:24 PM, "Weichen Xu" <[hidden email]> wrote:
Hi, Marco,

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

The data now include a feature column with name "features",

val featureIndexer = new VectorIndexer()
  .setInputCol("features")   <------ Here specify the "features" column to index.
  .setOutputCol("indexedFeatures")  

Thanks.

On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <[hidden email]> wrote:
HI all
 i am trying to run a sample decision tree, following examples here (for Mllib)

https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier

the example seems to use  a Vectorindexer, however i am missing something.
How does the featureIndexer knows which columns are features?
Isnt' there something missing?  or the featuresIndexer is able to figure out by itself
which columns of teh DAtaFrame are features?

val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
  .fit(data)

Using this code i am getting back this exception

Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)

what am i missing?

w/kindest regarsd
 marco


Reply | Threaded
Open this post in threaded view
|

Re: Please Help with DecisionTree/FeatureIndexer

Weichen Xu
Hi Marco,

If you add assembler at the first of the pipeline, like:
```
 val pipeline = new Pipeline()
      .setStages(Array(assembler, labelIndexer, featureIndexer, dt, labelConverter))
```

Which error do you got ?

I think it can work fine if the `assembler` added into pipeline.

Thanks.

On Tue, Dec 19, 2017 at 6:08 AM, Marco Mistroni <[hidden email]> wrote:
Hello Weichen
 sorry to bother you again with my ML issue... but i feel you have more experience than i do in this and perhaps you can suggest me  if i am following the correct steps, as i seem to get confused by different examples on Decision Treees

So, as a starting point i have this dataframe

[BI-RADS, Age, Shape, Margin,Density,Severity]

The label is 'Severity' and all others are features
I am following these steps and i was wondering if you can advise if i am doing the correct thing , as i am unable to add the assembler at the beginning of the pipeilne, resorting instead to the following code
<inputData is the original DataFrame>

    val assembler = new VectorAssembler().
      setInputCols(inputData.columns.filter(_ != "Severity")).
      setOutputCol("features")
   
    val data = assembler.transform(inputData) 
   
    val labelIndexer = new StringIndexer()
      .setInputCol("Severity")
      .setOutputCol("indexedLabel")
      .fit(data)

    val featureIndexer =     
      new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(5) // features with > 4 distinct values are treated as continuous.
      .fit(data)
   
    val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))
    // Train a DecisionTree model.
    val dt = new DecisionTreeClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("indexedFeatures")

    // Convert indexed labels back to original labels.
      val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)

    // Chain indexers and tree in a Pipeline.
    val pipeline = new Pipeline()
      .setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))

    trainingData.cache()
    testData.cache() 
     
     
    // Train model. This also runs the indexers.
    val model = pipeline.fit(trainingData)

    // Make predictions.
    val predictions = model.transform(testData)

    // Select example rows to display.
    predictions.select("predictedLabel", "indexedLabel", "indexedFeatures").show(5)

    // Select (prediction, true label) and compute test error.
    val evaluator = new MulticlassClassificationEvaluator()
      .setLabelCol("indexedLabel")
      .setPredictionCol("prediction")
      .setMetricName("accuracy")
    val accuracy = evaluator.evaluate(predictions)
    println("Test Error = " + (1.0 - accuracy))

Could you advise if this is the proper way to follow when using an Assembler?
I was unable to add the Assembler at the beginning of the pipeline... it seems it dint get invoked as , at the moment of calling the FeatureIndexer, the column 'features' was not found

this is not urgent, i'll appreciate ifyou can give me your comments
kind regards
 marco






On Sun, Dec 17, 2017 at 2:48 AM, Weichen Xu <[hidden email]> wrote:
Hi Marco,

Yes you can apply `VectorAssembler` first in the pipeline to assemble multiple features column.

Thanks.

On Sun, Dec 17, 2017 at 6:33 AM, Marco Mistroni <[hidden email]> wrote:
Hello Wei
 Thanks, i should have c hecked the data
My data has this format
|col1|col2|col3|label|

so it looks like i cannot use VectorIndexer directly (it accepts a Vector column).
I am guessing what i should do is something like this (given i have few categorical features)

val assembler = new VectorAssembler().
      setInputCols(inputData.columns.filter(_ != "Label")).
      setOutputCol("features")

    val transformedData = assembler.transform(inputData) 
     
   
    val featureIndexer =     
      new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(5) // features with > 4 distinct values are treated as continuous.
      .fit(transformedData)
   
?
Apologies for the basic question btu last time i worked on an ML project i was using Spark 1.x

kr
 marco









On Dec 16, 2017 1:24 PM, "Weichen Xu" <[hidden email]> wrote:
Hi, Marco,

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

The data now include a feature column with name "features",

val featureIndexer = new VectorIndexer()
  .setInputCol("features")   <------ Here specify the "features" column to index.
  .setOutputCol("indexedFeatures")  

Thanks.

On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <[hidden email]> wrote:
HI all
 i am trying to run a sample decision tree, following examples here (for Mllib)

https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier

the example seems to use  a Vectorindexer, however i am missing something.
How does the featureIndexer knows which columns are features?
Isnt' there something missing?  or the featuresIndexer is able to figure out by itself
which columns of teh DAtaFrame are features?

val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
  .fit(data)

Using this code i am getting back this exception

Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)

what am i missing?

w/kindest regarsd
 marco




Reply | Threaded
Open this post in threaded view
|

Re: Please Help with DecisionTree/FeatureIndexer

Weichen Xu
Hi, Marco

Do not call any single fit/transform by your self. You only need to call `pipeline.fit`/`pipelineModel.transform`. Like following:

    val assembler = new VectorAssembler().
      setInputCols(inputData.columns.filter(_ != "Severity")).
      setOutputCol("features")
    
    val data = assembler.transform(inputData)  
    
    val labelIndexer = new StringIndexer()
      .setInputCol("Severity")
      .setOutputCol("indexedLabel")

    val featureIndexer =      
      new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(5) // features with > 4 distinct values are treated as continuous.

    
    val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))
    // Train a DecisionTree model.
    val dt = new DecisionTreeClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("indexedFeatures")

    // Convert indexed labels back to original labels.
      val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)

    // Chain indexers and tree in a Pipeline.
    val pipeline = new Pipeline()
      .setStages(Array(assembler, labelIndexer, featureIndexer, dt, labelConverter))

    trainingData.cache()
    testData.cache()  
      
      
    // Train model. This also runs the indexers.
    val model = pipeline.fit(trainingData)

    // Make predictions.
    val predictions = model.transform(testData)


Thanks.

On Wed, Dec 20, 2017 at 5:26 AM, Marco Mistroni <[hidden email]> wrote:
Hello Weichen
 i will try it out and le tyou know
But, if i add assembler to the pipeline, do i still have to call Assembler.transform  and XXXIndexer.fit() ?
kind regards
 Marco

On Tue, Dec 19, 2017 at 2:45 AM, Weichen Xu <[hidden email]> wrote:
Hi Marco,

If you add assembler at the first of the pipeline, like:
```
 val pipeline = new Pipeline()
      .setStages(Array(assembler, labelIndexer, featureIndexer, dt, labelConverter))
```

Which error do you got ?

I think it can work fine if the `assembler` added into pipeline.

Thanks.

On Tue, Dec 19, 2017 at 6:08 AM, Marco Mistroni <[hidden email]> wrote:
Hello Weichen
 sorry to bother you again with my ML issue... but i feel you have more experience than i do in this and perhaps you can suggest me  if i am following the correct steps, as i seem to get confused by different examples on Decision Treees

So, as a starting point i have this dataframe

[BI-RADS, Age, Shape, Margin,Density,Severity]

The label is 'Severity' and all others are features
I am following these steps and i was wondering if you can advise if i am doing the correct thing , as i am unable to add the assembler at the beginning of the pipeilne, resorting instead to the following code
<inputData is the original DataFrame>

    val assembler = new VectorAssembler().
      setInputCols(inputData.columns.filter(_ != "Severity")).
      setOutputCol("features")
   
    val data = assembler.transform(inputData) 
   
    val labelIndexer = new StringIndexer()
      .setInputCol("Severity")
      .setOutputCol("indexedLabel")
      .fit(data)

    val featureIndexer =     
      new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(5) // features with > 4 distinct values are treated as continuous.
      .fit(data)
   
    val Array(trainingData, testData) = data.randomSplit(Array(0.8, 0.2))
    // Train a DecisionTree model.
    val dt = new DecisionTreeClassifier()
      .setLabelCol("indexedLabel")
      .setFeaturesCol("indexedFeatures")

    // Convert indexed labels back to original labels.
      val labelConverter = new IndexToString()
      .setInputCol("prediction")
      .setOutputCol("predictedLabel")
      .setLabels(labelIndexer.labels)

    // Chain indexers and tree in a Pipeline.
    val pipeline = new Pipeline()
      .setStages(Array(labelIndexer, featureIndexer, dt, labelConverter))

    trainingData.cache()
    testData.cache() 
     
     
    // Train model. This also runs the indexers.
    val model = pipeline.fit(trainingData)

    // Make predictions.
    val predictions = model.transform(testData)

    // Select example rows to display.
    predictions.select("predictedLabel", "indexedLabel", "indexedFeatures").show(5)

    // Select (prediction, true label) and compute test error.
    val evaluator = new MulticlassClassificationEvaluator()
      .setLabelCol("indexedLabel")
      .setPredictionCol("prediction")
      .setMetricName("accuracy")
    val accuracy = evaluator.evaluate(predictions)
    println("Test Error = " + (1.0 - accuracy))

Could you advise if this is the proper way to follow when using an Assembler?
I was unable to add the Assembler at the beginning of the pipeline... it seems it dint get invoked as , at the moment of calling the FeatureIndexer, the column 'features' was not found

this is not urgent, i'll appreciate ifyou can give me your comments
kind regards
 marco






On Sun, Dec 17, 2017 at 2:48 AM, Weichen Xu <[hidden email]> wrote:
Hi Marco,

Yes you can apply `VectorAssembler` first in the pipeline to assemble multiple features column.

Thanks.

On Sun, Dec 17, 2017 at 6:33 AM, Marco Mistroni <[hidden email]> wrote:
Hello Wei
 Thanks, i should have c hecked the data
My data has this format
|col1|col2|col3|label|

so it looks like i cannot use VectorIndexer directly (it accepts a Vector column).
I am guessing what i should do is something like this (given i have few categorical features)

val assembler = new VectorAssembler().
      setInputCols(inputData.columns.filter(_ != "Label")).
      setOutputCol("features")

    val transformedData = assembler.transform(inputData) 
     
   
    val featureIndexer =     
      new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexedFeatures")
      .setMaxCategories(5) // features with > 4 distinct values are treated as continuous.
      .fit(transformedData)
   
?
Apologies for the basic question btu last time i worked on an ML project i was using Spark 1.x

kr
 marco









On Dec 16, 2017 1:24 PM, "Weichen Xu" <[hidden email]> wrote:
Hi, Marco,

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

The data now include a feature column with name "features",

val featureIndexer = new VectorIndexer()
  .setInputCol("features")   <------ Here specify the "features" column to index.
  .setOutputCol("indexedFeatures")  

Thanks.

On Sat, Dec 16, 2017 at 6:26 AM, Marco Mistroni <[hidden email]> wrote:
HI all
 i am trying to run a sample decision tree, following examples here (for Mllib)

https://spark.apache.org/docs/latest/ml-classification-regression.html#decision-tree-classifier

the example seems to use  a Vectorindexer, however i am missing something.
How does the featureIndexer knows which columns are features?
Isnt' there something missing?  or the featuresIndexer is able to figure out by itself
which columns of teh DAtaFrame are features?

val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
  .fit(data)

Using this code i am getting back this exception

Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:266)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:265)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:40)
at org.apache.spark.ml.feature.VectorIndexer.transformSchema(VectorIndexer.scala:141)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.feature.VectorIndexer.fit(VectorIndexer.scala:118)

what am i missing?

w/kindest regarsd
 marco