PySpark ML: Get best set of parameters from TrainValidationSplit

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

PySpark ML: Get best set of parameters from TrainValidationSplit

Aakash Basu-2
Hi,

I am running a Random Forest model on a dataset using hyper parameter tuning with Spark's paramGrid and Train Validation Split.

Can anyone tell me how to get the best set for all the four parameters?

I used:

model.bestModel()
model.metrics()

But none of them seem to work.


Below is the code chunk:
paramGrid = ParamGridBuilder() \
.addGrid(rf.numTrees, [50, 100, 150, 200]) \
.addGrid(rf.maxDepth, [5, 10, 15, 20]) \
.addGrid(rf.minInfoGain, [0.001, 0.01, 0.1, 0.6]) \
.addGrid(rf.minInstancesPerNode, [5, 15, 30, 50, 100]) \
.build()

tvs = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator(),
# 80% of the data will be used for training, 20% for validation.
trainRatio=0.8)

model = tvs.fit(trainingData)

predictions = model.transform(testData)

evaluator = MulticlassClassificationEvaluator(
labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))

Any help?


Thanks,
Aakash.
Reply | Threaded
Open this post in threaded view
|

Re: PySpark ML: Get best set of parameters from TrainValidationSplit

Bryan Cutler
Hi Aakash,

First you will want to get the the random forest model stage from the best pipeline model result, for example if RF is the first stage:

rfModel = model.bestModel.stages[0]

Then you can check the values of the params you tuned like this:

rfModel.getNumTrees

On Mon, Apr 16, 2018 at 7:52 AM, Aakash Basu <[hidden email]> wrote:
Hi,

I am running a Random Forest model on a dataset using hyper parameter tuning with Spark's paramGrid and Train Validation Split.

Can anyone tell me how to get the best set for all the four parameters?

I used:

model.bestModel()
model.metrics()

But none of them seem to work.


Below is the code chunk:
paramGrid = ParamGridBuilder() \
.addGrid(rf.numTrees, [50, 100, 150, 200]) \
.addGrid(rf.maxDepth, [5, 10, 15, 20]) \
.addGrid(rf.minInfoGain, [0.001, 0.01, 0.1, 0.6]) \
.addGrid(rf.minInstancesPerNode, [5, 15, 30, 50, 100]) \
.build()

tvs = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator(),
# 80% of the data will be used for training, 20% for validation.
trainRatio=0.8)

model = tvs.fit(trainingData)

predictions = model.transform(testData)

evaluator = MulticlassClassificationEvaluator(
labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))

Any help?


Thanks,
Aakash.