Ask about Pyspark ML interaction

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Ask about Pyspark ML interaction

Du, Yi
Hi, 

 

How are you doing?

 

Please first introduce myself to you. I am Yi Du, working in a mortgage insurance company called ‘Arch Capital Group’ based in Washington DC office in US. I find your profile under the repo Spark of Github and would like to ask you one particular coding issue under Spark ML. I tried to read the documentation of Spark and also asked in Stackoverflow but still have no clue.

 

I am using Pyspark and using ML to build models. I have categorical variables as predictors and would like to have interactions between two categorical variables in the model as well.

 

I was trying to follow the example here: https://spark.apache.org/docs/latest/ml-features#interaction to create the interaction between two categorical variables.

 

Here is my snippet of code:

 

```python

stringIndexer = StringIndexer(inputCols=['fico_group','ltv_group'], outputCols=['fico_groupIndex1','ltv_groupIndex1'], stringOrderType='frequencyAsc')

trs_data_index = stringIndexer.fit(trs_data).transform(trs_data)

 

interaction = Interaction(inputCols=['fico_groupIndex1','ltv_groupIndex1'], outputCol="interactedCol")

trs_data_interacted_temp = interaction.transform(trs_data_index)

 

encoder = OneHotEncoder(inputCols=['interactedCol'], outputCols=['interactedColVec'])

trs_data_interacted = encoder.fit(trs_data_interacted_temp).transform(trs_data_interacted_temp)

```

 

I basically index ‘fico_group’ and ‘ltv_group’ first and interact them together and use onehotencoder to create the final column ‘interactedColVec’ for use.

 

However, the final results didn’t come as expected. My ‘fico_group’ has 5 levels and so does ‘ltv_group’. So there are 5*5 = 25 combinations. But in the model estimates, one level should be treated as base so I expected to see 25-1 = 24 interactions in the final estimates. However, by using the above code, I have 25 interactions in the model estimates.

 

This is my post under Stackoverflow. https://stackoverflow.com/questions/64602060/add-interaction-term-to-ml

 

I don’t know if I articulated my question/issues clearly to you. But I do really appreciate your help if possible or if you can direct me to the person who knows this.

 

Again, thank you very much for your help.

 

Best,

Yi

 





The information contained in this e-mail message may be privileged and confidential information and is intended only for the use of the individual and/or entity identified in the alias address of this message. If the reader of this message is not the intended recipient, or an employee or agent responsible to deliver it to the intended recipient, you are hereby requested not to distribute or copy this communication. If you have received this communication in error, please notify us immediately by telephone or return e-mail and delete the original message from your system.
Reply | Threaded
Open this post in threaded view
|

Re: Ask about Pyspark ML interaction

srowen
I think you have this flipped around - you want to one-hot encode, then compute interactions. As it is you are treating the product of {0,1,2,3,4} x {0,1,2,3,4} as if it's a categorical index. That doesn't have nearly 25 possible values and probably is not what you intend.

On Mon, Nov 9, 2020 at 7:53 AM Du, Yi <[hidden email]> wrote:
Hi, 

 

How are you doing?

 

Please first introduce myself to you. I am Yi Du, working in a mortgage insurance company called ‘Arch Capital Group’ based in Washington DC office in US. I find your profile under the repo Spark of Github and would like to ask you one particular coding issue under Spark ML. I tried to read the documentation of Spark and also asked in Stackoverflow but still have no clue.

 

I am using Pyspark and using ML to build models. I have categorical variables as predictors and would like to have interactions between two categorical variables in the model as well.

 

I was trying to follow the example here: https://spark.apache.org/docs/latest/ml-features#interaction to create the interaction between two categorical variables.

 

Here is my snippet of code:

 

```python

stringIndexer = StringIndexer(inputCols=['fico_group','ltv_group'], outputCols=['fico_groupIndex1','ltv_groupIndex1'], stringOrderType='frequencyAsc')

trs_data_index = stringIndexer.fit(trs_data).transform(trs_data)

 

interaction = Interaction(inputCols=['fico_groupIndex1','ltv_groupIndex1'], outputCol="interactedCol")

trs_data_interacted_temp = interaction.transform(trs_data_index)

 

encoder = OneHotEncoder(inputCols=['interactedCol'], outputCols=['interactedColVec'])

trs_data_interacted = encoder.fit(trs_data_interacted_temp).transform(trs_data_interacted_temp)

```

 

I basically index ‘fico_group’ and ‘ltv_group’ first and interact them together and use onehotencoder to create the final column ‘interactedColVec’ for use.

 

However, the final results didn’t come as expected. My ‘fico_group’ has 5 levels and so does ‘ltv_group’. So there are 5*5 = 25 combinations. But in the model estimates, one level should be treated as base so I expected to see 25-1 = 24 interactions in the final estimates. However, by using the above code, I have 25 interactions in the model estimates.

 

This is my post under Stackoverflow. https://stackoverflow.com/questions/64602060/add-interaction-term-to-ml

 

I don’t know if I articulated my question/issues clearly to you. But I do really appreciate your help if possible or if you can direct me to the person who knows this.

 

Again, thank you very much for your help.

 

Best,

Yi

 





The information contained in this e-mail message may be privileged and confidential information and is intended only for the use of the individual and/or entity identified in the alias address of this message. If the reader of this message is not the intended recipient, or an employee or agent responsible to deliver it to the intended recipient, you are hereby requested not to distribute or copy this communication. If you have received this communication in error, please notify us immediately by telephone or return e-mail and delete the original message from your system.
Reply | Threaded
Open this post in threaded view
|

RE: Ask about Pyspark ML interaction

Du, Yi

Do you mean I need to index them, onehotencode and interact them?

 

I tried both ways:

 

Index -> interact -> onehotencode: it gave me 25 combinations.

 

Index -> onehotencode -> interact: it gave me 16 combinations.

 

Neither of them gave me expected 24 combinations. Did I miss something?

 

Thanks,

 

From: Sean Owen [mailto:[hidden email]]
Sent: Monday, November 9, 2020 9:58 AM
To: Du, Yi <[hidden email]>
Cc: [hidden email]
Subject: Re: Ask about Pyspark ML interaction

 

CAUTION: External email.

I think you have this flipped around - you want to one-hot encode, then compute interactions. As it is you are treating the product of {0,1,2,3,4} x {0,1,2,3,4} as if it's a categorical index. That doesn't have nearly 25 possible values and probably is not what you intend.

 

On Mon, Nov 9, 2020 at 7:53 AM Du, Yi <[hidden email]> wrote:

Hi, 

 

How are you doing?

 

Please first introduce myself to you. I am Yi Du, working in a mortgage insurance company called ‘Arch Capital Group’ based in Washington DC office in US. I find your profile under the repo Spark of Github and would like to ask you one particular coding issue under Spark ML. I tried to read the documentation of Spark and also asked in Stackoverflow but still have no clue.

 

I am using Pyspark and using ML to build models. I have categorical variables as predictors and would like to have interactions between two categorical variables in the model as well.

 

I was trying to follow the example here: https://spark.apache.org/docs/latest/ml-features#interaction to create the interaction between two categorical variables.

 

Here is my snippet of code:

 

```python

stringIndexer = StringIndexer(inputCols=['fico_group','ltv_group'], outputCols=['fico_groupIndex1','ltv_groupIndex1'], stringOrderType='frequencyAsc')

trs_data_index = stringIndexer.fit(trs_data).transform(trs_data)

 

interaction = Interaction(inputCols=['fico_groupIndex1','ltv_groupIndex1'], outputCol="interactedCol")

trs_data_interacted_temp = interaction.transform(trs_data_index)

 

encoder = OneHotEncoder(inputCols=['interactedCol'], outputCols=['interactedColVec'])

trs_data_interacted = encoder.fit(trs_data_interacted_temp).transform(trs_data_interacted_temp)

```

 

I basically index ‘fico_group’ and ‘ltv_group’ first and interact them together and use onehotencoder to create the final column ‘interactedColVec’ for use.

 

However, the final results didn’t come as expected. My ‘fico_group’ has 5 levels and so does ‘ltv_group’. So there are 5*5 = 25 combinations. But in the model estimates, one level should be treated as base so I expected to see 25-1 = 24 interactions in the final estimates. However, by using the above code, I have 25 interactions in the model estimates.

 

This is my post under Stackoverflow. https://stackoverflow.com/questions/64602060/add-interaction-term-to-ml

 

I don’t know if I articulated my question/issues clearly to you. But I do really appreciate your help if possible or if you can direct me to the person who knows this.

 

Again, thank you very much for your help.

 

Best,

Yi

 

 

 



The information contained in this e-mail message may be privileged and confidential information and is intended only for the use of the individual and/or entity identified in the alias address of this message. If the reader of this message is not the intended recipient, or an employee or agent responsible to deliver it to the intended recipient, you are hereby requested not to distribute or copy this communication. If you have received this communication in error, please notify us immediately by telephone or return e-mail and delete the original message from your system.




The information contained in this e-mail message may be privileged and confidential information and is intended only for the use of the individual and/or entity identified in the alias address of this message. If the reader of this message is not the intended recipient, or an employee or agent responsible to deliver it to the intended recipient, you are hereby requested not to distribute or copy this communication. If you have received this communication in error, please notify us immediately by telephone or return e-mail and delete the original message from your system.