Fwd: Recover RFormula Column Names

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Recover RFormula Column Names

Andrew Redd

Hi All!

I'm performing an econometric analysis over several billion rows of data and would like to use the Pyspark SparkML implementation of linear regression. In the example below I'm trying to interact hour of day and month of year indicators. The StringIndexer documentation tells you what it's doing when it's one hot encoding string/factor columns (i.e. taking out the most/least common value or first/last when sorted alphabetically) but doesn't allow you to recover your coefficient names. This feels like such a general case that I must be missing something. How can I get my column names back post regression to map to coefficient values? Do I need to basically rebuild the RFormula logic in if this isn't already implemented? Would be happy to use a different Spark language (Scala/Java etc. ) if implemented there.

Thanks in advance

Andrew

rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day + month_of_year + hour_of_day:month_of_year + additional_column",
                 featuresCol="features",
                 labelCol="label")
   
    rform_regression_input = rform.fit(regression_input).transform(regression_input)

    lr = LinearRegression(featuresCol='features',
                         labelCol='label',
                         solver='normal')

    lr_model = lr.fit(rform_regression_input)
    coefs = [ *lr_model.coefficients, lr_model.intercept]

    return pd.DataFrame(
        {"pvalues": lr_model.summary.pValues,
         "tvalues": lr_model.summary.tValues,
         "std_errs": lr_model.summary.coefficientStandardErrors,
         "coefs": coefs}
    )

Reply | Threaded
Open this post in threaded view
|

Re: Recover RFormula Column Names

Alessandro Solimando
Hello Andrew,
few years ago I had the same need and I found this SO's answer the way to go.

Here an extract of my (Scala) code (which was doing other things on top), I have removed the irrelevant parts but without testing it, so it might not work out of the box, nonetheless it should help you starting:

   private def getEncodedVectorLookupTable(df: DataFrame,
                                          featuresColName: String): Map[Long, String] = { 
     val meta = df.select(featuresColName)
      .schema.fields.head.metadata
      .getMetadata("ml_attr")
      .getMetadata("attrs")
 
    /* REFLECTION START */
    val field = meta.getClass.getDeclaredField("map")
    field.setAccessible(true)
    val keys = field.get(meta).asInstanceOf[Map[String, Any]].keySet
    field.setAccessible(false)
    /* REFLECTION END */ 
   
    keys.flatMap(
      meta.getMetadataArray(_)
        .map(m => m.getLong("idx") -> m.getString("name"))
    ).toMap 
 }

It looks like there is some support now for achieving this, but I have never tried it: https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/r/RWrapperUtils.html

Best regards,
Alessandro

On Mon, 28 Oct 2019 at 21:01, Andrew Redd <[hidden email]> wrote:

Hi All!

I'm performing an econometric analysis over several billion rows of data and would like to use the Pyspark SparkML implementation of linear regression. In the example below I'm trying to interact hour of day and month of year indicators. The StringIndexer documentation tells you what it's doing when it's one hot encoding string/factor columns (i.e. taking out the most/least common value or first/last when sorted alphabetically) but doesn't allow you to recover your coefficient names. This feels like such a general case that I must be missing something. How can I get my column names back post regression to map to coefficient values? Do I need to basically rebuild the RFormula logic in if this isn't already implemented? Would be happy to use a different Spark language (Scala/Java etc. ) if implemented there.

Thanks in advance

Andrew

rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day + month_of_year + hour_of_day:month_of_year + additional_column",
                 featuresCol="features",
                 labelCol="label")
   
    rform_regression_input = rform.fit(regression_input).transform(regression_input)

    lr = LinearRegression(featuresCol='features',
                         labelCol='label',
                         solver='normal')

    lr_model = lr.fit(rform_regression_input)
    coefs = [ *lr_model.coefficients, lr_model.intercept]

    return pd.DataFrame(
        {"pvalues": lr_model.summary.pValues,
         "tvalues": lr_model.summary.tValues,
         "std_errs": lr_model.summary.coefficientStandardErrors,
         "coefs": coefs}
    )

Reply | Threaded
Open this post in threaded view
|

Re: Recover RFormula Column Names

Andrew Redd
Thanks Alessandro!

That did the trick. I all of the indices and interactions are in the metadata. I also wanted to confirm that this solution works in pyspark as the metadata is carried over.

Andrew

On Tue, Oct 29, 2019 at 5:26 AM Alessandro Solimando <[hidden email]> wrote:
Hello Andrew,
few years ago I had the same need and I found this SO's answer the way to go.

Here an extract of my (Scala) code (which was doing other things on top), I have removed the irrelevant parts but without testing it, so it might not work out of the box, nonetheless it should help you starting:

   private def getEncodedVectorLookupTable(df: DataFrame,
                                          featuresColName: String): Map[Long, String] = { 
     val meta = df.select(featuresColName)
      .schema.fields.head.metadata
      .getMetadata("ml_attr")
      .getMetadata("attrs")
 
    /* REFLECTION START */
    val field = meta.getClass.getDeclaredField("map")
    field.setAccessible(true)
    val keys = field.get(meta).asInstanceOf[Map[String, Any]].keySet
    field.setAccessible(false)
    /* REFLECTION END */ 
   
    keys.flatMap(
      meta.getMetadataArray(_)
        .map(m => m.getLong("idx") -> m.getString("name"))
    ).toMap 
 }

It looks like there is some support now for achieving this, but I have never tried it: https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/r/RWrapperUtils.html

Best regards,
Alessandro

On Mon, 28 Oct 2019 at 21:01, Andrew Redd <[hidden email]> wrote:

Hi All!

I'm performing an econometric analysis over several billion rows of data and would like to use the Pyspark SparkML implementation of linear regression. In the example below I'm trying to interact hour of day and month of year indicators. The StringIndexer documentation tells you what it's doing when it's one hot encoding string/factor columns (i.e. taking out the most/least common value or first/last when sorted alphabetically) but doesn't allow you to recover your coefficient names. This feels like such a general case that I must be missing something. How can I get my column names back post regression to map to coefficient values? Do I need to basically rebuild the RFormula logic in if this isn't already implemented? Would be happy to use a different Spark language (Scala/Java etc. ) if implemented there.

Thanks in advance

Andrew

rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day + month_of_year + hour_of_day:month_of_year + additional_column",
                 featuresCol="features",
                 labelCol="label")
   
    rform_regression_input = rform.fit(regression_input).transform(regression_input)

    lr = LinearRegression(featuresCol='features',
                         labelCol='label',
                         solver='normal')

    lr_model = lr.fit(rform_regression_input)
    coefs = [ *lr_model.coefficients, lr_model.intercept]

    return pd.DataFrame(
        {"pvalues": lr_model.summary.pValues,
         "tvalues": lr_model.summary.tValues,
         "std_errs": lr_model.summary.coefficientStandardErrors,
         "coefs": coefs}
    )

Reply | Threaded
Open this post in threaded view
|

Re: Recover RFormula Column Names

Alessandro Solimando
Glad to hear that Andrew.

While looking for the aforementioned SO's answer I have stumbled upon a similar one for pyspark, it works and being in Python you are also spared the "reflection" part.

If you happen to try the RWrapperUtils it would be great to have a feedback!

Best regards,
Alessandro

On Tue, 29 Oct 2019 at 13:49, Andrew Redd <[hidden email]> wrote:
Thanks Alessandro!

That did the trick. I all of the indices and interactions are in the metadata. I also wanted to confirm that this solution works in pyspark as the metadata is carried over.

Andrew

On Tue, Oct 29, 2019 at 5:26 AM Alessandro Solimando <[hidden email]> wrote:
Hello Andrew,
few years ago I had the same need and I found this SO's answer the way to go.

Here an extract of my (Scala) code (which was doing other things on top), I have removed the irrelevant parts but without testing it, so it might not work out of the box, nonetheless it should help you starting:

   private def getEncodedVectorLookupTable(df: DataFrame,
                                          featuresColName: String): Map[Long, String] = { 
     val meta = df.select(featuresColName)
      .schema.fields.head.metadata
      .getMetadata("ml_attr")
      .getMetadata("attrs")
 
    /* REFLECTION START */
    val field = meta.getClass.getDeclaredField("map")
    field.setAccessible(true)
    val keys = field.get(meta).asInstanceOf[Map[String, Any]].keySet
    field.setAccessible(false)
    /* REFLECTION END */ 
   
    keys.flatMap(
      meta.getMetadataArray(_)
        .map(m => m.getLong("idx") -> m.getString("name"))
    ).toMap 
 }

It looks like there is some support now for achieving this, but I have never tried it: https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/r/RWrapperUtils.html

Best regards,
Alessandro

On Mon, 28 Oct 2019 at 21:01, Andrew Redd <[hidden email]> wrote:

Hi All!

I'm performing an econometric analysis over several billion rows of data and would like to use the Pyspark SparkML implementation of linear regression. In the example below I'm trying to interact hour of day and month of year indicators. The StringIndexer documentation tells you what it's doing when it's one hot encoding string/factor columns (i.e. taking out the most/least common value or first/last when sorted alphabetically) but doesn't allow you to recover your coefficient names. This feels like such a general case that I must be missing something. How can I get my column names back post regression to map to coefficient values? Do I need to basically rebuild the RFormula logic in if this isn't already implemented? Would be happy to use a different Spark language (Scala/Java etc. ) if implemented there.

Thanks in advance

Andrew

rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day + month_of_year + hour_of_day:month_of_year + additional_column",
                 featuresCol="features",
                 labelCol="label")
   
    rform_regression_input = rform.fit(regression_input).transform(regression_input)

    lr = LinearRegression(featuresCol='features',
                         labelCol='label',
                         solver='normal')

    lr_model = lr.fit(rform_regression_input)
    coefs = [ *lr_model.coefficients, lr_model.intercept]

    return pd.DataFrame(
        {"pvalues": lr_model.summary.pValues,
         "tvalues": lr_model.summary.tValues,
         "std_errs": lr_model.summary.coefficientStandardErrors,
         "coefs": coefs}
    )