Pivot Column ordering in spark

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Pivot Column ordering in spark

Manohar Rao

I am doing a pivot transformation on an input dataset

Following input schema
 |-- c_salutation: string (nullable = true)
 |-- c_preferred_cust_flag: string (nullable = true)
 |-- integer_type_col: integer (nullable = false)
 |-- long_type_col: long (nullable = false)
 |-- string_type_col: string (nullable = true)
 |-- decimal_type_col: decimal(38,0) (nullable = true)

 My pivot column is c_preferred_cust_flag , pivot values is "Y","N","R"
  and group by column is c_salutation

 I am using the api   pivot(String pivotColumn,
                                      java.util.List<Object> values) 
on  RelationalGroupedDataset

My aggregation functions after this pivot is 
count(`string_type_col`) ,sum(`string_type_col`) ,sum(`integer_type_col`) ,avg(`integer_type_col`) 
,sum(`long_type_col`) ,avg(`long_type_col`) ,avg(`decimal_type_col`)

My output dataset schema after the groupby.pivot.agg() 
 |-- c_salutation: string (nullable = true)
 |-- Y_count(`string_type_col`): long (nullable = true)
 |-- Y_sum(CAST(`string_type_col` AS DOUBLE)): double (nullable = true)
 |-- Y_sum(CAST(`integer_type_col` AS BIGINT)): long (nullable = true)
 |-- Y_avg(CAST(`integer_type_col` AS BIGINT)): double (nullable = true)
 |-- Y_sum(`long_type_col`): long (nullable = true)
 |-- Y_avg(`long_type_col`): double (nullable = true)
 |-- Y_avg(`decimal_type_col`): decimal(38,4) (nullable = true)
 |-- N_count(`string_type_col`): long (nullable = true)
 |-- N_sum(CAST(`string_type_col` AS DOUBLE)): double (nullable = true)
 |-- N_sum(CAST(`integer_type_col` AS BIGINT)): long (nullable = true)
 |-- N_avg(CAST(`integer_type_col` AS BIGINT)): double (nullable = true)
 |-- N_sum(`long_type_col`): long (nullable = true)
 |-- N_avg(`long_type_col`): double (nullable = true)
 |-- N_avg(`decimal_type_col`): decimal(38,4) (nullable = true)
 |-- R_count(`string_type_col`): long (nullable = true)
 |-- R_sum(CAST(`string_type_col` AS DOUBLE)): double (nullable = true)
 |-- R_sum(CAST(`integer_type_col` AS BIGINT)): long (nullable = true)
 |-- R_avg(CAST(`integer_type_col` AS BIGINT)): double (nullable = true)
 |-- R_sum(`long_type_col`): long (nullable = true)
 |-- R_avg(`long_type_col`): double (nullable = true)
 |-- R_avg(`decimal_type_col`): decimal(38,4) (nullable = true)

 My requirement is:
to rename the system generated column names such as Y_count(`string_type_col`), N_avg(`decimal_type_col`)
 etc to a user defined name based on a mapping. I
I need to be able to do this programatically given a mapping of the form: 
(pivotvalue + aggregationfunction) --> (requiredcolumnname)

 My question is :
 Can i rely on the order of the output columns generated?
    The order looks to confirm to this pattern

 Is this order standard across spark versions 2+ . ?
 Is this subject to change or not reliable from a user point of view. ?

 If not reliable , is there another way by which I can logically/programatically
  identify that a column such  as R_sum(CAST(`integer_type_col` AS BIGINT))  
 corresponds to the input pivot value  "R" and aggregation function of sum(`integer_type_col`)