FW: Pyspark: set Orc Stripe.size on dataframe writer issue

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

FW: Pyspark: set Orc Stripe.size on dataframe writer issue

Somasundara, Ashwin

Hello Group

 

I am having issues setting the stripe size, index stride and index on an orc file using PySpark.  I am getting approx 2000 stripes for the 1.2GB file when I am expecting only 5 stripes for the 256MB setting.

 

Tried the below options

 

1. Set the .options on data frame writer. The compression setting in .option worked but no other .option setting worked. Research the .option method in Dataframe class and it has only for compression and not for the stripe, index, and stride.

 

df.\

.repartition(custom field)\

.sortWithPartitions(custom field, sort field 1 , sort field 2)\

.write.format(orc)\

.option("compression","zlib")\                 only this option worked

.option("preserveSortOrder","true")\

.options("orc.stripe.size","268435456")\

.option("orc.row.index.stride","true")\

.options("orc.create.index","true")\

.save(s3 location )

 

 

2. Created an empty HIVE table with above ORC setting and loaded into the table using spark  SaveAsTable and insertInto method. The resultant table had more stripes than anticipated

 

df.\

.repartition(custom field)\

.sortWithPartitions(custom field, sort field 1 , sort field 2)\

.write.format(orc)\

.mode("apped")

.saveAsTable(hive tablename )    & tried .insertInto (hive table name)

 

 

For both the option I have enabled the below configs

 

spark.sql("set spark.sql.orc.impl=native")

spark.sql("set spark.sql.orc.enabled=true")

spark.sql("set spark.sql.orc.cache.stripe.details.size=" 268435456  ")  

 

Please let me know if there are any missing piece of code or data frame writer level methods or Spark session level config that would enable us to get the desired results.