Spark read parquet with unnamed index

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Spark read parquet with unnamed index

Lord, Jesse

When reading a parquet created from a pandas dataframe with an unnamed index spark creates a column named “__index_level_0__” since spark DataFrames do not support row indexing. This looks like it is probably a bug to me, since as a spark user I would expect unnamed index columns to be dropped on read, but might be intended.

 

import pandas as pd

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

pandas_frame = pd.DataFrame({'str_col': ['a', 'b'], 'num_col':[1, 2]})

pandas_frame.to_parquet('test.parquet')

spark_frame = spark.read.parquet('test.parquet')

spark_frame.show()

 

+-------+-------+-----------------+

|num_col|str_col|__index_level_0__|

+-------+-------+-----------------+

|      1|      a|                0|

|      2|      b|                1|

+-------+-------+-----------------+

 

Thanks,

Jesse