How to read the schema of a partitioned dataframe without listing all the partitions ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How to read the schema of a partitioned dataframe without listing all the partitions ?

Walid LEZZAR
Hi,

I have a parquet on S3 partitioned by day. I have 2 years of data (-> about 1000 partitions). With spark, when I just want to know the schema of this parquet without even asking for a single row of data, spark tries to list all the partitions and the nested partitions of the parquet. Which makes it very slow just to build the dataframe object on Zeppelin.

Is there a way to avoid that ? Is there way to tell spark : "hey, just read a single partition and give me the schema of that partition and consider it as the schema of the whole dataframe" ? (I don't care about schema merge, it's off by the way)

Thanks.
Walid.
Reply | Threaded
Open this post in threaded view
|

Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

ayan guha
You can specify the first folder directly and read it

On Fri, 27 Apr 2018 at 9:42 pm, Walid LEZZAR <[hidden email]> wrote:
Hi,

I have a parquet on S3 partitioned by day. I have 2 years of data (-> about 1000 partitions). With spark, when I just want to know the schema of this parquet without even asking for a single row of data, spark tries to list all the partitions and the nested partitions of the parquet. Which makes it very slow just to build the dataframe object on Zeppelin.

Is there a way to avoid that ? Is there way to tell spark : "hey, just read a single partition and give me the schema of that partition and consider it as the schema of the whole dataframe" ? (I don't care about schema merge, it's off by the way)

Thanks.
Walid.
--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

Yong Zhang
In reply to this post by Walid LEZZAR

What version of Spark you are using?


You can search "spark.sql.parquet.mergeSchema" on https://spark.apache.org/docs/latest/sql-programming-guide.html


Starting from Spark 1.5, the default is already "false", which means Spark shouldn't scan all the parquet files to generate the schema.


Yong





From: Walid LEZZAR <[hidden email]>
Sent: Friday, April 27, 2018 7:42 AM
To: spark users
Subject: How to read the schema of a partitioned dataframe without listing all the partitions ?
 
Hi,

I have a parquet on S3 partitioned by day. I have 2 years of data (-> about 1000 partitions). With spark, when I just want to know the schema of this parquet without even asking for a single row of data, spark tries to list all the partitions and the nested partitions of the parquet. Which makes it very slow just to build the dataframe object on Zeppelin.

Is there a way to avoid that ? Is there way to tell spark : "hey, just read a single partition and give me the schema of that partition and consider it as the schema of the whole dataframe" ? (I don't care about schema merge, it's off by the way)

Thanks.
Walid.
Reply | Threaded
Open this post in threaded view
|

Re: How to read the schema of a partitioned dataframe without listing all the partitions ?

Walid LEZZAR
I’m using spark 2.3 with schema merge set to false. I don’t think spark is reading any file indeed but it tries to list them all one by one and it’s super slow on s3 ! 

Pointing to a single partition manually is not an option as it requires me to be aware of the partitioning in order to add it to the path and also, spark doesn’t include the partitioning column in that case.

Le 27 avr. 2018 à 16:07, Yong Zhang <[hidden email]> a écrit :

What version of Spark you are using?


You can search "spark.sql.parquet.mergeSchema" on https://spark.apache.org/docs/latest/sql-programming-guide.html


Starting from Spark 1.5, the default is already "false", which means Spark shouldn't scan all the parquet files to generate the schema.


Yong





From: Walid LEZZAR <[hidden email]>
Sent: Friday, April 27, 2018 7:42 AM
To: spark users
Subject: How to read the schema of a partitioned dataframe without listing all the partitions ?
 
Hi,

I have a parquet on S3 partitioned by day. I have 2 years of data (-> about 1000 partitions). With spark, when I just want to know the schema of this parquet without even asking for a single row of data, spark tries to list all the partitions and the nested partitions of the parquet. Which makes it very slow just to build the dataframe object on Zeppelin.

Is there a way to avoid that ? Is there way to tell spark : "hey, just read a single partition and give me the schema of that partition and consider it as the schema of the whole dataframe" ? (I don't care about schema merge, it's off by the way)

Thanks.
Walid.