[PySpark 2.3+] Reading parquet entire path vs a set of file paths

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[PySpark 2.3+] Reading parquet entire path vs a set of file paths

rishishah.star
Hi All,

I use the following to read a set of parquet file paths when files are scattered across many many partitions.

paths = ['p1', 'p2', ... 'p10000']
df = spark.read.parquet(*paths)

Above method feels like is sequentially reading those files & not really parallelizing the read operation, is that correct?

If I put all these files in a single path and read like below - works faster:

path = 'consolidated_path'
df = spark.read.parquet(path)

Is my observation correct? If so, is there a way to optimize reads from multiple/specific paths ?

--
Regards,

Rishi Shah
Reply | Threaded
Open this post in threaded view
|

Re: [PySpark 2.3+] Reading parquet entire path vs a set of file paths

rishishah.star
Hi All,

Just following up on below to see if anyone has any suggestions. Appreciate your help in advance.

Thanks,
Rishi

On Mon, Jun 1, 2020 at 9:33 AM Rishi Shah <[hidden email]> wrote:
Hi All,

I use the following to read a set of parquet file paths when files are scattered across many many partitions.

paths = ['p1', 'p2', ... 'p10000']
df = spark.read.parquet(*paths)

Above method feels like is sequentially reading those files & not really parallelizing the read operation, is that correct?

If I put all these files in a single path and read like below - works faster:

path = 'consolidated_path'
df = spark.read.parquet(path)

Is my observation correct? If so, is there a way to optimize reads from multiple/specific paths ?

--
Regards,

Rishi Shah


--
Regards,

Rishi Shah