[pyspark] Read multiple files parallely into a single dataframe

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[pyspark] Read multiple files parallely into a single dataframe

Shuporno Choudhury
Hi,

I want to read multiple files parallely into 1 dataframe. But the files have random names and cannot confirm to any pattern (so I can't use wildcard). Also, the files can be in different directories. 
If I provide the file names in a list to the dataframe reader, it reads then sequentially.
    Eg: df=spark.read.format('csv').load(['/path/to/file1.csv.gz','/path/to/file2.csv.gz','/path/to/file3.csv.gz'])
This reads the files sequentially. What can I do to read the files parallely?
I noticed that spark reads files parallely if provided directly the directory location. How can that be extended to multiple random files?
Suppose if my system has 4 cores, how can I make spark read 4 files at a time?

Please suggest.
Reply | Threaded
Open this post in threaded view
|

Re: [pyspark] Read multiple files parallely into a single dataframe

Irving Duran
I could be wrong, but I think you can do a wild card.

df = spark.read.format('csv').load('/path/to/file*.csv.gz')

Thank You,

Irving Duran


On Fri, May 4, 2018 at 4:38 AM Shuporno Choudhury <[hidden email]> wrote:
Hi,

I want to read multiple files parallely into 1 dataframe. But the files have random names and cannot confirm to any pattern (so I can't use wildcard). Also, the files can be in different directories. 
If I provide the file names in a list to the dataframe reader, it reads then sequentially.
    Eg: df=spark.read.format('csv').load(['/path/to/file1.csv.gz','/path/to/file2.csv.gz','/path/to/file3.csv.gz'])
This reads the files sequentially. What can I do to read the files parallely?
I noticed that spark reads files parallely if provided directly the directory location. How can that be extended to multiple random files?
Suppose if my system has 4 cores, how can I make spark read 4 files at a time?

Please suggest.