How to skip nonexistent file when read files with spark?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

How to skip nonexistent file when read files with spark?

JF Chen
Hi Everyone
I meet a tricky problem recently. I am trying to read some file paths generated by other method. The file paths are represented by wild card in list, like [ '/data/*/12', '/data/*/13']
But in practice, if the wildcard cannot match any existed path, it will throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not exist: ...'", and the program stops after that.
Actually I want spark can just ignore and skip these nonexistent  file path, and continues to run. I have tried python HDFSCli api to check the existence of path , but hdfs cli cannot support wildcard.

Any good idea to solve my problem? Thanks~

Regard,
Junfeng Chen
Reply | Threaded
Open this post in threaded view
|

Re: How to skip nonexistent file when read files with spark?

Thakrar, Jayesh

Probably you can do some preprocessing/checking of the paths before you attempt to read it via Spark.

Whether it is local or hdfs filesystem, you can try to check for existence and other details by using the "FileSystem.globStatus" method from the Hadoop API.

 

From: JF Chen <[hidden email]>
Date: Sunday, May 20, 2018 at 10:30 PM
To: user <[hidden email]>
Subject: How to skip nonexistent file when read files with spark?

 

Hi Everyone

I meet a tricky problem recently. I am trying to read some file paths generated by other method. The file paths are represented by wild card in list, like [ '/data/*/12', '/data/*/13']

But in practice, if the wildcard cannot match any existed path, it will throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not exist: ...'", and the program stops after that.

Actually I want spark can just ignore and skip these nonexistent  file path, and continues to run. I have tried python HDFSCli api to check the existence of path , but hdfs cli cannot support wildcard.

 

Any good idea to solve my problem? Thanks~

 

Regard,
Junfeng Chen

Reply | Threaded
Open this post in threaded view
|

Re: How to skip nonexistent file when read files with spark?

JF Chen
Thanks, Thakrar,

I have tried to check the existence of path before read it, but HDFSCli python package seems not support wildcard.  "FileSystem.globStatus" is a java api while I am using python via livy.... Do you know any python api implementing the same function? 


Regard,
Junfeng Chen

On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh <[hidden email]> wrote:

Probably you can do some preprocessing/checking of the paths before you attempt to read it via Spark.

Whether it is local or hdfs filesystem, you can try to check for existence and other details by using the "FileSystem.globStatus" method from the Hadoop API.

 

From: JF Chen <[hidden email]>
Date: Sunday, May 20, 2018 at 10:30 PM
To: user <[hidden email]>
Subject: How to skip nonexistent file when read files with spark?

 

Hi Everyone

I meet a tricky problem recently. I am trying to read some file paths generated by other method. The file paths are represented by wild card in list, like [ '/data/*/12', '/data/*/13']

But in practice, if the wildcard cannot match any existed path, it will throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not exist: ...'", and the program stops after that.

Actually I want spark can just ignore and skip these nonexistent  file path, and continues to run. I have tried python HDFSCli api to check the existence of path , but hdfs cli cannot support wildcard.

 

Any good idea to solve my problem? Thanks~

 

Regard,
Junfeng Chen


Reply | Threaded
Open this post in threaded view
|

Re: How to skip nonexistent file when read files with spark?

ayan guha
A relatively naive solution will be:

0. Create a dummy blank dataframe
1. Loop through the list of paths.
2. Try to create the dataframe from the path. If success then union it cumulatively. 
3. If error, just ignore it or handle as you wish.

At the end of the loop, just use the unioned df. This should not have any additional performance overhead as declaring dataframes and union is not expensive, unless you call any action within the loop.

Best
Ayan

On Tue, 22 May 2018 at 11:27 am, JF Chen <[hidden email]> wrote:
Thanks, Thakrar,

I have tried to check the existence of path before read it, but HDFSCli python package seems not support wildcard.  "FileSystem.globStatus" is a java api while I am using python via livy.... Do you know any python api implementing the same function? 


Regard,
Junfeng Chen

On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh <[hidden email]> wrote:

Probably you can do some preprocessing/checking of the paths before you attempt to read it via Spark.

Whether it is local or hdfs filesystem, you can try to check for existence and other details by using the "FileSystem.globStatus" method from the Hadoop API.

 

From: JF Chen <[hidden email]>
Date: Sunday, May 20, 2018 at 10:30 PM
To: user <[hidden email]>
Subject: How to skip nonexistent file when read files with spark?

 

Hi Everyone

I meet a tricky problem recently. I am trying to read some file paths generated by other method. The file paths are represented by wild card in list, like [ '/data/*/12', '/data/*/13']

But in practice, if the wildcard cannot match any existed path, it will throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not exist: ...'", and the program stops after that.

Actually I want spark can just ignore and skip these nonexistent  file path, and continues to run. I have tried python HDFSCli api to check the existence of path , but hdfs cli cannot support wildcard.

 

Any good idea to solve my problem? Thanks~

 

Regard,
Junfeng Chen


--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: How to skip nonexistent file when read files with spark?

JF Chen
Thanks ayan,

Also I have tried this method, the most tricky thing is that dataframe union method must be based on same structure schema, while on my files, the schema is variable. 


Regard,
Junfeng Chen

On Tue, May 22, 2018 at 10:33 AM, ayan guha <[hidden email]> wrote:
A relatively naive solution will be:

0. Create a dummy blank dataframe
1. Loop through the list of paths.
2. Try to create the dataframe from the path. If success then union it cumulatively. 
3. If error, just ignore it or handle as you wish.

At the end of the loop, just use the unioned df. This should not have any additional performance overhead as declaring dataframes and union is not expensive, unless you call any action within the loop.

Best
Ayan

On Tue, 22 May 2018 at 11:27 am, JF Chen <[hidden email]> wrote:
Thanks, Thakrar,

I have tried to check the existence of path before read it, but HDFSCli python package seems not support wildcard.  "FileSystem.globStatus" is a java api while I am using python via livy.... Do you know any python api implementing the same function? 


Regard,
Junfeng Chen

On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh <[hidden email]> wrote:

Probably you can do some preprocessing/checking of the paths before you attempt to read it via Spark.

Whether it is local or hdfs filesystem, you can try to check for existence and other details by using the "FileSystem.globStatus" method from the Hadoop API.

 

From: JF Chen <[hidden email]>
Date: Sunday, May 20, 2018 at 10:30 PM
To: user <[hidden email]>
Subject: How to skip nonexistent file when read files with spark?

 

Hi Everyone

I meet a tricky problem recently. I am trying to read some file paths generated by other method. The file paths are represented by wild card in list, like [ '/data/*/12', '/data/*/13']

But in practice, if the wildcard cannot match any existed path, it will throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not exist: ...'", and the program stops after that.

Actually I want spark can just ignore and skip these nonexistent  file path, and continues to run. I have tried python HDFSCli api to check the existence of path , but hdfs cli cannot support wildcard.

 

Any good idea to solve my problem? Thanks~

 

Regard,
Junfeng Chen


--
Best Regards,
Ayan Guha

Reply | Threaded
Open this post in threaded view
|

Re: How to skip nonexistent file when read files with spark?

Thakrar, Jayesh

Junfeng,

 

I would suggest preprocessing/validating the paths in plain Python (and not Spark) before you try to fetch data.

I am not familiar with Python Hadoop libraries, but see if this helps - http://crs4.github.io/pydoop/tutorial/hdfs_api.html

 

Best,

Jayesh

 

From: JF Chen <[hidden email]>
Date: Monday, May 21, 2018 at 10:20 PM
To: ayan guha <[hidden email]>
Cc: "Thakrar, Jayesh" <[hidden email]>, user <[hidden email]>
Subject: Re: How to skip nonexistent file when read files with spark?

 

Thanks ayan,

 

Also I have tried this method, the most tricky thing is that dataframe union method must be based on same structure schema, while on my files, the schema is variable. 


 

Regard,
Junfeng Chen

 

On Tue, May 22, 2018 at 10:33 AM, ayan guha <[hidden email]> wrote:

A relatively naive solution will be:

 

0. Create a dummy blank dataframe

1. Loop through the list of paths.

2. Try to create the dataframe from the path. If success then union it cumulatively. 

3. If error, just ignore it or handle as you wish.

 

At the end of the loop, just use the unioned df. This should not have any additional performance overhead as declaring dataframes and union is not expensive, unless you call any action within the loop.

 

Best

Ayan

 

On Tue, 22 May 2018 at 11:27 am, JF Chen <[hidden email]> wrote:

Thanks, Thakrar,

 

I have tried to check the existence of path before read it, but HDFSCli python package seems not support wildcard.  "FileSystem.globStatus" is a java api while I am using python via livy.... Do you know any python api implementing the same function? 


 

Regard,
Junfeng Chen

 

On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh <[hidden email]> wrote:

Probably you can do some preprocessing/checking of the paths before you attempt to read it via Spark.

Whether it is local or hdfs filesystem, you can try to check for existence and other details by using the "FileSystem.globStatus" method from the Hadoop API.

 

From: JF Chen <[hidden email]>
Date: Sunday, May 20, 2018 at 10:30 PM
To: user <[hidden email]>
Subject: How to skip nonexistent file when read files with spark?

 

Hi Everyone

I meet a tricky problem recently. I am trying to read some file paths generated by other method. The file paths are represented by wild card in list, like [ '/data/*/12', '/data/*/13']

But in practice, if the wildcard cannot match any existed path, it will throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not exist: ...'", and the program stops after that.

Actually I want spark can just ignore and skip these nonexistent  file path, and continues to run. I have tried python HDFSCli api to check the existence of path , but hdfs cli cannot support wildcard.

 

Any good idea to solve my problem? Thanks~

 

Regard,
Junfeng Chen

 

--

Best Regards,
Ayan Guha

 

Reply | Threaded
Open this post in threaded view
|

Re: How to skip nonexistent file when read files with spark?

JF Chen
Thanks Thakrar~


Regard,
Junfeng Chen

On Tue, May 22, 2018 at 11:22 AM, Thakrar, Jayesh <[hidden email]> wrote:

Junfeng,

 

I would suggest preprocessing/validating the paths in plain Python (and not Spark) before you try to fetch data.

I am not familiar with Python Hadoop libraries, but see if this helps - http://crs4.github.io/pydoop/tutorial/hdfs_api.html

 

Best,

Jayesh

 

From: JF Chen <[hidden email]>
Date: Monday, May 21, 2018 at 10:20 PM
To: ayan guha <[hidden email]>
Cc: "Thakrar, Jayesh" <[hidden email]>, user <[hidden email]>
Subject: Re: How to skip nonexistent file when read files with spark?

 

Thanks ayan,

 

Also I have tried this method, the most tricky thing is that dataframe union method must be based on same structure schema, while on my files, the schema is variable. 


 

Regard,
Junfeng Chen

 

On Tue, May 22, 2018 at 10:33 AM, ayan guha <[hidden email]> wrote:

A relatively naive solution will be:

 

0. Create a dummy blank dataframe

1. Loop through the list of paths.

2. Try to create the dataframe from the path. If success then union it cumulatively. 

3. If error, just ignore it or handle as you wish.

 

At the end of the loop, just use the unioned df. This should not have any additional performance overhead as declaring dataframes and union is not expensive, unless you call any action within the loop.

 

Best

Ayan

 

On Tue, 22 May 2018 at 11:27 am, JF Chen <[hidden email]> wrote:

Thanks, Thakrar,

 

I have tried to check the existence of path before read it, but HDFSCli python package seems not support wildcard.  "FileSystem.globStatus" is a java api while I am using python via livy.... Do you know any python api implementing the same function? 


 

Regard,
Junfeng Chen

 

On Mon, May 21, 2018 at 9:01 PM, Thakrar, Jayesh <[hidden email]> wrote:

Probably you can do some preprocessing/checking of the paths before you attempt to read it via Spark.

Whether it is local or hdfs filesystem, you can try to check for existence and other details by using the "FileSystem.globStatus" method from the Hadoop API.

 

From: JF Chen <[hidden email]>
Date: Sunday, May 20, 2018 at 10:30 PM
To: user <[hidden email]>
Subject: How to skip nonexistent file when read files with spark?

 

Hi Everyone

I meet a tricky problem recently. I am trying to read some file paths generated by other method. The file paths are represented by wild card in list, like [ '/data/*/12', '/data/*/13']

But in practice, if the wildcard cannot match any existed path, it will throw an exception:"pyspark.sql.utils.AnalysisException: 'Path does not exist: ...'", and the program stops after that.

Actually I want spark can just ignore and skip these nonexistent  file path, and continues to run. I have tried python HDFSCli api to check the existence of path , but hdfs cli cannot support wildcard.

 

Any good idea to solve my problem? Thanks~

 

Regard,
Junfeng Chen

 

--

Best Regards,
Ayan Guha