Continue reading dataframe from file despite errors

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Continue reading dataframe from file despite errors

jeffsaremi

I'm using a statement like the following to load my dataframe from some text file

Upon encountering the first error, the whole thing throws an exception and processing stops.

I'd like to continue loading even if that results in zero rows in my dataframe. How can i do that?
thanks


spark.read.schema(SomeSchema).option("sep", "\t").format("csv").load("somepath")



Reply | Threaded
Open this post in threaded view
|

Re: Continue reading dataframe from file despite errors

jeffsaremi

should have added some of the exception to be clear:

17/09/12 14:14:17 ERROR TaskSetManager: Task 0 in stage 15.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 15, localhost, executor driver): java.lang.NumberFormatException: For input string: "south carolina"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:580)
        at java.lang.Integer.parseInt(Integer.java:615)
        at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
        at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
        at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:250)


From: jeff saremi <[hidden email]>
Sent: Tuesday, September 12, 2017 2:32:03 PM
To: [hidden email]
Subject: Continue reading dataframe from file despite errors
 

I'm using a statement like the following to load my dataframe from some text file

Upon encountering the first error, the whole thing throws an exception and processing stops.

I'd like to continue loading even if that results in zero rows in my dataframe. How can i do that?
thanks


spark.read.schema(SomeSchema).option("sep", "\t").format("csv").load("somepath")



Reply | Threaded
Open this post in threaded view
|

Re: Continue reading dataframe from file despite errors

Suresh Thalamati
Try the CSV   Option(mode "dropmalformed), that might skip the error records. 


On Sep 12, 2017, at 2:33 PM, jeff saremi <[hidden email]> wrote:

should have added some of the exception to be clear:

17/09/12 14:14:17 ERROR TaskSetManager: Task 0 in stage 15.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 15, localhost, executor driver): java.lang.NumberFormatException: For input string: "south carolina"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:580)
        at java.lang.Integer.parseInt(Integer.java:615)
        at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
        at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
        at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:250)


From: jeff saremi <[hidden email]>
Sent: Tuesday, September 12, 2017 2:32:03 PM
To: [hidden email]
Subject: Continue reading dataframe from file despite errors
 
I'm using a statement like the following to load my dataframe from some text file
Upon encountering the first error, the whole thing throws an exception and processing stops.
I'd like to continue loading even if that results in zero rows in my dataframe. How can i do that?
thanks

spark.read.schema(SomeSchema).option("sep", "\t").format("csv").load("somepath")

Reply | Threaded
Open this post in threaded view
|

Re: Continue reading dataframe from file despite errors

jeffsaremi

thanks Suresh. it worked nicely


From: Suresh Thalamati <[hidden email]>
Sent: Tuesday, September 12, 2017 2:59:29 PM
To: jeff saremi
Cc: [hidden email]
Subject: Re: Continue reading dataframe from file despite errors
 
Try the CSV   Option(mode "dropmalformed), that might skip the error records. 


On Sep 12, 2017, at 2:33 PM, jeff saremi <[hidden email]> wrote:

should have added some of the exception to be clear:

17/09/12 14:14:17 ERROR TaskSetManager: Task 0 in stage 15.0 failed 1 times; aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 15, localhost, executor driver): java.lang.NumberFormatException: For input string: "south carolina"
        at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
        at java.lang.Integer.parseInt(Integer.java:580)
        at java.lang.Integer.parseInt(Integer.java:615)
        at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
        at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
        at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:250)


From: jeff saremi <[hidden email]>
Sent: Tuesday, September 12, 2017 2:32:03 PM
To: [hidden email]
Subject: Continue reading dataframe from file despite errors
 
I'm using a statement like the following to load my dataframe from some text file
Upon encountering the first error, the whole thing throws an exception and processing stops.
I'd like to continue loading even if that results in zero rows in my dataframe. How can i do that?
thanks

spark.read.schema(SomeSchema).option("sep", "\t").format("csv").load("somepath")