CSV parser - is there a way to find malformed csv record

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

CSV parser - is there a way to find malformed csv record

Nirav Patel
I am getting `RuntimeException: Malformed CSV record` while parsing csv record and attaching schema at same time. Most likely there are additional commas or json data in some field which are not escaped properly. Is there a way CSV parser tells me which record is malformed?


This is what I am using:

    val df2 = sparkSession.read
      .option("inferSchema", true)
      .option("multiLine", true)
      .schema(headerDF.schema) // this only works without column mismatch
      .csv(dataPath)

Thanks



What's New with Xactly

        
Reply | Threaded
Open this post in threaded view
|

Re: CSV parser - is there a way to find malformed csv record

Shuporno Choudhury
Hi,
There is a way to way obtain these malformed/rejected records. Rejection can happen not only because of column number mismatch but also if the data type of the data does not match the data type mentioned in the schema.
To obtain the rejected records, you can do the following:
1. Add an extra column (eg: CorruptRecCol) to your schema of type StringType()
2. In the datadrame reader, add the mode 'PERMISSIVE' while simultaneously adding the column CorruptRecCol to columnNameOfCorruptRecord
3. The column CorruptRecCol will contain the complete record if it is malformed/corrupted. On the other hand, it will be null if the record is valid. So you can use a filter (CorruptRecCol is NULL) to obtain the malformed/corrupted record.
You can use any column name to contain the invalid records. I have used CorruptRecCol just for example.
This example is for pyspark. Similar example will exist for Java/Scala also.


On Tue, 9 Oct 2018 at 00:27, Nirav Patel [via Apache Spark User List] <[hidden email]> wrote:
I am getting `RuntimeException: Malformed CSV record` while parsing csv record and attaching schema at same time. Most likely there are additional commas or json data in some field which are not escaped properly. Is there a way CSV parser tells me which record is malformed?


This is what I am using:

    val df2 = sparkSession.read
      .option("inferSchema", true)
      .option("multiLine", true)
      .schema(headerDF.schema) // this only works without column mismatch
      .csv(dataPath)

Thanks



What's New with Xactly

        


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/CSV-parser-is-there-a-way-to-find-malformed-csv-record-tp33643.html
To start a new topic under Apache Spark User List, email [hidden email]
To unsubscribe from Apache Spark User List, click here.
NAML


--
--Thanks,
Shuporno Choudhury
Reply | Threaded
Open this post in threaded view
|

RE: CSV parser - is there a way to find malformed csv record

Taylor Cox
In reply to this post by Nirav Patel

Hey Nirav,

 

Here’s an idea:

 

Suppose your file.csv has N records, one for each line. Read the csv line-by-line (without spark) and attempt to parse each line. If a record is malformed, catch the exception and rethrow it with the line number. That should show you where the problematic record(s) can be found.

 

From: Nirav Patel <[hidden email]>
Sent: Monday, October 8, 2018 11:57 AM
To: spark users <[hidden email]>
Subject: CSV parser - is there a way to find malformed csv record

 

I am getting `RuntimeException: Malformed CSV record` while parsing csv record and attaching schema at same time. Most likely there are additional commas or json data in some field which are not escaped properly. Is there a way CSV parser tells me which record is malformed?

 

 

This is what I am using:

 

    val df2 = sparkSession.read

      .option("inferSchema", true)

      .option("multiLine", true)

      .schema(headerDF.schema) // this only works without column mismatch

      .csv(dataPath)

 

Thanks




Image removed by sender. What's New with Xactly

Image removed by sender.  Image removed by sender.  Image removed by sender.  Image removed by sender.  Image removed by sender.

Reply | Threaded
Open this post in threaded view
|

Re: CSV parser - is there a way to find malformed csv record

Nirav Patel
Thanks Shuporno . That mode worked. I found out couple records having quotes inside quotes which needed to be escaped. 



On Tue, Oct 9, 2018 at 1:40 PM Taylor Cox <[hidden email]> wrote:

Hey Nirav,

 

Here’s an idea:

 

Suppose your file.csv has N records, one for each line. Read the csv line-by-line (without spark) and attempt to parse each line. If a record is malformed, catch the exception and rethrow it with the line number. That should show you where the problematic record(s) can be found.

 

From: Nirav Patel <[hidden email]>
Sent: Monday, October 8, 2018 11:57 AM
To: spark users <[hidden email]>
Subject: CSV parser - is there a way to find malformed csv record

 

I am getting `RuntimeException: Malformed CSV record` while parsing csv record and attaching schema at same time. Most likely there are additional commas or json data in some field which are not escaped properly. Is there a way CSV parser tells me which record is malformed?

 

 

This is what I am using:

 

    val df2 = sparkSession.read

      .option("inferSchema", true)

      .option("multiLine", true)

      .schema(headerDF.schema) // this only works without column mismatch

      .csv(dataPath)

 

Thanks






        




What's New with Xactly

        

~WRD000.jpg (1K) Download Attachment
~WRD000.jpg (1K) Download Attachment