CSV parsing issue

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

CSV parsing issue

elango vaidyanathan

Hi team,

I am loading an CSV. One column contains a json value. I am unable to parse that column properly. Below is the details. Can you please check once?

 

val df1=spark.read.option("inferSchema","true"). option("header","true").option("quote", "\"")

.option("escape", "\"").csv("/FileStore/tables/sample_file_structure.csv")

 

sample data:

----------------

column1,column2,column3

123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "abcdef",   "language" : "en" }",11

123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "ghi, jkl",   "language" : "en" }",12 123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "mno, pqr",   "language" : "en" }",13

 

output:

-----------

+---------+--------------------+---------------+

| column1| column2| column3 |

+---------+--------------------+---------------+

|123456789|"{ "moveId" : "...| "dob" : null|

|123456789|"{ "moveId" : "...| "dob" : null|

+---------+--------------------+---------------+

 


Thanks,
Elango
Reply | Threaded
Open this post in threaded view
|

Re: CSV parsing issue

srowen
Your data doesn't escape double-quotes.

On Thu, May 28, 2020 at 10:21 AM elango vaidyanathan <[hidden email]> wrote:

Hi team,

I am loading an CSV. One column contains a json value. I am unable to parse that column properly. Below is the details. Can you please check once?

 

val df1=spark.read.option("inferSchema","true"). option("header","true").option("quote", "\"")

.option("escape", "\"").csv("/FileStore/tables/sample_file_structure.csv")

 

sample data:

----------------

column1,column2,column3

123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "abcdef",   "language" : "en" }",11

123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "ghi, jkl",   "language" : "en" }",12 123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "mno, pqr",   "language" : "en" }",13

 

output:

-----------

+---------+--------------------+---------------+

| column1| column2| column3 |

+---------+--------------------+---------------+

|123456789|"{ "moveId" : "...| "dob" : null|

|123456789|"{ "moveId" : "...| "dob" : null|

+---------+--------------------+---------------+

 


Thanks,
Elango
Reply | Threaded
Open this post in threaded view
|

Re: CSV parsing issue

elango vaidyanathan
Is there any way I can handle it in code?

Thanks,
Elango

On Thu, May 28, 2020, 8:52 PM Sean Owen <[hidden email]> wrote:
Your data doesn't escape double-quotes.

On Thu, May 28, 2020 at 10:21 AM elango vaidyanathan <[hidden email]> wrote:

Hi team,

I am loading an CSV. One column contains a json value. I am unable to parse that column properly. Below is the details. Can you please check once?

 

val df1=spark.read.option("inferSchema","true"). option("header","true").option("quote", "\"")

.option("escape", "\"").csv("/FileStore/tables/sample_file_structure.csv")

 

sample data:

----------------

column1,column2,column3

123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "abcdef",   "language" : "en" }",11

123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "ghi, jkl",   "language" : "en" }",12 123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "mno, pqr",   "language" : "en" }",13

 

output:

-----------

+---------+--------------------+---------------+

| column1| column2| column3 |

+---------+--------------------+---------------+

|123456789|"{ "moveId" : "...| "dob" : null|

|123456789|"{ "moveId" : "...| "dob" : null|

+---------+--------------------+---------------+

 


Thanks,
Elango
Reply | Threaded
Open this post in threaded view
|

Re: CSV parsing issue

srowen
I don't think so, that data is inherently ambiguous and incorrectly formatted. If you know something about the structure, maybe you can rewrite the middle column manually to escape the inner quotes and then reparse.

On Thu, May 28, 2020 at 10:25 AM elango vaidyanathan <[hidden email]> wrote:
Is there any way I can handle it in code?

Thanks,
Elango

On Thu, May 28, 2020, 8:52 PM Sean Owen <[hidden email]> wrote:
Your data doesn't escape double-quotes.

On Thu, May 28, 2020 at 10:21 AM elango vaidyanathan <[hidden email]> wrote:

Hi team,

I am loading an CSV. One column contains a json value. I am unable to parse that column properly. Below is the details. Can you please check once?

 

val df1=spark.read.option("inferSchema","true"). option("header","true").option("quote", "\"")

.option("escape", "\"").csv("/FileStore/tables/sample_file_structure.csv")

 

sample data:

----------------

column1,column2,column3

123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "abcdef",   "language" : "en" }",11

123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "ghi, jkl",   "language" : "en" }",12 123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "mno, pqr",   "language" : "en" }",13

 

output:

-----------

+---------+--------------------+---------------+

| column1| column2| column3 |

+---------+--------------------+---------------+

|123456789|"{ "moveId" : "...| "dob" : null|

|123456789|"{ "moveId" : "...| "dob" : null|

+---------+--------------------+---------------+

 


Thanks,
Elango
Reply | Threaded
Open this post in threaded view
|

Re: CSV parsing issue

elango vaidyanathan

Thanks Sean, got it.

Thanks,
Elango

On Thu, May 28, 2020, 9:04 PM Sean Owen <[hidden email]> wrote:
I don't think so, that data is inherently ambiguous and incorrectly formatted. If you know something about the structure, maybe you can rewrite the middle column manually to escape the inner quotes and then reparse.

On Thu, May 28, 2020 at 10:25 AM elango vaidyanathan <[hidden email]> wrote:
Is there any way I can handle it in code?

Thanks,
Elango

On Thu, May 28, 2020, 8:52 PM Sean Owen <[hidden email]> wrote:
Your data doesn't escape double-quotes.

On Thu, May 28, 2020 at 10:21 AM elango vaidyanathan <[hidden email]> wrote:

Hi team,

I am loading an CSV. One column contains a json value. I am unable to parse that column properly. Below is the details. Can you please check once?

 

val df1=spark.read.option("inferSchema","true"). option("header","true").option("quote", "\"")

.option("escape", "\"").csv("/FileStore/tables/sample_file_structure.csv")

 

sample data:

----------------

column1,column2,column3

123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "abcdef",   "language" : "en" }",11

123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "ghi, jkl",   "language" : "en" }",12 123456789,"{   "moveId" : "123456789",   "dob" : null,   "username" : "mno, pqr",   "language" : "en" }",13

 

output:

-----------

+---------+--------------------+---------------+

| column1| column2| column3 |

+---------+--------------------+---------------+

|123456789|"{ "moveId" : "...| "dob" : null|

|123456789|"{ "moveId" : "...| "dob" : null|

+---------+--------------------+---------------+

 


Thanks,
Elango