CSV data source : Garbled Japanese text and handling multilines

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

CSV data source : Garbled Japanese text and handling multilines

Ashika Umagiliya

In my Spark job (spark 2.4.1) , I am reading CSV files on S3.These files contain Japanese characters.Also they can have ^M character (u000D) so I need to parse them as multiline.

First I used following code to read CSV files:

implicit class DataFrameReadImplicits (dataFrameReader: DataFrameReader) {
     def readTeradataCSV(schema: StructType, s3Path: String) : DataFrame = {

        dataFrameReader.option("delimiter", "\u0001")
          .option("header", "false")
          .option("inferSchema", "false")
          .option("multiLine","true")
          .option("encoding", "UTF-8")
          .option("charset", "UTF-8")
          .schema(schema)
          .csv(s3Path)
     }
  }

But when I read DF using this method all the Japanese characters are garbled.

After doing some tests I found out that If I read the same S3 file using "spark.sparkContext.textFile(path)" Japanese characters encoded properly.

So I tried this way :

implicit class SparkSessionImplicits (spark : SparkSession) {
    def readTeradataCSV(schema: StructType, s3Path: String) = {
      import spark.sqlContext.implicits._
      spark.read.option("delimiter", "\u0001")
        .option("header", "false")
        .option("inferSchema", "false")
        .option("multiLine","true")
        .schema(schema)
        .csv(spark.sparkContext.textFile(s3Path).map(str => str.replaceAll("\u000D"," ")).toDS())
    }
  }

Now the encoding issue is fixed.However multilines doesn't work properly and lines are broken near ^M character , even though I tried to replace ^M using str.replaceAll("\u000D"," ")

Any tips on how to read Japanese characters using first method, or handle multi-lines using the second method ?

Reply | Threaded
Open this post in threaded view
|

Re: CSV data source : Garbled Japanese text and handling multilines

ZHANG Wei
May I get the CSV file's encoding, which can be checked by `file` command?

--
Cheers,
-z

On Tue, 19 May 2020 09:24:24 +0900
Ashika Umagiliya <[hidden email]> wrote:

> In my Spark job (spark 2.4.1) , I am reading CSV files on S3.These files
> contain Japanese characters.Also they can have ^M character (u000D) so I
> need to parse them as multiline.
>
> First I used following code to read CSV files:
>
> implicit class DataFrameReadImplicits (dataFrameReader: DataFrameReader) {
>      def readTeradataCSV(schema: StructType, s3Path: String) : DataFrame = {
>
>         dataFrameReader.option("delimiter", "\u0001")
>           .option("header", "false")
>           .option("inferSchema", "false")
>           .option("multiLine","true")
>           .option("encoding", "UTF-8")
>           .option("charset", "UTF-8")
>           .schema(schema)
>           .csv(s3Path)
>      }
>   }
>
> But when I read DF using this method all the Japanese characters are garbled.
>
> After doing some tests I found out that If I read the same S3 file
> using *"spark.sparkContext.textFile(path)"* Japanese characters
> encoded properly.
>
> So I tried this way :
>
> implicit class SparkSessionImplicits (spark : SparkSession) {
>     def readTeradataCSV(schema: StructType, s3Path: String) = {
>       import spark.sqlContext.implicits._
>       spark.read.option("delimiter", "\u0001")
>         .option("header", "false")
>         .option("inferSchema", "false")
>         .option("multiLine","true")
>         .schema(schema)
>         .csv(spark.sparkContext.textFile(s3Path).map(str =>
> str.replaceAll("\u000D"," ")).toDS())
>     }
>   }
>
> Now the encoding issue is fixed.However multilines doesn't work
> properly and lines are broken near ^M character , even though I tried
> to replace ^M using *str.replaceAll("\u000D"," ")*
>
> Any tips on how to read Japanese characters using first method, or
> handle multi-lines using the second method ?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]