Implementing .zip file codec

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view

Implementing .zip file codec

This post has NOT been accepted by the mailing list yet.

I am able to read .gz and write files through spark csv using available codecs and getting expected result. But while trying to read and write .zip file spark is giving unexpected results like wV�J�.f�T n .

I have visited, but didn't find any compression codec for .zip file.

I searched on stackoverflow but didn't get any satisfactory result for that.

I have also tried solution from

But my requirement is to read and write .zip file like we read csv files by providing codecs.

     dataframe.write().option("codec", "customzipcodec").format("customfomat").save("outputpath")

Please provide more information if anyone has faced same issue or have any solution on that.
Reply | Threaded
Open this post in threaded view

Re: Implementing .zip file codec

Spark doesn't support zip file reading directly since this not distributable
file .

Read using Java.uti.zipInputStream api and prepare rdd ..  ( 4GB Limit )

import org.apache.spark.input.PortableDataStream

var zipPath = "s3://...."

val rdd= sc.binaryFiles(zipPath).flatMap((file: (String,
PortableDataStream)) => {
var zipStream = new ZipInputStream(
val entry = zipStream.getNextEntry
var iter: Iterator[String] = null

iter = Source.fromInputStream(zipStream, "ISO_8859_1").getLines


if zip file more than 4 GB use

Sent from:

To unsubscribe e-mail: [hidden email]