Implementing .zip file codec

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Implementing .zip file codec

hemant
This post has NOT been accepted by the mailing list yet.
Hi,

I am able to read .gz and write files through spark csv using available codecs and getting expected result. But while trying to read and write .zip file spark is giving unexpected results like wV�J�.f�T n .


I have visited https://github.com/apache/hadoop/tree/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/compress, but didn't find any compression codec for .zip file.

I searched on stackoverflow but didn't get any satisfactory result for that.

I have also tried solution from http://stackoverflow.com/questions/28569788/how-to-open-stream-zip-files-through-spark

But my requirement is to read and write .zip file like we read csv files by providing codecs.
Ex: sc.read.option("","").schema("userdefinedschema").‌​format("customfomat").load("abc‌​.zip")

     dataframe.write().option("codec", "customzipcodec").format("customfomat").save("outputpath")

Please provide more information if anyone has faced same issue or have any solution on that.
Reply | Threaded
Open this post in threaded view
|

Re: Implementing .zip file codec

mytramesh
Spark doesn't support zip file reading directly since this not distributable
file .

Read using Java.uti.zipInputStream api and prepare rdd ..  ( 4GB Limit )

import java.util.zip.ZipInputStream
import scala.io.Source
import org.apache.spark.input.PortableDataStream

var zipPath = "s3://.... ABC.zip"

val rdd= sc.binaryFiles(zipPath).flatMap((file: (String,
PortableDataStream)) => {
var zipStream = new ZipInputStream(file._2.open)
val entry = zipStream.getNextEntry
var iter: Iterator[String] = null

iter = Source.fromInputStream(zipStream, "ISO_8859_1").getLines

iter
})


if zip file more than 4 GB use
import org.apache.commons.compress.archivers.zip.ZipArchiveInputStream



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]