How to read a snappy-compressed text file?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

How to read a snappy-compressed text file?

innowireless TaeYun Kim

Hi,

 

Maybe this is a newbie question: How to read a snappy-compressed text file?

 

The OS is Windows 7.

Currently, I’ve done the following steps:

 

1. Built Hadoop 2.4.0 with snappy option.

‘hadoop checknative’ command displays the following line:

snappy: true D:\hadoop-2.4.0\bin\snappy.dll

So, I assume hadoop can do snappy compression.

BTW, snapp.dll was copied from snapp64.dll file in snappy-windows-1.1.1.8.

 

2. Added the following configurations to both core-site.xml and yarn-site.xml.

<property>

<name>io.compression.codecs</name>

<value>org.apache.hadoop.io.compress.SnappyCodec</value>

</property>

 

3. Added the following environment variable.

SPARK_LIBRARY_PATH=D:\hadoop-2.4.0\bin

Since I use IntelliJ, the above line was included to the Environment variables section in Run Configuration.

 

4. Compressed the input text file with snzip.exe which was included in snappy-windows-1.1.1.8.

 

4. Wrote the code.

sc.textFile(compressed_file_name);  // no other argument.
.map(…)

 

Now when I run my spark application, the results are as follows:

 

1. ‘snappy’ string cannot be found in DEBUG log.

The most relevant logs are as follows:

14/06/12 18:57:55 DEBUG NativeCodeLoader: Trying to load the custom-built native-hadoop library...

14/06/12 18:57:55 DEBUG NativeCodeLoader: Loaded the native-hadoop library

2. Application fails. The log is as follows:

14/06/12 18:57:57 WARN: int from string failed for: [(some binary characters)]

 

So apparently sc.textFile() does not recognize the file format and read it as-is, so map function receives a garbage.

 

How can I fix this?

 

Thanks.