Maybe this is a newbie question: How to read a snappy-compressed text file?
The OS is Windows 7.
Currently, I’ve done the following steps:
1. Built Hadoop 2.4.0 with snappy option.
‘hadoop checknative’ command displays the following line:
snappy: true D:\hadoop-2.4.0\bin\snappy.dll
So, I assume hadoop can do snappy compression.
BTW, snapp.dll was copied from snapp64.dll file in snappy-windows-22.214.171.124.
2. Added the following configurations to both core-site.xml and yarn-site.xml.
3. Added the following environment variable.
Since I use IntelliJ, the above line was included to the Environment variables section in Run Configuration.
4. Compressed the input text file with snzip.exe which was included in snappy-windows-126.96.36.199.
4. Wrote the code.
sc.textFile(compressed_file_name); // no other argument..map(…)
Now when I run my spark application, the results are as follows:
1. ‘snappy’ string cannot be found in DEBUG log.
The most relevant logs are as follows:
14/06/12 18:57:55 DEBUG NativeCodeLoader: Trying to load the custom-built native-hadoop library...
14/06/12 18:57:55 DEBUG NativeCodeLoader: Loaded the native-hadoop library
2. Application fails. The log is as follows:
14/06/12 18:57:57 WARN: int from string failed for: [(some binary characters)]
So apparently sc.textFile() does not recognize the file format and read it as-is, so map function receives a garbage.
How can I fix this?