Reading Hadoop Archive from Spark

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Reading Hadoop Archive from Spark

To Quoc Cuong
Hello,

After archiving parquets into a HAR (Hadoop Archive) file, its data format has the following layout:
foo.har/_masterindex //stores hashes and offsets

foo.har/_index //stores file statuses

foo.har/part-[0..n] //stores actual parquet files combined in sequential

So, we can access parquet file inside the HAR by:

spark.read.parquet("hdfs:///user/cyber/dataset/HARFolder/foo.har/")

or second way:

spark.read.parquet("hdfs:///user/cyber/dataset/HARFolder/foo.har/part-0")

If the HAR file contains only 1 parquet, we can read entirely. But if the HAR file contains more than 1 parquet, we can read only the first parquet, and ignore the other parquets inside this HAR. This is because the archiving process treats parquet files as binary files, and it just appends these multiple binary files into part-0 file. So only the header of the first parquet file is placed on the header of the part-0 file. The headers of other parquet files are placed in somewhere in the middle of part-0 file. So when we use spark.read.parquet("hdfs:///foo.har/part-0"), it scans only the header of part-0, which is also the header of first parquet, and skips the rest.

For example, if foo.har contains tintin_milou.parquet, we can read successfully. But if foo2.har contains 2 parquets (tintin_milou.parquet and tintin_milou2.parquet), we can read only tintin_milou.parquet, and fail to read tintin_milou2.parquet. Furthermore, if foo3.har contains 2 parquets that has different schema (like tintin_.milou.parquet and cdr.parquet), we cannot read both of them.

We CANNOT access original parquets by these 2 ways:

spark.read.parquet("hdfs:///user/cyber/dataset/HARFolder/foo2.har/tintin_milou.parquet")
spark.read.parquet("har:///user/cyber/dataset/HARFolder/foo2.har/tintin_milou.parquet")

Even though we can access this original parquet by hadoop:

hadoop dfs -ls har:///user/cyber/dataset/HARFolder/foo2.har

Output: (assume that tintin_milou.parquet and tintin_milou2.parquet are archived into foo2.har)

har:///user/cyber/dataset/HARFolder/foo2.har/tintin_milou.parquet
har:///user/cyber/dataset/HARFolder/foo2.har/tintin_milou2.parquet

So does anyone know how to read multiple parquet files inside a HAR with Spark ?
Thanks