If the HAR file contains only 1 parquet, we can read entirely. But if the HAR file contains more than 1 parquet, we can read only the first parquet, and ignore the other parquets inside this HAR. This is because the archiving process treats parquet files as binary files, and it just appends these multiple binary files into part-0 file. So only the header of the first parquet file is placed on the header of the part-0 file. The headers of other parquet files are placed in somewhere in the middle of part-0 file. So when we use spark.read.parquet("hdfs:///foo.har/part-0"), it scans only the header of part-0, which is also the header of first parquet, and skips the rest.
For example, if foo.har contains tintin_milou.parquet, we can read successfully. But if foo2.har contains 2 parquets (tintin_milou.parquet and tintin_milou2.parquet), we can read only tintin_milou.parquet, and fail to read tintin_milou2.parquet. Furthermore, if foo3.har contains 2 parquets that has different schema (like tintin_.milou.parquet and cdr.parquet), we cannot read both of them.
We CANNOT access original parquets by these 2 ways: