How to store 10M records in HDFS to speed up further filtering?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

How to store 10M records in HDFS to speed up further filtering?

MoTao
This post has NOT been accepted by the mailing list yet.
Hi all,

I have 10M (ID, BINARY) record, and the size of each BINARY is 5MB on average.
In my daily application, I need to filter out 10K BINARY according to an ID list.
How should I store the whole data to make the filtering faster?

I'm using DataFrame in Spark 2.0.0 and I've tried row-based format (avro) and column-based format (orc).
However, both of them require to scan almost ALL records, making the filtering stage very very slow.
The code block for filtering looks like:

val IDSet: Set[String] = ...
val checkID = udf { ID: String => IDSet(ID) }
spark.read.orc("/path/to/whole/data")
  .filter(checkID($"ID"))
  .select($"ID", $"BINARY")
  .write...

Thanks for any advice!
Loading...