[pyspark 2.3+] Dedupe records

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[pyspark 2.3+] Dedupe records

rishishah.star
Hi All,

I have around 100B records where I get new , update & delete records. Update/delete records are not that frequent. I would like to get some advice on below:

1) should I use rdd + reducibly or DataFrame window operation for data of this size? Which one would outperform the other? Which is more reliable and low maintenance?
2) Also how would you suggest we do incremental deduplication? Currently we do full processing once a week and no dedupe during week days to avoid heavy processing. However I would like to explore incremental dedupe option and weight pros/cons.

Any input is highly appreciated! 

--
Regards,

Rishi Shah
Reply | Threaded
Open this post in threaded view
|

Re: [pyspark 2.3+] Dedupe records

Sonal Goyal
Hi Rishi,

1. Dataframes are RDDs under the cover. If you have unstructured data or if you know something about the data through which you can optimize the computation. you can go with RDDs. Else the Dataframes which are optimized by Spark SQL should be fine.
2. For incremental deduplication, I guess you can hash your data based on some particular values and then only compare the new records against the ones which have the same hash. That should reduce the order of comparisons drastically provided you can come up with a good indexing/hashing scheme as per your dataset.

Thanks,
Sonal
Nube Technologies 






On Sat, May 30, 2020 at 8:17 AM Rishi Shah <[hidden email]> wrote:
Hi All,

I have around 100B records where I get new , update & delete records. Update/delete records are not that frequent. I would like to get some advice on below:

1) should I use rdd + reducibly or DataFrame window operation for data of this size? Which one would outperform the other? Which is more reliable and low maintenance?
2) Also how would you suggest we do incremental deduplication? Currently we do full processing once a week and no dedupe during week days to avoid heavy processing. However I would like to explore incremental dedupe option and weight pros/cons.

Any input is highly appreciated! 

--
Regards,

Rishi Shah
Reply | Threaded
Open this post in threaded view
|

Re: [pyspark 2.3+] Dedupe records

Molotch
In reply to this post by rishishah.star
The performant way would be to partition your dataset into reasonably small
chunks and use a bloom filter to see if the entity might be in your set
before you make a lookup.

Check the bloom filter, if the entity might be in the set, rely on partition
pruning to read and backfill the relevant partition. If the entity isn't in
the set, just save as new data.

Sooner or later you probably would want to compact the appended partitions
to reduce the amount of small files.

Delta Lake has update and compation semantics unless you want to do it
manually.

Since 2.4.0 Spark is also able to prune buckets. But as far as I know
there's no way to backfill a single bucket. If it was the combination of
partition and bucket pruning could dramatically limit the amount data you
needed to read/write from/to disk.

RDD vs Dataframe, I'm not sure exactly how and when Tungsten is able to be
used when using RDD:s, if at all. Because of that I always try to use
Dataframes and the built in fucntions as long as possible just to get the
sweet offheap allocation and the "expressions to byte code" thingy along the
Catalyst optimizations. That will probably make more for your performance
than anything else. The memory overhead of JVM objects and GC runs might be
brutal on your performance and memory usage depending on your dataset and
use case.


br,

molotch



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: [pyspark 2.3+] Dedupe records

Anwar AliKhan
In reply to this post by Sonal Goyal

What meaning Dataframes are RDDs under the cover ?

What meaning deduplication ?


Please send your  bio data history and past commercial projects.

The Wali Ahad agreed to release 300 million USD for new machine learning research
Project to centralize government facilities to find better way to offer Citizen Service with artificial Intelligence Technologies.

I am to find talented Artificial Intelligence Experts.


Shukran



On Sat, 30 May 2020, 05:26 Sonal Goyal, <[hidden email]> wrote:
Hi Rishi,

1. Dataframes are RDDs under the cover. If you have unstructured data or if you know something about the data through which you can optimize the computation. you can go with RDDs. Else the Dataframes which are optimized by Spark SQL should be fine.
2. For incremental deduplication, I guess you can hash your data based on some particular values and then only compare the new records against the ones which have the same hash. That should reduce the order of comparisons drastically provided you can come up with a good indexing/hashing scheme as per your dataset.

Thanks,
Sonal
Nube Technologies 






On Sat, May 30, 2020 at 8:17 AM Rishi Shah <[hidden email]> wrote:
Hi All,

I have around 100B records where I get new , update & delete records. Update/delete records are not that frequent. I would like to get some advice on below:

1) should I use rdd + reducibly or DataFrame window operation for data of this size? Which one would outperform the other? Which is more reliable and low maintenance?
2) Also how would you suggest we do incremental deduplication? Currently we do full processing once a week and no dedupe during week days to avoid heavy processing. However I would like to explore incremental dedupe option and weight pros/cons.

Any input is highly appreciated! 

--
Regards,

Rishi Shah