What is the recommended way to store records that don't meet a filter?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

What is the recommended way to store records that don't meet a filter?

Yeikel

Community ,

 

Given a dataset ds , what is the recommended way to store the records that don't meet a filter?

 

For example :

 

val ds = Seq(1,2,3,4).toDS

 

val f = (i : Integer) => i < 2

 

val filtered = ds.filter(f(_))

 

I understand I can do this :

 

val filterNotMet = ds.filter(!f(_))

 

But unless I am missing something , I believe this means that Spark will iterate and apply the filter twice which sounds like an overhead to me. Please clarify if this is not the case.

 

Another option I can think of is to do something like this :

 

val fudf = udf(f)

 

val applyFilterUDF = ds.withColumn("filtered",fudf($"value"))

 

val filteredUDF = applyFilter.where(applyFilter("filtered") === true)

 

val filterNotMetUDF = applyFilter.where(applyFilter("filtered") === false)

 

But from the plan I can’t really tell if it is any better :

 

scala> filtered.explain

== Physical Plan ==

*(1) Filter <function1>.apply$mcZI$sp

+- LocalTableScan [value#149]

 

scala> applyFilterUDF.explain

== Physical Plan ==

LocalTableScan [value#149, filtered#153]

 

scala> filterNotMet.explain

== Physical Plan ==

*(1) Filter <function1>.apply$mcZI$sp

+- LocalTableScan [value#149]

 

scala> filterNotMetUDF.explain

== Physical Plan ==

*(1) Project [value#62, UDF(value#62) AS filtered#97]

+- *(1) Filter (UDF(value#62) = false)

   +- LocalTableScan [value#62]

 

 

Thank you.