I need to run some queries on huge amount input records. Input rate for records are 100K/seconds.
A record is like (key1,key2,value) and the application should report occurances of kye1 = something && key2 == somethingElse.
The problem is there are too many filters in my query: more than 3 thousands pair of key1 and key2 should be filtered.
I was simply puting 1 millions of records in a temptable each time and running a query sql using spark-sql on temp table:
select * from mytemptable where (kye1 = something && key2 == somethingElse) or (kye1 = someOtherthing && key2 == someAnotherThing) or ...(3thousands or!!!)
And i encounter StackOverFlow at ATNConfigSet.java line 178.
So i have two options IMHO:
1. Either put all key1 and key2 filter pairs in another temp table and do a join between two temp table
2. Or use spark-stream that i'm not familiar with and i don't know if it could handle 3K of filters.
Which way do you suggest? what is the best solution for my problem 'performance-wise'?
Thanks in advance
I am not familiar with ATNConfigSet, but some thoughts that might help.
How many distinct key1 (resp. key2) values do you have? Are these values reasonably stable over time?
Are these records ingested in real-time or are they loaded from a datastore?
If the latter case the DB might be able to efficiently perform the filtering, especially if equipped with a proper index over key1/key2 (or a composite one).
In such case the filter push-down could be very effective (I didn't get if you just need to count or do something more with the matching record).
Alternatively, you could try to group by (key1,key2), and then filter (it again depends on the kind of output you have in mind).
If the datastore/stream is distributed and supports partitioning, you could partition your records by either key1 or key2 (or key1+key2), so they are already "separated" and can be consumed more efficiently (e.g., the groupby could then be local to a single partition).
On 15 May 2018 at 08:32, onmstester onmstester <[hidden email]> wrote:
From the information you provided I would tackle this as a batch problem, because this way you have access to more sophisticated techniques and you have more flexibility (maybe HDFS and a SparkJob, but also think about a datastore offering good indexes for the kind of data types and values you have for your keys, and benefit from filter push-downs).
I personally use streaming only when real-time ingestion is needed.
On 15 May 2018 at 09:11, onmstester onmstester <[hidden email]> wrote:
3000 filters don’t look like something reasonable. This is very difficult to test and verify as well as impossible to maintain.
Could it be that your filters are another table that you should join with ?
The example is a little bit artificial to understand the underlying business case. Can you provide a more realistic example?
Maybe a bloom filter or something similar can make sense for you ? Basically you want to know if the key pair is in a given set of pairs?
On 15. May 2018, at 11:48, Alessandro Solimando <[hidden email]> wrote:
|Free forum by Nabble||Edit this page|