spark sql StackOverflow

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

spark sql StackOverflow

onmstester onmstester
Hi,

I need to run some queries on huge amount input records. Input rate for records are 100K/seconds.
A record is like (key1,key2,value) and the application should report occurances of kye1 = something && key2 == somethingElse.
The problem is there are too many filters in my query: more than 3 thousands pair of key1 and key2 should be filtered.
I was simply puting 1 millions of records in a temptable each time and running a query sql using spark-sql on temp table:
select * from mytemptable where (kye1 = something && key2 == somethingElse) or (kye1 = someOtherthing && key2 == someAnotherThing) or ...(3thousands or!!!)
And i encounter StackOverFlow at ATNConfigSet.java line 178.

So i have two options IMHO:
1. Either put all key1 and key2 filter pairs in another temp table and do a join between  two temp table
2. Or use spark-stream that i'm not familiar with and i don't know if it could handle 3K of filters.

Which way do you suggest? what is the best solution for my problem 'performance-wise'?

Thanks in advance

Reply | Threaded
Open this post in threaded view
|

Re: spark sql StackOverflow

Alessandro Solimando
Hi,
I am not familiar with ATNConfigSet, but some thoughts that might help.

How many distinct key1 (resp. key2) values do you have? Are these values reasonably stable over time?

Are these records ingested in real-time or are they loaded from a datastore?

If the latter case the DB might be able to efficiently perform the filtering, especially if equipped with a proper index over key1/key2 (or a composite one).

In such case the filter push-down could be very effective (I didn't get if you just need to count or do something more with the matching record).

Alternatively, you could try to group by (key1,key2), and then filter (it again depends on the kind of output you have in mind).

If the datastore/stream is distributed and supports partitioning, you could partition your records by either key1 or key2 (or key1+key2), so they are already "separated" and can be consumed more efficiently (e.g., the groupby could then be local to a single partition).

Best regards,
Alessandro

On 15 May 2018 at 08:32, onmstester onmstester <[hidden email]> wrote:
Hi,

I need to run some queries on huge amount input records. Input rate for records are 100K/seconds.
A record is like (key1,key2,value) and the application should report occurances of kye1 = something && key2 == somethingElse.
The problem is there are too many filters in my query: more than 3 thousands pair of key1 and key2 should be filtered.
I was simply puting 1 millions of records in a temptable each time and running a query sql using spark-sql on temp table:
select * from mytemptable where (kye1 = something && key2 == somethingElse) or (kye1 = someOtherthing && key2 == someAnotherThing) or ...(3thousands or!!!)
And i encounter StackOverFlow at ATNConfigSet.java line 178.

So i have two options IMHO:
1. Either put all key1 and key2 filter pairs in another temp table and do a join between  two temp table
2. Or use spark-stream that i'm not familiar with and i don't know if it could handle 3K of filters.

Which way do you suggest? what is the best solution for my problem 'performance-wise'?

Thanks in advance


Reply | Threaded
Open this post in threaded view
|

Re: spark sql StackOverflow

Alessandro Solimando
From the information you provided I would tackle this as a batch problem, because this way you have access to more sophisticated techniques and you have more flexibility (maybe HDFS and a SparkJob, but also think about a datastore offering good indexes for the kind of data types and values you have for your keys, and benefit from filter push-downs).

I personally use streaming only when real-time ingestion is needed.

Hth,
Alessandro

On 15 May 2018 at 09:11, onmstester onmstester <[hidden email]> wrote:

How many distinct key1 (resp. key2) values do you have? Are these values reasonably stable over time?

less than 10 thousands and this filters would change each 2-3 days. They would be written and loaded from a database

Are these records ingested in real-time or are they loaded from a datastore?

records would be loaded from some text files that would be copied in some directory over and over

Are you suggesting that i dont need to use spark-streaming?

Sent using Zoho Mail



---- On Tue, 15 May 2018 11:26:42 +0430 Alessandro Solimando <[hidden email]> wrote ----

Hi,
I am not familiar with ATNConfigSet, but some thoughts that might help.

How many distinct key1 (resp. key2) values do you have? Are these values reasonably stable over time?

Are these records ingested in real-time or are they loaded from a datastore?

If the latter case the DB might be able to efficiently perform the filtering, especially if equipped with a proper index over key1/key2 (or a composite one).

In such case the filter push-down could be very effective (I didn't get if you just need to count or do something more with the matching record).

Alternatively, you could try to group by (key1,key2), and then filter (it again depends on the kind of output you have in mind).

If the datastore/stream is distributed and supports partitioning, you could partition your records by either key1 or key2 (or key1+key2), so they are already "separated" and can be consumed more efficiently (e.g., the groupby could then be local to a single partition).

Best regards,
Alessandro

On 15 May 2018 at 08:32, onmstester onmstester <[hidden email]> wrote:


Hi,

I need to run some queries on huge amount input records. Input rate for records are 100K/seconds.
A record is like (key1,key2,value) and the application should report occurances of kye1 = something && key2 == somethingElse.
The problem is there are too many filters in my query: more than 3 thousands pair of key1 and key2 should be filtered.
I was simply puting 1 millions of records in a temptable each time and running a query sql using spark-sql on temp table:
select * from mytemptable where (kye1 = something && key2 == somethingElse) or (kye1 = someOtherthing && key2 == someAnotherThing) or ...(3thousands or!!!)
And i encounter StackOverFlow at ATNConfigSet.java line 178.

So i have two options IMHO:
1. Either put all key1 and key2 filter pairs in another temp table and do a join between  two temp table
2. Or use spark-stream that i'm not familiar with and i don't know if it could handle 3K of filters.

Which way do you suggest? what is the best solution for my problem 'performance-wise'?

Thanks in advance



Reply | Threaded
Open this post in threaded view
|

Re: spark sql StackOverflow

Jörn Franke
3000 filters don’t look like something reasonable. This is very difficult to test and verify as well as impossible to maintain.
Could it be that your filters are another table that you should join with ?
The example is a little bit artificial to understand the underlying business case. Can you provide a more realistic example?

Maybe a bloom filter or something similar can make sense for you ?  Basically you want  to know if the key pair is in a given set of pairs?

On 15. May 2018, at 11:48, Alessandro Solimando <[hidden email]> wrote:

From the information you provided I would tackle this as a batch problem, because this way you have access to more sophisticated techniques and you have more flexibility (maybe HDFS and a SparkJob, but also think about a datastore offering good indexes for the kind of data types and values you have for your keys, and benefit from filter push-downs).

I personally use streaming only when real-time ingestion is needed.

Hth,
Alessandro

On 15 May 2018 at 09:11, onmstester onmstester <[hidden email]> wrote:

How many distinct key1 (resp. key2) values do you have? Are these values reasonably stable over time?

less than 10 thousands and this filters would change each 2-3 days. They would be written and loaded from a database

Are these records ingested in real-time or are they loaded from a datastore?

records would be loaded from some text files that would be copied in some directory over and over

Are you suggesting that i dont need to use spark-streaming?

Sent using Zoho Mail



---- On Tue, 15 May 2018 11:26:42 +0430 Alessandro Solimando <[hidden email]> wrote ----

Hi,
I am not familiar with ATNConfigSet, but some thoughts that might help.

How many distinct key1 (resp. key2) values do you have? Are these values reasonably stable over time?

Are these records ingested in real-time or are they loaded from a datastore?

If the latter case the DB might be able to efficiently perform the filtering, especially if equipped with a proper index over key1/key2 (or a composite one).

In such case the filter push-down could be very effective (I didn't get if you just need to count or do something more with the matching record).

Alternatively, you could try to group by (key1,key2), and then filter (it again depends on the kind of output you have in mind).

If the datastore/stream is distributed and supports partitioning, you could partition your records by either key1 or key2 (or key1+key2), so they are already "separated" and can be consumed more efficiently (e.g., the groupby could then be local to a single partition).

Best regards,
Alessandro

On 15 May 2018 at 08:32, onmstester onmstester <[hidden email]> wrote:


Hi,

I need to run some queries on huge amount input records. Input rate for records are 100K/seconds.
A record is like (key1,key2,value) and the application should report occurances of kye1 = something && key2 == somethingElse.
The problem is there are too many filters in my query: more than 3 thousands pair of key1 and key2 should be filtered.
I was simply puting 1 millions of records in a temptable each time and running a query sql using spark-sql on temp table:
select * from mytemptable where (kye1 = something && key2 == somethingElse) or (kye1 = someOtherthing && key2 == someAnotherThing) or ...(3thousands or!!!)
And i encounter StackOverFlow at ATNConfigSet.java line 178.

So i have two options IMHO:
1. Either put all key1 and key2 filter pairs in another temp table and do a join between  two temp table
2. Or use spark-stream that i'm not familiar with and i don't know if it could handle 3K of filters.

Which way do you suggest? what is the best solution for my problem 'performance-wise'?

Thanks in advance