cache table vs. parquet table performance

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

cache table vs. parquet table performance

Tomas Bartalos
Hello,

I'm using spark-thrift server and I'm searching for best performing solution to query hot set of data. I'm processing records with nested structure, containing subtypes and arrays. 1 record takes up several KB.

I tried to make some improvement with cache table:

cache table event_jan_01 as select * from events where day_registered = 20190102;


If I understood correctly, the data should be stored in in-memory columnar format with storage level MEMORY_AND_DISK. So data which doesn't fit to memory will be spille to disk (I assume also in columnar format (?))
I cached 1 day of data (1 M records) and according to spark UI storage tab none of the data was cached to memory and everything was spilled to disk. The size of the data was 5.7 GB.
Typical queries took ~ 20 sec.

Then I tried to store the data to parquet format:

CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" as 

select * from event_jan_01;


The whole parquet took up only 178MB.
And typical queries took 5-10 sec.

Is it possible to tune spark to spill the cached data in parquet format ?
Why the whole cached table was spilled to disk and nothing stayed in memory ?

Spark version: 2.4.0

Best regards,
Tomas

Reply | Threaded
Open this post in threaded view
|

Re:cache table vs. parquet table performance

Jiaan Geng
Hi ,Tomas.
Thanks for your question give me some prompt.But the best way use cache usually stores smaller data.
I think cache large data will consume memory or disk space too much.
Spill the cached data in parquet format maybe a good improvement.

At 2019-01-16 02:20:56, "Tomas Bartalos" <[hidden email]> wrote:
Hello,

I'm using spark-thrift server and I'm searching for best performing solution to query hot set of data. I'm processing records with nested structure, containing subtypes and arrays. 1 record takes up several KB.

I tried to make some improvement with cache table:

cache table event_jan_01 as select * from events where day_registered = 20190102;


If I understood correctly, the data should be stored in in-memory columnar format with storage level MEMORY_AND_DISK. So data which doesn't fit to memory will be spille to disk (I assume also in columnar format (?))
I cached 1 day of data (1 M records) and according to spark UI storage tab none of the data was cached to memory and everything was spilled to disk. The size of the data was 5.7 GB.
Typical queries took ~ 20 sec.

Then I tried to store the data to parquet format:

CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" as 

select * from event_jan_01;


The whole parquet took up only 178MB.
And typical queries took 5-10 sec.

Is it possible to tune spark to spill the cached data in parquet format ?
Why the whole cached table was spilled to disk and nothing stayed in memory ?

Spark version: 2.4.0

Best regards,
Tomas



 

Reply | Threaded
Open this post in threaded view
|

Re:Re:cache table vs. parquet table performance

Jiaan Geng
So I think cache large data is not a best practice.

At 2019-01-16 12:24:34, "大啊" <[hidden email]> wrote:
Hi ,Tomas.
Thanks for your question give me some prompt.But the best way use cache usually stores smaller data.
I think cache large data will consume memory or disk space too much.
Spill the cached data in parquet format maybe a good improvement.

At 2019-01-16 02:20:56, "Tomas Bartalos" <[hidden email]> wrote:
Hello,

I'm using spark-thrift server and I'm searching for best performing solution to query hot set of data. I'm processing records with nested structure, containing subtypes and arrays. 1 record takes up several KB.

I tried to make some improvement with cache table:

cache table event_jan_01 as select * from events where day_registered = 20190102;


If I understood correctly, the data should be stored in in-memory columnar format with storage level MEMORY_AND_DISK. So data which doesn't fit to memory will be spille to disk (I assume also in columnar format (?))
I cached 1 day of data (1 M records) and according to spark UI storage tab none of the data was cached to memory and everything was spilled to disk. The size of the data was 5.7 GB.
Typical queries took ~ 20 sec.

Then I tried to store the data to parquet format:

CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" as 

select * from event_jan_01;


The whole parquet took up only 178MB.
And typical queries took 5-10 sec.

Is it possible to tune spark to spill the cached data in parquet format ?
Why the whole cached table was spilled to disk and nothing stayed in memory ?

Spark version: 2.4.0

Best regards,
Tomas



 



Reply | Threaded
Open this post in threaded view
|

Re: cache table vs. parquet table performance

tnist
In reply to this post by Jiaan Geng
Hi Tomas,

Have you considered using something like https://www.alluxio.org/ for you cache?  Seems like a possible solution for what your trying to do.

-Todd

On Tue, Jan 15, 2019 at 11:24 PM 大啊 <[hidden email]> wrote:
Hi ,Tomas.
Thanks for your question give me some prompt.But the best way use cache usually stores smaller data.
I think cache large data will consume memory or disk space too much.
Spill the cached data in parquet format maybe a good improvement.

At 2019-01-16 02:20:56, "Tomas Bartalos" <[hidden email]> wrote:
Hello,

I'm using spark-thrift server and I'm searching for best performing solution to query hot set of data. I'm processing records with nested structure, containing subtypes and arrays. 1 record takes up several KB.

I tried to make some improvement with cache table:

cache table event_jan_01 as select * from events where day_registered = 20190102;


If I understood correctly, the data should be stored in in-memory columnar format with storage level MEMORY_AND_DISK. So data which doesn't fit to memory will be spille to disk (I assume also in columnar format (?))
I cached 1 day of data (1 M records) and according to spark UI storage tab none of the data was cached to memory and everything was spilled to disk. The size of the data was 5.7 GB.
Typical queries took ~ 20 sec.

Then I tried to store the data to parquet format:

CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" as 

select * from event_jan_01;


The whole parquet took up only 178MB.
And typical queries took 5-10 sec.

Is it possible to tune spark to spill the cached data in parquet format ?
Why the whole cached table was spilled to disk and nothing stayed in memory ?

Spark version: 2.4.0

Best regards,
Tomas



 

Reply | Threaded
Open this post in threaded view
|

Re: cache table vs. parquet table performance

Jörn Franke
In reply to this post by Tomas Bartalos
I believe the in-memory solution misses the storage indexes that parquet / orc have.

The in-memory solution is more suitable if you iterate in the whole set of data frequently.

Am 15.01.2019 um 19:20 schrieb Tomas Bartalos <[hidden email]>:

Hello,

I'm using spark-thrift server and I'm searching for best performing solution to query hot set of data. I'm processing records with nested structure, containing subtypes and arrays. 1 record takes up several KB.

I tried to make some improvement with cache table:

cache table event_jan_01 as select * from events where day_registered = 20190102;


If I understood correctly, the data should be stored in in-memory columnar format with storage level MEMORY_AND_DISK. So data which doesn't fit to memory will be spille to disk (I assume also in columnar format (?))
I cached 1 day of data (1 M records) and according to spark UI storage tab none of the data was cached to memory and everything was spilled to disk. The size of the data was 5.7 GB.
Typical queries took ~ 20 sec.

Then I tried to store the data to parquet format:

CREATE TABLE event_jan_01_par USING parquet location "/tmp/events/jan/02" as 

select * from event_jan_01;


The whole parquet took up only 178MB.
And typical queries took 5-10 sec.

Is it possible to tune spark to spill the cached data in parquet format ?
Why the whole cached table was spilled to disk and nothing stayed in memory ?

Spark version: 2.4.0

Best regards,
Tomas