SparkSQL read Hive transactional table

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

SparkSQL read Hive transactional table

daily

Hi,

I use HCatalog Streaming Mutation API to write data to hive transactional table, and then, I use SparkSQL to read data from the hive transactional table. I get the right result.
However, SparkSQL uses more time to read hive orc bucket transactional table, beacause SparkSQL read all columns(not The columns involved in SQL) so it uses more time.
My question is why that SparkSQL read all columns of hive orc bucket transactional table, but not the columns involved in SQL? Is it possible to control the SparkSQL read the columns involved in SQL?

 

For example:
Hive Table:
create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as orc TBLPROPERTIES ('transactional'='true');

create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as orc TBLPROPERTIES ('transactional'='false');

SparkSQL:
select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
select sum(t1),sum(t2) from dbtest.t_a2 group by t0;

SparkSQL's stage Input size:

dbtest.t_a1=113.9 GB,

dbtest.t_a2=96.5 MB

 

Best regards.

 


Reply | Threaded
Open this post in threaded view
|

Re: SparkSQL read Hive transactional table

Gourav Sengupta
Hi,

can I please ask which version of Hive and Spark are you using?

Regards,
Gourav Sengupta

On Tue, Oct 16, 2018 at 2:42 AM daily <[hidden email]> wrote:

Hi,

I use HCatalog Streaming Mutation API to write data to hive transactional table, and then, I use SparkSQL to read data from the hive transactional table. I get the right result.
However, SparkSQL uses more time to read hive orc bucket transactional table, beacause SparkSQL read all columns(not The columns involved in SQL) so it uses more time.
My question is why that SparkSQL read all columns of hive orc bucket transactional table, but not the columns involved in SQL? Is it possible to control the SparkSQL read the columns involved in SQL?

 

For example:
Hive Table:
create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as orc TBLPROPERTIES ('transactional'='true');

create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as orc TBLPROPERTIES ('transactional'='false');

SparkSQL:
select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
select sum(t1),sum(t2) from dbtest.t_a2 group by t0;

SparkSQL's stage Input size:

dbtest.t_a1=113.9 GB,

dbtest.t_a2=96.5 MB

 

Best regards.

 


Reply | Threaded
Open this post in threaded view
|

回复: SparkSQL read Hive transactional table

daily
Hi,

Spark version: 2.3.0
Hive   version: 2.1.0

Best regards.


------------------ 原始邮件 ------------------
发件人: "Gourav Sengupta"<[hidden email]>;
发送时间: 2018年10月16日(星期二) 晚上6:35
收件人: "daily"<[hidden email]>;
抄送: "user"<[hidden email]>; "dev"<[hidden email]>;
主题: Re: SparkSQL read Hive transactional table

Hi,

can I please ask which version of Hive and Spark are you using?

Regards,
Gourav Sengupta

On Tue, Oct 16, 2018 at 2:42 AM daily <[hidden email]> wrote:

Hi,

I use HCatalog Streaming Mutation API to write data to hive transactional table, and then, I use SparkSQL to read data from the hive transactional table. I get the right result.
However, SparkSQL uses more time to read hive orc bucket transactional table, beacause SparkSQL read all columns(not The columns involved in SQL) so it uses more time.
My question is why that SparkSQL read all columns of hive orc bucket transactional table, but not the columns involved in SQL? Is it possible to control the SparkSQL read the columns involved in SQL?

 

For example:
Hive Table:
create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as orc TBLPROPERTIES ('transactional'='true');

create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as orc TBLPROPERTIES ('transactional'='false');

SparkSQL:
select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
select sum(t1),sum(t2) from dbtest.t_a2 group by t0;

SparkSQL's stage Input size:

dbtest.t_a1=113.9 GB,

dbtest.t_a2=96.5 MB

 

Best regards.

 


Reply | Threaded
Open this post in threaded view
|

Re: SparkSQL read Hive transactional table

Gourav Sengupta
Hi, 

I think that the speed of ORC has been improved in latest versions. Any chance you could use the latest version?

Regards,
Gourav Sengupta

On 17 Oct 2018 6:11 am, "daily" <[hidden email]> wrote:
Hi,

Spark version: 2.3.0
Hive   version: 2.1.0

Best regards.


------------------ 原始邮件 ------------------
发件人: "Gourav Sengupta"<[hidden email]>;
发送时间: 2018年10月16日(星期二) 晚上6:35
收件人: "daily"<[hidden email]>;
抄送: "user"<[hidden email]>; "dev"<[hidden email]>;
主题: Re: SparkSQL read Hive transactional table

Hi,

can I please ask which version of Hive and Spark are you using?

Regards,
Gourav Sengupta

On Tue, Oct 16, 2018 at 2:42 AM daily <[hidden email]> wrote:

Hi,

I use HCatalog Streaming Mutation API to write data to hive transactional table, and then, I use SparkSQL to read data from the hive transactional table. I get the right result.
However, SparkSQL uses more time to read hive orc bucket transactional table, beacause SparkSQL read all columns(not The columns involved in SQL) so it uses more time.
My question is why that SparkSQL read all columns of hive orc bucket transactional table, but not the columns involved in SQL? Is it possible to control the SparkSQL read the columns involved in SQL?

 

For example:
Hive Table:
create table dbtest.t_a1 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as orc TBLPROPERTIES ('transactional'='true');

create table dbtest.t_a2 (t0 VARCHAR(36),t1 string,t2 double,t5 int ,t6 int) partitioned by(sd string,st string) clustered by(t0) into 10 buckets stored as orc TBLPROPERTIES ('transactional'='false');

SparkSQL:
select sum(t1),sum(t2) from dbtest.t_a1 group by t0;
select sum(t1),sum(t2) from dbtest.t_a2 group by t0;

SparkSQL's stage Input size:

dbtest.t_a1=113.9 GB,

dbtest.t_a2=96.5 MB

 

Best regards.