Spark SQL reads all leaf directories on a partitioned Hive table

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark SQL reads all leaf directories on a partitioned Hive table

Hao Ren
Hi,
I am using Spark SQL 2.3.3 to read a hive table which is partitioned by day, hour, platform, request_status and is_sampled. The underlying data is in parquet format on HDFS.
Here is the SQL query to read just one partition.

```
spark.sql("""
SELECT rtb_platform_id, SUM(e_cpm)
FROM raw_logs.fact_request
WHERE day = '2019-08-01'
AND hour = '00'
AND platform = 'US'
AND request_status = '3'
AND is_sampled = 1
GROUP BY rtb_platform_id
""").show
```

However, from the Spark web UI, the stage description shows:

```
Listing leaf files and directories for 201616 paths:
viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0, ...
```

It seems the job is reading all of the partitions of the table and the job takes too long for just one partition. One workaround is using `spark.read.parquet` API to read parquet files directly. Spark has partition-awareness for partitioned directories.

But still, I would like to know if there is a way to leverage partition-awareness via Hive by using `spark.sql` API?

Any help is highly appreciated!

Thank you.

--
Hao Ren
Reply | Threaded
Open this post in threaded view
|

Re: Spark SQL reads all leaf directories on a partitioned Hive table

Gourav Sengupta
Hi,

Just out of curiosity did you start the SPARK session using enableHiveSupport() ?

Or are you creating the table using SPARK?


Regards,
Gourav

On Wed, Aug 7, 2019 at 3:28 PM Hao Ren <[hidden email]> wrote:
Hi,
I am using Spark SQL 2.3.3 to read a hive table which is partitioned by day, hour, platform, request_status and is_sampled. The underlying data is in parquet format on HDFS.
Here is the SQL query to read just one partition.

```
spark.sql("""
SELECT rtb_platform_id, SUM(e_cpm)
FROM raw_logs.fact_request
WHERE day = '2019-08-01'
AND hour = '00'
AND platform = 'US'
AND request_status = '3'
AND is_sampled = 1
GROUP BY rtb_platform_id
""").show
```

However, from the Spark web UI, the stage description shows:

```
Listing leaf files and directories for 201616 paths:
viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0, ...
```

It seems the job is reading all of the partitions of the table and the job takes too long for just one partition. One workaround is using `spark.read.parquet` API to read parquet files directly. Spark has partition-awareness for partitioned directories.

But still, I would like to know if there is a way to leverage partition-awareness via Hive by using `spark.sql` API?

Any help is highly appreciated!

Thank you.

--
Hao Ren
Reply | Threaded
Open this post in threaded view
|

Fwd: Spark SQL reads all leaf directories on a partitioned Hive table

Hao Ren


---------- Forwarded message ---------
From: Hao Ren <[hidden email]>
Date: Thu, Aug 8, 2019 at 4:15 PM
Subject: Re: Spark SQL reads all leaf directories on a partitioned Hive table
To: Gourav Sengupta <[hidden email]>


Hi Gourva,

I am using enableHiveSupport.
The table was not created by Spark. The table already exists in Hive. All I did is just reading it by using SQL query in Spark.
FYI, I put hive-site.xml in spark/conf/ directory to make sure that Spark can access to Hive.

Hao

On Thu, Aug 8, 2019 at 1:24 PM Gourav Sengupta <[hidden email]> wrote:
Hi,

Just out of curiosity did you start the SPARK session using enableHiveSupport() ?

Or are you creating the table using SPARK?


Regards,
Gourav

On Wed, Aug 7, 2019 at 3:28 PM Hao Ren <[hidden email]> wrote:
Hi,
I am using Spark SQL 2.3.3 to read a hive table which is partitioned by day, hour, platform, request_status and is_sampled. The underlying data is in parquet format on HDFS.
Here is the SQL query to read just one partition.

```
spark.sql("""
SELECT rtb_platform_id, SUM(e_cpm)
FROM raw_logs.fact_request
WHERE day = '2019-08-01'
AND hour = '00'
AND platform = 'US'
AND request_status = '3'
AND is_sampled = 1
GROUP BY rtb_platform_id
""").show
```

However, from the Spark web UI, the stage description shows:

```
Listing leaf files and directories for 201616 paths:
viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0, ...
```

It seems the job is reading all of the partitions of the table and the job takes too long for just one partition. One workaround is using `spark.read.parquet` API to read parquet files directly. Spark has partition-awareness for partitioned directories.

But still, I would like to know if there is a way to leverage partition-awareness via Hive by using `spark.sql` API?

Any help is highly appreciated!

Thank you.

--
Hao Ren


--
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France


--
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France
Reply | Threaded
Open this post in threaded view
|

Re: Spark SQL reads all leaf directories on a partitioned Hive table

Mich Talebzadeh
also need others as well using soft link ls -l

cd $SPARK_HOME/conf

hive-site.xml -> ${HIVE_HOME/conf/hive-site.xml
core-site.xml -> ${HADOOP_HOME}/etc/hadoop/core-site.xml
hdfs-site.xml -> ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml


Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Thu, 8 Aug 2019 at 15:16, Hao Ren <[hidden email]> wrote:


---------- Forwarded message ---------
From: Hao Ren <[hidden email]>
Date: Thu, Aug 8, 2019 at 4:15 PM
Subject: Re: Spark SQL reads all leaf directories on a partitioned Hive table
To: Gourav Sengupta <[hidden email]>


Hi Gourva,

I am using enableHiveSupport.
The table was not created by Spark. The table already exists in Hive. All I did is just reading it by using SQL query in Spark.
FYI, I put hive-site.xml in spark/conf/ directory to make sure that Spark can access to Hive.

Hao

On Thu, Aug 8, 2019 at 1:24 PM Gourav Sengupta <[hidden email]> wrote:
Hi,

Just out of curiosity did you start the SPARK session using enableHiveSupport() ?

Or are you creating the table using SPARK?


Regards,
Gourav

On Wed, Aug 7, 2019 at 3:28 PM Hao Ren <[hidden email]> wrote:
Hi,
I am using Spark SQL 2.3.3 to read a hive table which is partitioned by day, hour, platform, request_status and is_sampled. The underlying data is in parquet format on HDFS.
Here is the SQL query to read just one partition.

```
spark.sql("""
SELECT rtb_platform_id, SUM(e_cpm)
FROM raw_logs.fact_request
WHERE day = '2019-08-01'
AND hour = '00'
AND platform = 'US'
AND request_status = '3'
AND is_sampled = 1
GROUP BY rtb_platform_id
""").show
```

However, from the Spark web UI, the stage description shows:

```
Listing leaf files and directories for 201616 paths:
viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0, ...
```

It seems the job is reading all of the partitions of the table and the job takes too long for just one partition. One workaround is using `spark.read.parquet` API to read parquet files directly. Spark has partition-awareness for partitioned directories.

But still, I would like to know if there is a way to leverage partition-awareness via Hive by using `spark.sql` API?

Any help is highly appreciated!

Thank you.

--
Hao Ren


--
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France


--
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France
Reply | Threaded
Open this post in threaded view
|

Re: Spark SQL reads all leaf directories on a partitioned Hive table

Hao Ren
Hi Mich,

Thank you for your reply. 
I need to be more clear about the environment. I am using spark-shell to run the query.
Actually, the query works even without core-site, hdfs-site being under $SPARK_HOME/conf.
My problem is efficiency. Because all of the partitions was scanned instead of the one in question during the execution of the spark sql query.
This is why this simple query takes too much time.
I would like to know how to improve this by just reading the specific partition in question.

Feel free to ask more questions if I am not clear.

Best regards,
Hao

On Thu, Aug 8, 2019 at 9:05 PM Mich Talebzadeh <[hidden email]> wrote:
also need others as well using soft link ls -l

cd $SPARK_HOME/conf

hive-site.xml -> ${HIVE_HOME/conf/hive-site.xml
core-site.xml -> ${HADOOP_HOME}/etc/hadoop/core-site.xml
hdfs-site.xml -> ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml


Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Thu, 8 Aug 2019 at 15:16, Hao Ren <[hidden email]> wrote:


---------- Forwarded message ---------
From: Hao Ren <[hidden email]>
Date: Thu, Aug 8, 2019 at 4:15 PM
Subject: Re: Spark SQL reads all leaf directories on a partitioned Hive table
To: Gourav Sengupta <[hidden email]>


Hi Gourva,

I am using enableHiveSupport.
The table was not created by Spark. The table already exists in Hive. All I did is just reading it by using SQL query in Spark.
FYI, I put hive-site.xml in spark/conf/ directory to make sure that Spark can access to Hive.

Hao

On Thu, Aug 8, 2019 at 1:24 PM Gourav Sengupta <[hidden email]> wrote:
Hi,

Just out of curiosity did you start the SPARK session using enableHiveSupport() ?

Or are you creating the table using SPARK?


Regards,
Gourav

On Wed, Aug 7, 2019 at 3:28 PM Hao Ren <[hidden email]> wrote:
Hi,
I am using Spark SQL 2.3.3 to read a hive table which is partitioned by day, hour, platform, request_status and is_sampled. The underlying data is in parquet format on HDFS.
Here is the SQL query to read just one partition.

```
spark.sql("""
SELECT rtb_platform_id, SUM(e_cpm)
FROM raw_logs.fact_request
WHERE day = '2019-08-01'
AND hour = '00'
AND platform = 'US'
AND request_status = '3'
AND is_sampled = 1
GROUP BY rtb_platform_id
""").show
```

However, from the Spark web UI, the stage description shows:

```
Listing leaf files and directories for 201616 paths:
viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0, ...
```

It seems the job is reading all of the partitions of the table and the job takes too long for just one partition. One workaround is using `spark.read.parquet` API to read parquet files directly. Spark has partition-awareness for partitioned directories.

But still, I would like to know if there is a way to leverage partition-awareness via Hive by using `spark.sql` API?

Any help is highly appreciated!

Thank you.

--
Hao Ren


--
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France


--
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France


--
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France
Reply | Threaded
Open this post in threaded view
|

Re: Spark SQL reads all leaf directories on a partitioned Hive table

Subash Prabakar
I had the similar issue reading the external parquet table . In my case I had permission issue in one partition so I added filter to exclude that partition but still the spark didn’t prune it. Then I read that in order for spark to be aware of all the partitions it first read the folders and then updated its metastore . Then the sql is applied on TOP of it. Instead of using the existing hive SerDe and this property is only for parquet files.

Hive metastore Parquet table conversion

When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by thespark.sql.hive.convertMetastoreParquetconfiguration, and is turned on by default.


Reference: 

Set the above property to false . It should work. 

If anyone have better explanation please let me know - I have same question. Why only parquet has this problem ?

Thanks
Subash

On Fri, 9 Aug 2019 at 16:18, Hao Ren <[hidden email]> wrote:
Hi Mich,

Thank you for your reply. 
I need to be more clear about the environment. I am using spark-shell to run the query.
Actually, the query works even without core-site, hdfs-site being under $SPARK_HOME/conf.
My problem is efficiency. Because all of the partitions was scanned instead of the one in question during the execution of the spark sql query.
This is why this simple query takes too much time.
I would like to know how to improve this by just reading the specific partition in question.

Feel free to ask more questions if I am not clear.

Best regards,
Hao

On Thu, Aug 8, 2019 at 9:05 PM Mich Talebzadeh <[hidden email]> wrote:
also need others as well using soft link ls -l

cd $SPARK_HOME/conf

hive-site.xml -> ${HIVE_HOME/conf/hive-site.xml
core-site.xml -> ${HADOOP_HOME}/etc/hadoop/core-site.xml
hdfs-site.xml -> ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml


Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Thu, 8 Aug 2019 at 15:16, Hao Ren <[hidden email]> wrote:


---------- Forwarded message ---------
From: Hao Ren <[hidden email]>
Date: Thu, Aug 8, 2019 at 4:15 PM
Subject: Re: Spark SQL reads all leaf directories on a partitioned Hive table
To: Gourav Sengupta <[hidden email]>


Hi Gourva,

I am using enableHiveSupport.
The table was not created by Spark. The table already exists in Hive. All I did is just reading it by using SQL query in Spark.
FYI, I put hive-site.xml in spark/conf/ directory to make sure that Spark can access to Hive.

Hao

On Thu, Aug 8, 2019 at 1:24 PM Gourav Sengupta <[hidden email]> wrote:
Hi,

Just out of curiosity did you start the SPARK session using enableHiveSupport() ?

Or are you creating the table using SPARK?


Regards,
Gourav

On Wed, Aug 7, 2019 at 3:28 PM Hao Ren <[hidden email]> wrote:
Hi,
I am using Spark SQL 2.3.3 to read a hive table which is partitioned by day, hour, platform, request_status and is_sampled. The underlying data is in parquet format on HDFS.
Here is the SQL query to read just one partition.

```
spark.sql("""
SELECT rtb_platform_id, SUM(e_cpm)
FROM raw_logs.fact_request
WHERE day = '2019-08-01'
AND hour = '00'
AND platform = 'US'
AND request_status = '3'
AND is_sampled = 1
GROUP BY rtb_platform_id
""").show
```

However, from the Spark web UI, the stage description shows:

```
Listing leaf files and directories for 201616 paths:
viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0, ...
```

It seems the job is reading all of the partitions of the table and the job takes too long for just one partition. One workaround is using `spark.read.parquet` API to read parquet files directly. Spark has partition-awareness for partitioned directories.

But still, I would like to know if there is a way to leverage partition-awareness via Hive by using `spark.sql` API?

Any help is highly appreciated!

Thank you.

--
Hao Ren


--
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France


--
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France


--
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France
Reply | Threaded
Open this post in threaded view
|

Re: Spark SQL reads all leaf directories on a partitioned Hive table

Hao Ren
Thank you, Subash. It works!

On Tue, Aug 13, 2019 at 5:58 AM Subash Prabakar <[hidden email]> wrote:
I had the similar issue reading the external parquet table . In my case I had permission issue in one partition so I added filter to exclude that partition but still the spark didn’t prune it. Then I read that in order for spark to be aware of all the partitions it first read the folders and then updated its metastore . Then the sql is applied on TOP of it. Instead of using the existing hive SerDe and this property is only for parquet files.

Hive metastore Parquet table conversion

When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance. This behavior is controlled by thespark.sql.hive.convertMetastoreParquetconfiguration, and is turned on by default.


Reference: 

Set the above property to false . It should work. 

If anyone have better explanation please let me know - I have same question. Why only parquet has this problem ?

Thanks
Subash

On Fri, 9 Aug 2019 at 16:18, Hao Ren <[hidden email]> wrote:
Hi Mich,

Thank you for your reply. 
I need to be more clear about the environment. I am using spark-shell to run the query.
Actually, the query works even without core-site, hdfs-site being under $SPARK_HOME/conf.
My problem is efficiency. Because all of the partitions was scanned instead of the one in question during the execution of the spark sql query.
This is why this simple query takes too much time.
I would like to know how to improve this by just reading the specific partition in question.

Feel free to ask more questions if I am not clear.

Best regards,
Hao

On Thu, Aug 8, 2019 at 9:05 PM Mich Talebzadeh <[hidden email]> wrote:
also need others as well using soft link ls -l

cd $SPARK_HOME/conf

hive-site.xml -> ${HIVE_HOME/conf/hive-site.xml
core-site.xml -> ${HADOOP_HOME}/etc/hadoop/core-site.xml
hdfs-site.xml -> ${HADOOP_HOME}/etc/hadoop/hdfs-site.xml


Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Thu, 8 Aug 2019 at 15:16, Hao Ren <[hidden email]> wrote:


---------- Forwarded message ---------
From: Hao Ren <[hidden email]>
Date: Thu, Aug 8, 2019 at 4:15 PM
Subject: Re: Spark SQL reads all leaf directories on a partitioned Hive table
To: Gourav Sengupta <[hidden email]>


Hi Gourva,

I am using enableHiveSupport.
The table was not created by Spark. The table already exists in Hive. All I did is just reading it by using SQL query in Spark.
FYI, I put hive-site.xml in spark/conf/ directory to make sure that Spark can access to Hive.

Hao

On Thu, Aug 8, 2019 at 1:24 PM Gourav Sengupta <[hidden email]> wrote:
Hi,

Just out of curiosity did you start the SPARK session using enableHiveSupport() ?

Or are you creating the table using SPARK?


Regards,
Gourav

On Wed, Aug 7, 2019 at 3:28 PM Hao Ren <[hidden email]> wrote:
Hi,
I am using Spark SQL 2.3.3 to read a hive table which is partitioned by day, hour, platform, request_status and is_sampled. The underlying data is in parquet format on HDFS.
Here is the SQL query to read just one partition.

```
spark.sql("""
SELECT rtb_platform_id, SUM(e_cpm)
FROM raw_logs.fact_request
WHERE day = '2019-08-01'
AND hour = '00'
AND platform = 'US'
AND request_status = '3'
AND is_sampled = 1
GROUP BY rtb_platform_id
""").show
```

However, from the Spark web UI, the stage description shows:

```
Listing leaf files and directories for 201616 paths:
viewfs://root/user/bilogs/logs/fact_request/day=2018-08-01/hour=11/platform=AS/request_status=0/is_sampled=0, ...
```

It seems the job is reading all of the partitions of the table and the job takes too long for just one partition. One workaround is using `spark.read.parquet` API to read parquet files directly. Spark has partition-awareness for partitioned directories.

But still, I would like to know if there is a way to leverage partition-awareness via Hive by using `spark.sql` API?

Any help is highly appreciated!

Thank you.

--
Hao Ren


--
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France


--
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France


--
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France


--
Hao Ren

Software Engineer in Machine Learning @ Criteo

Paris, France