Persist Dataframe to HDFS considering HDFS Block Size.

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Persist Dataframe to HDFS considering HDFS Block Size.

Shivam Sharma
Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into a HIVE table using Spark. Currently, at the time of writing to HIVE table I have set total shuffle partitions = 400 so total 400 files are being created which is not even considering HDFS block size. How can I tell spark to persist according to HDFS Blocks.

We have something like this HIVE which solves this problem:
set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=2048000000;
set hive.merge.size.per.task=4096000000;
Thanks

--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744
Reply | Threaded
Open this post in threaded view
|

Re: Persist Dataframe to HDFS considering HDFS Block Size.

Felix Cheung
You can call coalesce to combine partitions..

 

From: Shivam Sharma <[hidden email]>
Sent: Saturday, January 19, 2019 7:43 AM
To: [hidden email]
Subject: Persist Dataframe to HDFS considering HDFS Block Size.
 
Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into a HIVE table using Spark. Currently, at the time of writing to HIVE table I have set total shuffle partitions = 400 so total 400 files are being created which is not even considering HDFS block size. How can I tell spark to persist according to HDFS Blocks.

We have something like this HIVE which solves this problem:
set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=2048000000;
set hive.merge.size.per.task=4096000000;
Thanks

--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744
Reply | Threaded
Open this post in threaded view
|

Re: Persist Dataframe to HDFS considering HDFS Block Size.

Hichame El Khalfi
You can do this in 2 passes (not one)
A) save you dataset into hdfs with what you have.
B) calculate number of partition, n= (size of your dataset)/hdfs block size
Then run simple spark job to read and partition based on 'n'.

Hichame

Sent: January 19, 2019 2:06 PM
Subject: Re: Persist Dataframe to HDFS considering HDFS Block Size.

You can call coalesce to combine partitions..

 

From: Shivam Sharma <[hidden email]>
Sent: Saturday, January 19, 2019 7:43 AM
To: [hidden email]
Subject: Persist Dataframe to HDFS considering HDFS Block Size.
 
Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into a HIVE table using Spark. Currently, at the time of writing to HIVE table I have set total shuffle partitions = 400 so total 400 files are being created which is not even considering HDFS block size. How can I tell spark to persist according to HDFS Blocks.

We have something like this HIVE which solves this problem:
set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=2048000000;
set hive.merge.size.per.task=4096000000;
Thanks

--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744
Reply | Threaded
Open this post in threaded view
|

Re: Persist Dataframe to HDFS considering HDFS Block Size.

Shivam Sharma
Don't we have any property for it? 

One more quick question that if files created by Spark is less than HDFS block size then the rest of Block space will become unavailable and remain unutilized or it will be shared with other files?

On Mon, Jan 21, 2019 at 1:30 PM Shivam Sharma <[hidden email]> wrote:
Don't we have any property for it? 

One more quick question that if files created by Spark is less than HDFS block size then the rest of Block space will become unavailable and remain unutilized or it will be shared with other files?

On Sun, Jan 20, 2019 at 12:47 AM Hichame El Khalfi <[hidden email]> wrote:
You can do this in 2 passes (not one)
A) save you dataset into hdfs with what you have.
B) calculate number of partition, n= (size of your dataset)/hdfs block size
Then run simple spark job to read and partition based on 'n'.

Hichame

Sent: January 19, 2019 2:06 PM
Subject: Re: Persist Dataframe to HDFS considering HDFS Block Size.

You can call coalesce to combine partitions..

 

From: Shivam Sharma <[hidden email]>
Sent: Saturday, January 19, 2019 7:43 AM
To: [hidden email]
Subject: Persist Dataframe to HDFS considering HDFS Block Size.
 
Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into a HIVE table using Spark. Currently, at the time of writing to HIVE table I have set total shuffle partitions = 400 so total 400 files are being created which is not even considering HDFS block size. How can I tell spark to persist according to HDFS Blocks.

We have something like this HIVE which solves this problem:
set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=2048000000;
set hive.merge.size.per.task=4096000000;
Thanks

--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744


--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744


--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744
Reply | Threaded
Open this post in threaded view
|

Re: Persist Dataframe to HDFS considering HDFS Block Size.

Arnaud LARROQUE
Hi Shivam,

At the end, the file is taking its own space regardless of the block size. So if you're file is just a few ko bytes, it will take only this few ko bytes.
But I've noticed that when the file is written, somehow a block is allocated and the Namenode consider that all the block size is used. I had this problem when writing a too much partitioned dataset !
But as soon as the file was written, the Namenode seems to know its true size and drop the "default block size"

Arnaud 

On Mon, Jan 21, 2019 at 9:01 AM Shivam Sharma <[hidden email]> wrote:
Don't we have any property for it? 

One more quick question that if files created by Spark is less than HDFS block size then the rest of Block space will become unavailable and remain unutilized or it will be shared with other files?

On Mon, Jan 21, 2019 at 1:30 PM Shivam Sharma <[hidden email]> wrote:
Don't we have any property for it? 

One more quick question that if files created by Spark is less than HDFS block size then the rest of Block space will become unavailable and remain unutilized or it will be shared with other files?

On Sun, Jan 20, 2019 at 12:47 AM Hichame El Khalfi <[hidden email]> wrote:
You can do this in 2 passes (not one)
A) save you dataset into hdfs with what you have.
B) calculate number of partition, n= (size of your dataset)/hdfs block size
Then run simple spark job to read and partition based on 'n'.

Hichame

Sent: January 19, 2019 2:06 PM
Subject: Re: Persist Dataframe to HDFS considering HDFS Block Size.

You can call coalesce to combine partitions..

 

From: Shivam Sharma <[hidden email]>
Sent: Saturday, January 19, 2019 7:43 AM
To: [hidden email]
Subject: Persist Dataframe to HDFS considering HDFS Block Size.
 
Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into a HIVE table using Spark. Currently, at the time of writing to HIVE table I have set total shuffle partitions = 400 so total 400 files are being created which is not even considering HDFS block size. How can I tell spark to persist according to HDFS Blocks.

We have something like this HIVE which solves this problem:
set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=2048000000;
set hive.merge.size.per.task=4096000000;
Thanks

--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744


--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744


--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744
Reply | Threaded
Open this post in threaded view
|

Re: Persist Dataframe to HDFS considering HDFS Block Size.

Shivam Sharma
Thanks Arnaud

On Mon, Jan 21, 2019 at 2:07 PM Arnaud LARROQUE <[hidden email]> wrote:
Hi Shivam,

At the end, the file is taking its own space regardless of the block size. So if you're file is just a few ko bytes, it will take only this few ko bytes.
But I've noticed that when the file is written, somehow a block is allocated and the Namenode consider that all the block size is used. I had this problem when writing a too much partitioned dataset !
But as soon as the file was written, the Namenode seems to know its true size and drop the "default block size"

Arnaud 

On Mon, Jan 21, 2019 at 9:01 AM Shivam Sharma <[hidden email]> wrote:
Don't we have any property for it? 

One more quick question that if files created by Spark is less than HDFS block size then the rest of Block space will become unavailable and remain unutilized or it will be shared with other files?

On Mon, Jan 21, 2019 at 1:30 PM Shivam Sharma <[hidden email]> wrote:
Don't we have any property for it? 

One more quick question that if files created by Spark is less than HDFS block size then the rest of Block space will become unavailable and remain unutilized or it will be shared with other files?

On Sun, Jan 20, 2019 at 12:47 AM Hichame El Khalfi <[hidden email]> wrote:
You can do this in 2 passes (not one)
A) save you dataset into hdfs with what you have.
B) calculate number of partition, n= (size of your dataset)/hdfs block size
Then run simple spark job to read and partition based on 'n'.

Hichame

Sent: January 19, 2019 2:06 PM
Subject: Re: Persist Dataframe to HDFS considering HDFS Block Size.

You can call coalesce to combine partitions..

 

From: Shivam Sharma <[hidden email]>
Sent: Saturday, January 19, 2019 7:43 AM
To: [hidden email]
Subject: Persist Dataframe to HDFS considering HDFS Block Size.
 
Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into a HIVE table using Spark. Currently, at the time of writing to HIVE table I have set total shuffle partitions = 400 so total 400 files are being created which is not even considering HDFS block size. How can I tell spark to persist according to HDFS Blocks.

We have something like this HIVE which solves this problem:
set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=2048000000;
set hive.merge.size.per.task=4096000000;
Thanks

--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744


--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744


--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744


--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744