input split size

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

input split size

Larryliu
What is the default input split size? How to change it?
Reply | Threaded
Open this post in threaded view
|

Re: input split size

Andrew Ash
When reading out of HDFS it's the HDFS block size.

On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu <[hidden email]> wrote:
What is the default input split size? How to change it?

Reply | Threaded
Open this post in threaded view
|

Re: input split size

Larryliu
Thanks, Andrew. What about reading out of local?

On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash <[hidden email]> wrote:
When reading out of HDFS it's the HDFS block size.

On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu <[hidden email]> wrote:
What is the default input split size? How to change it?


Reply | Threaded
Open this post in threaded view
|

Re: input split size

Ilya Ganelin

Also - if you're doing a text file read you can pass the number of resulting partitions as the second argument.

On Oct 17, 2014 9:05 PM, "Larry Liu" <[hidden email]> wrote:
Thanks, Andrew. What about reading out of local?

On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash <[hidden email]> wrote:
When reading out of HDFS it's the HDFS block size.

On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu <[hidden email]> wrote:
What is the default input split size? How to change it?


Reply | Threaded
Open this post in threaded view
|

Re: input split size

Mayur Rustagi
Does it retain the order if its pulling from the hdfs blocks, meaning 
if  file1 => a, b, c partition in order
if I convert to 2 partition read will it map to ab, c or a, bc or it can also be a, cb ?


Mayur Rustagi
Ph: +1 (760) 203 3257

On Sat, Oct 18, 2014 at 9:09 AM, Ilya Ganelin <[hidden email]> wrote:

Also - if you're doing a text file read you can pass the number of resulting partitions as the second argument.

On Oct 17, 2014 9:05 PM, "Larry Liu" <[hidden email]> wrote:
Thanks, Andrew. What about reading out of local?

On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash <[hidden email]> wrote:
When reading out of HDFS it's the HDFS block size.

On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu <[hidden email]> wrote:
What is the default input split size? How to change it?



Reply | Threaded
Open this post in threaded view
|

Re: input split size

Aaron Davidson
The "minPartitions" argument of textFile/hadoopFile cannot decrease the number of splits past the physical number of blocks/files. So if you have 3 HDFS blocks, asking for 2 minPartitions will still give you 3 partitions (hence the "min"). It can, however, convert a file with fewer HDFS blocks into more (so you could ask for and get 4 partitions), assuming the blocks are "splittable". HDFS blocks are usually splittable, but if it's compressed with something like bzip2, it would not be.

If you wish to combine splits from a larger file, you can use RDD#coalesce. With shuffle=false, this will simply concatenate partitions, but it does not provide any ordering guarantees (it uses an algorithm which attempts to coalesce co-located partitions, to maintain locality information). 

coalesce() with shuffle=true causes all of the elements will be shuffled around randomly into new partitions, which is an expensive operation but guarantees uniformity of data distribution.

On Sat, Oct 18, 2014 at 10:47 AM, Mayur Rustagi <[hidden email]> wrote:
Does it retain the order if its pulling from the hdfs blocks, meaning 
if  file1 => a, b, c partition in order
if I convert to 2 partition read will it map to ab, c or a, bc or it can also be a, cb ?


Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257

On Sat, Oct 18, 2014 at 9:09 AM, Ilya Ganelin <[hidden email]> wrote:

Also - if you're doing a text file read you can pass the number of resulting partitions as the second argument.

On Oct 17, 2014 9:05 PM, "Larry Liu" <[hidden email]> wrote:
Thanks, Andrew. What about reading out of local?

On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash <[hidden email]> wrote:
When reading out of HDFS it's the HDFS block size.

On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu <[hidden email]> wrote:
What is the default input split size? How to change it?




Reply | Threaded
Open this post in threaded view
|

Re: input split size

Nick Chammas
Side note: I thought bzip2 was splittable. Perhaps you meant gzip?

2014년 10월 18일 토요일, Aaron Davidson<[hidden email]>님이 작성한 메시지:
The "minPartitions" argument of textFile/hadoopFile cannot decrease the number of splits past the physical number of blocks/files. So if you have 3 HDFS blocks, asking for 2 minPartitions will still give you 3 partitions (hence the "min"). It can, however, convert a file with fewer HDFS blocks into more (so you could ask for and get 4 partitions), assuming the blocks are "splittable". HDFS blocks are usually splittable, but if it's compressed with something like bzip2, it would not be.

If you wish to combine splits from a larger file, you can use RDD#coalesce. With shuffle=false, this will simply concatenate partitions, but it does not provide any ordering guarantees (it uses an algorithm which attempts to coalesce co-located partitions, to maintain locality information). 

coalesce() with shuffle=true causes all of the elements will be shuffled around randomly into new partitions, which is an expensive operation but guarantees uniformity of data distribution.

On Sat, Oct 18, 2014 at 10:47 AM, Mayur Rustagi <<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;mayur.rustagi@gmail.com&#39;);" target="_blank">mayur.rustagi@...> wrote:
Does it retain the order if its pulling from the hdfs blocks, meaning 
if  file1 => a, b, c partition in order
if I convert to 2 partition read will it map to ab, c or a, bc or it can also be a, cb ?


Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257

On Sat, Oct 18, 2014 at 9:09 AM, Ilya Ganelin <<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;ilganeli@gmail.com&#39;);" target="_blank">ilganeli@...> wrote:

Also - if you're doing a text file read you can pass the number of resulting partitions as the second argument.

On Oct 17, 2014 9:05 PM, "Larry Liu" <<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;larryliu05@gmail.com&#39;);" target="_blank">larryliu05@...> wrote:
Thanks, Andrew. What about reading out of local?

On Fri, Oct 17, 2014 at 5:38 PM, Andrew Ash <<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;andrew@andrewash.com&#39;);" target="_blank">andrew@...> wrote:
When reading out of HDFS it's the HDFS block size.

On Fri, Oct 17, 2014 at 5:27 PM, Larry Liu <<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;larryliu05@gmail.com&#39;);" target="_blank">larryliu05@...> wrote:
What is the default input split size? How to change it?