Reading from HDFS by increasing split size

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Reading from HDFS by increasing split size

Kanagha Kumar
Hi,

I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions). 

How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,

Any suggestions are helpful!

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: Reading from HDFS by increasing split size

Jörn Franke
Write your own input format/datasource or split the file yourself beforehand (not recommended).

> On 10. Oct 2017, at 09:14, Kanagha Kumar <[hidden email]> wrote:
>
> Hi,
>
> I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions).
>
> How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,
>
> Any suggestions are helpful!
>
> Thanks
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Reading from HDFS by increasing split size

ayan guha
I have not tested this, but you should be able to pass on any map-reduce like conf to underlying hadoop config.....essentially you should be able to control behaviour of split as you can do in a map-reduce program (as Spark uses the same input format)

On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke <[hidden email]> wrote:
Write your own input format/datasource or split the file yourself beforehand (not recommended).

> On 10. Oct 2017, at 09:14, Kanagha Kumar <[hidden email]> wrote:
>
> Hi,
>
> I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions).
>
> How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,
>
> Any suggestions are helpful!
>
> Thanks
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Reading from HDFS by increasing split size

Kanagha Kumar
Thanks for the inputs!!

I passed in spark.mapred.max.split.size, spark.mapred.min.split.size to the size I wanted to read. It didn't take any effect.
I also tried passing in spark.dfs.block.size, with all the params set to the same value.

JavaSparkContext.fromSparkContext(spark.sparkContext()).textFile(hdfsPath, 13);

Is there any other param that needs to be set as well?

Thanks

On Tue, Oct 10, 2017 at 4:32 AM, ayan guha <[hidden email]> wrote:
I have not tested this, but you should be able to pass on any map-reduce like conf to underlying hadoop config.....essentially you should be able to control behaviour of split as you can do in a map-reduce program (as Spark uses the same input format)

On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke <[hidden email]> wrote:
Write your own input format/datasource or split the file yourself beforehand (not recommended).

> On 10. Oct 2017, at 09:14, Kanagha Kumar <[hidden email]> wrote:
>
> Hi,
>
> I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions).
>
> How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,
>
> Any suggestions are helpful!
>
> Thanks
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Best Regards,
Ayan Guha



--


Reply | Threaded
Open this post in threaded view
|

Re: Reading from HDFS by increasing split size

ayan guha

On Wed, Oct 11, 2017 at 2:53 AM, Kanagha Kumar <[hidden email]> wrote:
Thanks for the inputs!!

I passed in spark.mapred.max.split.size, spark.mapred.min.split.size to the size I wanted to read. It didn't take any effect.
I also tried passing in spark.dfs.block.size, with all the params set to the same value.

JavaSparkContext.fromSparkContext(spark.sparkContext()).textFile(hdfsPath, 13);

Is there any other param that needs to be set as well?

Thanks

On Tue, Oct 10, 2017 at 4:32 AM, ayan guha <[hidden email]> wrote:
I have not tested this, but you should be able to pass on any map-reduce like conf to underlying hadoop config.....essentially you should be able to control behaviour of split as you can do in a map-reduce program (as Spark uses the same input format)

On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke <[hidden email]> wrote:
Write your own input format/datasource or split the file yourself beforehand (not recommended).

> On 10. Oct 2017, at 09:14, Kanagha Kumar <[hidden email]> wrote:
>
> Hi,
>
> I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions).
>
> How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,
>
> Any suggestions are helpful!
>
> Thanks
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Best Regards,
Ayan Guha



--





--
Best Regards,
Ayan Guha
Reply | Threaded
Open this post in threaded view
|

Re: Reading from HDFS by increasing split size

Jörn Franke
In reply to this post by Kanagha Kumar
Maybe you need to set the parameters for the mapreduce api and not the mapred api. I do not have in mind now how they differ but the Hadoop web page should tell you ;-)

On 10. Oct 2017, at 17:53, Kanagha Kumar <[hidden email]> wrote:

Thanks for the inputs!!

I passed in spark.mapred.max.split.size, spark.mapred.min.split.size to the size I wanted to read. It didn't take any effect.
I also tried passing in spark.dfs.block.size, with all the params set to the same value.

JavaSparkContext.fromSparkContext(spark.sparkContext()).textFile(hdfsPath, 13);

Is there any other param that needs to be set as well?

Thanks

On Tue, Oct 10, 2017 at 4:32 AM, ayan guha <[hidden email]> wrote:
I have not tested this, but you should be able to pass on any map-reduce like conf to underlying hadoop config.....essentially you should be able to control behaviour of split as you can do in a map-reduce program (as Spark uses the same input format)

On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke <[hidden email]> wrote:
Write your own input format/datasource or split the file yourself beforehand (not recommended).

> On 10. Oct 2017, at 09:14, Kanagha Kumar <[hidden email]> wrote:
>
> Hi,
>
> I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions).
>
> How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,
>
> Any suggestions are helpful!
>
> Thanks
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Best Regards,
Ayan Guha



--


Reply | Threaded
Open this post in threaded view
|

Re: Reading from HDFS by increasing split size

Kanagha Kumar
Thanks Ayan!

Finally it worked!! Thanks a lot everyone for the inputs!

Once I prefixed the params with "spark.hadoop", I see the no.of tasks getting reduced.

I'm setting the following params:

--conf spark.hadoop.dfs.block.size

--conf spark.hadoop.mapreduce.input.fileinputformat.split.minsize

--conf spark.hadoop.mapreduce.input.fileinputformat.split.maxsize


On Tue, Oct 10, 2017 at 1:16 PM, Jörn Franke <[hidden email]> wrote:
Maybe you need to set the parameters for the mapreduce api and not the mapred api. I do not have in mind now how they differ but the Hadoop web page should tell you ;-)

On 10. Oct 2017, at 17:53, Kanagha Kumar <[hidden email]> wrote:

Thanks for the inputs!!

I passed in spark.mapred.max.split.size, spark.mapred.min.split.size to the size I wanted to read. It didn't take any effect.
I also tried passing in spark.dfs.block.size, with all the params set to the same value.

JavaSparkContext.fromSparkContext(spark.sparkContext()).textFile(hdfsPath, 13);

Is there any other param that needs to be set as well?

Thanks

On Tue, Oct 10, 2017 at 4:32 AM, ayan guha <[hidden email]> wrote:
I have not tested this, but you should be able to pass on any map-reduce like conf to underlying hadoop config.....essentially you should be able to control behaviour of split as you can do in a map-reduce program (as Spark uses the same input format)

On Tue, Oct 10, 2017 at 10:21 PM, Jörn Franke <[hidden email]> wrote:
Write your own input format/datasource or split the file yourself beforehand (not recommended).

> On 10. Oct 2017, at 09:14, Kanagha Kumar <[hidden email]> wrote:
>
> Hi,
>
> I'm trying to read a 60GB HDFS file using spark textFile("hdfs_file_path", minPartitions).
>
> How can I control the no.of tasks by increasing the split size? With default split size of 250 MB, several tasks are created. But I would like to have a specific no.of tasks created while reading from HDFS itself instead of using repartition() etc.,
>
> Any suggestions are helpful!
>
> Thanks
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]




--
Best Regards,
Ayan Guha



--





--