Where to put "local" data files?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Where to put "local" data files?

XiaoboGu
Hi,

We are going to deploy a standalone mode cluster, we know Spark can read local data files into RDDs, but the question is where should we put the data file, on the server where commit our application, or the server where the master service runs?

Regards,

Xiaobo Gu
Reply | Threaded
Open this post in threaded view
|

Re: Where to put "local" data files?

Andrew Ash
Hi Xiaobo,

I would recommend putting the files into an HDFS cluster on the same machines instead if possible.  If you're concerned about duplicating the data, you can set the replication factor to 1 so you don't use more space than before.

In my experience of Spark around 0.7.0 or so, when reading from a local file with sc.textFile("file:///...") you had to have that file in that exact path on every Spark worker machine.

Cheers,
Andrew


On Tue, Dec 31, 2013 at 5:34 AM, guxiaobo1982 <[hidden email]> wrote:
Hi,

We are going to deploy a standalone mode cluster, we know Spark can read local data files into RDDs, but the question is where should we put the data file, on the server where commit our application, or the server where the master service runs?

Regards,

Xiaobo Gu

Reply | Threaded
Open this post in threaded view
|

回复: Where to put "local" data files?

XiaoboGu
Hi Andrew,

Thanks for your reply, I have another question about using HDFS, when running hdfs and the standalone mode on the same cluster, will the spark workers only read data on the same server to avoid transfering data over network.

Xiaobo gu

在 2014年01月01日 05:37:36
"andrew"<[hidden email]> 写道:

Hi Xiaobo,

I would recommend putting the files into an HDFS cluster on the same machines instead if possible. ?If you're concerned about duplicating the data, you can set the replication factor to 1 so you don't use more space than before.

In my experience of Spark around 0.7.0 or so, when reading from a local file with sc.textFile("file:///...") you had to have that file in that exact path on every Spark worker machine.

Cheers,
Andrew


On Tue, Dec 31, 2013 at 5:34 AM, guxiaobo1982 <[hidden email]> wrote:
Hi,

We are going to deploy a standalone mode cluster, we know Spark can read local data files into RDDs, but the question is where should we put the data file, on the server where commit our application, or the server where the master service runs?

Regards,

Xiaobo Gu

Reply | Threaded
Open this post in threaded view
|

Re: 回复: Where to put "local" data files?

Andrew Ash
Yes it will.  This is called data locality and it works by matching the hostname in Spark with the one in HDFS.


On Wed, Jan 1, 2014 at 2:40 AM, guxiaobo1982 <[hidden email]> wrote:
Hi Andrew,

Thanks for your reply, I have another question about using HDFS, when running hdfs and the standalone mode on the same cluster, will the spark workers only read data on the same server to avoid transfering data over network.

Xiaobo gu

在 2014年01月01日 05:37:36
"andrew"<[hidden email]> 写道:

Hi Xiaobo,

I would recommend putting the files into an HDFS cluster on the same machines instead if possible. ?If you're concerned about duplicating the data, you can set the replication factor to 1 so you don't use more space than before.

In my experience of Spark around 0.7.0 or so, when reading from a local file with sc.textFile("file:///...") you had to have that file in that exact path on every Spark worker machine.

Cheers,
Andrew


On Tue, Dec 31, 2013 at 5:34 AM, guxiaobo1982 <[hidden email]> wrote:
Hi,

We are going to deploy a standalone mode cluster, we know Spark can read local data files into RDDs, but the question is where should we put the data file, on the server where commit our application, or the server where the master service runs?

Regards,

Xiaobo Gu