[Spark Core] makeRDD() preferredLocations do not appear to be considered

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[Spark Core] makeRDD() preferredLocations do not appear to be considered

Tom Scott
Hi Guys,

  I asked this in stack overflow here: https://stackoverflow.com/questions/63535720/why-would-preferredlocations-not-be-enforced-on-an-empty-spark-cluster but am hoping there is further help here.

  I have a 4 node standalone cluster with workers named worker1, worker2 and worker3 and a master on which I am running spark-shell. Given the following example:
-----------------------------------------------------------------------------------------------------------------
import scala.collection.mutable

val someData = mutable.ArrayBuffer[(String, Seq[String])]()

someData += ("1" -> Seq("worker1"))
someData += ("2" -> Seq("worker2"))
someData += ("3" -> Seq("worker3"))

val someRdd = sc.makeRDD(someData)

someRdd.map(i=>i + ":" + java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
-----------------------------------------------------------------------------------------------------------------

The cluster is completely clean with nothing else executing so I would expect to see output:

1:worker1
2:worker2
3:worker3

but in fact the output is undefined and i see things like:

scala> someRdd.map(i=>i + ":" + java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
1:worker3
2:worker1
3:worker2

scala> someRdd.map(i=>i + ":" + java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
1:worker2
2:worker3
3:worker1

Am I doing this wrong or is this expected behaviour?

Thanks

  Tom

Reply | Threaded
Open this post in threaded view
|

Re: [Spark Core] makeRDD() preferredLocations do not appear to be considered

Tom Scott
It turned out the issue was with my environment not Spark. Just in case anyone else is experiencing this the problem was that the Spark workers did not use the machine hostname by default. Setting the following environment variable on each worker rectified it: SPARK_LOCAL_HOSTNAME: "worker1" etc.

On Tue, Sep 8, 2020 at 10:11 PM Tom Scott <[hidden email]> wrote:
Hi Guys,

  I asked this in stack overflow here: https://stackoverflow.com/questions/63535720/why-would-preferredlocations-not-be-enforced-on-an-empty-spark-cluster but am hoping there is further help here.

  I have a 4 node standalone cluster with workers named worker1, worker2 and worker3 and a master on which I am running spark-shell. Given the following example:
-----------------------------------------------------------------------------------------------------------------
import scala.collection.mutable

val someData = mutable.ArrayBuffer[(String, Seq[String])]()

someData += ("1" -> Seq("worker1"))
someData += ("2" -> Seq("worker2"))
someData += ("3" -> Seq("worker3"))

val someRdd = sc.makeRDD(someData)

someRdd.map(i=>i + ":" + java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
-----------------------------------------------------------------------------------------------------------------

The cluster is completely clean with nothing else executing so I would expect to see output:

1:worker1
2:worker2
3:worker3

but in fact the output is undefined and i see things like:

scala> someRdd.map(i=>i + ":" + java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
1:worker3
2:worker1
3:worker2

scala> someRdd.map(i=>i + ":" + java.net.InetAddress.getLocalHost().getHostName()).collect().foreach(println)
1:worker2
2:worker3
3:worker1

Am I doing this wrong or is this expected behaviour?

Thanks

  Tom