Inconsistent behavior when running spark on top of tachyon on top of HDFS HA

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Inconsistent behavior when running spark on top of tachyon on top of HDFS HA

elyast
This post was updated on .
Hi,

I don't understand why counting file is failing for the first time, but when I run it second time it's giving good results

(for reference stanley is a nameservice not real host, hdfs-site.xml config is in the classpath)

Below full log from spark shell:

14/02/15 03:04:22 INFO : initialize(tachyon://hadoop-ha-1:19998/tmp/proxy.txt, Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, hdfs-default.xml, hdfs-site.xml). Connecting to Tachyon: tachyon://hadoop-ha-1:19998/tmp/proxy.txt
14/02/15 03:04:22 INFO : Trying to connect master @ hadoop-ha-1/14.255.247.81:19998
14/02/15 03:04:22 INFO : User registered at the master hadoop-ha-1/14.255.247.81:19998 got UserId 15
14/02/15 03:04:22 INFO : Trying to get local worker host : hadoop-ha-1
14/02/15 03:04:22 INFO : No local worker on hadoop-ha-1
14/02/15 03:04:22 INFO : Connecting remote worker @ hadoop-worker-6/14.255.247.53:29998
14/02/15 03:04:22 INFO : tachyon://hadoop-ha-1:19998 tachyon://hadoop-ha-1:19998 hdfs://stanley
14/02/15 03:04:22 INFO : getFileStatus(/tmp/proxy.txt): HDFS Path: hdfs://stanley/tmp/proxy.txt TPath: tachyon://hadoop-ha-1:19998/tmp/proxy.txt
14/02/15 03:04:22 INFO mapred.FileInputFormat: Total input paths to process : 1
...
14/02/15 03:04:23 WARN scheduler.TaskSetManager: Loss was due to java.lang.IllegalArgumentException
java.lang.IllegalArgumentException: java.net.UnknownHostException: stanley
    at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414)
    at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:164)
    at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:448)
    at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:410)
    at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:128)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2308)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:87)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2342)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2324)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:351)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:163)
    at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:587)
    at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:315)
    at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:288)
    at org.apache.spark.SparkContext$$anonfun$15.apply(SparkContext.scala:391)
    at org.apache.spark.SparkContext$$anonfun$15.apply(SparkContext.scala:391)
    at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:111)
    at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$1.apply(HadoopRDD.scala:111)
    at scala.Option.map(Option.scala:145)
...
org.apache.spark.SparkException: Job aborted: Task 0.0:0 failed 4 times (most recent failure: Exception failure: java.lang.IllegalArgumentException: java.net.UnknownHostException: stanley)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1026)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1026)

scala> val s = sc.textFile("tachyon://hadoop-ha-1:19998/tmp/proxy.txt")
14/02/15 03:04:27 INFO storage.MemoryStore: ensureFreeSpace(45012) called with curMem=80405, maxMem=309225062
14/02/15 03:04:27 INFO storage.MemoryStore: Block broadcast_1 stored as values to memory (estimated size 44.0 KB, free 294.8 MB)
s: org.apache.spark.rdd.RDD[String] = MappedRDD[3] at textFile at <console>:12

scala> s.count()
14/02/15 03:04:29 INFO : getFileStatus(/tmp/proxy.txt): HDFS Path: hdfs://stanley/tmp/proxy.txt TPath: tachyon://hadoop-ha-1:19998/tmp/proxy.txt
14/02/15 03:04:29 INFO mapred.FileInputFormat: Total input paths to process : 1
14/02/15 03:04:29 INFO spark.SparkContext: Starting job: count at <console>:15
14/02/15 03:04:29 INFO scheduler.DAGScheduler: Got job 1 (count at <console>:15) with 2 output partitions (allowLocal=false)

14/02/15 03:04:29 INFO spark.SparkContext: Job finished: count at <console>:15, took 0.466730364 s
res1: Long = 5

Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent behavior when running spark on top of tachyon on top of HDFS HA

elyast
I have also checked that spark behaves the same way when it is talking directly to HDFS HA:

val s = sc.textFile("hdfs://stanley/tmp/proxy.txt")
s.count()
(first time error) UnknownHostException

(important to assign once again the variable)
val s = sc.textFile("hdfs://stanley/tmp/proxy.txt")
s.count()

res: 5

I guess tachyon is not the guilty one here,

Any help appreciated
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent behavior when running spark on top of tachyon on top of HDFS HA

elyast
Hi,

More update on the case, spark in local config works fine, however when run through mesos such previously described behaviour occurs:

my spark-env.sh:

export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_JAVA_OPTS="
-Dspark.serializer=org.apache.spark.serializer.KryoSerializer
-Dspark.local.dir=/mnt/usr/spark/tmp
-Dspark.mesos.coarse=True
-Dspark.executor.memory=512m
-Dspark.ui.port=8775
-Dspark.scheduler.mode=FAIR
-Dspark.logConf=true
"
export SPARK_EXECUTOR_URI=hdfs://stanley/tmp/spark-0.9.0-2.0.0-mr1-cdh4.5.0.tgz
export MASTER=zk://hadoop-zoo-1:2181,hadoop-zoo-2:2181,hadoop-zoo-3:2181/mesos
export JAVA_HOME="/usr/lib/jvm/java-7-openjdk-amd64/jre"
Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent behavior when running spark on top of tachyon on top of HDFS HA

elyast
This post was updated on .
Hi,

It is going even more akward, I have removed references to stanley from spark-env.sh and from command line (the last one stayed in hdfs-site.xml which is placed in spark.home/conf) :

scala> val s = sc.textFile("hdfs://hadoop-ha-1:8020/tmp/proxy.txt")
scala> s.count()

and the same exception:

14/02/15 07:24:29 WARN TaskSetManager: Loss was due to java.lang.IllegalArgumentException
java.lang.IllegalArgumentException: java.net.UnknownHostException: stanley
        at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414)

Any idea guys?

Thanks in advance

Update

I removed core-site.xml and leave out just hdfs-site.xml, it seems that it improved situation,
(val s = sc.textFile("hdfs://hadoop-ha-1:8020/tmp/proxy.txt"); s.count() // works every time),

however when I want to talk to stanley nameservice it is still failing.
(val s = sc.textFile("hdfs://stanley/tmp/proxy.txt"); s.count()  // fails for the first time then works)


Reply | Threaded
Open this post in threaded view
|

Re: Inconsistent behavior when running spark on top of tachyon on top of HDFS HA

elyast
Hi,

Mystery solved, seems that core-site.xml and hdfs-site.xml needs to be distributed across mesos slaves.

Everything works as expected.