File write ownership conflicts for Spark 0.8.1 in YARN modes
When a Spark job on Yarn writes to HDFS using RDD.saveAsTextFile(hdfsPath:
String) or when it writes to HDFS using FSDataOutputStream,
the owner of the process writing the file often conflicts with the
UID of user running the Spark job (the appUser). As result, the
owner of the newly written files or directories is the YARN
process, not the appUser, creating file access problems for
downstream processes. Depending on cluster setup, this conflict
often results in write permission errors that kill the Job.
In contrast, when one runs an equivalent Yarn MapReduce2 job,
all application output files are owned by the appUser, the
UID running the job using the usual "hadoop jar JARFILE INPUT OUTPUT
args..." job submission method.
This occurs in the following environments:
CDH5.0.0-beta-1 or plain vanilla Apache Hadoop 2.2.0
A small CDH5-beta-1 cluster or a single-CPU Yarn/HDFS
mode, although there are subtle differences in ownership of
directories, temporary files written to the output directories,
and (when not halted by permission errors) final output result
There are some workarounds, including changing system UMASK to
000 or changing output destination directories to 0777, but some
environments do not allow any workarounds. None of the
workarounds solve the downstream processing problems that result
from incorrect UIDs.
I've scanned any number of configuration instructions, and I've
jumped into the code and the various Spark and Yarn scripts used
by yarn-standalone and yarn-client modes, but to
Any help would be appreciated!!!
Here are a few details...
In yarn-standalone, output directories and contents are
owned by YARN, not by the appUser UID.
In the following examples the "SparkTest1" was run in
yarn-standalone. The "TestQueue4" was the output of a Yarn
MapReduce. The appUser UID is "klmarkey", and the YARN UID is
In yarn-client, output directories are owned by the appUser,
but contents written by the worker nodes are owned by the YARN UID.
In the following example, an attempt to write by RDD.saveAsTextFile
failed when Spark worker nodes attempted to write temporary results
14/01/30 22:39:37 WARN ClusterTaskSetManager: Loss was due to
Permission denied: user=yarn, access=WRITE, inode="/user/klmarkey/output/SparkClient1/values/column-0000/_temporary/0":klmarkey:klmarkey:drwxr-xr-x at
Finally, in all these processes, "klmarkey" is reported as the
appUser in all logs, and Java system properties report the
"user.name" to be "yarn" when using FSDataOutputStream (I can't
instrument the saveAsTextFile in the same way).