File write ownership conflicts for Spark 0.8.1 in YARN modes

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

File write ownership conflicts for Spark 0.8.1 in YARN modes

Kevin Markey
When a Spark job on Yarn writes to HDFS using RDD.saveAsTextFile(hdfsPath: String) or when it writes to HDFS using FSDataOutputStream, the owner of the process writing the file often conflicts with the UID of user running the Spark job (the appUser).  As result, the owner of the newly written files or directories is the YARN process, not the appUser, creating file access problems for downstream processes.  Depending on cluster setup, this conflict often results in write permission errors that kill the Job.

In contrast, when one runs an equivalent Yarn MapReduce2 job, all application output files are owned by the appUser, the UID running the job using the usual "hadoop jar JARFILE INPUT OUTPUT args..." job submission method.

This occurs in the following environments:
  • Spark 0.8.1
  • CDH5.0.0-beta-1 or plain vanilla Apache Hadoop 2.2.0
  • A small CDH5-beta-1 cluster or a single-CPU Yarn/HDFS pseudocluster
  • In yarn-standalone or yarn-client mode, although there are subtle differences in ownership of directories, temporary files written to the output directories, and (when not halted by permission errors) final output result files.

There are some workarounds, including changing system UMASK to 000 or changing output destination directories to 0777, but some environments do not allow any workarounds.  None of the workarounds solve the downstream processing problems that result from incorrect UIDs.

I've scanned any number of configuration instructions, and I've jumped into the code and the various Spark and Yarn scripts used by yarn-standalone and yarn-client modes, but to no avail!

Any help would be appreciated!!!

Here are a few details...

In yarn-standalone, output directories and contents are owned by YARN, not by the appUser UID. 

In the following examples the "SparkTest1" was run in yarn-standalone.  The "TestQueue4" was the output of a Yarn MapReduce.  The appUser UID is "klmarkey", and the YARN UID is "yarn"...

# Spark yarn-standalone job
drwxrwxrwx   - klmarkey klmarkey          0 2014-01-16 16:09 output

drwxrwxrwx   - yarn     klmarkey          0 2014-01-16 16:09 output/SparkTest1
drwxrwxrwx   - yarn     klmarkey          0 2014-01-16 16:09 output/SparkTest1/summary
-rw-r--r--   3 yarn     klmarkey       4658 2014-01-16 16:09 output/SparkTest1/summary/summary.txt
# Yarn MapReduce job
drwxr-xr-x   - klmarkey klmarkey          0 2014-01-06 15:47 output/TestQueue4
drwxr-xr-x   - klmarkey klmarkey          0 2014-01-06 15:40 output/TestQueue4/maxfreq
drwxr-xr-x   - klmarkey klmarkey          0 2014-01-06 15:38 output/TestQueue4/maxfreq/0000
-rw-r--r--   3 klmarkey klmarkey          0 2014-01-06 15:38 output/TestQueue4/maxfreq/0000/_SUCCESS
-rw-r--r--   3 klmarkey klmarkey         65 2014-01-06 15:38 output/TestQueue4/maxfreq/0000/part-r-00000

In yarn-client, output directories are owned by the appUser, but contents written by the worker nodes are owned by the YARN UID.  In the following example, an attempt to write by RDD.saveAsTextFile failed when Spark worker nodes attempted to write temporary results into output/SparkClient1/values/column-0000/_temporary/0

# Spark yarn-client job (failed; see error message below)
drwxrwxrwx   - klmarkey klmarkey          0 2014-01-30 22:58 output

drwxr-xr-x   - klmarkey klmarkey          0 2014-01-30 22:39 output/SparkClient1
drwxr-xr-x   - klmarkey klmarkey          0 2014-01-30 22:39 output/SparkClient1/values
drwxr-xr-x   - klmarkey klmarkey          0 2014-01-30 22:39 output/SparkClient1/values/column-0000
drwxr-xr-x   - klmarkey klmarkey          0 2014-01-30 22:39 output/SparkClient1/values/column-0000/_temporary
drwxr-xr-x   - klmarkey klmarkey          0 2014-01-30 22:39 output/SparkClient1/values/column-0000/_temporary/0

Here is the error message:

14/01/30 22:39:37 WARN ClusterTaskSetManager: Loss was due to org.apache.hadoop.security.AccessControlException
org.apache.hadoop.security.AccessControlException: Permission denied: user=yarn, access=WRITE, inode="/user/klmarkey/output/SparkClient1/values/column-0000/_temporary/0":klmarkey:klmarkey:drwxr-xr-x
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:234)
    at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:214)

Finally, in all these processes, "klmarkey" is reported as the appUser in all logs, and Java system properties report the "user.name" to be "yarn" when using FSDataOutputStream (I can't instrument the saveAsTextFile in the same way). 

Thanks.
Kevin Markey