Hi - I had originally posted this as a bug (SPARK-22528) but given my uncertainty, it was suggested that I send it to the mailing list instead...
We are using Azure Data Lake (ADL) to store our event logs. This worked fine in 2.1.x, but in 2.2.0 the underlying files are no longer visible to the history server - even though we are using the same service principal that was used to write the logs. I tracked it down to this call in "FSHistoryProvider" (which was added for v2.2.0):
From what I can tell, it is preemptively checking the permissions on the files and skipping the ones which it thinks are not readable. The problem is that its using a check that appears to be specific to HDFS and so even though the files are definitely readable, it skips over them. Also, "FSHistoryProvider" is the only place this code is used.
I was able to workaround it by either:
* setting the permissions for the files on ADL to world readable
* or setting HADOOP_PROXY to the objectId of the Azure service principal which owns file
Neither of these workarounds are acceptable for our environment. That said, I am not sure how this should be addressed:
* Is this an issue with the Azure/Hadoop not complying with how the Hadoop FileSystem interface/contract in some way?
* Is this an issue with "checkAccessPermission()" not really accounting for all of the possible FileSystem implementations?
My gut tells me its the latter because the SparkHadoopUtil.checkAccessPermission() gets its "currentUser" info from outside of the FileSystem class and it doesn't make sense to me that an instance of FileSystem would affect a global context since there could be many FileSytem instances in a given app.
That said, I know ADL is not heavily used at this time so I wonder if anyone is seeing this with S3 as well? Maybe not since S3 permissions are always reported as world-readable (I think) which causes checkAccessPermission() to succeed.