binaryFiles() on directory full of directories

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

binaryFiles() on directory full of directories

Christopher Piggott
I have a top level directory in HDFS that contains nothing but subdirectories (no actual files).  In each one of those subdirs are a combination of files and other subdirs

        /topdir/dir1/(lots of files)
        /topdir/dir2/(lots of files)
        /topdir/dir2//subdir/(lots of files)

I noticed something strange:

spark.sparkContext.binaryFiles("hdfs://*", 32*8)
    .filter { case (fileName, contents) => fileName.endsWith(".xyz") }
    .map { case (fileName, contents) => 1}

fails with an ArrayOutOfBoundsException ... but if I specify it as:

  spark.sparkContext.binaryFiles("hdfs://*/*", 32*8)

it works.

I played around a little and found I could get the first attempt to work if I just put one regular file in /topdir

This is with Spark 2.2.1

Is this known behavior?