Error while reading hive tables with tmp/hidden files inside partitions

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Error while reading hive tables with tmp/hidden files inside partitions

Dhrubajyoti Hati
Hi,

Is there any way to discard files starting with dot(.) or ending with .tmp in the hive partition while reading from Hive table using spark.read.table method.

I tried using PathFilters but they didn't work. I am using spark-submit and passing my python file(pyspark) containing the source code.

spark.sparkContext._jsc.hadoopConfiguration().set("mapreduce.input.pathFilter.class", "com.abc.hadoop.utility.TmpFileFilter")

class TmpFileFilter extends PathFilter {
override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
}
Still in the detailed logs I can see .tmp files are getting considered in the detailed logs:
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/FlumeData.1587559137715
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/FlumeData.1587556815621
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/.FlumeData.1587560277337.tmp


Is there any way to discard the tmp(.tmp) or the hidden files(filename starting with dot or underscore) in hive partitions while reading from spark?

Regards,

Dhrubajyoti Hati.
Mob No: 9886428028/9652029028

Reply | Threaded
Open this post in threaded view
|

Re: Error while reading hive tables with tmp/hidden files inside partitions

Dhrubajyoti Hati
Just wondering if any one could help me out on this.

Thank you!

Regards,

Dhrubajyoti Hati.



On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati <[hidden email]> wrote:
Hi,

Is there any way to discard files starting with dot(.) or ending with .tmp in the hive partition while reading from Hive table using spark.read.table method.

I tried using PathFilters but they didn't work. I am using spark-submit and passing my python file(pyspark) containing the source code.

spark.sparkContext._jsc.hadoopConfiguration().set("mapreduce.input.pathFilter.class", "com.abc.hadoop.utility.TmpFileFilter")

class TmpFileFilter extends PathFilter {
override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
}
Still in the detailed logs I can see .tmp files are getting considered in the detailed logs:
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/FlumeData.1587559137715
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/FlumeData.1587556815621
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/.FlumeData.1587560277337.tmp


Is there any way to discard the tmp(.tmp) or the hidden files(filename starting with dot or underscore) in hive partitions while reading from spark?

Regards,

Dhrubajyoti Hati.

Reply | Threaded
Open this post in threaded view
|

Re: Error while reading hive tables with tmp/hidden files inside partitions

Wenchen Fan
This looks like a bug that path filter doesn't work for hive table reading. Can you open a JIRA ticket?

On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati <[hidden email]> wrote:
Just wondering if any one could help me out on this.

Thank you!

Regards,

Dhrubajyoti Hati.



On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati <[hidden email]> wrote:
Hi,

Is there any way to discard files starting with dot(.) or ending with .tmp in the hive partition while reading from Hive table using spark.read.table method.

I tried using PathFilters but they didn't work. I am using spark-submit and passing my python file(pyspark) containing the source code.

spark.sparkContext._jsc.hadoopConfiguration().set("mapreduce.input.pathFilter.class", "com.abc.hadoop.utility.TmpFileFilter")

class TmpFileFilter extends PathFilter {
override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
}
Still in the detailed logs I can see .tmp files are getting considered in the detailed logs:
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/FlumeData.1587559137715
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/FlumeData.1587556815621
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/.FlumeData.1587560277337.tmp


Is there any way to discard the tmp(.tmp) or the hidden files(filename starting with dot or underscore) in hive partitions while reading from spark?

Regards,

Dhrubajyoti Hati.

Reply | Threaded
Open this post in threaded view
|

Re: Error while reading hive tables with tmp/hidden files inside partitions

Dhrubajyoti Hati
FYI we are using Spark 2.2.0. Should the change be present in this spark version? Wanted to check before opening a JIRA ticket? 

Regards,

Dhrubajyoti Hati.



On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan <[hidden email]> wrote:
This looks like a bug that path filter doesn't work for hive table reading. Can you open a JIRA ticket?

On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati <[hidden email]> wrote:
Just wondering if any one could help me out on this.

Thank you!

Regards,

Dhrubajyoti Hati.



On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati <[hidden email]> wrote:
Hi,

Is there any way to discard files starting with dot(.) or ending with .tmp in the hive partition while reading from Hive table using spark.read.table method.

I tried using PathFilters but they didn't work. I am using spark-submit and passing my python file(pyspark) containing the source code.

spark.sparkContext._jsc.hadoopConfiguration().set("mapreduce.input.pathFilter.class", "com.abc.hadoop.utility.TmpFileFilter")

class TmpFileFilter extends PathFilter {
override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
}
Still in the detailed logs I can see .tmp files are getting considered in the detailed logs:
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/FlumeData.1587559137715
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/FlumeData.1587556815621
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/.FlumeData.1587560277337.tmp


Is there any way to discard the tmp(.tmp) or the hidden files(filename starting with dot or underscore) in hive partitions while reading from spark?

Regards,

Dhrubajyoti Hati.

Reply | Threaded
Open this post in threaded view
|

Re: Error while reading hive tables with tmp/hidden files inside partitions

Wenchen Fan
Yea, please report the bug on a supported Spark version like 2.4.

On Thu, Apr 23, 2020 at 3:40 PM Dhrubajyoti Hati <[hidden email]> wrote:
FYI we are using Spark 2.2.0. Should the change be present in this spark version? Wanted to check before opening a JIRA ticket? 

Regards,

Dhrubajyoti Hati.



On Thu, Apr 23, 2020 at 10:12 AM Wenchen Fan <[hidden email]> wrote:
This looks like a bug that path filter doesn't work for hive table reading. Can you open a JIRA ticket?

On Thu, Apr 23, 2020 at 3:15 AM Dhrubajyoti Hati <[hidden email]> wrote:
Just wondering if any one could help me out on this.

Thank you!

Regards,

Dhrubajyoti Hati.



On Wed, Apr 22, 2020 at 7:15 PM Dhrubajyoti Hati <[hidden email]> wrote:
Hi,

Is there any way to discard files starting with dot(.) or ending with .tmp in the hive partition while reading from Hive table using spark.read.table method.

I tried using PathFilters but they didn't work. I am using spark-submit and passing my python file(pyspark) containing the source code.

spark.sparkContext._jsc.hadoopConfiguration().set("mapreduce.input.pathFilter.class", "com.abc.hadoop.utility.TmpFileFilter")

class TmpFileFilter extends PathFilter {
override def accept(path : Path): Boolean = !path.getName.endsWith(".tmp")
}
Still in the detailed logs I can see .tmp files are getting considered in the detailed logs:
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/FlumeData.1587559137715
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/FlumeData.1587556815621
20/04/22 12:58:44 DEBUG MapRFileSystem: getMapRFileStatus maprfs:///a/hour=05/host=abc/.FlumeData.1587560277337.tmp


Is there any way to discard the tmp(.tmp) or the hidden files(filename starting with dot or underscore) in hive partitions while reading from spark?

Regards,

Dhrubajyoti Hati.