DataSourceReader and SupportPushDownFilters for Short types

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

DataSourceReader and SupportPushDownFilters for Short types

Hugh Hyndman
Hi,

This is my first message to the Apache Spark digest. 

In a custom data source reader I am implementing, I noticed that I do not receive pushdown filters for datatypes such as ShortType, ByteType, and BooleanType. I do get filters for types: IntegerType, LongType, FloatType, DoubleType, DateType, and TimestampType.

I receive the filters via the pushFilters override, which is part of the SupportsPushDownFilters trait.

The following is a session to demonstrate the behaviour. Note that I placed:

filters.foreach { println }

at the top of the pushFilters override.

Spark context Web UI available at http://192.168.20.114:4040
Spark context available as 'sc' (master = local[4], app id = local-1533638354996).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = spark.read.format("myds").
  option("port", 5000).
  option("host", "localhost").
  option("numPartitions", "1").
  option("function","spartanQuery").load()
df: org.apache.spark.sql.DataFrame = [col1: smallint, col2: bigint ... 1 more field]

scala> df.schema
res0: org.apache.spark.sql.types.StructType = StructType(StructField(col1,ShortType,false), 
StructField(col2,LongType,false), StructField(col3,DoubleType,false))

scala> df.createOrReplaceTempView("dft")

scala> spark.sql("select * from dft").show
+----+----+-------------------+
|col1|col2|               col3|
+----+----+-------------------+
|   0|   0|0.36884380085393786|
|   1| 100| 0.5903338296338916|
|   2| 200|0.19292618241161108|
|   3| 300|0.27678317041136324|
|   4| 400|0.35814784304238856|
|   5| 500|0.19823945895768702|
|   6| 600| 0.4533605803735554|
|   7| 700|0.15147616644389927|
|   8| 800|0.09802114544436336|
|   9| 900|0.31370880361646414|
+----+----+-------------------+


scala> spark.sql("select * from dft where col1<5 and col2<500 and col3<.5").show
LessThan(col2,500)
LessThan(col3,0.5)
+----+----+-------------------+
|col1|col2|               col3|
+----+----+-------------------+
|   0|   0|0.36884380085393786|
|   2| 200|0.19292618241161108|
|   3| 300|0.27678317041136324|
|   4| 400|0.35814784304238856|
+----+----+-------------------+

Note the absence of a filter for col1 (ShortType). Is this a shortcoming in Spark or am I doing something wrong? If a shortcoming, can the Spark team consider adding some of the other data types to the list of potential pushdown filters?

Thank you