SPARK SQL: returns null for a column, while HIVE query returns data for the same column

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

SPARK SQL: returns null for a column, while HIVE query returns data for the same column

ARAVIND ARUMUGHAM Sethurathnam

I have a hive table created on top of s3 DATA in parquet format and partitioned by one column named eventdate.

1) When using HIVE QUERY, it returns data for a column named "headertime" which is in the schema of BOTH the table and the file.

select headertime from dbName.test_bug where eventdate=20180510 limit 10

2) FROM a scala NOTEBOOK , when directly loading a file from a particular partition that also works,

val session = org.apache.spark.sql.SparkSession.builder 
.appName("searchRequests") 
.enableHiveSupport() 
.getOrCreate;

val searchRequest = session.sqlContext.read.parquet("s3n://bucketName/module/search_request/eventDate=20180510")

searchRequest.createOrReplaceTempView("SearchRequest")

val exploreDF = session.sql("select headertime from SearchRequest where SearchRequestHeaderDate='2018-05-10' limit 100")

exploreDF.show(20)

this also displays the values for the column "headertime"

3) But, when using spark sql to query directly the HIVE table as below,

val exploreDF = session.sql("select headertime from 
dbName.test_bug where eventdate=20180510 limit 100")
exploreDF.show(20)

it keeps returning null always.

I opened the parquet file and see that the column headertime is present with values, but not sure why spark SQL is not able to read the values for that column.

it will be helpful if someone can point out from where the spark SQL gets the schema? I was expecting it to behave similar to the HIVE QUERY


Reply | Threaded
Open this post in threaded view
|

Re: SPARK SQL: returns null for a column, while HIVE query returns data for the same column

ARAVIND ARUMUGHAM Sethurathnam
  • this column was added in later partitions and not present in earlier ones.

  • i assume partition pruning should just load from that particular partition i am specifying when using spark sql ?
  •   (spark version 2.2)

On Fri, May 11, 2018 at 2:24 PM, ARAVIND ARUMUGHAM Sethurathnam <[hidden email]> wrote:

I have a hive table created on top of s3 DATA in parquet format and partitioned by one column named eventdate.

1) When using HIVE QUERY, it returns data for a column named "headertime" which is in the schema of BOTH the table and the file.

select headertime from dbName.test_bug where eventdate=20180510 limit 10

2) FROM a scala NOTEBOOK , when directly loading a file from a particular partition that also works,

val session = org.apache.spark.sql.SparkSession.builder 
.appName("searchRequests") 
.enableHiveSupport() 
.getOrCreate;

val searchRequest = session.sqlContext.read.parquet("s3n://bucketName/module/search_request/eventDate=20180510")

searchRequest.createOrReplaceTempView("SearchRequest")

val exploreDF = session.sql("select headertime from SearchRequest where SearchRequestHeaderDate='2018-05-10' limit 100")

exploreDF.show(20)

this also displays the values for the column "headertime"

3) But, when using spark sql to query directly the HIVE table as below,

val exploreDF = session.sql("select headertime from 
dbName.test_bug where eventdate=20180510 limit 100")
exploreDF.show(20)

it keeps returning null always.

I opened the parquet file and see that the column headertime is present with values, but not sure why spark SQL is not able to read the values for that column.

it will be helpful if someone can point out from where the spark SQL gets the schema? I was expecting it to behave similar to the HIVE QUERY





--
Wealth is not money. Wealth is relationships with people.