How to disable pushdown predicate in spark 2.x query

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

How to disable pushdown predicate in spark 2.x query

Mohit
Hi All,

I am trying to read a table of a relational database using spark 2.x. 

I am using code like the following:

sparkContext.read().jdbc(url, table , connectionProperties).select('SELECT_COLUMN').where(whereClause);


Now, What's happening is spark is actually the SQL query which spark is running against the relational db is :

select column,(where_clause_columns) from table WHERE SELECT_COLUMN IS NOT NULL;

And I guess it is doing filtering based on the where clause only after fetching all the data from DB where SELECT_COLUMN IS NOT NULL. 

I searched about it and found out this is because of pushdown predicate. Is there a way to load data into dataframe using specific query instead of this. 

I found a solution where if we provide actual query instead of the table name in the following code, it should run that query exactly:

table = "select SELECT_COLUMN from table  "+ whereClause;
sparkContext.read().jdbc(url, table , connectionProperties).select('SELECT_COLUMN').where(whereClause);


Does the above seem like a good solution?


Regards,
Mohit
Reply | Threaded
Open this post in threaded view
|

Re: How to disable pushdown predicate in spark 2.x query

Xiao Li-2

Just turn off the JDBC option pushDownPredicate, which was introduced in Spark 2.4. https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Xiao

On Mon, Jun 22, 2020 at 11:36 AM Mohit Durgapal <[hidden email]> wrote:
Hi All,

I am trying to read a table of a relational database using spark 2.x. 

I am using code like the following:

sparkContext.read().jdbc(url, table , connectionProperties).select('SELECT_COLUMN').where(whereClause);


Now, What's happening is spark is actually the SQL query which spark is running against the relational db is :

select column,(where_clause_columns) from table WHERE SELECT_COLUMN IS NOT NULL;

And I guess it is doing filtering based on the where clause only after fetching all the data from DB where SELECT_COLUMN IS NOT NULL. 

I searched about it and found out this is because of pushdown predicate. Is there a way to load data into dataframe using specific query instead of this. 

I found a solution where if we provide actual query instead of the table name in the following code, it should run that query exactly:

table = "select SELECT_COLUMN from table  "+ whereClause;
sparkContext.read().jdbc(url, table , connectionProperties).select('SELECT_COLUMN').where(whereClause);


Does the above seem like a good solution?


Regards,
Mohit


--