Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

lucas.gary@gmail.com
IE:  If my JDBC table has an index on it, will the optimizer consider that when pushing predicates down?

I noticed in a query like this:

df = spark.hiveContext.read.jdbc(
  url=jdbc_url,
  table="schema.table",
  column="id",
  lowerBound=lower_bound_id,
  upperBound=upper_bound_id,
  numPartitions=numberPartitions
)
df.registerTempTable("df")

filtered_df = spark.hiveContext.sql("""
    SELECT
        * 
    FROM
        df
    WHERE
        type = 'type' 
        AND action = 'action' 
        AND audited_changes LIKE '---\ncompany_id:\n- %'
""")
filtered_audits.registerTempTable("filtered_df")


The queries sent to the DB look like this:
"Select fields from schema.table where type='type' and action='action' and id > lower_bound and id <= upper_bound"

And then it does the like ( LIKE '---\ncompany_id:\n- %') in memory, which is great!

However I'm wondering why it chooses that optimization.  In this case there aren't any indexes on any of these except ID.  

So, does spark take into account JDBC indexes in it's query plan where it can?  

Thanks!

Gary Lucas
Reply | Threaded
Open this post in threaded view
|

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

Mich Talebzadeh
sorry what do you mean my JDBC table has an index on it? Where are you reading the data from the table?

I assume you are referring to "id" column on the table that you are reading through JDBC connection.

Then you are creating a temp Table called "df". That temp table is created in temporary work space and does not have any index. That index "id" is used when doing parallel reads into RDDs not when querying the temp Table.

HTH

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 


On 19 October 2017 at 23:10, [hidden email] <[hidden email]> wrote:
IE:  If my JDBC table has an index on it, will the optimizer consider that when pushing predicates down?

I noticed in a query like this:

df = spark.hiveContext.read.jdbc(
  url=jdbc_url,
  table="schema.table",
  column="id",
  lowerBound=lower_bound_id,
  upperBound=upper_bound_id,
  numPartitions=numberPartitions
)
df.registerTempTable("df")

filtered_df = spark.hiveContext.sql("""
    SELECT
        * 
    FROM
        df
    WHERE
        type = 'type' 
        AND action = 'action' 
        AND audited_changes LIKE '---\ncompany_id:\n- %'
""")
filtered_audits.registerTempTable("filtered_df")


The queries sent to the DB look like this:
"Select fields from schema.table where type='type' and action='action' and id > lower_bound and id <= upper_bound"

And then it does the like ( LIKE '---\ncompany_id:\n- %') in memory, which is great!

However I'm wondering why it chooses that optimization.  In this case there aren't any indexes on any of these except ID.  

So, does spark take into account JDBC indexes in it's query plan where it can?  

Thanks!

Gary Lucas

Reply | Threaded
Open this post in threaded view
|

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

lucas.gary@gmail.com
If the underlying table(s) have indexes on them.  Does spark use those indexes to optimize the query?

IE if I had a table in my JDBC data source (mysql in this case) had several indexes and my query was filtering on one of the fields with an index.  Would spark know to push that predicate to the database or is the predicate push-down ignorant of the underlying storage layer details.  

Apologies if that still doesn't adequately explain my question.

Gary Lucas

On 19 October 2017 at 15:19, Mich Talebzadeh <[hidden email]> wrote:
sorry what do you mean my JDBC table has an index on it? Where are you reading the data from the table?

I assume you are referring to "id" column on the table that you are reading through JDBC connection.

Then you are creating a temp Table called "df". That temp table is created in temporary work space and does not have any index. That index "id" is used when doing parallel reads into RDDs not when querying the temp Table.

HTH

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 


On 19 October 2017 at 23:10, [hidden email] <[hidden email]> wrote:
IE:  If my JDBC table has an index on it, will the optimizer consider that when pushing predicates down?

I noticed in a query like this:

df = spark.hiveContext.read.jdbc(
  url=jdbc_url,
  table="schema.table",
  column="id",
  lowerBound=lower_bound_id,
  upperBound=upper_bound_id,
  numPartitions=numberPartitions
)
df.registerTempTable("df")

filtered_df = spark.hiveContext.sql("""
    SELECT
        * 
    FROM
        df
    WHERE
        type = 'type' 
        AND action = 'action' 
        AND audited_changes LIKE '---\ncompany_id:\n- %'
""")
filtered_audits.registerTempTable("filtered_df")


The queries sent to the DB look like this:
"Select fields from schema.table where type='type' and action='action' and id > lower_bound and id <= upper_bound"

And then it does the like ( LIKE '---\ncompany_id:\n- %') in memory, which is great!

However I'm wondering why it chooses that optimization.  In this case there aren't any indexes on any of these except ID.  

So, does spark take into account JDBC indexes in it's query plan where it can?  

Thanks!

Gary Lucas


Reply | Threaded
Open this post in threaded view
|

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

Mich Talebzadeh
remember your indexes are in RDBMS. In this case MySQL. When you are reading from that table you have an 'id' column which I assume is an integer and you are making parallel threads through JDBC connection to that table. You can see the threads in MySQL if you query it. You can see multiple threads. You stated numPartitions but MySQL will decide how many parallel threads it can handle.

So data is read into Spark to RDDs and you can se that through SPAK GUI (port 4040 by default). Then you create a DataFrame (DF) and convert it into a tempTable. tempTable will not have any indexes. This is happening in Spark space not MySQL. Once you start reading in your query and collect data then it will try to cache data in Spark memory. You can see this again through Spark GUI. You can see the optimizer by using explain() function. You will see that no index is used.

Spark uses distributed data in memory to optimize the work. It does not use any index. In RDBMS an index is an ordered set of column or columns stored on the disk in B-tree format to improve the query where needed. Spark tempTable does not follow that method. So in summary your tempTable will benefit from more executors and memory if you want to improve the query performance.

HTH

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 


On 19 October 2017 at 23:29, [hidden email] <[hidden email]> wrote:
If the underlying table(s) have indexes on them.  Does spark use those indexes to optimize the query?

IE if I had a table in my JDBC data source (mysql in this case) had several indexes and my query was filtering on one of the fields with an index.  Would spark know to push that predicate to the database or is the predicate push-down ignorant of the underlying storage layer details.  

Apologies if that still doesn't adequately explain my question.

Gary Lucas

On 19 October 2017 at 15:19, Mich Talebzadeh <[hidden email]> wrote:
sorry what do you mean my JDBC table has an index on it? Where are you reading the data from the table?

I assume you are referring to "id" column on the table that you are reading through JDBC connection.

Then you are creating a temp Table called "df". That temp table is created in temporary work space and does not have any index. That index "id" is used when doing parallel reads into RDDs not when querying the temp Table.

HTH

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 


On 19 October 2017 at 23:10, [hidden email] <[hidden email]> wrote:
IE:  If my JDBC table has an index on it, will the optimizer consider that when pushing predicates down?

I noticed in a query like this:

df = spark.hiveContext.read.jdbc(
  url=jdbc_url,
  table="schema.table",
  column="id",
  lowerBound=lower_bound_id,
  upperBound=upper_bound_id,
  numPartitions=numberPartitions
)
df.registerTempTable("df")

filtered_df = spark.hiveContext.sql("""
    SELECT
        * 
    FROM
        df
    WHERE
        type = 'type' 
        AND action = 'action' 
        AND audited_changes LIKE '---\ncompany_id:\n- %'
""")
filtered_audits.registerTempTable("filtered_df")


The queries sent to the DB look like this:
"Select fields from schema.table where type='type' and action='action' and id > lower_bound and id <= upper_bound"

And then it does the like ( LIKE '---\ncompany_id:\n- %') in memory, which is great!

However I'm wondering why it chooses that optimization.  In this case there aren't any indexes on any of these except ID.  

So, does spark take into account JDBC indexes in it's query plan where it can?  

Thanks!

Gary Lucas



Reply | Threaded
Open this post in threaded view
|

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

lucas.gary@gmail.com
Ok, so when Spark is forming queries it's ignorant of the underlying storage layer index.

If there is an index on a table Spark doesn't take that into account when doing the predicate push down in optimization. In that case why does spark push 2 of my conditions (where fieldx = 'action') to the database but then do the like in memory.  Is that just a function a straightforward LIKE's are done in memory and simple equalities are pushed to the storage layer?

remember your indexes are in RDBMS

Exactly what I'm asking about, when spark issues the query via the JDBC reader, that query is / is not ignorant of the underlying indexes?  How does spark determine which predicates to perform in the RDD and which predicates to execute in the storage layer?  I guess I should just dig out the JDBC data-frame reader code and see if I can make sense of that?  Or is the predicate push-down stage independent of the readers?

Thanks for helping me form a more accurate question!

Gary

 

On 19 October 2017 at 15:46, Mich Talebzadeh <[hidden email]> wrote:
remember your indexes are in RDBMS. In this case MySQL. When you are reading from that table you have an 'id' column which I assume is an integer and you are making parallel threads through JDBC connection to that table. You can see the threads in MySQL if you query it. You can see multiple threads. You stated numPartitions but MySQL will decide how many parallel threads it can handle.

So data is read into Spark to RDDs and you can se that through SPAK GUI (port 4040 by default). Then you create a DataFrame (DF) and convert it into a tempTable. tempTable will not have any indexes. This is happening in Spark space not MySQL. Once you start reading in your query and collect data then it will try to cache data in Spark memory. You can see this again through Spark GUI. You can see the optimizer by using explain() function. You will see that no index is used.

Spark uses distributed data in memory to optimize the work. It does not use any index. In RDBMS an index is an ordered set of column or columns stored on the disk in B-tree format to improve the query where needed. Spark tempTable does not follow that method. So in summary your tempTable will benefit from more executors and memory if you want to improve the query performance.

HTH

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 


On 19 October 2017 at 23:29, [hidden email] <[hidden email]> wrote:
If the underlying table(s) have indexes on them.  Does spark use those indexes to optimize the query?

IE if I had a table in my JDBC data source (mysql in this case) had several indexes and my query was filtering on one of the fields with an index.  Would spark know to push that predicate to the database or is the predicate push-down ignorant of the underlying storage layer details.  

Apologies if that still doesn't adequately explain my question.

Gary Lucas

On 19 October 2017 at 15:19, Mich Talebzadeh <[hidden email]> wrote:
sorry what do you mean my JDBC table has an index on it? Where are you reading the data from the table?

I assume you are referring to "id" column on the table that you are reading through JDBC connection.

Then you are creating a temp Table called "df". That temp table is created in temporary work space and does not have any index. That index "id" is used when doing parallel reads into RDDs not when querying the temp Table.

HTH

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 


On 19 October 2017 at 23:10, [hidden email] <[hidden email]> wrote:
IE:  If my JDBC table has an index on it, will the optimizer consider that when pushing predicates down?

I noticed in a query like this:

df = spark.hiveContext.read.jdbc(
  url=jdbc_url,
  table="schema.table",
  column="id",
  lowerBound=lower_bound_id,
  upperBound=upper_bound_id,
  numPartitions=numberPartitions
)
df.registerTempTable("df")

filtered_df = spark.hiveContext.sql("""
    SELECT
        * 
    FROM
        df
    WHERE
        type = 'type' 
        AND action = 'action' 
        AND audited_changes LIKE '---\ncompany_id:\n- %'
""")
filtered_audits.registerTempTable("filtered_df")


The queries sent to the DB look like this:
"Select fields from schema.table where type='type' and action='action' and id > lower_bound and id <= upper_bound"

And then it does the like ( LIKE '---\ncompany_id:\n- %') in memory, which is great!

However I'm wondering why it chooses that optimization.  In this case there aren't any indexes on any of these except ID.  

So, does spark take into account JDBC indexes in it's query plan where it can?  

Thanks!

Gary Lucas




Reply | Threaded
Open this post in threaded view
|

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

Mich Talebzadeh
here below Gary

filtered_df = spark.hiveContext.sql("""
    SELECT
        * 
    FROM
        df
    WHERE
        type = 'type' 
        AND action = 'action' 
        AND audited_changes LIKE '---\ncompany_id:\n- %'
""")
filtered_audits.registerTempTable("filtered_df")


you are using hql to read data from your temporary table "df" and then creating a temporary table on the subset of that temptable "df".

What is the  purpose of it?

When you are within Spark itself data is read in. Granted the indexes on RDBMS help reading data through the JDBC connection but do not play any role later in running the sal in hql.

Does that make sense?
 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 


On 20 October 2017 at 00:04, [hidden email] <[hidden email]> wrote:
Ok, so when Spark is forming queries it's ignorant of the underlying storage layer index.

If there is an index on a table Spark doesn't take that into account when doing the predicate push down in optimization. In that case why does spark push 2 of my conditions (where fieldx = 'action') to the database but then do the like in memory.  Is that just a function a straightforward LIKE's are done in memory and simple equalities are pushed to the storage layer?

remember your indexes are in RDBMS

Exactly what I'm asking about, when spark issues the query via the JDBC reader, that query is / is not ignorant of the underlying indexes?  How does spark determine which predicates to perform in the RDD and which predicates to execute in the storage layer?  I guess I should just dig out the JDBC data-frame reader code and see if I can make sense of that?  Or is the predicate push-down stage independent of the readers?

Thanks for helping me form a more accurate question!

Gary

 

On 19 October 2017 at 15:46, Mich Talebzadeh <[hidden email]> wrote:
remember your indexes are in RDBMS. In this case MySQL. When you are reading from that table you have an 'id' column which I assume is an integer and you are making parallel threads through JDBC connection to that table. You can see the threads in MySQL if you query it. You can see multiple threads. You stated numPartitions but MySQL will decide how many parallel threads it can handle.

So data is read into Spark to RDDs and you can se that through SPAK GUI (port 4040 by default). Then you create a DataFrame (DF) and convert it into a tempTable. tempTable will not have any indexes. This is happening in Spark space not MySQL. Once you start reading in your query and collect data then it will try to cache data in Spark memory. You can see this again through Spark GUI. You can see the optimizer by using explain() function. You will see that no index is used.

Spark uses distributed data in memory to optimize the work. It does not use any index. In RDBMS an index is an ordered set of column or columns stored on the disk in B-tree format to improve the query where needed. Spark tempTable does not follow that method. So in summary your tempTable will benefit from more executors and memory if you want to improve the query performance.

HTH

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 


On 19 October 2017 at 23:29, [hidden email] <[hidden email]> wrote:
If the underlying table(s) have indexes on them.  Does spark use those indexes to optimize the query?

IE if I had a table in my JDBC data source (mysql in this case) had several indexes and my query was filtering on one of the fields with an index.  Would spark know to push that predicate to the database or is the predicate push-down ignorant of the underlying storage layer details.  

Apologies if that still doesn't adequately explain my question.

Gary Lucas

On 19 October 2017 at 15:19, Mich Talebzadeh <[hidden email]> wrote:
sorry what do you mean my JDBC table has an index on it? Where are you reading the data from the table?

I assume you are referring to "id" column on the table that you are reading through JDBC connection.

Then you are creating a temp Table called "df". That temp table is created in temporary work space and does not have any index. That index "id" is used when doing parallel reads into RDDs not when querying the temp Table.

HTH

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 


On 19 October 2017 at 23:10, [hidden email] <[hidden email]> wrote:
IE:  If my JDBC table has an index on it, will the optimizer consider that when pushing predicates down?

I noticed in a query like this:

df = spark.hiveContext.read.jdbc(
  url=jdbc_url,
  table="schema.table",
  column="id",
  lowerBound=lower_bound_id,
  upperBound=upper_bound_id,
  numPartitions=numberPartitions
)
df.registerTempTable("df")

filtered_df = spark.hiveContext.sql("""
    SELECT
        * 
    FROM
        df
    WHERE
        type = 'type' 
        AND action = 'action' 
        AND audited_changes LIKE '---\ncompany_id:\n- %'
""")
filtered_audits.registerTempTable("filtered_df")


The queries sent to the DB look like this:
"Select fields from schema.table where type='type' and action='action' and id > lower_bound and id <= upper_bound"

And then it does the like ( LIKE '---\ncompany_id:\n- %') in memory, which is great!

However I'm wondering why it chooses that optimization.  In this case there aren't any indexes on any of these except ID.  

So, does spark take into account JDBC indexes in it's query plan where it can?  

Thanks!

Gary Lucas





Reply | Threaded
Open this post in threaded view
|

Re: Does Apache Spark take into account JDBC indexes / statistics when optimizing queries?

lucas.gary@gmail.com
Right, that makes sense and I understood that.

The thing I'm wondering about (And i think the answer is 'no' at this stage).

When the optimizer is running and pushing predicates down, does it take into account indexing and other storage layer strategies in determining which predicates are processed in memory and which predicates are pushed to storage.  

Thanks!

Gary Lucas


On 20 October 2017 at 07:32, Mich Talebzadeh <[hidden email]> wrote:
here below Gary

filtered_df = spark.hiveContext.sql("""
    SELECT
        * 
    FROM
        df
    WHERE
        type = 'type' 
        AND action = 'action' 
        AND audited_changes LIKE '---\ncompany_id:\n- %'
""")
filtered_audits.registerTempTable("filtered_df")


you are using hql to read data from your temporary table "df" and then creating a temporary table on the subset of that temptable "df".

What is the  purpose of it?

When you are within Spark itself data is read in. Granted the indexes on RDBMS help reading data through the JDBC connection but do not play any role later in running the sal in hql.

Does that make sense?
 

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 


On 20 October 2017 at 00:04, [hidden email] <[hidden email]> wrote:
Ok, so when Spark is forming queries it's ignorant of the underlying storage layer index.

If there is an index on a table Spark doesn't take that into account when doing the predicate push down in optimization. In that case why does spark push 2 of my conditions (where fieldx = 'action') to the database but then do the like in memory.  Is that just a function a straightforward LIKE's are done in memory and simple equalities are pushed to the storage layer?

remember your indexes are in RDBMS

Exactly what I'm asking about, when spark issues the query via the JDBC reader, that query is / is not ignorant of the underlying indexes?  How does spark determine which predicates to perform in the RDD and which predicates to execute in the storage layer?  I guess I should just dig out the JDBC data-frame reader code and see if I can make sense of that?  Or is the predicate push-down stage independent of the readers?

Thanks for helping me form a more accurate question!

Gary

 

On 19 October 2017 at 15:46, Mich Talebzadeh <[hidden email]> wrote:
remember your indexes are in RDBMS. In this case MySQL. When you are reading from that table you have an 'id' column which I assume is an integer and you are making parallel threads through JDBC connection to that table. You can see the threads in MySQL if you query it. You can see multiple threads. You stated numPartitions but MySQL will decide how many parallel threads it can handle.

So data is read into Spark to RDDs and you can se that through SPAK GUI (port 4040 by default). Then you create a DataFrame (DF) and convert it into a tempTable. tempTable will not have any indexes. This is happening in Spark space not MySQL. Once you start reading in your query and collect data then it will try to cache data in Spark memory. You can see this again through Spark GUI. You can see the optimizer by using explain() function. You will see that no index is used.

Spark uses distributed data in memory to optimize the work. It does not use any index. In RDBMS an index is an ordered set of column or columns stored on the disk in B-tree format to improve the query where needed. Spark tempTable does not follow that method. So in summary your tempTable will benefit from more executors and memory if you want to improve the query performance.

HTH

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 


On 19 October 2017 at 23:29, [hidden email] <[hidden email]> wrote:
If the underlying table(s) have indexes on them.  Does spark use those indexes to optimize the query?

IE if I had a table in my JDBC data source (mysql in this case) had several indexes and my query was filtering on one of the fields with an index.  Would spark know to push that predicate to the database or is the predicate push-down ignorant of the underlying storage layer details.  

Apologies if that still doesn't adequately explain my question.

Gary Lucas

On 19 October 2017 at 15:19, Mich Talebzadeh <[hidden email]> wrote:
sorry what do you mean my JDBC table has an index on it? Where are you reading the data from the table?

I assume you are referring to "id" column on the table that you are reading through JDBC connection.

Then you are creating a temp Table called "df". That temp table is created in temporary work space and does not have any index. That index "id" is used when doing parallel reads into RDDs not when querying the temp Table.

HTH

Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 


On 19 October 2017 at 23:10, [hidden email] <[hidden email]> wrote:
IE:  If my JDBC table has an index on it, will the optimizer consider that when pushing predicates down?

I noticed in a query like this:

df = spark.hiveContext.read.jdbc(
  url=jdbc_url,
  table="schema.table",
  column="id",
  lowerBound=lower_bound_id,
  upperBound=upper_bound_id,
  numPartitions=numberPartitions
)
df.registerTempTable("df")

filtered_df = spark.hiveContext.sql("""
    SELECT
        * 
    FROM
        df
    WHERE
        type = 'type' 
        AND action = 'action' 
        AND audited_changes LIKE '---\ncompany_id:\n- %'
""")
filtered_audits.registerTempTable("filtered_df")


The queries sent to the DB look like this:
"Select fields from schema.table where type='type' and action='action' and id > lower_bound and id <= upper_bound"

And then it does the like ( LIKE '---\ncompany_id:\n- %') in memory, which is great!

However I'm wondering why it chooses that optimization.  In this case there aren't any indexes on any of these except ID.  

So, does spark take into account JDBC indexes in it's query plan where it can?  

Thanks!

Gary Lucas