spark session jdbc performance

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

spark session jdbc performance

Madhire, Naveen

Hi,

 

I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues.

 

Below is the query I am using,

 

Using Spark 2.0.2

 

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "500000")
.option("numPartitions", 30)
.load()

 

The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query.

 

Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query.

 

Thanks



The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Reply | Threaded
Open this post in threaded view
|

spark session jdbc performance

Madhire, Naveen

Hi,

 

I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues.

 

Below is the query I am using,

 

Using Spark 2.0.2

 

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "500000")
.option("numPartitions", 30)
.load()

 

The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query.

 

Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query.

 

Thanks



The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Reply | Threaded
Open this post in threaded view
|

spark session jdbc performance

Naveen Madhire

Hi,

 

I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues.

 

Below is the query I am using,

 

Using Spark 2.0.2

 

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "500000")
.option("numPartitions", 30)
.load()

 

The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query.

 

Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query.

 Thanks

Reply | Threaded
Open this post in threaded view
|

Re: spark session jdbc performance

lucas.gary@gmail.com
Did you check the query plan / check the UI?  

That code looks same to me.  Maybe you've only configured for one executor?

Gary

On Oct 24, 2017 2:55 PM, "Naveen Madhire" <[hidden email]> wrote:

Hi,

 

I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues.

 

Below is the query I am using,

 

Using Spark 2.0.2

 

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "500000")
.option("numPartitions", 30)
.load()

 

The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query.

 

Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query.

 Thanks

Reply | Threaded
Open this post in threaded view
|

Re: spark session jdbc performance

lucas.gary@gmail.com
Sorry, I meant to say: "That code looks SANE to me"

Assuming that you're seeing the query running partitioned as expected then you're likely configured with one executor.  Very easy to check in the UI.

Gary Lucas

On 24 October 2017 at 16:09, [hidden email] <[hidden email]> wrote:
Did you check the query plan / check the UI?  

That code looks same to me.  Maybe you've only configured for one executor?

Gary

On Oct 24, 2017 2:55 PM, "Naveen Madhire" <[hidden email]> wrote:

Hi,

 

I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues.

 

Below is the query I am using,

 

Using Spark 2.0.2

 

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "500000")
.option("numPartitions", 30)
.load()

 

The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query.

 

Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query.

 Thanks


Reply | Threaded
Open this post in threaded view
|

Re: spark session jdbc performance

Srinivasa Reddy Tatiredidgari
In reply to this post by Naveen Madhire
Hi, is the subquery is user defined sqls or table name in db.
If it is user Defined sql.
Make sure ur partition column is in main select clause.


On Wed, Oct 25, 2017 at 3:25, Naveen Madhire

Hi,

 

I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues.

 

Below is the query I am using,

 

Using Spark 2.0.2

 

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "500000")
.option("numPartitions", 30)
.load()

 

The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query.

 

Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query.

 Thanks

Reply | Threaded
Open this post in threaded view
|

Re: spark session jdbc performance

Gourav Sengupta
In reply to this post by Naveen Madhire
Hi Naveen,

I do not think that it is prudent to use the PK as the partitionColumn. That is too many partitions for any system to handle. The numPartitions will be valid in case of JDBC very differently. 

Please keep me updated on how things go.


Regards,
Gourav Sengupta

On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire <[hidden email]> wrote:

Hi,

 

I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues.

 

Below is the query I am using,

 

Using Spark 2.0.2

 

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "500000")
.option("numPartitions", 30)
.load()

 

The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query.

 

Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query.

 Thanks


Reply | Threaded
Open this post in threaded view
|

Re: spark session jdbc performance

lucas.gary@gmail.com
Gourav, I'm assuming you misread the code.  It's 30 partitions, which isn't a ridiculous value.  Maybe you misread the upperBound for the partitions?  (That would be ridiculous)

Why not use the PK as the partition column?  Obviously it depends on the downstream queries.  If you're going to be performing joins (which I assume is the case) then partitioning on the join column would be advisable, but what about the case where the join column would be heavily skewed?

Thanks!

Gary

On 24 October 2017 at 23:41, Gourav Sengupta <[hidden email]> wrote:
Hi Naveen,

I do not think that it is prudent to use the PK as the partitionColumn. That is too many partitions for any system to handle. The numPartitions will be valid in case of JDBC very differently. 

Please keep me updated on how things go.


Regards,
Gourav Sengupta

On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire <[hidden email]> wrote:

Hi,

 

I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues.

 

Below is the query I am using,

 

Using Spark 2.0.2

 

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "500000")
.option("numPartitions", 30)
.load()

 

The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query.

 

Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query.

 Thanks



Reply | Threaded
Open this post in threaded view
|

Re: spark session jdbc performance

Gourav Sengupta
Hi Lucas,

so if I am assuming things, can you please explain why the UI is showing only one partition to run the query?


Regards,
Gourav Sengupta

On Wed, Oct 25, 2017 at 6:03 PM, [hidden email] <[hidden email]> wrote:
Gourav, I'm assuming you misread the code.  It's 30 partitions, which isn't a ridiculous value.  Maybe you misread the upperBound for the partitions?  (That would be ridiculous)

Why not use the PK as the partition column?  Obviously it depends on the downstream queries.  If you're going to be performing joins (which I assume is the case) then partitioning on the join column would be advisable, but what about the case where the join column would be heavily skewed?

Thanks!

Gary

On 24 October 2017 at 23:41, Gourav Sengupta <[hidden email]> wrote:
Hi Naveen,

I do not think that it is prudent to use the PK as the partitionColumn. That is too many partitions for any system to handle. The numPartitions will be valid in case of JDBC very differently. 

Please keep me updated on how things go.


Regards,
Gourav Sengupta

On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire <[hidden email]> wrote:

Hi,

 

I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues.

 

Below is the query I am using,

 

Using Spark 2.0.2

 

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "500000")
.option("numPartitions", 30)
.load()

 

The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query.

 

Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query.

 Thanks




Reply | Threaded
Open this post in threaded view
|

Re: spark session jdbc performance

lucas.gary@gmail.com
Are we seeing the UI is showing only one partition to run the query?  The original poster hasn't replied yet.

My assumption is that there's only one executor configured / deployed.  But we only know what the OP stated which wasn't enough to be sure of anything.

Why are you suggesting that partitioning on the PK isn't prudent? and did you mean to say that 30 partitions were far to many for any system to handle?  (I'm assuming you misread the original code)

Gary

On 25 October 2017 at 13:21, Gourav Sengupta <[hidden email]> wrote:
Hi Lucas,

so if I am assuming things, can you please explain why the UI is showing only one partition to run the query?


Regards,
Gourav Sengupta

On Wed, Oct 25, 2017 at 6:03 PM, [hidden email] <[hidden email]> wrote:
Gourav, I'm assuming you misread the code.  It's 30 partitions, which isn't a ridiculous value.  Maybe you misread the upperBound for the partitions?  (That would be ridiculous)

Why not use the PK as the partition column?  Obviously it depends on the downstream queries.  If you're going to be performing joins (which I assume is the case) then partitioning on the join column would be advisable, but what about the case where the join column would be heavily skewed?

Thanks!

Gary

On 24 October 2017 at 23:41, Gourav Sengupta <[hidden email]> wrote:
Hi Naveen,

I do not think that it is prudent to use the PK as the partitionColumn. That is too many partitions for any system to handle. The numPartitions will be valid in case of JDBC very differently. 

Please keep me updated on how things go.


Regards,
Gourav Sengupta

On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire <[hidden email]> wrote:

Hi,

 

I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues.

 

Below is the query I am using,

 

Using Spark 2.0.2

 

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "500000")
.option("numPartitions", 30)
.load()

 

The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query.

 

Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query.

 Thanks





Reply | Threaded
Open this post in threaded view
|

Re: spark session jdbc performance

Gourav Sengupta
Hi Naveen,

Can you please copy and paste the lines in your original email again, and perhaps then Lucas can go through it completely & kindly stop thinking that others are responding by assuming things? 

On other hand, please try to let me know how things are going on, there is another post on this a few weeks back and several other users are equally finding this issue very interesting to resolve.

I might just have the solution for this.

Regards,
Gourav Sengupta

On Wed, Oct 25, 2017 at 9:26 PM, [hidden email] <[hidden email]> wrote:
Are we seeing the UI is showing only one partition to run the query?  The original poster hasn't replied yet.

My assumption is that there's only one executor configured / deployed.  But we only know what the OP stated which wasn't enough to be sure of anything.

Why are you suggesting that partitioning on the PK isn't prudent? and did you mean to say that 30 partitions were far to many for any system to handle?  (I'm assuming you misread the original code)

Gary

On 25 October 2017 at 13:21, Gourav Sengupta <[hidden email]> wrote:
Hi Lucas,

so if I am assuming things, can you please explain why the UI is showing only one partition to run the query?


Regards,
Gourav Sengupta

On Wed, Oct 25, 2017 at 6:03 PM, [hidden email] <[hidden email]> wrote:
Gourav, I'm assuming you misread the code.  It's 30 partitions, which isn't a ridiculous value.  Maybe you misread the upperBound for the partitions?  (That would be ridiculous)

Why not use the PK as the partition column?  Obviously it depends on the downstream queries.  If you're going to be performing joins (which I assume is the case) then partitioning on the join column would be advisable, but what about the case where the join column would be heavily skewed?

Thanks!

Gary

On 24 October 2017 at 23:41, Gourav Sengupta <[hidden email]> wrote:
Hi Naveen,

I do not think that it is prudent to use the PK as the partitionColumn. That is too many partitions for any system to handle. The numPartitions will be valid in case of JDBC very differently. 

Please keep me updated on how things go.


Regards,
Gourav Sengupta

On Tue, Oct 24, 2017 at 10:54 PM, Naveen Madhire <[hidden email]> wrote:

Hi,

 

I am trying to fetch data from Oracle DB using a subquery and experiencing lot of performance issues.

 

Below is the query I am using,

 

Using Spark 2.0.2

 

val df = spark_session.read.format("jdbc")
.option("driver","oracle.jdbc.OracleDriver")
.option("url", jdbc_url)
   .option("user", user)
   .option("password", pwd)
   .option("dbtable", "subquery")
   .option("partitionColumn", "id")  //primary key column uniformly distributed
   .option("lowerBound", "1")
   .option("upperBound", "500000")
.option("numPartitions", 30)
.load()

 

The above query is running using the 30 partitions, but when I see the UI it is only using 1 partiton to run the query.

 

Can anyone tell if I am missing anything or do I need to anything else to tune the performance of the query.

 Thanks