Question on using pseudo columns in spark jdbc options

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Question on using pseudo columns in spark jdbc options

☼ R Nair (रविशंकर नायर)
Hi all,

I am using a query to fetch data from MYSQL as follows:

var df = spark.read.
format("jdbc").
option("url", "jdbc:mysql://10.0.0.192:3306/retail_db").
option("driver" ,"com.mysql.jdbc.Driver").
option("user", "retail_dba").
option("password", "cloudera").
option("dbtable", "orders").
option("partitionColumn", "order_id").
option("lowerBound", "1").
option("upperBound", "68883").
option("numPartitions", "4").
load()

Question is, can I use a pseudo column (like ROWNUM in Oracle or RRN(employeeno) in DB2) in option where I specify the "partitionColumn" ? If not, can we specify a partition column which is not a primary key ?

Best,
Ravion



Reply | Threaded
Open this post in threaded view
|

Re: Question on using pseudo columns in spark jdbc options

Tomasz Dudek
Hey Ravion,

yes, you can obviously specify other column than a primary key. Be aware though, that if the key range is not spread evenly (for example in your code, if there's a "gap" in primary keys and no row has id between 0 and 17220) some of the executors may not assist in loading data (because "SELECT * FROM orders WHERE order_id IS BETWEEN 0 AND 17220 will return an empty result). I think you might want to repartition afterwards to ensure that df is evenly distributed(<--- could somebody confirm my last sentence? I don't want to mislead and I am not sure).

The first question - could you just check and provide us the answer? :)

Cheers,
Tomasz

2017-12-03 7:39 GMT+01:00 ☼ R Nair (रविशंकर नायर) <[hidden email]>:
Hi all,

I am using a query to fetch data from MYSQL as follows:

var df = spark.read.
format("jdbc").
option("url", "jdbc:mysql://10.0.0.192:3306/retail_db").
option("driver" ,"com.mysql.jdbc.Driver").
option("user", "retail_dba").
option("password", "cloudera").
option("dbtable", "orders").
option("partitionColumn", "order_id").
option("lowerBound", "1").
option("upperBound", "68883").
option("numPartitions", "4").
load()

Question is, can I use a pseudo column (like ROWNUM in Oracle or RRN(employeeno) in DB2) in option where I specify the "partitionColumn" ? If not, can we specify a partition column which is not a primary key ?

Best,
Ravion




Reply | Threaded
Open this post in threaded view
|

Re: Question on using pseudo columns in spark jdbc options

☼ R Nair (रविशंकर नायर)
It works perfectly. You can use pseudo columns like ROWNUM in Oracle and RRN in DB2. To avoid skewing you can apply the great coalesce function...Spark is sparkling..

Best,


On Thu, Dec 7, 2017 at 2:20 PM, Tomasz Dudek <[hidden email]> wrote:
Hey Ravion,

yes, you can obviously specify other column than a primary key. Be aware though, that if the key range is not spread evenly (for example in your code, if there's a "gap" in primary keys and no row has id between 0 and 17220) some of the executors may not assist in loading data (because "SELECT * FROM orders WHERE order_id IS BETWEEN 0 AND 17220 will return an empty result). I think you might want to repartition afterwards to ensure that df is evenly distributed(<--- could somebody confirm my last sentence? I don't want to mislead and I am not sure).

The first question - could you just check and provide us the answer? :)

Cheers,
Tomasz

2017-12-03 7:39 GMT+01:00 ☼ R Nair (रविशंकर नायर) <[hidden email]>:
Hi all,

I am using a query to fetch data from MYSQL as follows:

var df = spark.read.
format("jdbc").
option("url", "jdbc:mysql://10.0.0.192:3306/retail_db").
option("driver" ,"com.mysql.jdbc.Driver").
option("user", "retail_dba").
option("password", "cloudera").
option("dbtable", "orders").
option("partitionColumn", "order_id").
option("lowerBound", "1").
option("upperBound", "68883").
option("numPartitions", "4").
load()

Question is, can I use a pseudo column (like ROWNUM in Oracle or RRN(employeeno) in DB2) in option where I specify the "partitionColumn" ? If not, can we specify a partition column which is not a primary key ?

Best,
Ravion







--