Accessing Teradata DW data from Spark

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Accessing Teradata DW data from Spark

Mich Talebzadeh
Using JDBC drivers much like accessing Oracle data, one can utilise the power of Spark on Teradata via JDBC drivers.

I have seen connections in some articles which indicates this process is pretty mature.

My question is if anyone has done this work and how is performance in Spark vis-a-vis running the same code on Teradata itself. For example in Oracle one can force parallel processing by using numPartitions

val s = HiveContext.read.format("jdbc").options(
       Map("url" -> _ORACLEserver,
       "dbtable" -> "(SELECT ID FROM scratchpad.dummy4)",
       "partitionColumn" -> "ID",
       "lowerBound" -> minID,
       "upperBound" -> maxID,
       "numPartitions" -> "5",
       "user" -> _username,
       "password" -> _password)).load

As both Oracle & Teradata are data warehouses, this may work. The intention is to read from Teradata initially as tactical and use Hadoop/Hive/Spark as strategic.

Obviously the underlying tables reading from Hive compared to Teradata will be different. However the SQL to fetch, slice and dice data will be similar.

Let me know your thoughts

Thanks



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Accessing Teradata DW data from Spark

Gourav Sengupta
Hi,
Partitioning works in teradata, but your user may have core and memory restrictions. So please do adjust the number of queries hitting parallel to teradata based on partitions used in your query. 

I am able to extract data to S3 in 3 hours from on premise teradata which from teradata export and upload to s3 and converting to parquet will take 10 hours 



Regards 

On Wed, 10 Jun 2020, 17:50 Mich Talebzadeh, <[hidden email]> wrote:
Using JDBC drivers much like accessing Oracle data, one can utilise the power of Spark on Teradata via JDBC drivers.

I have seen connections in some articles which indicates this process is pretty mature.

My question is if anyone has done this work and how is performance in Spark vis-a-vis running the same code on Teradata itself. For example in Oracle one can force parallel processing by using numPartitions

val s = HiveContext.read.format("jdbc").options(
       Map("url" -> _ORACLEserver,
       "dbtable" -> "(SELECT ID FROM scratchpad.dummy4)",
       "partitionColumn" -> "ID",
       "lowerBound" -> minID,
       "upperBound" -> maxID,
       "numPartitions" -> "5",
       "user" -> _username,
       "password" -> _password)).load

As both Oracle & Teradata are data warehouses, this may work. The intention is to read from Teradata initially as tactical and use Hadoop/Hive/Spark as strategic.

Obviously the underlying tables reading from Hive compared to Teradata will be different. However the SQL to fetch, slice and dice data will be similar.

Let me know your thoughts

Thanks



LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.