Is Spark suited for this use case?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Is Spark suited for this use case?

Saravanan Thirumalai
We are an Investment firm and have a MDM platform in oracle at a vendor location and use Oracle Golden Gate to replicat data to our data center for reporting needs. 
Our data is not big data (total size 6 TB including 2 TB of archive data). Moreover our data doesn't get updated often, nightly once (around 50 MB) and some correction transactions during the day (<10 MB). We don't have external users and hence data doesn't grow real-time like e-commerce.

When we replicate data from source to target, we transfer data through files. So, if there are DML operations (corrections) during day time on a source table, the corresponding file would have probably 100 lines of table data that needs to be loaded into the target database. Due to low volume of data we designed this through Informatica and this works in less than 2-5 minutes. Can Spark be used in this case or would it be an overkill of technology use?



Reply | Threaded
Open this post in threaded view
|

RE: Is Spark suited for this use case?

van den Heever, Christian CC

Hi,

 

We basically have the same scenario but worldwide as we have bigger Datasets we use OGG à local à Sqoop Into Hadoop.

By all means you can have spark reading the oracle tables and then do some changes to data in need which will not be done on scoop qry. Ie fraudulent detection on transaction records.

 

But some time the simplest way is the best. Unless you need a change or need more then I would advise not using another hop.

I would rather move away from files as OGG can do files and direct table loading then sqoop for the rest.

 

Simpler is better.

 

Hope this helps.

C.

 

From: Saravanan Thirumalai [mailto:[hidden email]]
Sent: Monday, 16 October 2017 4:29 AM
To: [hidden email]
Subject: Is Spark suited for this use case?

 

We are an Investment firm and have a MDM platform in oracle at a vendor location and use Oracle Golden Gate to replicat data to our data center for reporting needs. 

Our data is not big data (total size 6 TB including 2 TB of archive data). Moreover our data doesn't get updated often, nightly once (around 50 MB) and some correction transactions during the day (<10 MB). We don't have external users and hence data doesn't grow real-time like e-commerce.

 

When we replicate data from source to target, we transfer data through files. So, if there are DML operations (corrections) during day time on a source table, the corresponding file would have probably 100 lines of table data that needs to be loaded into the target database. Due to low volume of data we designed this through Informatica and this works in less than 2-5 minutes. Can Spark be used in this case or would it be an overkill of technology use?

 

 

 



Standard Bank email disclaimer and confidentiality note
Please go to www.standardbank.co.za/site/homepage/emaildisclaimer.html to read our email disclaimer and confidentiality note. Kindly email [hidden email] (no content or subject line necessary) if you cannot view that page and we will email our email disclaimer and confidentiality note to you.


Reply | Threaded
Open this post in threaded view
|

Re: Is Spark suited for this use case?

Jörn Franke
Hi,

What is the motivation behind your question? Save costs?

You seem to be happy with the functional/non-functional requirements. So the only thing that it could be is cost or need for innovation in the future.

Best regards

On 16. Oct 2017, at 06:32, van den Heever, Christian CC <[hidden email]> wrote:

Hi,

 

We basically have the same scenario but worldwide as we have bigger Datasets we use OGG à local à Sqoop Into Hadoop.

By all means you can have spark reading the oracle tables and then do some changes to data in need which will not be done on scoop qry. Ie fraudulent detection on transaction records.

 

But some time the simplest way is the best. Unless you need a change or need more then I would advise not using another hop.

I would rather move away from files as OGG can do files and direct table loading then sqoop for the rest.

 

Simpler is better.

 

Hope this helps.

C.

 

From: Saravanan Thirumalai [[hidden email]]
Sent: Monday, 16 October 2017 4:29 AM
To: [hidden email]
Subject: Is Spark suited for this use case?

 

We are an Investment firm and have a MDM platform in oracle at a vendor location and use Oracle Golden Gate to replicat data to our data center for reporting needs. 

Our data is not big data (total size 6 TB including 2 TB of archive data). Moreover our data doesn't get updated often, nightly once (around 50 MB) and some correction transactions during the day (<10 MB). We don't have external users and hence data doesn't grow real-time like e-commerce.

 

When we replicate data from source to target, we transfer data through files. So, if there are DML operations (corrections) during day time on a source table, the corresponding file would have probably 100 lines of table data that needs to be loaded into the target database. Due to low volume of data we designed this through Informatica and this works in less than 2-5 minutes. Can Spark be used in this case or would it be an overkill of technology use?

 

 

 



Standard Bank email disclaimer and confidentiality note
Please go to www.standardbank.co.za/site/homepage/emaildisclaimer.html to read our email disclaimer and confidentiality note. Kindly email [hidden email] (no content or subject line necessary) if you cannot view that page and we will email our email disclaimer and confidentiality note to you.


Reply | Threaded
Open this post in threaded view
|

Re: Is Spark suited for this use case?

Jean Georges Perrin
In reply to this post by Saravanan Thirumalai
I have seen a similar scenario where we load data from a RDBMS into a NoSQL database… Spark made sense for velocity and parallel processing (and cost of licenses :) ).
 
> On Oct 15, 2017, at 21:29, Saravanan Thirumalai <[hidden email]> wrote:
>
> We are an Investment firm and have a MDM platform in oracle at a vendor location and use Oracle Golden Gate to replicat data to our data center for reporting needs.
> Our data is not big data (total size 6 TB including 2 TB of archive data). Moreover our data doesn't get updated often, nightly once (around 50 MB) and some correction transactions during the day (<10 MB). We don't have external users and hence data doesn't grow real-time like e-commerce.
>
> When we replicate data from source to target, we transfer data through files. So, if there are DML operations (corrections) during day time on a source table, the corresponding file would have probably 100 lines of table data that needs to be loaded into the target database. Due to low volume of data we designed this through Informatica and this works in less than 2-5 minutes. Can Spark be used in this case or would it be an overkill of technology use?
>
>
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Is Spark suited for this use case?

Gourav Sengupta
In reply to this post by van den Heever, Christian CC
Hi Saravanan,

SPARK may be free, but to make it run with the same level of performance, consistency, and reliability will show you that SPARK or HADOOP or anything else is essentially not free. With Informatica you pay for the licensing and have almost no headaches as far as stability, upgrades, and reliability is concerned.

If you want to deliver the same with SPARK, then the costs will start escalating as you will have to go with SPARK vendors.

As with everything else my best analogy is dont use a fork for drinking soup. SPARK works wonderfully with huge scale of data, SPARK cannot read Oracle Binary Log files or provides a change data capture capability, for your used case with such low volumes and solutions like Informatica and Golden Gate, I think that you are already using optimal solutions.

Of course I am presuming that you DO NOT replicate your entire 6TB of MDM platform everyday and just use CDC to transfer data to your data center for reporting purposes.

In case you are interested in super fast reporting using AWS Redshift then please do let me know, I have delivered several end-to-end hybrid data warehouse solutions and will be happy to help you with the same.


Regards,
Gourav Sengupta

On Mon, Oct 16, 2017 at 5:32 AM, van den Heever, Christian CC <[hidden email]> wrote:

Hi,

 

We basically have the same scenario but worldwide as we have bigger Datasets we use OGG à local à Sqoop Into Hadoop.

By all means you can have spark reading the oracle tables and then do some changes to data in need which will not be done on scoop qry. Ie fraudulent detection on transaction records.

 

But some time the simplest way is the best. Unless you need a change or need more then I would advise not using another hop.

I would rather move away from files as OGG can do files and direct table loading then sqoop for the rest.

 

Simpler is better.

 

Hope this helps.

C.

 

From: Saravanan Thirumalai [mailto:[hidden email]]
Sent: Monday, 16 October 2017 4:29 AM
To: [hidden email]
Subject: Is Spark suited for this use case?

 

We are an Investment firm and have a MDM platform in oracle at a vendor location and use Oracle Golden Gate to replicat data to our data center for reporting needs. 

Our data is not big data (total size 6 TB including 2 TB of archive data). Moreover our data doesn't get updated often, nightly once (around 50 MB) and some correction transactions during the day (<10 MB). We don't have external users and hence data doesn't grow real-time like e-commerce.

 

When we replicate data from source to target, we transfer data through files. So, if there are DML operations (corrections) during day time on a source table, the corresponding file would have probably 100 lines of table data that needs to be loaded into the target database. Due to low volume of data we designed this through Informatica and this works in less than 2-5 minutes. Can Spark be used in this case or would it be an overkill of technology use?

 

 

 



Standard Bank email disclaimer and confidentiality note
Please go to www.standardbank.co.za/site/homepage/emaildisclaimer.html to read our email disclaimer and confidentiality note. Kindly email [hidden email] (no content or subject line necessary) if you cannot view that page and we will email our email disclaimer and confidentiality note to you.