Separating storage from compute layer with Spark and data warehouses offering ML capabilities
This is a generic question with regard to an optimum design.
Many Cloud Data Warehouses like Google BigQuery (BQ) or Oracle Autonomous Data Warehouse (ADW), nowadays offer ML capabilities based on models built within the storage itself. This is great as it allows those with SQL knowledge but not necessarily data scientists to build and run models. These data warehouses are built in Cloud.
However, I see some limitations if the data warehouse itself is used for both storage and model building capabilities. The fundamental issue arises when you want to scale this up with multiple sources (on prem or already in a data warehouse), concurrent users, ability to enrich data and the ability to store what is needed in the data warehouse itself (data or results of models).
This where Spark comes into play. It can connect multiple sources with JDBC connections, can combine data from these sources within Spark itself and provide in-memory enrichment and computation at the compute layer. additionally and perhaps more importantly you can scale up and down compute layers (some of them dedicated) to your needs without adversely impacting the storage and model building layer.
In summary, I cannot see how one can rely on storage layer alone to
read data from multiple sources
combine storage and computing with scale
avoid concurrency bottlenecks in a meaningful way
I would be interested to hear other views on this.
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.