Spark on the cloud deployments

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark on the cloud deployments

Stephane Verlet
Hello,

We have been using Spark on a on-premise cluster for several years and
looking at moving to a cloud deployment.

I was wondering what is your current favorite cloud setup.  Just simple
AWR / Azure, or something on top like Databricks ?

This would support a on demand report application so usage would be
sporadic with spikes during the day. Current deployment is Spark with
Hive data.

Thanks for sharing

Stephane



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark on the cloud deployments

Lalwani, Jayesh
AWS has 2 offerings built on top of Spark: EMR and Glue. You can, of course, spin up your EC2 instances and deploy Spark on it. The 3 offerings allows you to tradeoff between flexibility and  infrastructure management. EC2 gives you the most flexibility, because it's basically a bunch of nodes, and you can configure spark anyway you want. Con is that you need to manage your EC2 instances. EMR is a step up: You manage your EC2 instances, but you don't need to manage Spark. With Glue, you don't need to manage infrastructure.  Glue is serverless (for you)

Besides, those, you also get different choices. Like, if your usage is spiky, you could implement this in Kinesis. Or you could have your reporting application make queries to Athena

On 2/24/21, 10:25 AM, "Stephane Verlet" <[hidden email]> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    Hello,

    We have been using Spark on a on-premise cluster for several years and
    looking at moving to a cloud deployment.

    I was wondering what is your current favorite cloud setup.  Just simple
    AWR / Azure, or something on top like Databricks ?

    This would support a on demand report application so usage would be
    sporadic with spikes during the day. Current deployment is Spark with
    Hive data.

    Thanks for sharing

    Stephane



    ---------------------------------------------------------------------
    To unsubscribe e-mail: [hidden email]



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Spark on the cloud deployments

Mich Talebzadeh
In reply to this post by Stephane Verlet

Hi Stephane,


If you are currently using on-premisses then you should also consider Google Cloud platform (GCP). As a practitioner I see a number of customers migrating from others to GCP. 


Databricks on GCP will be available (if I am correct) in April this year. GCP already offers Google Compute Engines as IaaS which support Spark with Yarn. In addition, you have other cost saving  'preemptible instances' that can run Spark on affordable tin boxes so to speak. GCP also offers BigQuery as a Data Warehouse (DW) with ML models built in. So there is a fair bit of 'either or choice' here. There is also the question of the migration path from GCP artifacts to Databricks. Will Databricks provide all these as a service? For example, BigQuery is a fully managed serverless warehouse. Will Lakehouse provide the same in GCP etc? BigQuery besides ML provides Oracle's PL/SQL type functions and procedures so some are migrating from Oracle classic on premises to BigQuery


However, neither BigQuery nor compute engines are cheap. Personally I believe the landscape on Cloud is getting congested and unless there is a clear motivation to move from one to another, many will choose to stay where they are. if you are already using Spark on a private Cloud, then the journey to GCP should be pretty smooth. As ever, your mileage will vary. You may also decide to go for a multi-cloud mixture with the best of breed.


HTH,


Mich


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 24 Feb 2021 at 15:25, Stephane Verlet <[hidden email]> wrote:
Hello,

We have been using Spark on a on-premise cluster for several years and
looking at moving to a cloud deployment.

I was wondering what is your current favorite cloud setup.  Just simple
AWR / Azure, or something on top like Databricks ?

This would support a on demand report application so usage would be
sporadic with spikes during the day. Current deployment is Spark with
Hive data.

Thanks for sharing

Stephane



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]