Spark based Data Warehouse

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark based Data Warehouse

ashish rawat
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:
  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

Regards,
Ashish
Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

Deepak Sharma
I am looking for similar solution more aligned to data scientist group.
The concern i have is about supporting complex aggregations at runtime .

Thanks
Deepak

On Nov 12, 2017 12:51, "ashish rawat" <[hidden email]> wrote:
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:
  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

Regards,
Ashish
Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

Jörn Franke
What do you mean all possible workloads?
You cannot prepare any system to do all possible processing.

We do not know the requirements of your data scientists now or in the future so it is difficult to say. How do they work currently without the new solution? Do they all work on the same data? I bet you will receive on your email a lot of private messages trying to sell their solution that solves everything - with the information you provided this is impossible to say.

Then with every system: have incremental releases but have then in short time frames - do not engineer a big system that you will deliver in 2 years. In the cloud you have the perfect possibility to scale feature but also infrastructure wise.

Challenges with concurrent queries is the right definition of the scheduler (eg fairscheduler) that not one query take all the resources or that long running queries starve.

User interfaces: what could help are notebooks (Jupyter etc) but you may need to train your data scientists. Some may know or prefer other tools.

On 12. Nov 2017, at 08:32, Deepak Sharma <[hidden email]> wrote:

I am looking for similar solution more aligned to data scientist group.
The concern i have is about supporting complex aggregations at runtime .

Thanks
Deepak

On Nov 12, 2017 12:51, "ashish rawat" <[hidden email]> wrote:
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:
  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

Regards,
Ashish
Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

Phillip Henry
Agree with Jorn. The answer is: it depends.

In the past, I've worked with data scientists who are happy to use the Spark CLI. Again, the answer is "it depends" (in this case, on the skills of your customers).

Regarding sharing resources, different teams were limited to their own queue so they could not hog all the resources. However, people within a team had to do some horse trading if they had a particularly intensive job to run. I did feel that this was an area that could be improved. It may be by now, I've just not looked into it for a while.

BTW I'm not sure what you mean by "Spark still does not provide spill to disk" as the FAQ says "Spark's operators spill data to disk if it does not fit in memory" (http://spark.apache.org/faq.html). So, your data will not normally cause OutOfMemoryErrors (certain terms and conditions may apply).

My 2 cents.

Phillip



On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <[hidden email]> wrote:
What do you mean all possible workloads?
You cannot prepare any system to do all possible processing.

We do not know the requirements of your data scientists now or in the future so it is difficult to say. How do they work currently without the new solution? Do they all work on the same data? I bet you will receive on your email a lot of private messages trying to sell their solution that solves everything - with the information you provided this is impossible to say.

Then with every system: have incremental releases but have then in short time frames - do not engineer a big system that you will deliver in 2 years. In the cloud you have the perfect possibility to scale feature but also infrastructure wise.

Challenges with concurrent queries is the right definition of the scheduler (eg fairscheduler) that not one query take all the resources or that long running queries starve.

User interfaces: what could help are notebooks (Jupyter etc) but you may need to train your data scientists. Some may know or prefer other tools.

On 12. Nov 2017, at 08:32, Deepak Sharma <[hidden email]> wrote:

I am looking for similar solution more aligned to data scientist group.
The concern i have is about supporting complex aggregations at runtime .

Thanks
Deepak

On Nov 12, 2017 12:51, "ashish rawat" <[hidden email]> wrote:
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:
  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

Regards,
Ashish

Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

ashish rawat
Thanks Jorn and Phillip. My question was specifically to anyone who have tried creating a system using spark SQL, as Data Warehouse. I was trying to check, if someone has tried it and they can help with the kind of workloads which worked and the ones, which have problems.

Regarding spill to disk, I might be wrong but not all functionality of spark is spill to disk. So it still doesn't provide DB like reliability in execution. In case of DBs, queries get slow but they don't fail or go out of memory, specifically in concurrent user scenarios.

Regards,
Ashish 

On Nov 12, 2017 3:02 PM, "Phillip Henry" <[hidden email]> wrote:
Agree with Jorn. The answer is: it depends.

In the past, I've worked with data scientists who are happy to use the Spark CLI. Again, the answer is "it depends" (in this case, on the skills of your customers).

Regarding sharing resources, different teams were limited to their own queue so they could not hog all the resources. However, people within a team had to do some horse trading if they had a particularly intensive job to run. I did feel that this was an area that could be improved. It may be by now, I've just not looked into it for a while.

BTW I'm not sure what you mean by "Spark still does not provide spill to disk" as the FAQ says "Spark's operators spill data to disk if it does not fit in memory" (http://spark.apache.org/faq.html). So, your data will not normally cause OutOfMemoryErrors (certain terms and conditions may apply).

My 2 cents.

Phillip



On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <[hidden email]> wrote:
What do you mean all possible workloads?
You cannot prepare any system to do all possible processing.

We do not know the requirements of your data scientists now or in the future so it is difficult to say. How do they work currently without the new solution? Do they all work on the same data? I bet you will receive on your email a lot of private messages trying to sell their solution that solves everything - with the information you provided this is impossible to say.

Then with every system: have incremental releases but have then in short time frames - do not engineer a big system that you will deliver in 2 years. In the cloud you have the perfect possibility to scale feature but also infrastructure wise.

Challenges with concurrent queries is the right definition of the scheduler (eg fairscheduler) that not one query take all the resources or that long running queries starve.

User interfaces: what could help are notebooks (Jupyter etc) but you may need to train your data scientists. Some may know or prefer other tools.

On 12. Nov 2017, at 08:32, Deepak Sharma <[hidden email]> wrote:

I am looking for similar solution more aligned to data scientist group.
The concern i have is about supporting complex aggregations at runtime .

Thanks
Deepak

On Nov 12, 2017 12:51, "ashish rawat" <[hidden email]> wrote:
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:
  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

Regards,
Ashish


Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

Phillip Henry
Hi, Ashish.

You are correct in saying that not *all* functionality of Spark is spill-to-disk but I am not sure how this pertains to a "concurrent user scenario". Each executor will run in its own JVM and is therefore isolated from others. That is, if the JVM of one user dies, this should not effect another user who is running their own jobs in their own JVMs. The amount of resources used by a user can be controlled by the resource manager.

AFAIK, you configure something like YARN to limit the number of cores and the amount of memory in the cluster a certain user or group is allowed to use for their job. This is obviously quite a coarse-grained approach as (to my knowledge) IO is not throttled. I believe people generally use something like Apache Ambari to keep an eye on network and disk usage to mitigate problems in a shared cluster.

If the user has badly designed their query, it may very well fail with OOMEs but this can happen irrespective of whether one user or many is using the cluster at a given moment in time.

Does this help?

Regards,

Phillip


On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat <[hidden email]> wrote:
Thanks Jorn and Phillip. My question was specifically to anyone who have tried creating a system using spark SQL, as Data Warehouse. I was trying to check, if someone has tried it and they can help with the kind of workloads which worked and the ones, which have problems.

Regarding spill to disk, I might be wrong but not all functionality of spark is spill to disk. So it still doesn't provide DB like reliability in execution. In case of DBs, queries get slow but they don't fail or go out of memory, specifically in concurrent user scenarios.

Regards,
Ashish 

On Nov 12, 2017 3:02 PM, "Phillip Henry" <[hidden email]> wrote:
Agree with Jorn. The answer is: it depends.

In the past, I've worked with data scientists who are happy to use the Spark CLI. Again, the answer is "it depends" (in this case, on the skills of your customers).

Regarding sharing resources, different teams were limited to their own queue so they could not hog all the resources. However, people within a team had to do some horse trading if they had a particularly intensive job to run. I did feel that this was an area that could be improved. It may be by now, I've just not looked into it for a while.

BTW I'm not sure what you mean by "Spark still does not provide spill to disk" as the FAQ says "Spark's operators spill data to disk if it does not fit in memory" (http://spark.apache.org/faq.html). So, your data will not normally cause OutOfMemoryErrors (certain terms and conditions may apply).

My 2 cents.

Phillip



On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <[hidden email]> wrote:
What do you mean all possible workloads?
You cannot prepare any system to do all possible processing.

We do not know the requirements of your data scientists now or in the future so it is difficult to say. How do they work currently without the new solution? Do they all work on the same data? I bet you will receive on your email a lot of private messages trying to sell their solution that solves everything - with the information you provided this is impossible to say.

Then with every system: have incremental releases but have then in short time frames - do not engineer a big system that you will deliver in 2 years. In the cloud you have the perfect possibility to scale feature but also infrastructure wise.

Challenges with concurrent queries is the right definition of the scheduler (eg fairscheduler) that not one query take all the resources or that long running queries starve.

User interfaces: what could help are notebooks (Jupyter etc) but you may need to train your data scientists. Some may know or prefer other tools.

On 12. Nov 2017, at 08:32, Deepak Sharma <[hidden email]> wrote:

I am looking for similar solution more aligned to data scientist group.
The concern i have is about supporting complex aggregations at runtime .

Thanks
Deepak

On Nov 12, 2017 12:51, "ashish rawat" <[hidden email]> wrote:
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:
  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

Regards,
Ashish



Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

Gourav Sengupta
Dear Ashish,
what you are asking for involves at least a few weeks of dedicated understanding of your used case and then it takes at least 3 to 4 months to even propose a solution. You can even build a fantastic data warehouse just using C++. The matter depends on lots of conditions. I just think that your approach and question needs a lot of modification.

Regards,
Gourav

On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry <[hidden email]> wrote:
Hi, Ashish.

You are correct in saying that not *all* functionality of Spark is spill-to-disk but I am not sure how this pertains to a "concurrent user scenario". Each executor will run in its own JVM and is therefore isolated from others. That is, if the JVM of one user dies, this should not effect another user who is running their own jobs in their own JVMs. The amount of resources used by a user can be controlled by the resource manager.

AFAIK, you configure something like YARN to limit the number of cores and the amount of memory in the cluster a certain user or group is allowed to use for their job. This is obviously quite a coarse-grained approach as (to my knowledge) IO is not throttled. I believe people generally use something like Apache Ambari to keep an eye on network and disk usage to mitigate problems in a shared cluster.

If the user has badly designed their query, it may very well fail with OOMEs but this can happen irrespective of whether one user or many is using the cluster at a given moment in time.

Does this help?

Regards,

Phillip


On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat <[hidden email]> wrote:
Thanks Jorn and Phillip. My question was specifically to anyone who have tried creating a system using spark SQL, as Data Warehouse. I was trying to check, if someone has tried it and they can help with the kind of workloads which worked and the ones, which have problems.

Regarding spill to disk, I might be wrong but not all functionality of spark is spill to disk. So it still doesn't provide DB like reliability in execution. In case of DBs, queries get slow but they don't fail or go out of memory, specifically in concurrent user scenarios.

Regards,
Ashish 

On Nov 12, 2017 3:02 PM, "Phillip Henry" <[hidden email]> wrote:
Agree with Jorn. The answer is: it depends.

In the past, I've worked with data scientists who are happy to use the Spark CLI. Again, the answer is "it depends" (in this case, on the skills of your customers).

Regarding sharing resources, different teams were limited to their own queue so they could not hog all the resources. However, people within a team had to do some horse trading if they had a particularly intensive job to run. I did feel that this was an area that could be improved. It may be by now, I've just not looked into it for a while.

BTW I'm not sure what you mean by "Spark still does not provide spill to disk" as the FAQ says "Spark's operators spill data to disk if it does not fit in memory" (http://spark.apache.org/faq.html). So, your data will not normally cause OutOfMemoryErrors (certain terms and conditions may apply).

My 2 cents.

Phillip



On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <[hidden email]> wrote:
What do you mean all possible workloads?
You cannot prepare any system to do all possible processing.

We do not know the requirements of your data scientists now or in the future so it is difficult to say. How do they work currently without the new solution? Do they all work on the same data? I bet you will receive on your email a lot of private messages trying to sell their solution that solves everything - with the information you provided this is impossible to say.

Then with every system: have incremental releases but have then in short time frames - do not engineer a big system that you will deliver in 2 years. In the cloud you have the perfect possibility to scale feature but also infrastructure wise.

Challenges with concurrent queries is the right definition of the scheduler (eg fairscheduler) that not one query take all the resources or that long running queries starve.

User interfaces: what could help are notebooks (Jupyter etc) but you may need to train your data scientists. Some may know or prefer other tools.

On 12. Nov 2017, at 08:32, Deepak Sharma <[hidden email]> wrote:

I am looking for similar solution more aligned to data scientist group.
The concern i have is about supporting complex aggregations at runtime .

Thanks
Deepak

On Nov 12, 2017 12:51, "ashish rawat" <[hidden email]> wrote:
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:
  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

Regards,
Ashish




Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

Vadim Semenov
It's actually quite simple to answer

> 1. Is Spark SQL and UDF, able to handle all the workloads?
Yes

> 2. What user interface did you provide for data scientist, data engineers and analysts
Home-grown platform, EMR, Zeppelin

> What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
You can run separate Spark Contexts, so jobs will be isolated

> Are there any open source implementations, which provide something similar?
Yes, many.


On Sun, Nov 12, 2017 at 1:47 PM, Gourav Sengupta <[hidden email]> wrote:
Dear Ashish,
what you are asking for involves at least a few weeks of dedicated understanding of your used case and then it takes at least 3 to 4 months to even propose a solution. You can even build a fantastic data warehouse just using C++. The matter depends on lots of conditions. I just think that your approach and question needs a lot of modification.

Regards,
Gourav

On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry <[hidden email]> wrote:
Hi, Ashish.

You are correct in saying that not *all* functionality of Spark is spill-to-disk but I am not sure how this pertains to a "concurrent user scenario". Each executor will run in its own JVM and is therefore isolated from others. That is, if the JVM of one user dies, this should not effect another user who is running their own jobs in their own JVMs. The amount of resources used by a user can be controlled by the resource manager.

AFAIK, you configure something like YARN to limit the number of cores and the amount of memory in the cluster a certain user or group is allowed to use for their job. This is obviously quite a coarse-grained approach as (to my knowledge) IO is not throttled. I believe people generally use something like Apache Ambari to keep an eye on network and disk usage to mitigate problems in a shared cluster.

If the user has badly designed their query, it may very well fail with OOMEs but this can happen irrespective of whether one user or many is using the cluster at a given moment in time.

Does this help?

Regards,

Phillip


On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat <[hidden email]> wrote:
Thanks Jorn and Phillip. My question was specifically to anyone who have tried creating a system using spark SQL, as Data Warehouse. I was trying to check, if someone has tried it and they can help with the kind of workloads which worked and the ones, which have problems.

Regarding spill to disk, I might be wrong but not all functionality of spark is spill to disk. So it still doesn't provide DB like reliability in execution. In case of DBs, queries get slow but they don't fail or go out of memory, specifically in concurrent user scenarios.

Regards,
Ashish 

On Nov 12, 2017 3:02 PM, "Phillip Henry" <[hidden email]> wrote:
Agree with Jorn. The answer is: it depends.

In the past, I've worked with data scientists who are happy to use the Spark CLI. Again, the answer is "it depends" (in this case, on the skills of your customers).

Regarding sharing resources, different teams were limited to their own queue so they could not hog all the resources. However, people within a team had to do some horse trading if they had a particularly intensive job to run. I did feel that this was an area that could be improved. It may be by now, I've just not looked into it for a while.

BTW I'm not sure what you mean by "Spark still does not provide spill to disk" as the FAQ says "Spark's operators spill data to disk if it does not fit in memory" (http://spark.apache.org/faq.html). So, your data will not normally cause OutOfMemoryErrors (certain terms and conditions may apply).

My 2 cents.

Phillip



On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <[hidden email]> wrote:
What do you mean all possible workloads?
You cannot prepare any system to do all possible processing.

We do not know the requirements of your data scientists now or in the future so it is difficult to say. How do they work currently without the new solution? Do they all work on the same data? I bet you will receive on your email a lot of private messages trying to sell their solution that solves everything - with the information you provided this is impossible to say.

Then with every system: have incremental releases but have then in short time frames - do not engineer a big system that you will deliver in 2 years. In the cloud you have the perfect possibility to scale feature but also infrastructure wise.

Challenges with concurrent queries is the right definition of the scheduler (eg fairscheduler) that not one query take all the resources or that long running queries starve.

User interfaces: what could help are notebooks (Jupyter etc) but you may need to train your data scientists. Some may know or prefer other tools.

On 12. Nov 2017, at 08:32, Deepak Sharma <[hidden email]> wrote:

I am looking for similar solution more aligned to data scientist group.
The concern i have is about supporting complex aggregations at runtime .

Thanks
Deepak

On Nov 12, 2017 12:51, "ashish rawat" <[hidden email]> wrote:
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:
  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

Regards,
Ashish





Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

Patrick Alwell

Alcon,

 

You can most certainly do this. I’ve done benchmarking with Spark SQL and the TPCDS queries using S3 as the filesystem.

 

Zeppelin and Livy server work well for the dash boarding and concurrent query issues:  https://hortonworks.com/blog/livy-a-rest-interface-for-apache-spark/

 

Livy Server will allow you to create multiple spark contexts via REST: https://livy.incubator.apache.org/

 

If you are looking for broad SQL functionality I’d recommend instantiating a Hive context. And Spark is able to spill to disk à https://spark.apache.org/faq.html

 

There are multiple companies running spark within their data warehouse solutions: https://ibmdatawarehousing.wordpress.com/2016/10/12/steinbach_dashdb_local_spark/

 

Edmunds used Spark to allow business analysts to point Spark to files in S3 and infer schema: https://www.youtube.com/watch?v=gsR1ljgZLq0

 

Recommend running some benchmarks and testing query scenarios for your end users; but it sounds like you’ll be using it for exploratory analysis. Spark is great for this

 

-Pat

 

 

From: Vadim Semenov <[hidden email]>
Date: Sunday, November 12, 2017 at 1:06 PM
To: Gourav Sengupta <[hidden email]>
Cc: Phillip Henry <[hidden email]>, ashish rawat <[hidden email]>, Jörn Franke <[hidden email]>, Deepak Sharma <[hidden email]>, spark users <[hidden email]>
Subject: Re: Spark based Data Warehouse

 

It's actually quite simple to answer

 

> 1. Is Spark SQL and UDF, able to handle all the workloads?

Yes

 

> 2. What user interface did you provide for data scientist, data engineers and analysts

Home-grown platform, EMR, Zeppelin

 

> What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries

You can run separate Spark Contexts, so jobs will be isolated

 

> Are there any open source implementations, which provide something similar?

Yes, many.

 

 

On Sun, Nov 12, 2017 at 1:47 PM, Gourav Sengupta <[hidden email]> wrote:

Dear Ashish,

what you are asking for involves at least a few weeks of dedicated understanding of your used case and then it takes at least 3 to 4 months to even propose a solution. You can even build a fantastic data warehouse just using C++. The matter depends on lots of conditions. I just think that your approach and question needs a lot of modification.

 

Regards,

Gourav

 

On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry <[hidden email]> wrote:

Hi, Ashish.

You are correct in saying that not *all* functionality of Spark is spill-to-disk but I am not sure how this pertains to a "concurrent user scenario". Each executor will run in its own JVM and is therefore isolated from others. That is, if the JVM of one user dies, this should not effect another user who is running their own jobs in their own JVMs. The amount of resources used by a user can be controlled by the resource manager.

AFAIK, you configure something like YARN to limit the number of cores and the amount of memory in the cluster a certain user or group is allowed to use for their job. This is obviously quite a coarse-grained approach as (to my knowledge) IO is not throttled. I believe people generally use something like Apache Ambari to keep an eye on network and disk usage to mitigate problems in a shared cluster.

If the user has badly designed their query, it may very well fail with OOMEs but this can happen irrespective of whether one user or many is using the cluster at a given moment in time.

 

Does this help?

Regards,

Phillip

 

On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat <[hidden email]> wrote:

Thanks Jorn and Phillip. My question was specifically to anyone who have tried creating a system using spark SQL, as Data Warehouse. I was trying to check, if someone has tried it and they can help with the kind of workloads which worked and the ones, which have problems.

 

Regarding spill to disk, I might be wrong but not all functionality of spark is spill to disk. So it still doesn't provide DB like reliability in execution. In case of DBs, queries get slow but they don't fail or go out of memory, specifically in concurrent user scenarios.

 

Regards,

Ashish 

 

On Nov 12, 2017 3:02 PM, "Phillip Henry" <[hidden email]> wrote:

Agree with Jorn. The answer is: it depends.

 

In the past, I've worked with data scientists who are happy to use the Spark CLI. Again, the answer is "it depends" (in this case, on the skills of your customers).

Regarding sharing resources, different teams were limited to their own queue so they could not hog all the resources. However, people within a team had to do some horse trading if they had a particularly intensive job to run. I did feel that this was an area that could be improved. It may be by now, I've just not looked into it for a while.

BTW I'm not sure what you mean by "Spark still does not provide spill to disk" as the FAQ says "Spark's operators spill data to disk if it does not fit in memory" (http://spark.apache.org/faq.html). So, your data will not normally cause OutOfMemoryErrors (certain terms and conditions may apply).

My 2 cents.

Phillip

 

 

On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <[hidden email]> wrote:

What do you mean all possible workloads?

You cannot prepare any system to do all possible processing.

 

We do not know the requirements of your data scientists now or in the future so it is difficult to say. How do they work currently without the new solution? Do they all work on the same data? I bet you will receive on your email a lot of private messages trying to sell their solution that solves everything - with the information you provided this is impossible to say.

 

Then with every system: have incremental releases but have then in short time frames - do not engineer a big system that you will deliver in 2 years. In the cloud you have the perfect possibility to scale feature but also infrastructure wise.

 

Challenges with concurrent queries is the right definition of the scheduler (eg fairscheduler) that not one query take all the resources or that long running queries starve.

 

User interfaces: what could help are notebooks (Jupyter etc) but you may need to train your data scientists. Some may know or prefer other tools.


On 12. Nov 2017, at 08:32, Deepak Sharma <[hidden email]> wrote:

I am looking for similar solution more aligned to data scientist group.

The concern i have is about supporting complex aggregations at runtime .

 

Thanks

Deepak

 

On Nov 12, 2017 12:51, "ashish rawat" <[hidden email]> wrote:

Hello Everyone,

 

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

 

If anyone has tried it out, would like to understand the following:

  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

 

Regards,

Ashish

 

 

 

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

ashish rawat
Thanks Everyone. I am still not clear on what is the right way to execute support multiple users, running concurrent queries with Spark. Is it through multiple spark contexts or through Livy (which creates a single spark context only).

Also, what kind of isolation is possible with Spark SQL? If one user fires a big query, then would that choke all other queries in the cluster?

Regards,
Ashish

On Mon, Nov 13, 2017 at 3:10 AM, Patrick Alwell <[hidden email]> wrote:

Alcon,

 

You can most certainly do this. I’ve done benchmarking with Spark SQL and the TPCDS queries using S3 as the filesystem.

 

Zeppelin and Livy server work well for the dash boarding and concurrent query issues:  https://hortonworks.com/blog/livy-a-rest-interface-for-apache-spark/

 

Livy Server will allow you to create multiple spark contexts via REST: https://livy.incubator.apache.org/

 

If you are looking for broad SQL functionality I’d recommend instantiating a Hive context. And Spark is able to spill to disk à https://spark.apache.org/faq.html

 

There are multiple companies running spark within their data warehouse solutions: https://ibmdatawarehousing.wordpress.com/2016/10/12/steinbach_dashdb_local_spark/

 

Edmunds used Spark to allow business analysts to point Spark to files in S3 and infer schema: https://www.youtube.com/watch?v=gsR1ljgZLq0

 

Recommend running some benchmarks and testing query scenarios for your end users; but it sounds like you’ll be using it for exploratory analysis. Spark is great for this

 

-Pat

 

 

From: Vadim Semenov <[hidden email]>
Date: Sunday, November 12, 2017 at 1:06 PM
To: Gourav Sengupta <[hidden email]>
Cc: Phillip Henry <[hidden email]>, ashish rawat <[hidden email]>, Jörn Franke <[hidden email]>, Deepak Sharma <[hidden email]>, spark users <[hidden email]>
Subject: Re: Spark based Data Warehouse

 

It's actually quite simple to answer

 

> 1. Is Spark SQL and UDF, able to handle all the workloads?

Yes

 

> 2. What user interface did you provide for data scientist, data engineers and analysts

Home-grown platform, EMR, Zeppelin

 

> What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries

You can run separate Spark Contexts, so jobs will be isolated

 

> Are there any open source implementations, which provide something similar?

Yes, many.

 

 

On Sun, Nov 12, 2017 at 1:47 PM, Gourav Sengupta <[hidden email]> wrote:

Dear Ashish,

what you are asking for involves at least a few weeks of dedicated understanding of your used case and then it takes at least 3 to 4 months to even propose a solution. You can even build a fantastic data warehouse just using C++. The matter depends on lots of conditions. I just think that your approach and question needs a lot of modification.

 

Regards,

Gourav

 

On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry <[hidden email]> wrote:

Hi, Ashish.

You are correct in saying that not *all* functionality of Spark is spill-to-disk but I am not sure how this pertains to a "concurrent user scenario". Each executor will run in its own JVM and is therefore isolated from others. That is, if the JVM of one user dies, this should not effect another user who is running their own jobs in their own JVMs. The amount of resources used by a user can be controlled by the resource manager.

AFAIK, you configure something like YARN to limit the number of cores and the amount of memory in the cluster a certain user or group is allowed to use for their job. This is obviously quite a coarse-grained approach as (to my knowledge) IO is not throttled. I believe people generally use something like Apache Ambari to keep an eye on network and disk usage to mitigate problems in a shared cluster.

If the user has badly designed their query, it may very well fail with OOMEs but this can happen irrespective of whether one user or many is using the cluster at a given moment in time.

 

Does this help?

Regards,

Phillip

 

On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat <[hidden email]> wrote:

Thanks Jorn and Phillip. My question was specifically to anyone who have tried creating a system using spark SQL, as Data Warehouse. I was trying to check, if someone has tried it and they can help with the kind of workloads which worked and the ones, which have problems.

 

Regarding spill to disk, I might be wrong but not all functionality of spark is spill to disk. So it still doesn't provide DB like reliability in execution. In case of DBs, queries get slow but they don't fail or go out of memory, specifically in concurrent user scenarios.

 

Regards,

Ashish 

 

On Nov 12, 2017 3:02 PM, "Phillip Henry" <[hidden email]> wrote:

Agree with Jorn. The answer is: it depends.

 

In the past, I've worked with data scientists who are happy to use the Spark CLI. Again, the answer is "it depends" (in this case, on the skills of your customers).

Regarding sharing resources, different teams were limited to their own queue so they could not hog all the resources. However, people within a team had to do some horse trading if they had a particularly intensive job to run. I did feel that this was an area that could be improved. It may be by now, I've just not looked into it for a while.

BTW I'm not sure what you mean by "Spark still does not provide spill to disk" as the FAQ says "Spark's operators spill data to disk if it does not fit in memory" (http://spark.apache.org/faq.html). So, your data will not normally cause OutOfMemoryErrors (certain terms and conditions may apply).

My 2 cents.

Phillip

 

 

On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <[hidden email]> wrote:

What do you mean all possible workloads?

You cannot prepare any system to do all possible processing.

 

We do not know the requirements of your data scientists now or in the future so it is difficult to say. How do they work currently without the new solution? Do they all work on the same data? I bet you will receive on your email a lot of private messages trying to sell their solution that solves everything - with the information you provided this is impossible to say.

 

Then with every system: have incremental releases but have then in short time frames - do not engineer a big system that you will deliver in 2 years. In the cloud you have the perfect possibility to scale feature but also infrastructure wise.

 

Challenges with concurrent queries is the right definition of the scheduler (eg fairscheduler) that not one query take all the resources or that long running queries starve.

 

User interfaces: what could help are notebooks (Jupyter etc) but you may need to train your data scientists. Some may know or prefer other tools.


On 12. Nov 2017, at 08:32, Deepak Sharma <[hidden email]> wrote:

I am looking for similar solution more aligned to data scientist group.

The concern i have is about supporting complex aggregations at runtime .

 

Thanks

Deepak

 

On Nov 12, 2017 12:51, "ashish rawat" <[hidden email]> wrote:

Hello Everyone,

 

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

 

If anyone has tried it out, would like to understand the following:

  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

 

Regards,

Ashish

 

 

 

 

 


Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

Deepak Sharma
If you have only 1 user , its still possible to execute non-blocking long running queries .
Best way is to have different users with pre assigned resources , run their queries .

HTH

Thanks
Deepak 

On Nov 13, 2017 23:56, "ashish rawat" <[hidden email]> wrote:
Thanks Everyone. I am still not clear on what is the right way to execute support multiple users, running concurrent queries with Spark. Is it through multiple spark contexts or through Livy (which creates a single spark context only).

Also, what kind of isolation is possible with Spark SQL? If one user fires a big query, then would that choke all other queries in the cluster?

Regards,
Ashish

On Mon, Nov 13, 2017 at 3:10 AM, Patrick Alwell <[hidden email]> wrote:

Alcon,

 

You can most certainly do this. I’ve done benchmarking with Spark SQL and the TPCDS queries using S3 as the filesystem.

 

Zeppelin and Livy server work well for the dash boarding and concurrent query issues:  https://hortonworks.com/blog/livy-a-rest-interface-for-apache-spark/

 

Livy Server will allow you to create multiple spark contexts via REST: https://livy.incubator.apache.org/

 

If you are looking for broad SQL functionality I’d recommend instantiating a Hive context. And Spark is able to spill to disk à https://spark.apache.org/faq.html

 

There are multiple companies running spark within their data warehouse solutions: https://ibmdatawarehousing.wordpress.com/2016/10/12/steinbach_dashdb_local_spark/

 

Edmunds used Spark to allow business analysts to point Spark to files in S3 and infer schema: https://www.youtube.com/watch?v=gsR1ljgZLq0

 

Recommend running some benchmarks and testing query scenarios for your end users; but it sounds like you’ll be using it for exploratory analysis. Spark is great for this

 

-Pat

 

 

From: Vadim Semenov <[hidden email]>
Date: Sunday, November 12, 2017 at 1:06 PM
To: Gourav Sengupta <[hidden email]>
Cc: Phillip Henry <[hidden email]>, ashish rawat <[hidden email]>, Jörn Franke <[hidden email]>, Deepak Sharma <[hidden email]>, spark users <[hidden email]>
Subject: Re: Spark based Data Warehouse

 

It's actually quite simple to answer

 

> 1. Is Spark SQL and UDF, able to handle all the workloads?

Yes

 

> 2. What user interface did you provide for data scientist, data engineers and analysts

Home-grown platform, EMR, Zeppelin

 

> What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries

You can run separate Spark Contexts, so jobs will be isolated

 

> Are there any open source implementations, which provide something similar?

Yes, many.

 

 

On Sun, Nov 12, 2017 at 1:47 PM, Gourav Sengupta <[hidden email]> wrote:

Dear Ashish,

what you are asking for involves at least a few weeks of dedicated understanding of your used case and then it takes at least 3 to 4 months to even propose a solution. You can even build a fantastic data warehouse just using C++. The matter depends on lots of conditions. I just think that your approach and question needs a lot of modification.

 

Regards,

Gourav

 

On Sun, Nov 12, 2017 at 6:19 PM, Phillip Henry <[hidden email]> wrote:

Hi, Ashish.

You are correct in saying that not *all* functionality of Spark is spill-to-disk but I am not sure how this pertains to a "concurrent user scenario". Each executor will run in its own JVM and is therefore isolated from others. That is, if the JVM of one user dies, this should not effect another user who is running their own jobs in their own JVMs. The amount of resources used by a user can be controlled by the resource manager.

AFAIK, you configure something like YARN to limit the number of cores and the amount of memory in the cluster a certain user or group is allowed to use for their job. This is obviously quite a coarse-grained approach as (to my knowledge) IO is not throttled. I believe people generally use something like Apache Ambari to keep an eye on network and disk usage to mitigate problems in a shared cluster.

If the user has badly designed their query, it may very well fail with OOMEs but this can happen irrespective of whether one user or many is using the cluster at a given moment in time.

 

Does this help?

Regards,

Phillip

 

On Sun, Nov 12, 2017 at 5:50 PM, ashish rawat <[hidden email]> wrote:

Thanks Jorn and Phillip. My question was specifically to anyone who have tried creating a system using spark SQL, as Data Warehouse. I was trying to check, if someone has tried it and they can help with the kind of workloads which worked and the ones, which have problems.

 

Regarding spill to disk, I might be wrong but not all functionality of spark is spill to disk. So it still doesn't provide DB like reliability in execution. In case of DBs, queries get slow but they don't fail or go out of memory, specifically in concurrent user scenarios.

 

Regards,

Ashish 

 

On Nov 12, 2017 3:02 PM, "Phillip Henry" <[hidden email]> wrote:

Agree with Jorn. The answer is: it depends.

 

In the past, I've worked with data scientists who are happy to use the Spark CLI. Again, the answer is "it depends" (in this case, on the skills of your customers).

Regarding sharing resources, different teams were limited to their own queue so they could not hog all the resources. However, people within a team had to do some horse trading if they had a particularly intensive job to run. I did feel that this was an area that could be improved. It may be by now, I've just not looked into it for a while.

BTW I'm not sure what you mean by "Spark still does not provide spill to disk" as the FAQ says "Spark's operators spill data to disk if it does not fit in memory" (http://spark.apache.org/faq.html). So, your data will not normally cause OutOfMemoryErrors (certain terms and conditions may apply).

My 2 cents.

Phillip

 

 

On Sun, Nov 12, 2017 at 9:14 AM, Jörn Franke <[hidden email]> wrote:

What do you mean all possible workloads?

You cannot prepare any system to do all possible processing.

 

We do not know the requirements of your data scientists now or in the future so it is difficult to say. How do they work currently without the new solution? Do they all work on the same data? I bet you will receive on your email a lot of private messages trying to sell their solution that solves everything - with the information you provided this is impossible to say.

 

Then with every system: have incremental releases but have then in short time frames - do not engineer a big system that you will deliver in 2 years. In the cloud you have the perfect possibility to scale feature but also infrastructure wise.

 

Challenges with concurrent queries is the right definition of the scheduler (eg fairscheduler) that not one query take all the resources or that long running queries starve.

 

User interfaces: what could help are notebooks (Jupyter etc) but you may need to train your data scientists. Some may know or prefer other tools.


On 12. Nov 2017, at 08:32, Deepak Sharma <[hidden email]> wrote:

I am looking for similar solution more aligned to data scientist group.

The concern i have is about supporting complex aggregations at runtime .

 

Thanks

Deepak

 

On Nov 12, 2017 12:51, "ashish rawat" <[hidden email]> wrote:

Hello Everyone,

 

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

 

If anyone has tried it out, would like to understand the following:

  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

 

Regards,

Ashish

 

 

 

 

 


Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

Sky Yin
In reply to this post by ashish rawat
We are running Spark in AWS EMR as data warehouse. All data are in S3 and metadata in Hive metastore. 

We have internal tools to creat juypter notebook on the dev cluster. I guess you can use zeppelin instead, or Livy?

We run genie as a job server for the prod cluster, so users have to submit their queries through the genie. For better resource utilization, we rely on Yarn dynamic allocation to balance the load of multiple jobs/queries in Spark.

Hope this helps.

On Sat, Nov 11, 2017 at 11:21 PM ashish rawat <[hidden email]> wrote:
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:
  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

Regards,
Ashish
Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

ashish rawat
Thanks Sky Yin. This really helps. 

On Nov 14, 2017 12:11 AM, "Sky Yin" <[hidden email]> wrote:
We are running Spark in AWS EMR as data warehouse. All data are in S3 and metadata in Hive metastore. 

We have internal tools to creat juypter notebook on the dev cluster. I guess you can use zeppelin instead, or Livy?

We run genie as a job server for the prod cluster, so users have to submit their queries through the genie. For better resource utilization, we rely on Yarn dynamic allocation to balance the load of multiple jobs/queries in Spark.

Hope this helps.

On Sat, Nov 11, 2017 at 11:21 PM ashish rawat <[hidden email]> wrote:
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:
  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

Regards,
Ashish

Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

Affan Syed
Another option that we are trying internally is to uses Mesos for isolating different jobs or groups. Within a single group, using Livy to create different spark contexts also works. 

- Affan

On Tue, Nov 14, 2017 at 8:43 AM, ashish rawat <[hidden email]> wrote:
Thanks Sky Yin. This really helps. 

On Nov 14, 2017 12:11 AM, "Sky Yin" <[hidden email]> wrote:
We are running Spark in AWS EMR as data warehouse. All data are in S3 and metadata in Hive metastore. 

We have internal tools to creat juypter notebook on the dev cluster. I guess you can use zeppelin instead, or Livy?

We run genie as a job server for the prod cluster, so users have to submit their queries through the genie. For better resource utilization, we rely on Yarn dynamic allocation to balance the load of multiple jobs/queries in Spark.

Hope this helps.

On Sat, Nov 11, 2017 at 11:21 PM ashish rawat <[hidden email]> wrote:
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:
  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

Regards,
Ashish


Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

lucas.gary@gmail.com
Hi Ashish, bear in mind that EMR has some additional tooling available that smoothes out some S3 problems that you may / almost certainly will encounter.

We are using Spark / S3 not on EMR and have encountered issues with file consistency, you can deal with it but be aware it's additional technical debt that you'll need to own.  We didn't want to own an HDFS cluster so we consider it worthwhile.

Here are some additional resources:  The video is Steve Loughran talking about S3.

For the record we use S3 heavily but tend to drop our processed data into databases so they can be more easily consumed by visualization tools. 

Good luck!

Gary Lucas

On 13 November 2017 at 20:04, Affan Syed <[hidden email]> wrote:
Another option that we are trying internally is to uses Mesos for isolating different jobs or groups. Within a single group, using Livy to create different spark contexts also works. 

- Affan

On Tue, Nov 14, 2017 at 8:43 AM, ashish rawat <[hidden email]> wrote:
Thanks Sky Yin. This really helps. 

On Nov 14, 2017 12:11 AM, "Sky Yin" <[hidden email]> wrote:
We are running Spark in AWS EMR as data warehouse. All data are in S3 and metadata in Hive metastore. 

We have internal tools to creat juypter notebook on the dev cluster. I guess you can use zeppelin instead, or Livy?

We run genie as a job server for the prod cluster, so users have to submit their queries through the genie. For better resource utilization, we rely on Yarn dynamic allocation to balance the load of multiple jobs/queries in Spark.

Hope this helps.

On Sat, Nov 11, 2017 at 11:21 PM ashish rawat <[hidden email]> wrote:
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:
  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

Regards,
Ashish



Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

ashish rawat
Thanks everyone for their suggestions. Does any of you take care of auto scale up and down of your underlying spark clusters on AWS?

On Nov 14, 2017 10:46 AM, "[hidden email]" <[hidden email]> wrote:
Hi Ashish, bear in mind that EMR has some additional tooling available that smoothes out some S3 problems that you may / almost certainly will encounter.

We are using Spark / S3 not on EMR and have encountered issues with file consistency, you can deal with it but be aware it's additional technical debt that you'll need to own.  We didn't want to own an HDFS cluster so we consider it worthwhile.

Here are some additional resources:  The video is Steve Loughran talking about S3.

For the record we use S3 heavily but tend to drop our processed data into databases so they can be more easily consumed by visualization tools. 

Good luck!

Gary Lucas

On 13 November 2017 at 20:04, Affan Syed <[hidden email]> wrote:
Another option that we are trying internally is to uses Mesos for isolating different jobs or groups. Within a single group, using Livy to create different spark contexts also works. 

- Affan

On Tue, Nov 14, 2017 at 8:43 AM, ashish rawat <[hidden email]> wrote:
Thanks Sky Yin. This really helps. 

On Nov 14, 2017 12:11 AM, "Sky Yin" <[hidden email]> wrote:
We are running Spark in AWS EMR as data warehouse. All data are in S3 and metadata in Hive metastore. 

We have internal tools to creat juypter notebook on the dev cluster. I guess you can use zeppelin instead, or Livy?

We run genie as a job server for the prod cluster, so users have to submit their queries through the genie. For better resource utilization, we rely on Yarn dynamic allocation to balance the load of multiple jobs/queries in Spark.

Hope this helps.

On Sat, Nov 11, 2017 at 11:21 PM ashish rawat <[hidden email]> wrote:
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:
  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

Regards,
Ashish




Reply | Threaded
Open this post in threaded view
|

Re: Spark based Data Warehouse

lucas.gary@gmail.com
We are using Spark on Kubernetes on AWS (it's a long story) but it does work.  It's still on the raw side but we've been pretty successful.

We configured our cluster primarily with Kube-AWS and auto scaling groups.  There are gotcha's there, but so far we've been quite successful.

Gary Lucas

On 17 November 2017 at 22:20, ashish rawat <[hidden email]> wrote:
Thanks everyone for their suggestions. Does any of you take care of auto scale up and down of your underlying spark clusters on AWS?

On Nov 14, 2017 10:46 AM, "[hidden email]" <[hidden email]> wrote:
Hi Ashish, bear in mind that EMR has some additional tooling available that smoothes out some S3 problems that you may / almost certainly will encounter.

We are using Spark / S3 not on EMR and have encountered issues with file consistency, you can deal with it but be aware it's additional technical debt that you'll need to own.  We didn't want to own an HDFS cluster so we consider it worthwhile.

Here are some additional resources:  The video is Steve Loughran talking about S3.

For the record we use S3 heavily but tend to drop our processed data into databases so they can be more easily consumed by visualization tools. 

Good luck!

Gary Lucas

On 13 November 2017 at 20:04, Affan Syed <[hidden email]> wrote:
Another option that we are trying internally is to uses Mesos for isolating different jobs or groups. Within a single group, using Livy to create different spark contexts also works. 

- Affan

On Tue, Nov 14, 2017 at 8:43 AM, ashish rawat <[hidden email]> wrote:
Thanks Sky Yin. This really helps. 

On Nov 14, 2017 12:11 AM, "Sky Yin" <[hidden email]> wrote:
We are running Spark in AWS EMR as data warehouse. All data are in S3 and metadata in Hive metastore. 

We have internal tools to creat juypter notebook on the dev cluster. I guess you can use zeppelin instead, or Livy?

We run genie as a job server for the prod cluster, so users have to submit their queries through the genie. For better resource utilization, we rely on Yarn dynamic allocation to balance the load of multiple jobs/queries in Spark.

Hope this helps.

On Sat, Nov 11, 2017 at 11:21 PM ashish rawat <[hidden email]> wrote:
Hello Everyone,

I was trying to understand if anyone here has tried a data warehouse solution using S3 and Spark SQL. Out of multiple possible options (redshift, presto, hive etc), we were planning to go with Spark SQL, for our aggregates and processing requirements.

If anyone has tried it out, would like to understand the following:
  1. Is Spark SQL and UDF, able to handle all the workloads?
  2. What user interface did you provide for data scientist, data engineers and analysts
  3. What are the challenges in running concurrent queries, by many users, over Spark SQL? Considering Spark still does not provide spill to disk, in many scenarios, are there frequent query failures when executing concurrent queries
  4. Are there any open source implementations, which provide something similar?

Regards,
Ashish