Using google cloud storage for spark big data

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Using google cloud storage for spark big data

Aureliano Buendia
Hi,

Google has publisheed a new connector for hadoop: google cloud storage, which is an equivalent of amazon s3:

googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html

How can spark be configured to use this connector?
Reply | Threaded
Open this post in threaded view
|

Re: Using google cloud storage for spark big data

Vincent.Heuschling
This post has NOT been accepted by the mailing list yet.
Hi, 
On the last Google Cloud Platform event, there was a demonstration of Spark clusters running on top of Google Cloud Platform. It seems that they were using Spark with Hadoop on GCS. We also had difficulties using GCS with Spark, but we hope that Google will release their stuff.
regards
Vincent


2014-04-16 20:00 GMT+02:00 Aureliano Buendia [via Apache Spark User List] <[hidden email]>:
Hi,

Google has publisheed a new connector for hadoop: google cloud storage, which is an equivalent of amazon s3:

googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html

How can spark be configured to use this connector?



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-google-cloud-storage-for-spark-big-data-tp4342.html
To unsubscribe from Apache Spark User List, click here.
NAML



--



Vincent Heuschling
Affini-Tech SARL
Gsm : 06 61 88 76 71
Email : [hidden email]
Web : http://www.affini-tech.com
Skype : vheuschling
Twitter : @affinitech & @vhe74
11 Avenue Henri IV, 92190 Meudon



Reply | Threaded
Open this post in threaded view
|

Re: Using google cloud storage for spark big data

Andras Nemeth
In reply to this post by Aureliano Buendia
Hello!

On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

Google has publisheed a new connector for hadoop: google cloud storage, which is an equivalent of amazon s3:

googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
This is actually about Cloud Datastore and not Cloud Storage (yeah, quite confusing naming ;) ). But they do already have for a while a cloud storage connector, also linked from your article:
 


How can spark be configured to use this connector?
Yes, it can, but in a somewhat hacky way. The problem is that for some reason Google does not officially publish the library jar alone, you get it installed as part of a Hadoop on Google Cloud installation. So, the official way would be (we did not try that) to have a Hadoop on Google Cloud installation and run spark on top of that.

The other option - that we did try and which works fine for us - is to snatch the jar: https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar, make sure it's shipped to your workers (e.g. with setJars on SparkConf when you create your SparkContext). Then create a core-site.xml file which you make sure is on the classpath both in your driver and your cluster (e.g. you can make sure it ends up in one of the jars you send with setJars above) with this content (with YOUR_* replaced):
<configuration>
  <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
  <property><name>fs.gs.project.id</name><value>YOUR_PROJECT_ID</value></property>
  <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
</configuration>

From this point on you can simply use gs://... filenames to read/write data on Cloud Storage.

Note that you should run your cluster and driver program on Google Compute Engine for this to work as is. Probably it's possible to configure access from the outside too but we didn't do that.

Hope this helps,
Andras


 

Reply | Threaded
Open this post in threaded view
|

Re: Using google cloud storage for spark big data

Aureliano Buendia
Thanks, Andras. What approach did you use to setup a spark cluster on google compute engine? Currently, there is no production-ready official support for an equivalent of spark-ec2 on gce. Did you roll your own?


On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth <[hidden email]> wrote:
Hello!

On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

Google has publisheed a new connector for hadoop: google cloud storage, which is an equivalent of amazon s3:

googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
This is actually about Cloud Datastore and not Cloud Storage (yeah, quite confusing naming ;) ). But they do already have for a while a cloud storage connector, also linked from your article:
 


How can spark be configured to use this connector?
Yes, it can, but in a somewhat hacky way. The problem is that for some reason Google does not officially publish the library jar alone, you get it installed as part of a Hadoop on Google Cloud installation. So, the official way would be (we did not try that) to have a Hadoop on Google Cloud installation and run spark on top of that.

The other option - that we did try and which works fine for us - is to snatch the jar: https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar, make sure it's shipped to your workers (e.g. with setJars on SparkConf when you create your SparkContext). Then create a core-site.xml file which you make sure is on the classpath both in your driver and your cluster (e.g. you can make sure it ends up in one of the jars you send with setJars above) with this content (with YOUR_* replaced):
<configuration>
  <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
  <property><name>fs.gs.project.id</name><value>YOUR_PROJECT_ID</value></property>
  <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
</configuration>

From this point on you can simply use gs://... filenames to read/write data on Cloud Storage.

Note that you should run your cluster and driver program on Google Compute Engine for this to work as is. Probably it's possible to configure access from the outside too but we didn't do that.

Hope this helps,
Andras


 


Reply | Threaded
Open this post in threaded view
|

Re: Using google cloud storage for spark big data

Mayur Rustagi
Okay just commented on another thread :) 
I have one that I use internally. Can give it out but will need some support from you to fix bugs etc. Let me know if you are interested. 

Mayur Rustagi
Ph: +1 (760) 203 3257


On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia <[hidden email]> wrote:
Thanks, Andras. What approach did you use to setup a spark cluster on google compute engine? Currently, there is no production-ready official support for an equivalent of spark-ec2 on gce. Did you roll your own?


On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth <[hidden email]> wrote:
Hello!

On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

Google has publisheed a new connector for hadoop: google cloud storage, which is an equivalent of amazon s3:

googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
This is actually about Cloud Datastore and not Cloud Storage (yeah, quite confusing naming ;) ). But they do already have for a while a cloud storage connector, also linked from your article:
 


How can spark be configured to use this connector?
Yes, it can, but in a somewhat hacky way. The problem is that for some reason Google does not officially publish the library jar alone, you get it installed as part of a Hadoop on Google Cloud installation. So, the official way would be (we did not try that) to have a Hadoop on Google Cloud installation and run spark on top of that.

The other option - that we did try and which works fine for us - is to snatch the jar: https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar, make sure it's shipped to your workers (e.g. with setJars on SparkConf when you create your SparkContext). Then create a core-site.xml file which you make sure is on the classpath both in your driver and your cluster (e.g. you can make sure it ends up in one of the jars you send with setJars above) with this content (with YOUR_* replaced):
<configuration>
  <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
  <property><name>fs.gs.project.id</name><value>YOUR_PROJECT_ID</value></property>
  <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
</configuration>

From this point on you can simply use gs://... filenames to read/write data on Cloud Storage.

Note that you should run your cluster and driver program on Google Compute Engine for this to work as is. Probably it's possible to configure access from the outside too but we didn't do that.

Hope this helps,
Andras


 



Reply | Threaded
Open this post in threaded view
|

Re: Using google cloud storage for spark big data

Andras Nemeth
We don't have anything fancy. It's basically some very thin layer of google specifics on top of a stand alone cluster. We basically created two disk snapshots, one for the master and one for the workers. The snapshots contain initialization scripts so that the master/worker daemons are started on boot. So if I want a cluster I just create a new instance (with a fixed name) using the master snapshot for the master. When it is up I start as many slave instances as I need using the slave snapshot. By the time the machines are up the cluster is ready to be used.

Andras



On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi <[hidden email]> wrote:
Okay just commented on another thread :) 
I have one that I use internally. Can give it out but will need some support from you to fix bugs etc. Let me know if you are interested. 

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia <[hidden email]> wrote:
Thanks, Andras. What approach did you use to setup a spark cluster on google compute engine? Currently, there is no production-ready official support for an equivalent of spark-ec2 on gce. Did you roll your own?


On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth <[hidden email]> wrote:
Hello!

On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

Google has publisheed a new connector for hadoop: google cloud storage, which is an equivalent of amazon s3:

googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
This is actually about Cloud Datastore and not Cloud Storage (yeah, quite confusing naming ;) ). But they do already have for a while a cloud storage connector, also linked from your article:
 


How can spark be configured to use this connector?
Yes, it can, but in a somewhat hacky way. The problem is that for some reason Google does not officially publish the library jar alone, you get it installed as part of a Hadoop on Google Cloud installation. So, the official way would be (we did not try that) to have a Hadoop on Google Cloud installation and run spark on top of that.

The other option - that we did try and which works fine for us - is to snatch the jar: https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar, make sure it's shipped to your workers (e.g. with setJars on SparkConf when you create your SparkContext). Then create a core-site.xml file which you make sure is on the classpath both in your driver and your cluster (e.g. you can make sure it ends up in one of the jars you send with setJars above) with this content (with YOUR_* replaced):
<configuration>
  <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
  <property><name>fs.gs.project.id</name><value>YOUR_PROJECT_ID</value></property>
  <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
</configuration>

From this point on you can simply use gs://... filenames to read/write data on Cloud Storage.

Note that you should run your cluster and driver program on Google Compute Engine for this to work as is. Probably it's possible to configure access from the outside too but we didn't do that.

Hope this helps,
Andras


 




Reply | Threaded
Open this post in threaded view
|

Re: Using google cloud storage for spark big data

Aureliano Buendia



On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth <[hidden email]> wrote:
We don't have anything fancy. It's basically some very thin layer of google specifics on top of a stand alone cluster. We basically created two disk snapshots, one for the master and one for the workers. The snapshots contain initialization scripts so that the master/worker daemons are started on boot. So if I want a cluster I just create a new instance (with a fixed name) using the master snapshot for the master. When it is up I start as many slave instances as I need using the slave snapshot. By the time the machines are up the cluster is ready to be used.


This sounds like being a lot simpler than the existing spark-ec2 script. Does google compute engine api makes this happen in a simple way, when compared to ec2 api? Does your script do everything spark-ec2 does?

Also, any plans to make this open source?
 
Andras



On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi <[hidden email]> wrote:
Okay just commented on another thread :) 
I have one that I use internally. Can give it out but will need some support from you to fix bugs etc. Let me know if you are interested. 

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia <[hidden email]> wrote:
Thanks, Andras. What approach did you use to setup a spark cluster on google compute engine? Currently, there is no production-ready official support for an equivalent of spark-ec2 on gce. Did you roll your own?


On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth <[hidden email]> wrote:
Hello!

On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

Google has publisheed a new connector for hadoop: google cloud storage, which is an equivalent of amazon s3:

googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
This is actually about Cloud Datastore and not Cloud Storage (yeah, quite confusing naming ;) ). But they do already have for a while a cloud storage connector, also linked from your article:
 


How can spark be configured to use this connector?
Yes, it can, but in a somewhat hacky way. The problem is that for some reason Google does not officially publish the library jar alone, you get it installed as part of a Hadoop on Google Cloud installation. So, the official way would be (we did not try that) to have a Hadoop on Google Cloud installation and run spark on top of that.

The other option - that we did try and which works fine for us - is to snatch the jar: https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar, make sure it's shipped to your workers (e.g. with setJars on SparkConf when you create your SparkContext). Then create a core-site.xml file which you make sure is on the classpath both in your driver and your cluster (e.g. you can make sure it ends up in one of the jars you send with setJars above) with this content (with YOUR_* replaced):
<configuration>
  <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
  <property><name>fs.gs.project.id</name><value>YOUR_PROJECT_ID</value></property>
  <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
</configuration>

From this point on you can simply use gs://... filenames to read/write data on Cloud Storage.

Note that you should run your cluster and driver program on Google Compute Engine for this to work as is. Probably it's possible to configure access from the outside too but we didn't do that.

Hope this helps,
Andras


 





Reply | Threaded
Open this post in threaded view
|

Re: Using google cloud storage for spark big data

Akhil
Hi Aureliano,

You might want to check this script out, https://github.com/sigmoidanalytics/spark_gce 
Let me know if you need any help around that.

Thanks
Best Regards


On Tue, Apr 22, 2014 at 7:12 PM, Aureliano Buendia <[hidden email]> wrote:



On Tue, Apr 22, 2014 at 10:50 AM, Andras Nemeth <[hidden email]> wrote:
We don't have anything fancy. It's basically some very thin layer of google specifics on top of a stand alone cluster. We basically created two disk snapshots, one for the master and one for the workers. The snapshots contain initialization scripts so that the master/worker daemons are started on boot. So if I want a cluster I just create a new instance (with a fixed name) using the master snapshot for the master. When it is up I start as many slave instances as I need using the slave snapshot. By the time the machines are up the cluster is ready to be used.


This sounds like being a lot simpler than the existing spark-ec2 script. Does google compute engine api makes this happen in a simple way, when compared to ec2 api? Does your script do everything spark-ec2 does?

Also, any plans to make this open source?
 
Andras



On Mon, Apr 21, 2014 at 10:04 PM, Mayur Rustagi <[hidden email]> wrote:
Okay just commented on another thread :) 
I have one that I use internally. Can give it out but will need some support from you to fix bugs etc. Let me know if you are interested. 

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Fri, Apr 18, 2014 at 9:08 PM, Aureliano Buendia <[hidden email]> wrote:
Thanks, Andras. What approach did you use to setup a spark cluster on google compute engine? Currently, there is no production-ready official support for an equivalent of spark-ec2 on gce. Did you roll your own?


On Thu, Apr 17, 2014 at 10:24 AM, Andras Nemeth <[hidden email]> wrote:
Hello!

On Wed, Apr 16, 2014 at 7:59 PM, Aureliano Buendia <[hidden email]> wrote:
Hi,

Google has publisheed a new connector for hadoop: google cloud storage, which is an equivalent of amazon s3:

googlecloudplatform.blogspot.com/2014/04/google-bigquery-and-datastore-connectors-for-hadoop.html
This is actually about Cloud Datastore and not Cloud Storage (yeah, quite confusing naming ;) ). But they do already have for a while a cloud storage connector, also linked from your article:
 


How can spark be configured to use this connector?
Yes, it can, but in a somewhat hacky way. The problem is that for some reason Google does not officially publish the library jar alone, you get it installed as part of a Hadoop on Google Cloud installation. So, the official way would be (we did not try that) to have a Hadoop on Google Cloud installation and run spark on top of that.

The other option - that we did try and which works fine for us - is to snatch the jar: https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-1.2.4.jar, make sure it's shipped to your workers (e.g. with setJars on SparkConf when you create your SparkContext). Then create a core-site.xml file which you make sure is on the classpath both in your driver and your cluster (e.g. you can make sure it ends up in one of the jars you send with setJars above) with this content (with YOUR_* replaced):
<configuration>
  <property><name>fs.gs.impl</name><value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value></property>
  <property><name>fs.gs.project.id</name><value>YOUR_PROJECT_ID</value></property>
  <property><name>fs.gs.system.bucket</name><value>YOUR_FAVORITE_BUCKET</value></property>
</configuration>

From this point on you can simply use gs://... filenames to read/write data on Cloud Storage.

Note that you should run your cluster and driver program on Google Compute Engine for this to work as is. Probably it's possible to configure access from the outside too but we didn't do that.

Hope this helps,
Andras


 






Reply | Threaded
Open this post in threaded view
|

Re: Using google cloud storage for spark big data

datalicious-dan
This post has NOT been accepted by the mailing list yet.
In reply to this post by Aureliano Buendia
We have implemented a setup that connects Spark to Google Cloud Storage.

Connecting Apache Spark to Google Cloud Storage

We needed it because our DataCollector product sits on the Google Cloud Platform and running everything back to Hadoop was counter-productive.

Shortly we will be posting an update with a basic tutorial/use-case of running an instance with Spark on GCS.