Spark + MongoDB

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark + MongoDB

Sampo Niskanen
Hi,

We're starting to build an analytics framework for our wellness service.  While our data is not yet Big, we'd like to use a framework that will scale as needed, and Spark seems to be the best around.

I'm new to Hadoop and Spark, and I'm having difficulty figuring out how to use Spark in connection with MongoDB.  Apparently, I should be able to use the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop) also with Spark, but haven't figured out how.

I've run through the Spark tutorials and been able to setup a single-machine Hadoop system with the MongoDB connector as instructed at 
and

Could someone give some instructions or pointers on how to configure and use the mongo-hadoop connector with Spark?  I haven't been able to find any documentation about this.


Thanks.


Best regards,
   Sampo N.


Reply | Threaded
Open this post in threaded view
|

Re: Spark + MongoDB

Tathagata Das
I walked through the example in the second link you gave. The Treasury Yield example referred there is here. Note the InputFormat and OutputFormat used in the job configuration. This InputFormat and OutputFormat specifies how to write data in and out of MongoDB. You should be able to use the same InputFormat and outputFormat class in Spark as well. For saving files to MongoDB, use yourRDD.saveAsHadoopFile(.... specify the output format class ...)  and to read from MongoDB  sparkContext.hadoopFile(..... specify input format class ....) . 

TD


On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen <[hidden email]> wrote:
Hi,

We're starting to build an analytics framework for our wellness service.  While our data is not yet Big, we'd like to use a framework that will scale as needed, and Spark seems to be the best around.

I'm new to Hadoop and Spark, and I'm having difficulty figuring out how to use Spark in connection with MongoDB.  Apparently, I should be able to use the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop) also with Spark, but haven't figured out how.

I've run through the Spark tutorials and been able to setup a single-machine Hadoop system with the MongoDB connector as instructed at 
and

Could someone give some instructions or pointers on how to configure and use the mongo-hadoop connector with Spark?  I haven't been able to find any documentation about this.


Thanks.


Best regards,
   Sampo N.



Reply | Threaded
Open this post in threaded view
|

Re: Spark + MongoDB

Sampo Niskanen
Hi,

Thanks for the pointer.  However, I'm still unable to generate the RDD using MongoInputFormat.  I'm trying to add the mongo-hadoop connector to the Java SimpleApp in the quickstart at http://spark.incubator.apache.org/docs/latest/quick-start.html


The mongo-hadoop connector contains two versions of MongoInputFormat, one extending org.apache.hadoop.mapreduce.InputFormat<Object, BSONObject>, the other extending org.apache.hadoop.mapred.InputFormat<Object, BSONObject>.  Neither of them is accepted by the compiler, and I'm unsure why:

        JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
        sc.hadoopRDD(job, com.mongodb.hadoop.mapred.MongoInputFormat.class, Object.class, BSONObject.class);
        sc.hadoopRDD(job, com.mongodb.hadoop.MongoInputFormat.class, Object.class, BSONObject.class);

Eclipse gives the following error for both the the latter two lines:

Bound mismatch: The generic method hadoopRDD(JobConf, Class<F>, Class<K>, Class<V>) of type JavaSparkContext is not applicable for the arguments (JobConf, Class<MongoInputFormat>, Class<Object>, Class<BSONObject>). The inferred type MongoInputFormat is not a valid substitute for the bounded parameter <F extends InputFormat<K,V>>


I'm using Spark 0.9.0.  Might this be caused by a conflict of Hadoop versions?  I downloaded the mongo-hadoop connector for Hadoop 2.2.  I haven't figured out how to select which Hadoop version Spark uses, when required from an sbt file.  (The SBT file is the one described in the quickstart.)


Thanks for any help.


Best regards,
   Sampo N.



On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das <[hidden email]> wrote:
I walked through the example in the second link you gave. The Treasury Yield example referred there is here. Note the InputFormat and OutputFormat used in the job configuration. This InputFormat and OutputFormat specifies how to write data in and out of MongoDB. You should be able to use the same InputFormat and outputFormat class in Spark as well. For saving files to MongoDB, use yourRDD.saveAsHadoopFile(.... specify the output format class ...)  and to read from MongoDB  sparkContext.hadoopFile(..... specify input format class ....) . 

TD


On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen <[hidden email]> wrote:
Hi,

We're starting to build an analytics framework for our wellness service.  While our data is not yet Big, we'd like to use a framework that will scale as needed, and Spark seems to be the best around.

I'm new to Hadoop and Spark, and I'm having difficulty figuring out how to use Spark in connection with MongoDB.  Apparently, I should be able to use the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop) also with Spark, but haven't figured out how.

I've run through the Spark tutorials and been able to setup a single-machine Hadoop system with the MongoDB connector as instructed at 
and

Could someone give some instructions or pointers on how to configure and use the mongo-hadoop connector with Spark?  I haven't been able to find any documentation about this.


Thanks.


Best regards,
   Sampo N.




Reply | Threaded
Open this post in threaded view
|

Re: Spark + MongoDB

Tathagata Das
Can you try using sc.newAPIHadoop**  ?
There are two kinds of classes because the Hadoop API for input and output format had undergone a significant change a few years ago. 

TD


On Tue, Feb 4, 2014 at 5:58 AM, Sampo Niskanen <[hidden email]> wrote:
Hi,

Thanks for the pointer.  However, I'm still unable to generate the RDD using MongoInputFormat.  I'm trying to add the mongo-hadoop connector to the Java SimpleApp in the quickstart at http://spark.incubator.apache.org/docs/latest/quick-start.html


The mongo-hadoop connector contains two versions of MongoInputFormat, one extending org.apache.hadoop.mapreduce.InputFormat<Object, BSONObject>, the other extending org.apache.hadoop.mapred.InputFormat<Object, BSONObject>.  Neither of them is accepted by the compiler, and I'm unsure why:

        JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
        sc.hadoopRDD(job, com.mongodb.hadoop.mapred.MongoInputFormat.class, Object.class, BSONObject.class);
        sc.hadoopRDD(job, com.mongodb.hadoop.MongoInputFormat.class, Object.class, BSONObject.class);

Eclipse gives the following error for both the the latter two lines:

Bound mismatch: The generic method hadoopRDD(JobConf, Class<F>, Class<K>, Class<V>) of type JavaSparkContext is not applicable for the arguments (JobConf, Class<MongoInputFormat>, Class<Object>, Class<BSONObject>). The inferred type MongoInputFormat is not a valid substitute for the bounded parameter <F extends InputFormat<K,V>>


I'm using Spark 0.9.0.  Might this be caused by a conflict of Hadoop versions?  I downloaded the mongo-hadoop connector for Hadoop 2.2.  I haven't figured out how to select which Hadoop version Spark uses, when required from an sbt file.  (The SBT file is the one described in the quickstart.)


Thanks for any help.


Best regards,
   Sampo N.



On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das <[hidden email]> wrote:
I walked through the example in the second link you gave. The Treasury Yield example referred there is here. Note the InputFormat and OutputFormat used in the job configuration. This InputFormat and OutputFormat specifies how to write data in and out of MongoDB. You should be able to use the same InputFormat and outputFormat class in Spark as well. For saving files to MongoDB, use yourRDD.saveAsHadoopFile(.... specify the output format class ...)  and to read from MongoDB  sparkContext.hadoopFile(..... specify input format class ....) . 

TD


On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen <[hidden email]> wrote:
Hi,

We're starting to build an analytics framework for our wellness service.  While our data is not yet Big, we'd like to use a framework that will scale as needed, and Spark seems to be the best around.

I'm new to Hadoop and Spark, and I'm having difficulty figuring out how to use Spark in connection with MongoDB.  Apparently, I should be able to use the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop) also with Spark, but haven't figured out how.

I've run through the Spark tutorials and been able to setup a single-machine Hadoop system with the MongoDB connector as instructed at 
and

Could someone give some instructions or pointers on how to configure and use the mongo-hadoop connector with Spark?  I haven't been able to find any documentation about this.


Thanks.


Best regards,
   Sampo N.





Reply | Threaded
Open this post in threaded view
|

Re: Spark + MongoDB

sonamjain01
This post has NOT been accepted by the mailing list yet.
Hey Thanks Tathagata!
I was facing similar problem while reading MongoDB from Spark.
Changing sc.hadoopRDD(conf,MongoInputFormat.class, Object.class, BSONObject.class) to
sc.newAPIHadoopRDD(conf,MongoInputFormat.class, Object.class, BSONObject.class) solved the problem.
Reply | Threaded
Open this post in threaded view
|

Re: Spark + MongoDB

Sampo Niskanen
In reply to this post by Tathagata Das
Hi,

Since getting Spark + MongoDB to work together was not very obvious (at least to me) I wrote a tutorial about it in my blog with an example application:

Hope it's of use to someone else as well.


Cheers,

    Sampo Niskanen
    Lead developer / Wellmo

    [hidden email]
    +358 40 820 5291
 



On Tue, Feb 4, 2014 at 10:46 PM, Tathagata Das <[hidden email]> wrote:
Can you try using sc.newAPIHadoop**  ?
There are two kinds of classes because the Hadoop API for input and output format had undergone a significant change a few years ago. 

TD


On Tue, Feb 4, 2014 at 5:58 AM, Sampo Niskanen <[hidden email]> wrote:
Hi,

Thanks for the pointer.  However, I'm still unable to generate the RDD using MongoInputFormat.  I'm trying to add the mongo-hadoop connector to the Java SimpleApp in the quickstart at http://spark.incubator.apache.org/docs/latest/quick-start.html


The mongo-hadoop connector contains two versions of MongoInputFormat, one extending org.apache.hadoop.mapreduce.InputFormat<Object, BSONObject>, the other extending org.apache.hadoop.mapred.InputFormat<Object, BSONObject>.  Neither of them is accepted by the compiler, and I'm unsure why:

        JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
        sc.hadoopRDD(job, com.mongodb.hadoop.mapred.MongoInputFormat.class, Object.class, BSONObject.class);
        sc.hadoopRDD(job, com.mongodb.hadoop.MongoInputFormat.class, Object.class, BSONObject.class);

Eclipse gives the following error for both the the latter two lines:

Bound mismatch: The generic method hadoopRDD(JobConf, Class<F>, Class<K>, Class<V>) of type JavaSparkContext is not applicable for the arguments (JobConf, Class<MongoInputFormat>, Class<Object>, Class<BSONObject>). The inferred type MongoInputFormat is not a valid substitute for the bounded parameter <F extends InputFormat<K,V>>


I'm using Spark 0.9.0.  Might this be caused by a conflict of Hadoop versions?  I downloaded the mongo-hadoop connector for Hadoop 2.2.  I haven't figured out how to select which Hadoop version Spark uses, when required from an sbt file.  (The SBT file is the one described in the quickstart.)


Thanks for any help.


Best regards,
   Sampo N.



On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das <[hidden email]> wrote:
I walked through the example in the second link you gave. The Treasury Yield example referred there is here. Note the InputFormat and OutputFormat used in the job configuration. This InputFormat and OutputFormat specifies how to write data in and out of MongoDB. You should be able to use the same InputFormat and outputFormat class in Spark as well. For saving files to MongoDB, use yourRDD.saveAsHadoopFile(.... specify the output format class ...)  and to read from MongoDB  sparkContext.hadoopFile(..... specify input format class ....) . 

TD


On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen <[hidden email]> wrote:
Hi,

We're starting to build an analytics framework for our wellness service.  While our data is not yet Big, we'd like to use a framework that will scale as needed, and Spark seems to be the best around.

I'm new to Hadoop and Spark, and I'm having difficulty figuring out how to use Spark in connection with MongoDB.  Apparently, I should be able to use the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop) also with Spark, but haven't figured out how.

I've run through the Spark tutorials and been able to setup a single-machine Hadoop system with the MongoDB connector as instructed at 
and

Could someone give some instructions or pointers on how to configure and use the mongo-hadoop connector with Spark?  I haven't been able to find any documentation about this.


Thanks.


Best regards,
   Sampo N.






Reply | Threaded
Open this post in threaded view
|

Re: Spark + MongoDB

Matei Zaharia
Administrator
Very cool, thanks for writing this. I’ll link it from our website.

Matei

On Feb 18, 2014, at 12:44 PM, Sampo Niskanen <[hidden email]> wrote:

Hi,

Since getting Spark + MongoDB to work together was not very obvious (at least to me) I wrote a tutorial about it in my blog with an example application:

Hope it's of use to someone else as well.


Cheers,

    Sampo Niskanen
    Lead developer / Wellmo

    [hidden email]
    +358 40 820 5291
 



On Tue, Feb 4, 2014 at 10:46 PM, Tathagata Das <[hidden email]> wrote:
Can you try using sc.newAPIHadoop**  ?
There are two kinds of classes because the Hadoop API for input and output format had undergone a significant change a few years ago. 

TD


On Tue, Feb 4, 2014 at 5:58 AM, Sampo Niskanen <[hidden email]> wrote:
Hi,

Thanks for the pointer.  However, I'm still unable to generate the RDD using MongoInputFormat.  I'm trying to add the mongo-hadoop connector to the Java SimpleApp in the quickstart at http://spark.incubator.apache.org/docs/latest/quick-start.html


The mongo-hadoop connector contains two versions of MongoInputFormat, one extending org.apache.hadoop.mapreduce.InputFormat<Object, BSONObject>, the other extending org.apache.hadoop.mapred.InputFormat<Object, BSONObject>.  Neither of them is accepted by the compiler, and I'm unsure why:

        JavaSparkContext sc = new JavaSparkContext("local", "Simple App");
        sc.hadoopRDD(job, com.mongodb.hadoop.mapred.MongoInputFormat.class, Object.class, BSONObject.class);
        sc.hadoopRDD(job, com.mongodb.hadoop.MongoInputFormat.class, Object.class, BSONObject.class);

Eclipse gives the following error for both the the latter two lines:

Bound mismatch: The generic method hadoopRDD(JobConf, Class<F>, Class<K>, Class<V>) of type JavaSparkContext is not applicable for the arguments (JobConf, Class<MongoInputFormat>, Class<Object>, Class<BSONObject>). The inferred type MongoInputFormat is not a valid substitute for the bounded parameter <F extends InputFormat<K,V>>


I'm using Spark 0.9.0.  Might this be caused by a conflict of Hadoop versions?  I downloaded the mongo-hadoop connector for Hadoop 2.2.  I haven't figured out how to select which Hadoop version Spark uses, when required from an sbt file.  (The SBT file is the one described in the quickstart.)


Thanks for any help.


Best regards,
   Sampo N.



On Fri, Jan 31, 2014 at 5:34 AM, Tathagata Das <[hidden email]> wrote:
I walked through the example in the second link you gave. The Treasury Yield example referred there is here. Note the InputFormat and OutputFormat used in the job configuration. This InputFormat and OutputFormat specifies how to write data in and out of MongoDB. You should be able to use the same InputFormat and outputFormat class in Spark as well. For saving files to MongoDB, use yourRDD.saveAsHadoopFile(.... specify the output format class ...)  and to read from MongoDB  sparkContext.hadoopFile(..... specify input format class ....) . 

TD


On Thu, Jan 30, 2014 at 12:36 PM, Sampo Niskanen <[hidden email]> wrote:
Hi,

We're starting to build an analytics framework for our wellness service.  While our data is not yet Big, we'd like to use a framework that will scale as needed, and Spark seems to be the best around.

I'm new to Hadoop and Spark, and I'm having difficulty figuring out how to use Spark in connection with MongoDB.  Apparently, I should be able to use the mongo-hadoop connector (https://github.com/mongodb/mongo-hadoop) also with Spark, but haven't figured out how.

I've run through the Spark tutorials and been able to setup a single-machine Hadoop system with the MongoDB connector as instructed at 
and

Could someone give some instructions or pointers on how to configure and use the mongo-hadoop connector with Spark?  I haven't been able to find any documentation about this.


Thanks.


Best regards,
   Sampo N.