Spark RDD + HBase: adoption trend

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark RDD + HBase: adoption trend

Marco Firrincieli
Hi, my name is Marco and I'm one of the developers behind https://github.com/unicredit/hbase-rdd 
a project we are currently reviewing for various reasons.

We were basically wondering if RDD "is still a thing" nowadays (we see lots of usage for DataFrames or Datasets) and we're not sure how much of the community still works/uses RDDs.

Also, for lack of time, we always mainly worked using Cloudera-flavored Hadoop/HBase & Spark versions. We were thinking the community would then help us organize the project in a more "generic" way, but that didn't happen.

So I figured I would ask here what is the gut feeling of the Spark community so to better define the future of our little library.

Thanks

-Marco

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark RDD + HBase: adoption trend

Jacek Laskowski
Hi Marco,

IMHO RDD is only for very sophisticated use cases that very few Spark devs would be capable of. I consider RDD API a sort of Spark assembler and most Spark devs should stick to Dataset API.

Speaking of HBase, see https://github.com/GoogleCloudPlatform/java-docs-samples/tree/master/bigtable/spark where you can find a demo that I worked on last year and made sure that:

"Apache HBase™ Spark Connector implements the DataSource API for Apache HBase and allows executing relational queries on data stored in Cloud Bigtable."

That makes hbase-rdd even more obsolete but not necessarily unusable (I am little skilled in the HBase space to comment on this).

I think you should consider merging the project hbase-rdd of yours with the official Apache HBase™ Spark Connector at https://github.com/apache/hbase-connectors/tree/master/spark (as they seem to lack active development IMHO).

On Wed, Jan 20, 2021 at 2:44 PM Marco Firrincieli <[hidden email]> wrote:
Hi, my name is Marco and I'm one of the developers behind https://github.com/unicredit/hbase-rdd 
a project we are currently reviewing for various reasons.

We were basically wondering if RDD "is still a thing" nowadays (we see lots of usage for DataFrames or Datasets) and we're not sure how much of the community still works/uses RDDs.

Also, for lack of time, we always mainly worked using Cloudera-flavored Hadoop/HBase & Spark versions. We were thinking the community would then help us organize the project in a more "generic" way, but that didn't happen.

So I figured I would ask here what is the gut feeling of the Spark community so to better define the future of our little library.

Thanks

-Marco

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark RDD + HBase: adoption trend

srowen
In reply to this post by Marco Firrincieli
RDDs are still relevant in a few ways - there is no Dataset in Python for example, so RDD is still the 'typed' API. They still underpin DataFrames. And of course it's still there because there's probably still a lot of code out there that uses it. Occasionally it's still useful to drop into that API for certain operations.

If that's a connector to read data from HBase - you probably do want to return DataFrames ideally.
Unless you're relying on very specific APIs from very specific versions, I wouldn't think a distro's Spark or HBase is much different?

On Wed, Jan 20, 2021 at 7:44 AM Marco Firrincieli <[hidden email]> wrote:
Hi, my name is Marco and I'm one of the developers behind https://github.com/unicredit/hbase-rdd 
a project we are currently reviewing for various reasons.

We were basically wondering if RDD "is still a thing" nowadays (we see lots of usage for DataFrames or Datasets) and we're not sure how much of the community still works/uses RDDs.

Also, for lack of time, we always mainly worked using Cloudera-flavored Hadoop/HBase & Spark versions. We were thinking the community would then help us organize the project in a more "generic" way, but that didn't happen.

So I figured I would ask here what is the gut feeling of the Spark community so to better define the future of our little library.

Thanks

-Marco

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]