Metadata Management

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Metadata Management

Vasu Gourabathina
All:

This may be off topic for Spark, but I'm sure several of you might have used some form of this as part of your BigData implementations. So, wanted to reach out.

As part of the Data Lake and Data Processing (by Spark as an example), we might end up different form-factors for the files (via, cleanup, enrichment etc).

In order to make this data available for data exploration by analysts, data scientists - how to manage the metadata?
  - Creating Metadata Repository
  - Make the schemas available for users, so they may use it to create Hive tables, use them by Presto etc.

Can you recommend some patterns (or tools) to help manage the Metadata? Trying to reduce the dependency on the engineers and make the analysts/scientists be self-sufficient as much as possible.

Azure and AWS Glue Data Catalog seem to address this. Any inputs on these two?

Appreciate in advance.

Thanks,
Vasu.
Reply | Threaded
Open this post in threaded view
|

Re: Metadata Management

Szuromi Tamás
Hi Vasu,


Cheers 
Tamas 

On 2017. Oct 19., Thu at 23:22, Vasu Gourabathina <[hidden email]> wrote:
All:

This may be off topic for Spark, but I'm sure several of you might have used some form of this as part of your BigData implementations. So, wanted to reach out.

As part of the Data Lake and Data Processing (by Spark as an example), we might end up different form-factors for the files (via, cleanup, enrichment etc).

In order to make this data available for data exploration by analysts, data scientists - how to manage the metadata?
  - Creating Metadata Repository
  - Make the schemas available for users, so they may use it to create Hive tables, use them by Presto etc.

Can you recommend some patterns (or tools) to help manage the Metadata? Trying to reduce the dependency on the engineers and make the analysts/scientists be self-sufficient as much as possible.

Azure and AWS Glue Data Catalog seem to address this. Any inputs on these two?

Appreciate in advance.

Thanks,
Vasu.