OPERATING SYSTEMSOS Linux

Data Discovery at Databricks with Amundsen

Databricks used to use a static manually maintained wiki page for internal data exploration. We will discuss how we leverage Amundsen, an open source data discovery tool from Linux Foundation AI & Data, to improve productivity with trust by surfacing the most relevant dataset and SQL analytics dashboard with its important information programmatically at Databricks internally.

We will also talk about how we integrate Amundsen with Databricks world class infrastructure to surface metadata including:

Surface the most popular tables used within Databricks
Support fuzzy search and facet search for dataset- Surface rich metadata on datasets:
Lineage information (downstream table, upstream table, downstream jobs, downstream users)
Dataset owner
Dataset frequent users
Delta extend metadata (e.g change history)
ETL job that generates the dataset
Column stats on numeric type columns
Dashboards that use the given dataset
Use Databricks data tab to show the sample data
Surface metadata on dashboards including: create time, last update time, tables used, etc
Last but not least, we will discuss how we incorporate internal user feedback and provide the same discovery productivity improvements for Databricks customers in the future.

Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks
Instagram: https://www.instagram.com/databricksinc/

source

by Databricks

linux foundation

Leave a Reply

Your email address will not be published. Required fields are marked *