Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metadata Platform
Amundsen is the data discovery metadata platform that originated from Lyft which is recently donated to Linux Foundation AI. Since its open-sourced, Amundsen has been used and extended by many different companies within our community.
In short, Amundsen is built on 3 key pillars:
1. Augmented Data Graph: Amundsen uses a graph database(Neo4j by default) under the hood to store relationships between various data assets (tables, dashboards, protobuf events, etc.). What’s unique to Amundsen is that we bring all related metadata (usage, last updated, watermark, stats, etc) into this graph. One example is that we also treat people as a first-class data asset – in other words, there’s a graph node for each person in the organization that connects to other nodes (like tables, and dashboards). This solves interesting problems such as ramping up problems by answering “what my team member’s frequently used table”?
2. Intuitive User Experience: Amundsen strives to deliver data discovery relevant to the user by running PageRank using data from access logs to power search ranking, similar to how Google ranks web pages on the internet.
3. Centralized Metadata from different sources: Amundsen gathers metadata from various different sources (Hive, Presto, Airflow, etc.) and exposes it in one central place. The right place to store all this metadata is a work in progress. It also provides the data lineage across different sources and allows users to understand the data connection.
In this talk, we will discuss what a data discovery experience would look like in an ideal world and what Lyft has done to make that possible. Then we will deep dive into Amundsen’s architecture, discuss how it achieves the 3 discussed design pillars. More importantly, we will discuss how Amundsen could be customized and extended to other companies’ data ecosystem. Lastly, we will close with the future roadmap of the project, what problems remain unsolved, and how we can work together to solve them.
About:
Databricks provides a unified data analytics platform, powered by Apache Spark™, that accelerates innovation by unifying data science, engineering and business.
Read more here: https://databricks.com/product/unified-data-analytics-platform
See all the previous Summit sessions: https://databricks.com/sparkaisummit/north-america/sessions
Connect with us:
Website: https://databricks.com
Facebook: https://www.facebook.com/databricksinc
Twitter: https://twitter.com/databricks
LinkedIn: https://www.linkedin.com/company/databricks/
Instagram: https://www.instagram.com/databricksinc/
by Databricks
linux foundation