Building Reproducible ML Processes with an Open Source Stack – Einat Orr, Treeverse

June 28, 2024

Machine learning experiments consist of Data + Code + Environment. While MLFlow Projects are a great way to ensure reproducibility of Data Science code, it cannot ensure the reproducibility of the input data used by that code. In this talk, we’ll go over the trifecta required for truly reproducible experiments: Code (KubeFlow and Git), Data (Minio+lakeFS) and Environment (Infrastructure-as-code). This talk will include a hands-on code demonstration of reproducing an experiment, while ensuring we use the exact same input data, code and processing environment as used by a previous run. We will demonstrate programmatic ways to tie all moving parts together: from creating commits that snapshot the input data, to tagging and traversing the history of both code and data in tandem.

source

by The Linux Foundation

linux foundation