What is Lineage Tracking in Machine Learning and why you need It

November 28, 2022

Emmanuel Turlay

Founder, CEO

What is Lineage Tracking in Machine Learning and why you need It

As ML engineers, we often run hundreds of training jobs every week with different inputs, configurations, and code changes, we deploy models to production, we investigate regressions, etc. All these tasks consume many different assets (e.g. datasets, configurations, code) and produce many more (e.g. models, metrics, inferences).

How do we keep track of the sources-outcomes relationship between all these assets? How do we know what hyperparameters were used in the 7th job we launched before the week-end? How do we know exactly what data points were used to train the model currently being used in production?

Enter Lineage Tracking.

Lineage Tracking is the systematic tracking of all assets consumed and produced by each step in an ML pipeline, in order to keep a source of truth of the lineage graph between them.

Sure we can do this manually: keep track of jobs in spreadsheets and notebooks, adopt a naming convention to make sure we know the origin of every model... But we all know this is very tedious and impractical.

Nobody likes using spreadsheets and notebooks for bookkeeping.

Instead, Lineage Tracking must come out-of-the-box with your ML platform.

Why do I need Lineage tracking?

Here are five reasons why Lineage Tracking is a must-have, each of which would be a sufficient reason in itself.

Bookkeeping, experiment tracking

As you iterate on your code (e.g. training code, data processing, evaluation, etc.), you need to keep track of every single execution in order to simply keep a record of all the options and sets of parameters you tried. This is a basic of good scientific research.

Imagine you start 5 training jobs with different input datasets, configurations, and hyperparameters, and one of them clearly outperforms the others. Did you write down ahead of time which is which? Maybe, but are you confident you logged enough information? Lineage Tracking does this for you without having to think about it.

Debugging

It is impossible to debug an unexpected outcome without knowing the input parameters of the system. If your model starts generating incorrect inferences in production, you will need to roll back to a prior version, and do Root Cause Analysis. To do so, you need to find the training job that produced the model, its evaluation metrics, the input datasets, configurations, etc., all of which are tracked by Lineage Tracking.

Traceability

Traceability is a requirement for any supply chain. In Software, CI/CD pipelines automate testing and deployments, and always tie a particular code asset back to a version control reference (e.g. Git commit sha), which in turn is associated to an author.

ML pipelines are the supply chains for trained models. Therefore, they also need to provide a high level of guarantees around traceability.

If this traceability is not inherently guaranteed by the underlying platform, it is likely that it will be incomplete, unusable, or even simply absent.

Compliance

In many cases, ML models are used in mission-critical scenarios. For example, in autonomous vehicles, fraud detection, healthcare, etc. In such cases, failures can have dramatic effects on users, including potentially fatal consequences.

Arguably, models deployed to production without Lineage Tracking are simply a liability, with potential legal consequences.

Reproducibility

Much is said about the lack of explainability in Machine Learning. The first step towards explainability is the ability to reproduce the outcome of a particular job, in order to iterate and investigate. Without rigorous Lineage Tracking, it is virtually impossible to know exactly what code, data, and configuration were used to produce a particular model.

What needs to be tracked?

The following assets must be tracked for all executions of the training pipeline:

Code: Git commit sha, container image reference for data processing, training, evaluation code, etc.
Configurations: all parameters of all pipeline steps (e.g. learning rate, number of epochs, normalization parameters, cut thresholds, etc.)
Model hyperparameters
Input data: exact references to the training and testing datasets
Annotations: what ground truth labels where used
Resource used: what machines were used (GPU types, CPU count, memory, storage, etc.)
Ownership: who ran the job, who wrote the code, who deployed the model?
Model: the model itself needs to be registered and tracked

This is obviously a lot of information, which makes it difficult to track it manually or in an ad-hoc manner. Instead, these should be tracked automatically by your ML platform.

Lineage Tracking with Sematic

Sematic is the open-source Continuous Machine Learning platform. It lets you define arbitrary end-to-end ML pipelines using only Python, and without requiring any infrastructure skills.

By default Sematic guarantees the highest level of traceability and surfaces all lineage information in its UI.

Sematic tracks:

Code: Git information, container images
Configurations, input data, hyperparameters, annotations, model: all inputs and outputs of all pipeline steps are serialized, persisted, and tracked in a metadata store
Resource used: all requested resources are described and stored alongside the job's metadata
Ownership: person or system submitting the job

This high level of tracking enables Sematic to guarantee reproducibility of your pipeline. You can trigger a re-run of any pipeline directly from the UI.

Check out Sematic, join our Discord server, and give us a star on Github!

July 18, 2023

What is Lineage Tracking in Machine Learning and why you need It

Why do I need Lineage tracking?

Bookkeeping, experiment tracking

Debugging

Traceability

Compliance

Reproducibility

What needs to be tracked?

Lineage Tracking with Sematic

Tuning and Testing Llama 2, FLAN-T5, and GPT-J with LoRA, Sematic, and Gradio

How Voxel cut model retraining time by 80%

Release Notes – 0.31.0

ML Orchestration: Why It's Time to Move Past Airflow

5 Tips to Reduce your ML Cloud Costs

Release Notes – 0.29.0

Sematic + Ray: The Best of Orchestration and Distributed Compute at your Fingertips

Release Notes – 0.27.0

Release Notes – 0.22.1

What is “production” Machine Learning?

Sematic raises $3M to build an open-source Continuous Machine Learning platform

Observability for Machine Learning: what is it and what are the benefits

Getting started with Sematic in 5 minutes

Implementing Deep Links in React with Atoms

Continuous Learning for safer and better ML models

Hello World

What is Lineage Tracking in Machine Learning and why you need It

Why do I need Lineage tracking?

Bookkeeping, experiment tracking

Debugging

Traceability

Compliance

Reproducibility

What needs to be tracked?

Lineage Tracking with Sematic

Tuning and Testing Llama 2, FLAN-T5, and GPT-J with LoRA, Sematic, and Gradio

How Voxel cut model retraining time by 80%

Release Notes – 0.31.0

ML Orchestration: Why It's Time to Move Past Airflow

5 Tips to Reduce your ML Cloud Costs

Release Notes – 0.29.0

Sematic + Ray: The Best of Orchestration and Distributed Compute at your Fingertips

Release Notes – 0.27.0

Release Notes – 0.22.1

What is “production” Machine Learning?

Sematic raises $3M to build an open-source Continuous Machine Learning platform

Observability for Machine Learning: what is it and what are the benefits

Getting started with Sematic in 5 minutes

Implementing Deep Links in React with Atoms

Continuous Learning for safer and better ML models

Hello World

Subscribe to our mailing list