Sematic raises $3M Seed Round – Read the Announcement πŸ¦ŠπŸš€

What is Lineage Tracking in Machine Learning and Why You Need It

We explain what Lineage Tracking is and why it is a requirement for any production-grade ML system.

November 28, 2022

As ML engineers, we often run hundreds of training jobs every week with different inputs, configurations, and code changes, we deploy models to production, we investigate regressions, etc. All these tasks consume many different assets (e.g. datasets, configurations, code) and produce many more (e.g. models, metrics, inferences).

How do we keep track of the sources-outcomes relationship between all these assets? How do we know what hyperparameters were used in the 7th job we launched before the week-end? How do we know exactly what data points were used to train the model currently being used in production?

Enter Lineage Tracking.

Lineage Tracking is the systematic tracking of all assets consumed and produced by each step in an ML pipeline, in order to keep a source of truth of the lineage graph between them.

Sure we can do this manually: keep track of jobs in spreadsheets and notebooks, adopt a naming convention to make sure we know the origin of every model... But we all know this is very tedious and impractical.

Nobody likes using spreadsheets and notebooks for bookkeeping.

Instead, Lineage Tracking must come out-of-the-box with your ML platform.

Why do IΒ need Lineage tracking?

Here are five reasons why Lineage Tracking is a must-have, each of which would be a sufficient reason in itself.

Bookkeeping, experiment tracking

As you iterate on your code (e.g. training code, data processing, evaluation, etc.), you need to keep track of every single execution in order to simply keep a record of all the options and sets of parameters you tried. This is a basic of good scientific research.

Imagine you start 5 training jobs with different input datasets, configurations, and hyperparameters, and one of them clearly outperforms the others. Did you write down ahead of time which is which? Maybe, but are you confident you logged enough information? Lineage Tracking does this for you without having to think about it.

Debugging

It is impossible to debug an unexpected outcome without knowing the input parameters of the system. If your model starts generating incorrect inferences in production, you will need to roll back to a prior version, and do Root Cause Analysis. To do so, you need to find the training job that produced the model, its evaluation metrics, the input datasets, configurations, etc., all of which are tracked by Lineage Tracking.

Traceability

Traceability is a requirement for any supply chain. In Software, CI/CD pipelines automate testing and deployments, and always tie a particular code asset back to a version control reference (e.g. Git commit sha), which in turn is associated to an author.

ML pipelines are the supply chains for trained models. Therefore, they also need to provide a high level of guarantees around traceability.

If this traceability is not inherently guaranteed by the underlying platform, it is likely that it will be incomplete, unusable, or even simply absent.

Compliance

In many cases, ML models are used in mission-critical scenarios. For example, in autonomous vehicles, fraud detection, healthcare, etc. In such cases, failures can have dramatic effects on users, including potentially fatal consequences.

Arguably, models deployed to production without Lineage Tracking are simply a liability, with potential legal consequences.

Reproducibility

Much is said about the lack of explainability in Machine Learning. The first step towards explainability is the ability to reproduce the outcome of a particular job, in order to iterate and investigate. Without rigorous Lineage Tracking, it is virtually impossible to know exactly what code, data, and configuration were used to produce a particular model.

What needs to be tracked?

The following assets must be tracked for all executions of the training pipeline:

  • Code: Git commit sha, container image reference for data processing, training, evaluation code, etc.
  • Configurations: all parameters of all pipeline steps (e.g. learning rate, number of epochs, normalization parameters, cut thresholds, etc.)
  • Model hyperparameters
  • Input data: exact references to the training and testing datasets
  • Annotations: what ground truth labels where used
  • Resource used: what machines were used (GPU types, CPU count, memory, storage, etc.)
  • Ownership: who ran the job, who wrote the code, who deployed the model?
  • Model: the model itself needs to be registered and tracked

This is obviously a lot of information, which makes it difficult to track it manually or in an ad-hoc manner. Instead, these should be tracked automatically by your ML platform.

Lineage Tracking with Sematic

Sematic is the open-source Continuous Machine Learning platform. It lets you define arbitrary end-to-end ML pipelines using only Python, and without requiring any infrastructure skills.

By default Sematic guarantees the highest level of traceability and surfaces all lineage information in its UI.

Sematic tracks:

  • Code: Git information, container images
  • Configurations, input data, hyperparameters, annotations, model: all inputs and outputs of all pipeline steps are serialized, persisted, and tracked in a metadata store
  • Resource used: all requested resources are described and stored alongside the job's metadata
  • Ownership: person or system submitting the job

This high level of tracking enables Sematic to guarantee reproducibility of your pipeline. You can trigger a re-run of any pipeline directly from the UI.

Check out Sematic, join our Discord server, and give us a star on Github!

Latest blog posts
Release Notes – 0.22.1

Read up on what we shipped in 0.22.1. Helm chart, deep links, reruns, and more!

Observability for Machine Learning: What is it and What Are the Benefits

What does observability mean for Machine Learning pipelines?

What is Lineage Tracking in Machine Learning and Why You Need It

We explain what Lineage Tracking is and why it is a requirement for any production-grade ML system.

Sematic Raises $3M to Build an Open-Source Continuous Machine Learning Platform

We are announcing a $3M seed funding round led by Race Capital.

Continuous Learning for safer and better ML models

Continuous Learning processes can help ML teams automate regression testing and re-training with new data for greater safety and performance.

Prototype to Production ML in days not weeks

Thank you! We will be in touch soon.
Oops! Something went wrong while submitting the form.

Why Sematic?

The easiest pipelining tool on the market

Just simple Python, no infrastructure
skills needed.

Traceability, observability, reproducibility

Get rich insights into inputs, outputs, logs, errors. Rerun pipelines from the UI with cached results.

Local-to-cloud parity

Run your pipelines on your local machine or in a GPU cluster with no
change in code.