Introducing šŸš€Airtrain.ai ā€“ our free batch evaluation tool for Large Language Models. Get started for free!
Blog

What is Lineage Tracking in Machine Learning and why you need It

November 28, 2022
Emmanuel Turlay
Emmanuel Turlay
Founder, CEO
What is Lineage Tracking in Machine Learning and why you need It

As ML engineers, we often run hundreds of training jobs every week with different inputs, configurations, and code changes, we deploy models to production, we investigate regressions, etc. All these tasks consume many different assets (e.g. datasets, configurations, code) and produce many more (e.g. models, metrics, inferences).

How do we keep track of the sources-outcomes relationship between all these assets? How do we know what hyperparameters were used in the 7th job we launched before the week-end? How do we know exactly what data points were used to train the model currently being used in production?

Enter Lineage Tracking.

Lineage Tracking is the systematic tracking of all assets consumed and produced by each step in an ML pipeline, in order to keep a source of truth of the lineage graph between them.

Sure we can do this manually: keep track of jobs in spreadsheets and notebooks, adopt a naming convention to make sure we know the origin of every model... But we all know this is very tedious and impractical.

Nobody likes using spreadsheets and notebooks for bookkeeping.

Instead, Lineage Tracking must come out-of-the-box with your ML platform.

Why do IĀ need Lineage tracking?

Here are five reasons why Lineage Tracking is a must-have, each of which would be a sufficient reason in itself.

Bookkeeping, experiment tracking

As you iterate on your code (e.g. training code, data processing, evaluation, etc.), you need to keep track of every single execution in order to simply keep a record of all the options and sets of parameters you tried. This is a basic of good scientific research.

Imagine you start 5 training jobs with different input datasets, configurations, and hyperparameters, and one of them clearly outperforms the others. Did you write down ahead of time which is which? Maybe, but are you confident you logged enough information? Lineage Tracking does this for you without having to think about it.

Debugging

It is impossible to debug an unexpected outcome without knowing the input parameters of the system. If your model starts generating incorrect inferences in production, you will need to roll back to a prior version, and do Root Cause Analysis. To do so, you need to find the training job that produced the model, its evaluation metrics, the input datasets, configurations, etc., all of which are tracked by Lineage Tracking.

Traceability

Traceability is a requirement for any supply chain. In Software, CI/CD pipelines automate testing and deployments, and always tie a particular code asset back to a version control reference (e.g. Git commit sha), which in turn is associated to an author.

ML pipelines are the supply chains for trained models. Therefore, they also need to provide a high level of guarantees around traceability.

If this traceability is not inherently guaranteed by the underlying platform, it is likely that it will be incomplete, unusable, or even simply absent.

Compliance

In many cases, ML models are used in mission-critical scenarios. For example, in autonomous vehicles, fraud detection, healthcare, etc. In such cases, failures can have dramatic effects on users, including potentially fatal consequences.

Arguably, models deployed to production without Lineage Tracking are simply a liability, with potential legal consequences.

Reproducibility

Much is said about the lack of explainability in Machine Learning. The first step towards explainability is the ability to reproduce the outcome of a particular job, in order to iterate and investigate. Without rigorous Lineage Tracking, it is virtually impossible to know exactly what code, data, and configuration were used to produce a particular model.

What needs to be tracked?

The following assets must be tracked for all executions of the training pipeline:

  • Code: Git commit sha, container image reference for data processing, training, evaluation code, etc.
  • Configurations: all parameters of all pipeline steps (e.g. learning rate, number of epochs, normalization parameters, cut thresholds, etc.)
  • Model hyperparameters
  • Input data: exact references to the training and testing datasets
  • Annotations: what ground truth labels where used
  • Resource used: what machines were used (GPU types, CPU count, memory, storage, etc.)
  • Ownership: who ran the job, who wrote the code, who deployed the model?
  • Model: the model itself needs to be registered and tracked

This is obviously a lot of information, which makes it difficult to track it manually or in an ad-hoc manner. Instead, these should be tracked automatically by your ML platform.

Lineage Tracking with Sematic

Sematic is the open-source Continuous Machine Learning platform. It lets you define arbitrary end-to-end ML pipelines using only Python, and without requiring any infrastructure skills.

By default Sematic guarantees the highest level of traceability and surfaces all lineage information in its UI.

Sematic tracks:

  • Code: Git information, container images
  • Configurations, input data, hyperparameters, annotations, model: all inputs and outputs of all pipeline steps are serialized, persisted, and tracked in a metadata store
  • Resource used: all requested resources are described and stored alongside the job's metadata
  • Ownership: person or system submitting the job

This high level of tracking enables Sematic to guarantee reproducibility of your pipeline. You can trigger a re-run of any pipeline directly from the UI.

Check out Sematic, join our Discord server, and give us a star on Github!

Subscribe to our mailing list

Receive release notes, updates, tips and news straight into your inbox.
Average frequency: every couple of weeks. No spam ever.

Subscribed, thank you!
Oops! Something went wrong while submitting the form.