What is Lineage Tracking in Machine Learning and Why You Need It
We explain what Lineage Tracking is and why it is a requirement for any production-grade ML system.
As ML engineers, we often run hundreds of training jobs every week with different inputs, configurations, and code changes, we deploy models to production, we investigate regressions, etc. All these tasks consume many different assets (e.g. datasets, configurations, code) and produce many more (e.g. models, metrics, inferences).
How do we keep track of the sources-outcomes relationship between all these assets? How do we know what hyperparameters were used in the 7th job we launched before the week-end? How do we know exactly what data points were used to train the model currently being used in production?
Enter Lineage Tracking.
Lineage Tracking is the systematic tracking of all assets consumed and produced by each step in an ML pipeline, in order to keep a source of truth of the lineage graph between them.
Sure we can do this manually: keep track of jobs in spreadsheets and notebooks, adopt a naming convention to make sure we know the origin of every model... But we all know this is very tedious and impractical.
Instead, Lineage Tracking must come out-of-the-box with your ML platform.
Why do I need Lineage tracking?
Here are five reasons why Lineage Tracking is a must-have, each of which would be a sufficient reason in itself.
Bookkeeping, experiment tracking
As you iterate on your code (e.g. training code, data processing, evaluation, etc.), you need to keep track of every single execution in order to simply keep a record of all the options and sets of parameters you tried. This is a basic of good scientific research.
Imagine you start 5 training jobs with different input datasets, configurations, and hyperparameters, and one of them clearly outperforms the others. Did you write down ahead of time which is which? Maybe, but are you confident you logged enough information? Lineage Tracking does this for you without having to think about it.
It is impossible to debug an unexpected outcome without knowing the input parameters of the system. If your model starts generating incorrect inferences in production, you will need to roll back to a prior version, and do Root Cause Analysis. To do so, you need to find the training job that produced the model, its evaluation metrics, the input datasets, configurations, etc., all of which are tracked by Lineage Tracking.
Traceability is a requirement for any supply chain. In Software, CI/CD pipelines automate testing and deployments, and always tie a particular code asset back to a version control reference (e.g. Git commit sha), which in turn is associated to an author.
ML pipelines are the supply chains for trained models. Therefore, they also need to provide a high level of guarantees around traceability.
If this traceability is not inherently guaranteed by the underlying platform, it is likely that it will be incomplete, unusable, or even simply absent.
In many cases, ML models are used in mission-critical scenarios. For example, in autonomous vehicles, fraud detection, healthcare, etc. In such cases, failures can have dramatic effects on users, including potentially fatal consequences.
Arguably, models deployed to production without Lineage Tracking are simply a liability, with potential legal consequences.
Much is said about the lack of explainability in Machine Learning. The first step towards explainability is the ability to reproduce the outcome of a particular job, in order to iterate and investigate. Without rigorous Lineage Tracking, it is virtually impossible to know exactly what code, data, and configuration were used to produce a particular model.
What needs to be tracked?
The following assets must be tracked for all executions of the training pipeline:
- Code: Git commit sha, container image reference for data processing, training, evaluation code, etc.
- Configurations: all parameters of all pipeline steps (e.g. learning rate, number of epochs, normalization parameters, cut thresholds, etc.)
- Model hyperparameters
- Input data: exact references to the training and testing datasets
- Annotations: what ground truth labels where used
- Resource used: what machines were used (GPU types, CPU count, memory, storage, etc.)
- Ownership: who ran the job, who wrote the code, who deployed the model?
- Model: the model itself needs to be registered and tracked
This is obviously a lot of information, which makes it difficult to track it manually or in an ad-hoc manner. Instead, these should be tracked automatically by your ML platform.
Lineage Tracking with Sematic
Sematic is the open-source Continuous Machine Learning platform. It lets you define arbitrary end-to-end ML pipelines using only Python, and without requiring any infrastructure skills.
By default Sematic guarantees the highest level of traceability and surfaces all lineage information in its UI.
- Code: Git information, container images
- Configurations, input data, hyperparameters, annotations, model: all inputs and outputs of all pipeline steps are serialized, persisted, and tracked in a metadata store
- Resource used: all requested resources are described and stored alongside the job's metadata
- Ownership: person or system submitting the job
This high level of tracking enables Sematic to guarantee reproducibility of your pipeline. You can trigger a re-run of any pipeline directly from the UI.