Observability for Machine Learning: what is it and what are the benefits
Observability is a term usually associated with DevOps and Infrastructure engineering. It typically refers to the ability to "observe" a workload (e.g. a web server, a cluster of virtual machines, a set of micro-services, etc.) by having access to its logs, application traces, resource usage metrics, etc. It enables fast reaction to and deep forensic analysis of incidents such as bad deploys, traffic spikes, attacks, etc.
In the pre-cloud era, engineers would "observe" workloads by simply logging into the corresponding machines and using low-level shell commands. Cloud workloads are not as easy to access since they are often behind numerous layers of security and permission controls, and let's face it, using shell commands is not the most practical when managing hundreds of VMs.
Nowadays, engineers use services such as Grafana, Datadog, NewRelic, and more to integrate powerful observability tools directly at the heart of their workloads and monitor those from highly usable UIs.
Observability for Machine Learning
Just like any other workloads, ML processes need to be observable to enable inspection, debugging, optimization, and forensic analysis.
Observability of the training pipeline
End-to-end training pipelines interact with numerous third-party services (e.g. data warehouses, map/reduce clusters, databases, GPU clusters, etc.), can last for minutes to days, and can cost thousands of dollars (data transfers, GPU compute, etc.). Therefore, it is crucial that they are equipped with observability tools.
As your pipeline steps are executing, they will be leveraging many third-party dependencies (e.g. Pandas, NumPy, Tensorflow, PyTorch, XGBoost) all of which output valuable information such as warnings, debug data and so on. Same goes with connections to third-party services (e.g. data warehouses, databases). All these logs are very useful for debugging and inspecting your pipeline.
When a pipeline runs in the cloud, these messages are typically logged to the standard output of the container in which the pipeline step is running. These container typically live only for duration of the pipeline step and are winded down after completion, disabling any post-hoc analysis.
If your Infrastructure team has set up log collection for the cluster on which your pipeline runs, you might be able to track down said logs in e.g. Grafana or Humio.
Sematic surfaces your pipeline logs directly in the UI. No need to forage cloud storage for container logs or understand your Infra team's complex tools. It's all right there.
Failures and exceptions
Failures are frequent for long end-to-end pipelines. They can be due to bugs in your code, unavailable third-party services, insufficient resources (e.g. out-of-memory errors), network failures, etc.
When those occur, they typically raise an exception in your pipeline code (e.g. a Python exception). Accessing those exception quickly is the key to a fast turnaround time and smooth development cycle.
Sematic surfaces exceptions and failures directly in the UI so that you don't need to look for them in container logs or third-party exception trackers.
ML pipelines can be hungry beasts. Data processing can require a lot of memory to load large datasets, and neural nets need expensive GPUs to process large matrix multiplications.
It is not unusual that those GPUs sit idle for large fractions of your run time because the of I/O bottlenecks (e.g. stream training data from cloud storage). You are then effectively paying a lot of money for something you are not fully utilizing.
These inefficiencies can be detected and diagnosed with the right tooling. You should have the ability to visualize resource usage over time for all your pipeline steps. That includes memory usage, CPU and GPU utilizations, network throughputs, as well as ephemeral storage usage.
Sematic enables resource monitoring thanks to its Grafana integration. You can attach a Grafana panel of your choosing to the UI making it possible to view GPU/network/memory usage directly in Sematic.
In the future, Sematic will develop more native resource usage monitoring tools for greater visibility into your pipelines.
When training complex ML models on large datasets, the bill can quickly get out of hand. Data transfers, GPU clusters, expensive VMs and third-party services: it all adds up.
Having granular visibility on cloud spend can help optimize it. Ideally, every developer would know the cost of each of their pipeline execution, and leadership is also able to break down cost per team, and models. This cost breakdown can help establish resource usage policies so that higher priority workloads can execute swiftly while lower priority jobs are queued for off-hours execution.
Observability of the inference server
In addition to monitoring your training pipelines, it is also important to monitor your inference server.
If your model is serving live inferences, it is expected to sustain a certain amount of load below a certain latency as per its Service Level Agreement. It is important to monitor the latency overtime in order to alert on-call engineers in case of deviations. Similarly to training pipelines, monitoring resource usage (memory, GPU, etc.) is also important to avert outages.
In addition to resource usage, it is a good idea to persist inferences themselves and establish a proxy metric to evaluate the performance of the model itself over time (e.g. drift). This can inform when the model should be retrained with fresh data.
How Sematic enables pipeline observability
Sematic is the open-source Continuous Machine Learning platform. It enables ML teams to develop and automate end-to-end training pipelines.
Sematic abstracts away infrastructure and surfaces the information you need to quickly resolve issues and iterate. It provides observability into your ML pipelines by surfacing logs and exception in the UI, and enabling embedding a Grafana panel for resource usage monitoring.