Getting started with Sematic in 5 minutes
Sematic is an open-source development platform for Machine Learning (ML) and Data Science. It enables users to quickly build end-to-end ML pipelines to execute on their local machine or in their cloud environment.
With integrations such as PyTorch, Kubernetes, Bazel, Snowflake, and more, it is designed to support arbitrarily complex pipelines of Python-defined business logic running on heterogeneous compute.
Pipeline steps can notably include:
- Data processing – Apache Spark jobs, Google Dataflow jobs, or other map/reduce jobs
- Model training and evaluation – PyTorch, Tensorflow, XgBoost, Scikit Learn, etc.
- Metrics extraction – extract aggregate metrics from model inferences or feature datasets
- Hyperparameter tuning – iterate on configurations and trigger training jobs
- Post to third-party APIs – post labeling requests, JIRA tickets, Slack messages, etc.
- Arbitrary Python logic – really anything that can be implemented in Python.
Sematic currently supports Python 3.8, 3.9 and 3.10 on Linux and Mac. But if you're using Windows, you can run Sematic in Windows Subsystem for Linux.
Sematic comes with these components:
- A lightweight Python SDK to define dynamic pipelines of arbitrary complexity
- An execution backend to orchestrate pipelines locally or in a Kubernetes cluster
- A Command Line Interface to interact with Sematic
- A web dashboard
And some advanced features such as run caching, fault tolerance, function retries, reruns, etc.
Sematic is most useful when deployed in your cloud infrastructure, but it can also be used entirely locally with no infrastructure required.
It can be installed using the Python package installer.
After installation, launch the Sematic web dashboard on your local machine with:
This starts the metadata server and the web dashboard in your browser at http://127.0.0.1:5001, to stop the server simply type:
Functions are the fundamental building block of work in your pipelines and may be nested arbitrarily, much like conventional Python functions. You will put all of the business logic pertaining to your pipeline steps into action there.
Sematic Function inputs and outputs are serialized, tracked in the database, and the execution state is also monitored. In the Dashboard, Sematic functions are shown as Runs.
Consider this Sematic function:
You will notice that this is just a regular Python function but it is decorated with a Sematic decorator. The input artifacts (a: int, b: int) and the output are type-checked, tracked and visualized in the Sematic Dashboard.
Let’s create a simple pipeline to fully understand how Sematic works.
A Python package with some boilerplate code will be created with the following files present in the tutorial/ directory:
- __main__.py: This is the typical entry point of any Python package.
- pipeline.py: This is where your pipeline and its nested steps are defined. You can define multiple pipelines and pilot their executions from the __main__.py file.
- requirements.txt: This is where you can keep the external dependencies specific to your project.
In the tutorial/pipeline.py, add the following code:
In the tutorial/__main__.py, add the following code:
And you are done, the next thing is to run the pipeline. You will need to pass an argument --name when you try to run this code.
And with that, you just created your first pipeline :)
Head over to the web dashboard and you will find your first pipeline.
Click on the pipeline, you will discover information such as the run ID, the latest runs, the nested runs (your Python functions), the input, output, source, logs, resources and a Note (bottom right corner) where you can leave a note for your team members.
In the Execution Graph panel your pipeline is represented as a series of nested Directed Acyclic Graph (DAG).
Learn more about the web dashboard here.
With this understanding, let’s build a simple example pipeline for MNIST in PyTorch.
Start a new project
Load the dataset
You will use the baseline MNIST dataset that comes with Pytorch.
Getting a dataloader
To feed this data into the model for training and testing, create a PyTorch dataloader.
Train the model
Train the model since the data is now ready.
Evaluate the model
After the model has been trained, you want to assess how well it performed on the test dataset.
The end-to-end pipeline
You can now combine everything into an end-to-end pipeline.
Finally, the launch script
To be able to execute the pipeline, create a launch script in the __main__.py file
Run the pipeline and see what the execution graph and visualizations look like in the web dashboard.
On your local development machine, Sematic enables you to iterate, prototype, and debug your pipelines before submitting them to run in your cloud environment's Kubernetes cluster and make use of resources like GPUs and large memory instances.
In order to get the best experience using Sematic you can use a wide range of features such as step retry, pipeline nesting, local execution, lightweight Python SDK, artifact visualization, pipeline reruns, Step caching, and many more.
Check out our: Documentation, join our Discord server, subscribe to our Youtube channel, and star our GitHub.