Tuning and Testing Llama 2, FLAN-T5, and GPT-J with LoRA, Sematic, and Gradio
In recent months, it’s been hard to miss all the news about Large Language Models and the rapidly developing set of technologies around them. Although proprietary, closed-source models like GPT-4 have drawn a lot of attention, there has also been an explosion in open-source models, libraries, and tools. With all these developments, it can be hard to see how all the pieces fit together. One of the best ways to learn is by example, so let’s set ourselves a goal and see what it takes to accomplish it. We’ll summarize the technology and key ideas we use along the way. Whether you’re a language model newbie or a seasoned veteran, hopefully you can learn something as we go. Ready? Let’s dive in!
Let's set a well-defined goal for ourselves: building a tool that can summarize information into a shorter representation. Summarization is a broad topic, with different properties for models that would be good at summarizing news stories, academic papers, software documentation, and more. Rather than focusing on a specific domain, let's create a tool that can be used for various summarization tasks, while being willing to invest computing power to make it work better in a given subdomain.
Let’s set a few more criteria. Our tool should:
- Be able to pull from a variety of kinds of data to improve performance on a specific sub-domain of summarization
- Run on our own devices (including possibly VMs in the cloud that we’ve specified)
- Allow us to experiment using only a single machine
- Put us on the path to scale up to a cluster when we’re ready
- Be capable of leveraging state-of-the-art models for a given set of compute constraints
- Make it easy to experiment with different configurations so we can search for the right setup for a given domain
- Enable us to export our resulting model for usage in a production setting
Sounds intimidating? You might be surprised how far we can get if we know where to look!
Looking at our goal of being able to achieve good performance on a specific sub-domain, there are a few options that might occur to you. We could:
- Train our own model from scratch
- Use an existing model “off the shelf”
- Take an existing model and “tweak” it a bit for our custom purposes
Training a “near state of the art” model from scratch can be complex, time consuming, and costly. So that option is likely not the best. Using an existing model “off the shelf” is far easier, but might not perform as well on our specific subdomain. We might be able to mitigate that somewhat by being clever with our prompting or combining multiple models in ingenious ways, but let’s take a look at the third option. This option, referred to as “fine-tuning,” offers the best of both worlds: we can leverage an existing powerful model, while still achieving solid performance on our desired task.
Even once we’ve decided to fine-tune, there are multiple choices for how we can perform the training:
- Make the entire model “flexible” during training, allowing it to explore the full parameter space that it did for its initial training
- Train a smaller number of parameters than were used in the original model
While it might seem like we might need to do the first to achieve full flexibility, it turns out that the latter can be both far cheaper (in terms of time and resource costs) and just as powerful as the former. Training a smaller number of parameters is generally referred to by the name “Parameter Efficient Fine Tuning,” or “PEFT” for short.
There are several mechanisms for PEFT, but one method that seems to achieve some of the best overall performance as of this writing is referred to as “Low Rank Adaptation,” or LoRA. If you’d like a detailed description, here’s a great explainer. Or if you’re academically inclined, you can go straight to the original paper on the technique.
Modern language models have many layers that perform different operations. Each one takes the the output tensors of the previous layers to produce the output tensors for the layers that follow. Many (though not all) of these layers have one or more trainable matrices that control the specific transformation they will apply. Considering just a single such layer with one trainable matrix W, we can consider our fine-tuning to be looking for a matrix we can add to the original, 𝚫W , to get the weights for the final model: W’ = W + 𝚫W.
If we just looked to find 𝚫W directly, we’d have to use just as many parameters as were in the original layer. But if we instead define 𝚫W as the product of two smaller matrices 𝚫W = A X B, we can potentially have far fewer parameters to learn. To see how the numbers work out, let’s say 𝚫W is an NxN matrix. Given the rules of matrix multiplication, A must have N rows, and B must have N columns. But we get to choose the number of columns in A and the number of rows in B as we see fit (so long as they match up!). So A is an Nxr matrix and B is an rxN matrix. The number of parameters in 𝚫W is N², but the number of parameters in A & B is Nr + rN = 2Nr. By choosing an r that’s much less than N, we can reduce the number of parameters we need to learn significantly!
So why not just always choose r=1? Well, the smaller r is, the less “freedom” there is for what 𝚫W can look like (formally, the less independent the parameters of 𝚫W will be). So for very small r values, we might not be able to capture the nuances of our problem domain. In practice, we can typically achieve significant reductions in learnable parameters without sacrificing performance on our target problem.
As one final aside down this technical section (no more math after this, I promise!), you could imagine that after tuning we might want to actually represent 𝚫W as 𝚫W = ⍺**(AXB)**, with ⍺ as a scaling factor for our decomposed weights. Setting it to 1 would leave us with the same ratio of “original model” behavior to “tuned model” behavior as we had during training. But we might want to amplify or suppress these behaviors relative to one another in prod.
The above should help give you some intuition for what you’re doing as you play around with the hyperparameters for LoRA, but to summarize at a high level, LoRA will require the following hyperparameters that will have to be determined via experimentation:
- r: the free dimension for decomposing the weight matrices into smaller factors. Higher values will increase the generalization of the fine-tuning, but at the cost of increasing the computational resources (compute, memory, and storage) required for the tuning. In practice, values as low as 1 can do the trick, and values greater than around 64 generally seem to add little to the final performance.
- layer selection: as mentioned, not all layers can be tuned at all, nor do all layers have a 2d tensor (aka a matrix) as their parameters. Even for the layers that do meet our requirements, we may or may not want/need to fine-tune all of them.
- ⍺: a factor controlling how much of the tuned behavior will be amplified or suppressed once our model is done training and ready to perform evaluation.
Selecting a Model
Now that we’ve decided to fine-tune an existing model using LoRA, we need to choose which model(s) we will be tuning. In our goals, we mentioned working with different compute constraints. We also decided that we would be focusing on summarization tasks. Rather than simply extending a sequence of text (so called “causal language modeling,” the default approach used by the GPT class of models), this task looks more like taking one input sequence (the thing to summarize) and producing one output sequence (the summary). Thus we might require less fine-tuning if we pick a model designed for “sequence to sequence” language modeling out of the box. However, many of the most powerful language models available today use Causal Language Modeling, so we might want to consider something using that approach and rely on fine-tuning and clever prompting to teach the model that we want it to produce an output sequence that relates to the input one.
Google has released a language model known as FLAN-T5 that:
- Is trained on a variety of sequence-to-sequence tasks
- Comes in a variety of sizes, from something that comfortably runs on an M1 Mac to something large enough to score well on competitive benchmarks for complex tasks
- Is licensed for open-source usage (Apache 2)
- Has achieved “state-of-the-art performance on several benchmarks” (source)
It looks like a great candidate for our goals.
While this model is a causal language model, and thus might require more fine-tuning, it:
- Has ranked at the top of many benchmarks for models with comparable numbers of parameters
- Is licensed for open-source usage (Apache 2)
- Comes in a variety of sizes, to suit different use cases and constraints
Let’s give it a shot too.
This model is another causal language model. It:
- comes from the well-known GPT class of models
- has achieved solid performance on benchmarks
- and has a number of parameters that puts it solidly in the class of large language models while remaining small enough to play around with on a single cloud VM without breaking the bank
Let’s give it a shot too.
Selecting some frameworks
Now that we have all the academic stuff out of the way, its time for the rubber to meet the road with some actual tooling. Our goals cover a lot of territory. We need to find tools that help us:
- Manage (retrieve, store, track) our models
- Interface with hardware
- Perform the fine-tuning
- Perform some experimentation as we go through the fine-tuning process. This might include:
- tracking the experiments we’ve performed
- visualizing the elements of our experiments
- keeping references between our configurations, models, and evaluation results
- allowing for a rapid “try a prompt and get the output” loop
- Prepare us for productionizing the process that produces our final model
As it turns out, there are three tool suites we can combine with ease to take care of all these goals. Let’s take a look at them one-by-one.
The biggest workhorse in our suite of tools will be Hugging Face. They have been in the language modeling space since long before “LLM” was on everyone’s lips, and they’ve put together a suite of interoperable libraries that have continued to evolve along with the cutting edge.
One of Hugging Face’s most central products is the Hugging Face Hub. What GitHub is for source code, Hugging Face Hub is for models, datasets, and more. Indeed, it actually uses git (plus git-lfs) to store the objects it tracks. It takes the familiar concepts of repositories, repository owners, and even pull-requests, and uses them in the context of datasets and models. Here’s the repository tree for the base FLAN-T5 model, for example. Many state-of-the-art models and datasets are hosted on this hub.
Another keystone in the Hugging Face suite is their transformers library. It provides a suite of abstractions around downloading and using pre-trained models from their hub. It wraps lower-level modeling frameworks like PyTorch, TensorFlow, and JAX, and can provide interoperability between them.
The next piece of the Hugging Face toolkit we’ll be using is their Accelerate library, which will help us be able to effectively leverage the resource provided by different hardware configurations without too much extra configuration. If you’re interested, accelerate can also be used to enable distributed training when starting from non-distributed PyTorch code.
A new kid on the proverbial Hugging Face block is PEFT. Recall this acronym for “Parameter Efficient Fine Tuning” from above? This library will allow us to work with LoRA for fine tuning, and treat the matrices that generate the weight deltas as models (sometimes referred to as adaptors) in their own right. That means we can upload them to the Hugging Face Hub once we’re satisfied with the results. It also supports other fine-tuning methods, but for our purposes we’ll stick with LoRA.
Sematic will help us track & visualize our experiments, keep references between our configurations/models/evaluation results, and prepare us for productionization. Sematic not only handles experiment management, but is also a fully-featured cloud orchestration engine targeted at ML use cases. If we start with it for our local development, we can move our train/eval/export pipeline to the cloud once we’re ready to do so without much overhead.
There’s still one piece missing: ideally once we’ve trained a model and gotten some initial evaluation results, we’d like to be able to interactively feed the model inputs and see what it produces. Gradio is ideally suited for this task, as it will allow us to develop a simple app hooked up to our model with just a few lines of python.
Tying it all together
Armed with this impressive arsenal of tooling, how do we put it all together? We can use Sematic to define and chain together the steps in our workflow just using regular python functions, decorated with the @sematic.func decorator.
This will give us:
- A graph view to monitor execution of the experiment as it progresses through the various steps
- A dashboard to keep track of our experiments, notes, inputs, outputs, source code, and more. This includes links to the resources we’re using/producing on Hugging Face Hub, navigable configuration & result displays. Sematic EE users can get access to even more, like live metrics produced during training and evaluation.
- A search UI to track down specific experiments we might be interested in
- The basic structure we need to scale our pipeline up to cloud scale. When we’re ready, we can even add distributed inference using Sematic’s integration with Ray.
After defining our basic pipeline structure with Sematic, we need to define the Hugging Face code with transformers & PEFT.
This requires a bit more effort than the Sematic setup, but it’s still quite a manageable amount of code given the power of what we’re doing. The full source can be found here. Luckily, usage of the “accelerate” library comes essentially for free once you have installed it alongside transformers & PEFT.
Finally, we need to hook up Gradio. It just takes a few lines of python to define our Gradio app:
This app will have a text input, a text output, a run button (to invoke the model and get a summary using the context), and a stop button (to close the Gradio app and allow the Sematic pipeline to continue). We’ll keep track of all the input contexts and output summaries in a history object (essentially just a list of prompt/response pairs) to be visualized in the dashboard for the Sematic pipeline. This way we can always go back to a particular pipeline execution later and see a transcript of our interactive trials. The interactive app will look like this:
The transcript will be displayed as the output of the launch_interactively step in our pipeline.
We’ve set up this script so that via the command line we use to launch, we can change:
- The model (selecting from one of the FLAN-T5 variants or GPT-J 6B)
- The training hyperparameters
- The dataset used
- The Hugging Face repo to export the result to, if we even want to export the result
Let’s take a look at some of the results we get.
CNN Daily Mail Article Summarization
The default dataset used by our pipeline is cnn_dailymail, from Hugging Face. This contains some articles from CNN paired with summaries of those articles. Using FLAN-T5 large variant, we were able to produce some good summaries, such as the one below.
Not all results were perfect though. For example, the one below contains some repetition and misses some key information in the summary (like the name of the headliners).
Amazon Review Headline Suggestion
To demonstrate the flexibility that can be achieved with fine-tuning, we also used a fairly different use case for our second tuning. This time we leveraged the amazon_us_reviews dataset, pairing a review with the review’s headline, which could be considered a summary of the review’s content.
Try it out yourself!
Think this example might actually be useful to you? It’s free and open-source! All you need to do to use it is install Sematic 0.32.0
Then follow the instructions here.
You can fine tune any of the supported models on any Hugging Face dataset with two text columns (where one column contains the summaries of the other). Tuning the large FLAN variants, Llama 2 models, or GPT-J may require machines with at least 24 GB of GPU memory. However, the small and base FLAN variants have been successfully tuned on M1 Macbooks. Hop on our Discord if you have any suggestions or requests, or even if you just want to say hi!