Running Llama2 as a web service on an M1 Macbook Pro

Running Llama2 as a web service on an M1 Macbook Pro

August 4, 2023

I don’t have CUDA… Now what?

CUDA really revolutionized the AI space and as a result, Nvidia has dominated the market for AI acceleration for years now. But in 2023 it's crazy because we’ve got a lot of powerful hardware options now.
In my time learning how to work with text-generation models, I’ve discovered that in the world of AI, you either have an Nvidia GPU, or (for all intents and purposes) you don’t have a GPU at all.
Apple’s M1 and M2 chips are pretty phenomenal processors, with powerful, efficient CPUs and GPUs as well as dedicated AI acceleration “tensor” cores. But a lot of that power is unfortunately held back from being used because most libraries that make using AI models easy often only support CUDA acceleration, for good reason (because CUDA is the most feature-complete implementation in most libraries like PyTorch and Tensorflow).

So what can we do about it?

Well, the good news is PyTorch recently announced support for Metal! Metal is Apple’s proprietary framework for building GPU-accelerated applications on Apple hardware.
The support is fairly limited, you can’t do everything that CUDA can (just yet), but its enough to get started with leveraging a bit of GPU acceleration on Apple hardware. Look at the performance gains!
Cool, so just set torch.device("mps") and your off to the races no? Maybe, but not quite…

A quick Text Generation Example

Let’s take a quick example using HuggingFace’s transformers library. We’ll leverage Auto Classes to make things super simple, but we’ll need to set the model to use "mps" to work with Metal.
Let’s start by installing requirements. I highly recommend using a virtual environment, of your choice. You can also use conda if you’d like but I’ve been a pip guy all my life so that’s what I’m using.
# Create a virtual environment. Skip this if your using one or are too boss python3 -m venv .venv source .venv/bin/activate # Install some basic reqs pip install torch transformers
Now lets try running a simple text-generation script.
Fair warning - You might run in to an error. We’ll talk about it in a second
from transformers import AutoTokenizer, AutoModelForCausalLM # Load the tokenizer and model. Set the model to use "mps" device tokenizer = AutoTokenizer.from_pretrained('distilgpt2') model = AutoModelForCausalLM.from_pretrained('distilgpt2').to('mps') # Tokenize the input, and transform the tokens to "mps" input_tokens = tokenizer("Once upon a time,", return_tensors='pt').to('mps') # Generate an output and decode the output and print output = model.generate(input_tokens.input_ids, max_new_tokens=20) output_text = tokenizer.batch_decode(output)[0] print(output_text)

I ran in to an error

If you didn’t, hooray for you! You can skip to the next part…
If you did, you’re experiencing exactly what I did the first time I tried this. Might be a different error, but nevertheless I ran into AN error. Which is part of the process apparently with this stuff.
If you ran into RuntimeError: MPS does not support cumsum op with int64 input then I have a fix for you.
This seems to be a known issue in PyTorch, and really its a reality of how bleeding edge this stuff is. Even though Apple announced Metal support in PyTorch in 2022, its taken a year for people to use this stuff enough to start running in to errors.
So how do we fix it?
Well, the fix is coming… in PyTorch 2.1.0 (which isn’t out as of writing this). If PyTorch 2.1.0 is out, then great! Just run pip install torch>=2.1.0. If it isn’t then you’ll need to install a nightly build with:
pip install --upgrade --no-deps --force-reinstall --pre torch --index-url
Then try the script again.

Going bigger

Now that you’ve got an output you’re probably wondering what you’re looking at. Yea, it sounds like a child with a basic understanding of english got bored writing half way through.
You can probably tune things like the temperature, seed and repetition parameters, but the reality is GPT2 is only going to get you so far.
What you want is a more powerful model, lets say Llama2?
See the problem is those models are MASSIVE. You can try if you want. Just swap out distilgpt2 for meta-llama/Llama-2-7b-hf to use the HuggingFace compatible models for Llama2-7b. Heck if you wanna go crazy you can use 70b. But you’ll quickly run in to a problem.
Inference will take HOURS, if it even completes, and you’ll be running in to a few bottlenecks. Primarily hardware related:
  • Memory
  • Compute performance
Now we’ve tried to make compute better by leveraging the more powerful GPU on your M1/M2 chip. But it can only get us so far compared to an Nvidia A100 - a 40+GB Tensor GPU with 108 compute cores (compared to 32 on the M2 Ultra), and 19.5 TFLOPS of FP32 float performance running at 250 watts (compared to the M2’s 40 watt maximum).
The facts are, we’re working on a tight budget (a $3000+ tight budget god damn Apple shit is expensive). We can’t fit the entire model into GPU memory, and we can’t run 32-bit float computations as fast as the other guys.
So how do we “get around this”?


Quantization is a technique used for reducing the memory and compute requirements of large language models. It involves compressing the model's weights and activations to use fewer bits, which reduces the amount of memory needed to store them. This allows larger models to be run on hardware with limited memory capacity, such as mobile devices or low-power CPUs. However, there is a trade-off between the level of quantization and the accuracy of the model.
Essentially, quantization is a technique that reduces the float precision of a model’s weights to 16-bit, 8-bit or even 4-bit, at the expense of the model’s accuracy.
Now by definition this is a lossy process, so any time I’ve mentioned a “solution” or “getting around” problems, I’ve put them in quotes, because its NOT a 1-1 conversion. The efficacy of the model you’re using will reduce, so its up to you to make a call on whether the outcomes are still effective for your project.
So how do you find and run a quantized model?

Llama.cpp and GGML

This is where the power of the open source community shines. There are plenty of super smart people out there who are already solving this problem, so we’re doing to leverage some of their awesome work.

First lets talk about Llama.cpp

This is an awesome project that focuses on deploying quantized Llama models for inference on Apple hardware.
The main goal of llama.cpp is to run the LLaMA model using 4-bit integer quantization on a MacBook
Hot damn! This is exactly what we’re trying to do.
And what’s great is, I’m a Python freak, so I don’t even have to work with C++ to use this. More awesome people have created libraries like llama-cpp-python which functions as a wrapper around llama.cpp. It even has a baked in Web Server example!

Where do we get a quantized model?

Well we could make one ourselves, but we’ll be butting up against some of the same problems we’re already having with OOtB models (namely Memory and CPU).
But hey, more open source to the rescue. Someone’s already done that! Once a model is quantized, it doesn’t have to be quantized again, so if you’re using a popular foundation model, you can probably find an appropriate quantized version online.
I ended using TheBloke’s GGML version of Llama2-7b since it seems to be the most popular.
You can download the q4_0.bin file straight off of HuggingFace and save it somewhere safe…

Let’s run inference with a quantized model

After downloading the q4_0.bin file off of HuggingFace, renaming it llama-7b-q4.bin and saving it in a folder called .models, we can install the dependencies and run a quick script.
pip install llama-cpp-python
Now lets load the model and run some inference
from llama_cpp import Llama llm = Llama(model_path="./.models/llama2-7b-q4.bin") output = llm("Once upon a time,", max_tokens=32, stop=["\n"], echo=True) print(output['choices'][0]['text'])
Holy crap! That took like no effort.
That is a result, but lets not forget, we’re running a 7B model which is a relatively simple (relative to 70B obvs), and at 4 bit quantization. There’s a lot of precision we’re losing here.
But its a place to start. From here we can experiment with higher precision as well as larger models. Just swap in the appropriately quantized or sized model. TheBloke offers all the way up to 8 bit precision on Llama2 13B and 70B.
You can also test your outputs by setting up an endpoint through HuggingFace. You may have to pay a bit for it, but if your concerned about accuracy it’ll give you a chance to compare and contrast quantized models vs the full fat 32bit models.

Long way around to just get to quantization

Yea, but along the way we’ve leveraged all the GPU acceleration the Apple ecosystem has to offer, and we’ve solved for some of its shortcomings.
Theoretically everything here is useful on a CPU-only system as well, but leveraging the GPU that’s already in your M1/M2 powered Macbook will give you a bit more performance, and a bit more head room to run bigger, more complex models.
And hey, you can even use quantized models on Nvidia systems to run gigantic models but at much smaller (and cheaper) systems. I’ll talk about how to do that later.

Run a web server

Well finally we get to the thing we’ve all been waiting for. Running an actual web server with Llama2, on our Mac, without inference taking hours and hours, while still leveraging all the performance we have on tap.
pip install llama-cpp-python[server] python3 -m llama_cpp.server --model ./.models/llama2-7b-q4.bin
Navigate to http://localhost:8000/docs to see the OpenAPI documentation.
That’s all folks!