Running llama.cpp

Posted on Jul 28, 2023

TL;DR: llama.cpp has improved a lot in the past few months. I’m happy to run 65B models on an M2 MacBook Pro at acceptable speeds (comparable to GPT4 when it just came out and was slow). I only have one nVidia 4090 GPT with 24 GB of VRAM; and while inference in it is about 3 times faster, it fails to run bigger models. I considered getting another one; but the sweet spot for pretty fast inference on GPUs seems to be a little bit north of 48 GB.


For the past few months, I’ve been following the superb progress on the ggml and llama.cpp community towards making running models on consumer hardware more accessible, secure and fast. It’s amazing to see the progress, and I’m happy to have bought a maxed-RAM M2 to follow along and run these models.

In the mean time, I’ve been playing around with ChatGPT, the GPT4 API, Claude2 and these models, and I wanted to share my insights from a practitioner’s perspective, with the hope that someone in the internet finds these ideas useful and/or applicable for their use cases.

These are some data points to give you an idea of how long it takes to run local llama.cpp inference on somewhat beefy consumer hardware of July 2023:

Hardware Model Size Quantization Time per token (ms) Tokens per second
CPU (i9) 13B q4_K_M 205 4.86
CPU (M2) 13B q4_K_M 135 7.39
CUDA (4090) 13B q4_K_M 13 75.13
Metal (M2) 13B q4_K_M 38 26.22
CPU (i9) 33B q4_K_M 480 2.08
CPU (M2) 33B q4_K_M 308 3.25
CPU (i9) 33B q4_0 596 1.68
CPU (M2) 33B q4_0 292 3.42
CUDA (4090) 33B q4_K_M * (failed) *
CUDA (4090) 33B q4_1 * *
CUDA (4090) 33B q4_0 25 39.26
Metal (M2) 33B q4_K_M 101 9.94
Metal (M2) 33B q4_0 85 11.73
CPU (i9) 65B q4_K_M 1020 0.99
CPU (M2) 65B q4_K_M 558 1.79
Metal (M2) 65B q4_K_M 205 4.86

Some notes:

  • Bigger models tend to “hallucinate/overfit” to known stories better than smaller models. Guanaco 65B told me 2/4 times a very accurate Alice in Wonderland story, naming the Mad Hatter and other known characters, while 33B and smaller models never did. The prompt I used was along the lines of “Tell me a story to go to bed”.
  • Mac suffers from more sensibility to “warm-up” (could also be that maybe my NVMe disks in my PC are much faster)
  • It’s a shame that q4_K_M quantization requires just about a few dozens of megabytes of RAM more than my GPU has. Even after shutting down X, I was shy for 20Mb of running q4_K_M. q4_1 also failed to run. It’s promising to know that there are some improvements coming to memory management in the roadmap.
  • I can’t tell about the quality of inference for these different quantization methods. It’s also very likely to vary a lot from task to task.
  • I couldn’t run OpenBLAS on my linux machines. I gave up after downloading three or four binary blobs, each one weighting gigabytes; none of them worked out of the box. So there might be some performance loss there for the i9 CPU.
  • CUDA seems to be ~3 times faster than Metal on M2. Both machines costed me about ~5k EUR. The M2 is a MacBook Pro; I used it always plugged; but I wonder if something less mobile-optimized could perform better.
  • I was very happy to run a 65B model at reasonable speeds. 5 tokens per second is slow to read; but not painfully so. It is slightly faster than the speed of an off-camera narrator voice; but I feel that an average person speaks faster than that. This also depends on which tokens get generated.
  • I’ve been running GPTQ models on python code; but after seeing the fast inference speeds that llama.cpp has, I now favor llama.cpp over those.
  • Prompt ingestion time is quite slow on both M2 and Metal. I wonder why is this; may be related to how they build the KV cache. It’s almost instantaneous on the python Hugging Face’s transformers library; and quite fast with llama.cpp’s CUDA backend (at least one order of magnitude faster).
  • Metal inference doesn’t turn on the fans on the MacBook; but CPU inference very much so. Also, it feels a little less stable, but that might be a problem with the llama-server example code. I’ve had to restart generation a couple of times after observing it was stopping to churn more characters.
  • On the other hand, the CUDA backend is pretty solid. I ran some tasks of summarization on many files, run overnight and didn’t have this issue.
  • I also failed to usefully generation of embeddings. With the e5-large-v2 and instructor-xl models, it was pretty easy to get interesting results and query results that made sense, but doing vector similarity on the ~8k output of llama.cpp’s embeddings binary didn’t work for me. Also, to be honest, I didn’t try very hard with variations of this.
  • The context length limits are a little bit of a pain in the ass to deal with. Most of the engineering I’ve been doing has been to sort around the limitations and problems associated with giving enough context to the model