Running llama.cpp

Posted on Jul 28, 2023

TL;DR: llama.cpp has improved a lot in the past few months. I’m happy to run 65B models on an M2 MacBook Pro at acceptable speeds (comparable to GPT4 when it just came out and was slow). I only have one nVidia 4090 GPT with 24 GB of VRAM; and while inference in it is about 3 times faster, it fails to run bigger models. I considered getting another one; but the sweet spot for pretty fast inference on GPUs seems to be a little bit north of 48 GB.

For the past few months, I’ve been following the superb progress on the ggml and llama.cpp community towards making running models on consumer hardware more accessible, secure and fast. It’s amazing to see the progress, and I’m happy to have bought a maxed-RAM M2 to follow along and run these models.

In the mean time, I’ve been playing around with ChatGPT, the GPT4 API, Claude2 and these models, and I wanted to share my insights from a practitioner’s perspective, with the hope that someone in the internet finds these ideas useful and/or applicable for their use cases.

These are some data points to give you an idea of how long it takes to run local llama.cpp inference on somewhat beefy consumer hardware of July 2023:

Hardware	Model Size	Quantization	Time per token (ms)	Tokens per second
CPU (i9)	13B	q4_K_M	205	4.86
CPU (M2)	13B	q4_K_M	135	7.39
CUDA (4090)	13B	q4_K_M	13	75.13
Metal (M2)	13B	q4_K_M	38	26.22
CPU (i9)	33B	q4_K_M	480	2.08
CPU (M2)	33B	q4_K_M	308	3.25
CPU (i9)	33B	q4_0	596	1.68
CPU (M2)	33B	q4_0	292	3.42
CUDA (4090)	33B	q4_K_M	* (failed)	*
CUDA (4090)	33B	q4_1	*	*
CUDA (4090)	33B	q4_0	25	39.26
Metal (M2)	33B	q4_K_M	101	9.94
Metal (M2)	33B	q4_0	85	11.73
CPU (i9)	65B	q4_K_M	1020	0.99
CPU (M2)	65B	q4_K_M	558	1.79
Metal (M2)	65B	q4_K_M	205	4.86

Some notes:

Bigger models tend to “hallucinate/overfit” to known stories better than smaller models. Guanaco 65B told me 2/4 times a very accurate Alice in Wonderland story, naming the Mad Hatter and other known characters, while 33B and smaller models never did. The prompt I used was along the lines of “Tell me a story to go to bed”.
Mac suffers from more sensibility to “warm-up” (could also be that maybe my NVMe disks in my PC are much faster)
It’s a shame that q4_K_M quantization requires just about a few dozens of megabytes of RAM more than my GPU has. Even after shutting down X, I was shy for 20Mb of running q4_K_M. q4_1 also failed to run. It’s promising to know that there are some improvements coming to memory management in the roadmap.
I can’t tell about the quality of inference for these different quantization methods. It’s also very likely to vary a lot from task to task.
I couldn’t run OpenBLAS on my linux machines. I gave up after downloading three or four binary blobs, each one weighting gigabytes; none of them worked out of the box. So there might be some performance loss there for the i9 CPU.
CUDA seems to be ~3 times faster than Metal on M2. Both machines costed me about ~5k EUR. The M2 is a MacBook Pro; I used it always plugged; but I wonder if something less mobile-optimized could perform better.
I was very happy to run a 65B model at reasonable speeds. 5 tokens per second is slow to read; but not painfully so. It is slightly faster than the speed of an off-camera narrator voice; but I feel that an average person speaks faster than that. This also depends on which tokens get generated.
I’ve been running GPTQ models on python code; but after seeing the fast inference speeds that llama.cpp has, I now favor llama.cpp over those.
Prompt ingestion time is quite slow on both M2 and Metal. I wonder why is this; may be related to how they build the KV cache. It’s almost instantaneous on the python Hugging Face’s transformers library; and quite fast with llama.cpp’s CUDA backend (at least one order of magnitude faster).
Metal inference doesn’t turn on the fans on the MacBook; but CPU inference very much so. Also, it feels a little less stable, but that might be a problem with the llama-server example code. I’ve had to restart generation a couple of times after observing it was stopping to churn more characters.
On the other hand, the CUDA backend is pretty solid. I ran some tasks of summarization on many files, run overnight and didn’t have this issue.
I also failed to usefully generation of embeddings. With the e5-large-v2 and instructor-xl models, it was pretty easy to get interesting results and query results that made sense, but doing vector similarity on the ~8k output of llama.cpp’s embeddings binary didn’t work for me. Also, to be honest, I didn’t try very hard with variations of this.
The context length limits are a little bit of a pain in the ass to deal with. Most of the engineering I’ve been doing has been to sort around the limitations and problems associated with giving enough context to the model