llama.cpp
A high-performance C++ implementation for running LLMs on consumer hardware.
LLM Inference Everywhere in Pure C/C++
llama.cpp is one of the most significant open-source projects in the LLM era. Its primary goal is to enable the inference of models (like Llama) with minimal setup and maximum performance across a wide variety of hardware, especially those without powerful dedicated GPUs. Written in pure C++, it is designed to be lightweight and highly portable, running on everything from MacBooks to Raspberry Pis and even Android phones.
The project popularized the 'GGUF' format, which allows models to be quantized (compressed) so they take up significantly less RAM while maintaining high accuracy. This breakthrough made it possible to run '70B' parameter models on consumer-grade computers. llama.cpp serves as the engine for many other popular AI tools, including Ollama and various mobile AI apps. It supports hardware acceleration via Apple's Metal, NVIDIA's CUDA, and OpenCL. For developers who want to embed LLM capabilities directly into local software without the overhead of Python, llama.cpp is the industry standard tool.
A high-throughput, memory-efficient serving engine for LLMs using PagedAttention.