Running LLM Inference on a Budget

Why this matters
The core problem
Quantization formats in practice
Choosing a backend
Memory estimation
Practical tips
Conclusion

Why this matters#

Running a 70B parameter model used to require a cluster of A100s. Today you can run a quantized version on a laptop with 16 GB of RAM. This post walks through the practical choices that make that possible.

The core problem#

Full-precision (FP32) weights for a 7B model consume roughly 28 GB of memory. BF16 halves that to ~14 GB. INT8 takes it to ~7 GB. INT4 gets you to ~3.5 GB — small enough to fit on a mid-range GPU.

The tradeoff is accuracy. The art is knowing how much you can lose before it matters for your use case.

Quantization formats in practice#

GGUF (llama.cpp)#

GGUF is the format used by llama.cpp and Ollama. It supports mixed-precision — you can quantize most layers to Q4 while keeping attention layers at Q8, which recovers most of the accuracy lost by pure Q4.

# Pull and run a quantized model with Ollama
ollama pull llama3.2:3b-instruct-q4_K_M
ollama run llama3.2:3b-instruct-q4_K_M

The Q4_K_M suffix means 4-bit weights using the K-quant method, medium variant. K-quants group weights and quantize per-group, which significantly reduces the error compared to naive rounding.

GPTQ#

GPTQ is a post-training quantization method that minimizes the reconstruction error layer by layer using second-order information. It produces INT4 models that run well on NVIDIA GPUs via the auto-gptq library.

from transformers import AutoModelForCausalLM
from auto_gptq import AutoGPTQForCausalLM

model = AutoGPTQForCausalLM.from_quantized(
    "TheBloke/Llama-2-7B-GPTQ",
    device="cuda:0",
    use_triton=True,
)

AWQ#

AWQ (Activation-aware Weight Quantization) is similar to GPTQ but preserves the weights that matter most based on activation statistics. In practice it often outperforms GPTQ at Q4, especially for instruction-following tasks.

Choosing a backend#

Backend	Format	Best for
llama.cpp / Ollama	GGUF	CPU + Apple Silicon, local dev
vLLM	GPTQ, AWQ, FP8	GPU serving, high throughput
TGI	GPTQ, AWQ	Production APIs
ExLlamaV2	EXL2	Maximum throughput on NVIDIA

Memory estimation#

A rough formula for VRAM needed:

VRAM (GB) ≈ (params_B × bits) / 8 × 1.2

The 1.2 factor accounts for KV cache and activation memory during inference. For a 7B model at Q4:

(7 × 4) / 8 × 1.2 = 4.2 GB

Comfortably fits on an RTX 3060 12 GB or an M2 MacBook with 16 GB unified memory.

Practical tips#

Start with Q4_K_M in GGUF — it hits the sweet spot of size vs quality for most tasks.
Use Q8 for math/code — numeric precision matters more there.
Benchmark on your actual task, not just perplexity. A model that scores well on benchmarks can still fail on your specific prompt distribution.
KV cache quantization (supported in vLLM) can cut memory further without touching weight precision.

Conclusion#

Budget inference is no longer a compromise — it is an engineering discipline. Pick the right quantization format for your hardware, benchmark on your task, and iterate. The tooling in 2026 makes this accessible to any ML engineer with a single consumer GPU.