LoRA

What is LoRA?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that trains small rank-decomposed weight matrices alongside frozen base model weights, enabling model customization with minimal compute and memory. Instead of updating all model parameters, LoRA adds trainable low-rank matrices that modify the model's behavior while keeping 99%+ of parameters frozen.

The mathematical insight is that weight updates during fine-tuning have low intrinsic rank. LoRA decomposes the weight update matrix (dimensions d x d) into two smaller matrices (d x r and r x d, where r is much smaller than d, typically 4-64). Only these small matrices are trained, reducing trainable parameters by 1,000-10,000x while achieving comparable quality to full fine-tuning on most tasks.

At inference time, the LoRA matrices can be merged into the base weights with zero additional latency, or kept separate to enable serving multiple adaptations from a single base model. This separation is powerful for multi-tenant deployments: one base model in GPU memory can serve dozens of customer-specific adaptations by hot-swapping LoRA weights per request.

Why does LoRA matter?

LoRA makes fine-tuning accessible on consumer hardware. Training a 70B model normally requires 8+ A100 GPUs; with LoRA at rank 16, the same model can be fine-tuned on a single 24GB GPU. This democratization enabled the open-source fine-tuning ecosystem that produced thousands of specialized model variants.

How is LoRA used in practice?

A legal AI company fine-tunes Llama 3 70B with LoRA adapters for each law firm client, training on their specific document styles and citation preferences. All adapters share one base model in vLLM, with per-request adapter routing based on the client API key — serving 30 customized models from the same 2-GPU deployment.

What is LoRA?

Why does LoRA matter?

How is LoRA used in practice?

Related Terms

About the Author