When to Fine-Tune vs. Prompt Engineer
Before writing any training code, answer this question: can you solve the problem with prompting alone?
Fine-tuning is appropriate when:
- The task requires a specific output format that's hard to specify in a prompt
- You need the model to have domain knowledge not in its pretraining data
- Latency/cost constraints rule out large models or long prompts
- You have >1000 high-quality labeled examples
Stick with prompting when:
- You have fewer than a few hundred labeled examples
- The task is general enough that a base model handles it well
- You need rapid iteration
Full Fine-Tuning with Hugging Face
Full fine-tuning updates all model parameters. This gives the best results but requires the most memory.
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
TrainingArguments,
Trainer,
DataCollatorForSeq2Seq
)
from datasets import Dataset
import torch
# Prepare data
def format_instruction(example):
return {
"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
}
raw_data = [
{"instruction": "Summarize this code", "output": "This function..."},
# ... thousands of examples
]
dataset = Dataset.from_list(raw_data)
dataset = dataset.map(format_instruction)
# Load model and tokenizer
model_name = "meta-llama/Llama-3.2-3B" # 3B — feasible to fine-tune on 1 GPU
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16, # bf16 saves memory, minimal quality loss
device_map="auto",
)
# Tokenize
def tokenize(examples):
return tokenizer(
examples["text"],
truncation=True,
max_length=512,
padding="max_length",
)
tokenized = dataset.map(tokenize, batched=True, remove_columns=["text", "instruction", "output"])
# Training config
training_args = TrainingArguments(
output_dir="./llama3-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
learning_rate=2e-5,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
bf16=True,
logging_steps=10,
save_strategy="epoch",
eval_strategy="epoch",
report_to="wandb",
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
)
trainer.train()
LoRA: Fine-Tune 1% of Parameters
Full fine-tuning a 7B model requires 4× 80GB A100s. LoRA (Low-Rank Adaptation) achieves comparable results by training only small adapter matrices:
Original weight matrix W (frozen): [d × d]
LoRA: W' = W + BA where B: [d × r], A: [r × d], r << d
For a 4096×4096 attention matrix with r=16:
Original: 16.7M parameters
LoRA: 4096*16 + 16*4096 = 131K parameters (~0.8% of original)
from peft import LoraConfig, get_peft_model, TaskType
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-7B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
# Apply LoRA
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # Rank — higher = more capacity, more memory
lora_alpha=32, # Scaling factor (usually 2*r)
lora_dropout=0.1,
target_modules=[ # Which layers to add LoRA to
"q_proj", "v_proj", # Attention projections
"k_proj", "o_proj", # All attention projections for better results
],
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622
Now train exactly like full fine-tuning — the API is identical, but you're only updating ~4M parameters instead of 7B.
QLoRA: 4-bit Quantization + LoRA
QLoRA (Quantized LoRA) lets you fine-tune a 7B model on a single 16GB GPU:
from transformers import BitsAndBytesConfig
from peft import prepare_model_for_kbit_training
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 quantization
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Quantize the quantization constants
)
# Load in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto",
)
# Prepare for LoRA training
model = prepare_model_for_kbit_training(model)
# Apply LoRA (same as before)
model = get_peft_model(model, lora_config)
# GPU memory: ~8GB for 7B model with QLoRA
Supervised Fine-Tuning for Instruction Following
For instruction-following tasks, structure your data carefully:
# Good instruction format (alpaca-style)
def format_sample(instruction: str, input: str, output: str) -> str:
if input:
return f"""Below is an instruction that describes a task, paired with an input. Write a response.
### Instruction:
{instruction}
### Input:
{input}
### Response:
{output}"""
else:
return f"""Below is an instruction. Write a response.
### Instruction:
{instruction}
### Response:
{output}"""
# Critical: only compute loss on the response, not the instruction
def tokenize_with_labels(example, tokenizer):
full_text = format_sample(example["instruction"], example["input"], example["output"])
tokens = tokenizer(full_text, return_tensors="pt")
# Find where response starts
instruction_text = format_sample(example["instruction"], example["input"], "")
instruction_len = len(tokenizer(instruction_text)["input_ids"][0])
# Mask instruction tokens in labels (-100 = ignore in loss)
labels = tokens["input_ids"].clone()
labels[0, :instruction_len] = -100
return {"input_ids": tokens["input_ids"][0], "labels": labels[0]}
Not masking the instruction means your model wastes capacity learning to reproduce the prompt. Always mask.
Merging LoRA Adapters
After training, merge the LoRA weights back into the base model for clean serving:
from peft import PeftModel
# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-7B")
# Load and merge LoRA weights
model = PeftModel.from_pretrained(base_model, "./lora-checkpoint")
merged_model = model.merge_and_unload() # Fuses LoRA into base weights
# Save merged model
merged_model.save_pretrained("./llama3-finetuned-merged")
tokenizer.save_pretrained("./llama3-finetuned-merged")
# Now deploy as a regular model — no PEFT dependency
Evaluating Fine-Tuned Models
from transformers import pipeline
pipe = pipeline("text-generation", model="./llama3-finetuned-merged", device=0)
# Automated evaluation
test_cases = [
{"instruction": "Classify this email as spam or not spam.", "input": "Win a free iPhone...", "expected": "spam"},
# ...
]
correct = 0
for case in test_cases:
prompt = format_sample(case["instruction"], case["input"], "")
output = pipe(prompt, max_new_tokens=50)[0]["generated_text"]
response = output[len(prompt):].strip().lower()
if case["expected"] in response:
correct += 1
print(f"Accuracy: {correct/len(test_cases):.2%}")
For open-ended generation tasks, use LLM-as-judge evaluation or human evaluation.
Hardware Reference
| GPU | VRAM | Max model (full FT) | Max model (QLoRA) |
|---|---|---|---|
| RTX 4090 | 24GB | 7B | 70B |
| A100 40GB | 40GB | 13B | 70B |
| A100 80GB | 80GB | 30B | 70B+ |
| 2× A100 80GB | 160GB | 70B | 70B+ |
For production fine-tuning, use cloud GPUs (Lambda Labs, RunPod, AWS p4d instances). For experimentation, QLoRA on an RTX 4090 covers most use cases.
Next: learn how to serve your fine-tuned model efficiently with vLLM and LLM inference optimization.