Fine-Tuning LLMs: A Practical Guide for ML Engineers

When to Fine-Tune vs. Prompt Engineer

Before writing any training code, answer this question: can you solve the problem with prompting alone?

Fine-tuning is appropriate when:

The task requires a specific output format that's hard to specify in a prompt
You need the model to have domain knowledge not in its pretraining data
Latency/cost constraints rule out large models or long prompts
You have >1000 high-quality labeled examples

Stick with prompting when:

You have fewer than a few hundred labeled examples
The task is general enough that a base model handles it well
You need rapid iteration

Full Fine-Tuning with Hugging Face

Full fine-tuning updates all model parameters. This gives the best results but requires the most memory.

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForSeq2Seq
)
from datasets import Dataset
import torch

# Prepare data
def format_instruction(example):
    return {
        "text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
    }

raw_data = [
    {"instruction": "Summarize this code", "output": "This function..."},
    # ... thousands of examples
]

dataset = Dataset.from_list(raw_data)
dataset = dataset.map(format_instruction)

# Load model and tokenizer
model_name = "meta-llama/Llama-3.2-3B"  # 3B — feasible to fine-tune on 1 GPU
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # bf16 saves memory, minimal quality loss
    device_map="auto",
)

# Tokenize
def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=512,
        padding="max_length",
    )

tokenized = dataset.map(tokenize, batched=True, remove_columns=["text", "instruction", "output"])

# Training config
training_args = TrainingArguments(
    output_dir="./llama3-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,   # Effective batch size = 4 * 4 = 16
    learning_rate=2e-5,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="epoch",
    report_to="wandb",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["test"],
)

trainer.train()

LoRA: Fine-Tune 1% of Parameters

Full fine-tuning a 7B model requires 4× 80GB A100s. LoRA (Low-Rank Adaptation) achieves comparable results by training only small adapter matrices:

Original weight matrix W (frozen):  [d × d]
LoRA: W' = W + BA where B: [d × r], A: [r × d], r << d

For a 4096×4096 attention matrix with r=16:
Original: 16.7M parameters
LoRA: 4096*16 + 16*4096 = 131K parameters  (~0.8% of original)

from peft import LoraConfig, get_peft_model, TaskType

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-7B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Apply LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,               # Rank — higher = more capacity, more memory
    lora_alpha=32,      # Scaling factor (usually 2*r)
    lora_dropout=0.1,
    target_modules=[    # Which layers to add LoRA to
        "q_proj", "v_proj",  # Attention projections
        "k_proj", "o_proj",  # All attention projections for better results
    ],
    bias="none",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.0622

Now train exactly like full fine-tuning — the API is identical, but you're only updating ~4M parameters instead of 7B.

QLoRA: 4-bit Quantization + LoRA

QLoRA (Quantized LoRA) lets you fine-tune a 7B model on a single 16GB GPU:

from transformers import BitsAndBytesConfig
from peft import prepare_model_for_kbit_training

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",        # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,   # Quantize the quantization constants
)

# Load in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

# Prepare for LoRA training
model = prepare_model_for_kbit_training(model)

# Apply LoRA (same as before)
model = get_peft_model(model, lora_config)

# GPU memory: ~8GB for 7B model with QLoRA

Supervised Fine-Tuning for Instruction Following

For instruction-following tasks, structure your data carefully:

# Good instruction format (alpaca-style)
def format_sample(instruction: str, input: str, output: str) -> str:
    if input:
        return f"""Below is an instruction that describes a task, paired with an input. Write a response.

### Instruction:
{instruction}

### Input:
{input}

### Response:
{output}"""
    else:
        return f"""Below is an instruction. Write a response.

### Instruction:
{instruction}

### Response:
{output}"""

# Critical: only compute loss on the response, not the instruction
def tokenize_with_labels(example, tokenizer):
    full_text = format_sample(example["instruction"], example["input"], example["output"])
    tokens = tokenizer(full_text, return_tensors="pt")

    # Find where response starts
    instruction_text = format_sample(example["instruction"], example["input"], "")
    instruction_len = len(tokenizer(instruction_text)["input_ids"][0])

    # Mask instruction tokens in labels (-100 = ignore in loss)
    labels = tokens["input_ids"].clone()
    labels[0, :instruction_len] = -100

    return {"input_ids": tokens["input_ids"][0], "labels": labels[0]}

Not masking the instruction means your model wastes capacity learning to reproduce the prompt. Always mask.

Merging LoRA Adapters

After training, merge the LoRA weights back into the base model for clean serving:

from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-7B")

# Load and merge LoRA weights
model = PeftModel.from_pretrained(base_model, "./lora-checkpoint")
merged_model = model.merge_and_unload()  # Fuses LoRA into base weights

# Save merged model
merged_model.save_pretrained("./llama3-finetuned-merged")
tokenizer.save_pretrained("./llama3-finetuned-merged")

# Now deploy as a regular model — no PEFT dependency

Evaluating Fine-Tuned Models

from transformers import pipeline

pipe = pipeline("text-generation", model="./llama3-finetuned-merged", device=0)

# Automated evaluation
test_cases = [
    {"instruction": "Classify this email as spam or not spam.", "input": "Win a free iPhone...", "expected": "spam"},
    # ...
]

correct = 0
for case in test_cases:
    prompt = format_sample(case["instruction"], case["input"], "")
    output = pipe(prompt, max_new_tokens=50)[0]["generated_text"]
    response = output[len(prompt):].strip().lower()

    if case["expected"] in response:
        correct += 1

print(f"Accuracy: {correct/len(test_cases):.2%}")

For open-ended generation tasks, use LLM-as-judge evaluation or human evaluation.

Hardware Reference

GPU	VRAM	Max model (full FT)	Max model (QLoRA)
RTX 4090	24GB	7B	70B
A100 40GB	40GB	13B	70B
A100 80GB	80GB	30B	70B+
2× A100 80GB	160GB	70B	70B+

For production fine-tuning, use cloud GPUs (Lambda Labs, RunPod, AWS p4d instances). For experimentation, QLoRA on an RTX 4090 covers most use cases.

Next: learn how to serve your fine-tuned model efficiently with vLLM and LLM inference optimization.

When to Fine-Tune vs. Prompt Engineer

Full Fine-Tuning with Hugging Face

LoRA: Fine-Tune 1% of Parameters

QLoRA: 4-bit Quantization + LoRA

Supervised Fine-Tuning for Instruction Following

Merging LoRA Adapters

Evaluating Fine-Tuned Models

Hardware Reference

Related Articles

LinkedIn's MixLM: Achieving 10x Faster LLM Ranking via Embedding Injection

Meta's GEM: Bringing LLM-Scale Architectures to Ads Recommendation

vLLM at LinkedIn: Optimizing LLM Inference at Scale

Want to Go Deeper?