Pretrained language models are impressive generalists. They can write code, explain concepts, translate languages, and summarize documents — all from a single set of weights. But “impressive generalist” and “expert in your specific domain” are different things. If you need a model that consistently uses your company’s terminology, follows your specific output format, matches your brand’s tone, or performs well on a narrow task like classifying customer support tickets by urgency — fine-tuning is how you get there.
Hugging Face has become the standard infrastructure layer for working with open-source models. The transformers library provides a unified API for hundreds of model architectures. The datasets library handles data loading and preprocessing. The Trainer class wraps the training loop with gradient accumulation, mixed precision, and evaluation built in. Together, they mean you can fine-tune a model with far less boilerplate than PyTorch alone would require.
This tutorial covers the complete fine-tuning workflow: setting up a dataset, loading a pretrained model, configuring training with the Trainer API, evaluating the results, and saving/loading your fine-tuned model. We’ll work through two examples — sentiment classification (a classification task) and instruction tuning (a text generation task).
Fine-tuning with Hugging Face: load a pretrained model with
AutoModelForSequenceClassification.from_pretrained(), tokenize your dataset with AutoTokenizer, define TrainingArguments, create a Trainer, call trainer.train(). For LLMs, use SFTTrainer from TRL with LoRA (PEFT) to reduce memory requirements. Save with trainer.save_model().
What Is Fine-Tuning and When Should You Do It?
A pretrained model has learned general language understanding from billions of tokens of text. Fine-tuning continues training on a smaller, task-specific dataset to specialize those general capabilities. The pretrained weights provide a head start — you need far less data and compute than training from scratch.
Fine-tuning is the right choice when: a general model gives inconsistent results on your specific task; you need the model to follow a specific output format reliably; you have domain-specific terminology the general model handles poorly; you need to embed task-specific knowledge that’s expensive to inject via prompting; or you need a smaller, faster model that’s specialized for one task rather than a large general model.
Fine-tuning is NOT the right choice when: the task can be solved with good prompting alone; you have fewer than a few hundred examples; the task requires knowledge that changes frequently (use RAG instead); or you don’t have the compute budget even for fine-tuning.
| Approach | Data Needed | Compute | Best For |
|---|---|---|---|
| Prompting | 0–10 examples | None | General tasks, quick iteration |
| Few-shot prompting | 10–100 examples | None | Pattern following with small models |
| Fine-tuning (full) | 1K–100K examples | High (multiple GPUs) | Small models, max performance |
| Fine-tuning (LoRA/PEFT) | 100–10K examples | Moderate (1 GPU) | LLMs, memory-constrained hardware |
| RAG | Any amount | Low (just embeddings) | Knowledge that updates frequently |
Installing Dependencies
pip install transformers datasets accelerate evaluate scikit-learn
# For LLM fine-tuning with LoRA:
pip install peft trl bitsandbytes
# If you have a GPU:
pip install torch --index-url https://download.pytorch.org/whl/cu121
The transformers library is the core Hugging Face library for models and tokenizers. datasets provides efficient data loading and processing. accelerate handles distributed training and mixed precision automatically. evaluate provides standardized metrics. peft (Parameter-Efficient Fine-Tuning) provides LoRA and other memory-efficient adaptation methods. trl (Transformer Reinforcement Learning) includes SFTTrainer for supervised fine-tuning of LLMs.
Part 1: Fine-Tuning for Text Classification
Text classification is the most common fine-tuning task: given a text, predict one of N categories. Sentiment analysis (positive/negative/neutral), intent classification, topic categorization — all the same training approach.
Loading and Preparing the Dataset
from datasets import load_dataset, DatasetDict
from transformers import AutoTokenizer
# Load the IMDB sentiment dataset from Hugging Face Hub
dataset = load_dataset("imdb")
print(dataset)
# DatasetDict with 'train' (25000 examples) and 'test' (25000 examples)
# Each example: {'text': '...', 'label': 0 or 1}
# For demonstration, work with a smaller subset
small_dataset = DatasetDict({
'train': dataset['train'].select(range(2000)),
'test': dataset['test'].select(range(500))
})
# Load the tokenizer for our base model
model_name = "distilbert-base-uncased" # Fast, small, good baseline
tokenizer = AutoTokenizer.from_pretrained(model_name)
def tokenize_function(examples):
"""Tokenize text examples with truncation and padding."""
return tokenizer(
examples["text"],
truncation=True,
padding="max_length",
max_length=512
)
# Apply tokenization to the entire dataset
tokenized_dataset = small_dataset.map(
tokenize_function,
batched=True, # Process in batches for speed
remove_columns=["text"] # Remove the raw text column (we have tokens now)
)
print(f"Training examples: {len(tokenized_dataset['train'])}")
print(f"Test examples: {len(tokenized_dataset['test'])}")
print(f"Features: {tokenized_dataset['train'].features}")
The tokenizer converts raw text into token IDs that the model understands. truncation=True cuts sequences longer than max_length. padding="max_length" pads shorter sequences to the same length so they can be batched. batched=True in map() processes multiple examples at once, which is significantly faster than one-at-a-time processing.
Loading the Model and Configuring Training
from transformers import (
AutoModelForSequenceClassification,
TrainingArguments,
Trainer
)
import evaluate
import numpy as np
# Load pretrained model with a classification head
# num_labels=2 for binary sentiment (positive/negative)
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=2,
id2label={0: "NEGATIVE", 1: "POSITIVE"},
label2id={"NEGATIVE": 0, "POSITIVE": 1}
)
# Load evaluation metric
accuracy_metric = evaluate.load("accuracy")
f1_metric = evaluate.load("f1")
def compute_metrics(eval_pred):
"""Compute accuracy and F1 during evaluation."""
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
accuracy = accuracy_metric.compute(predictions=predictions, references=labels)
f1 = f1_metric.compute(predictions=predictions, references=labels, average="binary")
return {**accuracy, **f1}
# Configure training
training_args = TrainingArguments(
output_dir="./sentiment-model", # Where to save checkpoints
num_train_epochs=3, # Full passes through the training data
per_device_train_batch_size=16, # Batch size per GPU/CPU
per_device_eval_batch_size=32,
warmup_steps=100, # Gradual LR increase at start
weight_decay=0.01, # L2 regularization
learning_rate=2e-5, # Key hyperparameter for fine-tuning
evaluation_strategy="epoch", # Evaluate at end of each epoch
save_strategy="epoch",
load_best_model_at_end=True, # Keep the best checkpoint
metric_for_best_model="f1",
logging_steps=50,
fp16=True, # Mixed precision (faster on GPU)
report_to="none" # Disable wandb/tensorboard for simplicity
)
# Create the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["test"],
compute_metrics=compute_metrics,
)
print(f"Model parameters: {model.num_parameters():,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")
Training and Evaluating
# Train the model
print("Starting training...")
train_result = trainer.train()
print(f"\nTraining complete!")
print(f"Train loss: {train_result.training_loss:.4f}")
# Evaluate on test set
eval_results = trainer.evaluate()
print(f"\nTest accuracy: {eval_results['eval_accuracy']:.4f}")
print(f"Test F1: {eval_results['eval_f1']:.4f}")
# Save the fine-tuned model
trainer.save_model("./sentiment-model-final")
tokenizer.save_pretrained("./sentiment-model-final")
print("\nModel saved to ./sentiment-model-final")
The Trainer handles the entire training loop — forward pass, loss calculation, backpropagation, optimizer step — for every batch across all epochs. The load_best_model_at_end=True setting means that if epoch 2 had the best F1 score but epoch 3 regressed slightly, you get the epoch 2 weights, not epoch 3. After training, trainer.save_model() writes both the model weights and the tokenizer config to disk so they can be reloaded together as a unit. The resulting directory is self-contained — you can copy it to any machine and run inference without needing to know which base model it started from.
Using the Fine-Tuned Model
from transformers import pipeline
# Load the fine-tuned model with the high-level pipeline API
classifier = pipeline(
"text-classification",
model="./sentiment-model-final",
tokenizer="./sentiment-model-final",
device=0 # Use GPU if available, -1 for CPU
)
# Test on new examples
test_texts = [
"This movie was absolutely brilliant! One of the best I've seen.",
"Complete waste of time. Boring from start to finish.",
"It was okay, nothing special but not terrible either.",
"An unexpected masterpiece. I was completely captivated."
]
results = classifier(test_texts)
for text, result in zip(test_texts, results):
print(f"Text: {text[:50]}...")
print(f"Label: {result['label']} (confidence: {result['score']:.3f})\n")
The pipeline API is the simplest way to run inference on a saved model. It handles tokenization, tensor conversion, the forward pass, and converting logits back to human-readable labels — all in one call. The device=0 argument moves the model to your first GPU; use device=-1 for CPU-only inference. For production deployments where latency matters, you’d typically load the model once at startup and keep it in memory, batching incoming requests rather than processing them one at a time.
Part 2: Fine-Tuning an LLM with LoRA
Fine-tuning full LLMs (7B+ parameters) requires significant GPU memory — too much for most developers. LoRA (Low-Rank Adaptation) is a parameter-efficient approach that freezes the original model weights and adds small trainable rank decomposition matrices to each layer. Instead of updating 7 billion parameters, you update 5-10 million. The quality loss is minimal for most tasks.
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
from datasets import Dataset
import torch
# Load a smaller model for this demo (use llama or mistral for production)
base_model = "microsoft/phi-2" # 2.7B parameters, fits in ~8GB VRAM with LoRA
tokenizer = AutoTokenizer.from_pretrained(base_model)
tokenizer.pad_token = tokenizer.eos_token # Phi-2 doesn't have a pad token
model = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.float16, # Use float16 to save memory
device_map="auto" # Automatically assign to GPU if available
)
# LoRA configuration
lora_config = LoraConfig(
r=16, # Rank — higher = more parameters, better quality
lora_alpha=32, # Scaling factor (usually 2x rank)
target_modules=["q_proj", "v_proj"], # Which layers to adapt (model-specific)
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 3,145,728 || all params: 2,782,765,056 (0.11% trainable)
The r=16 rank determines LoRA’s capacity. Higher rank means more trainable parameters and better adaptation, but more memory. For most tasks, ranks between 8 and 64 work well. target_modules specifies which layers get LoRA adapters — this varies by model architecture. For LLaMA models it’s typically ["q_proj", "k_proj", "v_proj", "o_proj"].
Preparing Instruction Data
from datasets import Dataset
# Instruction-following dataset format
# The model learns to follow instructions in this format
raw_data = [
{
"instruction": "Explain what a Python decorator is.",
"input": "",
"output": "A Python decorator is a function that takes another function as input and returns a modified version of that function. Decorators allow you to add functionality to existing functions without modifying them directly, using the @decorator_name syntax."
},
{
"instruction": "Write a Python function to check if a number is prime.",
"input": "",
"output": "def is_prime(n: int) -> bool:\n if n < 2:\n return False\n if n == 2:\n return True\n if n % 2 == 0:\n return False\n for i in range(3, int(n**0.5) + 1, 2):\n if n % i == 0:\n return False\n return True"
},
{
"instruction": "What does the following Python code do?",
"input": "result = [x**2 for x in range(10) if x % 2 == 0]",
"output": "This list comprehension creates a list of squares of even numbers from 0 to 9. It iterates through numbers 0-9, filters for even numbers (x % 2 == 0), squares each one (x**2), and collects them in a list. The result is [0, 4, 16, 36, 64]."
},
# ... add hundreds or thousands more examples for real training
]
def format_instruction(example):
"""Format into a single instruction-following string."""
if example.get("input"):
return f"""### Instruction:
{example['instruction']}
### Input:
{example['input']}
### Response:
{example['output']}"""
else:
return f"""### Instruction:
{example['instruction']}
### Response:
{example['output']}"""
# Convert to Dataset and format
dataset = Dataset.from_list(raw_data)
dataset = dataset.map(
lambda x: {"text": format_instruction(x)},
remove_columns=dataset.column_names
)
print(dataset[0]["text"])
Training with SFTTrainer
training_args = TrainingArguments(
output_dir="./python-tutor-lora",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 4 * 4 = 16
warmup_steps=50,
learning_rate=2e-4, # LoRA uses higher LR than full fine-tuning
fp16=True,
logging_steps=10,
save_strategy="epoch",
report_to="none"
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=512,
peft_config=lora_config
)
print("Training with LoRA...")
trainer.train()
# Save the LoRA adapter (NOT the full model -- much smaller!)
trainer.save_model("./python-tutor-lora-adapters")
print("LoRA adapters saved (small file, just the delta weights)")
The gradient_accumulation_steps=4 setting simulates a larger batch size by accumulating gradients over multiple forward passes before updating weights. This is essential when GPU memory limits your batch size — effective batch size of 16 trains better than effective batch size of 4.
Loading and Using LoRA Fine-Tuned Models
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model + LoRA adapters
base_model_name = "microsoft/phi-2"
adapter_path = "./python-tutor-lora-adapters"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto"
)
# Load and merge LoRA adapters into the base model
model = PeftModel.from_pretrained(base_model, adapter_path)
model = model.merge_and_unload() # Merge adapters into weights for faster inference
# Generate a response
def ask_model(question: str, max_tokens: int = 300) -> str:
prompt = f"""### Instruction:
{question}
### Response:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_tokens,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
# Decode only the new tokens (skip the prompt)
new_tokens = outputs[0][inputs["input_ids"].shape[1]:]
return tokenizer.decode(new_tokens, skip_special_tokens=True)
# Test the fine-tuned model
response = ask_model("Explain list comprehensions in Python with an example.")
print(response)
merge_and_unload() permanently fuses the LoRA adapter weights back into the base model's weight matrices. The result is a single merged model with no LoRA overhead during inference — same speed as the original base model, but with your task-specific improvements baked in. This is the deployment-ready form. Alternatively, keep the adapter separate with PeftModel.from_pretrained() at runtime if you need to hot-swap between different adapters for the same base model without reloading the full weights each time.
Real-Life Example: Customer Support Ticket Classifier
Here's a complete fine-tuning workflow for a realistic business use case — classifying customer support tickets into categories:
from datasets import Dataset
from transformers import (
AutoTokenizer, AutoModelForSequenceClassification,
TrainingArguments, Trainer, DataCollatorWithPadding
)
import evaluate
import numpy as np
# Sample training data (in practice you'd have thousands of examples)
ticket_data = [
{"text": "My payment failed but I was still charged", "label": 0}, # billing
{"text": "Can't log into my account, password reset not working", "label": 1}, # auth
{"text": "The app crashes every time I open the dashboard", "label": 2}, # bug
{"text": "How do I export my data as a CSV file?", "label": 3}, # howto
{"text": "Invoice shows wrong amount for last month", "label": 0}, # billing
{"text": "Two-factor auth code not arriving via SMS", "label": 1}, # auth
{"text": "Search results are empty even though I have data", "label": 2}, # bug
{"text": "Can I change my billing cycle from monthly to annual?", "label": 3}, # howto
# ... add many more
]
label_names = ["billing", "authentication", "bug_report", "how_to"]
num_labels = len(label_names)
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
dataset = Dataset.from_list(ticket_data)
# Train/test split
split = dataset.train_test_split(test_size=0.2, seed=42)
def tokenize(examples):
return tokenizer(examples["text"], truncation=True, padding=True)
tokenized = split.map(tokenize, batched=True)
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Load model
model = AutoModelForSequenceClassification.from_pretrained(
model_name,
num_labels=num_labels,
id2label={i: name for i, name in enumerate(label_names)},
label2id={name: i for i, name in enumerate(label_names)}
)
# Training
accuracy = evaluate.load("accuracy")
def compute_metrics(eval_pred):
preds, labels = eval_pred
preds = np.argmax(preds, axis=1)
return accuracy.compute(predictions=preds, references=labels)
args = TrainingArguments(
output_dir="./ticket-classifier",
num_train_epochs=5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
metric_for_best_model="accuracy",
report_to="none"
)
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["test"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics
)
trainer.train()
trainer.save_model("./ticket-classifier-final")
# Production inference
from transformers import pipeline
classifier = pipeline("text-classification", model="./ticket-classifier-final")
new_tickets = [
"I was charged twice for the same subscription this month",
"Getting 404 error on the reports page",
"How do I add a team member to my account?"
]
for ticket in new_tickets:
result = classifier(ticket)[0]
print(f"Ticket: {ticket}")
print(f"Category: {result['label']} (confidence: {result['score']:.2%})\n")
Frequently Asked Questions
How much data do I need for fine-tuning?
For classification with a pretrained language model, 500-2000 labeled examples per class is a reasonable starting point. With more data you'll get better results up to a point of diminishing returns (usually 10K-100K examples). For instruction tuning LLMs, high-quality datasets of 1000-10000 examples often outperform low-quality datasets of 100K examples. Quality matters more than quantity.
Do I need a GPU for fine-tuning?
For small models (DistilBERT, BERT-base): CPU works but is slow (hours instead of minutes). For medium models (7B LLMs with LoRA): a single consumer GPU with 8-16GB VRAM (RTX 3080, 4080, or Apple M1/M2 Pro) is sufficient. For large models without LoRA: multiple high-VRAM GPUs or cloud compute (A100s).
What's the difference between fine-tuning and RLHF?
Fine-tuning (supervised) trains on (input, correct output) pairs — you need labeled data with known correct answers. RLHF (Reinforcement Learning from Human Feedback) trains the model to maximize human preference scores — you need human raters to rank model outputs. RLHF is how models like ChatGPT learn to be helpful and harmless. For most custom task fine-tuning, supervised fine-tuning is sufficient and much simpler.
How do I prevent catastrophic forgetting during fine-tuning?
Catastrophic forgetting is when fine-tuning on new data degrades performance on the original task. Solutions: use LoRA (fine-tuning a tiny fraction of parameters preserves the base model's capabilities); use a low learning rate (2e-5 for full fine-tuning, 2e-4 for LoRA); train for fewer epochs; include some original task data in your training mix.
When should I use LoRA vs full fine-tuning?
Use LoRA when: the model has more than 1B parameters; you're memory-constrained (consumer GPU or CPU); you want to keep multiple specialized adapters for different tasks; you need fast switching between tasks. Use full fine-tuning when: the model is small (DistilBERT, BERT-base); you have significant compute budget; you need maximum performance on a single specific task.
Summary
You've fine-tuned a model for both classification (DistilBERT on sentiment) and instruction following (LoRA adapters on an LLM). The Hugging Face ecosystem handles the messy parts — gradient accumulation, mixed precision, checkpoint saving, evaluation — so you can focus on data quality and hyperparameter choices, which are the real levers for fine-tuning success.
The most important lesson: data quality beats model size almost every time. A fine-tuned small model on clean, well-labeled data usually outperforms a large pretrained model on your specific task. Invest in your dataset before spending compute. For using your fine-tuned model in a conversational interface, see How To Build a Chatbot with Ollama. For serving it behind an API, see Building a REST API with FastAPI.