The earlier tutorials fine-tuned encoder models by training the whole network. That does not scale to billion-parameter generative LLMs — full fine-tuning would need enormous GPU memory. The answer is parameter-efficient fine-tuning (PEFT): freeze the giant pretrained model and train a tiny number of new parameters.
This tutorial covers the theory — language modeling, SFT, adapters, LoRA, and QLoRA — and then fine-tunes Microsoft's Phi-2 (2.7B parameters) to generate product names and descriptions, all on a single GPU.
Prerequisites: The transformer foundations and a GPU environment with transformers, peft, accelerate, bitsandbytes, datasets, and torch.
Types of LLM Fine-Tuning
There are three main ways to adapt a large language model:
- Language modeling — predicting the next word given the previous words. This is a self-supervised task; the model trains on a large text corpus with no human labels.
- Supervised fine-tuning (SFT) — fine-tuning the pretrained model on a custom labeled dataset for a specific behavior. This is what you do in this tutorial.
- Preference fine-tuning — training on a dataset with preference labels (which response humans prefer), used to align models with human taste.
There are also lighter-weight techniques that change no weights at all — zero-shot and few-shot prompting, and prompt tuning — but for genuinely teaching a model new behavior, SFT with PEFT is the practical choice.
Parameter-Efficient Fine-Tuning (PEFT)
PEFT fine-tunes a pretrained model by training only a small number of new parameters, leaving the original weights frozen. Its main techniques are adapters, LoRA, and QLoRA.
Adapters
An adapter is a small neural network inserted into the pretrained model. The overall model grows slightly, but fine-tuning updates only the adapter's small parameter count instead of the full network. (See the PEFT/adapter paper, arXiv:1902.00751.)
LoRA
LoRA (Low-Rank Adaptation, arXiv:2106.09685) is the workhorse of modern fine-tuning:
- Low-rank decomposition. LoRA decomposes the weight-update matrices in the network into two much smaller, lower-rank matrices, drastically reducing the number of trainable parameters.
- Injecting trainable parameters. The original weights stay frozen; only the low-rank matrices are trainable. They are injected into the layers to capture task-specific information.
- Combining outputs. During the forward pass, the frozen weight's output is combined with the low-rank matrices' output, so the model keeps its pretrained knowledge while adapting to the new task.

LoRA freezes the original weights and trains two small low-rank matrices injected alongside them.
QLoRA
QLoRA (arXiv:2305.14314) is LoRA with quantized weights:
- Hybrid technique. It combines quantization with low-rank adaptation to push parameter efficiency even further, cutting memory and compute.
- Resource-constrained fine-tuning. It makes fine-tuning large models feasible on limited hardware — even a single consumer GPU.
- High performance. Despite the reduced resources, it maintains strong downstream performance.
- Normalized quantization. QLoRA uses normalized quantization (NF4) so the quantized weights stay on the same scale as the originals, preserving quality.

QLoRA quantizes the frozen base model to 4-bit and trains LoRA adapters on top, fitting large models on small GPUs.
Note
Phi-2 is a 2.7B-parameter small language model from Microsoft. Phi-3's technical report is at arXiv:2404.14219. Always check a model's license on its Hugging Face page before commercial use.
Loading the Custom Dataset
The task: given a product category, generate a product name or description. Load an Amazon product dataset and reshape it:
import pandas as pd
from datasets import Dataset
df = pd.read_csv('https://github.com/laxmimerit/All-CSV-ML-Data-Files-Download/raw/master/amazon_product_details.csv', usecols=['category', 'about_product', 'product_name'])
df['category'] = df['category'].apply(lambda x: x.split('|')[-1])
products = df[['category', 'product_name']].rename(columns={'product_name': 'text'})
description = df[['category', 'about_product']].rename(columns={'about_product': 'text'})
products['task_type'] = 'Product Name'
description['task_type'] = 'Product Description'
df = pd.concat([products, description], ignore_index=True)
Turn it into a shuffled train/test split:
dataset = Dataset.from_pandas(df)
dataset = dataset.shuffle(seed=0)
dataset = dataset.train_test_split(test_size=0.1)
dataset
DatasetDict({
train: Dataset({ features: ['category', 'text', 'task_type'], num_rows: 2637 })
test: Dataset({ features: ['category', 'text', 'task_type'], num_rows: 293 })
})
Formatting the Prompts
Causal LLMs learn from a single text field, so define a formatting function that combines the category, task type, and target text into one instruction-style prompt:
def formatting_func(example):
text = f"""
Given the product category, you need to generate a '{example['task_type']}'.
### Category: {example['category']}\n ### {example['task_type']}: {example['text']}
"""
return text
Loading Phi-2 in 8-bit
Load Phi-2 quantized to 8-bit, which roughly halves its memory footprint:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
base_model_id = "microsoft/phi-2"
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
trust_remote_code=True,
torch_dtype=torch.float16,
load_in_8bit=True
)
tokenizer = AutoTokenizer.from_pretrained(
base_model_id,
padding_size='left',
add_eos_token=True,
add_bos_token=True,
use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token
Note
Recent versions of Transformers prefer a BitsAndBytesConfig object over the load_in_8bit=True shortcut. The shortcut still works but may emit a deprecation warning.
Tokenize the dataset, copying input_ids into labels (for causal language modeling the model predicts its own input shifted by one):
max_length = 400
def tokenize(prompt):
result = tokenizer(
formatting_func(prompt),
truncation=True,
max_length=max_length,
padding="max_length"
)
result['labels'] = result['input_ids'].copy()
return result
dataset = dataset.map(tokenize)
The Base Model Out of the Box
Before fine-tuning, see how raw Phi-2 handles the task:
eval_prompt = """
Given the product category, you need to generate a 'Product Description'.
### Category: BatteryChargers
### Product Description:
"""
model_input = tokenizer(eval_prompt, truncation=True, max_length=max_length, padding="max_length", return_tensors='pt').to("cuda")
model.eval()
with torch.no_grad():
output = model.generate(**model_input, max_new_tokens=256, repetition_penalty=1.15)
print(tokenizer.decode(output[0], skip_special_tokens=True))
The base model rambles — it invents a "puzzle" about writing descriptions instead of producing one. It clearly does not follow our format. That is what fine-tuning fixes.
Configuring LoRA
Attach a LoRA adapter, targeting Phi-2's attention and feed-forward projection modules:
from peft import LoraConfig, get_peft_model
config = LoraConfig(
r=32,
lora_alpha=64,
target_modules=["Wqkv", "fc1", "fc2"],
bias="none",
lora_dropout=0.05,
task_type="CAUSAL_LM"
)
model = get_peft_model(model, config)
Check how few parameters you are actually training:
def print_trainable_parameters(model):
trainable_params = 0
all_param = 0
for _, param in model.named_parameters():
all_param += param.numel()
if param.requires_grad:
trainable_params += param.numel()
print(f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}")
print_trainable_parameters(model)
trainable params: 26214400 || all params: 2805898240 || trainable%: 0.9342605382581515
Only 0.93% of the parameters are trainable — 26M out of 2.8B. That is the entire point of PEFT.
Training
Prepare the model with Accelerate and configure the trainer with an 8-bit optimizer:
from accelerate import Accelerator
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling
accelerator = Accelerator(gradient_accumulation_steps=1)
model = accelerator.prepare_model(model)
args = TrainingArguments(
output_dir="./train-dir",
per_device_train_batch_size=2,
gradient_accumulation_steps=1,
max_steps=500,
learning_rate=2.5e-5,
optim="paged_adamw_8bit",
logging_steps=25,
save_strategy="steps",
save_steps=25,
evaluation_strategy="steps",
eval_steps=25,
do_eval=True
)
trainer = Trainer(
model=model,
args=args,
train_dataset=dataset['train'],
eval_dataset=dataset['test'],
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)
model.config.use_cache = False # silence warnings; re-enable for inference
trainer.train()
Step Training Loss Validation Loss
25 3.742300 3.499480
100 2.962300 3.003503
250 2.554800 2.697148
500 2.498900 2.653663
Important
DataCollatorForLanguageModeling(tokenizer, mlm=False) sets up causal language modeling (next-token prediction), not masked language modeling. The paged_adamw_8bit optimizer keeps optimizer state in 8-bit to save memory — essential for fitting training on a single GPU.
The loss falls steadily from 3.74 to ~2.50, showing the adapter is learning the product-description format.
Loading the Fine-Tuned Adapter
PEFT saves only the small LoRA adapter, not the whole model. To use it, load the base model and apply the adapter on top:
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
base_model_id,
trust_remote_code=True,
load_in_8bit=True,
torch_dtype=torch.float16
)
eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True, use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token
ft_model = PeftModel.from_pretrained(base_model, '/content/train-dir/checkpoint-500')
Note
The checkpoint path /content/train-dir/checkpoint-500 is a Google Colab path. On your machine, point it at wherever your output_dir checkpoint was saved (for example ./train-dir/checkpoint-500 on Windows).
Generate with the fine-tuned model on the same prompt:
eval_prompt = """
Given the product category, you need to generate a 'Product Description'.
### Category: BatteryChargers
### Product Description:
"""
model_input = eval_tokenizer(eval_prompt, return_tensors="pt")
ft_model.eval()
with torch.no_grad():
output = ft_model.generate(**model_input, max_new_tokens=256, repetition_penalty=1.15)
print(eval_tokenizer.decode(output[0], skip_special_tokens=True))
Given the product category, you need to generate a 'Product Description'.
### Category: BatteryChargers
### Product Description:
#### 1. Type: USB Charger for Smartphones and Tablets (2-in1)
#### 2. Features: Supports fast charging up to 100% in 30 minutes; Compatible with all Qi wireless chargers; Includes a power bank for on-the-go charging
Now the model follows the format and produces a structured, on-topic product description — a clear improvement over the rambling base model.

Train only the LoRA adapter on a frozen Phi-2, then load the base model plus adapter for inference.
Saving the Adapter
Because the adapter is tiny, you can save and share just those few megabytes:
# zip the saved adapter directory for download
# the checkpoint contains adapter_config.json and adapter_model.safetensors
The saved checkpoint contains adapter_config.json and adapter_model.safetensors — load them onto any copy of the base Phi-2 to reproduce your fine-tuned model.
Summary
You learned why full fine-tuning does not scale to large LLMs and how PEFT solves it: freeze the base model and train tiny LoRA matrices, optionally over a quantized model with QLoRA. Then you fine-tuned Phi-2 by training just 0.93% of its parameters, transforming a rambling base model into one that follows your product-description format.
In the final tutorial, you apply 4-bit QLoRA to turn a base model into a conversational assistant — fine-tuning TinyLlama as a chat (instruct) model.