Fine-Tuning Phi-2 with LoRA and QLoRA

Learn the theory behind PEFT, LoRA, and QLoRA, then fine-tune Microsoft's Phi-2 on a custom product dataset with quantization on a single GPU.

Jun 18, 202614 min readFollow

Topics You Will Master

The three types of LLM fine-tuning: language modeling, SFT, and preference tuning
Parameter-efficient fine-tuning (PEFT): adapters, LoRA, and QLoRA
Why low-rank decomposition and quantization make fine-tuning affordable
Loading Phi-2 in 8-bit and attaching a LoRA adapter

The earlier tutorials fine-tuned encoder models by training the whole network. That does not scale to billion-parameter generative LLMs — full fine-tuning would need enormous GPU memory. The answer is parameter-efficient fine-tuning (PEFT): freeze the giant pretrained model and train a tiny number of new parameters.

This tutorial covers the theory — language modeling, SFT, adapters, LoRA, and QLoRA — and then fine-tunes Microsoft's Phi-2 (2.7B parameters) to generate product names and descriptions, all on a single GPU.

Prerequisites: The transformer foundations and a GPU environment with transformers, peft, accelerate, bitsandbytes, datasets, and torch.

95% OFF

Fine Tuning LLM with HuggingFace Transformers for NLP

Learn the basics of transformers and then fine-tune large language models on your own custom datasets.

Enroll Now — 95% OFF →

Types of LLM Fine-Tuning

There are three main ways to adapt a large language model:

  • Language modeling — predicting the next word given the previous words. This is a self-supervised task; the model trains on a large text corpus with no human labels.
  • Supervised fine-tuning (SFT) — fine-tuning the pretrained model on a custom labeled dataset for a specific behavior. This is what you do in this tutorial.
  • Preference fine-tuning — training on a dataset with preference labels (which response humans prefer), used to align models with human taste.

There are also lighter-weight techniques that change no weights at all — zero-shot and few-shot prompting, and prompt tuning — but for genuinely teaching a model new behavior, SFT with PEFT is the practical choice.


Parameter-Efficient Fine-Tuning (PEFT)

PEFT fine-tunes a pretrained model by training only a small number of new parameters, leaving the original weights frozen. Its main techniques are adapters, LoRA, and QLoRA.

Adapters

An adapter is a small neural network inserted into the pretrained model. The overall model grows slightly, but fine-tuning updates only the adapter's small parameter count instead of the full network. (See the PEFT/adapter paper, arXiv:1902.00751.)

LoRA

LoRA (Low-Rank Adaptation, arXiv:2106.09685) is the workhorse of modern fine-tuning:

  • Low-rank decomposition. LoRA decomposes the weight-update matrices in the network into two much smaller, lower-rank matrices, drastically reducing the number of trainable parameters.
  • Injecting trainable parameters. The original weights stay frozen; only the low-rank matrices are trainable. They are injected into the layers to capture task-specific information.
  • Combining outputs. During the forward pass, the frozen weight's output is combined with the low-rank matrices' output, so the model keeps its pretrained knowledge while adapting to the new task.

Diagram of LoRA: frozen pretrained weights with two small trainable low-rank matrices injected alongside

LoRA freezes the original weights and trains two small low-rank matrices injected alongside them.

QLoRA

QLoRA (arXiv:2305.14314) is LoRA with quantized weights:

  • Hybrid technique. It combines quantization with low-rank adaptation to push parameter efficiency even further, cutting memory and compute.
  • Resource-constrained fine-tuning. It makes fine-tuning large models feasible on limited hardware — even a single consumer GPU.
  • High performance. Despite the reduced resources, it maintains strong downstream performance.
  • Normalized quantization. QLoRA uses normalized quantization (NF4) so the quantized weights stay on the same scale as the originals, preserving quality.

Diagram of QLoRA: a quantized frozen base model with LoRA adapters trained on top

QLoRA quantizes the frozen base model to 4-bit and trains LoRA adapters on top, fitting large models on small GPUs.

Note

Phi-2 is a 2.7B-parameter small language model from Microsoft. Phi-3's technical report is at arXiv:2404.14219. Always check a model's license on its Hugging Face page before commercial use.


Loading the Custom Dataset

The task: given a product category, generate a product name or description. Load an Amazon product dataset and reshape it:

PYTHON
import pandas as pd
from datasets import Dataset

df = pd.read_csv('https://github.com/laxmimerit/All-CSV-ML-Data-Files-Download/raw/master/amazon_product_details.csv', usecols=['category', 'about_product', 'product_name'])
df['category'] = df['category'].apply(lambda x: x.split('|')[-1])

products = df[['category', 'product_name']].rename(columns={'product_name': 'text'})
description = df[['category', 'about_product']].rename(columns={'about_product': 'text'})

products['task_type'] = 'Product Name'
description['task_type'] = 'Product Description'

df = pd.concat([products, description], ignore_index=True)

Turn it into a shuffled train/test split:

PYTHON
dataset = Dataset.from_pandas(df)
dataset = dataset.shuffle(seed=0)
dataset = dataset.train_test_split(test_size=0.1)
dataset
PYTHON
DatasetDict({
    train: Dataset({ features: ['category', 'text', 'task_type'], num_rows: 2637 })
    test: Dataset({ features: ['category', 'text', 'task_type'], num_rows: 293 })
})

Formatting the Prompts

Causal LLMs learn from a single text field, so define a formatting function that combines the category, task type, and target text into one instruction-style prompt:

PYTHON
def formatting_func(example):
    text = f"""
            Given the product category, you need to generate a '{example['task_type']}'.
            ### Category: {example['category']}\n ### {example['task_type']}: {example['text']}

            """
    return text

Loading Phi-2 in 8-bit

Load Phi-2 quantized to 8-bit, which roughly halves its memory footprint:

PYTHON
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

base_model_id = "microsoft/phi-2"

model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    load_in_8bit=True
)

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_size='left',
    add_eos_token=True,
    add_bos_token=True,
    use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token

Note

Recent versions of Transformers prefer a BitsAndBytesConfig object over the load_in_8bit=True shortcut. The shortcut still works but may emit a deprecation warning.

Tokenize the dataset, copying input_ids into labels (for causal language modeling the model predicts its own input shifted by one):

PYTHON
max_length = 400

def tokenize(prompt):
    result = tokenizer(
        formatting_func(prompt),
        truncation=True,
        max_length=max_length,
        padding="max_length"
    )
    result['labels'] = result['input_ids'].copy()
    return result

dataset = dataset.map(tokenize)

The Base Model Out of the Box

Before fine-tuning, see how raw Phi-2 handles the task:

PYTHON
eval_prompt = """
Given the product category, you need to generate a 'Product Description'.
### Category: BatteryChargers
### Product Description:
"""

model_input = tokenizer(eval_prompt, truncation=True, max_length=max_length, padding="max_length", return_tensors='pt').to("cuda")

model.eval()
with torch.no_grad():
    output = model.generate(**model_input, max_new_tokens=256, repetition_penalty=1.15)
    print(tokenizer.decode(output[0], skip_special_tokens=True))

The base model rambles — it invents a "puzzle" about writing descriptions instead of producing one. It clearly does not follow our format. That is what fine-tuning fixes.


Configuring LoRA

Attach a LoRA adapter, targeting Phi-2's attention and feed-forward projection modules:

PYTHON
from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=["Wqkv", "fc1", "fc2"],
    bias="none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

Check how few parameters you are actually training:

PYTHON
def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}")

print_trainable_parameters(model)
OUTPUT
trainable params: 26214400 || all params: 2805898240 || trainable%: 0.9342605382581515

Only 0.93% of the parameters are trainable — 26M out of 2.8B. That is the entire point of PEFT.


Training

Prepare the model with Accelerate and configure the trainer with an 8-bit optimizer:

PYTHON
from accelerate import Accelerator
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

accelerator = Accelerator(gradient_accumulation_steps=1)
model = accelerator.prepare_model(model)

args = TrainingArguments(
    output_dir="./train-dir",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    max_steps=500,
    learning_rate=2.5e-5,
    optim="paged_adamw_8bit",
    logging_steps=25,
    save_strategy="steps",
    save_steps=25,
    evaluation_strategy="steps",
    eval_steps=25,
    do_eval=True
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

model.config.use_cache = False  # silence warnings; re-enable for inference
trainer.train()
OUTPUT
Step    Training Loss    Validation Loss
25      3.742300         3.499480
100     2.962300         3.003503
250     2.554800         2.697148
500     2.498900         2.653663

Important

DataCollatorForLanguageModeling(tokenizer, mlm=False) sets up causal language modeling (next-token prediction), not masked language modeling. The paged_adamw_8bit optimizer keeps optimizer state in 8-bit to save memory — essential for fitting training on a single GPU.

The loss falls steadily from 3.74 to ~2.50, showing the adapter is learning the product-description format.


Loading the Fine-Tuned Adapter

PEFT saves only the small LoRA adapter, not the whole model. To use it, load the base model and apply the adapter on top:

PYTHON
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    trust_remote_code=True,
    load_in_8bit=True,
    torch_dtype=torch.float16
)

eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True, use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token

ft_model = PeftModel.from_pretrained(base_model, '/content/train-dir/checkpoint-500')

Note

The checkpoint path /content/train-dir/checkpoint-500 is a Google Colab path. On your machine, point it at wherever your output_dir checkpoint was saved (for example ./train-dir/checkpoint-500 on Windows).

Generate with the fine-tuned model on the same prompt:

PYTHON
eval_prompt = """
Given the product category, you need to generate a 'Product Description'.
### Category: BatteryChargers
### Product Description:
"""

model_input = eval_tokenizer(eval_prompt, return_tensors="pt")
ft_model.eval()
with torch.no_grad():
    output = ft_model.generate(**model_input, max_new_tokens=256, repetition_penalty=1.15)
    print(eval_tokenizer.decode(output[0], skip_special_tokens=True))
OUTPUT
Given the product category, you need to generate a 'Product Description'.
### Category: BatteryChargers
### Product Description:
#### 1. Type: USB Charger for Smartphones and Tablets (2-in1)
#### 2. Features: Supports fast charging up to 100% in 30 minutes; Compatible with all Qi wireless chargers; Includes a power bank for on-the-go charging

Now the model follows the format and produces a structured, on-topic product description — a clear improvement over the rambling base model.

Diagram of the PEFT workflow: frozen Phi-2 plus trainable LoRA adapter, trained then merged for inference

Train only the LoRA adapter on a frozen Phi-2, then load the base model plus adapter for inference.


Saving the Adapter

Because the adapter is tiny, you can save and share just those few megabytes:

PYTHON
# zip the saved adapter directory for download
# the checkpoint contains adapter_config.json and adapter_model.safetensors

The saved checkpoint contains adapter_config.json and adapter_model.safetensors — load them onto any copy of the base Phi-2 to reproduce your fine-tuned model.


Summary

You learned why full fine-tuning does not scale to large LLMs and how PEFT solves it: freeze the base model and train tiny LoRA matrices, optionally over a quantized model with QLoRA. Then you fine-tuned Phi-2 by training just 0.93% of its parameters, transforming a rambling base model into one that follows your product-description format.

In the final tutorial, you apply 4-bit QLoRA to turn a base model into a conversational assistant — fine-tuning TinyLlama as a chat (instruct) model.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments