Fine-Tuning Phi-2 with LoRA and QLoRA

The earlier tutorials fine-tuned encoder models by training the whole network. That does not scale to billion-parameter generative LLMs. Full fine-tuning would need a huge amount of GPU memory. So, here comes parameter-efficient fine-tuning (PEFT) to the rescue. We freeze the giant pretrained model and train only a tiny number of new parameters.

In this blog, we will learn the theory first: language modeling, SFT, adapters, LoRA, and QLoRA. Then we fine-tune Microsoft's Phi-2 (2.7B parameters) to generate product names and descriptions, all on a single GPU.

Prerequisites: The transformer foundations and a GPU environment with transformers, peft, accelerate, bitsandbytes, datasets, and torch.

Types of LLM Fine-Tuning

There are three main ways to adapt a large language model:

Language modeling: predicting the next word from the previous words. This is a self-supervised task. The model trains on a large text corpus with no human labels.
Supervised fine-tuning (SFT): fine-tuning the pretrained model on a custom labeled dataset for a specific behavior. This is what we do in this tutorial.
Preference fine-tuning: training on a dataset with preference labels, which say which response humans prefer. It is used to align models with human taste.

There are also lighter techniques that change no weights at all. These are zero-shot and few-shot prompting, and prompt tuning. But to truly teach a model new behavior, SFT with PEFT is the practical choice.

Parameter-Efficient Fine-Tuning (PEFT)

PEFT fine-tunes a pretrained model by training only a small number of new parameters. It leaves the original weights frozen. Its main techniques are adapters, LoRA, and QLoRA.

Adapters

An adapter is a small neural network placed inside the pretrained model. The overall model grows a little. But fine-tuning updates only the adapter's small parameter count, not the full network. (See the PEFT adapter paper, arXiv:1902.00751.)

LoRA

LoRA (Low-Rank Adaptation, arXiv:2106.09685) is the workhorse of modern fine-tuning:

Low-rank decomposition. LoRA breaks the weight-update matrices into two much smaller, lower-rank matrices. This sharply cuts the number of trainable parameters.
Injecting trainable parameters. The original weights stay frozen. Only the low-rank matrices are trainable. They are placed into the layers to capture task-specific information.
Combining outputs. During the forward pass, the frozen weight's output is combined with the low-rank matrices' output. So the model keeps its pretrained knowledge while it adapts to the new task.

Diagram of LoRA: frozen pretrained weights with two small trainable low-rank matrices injected alongside

LoRA freezes the original weights and trains two small low-rank matrices injected alongside them.

QLoRA

QLoRA (arXiv:2305.14314) is LoRA with quantized weights:

Hybrid technique. It combines quantization with low-rank adaptation to push parameter efficiency even further. This cuts memory and compute.
Resource-constrained fine-tuning. It makes fine-tuning large models possible on limited hardware, even a single consumer GPU.
High performance. Despite the reduced resources, it keeps strong downstream performance.
Normalized quantization. QLoRA uses normalized quantization (NF4). So the quantized weights stay on the same scale as the originals, which preserves quality.

Diagram of QLoRA: a quantized frozen base model with LoRA adapters trained on top

QLoRA quantizes the frozen base model to 4-bit and trains LoRA adapters on top, fitting large models on small GPUs.

Note

Phi-2 is a 2.7B-parameter small language model from Microsoft. Phi-3's technical report is at arXiv:2404.14219. Always check a model's license on its Hugging Face page before commercial use.

Loading the Custom Dataset

The task is simple. Given a product category, we generate a product name or description. We load an Amazon product dataset and reshape it:

PYTHON

import pandas as pd
from datasets import Dataset

df = pd.read_csv('https://github.com/laxmimerit/All-CSV-ML-Data-Files-Download/raw/master/amazon_product_details.csv', usecols=['category', 'about_product', 'product_name'])
df['category'] = df['category'].apply(lambda x: x.split('|')[-1])

products = df[['category', 'product_name']].rename(columns={'product_name': 'text'})
description = df[['category', 'about_product']].rename(columns={'about_product': 'text'})

products['task_type'] = 'Product Name'
description['task_type'] = 'Product Description'

df = pd.concat([products, description], ignore_index=True)

Now we turn it into a shuffled train/test split:

PYTHON

dataset = Dataset.from_pandas(df)
dataset = dataset.shuffle(seed=0)
dataset = dataset.train_test_split(test_size=0.1)
dataset

PYTHON

DatasetDict({
    train: Dataset({ features: ['category', 'text', 'task_type'], num_rows: 2637 })
    test: Dataset({ features: ['category', 'text', 'task_type'], num_rows: 293 })
})

Formatting the Prompts

Causal LLMs learn from a single text field. So we define a formatting function. It combines the category, task type, and target text into one instruction-style prompt:

PYTHON

def formatting_func(example):
    text = f"""
            Given the product category, you need to generate a '{example['task_type']}'.
            ### Category: {example['category']}\n ### {example['task_type']}: {example['text']}

            """
    return text

Loading Phi-2 in 8-bit

We load Phi-2 quantized to 8-bit. This roughly halves its memory footprint:

PYTHON

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

base_model_id = "microsoft/phi-2"

model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    load_in_8bit=True
)

tokenizer = AutoTokenizer.from_pretrained(
    base_model_id,
    padding_size='left',
    add_eos_token=True,
    add_bos_token=True,
    use_fast=False
)
tokenizer.pad_token = tokenizer.eos_token

Note

Recent versions of Transformers prefer a BitsAndBytesConfig object over the load_in_8bit=True shortcut. The shortcut still works but may emit a deprecation warning.

We tokenize the dataset and copy input_ids into labels. For causal language modeling, the model predicts its own input shifted by one:

PYTHON

max_length = 400

def tokenize(prompt):
    result = tokenizer(
        formatting_func(prompt),
        truncation=True,
        max_length=max_length,
        padding="max_length"
    )
    result['labels'] = result['input_ids'].copy()
    return result

dataset = dataset.map(tokenize)

The Base Model Out of the Box

Before fine-tuning, we see how raw Phi-2 handles the task:

PYTHON

eval_prompt = """
Given the product category, you need to generate a 'Product Description'.
### Category: BatteryChargers
### Product Description:
"""

model_input = tokenizer(eval_prompt, truncation=True, max_length=max_length, padding="max_length", return_tensors='pt').to("cuda")

model.eval()
with torch.no_grad():
    output = model.generate(**model_input, max_new_tokens=256, repetition_penalty=1.15)
    print(tokenizer.decode(output[0], skip_special_tokens=True))

Here, we can see the base model ramble. It invents a puzzle about writing descriptions instead of producing one. It clearly does not follow our format. That is what fine-tuning fixes.

Configuring LoRA

We attach a LoRA adapter. It targets Phi-2's attention and feed-forward projection modules:

PYTHON

from peft import LoraConfig, get_peft_model

config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=["Wqkv", "fc1", "fc2"],
    bias="none",
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

Let's check how few parameters we are actually training:

PYTHON

def print_trainable_parameters(model):
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}")

print_trainable_parameters(model)

OUTPUT

trainable params: 26214400 || all params: 2805898240 || trainable%: 0.9342605382581515

Here, we can see only 0.93% of the parameters are trainable, 26M out of 2.8B. That is the whole point of PEFT.

Training

We prepare the model with Accelerate and configure the trainer with an 8-bit optimizer:

PYTHON

from accelerate import Accelerator
from transformers import TrainingArguments, Trainer, DataCollatorForLanguageModeling

accelerator = Accelerator(gradient_accumulation_steps=1)
model = accelerator.prepare_model(model)

args = TrainingArguments(
    output_dir="./train-dir",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=1,
    max_steps=500,
    learning_rate=2.5e-5,
    optim="paged_adamw_8bit",
    logging_steps=25,
    save_strategy="steps",
    save_steps=25,
    evaluation_strategy="steps",
    eval_steps=25,
    do_eval=True
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

model.config.use_cache = False  # silence warnings; re-enable for inference
trainer.train()

OUTPUT

Step    Training Loss    Validation Loss
25      3.742300         3.499480
100     2.962300         3.003503
250     2.554800         2.697148
500     2.498900         2.653663

Important

DataCollatorForLanguageModeling(tokenizer, mlm=False) sets up causal language modeling, which is next-token prediction, not masked language modeling. The paged_adamw_8bit optimizer keeps the optimizer state in 8-bit to save memory. This is essential for fitting training on a single GPU.

Here, we can see the loss fall steadily from 3.74 to about 2.50. This shows the adapter is learning the product-description format.

Loading the Fine-Tuned Adapter

PEFT saves only the small LoRA adapter, not the whole model. To use it, we load the base model and apply the adapter on top:

PYTHON

from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    trust_remote_code=True,
    load_in_8bit=True,
    torch_dtype=torch.float16
)

eval_tokenizer = AutoTokenizer.from_pretrained(base_model_id, add_bos_token=True, trust_remote_code=True, use_fast=False)
eval_tokenizer.pad_token = eval_tokenizer.eos_token

ft_model = PeftModel.from_pretrained(base_model, '/content/train-dir/checkpoint-500')

Note

The checkpoint path /content/train-dir/checkpoint-500 is a Google Colab path. On our own machine, we point it at wherever the output_dir checkpoint was saved, for example ./train-dir/checkpoint-500 on Windows.

Now we generate with the fine-tuned model on the same prompt:

PYTHON

eval_prompt = """
Given the product category, you need to generate a 'Product Description'.
### Category: BatteryChargers
### Product Description:
"""

model_input = eval_tokenizer(eval_prompt, return_tensors="pt")
ft_model.eval()
with torch.no_grad():
    output = ft_model.generate(**model_input, max_new_tokens=256, repetition_penalty=1.15)
    print(eval_tokenizer.decode(output[0], skip_special_tokens=True))

OUTPUT

Given the product category, you need to generate a 'Product Description'.
### Category: BatteryChargers
### Product Description:
#### 1. Type: USB Charger for Smartphones and Tablets (2-in1)
#### 2. Features: Supports fast charging up to 100% in 30 minutes; Compatible with all Qi wireless chargers; Includes a power bank for on-the-go charging

Here, we can see the model follow the format now. It produces a structured, on-topic product description. This is a clear improvement over the rambling base model.

Diagram of the PEFT workflow: frozen Phi-2 plus trainable LoRA adapter, trained then merged for inference

Train only the LoRA adapter on a frozen Phi-2, then load the base model plus adapter for inference.

Saving the Adapter

Because the adapter is tiny, we can save and share just those few megabytes:

PYTHON

# zip the saved adapter directory for download
# the checkpoint contains adapter_config.json and adapter_model.safetensors

The saved checkpoint contains adapter_config.json and adapter_model.safetensors. We load them onto any copy of the base Phi-2 to reproduce our fine-tuned model.

Summary

This is how PEFT fine-tuning works. We learned why full fine-tuning does not scale to large LLMs. PEFT solves it. We freeze the base model and train tiny LoRA matrices, and we can put a quantized model under it with QLoRA. Then we fine-tuned Phi-2 by training just 0.93% of its parameters. That turned a rambling base model into one that follows our product-description format.

In the final tutorial, we apply 4-bit QLoRA to turn a base model into a conversational assistant. We will do fine-tuning TinyLlama as a chat (instruct) model.

Fine-Tuning Phi-2 with LoRA and QLoRA

Fine Tuning LLM with HuggingFace Transformers for NLP

Types of LLM Fine-Tuning

Parameter-Efficient Fine-Tuning (PEFT)

Adapters

LoRA

QLoRA

Loading the Custom Dataset

Formatting the Prompts

Loading Phi-2 in 8-bit

The Base Model Out of the Box

Configuring LoRA

Training

Loading the Fine-Tuned Adapter

Saving the Adapter

Summary

Found this useful? Keep building with me.

Latest recommendations you might like

BERT Architecture: Theory and Fine-Tuning

Fine-Tuning Distilled BERT for Fake News Detection

Fine-Tuning BERT for Sentiment Classification

Fine-Tuning DistilBERT for Restaurant Search NER

Find this tutorial useful?

Discussion & Comments