Fine-Tuning TinyLlama as a Chat (Instruct) Model

A base language model only predicts the next word. Ask it a question and it may continue the question instead of answering it. An instruct (chat) model has been fine-tuned to follow instructions and hold a conversation. In this final blog, we convert the TinyLlama 1.1B base model into a chat assistant using 4-bit QLoRA.

This brings together everything from the Phi-2 tutorial: PEFT, LoRA, and quantization. It also adds chat templates and the purpose-built SFTTrainer.

Prerequisites: The LoRA and QLoRA concepts and a GPU environment with transformers, peft, trl, accelerate, bitsandbytes, and datasets.

Base vs. Instruct Models

The base TinyLlama was trained only on next-token prediction over a huge text corpus. It knows a lot about language. But it has never been taught the behavior of answering a user. So, here comes instruction tuning to the rescue. It fine-tunes the model on conversations, which are pairs of user messages and ideal assistant replies. This way the model learns the chat format and the habit of being helpful.

We use QLoRA for this. The base model is quantized to 4-bit and frozen. We train only small LoRA adapters on top. This recap from the previous tutorial is worth keeping in mind:

Note

QLoRA uses normalized quantization (NF4) to quantize weights to 4-bit. It keeps them on the same scale as the originals, which preserves performance during fine-tuning. This is what lets a 1.1B model train comfortably on a single GPU.

Diagram contrasting a base model continuing a prompt with an instruct model answering it

A base model continues text; instruction tuning teaches it to answer the user in a chat format.

Loading and Formatting the Chat Data

We use the UltraChat dataset. We sample 10,000 conversations for a manageable run:

PYTHON

from datasets import load_dataset

dataset = load_dataset("HuggingFaceH4/ultrachat_200k", trust_remote_code=True, split="train_sft")
dataset = dataset.shuffle(seed=0).select(range(10_000))

Each example is a list of chat messages. We apply TinyLlama's chat template to turn them into the exact <|user|> and <|assistant|> format the model expects:

PYTHON

from transformers import AutoTokenizer

template_tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

def format_prompt(example):
    """Format the prompt using the <|user|> and <|assistant|> format"""
    chat = example['messages']
    prompt = template_tokenizer.apply_chat_template(chat, tokenize=False)
    return {'text': prompt}

dataset = dataset.map(format_prompt)

Tip

apply_chat_template is the correct way to format conversations. It inserts the special role tokens and turn separators the model was trained on. Hand-writing the format risks small mismatches that hurt quality.

Testing the Base Model

We load the base model and see how it responds before tuning:

PYTHON

from transformers import pipeline

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
pipe = pipeline(task='text-generation', model=model_name, device='cuda')

prompt = """
Tell me something about Large Language Models
"""
output = pipe(prompt)

Here, we can see the raw base model ramble or continue the prompt instead of giving a clean, structured answer. This is exactly the gap instruction tuning closes.

Configuring 4-bit QLoRA

We define the 4-bit quantization config. It uses the NF4 type, double quantization, and a float16 compute dtype:

PYTHON

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype='float16',
    bnb_4bit_use_double_quant=True
)

We load the tokenizer and the quantized model:

PYTHON

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_size = "left"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config
)

model.config.use_cache = False
model.config.pretraining_tp = 1

Diagram of the 4-bit QLoRA setup: a quantized frozen TinyLlama with trainable LoRA adapters on attention and MLP

4-bit QLoRA quantizes and freezes TinyLlama, training LoRA adapters on the attention and MLP projections.

Preparing the LoRA Configuration

We configure LoRA across all the attention and MLP projection layers. Then we prepare the quantized model for training:

PYTHON

from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

peft_config = LoraConfig(
    lora_alpha=32,
    lora_dropout=0.1,
    r=64,
    bias='none',
    task_type='CAUSAL_LM',
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

Note

prepare_model_for_kbit_training casts layer norms to float32 and turns on gradient checkpointing. These are small but important steps for stable training on a quantized model.

Here is the intuition behind LoRA's efficiency. A 2048 × 256 weight matrix has 524,288 values. But two low-rank matrices (2048 × 64 and 64 × 256) total only 147,456. That is about 28% of the original. And that fraction shrinks further for larger matrices.

Training with SFTTrainer

TRL's SFTTrainer is built for supervised fine-tuning on a text field. We point it at the formatted text column:

PYTHON

from transformers import TrainingArguments
from trl import SFTTrainer

args = TrainingArguments(
    output_dir="train_dir",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    num_train_epochs=1,
    logging_steps=10,
    fp16=True,
    gradient_checkpointing=True
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field='text',
    tokenizer=tokenizer,
    args=args,
    max_seq_length=512,
    peft_config=peft_config
)

trainer.train()
trainer.model.save_pretrained("TinyLlama-1.1B-qlora")

Important

A higher learning rate (2e-4) with a cosine scheduler suits LoRA adapters, which have far fewer parameters than the full model. The paged_adamw_32bit optimizer and gradient_checkpointing keep memory in check.

Loading and Merging the Adapter

For inference, we load the adapter and merge it into the base weights with merge_and_unload. This produces a single standalone model:

PYTHON

from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    "TinyLlama-1.1B-qlora",
    device_map='auto'
)

merged_model = model.merge_and_unload()

Now we generate a response using the chat format:

PYTHON

from transformers import pipeline

prompt = """<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
"""

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_size = "left"

pipe = pipeline(task='text-generation', model=merged_model, tokenizer=tokenizer)
output = pipe(prompt)
print(output[0]['generated_text'])

OUTPUT

<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
Large Language Models (LLMs) are a type of artificial intelligence (AI) that can generate human-like language. They are trained on large amounts of text data, such as Wikipedia articles, to create a model that can generate natural-sounding sentences.

LLMs are becoming increasingly popular in various fields, including natural language processing (NLP), machine translation, and chatbots...

One of the key advantages of LLMs is their ability to generate natural-sounding sentences that are similar to human language...

Overall, LLMs are a powerful tool that can help to improve the quality and efficiency of various NLP applications.

Here, we can see the fine-tuned model answer the question directly, in the assistant role. It gives a structured and coherent response. It is now a real chat assistant.

Diagram of the QLoRA chat-tuning workflow: format chat data, train adapter, merge, and serve responses

The full chat-tuning workflow: format conversations, train a LoRA adapter, merge it, and generate replies.

Summary

This is how chat fine-tuning works. We turned the TinyLlama 1.1B base model into a chat model using 4-bit QLoRA and TRL's SFTTrainer. The key ingredients were the chat template for formatting conversations, the BitsAndBytesConfig for 4-bit quantization, and merge_and_unload to produce a deployable model. All of it trained on a single GPU.

That completes this series. We started by running pretrained pipelines. Then we studied the transformer and BERT architectures. We fine-tuned encoder models for classification, NER, and summarization. We fine-tuned a Vision Transformer for images. And finally we fine-tuned generative LLMs with LoRA and QLoRA. We now have the full toolkit to fine-tune transformers for almost any task.

Fine-Tuning TinyLlama as a Chat (Instruct) Model

Fine Tuning LLM with HuggingFace Transformers for NLP

Base vs. Instruct Models

Loading and Formatting the Chat Data

Testing the Base Model

Configuring 4-bit QLoRA

Preparing the LoRA Configuration

Training with SFTTrainer

Loading and Merging the Adapter

Summary

Found this useful? Keep building with me.

Latest recommendations you might like

BERT Architecture: Theory and Fine-Tuning

Fine-Tuning Distilled BERT for Fake News Detection

Fine-Tuning BERT for Sentiment Classification

Fine-Tuning DistilBERT for Restaurant Search NER

Find this tutorial useful?

Discussion & Comments