Fine-Tuning TinyLlama as a Chat (Instruct) Model

Turn the TinyLlama 1.1B base model into a conversational assistant with 4-bit QLoRA, chat templates, and the TRL SFTTrainer on a single GPU.

Jun 18, 20269 min readFollow

Topics You Will Master

The difference between a base LLM and an instruct/chat model
Formatting conversations with a chat template
Configuring 4-bit QLoRA with BitsAndBytesConfig
Training with TRL's SFTTrainer and merging the adapter

A base language model only predicts the next word — ask it a question and it may continue the question rather than answer it. An instruct (chat) model has been fine-tuned to follow instructions and hold a conversation. In this final tutorial you convert the TinyLlama 1.1B base model into a chat assistant using 4-bit QLoRA.

This brings together everything from the Phi-2 tutorial — PEFT, LoRA, quantization — and adds chat templates and the purpose-built SFTTrainer.

Prerequisites: The LoRA and QLoRA concepts and a GPU environment with transformers, peft, trl, accelerate, bitsandbytes, and datasets.

95% OFF

Fine Tuning LLM with HuggingFace Transformers for NLP

Learn the basics of transformers and then fine-tune large language models on your own custom datasets.

Enroll Now — 95% OFF →

Base vs. Instruct Models

The base TinyLlama was trained purely on next-token prediction over a huge text corpus. It knows a lot about language but has never been taught the behavior of answering a user. Instruction tuning fixes that by fine-tuning on conversations — pairs of user messages and ideal assistant replies — so the model learns the chat format and the habit of being helpful.

You will use QLoRA for this: the base model is quantized to 4-bit and frozen, and you train only small LoRA adapters on top. This recap from the previous tutorial is worth keeping in mind:

Note

QLoRA uses normalized quantization (NF4) to quantize weights to 4-bit while keeping them on the same scale as the originals, preserving performance during fine-tuning. It is what lets a 1.1B model train comfortably on a single GPU.

Diagram contrasting a base model continuing a prompt with an instruct model answering it

A base model continues text; instruction tuning teaches it to answer the user in a chat format.


Loading and Formatting the Chat Data

Use the UltraChat dataset, sampling 10,000 conversations for a manageable run:

PYTHON
from datasets import load_dataset

dataset = load_dataset("HuggingFaceH4/ultrachat_200k", trust_remote_code=True, split="train_sft")
dataset = dataset.shuffle(seed=0).select(range(10_000))

Each example is a list of chat messages. Apply TinyLlama's chat template to render them into the exact <|user|> / <|assistant|> format the model expects:

PYTHON
from transformers import AutoTokenizer

template_tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")

def format_prompt(example):
    """Format the prompt using the <|user|> and <|assistant|> format"""
    chat = example['messages']
    prompt = template_tokenizer.apply_chat_template(chat, tokenize=False)
    return {'text': prompt}

dataset = dataset.map(format_prompt)

Tip

apply_chat_template is the correct way to format conversations — it inserts the special role tokens and turn separators the model was trained on. Hand-writing the format risks subtle mismatches that hurt quality.


Testing the Base Model

Load the base model and see how it responds before tuning:

PYTHON
from transformers import pipeline

model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
pipe = pipeline(task='text-generation', model=model_name, device='cuda')

prompt = """
Tell me something about Large Language Models
"""
output = pipe(prompt)

The raw base model tends to ramble or continue the prompt rather than give a clean, structured answer — exactly the gap instruction tuning closes.


Configuring 4-bit QLoRA

Define the 4-bit quantization config — NF4 type, double quantization, and a float16 compute dtype:

PYTHON
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype='float16',
    bnb_4bit_use_double_quant=True
)

Load the tokenizer and the quantized model:

PYTHON
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_size = "left"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    quantization_config=bnb_config
)

model.config.use_cache = False
model.config.pretraining_tp = 1

Diagram of the 4-bit QLoRA setup: a quantized frozen TinyLlama with trainable LoRA adapters on attention and MLP

4-bit QLoRA quantizes and freezes TinyLlama, training LoRA adapters on the attention and MLP projections.


Preparing the LoRA Configuration

Configure LoRA across all the attention and MLP projection layers, and prepare the quantized model for training:

PYTHON
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

peft_config = LoraConfig(
    lora_alpha=32,
    lora_dropout=0.1,
    r=64,
    bias='none',
    task_type='CAUSAL_LM',
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

Note

prepare_model_for_kbit_training casts layer norms to float32 and enables gradient checkpointing — small but important steps for stable training on a quantized model.

The intuition behind LoRA's efficiency: a 2048 × 256 weight matrix has 524,288 values, but two low-rank matrices (2048 × 64 and 64 × 256) total only 147,456 — about 28% of the original — and that fraction shrinks further for larger matrices.


Training with SFTTrainer

TRL's SFTTrainer is built for supervised fine-tuning on a text field. Point it at the formatted text column:

PYTHON
from transformers import TrainingArguments
from trl import SFTTrainer

args = TrainingArguments(
    output_dir="train_dir",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    num_train_epochs=1,
    logging_steps=10,
    fp16=True,
    gradient_checkpointing=True
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field='text',
    tokenizer=tokenizer,
    args=args,
    max_seq_length=512,
    peft_config=peft_config
)

trainer.train()
trainer.model.save_pretrained("TinyLlama-1.1B-qlora")

Important

A higher learning rate (2e-4) with a cosine scheduler suits LoRA adapters, which have far fewer parameters than the full model. The paged_adamw_32bit optimizer and gradient_checkpointing keep memory in check.


Loading and Merging the Adapter

For inference, load the adapter and merge it into the base weights with merge_and_unload, producing a single standalone model:

PYTHON
from peft import AutoPeftModelForCausalLM

model = AutoPeftModelForCausalLM.from_pretrained(
    "TinyLlama-1.1B-qlora",
    device_map='auto'
)

merged_model = model.merge_and_unload()

Generate a response using the chat format:

PYTHON
from transformers import pipeline

prompt = """<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
"""

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_size = "left"

pipe = pipeline(task='text-generation', model=merged_model, tokenizer=tokenizer)
output = pipe(prompt)
print(output[0]['generated_text'])
OUTPUT
<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
Large Language Models (LLMs) are a type of artificial intelligence (AI) that can generate human-like language. They are trained on large amounts of text data, such as Wikipedia articles, to create a model that can generate natural-sounding sentences.

LLMs are becoming increasingly popular in various fields, including natural language processing (NLP), machine translation, and chatbots...

One of the key advantages of LLMs is their ability to generate natural-sounding sentences that are similar to human language...

Overall, LLMs are a powerful tool that can help to improve the quality and efficiency of various NLP applications.

The fine-tuned model now answers the question directly, in the assistant role, with a structured and coherent response — a genuine chat assistant.

Diagram of the QLoRA chat-tuning workflow: format chat data, train adapter, merge, and serve responses

The full chat-tuning workflow: format conversations, train a LoRA adapter, merge it, and generate replies.


Summary

You turned the TinyLlama 1.1B base model into a chat model using 4-bit QLoRA and TRL's SFTTrainer. The essential ingredients were the chat template for formatting conversations, the BitsAndBytesConfig for 4-bit quantization, and merge_and_unload to produce a deployable model — all trainable on a single GPU.

That completes this series. You have gone from running pretrained pipelines, through the transformer and BERT architectures, to fine-tuning encoder models for classification, NER, and summarization, a Vision Transformer for images, and finally generative LLMs with LoRA and QLoRA. You now have the full toolkit to fine-tune transformers for almost any task.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments