A base language model only predicts the next word — ask it a question and it may continue the question rather than answer it. An instruct (chat) model has been fine-tuned to follow instructions and hold a conversation. In this final tutorial you convert the TinyLlama 1.1B base model into a chat assistant using 4-bit QLoRA.
This brings together everything from the Phi-2 tutorial — PEFT, LoRA, quantization — and adds chat templates and the purpose-built SFTTrainer.
Prerequisites: The LoRA and QLoRA concepts and a GPU environment with transformers, peft, trl, accelerate, bitsandbytes, and datasets.
Base vs. Instruct Models
The base TinyLlama was trained purely on next-token prediction over a huge text corpus. It knows a lot about language but has never been taught the behavior of answering a user. Instruction tuning fixes that by fine-tuning on conversations — pairs of user messages and ideal assistant replies — so the model learns the chat format and the habit of being helpful.
You will use QLoRA for this: the base model is quantized to 4-bit and frozen, and you train only small LoRA adapters on top. This recap from the previous tutorial is worth keeping in mind:
Note
QLoRA uses normalized quantization (NF4) to quantize weights to 4-bit while keeping them on the same scale as the originals, preserving performance during fine-tuning. It is what lets a 1.1B model train comfortably on a single GPU.

A base model continues text; instruction tuning teaches it to answer the user in a chat format.
Loading and Formatting the Chat Data
Use the UltraChat dataset, sampling 10,000 conversations for a manageable run:
from datasets import load_dataset
dataset = load_dataset("HuggingFaceH4/ultrachat_200k", trust_remote_code=True, split="train_sft")
dataset = dataset.shuffle(seed=0).select(range(10_000))
Each example is a list of chat messages. Apply TinyLlama's chat template to render them into the exact <|user|> / <|assistant|> format the model expects:
from transformers import AutoTokenizer
template_tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-Chat-v1.0")
def format_prompt(example):
"""Format the prompt using the <|user|> and <|assistant|> format"""
chat = example['messages']
prompt = template_tokenizer.apply_chat_template(chat, tokenize=False)
return {'text': prompt}
dataset = dataset.map(format_prompt)
Tip
apply_chat_template is the correct way to format conversations — it inserts the special role tokens and turn separators the model was trained on. Hand-writing the format risks subtle mismatches that hurt quality.
Testing the Base Model
Load the base model and see how it responds before tuning:
from transformers import pipeline
model_name = "TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T"
pipe = pipeline(task='text-generation', model=model_name, device='cuda')
prompt = """
Tell me something about Large Language Models
"""
output = pipe(prompt)
The raw base model tends to ramble or continue the prompt rather than give a clean, structured answer — exactly the gap instruction tuning closes.
Configuring 4-bit QLoRA
Define the 4-bit quantization config — NF4 type, double quantization, and a float16 compute dtype:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype='float16',
bnb_4bit_use_double_quant=True
)
Load the tokenizer and the quantized model:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_size = "left"
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
quantization_config=bnb_config
)
model.config.use_cache = False
model.config.pretraining_tp = 1

4-bit QLoRA quantizes and freezes TinyLlama, training LoRA adapters on the attention and MLP projections.
Preparing the LoRA Configuration
Configure LoRA across all the attention and MLP projection layers, and prepare the quantized model for training:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model
peft_config = LoraConfig(
lora_alpha=32,
lora_dropout=0.1,
r=64,
bias='none',
task_type='CAUSAL_LM',
target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj', 'gate_proj', 'up_proj', 'down_proj']
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)
Note
prepare_model_for_kbit_training casts layer norms to float32 and enables gradient checkpointing — small but important steps for stable training on a quantized model.
The intuition behind LoRA's efficiency: a 2048 × 256 weight matrix has 524,288 values, but two low-rank matrices (2048 × 64 and 64 × 256) total only 147,456 — about 28% of the original — and that fraction shrinks further for larger matrices.
Training with SFTTrainer
TRL's SFTTrainer is built for supervised fine-tuning on a text field. Point it at the formatted text column:
from transformers import TrainingArguments
from trl import SFTTrainer
args = TrainingArguments(
output_dir="train_dir",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
optim="paged_adamw_32bit",
learning_rate=2e-4,
lr_scheduler_type="cosine",
num_train_epochs=1,
logging_steps=10,
fp16=True,
gradient_checkpointing=True
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field='text',
tokenizer=tokenizer,
args=args,
max_seq_length=512,
peft_config=peft_config
)
trainer.train()
trainer.model.save_pretrained("TinyLlama-1.1B-qlora")
Important
A higher learning rate (2e-4) with a cosine scheduler suits LoRA adapters, which have far fewer parameters than the full model. The paged_adamw_32bit optimizer and gradient_checkpointing keep memory in check.
Loading and Merging the Adapter
For inference, load the adapter and merge it into the base weights with merge_and_unload, producing a single standalone model:
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained(
"TinyLlama-1.1B-qlora",
device_map='auto'
)
merged_model = model.merge_and_unload()
Generate a response using the chat format:
from transformers import pipeline
prompt = """<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
"""
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = "<PAD>"
tokenizer.padding_size = "left"
pipe = pipeline(task='text-generation', model=merged_model, tokenizer=tokenizer)
output = pipe(prompt)
print(output[0]['generated_text'])
<|user|>
Tell me something about Large Language Models.</s>
<|assistant|>
Large Language Models (LLMs) are a type of artificial intelligence (AI) that can generate human-like language. They are trained on large amounts of text data, such as Wikipedia articles, to create a model that can generate natural-sounding sentences.
LLMs are becoming increasingly popular in various fields, including natural language processing (NLP), machine translation, and chatbots...
One of the key advantages of LLMs is their ability to generate natural-sounding sentences that are similar to human language...
Overall, LLMs are a powerful tool that can help to improve the quality and efficiency of various NLP applications.
The fine-tuned model now answers the question directly, in the assistant role, with a structured and coherent response — a genuine chat assistant.

The full chat-tuning workflow: format conversations, train a LoRA adapter, merge it, and generate replies.
Summary
You turned the TinyLlama 1.1B base model into a chat model using 4-bit QLoRA and TRL's SFTTrainer. The essential ingredients were the chat template for formatting conversations, the BitsAndBytesConfig for 4-bit quantization, and merge_and_unload to produce a deployable model — all trainable on a single GPU.
That completes this series. You have gone from running pretrained pipelines, through the transformer and BERT architectures, to fine-tuning encoder models for classification, NER, and summarization, a Vision Transformer for images, and finally generative LLMs with LoRA and QLoRA. You now have the full toolkit to fine-tune transformers for almost any task.