Fine-Tuning T5 for Custom Text Summarization

Fine-tune the T5 text-to-text model for abstractive dialogue summarization on the SAMSum dataset using Hugging Face's Seq2Seq Trainer and data collator.

Jun 18, 20269 min readFollow

Topics You Will Master

The difference between extractive and abstractive summarization
Running summarization pipelines and comparing T5 and BART
Preparing a sequence-to-sequence dataset with text_target
Using DataCollatorForSeq2Seq and the Trainer for encoder-decoder models

Summarization is a generation task — the model writes new text rather than picking a label. T5 (Text-to-Text Transfer Transformer) is ideal for it: it treats every NLP problem as text-in, text-out, using the full encoder-decoder transformer. In this tutorial you fine-tune t5-small to summarize chat-style dialogues.

Unlike the encoder-only BERT tutorials, this one uses a sequence-to-sequence setup with a dedicated data collator and trainer.

Prerequisites: Familiarity with the transformer encoder-decoder architecture and a Python environment with transformers, datasets, and torch.

95% OFF

Fine Tuning LLM with HuggingFace Transformers for NLP

Learn the basics of transformers and then fine-tune large language models on your own custom datasets.

Enroll Now — 95% OFF →

Extractive vs. Abstractive Summarization

Summarization in NLP automatically generates a concise version of a longer text.

There are two flavors:

  • Extractive — copies the most important sentences verbatim from the source.
  • Abstractive — generates new sentences that capture the meaning, the way a human would paraphrase.

T5 and BART produce abstractive summaries, which read more naturally. The CNN/DailyMail dataset — around 300,000 news article/summary pairs — is the classic benchmark, and its summaries are abstractive.

Diagram of abstractive summarization: a long document encoded and decoded into a short, reworded summary

Abstractive summarization encodes a long document and decodes a short, reworded summary.


Trying Pretrained Summarizers

Load a few examples from CNN/DailyMail and compare two pretrained models:

PYTHON
from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", '3.0.0', split="train[:10]")
print(dataset[0]['article'])
print("\nSummary:\n")
print(dataset[0]['highlights'])
OUTPUT
LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday...

Summary:

Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .

Run both t5-small-finetuned-cnn and facebook/bart-large-cnn on the same article:

PYTHON
from transformers import pipeline
import torch

device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')
summary = {}

pipe = pipeline('summarization', model='ubikpt/t5-small-finetuned-cnn', device=device)
summary['t5-small'] = pipe(dataset[0]['article'])[0]['summary_text']

pipe = pipeline('summarization', model='facebook/bart-large-cnn', device=device)
summary['bart-large'] = pipe(dataset[0]['article'])[0]['summary_text']

for model in summary:
    print(f"\n{model}\n{summary[model]}")
OUTPUT
t5-small
Harry Potter star Daniel Radcliffe says he has no plans to fritter his cash away . The actor has filmed a TV movie about author Rudyard Kipling

bart-large
Harry Potter star Daniel Radcliffe turns 18 on Monday. He gains access to a reported £20 million ($41.1 million) fortune. Radcliffe's earnings from the first five Potter films have been held in a trust fund. Details of how he'll mark his landmark birthday are under wraps.

Note

BART produces a longer, more detailed summary here; T5-small is more terse. Which is "better" depends on your use case — that is exactly why fine-tuning on your own data matters.


The SAMSum Dataset

To customize summarization, fine-tune on the SAMSum dataset — messenger-style dialogues paired with human-written summaries:

PYTHON
samsum = load_dataset('samsum', trust_remote_code=True)
samsum['train'][0]
OUTPUT
{'id': '13818513', 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)", 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}

Inspect dialogue and summary lengths to pick a sensible maximum token length:

PYTHON
import pandas as pd

dialogue_len = [len(x['dialogue'].split()) for x in samsum['train']]
summary_len = [len(x['summary'].split()) for x in samsum['train']]

data = pd.DataFrame([dialogue_len, summary_len]).T
data.columns = ['Dialogue Length', 'Summary Length']
data.hist(figsize=(10, 3))

Loading and Tokenizing T5

Load the t5-small tokenizer and the sequence-to-sequence model:

PYTHON
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_ckpt = 't5-small'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

The key to seq2seq tokenization is text_target, which tokenizes the labels (summaries) separately from the inputs (dialogues):

PYTHON
def tokenize(batch):
    encoding = tokenizer(batch['dialogue'], text_target=batch['summary'], max_length=200, truncation=True, padding=True, return_tensors='pt')
    return encoding

samsum_pt = samsum.map(tokenize, batched=True, batch_size=None)

The tokenized dataset now carries input_ids, attention_mask, and labels for the train, test, and validation splits.

Diagram of seq2seq tokenization: dialogue tokenized as input_ids and summary tokenized as labels

Seq2seq tokenization encodes the dialogue as input_ids and the summary as labels via text_target.


Training

Sequence-to-sequence models need DataCollatorForSeq2Seq, which dynamically pads inputs and labels and prepares decoder inputs:

PYTHON
from transformers import DataCollatorForSeq2Seq, TrainingArguments, Trainer

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

args = TrainingArguments(
    output_dir="train_dir",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    eval_strategy='epoch',
    save_strategy='epoch',
    weight_decay=0.01,
    learning_rate=2e-5,
    gradient_accumulation_steps=500
)

trainer = Trainer(
    model=model,
    args=args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=samsum_pt['train'],
    eval_dataset=samsum_pt['validation']
)

trainer.train()
OUTPUT
{'eval_loss': 14.6737, 'epoch': 0.95}
{'eval_loss': 13.8082, 'epoch': 1.9}
{'train_runtime': 550.3847, 'train_loss': 14.087, 'epoch': 1.9}

Warning

The large gradient_accumulation_steps=500 means the optimizer only updates 14 times over the whole run, so this is a quick workflow demonstration, not a fully converged model. For real training, lower gradient_accumulation_steps (e.g. 1–8) and increase epochs so the loss drops meaningfully.


Prediction

Save the model and summarize a brand-new dialogue:

PYTHON
from transformers import pipeline

trainer.save_model("t5_samsum_summarization")
pipe = pipeline('summarization', model='t5_samsum_summarization', device=device)

custom_dialogue = """
Laxmi Kant: what work you planning to give Tom?
Juli: i was hoping to send him on a business trip first.
Laxmi Kant: cool. is there any suitable work for him?
Juli: he did excellent in last quarter. i will assign new project, once he is back.
"""

output = pipe(custom_dialogue)
output
OUTPUT
[{'summary_text': 'laxmi Kant: i was hoping to send him on a business trip first . i will assign new project once he is back .'}]

Even from the short demo run, the model captures the gist of the conversation — the business trip and the new project.

Diagram of the summarization pipeline turning a multi-turn dialogue into a one-line summary

The fine-tuned T5 pipeline condenses a multi-turn dialogue into a single-sentence summary.


Summary

You fine-tuned t5-small for abstractive dialogue summarization on SAMSum. The new pieces compared to classification are the encoder-decoder model (AutoModelForSeq2SeqLM), the text_target tokenization that handles labels, and the DataCollatorForSeq2Seq that prepares decoder inputs.

Next, you leave text behind and apply the same fine-tuning recipe to images — fine-tuning a Vision Transformer for image classification.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments