Fine-Tuning T5 for Custom Text Summarization

Summarization is a generation task. The model writes new text instead of picking a label. T5 (Text-to-Text Transfer Transformer) is ideal for it. It treats every NLP problem as text in, text out. It uses the full encoder-decoder transformer. In this blog, we fine-tune t5-small to summarize chat-style dialogues.

The encoder-only BERT tutorials were different. This one uses a sequence-to-sequence setup with its own data collator and trainer.

Prerequisites: Familiarity with the transformer encoder-decoder architecture and a Python environment with transformers, datasets, and torch.

Extractive vs. Abstractive Summarization

Summarization in NLP creates a short version of a longer text.

There are two types:

Extractive: copies the most important sentences word for word from the source.
Abstractive: writes new sentences that capture the meaning, the way a human would paraphrase.

T5 and BART produce abstractive summaries, which read more naturally. The CNN/DailyMail dataset is the classic benchmark. It has around 300,000 news article and summary pairs, and its summaries are abstractive.

Diagram of abstractive summarization: a long document encoded and decoded into a short, reworded summary

Abstractive summarization encodes a long document and decodes a short, reworded summary.

Trying Pretrained Summarizers

We load a few examples from CNN/DailyMail and compare two pretrained models:

PYTHON

from datasets import load_dataset

dataset = load_dataset("cnn_dailymail", '3.0.0', split="train[:10]")
print(dataset[0]['article'])
print("\nSummary:\n")
print(dataset[0]['highlights'])

OUTPUT

LONDON, England (Reuters) -- Harry Potter star Daniel Radcliffe gains access to a reported £20 million ($41.1 million) fortune as he turns 18 on Monday...

Summary:

Harry Potter star Daniel Radcliffe gets £20M fortune as he turns 18 Monday .
Young actor says he has no plans to fritter his cash away .
Radcliffe's earnings from first five Potter films have been held in trust fund .

We run both t5-small-finetuned-cnn and facebook/bart-large-cnn on the same article:

PYTHON

from transformers import pipeline
import torch

device = torch.device("cuda" if torch.cuda.is_available() else 'cpu')
summary = {}

pipe = pipeline('summarization', model='ubikpt/t5-small-finetuned-cnn', device=device)
summary['t5-small'] = pipe(dataset[0]['article'])[0]['summary_text']

pipe = pipeline('summarization', model='facebook/bart-large-cnn', device=device)
summary['bart-large'] = pipe(dataset[0]['article'])[0]['summary_text']

for model in summary:
    print(f"\n{model}\n{summary[model]}")

OUTPUT

t5-small
Harry Potter star Daniel Radcliffe says he has no plans to fritter his cash away . The actor has filmed a TV movie about author Rudyard Kipling

bart-large
Harry Potter star Daniel Radcliffe turns 18 on Monday. He gains access to a reported £20 million ($41.1 million) fortune. Radcliffe's earnings from the first five Potter films have been held in a trust fund. Details of how he'll mark his landmark birthday are under wraps.

Note

Here, we can see BART produce a longer, more detailed summary. T5-small is more terse. Which one is better depends on our use case. That is exactly why fine-tuning on our own data matters.

The SAMSum Dataset

To customize summarization, we fine-tune on the SAMSum dataset. It has messenger-style dialogues paired with human-written summaries:

PYTHON

samsum = load_dataset('samsum', trust_remote_code=True)
samsum['train'][0]

OUTPUT

{'id': '13818513', 'dialogue': "Amanda: I baked  cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)", 'summary': 'Amanda baked cookies and will bring Jerry some tomorrow.'}

We inspect dialogue and summary lengths to pick a sensible maximum token length:

PYTHON

import pandas as pd

dialogue_len = [len(x['dialogue'].split()) for x in samsum['train']]
summary_len = [len(x['summary'].split()) for x in samsum['train']]

data = pd.DataFrame([dialogue_len, summary_len]).T
data.columns = ['Dialogue Length', 'Summary Length']
data.hist(figsize=(10, 3))

Loading and Tokenizing T5

We load the t5-small tokenizer and the sequence-to-sequence model:

PYTHON

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_ckpt = 't5-small'
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt).to(device)

The key to seq2seq tokenization is text_target. It tokenizes the labels (summaries) separately from the inputs (dialogues):

PYTHON

def tokenize(batch):
    encoding = tokenizer(batch['dialogue'], text_target=batch['summary'], max_length=200, truncation=True, padding=True, return_tensors='pt')
    return encoding

samsum_pt = samsum.map(tokenize, batched=True, batch_size=None)

Here, we can see the tokenized dataset now carries input_ids, attention_mask, and labels for the train, test, and validation splits.

Diagram of seq2seq tokenization: dialogue tokenized as input_ids and summary tokenized as labels

Seq2seq tokenization encodes the dialogue as input_ids and the summary as labels via text_target.

Training

Sequence-to-sequence models need DataCollatorForSeq2Seq. It pads inputs and labels on the fly and prepares the decoder inputs:

PYTHON

from transformers import DataCollatorForSeq2Seq, TrainingArguments, Trainer

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

args = TrainingArguments(
    output_dir="train_dir",
    num_train_epochs=2,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    eval_strategy='epoch',
    save_strategy='epoch',
    weight_decay=0.01,
    learning_rate=2e-5,
    gradient_accumulation_steps=500
)

trainer = Trainer(
    model=model,
    args=args,
    tokenizer=tokenizer,
    data_collator=data_collator,
    train_dataset=samsum_pt['train'],
    eval_dataset=samsum_pt['validation']
)

trainer.train()

OUTPUT

{'eval_loss': 14.6737, 'epoch': 0.95}
{'eval_loss': 13.8082, 'epoch': 1.9}
{'train_runtime': 550.3847, 'train_loss': 14.087, 'epoch': 1.9}

Warning

The large gradient_accumulation_steps=500 means the optimizer updates only 14 times over the whole run. So this is a quick workflow demo, not a fully converged model. For real training, lower gradient_accumulation_steps (say 1 to 8) and increase epochs so the loss drops meaningfully.

Prediction

We save the model and summarize a brand-new dialogue:

PYTHON

from transformers import pipeline

trainer.save_model("t5_samsum_summarization")
pipe = pipeline('summarization', model='t5_samsum_summarization', device=device)

custom_dialogue = """
Laxmi Kant: what work you planning to give Tom?
Juli: i was hoping to send him on a business trip first.
Laxmi Kant: cool. is there any suitable work for him?
Juli: he did excellent in last quarter. i will assign new project, once he is back.
"""

output = pipe(custom_dialogue)
output

OUTPUT

[{'summary_text': 'laxmi Kant: i was hoping to send him on a business trip first . i will assign new project once he is back .'}]

Here, we can see that even from the short demo run, the model captures the gist of the chat. It picks up the business trip and the new project.

Diagram of the summarization pipeline turning a multi-turn dialogue into a one-line summary

The fine-tuned T5 pipeline condenses a multi-turn dialogue into a single-sentence summary.

Summary

This is how T5 summarization works. We fine-tuned t5-small for abstractive dialogue summarization on SAMSum. The new pieces here, compared to classification, are the encoder-decoder model (AutoModelForSeq2SeqLM), the text_target tokenization that handles labels, and the DataCollatorForSeq2Seq that prepares decoder inputs.

Next, we leave text behind and apply the same fine-tuning recipe to images. We will do fine-tuning a Vision Transformer for image classification.

Fine-Tuning T5 for Custom Text Summarization

Fine Tuning LLM with HuggingFace Transformers for NLP

Extractive vs. Abstractive Summarization

Trying Pretrained Summarizers

The SAMSum Dataset

Loading and Tokenizing T5

Training

Prediction

Summary

Found this useful? Keep building with me.

Latest recommendations you might like

BERT Architecture: Theory and Fine-Tuning

Fine-Tuning Distilled BERT for Fake News Detection

Fine-Tuning BERT for Sentiment Classification

Fine-Tuning DistilBERT for Restaurant Search NER

Find this tutorial useful?

Discussion & Comments