Fine-Tuning Vision Transformer (ViT) for Images

The transformer was built for text. But the Vision Transformer (ViT) showed it works just as well on images. It treats an image as a sequence of patches, just like words in a sentence. In this blog, we fine-tune ViT to classify 20 kinds of Indian food.

The fine-tuning recipe is the same one we have used throughout this series. Only the preprocessing changes, from a tokenizer to an image processor.

Prerequisites: Familiarity with the transformer architecture and a Python environment with transformers, datasets, evaluate, torch, and torchvision. A GPU is recommended.

How the Vision Transformer Works

ViT was introduced in An Image is Worth 16x16 Words (arXiv:2010.11929).

It applies a standard transformer encoder directly to images with very few changes:

Split into patches. The image is divided into fixed-size patches, say 16×16 pixels. A 224×224 image becomes a grid of 196 patches.
Linearly embed each patch. Each flattened patch is projected to the model's hidden dimension. This is the patch embedding, much like a word embedding.
Add a [class] token and position embeddings. A learnable [class] token is added at the start, and its final state becomes the image representation. Position embeddings keep the spatial order.
Run the transformer encoder. Multi-head self-attention lets every patch attend to every other patch. So we get global context from the very first layer.
Classify with an MLP head. The [class] token's output feeds a classification head.

Model	Layers	Hidden size	Heads	Params
ViT-Base	12	768	12	86M
ViT-Large	24	1024	16	307M
ViT-Huge	32	1280	16	632M

Note

ViT has less built-in inductive bias than a CNN. It has no locality or translation-equivariance baked in. So it shines when it is pre-trained on large datasets and then fine-tuned. That is exactly what we do here, starting from a model pre-trained on ImageNet-21k.

Diagram of the Vision Transformer: an image split into patches, embedded, and processed by a transformer encoder

ViT splits an image into patches, embeds them, and feeds the sequence to a transformer encoder with a class token.

Loading the Image Dataset

We load the Indian food image dataset directly from the Hub:

PYTHON

from datasets import load_dataset

food = load_dataset("rajistics/indian_food_images")
food['train'][0]['image']

OUTPUT

<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=480x360>

We build the label-to-ID mappings from the dataset's class names:

PYTHON

labels = food['train'].features['label'].names
label2id, id2label = dict(), dict()

for i, label in enumerate(labels):
    label2id[label] = i
    id2label[i] = label

print(label2id)

OUTPUT

{'burger': 0, 'butter_naan': 1, 'chai': 2, 'chapati': 3, 'chole_bhature': 4, 'dal_makhani': 5, 'dhokla': 6, 'fried_rice': 7, 'idli': 8, 'jalebi': 9, 'kaathi_rolls': 10, 'kadai_paneer': 11, 'kulfi': 12, 'masala_dosa': 13, 'momos': 14, 'paani_puri': 15, 'pakode': 16, 'pav_bhaji': 17, 'pizza': 18, 'samosa': 19}

There are 20 food classes to classify.

Preprocessing Images

We load the matching AutoImageProcessor. It knows the model's expected input size and normalization statistics:

PYTHON

from transformers import AutoImageProcessor

model_ckpt = "google/vit-base-patch16-224-in21k"
image_processor = AutoImageProcessor.from_pretrained(model_ckpt, use_fast=True)

We define a torchvision transform pipeline. It does a random-resized crop, tensor conversion, and normalization. We apply it lazily with with_transform:

PYTHON

from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor

normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)

size = (
    image_processor.size['shorted_edge']
    if "shorted_edge" in image_processor.size
    else (image_processor.size['height'], image_processor.size['width'])
)

_transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])

def transforms(examples):
    examples['pixel_values'] = [_transforms(img.convert('RGB')) for img in examples['image']]
    del examples['image']
    return examples

food = food.with_transform(transforms)

Tip

RandomResizedCrop is a form of data augmentation. It randomly crops and resizes each image during training, which helps the model generalize. with_transform applies the transform on the fly, so we do not copy the whole dataset in memory.

We define the accuracy metric:

PYTHON

import evaluate
import numpy as np

accuracy = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Fine-Tuning ViT

We load the model with an image-classification head sized to 20 classes:

PYTHON

from transformers import AutoModelForImageClassification, TrainingArguments, Trainer
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForImageClassification.from_pretrained(
    model_ckpt,
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id
).to(device)

Now we configure training. load_best_model_at_end keeps the best checkpoint by accuracy. gradient_accumulation_steps simulates a larger batch:

PYTHON

args = TrainingArguments(
    output_dir="train_dir",
    remove_unused_columns=False,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=4,
    num_train_epochs=4,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy'
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=food['train'],
    eval_dataset=food['test'],
    tokenizer=image_processor,
    compute_metrics=compute_metrics
)

trainer.train()

OUTPUT

{'eval_loss': 1.6276, 'eval_accuracy': 0.8225, 'epoch': 1.0}
{'eval_loss': 1.1054, 'eval_accuracy': 0.8810, 'epoch': 1.99}
{'eval_loss': 0.9265, 'eval_accuracy': 0.8905, 'epoch': 2.99}
{'eval_loss': 0.8886, 'eval_accuracy': 0.8895, 'epoch': 3.99}
{'train_runtime': 928.8633, 'train_loss': 1.3234, 'epoch': 3.99}

Important

remove_unused_columns=False is required for image models. Without it, the Trainer would strip the pixel_values column that the transform creates, and training would fail.

Here, we can see accuracy climb from 82% to nearly 89% over four epochs. Now we save the model:

PYTHON

trainer.save_model('food_classification')

Diagram of the ViT fine-tuning workflow: image dataset, image processor, Trainer, evaluation, saved model

The ViT fine-tuning workflow mirrors text fine-tuning, swapping the tokenizer for an image processor.

Inference

We load the fine-tuned model as an image-classification pipeline. Then we predict on a new image from the web:

PYTHON

from transformers import pipeline
import requests
from PIL import Image
from io import BytesIO

pipe = pipeline("image-classification", model='food_classification', device=device)

url = 'https://www.indianhealthyrecipes.com/wp-content/uploads/2015/10/pizza-recipe-1.jpg'
response = requests.get(url)
image = Image.open(BytesIO(response.content))

pipe(image)

OUTPUT

[{'label': 'pizza', 'score': 0.5428637862205505}, {'label': 'kadai_paneer', 'score': 0.03565927594900131}, {'label': 'pav_bhaji', 'score': 0.03565794229507446}, {'label': 'butter_naan', 'score': 0.028422074392437935}, {'label': 'burger', 'score': 0.027608927339315414}]

Here, we can see the model classify the image as pizza with high confidence.

Diagram of the inference pipeline: a food photo classified into one of twenty Indian food categories

The fine-tuned pipeline classifies a new food photo into one of the twenty categories.

Summary

This is how ViT fine-tuning works. We fine-tuned a Vision Transformer to classify 20 food categories at nearly 89% accuracy. The big idea is simple. ViT treats image patches like word tokens. So the exact same Trainer workflow applies. The only change is swapping the tokenizer for an AutoImageProcessor and using remove_unused_columns=False.

So far every model has been encoder-based. Next, the series moves to true generative LLMs and parameter-efficient fine-tuning. We will do fine-tuning Phi-2 on custom data with LoRA and QLoRA.

Fine-Tuning Vision Transformer (ViT) for Images

Fine Tuning LLM with HuggingFace Transformers for NLP

How the Vision Transformer Works

Loading the Image Dataset

Preprocessing Images

Fine-Tuning ViT

Inference

Summary

Found this useful? Keep building with me.

Latest recommendations you might like

BERT Architecture: Theory and Fine-Tuning

Fine-Tuning Distilled BERT for Fake News Detection

Fine-Tuning BERT for Sentiment Classification

Fine-Tuning DistilBERT for Restaurant Search NER

Find this tutorial useful?

Discussion & Comments