Fine-Tuning Vision Transformer (ViT) for Images

Fine-tune a Vision Transformer (ViT) to classify Indian food images with Hugging Face, using an image processor, the Trainer API, and patch-based attention.

Jun 18, 202610 min readFollow

Topics You Will Master

How the Vision Transformer turns images into patch sequences
Loading an image dataset and building label mappings
Preprocessing images with AutoImageProcessor and torchvision transforms
Fine-tuning AutoModelForImageClassification with the Trainer

The transformer was built for text, but the Vision Transformer (ViT) showed it works just as well on images — by treating an image as a sequence of patches, exactly like words in a sentence. In this tutorial you fine-tune ViT to classify 20 kinds of Indian food.

The fine-tuning recipe is the same one you have used throughout this series; only the preprocessing changes from a tokenizer to an image processor.

Prerequisites: Familiarity with the transformer architecture and a Python environment with transformers, datasets, evaluate, torch, and torchvision. A GPU is recommended.

95% OFF

Fine Tuning LLM with HuggingFace Transformers for NLP

Learn the basics of transformers and then fine-tune large language models on your own custom datasets.

Enroll Now — 95% OFF →

How the Vision Transformer Works

ViT, introduced in An Image is Worth 16x16 Words (arXiv:2010.11929), applies a standard transformer encoder directly to images with minimal changes:

  1. Split into patches. The image is divided into fixed-size patches (e.g. 16×16 pixels). A 224×224 image becomes a grid of 196 patches.
  2. Linearly embed each patch. Each flattened patch is projected to the model's hidden dimension — the patch embedding, analogous to a word embedding.
  3. Add a [class] token and position embeddings. A learnable [class] token is prepended (its final state becomes the image representation), and position embeddings preserve spatial order.
  4. Run the transformer encoder. Multi-head self-attention lets every patch attend to every other patch — global context from the very first layer.
  5. Classify with an MLP head. The [class] token's output feeds a classification head.
Model Layers Hidden size Heads Params
ViT-Base 12 768 12 86M
ViT-Large 24 1024 16 307M
ViT-Huge 32 1280 16 632M

Note

ViT has less built-in "inductive bias" than a CNN (no locality or translation-equivariance baked in), so it shines when pre-trained on large datasets and then fine-tuned — which is exactly what you do here, starting from a model pre-trained on ImageNet-21k.

Diagram of the Vision Transformer: an image split into patches, embedded, and processed by a transformer encoder

ViT splits an image into patches, embeds them, and feeds the sequence to a transformer encoder with a class token.


Loading the Image Dataset

Load the Indian food image dataset directly from the Hub:

PYTHON
from datasets import load_dataset

food = load_dataset("rajistics/indian_food_images")
food['train'][0]['image']
OUTPUT
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=480x360>

Build the label-to-ID mappings from the dataset's class names:

PYTHON
labels = food['train'].features['label'].names
label2id, id2label = dict(), dict()

for i, label in enumerate(labels):
    label2id[label] = i
    id2label[i] = label

print(label2id)
OUTPUT
{'burger': 0, 'butter_naan': 1, 'chai': 2, 'chapati': 3, 'chole_bhature': 4, 'dal_makhani': 5, 'dhokla': 6, 'fried_rice': 7, 'idli': 8, 'jalebi': 9, 'kaathi_rolls': 10, 'kadai_paneer': 11, 'kulfi': 12, 'masala_dosa': 13, 'momos': 14, 'paani_puri': 15, 'pakode': 16, 'pav_bhaji': 17, 'pizza': 18, 'samosa': 19}

There are 20 food classes to classify.


Preprocessing Images

Load the matching AutoImageProcessor, which knows the model's expected input size and normalization statistics:

PYTHON
from transformers import AutoImageProcessor

model_ckpt = "google/vit-base-patch16-224-in21k"
image_processor = AutoImageProcessor.from_pretrained(model_ckpt, use_fast=True)

Define a torchvision transform pipeline — random-resized crop, tensor conversion, and normalization — and apply it lazily with with_transform:

PYTHON
from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor

normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)

size = (
    image_processor.size['shorted_edge']
    if "shorted_edge" in image_processor.size
    else (image_processor.size['height'], image_processor.size['width'])
)

_transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])

def transforms(examples):
    examples['pixel_values'] = [_transforms(img.convert('RGB')) for img in examples['image']]
    del examples['image']
    return examples

food = food.with_transform(transforms)

Tip

RandomResizedCrop is a form of data augmentation — it randomly crops and resizes each image during training, which helps the model generalize. with_transform applies the transform on the fly, so you do not duplicate the whole dataset in memory.

Define the accuracy metric:

PYTHON
import evaluate
import numpy as np

accuracy = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

Fine-Tuning ViT

Load the model with an image-classification head sized to 20 classes:

PYTHON
from transformers import AutoModelForImageClassification, TrainingArguments, Trainer
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = AutoModelForImageClassification.from_pretrained(
    model_ckpt,
    num_labels=len(labels),
    id2label=id2label,
    label2id=label2id
).to(device)

Configure training. load_best_model_at_end keeps the best checkpoint by accuracy, and gradient_accumulation_steps simulates a larger batch:

PYTHON
args = TrainingArguments(
    output_dir="train_dir",
    remove_unused_columns=False,
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=4,
    num_train_epochs=4,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy'
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=food['train'],
    eval_dataset=food['test'],
    tokenizer=image_processor,
    compute_metrics=compute_metrics
)

trainer.train()
OUTPUT
{'eval_loss': 1.6276, 'eval_accuracy': 0.8225, 'epoch': 1.0}
{'eval_loss': 1.1054, 'eval_accuracy': 0.8810, 'epoch': 1.99}
{'eval_loss': 0.9265, 'eval_accuracy': 0.8905, 'epoch': 2.99}
{'eval_loss': 0.8886, 'eval_accuracy': 0.8895, 'epoch': 3.99}
{'train_runtime': 928.8633, 'train_loss': 1.3234, 'epoch': 3.99}

Important

remove_unused_columns=False is required for image models. Without it, the Trainer would strip the pixel_values column that the transform creates, and training would fail.

Accuracy climbs from 82% to nearly 89% over four epochs. Save the model:

PYTHON
trainer.save_model('food_classification')

Diagram of the ViT fine-tuning workflow: image dataset, image processor, Trainer, evaluation, saved model

The ViT fine-tuning workflow mirrors text fine-tuning, swapping the tokenizer for an image processor.


Inference

Load the fine-tuned model as an image-classification pipeline and predict on a new image from the web:

PYTHON
from transformers import pipeline
import requests
from PIL import Image
from io import BytesIO

pipe = pipeline("image-classification", model='food_classification', device=device)

url = 'https://www.indianhealthyrecipes.com/wp-content/uploads/2015/10/pizza-recipe-1.jpg'
response = requests.get(url)
image = Image.open(BytesIO(response.content))

pipe(image)
OUTPUT
[{'label': 'pizza', 'score': 0.5428637862205505}, {'label': 'kadai_paneer', 'score': 0.03565927594900131}, {'label': 'pav_bhaji', 'score': 0.03565794229507446}, {'label': 'butter_naan', 'score': 0.028422074392437935}, {'label': 'burger', 'score': 0.027608927339315414}]

The model confidently classifies the image as pizza.

Diagram of the inference pipeline: a food photo classified into one of twenty Indian food categories

The fine-tuned pipeline classifies a new food photo into one of the twenty categories.


Summary

You fine-tuned a Vision Transformer to classify 20 food categories at nearly 89% accuracy. The big idea: ViT treats image patches like word tokens, so the exact same Trainer workflow applies — the only change is swapping the tokenizer for an AutoImageProcessor and using remove_unused_columns=False.

So far every model has been encoder-based. Next, the series moves to true generative LLMs and parameter-efficient fine-tuning — fine-tuning Phi-2 on custom data with LoRA and QLoRA.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments