The transformer was built for text, but the Vision Transformer (ViT) showed it works just as well on images — by treating an image as a sequence of patches, exactly like words in a sentence. In this tutorial you fine-tune ViT to classify 20 kinds of Indian food.
The fine-tuning recipe is the same one you have used throughout this series; only the preprocessing changes from a tokenizer to an image processor.
Prerequisites: Familiarity with the transformer architecture and a Python environment with transformers, datasets, evaluate, torch, and torchvision. A GPU is recommended.
How the Vision Transformer Works
ViT, introduced in An Image is Worth 16x16 Words (arXiv:2010.11929), applies a standard transformer encoder directly to images with minimal changes:
- Split into patches. The image is divided into fixed-size patches (e.g. 16×16 pixels). A 224×224 image becomes a grid of 196 patches.
- Linearly embed each patch. Each flattened patch is projected to the model's hidden dimension — the patch embedding, analogous to a word embedding.
- Add a [class] token and position embeddings. A learnable
[class]token is prepended (its final state becomes the image representation), and position embeddings preserve spatial order. - Run the transformer encoder. Multi-head self-attention lets every patch attend to every other patch — global context from the very first layer.
- Classify with an MLP head. The
[class]token's output feeds a classification head.
| Model | Layers | Hidden size | Heads | Params |
|---|---|---|---|---|
| ViT-Base | 12 | 768 | 12 | 86M |
| ViT-Large | 24 | 1024 | 16 | 307M |
| ViT-Huge | 32 | 1280 | 16 | 632M |
Note
ViT has less built-in "inductive bias" than a CNN (no locality or translation-equivariance baked in), so it shines when pre-trained on large datasets and then fine-tuned — which is exactly what you do here, starting from a model pre-trained on ImageNet-21k.

ViT splits an image into patches, embeds them, and feeds the sequence to a transformer encoder with a class token.
Loading the Image Dataset
Load the Indian food image dataset directly from the Hub:
from datasets import load_dataset
food = load_dataset("rajistics/indian_food_images")
food['train'][0]['image']
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=480x360>
Build the label-to-ID mappings from the dataset's class names:
labels = food['train'].features['label'].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
label2id[label] = i
id2label[i] = label
print(label2id)
{'burger': 0, 'butter_naan': 1, 'chai': 2, 'chapati': 3, 'chole_bhature': 4, 'dal_makhani': 5, 'dhokla': 6, 'fried_rice': 7, 'idli': 8, 'jalebi': 9, 'kaathi_rolls': 10, 'kadai_paneer': 11, 'kulfi': 12, 'masala_dosa': 13, 'momos': 14, 'paani_puri': 15, 'pakode': 16, 'pav_bhaji': 17, 'pizza': 18, 'samosa': 19}
There are 20 food classes to classify.
Preprocessing Images
Load the matching AutoImageProcessor, which knows the model's expected input size and normalization statistics:
from transformers import AutoImageProcessor
model_ckpt = "google/vit-base-patch16-224-in21k"
image_processor = AutoImageProcessor.from_pretrained(model_ckpt, use_fast=True)
Define a torchvision transform pipeline — random-resized crop, tensor conversion, and normalization — and apply it lazily with with_transform:
from torchvision.transforms import RandomResizedCrop, Compose, Normalize, ToTensor
normalize = Normalize(mean=image_processor.image_mean, std=image_processor.image_std)
size = (
image_processor.size['shorted_edge']
if "shorted_edge" in image_processor.size
else (image_processor.size['height'], image_processor.size['width'])
)
_transforms = Compose([RandomResizedCrop(size), ToTensor(), normalize])
def transforms(examples):
examples['pixel_values'] = [_transforms(img.convert('RGB')) for img in examples['image']]
del examples['image']
return examples
food = food.with_transform(transforms)
Tip
RandomResizedCrop is a form of data augmentation — it randomly crops and resizes each image during training, which helps the model generalize. with_transform applies the transform on the fly, so you do not duplicate the whole dataset in memory.
Define the accuracy metric:
import evaluate
import numpy as np
accuracy = evaluate.load('accuracy')
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
Fine-Tuning ViT
Load the model with an image-classification head sized to 20 classes:
from transformers import AutoModelForImageClassification, TrainingArguments, Trainer
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForImageClassification.from_pretrained(
model_ckpt,
num_labels=len(labels),
id2label=id2label,
label2id=label2id
).to(device)
Configure training. load_best_model_at_end keeps the best checkpoint by accuracy, and gradient_accumulation_steps simulates a larger batch:
args = TrainingArguments(
output_dir="train_dir",
remove_unused_columns=False,
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=5e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
gradient_accumulation_steps=4,
num_train_epochs=4,
load_best_model_at_end=True,
metric_for_best_model='accuracy'
)
trainer = Trainer(
model=model,
args=args,
train_dataset=food['train'],
eval_dataset=food['test'],
tokenizer=image_processor,
compute_metrics=compute_metrics
)
trainer.train()
{'eval_loss': 1.6276, 'eval_accuracy': 0.8225, 'epoch': 1.0}
{'eval_loss': 1.1054, 'eval_accuracy': 0.8810, 'epoch': 1.99}
{'eval_loss': 0.9265, 'eval_accuracy': 0.8905, 'epoch': 2.99}
{'eval_loss': 0.8886, 'eval_accuracy': 0.8895, 'epoch': 3.99}
{'train_runtime': 928.8633, 'train_loss': 1.3234, 'epoch': 3.99}
Important
remove_unused_columns=False is required for image models. Without it, the Trainer would strip the pixel_values column that the transform creates, and training would fail.
Accuracy climbs from 82% to nearly 89% over four epochs. Save the model:
trainer.save_model('food_classification')

The ViT fine-tuning workflow mirrors text fine-tuning, swapping the tokenizer for an image processor.
Inference
Load the fine-tuned model as an image-classification pipeline and predict on a new image from the web:
from transformers import pipeline
import requests
from PIL import Image
from io import BytesIO
pipe = pipeline("image-classification", model='food_classification', device=device)
url = 'https://www.indianhealthyrecipes.com/wp-content/uploads/2015/10/pizza-recipe-1.jpg'
response = requests.get(url)
image = Image.open(BytesIO(response.content))
pipe(image)
[{'label': 'pizza', 'score': 0.5428637862205505}, {'label': 'kadai_paneer', 'score': 0.03565927594900131}, {'label': 'pav_bhaji', 'score': 0.03565794229507446}, {'label': 'butter_naan', 'score': 0.028422074392437935}, {'label': 'burger', 'score': 0.027608927339315414}]
The model confidently classifies the image as pizza.

The fine-tuned pipeline classifies a new food photo into one of the twenty categories.
Summary
You fine-tuned a Vision Transformer to classify 20 food categories at nearly 89% accuracy. The big idea: ViT treats image patches like word tokens, so the exact same Trainer workflow applies — the only change is swapping the tokenizer for an AutoImageProcessor and using remove_unused_columns=False.
So far every model has been encoder-based. Next, the series moves to true generative LLMs and parameter-efficient fine-tuning — fine-tuning Phi-2 on custom data with LoRA and QLoRA.