How to use Pre-trained VGG16 models to predict object
VGG-16 is a 16-layer deep CNN architecture that won the 2014 ImageNet competition using only 3×3 convolution filters stacked in increasing depth, achieving 92.7% top-5 accuracy. This tutorial uses Keras's pre-trained VGG-16 with ImageNet weights to classify new images via forward-pass inference.
Importing Libraries
from tensorflow.keras.applications.vgg16 import VGG16, preprocess_input, decode_predictions
from tensorflow.keras.preprocessing.image import load_img, img_to_array
import os
python
#creating an object for VGG16 model(pre-trained)
model = VGG16()
Downloading data from https://github.com/fchollet/deep-learning-models/releases/download/v0.1/vgg16_weights_tf_dim_ordering_tf_kernels.h5
553467904/553467096 [==============================] - 255s 0us/step
model.summary()
Model: "vgg16"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 224, 224, 3)] 0
_________________________________________________________________
block1_conv1 (Conv2D) (None, 224, 224, 64) 1792
_________________________________________________________________
block1_conv2 (Conv2D) (None, 224, 224, 64) 36928
_________________________________________________________________
block1_pool (MaxPooling2D) (None, 112, 112, 64) 0
_________________________________________________________________
block2_conv1 (Conv2D) (None, 112, 112, 128) 73856
_________________________________________________________________
block2_conv2 (Conv2D) (None, 112, 112, 128) 147584
_________________________________________________________________
block2_pool (MaxPooling2D) (None, 56, 56, 128) 0
_________________________________________________________________
block3_conv1 (Conv2D) (None, 56, 56, 256) 295168
_________________________________________________________________
block3_conv2 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_conv3 (Conv2D) (None, 56, 56, 256) 590080
_________________________________________________________________
block3_pool (MaxPooling2D) (None, 28, 28, 256) 0
_________________________________________________________________
block4_conv1 (Conv2D) (None, 28, 28, 512) 1180160
_________________________________________________________________
block4_conv2 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_conv3 (Conv2D) (None, 28, 28, 512) 2359808
_________________________________________________________________
block4_pool (MaxPooling2D) (None, 14, 14, 512) 0
_________________________________________________________________
block5_conv1 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv2 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_conv3 (Conv2D) (None, 14, 14, 512) 2359808
_________________________________________________________________
block5_pool (MaxPooling2D) (None, 7, 7, 512) 0
_________________________________________________________________
flatten (Flatten) (None, 25088) 0
_________________________________________________________________
fc1 (Dense) (None, 4096) 102764544
_________________________________________________________________
fc2 (Dense) (None, 4096) 16781312
_________________________________________________________________
predictions (Dense) (None, 1000) 4097000
=================================================================
Total params: 138,357,544
Trainable params: 138,357,544
Non-trainable params: 0
_________________________________________________________________
In the below steps, we are performing following activities :
- loading 8 sample images from the disk
- Converting the image to array and then reshaping it.
- After performig the above steps, we are pre-process it and then predicting the output.
- top=2 in decode_predictions() function means which we are taking top 2 probability values for the particular prediction.
#Here we are taking sample images and predicting the same images on top of pre-trained VGG16 model.
#top=2 in decode_predictions() function means which we are taking top 2 probability values for the particular prediction.
for file in os.listdir('sample'):
print(file)
full_path = 'sample/' + file
image = load_img(full_path, target_size=(224, 224))
image = img_to_array(image)
image = image.reshape((1, image.shape[0], image.shape[1], image.shape[2]))
image = preprocess_input(image)
y_pred = model.predict(image)
label = decode_predictions(y_pred, top = 2)
print(label)
print()
bottle1.jpeg
[[('n04557648', 'water_bottle', 0.6603951), ('n04560804', 'water_jug', 0.08577988)]]
bottle2.jpeg
[[('n04557648', 'water_bottle', 0.5169559), ('n04560804', 'water_jug', 0.2630159)]]
bottle3.jpeg
[[('n04557648', 'water_bottle', 0.88239855), ('n04560804', 'water_jug', 0.051655706)]]
monitor.jpeg
[[('n03782006', 'monitor', 0.46309018), ('n03179701', 'desk', 0.16822667)]]
mouse.jpeg
[[('n03793489', 'mouse', 0.37214068), ('n03657121', 'lens_cap', 0.1903602)]]
mug.jpeg
[[('n03063599', 'coffee_mug', 0.46725288), ('n03950228', 'pitcher', 0.1496518)]]
pen.jpeg
[[('n02783161', 'ballpoint', 0.6506707), ('n04116512', 'rubber_eraser', 0.12477029)]]
wallet.jpeg
[[('n04026417', 'purse', 0.530347), ('n04548362', 'wallet', 0.24484588)]]
Challenges Of VGG 16:
- It is very slow to train (the original VGG model was trained on Nvidia Titan GPU for 2-3 weeks).
- The size of VGG-16 trained imageNet weights is 528 MB. So, it takes quite a lot of disk space and bandwidth that makes it inefficient.
Conclusion
In this tutorial you used Keras's pre-trained VGG-16 with ImageNet weights to classify 8 sample images through a single forward pass — no training required. The model correctly identified water bottles, a computer monitor, a mouse, a coffee mug, a ballpoint pen, and a wallet, with top-1 confidence ranging from 37% (mouse) to 88% (bottle3).
Key takeaways:
- Pre-trained VGG-16 weights are loaded via
VGG16()with defaultweights="imagenet"— this downloads 528 MB of weights once and caches them, making inference as simple as callingmodel.predict()without any fine-tuning. preprocess_input()must be applied after converting an image to a NumPy array — it subtracts the ImageNet per-channel mean (BGR format) to match the normalization used during VGG's original training.decode_predictions(y_pred, top=2)maps the 1000-class softmax output to human-readable ImageNet labels and confidence scores;top=2returns the two most likely classes per image.- VGG-16's 138M parameters make it accurate but heavy (528 MB, 2-3 weeks to train from scratch on GPU) — for production, MobileNet or EfficientNet achieve similar accuracy at a fraction of the size.
Next steps:
- Adapt VGG-16 for a custom dataset via transfer learning by replacing the final
Dense(1000)layer withDense(num_classes)and fine-tuning only the top layers. - Compare against multi-label classification in Multi-Label Movie Poster Classification with CNN to see how custom CNNs compare with pre-trained backbones.
- Visualize intermediate feature maps from
block3_poolorblock5_poolto understand what spatial features VGG-16 learns at each depth.
