Hugging Face Transformers: A Beginner's Guide

Get started with Hugging Face Transformers: run pretrained models in one line with pipelines for text, image, and audio tasks like classification, QA, and translation.

Jun 18, 202621 min readFollow

Topics You Will Master

Understanding the Hugging Face Hub, Transformers, Datasets, and Spaces
Running pretrained models in one line with the pipeline() API
Solving text tasks: classification, NER, question answering, summarization, translation, text generation
Solving image and audio tasks: image classification, segmentation, text-to-speech, and music generation

Hugging Face is the GitHub of machine learning — a community hub where you can browse, download, and run thousands of pretrained models for text, image, audio, and multimodal tasks. The Transformers library is the Python package that loads those models and runs them with a single function call.

This guide walks through the pipeline() API, the fastest way to use a pretrained model, and applies it to every major task type: text classification, NER, question answering, summarization, translation, text generation, image classification, segmentation, text-to-speech, and music generation.

Prerequisites: Python 3.9+ and basic familiarity with running Python scripts or notebooks. A GPU is optional but speeds up the larger models.

95% OFF

Fine Tuning LLM with HuggingFace Transformers for NLP

Learn the basics of transformers and then fine-tune large language models on your own custom datasets.

Enroll Now — 95% OFF →

The Hugging Face Ecosystem

Before writing code, it helps to know the four pieces you will use in almost every tutorial:

Component What it does
Hub Hosts model repos, dataset repos, and Spaces — each with files, a README (model card), and versions
Transformers The Python library that loads tokenizers, processors, models, and pipelines
Datasets A library and Hub section to search, inspect, and load datasets
Spaces Browser-hosted ML demos built with Gradio, Streamlit, Docker, or static apps

Note

The mental model to remember for everything that follows: task → checkpoint → tokenizer/processor → model → output. A checkpoint is a saved model package containing config.json, the model weights, tokenizer files, a model card, and license notes.

Diagram of the Hugging Face workflow: choose a task, pick a checkpoint, load tokenizer and model, get output

The repeating Hugging Face pattern: every task follows the same task → checkpoint → model → output flow.


Installation

Install the core libraries. transformers provides the models and pipelines, datasets loads data, and torch is the deep-learning backend.

BASH
pip install -U transformers datasets accelerate torch

On Linux/macOS: same command — use pip3 if pip points to Python 2.

Tip

Use a clean virtual environment so your project dependencies stay isolated:

BASH
python -m venv hf
hf\Scripts\activate

On Linux/macOS: source hf/bin/activate

Note

The companion course notebooks install everything from a single requirements file with pip install -r https://raw.githubusercontent.com/laxmimerit/Fine-Tuning-LLM-with-HuggingFace/main/requirements.txt. The explicit pip install above covers the same core libraries for this tutorial.


What Is a Pipeline?

A pipeline() hides the repetitive steps — pre-processing, model inference, and post-processing — so you can focus on the task itself. You pass a task name and an input, and it returns a clean, human-readable output.

PYTHON
from transformers import pipeline
import pandas as pd

Internally, every pipeline does three things: a tokenizer or processor converts your raw input into tensors, the model predicts logits or hidden states, and a post-processing step turns that back into labels, text, or boxes.

Diagram of the pipeline internals: input, pre-processing, model inference, post-processing, output

A pipeline wraps pre-processing, model inference, and post-processing into one call.


Text Classification

Text classification assigns a label to a piece of text — sentiment, spam, topic, toxicity, or emotion. Create a text-classification pipeline; device=0 runs it on the first GPU.

PYTHON
classifier = pipeline("text-classification", device=0)

text = "I really love tutorials by KGP Talkie."

outputs = classifier(text)

Note

When you do not specify a model, the pipeline picks a sensible default — here distilbert-base-uncased-finetuned-sst-2-english. For production, always pin an explicit model name and revision.

Wrap the output in a DataFrame to read it cleanly:

PYTHON
pd.DataFrame(outputs)
OUTPUT
labelscore
0POSITIVE0.995206

To detect a specific emotion instead of binary sentiment, pass a model trained for that — bhadresh-savani/distilbert-base-uncased-emotion:

PYTHON
classifier = pipeline("text-classification", model='bhadresh-savani/distilbert-base-uncased-emotion', device=0)

text = "I really love tutorials by KGP Talkie."

outputs = classifier(text)
pd.DataFrame(outputs)
OUTPUT
labelscore
0joy0.941611

Named Entity Recognition

Named Entity Recognition (NER) tags spans of text as people, organizations, locations, dates, and more. Use the ner task:

PYTHON
ner = pipeline(task='ner')
text = "I really love tutorials by KGP Talkie. I live in Mumbai."
outputs = ner(text)
pd.DataFrame(outputs)
OUTPUT
entityscoreindexwordstartend
0I-ORG0.9990328K2728
1I-ORG0.9961799##GP2830
2I-ORG0.99639110Talk3135
3I-ORG0.99350611##ie3537
4I-LOC0.99936916Mumbai4955

The model splits "KGP Talkie" into subword tokens (K, ##GP, Talk, ##ie) and tags them as an organization, and correctly tags "Mumbai" as a location.

The same task can do part-of-speech tagging with a different checkpoint:

PYTHON
ner = pipeline(task='ner', model='vblagoje/bert-english-uncased-finetuned-pos', device=0)
text = "I really love tutorials by KGP Talkie. I live in Mumbai."
outputs = ner(text)
pd.DataFrame(outputs)
OUTPUT
entityscoreindexword
0PRON0.9995321i
1ADV0.9991482really
2VERB0.9990933love
3NOUN0.9985784tutor
14PROPN0.99885915mumbai

Question Answering

Extractive question answering finds the span of a context paragraph that answers a question:

PYTHON
pipe = pipeline('question-answering', device=0)

context = 'The iPhone is a smartphone developed by Apple Inc. It was first introduced by Steve Jobs in 2007 and became one of the most popular smartphones in the world.'

question = 'Which company developed the iPhone?'

output = pipe(question=question, context=context)
output
OUTPUT
{'score': 0.7761176228523254, 'start': 40, 'end': 49, 'answer': 'Apple Inc'}

The score is the model's confidence, and start/end are character offsets into the context. Swap in a stronger model like deepset/roberta-base-squad2 for harder questions:

PYTHON
pipe = pipeline('question-answering', device=0, model='deepset/roberta-base-squad2')
output = pipe(question=question, context=context)
pd.DataFrame([output])
OUTPUT
scorestartendanswer
00.5918314049Apple Inc

Summarization

Summarization condenses a long passage into a short one. Use a model trained for it, such as facebook/bart-large-cnn:

PYTHON
pipe = pipeline('summarization', device=0, max_length=50, model='facebook/bart-large-cnn')

text = 'Climate change is one of the biggest challenges facing the world today. It is mainly caused by human activities such as burning fossil fuels, cutting down forests, and increasing industrial pollution. These activities release greenhouse gases into the atmosphere, which trap heat and increase the Earth’s temperature. As a result, we are seeing rising sea levels, extreme weather events, melting glaciers, and changes in rainfall patterns. To reduce the impact of climate change, countries need to use clean energy, protect forests, reduce pollution, and promote sustainable development.'

output = pipe(text)
output[0]['summary_text']
OUTPUT
'Climate change is one of the biggest challenges facing the world today. It is mainly caused by human activities such as burning fossil fuels, cutting down forests, and increasing industrial pollution. To reduce the impact of climate change, countries need to use'

Tip

max_length caps the summary length in tokens. If you set it lower than the model's internal min_length, generation stops early — increase max_length for longer summaries.


Translation

Translation pipelines are named by language pair. The default translation_en_to_de uses Google's T5:

PYTHON
pipe = pipeline('translation_en_to_de')
text = "I really love tutorials by KGP Talkie. I live in Mumbai."
output = pipe(text)
output
OUTPUT
[{'translation_text': 'Ich liebe die Tutorials von KGP Talkie und lebe in Mumbai.'}]

You can point the same task at a fine-tuned model for other languages, such as English-to-Hindi:

PYTHON
pipe = pipeline('translation_en_to_de', model='AbhirupGhosh/opus-mt-finetuned-en-hi')
text = "I really love tutorials by KGP Talkie. I live in Mumbai."
output = pipe(text)
output
OUTPUT
[{'translation_text': 'मैं वास्तव में केजीपी टॉकी द्वारा शिक्षण से प्यार करता हूँ। मैं मुंबई में रहता हूँ।'}]

Text Generation

Text generation continues a prompt. The default model is GPT-2:

PYTHON
pipe = pipeline('text-generation')
output = pipe(text, max_length=128)
output
OUTPUT
[{'generated_text': 'I really love tutorials by KGP Talkie. I live in Mumbai.\n\nQ: Is the internet a great way to get inspired?\n\nA: It is a great tool for people to find inspiration through video and social media...'}]

Larger models produce more coherent text. Here is gpt2-xl (1.5B parameters):

PYTHON
pipe = pipeline('text-generation', model='openai-community/gpt2-xl')
output = pipe(text, max_length=128)
output
OUTPUT
[{'generated_text': "I really love tutorials by KGP Talkie. I live in Mumbai. I don't know how long it takes to take an idea and just start cooking with it..."}]

Diagram comparing common Hugging Face pipeline tasks across text, vision, and audio modalities

The pipeline() API covers text, vision, and audio tasks with the same simple call signature.


Image Classification

Pipelines are not limited to text. Image classification labels an image with microsoft/resnet-18:

PYTHON
from PIL import Image
import requests

pipe = pipeline("image-classification", model='microsoft/resnet-18')

url = 'https://headsupfortails.com/cdn/shop/articles/Pomeranian_Dog_Guide_38876a16-d481-41d0-a5d8-4bf26afd2c8f.jpg?v=1754635331'
image = Image.open(requests.get(url, stream=True).raw)

output = pipe(image)
output
OUTPUT
[{'label': 'Pomeranian', 'score': 0.9684421420097351}, {'label': 'kit fox, Vulpes macrotis', 'score': 0.0032119215466082096}, {'label': 'red fox, Vulpes vulpes', 'score': 0.0030783233232796192}, {'label': 'keeshond', 'score': 0.003051026491448283}, {'label': 'Arctic fox, white fox, Alopex lagopus', 'score': 0.002322540385648608}]

Image Segmentation

Segmentation goes further than classification — it returns a mask for each object in the image:

PYTHON
pipe = pipeline('image-segmentation', model='nvidia/segformer-b0-finetuned-ade-512-512')

url = 'https://headsupfortails.com/cdn/shop/articles/Pomeranian_Dog_Guide_38876a16-d481-41d0-a5d8-4bf26afd2c8f.jpg?v=1754635331'
image = Image.open(requests.get(url, stream=True).raw)
output = pipe(image)
output
OUTPUT
[{'score': None, 'label': 'tree', 'mask': <PIL.Image.Image image mode=L size=801x801>}, {'score': None, 'label': 'grass', 'mask': <PIL.Image.Image image mode=L size=801x801>}, {'score': None, 'label': 'person', 'mask': <PIL.Image.Image image mode=L size=801x801>}, {'score': None, 'label': 'animal', 'mask': <PIL.Image.Image image mode=L size=801x801>}]

Each result carries a mask you can display as an image — for example output[2]['mask'] shows the "person" mask.


Text to Speech

Audio works the same way. The text-to-speech task synthesizes speech with suno/bark-small:

PYTHON
import soundfile as sf

pipe = pipeline('text-to-speech')

text = """Sam Altman on Wednesday returned to OpenAI as the chief executive officer (CEO) and sacked the Board that had fired him last week."""

output = pipe(text)

Save the generated waveform to a .wav file:

PYTHON
sf.write('speech.wav', output['audio'].T, samplerate=output['sampling_rate'])

Note

output is a dictionary with an audio NumPy array and a sampling_rate (here 24000 Hz). The .T transposes the array into the channel layout soundfile expects.


Text to Music Generation

The text-to-audio task with facebook/musicgen-small generates music from a text prompt:

PYTHON
pipe = pipeline('text-to-audio', model="facebook/musicgen-small")

text = "a chill song with influences from lofi, chillstep and downtempo"

output = pipe(text)

Save the result with SciPy:

PYTHON
import scipy
scipy.io.wavfile.write("music.wav", rate=output["sampling_rate"], data=output['audio'])

Choosing the Right Model

With millions of checkpoints on the Hub, use this checklist before committing to one in a project:

Question What to check
Is it for my task? Task tag, model-card examples, pipeline support
Can I use it legally? License, commercial restrictions, gated access
Will it run on my machine? Model size, RAM/VRAM, quantized versions
Is it reliable enough? Evaluation metrics, known limitations, recent updates

Important

Always read the model card (the repo's README) before using a checkpoint. It tells you the intended task, the license, evaluation metrics, and known limitations. Never copy a checkpoint name blindly.

You can confirm the official documentation links from the Hugging Face pipeline tutorial and the Hub documentation.


Summary

The pipeline() API is the single most useful entry point in Hugging Face Transformers. With one function call you ran sentiment analysis, NER, question answering, summarization, translation, text generation, image classification, segmentation, speech synthesis, and music generation — all on pretrained models, with no training required.

The pattern never changes: pick a task, pick a checkpoint trained for it, pass your input, and read the output. Once this workflow feels natural, the next step is fine-tuning — adapting a pretrained model to your own dataset, which the rest of this series covers.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments