So far you have classified whole sentences. Named entity recognition (NER) is different — it labels individual tokens, identifying spans like people, places, and organizations. In this tutorial you fine-tune DistilBERT to parse restaurant search queries, pulling out cuisines, locations, ratings, dishes, and amenities.
This is a token classification task, and it introduces one new wrinkle: aligning your word-level labels with the model's subword tokens.
Prerequisites: The BERT fine-tuning workflow and a Python environment with transformers, datasets, evaluate, seqeval, and torch.
What Is NER?
Named Entity Recognition identifies and classifies named entities — people, places, organizations, dates, and more — in text.
It works token by token, in five conceptual steps:
- Text input — a sentence, paragraph, or document.
- Tokenization — split the text into individual tokens.
- Entity recognition — find spans of tokens that form entities.
- Entity classification — assign each entity a category (location, dish, rating, etc.).
- Output — a structured representation of the text with its labeled entities.
Tip
The displaCy entity visualizer is a great way to see NER output rendered over text.
IOB Tagging
NER labels are usually stored in IOB format (Inside–Outside–Beginning), which marks where each entity starts and ends:
- B- (Beginning) — the first token of an entity.
- I- (Inside) — a token continuing the entity.
- O (Outside) — a token that is not part of any entity.
So "new york" as a location becomes B-Location I-Location, and a filler word like "in" is O. You can read the full convention on the IOB tagging Wikipedia page.

IOB tagging marks the Beginning, Inside, and Outside of each entity span, token by token.
The Dataset
This tutorial uses the MIT Restaurant dataset, which labels search queries with entities like Rating, Amenity, Location, Restaurant_Name, Price, Hours, Dish, and Cuisine. The data is in BIO format — one tag\ttoken per line, with blank lines separating sentences.
Load and parse the training file into lists of tokens and tags:
import requests
response = requests.get("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/mit_restaurant_search_ner/train.bio")
response = response.text.splitlines()
train_tokens = []
train_tags = []
temp_tokens, temp_tags = [], []
for line in response:
if line != "":
tag, token = line.strip().split("\t")
temp_tags.append(tag)
temp_tokens.append(token)
else:
train_tokens.append(temp_tokens)
train_tags.append(temp_tags)
temp_tokens, temp_tags = [], []
len(train_tokens), len(train_tags)
(7659, 7659)
Repeat the same parsing for the test.bio file, which yields 1,520 examples.
Building the Hugging Face Dataset
Convert the parsed lists into a DatasetDict. Here the test set doubles as validation:
import pandas as pd
from datasets import Dataset, DatasetDict
df = pd.DataFrame({'tokens': train_tokens, 'ner_tags_str': train_tags})
train = Dataset.from_pandas(df)
df = pd.DataFrame({'tokens': test_tokens, 'ner_tags_str': test_tags})
test = Dataset.from_pandas(df)
dataset = DatasetDict({'train': train, 'test': test, 'validation': test})
dataset['train'][0]
{'tokens': ['2', 'start', 'restaurants', 'with', 'inside', 'dining'], 'ner_tags_str': ['B-Rating', 'I-Rating', 'O', 'O', 'B-Amenity', 'I-Amenity']}
Build numeric tag mappings from the unique entity types, generating a B- and I- entry for each:
unique_tags = set()
for tag in dataset['train']['ner_tags_str']:
unique_tags.update(tag)
unique_tags = list(set([x[2:] for x in list(unique_tags) if x != 'O']))
tag2index = {"O": 0}
for i, tag in enumerate(unique_tags):
tag2index[f'B-{tag}'] = len(tag2index)
tag2index[f'I-{tag}'] = len(tag2index)
index2tag = {v: k for k, v in tag2index.items()}
Map the string tags to their integer IDs:
dataset = dataset.map(lambda example: {"ner_tags": [tag2index[tag] for tag in example['ner_tags_str']]})
Aligning Labels with Subword Tokens
Here is the key challenge of token classification. The tokenizer may split one word into several subwords, but you only have one label per word. You must align them.
Load the tokenizer and see the problem:
from transformers import AutoTokenizer
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
input = dataset['train'][2]['tokens']
output = tokenizer(input, is_split_into_words=True)
tokenizer.convert_ids_to_tokens(output.input_ids)
['[CLS]', '5', 'star', 'rest', '##ura', '##nts', 'in', 'my', 'town', '[SEP]']
The word "resturants" became three subword tokens. The fix: label only the first subword of each word, and assign -100 to the rest (and to special tokens). The Trainer ignores -100 positions when computing loss.
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples['tokens'], truncation=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples['ner_tags']):
word_ids = tokenized_inputs.word_ids(batch_index=i)
previous_word_idx = None
label_ids = []
for word_idx in word_ids:
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx:
label_ids.append(label[word_idx])
else:
label_ids.append(-100)
previous_word_idx = word_idx
labels.append(label_ids)
tokenized_inputs['labels'] = labels
return tokenized_inputs
tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)
tokenized_dataset['train'][2]['labels']
[-100, 3, 4, 0, -100, -100, 5, 6, 6, -100]
Important
The -100 sentinel is what makes subword alignment work. Without it, the loss would penalize the model on subword pieces that have no real label, corrupting training.

Only the first subword of each word keeps its label; the rest are masked with -100 and ignored by the loss.
Data Collation and Metrics
A DataCollatorForTokenClassification pads both the inputs and the labels to the same length per batch:
from transformers import DataCollatorForTokenClassification
data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
NER is evaluated with seqeval, which scores entity-level precision, recall, and F1 (not just per-token accuracy):
import evaluate
import numpy as np
metric = evaluate.load('seqeval')
label_names = list(tag2index)
def compute_metrics(eval_preds):
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
true_predictions = [[label_names[p] for p, l in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)]
all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
return {
"precision": all_metrics['overall_precision'],
'recall': all_metrics['overall_recall'],
'f1': all_metrics['overall_f1'],
'accuracy': all_metrics['overall_accuracy'],
}
Training
Load DistilBERT with a token-classification head, passing the label mappings:
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
model = AutoModelForTokenClassification.from_pretrained(model_ckpt, id2label=index2tag, label2id=tag2index)
args = TrainingArguments(
"finetuned-ner",
evaluation_strategy='epoch',
save_strategy='epoch',
learning_rate=2e-5,
num_train_epochs=3,
weight_decay=0.01
)
trainer = Trainer(
model=model,
args=args,
train_dataset=tokenized_dataset['train'],
eval_dataset=tokenized_dataset['validation'],
data_collator=data_collator,
compute_metrics=compute_metrics,
tokenizer=tokenizer
)
trainer.train()
{'eval_loss': 0.3013, 'eval_precision': 0.7355, 'eval_recall': 0.7863, 'eval_f1': 0.7600, 'eval_accuracy': 0.9096, 'epoch': 1.0}
{'eval_loss': 0.2850, 'eval_precision': 0.7850, 'eval_recall': 0.8044, 'eval_f1': 0.7946, 'eval_accuracy': 0.9174, 'epoch': 2.0}
{'eval_loss': 0.2847, 'eval_precision': 0.7777, 'eval_recall': 0.8121, 'eval_f1': 0.7945, 'eval_accuracy': 0.9184, 'epoch': 3.0}
After three epochs the model reaches an entity-level F1 of ~0.79 and a token accuracy of ~92%.
Prediction
Save the model and load it as a token-classification pipeline. The aggregation_strategy='simple' option merges subword tokens back into whole words and groups entities:
from transformers import pipeline
trainer.save_model("ner_distilbert")
pipe = pipeline('token-classification', model="ner_distilbert", aggregation_strategy='simple')
pipe("which restaurant serves the best shushi in new york?")
[{'entity_group': 'Rating', 'score': 0.9804273, 'word': 'best', 'start': 28, 'end': 32}, {'entity_group': 'Dish', 'score': 0.830101, 'word': 'shushi', 'start': 33, 'end': 39}, {'entity_group': 'Location', 'score': 0.8655802, 'word': 'new york', 'start': 43, 'end': 51}]
The model correctly tags "best" as a Rating, "shushi" as a Dish (even with the typo), and "new york" as a Location — exactly the structured output a restaurant search engine needs.

The fine-tuned pipeline turns a free-text query into structured entities with confidence scores.
Summary
You fine-tuned DistilBERT for token-level NER on restaurant queries, reaching ~92% token accuracy. The new skill here is subword label alignment with word_ids and the -100 sentinel — the standard technique for every token-classification task — plus entity-level evaluation with seqeval.
Next, you shift from understanding tasks to generation: fine-tuning T5 for custom text summarization.