Fine-Tuning DistilBERT for Restaurant Search NER

Fine-tune DistilBERT for named entity recognition on restaurant search queries, extracting cuisines, locations, ratings, and dishes with IOB tagging and seqeval.

Jun 18, 202612 min readFollow

Topics You Will Master

What named entity recognition (NER) is and how IOB tagging works
Parsing a BIO-format dataset into tokens and tags
Aligning word-level labels to subword tokens with word_ids
Training a token-classification model with DataCollatorForTokenClassification

So far you have classified whole sentences. Named entity recognition (NER) is different — it labels individual tokens, identifying spans like people, places, and organizations. In this tutorial you fine-tune DistilBERT to parse restaurant search queries, pulling out cuisines, locations, ratings, dishes, and amenities.

This is a token classification task, and it introduces one new wrinkle: aligning your word-level labels with the model's subword tokens.

Prerequisites: The BERT fine-tuning workflow and a Python environment with transformers, datasets, evaluate, seqeval, and torch.

95% OFF

Fine Tuning LLM with HuggingFace Transformers for NLP

Learn the basics of transformers and then fine-tune large language models on your own custom datasets.

Enroll Now — 95% OFF →

What Is NER?

Named Entity Recognition identifies and classifies named entities — people, places, organizations, dates, and more — in text.

It works token by token, in five conceptual steps:

  1. Text input — a sentence, paragraph, or document.
  2. Tokenization — split the text into individual tokens.
  3. Entity recognition — find spans of tokens that form entities.
  4. Entity classification — assign each entity a category (location, dish, rating, etc.).
  5. Output — a structured representation of the text with its labeled entities.

Tip

The displaCy entity visualizer is a great way to see NER output rendered over text.


IOB Tagging

NER labels are usually stored in IOB format (Inside–Outside–Beginning), which marks where each entity starts and ends:

  • B- (Beginning) — the first token of an entity.
  • I- (Inside) — a token continuing the entity.
  • O (Outside) — a token that is not part of any entity.

So "new york" as a location becomes B-Location I-Location, and a filler word like "in" is O. You can read the full convention on the IOB tagging Wikipedia page.

Diagram of IOB tagging: a restaurant query with B-, I-, and O tags over each token

IOB tagging marks the Beginning, Inside, and Outside of each entity span, token by token.


The Dataset

This tutorial uses the MIT Restaurant dataset, which labels search queries with entities like Rating, Amenity, Location, Restaurant_Name, Price, Hours, Dish, and Cuisine. The data is in BIO format — one tag\ttoken per line, with blank lines separating sentences.

Load and parse the training file into lists of tokens and tags:

PYTHON
import requests

response = requests.get("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/mit_restaurant_search_ner/train.bio")
response = response.text.splitlines()

train_tokens = []
train_tags = []
temp_tokens, temp_tags = [], []

for line in response:
    if line != "":
        tag, token = line.strip().split("\t")
        temp_tags.append(tag)
        temp_tokens.append(token)
    else:
        train_tokens.append(temp_tokens)
        train_tags.append(temp_tags)
        temp_tokens, temp_tags = [], []

len(train_tokens), len(train_tags)
OUTPUT
(7659, 7659)

Repeat the same parsing for the test.bio file, which yields 1,520 examples.


Building the Hugging Face Dataset

Convert the parsed lists into a DatasetDict. Here the test set doubles as validation:

PYTHON
import pandas as pd
from datasets import Dataset, DatasetDict

df = pd.DataFrame({'tokens': train_tokens, 'ner_tags_str': train_tags})
train = Dataset.from_pandas(df)

df = pd.DataFrame({'tokens': test_tokens, 'ner_tags_str': test_tags})
test = Dataset.from_pandas(df)

dataset = DatasetDict({'train': train, 'test': test, 'validation': test})
dataset['train'][0]
OUTPUT
{'tokens': ['2', 'start', 'restaurants', 'with', 'inside', 'dining'], 'ner_tags_str': ['B-Rating', 'I-Rating', 'O', 'O', 'B-Amenity', 'I-Amenity']}

Build numeric tag mappings from the unique entity types, generating a B- and I- entry for each:

PYTHON
unique_tags = set()
for tag in dataset['train']['ner_tags_str']:
    unique_tags.update(tag)

unique_tags = list(set([x[2:] for x in list(unique_tags) if x != 'O']))

tag2index = {"O": 0}
for i, tag in enumerate(unique_tags):
    tag2index[f'B-{tag}'] = len(tag2index)
    tag2index[f'I-{tag}'] = len(tag2index)

index2tag = {v: k for k, v in tag2index.items()}

Map the string tags to their integer IDs:

PYTHON
dataset = dataset.map(lambda example: {"ner_tags": [tag2index[tag] for tag in example['ner_tags_str']]})

Aligning Labels with Subword Tokens

Here is the key challenge of token classification. The tokenizer may split one word into several subwords, but you only have one label per word. You must align them.

Load the tokenizer and see the problem:

PYTHON
from transformers import AutoTokenizer

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

input = dataset['train'][2]['tokens']
output = tokenizer(input, is_split_into_words=True)
tokenizer.convert_ids_to_tokens(output.input_ids)
OUTPUT
['[CLS]', '5', 'star', 'rest', '##ura', '##nts', 'in', 'my', 'town', '[SEP]']

The word "resturants" became three subword tokens. The fix: label only the first subword of each word, and assign -100 to the rest (and to special tokens). The Trainer ignores -100 positions when computing loss.

PYTHON
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['tokens'], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples['ner_tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs['labels'] = labels
    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)
tokenized_dataset['train'][2]['labels']
OUTPUT
[-100, 3, 4, 0, -100, -100, 5, 6, 6, -100]

Important

The -100 sentinel is what makes subword alignment work. Without it, the loss would penalize the model on subword pieces that have no real label, corrupting training.

Diagram of label alignment: a word split into subwords, with the first subword labeled and the rest set to -100

Only the first subword of each word keeps its label; the rest are masked with -100 and ignored by the loss.


Data Collation and Metrics

A DataCollatorForTokenClassification pads both the inputs and the labels to the same length per batch:

PYTHON
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

NER is evaluated with seqeval, which scores entity-level precision, recall, and F1 (not just per-token accuracy):

PYTHON
import evaluate
import numpy as np

metric = evaluate.load('seqeval')
label_names = list(tag2index)

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [[label_names[p] for p, l in zip(prediction, label) if l != -100]
                        for prediction, label in zip(predictions, labels)]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics['overall_precision'],
        'recall': all_metrics['overall_recall'],
        'f1': all_metrics['overall_f1'],
        'accuracy': all_metrics['overall_accuracy'],
    }

Training

Load DistilBERT with a token-classification head, passing the label mappings:

PYTHON
from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(model_ckpt, id2label=index2tag, label2id=tag2index)

args = TrainingArguments(
    "finetuned-ner",
    evaluation_strategy='epoch',
    save_strategy='epoch',
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

trainer.train()
OUTPUT
{'eval_loss': 0.3013, 'eval_precision': 0.7355, 'eval_recall': 0.7863, 'eval_f1': 0.7600, 'eval_accuracy': 0.9096, 'epoch': 1.0}
{'eval_loss': 0.2850, 'eval_precision': 0.7850, 'eval_recall': 0.8044, 'eval_f1': 0.7946, 'eval_accuracy': 0.9174, 'epoch': 2.0}
{'eval_loss': 0.2847, 'eval_precision': 0.7777, 'eval_recall': 0.8121, 'eval_f1': 0.7945, 'eval_accuracy': 0.9184, 'epoch': 3.0}

After three epochs the model reaches an entity-level F1 of ~0.79 and a token accuracy of ~92%.


Prediction

Save the model and load it as a token-classification pipeline. The aggregation_strategy='simple' option merges subword tokens back into whole words and groups entities:

PYTHON
from transformers import pipeline

trainer.save_model("ner_distilbert")

pipe = pipeline('token-classification', model="ner_distilbert", aggregation_strategy='simple')
pipe("which restaurant serves the best shushi in new york?")
OUTPUT
[{'entity_group': 'Rating', 'score': 0.9804273, 'word': 'best', 'start': 28, 'end': 32}, {'entity_group': 'Dish', 'score': 0.830101, 'word': 'shushi', 'start': 33, 'end': 39}, {'entity_group': 'Location', 'score': 0.8655802, 'word': 'new york', 'start': 43, 'end': 51}]

The model correctly tags "best" as a Rating, "shushi" as a Dish (even with the typo), and "new york" as a Location — exactly the structured output a restaurant search engine needs.

Diagram of the NER pipeline turning a query into structured entities: Rating, Dish, Location

The fine-tuned pipeline turns a free-text query into structured entities with confidence scores.


Summary

You fine-tuned DistilBERT for token-level NER on restaurant queries, reaching ~92% token accuracy. The new skill here is subword label alignment with word_ids and the -100 sentinel — the standard technique for every token-classification task — plus entity-level evaluation with seqeval.

Next, you shift from understanding tasks to generation: fine-tuning T5 for custom text summarization.

Found this useful? Keep building with me.

New tutorials every week on YouTube — or go deeper with a full structured course.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments