Fine-Tuning DistilBERT for Restaurant Search NER

So far we have classified whole sentences. Named entity recognition (NER) is different. It labels individual tokens. It finds spans like people, places, and organizations. In this blog, we fine-tune DistilBERT to parse restaurant search queries. It pulls out cuisines, locations, ratings, dishes, and amenities.

This is a token classification task. It adds one new twist. We must align our word-level labels with the model's subword tokens.

Prerequisites: The BERT fine-tuning workflow and a Python environment with transformers, datasets, evaluate, seqeval, and torch.

What Is NER?

Named Entity Recognition finds and classifies named entities in text. These include people, places, organizations, dates, and more.

It works token by token, in five steps:

Text input: a sentence, paragraph, or document.
Tokenization: split the text into individual tokens.
Entity recognition: find spans of tokens that form entities.
Entity classification: assign each entity a category (location, dish, rating, etc.).
Output: a structured representation of the text with its labeled entities.

Tip

The displaCy entity visualizer is a great way to see NER output drawn over text.

IOB Tagging

NER labels are usually stored in IOB format (Inside-Outside-Beginning).

It marks where each entity starts and ends:

B- (Beginning) is the first token of an entity.
I- (Inside) is a token that continues the entity.
O (Outside) is a token that is not part of any entity.

So new york as a location becomes B-Location I-Location. A filler word like in is O. We can read the full convention on the IOB tagging Wikipedia page.

Diagram of IOB tagging: a restaurant query with B-, I-, and O tags over each token

IOB tagging marks the Beginning, Inside, and Outside of each entity span, token by token.

The Dataset

This tutorial uses the MIT Restaurant dataset. It labels search queries with entities like Rating, Amenity, Location, Restaurant_Name, Price, Hours, Dish, and Cuisine. The data is in BIO format. There is one tag\ttoken per line, with blank lines between sentences.

We load and parse the training file into lists of tokens and tags:

PYTHON

import requests

response = requests.get("https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/mit_restaurant_search_ner/train.bio")
response = response.text.splitlines()

train_tokens = []
train_tags = []
temp_tokens, temp_tags = [], []

for line in response:
    if line != "":
        tag, token = line.strip().split("\t")
        temp_tags.append(tag)
        temp_tokens.append(token)
    else:
        train_tokens.append(temp_tokens)
        train_tags.append(temp_tags)
        temp_tokens, temp_tags = [], []

len(train_tokens), len(train_tags)

OUTPUT

(7659, 7659)

We repeat the same parsing for the test.bio file. It yields 1,520 examples.

Building the Hugging Face Dataset

We convert the parsed lists into a DatasetDict. Here the test set also acts as validation:

PYTHON

import pandas as pd
from datasets import Dataset, DatasetDict

df = pd.DataFrame({'tokens': train_tokens, 'ner_tags_str': train_tags})
train = Dataset.from_pandas(df)

df = pd.DataFrame({'tokens': test_tokens, 'ner_tags_str': test_tags})
test = Dataset.from_pandas(df)

dataset = DatasetDict({'train': train, 'test': test, 'validation': test})
dataset['train'][0]

OUTPUT

{'tokens': ['2', 'start', 'restaurants', 'with', 'inside', 'dining'], 'ner_tags_str': ['B-Rating', 'I-Rating', 'O', 'O', 'B-Amenity', 'I-Amenity']}

We build numeric tag mappings from the unique entity types. We make a B- and I- entry for each:

PYTHON

unique_tags = set()
for tag in dataset['train']['ner_tags_str']:
    unique_tags.update(tag)

unique_tags = list(set([x[2:] for x in list(unique_tags) if x != 'O']))

tag2index = {"O": 0}
for i, tag in enumerate(unique_tags):
    tag2index[f'B-{tag}'] = len(tag2index)
    tag2index[f'I-{tag}'] = len(tag2index)

index2tag = {v: k for k, v in tag2index.items()}

Now we map the string tags to their integer IDs:

PYTHON

dataset = dataset.map(lambda example: {"ner_tags": [tag2index[tag] for tag in example['ner_tags_str']]})

Aligning Labels with Subword Tokens

Here is the main challenge of token classification. The tokenizer may split one word into several subwords. But we have only one label per word. So we must align them.

We load the tokenizer and see the problem:

PYTHON

from transformers import AutoTokenizer

model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

input = dataset['train'][2]['tokens']
output = tokenizer(input, is_split_into_words=True)
tokenizer.convert_ids_to_tokens(output.input_ids)

OUTPUT

['[CLS]', '5', 'star', 'rest', '##ura', '##nts', 'in', 'my', 'town', '[SEP]']

Here, we can see the word resturants became three subword tokens. The fix is simple. We label only the first subword of each word. We assign -100 to the rest, and to the special tokens. The Trainer ignores -100 positions when it computes loss.

PYTHON

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples['tokens'], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples['ner_tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs['labels'] = labels
    return tokenized_inputs

tokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)
tokenized_dataset['train'][2]['labels']

OUTPUT

[-100, 3, 4, 0, -100, -100, 5, 6, 6, -100]

Important

The -100 sentinel is what makes subword alignment work. Without it, the loss would punish the model on subword pieces that have no real label. That would corrupt training.

Diagram of label alignment: a word split into subwords, with the first subword labeled and the rest set to -100

Only the first subword of each word keeps its label; the rest are masked with -100 and ignored by the loss.

Data Collation and Metrics

A DataCollatorForTokenClassification pads both the inputs and the labels to the same length per batch:

PYTHON

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

We evaluate NER with seqeval. It scores entity-level precision, recall, and F1, not just per-token accuracy:

PYTHON

import evaluate
import numpy as np

metric = evaluate.load('seqeval')
label_names = list(tag2index)

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [[label_names[p] for p, l in zip(prediction, label) if l != -100]
                        for prediction, label in zip(predictions, labels)]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics['overall_precision'],
        'recall': all_metrics['overall_recall'],
        'f1': all_metrics['overall_f1'],
        'accuracy': all_metrics['overall_accuracy'],
    }

Training

We load DistilBERT with a token-classification head and pass the label mappings:

PYTHON

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained(model_ckpt, id2label=index2tag, label2id=tag2index)

args = TrainingArguments(
    "finetuned-ner",
    evaluation_strategy='epoch',
    save_strategy='epoch',
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_dataset['train'],
    eval_dataset=tokenized_dataset['validation'],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer
)

trainer.train()

OUTPUT

{'eval_loss': 0.3013, 'eval_precision': 0.7355, 'eval_recall': 0.7863, 'eval_f1': 0.7600, 'eval_accuracy': 0.9096, 'epoch': 1.0}
{'eval_loss': 0.2850, 'eval_precision': 0.7850, 'eval_recall': 0.8044, 'eval_f1': 0.7946, 'eval_accuracy': 0.9174, 'epoch': 2.0}
{'eval_loss': 0.2847, 'eval_precision': 0.7777, 'eval_recall': 0.8121, 'eval_f1': 0.7945, 'eval_accuracy': 0.9184, 'epoch': 3.0}

Here, we can see that after three epochs the model reaches an entity-level F1 of about 0.79. Its token accuracy is about 92%.

Prediction

We save the model and load it as a token-classification pipeline. The aggregation_strategy='simple' option merges subword tokens back into whole words and groups the entities:

PYTHON

from transformers import pipeline

trainer.save_model("ner_distilbert")

pipe = pipeline('token-classification', model="ner_distilbert", aggregation_strategy='simple')
pipe("which restaurant serves the best shushi in new york?")

OUTPUT

[{'entity_group': 'Rating', 'score': 0.9804273, 'word': 'best', 'start': 28, 'end': 32}, {'entity_group': 'Dish', 'score': 0.830101, 'word': 'shushi', 'start': 33, 'end': 39}, {'entity_group': 'Location', 'score': 0.8655802, 'word': 'new york', 'start': 43, 'end': 51}]

Here, we can see the model tag best as a Rating, shushi as a Dish (even with the typo), and new york as a Location. This is exactly the structured output a restaurant search engine needs.

Diagram of the NER pipeline turning a query into structured entities: Rating, Dish, Location

The fine-tuned pipeline turns a free-text query into structured entities with confidence scores.

Summary

This is how NER fine-tuning works. We fine-tuned DistilBERT for token-level NER on restaurant queries and reached about 92% token accuracy. The new skill here is subword label alignment with word_ids and the -100 sentinel. This is the standard technique for every token-classification task. We also used entity-level evaluation with seqeval.

Next, we shift from understanding tasks to generation. We will do fine-tuning T5 for custom text summarization.

Fine-Tuning DistilBERT for Restaurant Search NER

Fine Tuning LLM with HuggingFace Transformers for NLP

What Is NER?

IOB Tagging

The Dataset

Building the Hugging Face Dataset

Aligning Labels with Subword Tokens

Data Collation and Metrics

Training

Prediction

Summary

Found this useful? Keep building with me.

Latest recommendations you might like

BERT Architecture: Theory and Fine-Tuning

Fine-Tuning Distilled BERT for Fake News Detection

Fine-Tuning BERT for Sentiment Classification

Fine-Tuning Phi-2 with LoRA and QLoRA

Find this tutorial useful?

Discussion & Comments