#Custom Rules#Machine Learning#Natural Language Processing#NER#nlp#POS#python#roshan#spaCy#Text processing

Combining NLP Models and Custom Rules in spaCy

Extend spaCy's NLP pipeline with custom rules for named entity expansion. Covers EntityRuler, pattern matching, and combining ML models with rule-based logic.

May 17, 2026 at 7:30 PM8 min readFollowFollow (Hindi)

Topics You Will Master

spaCy EntityRuler for adding custom named entity patterns
Extending PERSON entities to include titles like Mr, Dr, Prof
Combining pre-trained ML model predictions with deterministic rule layers
Custom pipeline component registration and pipeline ordering
Testing and validating merged ML + rule-based pipelines on real text
Best For

NLP developers customizing pre-trained spaCy models for domain-specific entity recognition.

Expected Outcome

A spaCy pipeline that combines statistical predictions with user-defined entity rules.

Combining NLP Models and Creation of Custom rules using SpaCy


Objective: In this article, we are going to create some custom rules for our requirements and will add that to our pipeline like explanding named entities and identifying person's organization name from a given text.

For example: For example, the corpus spaCy’s English models were trained on defines a PERSON entity as just the person name, without titles like “Mr” or “Dr”. This makes sense because it makes it easier to resolve the entity type back to a knowledge base. But what if your application needs the full names, including the titles?

  • Mr. Laxmi Kant
  • Mr. Roshan Kumar Gupta

spaCy combining statistical models with rule-based components diagram


SpaCy

If a normal data analysis tool in Python for tabular and structured data has Pandas, then the data analysis tool in Natural Language Processing (NLP) for text and unstructured data has spaCy.

Pandas vs spaCy comparison: structured tabular data vs unstructured text data

When you’re first starting out as a data scientist, chances are you’ll be dealing with structured data which doesn’t necessarily require spaCy to handle complicated text data, depending on your needs.

In most cases — Pandas will fit your usage as it’s powerful enough to do most of the data cleaning and analysis for structured data.

Once you start dealing with unstructured text data — basically NLP stuff — where this can no longer be handled by Pandas, this is when spaCy comes in with tons of in-built capabilities to process, analyze and even understand these data through sophisticated and efficient NLP techniques.

SpaCy provides a one-stop-shop for tasks commonly used in any NLP project, including:

  • Tokenisation
  • Lemmatisation
  • Part-of-speech tagging
  • Entity recognition
  • Dependency parsing
  • Sentence recognition
  • Word-to-vector transformations And Many convenience methods for cleaning and normalising text

By the end of this article, I hope you’ll understand more about spaCy and how you could leverage this powerful tool in your domain space as well as other areas.

Let’s get started!


Additional Reading:

Language Processing Pipelines:

spaCy Processing Pipelines Documentation

Natural Language Processing and Computational Linguistics:

Natural Language Processing and Computational Linguistics (Packt)

Watch Full Video Here:

You can combine statistical and rule-based components in a variety of ways. Rule-based components can be used to improve the accuracy of statistical models, by presetting tags, entities, or sentence boundaries for specific tokens. The statistical models will usually respect these preset annotations, which sometimes improves the accuracy of other decisions. You can also use rule-based components after a statistical model to correct common errors. Finally, rule-based components can reference the attributes set by statistical models, in order to implement more abstract logic.

Rule-based and statistical NLP component interaction diagram showing how EntityRuler and ML models combine

Notebook Setup

Installing libraries

PYTHON
# !pip install -U spacy
PYTHON
# !pip install -U spacy-lookups-data
PYTHON
# !python -m spacy download en_core_web_sm

Importing libraries

PYTHON
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy

One of spaCy’s most interesting features is its language models. A language model is a statistical model that lets us perform NLP tasks such as POS-tagging and NER-tagging.

Here, we are using spacy.load() method to load a model package by and return the NLP object.

PYTHON
#loading english language model

nlp = spacy.load('en_core_web_sm')

Next, we call nlp() on a string and spaCy tokenizes the text and creates a document object.

PYTHON
doc = nlp('Dr. Alex Smith chaired first board meeting at Google')
OUTPUT
doc

Dr. Alex Smith chaired first board meeting at Google
PYTHON
print([(ent.text, ent.label_) for ent in doc.ents])
OUTPUT
[('Alex Smith', 'PERSON'), ('first', 'ORDINAL'), ('Google', 'ORG')]

Use of Name Entity Recognition

Now we are creating our rule to add a title along with entity where entity label is PERSON.

Below are the steps which we are peforming:

  • Creating a function that will take input.
  • Iterating over each word or token or doc, if any token is having entity label as PERSON and its starting position is not zero.
  • Checking if the previous token is having values in ('Dr', 'Dr.', 'Mr', 'Mr.').
  • Then using Span() we are creating a rule which will take PERSON label from starting position of title till the end of the name.
  • And then assigning the value back to token entity values.
PYTHON
def add_title(doc):
    new_ents = []
    for ent in doc.ents:
        if ent.label_ == 'PERSON' and ent.start!=0:
            prev_token = doc[ent.start-1]
            if prev_token.text in ('Dr', 'Dr.', 'Mr', 'Mr.'):
                new_ent = Span(doc, ent.start-1, ent.end, label=ent.label)
                new_ents.append(new_ent)
            else:
                new_ents.append(ent)
    doc.ents = new_ents
    return doc
PYTHON
#loading english language model

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(add_title, after='ner')
PYTHON
#here we call nlp() on a string and spaCy tokenizes the text and creates a document object.

doc = nlp('Dr. Alex Smith chaired first board meeting at Google')
PYTHON
print([(ent.text, ent.label_) for ent in doc.ents])
OUTPUT
[('Dr. Alex Smith', 'PERSON')]

Use of POS and Dep Parsing

Parts-of-speech tagging is the process of tagging words in textual input with their appropriate parts of speech and Dependency parsing refers to understanding the structure of a sentence via dependencies between words in a sentence. When a sentence is dependency parsed it would give us information about relationships between words in a sentence.

PYTHON
#loading english language model

nlp = spacy.load('en_core_web_sm')
PYTHON
#here we call nlp() on a string and spaCy tokenizes the text and creates a document object.

doc = nlp('Alex Smith was working at Google')

Parsers break up a sentence into a subject and an object which is a noun phrase and a verb phrase. Dependency parser considers the verb as ahead of the sentence and all dependencies are built around it.

Example: Alex Smith was working at Google

PYTHON
displacy.render(doc, style='dep', options = {'compact':True, 'distance':100})

displaCy compact dependency parse of "Alex Smith was working at Google" showing subject, verb, and object relationships

PYTHON
def get_person_orgs(doc):
    person_entities = [ent for ent in doc.ents if ent.label_=="PERSON"]
    for ent in person_entities:
        head = ent.root.head
        if head.lemma_ == 'work':
            preps = [token for token in head.children if token.dep_ == 'prep']
            for prep in preps:
                orgs = [token for token in prep.children if token.ent_type_ == 'ORG']
                print({'person': ent, 'orgs': orgs, 'past': head.tag_ == "VBD"})
    return doc
PYTHON
from spacy.pipeline import merge_entities
PYTHON
#loading english language model

nlp = spacy.load('en_core_web_sm')
PYTHON
nlp.add_pipe(merge_entities)
PYTHON
nlp.add_pipe(get_person_orgs)
PYTHON
#here we call nlp() on a string and spaCy tokenizes the text and creates a document object.

doc = nlp('Alex Smith worked at Google')
OUTPUT
{'person': Alex Smith, 'orgs': [Google], 'past': True}

Modify model

PYTHON
def get_person_orgs(doc):
    person_entities = [ent for ent in doc.ents if ent.label_=="PERSON"]
    for ent in person_entities:
        head = ent.root.head
        if head.lemma_ == 'work':
            preps = [token for token in head.children if token.dep_ == 'prep']
            for prep in preps:
                orgs = [token for token in prep.children if token.ent_type_ == 'ORG']

                aux = [token for token in head.children if token.dep_ == 'aux']
                past_aux = any(t.tag_ == 'VBD' for t in aux)
                past = head.tag_ == 'VBD' or head.tag_ == 'VBG' and past_aux

            print({'person': ent, 'orgs': orgs, 'past': past})
    return doc
PYTHON
from spacy.pipeline import merge_entities
PYTHON
#loading english language model

nlp = spacy.load('en_core_web_sm')
PYTHON
nlp.add_pipe(merge_entities)
PYTHON
nlp.add_pipe(get_person_orgs)
PYTHON
#here we call nlp() on a string and spaCy tokenizes the text and creates a document object.

doc = nlp('Alex Smith was working at Google')
OUTPUT
{'person': Alex Smith, 'orgs': [Google], 'past': True}

Conclusion

In this tutorial you extended spaCy's statistical NER pipeline with two custom rule-based components: an add_title function that expands PERSON entities to include preceding titles ("Dr.", "Mr."), and a get_person_orgs dependency parser that extracts the organization an entity worked at. Both components are injected into the pipeline via nlp.add_pipe, demonstrating how rule logic layers cleanly on top of pretrained models.

Key takeaways:

  • spaCy's pipeline is fully composable — nlp.add_pipe(component, after='ner') inserts a custom function after any existing stage, without rewriting or retraining the base model.
  • The Span API lets you redefine entity boundaries: Span(doc, start-1, end, label=label) extends a PERSON entity one token backward to capture the preceding title.
  • Dependency parsing (head.lemma_ == 'work', token.dep_ == 'prep') gives structured access to grammatical relationships, making it possible to extract "who worked where" from raw text without training a relation extraction model.
  • merge_entities is a built-in pipeline component that collapses multi-token entities into single tokens — a prerequisite for clean downstream dependency traversal.

Next steps:

  • Apply EntityRuler patterns for domain-specific entities (product names, internal codes) as shown in Rule-Based Text Extraction and Matching with spaCy.
  • Chain this pipeline with the processing pipeline components from Processing Pipeline in spaCy to handle batches of documents efficiently.
  • Extend the get_person_orgs extractor to output structured JSON for downstream knowledge-graph construction.

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments