Combining NLP Models and Custom Rules in spaCy

Combining NLP Models and Creation of Custom rules using SpaCy

Objective: In this article, we are going to create some custom rules for our requirements and will add that to our pipeline like explanding named entities and identifying person's organization name from a given text.

For example: For example, the corpus spaCy’s English models were trained on defines a PERSON entity as just the person name, without titles like “Mr” or “Dr”. This makes sense because it makes it easier to resolve the entity type back to a knowledge base. But what if your application needs the full names, including the titles?

Mr. Laxmi Kant
Mr. Roshan Kumar Gupta

spaCy combining statistical models with rule-based components diagram

`SpaCy`

If a normal data analysis tool in Python for tabular and structured data has Pandas, then the data analysis tool in Natural Language Processing (NLP) for text and unstructured data has spaCy.

Pandas vs spaCy comparison: structured tabular data vs unstructured text data

When you’re first starting out as a data scientist, chances are you’ll be dealing with structured data which doesn’t necessarily require spaCy to handle complicated text data, depending on your needs.

In most cases — Pandas will fit your usage as it’s powerful enough to do most of the data cleaning and analysis for structured data.

Once you start dealing with unstructured text data — basically NLP stuff — where this can no longer be handled by Pandas, this is when spaCy comes in with tons of in-built capabilities to process, analyze and even understand these data through sophisticated and efficient NLP techniques.

SpaCy provides a one-stop-shop for tasks commonly used in any NLP project, including:

Tokenisation
Lemmatisation
Part-of-speech tagging
Entity recognition
Dependency parsing
Sentence recognition
Word-to-vector transformations And Many convenience methods for cleaning and normalising text

By the end of this article, I hope you’ll understand more about spaCy and how you could leverage this powerful tool in your domain space as well as other areas.

Let’s get started!

Additional Reading:

Language Processing Pipelines:

spaCy Processing Pipelines Documentation

Natural Language Processing and Computational Linguistics:

Natural Language Processing and Computational Linguistics (Packt)

Watch Full Video Here:

You can combine statistical and rule-based components in a variety of ways. Rule-based components can be used to improve the accuracy of statistical models, by presetting tags, entities, or sentence boundaries for specific tokens. The statistical models will usually respect these preset annotations, which sometimes improves the accuracy of other decisions. You can also use rule-based components after a statistical model to correct common errors. Finally, rule-based components can reference the attributes set by statistical models, in order to implement more abstract logic.

Rule-based and statistical NLP component interaction diagram showing how EntityRuler and ML models combine

Notebook Setup

Installing libraries

PYTHON

# !pip install -U spacy

PYTHON

# !pip install -U spacy-lookups-data

PYTHON

# !python -m spacy download en_core_web_sm

Importing libraries

PYTHON

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy

One of spaCy’s most interesting features is its language models. A language model is a statistical model that lets us perform NLP tasks such as POS-tagging and NER-tagging.

Here, we are using spacy.load() method to load a model package by and return the NLP object.

PYTHON

#loading english language model

nlp = spacy.load('en_core_web_sm')

Next, we call nlp() on a string and spaCy tokenizes the text and creates a document object.

PYTHON

doc = nlp('Dr. Alex Smith chaired first board meeting at Google')

OUTPUT

doc

Dr. Alex Smith chaired first board meeting at Google

PYTHON

print([(ent.text, ent.label_) for ent in doc.ents])

OUTPUT

[('Alex Smith', 'PERSON'), ('first', 'ORDINAL'), ('Google', 'ORG')]

Use of Name Entity Recognition

Now we are creating our rule to add a title along with entity where entity label is PERSON.

Below are the steps which we are peforming:

Creating a function that will take input.
Iterating over each word or token or doc, if any token is having entity label as PERSON and its starting position is not zero.
Checking if the previous token is having values in ('Dr', 'Dr.', 'Mr', 'Mr.').
Then using Span() we are creating a rule which will take PERSON label from starting position of title till the end of the name.
And then assigning the value back to token entity values.

PYTHON

def add_title(doc):
    new_ents = []
    for ent in doc.ents:
        if ent.label_ == 'PERSON' and ent.start!=0:
            prev_token = doc[ent.start-1]
            if prev_token.text in ('Dr', 'Dr.', 'Mr', 'Mr.'):
                new_ent = Span(doc, ent.start-1, ent.end, label=ent.label)
                new_ents.append(new_ent)
            else:
                new_ents.append(ent)
    doc.ents = new_ents
    return doc

PYTHON

#loading english language model

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe(add_title, after='ner')

PYTHON

#here we call nlp() on a string and spaCy tokenizes the text and creates a document object.

doc = nlp('Dr. Alex Smith chaired first board meeting at Google')

PYTHON

print([(ent.text, ent.label_) for ent in doc.ents])

OUTPUT

[('Dr. Alex Smith', 'PERSON')]

Use of POS and Dep Parsing

Parts-of-speech tagging is the process of tagging words in textual input with their appropriate parts of speech and Dependency parsing refers to understanding the structure of a sentence via dependencies between words in a sentence. When a sentence is dependency parsed it would give us information about relationships between words in a sentence.

PYTHON

#loading english language model

nlp = spacy.load('en_core_web_sm')

PYTHON

#here we call nlp() on a string and spaCy tokenizes the text and creates a document object.

doc = nlp('Alex Smith was working at Google')

Parsers break up a sentence into a subject and an object which is a noun phrase and a verb phrase. Dependency parser considers the verb as ahead of the sentence and all dependencies are built around it.

Example: Alex Smith was working at Google

PYTHON

displacy.render(doc, style='dep', options = {'compact':True, 'distance':100})

displaCy compact dependency parse of "Alex Smith was working at Google" showing subject, verb, and object relationships

PYTHON

def get_person_orgs(doc):
    person_entities = [ent for ent in doc.ents if ent.label_=="PERSON"]
    for ent in person_entities:
        head = ent.root.head
        if head.lemma_ == 'work':
            preps = [token for token in head.children if token.dep_ == 'prep']
            for prep in preps:
                orgs = [token for token in prep.children if token.ent_type_ == 'ORG']
                print({'person': ent, 'orgs': orgs, 'past': head.tag_ == "VBD"})
    return doc

PYTHON

from spacy.pipeline import merge_entities

PYTHON

#loading english language model

nlp = spacy.load('en_core_web_sm')

PYTHON

nlp.add_pipe(merge_entities)

PYTHON

nlp.add_pipe(get_person_orgs)

PYTHON

#here we call nlp() on a string and spaCy tokenizes the text and creates a document object.

doc = nlp('Alex Smith worked at Google')

OUTPUT

{'person': Alex Smith, 'orgs': [Google], 'past': True}

Modify model

PYTHON

def get_person_orgs(doc):
    person_entities = [ent for ent in doc.ents if ent.label_=="PERSON"]
    for ent in person_entities:
        head = ent.root.head
        if head.lemma_ == 'work':
            preps = [token for token in head.children if token.dep_ == 'prep']
            for prep in preps:
                orgs = [token for token in prep.children if token.ent_type_ == 'ORG']

                aux = [token for token in head.children if token.dep_ == 'aux']
                past_aux = any(t.tag_ == 'VBD' for t in aux)
                past = head.tag_ == 'VBD' or head.tag_ == 'VBG' and past_aux

            print({'person': ent, 'orgs': orgs, 'past': past})
    return doc

PYTHON

from spacy.pipeline import merge_entities

PYTHON

#loading english language model

nlp = spacy.load('en_core_web_sm')

PYTHON

nlp.add_pipe(merge_entities)

PYTHON

nlp.add_pipe(get_person_orgs)

PYTHON

#here we call nlp() on a string and spaCy tokenizes the text and creates a document object.

doc = nlp('Alex Smith was working at Google')

OUTPUT

{'person': Alex Smith, 'orgs': [Google], 'past': True}

Conclusion

In this tutorial you extended spaCy's statistical NER pipeline with two custom rule-based components: an add_title function that expands PERSON entities to include preceding titles ("Dr.", "Mr."), and a get_person_orgs dependency parser that extracts the organization an entity worked at. Both components are injected into the pipeline via nlp.add_pipe, demonstrating how rule logic layers cleanly on top of pretrained models.

Key takeaways:

spaCy's pipeline is fully composable — nlp.add_pipe(component, after='ner') inserts a custom function after any existing stage, without rewriting or retraining the base model.
The Span API lets you redefine entity boundaries: Span(doc, start-1, end, label=label) extends a PERSON entity one token backward to capture the preceding title.
Dependency parsing (head.lemma_ == 'work', token.dep_ == 'prep') gives structured access to grammatical relationships, making it possible to extract "who worked where" from raw text without training a relation extraction model.
merge_entities is a built-in pipeline component that collapses multi-token entities into single tokens — a prerequisite for clean downstream dependency traversal.

Next steps:

Apply EntityRuler patterns for domain-specific entities (product names, internal codes) as shown in Rule-Based Text Extraction and Matching with spaCy.
Chain this pipeline with the processing pipeline components from Processing Pipeline in spaCy to handle batches of documents efficiently.
Extend the get_person_orgs extractor to output structured JSON for downstream knowledge-graph construction.

Combining NLP Models and Custom Rules in spaCy

Topics You Will Master

Combining NLP Models and Creation of Custom rules using SpaCy

`SpaCy`

Additional Reading:

Watch Full Video Here:

Notebook Setup

Installing libraries

Importing libraries

Use of Name Entity Recognition

Use of POS and Dep Parsing

Modify model

Conclusion

Latest recommendations you might like

NLP: End to End Text Processing for Beginners

Text Summarization using NLP

Find this tutorial useful?

Discussion & Comments