#aarya tadvalkar#kgp talkie#nlp#python#spaCy#spacy pipeline

Processing Pipeline in SpaCy

Understand how spaCy's NLP processing pipeline works end to end. Covers tokenizer, tagger, named entity recognizer, and custom component registration.

May 16, 2026 at 1:30 PM7 min readFollowFollow (Hindi)

Topics You Will Master

spaCy's default pipeline components: tokenizer, tagger, parser, NER
Text-to-Doc conversion and linguistic annotation layers
Disabling and enabling pipeline components for performance tuning
Adding and registering custom components into the spaCy pipeline
Inspecting pipeline state with nlp.pipe_names and nlp.analyze_pipes()
Best For

NLP beginners learning how spaCy processes raw text step by step.

Expected Outcome

A clear understanding of spaCy's pipeline architecture for building custom NLP systems.

What is SpaCy?

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

If you’re working with a lot of text, you’ll eventually want to know more about it. For example, what’s it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

Below are some of spaCy’s features and capabilities. Some of them refer to linguistic concepts, while others are related to more general machine learning functionality.

spaCy feature overview table showing NLP capabilities including tokenization, POS tagging, named entity recognition, and dependency parsing

Pipeline in SpaCy

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline.

The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

spaCy pipeline flow diagram showing text input passing through tokenizer, tagger, parser, and NER components to produce a Doc object

Each component in the pipeline transforms the Doc object and passes it to the next stage, adding annotations layer by layer.

spaCy pipeline component table listing name, description, and creates properties for tagger, parser, NER, and other built-in components

spaCy installation

You can run the following commands:-

BASH
pip install -U spacy
pip install -U spacy-lookups-data
python -m spacy download en_core_web_sm
Watch Full Video Here:

Processing text

Here we have imported the necessary libraries.

PYTHON
import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy

spacy.load() loads a model. When you call nlp on a text, spaCy will tokenize it and then call each component on the Doc, in order. It then returns the processed Doc that you can work with.

PYTHON
nlp = spacy.load("en_core_web_sm")
doc = nlp('This is raw text')

When processing large volumes of text, the statistical models are usually more efficient if you let them work on batches of texts. spaCy’s nlp.pipe method takes an iterable of texts and yields processed Doc objects. The batching is done internally.

PYTHON
texts = ["This is raw text", "There is lots of text"]
docs = list(nlp.pipe(texts))

Tips for efficient processing

  • Process the texts as a stream using nlp.pipe and buffer them in batches, instead of one-by-one. This is usually much more efficient.
  • Only apply the pipeline components you need. Getting predictions from the model that you don’t actually need adds up and becomes very inefficient at scale. To prevent this, use the disable keyword argument to disable components you don’t need

In this example, we’re using nlp.pipe to process a (potentially very large) iterable of texts as a stream. Because we’re only accessing the named entities in doc.ents (set by the ner component), we’ll disable all other statistical components (the tagger and parser) during processing. nlp.pipe yields Doc objects, so we can iterate over them and access the named entity predictions.

The below code disable tagger and parser. We are printing the text and its label which is assigned by named entity recognizer(ner).

PYTHON
import spacy

texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]

nlp = spacy.load("en_core_web_sm")
docs = nlp.pipe(texts, disable=["tagger", "parser"])
for doc in docs:
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])
    print()
OUTPUT
[('$9.4 million', 'MONEY'), ('the prior year', 'DATE'), ('$2.7 million', 'MONEY')]

[('twelve billion dollars', 'MONEY'), ('1b', 'MONEY')]

How Pipelines Work

spaCy makes it very easy to create your own pipelines consisting of reusable components – this includes spaCy’s default tagger, parser and entity recognizer, but also your own custom processing functions. A pipeline component can be added to an already existing nlp object, specified when initializing a Language class, or defined within a model package.

When you load a model, spaCy first consults the model’s meta.json.

The meta typically includes the model details, the ID of a language class, and an optional list of pipeline components.

spaCy then does the following:

  • Load the language class and data for the given ID via get_lang_class and initialize it. The Language class contains the shared vocabulary, tokenization rules and the language-specific annotation scheme.
  • Iterate over the pipeline names and create each component using create_pipe, which looks them up in Language.factories.
  • Add each pipeline component to the pipeline in order, using add_pipe.
  • Make the model data available to the Language class by calling from_disk with the path to the model data directory.
PLAINTEXT
{"lang": "en", "name": "core_web_sm", "description": "Example model for spaCy", "pipeline": ["tagger", "parser", "ner"]}

Fundamentally, a spaCy model consists of three components: the weights, i.e. binary data loaded in from a directory, a pipeline of functions called in order, and language data like the tokenization rules and annotation scheme.

spaCy model architecture diagram showing weights, pipeline functions, and language data as the three core components

Disabling and modifying pipeline components

If you don’t need a particular component of the pipeline – for example, the tagger or the parser, you can disable loading it. This can sometimes make a big difference and improve loading speed.

PYTHON
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
nlp

In some cases, you do want to load all pipeline components and their weights, because you need them at different points in your application. However, if you only need a Doc object with named entities, there’s no need to run all pipeline components on it

PYTHON
doc = nlp("Apple is buying a startup")
for ent in doc.ents:
    print(ent.text, ent.label_)
OUTPUT
Apple ORG

Now we are disabling ner also. After disbaling ner we do not get any output.

PYTHON
nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])
doc = nlp("Apple is buying a startup")
for ent in doc.ents:
    print(ent.text, ent.label_)

Now suppose we have a large document and we want to disable some pipeline components for a particular part of the text then do that as given below. At the end of the with block, the disabled pipeline components will be restored automatically. In this tagger and parser will be disabled for the first doc. The second doc will be tagged and parsed.

PYTHON
nlp = spacy.load('en_core_web_sm')

# 1. Use as a contextmanager
with nlp.disable_pipes("tagger", "parser"):
    doc = nlp("I won't be tagged and parsed")
doc = nlp("I will be tagged and parsed")

Alternatively, disable_pipes returns an object that lets you call its restore() method to restore the disabled components when needed. This can be useful if you want to prevent unnecessary code indentation of large blocks.

PYTHON
# 2. Restore manually
disabled = nlp.disable_pipes("ner")
doc = nlp("I won't have named entities")
disabled.restore()

Conclusion

In this tutorial you explored spaCy's processing pipeline architecture — how raw text flows through the tokenizer, tagger, parser, and NER components in sequence, and how to disable or selectively apply components for performance optimization.

Key takeaways:

  • spaCy's pipeline is a sequence of stateless components that each receive and annotate the same Doc object — the output of one component is the input to the next.
  • nlp.pipe(texts, disable=[...]) is significantly faster than calling nlp(text) in a loop: batching amortizes model overhead, and disabling unused components avoids unnecessary computation.
  • nlp.disable_pipes() as a context manager is the cleanest way to temporarily skip components — disabled components are restored automatically at the end of the with block.
  • A model's meta.json file controls which pipeline components are loaded at startup; you can override this by passing disable= to spacy.load().

Next steps:

Find this tutorial useful?

Subscribe to our YouTube channels for more practical production walk-throughs.

Discussion & Comments