Processing Pipeline in SpaCy

Published by georgiannacambel on

What is SpaCy?

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

If you’re working with a lot of text, you’ll eventually want to know more about it. For example, what’s it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?

spaCy is designed specifically for production use and helps you build applications that process and “understand” large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

Below are some of spaCy’s features and capabilities. Some of them refer to linguistic concepts, while others are related to more general machine learning functionality.

t1.png

Pipeline in SpaCy

When you call nlp on a text, spaCy first tokenizes the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline.

The pipeline used by the default models consists of a tagger, a parser and an entity recognizer. Each pipeline component returns the processed Doc, which is then passed on to the next component.

image.png
image.png

spaCy installation

You can run the following commands:-

!pip install -U spacy

!pip install -U spacy-lookups-data

!python -m spacy download en_core_web_sm

Watch Full Video Here:

Processing text

Here we have imported the necessary libraries.

import spacy
from spacy.matcher import Matcher
from spacy.tokens import Span
from spacy import displacy

spacy.load() loads a model. When you call nlp on a text, spaCy will tokenize it and then call each component on the Doc, in order. It then returns the processed Doc that you can work with.

nlp = spacy.load("en_core_web_sm")
doc = nlp('This is raw text')

When processing large volumes of text, the statistical models are usually more efficient if you let them work on batches of texts. spaCy’s nlp.pipe method takes an iterable of texts and yields processed Doc objects. The batching is done internally.

texts = ["This is raw text", "There is lots of text"]
docs = list(nlp.pipe(texts))

Tips for efficient processing

  • Process the texts as a stream using nlp.pipe and buffer them in batches, instead of one-by-one. This is usually much more efficient.
  • Only apply the pipeline components you need. Getting predictions from the model that you don’t actually need adds up and becomes very inefficient at scale. To prevent this, use the disable keyword argument to disable components you don’t need

In this example, we’re using nlp.pipe to process a (potentially very large) iterable of texts as a stream. Because we’re only accessing the named entities in doc.ents (set by the ner component), we’ll disable all other statistical components (the tagger and parser) during processing. nlp.pipe yields Doc objects, so we can iterate over them and access the named entity predictions.

The below code disable tagger and parser. We are printing the text and its label which is assigned by named entity recognizer(ner).

import spacy

texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",
]

nlp = spacy.load("en_core_web_sm")
docs = nlp.pipe(texts, disable=["tagger", "parser"])
for doc in docs:
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])
    print()
[('$9.4 million', 'MONEY'), ('the prior year', 'DATE'), ('$2.7 million', 'MONEY')]

[('twelve billion dollars', 'MONEY'), ('1b', 'MONEY')]

How Pipelines Work

spaCy makes it very easy to create your own pipelines consisting of reusable components – this includes spaCy’s default tagger, parser and entity recognizer, but also your own custom processing functions. A pipeline component can be added to an already existing nlp object, specified when initializing a Language class, or defined within a model package.

When you load a model, spaCy first consults the model’s meta.json.

The meta typically includes the model details, the ID of a language class, and an optional list of pipeline components. spaCy then does the following:

  • Load the language class and data for the given ID via get_lang_class and initialize it. The Language class contains the shared vocabulary, tokenization rules and the language-specific annotation scheme.
  • Iterate over the pipeline names and create each component using create_pipe, which looks them up in Language.factories.
  • Add each pipeline component to the pipeline in order, using add_pipe.
  • Make the model data available to the Language class by calling from_disk with the path to the model data directory.
{
  "lang": "en",
  "name": "core_web_sm",
  "description": "Example model for spaCy",
  "pipeline": ["tagger", "parser", "ner"]
}

Fundamentally, a spaCy model consists of three components: the weights, i.e. binary data loaded in from a directory, a pipeline of functions called in order, and language data like the tokenization rules and annotation scheme.

image.png

Disabling and modifying pipeline components

If you don’t need a particular component of the pipeline – for example, the tagger or the parser, you can disable loading it. This can sometimes make a big difference and improve loading speed.

nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])
nlp
<spacy.lang.en.English at 0x2792d8109b0>

In some cases, you do want to load all pipeline components and their weights, because you need them at different points in your application. However, if you only need a Doc object with named entities, there’s no need to run all pipeline components on it

doc = nlp("Apple is buying a startup")
for ent in doc.ents:
    print(ent.text, ent.label_)
Apple ORG

Now we are disabling ner also. After disbaling ner we do not get any output.

nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser", "ner"])
doc = nlp("Apple is buying a startup")
for ent in doc.ents:
    print(ent.text, ent.label_)

Now suppose we have a large document and we want to disable some pipeline components for a particular part of the text then do that as given below. At the end of the with block, the disabled pipeline components will be restored automatically. In this tagger and parser will be disabled for the first doc. The second doc will be tagged and parsed.

nlp = spacy.load('en_core_web_sm')

# 1. Use as a contextmanager
with nlp.disable_pipes("tagger", "parser"):
    doc = nlp("I won't be tagged and parsed")
doc = nlp("I will be tagged and parsed")

Alternatively, disable_pipes returns an object that lets you call its restore() method to restore the disabled components when needed. This can be useful if you want to prevent unnecessary code indentation of large blocks.

# 2. Restore manually
disabled = nlp.disable_pipes("ner")
doc = nlp("I won't have named entities")
disabled.restore()