SpaCy Introduction for NLP | Linguistic Features Extraction

Published by crystlefroggatt on

Getting Started with spaCy

This tutorial is a crisp and effective introduction to spaCy and the various NLP linguistic features it offers.We will perform several NLP related tasks, such as Tokenization, part-of-speech tagging, named entity recognition, dependency parsing and Visualization using displaCy.

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.spaCy is designed specifically for production use and helps you build applications that process and understand large volumes of text. It’s written in Cython and is designed to build information extraction or natural language understanding systemor to pre-process text for deep learning.

Linguistic Features in spaCy

Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing.

That’s exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of Linguistic annotations.

spaCy acts as a one-stop-shop for various tasks used in NLP projects, such as Tokenization, Lemmatisation, Part-of-speech(POS) tagging, Name entity recognition, Dependency parsing, Sentence Segmentation, Word-to-vector transformations, and other cleaning and normalization text methods.

image.png

Setup

!pip install -U spacy
!pip install -U spacy-lookups-data
!python -m spacy download en_core_web_sm

Once we've downloaded and installed a model, we will load it via spacy.load(). spaCy has different types of pretrained models. The default model for the English language is en_core_web_sm.

Here, the nlp object is a language instance of spaCy model.This will return a Language object containing all components and data needed to process text.

import spacy
nlp = spacy.load('en_core_web_sm')

Tokenization

Tokenization is the task of splitting a text into meaningful segments called tokens. The input to the tokenizer is a unicode text and the output is a Doc object.

Doc is a sequence of Token objects. Each Doc consists of individual tokens, and we can iterate over them.

doc = nlp("Apple isn't looking at buyig U.K. startup for $1 billion")
for token in doc:
    print(token.text)
Apple
is
n't
looking
at
buyig
U.K.
startup
for
$
1
billion

Lemmatization

A work-related to tokenizationlemmatization is the method of decreasing the word to its base form, or origin form. This reduced form or root word is called a lemma.

For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma.

Lemmatization is necessary because it helps to reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text.

doc
Apple isn't looking at buyig U.K. startup for $1 billion
for token in doc:
    print(token.text, token.lemma_)
Apple Apple
is be
n't not
looking look
at at
buyig buyig
U.K. U.K.
startup startup
for for
$ $
1 1
billion billion

Part-of-speech tagging

Part of speech tagging is the process of assigning a POS tag to each token depending on its usage in the sentence.

for token in doc:
    print(f'{token.text:{15}} {token.lemma_:{15}} {token.pos_:{10}} {token.is_stop}')
Apple           Apple           PROPN      False
is              be              AUX        True
n't             not             PART       True
looking         look            VERB       False
at              at              ADP        True
buyig           buyig           NOUN       False
U.K.            U.K.            PROPN      False
startup         startup         NOUN       False
for             for             ADP        True
$               $               SYM        False
1               1               NUM        False
billion         billion         NUM        False

Dependency Parsing

Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of a sentence has no dependency and is called the root of the sentence. The verb is usually the head of the sentence. All other words are linked to the headword.

Noun chunks are “base noun phrases” – flat phrases that have a noun as their head.To get the noun chunks in a document, simply iterate over Doc.noun_chunks.

for chunk in doc.noun_chunks:
    print(f'{chunk.text:{30}} {chunk.root.text:{15}} {chunk.root.dep_}')
Apple                          Apple           nsubj
buyig U.K. startup             startup         pobj

Named Entity Recognition

Named Entity Recognition (NER) is the process of locating named entities in unstructured text and then classifying them into pre-defined categories, such as person names, organizations, locations, monetary values, percentages, time expressions, and so on.

It is used to populate tags for a set of documents in order to improve the keyword search. Named entities are available as the ents property of a Doc.

doc
Apple isn't looking at buyig U.K. startup for $1 billion
for ent in doc.ents:
    print(ent.text, ent.label_)
Apple ORG
U.K. GPE
$1 billion MONEY

Sentence Segmentation

Sentence Segmentation is the process of locating the start and end of sentences in a given text. This allows you to you divide a text into linguistically meaningful units.SpaCy uses the dependency parse to determine sentence boundaries. In spaCy, the sents property is used to extract sentences.

doc
Apple isn't looking at buyig U.K. startup for $1 billion
for sent in doc.sents:
    print(sent)
Apple isn't looking at buyig U.K. startup for $1 billion
doc1 = nlp("Welcome to KGP Talkie. Thanks for watching. Please like and subscribe")
for sent in doc1.sents:
    print(sent)
Welcome to KGP Talkie.
Thanks for watching.
Please like and subscribe
doc1 = nlp("Welcome to.*.KGP Talkie.*.Thanks for watching")
for sent in doc1.sents:
    print(sent)
Welcome to.*.KGP Talkie.*.Thanks for watching

From the above example our sentence segmentation process fail to detect the sentence boundries due to delimiters. In such cases we write our own customize rules to detect sentence boundry based on delimiters.

Here’s an example, where an text(...) is used as the delimiter.

def set_rule(doc):
    for token in doc[:-1]:
        if token.text == '...':
            doc[token.i + 1].is_sent_start = True
    return doc
nlp.add_pipe(set_rule, before = 'parser')
text = 'Welcome to KGP Talkie...Thanks...Like and Subscribe!'
doc = nlp(text)
for sent in doc.sents:
    print(sent)
Welcome to KGP Talkie...
Thanks...
Like and Subscribe!
for token in doc:
    print(token.text)
Welcome
to
KGP
Talkie
...
Thanks
...
Like
and
Subscribe
!

Visualization

spaCy comes with a built-in visualizer called displaCy. We can use it to visualize a dependency parse or named entities in a browser or a Jupyter notebook.

You can pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw markup.

from spacy import displacy
doc
Welcome to KGP Talkie...Thanks...Like and Subscribe!

Visualizing the dependency parse

The dependency visualizer, dep, shows part-of-speech tags and syntactic dependencies.

displacy.render(doc, style='dep')

The argument options lets you specify a dictionary of settings to customize the layout.

displacy.render(doc, style='dep', options={'compact':True, 'distance': 100})

Visualizing the entity recognizer

The entity visualizer, ent, highlights named entities and their labels in a text.

doc = nlp("Apple isn't looking at buyig U.K. startup for $1 billion")
displacy.render(doc, style='ent')

Conclusion

spaCy is a modern, reliable NLP framework that quickly became the standard for doing NLP with Python. Its main advantages are: speed, accuracy, extensibility.

We have gained insights of linguistic Annotations like Tokenization, Lemmatisation, Part-of-speech(POS) tagging, Entity recognition, Dependency parsing, Sentence segmentation and Visualization using displaCy.