SpaCy Introduction for NLP | Linguistic Features Extraction
Getting Started with spaCy
This tutorial is a crisp and effective introduction to spaCy and the various NLP linguistic features it offers.We will perform several NLP related tasks, such as Tokenization, part-of-speech tagging, named entity recognition, dependency parsing and Visualization using displaCy.
spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.spaCy is designed specifically for production use and helps you build applications that process and understand large volumes of text. It’s written in Cython and is designed to build information extraction or natural language understanding systemor to pre-process text for deep learning.
Linguistic Features in spaCy
Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing.
That’s exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of Linguistic annotations.
spaCy acts as a one-stop-shop for various tasks used in NLP projects, such as Tokenization, Lemmatisation, Part-of-speech(POS) tagging, Name entity recognition, Dependency parsing, Sentence Segmentation, Word-to-vector transformations, and other cleaning and normalization text methods.

Setup
!pip install -U spacy
!pip install -U spacy-lookups-data
!python -m spacy download en_core_web_sm
Once we’ve downloaded and installed a model, we will load it via spacy.load(). spaCy has different types of pretrained models. The default model for the English language is en_core_web_sm.
Here, the nlp object is a language instance of spaCy model.This will return a Language object containing all components and data needed to process text.
import spacy
nlp = spacy.load('en_core_web_sm')
Tokenization
Tokenization is the task of splitting a text into meaningful segments called tokens. The input to the tokenizer is a unicode text and the output is a Doc object.
A Doc is a sequence of Token objects. Each Doc consists of individual tokens, and we can iterate over them.
doc = nlp("Apple isn't looking at buyig U.K. startup for $1 billion")
for token in doc:
print(token.text)
Apple is n't looking at buyig U.K. startup for $ 1 billion
Lemmatization
A work-related to tokenization, lemmatization is the method of decreasing the word to its base form, or origin form. This reduced form or root word is called a lemma.
For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma.
Lemmatization is necessary because it helps to reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text.
doc
Apple isn't looking at buyig U.K. startup for $1 billion
for token in doc:
print(token.text, token.lemma_)
Apple Apple is be n't not looking look at at buyig buyig U.K. U.K. startup startup for for $ $ 1 1 billion billion
Part-of-speech tagging
Part of speech tagging is the process of assigning a POS tag to each token depending on its usage in the sentence.
for token in doc:
print(f'{token.text:{15}} {token.lemma_:{15}} {token.pos_:{10}} {token.is_stop}')
Apple Apple PROPN False is be AUX True n't not PART True looking look VERB False at at ADP True buyig buyig NOUN False U.K. U.K. PROPN False startup startup NOUN False for for ADP True $ $ SYM False 1 1 NUM False billion billion NUM False
Dependency Parsing
Dependency parsing is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords and their dependents. The head of a sentence has no dependency and is called the root of the sentence. The verb is usually the head of the sentence. All other words are linked to the headword.
Noun chunks are “base noun phrases” – flat phrases that have a noun as their head.To get the noun chunks in a document, simply iterate over Doc.noun_chunks.
for chunk in doc.noun_chunks:
print(f'{chunk.text:{30}} {chunk.root.text:{15}} {chunk.root.dep_}')
Apple Apple nsubj buyig U.K. startup startup pobj
Named Entity Recognition
Named Entity Recognition (NER) is the process of locating named entities in unstructured text and then classifying them into pre-defined categories, such as person names, organizations, locations, monetary values, percentages, time expressions, and so on.
It is used to populate tags for a set of documents in order to improve the keyword search. Named entities are available as the ents property of a Doc.
doc
Apple isn't looking at buyig U.K. startup for $1 billion
for ent in doc.ents:
print(ent.text, ent.label_)
Apple ORG U.K. GPE $1 billion MONEY
Sentence Segmentation
Sentence Segmentation is the process of locating the start and end of sentences in a given text. This allows you to you divide a text into linguistically meaningful units.SpaCy uses the dependency parse to determine sentence boundaries. In spaCy, the sents property is used to extract sentences.
doc
Apple isn't looking at buyig U.K. startup for $1 billion
for sent in doc.sents:
print(sent)
Apple isn't looking at buyig U.K. startup for $1 billion
doc1 = nlp("Welcome to KGP Talkie. Thanks for watching. Please like and subscribe")
for sent in doc1.sents:
print(sent)
Welcome to KGP Talkie. Thanks for watching. Please like and subscribe
doc1 = nlp("Welcome to.*.KGP Talkie.*.Thanks for watching")
for sent in doc1.sents:
print(sent)
Welcome to.*.KGP Talkie.*.Thanks for watching
From the above example our sentence segmentation process fail to detect the sentence boundries due to delimiters. In such cases we write our own customize rules to detect sentence boundry based on delimiters.
Here’s an example, where an text(…) is used as the delimiter.
def set_rule(doc):
for token in doc[:-1]:
if token.text == '...':
doc[token.i + 1].is_sent_start = True
return doc
nlp.add_pipe(set_rule, before = 'parser')
text = 'Welcome to KGP Talkie...Thanks...Like and Subscribe!'
doc = nlp(text)
for sent in doc.sents:
print(sent)
Welcome to KGP Talkie... Thanks... Like and Subscribe!
for token in doc:
print(token.text)
Welcome to KGP Talkie ... Thanks ... Like and Subscribe !
Visualization
spaCy comes with a built-in visualizer called displaCy. We can use it to visualize a dependency parse or named entities in a browser or a Jupyter notebook.
You can pass a Doc or a list of Doc objects to displaCy and run displacy.serve to run the web server, or displacy.render to generate the raw markup.
from spacy import displacy
doc
Welcome to KGP Talkie...Thanks...Like and Subscribe!
Visualizing the dependency parse
The dependency visualizer, dep, shows part-of-speech tags and syntactic dependencies.
displacy.render(doc, style='dep')

The argument options lets you specify a dictionary of settings to customize the layout.
displacy.render(doc, style='dep', options={'compact':True, 'distance': 100})

Visualizing the entity recognizer
The entity visualizer, ent, highlights named entities and their labels in a text.
doc = nlp("Apple isn't looking at buyig U.K. startup for $1 billion")
displacy.render(doc, style='ent')

Conclusion
spaCy is a modern, reliable NLP framework that quickly became the standard for doing NLP with Python. Its main advantages are: speed, accuracy, extensibility.
We have gained insights of linguistic Annotations like Tokenization, Lemmatisation, Part-of-speech(POS) tagging, Entity recognition, Dependency parsing, Sentence segmentation and Visualization using displaCy.
2 Comments