SpaCy Introduction for NLP | Linguistic Features Extraction
Getting Started with spaCy
This tutorial is a crisp and effective introduction to spaCy and the various NLP linguistic features it offers.We will perform several NLP related tasks, such as Tokenization, part-of-speech tagging, named entity recognition, dependency parsing and Visualization using displaCy.
spaCy
is a free, open-source library for advanced Natural Language Processing (NLP) in Python.spaCy is designed specifically for production use and helps you build applications that process and understand
large volumes of text. It’s written in Cython
and is designed to build information extraction
or natural language understanding system
or to pre-process text
for deep learning.
Linguistic Features in spaCy
Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing.
That’s exactly what spaCy is designed to do: you put in raw text, and get back a Doc object, that comes with a variety of Linguistic annotations.
spaCy acts as a one-stop-shop for various tasks used in NLP projects, such as Tokenization, Lemmatisation, Part-of-speech(POS) tagging, Name entity recognition, Dependency parsing, Sentence Segmentation, Word-to-vector transformations, and other cleaning and normalization text methods.
Setup
!pip install -U spacy
!pip install -U spacy-lookups-data
!python -m spacy download en_core_web_sm
Once we've downloaded and installed a model, we will load it via spacy.load()
. spaCy has different types of pretrained models. The default model
for the English language
is en_core_web_sm
.
Here, the nlp object is a language instance of spaCy model.This will return a Language object containing all components and data needed to process text.
import spacy nlp = spacy.load('en_core_web_sm')
Tokenization
Tokenization
is the task of splitting a text into meaningful segments called tokens
. The input to the tokenizer is a unicode text and the output is a Doc
object.
A Doc
is a sequence of Token objects
. Each Doc
consists of individual tokens, and we can iterate over them.
doc = nlp("Apple isn't looking at buyig U.K. startup for $1 billion") for token in doc: print(token.text)
Apple is n't looking at buyig U.K. startup for $ 1 billion
Lemmatization
A work-related to tokenization
, lemmatization
is the method of decreasing the word to its base form, or origin form. This reduced form or root word is called a lemma
.
For example, organizes, organized and organizing are all forms of organize. Here, organize is the lemma.
Lemmatization
is necessary because it helps to reduce the inflected forms of a word so that they can be analyzed as a single item. It can also help you normalize the text.
doc
Apple isn't looking at buyig U.K. startup for $1 billion
for token in doc: print(token.text, token.lemma_)
Apple Apple is be n't not looking look at at buyig buyig U.K. U.K. startup startup for for $ $ 1 1 billion billion
Part-of-speech tagging
Part of speech tagging is the process of assigning a POS
tag to each token depending on its usage in the sentence.
for token in doc: print(f'{token.text:{15}} {token.lemma_:{15}} {token.pos_:{10}} {token.is_stop}')
Apple Apple PROPN False is be AUX True n't not PART True looking look VERB False at at ADP True buyig buyig NOUN False U.K. U.K. PROPN False startup startup NOUN False for for ADP True $ $ SYM False 1 1 NUM False billion billion NUM False
Dependency Parsing
Dependency parsing
is the process of extracting the dependency parse of a sentence to represent its grammatical structure. It defines the dependency relationship between headwords
and their dependents. The head
of a sentence has no dependency and is called the root
of the sentence. The verb
is usually the head
of the sentence. All other words are linked to the headword.
Noun chunks
are “base noun phrases”
– flat phrases that have a noun as their head
.To get the noun chunks in a document, simply iterate over Doc.noun_chunks
.
for chunk in doc.noun_chunks: print(f'{chunk.text:{30}} {chunk.root.text:{15}} {chunk.root.dep_}')
Apple Apple nsubj buyig U.K. startup startup pobj
Named Entity Recognition
Named Entity Recognition
(NER) is the process of locating named entities in unstructured text and then classifying them into pre-defined categories, such as person names, organizations, locations, monetary values, percentages, time expressions, and so on.
It is used to populate tags
for a set of documents in order to improve the keyword search
. Named entities are available as the ents
property of a Doc
.
doc
Apple isn't looking at buyig U.K. startup for $1 billion
for ent in doc.ents: print(ent.text, ent.label_)
Apple ORG U.K. GPE $1 billion MONEY
Sentence Segmentation
Sentence Segmentation
is the process of locating the start and end of sentences in a given text. This allows you to you divide a text into linguistically meaningful units.SpaCy uses the dependency parse
to determine sentence boundaries. In spaCy, the sents
property is used to extract sentences.
doc
Apple isn't looking at buyig U.K. startup for $1 billion
for sent in doc.sents: print(sent)
Apple isn't looking at buyig U.K. startup for $1 billion
doc1 = nlp("Welcome to KGP Talkie. Thanks for watching. Please like and subscribe") for sent in doc1.sents: print(sent)
Welcome to KGP Talkie. Thanks for watching. Please like and subscribe
doc1 = nlp("Welcome to.*.KGP Talkie.*.Thanks for watching") for sent in doc1.sents: print(sent)
Welcome to.*.KGP Talkie.*.Thanks for watching
From the above example our sentence segmentation
process fail to detect the sentence boundries due to delimiters. In such cases we write our own customize rules
to detect sentence boundry based on delimiters
.
Here’s an example, where an text(...) is used as the delimiter.
def set_rule(doc): for token in doc[:-1]: if token.text == '...': doc[token.i + 1].is_sent_start = True return doc
nlp.add_pipe(set_rule, before = 'parser')
text = 'Welcome to KGP Talkie...Thanks...Like and Subscribe!' doc = nlp(text) for sent in doc.sents: print(sent)
Welcome to KGP Talkie... Thanks... Like and Subscribe!
for token in doc: print(token.text)
Welcome to KGP Talkie ... Thanks ... Like and Subscribe !
Visualization
spaCy comes with a built-in visualizer called displaCy
. We can use it to visualize a dependency parse or named entities in a browser or a Jupyter notebook.
You can pass a Doc
or a list of Doc
objects to displaCy and run displacy.serve
to run the web server, or displacy.render
to generate the raw markup.
from spacy import displacy
doc
Welcome to KGP Talkie...Thanks...Like and Subscribe!
Visualizing the dependency parse
The dependency visualizer, dep
, shows part-of-speech tags and syntactic dependencies.
displacy.render(doc, style='dep')
The argument options
lets you specify a dictionary of settings to customize the layout.
displacy.render(doc, style='dep', options={'compact':True, 'distance': 100})
Visualizing the entity recognizer
The entity visualizer, ent
, highlights named entities and their labels in a text.
doc = nlp("Apple isn't looking at buyig U.K. startup for $1 billion")
displacy.render(doc, style='ent')
Conclusion
spaCy is a modern, reliable NLP framework that quickly became the standard for doing NLP with Python. Its main advantages are: speed, accuracy, extensibility.
We have gained insights of linguistic Annotations like Tokenization, Lemmatisation, Part-of-speech(POS) tagging, Entity recognition, Dependency parsing, Sentence segmentation and Visualization using displaCy.
2 Comments