Introduction to Natural Language Processing | EduGrad

Did you know 1 Exabyte (10^18) of data is being created on the internet daily, amounting to roughly the equivalent of data in 250 million DVDs? Most of this data is in the form of the text.

What do we do with these mountains of data? For example, the computer read some number ‘30”.

What should it mean? Is it 30 rupees or weight of some bag in kilograms or the number of days in a month? Or the waist size of jeans? So in all these aspects, the context is becoming more and more important.

Unstructured data (also known as free-form text) comprises 70%-80% of the data available on computer networks. The information about this resource is unavailable to governments, businesses, and individuals unless humans read these texts. Natural language processing can be applied to understand, interpret or characterize the information content of the text.

What is Natural Language Processing?

Natural Language Processing is the technology used to aid computers to understand the human’s interference as natural language. It is concerned with the interaction between humans and computers using the natural language. It is used to apply machine learning algorithms to speech and text.

Process of interaction between humans and machines using NLP goes as follows:

  1. Human talks to the machine
  2. The machine records the audio
  3. Audio is converted to text
  4. Text data is processed
  5. Data is again converted to audio
  6. The machine responds to human by playing the audio file

Natural language processing applications –

Natural Language Processing with Python is used in following –

  • Language translators such as Google Translate
  • Word Processors like Microsoft Word and Grammarly use NLP to check the accuracy of texts, grammatically.
  • Call centers use Interactive Voice Response (IVR) applications to respond to certain users’ requests.
  • Personal assistants like Google assistant, Siri, Cortana, and Alexa.
  • Spam mail detection also uses NLP to analyze the contents of mail to classify it as spam or not.

Techniques used in NLP

In this Natural Language Processing tutorial, you’ll understand the tasks involved in syntactic analysis and semantic analysis –

  1. Syntax

The syntax is the arrangement of words in a sentence such that they make grammatical sense. In Natural Language Processing in ai, syntactic analysis is used to assess how the natural language aligns with the grammatical rules.

Many computer algorithms use natural language processing applications to apply grammatical rules to a group of words and derive meaning from them.

Natural Language Processing techniques used for syntactical analysis:


Introduction to NLP - Tokenization | EduGrad

Tokenization is the process of breaking up the original raw text into component pieces or otherwise known as tokens. Generally, this is the first step that we do in deep learning in natural language processing to process the text.
For example, we have a text “Hello, Mr. Anant, how are you doing today?”

“Hello, Mr. Anant, how are you doing today?”

After tokenization at word-level, we will have the below words.


Stop wordsIntroduction to NLP - Stop Words | EduGrad

Stop Words are those which do not contain important significance to be used in Search Queries. Usually, these words are filtered out from search queries because they tend to return a vast amount of unnecessary information.

Each programming language gives its own list of stop words to use. Mostly they are words that are commonly used in the English language such as ‘as, the, be, are’ etc.
stop words are filtered out before or after processing natural language data.

Although “stop words” generally refers to the most common words in a language, but there is no universal list of stop words used by all-natural language processing tools.


Stemming in Natural Language Processing | EduGrad

Stemming is a kind of normalization for words. Normalization is a technique of natural language processing algorithms where a specified set of words in a sentence is converted into a sequence to shorten its lookup.

It involves cutting the inflected words to their root form. The words having the same meaning but variation according to the context or sentence are normalized.
Generally, there is one root word, but there are many variations of the same words.

Let’s take some natural language processing examples, the root word is “eat” and it’s variations are “eats, eating, eaten and like so”. In the same way, with the help of Stemming, we can find the root word of any variations.

In short, Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s” etc) from a word.


Lemmatization in NLP | EduGrad

Lemmatization, unlike Stemming, reduces the inflected words properly ensuring that the root word belongs to the language. In Lemmatization root word is called Lemma. A lemma (plural lemmas or lemmata) is the canonical form, dictionary form, or citation form of a set of words. It entails reducing the various inflected forms of a word into a single form for easy analysis.

For example, runs, running, ran are all forms of the word run, therefore run is the lemma of all these words. Because lemmatization returns an actual word of the language, it is used where it is necessary to get valid words.

Natural Language Processing in artificial intelligence uses Lemmatization that does not simply chop off inflections but instead relies on a lexical knowledge base like WordNet to obtain the correct base forms of words.
But lemmatization has limits.

For example, In stemming both happiness and happy stems to happi, while in lemmatization the two words to themselves. The WordNet lemmatizer also requires specifying the word’s part of speech — otherwise, it assumes the word is a noun. Finally, lemmatization cannot handle unknown words: for example, in stemming both iPhone and iPhones to iPhone, while in lemmatization both words convert to themselves.
In general, lemmatization tends to offers better precision than stemming, but at the expense of recall.

Part-of-speech tagging

It involves identifying the part of speech for every word.

Introduction to Natural Language Processing - Part-of-speech tagging | EduGrad

Part-of-speech tagging (POS tagging), is also known as word-category disambiguation or grammatical tagging. It is the process of marking up a word to a particular part of speech.

The process of converting a sentence to forms – list of tuples (where each tuple is having a form (word, tag)), list of words. The tag in case is a part-of-speech tag and tells whether the word is a noun, adjective, verb, and so on.

2. Semantics

Semantics refers to the meaning that is conveyed by a text. In the case of Natural Language Processing, semantic analysis is the one difficult aspect that has not been fully resolved yet.

It aims at understanding the meaning and interpretation of words using computer algorithms and analyzing how sentences are structured.

Some techniques in semantic analysis:

  • Named entity recognition (NER):

Techniques in Semantic Analysis - Named Entity Recognition | EduGrad

Analyzing the parts of a text to be identified or categorized into preset groups. Examples of such groups include names of people, organization, geographical locations and names of places.

  • Word sense disambiguation:

Finding out the meaning of a word based on the context.

  • Natural language generation:

Involves the use of databases to derive semantic intentions and convert them into human language.

NLP Libraries

Some common libraries used in natural language processing are:

  1. NLTK – entry-level open-source NLP ToolLibraries used in NLP - NLTK Library | EduGrad

Natural Language Toolkit (NLTK) is an open-source software powered with Python NLP. It is a standard NLP tool developed for research and education.

NLTK provides users with some basic set of tools using which they can perform text-related operations. It is a good tool for beginners to start in Natural Language Processing.

Natural Language Toolkit features include:

  • Classification of Text
  • POS tagging
  • Entity extraction
  • Tokenization
  • Parsing
  • Stemming
  • Semantic reasoning

2. SpaCy – Data Extraction, Data Analysis, Sentiment Analysis, Text SummarizationLibraries used in NLP - SpaCy Library | EduGrad

SpaCy is one step-up of the NLTK evolution. When it comes to more complex business applications, NLTK is clumsy and slow. Whereas, SpaCy provides users with smoother, faster, and efficient experience.

SpaCy, is an open-source NLP library and is the best fit for comparing customer/product profiles or text documents, useful in deep text analytics and sentiment analysis.

It is good at syntactic analysis, which is useful for aspect-based sentiment analysis and conversational user interface optimization. It is also an excellent choice for named-entity recognition and hence used for business insights and market research.

Another advantage of Spacy over OpenNLP and CoreNLP is its word vectors usage. SpaCy works with word2vec and doc2vec and got all functions combined at once as its API.

3. GenSim – Document Analysis, Semantic Search, Data Exploration Document Analysis, Semantic Search, Data Exploration | EduGrad

GenSim is the tool if you need to extract particular information to discover business insights. It is designed for document exploration and topic modeling and it’s an open-source NLP library.

The key feature of GenSem is word vectors. It sees the content of the documents as sequences of vectors and clusters and classifies them.

When it comes to dealing with a large amount of data, GenSim is also resource-saving.

Mainly used for:

  • Data analysis
  • Semantic search applications
  • Text generation applications (chatbot, text summarization, service customization, etc.)

Implementing NLP using Spacy and NLTK


Importing library
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")


Please enter your comment!
Please enter your name here