NLP by Vinod - Foundations

Text Preprocessing

Text Preprocessing in NLP - Cleaning Raw Text Before Feature Extraction.

After data acquisition, I learned how raw text is cleaned, normalized, tokenized, and prepared before feature extraction. This post connects basic and advanced preprocessing from my notebooks into one clear sequence.

NLP Text Cleaning Tokenization Stemming

Text preprocessing in NLP is the step where raw collected text becomes cleaner and more useful for a model. In the previous post on data acquisition for NLP, I collected data from sources like web scraping, APIs, JSON, SQL and CSV files. But collected data is rarely ready for feature extraction directly.

When I started this topic, my understanding was not fully in sequence. I had learned lowercasing, whitespace cleanup, number handling, HTML removal, punctuation removal, chat words, spelling correction, emojis, stopwords, tokenization, stemming, lemmatization and annotators. But the notebooks helped me see the real order: first reduce noise, then normalize text, then split it into useful units, then apply linguistic processing depending on the task.

What clicked for me:
Text preprocessing is not just cleaning. It is the bridge between raw human language and numerical feature extraction.

Text preprocessing workflow in NLP showing raw text cleaning, lowercasing, whitespace cleanup, tokenization, stopword removal and lemmatization — Text preprocessing converts noisy raw text into cleaner tokens that can later be used for feature extraction and NLP modeling.

01 Where Text Preprocessing Fits in the NLP Pipeline

The easiest way I understood preprocessing is by placing it between data acquisition and feature extraction. Data acquisition gives raw text. Text preprocessing improves that text. Feature extraction converts it into numbers.

Acquire CSV, API, scraping, JSON

Clean HTML, URLs, spaces, noise

Normalize lowercase, numbers, spelling

Tokenize words, sentences, tokens

Extract BOW, TF-IDF, embeddings

This order matters because every later step depends on the earlier step. If HTML tags, URLs, repeated spaces, punctuation noise or inconsistent casing remain in the dataset, feature extraction may treat meaningless variations as important patterns.

My practical takeaway: preprocessing decisions depend on the task. Sentiment analysis, search, NER, topic modeling and text classification may need different cleaning choices.

02 The Dataset Context from Data Acquisition

I started this preprocessing work using the movies.csv file that came from the data acquisition step. That was useful because it connected the previous topic directly to this one. Instead of learning preprocessing on random sentences only, I could apply functions on a dataset column like movie_name.

This made the topic feel more real. In a notebook, one function may work on one sentence, but in a project, the same function needs to work on a full column safely.

From Data Acquisition

movie names
movie descriptions
genre information
CSV file from API data

For Preprocessing

clean text columns
normalize inconsistent values
remove unwanted noise
prepare for feature extraction

          
        
Python

import pandas as pd

df = pd.read_csv("movies.csv")

df.head()

03 Lowercasing Text

Lowercasing was the first basic preprocessing step. The idea is simple: convert text into a consistent case so that words like Movie, movie and MOVIE are not treated as different tokens unnecessarily.

In my notebook, I tested it first on one movie name, then applied it to the whole column. This pattern is important: first test on one sample, then apply on the dataset.

          
        
Python

df.movie_name[0].lower()

df["movie_name"] = df["movie_name"].str.lower()

Where lowercasing helps: search, classification, keyword matching, bag of words and TF-IDF usually benefit from consistent casing.

But lowercasing is not always harmless: for Named Entity Recognition, casing can be meaningful. For example, Apple as a company and apple as a fruit may need different treatment.

04 Removing Extra White Spaces

Extra spaces look small, but they can create messy output. A string like " movie name " may look almost normal to us, but for processing it has unnecessary leading and trailing spaces.

I tried two ways: using split() and join(), and using regex. The split() approach is very readable. Regex is powerful when whitespace patterns become more complex.

          
        
Python

text = " movie name "

def remove_extra_spaces(text):
    return " ".join(text.split())

          
        
Python

import re

pattern = r'\s+'

def remove_extra_spaces_regex(text):
    return re.sub(pattern, " ", text).strip()

This is one of those small preprocessing steps that can save many problems later, especially when text comes from web scraping, PDFs, copied text, comments or forms.

05 Handling Numbers

Numbers are tricky in NLP. Sometimes numbers should be removed. Sometimes they should be preserved. Sometimes they should be converted into words. The correct choice depends on the task.

For example, in a movie title like Se7en, the number is part of the word. If I blindly process only space-separated words, this case may not be handled properly. This was one of the useful mistakes from the notebook.

          
        
Python

from num2words import num2words

def numbers_to_word(text):
    result = []

    for word in text.split():
        if word.isdigit():
            result.append(num2words(word))
        else:
            result.append(word)

    return " ".join(result)

Notebook observation: this word-by-word method does not properly handle mixed strings like Se7en. For that, character-level or pattern-based handling may be needed.

06 Removing HTML Tags

HTML tags are common when text comes from websites. In my movie API dataset, HTML tags were less likely. But in a scraped dataset like company data from web pages, HTML noise can appear.

I tested this on a sample HTML document and removed tags using regex. The important idea is not just removing tags, but also cleaning the extra spaces that remain after tags are removed.

          
        
Python

import re

pattern_html = re.compile(r'<.*?>')
pattern_space = re.compile(r'\s+')

def remove_html_tag(text):
    clean_text = re.sub(pattern_html, "", text)
    clean_text = re.sub(pattern_space, " ", clean_text)
    return clean_text.strip()

Practical note: regex is fine for simple HTML cleanup, but for complex HTML pages, BeautifulSoup is usually safer because real web HTML can be messy.

Text cleaning and tokenization workspace showing noisy raw text with HTML tags, URLs, emojis and cleaned NLP tokens — Cleaning raw text means removing noise like HTML tags, URLs, extra spaces and unwanted symbols before tokenization.

07 Removing and Extracting URLs, Emails and Hashtags

URLs, emails and hashtags are common in real text data. They appear in tweets, comments, forums, scraped pages, support tickets and social media datasets.

Sometimes I may want to remove them. Sometimes I may want to extract them as useful metadata. For example, an email address may not be useful for sentiment analysis, but hashtags can be useful for topic detection.

Pattern	Possible Action	Why It Matters
URL	remove or replace	URLs often add noise unless source tracking is important.
Email	remove, mask or extract	Email addresses may contain private information and should be handled carefully.
Hashtag	extract or normalize	Hashtags can carry topic or sentiment information.

          
        
Python

import re

url_pattern = r'https?://\S+|www\.\S+'
email_pattern = r'\b[\w\.-]+@[\w\.-]+\.\w+\b'
hashtag_pattern = r'#\w+'

def remove_urls(text):
    return re.sub(url_pattern, "", text)

08 Removing Punctuation

Punctuation removal depends on the task. If I am doing simple topic classification, removing punctuation may help. But if I am doing sentiment analysis, punctuation like exclamation marks can carry emotion.

In my notebook, I first used a loop and replace(). Then I learned a faster approach using str.translate() with str.maketrans(). This was useful because the same preprocessing function may run on thousands or millions of rows.

          
        
Python

import string

exclude = string.punctuation

def remove_punctuation(text):
    for char in exclude:
        text = text.replace(char, "")
    return text

          
        
Python

def remove_punctuation_fast(text):
    return text.translate(str.maketrans("", "", exclude))

What I learned: preprocessing is not only about correctness. Speed also matters when the dataset becomes large.

09 Chat Word Treatment and Spelling Correction

Chat words are common in social media datasets. People write short forms like U, FYI, GM and ASAP. A model may not understand these properly unless they are normalized.

          
        
Python

chat_words = {
    "U": "You",
    "FYI": "For Your Information",
    "GM": "Good Morning",
    "ASAP": "As Soon As Possible"
}

def expand_chat_words(text):
    words = text.split()
    expanded = [chat_words.get(word, word) for word in words]
    return " ".join(expanded)

I also explored spelling correction using libraries like TextBlob or pyspellchecker. This can be useful when user-generated text contains mistakes like recieved or messsage.

Important caution: spelling correction should not be applied blindly. Names, slang, domain words and abbreviations can be changed incorrectly.

10 Handling Emojis

Emojis appear a lot in social media text. My notebook showed two approaches: remove emojis, or convert them into meaning using the emoji library.

Removing emojis may simplify the text. But converting emojis into words may preserve sentiment. For example, a smiling emoji can carry positive emotion, and a fire emoji can indicate excitement or praise.

          
        
Python

import re

def remove_emoji(text):
    emoji_pattern = re.compile(
        "["
        u"\\U0001F600-\\U0001F64F"
        u"\\U0001F300-\\U0001F5FF"
        u"\\U0001F680-\\U0001F6FF"
        u"\\U0001F1E0-\\U0001F1FF"
        "]+",
        flags=re.UNICODE
    )
    return emoji_pattern.sub("", text)

          
        
Python

import emoji

emoji.demojize("Python is fire")

My understanding: in sentiment analysis, emojis should usually not be removed immediately. They may contain useful emotional signal.

11 Removing Stopwords

Stopwords are common words like the, is, am, are, of and to. They often appear frequently but may not always carry useful meaning for basic classification or keyword extraction.

In the notebook, I used NLTK stopwords and noticed that English and French are available, but Hindi stopwords were not directly available in that same way. This was useful because it showed that preprocessing tools are language-dependent.

          
        
Python

from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))

def remove_stopwords_fast(text):
    filtered_words = [
        word for word in text.split()
        if word.lower() not in stop_words
    ]
    return " ".join(filtered_words)

Speed improvement: converting the stopword list into a set is better for repeated lookup because set membership checking is faster.

Do not always remove stopwords: in some tasks, words like not are very important. Removing them can change the meaning of a sentence completely.

12 Tokenization

Tokenization is the step where text is split into smaller units. These units can be words, sentences or subwords. In my notebook, I tested tokenization using simple split, regex, NLTK and spaCy.

This is where I understood why tokenization is more than just splitting by space. A sentence like I am going to Mumbai! should ideally separate Mumbai and the punctuation mark. A basic split does not do that properly.

Using split

          
        
Python

sent = "I am going to Mumbai!"
sent.split()

The problem is that Mumbai! remains one token. For simple data this may be okay, but for robust NLP it is weak.

Using regex

          
        
Python

import re

tokens = re.findall(r'[\w]+', sent)
print(tokens)

Using NLTK

          
        
Python

from nltk.tokenize import word_tokenize, sent_tokenize

word_tokenize("I am going to Mumbai!")

sent_tokenize("I am going to Mumbai. I will stay there.")

Using spaCy

          
        
Python

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("We're here to help! mail us at xyz@gmail.com")

for token in doc:
    print(token.text)

Notebook conclusion: spaCy performed well on tricky examples like emails and punctuation, but for simple datasets NLTK can still be enough.

13 Stemming

Stemming converts words into a root-like form, but it does not always care whether the final word is a valid dictionary word. This was the key difference I understood.

For example, studies may become studi. That is not a proper English word, but for some search and classification tasks, it may still work because related forms become similar.

          
        
Python

from nltk.stem.porter import PorterStemmer

ps = PorterStemmer()

def stem_words(text):
    return " ".join([ps.stem(word) for word in text.split()])

stem_words("walk walks walking walked")

I also compared PorterStemmer and SnowballStemmer on different domains like news, Wikipedia-style text and science paper text. The outputs were not always perfect, but the comparison helped me see that stemming is aggressive and fast.

14 Lemmatization

Lemmatization also converts a word to its base form, but unlike stemming it tries to return a valid word. It usually needs more linguistic information like POS tags and a lexicon such as WordNet.

This is why lemmatization is more meaningful but usually more expensive than stemming.

          
        
Python

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

def apply_lemmatization(tokens, pos=wordnet.VERB):
    return [lemmatizer.lemmatize(word, pos) for word in tokens]

          
        
Python

import spacy

nlp = spacy.load("en_core_web_sm")

def spacy_lemmatization(text):
    doc = nlp(text)
    return [token.lemma_ for token in doc]

Method	Behavior	Best Use
Stemming	cuts words to a root-like form	fast search, simple classification, information retrieval
Lemmatization	returns a valid base word	clean NLP pipelines, semantic tasks, readable outputs

Advanced text preprocessing in NLP showing POS tagging, named entity recognition, stemming, lemmatization and cleaned output — Advanced preprocessing adds linguistic structure through POS tagging, entity recognition, stemming, lemmatization and annotation.

15 Annotators, NER and POS Tagging

After basic preprocessing, I moved into annotators. Annotators add extra information to the text. Instead of only cleaning words, we start identifying what each word represents.

The main annotators I explored were Named Entity Recognition, POS tagging and custom annotation ideas like sentiment rules based on positive or negative words.

Named Entity Recognition

detects people
detects organizations
detects locations
detects dates and entities

POS Tagging

marks nouns
marks verbs
marks adjectives
adds grammar context

With spaCy, I tested NER on a sentence about Google and its founders. It identified Google as an organization and names like Larry Page and Sergey Brin as persons.

          
        
Python

import spacy

nlp = spacy.load("en_core_web_sm")

text = "Google was founded by Larry Page and Sergey Brin."
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

For POS tagging, I tried NLTK and spaCy. A sentence like Will Will search my blogs showed why context matters. The same word can behave differently depending on where it appears.

          
        
Python

from nltk import pos_tag, word_tokenize

text = "Will Will search my blogs"
tokens = word_tokenize(text)

tagged = pos_tag(tokens)
print(tagged)

What I learned: annotation is not just preprocessing. It adds linguistic meaning, and that meaning becomes useful in tasks like NER systems, information extraction and search.

16 Hidden Markov Model Intuition for NER

In the annotator notebook, I also wrote notes about how NER can be understood using Hidden Markov Models and the Viterbi algorithm. I am not treating this as a full mathematical implementation yet, but the intuition became clearer.

The idea is that the actual entity tags are hidden states. The words are observations. The model tries to find the most likely sequence of entity tags for the given words.

Emission probability

This asks how likely a word is given an entity type. For example, how often a word appears as an organization, person or location.

Transition probability

This asks how likely one entity tag is after another entity tag. For example, a person tag may continue after another person tag in a full name.

Viterbi decoding

This finds the best sequence of hidden tags for the full sentence instead of deciding each word independently.

My current understanding: NER is not only about identifying words. It is about assigning the best sequence of labels using context.

17 Text Preprocessing Is Not One Fixed Recipe

The biggest mistake in text preprocessing is thinking that every dataset needs the same cleaning pipeline. That is not true.

For example, punctuation may be noise in topic modeling but useful in sentiment analysis. Emojis may be useless for news classification but useful in social media sentiment. Stopwords may be removed in keyword extraction but preserved in language modeling or question answering.

Task	Preprocessing Choice	Reason
Sentiment analysis	keep negation and maybe emojis	words like not and emoji signals can change sentiment.
Topic modeling	remove stopwords and punctuation	common words may hide topic-specific words.
NER	avoid aggressive lowercasing	capitalization can help identify names and organizations.
Search	stemming or lemmatization can help	different word forms can match the same idea.

Important lesson: preprocessing should be task-aware. Cleaning too much can remove useful information.

18 My Final Preprocessing Checklist

After arranging the notebooks in sequence, this is the preprocessing checklist that now makes sense to me.

Inspect the dataset first

Check whether the text contains HTML, URLs, emojis, repeated spaces, misspellings, chat words or mixed casing.

Clean only what is actually noise

Do not remove punctuation, emojis, casing or stopwords blindly. Decide based on the NLP task.

Normalize the text

Lowercase, expand chat words, handle numbers, normalize spaces and correct spelling when it is safe.

Tokenize properly

Use split only for simple cases. Use NLTK or spaCy when punctuation, emails, abbreviations and sentence boundaries matter.

Apply linguistic processing if needed

Use stemming, lemmatization, POS tagging or NER when the task needs grammatical or semantic structure.

19 GitHub Notebook Connection

This blog explains what I understood from the notebooks, while the GitHub repository keeps the actual implementation work visible. The notebooks include basic preprocessing, tokenization, annotator creation, and stemming plus lemmatization practice.

NLP by Vinod GitHub Repository

Notebook references: 1_00_text_preprocessing.ipynb, 1_04_string_tokenization.ipynb, 1_05_Annotator_creation.ipynb, and 1_07_Stemming_Lemmatization.ipynb.

Open the GitHub repository

20 Related Reading

NLP learning roadmap

The main roadmap that connects this topic to the full NLP by Vinod journey.

Data acquisition for NLP

The previous topic where raw data was collected from APIs, files, SQL, JSON and web scraping.

Python strings and regex for NLP

The earlier foundation topic that supports cleaning text, extracting patterns and handling raw strings.

21 What Comes Next in the NLP Journey

The next topic is Feature Extraction in NLP. This is where cleaned text will become numbers using methods like Bag of Words, TF-IDF and later embeddings.

Bag of Words

Represent text using word counts and vocabulary-based features.

TF-IDF

Give more weight to words that are important in a document but not common everywhere.

Embeddings

Move from sparse word counts to dense vector representations that capture meaning better.

NLP Text Preprocessing Tokenization NLTK spaCy Feature Extraction

A structured public journey from NLP fundamentals to real-world AI systems.