Text Preprocessing in NLP - Cleaning Raw Text Before Feature Extraction
Text Preprocessing in NLP - Cleaning Raw Text Before Feature Extraction.
After data acquisition, I learned how raw text is cleaned, normalized, tokenized, and prepared before feature extraction. This post connects basic and advanced preprocessing from my notebooks into one clear sequence.
Text preprocessing in NLP is the step where raw collected text becomes cleaner and more useful for a model. In the previous post on data acquisition for NLP, I collected data from sources like web scraping, APIs, JSON, SQL and CSV files. But collected data is rarely ready for feature extraction directly.
When I started this topic, my understanding was not fully in sequence. I had learned lowercasing, whitespace cleanup, number handling, HTML removal, punctuation removal, chat words, spelling correction, emojis, stopwords, tokenization, stemming, lemmatization and annotators. But the notebooks helped me see the real order: first reduce noise, then normalize text, then split it into useful units, then apply linguistic processing depending on the task.
Text preprocessing is not just cleaning. It is the bridge between raw human language and numerical feature extraction.
01 Where Text Preprocessing Fits in the NLP Pipeline
The easiest way I understood preprocessing is by placing it between data acquisition and feature extraction. Data acquisition gives raw text. Text preprocessing improves that text. Feature extraction converts it into numbers.
This order matters because every later step depends on the earlier step. If HTML tags, URLs, repeated spaces, punctuation noise or inconsistent casing remain in the dataset, feature extraction may treat meaningless variations as important patterns.
02 The Dataset Context from Data Acquisition
I started this preprocessing work using the movies.csv file
that came from the data acquisition step. That was useful because it
connected the previous topic directly to this one. Instead of learning
preprocessing on random sentences only, I could apply functions on a
dataset column like movie_name.
This made the topic feel more real. In a notebook, one function may work on one sentence, but in a project, the same function needs to work on a full column safely.
From Data Acquisition
- movie names
- movie descriptions
- genre information
- CSV file from API data
For Preprocessing
- clean text columns
- normalize inconsistent values
- remove unwanted noise
- prepare for feature extraction
import pandas as pd
df = pd.read_csv("movies.csv")
df.head()
03 Lowercasing Text
Lowercasing was the first basic preprocessing step. The idea is simple:
convert text into a consistent case so that words like Movie,
movie and MOVIE are not treated as different
tokens unnecessarily.
In my notebook, I tested it first on one movie name, then applied it to the whole column. This pattern is important: first test on one sample, then apply on the dataset.
df.movie_name[0].lower()
df["movie_name"] = df["movie_name"].str.lower()
Apple as
a company and apple as a fruit may need different treatment.
04 Removing Extra White Spaces
Extra spaces look small, but they can create messy output. A string like
" movie name " may look almost normal to us, but for
processing it has unnecessary leading and trailing spaces.
I tried two ways: using split() and join(), and
using regex. The split() approach is very readable. Regex is
powerful when whitespace patterns become more complex.
text = " movie name "
def remove_extra_spaces(text):
return " ".join(text.split())
import re
pattern = r'\s+'
def remove_extra_spaces_regex(text):
return re.sub(pattern, " ", text).strip()
This is one of those small preprocessing steps that can save many problems later, especially when text comes from web scraping, PDFs, copied text, comments or forms.
05 Handling Numbers
Numbers are tricky in NLP. Sometimes numbers should be removed. Sometimes they should be preserved. Sometimes they should be converted into words. The correct choice depends on the task.
For example, in a movie title like Se7en, the number is part
of the word. If I blindly process only space-separated words, this case
may not be handled properly. This was one of the useful mistakes from the
notebook.
from num2words import num2words
def numbers_to_word(text):
result = []
for word in text.split():
if word.isdigit():
result.append(num2words(word))
else:
result.append(word)
return " ".join(result)
Se7en. For that,
character-level or pattern-based handling may be needed.
06 Removing HTML Tags
HTML tags are common when text comes from websites. In my movie API dataset, HTML tags were less likely. But in a scraped dataset like company data from web pages, HTML noise can appear.
I tested this on a sample HTML document and removed tags using regex. The important idea is not just removing tags, but also cleaning the extra spaces that remain after tags are removed.
import re
pattern_html = re.compile(r'<.*?>')
pattern_space = re.compile(r'\s+')
def remove_html_tag(text):
clean_text = re.sub(pattern_html, "", text)
clean_text = re.sub(pattern_space, " ", clean_text)
return clean_text.strip()
07 Removing and Extracting URLs, Emails and Hashtags
URLs, emails and hashtags are common in real text data. They appear in tweets, comments, forums, scraped pages, support tickets and social media datasets.
Sometimes I may want to remove them. Sometimes I may want to extract them as useful metadata. For example, an email address may not be useful for sentiment analysis, but hashtags can be useful for topic detection.
| Pattern | Possible Action | Why It Matters |
|---|---|---|
| URL | remove or replace | URLs often add noise unless source tracking is important. |
| remove, mask or extract | Email addresses may contain private information and should be handled carefully. | |
| Hashtag | extract or normalize | Hashtags can carry topic or sentiment information. |
import re
url_pattern = r'https?://\S+|www\.\S+'
email_pattern = r'\b[\w\.-]+@[\w\.-]+\.\w+\b'
hashtag_pattern = r'#\w+'
def remove_urls(text):
return re.sub(url_pattern, "", text)
08 Removing Punctuation
Punctuation removal depends on the task. If I am doing simple topic classification, removing punctuation may help. But if I am doing sentiment analysis, punctuation like exclamation marks can carry emotion.
In my notebook, I first used a loop and replace(). Then I
learned a faster approach using str.translate() with
str.maketrans(). This was useful because the same
preprocessing function may run on thousands or millions of rows.
import string
exclude = string.punctuation
def remove_punctuation(text):
for char in exclude:
text = text.replace(char, "")
return text
def remove_punctuation_fast(text):
return text.translate(str.maketrans("", "", exclude))
09 Chat Word Treatment and Spelling Correction
Chat words are common in social media datasets. People write short forms
like U, FYI, GM and
ASAP. A model may not understand these properly unless they
are normalized.
chat_words = {
"U": "You",
"FYI": "For Your Information",
"GM": "Good Morning",
"ASAP": "As Soon As Possible"
}
def expand_chat_words(text):
words = text.split()
expanded = [chat_words.get(word, word) for word in words]
return " ".join(expanded)
I also explored spelling correction using libraries like TextBlob or
pyspellchecker. This can be useful when user-generated text contains
mistakes like recieved or messsage.
10 Handling Emojis
Emojis appear a lot in social media text. My notebook showed two
approaches: remove emojis, or convert them into meaning using the
emoji library.
Removing emojis may simplify the text. But converting emojis into words may preserve sentiment. For example, a smiling emoji can carry positive emotion, and a fire emoji can indicate excitement or praise.
import re
def remove_emoji(text):
emoji_pattern = re.compile(
"["
u"\\U0001F600-\\U0001F64F"
u"\\U0001F300-\\U0001F5FF"
u"\\U0001F680-\\U0001F6FF"
u"\\U0001F1E0-\\U0001F1FF"
"]+",
flags=re.UNICODE
)
return emoji_pattern.sub("", text)
import emoji
emoji.demojize("Python is fire")
11 Removing Stopwords
Stopwords are common words like the, is,
am, are, of and to.
They often appear frequently but may not always carry useful meaning for
basic classification or keyword extraction.
In the notebook, I used NLTK stopwords and noticed that English and French are available, but Hindi stopwords were not directly available in that same way. This was useful because it showed that preprocessing tools are language-dependent.
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))
def remove_stopwords_fast(text):
filtered_words = [
word for word in text.split()
if word.lower() not in stop_words
]
return " ".join(filtered_words)
not are very important. Removing them can change the meaning
of a sentence completely.
12 Tokenization
Tokenization is the step where text is split into smaller units. These units can be words, sentences or subwords. In my notebook, I tested tokenization using simple split, regex, NLTK and spaCy.
This is where I understood why tokenization is more than just splitting by
space. A sentence like I am going to Mumbai! should ideally
separate Mumbai and the punctuation mark. A basic split does
not do that properly.
Using split
sent = "I am going to Mumbai!"
sent.split()
The problem is that Mumbai! remains one token. For simple
data this may be okay, but for robust NLP it is weak.
Using regex
import re
tokens = re.findall(r'[\w]+', sent)
print(tokens)
Using NLTK
from nltk.tokenize import word_tokenize, sent_tokenize
word_tokenize("I am going to Mumbai!")
sent_tokenize("I am going to Mumbai. I will stay there.")
Using spaCy
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("We're here to help! mail us at xyz@gmail.com")
for token in doc:
print(token.text)
13 Stemming
Stemming converts words into a root-like form, but it does not always care whether the final word is a valid dictionary word. This was the key difference I understood.
For example, studies may become studi. That is
not a proper English word, but for some search and classification tasks,
it may still work because related forms become similar.
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
def stem_words(text):
return " ".join([ps.stem(word) for word in text.split()])
stem_words("walk walks walking walked")
I also compared PorterStemmer and SnowballStemmer on different domains like news, Wikipedia-style text and science paper text. The outputs were not always perfect, but the comparison helped me see that stemming is aggressive and fast.
14 Lemmatization
Lemmatization also converts a word to its base form, but unlike stemming it tries to return a valid word. It usually needs more linguistic information like POS tags and a lexicon such as WordNet.
This is why lemmatization is more meaningful but usually more expensive than stemming.
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
def apply_lemmatization(tokens, pos=wordnet.VERB):
return [lemmatizer.lemmatize(word, pos) for word in tokens]
import spacy
nlp = spacy.load("en_core_web_sm")
def spacy_lemmatization(text):
doc = nlp(text)
return [token.lemma_ for token in doc]
| Method | Behavior | Best Use |
|---|---|---|
| Stemming | cuts words to a root-like form | fast search, simple classification, information retrieval |
| Lemmatization | returns a valid base word | clean NLP pipelines, semantic tasks, readable outputs |
15 Annotators, NER and POS Tagging
After basic preprocessing, I moved into annotators. Annotators add extra information to the text. Instead of only cleaning words, we start identifying what each word represents.
The main annotators I explored were Named Entity Recognition, POS tagging and custom annotation ideas like sentiment rules based on positive or negative words.
Named Entity Recognition
- detects people
- detects organizations
- detects locations
- detects dates and entities
POS Tagging
- marks nouns
- marks verbs
- marks adjectives
- adds grammar context
With spaCy, I tested NER on a sentence about Google and its founders. It
identified Google as an organization and names like
Larry Page and Sergey Brin as persons.
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Google was founded by Larry Page and Sergey Brin."
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.label_)
For POS tagging, I tried NLTK and spaCy. A sentence like
Will Will search my blogs showed why context matters. The
same word can behave differently depending on where it appears.
from nltk import pos_tag, word_tokenize
text = "Will Will search my blogs"
tokens = word_tokenize(text)
tagged = pos_tag(tokens)
print(tagged)
16 Hidden Markov Model Intuition for NER
In the annotator notebook, I also wrote notes about how NER can be understood using Hidden Markov Models and the Viterbi algorithm. I am not treating this as a full mathematical implementation yet, but the intuition became clearer.
The idea is that the actual entity tags are hidden states. The words are observations. The model tries to find the most likely sequence of entity tags for the given words.
17 Text Preprocessing Is Not One Fixed Recipe
The biggest mistake in text preprocessing is thinking that every dataset needs the same cleaning pipeline. That is not true.
For example, punctuation may be noise in topic modeling but useful in sentiment analysis. Emojis may be useless for news classification but useful in social media sentiment. Stopwords may be removed in keyword extraction but preserved in language modeling or question answering.
| Task | Preprocessing Choice | Reason |
|---|---|---|
| Sentiment analysis | keep negation and maybe emojis | words like not and emoji signals can change sentiment. |
| Topic modeling | remove stopwords and punctuation | common words may hide topic-specific words. |
| NER | avoid aggressive lowercasing | capitalization can help identify names and organizations. |
| Search | stemming or lemmatization can help | different word forms can match the same idea. |
18 My Final Preprocessing Checklist
After arranging the notebooks in sequence, this is the preprocessing checklist that now makes sense to me.
19 GitHub Notebook Connection
This blog explains what I understood from the notebooks, while the GitHub repository keeps the actual implementation work visible. The notebooks include basic preprocessing, tokenization, annotator creation, and stemming plus lemmatization practice.
NLP by Vinod GitHub Repository
Notebook references: 1_00_text_preprocessing.ipynb,
1_04_string_tokenization.ipynb,
1_05_Annotator_creation.ipynb, and
1_07_Stemming_Lemmatization.ipynb.
21 What Comes Next in the NLP Journey
The next topic is Feature Extraction in NLP. This is where cleaned text will become numbers using methods like Bag of Words, TF-IDF and later embeddings.
Comments
Post a Comment