NLP by Vinod - Foundations

Feature Extraction

Feature Extraction in NLP - From Clean Text to Count-Based Features.

After text preprocessing, I learned how cleaned words are converted into numbers using word count, frequency distribution, one-hot encoding, Bag of Words, n-grams and TF-IDF.

NLP Feature Extraction Bag of Words TF-IDF

Feature extraction in NLP is the step where cleaned text is converted into numerical features. This is needed because computers and machine learning models do not directly understand raw words the way humans do. A model expects numbers, so we represent words, sentences, or documents using numerical values.

In the previous post on text preprocessing in NLP, I cleaned raw text using steps like lowercasing, punctuation handling, stopword removal, tokenization, stemming and lemmatization. But after cleaning, the text is still text. The next question is simple: how do we give this text to a model?

That is where feature extraction started making sense to me. We are not just writing words as numbers randomly. We are trying to extract useful signals from text and represent them in a form that a machine learning algorithm can process.

What clicked for me:
Text preprocessing makes text clean. Feature extraction makes that cleaned text usable for models.

Feature extraction in NLP showing cleaned text converted into count based numerical features and a document term matrix — Feature extraction converts cleaned text into numerical features so machine learning models can work with words and documents.

01 Where Feature Extraction Fits in the NLP Pipeline

I understood feature extraction better when I placed it in the full NLP pipeline. First we collect data. Then we clean it. Then we represent it as numbers. After that, we can use heuristic rules, machine learning models, or deep learning models.

Acquire CSV, API, scraping, JSON

Preprocess clean, tokenize, normalize

Represent word count, BoW, TF-IDF

Model heuristic, ML, DL

Evaluate accuracy, error, insight

This also helped me understand why feature extraction is called text representation. We are representing text in another form. The original sentence may be human-readable, but the extracted feature vector is model-readable.

My simple definition: feature extraction means converting text into useful numbers while trying to preserve meaningful information from the text.

02 Approaches After Feature Extraction

In my notes, I also connected feature extraction with different ways of solving NLP tasks. Once text becomes numerical, we can use different approaches depending on the problem.

Heuristic Approach

uses rules created manually
good for simple logic
easy to explain
can fail when language becomes complex

Machine Learning Approach

uses extracted features as input
learns patterns from labeled data
works with BoW and TF-IDF
used for classification and prediction

Deep Learning Approach

can learn richer representations
uses embeddings and neural networks
captures more context than simple counts
connects to the next topic

Why This Matters

model quality depends on features
bad features can weaken good models
simple features are still useful
embeddings improve semantic meaning

This is why I do not want to skip traditional feature extraction. Even though embeddings and transformers are powerful, simple features like word count, Bag of Words and TF-IDF explain the foundation very clearly.

03 Important Terms I Had to Understand First

Before jumping into code, I had to understand a few basic terms. These terms appear again and again in feature extraction.

Term	Meaning	How I Think About It
Corpus	collection of all text data	all sentences or documents together
Document	one text sample	one sentence, paragraph, review or article
Vocabulary	unique words from the corpus	the list of words used to create columns
Feature	numerical signal extracted from text	a value that the model can use
Vector	list of numbers representing text	model-readable version of a document

The most important term here for me was vocabulary. Once the vocabulary is created, each document can be represented based on those vocabulary words.

04 Word Count as the First Simple Feature

The first small feature I tried was word count. This is not a complete text representation method like TF-IDF, but it helped me understand feature extraction at the simplest level.

If a sentence is very short, very long, or has repeated words, those are also signals. In some NLP tasks, length itself can become a feature.

          
        
Python

text = "Natural Language Processing is a fascinating domain!"

words = text.split()

print(words)
print("Word count:", len(words))

The basic split() method is easy to understand, but it is not perfect. It simply splits based on spaces. Punctuation can remain attached to words, and that can create wrong counts in some cases.

Notebook mistake I noticed: split() is simple, but it does not understand punctuation properly. So Hello! may stay as one token with punctuation attached.

05 Regex, NLTK and spaCy for Better Counting

After using split(), I tested regex, NLTK and spaCy. This connected feature extraction back to the earlier tokenization topic. Better tokenization gives better counting.

Regex word count

          
        
Python

import re

pattern = re.compile(r'\w+')

def word_count_regex(text):
    words = re.findall(pattern, text)
    print(words)
    print("Word count:", len(words))

Regex performed better than basic split for punctuation-heavy examples because it can extract word-like patterns instead of only separating by spaces.

NLTK word count

          
        
Python

from nltk import word_tokenize, sent_tokenize

def word_count_nltk(text):
    tokens = word_tokenize(text)
    print(tokens)
    print("Word count:", len(tokens))

def sentence_count_nltk(text):
    sentences = sent_tokenize(text)
    print(sentences)
    print("Sentence count:", len(sentences))

spaCy word count

          
        
Python

import spacy

nlp = spacy.load("en_core_web_sm")

def word_count_spacy(text):
    doc = nlp(text)
    tokens = [token.text for token in doc if not token.is_punct]
    print(tokens)
    print("Word count:", len(tokens))

What I understood: word count depends on tokenization. If tokenization is weak, the extracted feature can also become weak.

06 Frequency Distribution

After word count, I learned frequency distribution. This means counting how many times each word appears in the text. This is the first point where I felt that text was becoming a numerical pattern.

Python's Counter made this very clear. It takes tokens and counts repeated words.

          
        
Python

from collections import Counter

def freq_count(text):
    words = text.lower().split()
    return Counter(words)

freq_count("NLP is fun. NLP is useful.")

But again, punctuation can disturb the result. That is why I also tried spaCy to remove punctuation before counting.

          
        
Python

from collections import Counter
import spacy

nlp = spacy.load("en_core_web_sm")

def freq_count_spacy(text):
    doc = nlp(text.lower())
    words = [token.text for token in doc if not token.is_punct]
    return Counter(words)

Small but important point: frequency distribution looks simple, but it is the base idea behind Bag of Words.

07 One-Hot Encoding

One-hot encoding represents each word as a binary vector. If the word is present at a vocabulary position, that position becomes 1 and the rest remain 0.

In my notebook, I built a small corpus and created vocabulary manually. This helped me see how each word gets its own position.

          
        
Python

corpus = [
    "NLP is fun and exciting",
    "Machines understand NLP and text",
    "Text processing is part of NLP"
]

One-hot encoding is intuitive, but it has major problems. If the vocabulary has 50,000 words, each word vector becomes very large and mostly filled with zeros. This is called sparsity.

Pros

easy to understand
easy to implement
shows vocabulary position clearly

Cons

very sparse vectors
large vocabulary creates huge vectors
document length is not naturally fixed
semantic meaning is not captured

This is where I understood why one-hot encoding is useful for learning but not always practical for real NLP pipelines.

08 Bag of Words

Bag of Words was the first proper text representation method that felt useful for machine learning. The idea is simple: build a vocabulary from the corpus, then represent each document using word counts.

In Bag of Words, word order is ignored. The model only sees how many times each vocabulary word appears in a document.

          
        
Python

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "NLP is fun and exciting",
    "Machines understand NLP and text",
    "Text processing is part of NLP"
]

cv = CountVectorizer()
bow = cv.fit_transform(corpus)

print(cv.vocabulary_)
print(cv.get_feature_names_out())
print(bow.toarray())

The output matrix is a document-term matrix. Rows represent documents. Columns represent vocabulary words. Values represent word counts.

What clicked: Bag of Words solves the fixed-size input problem better than one-hot encoding because every document gets represented using the same vocabulary columns.

Bag of Words and n-grams explained with vocabulary, word counts and document vectors for NLP — Bag of Words and n-grams create vocabulary-based vectors where each document is represented using word presence or word frequency.

09 Binary Bag of Words

Binary Bag of Words is a small variation of Bag of Words. Instead of counting how many times a word appears, it only checks whether the word is present or not.

If a word appears one time or five times, the value becomes 1. If the word does not appear, the value becomes 0.

          
        
Python

from sklearn.feature_extraction.text import CountVectorizer

cv_binary = CountVectorizer(binary=True)
binary_bow = cv_binary.fit_transform(corpus)

binary_bow.toarray()

This can be useful in sentiment analysis, where sometimes the presence of a word matters more than its repeated count.

Example: if a review contains the word excellent, its presence itself can be a useful signal even if it appears only once.

10 Advantages and Limitations of Bag of Words

Bag of Words is simple and useful, but it is not perfect. I liked it because it is intuitive, but its limitations also became clear quickly.

Advantages

very intuitive
easy to implement with CountVectorizer
works well for many text classification tasks
creates fixed-size feature vectors

Limitations

creates sparse vectors
ignores word order
does not understand meaning deeply
new unseen words may be ignored

The biggest problem for me was word order. For example, You are good and You are not good can look too similar in a basic Bag of Words representation, even though their meaning is different.

Important limitation: Bag of Words counts words, but it does not truly understand context.

11 N-grams

N-grams solve part of the word order problem by using sequences of words instead of only single words. A unigram uses one word. A bigram uses two words together. A trigram uses three words together.

This helped me understand how phrases can become features. Instead of only seeing not and good separately, the model can see not good as one feature.

          
        
Python

from sklearn.feature_extraction.text import CountVectorizer

cv_bigram = CountVectorizer(ngram_range=(2, 2))
bigram_matrix = cv_bigram.fit_transform(corpus)

print(cv_bigram.get_feature_names_out())

          
        
Python

cv_unigram_bigram = CountVectorizer(ngram_range=(1, 2))
matrix = cv_unigram_bigram.fit_transform(corpus)

print(cv_unigram_bigram.get_feature_names_out())

ngram_range	Meaning	Use
(1, 1)	only unigrams	single word features
(2, 2)	only bigrams	two-word phrase features
(1, 2)	unigrams and bigrams	single words plus short phrases
(1, 3)	unigrams, bigrams and trigrams	more context but larger feature space

N-grams are more context-aware than simple Bag of Words, but they also increase the number of features. If the dataset is large, the matrix can become very big and slow.

12 TF-IDF

TF-IDF was the most important part of this notebook for me. Bag of Words gives importance based on count, but TF-IDF gives different weights based on how important a word is in a document compared to the full corpus.

The intuition is simple: a word should get more weight if it appears often in one document but is not common in every document.

Term Frequency

how often a term appears in a document
higher count means stronger local signal
calculated inside one document

Inverse Document Frequency

checks how common the term is across documents
rare words get more importance
very common words get less importance

          
        
Python

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(corpus)

print(tfidf.get_feature_names_out())
print(tfidf_matrix.toarray())

TF-IDF is widely used in information retrieval and keyword extraction because it can highlight words that are more informative for a document.

TF-IDF feature extraction in NLP showing term frequency, inverse document frequency and keyword importance — TF-IDF gives higher importance to words that are meaningful in one document but not common across every document.

13 CountVectorizer vs TfidfVectorizer

After implementing both, I understood the difference more clearly. CountVectorizer counts words. TfidfVectorizer assigns weights based on importance.

Method	What It Stores	When It Helps
CountVectorizer	word count or word presence	simple classification, baseline models, frequency-based features
TfidfVectorizer	weighted word importance	information retrieval, keyword extraction, search-style tasks

My takeaway: CountVectorizer asks how many times a word appears. TF-IDF asks how important that word is in this document compared to the corpus.

14 Custom Features

I also learned that feature extraction is not limited to built-in methods. Sometimes we can create custom features based on the domain.

For example, in sentiment analysis, we can count positive words and negative words manually. This is a heuristic feature. It may not be perfect, but it is useful for understanding how feature engineering works.

          
        
Python

from nltk import word_tokenize

positive_words = {"happy", "good", "great", "excellent", "love", "amazing"}
negative_words = {"sad", "bad", "terrible", "hate", "awful", "worst"}

def sentiment_features(text):
    tokens = word_tokenize(text.lower())
    pos_count = sum(1 for token in tokens if token in positive_words)
    neg_count = sum(1 for token in tokens if token in negative_words)
    return {"positive_count": pos_count, "negative_count": neg_count}

This showed me that feature extraction can be automatic, like CountVectorizer, or manual, like counting positive and negative words based on domain knowledge.

15 Keyword Extraction Using TF-IDF

One practical task in the notebook was keyword extraction. The idea is to use TF-IDF values to identify important words from a document.

          
        
Python

import numpy as np

feature_array = np.array(tfidf.get_feature_names_out())
tfidf_values = tfidf_matrix.toarray()

importance = np.argsort(tfidf_values[0]).flatten()[::-1]
keywords = feature_array[importance[:5]]

print("Top keywords:", keywords)

This is where I saw TF-IDF as more than a formula. It can directly help in search, ranking, summarizing important words, and extracting keywords from text.

Important correction: in a real project, I should calculate keyword importance from the correct TF-IDF matrix, not accidentally reuse a matrix from another method.

16 Limitations of Count-Based Features

By the end of this topic, I understood why simple feature extraction is useful but limited. Bag of Words, n-grams and TF-IDF are strong baselines, but they still struggle with meaning.

Sparse vectors

Most values are zero because each document contains only a small part of the full vocabulary.

Out-of-vocabulary issue

If a new word was not seen during vocabulary building, it may be ignored during prediction.

Weak semantic meaning

Similar words like movie and film are treated as separate features unless the model learns that pattern from data.

Limited context

N-grams add some context, but they still cannot deeply understand sentence meaning.

These limitations naturally lead to the next topic: word and sentence embeddings. Embeddings try to represent words and sentences as dense vectors that capture meaning better than sparse count-based methods.

17 My Final Understanding

My final understanding is that feature extraction is the point where NLP starts becoming numerical. Before this, we were collecting and cleaning text. Now we are turning that cleaned text into features.

I also understood that traditional feature extraction methods are not useless just because embeddings exist. They are simple, interpretable, fast, and still useful for many baseline NLP models.

Start simple

Word count and frequency distribution help build intuition before vectorization.

Use Bag of Words as a baseline

It is easy to implement and works well for many beginner text classification tasks.

Use n-grams when short phrases matter

They preserve small word sequences and help capture simple context.

Use TF-IDF when importance matters

It highlights useful words better than raw count in many search and retrieval tasks.

18 GitHub Notebook Connection

This blog explains what I understood from the simple feature extraction notebooks. The implementation side is connected to my NLP by Vinod GitHub repository.

NLP by Vinod GitHub Repository

Notebook references: 2_01_what_is_FE.ipynb, 2_02_word_count.ipynb, and 2_03_freq_distribution_FE.ipynb.

Open the GitHub repository

19 Related Reading

NLP learning roadmap

The main roadmap that connects this topic to the full NLP by Vinod journey.

Text preprocessing in NLP

The previous topic where raw text was cleaned, normalized and tokenized before feature extraction.

Python strings and regex for NLP

The foundation topic that supports pattern extraction, tokenization and text cleaning.

20 What Comes Next in the NLP Journey

The next topic is Word and Sentence Embeddings in NLP. This will move beyond sparse count-based features and introduce dense vector representations.

Word embeddings

Represent words as dense vectors that can capture semantic similarity better than word counts.

Sentence embeddings

Represent full sentences in vector form so sentence meaning can be compared more directly.

From features to meaning

This becomes the bridge from traditional NLP to deep learning and transformer-based NLP.

NLP Feature Extraction Bag of Words TF-IDF CountVectorizer Text Representation

A structured public journey from NLP fundamentals to real-world AI systems.