NLP by Vinod

A structured public journey from NLP fundamentals to real-world AI systems.

Vinod Codes is where I document my learning in AI, Machine Learning, Deep Learning, Natural Language Processing, Generative AI, and practical projects.

The main series here is NLP by Vinod — a learner-builder journey where I explain concepts with intuition, Python examples, mistakes, GitHub work, and honest implementation notes.

Start here: follow the Foundations Track first, then move into deep learning, transformers, projects, and real-world NLP systems.
NLP Foundations Python for NLP Machine Learning Deep Learning Real Projects

Feature Extraction in NLP - From Clean Text to Count-Based Features

NLP by Vinod - Foundations
Feature Extraction

Feature Extraction in NLP - From Clean Text to Count-Based Features.

After text preprocessing, I learned how cleaned words are converted into numbers using word count, frequency distribution, one-hot encoding, Bag of Words, n-grams and TF-IDF.

NLP Feature Extraction Bag of Words TF-IDF

Feature extraction in NLP is the step where cleaned text is converted into numerical features. This is needed because computers and machine learning models do not directly understand raw words the way humans do. A model expects numbers, so we represent words, sentences, or documents using numerical values.

In the previous post on text preprocessing in NLP, I cleaned raw text using steps like lowercasing, punctuation handling, stopword removal, tokenization, stemming and lemmatization. But after cleaning, the text is still text. The next question is simple: how do we give this text to a model?

That is where feature extraction started making sense to me. We are not just writing words as numbers randomly. We are trying to extract useful signals from text and represent them in a form that a machine learning algorithm can process.

What clicked for me:
Text preprocessing makes text clean. Feature extraction makes that cleaned text usable for models.
Feature extraction in NLP showing cleaned text converted into count based numerical features and a document term matrix
Feature extraction converts cleaned text into numerical features so machine learning models can work with words and documents.

01 Where Feature Extraction Fits in the NLP Pipeline

I understood feature extraction better when I placed it in the full NLP pipeline. First we collect data. Then we clean it. Then we represent it as numbers. After that, we can use heuristic rules, machine learning models, or deep learning models.

Acquire CSV, API, scraping, JSON
Preprocess clean, tokenize, normalize
Represent word count, BoW, TF-IDF
Model heuristic, ML, DL
Evaluate accuracy, error, insight

This also helped me understand why feature extraction is called text representation. We are representing text in another form. The original sentence may be human-readable, but the extracted feature vector is model-readable.

My simple definition: feature extraction means converting text into useful numbers while trying to preserve meaningful information from the text.

02 Approaches After Feature Extraction

In my notes, I also connected feature extraction with different ways of solving NLP tasks. Once text becomes numerical, we can use different approaches depending on the problem.

Heuristic Approach

  • uses rules created manually
  • good for simple logic
  • easy to explain
  • can fail when language becomes complex

Machine Learning Approach

  • uses extracted features as input
  • learns patterns from labeled data
  • works with BoW and TF-IDF
  • used for classification and prediction

Deep Learning Approach

  • can learn richer representations
  • uses embeddings and neural networks
  • captures more context than simple counts
  • connects to the next topic

Why This Matters

  • model quality depends on features
  • bad features can weaken good models
  • simple features are still useful
  • embeddings improve semantic meaning

This is why I do not want to skip traditional feature extraction. Even though embeddings and transformers are powerful, simple features like word count, Bag of Words and TF-IDF explain the foundation very clearly.

03 Important Terms I Had to Understand First

Before jumping into code, I had to understand a few basic terms. These terms appear again and again in feature extraction.

Term Meaning How I Think About It
Corpus collection of all text data all sentences or documents together
Document one text sample one sentence, paragraph, review or article
Vocabulary unique words from the corpus the list of words used to create columns
Feature numerical signal extracted from text a value that the model can use
Vector list of numbers representing text model-readable version of a document

The most important term here for me was vocabulary. Once the vocabulary is created, each document can be represented based on those vocabulary words.

04 Word Count as the First Simple Feature

The first small feature I tried was word count. This is not a complete text representation method like TF-IDF, but it helped me understand feature extraction at the simplest level.

If a sentence is very short, very long, or has repeated words, those are also signals. In some NLP tasks, length itself can become a feature.

Python
text = "Natural Language Processing is a fascinating domain!"

words = text.split()

print(words)
print("Word count:", len(words))

The basic split() method is easy to understand, but it is not perfect. It simply splits based on spaces. Punctuation can remain attached to words, and that can create wrong counts in some cases.

Notebook mistake I noticed: split() is simple, but it does not understand punctuation properly. So Hello! may stay as one token with punctuation attached.

05 Regex, NLTK and spaCy for Better Counting

After using split(), I tested regex, NLTK and spaCy. This connected feature extraction back to the earlier tokenization topic. Better tokenization gives better counting.

Regex word count

Python
import re

pattern = re.compile(r'\w+')

def word_count_regex(text):
    words = re.findall(pattern, text)
    print(words)
    print("Word count:", len(words))

Regex performed better than basic split for punctuation-heavy examples because it can extract word-like patterns instead of only separating by spaces.

NLTK word count

Python
from nltk import word_tokenize, sent_tokenize

def word_count_nltk(text):
    tokens = word_tokenize(text)
    print(tokens)
    print("Word count:", len(tokens))

def sentence_count_nltk(text):
    sentences = sent_tokenize(text)
    print(sentences)
    print("Sentence count:", len(sentences))

spaCy word count

Python
import spacy

nlp = spacy.load("en_core_web_sm")

def word_count_spacy(text):
    doc = nlp(text)
    tokens = [token.text for token in doc if not token.is_punct]
    print(tokens)
    print("Word count:", len(tokens))
What I understood: word count depends on tokenization. If tokenization is weak, the extracted feature can also become weak.

06 Frequency Distribution

After word count, I learned frequency distribution. This means counting how many times each word appears in the text. This is the first point where I felt that text was becoming a numerical pattern.

Python's Counter made this very clear. It takes tokens and counts repeated words.

Python
from collections import Counter

def freq_count(text):
    words = text.lower().split()
    return Counter(words)

freq_count("NLP is fun. NLP is useful.")

But again, punctuation can disturb the result. That is why I also tried spaCy to remove punctuation before counting.

Python
from collections import Counter
import spacy

nlp = spacy.load("en_core_web_sm")

def freq_count_spacy(text):
    doc = nlp(text.lower())
    words = [token.text for token in doc if not token.is_punct]
    return Counter(words)
Small but important point: frequency distribution looks simple, but it is the base idea behind Bag of Words.

07 One-Hot Encoding

One-hot encoding represents each word as a binary vector. If the word is present at a vocabulary position, that position becomes 1 and the rest remain 0.

In my notebook, I built a small corpus and created vocabulary manually. This helped me see how each word gets its own position.

Python
corpus = [
    "NLP is fun and exciting",
    "Machines understand NLP and text",
    "Text processing is part of NLP"
]

One-hot encoding is intuitive, but it has major problems. If the vocabulary has 50,000 words, each word vector becomes very large and mostly filled with zeros. This is called sparsity.

Pros

  • easy to understand
  • easy to implement
  • shows vocabulary position clearly

Cons

  • very sparse vectors
  • large vocabulary creates huge vectors
  • document length is not naturally fixed
  • semantic meaning is not captured

This is where I understood why one-hot encoding is useful for learning but not always practical for real NLP pipelines.

08 Bag of Words

Bag of Words was the first proper text representation method that felt useful for machine learning. The idea is simple: build a vocabulary from the corpus, then represent each document using word counts.

In Bag of Words, word order is ignored. The model only sees how many times each vocabulary word appears in a document.

Python
from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "NLP is fun and exciting",
    "Machines understand NLP and text",
    "Text processing is part of NLP"
]

cv = CountVectorizer()
bow = cv.fit_transform(corpus)

print(cv.vocabulary_)
print(cv.get_feature_names_out())
print(bow.toarray())

The output matrix is a document-term matrix. Rows represent documents. Columns represent vocabulary words. Values represent word counts.

What clicked: Bag of Words solves the fixed-size input problem better than one-hot encoding because every document gets represented using the same vocabulary columns.
Bag of Words and n-grams explained with vocabulary, word counts and document vectors for NLP
Bag of Words and n-grams create vocabulary-based vectors where each document is represented using word presence or word frequency.

09 Binary Bag of Words

Binary Bag of Words is a small variation of Bag of Words. Instead of counting how many times a word appears, it only checks whether the word is present or not.

If a word appears one time or five times, the value becomes 1. If the word does not appear, the value becomes 0.

Python
from sklearn.feature_extraction.text import CountVectorizer

cv_binary = CountVectorizer(binary=True)
binary_bow = cv_binary.fit_transform(corpus)

binary_bow.toarray()

This can be useful in sentiment analysis, where sometimes the presence of a word matters more than its repeated count.

Example: if a review contains the word excellent, its presence itself can be a useful signal even if it appears only once.

10 Advantages and Limitations of Bag of Words

Bag of Words is simple and useful, but it is not perfect. I liked it because it is intuitive, but its limitations also became clear quickly.

Advantages

  • very intuitive
  • easy to implement with CountVectorizer
  • works well for many text classification tasks
  • creates fixed-size feature vectors

Limitations

  • creates sparse vectors
  • ignores word order
  • does not understand meaning deeply
  • new unseen words may be ignored

The biggest problem for me was word order. For example, You are good and You are not good can look too similar in a basic Bag of Words representation, even though their meaning is different.

Important limitation: Bag of Words counts words, but it does not truly understand context.

11 N-grams

N-grams solve part of the word order problem by using sequences of words instead of only single words. A unigram uses one word. A bigram uses two words together. A trigram uses three words together.

This helped me understand how phrases can become features. Instead of only seeing not and good separately, the model can see not good as one feature.

Python
from sklearn.feature_extraction.text import CountVectorizer

cv_bigram = CountVectorizer(ngram_range=(2, 2))
bigram_matrix = cv_bigram.fit_transform(corpus)

print(cv_bigram.get_feature_names_out())
Python
cv_unigram_bigram = CountVectorizer(ngram_range=(1, 2))
matrix = cv_unigram_bigram.fit_transform(corpus)

print(cv_unigram_bigram.get_feature_names_out())
ngram_range Meaning Use
(1, 1) only unigrams single word features
(2, 2) only bigrams two-word phrase features
(1, 2) unigrams and bigrams single words plus short phrases
(1, 3) unigrams, bigrams and trigrams more context but larger feature space

N-grams are more context-aware than simple Bag of Words, but they also increase the number of features. If the dataset is large, the matrix can become very big and slow.

12 TF-IDF

TF-IDF was the most important part of this notebook for me. Bag of Words gives importance based on count, but TF-IDF gives different weights based on how important a word is in a document compared to the full corpus.

The intuition is simple: a word should get more weight if it appears often in one document but is not common in every document.

Term Frequency

  • how often a term appears in a document
  • higher count means stronger local signal
  • calculated inside one document

Inverse Document Frequency

  • checks how common the term is across documents
  • rare words get more importance
  • very common words get less importance
Python
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(corpus)

print(tfidf.get_feature_names_out())
print(tfidf_matrix.toarray())

TF-IDF is widely used in information retrieval and keyword extraction because it can highlight words that are more informative for a document.

TF-IDF feature extraction in NLP showing term frequency, inverse document frequency and keyword importance
TF-IDF gives higher importance to words that are meaningful in one document but not common across every document.

13 CountVectorizer vs TfidfVectorizer

After implementing both, I understood the difference more clearly. CountVectorizer counts words. TfidfVectorizer assigns weights based on importance.

Method What It Stores When It Helps
CountVectorizer word count or word presence simple classification, baseline models, frequency-based features
TfidfVectorizer weighted word importance information retrieval, keyword extraction, search-style tasks
My takeaway: CountVectorizer asks how many times a word appears. TF-IDF asks how important that word is in this document compared to the corpus.

14 Custom Features

I also learned that feature extraction is not limited to built-in methods. Sometimes we can create custom features based on the domain.

For example, in sentiment analysis, we can count positive words and negative words manually. This is a heuristic feature. It may not be perfect, but it is useful for understanding how feature engineering works.

Python
from nltk import word_tokenize

positive_words = {"happy", "good", "great", "excellent", "love", "amazing"}
negative_words = {"sad", "bad", "terrible", "hate", "awful", "worst"}

def sentiment_features(text):
    tokens = word_tokenize(text.lower())
    pos_count = sum(1 for token in tokens if token in positive_words)
    neg_count = sum(1 for token in tokens if token in negative_words)
    return {"positive_count": pos_count, "negative_count": neg_count}

This showed me that feature extraction can be automatic, like CountVectorizer, or manual, like counting positive and negative words based on domain knowledge.

15 Keyword Extraction Using TF-IDF

One practical task in the notebook was keyword extraction. The idea is to use TF-IDF values to identify important words from a document.

Python
import numpy as np

feature_array = np.array(tfidf.get_feature_names_out())
tfidf_values = tfidf_matrix.toarray()

importance = np.argsort(tfidf_values[0]).flatten()[::-1]
keywords = feature_array[importance[:5]]

print("Top keywords:", keywords)

This is where I saw TF-IDF as more than a formula. It can directly help in search, ranking, summarizing important words, and extracting keywords from text.

Important correction: in a real project, I should calculate keyword importance from the correct TF-IDF matrix, not accidentally reuse a matrix from another method.

16 Limitations of Count-Based Features

By the end of this topic, I understood why simple feature extraction is useful but limited. Bag of Words, n-grams and TF-IDF are strong baselines, but they still struggle with meaning.

01
Sparse vectors
Most values are zero because each document contains only a small part of the full vocabulary.
02
Out-of-vocabulary issue
If a new word was not seen during vocabulary building, it may be ignored during prediction.
03
Weak semantic meaning
Similar words like movie and film are treated as separate features unless the model learns that pattern from data.
04
Limited context
N-grams add some context, but they still cannot deeply understand sentence meaning.

These limitations naturally lead to the next topic: word and sentence embeddings. Embeddings try to represent words and sentences as dense vectors that capture meaning better than sparse count-based methods.

17 My Final Understanding

My final understanding is that feature extraction is the point where NLP starts becoming numerical. Before this, we were collecting and cleaning text. Now we are turning that cleaned text into features.

I also understood that traditional feature extraction methods are not useless just because embeddings exist. They are simple, interpretable, fast, and still useful for many baseline NLP models.

01
Start simple
Word count and frequency distribution help build intuition before vectorization.
02
Use Bag of Words as a baseline
It is easy to implement and works well for many beginner text classification tasks.
03
Use n-grams when short phrases matter
They preserve small word sequences and help capture simple context.
04
Use TF-IDF when importance matters
It highlights useful words better than raw count in many search and retrieval tasks.

18 GitHub Notebook Connection

This blog explains what I understood from the simple feature extraction notebooks. The implementation side is connected to my NLP by Vinod GitHub repository.

GH

NLP by Vinod GitHub Repository

Notebook references: 2_01_what_is_FE.ipynb, 2_02_word_count.ipynb, and 2_03_freq_distribution_FE.ipynb.

Open the GitHub repository

20 What Comes Next in the NLP Journey

The next topic is Word and Sentence Embeddings in NLP. This will move beyond sparse count-based features and introduce dense vector representations.

01
Word embeddings

Represent words as dense vectors that can capture semantic similarity better than word counts.

02
Sentence embeddings

Represent full sentences in vector form so sentence meaning can be compared more directly.

03
From features to meaning

This becomes the bridge from traditional NLP to deep learning and transformer-based NLP.

NLP Feature Extraction Bag of Words TF-IDF CountVectorizer Text Representation

Comments

Most viewed

Python Strings & Regex for NLP — The Real Foundation

NLP Learning Roadmap — From Fundamentals to Real-World AI Systems

Data Acquisition for NLP - Collecting Text Before Preprocessing