Feature Extraction in NLP - From Clean Text to Count-Based Features
Feature Extraction in NLP - From Clean Text to Count-Based Features.
After text preprocessing, I learned how cleaned words are converted into numbers using word count, frequency distribution, one-hot encoding, Bag of Words, n-grams and TF-IDF.
Feature extraction in NLP is the step where cleaned text is converted into numerical features. This is needed because computers and machine learning models do not directly understand raw words the way humans do. A model expects numbers, so we represent words, sentences, or documents using numerical values.
In the previous post on text preprocessing in NLP, I cleaned raw text using steps like lowercasing, punctuation handling, stopword removal, tokenization, stemming and lemmatization. But after cleaning, the text is still text. The next question is simple: how do we give this text to a model?
That is where feature extraction started making sense to me. We are not just writing words as numbers randomly. We are trying to extract useful signals from text and represent them in a form that a machine learning algorithm can process.
Text preprocessing makes text clean. Feature extraction makes that cleaned text usable for models.
01 Where Feature Extraction Fits in the NLP Pipeline
I understood feature extraction better when I placed it in the full NLP pipeline. First we collect data. Then we clean it. Then we represent it as numbers. After that, we can use heuristic rules, machine learning models, or deep learning models.
This also helped me understand why feature extraction is called text representation. We are representing text in another form. The original sentence may be human-readable, but the extracted feature vector is model-readable.
02 Approaches After Feature Extraction
In my notes, I also connected feature extraction with different ways of solving NLP tasks. Once text becomes numerical, we can use different approaches depending on the problem.
Heuristic Approach
- uses rules created manually
- good for simple logic
- easy to explain
- can fail when language becomes complex
Machine Learning Approach
- uses extracted features as input
- learns patterns from labeled data
- works with BoW and TF-IDF
- used for classification and prediction
Deep Learning Approach
- can learn richer representations
- uses embeddings and neural networks
- captures more context than simple counts
- connects to the next topic
Why This Matters
- model quality depends on features
- bad features can weaken good models
- simple features are still useful
- embeddings improve semantic meaning
This is why I do not want to skip traditional feature extraction. Even though embeddings and transformers are powerful, simple features like word count, Bag of Words and TF-IDF explain the foundation very clearly.
03 Important Terms I Had to Understand First
Before jumping into code, I had to understand a few basic terms. These terms appear again and again in feature extraction.
| Term | Meaning | How I Think About It |
|---|---|---|
| Corpus | collection of all text data | all sentences or documents together |
| Document | one text sample | one sentence, paragraph, review or article |
| Vocabulary | unique words from the corpus | the list of words used to create columns |
| Feature | numerical signal extracted from text | a value that the model can use |
| Vector | list of numbers representing text | model-readable version of a document |
The most important term here for me was vocabulary. Once the vocabulary is created, each document can be represented based on those vocabulary words.
04 Word Count as the First Simple Feature
The first small feature I tried was word count. This is not a complete text representation method like TF-IDF, but it helped me understand feature extraction at the simplest level.
If a sentence is very short, very long, or has repeated words, those are also signals. In some NLP tasks, length itself can become a feature.
text = "Natural Language Processing is a fascinating domain!"
words = text.split()
print(words)
print("Word count:", len(words))
The basic split() method is easy to understand, but it is not perfect. It simply splits based on spaces. Punctuation can remain attached to words, and that can create wrong counts in some cases.
split() is simple, but it does not understand punctuation properly. So Hello! may stay as one token with punctuation attached.
05 Regex, NLTK and spaCy for Better Counting
After using split(), I tested regex, NLTK and spaCy. This connected feature extraction back to the earlier tokenization topic. Better tokenization gives better counting.
Regex word count
import re
pattern = re.compile(r'\w+')
def word_count_regex(text):
words = re.findall(pattern, text)
print(words)
print("Word count:", len(words))
Regex performed better than basic split for punctuation-heavy examples because it can extract word-like patterns instead of only separating by spaces.
NLTK word count
from nltk import word_tokenize, sent_tokenize
def word_count_nltk(text):
tokens = word_tokenize(text)
print(tokens)
print("Word count:", len(tokens))
def sentence_count_nltk(text):
sentences = sent_tokenize(text)
print(sentences)
print("Sentence count:", len(sentences))
spaCy word count
import spacy
nlp = spacy.load("en_core_web_sm")
def word_count_spacy(text):
doc = nlp(text)
tokens = [token.text for token in doc if not token.is_punct]
print(tokens)
print("Word count:", len(tokens))
06 Frequency Distribution
After word count, I learned frequency distribution. This means counting how many times each word appears in the text. This is the first point where I felt that text was becoming a numerical pattern.
Python's Counter made this very clear. It takes tokens and counts repeated words.
from collections import Counter
def freq_count(text):
words = text.lower().split()
return Counter(words)
freq_count("NLP is fun. NLP is useful.")
But again, punctuation can disturb the result. That is why I also tried spaCy to remove punctuation before counting.
from collections import Counter
import spacy
nlp = spacy.load("en_core_web_sm")
def freq_count_spacy(text):
doc = nlp(text.lower())
words = [token.text for token in doc if not token.is_punct]
return Counter(words)
07 One-Hot Encoding
One-hot encoding represents each word as a binary vector. If the word is present at a vocabulary position, that position becomes 1 and the rest remain 0.
In my notebook, I built a small corpus and created vocabulary manually. This helped me see how each word gets its own position.
corpus = [
"NLP is fun and exciting",
"Machines understand NLP and text",
"Text processing is part of NLP"
]
One-hot encoding is intuitive, but it has major problems. If the vocabulary has 50,000 words, each word vector becomes very large and mostly filled with zeros. This is called sparsity.
Pros
- easy to understand
- easy to implement
- shows vocabulary position clearly
Cons
- very sparse vectors
- large vocabulary creates huge vectors
- document length is not naturally fixed
- semantic meaning is not captured
This is where I understood why one-hot encoding is useful for learning but not always practical for real NLP pipelines.
08 Bag of Words
Bag of Words was the first proper text representation method that felt useful for machine learning. The idea is simple: build a vocabulary from the corpus, then represent each document using word counts.
In Bag of Words, word order is ignored. The model only sees how many times each vocabulary word appears in a document.
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
"NLP is fun and exciting",
"Machines understand NLP and text",
"Text processing is part of NLP"
]
cv = CountVectorizer()
bow = cv.fit_transform(corpus)
print(cv.vocabulary_)
print(cv.get_feature_names_out())
print(bow.toarray())
The output matrix is a document-term matrix. Rows represent documents. Columns represent vocabulary words. Values represent word counts.
09 Binary Bag of Words
Binary Bag of Words is a small variation of Bag of Words. Instead of counting how many times a word appears, it only checks whether the word is present or not.
If a word appears one time or five times, the value becomes 1. If the word does not appear, the value becomes 0.
from sklearn.feature_extraction.text import CountVectorizer
cv_binary = CountVectorizer(binary=True)
binary_bow = cv_binary.fit_transform(corpus)
binary_bow.toarray()
This can be useful in sentiment analysis, where sometimes the presence of a word matters more than its repeated count.
excellent, its presence itself can be a useful signal even if it appears only once.
10 Advantages and Limitations of Bag of Words
Bag of Words is simple and useful, but it is not perfect. I liked it because it is intuitive, but its limitations also became clear quickly.
Advantages
- very intuitive
- easy to implement with CountVectorizer
- works well for many text classification tasks
- creates fixed-size feature vectors
Limitations
- creates sparse vectors
- ignores word order
- does not understand meaning deeply
- new unseen words may be ignored
The biggest problem for me was word order. For example, You are good and You are not good can look too similar in a basic Bag of Words representation, even though their meaning is different.
11 N-grams
N-grams solve part of the word order problem by using sequences of words instead of only single words. A unigram uses one word. A bigram uses two words together. A trigram uses three words together.
This helped me understand how phrases can become features. Instead of only seeing not and good separately, the model can see not good as one feature.
from sklearn.feature_extraction.text import CountVectorizer
cv_bigram = CountVectorizer(ngram_range=(2, 2))
bigram_matrix = cv_bigram.fit_transform(corpus)
print(cv_bigram.get_feature_names_out())
cv_unigram_bigram = CountVectorizer(ngram_range=(1, 2))
matrix = cv_unigram_bigram.fit_transform(corpus)
print(cv_unigram_bigram.get_feature_names_out())
| ngram_range | Meaning | Use |
|---|---|---|
| (1, 1) | only unigrams | single word features |
| (2, 2) | only bigrams | two-word phrase features |
| (1, 2) | unigrams and bigrams | single words plus short phrases |
| (1, 3) | unigrams, bigrams and trigrams | more context but larger feature space |
N-grams are more context-aware than simple Bag of Words, but they also increase the number of features. If the dataset is large, the matrix can become very big and slow.
12 TF-IDF
TF-IDF was the most important part of this notebook for me. Bag of Words gives importance based on count, but TF-IDF gives different weights based on how important a word is in a document compared to the full corpus.
The intuition is simple: a word should get more weight if it appears often in one document but is not common in every document.
Term Frequency
- how often a term appears in a document
- higher count means stronger local signal
- calculated inside one document
Inverse Document Frequency
- checks how common the term is across documents
- rare words get more importance
- very common words get less importance
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf_matrix = tfidf.fit_transform(corpus)
print(tfidf.get_feature_names_out())
print(tfidf_matrix.toarray())
TF-IDF is widely used in information retrieval and keyword extraction because it can highlight words that are more informative for a document.
13 CountVectorizer vs TfidfVectorizer
After implementing both, I understood the difference more clearly. CountVectorizer counts words. TfidfVectorizer assigns weights based on importance.
| Method | What It Stores | When It Helps |
|---|---|---|
| CountVectorizer | word count or word presence | simple classification, baseline models, frequency-based features |
| TfidfVectorizer | weighted word importance | information retrieval, keyword extraction, search-style tasks |
14 Custom Features
I also learned that feature extraction is not limited to built-in methods. Sometimes we can create custom features based on the domain.
For example, in sentiment analysis, we can count positive words and negative words manually. This is a heuristic feature. It may not be perfect, but it is useful for understanding how feature engineering works.
from nltk import word_tokenize
positive_words = {"happy", "good", "great", "excellent", "love", "amazing"}
negative_words = {"sad", "bad", "terrible", "hate", "awful", "worst"}
def sentiment_features(text):
tokens = word_tokenize(text.lower())
pos_count = sum(1 for token in tokens if token in positive_words)
neg_count = sum(1 for token in tokens if token in negative_words)
return {"positive_count": pos_count, "negative_count": neg_count}
This showed me that feature extraction can be automatic, like CountVectorizer, or manual, like counting positive and negative words based on domain knowledge.
15 Keyword Extraction Using TF-IDF
One practical task in the notebook was keyword extraction. The idea is to use TF-IDF values to identify important words from a document.
import numpy as np
feature_array = np.array(tfidf.get_feature_names_out())
tfidf_values = tfidf_matrix.toarray()
importance = np.argsort(tfidf_values[0]).flatten()[::-1]
keywords = feature_array[importance[:5]]
print("Top keywords:", keywords)
This is where I saw TF-IDF as more than a formula. It can directly help in search, ranking, summarizing important words, and extracting keywords from text.
16 Limitations of Count-Based Features
By the end of this topic, I understood why simple feature extraction is useful but limited. Bag of Words, n-grams and TF-IDF are strong baselines, but they still struggle with meaning.
These limitations naturally lead to the next topic: word and sentence embeddings. Embeddings try to represent words and sentences as dense vectors that capture meaning better than sparse count-based methods.
17 My Final Understanding
My final understanding is that feature extraction is the point where NLP starts becoming numerical. Before this, we were collecting and cleaning text. Now we are turning that cleaned text into features.
I also understood that traditional feature extraction methods are not useless just because embeddings exist. They are simple, interpretable, fast, and still useful for many baseline NLP models.
18 GitHub Notebook Connection
This blog explains what I understood from the simple feature extraction notebooks. The implementation side is connected to my NLP by Vinod GitHub repository.
NLP by Vinod GitHub Repository
Notebook references: 2_01_what_is_FE.ipynb, 2_02_word_count.ipynb, and 2_03_freq_distribution_FE.ipynb.
20 What Comes Next in the NLP Journey
The next topic is Word and Sentence Embeddings in NLP. This will move beyond sparse count-based features and introduce dense vector representations.
Comments
Post a Comment