Word Embeddings in NLP - Moving Beyond Sparse Features
Word Embeddings in NLP - Moving Beyond Sparse Features.
After count-based feature extraction, I learned why NLP needs dense vectors that can capture similarity, meaning and context better than Bag of Words and TF-IDF.
Word embeddings in NLP are dense numerical representations of words, sentences or documents. In the previous topic, I learned count-based feature extraction methods like Bag of Words, n-grams and TF-IDF. Those methods are useful, but they create sparse vectors and do not understand semantic meaning deeply.
My rough understanding was simple: earlier methods had disadvantages like sparsity, weak semantic meaning and static representation. After working through the embedding notebooks, I started seeing why embeddings became such an important step in NLP. Instead of representing words only by their position in a vocabulary, embeddings try to place similar words closer in a vector space.
For example, words like king and queen, or movie and film, should not behave like completely unrelated words. Count-based features often struggle with this. Embeddings try to solve this by learning dense vectors where similarity has meaning.
Bag of Words counts words. TF-IDF weighs words. Embeddings try to represent meaning.
01 Why Embeddings Were Needed
Traditional feature extraction methods were the right place to start. They helped me understand how text becomes numbers. But their limitations are also clear when the vocabulary becomes large or when meaning matters.
This is why embeddings felt like the natural next topic after feature extraction. They are still numerical representations, but now the goal is not only counting. The goal is to represent meaning in a compact vector form.
02 Sparse Vectors vs Dense Vectors
The first difference I had to understand was sparse representation versus dense representation. Bag of Words and TF-IDF usually create sparse vectors. Embeddings create dense vectors.
| Aspect | Sparse Features | Dense Embeddings |
|---|---|---|
| Example | Bag of Words, TF-IDF | Word2Vec, GloVe, FastText |
| Vector size | usually very large | usually smaller, like 50, 100 or 300 |
| Zeros | mostly zeros | mostly meaningful values |
| Meaning | weak semantic meaning | captures similarity better |
| Interpretability | easy to explain | harder to interpret directly |
A sparse vector may say whether a word exists in the vocabulary. A dense vector tries to encode learned properties of the word. The values are not manually assigned by us. They are learned from data or loaded from a pretrained model.
03 Word2Vec Intuition
Word2Vec was the first embedding method that made the idea feel practical to me. The basic idea is that words used in similar contexts should get similar vectors.
I learned two important training ideas: CBOW and Skip-gram. In CBOW, the model uses surrounding context words to predict the center word. In Skip-gram, the model uses the center word to predict surrounding context words.
CBOW
- uses context words
- predicts the center word
- usually faster
- works well with frequent words
Skip-gram
- uses the center word
- predicts context words
- can work well for rare words
- captures local context
In my pretrained Word2Vec notebook, I loaded the Google News vectors and tested words like king, queen, man, woman, India and others. The most interesting part was checking similarity and analogy-style behavior.
import gensim.downloader as api
model = api.load("word2vec-google-news-300")
model["king"].shape
model.most_similar("king")
model.similarity("king", "queen")
vec = model["king"] - model["man"] + model["woman"]
model.most_similar([vec])
04 Training a Custom Word2Vec Model
After using pretrained Word2Vec, I also trained my own Word2Vec model. This helped me understand what happens when the model builds vocabulary from my own corpus.
In the custom notebook, I loaded text files, broke them into sentences, applied simple preprocessing, built vocabulary and then trained a Word2Vec model. I also tested words like king, jon, daenerys, arya and sansa.
import gensim
model = gensim.models.Word2Vec(
window=10,
min_count=2
)
model.build_vocab(story)
model.train(
story,
total_examples=model.corpus_count,
epochs=10
)
The important parameters for me were window and min_count. Window controls how many surrounding words the model considers. Min count removes words that appear very rarely.
| Parameter | Meaning | How I Understood It |
|---|---|---|
| window | context size | how many nearby words are considered |
| min_count | minimum word frequency | rare words below this count are ignored |
| vector_size | embedding dimension | length of each word vector |
| sg | training mode | 0 for CBOW, 1 for Skip-gram |
05 GloVe and Co-occurrence Matrix
GloVe gave me another view of embeddings. Word2Vec learns from prediction tasks. GloVe is based on word co-occurrence statistics. It looks at how often words appear together in a context window.
In the co-occurrence notebook, I built a vocabulary, created a co-occurrence matrix, initialized embedding parameters, trained them using a loss and finally visualized embeddings using PCA.
def build_cooccurrence_matrix(tokenized_corpus, vocab_size, word2id, window_size=2):
cooccurrence_matrix = np.zeros((vocab_size, vocab_size), dtype=np.float64)
for sentence in tokenized_corpus:
sentence_ids = [word2id[word] for word in sentence]
for i, center_id in enumerate(sentence_ids):
start = max(0, i - window_size)
end = min(len(sentence_ids), i + window_size + 1)
for j in range(start, end):
if i != j:
context_id = sentence_ids[j]
cooccurrence_matrix[center_id, context_id] += 1
return cooccurrence_matrix
06 Using Pretrained GloVe for NLP Tasks
In the pretrained GloVe notebook, I used word vectors to create sentence vectors by averaging word embeddings. Then I applied those vectors to practical tasks like task clustering, summary checking, text clustering and semantic search.
This was useful because embeddings stopped feeling like only theory. I could see them being used as input for clustering and similarity-based retrieval.
def get_sentence_vector(sentence):
words = [word for word in sentence.lower().split() if word in model]
if not words:
return np.zeros(model.vector_size)
return np.mean([model[word] for word in words], axis=0)
Tasks I Tried
- semantic search
- text clustering
- task grouping
- summary accuracy checking
What I Noticed
- average vectors are simple
- unknown words need handling
- similarity uses cosine score
- meaning improves compared to counts
07 FastText and Subword Information
FastText helped me understand one important limitation of Word2Vec. Word2Vec usually treats each word as a complete unit. FastText breaks words into subword pieces or character n-grams.
This matters because similar word forms like play, playing and played can share subword information. It also helps with rare words and some out-of-vocabulary cases.
from gensim.models import FastText
cbow_model = FastText(
sentences=tokenized,
vector_size=100,
window=5,
min_count=1,
sg=0,
epochs=100
)
cbow_model.wv["learning"][:10]
08 Sentence Embeddings
Word embeddings represent individual words, but many NLP tasks need sentence-level meaning. That is where sentence embeddings come in. A sentence embedding is a fixed-length numerical representation of a full sentence or document.
I tried average sentence embeddings first. The idea is simple: take all word vectors in a sentence and calculate their average.
import spacy
import numpy as np
nlp = spacy.load("en_core_web_sm")
def average_embedding(sentence):
doc = nlp(sentence)
vectors = []
for token in doc:
if token.has_vector:
vectors.append(token.vector)
if vectors:
return np.mean(vectors, axis=0)
return np.zeros(nlp.vocab.vectors_length)
Average embeddings are simple and fast, but they treat all words equally. In many cases, not every word should have the same importance.
TF-IDF Weighted Sentence Embedding
To improve simple averaging, I also tried TF-IDF weighted sentence embeddings. Here, important words get more weight while common words get lower influence.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf.fit(corpus)
def tfidf_weighted_embedding(sentence):
tfidf_scores = tfidf.transform([sentence]).toarray()[0]
feature_names = tfidf.get_feature_names_out()
doc = nlp(sentence)
vectors = []
weights = []
for token in doc:
word = token.text.lower()
if word in feature_names and token.has_vector:
index = list(feature_names).index(word)
vectors.append(token.vector)
weights.append(tfidf_scores[index])
if vectors:
return np.average(vectors, axis=0, weights=weights)
return np.zeros(nlp.vocab.vectors_length)
09 Semantic Search Using Embeddings
One practical task I liked was semantic search. Instead of matching exact words, we convert the query and documents into vectors, then compare them using cosine similarity.
This is a big jump from keyword search. A query like visualizing data can match a document like Data visualization techniques even if the exact wording is not the same.
from sklearn.metrics.pairwise import cosine_similarity
query = "visualizing data"
query_vec = get_sentence_vector(query)
scores = []
for doc in corpus:
doc_vec = get_sentence_vector(doc)
score = cosine_similarity(
[query_vec],
[doc_vec]
)[0][0]
scores.append(score)
10 PCA, SVD and Visualizing Embeddings
Embedding vectors can have many dimensions, like 50, 100 or 300. Humans cannot directly visualize those dimensions. That is why PCA and SVD are useful. They reduce the dimensions so embeddings can be plotted in 2D or lower-dimensional space.
In the notebook, I used PCA to reduce sentence embeddings and word embeddings. This helped me see that embeddings are not just abstract arrays. They can be projected and visualized to observe clusters or similarity patterns.
from sklearn.decomposition import PCA
sentence_embeddings = np.array(
[average_embedding(sent) for sent in sentences]
)
pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(sentence_embeddings)
11 Contextual Embeddings
Up to this point, many embedding methods were static. That means a word gets the same vector even when its meaning changes in different sentences.
This creates a problem. The word left can mean a direction in one sentence and an action in another sentence. Static embeddings may struggle because the vector for the word remains same or mostly same.
Contextual embeddings solve this using models like transformers. They create representations based on the surrounding words. This is where self-attention becomes important because the model can decide which words matter for the current meaning.
Static Embedding
- same word gets same representation
- works well for many similarity tasks
- struggles with multiple meanings
- example: Word2Vec, GloVe
Contextual Embedding
- word meaning depends on sentence
- uses surrounding context
- better for modern NLP tasks
- example: BERT, SBERT
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
sentence = "The dog runs fast"
embedding = model.encode(sentence)
print(embedding[:5])
12 Document Embeddings and Doc2Vec
Word embeddings represent words. Sentence embeddings represent sentences. Document embeddings represent larger pieces of text like reviews, articles or documents.
I also included Doc2Vec in this topic because it extends the embedding idea to full documents. Even though my Doc2Vec notebook is still empty, the concept belongs naturally here because it explains how we can represent entire documents as vectors.
13 Embeddings in Real NLP Tasks
The best part of these notebooks was seeing embeddings used in tasks, not only printed as vectors. I tried document rating, document clustering and text similarity search using Word2Vec-style document vectors.
| Task | How Embeddings Help | Notebook Idea |
|---|---|---|
| Semantic search | finds documents with similar meaning | query vector compared with document vectors |
| Text clustering | groups similar documents together | KMeans on averaged embedding vectors |
| Document rating | predicts a numerical score from text | Ridge regression on sentence vectors |
| Summary checking | compares original text and summary meaning | cosine similarity between vector sets |
This helped me understand why embeddings are so important. They are not the final model by themselves, but they become a powerful input representation for many downstream NLP tasks.
14 Final Comparison of Embedding Methods
After arranging the notebooks together, this is how I now compare the major embedding methods.
| Method | Main Idea | Limitation |
|---|---|---|
| Word2Vec | learns word vectors from context prediction | static embedding |
| GloVe | learns from global co-occurrence patterns | still static |
| FastText | uses subword information | not fully contextual |
| Average sentence embedding | averages word vectors | treats all words equally |
| TF-IDF weighted embedding | uses word importance as weights | limited deep context |
| SBERT | creates contextual sentence embeddings | requires pretrained transformer model |
15 GitHub Notebook Connection
This blog explains what I understood from the embedding notebooks. The implementation side is connected to my NLP by Vinod GitHub repository.
NLP by Vinod GitHub Repository
Notebook references: 2_06_sentence_embeddings.ipynb, 2_08_word_embeddings_pretrained_GLOVE.ipynb, 2_09_co_occurrence_matrix_GLOVE.ipynb, 2_10_Word_embeddings_pretrained_Word2Vec.ipynb, 2_11_Word_embeddings_custom_Word2Vec.ipynb, and 2_12_doc2vec.ipynb.
17 What Comes Next in the NLP Journey
The next topic is NLP Libraries. After understanding preprocessing, count-based features and embeddings, I now want to organize the practical tools used in NLP workflows.
Comments
Post a Comment