NLP by Vinod - Foundations

Text Representation

Word Embeddings in NLP - Moving Beyond Sparse Features.

After count-based feature extraction, I learned why NLP needs dense vectors that can capture similarity, meaning and context better than Bag of Words and TF-IDF.

NLP Embeddings Word2Vec GloVe

Word embeddings in NLP are dense numerical representations of words, sentences or documents. In the previous topic, I learned count-based feature extraction methods like Bag of Words, n-grams and TF-IDF. Those methods are useful, but they create sparse vectors and do not understand semantic meaning deeply.

My rough understanding was simple: earlier methods had disadvantages like sparsity, weak semantic meaning and static representation. After working through the embedding notebooks, I started seeing why embeddings became such an important step in NLP. Instead of representing words only by their position in a vocabulary, embeddings try to place similar words closer in a vector space.

For example, words like king and queen, or movie and film, should not behave like completely unrelated words. Count-based features often struggle with this. Embeddings try to solve this by learning dense vectors where similarity has meaning.

What clicked for me:
Bag of Words counts words. TF-IDF weighs words. Embeddings try to represent meaning.

Word embeddings in NLP showing sparse count vectors becoming dense semantic vectors in vector space — Embeddings move NLP representation from sparse word-count vectors to dense vectors where similar meanings can be closer together.

01 Why Embeddings Were Needed

Traditional feature extraction methods were the right place to start. They helped me understand how text becomes numbers. But their limitations are also clear when the vocabulary becomes large or when meaning matters.

Sparsity

In Bag of Words and TF-IDF, vectors can become very large and mostly filled with zeros. If the vocabulary has thousands of words, each document still uses only a small part of it.

Weak semantic meaning

Count-based methods may treat related words as separate unrelated features. Words like movie and film can be close in meaning but still become different columns.

Large feature space

As vocabulary grows, the feature matrix becomes larger. This can increase memory use and slow down training.

Limited context

N-grams capture short phrases, but they still cannot deeply understand sentence-level meaning or long-range context.

This is why embeddings felt like the natural next topic after feature extraction. They are still numerical representations, but now the goal is not only counting. The goal is to represent meaning in a compact vector form.

My simple definition: embeddings are dense vectors that represent text in a way where similar words or sentences can have similar vectors.

02 Sparse Vectors vs Dense Vectors

The first difference I had to understand was sparse representation versus dense representation. Bag of Words and TF-IDF usually create sparse vectors. Embeddings create dense vectors.

Aspect	Sparse Features	Dense Embeddings
Example	Bag of Words, TF-IDF	Word2Vec, GloVe, FastText
Vector size	usually very large	usually smaller, like 50, 100 or 300
Zeros	mostly zeros	mostly meaningful values
Meaning	weak semantic meaning	captures similarity better
Interpretability	easy to explain	harder to interpret directly

A sparse vector may say whether a word exists in the vocabulary. A dense vector tries to encode learned properties of the word. The values are not manually assigned by us. They are learned from data or loaded from a pretrained model.

Important point: dense does not mean magical. It simply means the vector has many non-zero learned values, and those values can carry useful patterns.

03 Word2Vec Intuition

Word2Vec was the first embedding method that made the idea feel practical to me. The basic idea is that words used in similar contexts should get similar vectors.

I learned two important training ideas: CBOW and Skip-gram. In CBOW, the model uses surrounding context words to predict the center word. In Skip-gram, the model uses the center word to predict surrounding context words.

CBOW

uses context words
predicts the center word
usually faster
works well with frequent words

Skip-gram

uses the center word
predicts context words
can work well for rare words
captures local context

In my pretrained Word2Vec notebook, I loaded the Google News vectors and tested words like king, queen, man, woman, India and others. The most interesting part was checking similarity and analogy-style behavior.

          
        
Python

import gensim.downloader as api

model = api.load("word2vec-google-news-300")

model["king"].shape
model.most_similar("king")
model.similarity("king", "queen")

          
        
Python

vec = model["king"] - model["man"] + model["woman"]
model.most_similar([vec])

What clicked: vectors are not just numbers. They can preserve relationships learned from text, like similarity and analogy patterns.

04 Training a Custom Word2Vec Model

After using pretrained Word2Vec, I also trained my own Word2Vec model. This helped me understand what happens when the model builds vocabulary from my own corpus.

In the custom notebook, I loaded text files, broke them into sentences, applied simple preprocessing, built vocabulary and then trained a Word2Vec model. I also tested words like king, jon, daenerys, arya and sansa.

          
        
Python

import gensim

model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

model.build_vocab(story)

model.train(
    story,
    total_examples=model.corpus_count,
    epochs=10
)

The important parameters for me were window and min_count. Window controls how many surrounding words the model considers. Min count removes words that appear very rarely.

Parameter	Meaning	How I Understood It
window	context size	how many nearby words are considered
min_count	minimum word frequency	rare words below this count are ignored
vector_size	embedding dimension	length of each word vector
sg	training mode	0 for CBOW, 1 for Skip-gram

Notebook lesson: custom embeddings depend heavily on the corpus. If the corpus is small or narrow, the learned vectors may not generalize well.

05 GloVe and Co-occurrence Matrix

GloVe gave me another view of embeddings. Word2Vec learns from prediction tasks. GloVe is based on word co-occurrence statistics. It looks at how often words appear together in a context window.

In the co-occurrence notebook, I built a vocabulary, created a co-occurrence matrix, initialized embedding parameters, trained them using a loss and finally visualized embeddings using PCA.

Corpus sentences and documents

Vocabulary unique words

Matrix word co-occurrence counts

Train learn embedding values

Use similarity and NLP tasks

          
        
Python

def build_cooccurrence_matrix(tokenized_corpus, vocab_size, word2id, window_size=2):
    cooccurrence_matrix = np.zeros((vocab_size, vocab_size), dtype=np.float64)

    for sentence in tokenized_corpus:
        sentence_ids = [word2id[word] for word in sentence]

        for i, center_id in enumerate(sentence_ids):
            start = max(0, i - window_size)
            end = min(len(sentence_ids), i + window_size + 1)

            for j in range(start, end):
                if i != j:
                    context_id = sentence_ids[j]
                    cooccurrence_matrix[center_id, context_id] += 1

    return cooccurrence_matrix

GloVe embedding workflow showing corpus, vocabulary, co-occurrence matrix, training and vector similarity — GloVe learns embeddings from word co-occurrence patterns, where words appearing in similar contexts can get similar vector representations.

06 Using Pretrained GloVe for NLP Tasks

In the pretrained GloVe notebook, I used word vectors to create sentence vectors by averaging word embeddings. Then I applied those vectors to practical tasks like task clustering, summary checking, text clustering and semantic search.

This was useful because embeddings stopped feeling like only theory. I could see them being used as input for clustering and similarity-based retrieval.

          
        
Python

def get_sentence_vector(sentence):
    words = [word for word in sentence.lower().split() if word in model]

    if not words:
        return np.zeros(model.vector_size)

    return np.mean([model[word] for word in words], axis=0)

Tasks I Tried

semantic search
text clustering
task grouping
summary accuracy checking

What I Noticed

average vectors are simple
unknown words need handling
similarity uses cosine score
meaning improves compared to counts

My understanding: once we can represent a sentence as a vector, we can compare sentences using cosine similarity.

07 FastText and Subword Information

FastText helped me understand one important limitation of Word2Vec. Word2Vec usually treats each word as a complete unit. FastText breaks words into subword pieces or character n-grams.

This matters because similar word forms like play, playing and played can share subword information. It also helps with rare words and some out-of-vocabulary cases.

          
        
Python

from gensim.models import FastText

cbow_model = FastText(
    sentences=tokenized,
    vector_size=100,
    window=5,
    min_count=1,
    sg=0,
    epochs=100
)

cbow_model.wv["learning"][:10]

What clicked: FastText is useful because it does not only learn word-level patterns. It also learns from parts of words.

08 Sentence Embeddings

Word embeddings represent individual words, but many NLP tasks need sentence-level meaning. That is where sentence embeddings come in. A sentence embedding is a fixed-length numerical representation of a full sentence or document.

I tried average sentence embeddings first. The idea is simple: take all word vectors in a sentence and calculate their average.

          
        
Python

import spacy
import numpy as np

nlp = spacy.load("en_core_web_sm")

def average_embedding(sentence):
    doc = nlp(sentence)
    vectors = []

    for token in doc:
        if token.has_vector:
            vectors.append(token.vector)

    if vectors:
        return np.mean(vectors, axis=0)

    return np.zeros(nlp.vocab.vectors_length)

Average embeddings are simple and fast, but they treat all words equally. In many cases, not every word should have the same importance.

TF-IDF Weighted Sentence Embedding

To improve simple averaging, I also tried TF-IDF weighted sentence embeddings. Here, important words get more weight while common words get lower influence.

          
        
Python

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf.fit(corpus)

def tfidf_weighted_embedding(sentence):
    tfidf_scores = tfidf.transform([sentence]).toarray()[0]
    feature_names = tfidf.get_feature_names_out()

    doc = nlp(sentence)
    vectors = []
    weights = []

    for token in doc:
        word = token.text.lower()

        if word in feature_names and token.has_vector:
            index = list(feature_names).index(word)
            vectors.append(token.vector)
            weights.append(tfidf_scores[index])

    if vectors:
        return np.average(vectors, axis=0, weights=weights)

    return np.zeros(nlp.vocab.vectors_length)

Limitation: TF-IDF weighted embeddings improve word importance, but they still do not deeply understand context like transformer-based sentence embeddings.

09 Semantic Search Using Embeddings

One practical task I liked was semantic search. Instead of matching exact words, we convert the query and documents into vectors, then compare them using cosine similarity.

This is a big jump from keyword search. A query like visualizing data can match a document like Data visualization techniques even if the exact wording is not the same.

          
        
Python

from sklearn.metrics.pairwise import cosine_similarity

query = "visualizing data"
query_vec = get_sentence_vector(query)

scores = []

for doc in corpus:
    doc_vec = get_sentence_vector(doc)
    score = cosine_similarity(
        [query_vec],
        [doc_vec]
    )[0][0]
    scores.append(score)

What I learned: semantic search depends on vector similarity, not just exact word overlap.

Semantic search using sentence embeddings and cosine similarity for NLP retrieval tasks — Semantic search compares query and document embeddings so related meanings can match even when the exact words are different.

10 PCA, SVD and Visualizing Embeddings

Embedding vectors can have many dimensions, like 50, 100 or 300. Humans cannot directly visualize those dimensions. That is why PCA and SVD are useful. They reduce the dimensions so embeddings can be plotted in 2D or lower-dimensional space.

In the notebook, I used PCA to reduce sentence embeddings and word embeddings. This helped me see that embeddings are not just abstract arrays. They can be projected and visualized to observe clusters or similarity patterns.

          
        
Python

from sklearn.decomposition import PCA

sentence_embeddings = np.array(
    [average_embedding(sent) for sent in sentences]
)

pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(sentence_embeddings)

My takeaway: PCA does not create embeddings. It helps inspect or reduce existing embeddings.

11 Contextual Embeddings

Up to this point, many embedding methods were static. That means a word gets the same vector even when its meaning changes in different sentences.

This creates a problem. The word left can mean a direction in one sentence and an action in another sentence. Static embeddings may struggle because the vector for the word remains same or mostly same.

Contextual embeddings solve this using models like transformers. They create representations based on the surrounding words. This is where self-attention becomes important because the model can decide which words matter for the current meaning.

Static Embedding

same word gets same representation
works well for many similarity tasks
struggles with multiple meanings
example: Word2Vec, GloVe

Contextual Embedding

word meaning depends on sentence
uses surrounding context
better for modern NLP tasks
example: BERT, SBERT

          
        
Python

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentence = "The dog runs fast"
embedding = model.encode(sentence)

print(embedding[:5])

What clicked: contextual embeddings are one bridge from traditional NLP toward transformers and modern GenAI systems.

12 Document Embeddings and Doc2Vec

Word embeddings represent words. Sentence embeddings represent sentences. Document embeddings represent larger pieces of text like reviews, articles or documents.

I also included Doc2Vec in this topic because it extends the embedding idea to full documents. Even though my Doc2Vec notebook is still empty, the concept belongs naturally here because it explains how we can represent entire documents as vectors.

Word embedding

Represents individual words as dense vectors.

Sentence embedding

Represents full sentences as fixed-length vectors.

Document embedding

Represents long text or full documents for search, clustering and classification.

Learning note: this part needs more implementation from my side. I understood the purpose of Doc2Vec, but I still need to complete the notebook properly.

13 Embeddings in Real NLP Tasks

The best part of these notebooks was seeing embeddings used in tasks, not only printed as vectors. I tried document rating, document clustering and text similarity search using Word2Vec-style document vectors.

Task	How Embeddings Help	Notebook Idea
Semantic search	finds documents with similar meaning	query vector compared with document vectors
Text clustering	groups similar documents together	KMeans on averaged embedding vectors
Document rating	predicts a numerical score from text	Ridge regression on sentence vectors
Summary checking	compares original text and summary meaning	cosine similarity between vector sets

This helped me understand why embeddings are so important. They are not the final model by themselves, but they become a powerful input representation for many downstream NLP tasks.

14 Final Comparison of Embedding Methods

After arranging the notebooks together, this is how I now compare the major embedding methods.

Method	Main Idea	Limitation
Word2Vec	learns word vectors from context prediction	static embedding
GloVe	learns from global co-occurrence patterns	still static
FastText	uses subword information	not fully contextual
Average sentence embedding	averages word vectors	treats all words equally
TF-IDF weighted embedding	uses word importance as weights	limited deep context
SBERT	creates contextual sentence embeddings	requires pretrained transformer model

My final understanding: embeddings are not one method. They are a family of representation techniques that move NLP from sparse counts toward dense meaning.

15 GitHub Notebook Connection

This blog explains what I understood from the embedding notebooks. The implementation side is connected to my NLP by Vinod GitHub repository.

NLP by Vinod GitHub Repository

Notebook references: 2_06_sentence_embeddings.ipynb, 2_08_word_embeddings_pretrained_GLOVE.ipynb, 2_09_co_occurrence_matrix_GLOVE.ipynb, 2_10_Word_embeddings_pretrained_Word2Vec.ipynb, 2_11_Word_embeddings_custom_Word2Vec.ipynb, and 2_12_doc2vec.ipynb.

Open the GitHub repository

16 Related Reading

NLP learning roadmap

The roadmap that connects this topic to the complete NLP by Vinod learning path.

Text preprocessing in NLP

The previous topic where raw text was cleaned, normalized and tokenized before representation.

Feature extraction in NLP

The previous representation topic where cleaned text became sparse numerical features through Bag of Words, n-grams and TF-IDF.

17 What Comes Next in the NLP Journey

The next topic is NLP Libraries. After understanding preprocessing, count-based features and embeddings, I now want to organize the practical tools used in NLP workflows.

NLTK

Useful for tokenization, stopwords, stemming, lemmatization and classic NLP utilities.

spaCy

Useful for production-style NLP pipelines, tokenization, POS tagging, NER and linguistic features.

Gensim

Useful for topic modeling, Word2Vec, FastText and vector-based text representation.

NLP Word Embeddings Word2Vec GloVe FastText Sentence Embeddings

Search This Blog

Vinod Codes | AI Engineering & Data Science

A structured public journey from NLP fundamentals to real-world AI systems.