NLP by Vinod

A structured public journey from NLP fundamentals to real-world AI systems.

Vinod Codes is where I document my learning in AI, Machine Learning, Deep Learning, Natural Language Processing, Generative AI, and practical projects.

The main series here is NLP by Vinod — a learner-builder journey where I explain concepts with intuition, Python examples, mistakes, GitHub work, and honest implementation notes.

Start here: follow the Foundations Track first, then move into deep learning, transformers, projects, and real-world NLP systems.
NLP Foundations Python for NLP Machine Learning Deep Learning Real Projects

Word Embeddings in NLP - Moving Beyond Sparse Features

NLP by Vinod - Foundations
Text Representation

Word Embeddings in NLP - Moving Beyond Sparse Features.

After count-based feature extraction, I learned why NLP needs dense vectors that can capture similarity, meaning and context better than Bag of Words and TF-IDF.

NLP Embeddings Word2Vec GloVe

Word embeddings in NLP are dense numerical representations of words, sentences or documents. In the previous topic, I learned count-based feature extraction methods like Bag of Words, n-grams and TF-IDF. Those methods are useful, but they create sparse vectors and do not understand semantic meaning deeply.

My rough understanding was simple: earlier methods had disadvantages like sparsity, weak semantic meaning and static representation. After working through the embedding notebooks, I started seeing why embeddings became such an important step in NLP. Instead of representing words only by their position in a vocabulary, embeddings try to place similar words closer in a vector space.

For example, words like king and queen, or movie and film, should not behave like completely unrelated words. Count-based features often struggle with this. Embeddings try to solve this by learning dense vectors where similarity has meaning.

What clicked for me:
Bag of Words counts words. TF-IDF weighs words. Embeddings try to represent meaning.
Word embeddings in NLP showing sparse count vectors becoming dense semantic vectors in vector space
Embeddings move NLP representation from sparse word-count vectors to dense vectors where similar meanings can be closer together.

01 Why Embeddings Were Needed

Traditional feature extraction methods were the right place to start. They helped me understand how text becomes numbers. But their limitations are also clear when the vocabulary becomes large or when meaning matters.

01
Sparsity
In Bag of Words and TF-IDF, vectors can become very large and mostly filled with zeros. If the vocabulary has thousands of words, each document still uses only a small part of it.
02
Weak semantic meaning
Count-based methods may treat related words as separate unrelated features. Words like movie and film can be close in meaning but still become different columns.
03
Large feature space
As vocabulary grows, the feature matrix becomes larger. This can increase memory use and slow down training.
04
Limited context
N-grams capture short phrases, but they still cannot deeply understand sentence-level meaning or long-range context.

This is why embeddings felt like the natural next topic after feature extraction. They are still numerical representations, but now the goal is not only counting. The goal is to represent meaning in a compact vector form.

My simple definition: embeddings are dense vectors that represent text in a way where similar words or sentences can have similar vectors.

02 Sparse Vectors vs Dense Vectors

The first difference I had to understand was sparse representation versus dense representation. Bag of Words and TF-IDF usually create sparse vectors. Embeddings create dense vectors.

Aspect Sparse Features Dense Embeddings
Example Bag of Words, TF-IDF Word2Vec, GloVe, FastText
Vector size usually very large usually smaller, like 50, 100 or 300
Zeros mostly zeros mostly meaningful values
Meaning weak semantic meaning captures similarity better
Interpretability easy to explain harder to interpret directly

A sparse vector may say whether a word exists in the vocabulary. A dense vector tries to encode learned properties of the word. The values are not manually assigned by us. They are learned from data or loaded from a pretrained model.

Important point: dense does not mean magical. It simply means the vector has many non-zero learned values, and those values can carry useful patterns.

03 Word2Vec Intuition

Word2Vec was the first embedding method that made the idea feel practical to me. The basic idea is that words used in similar contexts should get similar vectors.

I learned two important training ideas: CBOW and Skip-gram. In CBOW, the model uses surrounding context words to predict the center word. In Skip-gram, the model uses the center word to predict surrounding context words.

CBOW

  • uses context words
  • predicts the center word
  • usually faster
  • works well with frequent words

Skip-gram

  • uses the center word
  • predicts context words
  • can work well for rare words
  • captures local context

In my pretrained Word2Vec notebook, I loaded the Google News vectors and tested words like king, queen, man, woman, India and others. The most interesting part was checking similarity and analogy-style behavior.

Python
import gensim.downloader as api

model = api.load("word2vec-google-news-300")

model["king"].shape
model.most_similar("king")
model.similarity("king", "queen")
Python
vec = model["king"] - model["man"] + model["woman"]
model.most_similar([vec])
What clicked: vectors are not just numbers. They can preserve relationships learned from text, like similarity and analogy patterns.

04 Training a Custom Word2Vec Model

After using pretrained Word2Vec, I also trained my own Word2Vec model. This helped me understand what happens when the model builds vocabulary from my own corpus.

In the custom notebook, I loaded text files, broke them into sentences, applied simple preprocessing, built vocabulary and then trained a Word2Vec model. I also tested words like king, jon, daenerys, arya and sansa.

Python
import gensim

model = gensim.models.Word2Vec(
    window=10,
    min_count=2
)

model.build_vocab(story)

model.train(
    story,
    total_examples=model.corpus_count,
    epochs=10
)

The important parameters for me were window and min_count. Window controls how many surrounding words the model considers. Min count removes words that appear very rarely.

Parameter Meaning How I Understood It
window context size how many nearby words are considered
min_count minimum word frequency rare words below this count are ignored
vector_size embedding dimension length of each word vector
sg training mode 0 for CBOW, 1 for Skip-gram
Notebook lesson: custom embeddings depend heavily on the corpus. If the corpus is small or narrow, the learned vectors may not generalize well.

05 GloVe and Co-occurrence Matrix

GloVe gave me another view of embeddings. Word2Vec learns from prediction tasks. GloVe is based on word co-occurrence statistics. It looks at how often words appear together in a context window.

In the co-occurrence notebook, I built a vocabulary, created a co-occurrence matrix, initialized embedding parameters, trained them using a loss and finally visualized embeddings using PCA.

Corpus sentences and documents
Vocabulary unique words
Matrix word co-occurrence counts
Train learn embedding values
Use similarity and NLP tasks
Python
def build_cooccurrence_matrix(tokenized_corpus, vocab_size, word2id, window_size=2):
    cooccurrence_matrix = np.zeros((vocab_size, vocab_size), dtype=np.float64)

    for sentence in tokenized_corpus:
        sentence_ids = [word2id[word] for word in sentence]

        for i, center_id in enumerate(sentence_ids):
            start = max(0, i - window_size)
            end = min(len(sentence_ids), i + window_size + 1)

            for j in range(start, end):
                if i != j:
                    context_id = sentence_ids[j]
                    cooccurrence_matrix[center_id, context_id] += 1

    return cooccurrence_matrix
GloVe embedding workflow showing corpus, vocabulary, co-occurrence matrix, training and vector similarity
GloVe learns embeddings from word co-occurrence patterns, where words appearing in similar contexts can get similar vector representations.

06 Using Pretrained GloVe for NLP Tasks

In the pretrained GloVe notebook, I used word vectors to create sentence vectors by averaging word embeddings. Then I applied those vectors to practical tasks like task clustering, summary checking, text clustering and semantic search.

This was useful because embeddings stopped feeling like only theory. I could see them being used as input for clustering and similarity-based retrieval.

Python
def get_sentence_vector(sentence):
    words = [word for word in sentence.lower().split() if word in model]

    if not words:
        return np.zeros(model.vector_size)

    return np.mean([model[word] for word in words], axis=0)

Tasks I Tried

  • semantic search
  • text clustering
  • task grouping
  • summary accuracy checking

What I Noticed

  • average vectors are simple
  • unknown words need handling
  • similarity uses cosine score
  • meaning improves compared to counts
My understanding: once we can represent a sentence as a vector, we can compare sentences using cosine similarity.

07 FastText and Subword Information

FastText helped me understand one important limitation of Word2Vec. Word2Vec usually treats each word as a complete unit. FastText breaks words into subword pieces or character n-grams.

This matters because similar word forms like play, playing and played can share subword information. It also helps with rare words and some out-of-vocabulary cases.

Python
from gensim.models import FastText

cbow_model = FastText(
    sentences=tokenized,
    vector_size=100,
    window=5,
    min_count=1,
    sg=0,
    epochs=100
)

cbow_model.wv["learning"][:10]
What clicked: FastText is useful because it does not only learn word-level patterns. It also learns from parts of words.

08 Sentence Embeddings

Word embeddings represent individual words, but many NLP tasks need sentence-level meaning. That is where sentence embeddings come in. A sentence embedding is a fixed-length numerical representation of a full sentence or document.

I tried average sentence embeddings first. The idea is simple: take all word vectors in a sentence and calculate their average.

Python
import spacy
import numpy as np

nlp = spacy.load("en_core_web_sm")

def average_embedding(sentence):
    doc = nlp(sentence)
    vectors = []

    for token in doc:
        if token.has_vector:
            vectors.append(token.vector)

    if vectors:
        return np.mean(vectors, axis=0)

    return np.zeros(nlp.vocab.vectors_length)

Average embeddings are simple and fast, but they treat all words equally. In many cases, not every word should have the same importance.

TF-IDF Weighted Sentence Embedding

To improve simple averaging, I also tried TF-IDF weighted sentence embeddings. Here, important words get more weight while common words get lower influence.

Python
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf.fit(corpus)

def tfidf_weighted_embedding(sentence):
    tfidf_scores = tfidf.transform([sentence]).toarray()[0]
    feature_names = tfidf.get_feature_names_out()

    doc = nlp(sentence)
    vectors = []
    weights = []

    for token in doc:
        word = token.text.lower()

        if word in feature_names and token.has_vector:
            index = list(feature_names).index(word)
            vectors.append(token.vector)
            weights.append(tfidf_scores[index])

    if vectors:
        return np.average(vectors, axis=0, weights=weights)

    return np.zeros(nlp.vocab.vectors_length)
Limitation: TF-IDF weighted embeddings improve word importance, but they still do not deeply understand context like transformer-based sentence embeddings.

09 Semantic Search Using Embeddings

One practical task I liked was semantic search. Instead of matching exact words, we convert the query and documents into vectors, then compare them using cosine similarity.

This is a big jump from keyword search. A query like visualizing data can match a document like Data visualization techniques even if the exact wording is not the same.

Python
from sklearn.metrics.pairwise import cosine_similarity

query = "visualizing data"
query_vec = get_sentence_vector(query)

scores = []

for doc in corpus:
    doc_vec = get_sentence_vector(doc)
    score = cosine_similarity(
        [query_vec],
        [doc_vec]
    )[0][0]
    scores.append(score)
What I learned: semantic search depends on vector similarity, not just exact word overlap.
Semantic search using sentence embeddings and cosine similarity for NLP retrieval tasks
Semantic search compares query and document embeddings so related meanings can match even when the exact words are different.

10 PCA, SVD and Visualizing Embeddings

Embedding vectors can have many dimensions, like 50, 100 or 300. Humans cannot directly visualize those dimensions. That is why PCA and SVD are useful. They reduce the dimensions so embeddings can be plotted in 2D or lower-dimensional space.

In the notebook, I used PCA to reduce sentence embeddings and word embeddings. This helped me see that embeddings are not just abstract arrays. They can be projected and visualized to observe clusters or similarity patterns.

Python
from sklearn.decomposition import PCA

sentence_embeddings = np.array(
    [average_embedding(sent) for sent in sentences]
)

pca = PCA(n_components=2)
reduced_embeddings = pca.fit_transform(sentence_embeddings)
My takeaway: PCA does not create embeddings. It helps inspect or reduce existing embeddings.

11 Contextual Embeddings

Up to this point, many embedding methods were static. That means a word gets the same vector even when its meaning changes in different sentences.

This creates a problem. The word left can mean a direction in one sentence and an action in another sentence. Static embeddings may struggle because the vector for the word remains same or mostly same.

Contextual embeddings solve this using models like transformers. They create representations based on the surrounding words. This is where self-attention becomes important because the model can decide which words matter for the current meaning.

Static Embedding

  • same word gets same representation
  • works well for many similarity tasks
  • struggles with multiple meanings
  • example: Word2Vec, GloVe

Contextual Embedding

  • word meaning depends on sentence
  • uses surrounding context
  • better for modern NLP tasks
  • example: BERT, SBERT
Python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

sentence = "The dog runs fast"
embedding = model.encode(sentence)

print(embedding[:5])
What clicked: contextual embeddings are one bridge from traditional NLP toward transformers and modern GenAI systems.

12 Document Embeddings and Doc2Vec

Word embeddings represent words. Sentence embeddings represent sentences. Document embeddings represent larger pieces of text like reviews, articles or documents.

I also included Doc2Vec in this topic because it extends the embedding idea to full documents. Even though my Doc2Vec notebook is still empty, the concept belongs naturally here because it explains how we can represent entire documents as vectors.

01
Word embedding
Represents individual words as dense vectors.
02
Sentence embedding
Represents full sentences as fixed-length vectors.
03
Document embedding
Represents long text or full documents for search, clustering and classification.
Learning note: this part needs more implementation from my side. I understood the purpose of Doc2Vec, but I still need to complete the notebook properly.

13 Embeddings in Real NLP Tasks

The best part of these notebooks was seeing embeddings used in tasks, not only printed as vectors. I tried document rating, document clustering and text similarity search using Word2Vec-style document vectors.

Task How Embeddings Help Notebook Idea
Semantic search finds documents with similar meaning query vector compared with document vectors
Text clustering groups similar documents together KMeans on averaged embedding vectors
Document rating predicts a numerical score from text Ridge regression on sentence vectors
Summary checking compares original text and summary meaning cosine similarity between vector sets

This helped me understand why embeddings are so important. They are not the final model by themselves, but they become a powerful input representation for many downstream NLP tasks.

14 Final Comparison of Embedding Methods

After arranging the notebooks together, this is how I now compare the major embedding methods.

Method Main Idea Limitation
Word2Vec learns word vectors from context prediction static embedding
GloVe learns from global co-occurrence patterns still static
FastText uses subword information not fully contextual
Average sentence embedding averages word vectors treats all words equally
TF-IDF weighted embedding uses word importance as weights limited deep context
SBERT creates contextual sentence embeddings requires pretrained transformer model
My final understanding: embeddings are not one method. They are a family of representation techniques that move NLP from sparse counts toward dense meaning.

15 GitHub Notebook Connection

This blog explains what I understood from the embedding notebooks. The implementation side is connected to my NLP by Vinod GitHub repository.

GH

NLP by Vinod GitHub Repository

Notebook references: 2_06_sentence_embeddings.ipynb, 2_08_word_embeddings_pretrained_GLOVE.ipynb, 2_09_co_occurrence_matrix_GLOVE.ipynb, 2_10_Word_embeddings_pretrained_Word2Vec.ipynb, 2_11_Word_embeddings_custom_Word2Vec.ipynb, and 2_12_doc2vec.ipynb.

Open the GitHub repository

17 What Comes Next in the NLP Journey

The next topic is NLP Libraries. After understanding preprocessing, count-based features and embeddings, I now want to organize the practical tools used in NLP workflows.

01
NLTK

Useful for tokenization, stopwords, stemming, lemmatization and classic NLP utilities.

02
spaCy

Useful for production-style NLP pipelines, tokenization, POS tagging, NER and linguistic features.

03
Gensim

Useful for topic modeling, Word2Vec, FastText and vector-based text representation.

NLP Word Embeddings Word2Vec GloVe FastText Sentence Embeddings

Sparse features count words. Embeddings move closer to meaning.

This topic helped me understand why NLP moved from count-based features to dense vectors like Word2Vec, GloVe, FastText and contextual sentence embeddings.

Comments

Most viewed

Python Strings & Regex for NLP — The Real Foundation

NLP Learning Roadmap — From Fundamentals to Real-World AI Systems

Data Acquisition for NLP - Collecting Text Before Preprocessing