NLP by Vinod - Deep Learning

Sequence Models

RNN for Sequence Modeling - Why Neural Networks Need Memory.

After PyTorch and ANN training, I moved to Recurrent Neural Networks to understand why normal ANN and CNN models are not enough when the input is ordered sequential data like text.

RNN Sequence Model PyTorch NLP

RNN in deep learning became the next topic in my NLP journey because text is not just a collection of independent words. Text has order. The meaning of a word depends on the words before it, and sometimes even the full sentence context matters. This is where I started seeing the limitation of using only ANN or CNN style thinking for sequential data.

My rough understanding is this: an ANN works well for tabular data, and CNNs are strong for grid-like data such as images. But when the data is sequential, like text, speech, time series or stock prices, the model needs some way to remember previous steps. A Recurrent Neural Network does this by using a hidden state that is passed from one time step to the next.

In the notebooks, I first tried to understand the RNN forward pass from scratch using NumPy. Then I moved to PyTorch and built small RNN examples for sentiment analysis, machine translation style sequence processing and a simple question-answering system.

What clicked for me:
ANN looks at fixed features. RNN reads a sequence step by step and carries memory through hidden states.

RNN sequence model workflow showing text tokens hidden states and final prediction in NLP — RNNs process sequential data one step at a time, carrying information forward through hidden states.

01 Why I Needed RNN After ANN and CNN

While learning neural networks, I understood that different data types need different model thinking. Tabular data can often go into ANN. Images are grid-like, so CNNs make sense. But text is ordered. If I change the order of words, the meaning can change.

Data Type	Common Model	Why
Tabular	ANN	works with fixed feature columns
Images	CNN	captures local spatial patterns
Text	RNN	uses word order and previous context
Speech	RNN style models	signal changes over time
Time series	Sequence models	past values influence future values

This comparison helped me understand the real reason behind RNN. It is not just another neural network architecture. It is made for inputs where order matters.

My simple definition: RNN is a neural network for sequential data where the model uses a hidden state to remember information from previous time steps.

02 The Main Idea of Hidden State

The hidden state is the most important idea in a simple RNN. At each time step, the model takes the current input and the previous hidden state. Then it creates a new hidden state.

In a sentence like I love NLP, the model does not read all words as unrelated features. It reads I, then carries some information to love, then carries updated information to NLP.

Token 1 I

Hidden 1 memory after first word

Token 2 love

Hidden 2 updated memory

Prediction sentiment or class

This flow made RNN easier for me. The same cell is reused at every time step, and the hidden state becomes the memory of what has been seen so far.

What clicked: the hidden state is not the final answer. It is the running summary of the sequence so far.

03 RNN Forward Pass from Scratch

In the first notebook, I created a very small sentence with word vectors: I, love and NLP. Then I created input weights, hidden weights, output weights and biases manually using NumPy.

The forward pass became clearer when I wrote it in small steps. First, I initialized the hidden state. Then for every word, I calculated a new hidden state using the current input vector and the previous hidden state. After the final word, I used the last hidden state to make a prediction.

          
        
Python

import numpy as np

word_vectors = {
    "I": np.array([0.1, 0.2]),
    "love": np.array([0.5, 0.2]),
    "NLP": np.array([0.3, 0.7])
}

sentence = ["I", "love", "NLP"]

h_prev = np.zeros(2)

for word in sentence:
    x_t = word_vectors[word]
    h_t = np.tanh(
        np.dot(W_i, x_t) +
        np.dot(W_hh, h_prev) +
        b_h
    )
    h_prev = h_t

This small example was important because it showed the mathematical heart of RNN without hiding everything inside PyTorch.

Notebook lesson: when learning RNN, the shapes are easy to confuse. Input vector size, hidden size and output size must match the matrix operations.

04 Backpropagation Through Time

After the forward pass, I also touched Backpropagation Through Time. This was difficult because the same RNN weights are used repeatedly across time steps. So during training, the model has to calculate how each time step contributed to the final loss.

My understanding is that BPTT is normal backpropagation applied across the unfolded sequence. If a sentence has three tokens, the RNN can be imagined as three repeated cells sharing the same weights.

Forward Pass

read token by token
update hidden state
create final prediction
calculate loss

Backward Pass

start from the loss
move through time steps
calculate gradients
update shared weights

What I understood: RNN training is harder than simple ANN training because the model has repeated computation across time.

05 RNN for Sentiment Analysis

After the from-scratch part, I moved to a PyTorch RNN sentiment analysis example. I used small sentences such as i love it, i hate it, so good, very bad, awesome movie and worst ever.

The workflow was similar to what I learned earlier in NLP: tokenize text, build vocabulary, convert words to indices, pad sequences, create tensors, build an embedding layer, use RNN and train with a loss function.

          
        
Python

class RNNModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim=8)
        self.rnn = nn.RNN(8, 16, batch_first=True)
        self.fc = nn.Linear(16, 1)

    def forward(self, x):
        x = self.embedding(x)
        output, hidden = self.rnn(x)
        final_hidden = hidden.squeeze(0)
        logits = self.fc(final_hidden)
        return logits

This example helped me connect embeddings with RNN. The RNN does not directly understand raw words. Words first become indices, then embeddings, and then those embedding vectors are processed step by step.

RNN sentiment analysis workflow showing tokenization vocabulary embeddings hidden state and output prediction — In sentiment analysis, words are converted into embeddings and then passed through RNN hidden states before the final prediction.

06 RNN for Machine Translation Style Sequence Processing

I also tried a small machine translation style example using dummy English and French sentence pairs. The idea was not to build a real translator. The goal was to see how one sequence can be mapped to another sequence.

I built separate vocabularies for English and French, converted both sides into indices, padded them, passed the English sentence through an RNN and predicted output tokens.

Input Side

English sentence
tokenization
English vocabulary
integer sequence
embedding vectors

Output Side

French sentence
French vocabulary
target tokens
cross entropy loss
sequence prediction

Important limitation: this was a tiny learning example. Real machine translation needs better architecture, more data, teacher forcing, attention and later transformer-based models.

07 Building a Simple RNN-Based QA System

The second notebook was a small question-answering system using an RNN. I loaded a CSV dataset of question-answer pairs, tokenized both questions and answers, built vocabulary, converted text into numerical indices and created a custom PyTorch dataset.

This connected many earlier topics together: text preprocessing, vocabulary, indexing, Dataset, DataLoader, embeddings and sequence modeling.

          
        
Python

class QADataset(Dataset):
    def __init__(self, df, vocab):
        self.df = df
        self.vocab = vocab

    def __len__(self):
        return self.df.shape[0]

    def __getitem__(self, index):
        question = text_to_indices(self.df.iloc[index]["question"], self.vocab)
        answer = text_to_indices(self.df.iloc[index]["answer"], self.vocab)
        return torch.tensor(question), torch.tensor(answer)

Then I created a simple RNN model with an embedding layer, an RNN layer and a linear layer that predicts a vocabulary word from the final hidden state.

          
        
Python

class SimpleRNN(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim=50)
        self.rnn = nn.RNN(50, 64, batch_first=True)
        self.fc = nn.Linear(64, vocab_size)

    def forward(self, question):
        embedded_question = self.embedding(question)
        hidden, final = self.rnn(embedded_question)
        output = self.fc(final.squeeze(0))
        return output

What clicked: the RNN reads the question sequence and the final hidden state is used as a summary to predict an answer token.

08 Training Loop for the QA Model

The training loop followed the same PyTorch pattern I learned in the previous post: clear gradients, run forward pass, calculate loss, run backward pass and update parameters.

          
        
Python

for epoch in range(epochs):
    total_loss = 0

    for question, answer in dataloader:
        optimizer.zero_grad()

        output = model(question)
        loss = criterion(output, answer[0])

        loss.backward()
        optimizer.step()

        total_loss = total_loss + loss.item()

I also wrote a prediction function that converts a new question into indices, sends it to the model, applies softmax and returns the word with the highest probability.

Notebook lesson: this QA system is simple, but it helped me understand the end-to-end flow from raw question text to indexed sequence to RNN output.

09 Mistakes and Confusions I Noticed

RNN looked simple conceptually, but implementation had several places where I could get confused. Most confusion came from shapes and from understanding what the RNN returns.

Confusions

difference between output and final hidden state
batch dimension versus sequence dimension
why padding is needed for unequal sentence lengths
why final hidden state is used for classification
why raw words cannot go directly into RNN

Better Thinking

text first becomes tokens
tokens become indices
indices become embeddings
RNN creates hidden states
final layer makes prediction

Common RNN implementation mistakes showing tensor shapes hidden states padding and sequence dimensions — The hardest part of RNN implementation was not the syntax, but understanding tensor shapes, hidden states, padding and sequence dimensions.

10 Limitations of Simple RNN

After building simple RNN examples, I also understood why RNN is not the final answer for sequence modeling. A basic RNN can struggle with long sequences because information from earlier time steps may become weak as the sequence grows.

This connects directly to the next topic: LSTM. LSTM was designed to handle the memory problem better by using gates. So RNN is the correct foundation, but LSTM explains why simple recurrence was not enough.

Simple RNN Problems

struggles with long-term dependencies
can suffer from vanishing gradients
memory becomes weak over long sequences
not ideal for complex language tasks

Why LSTM Comes Next

uses gates for memory control
keeps important information longer
handles longer context better
improves sequence modeling

My final caution: RNN helped me understand sequence modeling, but for stronger NLP systems, I need to move forward to LSTM, GRU and then transformers.

11 My Final Understanding

My final understanding is that RNNs are important because they introduced the idea of processing text as a sequence. Instead of treating words as independent features, an RNN reads them one by one and keeps a hidden state as memory.

RNN is for ordered data

Text, speech and time series need models that respect sequence order.

Hidden state acts like memory

The model updates its memory at each time step while reading the sequence.

RNN still uses PyTorch basics

Embedding, loss, optimizer, backward pass and DataLoader are still part of the workflow.

Simple RNN has limits

Long context is difficult, so LSTM becomes the next natural topic.

12 GitHub Notebook Connection

This blog explains what I understood from my RNN notebooks. The implementation side is connected to the NLP by Vinod GitHub repository.

NLP by Vinod GitHub Repository

Notebook references: 01_RNN_from_scratch.ipynb and 02-rnn-based-qa-system.ipynb.

Open the GitHub repository

13 Related Reading

NLP learning roadmap

The roadmap that connects RNNs to the complete NLP by Vinod learning journey.

PyTorch for deep learning

The previous topic where I learned tensors, autograd, training loops, DataLoader, ANN training, GPU use and tuning.

NLP libraries

The earlier topic where I explored NLTK, spaCy, TextBlob and Stanza before entering deep learning.

14 What Comes Next in the NLP Journey

The next topic is LSTM. After understanding simple recurrence and hidden states, I now want to learn how LSTM improves memory using gates.

Vanishing gradients

Why simple RNN struggles when sequences become long.

LSTM gates

How forget, input and output gates control memory.

Better sequence learning

How LSTM handles longer context better than a simple RNN.

RNN Sequence Models Deep Learning PyTorch NLP Hidden State

Search This Blog

Vinod Codes | AI Engineering & Data Science

A structured public journey from NLP fundamentals to real-world AI systems.