RNN for Sequence Modeling - Why Neural Networks Need Memory
RNN for Sequence Modeling - Why Neural Networks Need Memory.
After PyTorch and ANN training, I moved to Recurrent Neural Networks to understand why normal ANN and CNN models are not enough when the input is ordered sequential data like text.
RNN in deep learning became the next topic in my NLP journey because text is not just a collection of independent words. Text has order. The meaning of a word depends on the words before it, and sometimes even the full sentence context matters. This is where I started seeing the limitation of using only ANN or CNN style thinking for sequential data.
My rough understanding is this: an ANN works well for tabular data, and CNNs are strong for grid-like data such as images. But when the data is sequential, like text, speech, time series or stock prices, the model needs some way to remember previous steps. A Recurrent Neural Network does this by using a hidden state that is passed from one time step to the next.
In the notebooks, I first tried to understand the RNN forward pass from scratch using NumPy. Then I moved to PyTorch and built small RNN examples for sentiment analysis, machine translation style sequence processing and a simple question-answering system.
ANN looks at fixed features. RNN reads a sequence step by step and carries memory through hidden states.
01 Why I Needed RNN After ANN and CNN
While learning neural networks, I understood that different data types need different model thinking. Tabular data can often go into ANN. Images are grid-like, so CNNs make sense. But text is ordered. If I change the order of words, the meaning can change.
| Data Type | Common Model | Why |
|---|---|---|
| Tabular | ANN | works with fixed feature columns |
| Images | CNN | captures local spatial patterns |
| Text | RNN | uses word order and previous context |
| Speech | RNN style models | signal changes over time |
| Time series | Sequence models | past values influence future values |
This comparison helped me understand the real reason behind RNN. It is not just another neural network architecture. It is made for inputs where order matters.
02 The Main Idea of Hidden State
The hidden state is the most important idea in a simple RNN. At each time step, the model takes the current input and the previous hidden state. Then it creates a new hidden state.
In a sentence like I love NLP, the model does not read all words as unrelated features. It reads I, then carries some information to love, then carries updated information to NLP.
This flow made RNN easier for me. The same cell is reused at every time step, and the hidden state becomes the memory of what has been seen so far.
03 RNN Forward Pass from Scratch
In the first notebook, I created a very small sentence with word vectors: I, love and NLP. Then I created input weights, hidden weights, output weights and biases manually using NumPy.
The forward pass became clearer when I wrote it in small steps. First, I initialized the hidden state. Then for every word, I calculated a new hidden state using the current input vector and the previous hidden state. After the final word, I used the last hidden state to make a prediction.
import numpy as np
word_vectors = {
"I": np.array([0.1, 0.2]),
"love": np.array([0.5, 0.2]),
"NLP": np.array([0.3, 0.7])
}
sentence = ["I", "love", "NLP"]
h_prev = np.zeros(2)
for word in sentence:
x_t = word_vectors[word]
h_t = np.tanh(
np.dot(W_i, x_t) +
np.dot(W_hh, h_prev) +
b_h
)
h_prev = h_t
This small example was important because it showed the mathematical heart of RNN without hiding everything inside PyTorch.
04 Backpropagation Through Time
After the forward pass, I also touched Backpropagation Through Time. This was difficult because the same RNN weights are used repeatedly across time steps. So during training, the model has to calculate how each time step contributed to the final loss.
My understanding is that BPTT is normal backpropagation applied across the unfolded sequence. If a sentence has three tokens, the RNN can be imagined as three repeated cells sharing the same weights.
Forward Pass
- read token by token
- update hidden state
- create final prediction
- calculate loss
Backward Pass
- start from the loss
- move through time steps
- calculate gradients
- update shared weights
05 RNN for Sentiment Analysis
After the from-scratch part, I moved to a PyTorch RNN sentiment analysis example. I used small sentences such as i love it, i hate it, so good, very bad, awesome movie and worst ever.
The workflow was similar to what I learned earlier in NLP: tokenize text, build vocabulary, convert words to indices, pad sequences, create tensors, build an embedding layer, use RNN and train with a loss function.
class RNNModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim=8)
self.rnn = nn.RNN(8, 16, batch_first=True)
self.fc = nn.Linear(16, 1)
def forward(self, x):
x = self.embedding(x)
output, hidden = self.rnn(x)
final_hidden = hidden.squeeze(0)
logits = self.fc(final_hidden)
return logits
This example helped me connect embeddings with RNN. The RNN does not directly understand raw words. Words first become indices, then embeddings, and then those embedding vectors are processed step by step.
06 RNN for Machine Translation Style Sequence Processing
I also tried a small machine translation style example using dummy English and French sentence pairs. The idea was not to build a real translator. The goal was to see how one sequence can be mapped to another sequence.
I built separate vocabularies for English and French, converted both sides into indices, padded them, passed the English sentence through an RNN and predicted output tokens.
Input Side
- English sentence
- tokenization
- English vocabulary
- integer sequence
- embedding vectors
Output Side
- French sentence
- French vocabulary
- target tokens
- cross entropy loss
- sequence prediction
07 Building a Simple RNN-Based QA System
The second notebook was a small question-answering system using an RNN. I loaded a CSV dataset of question-answer pairs, tokenized both questions and answers, built vocabulary, converted text into numerical indices and created a custom PyTorch dataset.
This connected many earlier topics together: text preprocessing, vocabulary, indexing, Dataset, DataLoader, embeddings and sequence modeling.
class QADataset(Dataset):
def __init__(self, df, vocab):
self.df = df
self.vocab = vocab
def __len__(self):
return self.df.shape[0]
def __getitem__(self, index):
question = text_to_indices(self.df.iloc[index]["question"], self.vocab)
answer = text_to_indices(self.df.iloc[index]["answer"], self.vocab)
return torch.tensor(question), torch.tensor(answer)
Then I created a simple RNN model with an embedding layer, an RNN layer and a linear layer that predicts a vocabulary word from the final hidden state.
class SimpleRNN(nn.Module):
def __init__(self, vocab_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim=50)
self.rnn = nn.RNN(50, 64, batch_first=True)
self.fc = nn.Linear(64, vocab_size)
def forward(self, question):
embedded_question = self.embedding(question)
hidden, final = self.rnn(embedded_question)
output = self.fc(final.squeeze(0))
return output
08 Training Loop for the QA Model
The training loop followed the same PyTorch pattern I learned in the previous post: clear gradients, run forward pass, calculate loss, run backward pass and update parameters.
for epoch in range(epochs):
total_loss = 0
for question, answer in dataloader:
optimizer.zero_grad()
output = model(question)
loss = criterion(output, answer[0])
loss.backward()
optimizer.step()
total_loss = total_loss + loss.item()
I also wrote a prediction function that converts a new question into indices, sends it to the model, applies softmax and returns the word with the highest probability.
09 Mistakes and Confusions I Noticed
RNN looked simple conceptually, but implementation had several places where I could get confused. Most confusion came from shapes and from understanding what the RNN returns.
Confusions
- difference between output and final hidden state
- batch dimension versus sequence dimension
- why padding is needed for unequal sentence lengths
- why final hidden state is used for classification
- why raw words cannot go directly into RNN
Better Thinking
- text first becomes tokens
- tokens become indices
- indices become embeddings
- RNN creates hidden states
- final layer makes prediction
10 Limitations of Simple RNN
After building simple RNN examples, I also understood why RNN is not the final answer for sequence modeling. A basic RNN can struggle with long sequences because information from earlier time steps may become weak as the sequence grows.
This connects directly to the next topic: LSTM. LSTM was designed to handle the memory problem better by using gates. So RNN is the correct foundation, but LSTM explains why simple recurrence was not enough.
Simple RNN Problems
- struggles with long-term dependencies
- can suffer from vanishing gradients
- memory becomes weak over long sequences
- not ideal for complex language tasks
Why LSTM Comes Next
- uses gates for memory control
- keeps important information longer
- handles longer context better
- improves sequence modeling
11 My Final Understanding
My final understanding is that RNNs are important because they introduced the idea of processing text as a sequence. Instead of treating words as independent features, an RNN reads them one by one and keeps a hidden state as memory.
12 GitHub Notebook Connection
This blog explains what I understood from my RNN notebooks. The implementation side is connected to the NLP by Vinod GitHub repository.
NLP by Vinod GitHub Repository
Notebook references: 01_RNN_from_scratch.ipynb and 02-rnn-based-qa-system.ipynb.
14 What Comes Next in the NLP Journey
The next topic is LSTM. After understanding simple recurrence and hidden states, I now want to learn how LSTM improves memory using gates.
Comments
Post a Comment