NLP by Vinod - Deep Learning

Sequence Models

LSTM for Sequence Modeling - Long-Term Memory in Neural Networks.

After RNNs, I learned why simple recurrence struggles with long sequences and how LSTM improves memory using gates, cell state and hidden state.

LSTMSequence ModelCell StatePyTorch

LSTM in deep learning became the next natural topic after RNN. In the previous post, I understood that RNNs read text step by step using a hidden state. But I also saw the limitation: when the sequence becomes long, earlier information can become weak. This is why simple RNNs struggle with long-term dependencies.

LSTM stands for Long Short-Term Memory. My current understanding is that LSTM is an improved version of RNN that still processes sequential data step by step, but it has a smarter memory system. Instead of relying only on one hidden state, LSTM uses a cell state for long-term memory and a hidden state for current or short-term context.

In this topic, I learned the theory of LSTM gates, implemented a small LSTM from scratch using NumPy, visualized how gates and memory change over time, built a sentiment analysis example, and then used PyTorch to create a next-word prediction model.

What clicked for me:
RNN gives memory to neural networks. LSTM makes that memory more controlled by deciding what to forget, what to store and what to output.

LSTM sequence model workflow showing forget input and output gates with cell state and hidden state — LSTM improves simple RNNs by using gates and a cell state to preserve useful long-term information.

01 Why LSTM Was Needed After RNN

Simple RNNs are useful because they process data in order. But the same hidden state keeps getting updated again and again. In short sequences this can work, but in long sequences the model may forget earlier information.

This becomes a serious issue in NLP. A word at the beginning of a sentence or paragraph may affect the meaning later. If the model cannot preserve that context, prediction becomes weak.

Problems in Simple RNN

struggles with long-term dependencies
vanishing gradient problem
exploding gradient problem
old context can be overwritten
long sequences become difficult

How LSTM Helps

uses gates to control memory
keeps useful old information
adds useful new information
uses cell state for long-term memory
handles longer context better

Important correction I learned: LSTM does not have two hidden states. It has one hidden state h_t and one cell state C_t. The cell state behaves like long-term memory, while the hidden state carries current output or short-term context.

02 Long-Term Dependency Intuition

Consider this sentence from my theory notes:

Example

Vinod is a student of B.Tech final year. He studies at VNIT Nagpur.

To understand the word He, the model should remember earlier information: Vinod. A simple RNN may weaken this information over a long sentence. LSTM is designed to carry useful information for a longer time using the cell state.

Earlier ContextVinod

Cell Statekeeps useful memory

Current WordHe

Hidden Statecurrent context

Predictionbetter meaning

My simple definition: LSTM is a gated RNN architecture that remembers important information for longer sequences using cell state and gates.

03 Cell State vs Hidden State

The most important difference for me was cell state and hidden state. Earlier in RNN, I mainly thought about hidden state. In LSTM, the cell state becomes the memory highway.

State	Role	How I Understood It
Cell State	`C_t`	long-term memory that carries useful information across many time steps
Hidden State	`h_t`	current output or short-term context passed to the next step
Previous Cell	`C_{t-1}`	old long-term memory before update
Previous Hidden	`h_{t-1}`	previous short-term context

Memory intuition: C_t remembers what should be carried forward. h_t decides what should be exposed as the current output.

04 The Three Gates of LSTM

LSTM has three main gates. These gates are not physical gates; they are neural network layers that output values between 0 and 1 using sigmoid. A value close to 0 means block or forget. A value close to 1 means allow or keep.

Forget Gate

Decides what old information should be removed from the previous cell state.

Input Gate

Decides what new information should be added to the cell state.

Output Gate

Decides what information should become the current hidden state.

05 Forget Gate

The forget gate decides how much of the old cell state should be kept. It looks at the previous hidden state and current input, then creates a forget vector.

Formula

f_t = sigmoid(W_f [h_{t-1}, x_t] + b_f)

If a value in f_t is close to 0, that part of memory is forgotten. If it is close to 1, that part is preserved.

06 Input Gate and Candidate Memory

The input gate controls what new information should enter the cell state. But it does not work alone. LSTM also creates candidate memory using tanh.

Formula

i_t = sigmoid(W_i [h_{t-1}, x_t] + b_i)
C̃_t = tanh(W_c [h_{t-1}, x_t] + b_c)

The input gate decides how much new information should be stored. The candidate memory contains possible new values that may be added to the cell state.

07 Updating the Cell State

This is where LSTM memory actually updates. It combines useful old memory and useful new memory.

Formula

C_t = f_t * C_{t-1} + i_t * C̃_t

What clicked: current cell state = useful old memory + useful new memory.

08 Output Gate

The output gate decides what part of the updated cell state should become the current hidden state. The hidden state is then passed forward and can also be used for prediction.

Formula

o_t = sigmoid(W_o [h_{t-1}, x_t] + b_o)
h_t = o_t * tanh(C_t)

This helped me understand why LSTM output is controlled. It does not expose all memory directly. It filters memory through the output gate.

09 Vanishing and Exploding Gradient Demo

In the implementation notebook, I first created a simple demo for gradient flow over time. When the recurrent multiplier was less than 1, the gradient kept shrinking. When it was greater than 1, the gradient kept growing.

Python

T = 20
Us = [0.5, 2.0]
labels = ["Vanishing Gradient", "Exploding Gradient"]

for U, label in zip(Us, labels):
    grads = []
    grad = 1.0
    for t in range(T):
        grad *= U
        grads.append(grad)

Notebook lesson: when U = 0.5, the gradient shrinks exponentially. When U = 2.0, it grows exponentially. This explains why training simple RNNs on long sequences can become difficult.

10 LSTM from Scratch

After the theory, I implemented a simplified LSTM step using NumPy. This helped me see the gates as actual calculations instead of only diagrams.

Python

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def tanh(x):
    return np.tanh(x)

x_t = 1.0
h_prev = 0.5
c_prev = 0.0
concat = h_prev + x_t

f_t = sigmoid(concat)
i_t = sigmoid(concat)
c_hat_t = tanh(concat)
c_t = f_t * c_prev + i_t * c_hat_t
o_t = sigmoid(concat)
h_t = o_t * tanh(c_t)

Then I created a small custom MiniLSTM class and ran it across multiple time steps. I tracked how forget gate, input gate, cell state and hidden state changed over time.

What clicked: LSTM is complex on paper, but the implementation is a sequence of small gate operations.

11 LSTM for Sentiment Analysis

I also tried a real NLP task using LSTM: sentiment analysis on IMDB-style movie reviews. The flow was simple: convert words into integer sequences, pad them to fixed length, pass them through an embedding layer, then use LSTM and a sigmoid output for binary classification.

Keras

model = Sequential()
model.add(Embedding(5000, 32, input_length=200))
model.add(LSTM(64))
model.add(Dense(1, activation='sigmoid'))

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

I also wrote a small function to encode new text, pad it and predict whether the sentiment is positive or negative.

LSTM sentiment analysis and next word prediction workflow showing tokens embeddings memory gates and output — LSTM can be used for NLP tasks such as sentiment analysis and next-word prediction because it processes sequences while carrying memory.

12 PyTorch LSTM for Next Word Prediction

The second notebook connected LSTM with the PyTorch workflow I already learned. I used a text document, tokenized it with NLTK, built a vocabulary, converted text to indices and created training sequences for next-word prediction.

The idea was this: given a partial sequence, predict the next token. This made LSTM feel more directly connected to language modeling.

PyTorch

class LSTMModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, 100)
        self.lstm = nn.LSTM(100, 150, batch_first=True)
        self.fc = nn.Linear(150, vocab_size)

    def forward(self, x):
        embedded = self.embedding(x)
        outputs, (h_n, c_n) = self.lstm(embedded)
        output = self.fc(h_n.squeeze(0))
        return output

I trained the model using CrossEntropyLoss and Adam optimizer. The prediction function converted the input text into indices, padded it, sent it through the model and selected the word with the highest score.

What clicked: for next-word prediction, the final hidden state becomes a summary of the input sequence, and the final linear layer maps that summary to vocabulary scores.

13 RNN vs LSTM

This comparison made the topic much clearer for me.

Feature	Simple RNN	LSTM
Memory	mainly hidden state	cell state + hidden state
Long-term dependency	weak	stronger
Vanishing gradient	high chance	reduced
Architecture	simple	more complex
Gates	no gates	forget, input and output gates
Long sequence handling	struggles	better

14 Mistakes and Confusions I Noticed

LSTM is powerful, but it can be confusing in the beginning because there are many symbols and states. I had to repeatedly separate h_t, C_t, gates and candidate memory.

Confusions

thinking cell state and hidden state are same
forgetting that LSTM has three gates
mixing candidate memory with final cell state
not understanding output shape from PyTorch LSTM
confusing padding length with sequence length

Better Thinking

cell state carries long-term memory
hidden state carries current output
gates decide memory flow
embeddings are inputs to LSTM
final linear layer maps hidden state to prediction

My final caution: LSTM improves RNN memory, but it is still not the final stage of NLP. After LSTM, I need to understand stacked models, bidirectional models, GRU and eventually transformers.

15 My Final Understanding

My final understanding is that LSTM is a memory-controlled sequence model. It processes text step by step like RNN, but it adds a cell state and gates so the model can preserve useful information for longer sequences.

LSTM is a gated RNN

It is built for sequential data but improves memory control compared to a simple RNN.

Cell state carries long-term memory

It works like a memory highway where important information can move forward across time steps.

Gates control information

Forget, input and output gates decide what to remove, store and expose.

LSTM supports real NLP tasks

It can be used for sentiment analysis, next-word prediction, sequence classification and other language tasks.

16 GitHub Notebook Connection

This blog explains what I understood from my LSTM theory and implementation notebooks. The implementation side is connected to the NLP by Vinod GitHub repository.

NLP by Vinod GitHub Repository

Notebook references: 01_LSTM_implementation.ipynb, 02_pytorch_lstm_next_word_predictor.ipynb, plus theory notes from lstm_theory.html.

Open the GitHub repository

18 What Comes Next in the NLP Journey

The next topic is Stacked RNN/LSTM, Bidirectional RNN/LSTM and GRU. After understanding simple LSTM, I now want to learn how sequence models become deeper, how bidirectional context works and how GRU simplifies the LSTM idea.

Stacked sequence models

How multiple recurrent layers can learn deeper sequence representations.

Bidirectional models

How models can use both past and future context in a sequence.

GRU

How GRU offers a simpler gated recurrent architecture.

LSTM Sequence Models Deep Learning PyTorch NLP Cell State

Search This Blog

Vinod Codes | AI Engineering & Data Science

A structured public journey from NLP fundamentals to real-world AI systems.