NLP by Vinod

A structured public journey from NLP fundamentals to real-world AI systems.

Vinod Codes is where I document my learning in AI, Machine Learning, Deep Learning, Natural Language Processing, Generative AI, and practical projects.

The main series here is NLP by Vinod — a learner-builder journey where I explain concepts with intuition, Python examples, mistakes, GitHub work, and honest implementation notes.

Start here: follow the Foundations Track first, then move into deep learning, transformers, projects, and real-world NLP systems.
NLP Foundations Python for NLP Machine Learning Deep Learning Real Projects

LSTM for Sequence Modeling - Long-Term Memory in Neural Networks

NLP by Vinod - Deep Learning
Sequence Models

LSTM for Sequence Modeling - Long-Term Memory in Neural Networks.

After RNNs, I learned why simple recurrence struggles with long sequences and how LSTM improves memory using gates, cell state and hidden state.

LSTMSequence ModelCell StatePyTorch

LSTM in deep learning became the next natural topic after RNN. In the previous post, I understood that RNNs read text step by step using a hidden state. But I also saw the limitation: when the sequence becomes long, earlier information can become weak. This is why simple RNNs struggle with long-term dependencies.

LSTM stands for Long Short-Term Memory. My current understanding is that LSTM is an improved version of RNN that still processes sequential data step by step, but it has a smarter memory system. Instead of relying only on one hidden state, LSTM uses a cell state for long-term memory and a hidden state for current or short-term context.

In this topic, I learned the theory of LSTM gates, implemented a small LSTM from scratch using NumPy, visualized how gates and memory change over time, built a sentiment analysis example, and then used PyTorch to create a next-word prediction model.

What clicked for me:
RNN gives memory to neural networks. LSTM makes that memory more controlled by deciding what to forget, what to store and what to output.
LSTM sequence model workflow showing forget input and output gates with cell state and hidden state
LSTM improves simple RNNs by using gates and a cell state to preserve useful long-term information.

01 Why LSTM Was Needed After RNN

Simple RNNs are useful because they process data in order. But the same hidden state keeps getting updated again and again. In short sequences this can work, but in long sequences the model may forget earlier information.

This becomes a serious issue in NLP. A word at the beginning of a sentence or paragraph may affect the meaning later. If the model cannot preserve that context, prediction becomes weak.

Problems in Simple RNN

  • struggles with long-term dependencies
  • vanishing gradient problem
  • exploding gradient problem
  • old context can be overwritten
  • long sequences become difficult

How LSTM Helps

  • uses gates to control memory
  • keeps useful old information
  • adds useful new information
  • uses cell state for long-term memory
  • handles longer context better
Important correction I learned: LSTM does not have two hidden states. It has one hidden state h_t and one cell state C_t. The cell state behaves like long-term memory, while the hidden state carries current output or short-term context.

02 Long-Term Dependency Intuition

Consider this sentence from my theory notes:

Example
Vinod is a student of B.Tech final year. He studies at VNIT Nagpur.

To understand the word He, the model should remember earlier information: Vinod. A simple RNN may weaken this information over a long sentence. LSTM is designed to carry useful information for a longer time using the cell state.

Earlier ContextVinod
Cell Statekeeps useful memory
Current WordHe
Hidden Statecurrent context
Predictionbetter meaning
My simple definition: LSTM is a gated RNN architecture that remembers important information for longer sequences using cell state and gates.

03 Cell State vs Hidden State

The most important difference for me was cell state and hidden state. Earlier in RNN, I mainly thought about hidden state. In LSTM, the cell state becomes the memory highway.

StateRoleHow I Understood It
Cell StateC_tlong-term memory that carries useful information across many time steps
Hidden Stateh_tcurrent output or short-term context passed to the next step
Previous CellC_{t-1}old long-term memory before update
Previous Hiddenh_{t-1}previous short-term context
Memory intuition: C_t remembers what should be carried forward. h_t decides what should be exposed as the current output.

04 The Three Gates of LSTM

LSTM has three main gates. These gates are not physical gates; they are neural network layers that output values between 0 and 1 using sigmoid. A value close to 0 means block or forget. A value close to 1 means allow or keep.

F
Forget Gate
Decides what old information should be removed from the previous cell state.
I
Input Gate
Decides what new information should be added to the cell state.
O
Output Gate
Decides what information should become the current hidden state.
LSTM architecture diagram showing forget gate input gate output gate cell state and hidden state
The forget, input and output gates control how memory flows through an LSTM cell.

05 Forget Gate

The forget gate decides how much of the old cell state should be kept. It looks at the previous hidden state and current input, then creates a forget vector.

Formula
f_t = sigmoid(W_f [h_{t-1}, x_t] + b_f)

If a value in f_t is close to 0, that part of memory is forgotten. If it is close to 1, that part is preserved.

06 Input Gate and Candidate Memory

The input gate controls what new information should enter the cell state. But it does not work alone. LSTM also creates candidate memory using tanh.

Formula
i_t = sigmoid(W_i [h_{t-1}, x_t] + b_i)
C̃_t = tanh(W_c [h_{t-1}, x_t] + b_c)

The input gate decides how much new information should be stored. The candidate memory contains possible new values that may be added to the cell state.

07 Updating the Cell State

This is where LSTM memory actually updates. It combines useful old memory and useful new memory.

Formula
C_t = f_t * C_{t-1} + i_t * C̃_t
What clicked: current cell state = useful old memory + useful new memory.

08 Output Gate

The output gate decides what part of the updated cell state should become the current hidden state. The hidden state is then passed forward and can also be used for prediction.

Formula
o_t = sigmoid(W_o [h_{t-1}, x_t] + b_o)
h_t = o_t * tanh(C_t)

This helped me understand why LSTM output is controlled. It does not expose all memory directly. It filters memory through the output gate.

09 Vanishing and Exploding Gradient Demo

In the implementation notebook, I first created a simple demo for gradient flow over time. When the recurrent multiplier was less than 1, the gradient kept shrinking. When it was greater than 1, the gradient kept growing.

Python
T = 20
Us = [0.5, 2.0]
labels = ["Vanishing Gradient", "Exploding Gradient"]

for U, label in zip(Us, labels):
    grads = []
    grad = 1.0
    for t in range(T):
        grad *= U
        grads.append(grad)
Notebook lesson: when U = 0.5, the gradient shrinks exponentially. When U = 2.0, it grows exponentially. This explains why training simple RNNs on long sequences can become difficult.

10 LSTM from Scratch

After the theory, I implemented a simplified LSTM step using NumPy. This helped me see the gates as actual calculations instead of only diagrams.

Python
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def tanh(x):
    return np.tanh(x)

x_t = 1.0
h_prev = 0.5
c_prev = 0.0
concat = h_prev + x_t

f_t = sigmoid(concat)
i_t = sigmoid(concat)
c_hat_t = tanh(concat)
c_t = f_t * c_prev + i_t * c_hat_t
o_t = sigmoid(concat)
h_t = o_t * tanh(c_t)

Then I created a small custom MiniLSTM class and ran it across multiple time steps. I tracked how forget gate, input gate, cell state and hidden state changed over time.

What clicked: LSTM is complex on paper, but the implementation is a sequence of small gate operations.

11 LSTM for Sentiment Analysis

I also tried a real NLP task using LSTM: sentiment analysis on IMDB-style movie reviews. The flow was simple: convert words into integer sequences, pad them to fixed length, pass them through an embedding layer, then use LSTM and a sigmoid output for binary classification.

Keras
model = Sequential()
model.add(Embedding(5000, 32, input_length=200))
model.add(LSTM(64))
model.add(Dense(1, activation='sigmoid'))

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

I also wrote a small function to encode new text, pad it and predict whether the sentiment is positive or negative.

LSTM sentiment analysis and next word prediction workflow showing tokens embeddings memory gates and output
LSTM can be used for NLP tasks such as sentiment analysis and next-word prediction because it processes sequences while carrying memory.

12 PyTorch LSTM for Next Word Prediction

The second notebook connected LSTM with the PyTorch workflow I already learned. I used a text document, tokenized it with NLTK, built a vocabulary, converted text to indices and created training sequences for next-word prediction.

The idea was this: given a partial sequence, predict the next token. This made LSTM feel more directly connected to language modeling.

PyTorch
class LSTMModel(nn.Module):
    def __init__(self, vocab_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, 100)
        self.lstm = nn.LSTM(100, 150, batch_first=True)
        self.fc = nn.Linear(150, vocab_size)

    def forward(self, x):
        embedded = self.embedding(x)
        outputs, (h_n, c_n) = self.lstm(embedded)
        output = self.fc(h_n.squeeze(0))
        return output

I trained the model using CrossEntropyLoss and Adam optimizer. The prediction function converted the input text into indices, padded it, sent it through the model and selected the word with the highest score.

What clicked: for next-word prediction, the final hidden state becomes a summary of the input sequence, and the final linear layer maps that summary to vocabulary scores.

13 RNN vs LSTM

This comparison made the topic much clearer for me.

FeatureSimple RNNLSTM
Memorymainly hidden statecell state + hidden state
Long-term dependencyweakstronger
Vanishing gradienthigh chancereduced
Architecturesimplemore complex
Gatesno gatesforget, input and output gates
Long sequence handlingstrugglesbetter

14 Mistakes and Confusions I Noticed

LSTM is powerful, but it can be confusing in the beginning because there are many symbols and states. I had to repeatedly separate h_t, C_t, gates and candidate memory.

Confusions

  • thinking cell state and hidden state are same
  • forgetting that LSTM has three gates
  • mixing candidate memory with final cell state
  • not understanding output shape from PyTorch LSTM
  • confusing padding length with sequence length

Better Thinking

  • cell state carries long-term memory
  • hidden state carries current output
  • gates decide memory flow
  • embeddings are inputs to LSTM
  • final linear layer maps hidden state to prediction
My final caution: LSTM improves RNN memory, but it is still not the final stage of NLP. After LSTM, I need to understand stacked models, bidirectional models, GRU and eventually transformers.

15 My Final Understanding

My final understanding is that LSTM is a memory-controlled sequence model. It processes text step by step like RNN, but it adds a cell state and gates so the model can preserve useful information for longer sequences.

01
LSTM is a gated RNN
It is built for sequential data but improves memory control compared to a simple RNN.
02
Cell state carries long-term memory
It works like a memory highway where important information can move forward across time steps.
03
Gates control information
Forget, input and output gates decide what to remove, store and expose.
04
LSTM supports real NLP tasks
It can be used for sentiment analysis, next-word prediction, sequence classification and other language tasks.

16 GitHub Notebook Connection

This blog explains what I understood from my LSTM theory and implementation notebooks. The implementation side is connected to the NLP by Vinod GitHub repository.

GH

NLP by Vinod GitHub Repository

Notebook references: 01_LSTM_implementation.ipynb, 02_pytorch_lstm_next_word_predictor.ipynb, plus theory notes from lstm_theory.html.

Open the GitHub repository

18 What Comes Next in the NLP Journey

The next topic is Stacked RNN/LSTM, Bidirectional RNN/LSTM and GRU. After understanding simple LSTM, I now want to learn how sequence models become deeper, how bidirectional context works and how GRU simplifies the LSTM idea.

01
Stacked sequence models

How multiple recurrent layers can learn deeper sequence representations.

02
Bidirectional models

How models can use both past and future context in a sequence.

03
GRU

How GRU offers a simpler gated recurrent architecture.

LSTM Sequence Models Deep Learning PyTorch NLP Cell State

LSTM made sequence memory feel controlled, not accidental.

This topic helped me understand why simple RNNs struggle with long context and how LSTM uses gates, hidden state and cell state to improve sequence modeling.

Comments

Most viewed

Python Strings & Regex for NLP — The Real Foundation

NLP Learning Roadmap — From Fundamentals to Real-World AI Systems

Data Acquisition for NLP - Collecting Text Before Preprocessing