LSTM for Sequence Modeling - Long-Term Memory in Neural Networks
LSTM for Sequence Modeling - Long-Term Memory in Neural Networks.
After RNNs, I learned why simple recurrence struggles with long sequences and how LSTM improves memory using gates, cell state and hidden state.
LSTM in deep learning became the next natural topic after RNN. In the previous post, I understood that RNNs read text step by step using a hidden state. But I also saw the limitation: when the sequence becomes long, earlier information can become weak. This is why simple RNNs struggle with long-term dependencies.
LSTM stands for Long Short-Term Memory. My current understanding is that LSTM is an improved version of RNN that still processes sequential data step by step, but it has a smarter memory system. Instead of relying only on one hidden state, LSTM uses a cell state for long-term memory and a hidden state for current or short-term context.
In this topic, I learned the theory of LSTM gates, implemented a small LSTM from scratch using NumPy, visualized how gates and memory change over time, built a sentiment analysis example, and then used PyTorch to create a next-word prediction model.
RNN gives memory to neural networks. LSTM makes that memory more controlled by deciding what to forget, what to store and what to output.
01 Why LSTM Was Needed After RNN
Simple RNNs are useful because they process data in order. But the same hidden state keeps getting updated again and again. In short sequences this can work, but in long sequences the model may forget earlier information.
This becomes a serious issue in NLP. A word at the beginning of a sentence or paragraph may affect the meaning later. If the model cannot preserve that context, prediction becomes weak.
Problems in Simple RNN
- struggles with long-term dependencies
- vanishing gradient problem
- exploding gradient problem
- old context can be overwritten
- long sequences become difficult
How LSTM Helps
- uses gates to control memory
- keeps useful old information
- adds useful new information
- uses cell state for long-term memory
- handles longer context better
h_t and one cell state C_t. The cell state behaves like long-term memory, while the hidden state carries current output or short-term context.02 Long-Term Dependency Intuition
Consider this sentence from my theory notes:
Vinod is a student of B.Tech final year. He studies at VNIT Nagpur.To understand the word He, the model should remember earlier information: Vinod. A simple RNN may weaken this information over a long sentence. LSTM is designed to carry useful information for a longer time using the cell state.
03 Cell State vs Hidden State
The most important difference for me was cell state and hidden state. Earlier in RNN, I mainly thought about hidden state. In LSTM, the cell state becomes the memory highway.
| State | Role | How I Understood It |
|---|---|---|
| Cell State | C_t | long-term memory that carries useful information across many time steps |
| Hidden State | h_t | current output or short-term context passed to the next step |
| Previous Cell | C_{t-1} | old long-term memory before update |
| Previous Hidden | h_{t-1} | previous short-term context |
C_t remembers what should be carried forward. h_t decides what should be exposed as the current output.04 The Three Gates of LSTM
LSTM has three main gates. These gates are not physical gates; they are neural network layers that output values between 0 and 1 using sigmoid. A value close to 0 means block or forget. A value close to 1 means allow or keep.
05 Forget Gate
The forget gate decides how much of the old cell state should be kept. It looks at the previous hidden state and current input, then creates a forget vector.
f_t = sigmoid(W_f [h_{t-1}, x_t] + b_f)If a value in f_t is close to 0, that part of memory is forgotten. If it is close to 1, that part is preserved.
06 Input Gate and Candidate Memory
The input gate controls what new information should enter the cell state. But it does not work alone. LSTM also creates candidate memory using tanh.
i_t = sigmoid(W_i [h_{t-1}, x_t] + b_i)
C̃_t = tanh(W_c [h_{t-1}, x_t] + b_c)The input gate decides how much new information should be stored. The candidate memory contains possible new values that may be added to the cell state.
07 Updating the Cell State
This is where LSTM memory actually updates. It combines useful old memory and useful new memory.
C_t = f_t * C_{t-1} + i_t * C̃_t08 Output Gate
The output gate decides what part of the updated cell state should become the current hidden state. The hidden state is then passed forward and can also be used for prediction.
o_t = sigmoid(W_o [h_{t-1}, x_t] + b_o)
h_t = o_t * tanh(C_t)This helped me understand why LSTM output is controlled. It does not expose all memory directly. It filters memory through the output gate.
09 Vanishing and Exploding Gradient Demo
In the implementation notebook, I first created a simple demo for gradient flow over time. When the recurrent multiplier was less than 1, the gradient kept shrinking. When it was greater than 1, the gradient kept growing.
T = 20
Us = [0.5, 2.0]
labels = ["Vanishing Gradient", "Exploding Gradient"]
for U, label in zip(Us, labels):
grads = []
grad = 1.0
for t in range(T):
grad *= U
grads.append(grad)U = 0.5, the gradient shrinks exponentially. When U = 2.0, it grows exponentially. This explains why training simple RNNs on long sequences can become difficult.10 LSTM from Scratch
After the theory, I implemented a simplified LSTM step using NumPy. This helped me see the gates as actual calculations instead of only diagrams.
def sigmoid(x):
return 1 / (1 + np.exp(-x))
def tanh(x):
return np.tanh(x)
x_t = 1.0
h_prev = 0.5
c_prev = 0.0
concat = h_prev + x_t
f_t = sigmoid(concat)
i_t = sigmoid(concat)
c_hat_t = tanh(concat)
c_t = f_t * c_prev + i_t * c_hat_t
o_t = sigmoid(concat)
h_t = o_t * tanh(c_t)Then I created a small custom MiniLSTM class and ran it across multiple time steps. I tracked how forget gate, input gate, cell state and hidden state changed over time.
11 LSTM for Sentiment Analysis
I also tried a real NLP task using LSTM: sentiment analysis on IMDB-style movie reviews. The flow was simple: convert words into integer sequences, pad them to fixed length, pass them through an embedding layer, then use LSTM and a sigmoid output for binary classification.
model = Sequential()
model.add(Embedding(5000, 32, input_length=200))
model.add(LSTM(64))
model.add(Dense(1, activation='sigmoid'))
model.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy']
)I also wrote a small function to encode new text, pad it and predict whether the sentiment is positive or negative.
12 PyTorch LSTM for Next Word Prediction
The second notebook connected LSTM with the PyTorch workflow I already learned. I used a text document, tokenized it with NLTK, built a vocabulary, converted text to indices and created training sequences for next-word prediction.
The idea was this: given a partial sequence, predict the next token. This made LSTM feel more directly connected to language modeling.
class LSTMModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
self.embedding = nn.Embedding(vocab_size, 100)
self.lstm = nn.LSTM(100, 150, batch_first=True)
self.fc = nn.Linear(150, vocab_size)
def forward(self, x):
embedded = self.embedding(x)
outputs, (h_n, c_n) = self.lstm(embedded)
output = self.fc(h_n.squeeze(0))
return outputI trained the model using CrossEntropyLoss and Adam optimizer. The prediction function converted the input text into indices, padded it, sent it through the model and selected the word with the highest score.
13 RNN vs LSTM
This comparison made the topic much clearer for me.
| Feature | Simple RNN | LSTM |
|---|---|---|
| Memory | mainly hidden state | cell state + hidden state |
| Long-term dependency | weak | stronger |
| Vanishing gradient | high chance | reduced |
| Architecture | simple | more complex |
| Gates | no gates | forget, input and output gates |
| Long sequence handling | struggles | better |
14 Mistakes and Confusions I Noticed
LSTM is powerful, but it can be confusing in the beginning because there are many symbols and states. I had to repeatedly separate h_t, C_t, gates and candidate memory.
Confusions
- thinking cell state and hidden state are same
- forgetting that LSTM has three gates
- mixing candidate memory with final cell state
- not understanding output shape from PyTorch LSTM
- confusing padding length with sequence length
Better Thinking
- cell state carries long-term memory
- hidden state carries current output
- gates decide memory flow
- embeddings are inputs to LSTM
- final linear layer maps hidden state to prediction
15 My Final Understanding
My final understanding is that LSTM is a memory-controlled sequence model. It processes text step by step like RNN, but it adds a cell state and gates so the model can preserve useful information for longer sequences.
16 GitHub Notebook Connection
This blog explains what I understood from my LSTM theory and implementation notebooks. The implementation side is connected to the NLP by Vinod GitHub repository.
NLP by Vinod GitHub Repository
Notebook references: 01_LSTM_implementation.ipynb, 02_pytorch_lstm_next_word_predictor.ipynb, plus theory notes from lstm_theory.html.
18 What Comes Next in the NLP Journey
The next topic is Stacked RNN/LSTM, Bidirectional RNN/LSTM and GRU. After understanding simple LSTM, I now want to learn how sequence models become deeper, how bidirectional context works and how GRU simplifies the LSTM idea.
Comments
Post a Comment