Impressions After Reading Chapter 4 of “LLM From Scratch”

힘센캥거루
2025년 11월 26일
2
14

Today is November 26, so if I finish one chapter a day, I’ll complete the challenge.

I’m not sure if I can do it with my first and second kids constantly interrupting me.

1. Dummy Transformer

Impressions After Reading Chapter 4 of “LLM From Scratch”-1

While building a GPT model, I saw that we can import a dummy Transformer block from PyTorch.

Looking it up, I found that several Transformer models are already implemented inside torch.nn.

This Transformer model is called a dummy because the model architecture is the same, but it hasn’t been trained yet.

2. Normalization

If among the values coming into a Transformer block, one dimension is too large, the data will be biased in that direction.

So we transform the values so that the mean becomes 0 and the variance becomes 1.

And in the Feed Forward Network, we first expand the dimension with a Linear layer and then apply a nonlinear transformation.

At this point we use nonlinear activation functions such as ReLU or GELU.

Impressions After Reading Chapter 4 of “LLM From Scratch”-2

It’s said that because GELU is more linear than ReLU, the parameters can be tuned more effectively.

And in each block, we increase the dimension with a Linear function and then reduce it again.

By increasing the dimension, richer nonlinear exploration becomes possible.

For example, if you have apples, oranges, and carrots, you can try cooking them in many different ways: blanching, roasting, chopping, mixing, and so on.

Then you take the finished dishes, extract only the essence, and pack that essence back into the original dimensional space.

Impressions After Reading Chapter 4 of “LLM From Scratch”-3

3. Shortcut

Learning starts from where the loss occurs and then traces back through the model.

For example, imagine water is dripping from the 2nd floor.

Then you go up to the 3rd, 4th, and 5th floors, tracing back to find where the leak is happening.

This process is called backpropagation.

Impressions After Reading Chapter 4 of “LLM From Scratch”-4

But in a model built like the one above, the gradient vanishes a little more every time it passes through a linear layer.

When the gradient vanishes, layers deeper in the network effectively don’t learn during backpropagation, so stacking many layers becomes meaningless.

To prevent gradient vanishing, we add the input and output at each linear layer, providing a detour (shortcut) pathway.

This is easier to understand mathematically.

Residual process: y = x + F(x)
Backward process: dL/dx = dL/dy * (1 + dF/dx)
Final gradient: dL/dy * (1 + dF/dx)

Because the value 1 that gets added to each layer’s gradient in the backward process is always guaranteed, the gradient at each layer does not vanish.

This helps enable an effective training process.

4. Building a GPT Model

Now we connect an attention model to this dummy model we created.

Finally, we let this attention model repeat multiple times, and then pass the output tokens through a decoding process.

Impressions After Reading Chapter 4 of “LLM From Scratch”-5

5. Thoughts

What started off light gradually became more and more overwhelming toward the end.

The code looks difficult, but when I learned through analogies and metaphors, it actually wasn’t that hard.

Times have changed, and understanding is now more important than memorizing the content itself.

If you only memorize, you can’t create anything new; but if you understand, you can complete a GPT model with the help of AI.

There isn’t much time left, but I’ll still try my best to truly understand as much as I can.

관련 글

Book Review and Challenge Review of Chapter 7 of *Building an LLM from Scratch*
Book Review and Challenge Review of Chapter 7 of *Building an LLM from Scratch*
Chapter 7 covers the process of fine-tuning a model to follow instructions.In other words, making it give the desired response to a given question.As...
Review of Chapter 6 of *Build an LLM from Scratch*
Review of Chapter 6 of *Build an LLM from Scratch*
Chapter 6 is about fine-tuning for classification.The example used is building a spam classifier.A spam classifier determines whether something is spa...
Review of Chapter 5 of *Building an LLM from Scratch*
Review of Chapter 5 of *Building an LLM from Scratch*
Today is December 14.The challenge period actually ended two weeks ago, but I couldn’t just give up on writing a review.Because these TILs I leave lik...
Review of Chapter 3 of Learning LLM from Scratch
Review of Chapter 3 of Learning LLM from Scratch
After spilling a bucket of water on my MacBook, I was in shock and wasted about 3-4 days. In retrospect, since my MacBook was already damaged, I should have thought of it as being sent for repair and done something. Anyway, although it's a bit late, I am determined to see it through and leave a review of Chapter 3. 1. Attention Mechanism Chapter 3...
Review of Chapter 2 of Learning LLM from Scratch
Review of Chapter 2 of Learning LLM from Scratch
Already in the second week of the challenge. I hadn't finished Chapter 2 until yesterday, but while attending a two-day retreat, I managed to catch up by coding until midnight. 1. Content The main focus of Chapter 2 was tokenization, encoding, decoding, and embedding vectors. I was familiar with others as I had made a one-hot encoder, but embedding...
Python OCR Recommendations for MacBook Users
Python OCR Recommendations for MacBook Users
It seems like I've tried every OCR available for recognizing students' medical certificates. I've used various OCRs such as Tesseract, EasyOCR, and PaddleOCR, but none had satisfactory performance with Korean. Recently, however, I discovered a Python library that wraps the Live Text functionality available on MacBook...

댓글을 불러오는 중...