1

Review of Chapter 3 of Learning LLM from Scratch

힘센캥거루
2025년 10월 29일(수정됨)
4
1
17

After spilling a bucket of water on my MacBook, I was in shock and wasted about 3-4 days.

In retrospect, since my MacBook was already damaged, I should have thought of it as being sent for repair and done something.

Anyway, although it's a bit late, I am determined to see it through and leave a review of Chapter 3.

1. Attention Mechanism

The main content of Chapter 3 is the attention mechanism.

Attention, as the name implies, shows which part of the sentence you want to focus on.

For example, let's try to understand the sentence I ate rice yesterday.

We will naturally focus on words like I, yesterday, rice, and ate.

To understand the words being processed, a combination with other words is required, and the attention mechanism helps understand the context through the combination of words.

Review of Chapter 3 of Learning LLM from Scratch-1

2. Self-Attention

Self-attention finds the relationship between tokens within a sentence. 

Suppose tokens are implemented as embedding vectors as follows.

import torch

inputs = torch.tensor(
  [[0.43, 0.15, 0.89], # Your     (x^1)
   [0.55, 0.87, 0.66], # journey  (x^2)
   [0.57, 0.85, 0.64], # starts   (x^3)
   [0.22, 0.58, 0.33], # with     (x^4)
   [0.77, 0.25, 0.10], # one      (x^5)
   [0.05, 0.80, 0.55]] # step     (x^6)
)

For one desired token, get the relationship between tokens through the dot product to get the attention score.

The concept of dot product might be new, but dot product in matrices is the same as inner product in vectors.

Review of Chapter 3 of Learning LLM from Scratch-2

This means that the closer the directions of the two vectors are, the larger the value, and if they are perpendicular, it becomes 0.

This checks how similar two embeddings are.

We look not at the vector itself, but the similarity of the two vectors, as embeddings were scattered with meaningless vectors when initialized.

Therefore, measuring the vector itself is meaningless.

Review of Chapter 3 of Learning LLM from Scratch-3

After normalizing so that the sum of the attention weights is 1, multiply and add the input vector with the attention weights.

The higher the attention weight, the greater the influence on the context vector, so you get a vector of the appropriate position direction of the embedding input as a query.

This is called the context vector.

If calculated for all embeddings in this way, you will get a tensor with the following shape.

tensor([[0.4421, 0.5931, 0.5790],
        [0.4419, 0.6515, 0.5683],
        [0.4431, 0.6496, 0.5671],
        [0.4304, 0.6298, 0.5510],
        [0.4671, 0.5910, 0.5266],
        [0.4177, 0.6503, 0.5645]])

The structure of the tensor itself is 2-dimensional, but the content contains 3-dimensional context vectors.

This allows each embedding's relationship to be expressed as a directionality.

3. Trainable Self-Attention

In trainable self-attention, each key, query value is reduced to a lower dimension.

The initial values processed are all arbitrary.

After processing, it is updated through the backpropagation process.

torch.manual_seed(123)

W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key   = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)

You can see that the initial query, key, value values are all randomized.

Although the initial values are all meaningless, through training, they become meaningful.

Through this process, we can obtain the context vector for a single word.

4. Causal Attention

Causal attention uses a masking technique.

By predicting the future, it allows the next context to be predicted.

When I first encountered this content, the overfitting problem came to mind, but...

ChatGPT's answer was different.

Review of Chapter 3 of Learning LLM from Scratch-4

Preventing overfitting is just a side effect of causal attention.

It helps the model learn the context in the actual language generation direction (left→right).

The context vectors learned in this way help understand a sentence's meaning and generate new sentences, reasoning, and inference.

Various methods for this are introduced later.

5. Questions

Various thoughts arose while reading through the book.

  • Why is this more complicated than a one-hot encoder; what's the purpose?

  • Is the reason for causal attention because of overfitting?

Review of Chapter 3 of Learning LLM from Scratch-5

I talked with ChatGPT for quite a long time, and the conclusion was as follows.

Knowing the answer
--> The model copies instead of predicting
--> Loss becomes almost 0 (calculations are possible)
--> Gradient becomes almost 0
--> Hardly any updates
--> No meaningful change in backpropagation
--> Learning essentially does not occur

For instance, imagine feeling around in the dark trying to find an object.

This process builds the ability to understand space and find objects.

If the object's location is already known, there's no need to search, and thus the 'method of finding' is not learned.

In language models, viewing future tokens is similar; knowing the answer causes the model to lose the ability to predict.

Review of Chapter 3 of Learning LLM from Scratch-6

So, causal attention is a function that induces loss.

Causal attention causes loss by masking the future, and based on this loss, the context vector is updated.

During the process of reducing prediction error, the model better understands context, improving abilities in inference, analogy, and evaluation.

6. Afterword

The week's delay due to spilling water on my MacBook made me anxious.

Though I wanted to read and move quickly, I couldn't.

Every sentence I encountered was difficult and new.

Review of Chapter 3 of Learning LLM from Scratch-7

Why multiply matrices with reduced dimensions here? Why suddenly apply masking here?

The process of finding answers to these questions was quite long and tedious.

AI provided answers to my questions, but whenever I formed new interpretations, it always insisted I was wrong, making me ponder lengthy on what the correct meaning is.

I think this process of reflection might help me grow.

I look forward to what Chapter 4 will teach me.

관련 글

2026년 동국대학교 미래사회 교원역량 강화 포럼 오프라인 참여 후기
2026년 동국대학교 미래사회 교원역량 강화 포럼 오프라인 참여 후기
어느 선생님이 재미있어 보이는 연수를 하나 소개시켜 주셨다.동국대에서 진행하는 AI 관련 연수였다.AI인 것도 좋인데 연수가 호텔에서?이건 무조건 가야 한다 싶었다.해당일 연수가 열리자 마자 신청해서 오프라인으로 참석하게 되었다.1. 앰배서더 서울 풀만 호텔처음에는 접...
Global Skilled Crafts Promotion Institute Special Field Training – Woodworking Workshop Review
Global Skilled Crafts Promotion Institute Special Field Training – Woodworking Workshop Review
A teacher I know told me there was a residential woodworking workshop being held in Incheon.And among the options, they said I absolutely had to take...
A Beginner’s Hacking Guide for Aspiring White-Hat Hackers – First Impressions of “A Taste of Hacking”
A Beginner’s Hacking Guide for Aspiring White-Hat Hackers – First Impressions of “A Taste of Hacking”
The most important thing when running a home server was security.No matter how nicely I built the website features, once I got hit by hacking attempts...
Training on Educational Research and Statistical Analysis for Teachers – Summary of Sessions 21–30 and Reflections
Training on Educational Research and Statistical Analysis for Teachers – Summary of Sessions 21–30 and Reflections
Today I’d like to write down what I remember from sessions 21–30 of the educational research and statistical analysis course for teachers, along with...
Educational Research and Statistical Analysis Training for Teachers - Collection of R Practices from Sessions 13–20
Educational Research and Statistical Analysis Training for Teachers - Collection of R Practices from Sessions 13–20
Previously, I used to wonder whether I really needed to learn R when I already knew Python.Through this training, I realized that there’s actually no...
Teacher Training in Educational Research and Statistical Analysis – Sessions 10–12: Coefficient of Determination, Multiple Regression Analysis, etc.
Teacher Training in Educational Research and Statistical Analysis – Sessions 10–12: Coefficient of Determination, Multiple Regression Analysis, etc.
I wrote a reflection after each session every day, but with writing student records and doing this as well, I ended up having to cut down on sleep eve...

댓글을 불러오는 중...