After spilling a bucket of water on my MacBook, I was in shock and wasted about 3-4 days.
In retrospect, since my MacBook was already damaged, I should have thought of it as being sent for repair and done something.
Anyway, although it's a bit late, I am determined to see it through and leave a review of Chapter 3.
1. Attention Mechanism
The main content of Chapter 3 is the attention mechanism.
Attention, as the name implies, shows which part of the sentence you want to focus on.
For example, let's try to understand the sentence I ate rice yesterday.
We will naturally focus on words like I, yesterday, rice, and ate.
To understand the words being processed, a combination with other words is required, and the attention mechanism helps understand the context through the combination of words.

2. Self-Attention
Self-attention finds the relationship between tokens within a sentence.
Suppose tokens are implemented as embedding vectors as follows.
import torch
inputs = torch.tensor(
[[0.43, 0.15, 0.89], # Your (x^1)
[0.55, 0.87, 0.66], # journey (x^2)
[0.57, 0.85, 0.64], # starts (x^3)
[0.22, 0.58, 0.33], # with (x^4)
[0.77, 0.25, 0.10], # one (x^5)
[0.05, 0.80, 0.55]] # step (x^6)
)For one desired token, get the relationship between tokens through the dot product to get the attention score.
The concept of dot product might be new, but dot product in matrices is the same as inner product in vectors.

This means that the closer the directions of the two vectors are, the larger the value, and if they are perpendicular, it becomes 0.
This checks how similar two embeddings are.
We look not at the vector itself, but the similarity of the two vectors, as embeddings were scattered with meaningless vectors when initialized.
Therefore, measuring the vector itself is meaningless.

After normalizing so that the sum of the attention weights is 1, multiply and add the input vector with the attention weights.
The higher the attention weight, the greater the influence on the context vector, so you get a vector of the appropriate position direction of the embedding input as a query.
This is called the context vector.
If calculated for all embeddings in this way, you will get a tensor with the following shape.
tensor([[0.4421, 0.5931, 0.5790],
[0.4419, 0.6515, 0.5683],
[0.4431, 0.6496, 0.5671],
[0.4304, 0.6298, 0.5510],
[0.4671, 0.5910, 0.5266],
[0.4177, 0.6503, 0.5645]])The structure of the tensor itself is 2-dimensional, but the content contains 3-dimensional context vectors.
This allows each embedding's relationship to be expressed as a directionality.
3. Trainable Self-Attention
In trainable self-attention, each key, query value is reduced to a lower dimension.
The initial values processed are all arbitrary.
After processing, it is updated through the backpropagation process.
torch.manual_seed(123)
W_query = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_key = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)
W_value = torch.nn.Parameter(torch.rand(d_in, d_out), requires_grad=False)You can see that the initial query, key, value values are all randomized.
Although the initial values are all meaningless, through training, they become meaningful.
Through this process, we can obtain the context vector for a single word.
4. Causal Attention
Causal attention uses a masking technique.
By predicting the future, it allows the next context to be predicted.
When I first encountered this content, the overfitting problem came to mind, but...
ChatGPT's answer was different.

Preventing overfitting is just a side effect of causal attention.
It helps the model learn the context in the actual language generation direction (left→right).
The context vectors learned in this way help understand a sentence's meaning and generate new sentences, reasoning, and inference.
Various methods for this are introduced later.
5. Questions
Various thoughts arose while reading through the book.
Why is this more complicated than a one-hot encoder; what's the purpose?
Is the reason for causal attention because of overfitting?

I talked with ChatGPT for quite a long time, and the conclusion was as follows.
Knowing the answer
--> The model copies instead of predicting
--> Loss becomes almost 0 (calculations are possible)
--> Gradient becomes almost 0
--> Hardly any updates
--> No meaningful change in backpropagation
--> Learning essentially does not occurFor instance, imagine feeling around in the dark trying to find an object.
This process builds the ability to understand space and find objects.
If the object's location is already known, there's no need to search, and thus the 'method of finding' is not learned.
In language models, viewing future tokens is similar; knowing the answer causes the model to lose the ability to predict.

So, causal attention is a function that induces loss.
Causal attention causes loss by masking the future, and based on this loss, the context vector is updated.
During the process of reducing prediction error, the model better understands context, improving abilities in inference, analogy, and evaluation.
6. Afterword
The week's delay due to spilling water on my MacBook made me anxious.
Though I wanted to read and move quickly, I couldn't.
Every sentence I encountered was difficult and new.

Why multiply matrices with reduced dimensions here? Why suddenly apply masking here?
The process of finding answers to these questions was quite long and tedious.
AI provided answers to my questions, but whenever I formed new interpretations, it always insisted I was wrong, making me ponder lengthy on what the correct meaning is.
I think this process of reflection might help me grow.
I look forward to what Chapter 4 will teach me.
댓글을 불러오는 중...