Attention Is All You Need
Attention Is All You Need 를 읽고 정리한 글입니다.
Attention Is All You Need
Overview
(You should include contents of summary and introduction.)
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
- recurrent, convolution 을 사용하지 않고
Attention
을 이용해서 더 빠른 성능에 도달
Related work (Basic concepts)
- layer normalization
- RNN
- Attention
Methods
(Explain one of the methods that the thesis used.)
Architecture
the encoder maps an input sequence of symbol representations (x1,…,xn) to a sequence of continuous representations z = (z1,…,zn)
- Encoder
- 6개의 identicla layer로 구성되어 있음.
- each layer has 2 sub-layers.
- The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network.
- each layer has 2 sub-layers.
- Add : residual connection을 의미함.
- Layer Norm. :
LayerNorm(x + Sublayer(x))
- produce outputs of dimension d_model =
512
- 6개의 identicla layer로 구성되어 있음.
- Decoder
- Encoder와 동일하게 6개의 동일한 레이어로 구성됨.
- Masked Multi-Head Attention
We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.
Attention
- Scaled Dot-Product Attention
- Input : Q(Queries), K(Keys), V(Values)
- compute the dot products query with all keys, divide each by root(dk), and apply a softmax function to obtain the weights on the values.
We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.
- Multi-Head Attention
linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively.
Position-wise Feed-Forward Networks
Positional Encoding
Transformer에서는 시간의 연속성
을 모델의 핵심부에서 다루지 않음. -> 그러나 시간의 순서는 실제 언어에서 중요하므로 단어의 위치 정보를 포함시키기 위해 Positional Encoding을 사용
Why Self-Attention
Total computational complexity per layer
The amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
The path length between long-range dependencies in the network.
A self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations.
Training
논문 참조.
Additional studies
(If you have some parts that cannot understand, you have to do additional studies for them. It’s optional.)
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[2] BERT
References
(References for your additional studies)