Attention Is All You Need 를 읽고 정리한 글입니다.
Attention Is All You Need
Overview
(You should include contents of summary and introduction.)
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
 recurrent, convolution 을 사용하지 않고
Attention
을 이용해서 더 빠른 성능에 도달
Related work (Basic concepts)
 layer normalization
 RNN
 Attention
Methods
(Explain one of the methods that the thesis used.)
Architecture

the encoder maps an input sequence of symbol representations (x1,…,xn) to a sequence of continuous representations z = (z1,…,zn)
 Encoder
 6개의 identicla layer로 구성되어 있음.
 each layer has 2 sublayers.
 The first is a multihead selfattention mechanism, and the second is a simple, position wise fully connected feedforward network.
 each layer has 2 sublayers.
 Add : residual connection을 의미함.
 Layer Norm. :
LayerNorm(x + Sublayer(x))
 produce outputs of dimension d_model =
512
 6개의 identicla layer로 구성되어 있음.
 Decoder
 Encoder와 동일하게 6개의 동일한 레이어로 구성됨.
 Masked MultiHead Attention
We also modify the selfattention sublayer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.
Attention
 Scaled DotProduct Attention
 Input : Q(Queries), K(Keys), V(Values)
 compute the dot products query with all keys, divide each by root(dk), and apply a softmax function to obtain the weights on the values.
We implement this inside of scaled dotproduct attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.
 MultiHead Attention
linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively.
 perform the attention function in parallel yielding dv dimensional output values.
 위의 과정이 완료된 후에는
concatenate
시킴
Positionwise FeedForward Networks
Positional Encoding
Transformer에서는 시간의 연속성
을 모델의 핵심부에서 다루지 않음. > 그러나 시간의 순서는 실제 언어에서 중요하므로 단어의 위치 정보를 포함시키기 위해 Positional Encoding을 사용
Why SelfAttention

Total computational complexity per layer

The amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.

The path length between longrange dependencies in the network.
A selfattention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations.
Training
논문 참조.
Additional studies
(If you have some parts that cannot understand, you have to do additional studies for them. It’s optional.)
[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[2] BERT
References
(References for your additional studies)