Attention Is All You Need

Attention Is All You Need 를 읽고 정리한 글입니다.


Attention Is All You Need

Overview

(You should include contents of summary and introduction.)

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

  • recurrent, convolution 을 사용하지 않고 Attention을 이용해서 더 빠른 성능에 도달
  • layer normalization
  • RNN
  • Attention

Methods

(Explain one of the methods that the thesis used.)

Architecture

스크린샷 2020-08-10 오후 12 03 51

  • the encoder maps an input sequence of symbol representations (x1,…,xn) to a sequence of continuous representations z = (z1,…,zn)

  • Encoder
    • 6개의 identicla layer로 구성되어 있음.
      • each layer has 2 sub-layers.
        • The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network.
    • Add : residual connection을 의미함.
    • Layer Norm. : LayerNorm(x + Sublayer(x))
    • produce outputs of dimension d_model = 512
  • Decoder
    • Encoder와 동일하게 6개의 동일한 레이어로 구성됨.
    • Masked Multi-Head Attention

      We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

Attention

스크린샷 2020-08-10 오후 12 11 26

  • Scaled Dot-Product Attention
    • Input : Q(Queries), K(Keys), V(Values)
    • compute the dot products query with all keys, divide each by root(dk), and apply a softmax function to obtain the weights on the values. 스크린샷 2020-08-10 오후 12 13 01

    We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.

  • Multi-Head Attention

    linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively.

    • perform the attention function in parallel yielding dv -dimensional output values.
    • 위의 과정이 완료된 후에는 concatenate 시킴 스크린샷 2020-08-10 오후 12 14 53

Position-wise Feed-Forward Networks

스크린샷 2020-08-10 오후 12 15 47

Positional Encoding

Transformer에서는 시간의 연속성을 모델의 핵심부에서 다루지 않음. -> 그러나 시간의 순서는 실제 언어에서 중요하므로 단어의 위치 정보를 포함시키기 위해 Positional Encoding을 사용

스크린샷 2020-08-10 오후 12 19 13

스크린샷 2020-08-10 오후 12 19 18

Why Self-Attention

  1. Total computational complexity per layer

  2. The amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.

  3. The path length between long-range dependencies in the network.

A self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations.

Training

논문 참조.

Additional studies

(If you have some parts that cannot understand, you have to do additional studies for them. It’s optional.)

[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

[2] BERT

References

(References for your additional studies)

https://www.youtube.com/watch?v=EyXehqvkfF0

https://www.youtube.com/watch?v=mxGCEWOxfe8