Attention Is All You Need

Posted Aug 10, 2020

2 min read

Attention Is All You Need 를 읽고 정리한 글입니다.

Attention Is All You Need

Overview

(You should include contents of summary and introduction.)

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

recurrent, convolution 을 사용하지 않고 Attention을 이용해서 더 빠른 성능에 도달

layer normalization
RNN
Attention

Methods

(Explain one of the methods that the thesis used.)

Architecture

the encoder maps an input sequence of symbol representations (x1,…,xn) to a sequence of continuous representations z = (z1,…,zn)
Encoder
- 6개의 identicla layer로 구성되어 있음.
  - each layer has 2 sub-layers.
    - The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network.
- Add : residual connection을 의미함.
- Layer Norm. : LayerNorm(x + Sublayer(x))
- produce outputs of dimension d_model = 512
Decoder
- Encoder와 동일하게 6개의 동일한 레이어로 구성됨.
- Masked Multi-Head Attention
  We also modify the self-attention sub-layer in the decoder stack to prevent positions from attending to subsequent positions. This masking, combined with fact that the output embeddings are offset by one position, ensures that the predictions for position i can depend only on the known outputs at positions less than i.

Attention

Scaled Dot-Product Attention
- Input : Q(Queries), K(Keys), V(Values)
- compute the dot products query with all keys, divide each by root(dk), and apply a softmax function to obtain the weights on the values.
We implement this inside of scaled dot-product attention by masking out (setting to −∞) all values in the input of the softmax which correspond to illegal connections.
Multi-Head Attention
linearly project the queries, keys and values h times with different, learned linear projections to dk, dk and dv dimensions, respectively.
- perform the attention function in parallel yielding dv -dimensional output values.
- 위의 과정이 완료된 후에는 concatenate 시킴

Position-wise Feed-Forward Networks

Positional Encoding

Transformer에서는 시간의 연속성을 모델의 핵심부에서 다루지 않음. -> 그러나 시간의 순서는 실제 언어에서 중요하므로 단어의 위치 정보를 포함시키기 위해 Positional Encoding을 사용

Why Self-Attention

Total computational complexity per layer
The amount of computation that can be parallelized, as measured by the minimum number of sequential operations required.
The path length between long-range dependencies in the network.

A self-attention layer connects all positions with a constant number of sequentially executed operations, whereas a recurrent layer requires O(n) sequential operations.

Training

논문 참조.

Additional studies

(If you have some parts that cannot understand, you have to do additional studies for them. It’s optional.)

[1] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

[2] BERT

References

(References for your additional studies)

https://www.youtube.com/watch?v=EyXehqvkfF0

https://www.youtube.com/watch?v=mxGCEWOxfe8

Paper Review

This post is licensed under CC BY 4.0 by the author.

Attention Is All You Need

Overview

Related work (Basic concepts)

Methods

Architecture

Attention

Position-wise Feed-Forward Networks

Positional Encoding

Why Self-Attention

Training

Additional studies

References

Trending Tags