The Star Also Rises: Transformer（三）：Illustrated

Transformer（三）：Illustrated

2021/09/01

-----

https://pixabay.com/zh/photos/flash-tesla-coil-experiment-113310/

-----

Outline

-----

1.1 Transformer

https://zhuanlan.zhihu.com/p/338817680

-----

1.2 Input Embedding (Word2vec)

https://zhuanlan.zhihu.com/p/27234078

-----

1.3 Positional Encoding

https://jalammar.github.io/illustrated-transformer/

縱軸為 pos，橫軸為 i。偶數位置 2i 套用 sin，奇數位置套用 cos。

一、pos：position。一個 word 在 sentence 中的位置。

二、i：dimension。Positional Embedding 向量的 index，最大值為 dmodel。

三、dmodel：Word Embedding 的維度。768、512，等等。位置編碼跟詞向量的維度相同，所以兩者可以相加。

四、sin(a+b)=sin(a)cos(b)+cos(a)sin(b)，cos(a+b)=cos(a)cos(b)-sin(a)sin(b)。

五、以 p+k 代替 a+b，k：新增的位置向量的 offset。新增向量可由之前向量的線性組合構成，係數為 sin(k) 與 cos(k)。

-----

2.1

https://jalammar.github.io/illustrated-transformer/

-----

2.2

https://medium.com/@edloginova/attention-in-nlp-734c6fa9d983

-----

2.3

-----

2.4

-----

2.5

http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture/Transformer%20(v5).pdf

-----

2.6

-----

2.7

https://jalammar.github.io/illustrated-transformer/

-----

2.8

https://zhuanlan.zhihu.com/p/75787683

-----

2.9

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

-----

http://proceedings.mlr.press/v37/ioffe15.pdf

-----

2.10

https://mlexplained.com/2018/01/13/weight-normalization-and-layer-normalization-explained-normalization-in-deep-learning-part-2/

-----

2.11

https://zhuanlan.zhihu.com/p/338817680

-----

3.1

https://zhuanlan.zhihu.com/p/338817680

-----

3.2

-----

4.1

https://huggingface.co/transformers/perplexity.html

https://www.zhihu.com/question/50828855

-----

4.2

https://www.aclweb.org/anthology/P02-1040.pdf

-----

4.3

https://www.cnblogs.com/by-dream/p/7679284.html

-----

4.4

# Transformer。

說明：

「We employ three types of regularization during training:」

Dropout

「Residual Dropout We apply dropout [27] to the output of each sub-layer, before it is added to the sub-layer input and normalized. 」

Dropout

「In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of Pdrop = 0:1.」

Label Smoothing

「Label Smoothing During training, we employed label smoothing of value ls = 0:1 [30]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.」

「網路會驅使自身往正確標籤和錯誤標籤差值大的方向學習，在訓練數據不足以表徵所以的樣本特徵的情況下，這就會導致網路過擬合。」

「label smoothing 的提出就是為了解決上述問題。最早是在 Inception v2 中被提出，是一種正則化的策略。其通過"軟化"傳統的 one-hot 類型標籤，使得在計算損失值時能夠有效抑製過擬合現象。」

https://blog.csdn.net/qiu931110/article/details/86684241

-----