Sunday, December 12, 2021

Transformer(三):Illustrated

 Transformer(三):Illustrated

2021/09/01

-----


https://pixabay.com/zh/photos/flash-tesla-coil-experiment-113310/

-----

Outline



-----

1.1 Transformer


https://zhuanlan.zhihu.com/p/338817680

-----

1.2 Input Embedding (Word2vec)


https://zhuanlan.zhihu.com/p/27234078

-----

1.3 Positional Encoding


https://jalammar.github.io/illustrated-transformer/

縱軸為 pos,橫軸為 i。偶數位置 2i 套用 sin,奇數位置套用 cos。

一、pos:position。一個 word 在 sentence 中的位置。

二、i:dimension。Positional Embedding 向量的 index,最大值為 dmodel。

三、dmodel:Word Embedding 的維度。768、512,等等。位置編碼跟詞向量的維度相同,所以兩者可以相加。

四、sin(a+b)=sin(a)cos(b)+cos(a)sin(b),cos(a+b)=cos(a)cos(b)-sin(a)sin(b)。

五、以 p+k 代替 a+b,k:新增的位置向量的 offset。新增向量可由之前向量的線性組合構成,係數為 sin(k) 與 cos(k)。

-----

2.1


https://jalammar.github.io/illustrated-transformer/

-----

2.2


https://medium.com/@edloginova/attention-in-nlp-734c6fa9d983

-----

2.3



-----

2.4



-----

2.5


http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture/Transformer%20(v5).pdf

-----

2.6



-----

2.7



https://jalammar.github.io/illustrated-transformer/

-----

2.8


https://zhuanlan.zhihu.com/p/75787683

-----

2.9


https://colah.github.io/posts/2015-08-Understanding-LSTMs/

-----

BN


http://proceedings.mlr.press/v37/ioffe15.pdf 

-----

2.10


https://mlexplained.com/2018/01/13/weight-normalization-and-layer-normalization-explained-normalization-in-deep-learning-part-2/

-----

2.11


https://zhuanlan.zhihu.com/p/338817680

-----

3.1


https://zhuanlan.zhihu.com/p/338817680

-----

3.2



-----

4.1


https://huggingface.co/transformers/perplexity.html

-----

4.2


https://www.aclweb.org/anthology/P02-1040.pdf

-----

4.3


https://www.cnblogs.com/by-dream/p/7679284.html

-----

4.4


# Transformer。

說明:

「We employ three types of regularization during training:」

Dropout 

「Residual Dropout We apply dropout [27] to the output of each sub-layer, before it is added to the sub-layer input and normalized. 」

Dropout 

「In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of Pdrop = 0:1.」

Label Smoothing

「Label Smoothing During training, we employed label smoothing of value ls = 0:1 [30]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.」

「網路會驅使自身往正確標籤和錯誤標籤差值大的方向學習,在訓練數據不足以表徵所以的樣本特徵的情況下,這就會導致網路過擬合。」

「label smoothing 的提出就是為了解決上述問題。最早是在 Inception v2 中被提出,是一種正則化的策略。其通過"軟化"傳統的 one-hot 類型標籤,使得在計算損失值時能夠有效抑製過擬合現象。」

https://blog.csdn.net/qiu931110/article/details/86684241

-----


# Transformer。

-----


# Transformer。

-----

重要說明一:

https://jalammar.github.io/illustrated-transformer/

重要說明二:

https://zhuanlan.zhihu.com/p/338817680

-----

References


# Transformer。被引用 13554 次。

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

-----

No comments: