The Star Also Rises: Short Attention（三）：Illustrated

Short Attention（三）：Illustrated

2021/08/31

-----

https://pixabay.com/zh/photos/window-hauswand-house-facade-831251/

-----

# Transformer。

說明：

QKV 的觀念被用在 self-attention 上。

-----

QKV

modified from # ConvS2S

C = QKV

https://deeplearning.hatenablog.com/entry/convs2s

說明：

QKV 的觀念被用在 ConvS2S 上。

-----

「 In a key-value part, we separate output vectors into

◎ keys to calculate the attention, and

◎ values to encode the next-word distribution and context representation.

However, we still use the value part for two goals at once. So, the authors split it again and in the end, the model outputs three vectors at each time step. The first is used to encode the next-word distribution, the second serves as a key to compute the attention vector, and the third as value for an attention mechanism.」

https://medium.com/@edloginova/attention-in-nlp-734c6fa9d983

說明：

QKV，K 是 key，V 是 value，Q 是什麼？Q 是 query，the next word distribution。

-----

# Short Attention。

說明：

Attention、Key-Value Attention、Predict-Key-Value Attention、以及 5-gram RNN。

疑問，c 是什麼？見後續的 LSTM 複習。

-----

https://jalammar.github.io/illustrated-transformer/

說明：

透過可訓練的矩陣，可將向量分解成 QKV。

-----

「More precisely, the output vector ht is divided into three equal parts: key, value and predict. In our implementation we simply split the output vector ht into kt, vt and pt. To this end the hidden dimension of the key-value-predict attention model needs to be a multiplicative of three. Consequently, the dimensions of kt, vt and pt are 100 for a hidden dimension of 300.」

# Short Attention。

-----

# History DL

說明：

LSTM 的輸入輸出為 xt、ct、ht。

-----

# History DL

說明：

(b) 為預測某個字，或者分類。

-----

# Short Attention。

https://zhuanlan.zhihu.com/p/25570951

說明：預測類別以及預測缺少的字詞。

https://stefanengineering.com/2018/11/05/cloze-deletion-detection/

-----

CBOW 周圍的字預測中間的單字

# Word2vec 3。

說明：

透過詞向量矩陣可得到詞向量。

-----

# Attention 2。

說明：

h：hidden state。

s：source。

t：target。

a：alignment。

Wa：alignment 的 weight matrix。

va：alignment 的 vector。

T：transpose（用於將向量轉置）。

dot：點積。

general：經過線性層後再點積（線性層的參數可由訓練得到）。

concat：串接後，經過完整的線性層（矩陣轉換加上激活函數），再與訓練得來的向量 va 進行點積。

可參考下一張原始碼的圖片。

-----

說明：

可參考上一張對於論文的說明。

https://blog.floydhub.com/attention-mechanism/

-----

# Attention 1 and Attention 2

說明：

兩種類似的 Attention。

# Short Attention。

說明：

每一個輸出字會額外參考前五個字組成的上下文向量 context vector。

-----

# Short Attention。

說明：

WY：維度為 k × k 的矩陣。

Yt：維度為 k × L 。ht 之前的 L 個向量的記憶，以矩陣形式表示。k 是 LSTM 單元輸出的維度，L 在此處為 5。

Wh：維度為 k × k 的矩陣。

ht： k × 1。ht 的維度為 k。

1T： 1 × L。

Mt：k × L 的記憶。

wT： 1 × k。

αt： 1 × L。（要執行 softmax）。

αT： L × 1。

γt：k × 1。維度為 k 的上下文向量。

Wr：維度為 k × k 的矩陣。

Wx：維度為 k × k 的矩陣。

ht*：k × 1。維度為 k。

W*：維度為 |V| × k 的矩陣。

b：|V| × 1。維度為 |V| 的向量。

yt：|V| × 1。維度為 |V| 的機率分布。

Yt = [h(t-L) ... h(t-1)]。previous L。

1 是每個值都為 1 的向量，維度為 L。

T 是轉置。

（1）計算隱藏層的輸出。

（2）計算權重。

（3）Yt 與權重得到上下文向量。

（4）上下文向量跟 ht 得到 ht*。

（5）將 ht* 轉成 yt，也就是 |v| 個字應該輸出哪一個字的機率分布。

-----

# Short Attention。

說明：

kv。

-----

# Short Attention。

說明：

（6）ht 的維度為 2k，kt、vt 的維度皆為 k。

（7）k 分量代替 Yt。

（8）此公式不變。

（9）v 分量代替 Yt。

（10）vt 代替 ht。

-----

# Short Attention。

說明：

KVP。

-----

# Short Attention。

說明：

（11）ht 的維度為 3k，kt、vt、pt 的維度各為 k。

（12）ht*、rt、pt 的維度都是 k。

-----

# Short Attention。

說明：

W_N 將維度 (N-1)k 轉成 k。

-----

# Short Attention。

說明：

WN：：k × (N - 1) k。

[ ... ]：(N - 1) k × 1。

（13）W_N 將維度 (N-1) k 轉成 k。

-----

# Short Attention。

說明：

split 在此處是分割。W_N 將維度 (N-1)k 轉成 k。

「Specifically, at every time step we split the LSTM output into N - 1 vectors.」

-----

# Short Attention。

說明：

困惑度比基準值低 3.2。

-----

# Short Attention。

說明：

困惑度比基準值低 7.0。

-----

# Short Attention。

說明：

困惑度比基準值低 9.4。

-----

# Short Attention。

說明：

4-gram 比 KVP 簡單，但效能接近。

-----

# Short Attention。

說明：

dev：development（驗證集）。（訓練集、驗證集、測試集）。

-----

References

# Short Attention。被引用 76 次。

Daniluk, Michał, et al. "Frustratingly short attention spans in neural language modeling." arXiv preprint arXiv:1702.04521 (2017).

https://arxiv.org/pdf/1702.04521.pdf

# History DL

Alom, Md Zahangir, et al. "The history began from alexnet: A comprehensive survey on deep learning approaches." arXiv preprint arXiv:1803.01164 (2018).

https://arxiv.org/ftp/arxiv/papers/1803/1803.01164.pdf

# Word2vec 3。被引用 645 次。

Rong, Xin. "word2vec parameter learning explained." arXiv preprint arXiv:1411.2738 (2014).

https://arxiv.org/pdf/1411.2738.pdf

# Attention 1。被引用 14895 次。

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).

https://arxiv.org/pdf/1409.0473.pdf

# Attention 2。被引用 4781 次。

Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation." arXiv preprint arXiv:1508.04025 (2015).

https://arxiv.org/pdf/1508.04025.pdf

# ConvS2S。被引用 1772 次。

Gehring, Jonas, et al. "Convolutional sequence to sequence learning." arXiv preprint arXiv:1705.03122 (2017).

https://arxiv.org/pdf/1705.03122.pdf

# Transformer。被引用 13554 次。

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

-----

The Star Also Rises

Sunday, October 31, 2021

Short Attention（三）：Illustrated

No comments:

Post a Comment