The Star Also Rises: Attention（三）：Illustrated

Attention（三）：Illustrated

2021/08/31

-----

https://pixabay.com/zh/photos/puzzle-match-missing-hole-blank-693873/

-----

# Attention 1

說明：

yt 由 st 決定。

st 由 s(t-1)（屬於 Decoder）與「所有前向與後向的 hi 與 alignment function 的所有值（屬於 Encoder）」決定。參考稍後的公式說明。

-----

# Attention 1。

說明：

傳統的 Seq2seq 的 Encoder，所有隱藏層的輸出值 ht 由該時間的輸入值 xt 與上一個隱藏層的輸出值 h(t-1) 決定。Encoder 會把所有的隱藏層輸出值壓縮成一個文本向量 c（context）。單元可以使用 LSTM。

-----

# Attention 1。

說明：

傳統的 Seq2seq 的 Decoder，

公式二，輸出句子的機率，由所有輸出字的機率相乘而得。每個輸出字的機率，受限在前面的輸出字跟 c。

公式三，也就是由上個輸出字、隱藏層、c 決定此輸出字的機率.

-----

# Attention 1。

說明：

公式四的重點是 c 變成 ci。

-----

# Attention 1。

說明：

a：alignment model，就是一個 FNN（We parametrize the alignment model a as a feedforward neural network which is jointly trained with all the other components of the proposed system.）。

si：decoder 的隱藏層。

hj：encoder 的隱藏層。

eij：由 s(i-1) 跟 hj 構成。

αij：將 eij 通過 softmax。

ci：將 hj 透過 αij 加總。

# Attention 1。

說明：

a 是一個 FNN。可參考後續 concat 的代碼。

-----

Figure 2: The BLEU scores of the generated translations on the test set with respect to the lengths of the sentences. The results are on the full test set which includes sentences having unknown words to the models.

圖 2：測試集上生成的翻譯在句子長度方面的 BLEU 分數。結果在完整的測試集上，其中包括對模型而言具有未知單詞的句子。

# Attention 1。

說明：

RNNsearch，就是本模型 # Attention#1。RNNenc，就是 RNN-encoder-decoder，也就是 # Seq2seq 2。

30 and 50。

「We train each model twice: first with the sentences of length up to 30 words (RNNencdec-30, RNNsearch-30) and then with the sentences of length up to 50 word (RNNencdec-50, RNNsearch-50).」

我們對每個模型訓練兩次：首先使用長度不超過 30 個單詞的句子（RNNencdec-30、RNNsearch-30），然後使用長度不超過 50 個單詞的句子（RNNencdec-50、RNNsearch-50）。

30,000 + 1。

「After a usual tokenization6, we use a shortlist of 30,000 most frequent words in each language to train our models. Any word not included in the shortlist is mapped to a special token ([UNK]). We do not apply any other special preprocessing, such as lowercasing or stemming, to the data.」。

在通常的標記化之後，我們使用每種語言中 30,000 個最常用單詞的候選列表來訓練我們的模型。未包含在候選列表中的任何單詞都映射到一個特殊標記 ([UNK])。我們不對數據應用任何其他特殊預處理，例如小寫或詞幹提取。

-----

# Seq2seq 2。

說明：

RNN Encoder–Decoder。作為對比模型。

-----

# Seq2seq 2。

說明：

ht 參考了 ht-1、yt-1、c 的資訊。

-----

Figure 1: Our model reads an input sentence “ABC” and produces “WXYZ” as the output sentence. The model stops making predictions after outputting the end-of-sentence token. Note that the LSTM reads the input sentence in reverse, because doing so introduces many short term dependencies in the data that make the optimization problem much easier.

圖 1：我們的模型讀取輸入句子“ABC”並生成“WXYZ”作為輸出句子。模型在輸出句尾標記後停止進行預測。請注意，LSTM 反向讀取輸入句子，因為這樣做會在數據中引入許多短期依賴關係，從而使優化問題變得更加容易。

# Seq2seq 1

說明：

# Seq2seq 1 ht 參考了 ht-1、yt-1 的資訊。

-----

Figure 3: Four sample alignments found by RNNsearch-50. The x-axis and y-axis of each plot correspond to the words in the source sentence (English) and the generated translation (French), respectively. Each pixel shows the weight ij of the annotation of the j-th source word for the i-th target word (see Eq. (6)), in grayscale (0: black, 1: white). (a) an arbitrary sentence. (b–d) three randomly selected samples among the sentences without any unknown words and of length between 10 and 20 words from the test set.

圖 3：RNNsearch-50 發現的四個樣本對齊。每個圖的 x 軸和 y 軸分別對應於源句子（英語）和生成的翻譯（法語）中的單詞。每個像素以灰度（0：黑色，1：白色）顯示第 i 個目標詞的第 j 個源詞的註釋的權重（參見等式（6））。 (a) 任意句子。 (b-d) 在沒有任何未知單詞的句子中隨機選擇三個樣本，長度在 10 到 20 個單詞之間。

# Attention 1。

說明：

英翻法。每個法文單字都由不同權重的英文單字作為向量的主要來源，權重越高佔的比例越大。(a) 為任意句子。(b - d) 為任意沒有未知單字的句子且長度在 10 與 20 之間。

-----

# Attention 1。

說明：

縱軸為法文字，圖中可以看出每個法文字主要由哪些英文字貢獻。

-----

Table 1: BLEU scores of the trained models computed on the test set. The second and third columns show respectively the scores on all the sentences and, on the sentences without any unknown word in themselves and in the reference translations. Note that RNNsearch-50* was trained much longer until the performance on the development set stopped improving. ()We disallowed the models to generate [UNK] tokens when only the sentences having no unknown words were evaluated (last column).

表 1：在測試集上計算的訓練模型的 BLEU 分數。第二列和第三列分別顯示所有句子的得分，以及本身和參考翻譯中沒有任何未知詞的句子的得分。請注意 RNNsearch-50* 訓練時間更長，直到開發集的性能停止提高。 ( )當僅評估沒有未知單詞的句子時（最後一列），我們不允許模型生成 [UNK] 標記。

# Attention 1。

說明：

All 包含 [UNK]，因此分數較低。No UNK 不含 [UNK]，因此分數較高（使用字彙集較小）。

-----

Figure 1: Neural machine translation – a stacking recurrent architecture for translating a source sequence A B C D into a target sequence X Y Z. Here, <eos> marks the end of a sentence.

圖 1：神經機器翻譯——一種用於將源序列 A B C D 翻譯成目標序列 X Y Z 的堆疊循環架構。這裡，<eos> 標記了句子的結尾。

# Attention 2。

說明：

與 # Seq2seq 1 同。

堆疊：此處為兩層堆疊。

-----

Figure 2: Global attentional model – at each time step t, the model infers a variable-length alignment weight vector at based on the current target state ht and all source states ¯h s. A global context vector ct is then computed as the weighted average, according to at, over all the source states.

圖 2：全局注意力模型——在每個時間步長 t，該模型根據當前目標狀態 ht 和所有源狀態 ¯h s 推斷出一個可變長度的對齊權重向量 at。然後，根據 at，在所有源狀態上計算全局上下文向量 ct 作為加權平均值。

# Attention 2。

說明：

藍：Encoder，紅：Decoder。對應論文圖一的第一層。

https://blog.csdn.net/weixin_40871455/article/details/85007560

at：alignment weight vector。

參考後續的分解說明。

-----

# Attention 2。

說明：

ht~ 先進行 ct 與 ht 的串接，再透過 Wc 的矩陣轉換，再經過 tanh 的激活函數得到。

-----

「A global context vector ct is then computed as the weighted average, according to at, over all the source states.」

「In this model type, a variable-length alignment vector at, whose size equals the number of time steps on the source side, is derived by comparing the current target hidden state ht with each source hidden state ¯h s:」

# Attention 2。

說明：

ct 透過 at 跟所有 ¯h s 得到。

at 的大小跟 source 的長度一致。參考次一張圖。

-----

# Attention 2。

某個 at。

-----

# Attention 2。

說明：

公式七

at：alignment。

這裡的 at 會讓人誤會只有一個值。但分子的部分 ¯h s 其實個數有 source 的長度那麼多個，參考上一張圖。

注意論文中有「each ¯h s」。

-----

# Attention 2。

說明：

h：hidden state。

s：source。

t：target。

a：alignment。

Wa：alignment 的 weight matrix。

va：alignment 的 vector。

T：transpose（用於將向量轉置）。

dot：點積。

general：經過線性層後再點積（線性層的參數可由訓練得到）。

concat：串接後，經過完整的線性層（矩陣轉換加上激活函數），再與訓練得來的向量 va 進行點積。

可參考下一張原始碼的圖片。

-----

說明：

可參考上一張對於論文的說明。

https://blog.floydhub.com/attention-mechanism/

-----

# Attention 1 and Attention 2

# Attention 2

以下為論文與翻譯。

「First, we simply use hidden states at the top LSTM layers in both the encoder and decoder as illustrated in Figure 2. Bahdanau et al. (2015), on the other hand, use the concatenation of the forward and backward source hidden states in the bi-directional encoder and target hidden states in their non-stacking unidirectional decoder. 」

翻譯：首先，我們簡單地在編碼器和解碼器的頂部 LSTM 層使用隱藏狀態，如圖 2 所示。另一方面，Bahdanau 等人 (2015) 在雙向編碼器中使用前向和後向源隱藏狀態的串聯，並在其非堆疊單向解碼器中使用目標隱藏狀態。

「Second, our computation path is simpler; we go from ht -> at -> ct -> ˜h t then make a prediction as detailed in Eq. (5), Eq. (6), and Figure 2. On the other hand, at any time t, Bahdanau et al. (2015) build from the previous hidden state ht−1 -> at -> ct -> ht, which, in turn, goes through a deep-output and a maxout layer before making predictions.7」

翻譯：其次，我們的計算路徑更簡單；我們從 ht -> at -> ct -> ˜h t 然後進行預測，如等式 (5)、等式 (6) 和圖 2。另一方面，在任何時間 t，Bahdanau 等人 (2015) 從之前的隱藏狀態 ht−1 -> at -> ct -> ht 構建，在進行預測之前，依次經過深度輸出和 maxout 層。註解7。

「Lastly, Bahdanau et al. (2015) only experimented with one alignment function, the concat product; whereas we show later that the other alternatives are better.」

翻譯：最後，Bahdanau 等人 (2015) 只試驗了一種對齊功能，即 concat 內積；而我們稍後會證明其他替代方案更好。

以下為部落格原文與翻譯。

Bahdanau 和 Luong 注意力機制的主要區別：

一、注意力計算

「Bahdanau et al. uses the concatenation of the forward and backward hidden states in the bi-directional encoder and previous target’s hidden states in their non-stacking unidirectional decoder.」

翻譯：Bahdanau 等人，使用雙向編碼器中的前向和後向隱藏狀態與其非堆疊單向解碼器中先前目標的隱藏狀態的串聯。

「Loung et al. attention uses hidden states at the top LSTM layers in both the encoder and decoder.」

翻譯：Loung 等人，注意力在編碼器和解碼器的頂部 LSTM 層使用隱藏狀態。

「Luong attention mechanism uses the current decoder’s hidden state to compute the alignment vector, whereas Bahdanau uses the output of the previous time step.」

翻譯：Luong 注意力機制使用當前解碼器的隱藏狀態來計算對齊向量，而 Bahdanau 使用前一個時間步的輸出。

二、Alignment functions

「Bahdanau uses only concat score alignment model whereas Luong uses dot, general and concat alignment score models.」

翻譯：Bahdanau 使用 concat，Luong 使用 dot、general、concat。

https://towardsdatascience.com/sequence-2-sequence-model-with-attention-mechanism-9e9ca2a613a

https://blog.floydhub.com/attention-mechanism/

-----

References

# Attention 1。被引用 14895 次。

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).

https://arxiv.org/pdf/1409.0473.pdf

# Visual Attention。被引用 6060 次。

Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015.

http://proceedings.mlr.press/v37/xuc15.pdf

# Attention 2。被引用 4781 次。

Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation." arXiv preprint arXiv:1508.04025 (2015).

https://arxiv.org/pdf/1508.04025.pdf

# Seq2seq 1。被引用 12676 次。

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.

http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

# Seq2seq 2。被引用 11284 次。

Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).

https://arxiv.org/pdf/1406.1078.pdf

-----

The Star Also Rises

Sunday, October 17, 2021

Attention（三）：Illustrated

No comments:

Post a Comment