The Star Also Rises: Attention（四）：Appendix

Attention（四）：Appendix

2021/10/05

-----

https://pixabay.com/zh/photos/photo-photographer-old-photos-256887/

-----

# Attention 1。

說明：

Attention 之 Encoder 與 Decoder 的架構。上方為 Decoder，下方為 Encoder。生成每個 Decoder 的輸出時，都會參考所有 Encoder 子單元的值（這些值會形成一個陣列，所有的陣列會再形成一個矩陣）。可參考下方的公式說明。

-----

說明：

Soft

「Soft Attention: the alignment weights are learned and placed “softly” over all patches in the source image; essentially the same type of attention as in Bahdanau et al., 2015.

Pro: the model is smooth and differentiable.

Con: expensive when the source input is large.」

軟注意力：學習對齊權重並“輕柔地”放置在源圖像中的所有塊上；基本上與 Bahdanau 等人，2015 年的關注類型相同。

優點：模型平滑且可微。

缺點：當源輸入很大時很昂貴。

https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#soft-vs-hard-attention

Hard

「Hard Attention: only selects one patch of the image to attend to at a time.

Pro: less calculation at the inference time.

Con: the model is non-differentiable and requires more complicated techniques such as variance reduction or reinforcement learning to train. (Luong, et al., 2015)」

硬注意：一次只選擇一個圖像塊來注意。

優點：推理時計算量少。

缺點：模型是不可微的，需要更複雜的技術，如方差減少或強化學習來訓練。 (Luong, et al., 2015)

https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#soft-vs-hard-attention

-----

說明：

Global vs Local

「Luong, et al., 2015 proposed the “global” and “local” attention. The global attention is similar to the soft attention, while the local one is an interesting blend between hard and soft, an improvement over the hard attention to make it differentiable: the model first predicts a single aligned position for the current target word and a window centered around the source position is then used to compute a context vector.」

Luong, et al., 2015 提出了“全局”和“局部”的關注。全局注意力類似於軟注意力，而局部注意力是硬注意力和軟注意力的有趣融合，對硬注意力的改進使其可微分：模型首先預測當前目標詞的單個對齊位置和一個窗口以源位置為中心然後用於計算上下文向量。

https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html#soft-vs-hard-attention

-----

# Attention 2。

-----

Figure 3: Local attention model – the model first predicts a single aligned position pt for the current target word. A window centered around the source position pt is then used to compute a context vector ct, a weighted average of the source hidden states in the window. The weights at are inferred from the current target state ht and those source states ¯h s in the window.

圖 3：局部注意力模型——該模型首先預測當前目標詞的單個對齊位置 pt。然後使用以源位置 pt 為中心的窗口來計算上下文向量 ct，即窗口中源隱藏狀態的加權平均值。權重是從當前目標狀態 ht 和窗口中的那些源狀態 ¯h s 推斷出來的。

# Attention 2。

說明：

藍：Encoder，紅：Decoder。對應論文圖一的第一層。

https://blog.csdn.net/weixin_40871455/article/details/85007560

2D + 1 的窗口大小是重點。窗口中心點的位置 pt 參考後續圖片說明。

「In concrete details, the model first generates an aligned position pt for each target word at time t. The context vector ct is then derived as a weighted average over the set of source hidden states within the window [pt−D, pt+D];D is empirically selected.」

具體來說，模型首先在時間 t 為每個目標詞生成一個對齊位置 pt。然後將上下文向量 ct 導出為窗口 [pt-D, pt+D] 內源隱藏狀態集的加權平均值；D 是憑經驗選擇的。

-----

# Attention 2。

說明：

公式九

Wp and vp，模型的參數，靠學習得來，用來預測測 pt。

公式十

pt，如公式九。

s，

σ，

「這裡，context vector ct 的計算只關注窗口 [pt - D, pt + D] 內的 2D + 1 個 source hidden states（若發生越界，則忽略界外的 source hidden states）。其中 pt 是一個 source position index，可以理解為 attention的焦點，作為模型的參數， D 根據經驗來選擇。關於 pt 的計算，文章給出了兩種計算方案，Monotonic alignment (local-m) 與 Predictive alignment (local-p)。」

「Predictive alignment (local-p) ：其中 Wp 和 vp 是模型的參數， S 是 source sentence 的長度，易知 pt 屬於 [0, S] 。可以看出，距離中心 pt 越遠的位置，其位置上的 source hidden state 對應的權重就越小。」

https://zhuanlan.zhihu.com/p/48424395

-----

References

# Attention 1。被引用 14895 次。

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).

https://arxiv.org/pdf/1409.0473.pdf

# Visual Attention。被引用 6060 次。

Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015.

http://proceedings.mlr.press/v37/xuc15.pdf

# Attention 2。被引用 4781 次。

Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation." arXiv preprint arXiv:1508.04025 (2015).

https://arxiv.org/pdf/1508.04025.pdf

# Seq2seq 1。被引用 12676 次。

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.

http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

# Seq2seq 2。被引用 11284 次。

Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).

https://arxiv.org/pdf/1406.1078.pdf

-----

The Star Also Rises

Sunday, October 17, 2021

Attention（四）：Appendix

No comments:

Programmer

Blog Archive

Labels

Recent Comments

My Blog List

MY LINKS

status

About Me