Monday, December 28, 2020

Template

Template

2020/12/04

-----


https://pixabay.com/zh/photos/silk-screen-silk-screening-art-1246169/

-----

◎ Abstract

-----

◎ Introduction

-----

本論文要解決(它之前研究)的(哪些)問題(弱點)? 

-----

◎ Method

-----

解決方法? 

-----

具體細節?

-----

◎ Result

-----

本論文成果。 

-----

◎ Discussion

-----

本論文與其他論文(成果或方法)的比較。 

-----

成果比較。 

-----

方法比較。 

-----

◎ Conclusion 

-----

◎ Future Work

-----

後續相關領域的研究。 

-----

後續延伸領域的研究。

-----

◎ References

-----

# HDR。被引用 3589 次。針對數字的手寫辨識,較早的神經網路架構,無全連接層。

LeCun, Yann, et al. "Handwritten digit recognition with a back-propagation network." Advances in neural information processing systems 2 (1989): 396-404.

https://papers.nips.cc/paper/1989/file/53c3bce66e43be4f209556518c2fcb54-Paper.pdf


# LeNet。被引用 31707 次。經典的卷積神經網路,主要比 HDR 多了全連接層。

LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.

http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf


# AlexNet。被引用 74398 次。較早使用 GPU 的大型卷積神經網路之一,效能比之前有飛躍的提升,成功使用 dropout 避免 overfitting。

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Communications of the ACM 60.6 (2017): 84-90.

https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

-----

Thursday, December 17, 2020

What's the main points of Transformer?

What's the main points of Transformer?

2020/10/27

-----


-----

-----

一、基礎學習與論文理解(70%)。


◎ 1. 可以從這篇論文學到什麼(解決什麼問題)? 

◎ A. 問題原因。 

◎ 1.a.1:過往這個領域已經做到甚麼程度?

-----


# ConvS2S 論文。

-----

◎ 1.a.2:最近的研究遇到了甚麼瓶頸?

◎ 1.a.3:議題發生的根本原因是甚麼?


◎ B. 解決方法。 

◎ 1.b.1:作者採用甚麼方式解決?

◎ 1.b.2:細節內容是如何解決的?

◎ 1.b.3:(optional)- 作者是否有說明思路? - 或是後續研究者的討論?


◎ C. 效能評估。 

◎ 1.c.1:成果效能的比較。

◎ 1.c.2:目前這個方法是否還有限制,是甚麼?

◎ 1.c.3:(optional)- 作者對後續發展的期許? - 其他研究者後續的發展?


二、後續發展與延伸應用(30%)


◎ 2. 可以應用在那些垂直領域(應用領域)? 

◎ 3. 這篇論文的價值在哪(如何跨領域延伸應用)? 

◎ 4. 如果要改進可以怎麼做(後續的研究)?

-----

-----

References

◎ 相關論文

◎ 延伸論文

十、Transformer - 英宗 Transformer Transformer Lab GPT-1、(GPT-2、GPT-3) BERT

◎ 參考文章

The Star Also Rises: NLP(五):Transformer

http://hemingwang.blogspot.com/2019/01/transformer.html

-----

[翻譯] The Illustrated Transformer

[翻譯] The Illustrated Transformer

2019/10/02

----- 

-----

References

-----

The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time
https://jalammar.github.io/illustrated-transformer/

Monday, December 07, 2020

What's the main points of ConvS2S?

What's the main points of ConvS2S?

2020/10/27

-----



-----

一、Introduction

本論文要解決(它之前研究)的哪些問題(弱點)?

-----


# GNMT

-----


# GNMT

-----


# PreConvS2S

-----

二、Method

-----

三、Result

-----

四、Discussion

-----

五、Conclusion and Future Work

-----

Conclusion

-----

Future Work

-----

References

◎ 主要論文

[1] LSTM。被引用 39743 次。

Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.676.4320&rep=rep1&type=pdf


[2] Seq2seq。被引用 12676 次。

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.

http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf


[3] Attention 1。被引用 14895 次。

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).

https://arxiv.org/pdf/1409.0473.pdf


[4] ConvS2S。被引用 1772 次。

Gehring, Jonas, et al. "Convolutional sequence to sequence learning." arXiv preprint arXiv:1705.03122 (2017).

https://arxiv.org/pdf/1705.03122.pdf


[5] Transformer。被引用 13554 次。

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

-----

◎ 相關論文

-----

[6] GNMT。被引用 3391 次。

Wu, Yonghui, et al. "Google's neural machine translation system: Bridging the gap between human and machine translation." arXiv preprint arXiv:1609.08144 (2016).

https://arxiv.org/pdf/1609.08144.pdf


[7] PreConvS2S。被引用 273 次。

Gehring, Jonas, et al. "A convolutional encoder model for neural machine translation." arXiv preprint arXiv:1611.02344 (2016).

https://arxiv.org/pdf/1611.02344.pdf

-----

◎ 參考文章

The Star Also Rises: NLP(四):ConvS2S

https://hemingwang.blogspot.com/2019/04/convs2s.html

-----

[翻譯] Understanding incremental decoding in fairseq

[翻譯] Understanding incremental decoding in fairseq

2019/10/02


-----




-----

References

# 綜述
Understanding incremental decoding in fairseq – Telesens
http://www.telesens.co/2019/04/21/understanding-incremental-decoding-in-fairseq/

-----

Tuesday, November 24, 2020

C=QKV

C=QKV

2019/09/30

-----

NTM
1. MN
2. EEMN
3. KVMN
4. PN
5. FSA

-----


// 論文解説 Memory Networks (MemNN) - ディープラーニングブログ

-----


// Attention in NLP – Kate Loginova – Medium

-----


// 論文解説 Attention Is All You Need (Transformer) - ディープラーニングブログ

-----


// Attention  Attention!

-----



// Attention in NLP – Kate Loginova – Medium

-----


-----


// 論文解説 Convolutional Sequence to Sequence Learning (ConvS2S) - ディープラーニングブログ 

-----


-----


References

◎ 論文

// MN
J Weston, S Chopra, and A Bordes. Memory networks. ICLR, 2014.
https://arxiv.org/abs/1410.3916

// EEMN
Sukhbaatar, Sainbayar, Jason Weston, and Rob Fergus. "End-to-end memory networks." Advances in neural information processing systems. 2015.
https://papers.nips.cc/paper/5846-end-to-end-memory-networks.pdf

// KVMN
Miller, Alexander, et al. "Key-value memory networks for directly reading documents." arXiv preprint arXiv:1606.03126 (2016).
https://arxiv.org/pdf/1606.03126.pdf 

// PN
Vinyals, Oriol, Meire Fortunato, and Navdeep Jaitly. "Pointer networks." Advances in Neural Information Processing Systems. 2015.
http://papers.nips.cc/paper/5866-pointer-networks.pdf

// FSA
Daniluk, Michał, et al. "Frustratingly short attention spans in neural language modeling." arXiv preprint arXiv:1702.04521 (2017).
https://arxiv.org/pdf/1702.04521.pdf

-----

◎ 英文參考資料

# EEMN
# FSA
# 680 claps
Attention in NLP – Kate Loginova – Medium
https://medium.com/@joealato/attention-in-nlp-734c6fa9d983

# KVMN
Summary of paper 'Key-Value Memory Networks for Directly Reading Documents' · GitHub
https://gist.github.com/shagunsodhani/a5e0baa075b4a917c0a69edc575772a8

# PN
Attention  Attention!
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html 

-----

◎ 日文參考資料

# MN
論文解説 Memory Networks (MemNN) - ディープラーニングブログ
http://deeplearning.hatenablog.com/entry/memory_networks

# KVMN
論文解説 Attention Is All You Need (Transformer) - ディープラーニングブログ
http://deeplearning.hatenablog.com/entry/transformer

# QKVC
論文解説 Convolutional Sequence to Sequence Learning (ConvS2S) - ディープラーニングブログ
http://deeplearning.hatenablog.com/entry/convs2s 

NLP(三):Attention

NLP(三):Attention

2019/01/18

-----


https://zhuanlan.zhihu.com/p/37601161

# Seq2seq

-----


https://zhuanlan.zhihu.com/p/37601161

# Attention

-----




Fig. 2. An illustration of the attention mechanism (RNNSearch) proposed by [Bahdanau, 2014]. Instead of converting the entire input sequence into a single context vector, we create a separate context vector for each output (target) word. These vectors consist of the weighted sums of encoder’s hidden states.



# Global Attention。[1]。

-----


https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

# Global Attention and Local Attention。

-----



-----



-----

References

◎ 論文

[1] Attention - using GRU
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
https://arxiv.org/pdf/1409.0473.pdf

[2] Global Attention - using LSTM
Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation." arXiv preprint arXiv:1508.04025 (2015).
https://arxiv.org/pdf/1508.04025.pdf 

[3] Visual Attention
Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015.
http://proceedings.mlr.press/v37/xuc15.pdf

-----




-----

◎ 英文參考資料

# 綜述
[1] Attention  Attention!
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html 

# 綜述
# 680 claps
[2] Attention in NLP – Kate Loginova – Medium
https://medium.com/@joealato/attention-in-nlp-734c6fa9d983

# 1.4K claps
Attn  Illustrated Attention - Towards Data Science
https://towardsdatascience.com/attn-illustrated-attention-5ec4ad276ee3

# 1.3K claps
A Brief Overview of Attention Mechanism - SyncedReview - Medium
https://medium.com/syncedreview/a-brief-overview-of-attention-mechanism-13c578ba9129

# 799 claps
Intuitive Understanding of Attention Mechanism in Deep Learning
https://towardsdatascience.com/intuitive-understanding-of-attention-mechanism-in-deep-learning-6c9482aecf4f

# 126 claps
Understanding Attention Mechanism - Shashank Yadav - Medium
https://medium.com/@shashank7.iitd/understanding-attention-mechanism-35ff53fc328e

Attention and Memory in Deep Learning and NLP – WildML
http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/

-----

◎ 日文參考資料

深層学習による自然言語処理 - RNN, LSTM, ニューラル機械翻訳の理論 - ディープラーニングブログ
http://deeplearning.hatenablog.com/entry/neural_machine_translation_theory

-----

◎ 簡體中文參考資料

# 綜述
Attention-Mechanisms-paper_Attention-mechanisms-paper.md at master · yuquanle_Attention-Mechanisms-paper · GitHub
https://github.com/yuquanle/Attention-Mechanisms-paper/blob/master/Attention-mechanisms-paper.md

深度学习中的注意力模型(2017版) - 知乎
https://zhuanlan.zhihu.com/p/37601161

自然语言处理中的Attention Model:是什么及为什么 - 张俊林的博客 - CSDN博客
https://blog.csdn.net/malefactor/article/details/50550211



目前主流的attention方法都有哪些? - 知乎
https://www.zhihu.com/question/68482809/answer/264632289

# 110 claps
自然语言处理中注意力机制综述 - 知乎
https://zhuanlan.zhihu.com/p/54491016 

# 14 claps
NLP硬核入门-Seq2Seq和Attention机制 - 知乎
https://zhuanlan.zhihu.com/p/73589030

注意力机制(Attention Mechanism)在自然语言处理中的应用 - Soul Joy Hub - CSDN博客
https://blog.csdn.net/u011239443/article/details/80418489

【NLP】Attention Model(注意力模型)学习总结 - 郭耀华 - 博客园
https://www.cnblogs.com/guoyaohua/p/9429924.html
 
-----

◎ 繁體中文參考資料

# 486 claps
[1] Seq2seq pay Attention to Self Attention  Part 1(中文版)
https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-1-%E4%B8%AD%E6%96%87%E7%89%88-2714bbd92727

-----

◎ 代碼實作

Translation with a Sequence to Sequence Network and Attention — PyTorch Tutorials 1.0.0.dev20181228 documentation
https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html

PyTorch 1.1 Tutorials   テキスト   Sequence to Sequence ネットワークと Attention で翻訳 – PyTorch
http://torch.classcat.com/2019/07/20/pytorch-1-1-tutorials-text-seq2seq-translation/

# 832 claps
Attention in Deep Networks with Keras - Towards Data Science
https://towardsdatascience.com/light-on-math-ml-attention-with-keras-dc8dbc1fad39

Thursday, November 19, 2020

What's the main points of Attention?

What's the main points of Attention?

2020/10/27

-----


-----

一、基礎學習與論文理解(70%)。


◎ 1. 可以從這篇論文學到什麼(解決什麼問題)(巨觀)? 

◎ 1. 可以從這篇論文學到什麼(解決它之前論文未解決的什麼問題)(微觀)? 



# 机器翻译的技术进化史——机器翻译专题(一) - 云+社区 - 腾讯云

https://cloud.tencent.com/developer/news/16139

-----


論文圖一,Seq2seq 的架構 [2]。

-----

◎ A. 問題原因。 

◎ 1.a.1:過往(本篇論文之前)這個領域已經做到甚麼程度?

-----


Attention 2,與 Attention 1 同期。

-----

◎ 1.a.1:過往(本論文稍後)這個領域已經做到甚麼程度?



GNMT,Attention 的多層 RL 版本。

-----

◎ 1.a.2:(本論文之前,離本論文)最近的研究遇到了甚麼瓶頸?



# Seq2seq [1]。

-----

◎ 1.a.3:議題發生的根本原因是甚麼?



https://medium.com/@joealato/attention-in-nlp-734c6fa9d983

-----

◎ B. 解決方法。 

◎ 1.b.1:作者採用甚麼方式解決?



# Attention [6]。

-----

◎ 1.b.2:細節內容是如何解決的?

-----


Fig. 2. An illustration of the attention mechanism (RNNSearch) proposed by [Bahdanau, 2014]. Instead of converting the entire input sequence into a single context vector, we create a separate context vector for each output (target) word. These vectors consist of the weighted sums of encoder’s hidden states.

-----

http://hemingwang.blogspot.com/2019/01/attention.html

-----

◎ 1.b.3:(optional)- 作者是否有說明思路? 



The context vector ci depends on a sequence of annotations (h1, ... hTx ) to which an encoder maps the input sentence. Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence. We explain in detail how the annotations are computed in the next section.

-----

- 或是後續研究者的討論?



Sequence to sequence modeling has been synonymous with recurrent neural network based encoder-decoder architectures (Sutskever et al., 2014; Bahdanau et al., 2014). The encoder RNN processes an input sequence x = (x1, . . . , xm) of m elements and returns state representations z = (z1, . . . , zm). The decoder RNN takes z and generates the output sequence y = (y1, . . . , yn) left to right, one element at a time. To generate output yi+1, the decoder computes a new hidden state hi+1 based on the previous state hi, an embedding gi of the previous target language word yi, as well as a conditional input ci derived from the encoder output z. Based on this generic formulation, various encoder-decoder architectures have been proposed, which differ mainly in the conditional input and the type of RNN.

ConvS2S 有,Transformer 沒有。

-----


# Attention [6]。

-----

◎ C. 效能評估。 

◎ 1.c.1:成果效能的比較。



-----

◎ 1.c.2:目前這個方法是否還有限制,是甚麼?


◎ 1.c.3:(optional)- 作者對後續發展的期許? 

-----



One of challenges left for the future is to better handle unknown, or rare words. This will be required for the model to be more widely used and to match the performance of current state-of-the-art machine translation systems in all contexts.

-----

- 其他研究者後續的發展?


# GNMT

-----


# GNMT

-----

二、後續發展與延伸應用(以論文為例)(30%)

◎ 2. 可以應用在那些垂直領域(應用領域)? 



# HAN。

-----

◎ 3. 這篇論文的價值在哪(如何跨領域延伸應用)? 


SAT

-----


ST

-----

◎ 4. 如果要改進可以怎麼做(後續的研究)?



# ConvS2S [4]。

-----

References

◎ 主要論文

[1] LSTM。被引用 39743 次。

Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.676.4320&rep=rep1&type=pdf


[2] Seq2seq。被引用 12676 次。

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.

http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf


[3] Attention 1 - Using GRU。被引用 14895 次。

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).

https://arxiv.org/pdf/1409.0473.pdf


[4] ConvS2S。被引用 1772 次。

Gehring, Jonas, et al. "Convolutional sequence to sequence learning." arXiv preprint arXiv:1705.03122 (2017).

https://arxiv.org/pdf/1705.03122.pdf


[5] Transformer。被引用 13554 次。

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

-----

◎ 相關論文

-----

[] Attention 2 - Using LSTM。被引用 4688 次

Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation." arXiv preprint arXiv:1508.04025 (2015).

https://arxiv.org/pdf/1508.04025.pdf


[] GNMT。被引用 3391 次。

Wu, Yonghui, et al. "Google's neural machine translation system: Bridging the gap between human and machine translation." arXiv preprint arXiv:1609.08144 (2016).

https://arxiv.org/pdf/1609.08144.pdf

-----

# HAN。領域內應用。被引用 2596 次。

Yang, Zichao, et al. "Hierarchical attention networks for document classification." Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 2016.

https://www.aclweb.org/anthology/N16-1174.pdf

-----

領域外應用

[3] SAT。Visual Attention 1。被引用 6040 次。

Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015.

http://proceedings.mlr.press/v37/xuc15.pdf

-----

[] ST。Visual Attention 2。被引用 4059 次。

Vinyals, Oriol, et al. "Show and tell: A neural image caption generator." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

https://openaccess.thecvf.com/content_cvpr_2015/papers/Vinyals_Show_and_Tell_2015_CVPR_paper.pdf

-----

GAP NIN

Cam

GradCam

GradCam++

ScoreCam


◎ 英文

Attention in NLP. In this post, I will describe recent… | by Kate Loginova | Medium

https://medium.com/@joealato/attention-in-nlp-734c6fa9d983

-----

◎ 繁中

歐尼克斯實境互動工作室(OmniXRI): 【AI HUB專欄】如何利用可視化工具揭開神經網路背後的祕密(上)

https://omnixri.blogspot.com/2020/10/ai-hub_16.html

歐尼克斯實境互動工作室(OmniXRI): 【AI HUB專欄】如何利用可視化工具揭開神經網路背後的祕密(下)

https://omnixri.blogspot.com/2020/09/ai-hub_20.html


The Star Also Rises: NLP(三):Attention

http://hemingwang.blogspot.com/2019/01/attention.html

-----

[翻譯] Attention in NLP

[翻譯] Attention in NLP

2019/10/02

-----

In this post, I will describe recent work on attention in deep learning models for natural language processing. I’ll start with the attention mechanism as it was introduced by Bahdanau. Then, we will go through self-attention, two-way attention, key-value-predict models and hierarchical attention.

-----

在本文裡,我會說明有關自然語言處理的深度學習模型中有關 attention 最新的進展。 從 Bahdanau 介紹的注意力機制開始。 然後,我們會經歷 self-attention、two-way attention、key-value-predict 模型和 hierarchical attention。

-----

In many tasks, such as machine translation or dialogue generation, we have a sequence of words as an input (e.g., an original text in English) and would like to generate another sequence of words as an output (e.g., a translation to Korean). Neural networks, especially recurrent ones (RNN), are well suited for solving such a task. I assume that you are familiar with RNNs and LSTMs. Otherwise, I recommend to check out an explanation in a famous blog post by Christopher Olah.

-----

在許多任務(例如機器翻譯或對話生成)中,我們有一系列單字作為輸入(例如,英語的原始文本),並且希望生成其他單字序列作為輸出(例如,對韓語的翻譯) 。 神經網路,尤其是遞歸網路(RNN),非常適合解決此類任務。 我假設您熟悉 RNN 和 LSTM。 否則,我建議在 Christopher Olah 的著名博文中查看解釋。

-----

The “sequence-to-sequence” neural network models are widely used for NLP. A popular type of these models is an “encoder-decoder”. There, one part of the network — encoder — encodes the input sequence into a fixed-length context vector. This vector is an internal representation of the text. This context vector is then decoded into the output sequence by the decoder. See an example:

-----

“序列到序列”神經網路模型被廣泛用於 NLP。 這些模型的一種流行類型是“編碼器-解碼器”。 在那裡,網路的一部分 — 編碼器 — 將輸入序列編碼為固定長度的 context vector。 此向量是文本的內部表示。 然後,該 context vector 由解碼器解碼為輸出序列。 看一個例子:

-----


Fig. 1. An encoder-decoder neural network architecture. An example on machine translation: an input sequence is an English sentence “How are you” and the reply of the system would be a Korean translation: “잘 지냈어요”.

-----

Here h denotes hidden states of the encoder and s of the decoder. Tx and Ty are the lengths of the input and output word sequences respectively. q is a function which generates the context vector out of the encoder’s hidden states. It can be, for example, just q({h_i}) = h_T. So, we take the last hidden state as an internal representation of the entire sentence.

-----

在此,h 表示編碼器的隱藏狀態,s 表示解碼器的 s。 Tx 和 Ty 分別是輸入和輸出字序列的長度。 q 是從編碼器的隱藏狀態中生成 context vector 的函數。 例如,它可以只是 q({h_i})= h_T。 因此,我們將最後一個隱藏狀態作為整個句子的內部表示。

-----

You can easily experiment with these models, as most deep learning libraries have general purpose encoder-decoder frameworks. To name a few, see Google’s implementation for Tensorflow and IBM’s one for PyTorch.

-----

您可以輕鬆地使用這些模型進行試驗,因為大多數深度學習庫都具有通用的編碼器-解碼器框架。 僅舉幾例,請參閱 Google 針對 Tensorflow 的實現和 IBM 針對 PyTorch 的實現。

https://github.com/google/seq2seq

https://github.com/IBM/pytorch-seq2seq

-----

However, there is a catch with the common encoder-decoder approach: a neural network compresses all the information of an input source sentence into a fixed-length vector. It has been shown that this leads to a decline in performance when dealing with long sentences. The attention mechanism was introduced by Bahdanau in “Neural Machine Translation by Jointly Learning to Align and Translate” to alleviate this problem.

-----

但是,常見的編碼器/解碼器方法存在一個問題:神經網路將輸入源語句的所有信息壓縮為固定長度的向量。 已經表明,這在處理長句子時導致性能下降。 Bahdanau在“通過共同學習對齊和翻譯的神經機器翻譯”中引入了注意力機制來緩解此問題。

https://arxiv.org/abs/1409.0473

-----

Attention

-----

The basic idea: each time the model predicts an output word, it only uses parts of an input where the most relevant information is concentrated instead of an entire sentence. In other words, it only pays attention to some input words. Let’s investigate how this is implemented.

-----

基本思想:每次模型預測一個輸出詞時,它僅使用輸入中最相關信息集中的部分,而不是整個句子。 換句話說,它僅注意某些輸入單字。 讓我們研究一下這是如何實現的。

-----


Fig. 2. An illustration of the attention mechanism (RNNSearch) proposed by [Bahdanau, 2014]. Instead of converting the entire input sequence into a single context vector, we create a separate context vector for each output (target) word. These vectors consist of the weighted sums of encoder’s hidden states.

-----

Encoder works as usual, and the difference is only on the decoder’s part. As you can see from a picture, the decoder’s hidden state is computed with a context vector, the previous output and the previous hidden state. But now we use not a single context vector c, but a separate context vector c_i for each target word.

-----

編碼器照常工作,不同之處僅在於解碼器。 從圖片中可以看到,解碼器的隱藏狀態是使用 context vector,先前的輸出和先前的隱藏狀態來計算的。 但是現在我們不使用單個 context vector c,而是為每個目標單詞使用單獨的 context vector c_i。

-----

These context vectors are computed as a weighted sum of annotations generated by the encoder. In Bahdanau’s paper, they use a Bidirectional LSTM, so these annotations are concatenations of hidden states in forward and backward directions.

-----

這些 context vectors 被視為是編碼器生成的註釋的加權和。 在 Bahdanau 的論文中,他們使用了雙向 LSTM,因此這些註釋是向前和向後隱藏狀態的串聯。

-----

The weight of each annotation is computed by an alignment model which scores how well the inputs and the output match. An alignment model is a feedforward neural network, for instance. In general, it can be any other model as well.

-----

每個註釋的權重由對齊模型計算,該模型對輸入和輸出的匹配程度進行評分。 對齊模型例如是前饋神經網路。 通常,它也可以是任何其他模型。

-----

As a result, the alphas — the weights of hidden states when computing a context vector — show how important a given annotation is in deciding the next state and generating the output word. These are the attention scores.

-----

結果,alpha(即計算 context vector 時隱藏狀態的權重)顯示了給定註釋在決定下一個狀態並生成輸出字時的重要性。 這些是 attention scores。

-----

If you want to read a bit more about the intuition behind this, visit WildML’s blog post. You can also enjoy an interactive visualization in the Distill blog. In the meantime, let’s move on to a bit more advanced attention mechanisms.

-----

如果你想進一步了解其背後的直覺,請訪問 WildML 的博文。 你還可以在 Distill 博客中享受交互式可視化效果。 同時,讓我們繼續講一些更高級的注意力機制。

http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/

https://distill.pub/2016/augmented-rnns/

-----

Memory networks

-----

One group of attention mechanisms repeats the computation of an attention vector between the query and the context through multiple layers. It is referred to as multi-hop. They are mainly variants of end-to-end memory networks, which we will discuss now.

-----

一組注意機制通過多層重複查詢和上下文之間的 attention vector 的計算。 這被稱為 multi-hop。 它們主要是端到端內存網路的變體,我們現在討論。

-----

[Sukhbaatar, 2015] argues that the attention mechanism implemented by Bahdanau can be seen as a form of memory. They extend this mechanism to a multi-hop setting. It means that the network reads the same input sequence multiple times before producing an output, and updates the memory contents at each step. Another modification is that the model works with multiple source sentences instead of a single one.

-----

[Sukhbaatar,2015] 認為 Bahdanau 實施的注意力機制可以被視為一種記憶形式。 他們將此機制擴展到 multi-hop 設置。 這意味著網路在產生輸出之前會多次讀取相同的輸入序列,並在每個步驟中更新內存內容。 另一個修改是該模型可以處理多個源語句,而不是單個源語句。

-----



Fig. 3. End-to-End Memory Networks.

-----

Let’s take a look at the inner workings. First, let me describe the single layer case (a). It implements a single memory hop operation. The entire input set of sentences is converted into memory vectors m. The query q is also embedded to obtain an internal state u. We compute the match between u and each memory by taking the inner product followed by a softmax. This way we obtain a probability vector p over the inputs (this is the attention part). Each input also has a corresponding output vector. We use the weights p to weigh a sum of these output vectors. This sum is a response vector o from the memory. Now we have an output vector o and the input embedding u. We sum them, multiply by a weight matrix W and apply a softmax to predict a label.

-----

讓我們看一下內部運作方式。 首先,讓我描述一下單層情況(a)。 它實現了單個內存跳躍操作。 句子的整個輸入集被轉換為 memory vectors m。 查詢 q 也被嵌入以獲得內部狀態 u。 我們通過取內積和 softmax 來計算 u 和每個內存之間的匹配。 這樣,我們就獲得了輸入的概率向量 p(這是 attention 的部分)。 每個輸入還具有一個對應的輸出向量。 我們使用權重 p 加權這些輸出向量的總和。 該和是來自存儲器的響應向量。 現在我們有一個輸出向量 o 和嵌入 u 的輸入。 我們將它們相加,乘以權重矩陣 W 並且用 softmax 來預測標籤。

-----

Now, we can extend the model to handle K hop operations (b). The memory layers are stacked so that the input to layers k + 1 is the sum of the output and the input from layer k. Each layer has its own embedding matrices for the inputs.

-----

現在,我們可以擴展模型以處理 K 跳操作(b)。 堆疊存儲層,以使到 k + 1 層的輸入是來自 k 層的輸出和輸入之和。 每層都有自己的輸入輸入矩陣。

-----

When the input and output embeddings are the same across different layers, the memory is identical to the attention mechanism of Bahdanau. The difference is that it makes multiple hops over the memory (because it tries to integrate information from multiple sentences).

-----

當不同層的輸入和輸出嵌入相同時,內存與 Bahdanau 的注意力機制相同。 不同之處在於,它在內存上進行了多次跳轉(因為它試圖集成來自多個句子的信息)。

-----

A fine-grained extension of this method is an Attentive Reader introduced by [Hermann, 2015].

-----

[Hermann,2015] 引入的 Attentive Reader 是該方法的仔細擴展。

https://arxiv.org/pdf/1506.03340.pdf

-----

Variations of attention

-----

[Luong, 2015] introduces the difference between global and local attention. The idea of a global attention is to use all the hidden states of the encoder when computing each context vector. The downside of a global attention model is that it has to attend to all words on the source side for each target word, which is computationally costly. To overcome this, the local attention first chooses a position in the source sentence. This position will determine a window of words that the model attends to. The authors also experimented with different alignment functions and simplified the computation path compared to Bahdanau’s work.

-----

[Luong,2015] 介紹了global attention 與 local attention 之間的差異。 Global attention 的想法是在計算每個 context vector 時使用編碼器的所有隱藏狀態。 Global attention 模型的缺點是,對於每個目標單字,它都必須注意源代碼方面的所有單字,這在計算上是昂貴的。 為了克服這個問題,Local attention 首先在源句中選擇一個位置。 該位置將確定模型所涉及的單字窗口。 與 Bahdanau 的工作相比,作者還嘗試了不同的對齊功能,並簡化了計算路徑。

https://arxiv.org/abs/1508.04025

-----

Attention Sum Reader [Kadlec, 2016] uses attention as a pointer over discrete tokens in the text. The task is to select an answer to a given question from the context paragraph. The difference with other methods is that the model selects the answer from the context directly using the computed attention instead of using the attention scores to weigh the sum of hidden vectors.

-----

Attention Sum Reader [Kadlec,2016] 使用注意力作為文本中離散標記的指針。 任務是從上下文段落中選擇給定問題的答案。 與其他方法的不同之處在於,該模型直接使用計算出的注意力從上下文中選擇答案,而不是使用 attention scores 來權衡隱藏向量的總和。

https://arxiv.org/abs/1603.01547

-----


Fig. 4. Attention Sum Reader.

-----

As an example, let us consider the question-context pair. Let the context be “A UFO was observed above our city in January and again in March.” and the question be “An observer has spotted a UFO in … .” January and March are equally good candidates, so the previous models will assign equal attention scores. They would then compute a vector between the representations of these two words and propose the word with the closest word embedding as the answer. At the same time, Attention Sum Reader would correctly propose January or March, because it chooses words directly from the passage.

-----

例如,讓我們考慮 question-context pair。 假設 context 為“在一月和三月在我們城市上方觀測到一個不明飛行物”,而問題為“觀察員在…發現了一個不明飛行物。”一月和三月是同樣好的候選者,因此以前的模型將分配相等 attention scores。 然後,他們將計算這兩個詞的表示之間的向量,並提出嵌入詞最接近的詞作為答案。 同時,Attention Sum Reader 會正確建議一月或三月,因為它直接從段落中選擇單字。

-----

Two-way Attention & Coattention

-----

As you might have noticed, in the previous model we pay attention from source to target. It makes sense in translation, but what about other fields? For example, consider textual entailment. We have a premise “If you help the needy, God will reward you” and a hypothesis “Giving money to a poor man has good consequences”. Our task is to understand whether the premise entails the hypothesis (in this case, it does). It would be useful to pay attention not only from the hypothesis to the text but also the other way around.

-----

您可能已經注意到,在先前的模型中,我們從源頭到目標都進行了 attention 。 在翻譯中是必須的,但是其他領域呢? 例如,textual entailment。 我們有一個前提:“如果您幫助有需要的人,上帝會獎勵您”,還有一個假設“將錢捐給窮人會帶來好的後果”。 我們的任務是了解前提是否包含假設(在這種情況下,假設確實存在)。 不僅要注意從假設到文本的 attention ,而且要注意其他方面的 attention,這將很有用。

-----

This brings the concept of two-way attention [Rocktäschel, 2015]. The idea is to use the same model to attend over the premise, as well as over the hypothesis. In the simplest form, you can simply swap the two sequences. This produces two attended representations which can be concatenated.

-----

這帶來了 two-way attention 的概念 [Rocktäschel,2015]。 想法是使用相同的模型來參與前提和假設。 以最簡單的形式,你可以簡單地交換兩個序列。 這將產生兩個可以串聯的出席代表。

https://arxiv.org/abs/1509.06664

-----

However, such a model will not let you emphasize more important matching results. For instance, alignment between stop words is less important than between the content words. In addition, the model still uses a single vector to represent the premise. To overcome these limitations, [Wang, Jiang, 2016] developed MatchLSTM. To deal with the importance of the matching, they add a special LSTM that will remember important matching results, while forgetting the others. This additional LSTM is also used to increase the granularity level. We will now multiply attention weights with each hidden state. It performed well in question answering and textual entailment tasks.

-----

但是,這種模型不會讓您強調更重要的匹配結果。 例如,stop words 之間的對齊不如 content words 之間的對齊重要。 此外,模型仍使用單個向量表示前提。 為了克服這些限制, [Wang, Jiang, 2016] 開發了 MatchLSTM。 為了處理匹配的重要性,他們添加了一個特殊的 LSTM,它將記住重要的匹配結果,而忽略其他結果。 此附加的 LSTM 也用於增加 granularity level。 現在,我們將注意力權重與每個隱藏狀態相乘。 它在 question answering 和 textual entailment 任務方面表現出色。

https://arxiv.org/abs/1512.08849

-----


Fig. 5. Top: model from [Rocktäschel, 2015]. Bottom: MatchLSTM from [Wang, Jiang, 2016]. h vectors in the first model are weighted versions of the premise only, while in the second model they “represent the matching between the premise and the hypothesis up to position k.”

-----

The question answering task gave rise to even more advanced ways to combine both sides. Bahdanau’s model, that we have seen in the beginning, uses a summary vector of the query to attend to the context. In contrast to it, the coattention is computed as an alignment matrix on all pairs of context and query words. As an example of this approach, let’s examine Dynamic Coattention Networks [Xiong, 2016].

-----

Question answering 任務帶來了將兩邊結合起來的更高級的方法。 我們在一開始就看到過 Bahdanau的 模型,它使用查詢的摘要向量來說明 context。 與此形成對比的是,在所有 context 和查詢詞對上,coattention 被計算為對齊矩陣。 作為這種方法的一個例子,讓我們研究一下動態協作網路 [Xiong, 2016]。

https://arxiv.org/abs/1611.01604

-----


Fig. 6. Dynamic Coattention Networks [Xiong, 2016].

-----

Let’s walk through what is going on in the picture. First, we compute the affinity matrix of all pairs of document and question words. Then we get the attention weights AQ across the document for each word in the question and AD — the other way around. Next, the summary or attention context of the document in light of each word in the question is computed. In the same way, we can compute it for the question in light of each word in the document. Finally, we compute the summaries of the previous attention contexts given each word in the document. The resulting vectors are concatenated into a co-dependent representation of the question and the document. This is called the coattention context.

-----

讓我們逐步瀏覽圖片中發生的事情。 首先,我們計算所有文檔和疑問詞對的相似度矩陣。 然後,我們獲得問題和 AD 中每個單字的注意權重 AQ,反之亦然。 接下來,根據問題中的每個單字來計算文檔的摘要或 attention context。 以相同的方式,我們可以根據文檔中的每個單字為問題計算它。 最後,我們計算給定文檔中每個單字的先前 attention context 的摘要。 結果向量被連接成問題和文檔的相互依賴表示。 這稱為 coattention context。

-----

Self-attention

-----



Fig. 7. Syntactic patterns learnt by the Transformer [Vaswani, 2017] using solely self-attention.

-----



-----




-----

References

# 綜述 
# 749 claps 
Attention in NLP – Kate Loginova – Medium
https://medium.com/@joealato/attention-in-nlp-734c6fa9d983