2019/10/02
-----
In this post, I will describe recent work on attention in deep learning models for natural language processing. I’ll start with the attention mechanism as it was introduced by Bahdanau. Then, we will go through self-attention, two-way attention, key-value-predict models and hierarchical attention.
-----
在本文裡,我會說明有關自然語言處理的深度學習模型中有關 attention 最新的進展。 從 Bahdanau 介紹的注意力機制開始。 然後,我們會經歷 self-attention、two-way attention、key-value-predict 模型和 hierarchical attention。
-----
In many tasks, such as machine translation or dialogue generation, we have a sequence of words as an input (e.g., an original text in English) and would like to generate another sequence of words as an output (e.g., a translation to Korean). Neural networks, especially recurrent ones (RNN), are well suited for solving such a task. I assume that you are familiar with RNNs and LSTMs. Otherwise, I recommend to check out an explanation in a famous blog post by Christopher Olah.
-----
在許多任務(例如機器翻譯或對話生成)中,我們有一系列單字作為輸入(例如,英語的原始文本),並且希望生成其他單字序列作為輸出(例如,對韓語的翻譯) 。 神經網路,尤其是遞歸網路(RNN),非常適合解決此類任務。 我假設您熟悉 RNN 和 LSTM。 否則,我建議在 Christopher Olah 的著名博文中查看解釋。
-----
The “sequence-to-sequence” neural network models are widely used for NLP. A popular type of these models is an “encoder-decoder”. There, one part of the network — encoder — encodes the input sequence into a fixed-length context vector. This vector is an internal representation of the text. This context vector is then decoded into the output sequence by the decoder. See an example:
-----
“序列到序列”神經網路模型被廣泛用於 NLP。 這些模型的一種流行類型是“編碼器-解碼器”。 在那裡,網路的一部分 — 編碼器 — 將輸入序列編碼為固定長度的 context vector。 此向量是文本的內部表示。 然後,該 context vector 由解碼器解碼為輸出序列。 看一個例子:
-----
Fig. 1. An encoder-decoder neural network architecture. An example on machine translation: an input sequence is an English sentence “How are you” and the reply of the system would be a Korean translation: “잘 지냈어요”.
-----
Here h denotes hidden states of the encoder and s of the decoder. Tx and Ty are the lengths of the input and output word sequences respectively. q is a function which generates the context vector out of the encoder’s hidden states. It can be, for example, just q({h_i}) = h_T. So, we take the last hidden state as an internal representation of the entire sentence.
-----
在此,h 表示編碼器的隱藏狀態,s 表示解碼器的 s。 Tx 和 Ty 分別是輸入和輸出字序列的長度。 q 是從編碼器的隱藏狀態中生成 context vector 的函數。 例如,它可以只是 q({h_i})= h_T。 因此,我們將最後一個隱藏狀態作為整個句子的內部表示。
-----
You can easily experiment with these models, as most deep learning libraries have general purpose encoder-decoder frameworks. To name a few, see Google’s implementation for Tensorflow and IBM’s one for PyTorch.
-----
您可以輕鬆地使用這些模型進行試驗,因為大多數深度學習庫都具有通用的編碼器-解碼器框架。 僅舉幾例,請參閱 Google 針對 Tensorflow 的實現和 IBM 針對 PyTorch 的實現。
https://github.com/google/seq2seq
https://github.com/IBM/pytorch-seq2seq
-----
However, there is a catch with the common encoder-decoder approach: a neural network compresses all the information of an input source sentence into a fixed-length vector. It has been shown that this leads to a decline in performance when dealing with long sentences. The attention mechanism was introduced by Bahdanau in “Neural Machine Translation by Jointly Learning to Align and Translate” to alleviate this problem.
-----
但是,常見的編碼器/解碼器方法存在一個問題:神經網路將輸入源語句的所有信息壓縮為固定長度的向量。 已經表明,這在處理長句子時導致性能下降。 Bahdanau在“通過共同學習對齊和翻譯的神經機器翻譯”中引入了注意力機制來緩解此問題。
https://arxiv.org/abs/1409.0473
-----
Attention
-----
The basic idea: each time the model predicts an output word, it only uses parts of an input where the most relevant information is concentrated instead of an entire sentence. In other words, it only pays attention to some input words. Let’s investigate how this is implemented.
-----
基本思想:每次模型預測一個輸出詞時,它僅使用輸入中最相關信息集中的部分,而不是整個句子。 換句話說,它僅注意某些輸入單字。 讓我們研究一下這是如何實現的。
-----
Fig. 2. An illustration of the attention mechanism (RNNSearch) proposed by [Bahdanau, 2014]. Instead of converting the entire input sequence into a single context vector, we create a separate context vector for each output (target) word. These vectors consist of the weighted sums of encoder’s hidden states.
-----
Encoder works as usual, and the difference is only on the decoder’s part. As you can see from a picture, the decoder’s hidden state is computed with a context vector, the previous output and the previous hidden state. But now we use not a single context vector c, but a separate context vector c_i for each target word.
-----
編碼器照常工作,不同之處僅在於解碼器。 從圖片中可以看到,解碼器的隱藏狀態是使用 context vector,先前的輸出和先前的隱藏狀態來計算的。 但是現在我們不使用單個 context vector c,而是為每個目標單詞使用單獨的 context vector c_i。
-----
These context vectors are computed as a weighted sum of annotations generated by the encoder. In Bahdanau’s paper, they use a Bidirectional LSTM, so these annotations are concatenations of hidden states in forward and backward directions.
-----
這些 context vectors 被視為是編碼器生成的註釋的加權和。 在 Bahdanau 的論文中,他們使用了雙向 LSTM,因此這些註釋是向前和向後隱藏狀態的串聯。
-----
The weight of each annotation is computed by an alignment model which scores how well the inputs and the output match. An alignment model is a feedforward neural network, for instance. In general, it can be any other model as well.
-----
每個註釋的權重由對齊模型計算,該模型對輸入和輸出的匹配程度進行評分。 對齊模型例如是前饋神經網路。 通常,它也可以是任何其他模型。
-----
As a result, the alphas — the weights of hidden states when computing a context vector — show how important a given annotation is in deciding the next state and generating the output word. These are the attention scores.
-----
結果,alpha(即計算 context vector 時隱藏狀態的權重)顯示了給定註釋在決定下一個狀態並生成輸出字時的重要性。 這些是 attention scores。
-----
If you want to read a bit more about the intuition behind this, visit WildML’s blog post. You can also enjoy an interactive visualization in the Distill blog. In the meantime, let’s move on to a bit more advanced attention mechanisms.
-----
如果你想進一步了解其背後的直覺,請訪問 WildML 的博文。 你還可以在 Distill 博客中享受交互式可視化效果。 同時,讓我們繼續講一些更高級的注意力機制。
http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/
https://distill.pub/2016/augmented-rnns/
-----
Memory networks
-----
One group of attention mechanisms repeats the computation of an attention vector between the query and the context through multiple layers. It is referred to as multi-hop. They are mainly variants of end-to-end memory networks, which we will discuss now.
-----
一組注意機制通過多層重複查詢和上下文之間的 attention vector 的計算。 這被稱為 multi-hop。 它們主要是端到端內存網路的變體,我們現在討論。
-----
[Sukhbaatar, 2015] argues that the attention mechanism implemented by Bahdanau can be seen as a form of memory. They extend this mechanism to a multi-hop setting. It means that the network reads the same input sequence multiple times before producing an output, and updates the memory contents at each step. Another modification is that the model works with multiple source sentences instead of a single one.
-----
[Sukhbaatar,2015] 認為 Bahdanau 實施的注意力機制可以被視為一種記憶形式。 他們將此機制擴展到 multi-hop 設置。 這意味著網路在產生輸出之前會多次讀取相同的輸入序列,並在每個步驟中更新內存內容。 另一個修改是該模型可以處理多個源語句,而不是單個源語句。
-----
Fig. 3. End-to-End Memory Networks.
-----
Let’s take a look at the inner workings. First, let me describe the single layer case (a). It implements a single memory hop operation. The entire input set of sentences is converted into memory vectors m. The query q is also embedded to obtain an internal state u. We compute the match between u and each memory by taking the inner product followed by a softmax. This way we obtain a probability vector p over the inputs (this is the attention part). Each input also has a corresponding output vector. We use the weights p to weigh a sum of these output vectors. This sum is a response vector o from the memory. Now we have an output vector o and the input embedding u. We sum them, multiply by a weight matrix W and apply a softmax to predict a label.
-----
讓我們看一下內部運作方式。 首先,讓我描述一下單層情況(a)。 它實現了單個內存跳躍操作。 句子的整個輸入集被轉換為 memory vectors m。 查詢 q 也被嵌入以獲得內部狀態 u。 我們通過取內積和 softmax 來計算 u 和每個內存之間的匹配。 這樣,我們就獲得了輸入的概率向量 p(這是 attention 的部分)。 每個輸入還具有一個對應的輸出向量。 我們使用權重 p 加權這些輸出向量的總和。 該和是來自存儲器的響應向量。 現在我們有一個輸出向量 o 和嵌入 u 的輸入。 我們將它們相加,乘以權重矩陣 W 並且用 softmax 來預測標籤。
-----
Now, we can extend the model to handle K hop operations (b). The memory layers are stacked so that the input to layers k + 1 is the sum of the output and the input from layer k. Each layer has its own embedding matrices for the inputs.
-----
現在,我們可以擴展模型以處理 K 跳操作(b)。 堆疊存儲層,以使到 k + 1 層的輸入是來自 k 層的輸出和輸入之和。 每層都有自己的輸入輸入矩陣。
-----
When the input and output embeddings are the same across different layers, the memory is identical to the attention mechanism of Bahdanau. The difference is that it makes multiple hops over the memory (because it tries to integrate information from multiple sentences).
-----
當不同層的輸入和輸出嵌入相同時,內存與 Bahdanau 的注意力機制相同。 不同之處在於,它在內存上進行了多次跳轉(因為它試圖集成來自多個句子的信息)。
-----
A fine-grained extension of this method is an Attentive Reader introduced by [Hermann, 2015].
-----
[Hermann,2015] 引入的 Attentive Reader 是該方法的仔細擴展。
https://arxiv.org/pdf/1506.03340.pdf
-----
Variations of attention
-----
[Luong, 2015] introduces the difference between global and local attention. The idea of a global attention is to use all the hidden states of the encoder when computing each context vector. The downside of a global attention model is that it has to attend to all words on the source side for each target word, which is computationally costly. To overcome this, the local attention first chooses a position in the source sentence. This position will determine a window of words that the model attends to. The authors also experimented with different alignment functions and simplified the computation path compared to Bahdanau’s work.
-----
[Luong,2015] 介紹了global attention 與 local attention 之間的差異。 Global attention 的想法是在計算每個 context vector 時使用編碼器的所有隱藏狀態。 Global attention 模型的缺點是,對於每個目標單字,它都必須注意源代碼方面的所有單字,這在計算上是昂貴的。 為了克服這個問題,Local attention 首先在源句中選擇一個位置。 該位置將確定模型所涉及的單字窗口。 與 Bahdanau 的工作相比,作者還嘗試了不同的對齊功能,並簡化了計算路徑。
https://arxiv.org/abs/1508.04025
-----
Attention Sum Reader [Kadlec, 2016] uses attention as a pointer over discrete tokens in the text. The task is to select an answer to a given question from the context paragraph. The difference with other methods is that the model selects the answer from the context directly using the computed attention instead of using the attention scores to weigh the sum of hidden vectors.
-----
Attention Sum Reader [Kadlec,2016] 使用注意力作為文本中離散標記的指針。 任務是從上下文段落中選擇給定問題的答案。 與其他方法的不同之處在於,該模型直接使用計算出的注意力從上下文中選擇答案,而不是使用 attention scores 來權衡隱藏向量的總和。
https://arxiv.org/abs/1603.01547
-----
Fig. 4. Attention Sum Reader.
-----
As an example, let us consider the question-context pair. Let the context be “A UFO was observed above our city in January and again in March.” and the question be “An observer has spotted a UFO in … .” January and March are equally good candidates, so the previous models will assign equal attention scores. They would then compute a vector between the representations of these two words and propose the word with the closest word embedding as the answer. At the same time, Attention Sum Reader would correctly propose January or March, because it chooses words directly from the passage.
-----
例如,讓我們考慮 question-context pair。 假設 context 為“在一月和三月在我們城市上方觀測到一個不明飛行物”,而問題為“觀察員在…發現了一個不明飛行物。”一月和三月是同樣好的候選者,因此以前的模型將分配相等 attention scores。 然後,他們將計算這兩個詞的表示之間的向量,並提出嵌入詞最接近的詞作為答案。 同時,Attention Sum Reader 會正確建議一月或三月,因為它直接從段落中選擇單字。
-----
Two-way Attention & Coattention
-----
As you might have noticed, in the previous model we pay attention from source to target. It makes sense in translation, but what about other fields? For example, consider textual entailment. We have a premise “If you help the needy, God will reward you” and a hypothesis “Giving money to a poor man has good consequences”. Our task is to understand whether the premise entails the hypothesis (in this case, it does). It would be useful to pay attention not only from the hypothesis to the text but also the other way around.
-----
您可能已經注意到,在先前的模型中,我們從源頭到目標都進行了 attention 。 在翻譯中是必須的,但是其他領域呢? 例如,textual entailment。 我們有一個前提:“如果您幫助有需要的人,上帝會獎勵您”,還有一個假設“將錢捐給窮人會帶來好的後果”。 我們的任務是了解前提是否包含假設(在這種情況下,假設確實存在)。 不僅要注意從假設到文本的 attention ,而且要注意其他方面的 attention,這將很有用。
-----
This brings the concept of two-way attention [Rocktäschel, 2015]. The idea is to use the same model to attend over the premise, as well as over the hypothesis. In the simplest form, you can simply swap the two sequences. This produces two attended representations which can be concatenated.
-----
這帶來了 two-way attention 的概念 [Rocktäschel,2015]。 想法是使用相同的模型來參與前提和假設。 以最簡單的形式,你可以簡單地交換兩個序列。 這將產生兩個可以串聯的出席代表。
https://arxiv.org/abs/1509.06664
-----
However, such a model will not let you emphasize more important matching results. For instance, alignment between stop words is less important than between the content words. In addition, the model still uses a single vector to represent the premise. To overcome these limitations, [Wang, Jiang, 2016] developed MatchLSTM. To deal with the importance of the matching, they add a special LSTM that will remember important matching results, while forgetting the others. This additional LSTM is also used to increase the granularity level. We will now multiply attention weights with each hidden state. It performed well in question answering and textual entailment tasks.
-----
但是,這種模型不會讓您強調更重要的匹配結果。 例如,stop words 之間的對齊不如 content words 之間的對齊重要。 此外,模型仍使用單個向量表示前提。 為了克服這些限制, [Wang, Jiang, 2016] 開發了 MatchLSTM。 為了處理匹配的重要性,他們添加了一個特殊的 LSTM,它將記住重要的匹配結果,而忽略其他結果。 此附加的 LSTM 也用於增加 granularity level。 現在,我們將注意力權重與每個隱藏狀態相乘。 它在 question answering 和 textual entailment 任務方面表現出色。
https://arxiv.org/abs/1512.08849
-----
Fig. 5. Top: model from [Rocktäschel, 2015]. Bottom: MatchLSTM from [Wang, Jiang, 2016]. h vectors in the first model are weighted versions of the premise only, while in the second model they “represent the matching between the premise and the hypothesis up to position k.”
-----
The question answering task gave rise to even more advanced ways to combine both sides. Bahdanau’s model, that we have seen in the beginning, uses a summary vector of the query to attend to the context. In contrast to it, the coattention is computed as an alignment matrix on all pairs of context and query words. As an example of this approach, let’s examine Dynamic Coattention Networks [Xiong, 2016].
-----
Question answering 任務帶來了將兩邊結合起來的更高級的方法。 我們在一開始就看到過 Bahdanau的 模型,它使用查詢的摘要向量來說明 context。 與此形成對比的是,在所有 context 和查詢詞對上,coattention 被計算為對齊矩陣。 作為這種方法的一個例子,讓我們研究一下動態協作網路 [Xiong, 2016]。
https://arxiv.org/abs/1611.01604
-----
Fig. 6. Dynamic Coattention Networks [Xiong, 2016].
-----
Let’s walk through what is going on in the picture. First, we compute the affinity matrix of all pairs of document and question words. Then we get the attention weights AQ across the document for each word in the question and AD — the other way around. Next, the summary or attention context of the document in light of each word in the question is computed. In the same way, we can compute it for the question in light of each word in the document. Finally, we compute the summaries of the previous attention contexts given each word in the document. The resulting vectors are concatenated into a co-dependent representation of the question and the document. This is called the coattention context.
-----
讓我們逐步瀏覽圖片中發生的事情。 首先,我們計算所有文檔和疑問詞對的相似度矩陣。 然後,我們獲得問題和 AD 中每個單字的注意權重 AQ,反之亦然。 接下來,根據問題中的每個單字來計算文檔的摘要或 attention context。 以相同的方式,我們可以根據文檔中的每個單字為問題計算它。 最後,我們計算給定文檔中每個單字的先前 attention context 的摘要。 結果向量被連接成問題和文檔的相互依賴表示。 這稱為 coattention context。
-----
Self-attention
-----
Fig. 7. Syntactic patterns learnt by the Transformer [Vaswani, 2017] using solely self-attention.
-----
-----
-----
References
# 綜述
# 749 claps
Attention in NLP – Kate Loginova – Medium
https://medium.com/@joealato/attention-in-nlp-734c6fa9d983
# 綜述
# 749 claps
Attention in NLP – Kate Loginova – Medium
https://medium.com/@joealato/attention-in-nlp-734c6fa9d983
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.