The Star Also Rises: 06C

06C_Word2vec

2020/07/18

-----

一、Word2vec Family

Fig. Word2vec（圖片來源）。

-----

二、Outline

https://hemingwang.blogspot.com/2020/07/06cword2vec.html

06C_Word2vec

◎ Word2vec v1：CBOW and Skip-gram。
◎ Word2vec v2：Hierarchical Softmax and Negative Sampling。
◎ Word2vec v3：Simplfied Word2vec v1 and v2。

◎ LSA：Co-occurrence Matrix + SVD。
◎ GloVe：Word2vec + LSA
◎ fastText v1：CBOW and w(t) to Label。
◎ fastText v2：Skip-gram and Word to Subword。
◎ WordRank：Word Embedding to Word Ranking

-----

三、Word2vec

https://medium.com/@tengyuanchang/%E8%AE%93%E9%9B%BB%E8%85%A6%E8%81%BD%E6%87%82%E4%BA%BA%E8%A9%B1-%E7%90%86%E8%A7%A3-nlp-%E9%87%8D%E8%A6%81%E6%8A%80%E8%A1%93-word2vec-%E7%9A%84-skip-gram-%E6%A8%A1%E5%9E%8B-73d0239ad698

說明：

Word2vec 是詞嵌入的代表性演算法，包含 CBOW，連續詞袋與 Skip-gram 兩個模型。CBOW 利用周邊的字預測中間的字，類似英文的克漏字測驗。Skip-gram 則利用中間的字預測周邊的字。兩種方法都可以獲得詞向量。

-----

四、King - Man + Woman = Queen

https://arxiv.org/abs/1708.02709

說明：

英文的 word 從 one hot encoding 轉成 vector 之後，便可進行向量的運算。經典的例子是 King - Man + Woman = Queen。由這個「等式」，我們可以理解到，向量的某個維度，代表性別，某個維度，代表社會地位的高低。

-----

五、Regression Model

https://www.deeplearningbook.org/

// Page 119。

說明：

在進入 Word2vec 之前，我們還是先回顧一下回歸分析。為什麼要先回顧回歸分析，因為 Word2vec 對第一次接觸的人來說，會顯的很複雜，所以我們先舉一個簡單的例子，並且是大家原來就很熟悉的。下一張圖片的 LeNet 模型，其實是一個很複雜的回歸模型。而 Word2vec，又是一個簡化後的 CNN 模型。回歸模型是大家最熟悉的，LeNet 也許是第二熟悉的。

-----

六、CNN Model

http://hemingwang.blogspot.com/2018/02/deep-learninglenet-bp.html

說明：

LeNet 也許是深度學習中，最為大家熟悉的 CNN 模型。LeNet 比 Word2vec 複雜很多，但是由於學習 Word2vec 之前，大家多半已經掌握 LeNet，所以我們利用 LeNet 來學習 Word2vec。簡單說，Word2vec 只有三層，輸入層、隱藏層、輸出層。輸出層到隱藏層之間的神經網路連接，在還沒進入激活函數之間，可以視為矩陣轉換。配合輸入的 one hot encoding，矩陣的列，就變成每個字的詞向量。

-----

七、Back Propagation

http://hemingwang.blogspot.com/2018/02/deep-learninglenet-bp.html

說明：

同樣的，在模型，輸入、輸出對應的資料集，以及損失函數決定後，Word2vec 也是用 Back Propagation 來學習詞向量，也就是神經網路的權重。

-----

八、CBOW

https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html

說明：

橙色的部分，輸入層到隱藏層的之間的神經網路連結，其實是個矩陣。矩陣的值，也就是我們要學習的詞向量。綠色的部分，是隱藏層到輸出層的矩陣轉換，也就是預測周邊（或者是下一個字）的機率。在 Word2vec 裡面，這個部分不會被當作詞向量使用，但是在 ConvS2S 或者 Transformer 的 QKV，Query、Key、Value 的分解裡面，綠色這個部分，代表 Query。簡單說，Key 是 one hot encoding，輸入層。Value 是文字的涵義，也就是詞向量，橙色的部分。Query 是下個字的機率分布，也就是綠色的部分。

以上是 Word2vec 跟 QKV 的關係，是我在寫這段文字的時候，忽然冒出來的。這個理解，我認為接近正確。但此刻我尚未確定。

-----

九、Skip-gram

https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html

說明：

這個是 Skip-gram 的模型。第一次看不容易理解，因為它畫的是矩陣而不是神經網路。Skip-gram 剛剛已經簡單介紹過。淺藍色的部分是重點，從輸入層的 one hot encoding，會變成隱藏層的詞向量。然後會對應到輸出層的機率。輸入層的 x 與輸出層的 y，就代表訓練用資料集的一筆資料，一個 word pair。

-----

一０、PKV

https://medium.com/@joealato/attention-in-nlp-734c6fa9d983

說明：

從 Attention 到 Key-Value 到 QKV。以 Word2vec 為例，Query 對應到隱藏層到輸出層之間的矩陣。Key 對應到輸入層的 one hot encoding。Value 對應到輸入層與隱藏層之間的矩陣。

-----

一一、Skip-gram Model

https://zhuanlan.zhihu.com/p/27234078

說明：

這張圖可能是網路上，Skip-gram 的經典。雖然輸入層到隱藏層的神經連結被簡化了，但是隱藏層到輸出層標示的很清楚，特別是輸出的部分。

-----

一二、Skip Gram Data

https://zhuanlan.zhihu.com/p/27234078

說明：

以圖左第四列為例，假定 sliding window 的長度是 5，那麼 fox 周邊的四個字分別是 quick、brown、jumps、over。

-----

一三、Skip-gram Result

https://zhuanlan.zhihu.com/p/27234078

說明：

訓練後的結果會有一個詞向量的表，one hot encoding 的特性可以萃取出對應的詞向量。

-----

一四、Weight Matrix of Word2vec

https://mc.ai/deep-nlp-word-vectors-with-word2vec/

說明：

本圖會比說明的文字更清楚。

-----

一五、Softmax

https://pojenlai.wordpress.com/2016/02/27/tensorflow%E8%AA%B2%E7%A8%8B%E7%AD%86%E8%A8%98-softmax%E5%AF%A6%E4%BD%9C/

說明：

進入 hierarchical softmax 之前，先看一下 softmax。

一樣，公式會比文字說明清楚。

-----

一六、Huffman Coding

https://hemingwang.blogspot.com/2020/08/huffman-coding.html

說明：

Huffman coding 嘗試用最少的位元代表頻率最高的字。作法：可參考上方連結。

演算法：

一、將 word 依頻率排序，由小到大。
二、將最小頻率的兩個值組成一棵樹，即兩個頻率相加，得到新頻率。回到一。若最後剩下兩個頻率值，則可組成最後的霍夫曼樹。

-----

一七、Hierarchical Softmax 一

https://zhuanlan.zhihu.com/p/66417229

說明：

「原始的 Word2Vec 使用 softmax 得到最終的詞彙概率分佈，詞彙表往往包含上百萬個單詞，如果針對輸出中每一個單詞都要用 softmax 計算概率的話，計算量是非常大的。解決辦法之一就是 Hierarchical Softmax。相比於原始的 softmax 直接計算每個單詞的概率，Hierarchical Softmax 使用一顆二元樹來得到每個單詞的概率。被驗證的效果最好的二元樹類型是霍夫曼樹。」

「霍夫曼樹中有 V-1 個中間節點，V 個葉節點。葉節點與單詞表中 V 個單詞一一對應。首先根據單詞出現的頻率構造一顆霍夫曼樹，出現頻率高的單詞霍夫曼編碼就短，更加靠近根節點。原來的 Word2Vec 模型結構會被改變，隱藏層後直接和霍夫曼樹中每一個非葉節點相連，如下圖所示（相當於輸出層中只有 V-1 個神經元節點）。然後再每一個非葉節點上計算二分概率（也就是用 Sigmoid 函數進行激活），這個概率是指從當前節點隨機遊走的概率，可以任意指定是向左遊走的概率，還是向右游走的概率。從根節點到目標單詞的路徑是唯一的，將中間非葉節點的遊走概率相乘就得到了最終目標單詞的概率。」

這樣只用計算樹深度個輸出節點的概率就可以得到目標單詞的概率。霍夫曼樹的深度基本是 logV，所以此時的計算複雜度就降為了 O (logV)。另外，高頻詞非常接近樹根，其所需要的計算次數將進一步減少，這也是使用霍夫曼樹的一個優點。

https://zhuanlan.zhihu.com/p/66417229

-----

一八、Hierarchical Softmax 二

https://ruder.io/word-embeddings-softmax/

說明：

原本 Softmax 的輸出層，假定是 V 個字的機率。Hierarchical Softmax 的輸出層，則改為 V - 1 個霍夫曼樹的節點。

以上面 CBOW 的例子為例，輸入為 the dog and the，預測是 cat，不用更新原本 10,000 個例子，只要更新 1、2、5、三個節點，讓其機率分別是左、右、右，使其輸出為 cat。

-----

一九、Hierarchical Softmax 三

https://sunjackson.github.io/2017/08/01/fb7b83894c233646897598c40c328c23/

http://building-babylon.net/2017/08/01/hierarchical-softmax/

https://zhuanlan.zhihu.com/p/53425736

說明：

這是一般的範例圖片，實在不容易直接從圖片理解。

-----

二０、Negative Sampling

https://python5566.wordpress.com/2018/03/17/nlp-%E7%AD%86%E8%A8%98-negative-sampling/comment-page-1/

http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/

https://zhuanlan.zhihu.com/p/53425736

-----

廿一、Negative Sampling

https://medium.com/@tengyuanchang/%E8%AE%93%E9%9B%BB%E8%85%A6%E8%81%BD%E6%87%82%E4%BA%BA%E8%A9%B1-%E7%90%86%E8%A7%A3-nlp-%E9%87%8D%E8%A6%81%E6%8A%80%E8%A1%93-word2vec-%E7%9A%84-skip-gram-%E6%A8%A1%E5%9E%8B-73d0239ad698

說明：

Negative Sampling

Positive sample：(fox, quick)。1 個。
Negative samples：(fox, word_not_quick)。9999 個。

小規模數據集：選 5 ~ 20 個 negative samples。
大規模數據集：選 2 ~ 5 個 negative samples。

-----

廿二、TF-IDF - term frequency–inverse document frequency

https://medium.com/nanonets/topic-modeling-with-lsa-psla-lda-and-lda2vec-555ff65b0b05

-----

廿三、LSA1

https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/

-----

廿四、LSA2

https://www.analyticsvidhya.com/blog/2018/10/stepwise-guide-topic-modeling-latent-semantic-analysis/

-----

廿五、GloVe

https://towardsdatascience.com/word-embeddings-for-nlp-5b72991e01d4

-----

廿六、GloVe in a Picture

https://dudeperf3ct.github.io/lstm/gru/nlp/2019/01/28/Force-of-LSTM-and-GRU/

-----

廿七、GloVe Loss

https://medium.com/@jonathan_hui/nlp-word-embedding-glove-5e7f523999f6

-----

廿八、GloVe Alpha

Fig. Weighting Function []。

-----

廿九、fastText

https://www.jiqizhixin.com/articles/2020-07-03-14

-----

三０、fastText v1

https://www.twblogs.net/a/5ba122282b71771a4da89d89

-----

卅一、fastText v2

https://blog.csdn.net/u012931582/article/details/83818374

-----

卅二、WordRank

https://leovan.me/cn/2018/10/word-embeddings/

-----

卅三、NNLMs

https://www.jiqizhixin.com/graph/technologies/c61ba3b9-40e2-4864-a941-9adc19e6792e

-----

References

[1] Word2vec v1

Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).

https://arxiv.org/pdf/1301.3781.pdf

[2] Word2vec v2

Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

[3] Word2vec v3

Rong, Xin. "word2vec parameter learning explained." arXiv preprint arXiv:1411.2738 (2014).

https://arxiv.org/pdf/1411.2738.pdf

[4] GloVe

Pennington, Jeffrey, Richard Socher, and Christopher Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

https://www.aclweb.org/anthology/D14-1162

[5] fastText v1

Joulin, Armand, et al. "Bag of tricks for efficient text classification." arXiv preprint arXiv:1607.01759 (2016).

https://arxiv.org/pdf/1607.01759.pdf

[6] fastText v2
Bojanowski, Piotr, et al. "Enriching word vectors with subword information." Transactions of the Association for Computational Linguistics 5 (2017): 135-146.
https://www.mitpressjournals.org/doi/pdfplus/10.1162/tacl_a_00051

[7] WordRank
Ji, Shihao, et al. "Wordrank: Learning word embeddings via robust ranking." arXiv preprint arXiv:1506.02761 (2015).
https://arxiv.org/pdf/1506.02761.pdf

The Star Also Rises

Friday, August 13, 2021

06C_Word2vec

No comments:

Programmer

Blog Archive

Labels

Recent Comments

My Blog List

MY LINKS

status

About Me