The Star Also Rises: Word2vec（三）：Illustrated

Word2vec（三）：Illustrated

2021/07/19

-----

https://pixabay.com/zh/photos/numbers-cipher-calculation-list-16804/

-----

Word2vec

1.1 Skip-Gram Model

1.2 Continuous Bag-of-Word Model

2.1 Hierarchical Softmax

2.2 Negative Sampling

-----

資料來源：

https://arxiv.org/abs/1708.02709

說明：

英文的 word 從 one hot encoding 轉成 vector 之後，便可進行向量的運算。經典的例子是 King - Man + Woman = Queen。由這個「等式」，我們可以理解到，向量的某個維度，代表性別，某個維度，代表社會地位的高低。

比起簡單的 one-hot encoding，word-embedding 轉成向量表示後，增加了豐富的意涵。

-----

Figure 2: Two-dimensional PCA projection of the 1000-dimensional Skip-gram vectors of countries and their capital cities. The figure illustrates ability of the model to automatically organize concepts and learn implicitly the relationships between them, as during the training we did not provide any supervised information about what a capital city means.

圖 2：國家及其首都城市的 1000 維 Skip-gram 向量的二維 PCA 投影。該圖說明了模型自動組織概念並隱式學習它們之間關係的能力，因為在訓練期間我們沒有提供任何關於首都意味著什麼的監督信息。

# Word2vec 2。

-----

# Word2vec 1。

說明：

w：window。此處 window 大小為 5。CBOW 是以周邊的字預測中間應該出現什麼字。Skip-gram 是以中間的字預測周邊應該出現什麼字。

-----

Skip-Gram 中間的單字預測周圍的字

# Word2vec 3。

說明：

Input layer：以 skip-gram 為例，是 V 維的 one-hot encoding，非 0 的輸入神經元到隱藏層的權重，極為該單詞的詞向量。

Hidden layer：隱藏層。

Output layer：輸出層。

V-dim：輸入層的維度。

N-dim：隱藏層的維度。

CxV-dim：輸出層的維度。

W VxN：VxN 的矩陣。將輸入層的 V 維資料，轉成 N 維的隱藏層資料。

W' NxV：NxV 的矩陣。輸出為字彙表裡面，每個字的機率。先將隱藏層轉成 V 個值，再把這 V 個值做 Softmax 輸出。我們希望 context 裡的字，其機率越高越好。

xk：index。

hj：index。

y Cj：C 代表 context。target 代表 window 中間的字，context 代表 window 中，target 之外的其他字。CBOW 是 context 預測 target，skip-gram 則是 target 預測 context。

-----

「The training objective of the Skip-gram model is to find word representations that are useful for predicting the surrounding words in a sentence or a document. More formally, given a sequence of training words w1,w2,w3, . . . ,wT , the objective of the Skip-gram model is to maximize the average log probability where c is the size of the training context (which can be a function of the center word wt).」

Skip-gram 模型的訓練目標是找到對預測句子或文檔中的周圍單詞有用的單詞表示。更正式地說，給定一系列訓練詞 w1,w2,w3,... . . ,wT ，Skip-gram 模型的目標是最大化平均對數概率，其中 c 是訓練上下文的大小（可以是中心詞 wt 的函數）。

# Word2vec 2。

說明：

T：句子的長度，或文件的長度。

t：句子裡的字的 index。

j：window 裡的字的 index。

c：訓練上下文的大小，上文大小是 c，下文大小也是 c。

p：機率。

wt：中心字。

w(t+j)：周圍字。

公式的目的則如論文中的說明，中心字可以預測周圍字：「Skip-gram 模型的訓練目標是找到對預測句子或文檔中的周圍單詞有用的單詞表示。」

https://www.quora.com/Why-is-word2vec-a-log-linear-model

-----

「The basic Skip-gram formulation defines p(wt+j |wt) using the softmax function:

where vw and v′ w are the “input” and “output” vector representations of w, and W is the number of words in the vocabulary. This formulation is impractical because the cost of computing ∇log p(wO|wI ) is proportional to W, which is often large (10^5–10^7 terms).」

基本的 Skip-gram 公式使用 softmax 函數定義 p(wt+j |wt)：

其中 vw 和 v' w 是 w 的“輸入”和“輸出”向量表示，W 是詞彙表中的單詞數。這個公式是不切實際的，因為計算 ∇log p(wO|wI ) 的成本與 W 成正比，W 通常很大（10^5-10^7 項）。

# Word2vec 2。

說明：

論文中的一句：「其中 vw 和 v' w 是 w 的“輸入”和“輸出”向量表示」。輸入向量表示，就是一般概念的 word2vec 詞向量。另有一個比較少見的，是輸出向量表示，此輸出向量即為隱藏層到輸出層的矩陣中的向量，個數也是與詞彙表的單詞數相同。

p：機率。

wO：應該輸出的字。

wI：輸入的字。

W：詞彙表的大小。

w：詞彙表的 index。

v'w：輸出向量表示（每一個）。

v'wO：輸出向量表示（目標）。

-----

說明：

Skip-gram 模型，以中間的字預測周邊的字。

https://zhuanlan.zhihu.com/p/27234078

-----

https://zhuanlan.zhihu.com/p/27234078

-----

CBOW 周圍的字預測中間的單字

# Word2vec 3。

說明：

所有 context 字的 one hot 先乘以共享的 VxN 矩陣，得到的每個向量相加求平均，作為隱藏層的向量。

https://blog.csdn.net/WitsMakeMen/article/details/89511764

-----

說明：

Huffman coding 嘗試用最少的位元代表頻率最高的字。

https://www.gatevidyalay.com/huffman-coding-huffman-encoding/

-----

Hierarchical Softmax

# Word2vec 3。

說明：

分層 softmax 模型的範例二元樹。白色單元是詞彙表中的單詞，深色單元是內部單元。粗線顯示了從 root 到 w2 的示範路徑。在所示的範例中，路徑的長度 L(w2) = 4。n(w; j) 表示從根到單詞 w 的路徑上的第 j 個單元。

-----

圖片：

https://zhuanlan.zhihu.com/p/66417229

https://ruder.io/word-embeddings-softmax/

說明：

假設字彙表共有 V 個字。原先希望目標字機率為 1，其他字機率為 0，但如此計算量很大。採用 Hierarchical Softmax 後，只要考慮 V - 1 個非葉節點，路徑上的機率即可，如此計算量則大幅減少。預測值為往右，sigmoid 的值極大化。若是往左，則將 1 減去 sigmoid 的值即可。參考上圖。

p(right|n,c)=sigmoid(hT v'n)。

https://www.cnblogs.com/pinard/p/7243513.html

https://zhuanlan.zhihu.com/p/56139075

-----

Negative Sampling

-----

https://tengyuanchang.medium.com/%E8%AE%93%E9%9B%BB%E8%85%A6%E8%81%BD%E6%87%82%E4%BA%BA%E8%A9%B1-%E7%90%86%E8%A7%A3-nlp-%E9%87%8D%E8%A6%81%E6%8A%80%E8%A1%93-word2vec-%E7%9A%84-skip-gram-%E6%A8%A1%E5%9E%8B-73d0239ad698

說明：

Negative Sampling

Positive sample：(fox, quick)。1 個。

Negative samples：(fox, word_not_quick)。9999 個。

小規模數據集：選 5 ~ 20 個 negative samples。

大規模數據集：選 2 ~ 5 個 negative samples。

-----

References

# NNLM。被引用 7185 次。

Bengio, Yoshua, et al. "A neural probabilistic language model." Journal of machine learning research 3.Feb (2003): 1137-1155.

http://www-labs.iro.umontreal.ca/~felipe/IFT6010-Automne2011/resources/tp3/bengio03a.pdf

# Word2vec 1。被引用 18991 次。

Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).

https://arxiv.org/pdf/1301.3781.pdf

# Word2vec 2。被引用 23990 次。

Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf

# Word2vec 3。被引用 645 次。

Rong, Xin. "word2vec parameter learning explained." arXiv preprint arXiv:1411.2738 (2014).

https://arxiv.org/pdf/1411.2738.pdf

# C&W v1。被引用 5099 次。

Collobert, Ronan, and Jason Weston. "A unified architecture for natural language processing: Deep neural networks with multitask learning." Proceedings of the 25th international conference on Machine learning. 2008.

http://www.cs.columbia.edu/~smaskey/CS6998-Fall2012/supportmaterial/colbert_dbn_nlp.pdf

# C&W v2。被引用 6841 次。本篇論文闡釋了從 Word2vec 繼續發展 Paragraph2vec 的必要性。

Collobert, Ronan, et al. "Natural language processing (almost) from scratch." Journal of machine learning research 12.ARTICLE (2011): 2493-2537.

https://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf

Word2Vec Tutorial - The Skip-Gram Model · Chris McCormick

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

理解 Word2Vec 之 Skip-Gram 模型 - 知乎

https://zhuanlan.zhihu.com/p/27234078

-----

[4] GloVe

Pennington, Jeffrey, Richard Socher, and Christopher Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

https://www.aclweb.org/anthology/D14-1162

[5] fastText v1

Joulin, Armand, et al. "Bag of tricks for efficient text classification." arXiv preprint arXiv:1607.01759 (2016).

https://arxiv.org/pdf/1607.01759.pdf

[6] fastText v2

Bojanowski, Piotr, et al. "Enriching word vectors with subword information." Transactions of the Association for Computational Linguistics 5 (2017): 135-146.

https://www.mitpressjournals.org/doi/pdfplus/10.1162/tacl_a_00051

[7] WordRank

Ji, Shihao, et al. "Wordrank: Learning word embeddings via robust ranking." arXiv preprint arXiv:1506.02761 (2015).

https://arxiv.org/pdf/1506.02761.pdf

-----

The Star Also Rises

Saturday, August 14, 2021

Word2vec（三）：Illustrated

No comments:

Programmer

Blog Archive

Labels

Recent Comments

My Blog List

MY LINKS

status

About Me