The Star Also Rises: December 2021

「BERT 每一層所學的就像是 NLP 的 pipeline，先決定詞性，再決定文法，找出代名詞所指射的名詞…等。所以可以看的到，橫軸所代表的是不同的 layer，縱軸則 NLP 的 pipeline，不同的 pipeline 所 output 的 embedding 做 weighted sum 的反應會有所不同。舉例來說，POS 的時候所需的可能是中間的 10~13 層。」

https://hackmd.io/@shaoeChen/Bky0Cnx7L

https://zhuanlan.zhihu.com/p/70757539

-----

Monday, December 13, 2021

BERT（三）：Illustrated

2021/09/01

-----

https://pixabay.com/zh/photos/fantasy-light-mood-sky-beautiful-2861107/

-----

# BERT

說明：

「The Token Embeddings layer will convert each wordpiece token into a 768-dimensional vector representation.」

Token Embeddings 層會將每個詞條標記轉換為 768 維向量表示。

「The Segment Embeddings layer only has 2 vector representations. The first vector (index 0) is assigned to all tokens that belong to input 1 while the last vector (index 1) is assigned to all tokens that belong to input 2.」

Segment Embeddings 層只有 2 個向量表示。第一個向量（索引 0）分配給屬於輸入 1 的所有標記，而最後一個向量（索引 1）分配給屬於輸入 2 的所有標記。

「BERT was designed to process input sequences of up to length 512. The authors incorporated the sequential nature of the input sequences by having BERT learn a vector representation for each position. This means that the Position Embeddings layer is a lookup table of size (512, 768).」

Position Embeddings 層是一個大小為 (512, 768) 的查找表。BERT 旨在處理最長為 512 的輸入序列。作者通過讓 BERT 學習每個位置的向量表示來合併輸入序列的順序性質。

https://medium.com/@_init_/why-bert-has-3-embedding-layers-and-their-implementation-details-9c261108e28a

-----

# Transformer。

-----

# BERT

說明：

「GPT 是「Generative Pre-Training」的簡稱，從名字看其含義是指的生成式的預訓練。GPT 也採用兩階段過程：第一階段是利用語言模型進行預訓練。第二階段通過 Fine-tuning 的模式解決下游任務。」

「GPT 的預訓練過程，其實和 ELMo 是類似的，主要不同在於兩點：特徵抽取器不使用雙層雙向 LSTM，而是用 Transformer，上面提到過它的特徵抽取能力要強於雙層雙向 LSTM。GPT 的預訓練雖然仍然是以語言模型作為目標任務，但是採用的是單向的語言模型，GPT 則只採用這個單詞的上文 Context-before 來進行預測。」

https://blog.csdn.net/qq_35883464/article/details/100173045

-----

# GPT

說明：

「GPT 採用兩階段過程：第一階段是利用語言模型進行預訓練。第二階段通過 Fine-tuning 的模式解決下游任務。」

四種 Fine-tuning 任務：

一、「對於分類問題，不用怎麼動，加上一個起始和終結符號即可。」

二、「對於句子關係判斷問題，比如 Entailment（關係），兩個句子中間再加個分隔符即可。」

三、「對於文本相似性判斷問題，把兩個句子順序顛倒下做出兩個輸入即可，這是為了告訴模型句子順序不重要。」

四、「對於多項選擇問題，則多路輸入，每一路把文章和答案選項拼接作為輸入即可。」

https://blog.csdn.net/qq_35883464/article/details/100173045

-----

# BERT

說明：

一、克漏字填空：(Masked Language Model, MLM）。輸出：字彙表的機率。

二、下個句子預測：第 2 個句子在原始檔案中是否跟第 1 個句子相接。（Next Sentence Prediction, NSP）。輸出：yes or no。

[CLS] classification 二元分類。[SEP] separation 句子分開。

兩種預訓練是同時做的。

https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html

-----

# BERT

說明：

A 成對句子分類任務

B 單一句子分類任務

C 問答任務

D 單一句子標註任務

https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html

-----

# BERT

說明：

A 成對句子分類任務

輸入：[CLS]（classification ）、句子一、[SEP]（separate ）、句子二。

輸出：類別標籤。

「[CLS] is the special symbol for classification output, and [SEP] is the special symbol to separate non-consecutive token sequences.」# BERT

[CLS] 是分類輸出的特殊符號，[SEP] 是分隔不連續的 token 序列的特殊符號。

https://hackmd.io/@shaoeChen/Bky0Cnx7L

MNLI（Multi-genre Natural Language Inference）

「Multi-Genre Natural Language Inference is a large-scale, crowdsourced entailment classification task (Williams et al., 2018). Given a pair of sentences, the goal is to predict whether the second sentence is an entailment, contradiction, or neutral with respect to the first one.」# BERT

多類型自然語言推理是一項大規模、眾包的蘊涵分類任務（Williams 等，2018）。給定一對句子，目標是預測第二個句子相對於第一個句子是蘊涵、矛盾還是中性。

「蘊涵（英語：Entailment）在命題邏輯和謂詞邏輯中用來描述在兩個句子或句子的集合之間的聯繫，一般使用 ⇒ 符號表示。」

https://zh.wikipedia.org/zh-tw/%E8%95%B4%E6%B6%B5

這是一個三分類任務。

QQP

QNLI

STS-B

MRPC

RTE

SWAG

https://zhuanlan.zhihu.com/p/102208639

-----

# BERT

說明：

B 單一句子分類任務

輸入：[CLS]（classification ）、單一句子。

輸出：類別標籤。

https://hackmd.io/@shaoeChen/Bky0Cnx7L

SST-2

「 SST-2 The Stanford Sentiment Treebank is a binary single-sentence classification task consisting of sentences extracted from movie reviews with human annotations of their sentiment (Socher et al., 2013).」# BERT

SST-2 The Stanford Sentiment Treebank 是一個二元單句分類任務，由從電影評論中提取的句子和對其情緒的人工註釋組成。

CoLA（Corpus of Linguistic Acceptability）

「CoLA The Corpus of Linguistic Acceptability is a binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically “acceptable” or not (Warstadt et al., 2018).」# BERT

CoLA 語言可接受性語料庫是一個二元單句分類任務，其目標是預測一個英語句子在語言上是否“可接受”。

https://zhuanlan.zhihu.com/p/102208639

-----

# BERT

說明：

C 問答任務

1. 輸入：[CLS]（classification ）、問題句、[SEP]（separate ）、答案來源的文件。

2. 每個單詞通過 BERT 後，會產生一個詞嵌入向量。

3. 讓模型學習 v_s 跟 v_e 兩個向量。維度與輸出的詞嵌入向量相同。

4. v_s 與文件的每一個輸出詞嵌入做點積計算（dot product）得到純量，再做 softmax。

5. v_e 與文件的每一個輸出詞嵌入做點積計算（dot product）得到純量，再做 softmax。

6. s = 2，e = 3，答案是 "d2d3"。

7. 若 s = 3，e = 2，則回答此題無解。

https://hackmd.io/@shaoeChen/Bky0Cnx7L

SQuAD

https://zhuanlan.zhihu.com/p/102208639

-----

# BERT

說明：

D 單一句子標註任務

輸入：[CLS]（classification ）、單一句子。

輸出：每個單詞的位置種類與類別種類，參考 NER。

https://hackmd.io/@shaoeChen/Bky0Cnx7L

https://zhuanlan.zhihu.com/p/102208639

NER（Named Entity Recognition）

「句子「小明在北京大學的燕園看了中國男籃的一場比賽」，通過 NER 模型，將「小明」以 PER，「北京大學」以 ORG，「燕園」以 LOC，「中國男籃」以 ORG 為類別分別挑了出來。」

「B，即 Begin，表示開始。I，即 Intermediate，表示中間。E，即 End，表示結尾。S，即 Single，表示單個字符。O，即 Other，表示其他，用於標記無關字符。」

「將「小明在北京大學的燕園看了中國男籃的一場比賽」這句話，進行標註，結果就是：[B-PER，E-PER，O, B-ORG，I-ORG，I-ORG，E-ORG，O，B-LOC，E-LOC，O，O，B-ORG，I-ORG，I-ORG，E-ORG，O，O，O，O]。」

https://zhuanlan.zhihu.com/p/88544122

-----

# BERT

-----

References

# BERT。被引用 12556 次。

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

https://arxiv.org/pdf/1810.04805.pdf

# BERT Pipeline

Tenney, Ian, Dipanjan Das, and Ellie Pavlick. "BERT rediscovers the classical NLP pipeline." arXiv preprint arXiv:1905.05950 (2019).

https://arxiv.org/pdf/1905.05950.pdf

# Transformer。被引用 13554 次。

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

# GPT

Radford, Alec, et al. "Improving language understanding by generative pre-training." URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf (2018).

https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf

-----

BERT（二）：Overview

2020/12/28

-----

https://pixabay.com/zh/photos/bert-and-ernie-sesamstrasse-382270/

-----

◎ Abstract

-----

◎ Introduction

-----

本論文要解決（它之前研究）的（哪些）問題（弱點）？

-----

# BERT。

-----

◎ Method

-----

解決方法？

-----

# BERT。

-----

具體細節？

-----

◎ Result

-----

本論文成果。

-----

◎ Discussion

-----

本論文與其他論文（成果或方法）的比較。

-----

成果比較。

-----

方法比較。

-----

◎ Conclusion

-----

◎ Future Work

-----

後續相關領域的研究。

-----

後續延伸領域的研究。

-----

◎ References

-----

# ULMFiT。被引用 1339 次。

Howard, Jeremy, and Sebastian Ruder. "Universal language model fine-tuning for text classification." arXiv preprint arXiv:1801.06146 (2018).

https://arxiv.org/pdf/1801.06146.pdf

# Transformer。被引用 13554 次。

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

# BERT。被引用 12556 次。

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

https://arxiv.org/pdf/1810.04805.pdf

-----

The Star Also Rises: BERT

https://hemingwang.blogspot.com/2019/01/bert.html

-----

NLP（六）：BERT Overview

NLP（六）：BERT Overview

2019/01/17

說明：

Natural Language Processing (NLP) 的 Pre-trained Models 從 2013 的 Word2vec 到 2018 的 BERT 有極大的進步 [1]-[11]。簡單說，Word2vec, GloVe, fastText 是淺層模型，ELMo, AWD-LSTM, ULMFiT 是深層的 LSTM 模型。OpenAI GPT 是深層的 Transformer 模型。BERT 則是基於雙向的 ELMo 模型與深層的 Transformer 模型之優點合併成為雙向的 Transformer 模型。

建議讀者可先瞭解 Transformer 的原理 [5]，再分別以一篇文章 [6] 或兩篇文章 [7], [8] 或三篇文章 [9]-[11] 漸次理解 Pre-trained Models 的演進過程。

-----

Outline：

一、Word2vec, GloVe, fastText
二、ELMo, AWD-LSTM, ULMFiT
三、OpenAI GPT
四、BERT
五、from Word2vec to BERT

-----

Fig. 1. BERT [1]。

-----

一、 Word2vec、GloVe、fastText

◎ 三個淺層模型。

「早在 2013 年 Google 提出了 Word2Vec 之後，NLP 領域的深度學習就開始使用預訓練模型，而後斯坦福大學提出的 GloVe 和 Facebook 提出的 fastText 則是進一步發展。然而在今年之前，這方面的嘗試大都局限於使用淺層網絡，在詞的層面上進行建模。針對具體的應用場景，要達到較好的效果依然需要非常大量的標註語料。預訓練深層模型以及之上的遷移學習在圖像領域的成功，引領著 NLP 領域專家們也在思考如何實現同樣的範式。多年的努力與探索，終於在今年迎來了豐收。」[4]

二、ELMo、AWD-LSTM、ULMFiT

◎ ELMo 是雙向的 LSTM 模型。

「首先是年初發表於 NAACL-HIT 2018 的 ELMo 預訓練模型，用正向和反向兩個 LSTM 語言模型（BiLM）在通用語料上進行訓練，將得到的預訓練好的模型（即 ELMo）用於深度網絡的輸入上，在多個任務上能夠明顯改善已有的模型的效果。」[4]

◎ ULMFiT 是三層的 AWD-LSTM 模型。

「此後，FastAI 基於三層 AWD-LSTM 構建出的語言模型，使用大規模通用語料預訓練出 ULMFiT 模型。將該模型應用於特定領域，只要使用非常少量的標註數據就可以達到普通模型需要大量標註數據的效果。這個模型的成功，使得大家看到了遷移學習在 NLP 領域上的曙光。」 [4]

三、OpenAI GPT

◎ OpenAI GPT 是單向的 Transformer（捨棄 LSTM，只保留 FNN）。

「緊接著，OpenAI 使用 Transformer 和無監督結合的方法在大規模通用語料上進行訓練，得到預訓練好的 GPT 模型。針對特定的場景，在預訓練好的 GPT 模型基礎上，用小得多的數據集進行有監督學習，獲得了當時最好的成績。」[4]

四、BERT

◎ BERT 綜合雙向的 ELMo 與單向的 OpenAI GPT 之優點，推出雙向 Transformer 的版本。

「2018 年10月，Google 在 GPT 的基礎上進一步改進，提出了基於 Transofrmer 的 BERT 模型。在訓練 BERT 的過程中，Google 構造出 MLM（Masked Language Model）語言模型，這是一個“真”雙向語言模型。並在通用的大規模語料 BooksCorpus（800M words）加上英文維基百科（2,500M words）上進行無監督訓練，得到預訓練模型 BERT。論文中，使用預訓練的模型 BERT 在 11 個任務上進行有監督的微調（遷移學習），其效果全部達到當前最優。特別地，在斯坦福問答評測數據集（SQuAD 1.1）上超越了人類專家的評測結果。」[4]

五、from Word2vec to BERT

-----

Fig. 2. From fastText to ELMo [2]。

-----

Fig. 3. CBOW and skip-gram [3]。

-----

Task #1: Masked LM

◎ CBOW and Masked-LM

「CBOW 方法，它的核心思想是：在做語言模型任務的時候，我把要預測的單詞摳掉，然後根據它的上文 Context-before 和下文 Context-after 去預測單詞。其實 BERT 怎麼做的？ BERT就是這麼做的。從這裡可以看到方法間的繼承關係。當然 BERT 作者沒提 Word2vec 及 CBOW 方法，這是我的判斷，BERT 作者說是受到完形填空任務的啟發，這也很可能，但是我覺得他們要是沒想到過 CBOW 估計是不太可能的。」[6]。

-----

Task #2: Next Sentence Prediction

◎ Skip-gram and Skip-Thought

「除了用上 Masked-LM 的方法使得雙向 Transformer 下的語言模型成為現實，BERT 還利用和借鑒了 Skip-thoughts 方法中的句子預測問題，來學習句子級別的語義關係，具體做法則是將兩個句子組合成一個序列，當然組合方式會按照下面將要介紹的方式，然後讓模型預測這兩個句子是否是先後近鄰的兩個句子，也就是會把 "Next Sentence Prediction" 問題建模成為一個二分類問題。訓練的時候，數據中有 50% 的情況這兩個句子是先後關係，而另外 50% 的情況下，這兩個句子是隨機從語料中湊到一起的，也就是不具備先後關係，以此來構造訓練數據。句子級別的預測思路和之前介紹的 Skip-thoughts 基本一致，當然更本質的思想來源還是來自於 Word2vec 中的 skip-gram 模型。」[10]。

◎ Skip-Thought and Quick-Thought

「2018年的時候，在 Skip-thoughts 的基礎上，Google Brain 的 Logeswaran 等人將這一思想做了進一步改進，他們認為 Skip-thoughts 的 Decoder 效率太低，且無法在大規模語料上很好的訓練（這是 RNN 結構的通病）。所以他們把 Skip-thoughts 的生成任務改進成為了一個分類任務，具體說來就是把同一個上下文窗口中的句子對標記為正例，把不是出現在同一個上下文窗口中的句子對標記為負例，並將這些句子對輸入模型，讓模型判斷這些句子對是否是同一個上下文窗口中，很明顯，這是一個分類任務。可以說，僅僅幾個月之後的 BERT 正是利用的這種思路。而這些方法都和 Skip-thoughts 一脈相承。」[11]。

-----

符號說明：

# basic
// advanced

-----

Paper

# Word2vec

Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International Conference on Machine Learning. 2014.

http://proceedings.mlr.press/v32/le14.pdf

# GloVe

Pennington, Jeffrey, Richard Socher, and Christopher Manning. "Glove: Global vectors for word representation." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.

https://nlp.stanford.edu/pubs/glove.pdf

# fastText

Bojanowski, Piotr, et al. "Enriching word vectors with subword information." Transactions of the Association for Computational Linguistics 5 (2017): 135-146.

https://www.mitpressjournals.org/doi/pdfplus/10.1162/tacl_a_00051

# ELMo

Peters, Matthew E., et al. "Deep contextualized word representations." arXiv preprint arXiv:1802.05365 (2018).

https://arxiv.org/pdf/1802.05365.pdf

# AWD-LSTM

Merity, Stephen, Nitish Shirish Keskar, and Richard Socher. "Regularizing and optimizing LSTM language models." arXiv preprint arXiv:1708.02182 (2017).

https://arxiv.org/pdf/1708.02182.pdf

# ULMFiT

Howard, Jeremy, and Sebastian Ruder. "Universal language model fine-tuning for text classification." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2018.

http://www.aclweb.org/anthology/P18-1031

# OpenAI GPT

Radford, Alec, et al. "Improving language understanding by generative pre-training." URL https://s3-us-west-2. amazonaws. com/openai-assets/research-covers/languageunsupervised/language understanding paper. pdf (2018).

https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf

# BERT

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

https://arxiv.org/pdf/1810.04805.pdf

-----

References

# BERT
[1] Google AI Blog Open Sourcing BERT State-of-the-Art Pre-training for Natural Language Processing
https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

[2] 想研究 NLP，不了解詞嵌入與句嵌入怎麼行？ _ 香港矽谷
https://www.hksilicon.com/articles/1646194

[3] A Review of the Neural History of Natural Language Processing - AYLIEN
http://blog.aylien.com/a-review-of-the-recent-history-of-natural-language-processing/

# Word2vec GloVe fastText 淺層 ELMo ULMFiT Open AI GPT BERT 深層
[4] 帮AI摆脱“智障”之名，NLP这条路还有多远 - AI科技大本营 - CSDN博客
https://blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/83754091

# Transformer
[5] Seq2seq pay Attention to Self Attention Part 2(中文版)
https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-%E4%B8%AD%E6%96%87%E7%89%88-ef2ddf8597a4

-----

一文道盡預訓練模型

[6] 从Word Embedding到Bert模型—自然语言处理中的预训练技术发展史 - 知乎
https://zhuanlan.zhihu.com/p/49271699

-----

兩篇文章分別解釋淺層模型與深層模型

# word2vec、GloVe、fastText
[7] word2vec、glove和 fasttext 的比较 - sun_brother的博客 - CSDN博客
https://blog.csdn.net/sun_brother/article/details/80327070

# ELMo、ULMFiT、OpenAI GPT
[8]【NLP】语言模型和迁移学习 - 知乎
https://zhuanlan.zhihu.com/p/42618178

-----

三篇系列文章詳細解釋預訓練模型。

[9] NLP的巨人肩膀（上） - 简书
https://www.jianshu.com/p/fa95963c9abd

[10] NLP的巨人肩膀（中） - 简书
https://www.jianshu.com/p/81dddec296fa

[11] NLP的巨人肩膀（下） - 简书
https://www.jianshu.com/p/922b2c12705b

BERT Applications

BERT Applications

2019/01/17

-----

Fig. BERT Applications（圖片來源）。

-----

符號說明：

# basic
// advanced

-----

Paper

# Bert 應用ㄧ文本分類

Kant, Neel, et al. "Practical Text Classification With Large Pre-Trained Language Models." arXiv preprint arXiv:1812.01207 (2018).

https://arxiv.org/pdf/1812.01207.pdf

# Bert 應用二回答問題

Zhu, Chenguang, Michael Zeng, and Xuedong Huang. "SDNet: Contextualized Attention-based Deep Network for Conversational Question Answering." arXiv preprint arXiv:1812.03593 (2018).

https://arxiv.org/pdf/1812.03593.pdf

# Bert 應用三生成文本
Transfer Learning for Style-Specific Text Generation
https://nips2018creativity.github.io/doc/Transfer%20Learning%20for%20Style-Specific%20Text%20Generation.pdf

// Why

Qi, Ye, et al. "When and Why are Pre-trained Word Embeddings Useful for Neural Machine Translation?." arXiv preprint arXiv:1804.06323 (2018).

https://arxiv.org/pdf/1804.06323.pdf

// Common Sense
Trinh, Trieu H., and Quoc V. Le. "Do Language Models Have Common Sense?." (2018).
https://openreview.net/pdf?id=rkgfWh0qKX

-----

References

NLP中的语言模型预训练&微调 - CLOUD - CSDN博客
https://blog.csdn.net/muumian123/article/details/84990765

如何应用 BERT ：Bidirectional Encoder Representations from Transformers - aliceyangxi1987的博客 - CSDN博客
https://blog.csdn.net/aliceyangxi1987/article/details/84403311

# 用 Bert 硬體開銷還是大。
谷歌终于开源BERT代码：3 亿参数量，机器之心全面解读 _ 机器之心
https://www.jiqizhixin.com/articles/2018-11-01-9

预训练BERT，官方代码发布前他们是这样用TensorFlow解决的 - 知乎
https://zhuanlan.zhihu.com/p/48018623

BERT Explained State of the art language model for NLP
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

BERT Tasks

BERT Tasks

2019/01/27

-----

Fig. BERT Tasks（圖片來源）。

-----

11 NLP Tasks

1. GLUE: The General Language Understanding Evaluation (GLUE) benchmark
1.1. MNLI: Multi-Genre Natural Language Inference
1.2. QQP: Quora Question Pairs
1.3. QNLI: Question Natural Language Inference
1.4. SST-2: The Stanford Sentiment Treebank
1.5. CoLA: The Corpus of Linguistic Acceptability
1.6. STS-B: The Semantic Textual Similarity Benchmark
1.7. MRPC: Microsoft Research Paraphrase Corpus
1.8. RTE: Recognizing Textual Entailment

2. SQuAD: The Standford Question Answering Dataset
3. NER: CoNLL 2003 Named Entity Recognition (NER) dataset
4. SWAG: The Situations With Adversarial Generations (SWAG) dataset

-----

Paper

# BERT

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

https://arxiv.org/pdf/1810.04805.pdf

-----

BERT

BERT

2019/01/17

-----

Fig. BERT（圖片來源：Pixabay）。

-----

# BERT

-----

// LeeMeng - 進擊的 BERT：NLP 界的巨人之力與遷移學習

-----

// Add Shine to your Data Science Resume with these 8 Ambitious Projects on GitHub

-----

References

[1] BERT

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

https://arxiv.org/pdf/1810.04805.pdf

-----

// English

# Illistrated
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time
http://jalammar.github.io/illustrated-bert/

Dissecting BERT Part 1 The Encoder – Dissecting BERT – Medium
https://medium.com/dissecting-bert/dissecting-bert-part-1-d3c3d495cdb3

Dissecting BERT Part 2 BERT Specifics – Dissecting BERT – Medium
https://medium.com/dissecting-bert/dissecting-bert-part2-335ff2ed9c73

Dissecting BERT Appendix The Decoder – Dissecting BERT – Medium
https://medium.com/dissecting-bert/dissecting-bert-appendix-the-decoder-3b86f66b0e5f

BERT Explained State of the art language model for NLP
https://towardsdatascience.com/bert-explained-state-of-the-art-language-model-for-nlp-f8b21a9b6270

How BERT leverage attention mechanism and transformer to learn word contextual relations
https://towardsdatascience.com/how-bert-leverage-attention-mechanism-and-transformer-to-learn-word-contextual-relations-5bbee1b6dbdb

Deconstructing BERT Distilling 6 Patterns from 100 Million Parameters
https://www.kdnuggets.com/2019/02/deconstructing-bert-distilling-patterns-100-million-parameters.html

Are BERT Features InterBERTible
https://www.kdnuggets.com/2019/02/bert-features-interbertible.html

Add Shine to your Data Science Resume with these 8 Ambitious Projects on GitHub
https://www.nekxmusic.com/add-shine-to-your-data-science-resume-with-these-8-ambitious-projects-on-github/

-----

// 簡體中文

NLP自然语言处理：文本表示总结 - 下篇（ELMo、Transformer、GPT、BERT）_陈宸的博客-CSDN博客
https://blog.csdn.net/qq_35883464/article/details/100173045

Attention isn’t all you need！BERT的力量之源远不止注意力 - 知乎
https://zhuanlan.zhihu.com/p/58430637

-----

LeeMeng - 進擊的 BERT：NLP 界的巨人之力與遷移學習
https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html

-----

// Code

# PyTorch
GitHub - huggingface_pytorch-pretrained-BERT 📖The Big-&-Extending-Repository-of-Transformers Pretrained PyTorch models for Google's BERT, OpenAI GPT & GPT-2, Google_CMU Transformer-XL
https://github.com/huggingface/pytorch-pretrained-BERT

# PyTorch
GitHub - codertimo_BERT-pytorch Google AI 2018 BERT pytorch implementation
https://github.com/codertimo/BERT-pytorch

GPT-1

GPT-1

2019/04/08

-----

Fig. OpenAI GPT（圖片來源）。

-----

# BERT

-----

// LeeMeng - 直觀理解 GPT-2 語言模型並生成金庸武俠小說

-----

# GPT-1

-----

-----

// NLP自然语言处理：文本表示总结 - 下篇（ELMo、Transformer、GPT、BERT）_陈宸的博客-CSDN博客

-----

// NLP自然语言处理：文本表示总结 - 下篇（ELMo、Transformer、GPT、BERT）_陈宸的博客-CSDN博客

-----
References

NLP自然语言处理：文本表示总结 - 下篇（ELMo、Transformer、GPT、BERT）_陈宸的博客-CSDN博客
https://blog.csdn.net/qq_35883464/article/details/100173045

LeeMeng - 直觀理解 GPT-2 語言模型並生成金庸武俠小說
https://leemeng.tw/gpt2-language-model-generate-chinese-jing-yong-novels.html

Sunday, December 12, 2021

Transformer（三）：Illustrated

2021/09/01

-----

https://pixabay.com/zh/photos/flash-tesla-coil-experiment-113310/

-----

Outline

-----

1.1 Transformer

https://zhuanlan.zhihu.com/p/338817680

-----

1.2 Input Embedding (Word2vec)

https://zhuanlan.zhihu.com/p/27234078

-----

1.3 Positional Encoding

https://jalammar.github.io/illustrated-transformer/

縱軸為 pos，橫軸為 i。偶數位置 2i 套用 sin，奇數位置套用 cos。

一、pos：position。一個 word 在 sentence 中的位置。

二、i：dimension。Positional Embedding 向量的 index，最大值為 dmodel。

三、dmodel：Word Embedding 的維度。768、512，等等。位置編碼跟詞向量的維度相同，所以兩者可以相加。

四、sin(a+b)=sin(a)cos(b)+cos(a)sin(b)，cos(a+b)=cos(a)cos(b)-sin(a)sin(b)。

五、以 p+k 代替 a+b，k：新增的位置向量的 offset。新增向量可由之前向量的線性組合構成，係數為 sin(k) 與 cos(k)。

-----

2.1

https://jalammar.github.io/illustrated-transformer/

-----

2.2

https://medium.com/@edloginova/attention-in-nlp-734c6fa9d983

-----

2.3

-----

2.4

-----

2.5

http://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2019/Lecture/Transformer%20(v5).pdf

-----

2.6

-----

2.7

https://jalammar.github.io/illustrated-transformer/

-----

2.8

https://zhuanlan.zhihu.com/p/75787683

-----

2.9

https://colah.github.io/posts/2015-08-Understanding-LSTMs/

-----

http://proceedings.mlr.press/v37/ioffe15.pdf

-----

2.10

https://mlexplained.com/2018/01/13/weight-normalization-and-layer-normalization-explained-normalization-in-deep-learning-part-2/

-----

2.11

https://zhuanlan.zhihu.com/p/338817680

-----

3.1

https://zhuanlan.zhihu.com/p/338817680

-----

3.2

-----

4.1

https://huggingface.co/transformers/perplexity.html

https://www.zhihu.com/question/50828855

-----

4.2

https://www.aclweb.org/anthology/P02-1040.pdf

-----

4.3

https://www.cnblogs.com/by-dream/p/7679284.html

-----

4.4

# Transformer。

說明：

「We employ three types of regularization during training:」

Dropout

「Residual Dropout We apply dropout [27] to the output of each sub-layer, before it is added to the sub-layer input and normalized. 」

Dropout

「In addition, we apply dropout to the sums of the embeddings and the positional encodings in both the encoder and decoder stacks. For the base model, we use a rate of Pdrop = 0:1.」

Label Smoothing

「Label Smoothing During training, we employed label smoothing of value ls = 0:1 [30]. This hurts perplexity, as the model learns to be more unsure, but improves accuracy and BLEU score.」

「網路會驅使自身往正確標籤和錯誤標籤差值大的方向學習，在訓練數據不足以表徵所以的樣本特徵的情況下，這就會導致網路過擬合。」

「label smoothing 的提出就是為了解決上述問題。最早是在 Inception v2 中被提出，是一種正則化的策略。其通過"軟化"傳統的 one-hot 類型標籤，使得在計算損失值時能夠有效抑製過擬合現象。」

https://blog.csdn.net/qiu931110/article/details/86684241

-----

# Transformer。

-----