The Star Also Rises: 深度學習論文研討（四）：深度學習（二）

深度學習論文研討（四）：深度學習（二）

2020/11/24

-----

在《看不見的城市》中，馬可波羅向大汗講了五十五個城市的故事，其實每個故事都在講威尼斯。同樣地，我們這篇文章介紹了很多論文，其實只有一篇，它叫做「深度學習」。

-----

前言：

這篇文章主要說明我為什麼精選這廿篇深度學習的論文作為深度學習的基礎。下方有論文下載與精簡說明。最下方則是精選的相關部落格文章。內容以 CV 跟 NLP 模型為主，NLP 也加入 Embedding 系列。幾個重要主題：Regularization、Normalization、Optimization、Activation Function、Loss Function 等，限於篇幅，並未包含在內。

-----

https://pixabay.com/zh/photos/bubble-gum-shoes-glue-dirt-438404/

-----

Summary：

Computer Vision (CV) [1] 與 Natural Language Processing (NLP) [2], [3] 是 Deep Learning 兩大應用。[1] 是我接觸 LeNet 之後，主要使用的參考資料。資料在現在當然是過時了，但在當時，它告訴我，LeNet 之後可以繼續研究哪些模型。[2], [3] 這兩篇文章省略一些重要細節沒講，但它闡釋了 NLP 模型的進程。所以我把 [1] 跟 [2], [3] 擺在最前面。[4] - [27] 則是輔助廿篇論文的部落格文章。

當然，你可以先進行機器學習的線上課程 [28] 或者機器學習的論文 [29] 或者深度學習的線上課程 [30] 或者本篇深度學習的論文，都可以，並無一定的順序。

-----

◎ 一、為什麼選 LeNet 與 AlexNet？

-----

LeNet [4] 是經典的卷積神經網路，卷積、激活函數、池化、全連接層、損失函數、梯度下降、反向傳播，都有。其他的卷積神經網路，都可以說是 LeNet 的延伸。初學者甚至學了一段時間的人，比較容易困惑的是為為什麼激活函數必須是非線性，這個可以參考 Colah 的文章 Nonlinear Activation Function [5]，非線性激活函數可以將線性不可分的問題轉成線性可分。

AlexNet [6] 是 LeNet 之後的第一個成功的大型卷積神經網路。可以處理的圖片比以前大，也比以前多很多。它還使用的很多當時的新技巧如 ReLU、Dropout 等。Dropout 在 Transformer 之中也被使用。

-----

◎ 二、為什麼選 NIN 與 GoogLeNet？

-----

NIN [7] 的核心是 Conv1 [8]，用簡單的說法就是千層派，可以把多張特徵圖壓扁成少張特徵圖，也可以把少張特徵圖拉成多張特徵圖。比例？靠訓練。應用 Conv1 最成功的例子是 GoogLeNet [9]。GoogLeNet 雖然比較受重視，但以深度學習的演進來說，NIN 更重要一些。GoogLeNet 是 Inception v1。Inception v3 的 Label Smoothing 在 Transformer 之中有使用。

-----

◎ 三、為什麼選 ResNet 與 DenseNet？

-----

ResNet [10]，可以說，現在的 CNN 都是 ResNet（的變種），或者說，都會用到 identity mapping。它其實是 ensemble learning，也就是說，一個深層的 ResNet 其實是很多淺層網路的平均。DenseNet [10] 可以說是 NIN 的特化，也是平均，特徵圖的反覆平均。平均，所以損失函數的圖示都是平滑的。

-----

◎ 四、為什麼選 FCN 與 PFPNet？

-----

FCN [11] 接在三篇 CNN：LeNet、NIN、ResNet 之後。CNN 主要是圖片分類。FCN 則是像素分類。先講 FCN 再講 YOLO 有一個好處。FCN 的語意分割會了，再加上物件偵測，就是實例分割。

PFPNet [12] 全景分割的例子，其骨幹是 FPN [13]，然後再做 FCN 語意分割 [11] 與 Mask R-CNN 實例分割。

-----

◎ 五、為什麼選 YOLO 與 Mask R-CNN？

-----

YOLO [14] 作為 CV 第五篇，很多人覺得很奇怪，為什麼不是 YOLO v3，甚至 YOLO v4。其實選 YOLO，並未限制你只能讀 YOLO，SSD、YOLO v2 - v4，都比 YOLO 效能更好。那 YOLO 好在哪裡？因為它是第一篇從 two-stage 轉進到 one-stage 的物件偵測論文，是一個從無到有，而不是一篇「更好」的論文。

講 Mask R-CNN [15] 則必須先講 Faster R-CNN。Mask R-CNN 是「第一篇」「比較好」的實例分割的論文。

-----

◎ 六、為什麼選 LSTM 與 Word2vec？

-----

LSTM [16] 可以處理有先後次序的資料，但如果要做 NLP，初學者不一定知道要先做 Word2vec [17], [18]。Word2vec 可以處理詞義與句法的任務。

-----

◎ 七、為什麼選 Seq2seq 與 Paragraph2vec？

-----

Seq2seq [19] 優於 LSTM 之處在於 Encoder-Decoder 的架構是整句讀完再輸出，避免斷章取義的缺點。由於 Word2vec 系列處理語義級別的任務差了一點，因此有 Paragraph2vec [20] 的必要。Skip-thought [20] 跟 Paragraph 都是 Sentence Embedding，Skip-thought 延伸 Seq2seq 的概念，但一個句子可以同時預測上一句與下一句。

-----

◎ 八、為什麼選 Attention 與 Short Attention？

-----

Seq2seq 壓縮成一個向量不夠精緻，Attention [21] 輸出的每個字，都考慮輸入句所有字（權重靠訓練）。Short Attention [22] 則是清楚地把文字向量分解成 Query、Key、Value 三個。K 是索引，V 是實際值，Q 比較難理解。Q 其實就是下一個字的機率分布。在 Word2vec 裡面已經有 QKV 的概念隱含在內。

-----

◎ 九、為什麼選 ConvS2S 與 ELMo？

-----

ConvS2S [23] 是比較容易被忽略的論文，它跟 Transformer 都用了 QKV 的觀念，可以視為 Transformer 的熱身，兩篇一起讀，會更能理解。

ELMo [24] 延續 Word2vec 與 Paragraph，是 Context2vec 做的比較好的。解決了一字多義的問題。

-----

◎ 十、為什麼選 Transformer 與 BERT？

-----

Transformer [25] 比起 ConvS2S 最大的不同點是 Encoder 與 Decoder 兩邊都先做 self attention。基於 Transformer Encoder 的 BERT [26] 預訓練模型是目前 NLP 的經典，如何預訓練？四個主要任務是什麼？[26] 兩張圖即說明一切。最後，BERT NLP Pipeline [27] 用實驗說明下層元件主要是句法任務用，上層元件主要是語義任務用，跟傳統 NLP 一致。也跟 Embedding 系列一致。

-----

結語：

-----

學習深度學習五年以來的心得總整理

過去一年來，由於有收到贊助，所以我讀了很多的深度學習論文跟部落格文章。讀了這麼多之後，有點心得，選了一些特別重要的論文跟特別好的部落格文章。有心打好深度學習基礎的朋友，可以以此篇提供的論文清單，作為一個起點！

-----

以下是論文說明：

-----

本文分成十個階段，選擇約廿篇論文，CV 與 NLP 各半，「簡述」十篇深度學習經典論文（LeNet、NIN、ResNet、FCN、YOLO、LSTM、Seq2se2、Attention、ConvS2S、Transformer）要解決的問題、如何解決，以及延伸的研究。

-----

01A：LeNet。01B：AlexNet（可補充 Dropout）。

02A：NIN（可補充 SENet）。02B：GoogLeNet（可補充 Inception v3，Label smoothing）。

03A：ResNet。03B：DenseNet（可補充 CSPNet，YOLO v4 的骨幹網路）。

04A：FCN。04B：PFPNet（可補充 Faster R-CNN（之前））。

05A：YOLO（可補充 YOLO v4）。05B：Mask R-CNN（可補充 Faster R-CNN（之後））。

-----

06A：LSTM。06B：Word2vec（詞義、句法）（可補充 C&W v2）。

07A：Seq2seq。07B：Paragraph2vec（詞義、句法、語義）。

08A：Attention。08B：Short Attention（QKV，水平分解）。

09A：ConvS2S。09B：ELMo（詞義、句法、語義，的垂直分解）（可補充 Context2vec）。

10A：Transformer。10B：BERT（詞義、句法、語義，的垂直分工，可補充 BERT NLP Pipeline）。

-----

Computer Vision（CV）

-----

一、前驅研究：HDR。經典論文：LeNet。延伸主題：AlexNet。

二、前驅研究：ZFNet。經典論文：NIN。延伸主題：GoogLeNet。

三、前驅研究：VGGNet。經典論文：ResNet。延伸主題：DenseNet。

四、前驅研究：SDS。經典論文：FCN。延伸主題：PFPNet。

五、前驅研究：Faster R-CNN。經典論文：YOLO。延伸主題：Mask R-CNN。

-----

Natural Language Processing（NLP）

-----

六、前驅研究：RNN。經典論文：LSTM。延伸主題：Word2vec。

七、前驅研究：RCTM。經典論文：Seq2Seq。延伸主題：Paragraph2vec。

八、前驅研究：Visual Attention。經典論文：Attention。延伸主題：Short Attention。

九、前驅研究：GNMT。經典論文：ConvS2S。延伸主題：ELMo。

十、前驅研究：ULMFiT。經典論文：Transformer。延伸主題：BERT。

-----

Advanced Topics

-----

一、RL

二、Mobile

三、NAS RL

四、Semantic Segmentation

五、Object Detection

六、PCA

七、Normalization

八、MLE

九、GAN

十、BERT Family

-----

◎ 一、前驅研究：HDR。經典論文：LeNet。延伸主題：AlexNet。

-----

說明：

LeNet 是經典的 CNN，卷積、激活函數、池化、全連接層、損失函數，都有。LeNet 較早的版本 HDR 是第一個使用反向傳播法的 CNN，沒有全連接層。

經過很久之後，AlexNet 是第一個成功的大型 CNN。使用了各式各樣的技巧，其中最重要的，可能是 Dropout。Dropout 後續有被用在 Transformer。

-----

# HDR。被引用 3589 次。針對數字的手寫辨識，較早的神經網路架構，無全連接層。

LeCun, Yann, et al. "Handwritten digit recognition with a back-propagation network." Advances in neural information processing systems 2 (1989): 396-404.

https://papers.nips.cc/paper/1989/file/53c3bce66e43be4f209556518c2fcb54-Paper.pdf

# LeNet。被引用 31707 次。經典的卷積神經網路，主要比 HDR 多了全連接層。

LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.

http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

# AlexNet。被引用 74398 次。較早使用 GPU 的大型卷積神經網路之一，效能比之前有飛躍的提升，成功使用 dropout 避免 overfitting。

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Communications of the ACM 60.6 (2017): 84-90.

https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

# Dropout。被引用 24940 次。Dropout 避免 overfitting，Transformer 有使用這個技巧。

Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." The journal of machine learning research 15.1 (2014): 1929-1958.

https://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf

-----

◎ 二、前驅研究：ZFNet。經典論文：NIN。延伸主題：GoogLeNet。

-----

說明：

AlexNet 微調後，是 ZFNet。NIN 在 ZFNet 之上加了 Conv1，重新融合 channel 間特徵圖的值，但 Conv1 後來主要被用於升降維，也就是增加或減少特徵圖，Conv1 後來幾乎成為深度學習的標準配備。SENet 沒有像 Conv1 融合頻道間特徵圖的值，而是就同一張特徵圖的值，全體放大或縮小。

GoogLeNet 是第一個成功運用 Conv1 的大型網路，也稱為 Inception v1。Inception v2 主要是 Batch Normalization，另外它也將 5 x 5 的卷積核，拆成兩個 3 x 3。Inception v3 將 3 x 3 拆成 3 x 1 與 1 x 3。另外使用了 Label Smoothing 的技巧，LS 後續有被用在 Transformer。Inception v4 則是與 ResNet 整合。

-----

# ZFNet。被引用 10795 次。AlexNet 的微調版本，VGGNet 的前驅研究。卷積核的可視化。

Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European conference on computer vision. Springer, Cham, 2014.

https://arxiv.org/pdf/1311.2901.pdf

# NIN。被引用 4475 次。Channel（feature maps）之間的 fusion。可用於升維或降維（改變特徵圖的數目）。

Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).

https://arxiv.org/pdf/1312.4400.pdf

# SENet。被引用 4780 次。NIN 的特殊版本，可對每張特徵圖的所有權重同時進行 scale。

Hu, Jie, Li Shen, and Gang Sun. "Squeeze-and-excitation networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

https://openaccess.thecvf.com/content_cvpr_2018/papers/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper.pdf

# GoogLeNet。被引用 25849 次。成功將 NIN 的一維卷積運用於大型網路，效能略優於 VGGNet。

Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf

# Inception v3。被引用 11280 次。Label smoothing 避免 overfitting，Transformer 有使用這個技巧。

Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Szegedy_Rethinking_the_Inception_CVPR_2016_paper.pdf

-----

◎ 三、前驅研究：VGGNet。經典論文：ResNet。延伸主題：DenseNet。

-----

說明：

AlexNet 與 VGGNet 都有引用一篇 PreVGGNet。PreVGGNet 嘗試加寬網路（增加特徵圖數目）不算有效。但加深網路層數有效，AlexNet 首先加深。VGG 則繼 ZFNet 縮小第一層的卷積核（有改善特徵圖解析度）之後，以兩個 3 x 3 的卷積核取代一個 5 x 5，反覆加深至 16 層，得到很好的結果。加深到 19 層之後，結果只有好一點點。繼續加深後，反而變差。

ResNet v1 借鏡 LSTM 的 identity mapping 成功將網路加深至百層，但無法至千層。ResNet-D 則加上 dropout 的技巧讓網路可以達到千層。ResNet 把 ReLU 移動造成 pure identity mapping，成功讓網路達到千層（不需要使用 dropout）。ResNet-E 說明 ResNet v2 其實是 enssemble learning，也就是深層的 ResNet v2 其實是一堆淺層網路的集成。ResNet-V 則以可視化的結果說明集成、平均、平滑、好訓練，之間的關係。

DenseNet 可以說是一個超級的 NIN，它沒有使用 identity mapping，但效果跟 ResNet 接近。CSPNet 可以改良 ResNet 與 DenseNet，是 YOLO v4 的骨幹網路。

-----

# VGGNet。被引用 47721 次。以兩個 conv3 組成一個 conv5，反覆加深網路至 16 與 19 層。

Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).

https://arxiv.org/pdf/1409.1556.pdf

# ResNet v1。被引用 61600 次。加上靈感來自 LSTM 的 identity mapping，網路可到百層。

He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf

# ResNet-D。被引用 982 次。ResNet v1 的 dropout 版本，網路可到千層。

Huang, Gao, et al. "Deep networks with stochastic depth." European conference on computer vision. Springer, Cham, 2016.

https://arxiv.org/pdf/1603.09382.pdf

# ResNet v2。被引用 4560 次。重點從 residual block 轉移到 pure identity mapping，網路可到千層。

He, Kaiming, et al. "Identity mappings in deep residual networks." European conference on computer vision. Springer, Cham, 2016.

https://arxiv.org/pdf/1603.05027.pdf

# ResNet-E。被引用 551 次。ResNet v2 其實是淺層網路的 ensemble。

Veit, Andreas, Michael J. Wilber, and Serge Belongie. "Residual networks behave like ensembles of relatively shallow networks." Advances in neural information processing systems. 2016.

https://papers.nips.cc/paper/2016/file/37bc2f75bf1bcfe8450a1a41c200364c-Paper.pdf

# ResNet-V。被引用 464 次。ensemble 促使損失函數平滑化，也因此好訓練。

Li, Hao, et al. "Visualizing the loss landscape of neural nets." Advances in Neural Information Processing Systems. 2018.

https://papers.nips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf

# DenseNet。被引用 12498 次。反覆使用 conv1 也可加深網路。

Huang, Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

https://openaccess.thecvf.com/content_cvpr_2017/papers/Huang_Densely_Connected_Convolutional_CVPR_2017_paper.pdf

# CSPNet。被引用 45 次。YOLOv4 的骨幹。

Wang, Chien-Yao, et al. "CSPNet: A new backbone that can enhance learning capability of cnn." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2020.

https://openaccess.thecvf.com/content_CVPRW_2020/papers/w28/Wang_CSPNet_A_New_Backbone_That_Can_Enhance_Learning_Capability_of_CVPRW_2020_paper.pdf

-----

◎ 四、前驅研究：SDS。經典論文：FCN。延伸主題：PFPNet。

-----

說明：

FCN 使用全卷積網路進行語意分割，避免 SDS 只能輸入固定大小的圖片的缺點。語意分割加上物件偵測是實例分割。PFPNet 全景分割則是在語意分割的基礎上，又進行實例分割。

-----

# SDS。被引用 983 次。

Hariharan, Bharath, et al. "Simultaneous detection and segmentation." European Conference on Computer Vision. Springer, Cham, 2014.

https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/shape/papers/BharathECCV2014.pdf

# FCN。被引用 19356 次。

Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

https://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Long_Fully_Convolutional_Networks_2015_CVPR_paper.pdf

# PFPNet。被引用 171 次。

Kirillov, Alexander, et al. "Panoptic feature pyramid networks." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2019.

https://openaccess.thecvf.com/content_CVPR_2019/papers/Kirillov_Panoptic_Feature_Pyramid_Networks_CVPR_2019_paper.pdf

-----

◎ 五、前驅研究：Faster R-CNN。經典論文：YOLO。延伸主題：Mask R-CNN。

-----

說明：

HOG 是較早的特徵抽取器。物件偵測較早的論文 DPM 使用 SVM 作為分類器。接下來 SS 提出建議框。R-CNN 則是在 SS 的建議框之後，用 CNN 作為特徵抽取器，然後一樣用 SVM 作為分類器。

Fast R-CNN 在 R-CNN 的基礎上，套用 SPPNet 的觀念，只做一次特徵抽取，然後再使用 SS 的建議框，分類器則從 SVM 改成 CNN。Faster R-CNN 則是把 SS 也改成 CNN-based 的 RPN，讓建議框從大約兩千改為大約三百。RPN 概念上跟 YOLO 是接近的。

Faster 是兩階段的物件偵測演算法，YOLO 則是一階段的物件偵測演算法。YOLO 快，但準確性差。到 YOLO v4，整合各種網路架構與訓練方法後，快，而且準確。

Mask R-CNN 則是在 Faster R-CNN 的基礎上，加上語意分割的功能。特點是較精確的 RoIAlign。

-----

# Faster R-CNN。被引用 23747 次。

Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.

https://proceedings.neurips.cc/paper/2015/file/14bfa6bb14875e45bba028a21ed38046-Paper.pdf

# YOLO。被引用 12295 次。

Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf

# YOLOv4。被引用 253 次。

Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. "YOLOv4: Optimal Speed and Accuracy of Object Detection." arXiv preprint arXiv:2004.10934 (2020).

https://arxiv.org/pdf/2004.10934.pdf

# Mask R-CNN。被引用 8887 次。

He, Kaiming, et al. "Mask r-cnn." Proceedings of the IEEE international conference on computer vision. 2017.

https://openaccess.thecvf.com/content_ICCV_2017/papers/He_Mask_R-CNN_ICCV_2017_paper.pdf

-----

◎ 六、前驅研究：RNN。經典論文：LSTM。延伸主題：Word2vec。

-----

說明：

RNN 是簡單的循環神經網路。LSTM 加上三個門，以及一個直通架構，用來改善梯度消失與梯度爆炸。直通架構也成為 ResNet 的靈感來源。

LSTM 可以用來處理音訊、語音資料。文字資料如果要用 LSTM 處理，則要先經過 Word Embedding，將 one hot encoding 壓縮成維度較低並且具關連性的向量。在 NNLM 之後的 Word2vec 是最經典的詞向量演算法。

Word2vec 1 主要是 CBOW 與 Skip-gram。CBOW 是多字預測一字，Skip-gram 是一字預測多字。Word2vec 2 主要是 Hierarchical Softmax 與 Negative Sampling。Hierarchical Softmax 階層式的架構大幅精簡網路結構，Negative Sampling 則大幅減少訓練樣本。Word2vec 3 主要是 Word2vec 1 與 Word2vec 2 較平易近人的版本。Word2vec 處理後，可執行詞義與句法的 NLP 任務。

C&W v2 也是 Word Embedding 的演算法。除了可以執行詞義與句法的 NLP 任務，由於部分語義任務效果不佳，因此也推出整句輸入的版本，順利改善語義任務的效能。此舉說明了在 Word Embedding 之外，Sentence Embedding 的必要性。

-----

# RNN。被引用 11946 次。

Elman, Jeffrey L. "Finding structure in time." Cognitive science 14.2 (1990): 179-211.

https://cogsci.ucsd.edu/~rik/courses/readings/elman90-fsit.pdf

# LSTM。被引用 39743 次。

Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.676.4320&rep=rep1&type=pdf

# Word2vec 1。被引用 18991 次。

Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).

https://arxiv.org/pdf/1301.3781.pdf

# Word2vec 2。被引用 23990 次。

Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." Advances in neural information processing systems. 2013.

https://papers.nips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf

# Word2vec 3。被引用 645 次。

Rong, Xin. "word2vec parameter learning explained." arXiv preprint arXiv:1411.2738 (2014).

https://arxiv.org/pdf/1411.2738.pdf

# C&W v1。被引用 5099 次。

Collobert, Ronan, and Jason Weston. "A unified architecture for natural language processing: Deep neural networks with multitask learning." Proceedings of the 25th international conference on Machine learning. 2008.

http://www.cs.columbia.edu/~smaskey/CS6998-Fall2012/supportmaterial/colbert_dbn_nlp.pdf

# C&W v2。被引用 6841 次。本篇論文闡釋了從 Word2vec 繼續發展 Paragraph2vec 的必要性。

Collobert, Ronan, et al. "Natural language processing (almost) from scratch." Journal of machine learning research 12.ARTICLE (2011): 2493-2537.

https://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf

-----

◎ 七、前驅研究：RCTM。經典論文：Seq2Seq。延伸主題：Paragraph2vec。

-----

基本上 LSTM 即可進行機器翻譯，但難以避免斷章取義的缺點，較合適的作法是將整句讀完，壓縮成一個向量，再將這個向量轉成目的語言，這個作法也就是 Seq2seq。Seq2seq 是 Encoder-Decoder 架構的 LSTM 版本。較早的 RCTM 已經開始使用 Encoder-Decoder 架構，但是 Encoder 端使用 CNN，因而有漏失時間訊息的缺點，Seq2seq 兩端都是 LSTM，可改善此缺點。

Paragraph2vec 則是從 Word2vec 詞向量延伸而來的句向量版本，概念上跟 Word2vec 接近，但段落的 id 會參加訓練，因而每個段落或句子也可以獲得一個向量。比起詞向量主要有詞義與句法的訊息，句向量還多了語義的訊息。

-----

# RCTM。被引用 1137 次。

Kalchbrenner, Nal, and Phil Blunsom. "Recurrent continuous translation models." Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 2013.

https://www.aclweb.org/anthology/D13-1176.pdf

# Seq2seq 1。被引用 12676 次。

Sutskever, Ilya, Oriol Vinyals, and Quoc V. Le. "Sequence to sequence learning with neural networks." Advances in neural information processing systems. 2014.

http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

# Seq2seq 2。被引用 11284 次。

Cho, Kyunghyun, et al. "Learning phrase representations using RNN encoder-decoder for statistical machine translation." arXiv preprint arXiv:1406.1078 (2014).

https://arxiv.org/pdf/1406.1078.pdf

# Paragraph2vec。被引用 6763 次。

Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International conference on machine learning. 2014.

http://proceedings.mlr.press/v32/le14.pdf

-----

◎ 八、前驅研究：Visual Attention。經典論文：Attention。延伸主題：Short Attention。

-----

說明：

Seq2seq 解碼端只靠一個向量，訊息較為粗糙。Attention 每個輸出字都會考慮編碼端所有字（向量）的權重，結果較為細緻，也就是較好。至於權重如何得來，靠訓練。

其實一個詞向量已經包含 Query、Key、Value 三種訊息。K 可以視為字典的索引，V 也可視為實質的意義。Query 則是下一個字的機率分布。Short Attention 則是在前人的基礎上，將 Word Embedding 一分為三。公式 Context = Q 運算 K 運算 V。

-----

# Attention 1。被引用 14895 次。

Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).

https://arxiv.org/pdf/1409.0473.pdf

# Visual Attention。被引用 6060 次。

Xu, Kelvin, et al. "Show, attend and tell: Neural image caption generation with visual attention." International conference on machine learning. 2015.

http://proceedings.mlr.press/v37/xuc15.pdf

# Attention 2。被引用 4781 次。

Luong, Minh-Thang, Hieu Pham, and Christopher D. Manning. "Effective approaches to attention-based neural machine translation." arXiv preprint arXiv:1508.04025 (2015).

https://arxiv.org/pdf/1508.04025.pdf

# Short Attention。被引用 76 次。

Daniluk, Michał, et al. "Frustratingly short attention spans in neural language modeling." arXiv preprint arXiv:1702.04521 (2017).

https://arxiv.org/pdf/1702.04521.pdf

-----

◎ 九、前驅研究：GNMT。經典論文：ConvS2S。延伸主題：ELMo。

-----

說明：

使用 LSTM 的 Attention 的架構，在多層的 GNMT 已經達到極限。ConvS2S 使用一維卷積改善了 LSTM 無法平行運算的缺點。另外 ConvS2S 也導入了 QKV 的觀念。

QKV 是在寬度上將 Context 拆成 Query、Key、Value三部分。ELMo 則是在深度上將 Context 拆成詞義、句法、語義三層，先欲訓練，再把詞義、句法、語義串接的向量投入不同的 NLP 任務訓練三種向量的權重，因而可以解決一字多義的 context issue。

-----

# GNMT。被引用 3391 次。

Wu, Yonghui, et al. "Google's neural machine translation system: Bridging the gap between human and machine translation." arXiv preprint arXiv:1609.08144 (2016).

https://arxiv.org/pdf/1609.08144.pdf

# ConvS2S。被引用 1772 次。

Gehring, Jonas, et al. "Convolutional sequence to sequence learning." arXiv preprint arXiv:1705.03122 (2017).

https://arxiv.org/pdf/1705.03122.pdf

# Context2vec。被引用 312 次。

Melamud, Oren, Jacob Goldberger, and Ido Dagan. "context2vec: Learning generic context embedding with bidirectional lstm." Proceedings of the 20th SIGNLL conference on computational natural language learning. 2016.

https://www.aclweb.org/anthology/K16-1006.pdf

# ELMo。被引用 5229 次。ELMo 是 Context2vec 中，做的最好的。

Peters, Matthew E., et al. "Deep contextualized word representations." arXiv preprint arXiv:1802.05365 (2018).

https://arxiv.org/pdf/1802.05365.pdf

-----

◎ 十、前驅研究：ULMFiT。經典論文：Transformer。延伸主題：BERT。

-----

說明：

GNMT 是多層的 LSTM 結構。ConvS2S 是多層的一維卷積，並且有 QKV 的精神。Transformer 最主要的特點是先分別在編碼端與解碼端都進行全連接的 self attention，再於 Encoder 端的 K、V 輸出與 Decoder 端每一層的 Q 進行 encoder-decoder attention。可說是集之前 NLP 研究的大成。

GPT-1 使用 Transformer 的解碼端作為預訓練模型，BERT 基於 ELMo 的雙向結構，使用 Transformer 的編碼端作為預訓練模型。ULMFiT 則是較早針對文件分類，使用 LSTM 的預訓練模型。

BERT 的輸入是詞向量，擁有詞義。BERT NLP Pipeline 發現 BERT 較低層主要處理句法（grammar / context）的問題，較高層主要處理語義（semantic）的問題，與傳統 NLP 的 pipeline 一致。

-----

# ULMFiT。被引用 1339 次。

Howard, Jeremy, and Sebastian Ruder. "Universal language model fine-tuning for text classification." arXiv preprint arXiv:1801.06146 (2018).

https://arxiv.org/pdf/1801.06146.pdf

# Transformer。被引用 13554 次。

Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems. 2017.

https://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

# BERT。被引用 12556 次。

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018).

https://arxiv.org/pdf/1810.04805.pdf

# BERT NLP Pipeline。被引用 262 次。

Tenney, Ian, Dipanjan Das, and Ellie Pavlick. "BERT rediscovers the classical NLP pipeline." arXiv preprint arXiv:1905.05950 (2019).

https://arxiv.org/pdf/1905.05950.pdf

-----

附錄：

-----

本篇論文解決（之前論文未解決的）什麼問題（弱點）？

一、LeNet（之前：HDR）。（弱點：Performance 不佳）。

二、NIN（之前：ZFNet）。（弱點：Performance 不佳）。

三、ResNet（之前：VGGNet）。（弱點：網路無法持續加深、Performance 不佳）。

四、FCN（之前：SDS）。（弱點：無法處理任意大小的圖片、Performance 不佳）。

五、YOLO（之前：Faster R-CNN）。（弱點：速度太慢）。

六、LSTM（之前：RNN）。（弱點：梯度消失、爆炸，其實也沒真正解決，長距離傳輸訊息不容易）。

七、Seq2seq 1、2（之前：LSTM、RCTM）。（弱點：LSTM 斷章取義。RCTM 損失時間順序）。

八、Attention 1、2（之前：Seq2seq、Visual Attention）。（弱點：Seq2seq 單向量訊息不夠細緻）。

九、ConvS2S（之前：Attention、GNMT）。（弱點：LSTM 無法像卷積一樣平行運算、Attention 不夠細緻，QKV 比較細緻）。

十、Transformer（之前：ConvS2S、ULMFiT）。（弱點：QKV 不夠細緻，兩邊都先進行 self-attention 比較細緻）。

-----

References

# CV

[1] 深度學習 : Caffe 之經典模型詳解與實戰 | 天瓏網路書店

https://www.tenlong.com.tw/products/9787121301186

# NLP（上）

[2] Seq2seq pay Attention to Self Attention: Part 1(中文版) | by Ta-Chun (Bgg/Gene) Su | Medium

https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-1-%E4%B8%AD%E6%96%87%E7%89%88-2714bbd92727

# NLP（下）

[3] Seq2seq pay Attention to Self Attention: Part 2(中文版) | by Ta-Chun (Bgg/Gene) Su | Medium

https://medium.com/@bgg/seq2seq-pay-attention-to-self-attention-part-2-%E4%B8%AD%E6%96%87%E7%89%88-ef2ddf8597a4

-----

# LeNet

[4] Review: LeNet-1, LeNet-4, LeNet-5, Boosted LeNet-4 (Image Classification) | by Sik-Ho Tsang | Medium

https://sh-tsang.medium.com/paper-brief-review-of-lenet-1-lenet-4-lenet-5-boosted-lenet-4-image-classification-1f5f809dbf17

# Non-linear activation function

[5] Neural Networks, Manifolds, and Topology -- colah's blog

http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

# AlexNet

[6] Review: AlexNet, CaffeNet — Winner of ILSVRC 2012 (Image Classification) | by Sik-Ho Tsang | Coinmonks | Medium

https://medium.com/coinmonks/paper-review-of-alexnet-caffenet-winner-in-ilsvrc-2012-image-classification-b93598314160

# NIN

[7] Review: NIN — Network In Network (Image Classification) | by Sik-Ho Tsang | Towards Data Science

https://towardsdatascience.com/review-nin-network-in-network-image-classification-69e271e499ee

# Conv1

[8] CNN网络中的 1 x 1 卷积是什么？_AI小作坊的博客-CSDN博客

https://blog.csdn.net/zhangjunhit/article/details/55101559

# GoogLeNet

[9] Review: GoogLeNet (Inception v1)— Winner of ILSVRC 2014 (Image Classification) | by Sik-Ho Tsang | Coinmonks | Medium

https://medium.com/coinmonks/paper-review-of-googlenet-inception-v1-winner-of-ilsvlc-2014-image-classification-c2b3565a64e7

# ResNet and DenseNet

[10] An Overview of ResNet and its Variants | by Vincent Fung | Towards Data Science

https://towardsdatascience.com/an-overview-of-resnet-and-its-variants-5281e2f56035

# FCN

[11] Review: FCN — Fully Convolutional Network (Semantic Segmentation) | by Sik-Ho Tsang | Towards Data Science

https://towardsdatascience.com/review-fcn-semantic-segmentation-eb8c9b50d2d1

# PFPNet

[12] PFPNet 算法笔记_AI之路-CSDN博客

https://blog.csdn.net/u014380165/article/details/82468725

# FPN

[13] Understanding Feature Pyramid Networks for object detection (FPN) | by Jonathan Hui | Medium

https://jonathan-hui.medium.com/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c

# YOLO

[14] What do we learn from single shot object detectors (SSD, YOLOv3), FPN & Focal loss (RetinaNet)? | by Jonathan Hui | Medium

https://jonathan-hui.medium.com/what-do-we-learn-from-single-shot-object-detectors-ssd-yolo-fpn-focal-loss-3888677c5f4d

# Mask R-CNN

[15] Image segmentation with Mask R-CNN | by Jonathan Hui | Medium

https://jonathan-hui.medium.com/understanding-feature-pyramid-networks-for-object-detection-fpn-45b227b9106c

-----

# LSTM

[16] Understanding LSTM Networks -- colah's blog

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

# Word2vec

[17] Word2Vec Tutorial - The Skip-Gram Model · Chris McCormick

http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/

# Hierarchical Softmax

[18] Approximating the Softmax for Learning Word Embeddings

https://ruder.io/word-embeddings-softmax/

# Seq2seq

[19] Word Level English to Marathi Neural Machine Translation using Encoder-Decoder Model | by Harshall Lamba | Towards Data Science

https://towardsdatascience.com/word-level-english-to-marathi-neural-machine-translation-using-seq2seq-encoder-decoder-lstm-model-1a913f2dc4a7

# Paragraph2vec and Skip-thought

[20] Meanings are Vectors - Seeking Wisdom

http://sanjaymeena.io/tech/word-embeddings/

# Attention

[21] Attention? Attention!

https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html

# Short Attention

[22] Attention in NLP. In this post, I will describe recent… | by Kate Loginova | Medium

https://medium.com/@edloginova/attention-in-nlp-734c6fa9d983

# ConvS2S

[23] Understanding incremental decoding in fairseq – Telesens

https://www.telesens.co/2019/04/21/understanding-incremental-decoding-in-fairseq/

# ELMo

[24] Learn how to build powerful contextual word embeddings with ELMo

https://medium.com/saarthi-ai/elmo-for-contextual-word-embedding-for-text-classification-24c9693b0045

# Transformer

[25] The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time.

https://jalammar.github.io/illustrated-transformer/

# BERT

[26] LeeMeng - 進擊的 BERT：NLP 界的巨人之力與遷移學習

https://leemeng.tw/attack_on_bert_transfer_learning_in_nlp.html

# BERT NLP Pipeline

[27] 《BERT Rediscovers the Classical NLP Pipeline》阅读笔记 - 知乎

https://zhuanlan.zhihu.com/p/70757539

-----

The Star Also Rises: 深度學習論文研討（一）：機器學習（一）

http://hemingwang.blogspot.com/2020/12/hsuan-tien-lin.html

The Star Also Rises: 深度學習論文研討（二）：機器學習（二）

http://hemingwang.blogspot.com/2020/12/problem.html

The Star Also Rises: 深度學習論文研討（三）：深度學習（一）

http://hemingwang.blogspot.com/2020/11/hung-yi-lee.html

-----

The Star Also Rises

Monday, January 25, 2021

深度學習論文研討（四）：深度學習（二）

No comments:

Programmer

Blog Archive

Labels

Recent Comments

My Blog List

MY LINKS

status

About Me