The Star Also Rises: June 2017

AI 從頭學（二九）：GoogLeNet

2017/06/12

前言：

GoogLeNet 裡面，1 x 1 convolution 很重要，也不難瞭解。至於它的理論基礎：稀疏網路的最佳化，這個才是核心。讀到現在，發現數學越來越多了！

Summary：

GoogLeNet 是 Google 向 LeNet 致敬的論文，本文討論 GoogLeNet 的架構 [1]。

構想來自大腦皮層的概念 [2]，作法上利用大量的 1 x 1 convolution 降維以使超深的架構運算量降低到可以實現的水平 [3], [4]，數學上的理論基礎則來自 [5]。

-----

http://celebnetworth.wiki/wp-content/uploads/2014/04/Google-Net-Worth.jpg

-----

Question

Q1: Multiple scales
Q2: Dimension reduction
Q3: Sparsity
Q4: Structure
Q5: Concatenation
Q6: Convolution
Q7: LRN, dropout and softmax

本文以回答七個問題的方式完成 GoogLeNet 的解說。

-----

Q1: Multiple scales

GoogLeNet 提出了 Inception 的概念作法是將 1 x 1, 3 x 3, 5 x 5 三種 convolutions 與 3 x 3 maxpooling 封裝在一個模組中，參考圖1.1a。然後重複使用這種模組進行特徵抽取。

這個概念是來自之前的視覺研究採用了類似大腦皮層的架構 [2]，即皮層中對不同 scales 有不同的神經細胞對應，參考圖1.2。採用小的 filters 其中一個考量是運算量較小，另一個考量是由小的 filters 組成大型 filter 有較高的非線性。多種表現良好的大型網路都使用較小的 filters 組成。

Fig. 1.1. Inception module, p. 4 [1].

Fig. 1.2a. Multiple scales, p. 2 [1].

Fig. 1.2b. Multiple scales, p. 414 [2].

-----

Q2: Dimension reduction

由於擴展越到深層，運算量也大幅提升，因此利用 1 x 1 convolution 來降維 [3], [4]，參考圖1.3。

Fig. 1.3. Dimension reduction, p. 2 [1].

-----

Q3: Sparsity

其理論基礎來自一個嚴格的數學證明：如果一個資料集的機率分布函數可以表現為一個大型的深層稀疏網路，則藉由分析前一層的激活，把有高度相關輸出的神經元群集在一起，則可以產生一個最佳化的網路拓樸。

這段話算是圖1.4的簡單翻譯，但簡單來說，就是不斷提煉，傳統上 CNN 即是如此 [6], [7]，經由反覆的 convolution 與 pooling 萃取特徵。只是 Inception 的架構藉由 Hebbian principle 把不同 scale 的 filters 與 pooling 封裝在一起，然後重複使用。

Fig. 1.4. Sparsity, p. 2 [1].

Fig. 1.5. Hebbian principle. p. 3 [1].

Fig. 1.6a. The Inception architecture started out as a case study for assessing the hypothetical output and covering the hypothesized outcome, p. 3 [1].

Fig. 1.6b. The Inception architecture started out as a case study for assessing the hypothetical output and covering the hypothesized outcome, p. 3 [1].

-----

Fig. 2.1. Module 1, p. 6 [1].

Fig. 2.2. Module 2, p. 6 [1].

Fig. 2.3. Module 3, p. 6 [1].

Fig. 2.4. Module 4, p. 6 [1].

Fig. 2.5. Module 5, p. 6 [1].

Fig. 2.6. Module 6, p. 6 [1].

-----

Q4: Structure

圖2.1到2.6為 GoogLeNet 的細部分解。接下來透過圖3.5說明 GoogLeNet 的架構。

首先，#3x3 reduce 與 #5x5 reduce 分別代表 1 x 1 conv 的數目，降維用，參考圖3.1a。另外，pool proj 也是代表 1 x 1 conv 的數目，也是用來降維（接在 max pooling 之後）。

Fig. 3.1a. Number of 1 x 1 filters, p. 4 [1].

Fig. 3.1b. Pool proj, p. 4 [1].

-----

輸入的圖片大小是224x224，有 RGB 三個 channels，參考圖3.2。論文中並未特別說明處理方式。

經過第一層的 convolution, stride = 2 處理後，大小變為112x112，深度為64（有64個7x7 convolutions）。再經過 max pooling, stride = 2，大小變為 56x56x64。然後經過第二層的處理，大小變為28x28x193，這個會成為 Inception 3(a) 的輸入，參考圖3.3a。

-----

Q5: Concatenation

接下來是很重要的一部份，輸出要串在一起，參考圖3.3b。在 note * - **** 的接法分別是m531、531m、53m1、531m，並不一致。圖3.3c與3.3d則以論文圖表列出的順序進行一些討論。

首先，相關性高的應該在一起，所以 1 x 1 跟 3 x 3 接在一起，3 x 3 跟 5 x 5 接在一起。另外，不重要的應該要擺前面，因為 1 x 1 降維時會沿著前面一路丟掉 feature map。max pooling 的重要性高或低？以它數量較少，擺在後面比較適合，因為物以稀為貴。不過這些都只是很浮掠的猜測。順序到底會不會真的影響結果，也值得探討！

Fig. 3.2. Input image size, p. 4 [1].

Fig. 3.3a. Output size, p. 5 [1].

Fig. 3.3b. Single output vector, p. 5 [1].

Fig. 3.3c. Inception 3.

Fig. 3.3d. Inception modules.

-----

論文中提到：隨著層的提高，3 x 3 跟 5 x 5 convolution 的比例要提高，參考圖3.3e。我們可以看到，在 Inception 3、4、5 內往 top 移時，3 x 3 跟 5 x 5 個數變多，Ratio則未必，參考圖3.3d與3.3f。

在 Inception 3、4、5 的最高層，3b、4e、5b，1 x 1 的 convolutions 用來升維，作用跟 ResNet 中部分的 1 x 1 convolution 是一樣的 [4]。Inception 3、4、5 的最高層，其深度跟下一層接近，參考圖3.3d。

Fig. 3.3e. Ratio of convolutions, p. 5 [1].

Fig. 3.3f. Ratio of convolutions.

-----

Q6: Convolution

從 1 x 1 的降維，到 3 x 3 或 5 x 5 convolution，數目不一定成倍數增加，細節在論文中並沒提到，參考圖 3.4a。這有 LeNet 可以參考，如何從6張圖增加到16張圖，參考圖3.4b與3.4c。

至於 convolution 之後圖會變小，可以用 padding 的方式讓圖維持固定大小以便接起來傳到下一層，可以參考圖3.4d。

Fig. 3.4a. Convolutions, p. 5 [1].

Fig. 3.4b. Convolutions from 6 to 16, p. 7 [6].

Fig. 3.4c. Convolutions from 6 to 16, p. 8 [6].

Fig. 3.4d. Padding, p. 13 [8].

-----

Q7: LRN, dropout and softmax

圖3.5是較完整參數說明。LRN 與 dropout 可參閱之前寫的 AlexNet 介紹 [9], [10]。Softmax 則可參考 [11]。

Fig. 3.5. GoogLeNet incarnation of the Inception architecture, p. 5 [1].

-----

結論：

GoogLeNet 的設計很巧妙，值得細細品味！

-----

References

[1] 2015_Going deeper with convolutions

[2] 2007_Robust object recognition with cortex-like mechanisms

[3] 2014_Network in network

[4] AI從頭學（二八）：Network in Network

[5] 2014_Provable bounds for learning some deep representations

[6] 1998_Gradient-Based Learning Applied to Document Recognition

[7] AI從頭學（一二）：LeNet

[8] 2016_A guide to convolution arithmetic for deep learning

[9] 2012_Imagenet classification with deep convolutional neural networks

[10] AI從頭學（二七）：AlexNet

[11] Lab DRL_04：Caffe網絡定義

The Star Also Rises

Friday, June 16, 2017

AI 從頭學（三０）：Conv1

Monday, June 12, 2017

AI 從頭學（二九）：GoogLeNet