AlexNet(三):Illustrated
2020/12/31
-----
前言:
AlexNet 在模型設計(含 ReLU)之外,最重要的三個主題是 dropout、momentum、與 weight decay。第一次閱讀時,momentum 與 weight decay 容易被忽略。
-----
https://pixabay.com/zh/photos/berlin-tv-tower-skyline-alex-4001319/
-----
2 The Dataset
-----
3.1 ReLU Nonlinearity
-----
Fig. 25. Pictorial representation of Rectified Linear Unit (ReLU)
圖25. 整流線性單元(ReLU)的圖示。
# History DL
-----
Figure 1: A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line). The learning rates for each network were chosen independently to make training as fast as possible. No regularization of any kind was employed. The magnitude of the effect demonstrated here varies with network architecture, but networks with ReLUs consistently learn several times faster than equivalents with saturating neurons.
圖1:帶有 ReLU 的四層卷積神經網路(實線)在CIFAR-10上達到 25% 的訓練錯誤率,比具有 tanh 神經元的等效網路(虛線)快六倍。 每個網路的學習率是獨立選擇的,以使訓練盡可能快。 沒有使用任何形式的正則化。 此處顯示的效果大小隨網路架構而異,但是有 ReLU 的網路的學習速度始終比具有飽和神經元的等效網路快幾倍。
# AlexNet
說明:
ReLU 比 tanh 更快收斂。
-----
Fig. 26. Diagram for (a) Leaky ReLU (b) Exponential Linear Unit (ELU)
圖26.(a)Leaky(洩漏的) ReLU(b)指數線性單位(ELU)的圖示。
# History DL
說明:
1. sigmoid
2. tanh(Hyperbolic Tangent)
3. ReLU(Rectified Liner Unit)
4. Leaky ReLU and Parametric ReLU
5. ELU(Exponential Linear Unit)
1. sigmoid:三大缺點。梯度消失(導數最大為 0.25)。輸出非 zero-centered。冪運算耗時。
2. tanh:改善輸出非 zero-centered。
3. ReLU(非全區間可微,可取次梯度)。優點:在正區間(解決梯度消失的問題),計算速度快,收斂速度快。缺點:輸出非 zero-centered。Dead ReLU 可以使用 Xavier 設定初值。
4. Leaky ReLU and Parametric ReLU:0x 的部分改為 0.01x 或 ax,a 由訓練得來。避免 Dead ReLU,理論上比較好,實際上未必。
5. ELU(Exponential Linear Unit):沒有 ReLU 的缺點。但也是慢。理論上比較好,實際上未必。
https://zhuanlan.zhihu.com/p/25110450
http://howrudatou.blogspot.com/2016/12/mlcs231n-training-neural-networks-1.html
-----
3.2 Training on Multiple GPUs
ResNeXt
Figure 1. Left: A block of ResNet [14]. Right: A block of ResNeXt with cardinality = 32, with roughly the same complexity. A layer is shown as (# in channels, filter size, # out channels).
圖1. 左:ResNet 的一個塊 [14]。 右:基數 = 32的 ResNeXt 塊,其複雜度大致相同。 圖層顯示為(#輸入通道,過濾器大小,#輸出通道)。
# ResNeXt
說明:
AlexNet 受限於當時 GPU 的內存不足,將資料分成兩個 group,但後來 ResNeXt 導入多 group 效能很棒。
-----
Figure 2. Normalization methods. Each subplot shows a feature map tensor. The pixels in blue are normalized by the same mean and variance, computed by aggregating the values of these pixels. Group Norm is illustrated using a group number of 2.
圖2. 標準化的方法。 每個子圖都顯示一個特徵圖張量。 藍色像素通過相同的均值和方差標準化,該均值和方差是通過匯總這些像素的值而得出的。 群組標準化使用 2 的組號進行說明。
# Group Normalization
說明:
ResNeXt 也帶給 Group Normalization 靈感。
-----
# Group Normalization
說明:
reshape 簡單說是可以把向量轉成多維矩陣。實際上是多維矩陣的轉換。元素個數不變,維度改變。
https://www.delftstack.com/zh-tw/tutorial/python-numpy/numpy-array-reshape-and-resize/
https://www.tensorflow.org/api_docs/python/tf/reshape
-----
3.3 Local Response Normalization
# AlexNet
說明:
是一個過時的專家方法。在 GoogLeNet 裡面認為有用,在 VGGNet 裡面認為沒用。實際上 Conv1 更一般化且結果更好。
-----
3.4 Overlapping Pooling
此節沒有圖。
-----
# LeNet。
說明:
LeNet 有 CNN 基本的架構。
-----
Figure 1: Architecture of a convolutional neural network. In this case, the convolutional layers are fully connected. Both convolutional layers use a kernel of 5 x 5 and skipping factors of 1.
圖1:卷積神經網路的架構。 在這個例子中,卷積層是全連接的。 兩個卷積層都使用 5 x 5 的內核和 1 的步長。
# PreVGGNet。
說明:
使用 GPU。小型資料集。加寬無效,加深有效。
-----
Table 1: Error rates on MNIST test set for randomly connected CNNs with 2 to 6 convolutional layers with M Maps and an optional fully connected layer with N neurons. Various kernel sizes and skipping factors were used.
表1:對於具有 2 到 6 個具有 M 個特徵圖的卷積層和 N 個可加選的神經元的全連接層的隨機連接的 CNN,MNIST 測試集的錯誤率。 使用了各種內核大小和跳過因子。
# PreVGGNet
說明:
加深有效。
-----
Table 3: Average error rates and standard deviations for N runs of an eight hidden layer CNN on the CIFAR10 test set (see text for details). The first five nets have 100 maps per convolutional and max-pooling layer, whereas the sixth, seventh and eighth have 200, 300 and 400 maps per hidden layer, respectively. IP - image processing layer: edge - 3 x 3 Sobel and Scharr filters; hat -13 x 13 positive and negative contrast extraction filters.
表3:CIFAR10 測試集上八個隱藏層 CNN 的 N次 運行的平均錯誤率和標準差(有關詳細信息,請參見文本)。 前五個網路的每個卷積和最大池化層都有 100 張特徵圖,而第六,第七和第八個網路的每個隱藏層分別有 200、300 和 400 張特徵圖。 IP-圖像處理層:邊緣-3 x 3 Sobel 和 Scharr 濾波器; 帽子- 13 x 13 正負對比提取濾波器。
# PreVGGNet
說明:
加寬無效。
-----
Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264–4096–4096–1000.
圖2:我們的 CNN 架構的圖示,明確顯示了兩個 GPU 之間的功能劃分。 一個 GPU 在圖的頂部運行圖層部分,而另一個 GPU 在圖的底部運行圖層部分。 GPU 僅在某些層進行通信。 網路的輸入為 150,528 維,網路其餘層的神經元數量為 253,440–186,624–64,896–64,896–43,264–4096–4096-1000。
# AlexNet。
說明:
AlexNet 加深、加大網路。
-----
4.1 Data Augmentation
-----
Figure 3: 96 convolutional kernels of size 11 x 11 x 3 learned by the first convolutional layer on the 224 x 224 x 3 input images. The top 48 kernels were learned on GPU 1 while the bottom 48 kernels were learned on GPU 2. See Section 6.1 for details.
圖3:在 224 x 224 x 3 輸入圖像上的第一個卷積層學習到的大小為 11 x 11 x 3 的 96 個卷積核。 前 48 個內核是在 GPU 1 上學習的,而後 48 個內核是在 GPU 2 上學習的。細節請參見 6.1 節。
# AlexNet
說明:
兩種資料擴增的方法造成兩類核的呈現。上面是一般擴充。下面是色彩強度轉換的擴充。
-----
Fig. 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 different 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 different 55 by 55 element feature maps. Similar operations are repeated in layers 2,3,4,5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 · 6 · 256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape.
圖3. 我們的 8 層卷積網路模型的架構。 輸入圖像為 224 x 224 裁切(帶有3個彩色平面)。 將其與 96 個不同的第一層濾鏡(紅色)進行卷積,每個濾鏡的大小為 7×7,x 和 y 的步長均為 2。 然後,將生成的特徵圖:(i)通過整流線性函數(未顯示),(ii)合併(3x3 區域內的最大值,使用步長 2),並且(iii)跨特徵圖的對比度標準化,得出 96 種不同的 55 x 55 個要素特徵圖。 在第 2、3、4、5 層中重複類似的操作。 最後兩層是全連接,將來自頂部卷積層的要素作為向量形式的輸入(6·6·256 = 9216 維度)。 最後一層是 C-路 softmax 函數,C 是類別數。 所有濾波器和特徵圖均為正方形。
# ZFNet
說明:
ZFNet 相比 AlexNet,其中一個特點是第一個卷積層的 size 跟 stride 都縮小,提升解析度,也帶給 VGGNet 靈感。
-----
Fig. 5. 1st layer features without feature scale clipping. Note that one feature dominates. (a): 1st layer features from Krizhevsky et al. [18]. (b): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer “dead” features. (c): Visualizations of 2nd layer features from Krizhevsky et al. [18]. (d): Visualizations of our 2nd layer features. These are cleaner, with no aliasing artifacts that are visible in (c).
圖5. 沒有特徵比例裁剪的第一層特徵。 請注意,一個特徵主導。 (a):Krizhevsky 等人的第 1 層特徵。 [18]。 (b):我們的第一層特徵。 較小的步長(2對4)和濾波器大小(7x7 對 11x11)可提供更多獨特的特徵和更少的“死”特徵。 (c):Krizhevsky 等人第二層特徵的可視化 [18]。 (d):第二層特徵的可視化。 這些更乾淨,沒有在(c)中可見的混疊偽影。
# ZFNet
說明:
第一層卷積核步長縮小,讓解析效果提升。
-----
# VGGNet
Table 1: ConvNet configurations (shown in columns). The depth of the configurations increases from the left (A) to the right (E), as more layers are added (the added layers are shown in bold). The convolutional layer parameters are denoted as “convhreceptive field sizei-hnumber of channelsi”. The ReLU activation function is not shown for brevity.
表1:ConvNet 配置(在欄中顯示)。 隨著添加更多的層(添加的層以粗體顯示),配置的深度從左側(A)到右側(E)增大。 卷積層參數被表示為“感受域大小-通道數”。 為簡潔起見,未顯示 ReLU 激活函數。
說明:
經由 ZFNet 縮小第一層卷積核的靈感,採用兩個 3x3 代替一個 5x5,反覆加深。
-----
# VGGNet
說明:
19 層比 16 層好一點點。繼續加深則變差。
-----
# Highway v2
說明:
1. y = H,是 block。
2. x,是 identity mapping。
3. T 跟 C 是係數。方便起見,令 C = 1 - T,則 T 跟 C 的總和為 1。
4. 極端的例子,維持原轉換或者不轉換,但不會有 ResNet 的情形發生。ResNet 基本上是單位轉換加上殘差。
-----
# Highway v2
說明:
對公式四求導。本例子為 GRU 的簡化,論文中另外提到有 LSTM 簡化的例子。
-----
Figure 1. Comparison of optimization of plain networks and highway networks of various depths. All networks were optimized using SGD with momentum. The curves shown are for the best hyperparameter settings obtained for each configuration using a random search. Plain networks become much harder to optimize with increasing depth, while highway networks with up to 100 layers can still be optimized well.
圖1. 各種深度的一般網路和公路網路的優化比較。 所有網路都使用 SGD 進行了優化。 顯示的曲線是使用隨機搜索為每種配置獲得的最佳超參數設置的。 隨著深度的增加,一般網路變得越來越難以優化,而高達 100 層的公路網路仍然可以很好地進行優化。
# Highway v1
說明:
在 VGGNet 的十九層網路之後,增加了 C 的公路網路可以達到百層。
-----
Figure 1: Comparison of optimization of plain networks and highway networks of various depths. Left: The training curves for the best hyperparameter settings obtained for each network depth. Right: Mean performance of top 10 (out of 100) hyperparameter settings. Plain networks become much harder to optimize with increasing depth, while highway networks with up to 100 layers can still be optimized well. Best viewed on screen (larger version included in Supplementary Material).
圖1:各種深度的一般網路和公路網路的優化比較。 左:針對每個網路深度獲得最佳超參數設置的訓練曲線。 右:前 10 個( 從 100 個裡面)超參數設置的平均性能。 隨著深度的增加,一般網路變得越來越難以優化,而高達 100 層的公路網路仍然可以很好地進行優化。 最好在屏幕上觀看(補充材料中包含較大的版本)。
# Highway v2
說明:
在 VGGNet 的十九層網路之後,增加了 C 的公路網路可以達到百層。
-----
4.2 Dropout
Figure 1: Dropout Neural Net Model. Left: A standard neural net with 2 hidden layers. Right: An example of a thinned net produced by applying dropout to the network on the left. Crossed units have been dropped.
圖1:Dropout 神經網路模型。 左:具有 2 個隱藏層的標準神經網路。 右圖:通過對左圖的網路應用 dropout 產生的變細網路的範例。 交叉單元表示已被丟棄。
# Dropout
說明:
隨機丟棄神經元。最後再將所有模型平均。
-----
Figure 2: Left: A unit at training time that is present with probability p and is connected to units in the next layer with weights w. Right: At test time, the unit is always present and the weights are multiplied by p. The output at test time is same as the expected output at training time.
圖2:左圖:訓練時以概率 p 出現的單元,並連接到權重為 w 的下一層的單元。 右:在測試時,單元永遠都在,此時的權重要乘以 p。 測試時的輸出與訓練時的預期輸出相同。
說明:
每次丟棄神經元的機率是 p,平均後的模型,權重乘以 p。
-----
Figure 1. (a): An example model layout for a single DropConnect layer. After running feature extractor g() on input x, a random instantiation of the mask M (e.g. (b)), masks out the weight matrix W. The masked weights are multiplied with this feature vector to produce u which is the input to an activation function a and a softmax layer s. For comparison, (c) shows an effective weight mask for elements that Dropout uses when applied to the previous layer's output (red columns) and this layer's output (green rows). Note the lack of structure in (b) compared to (c).
圖1.(a):單個 DropConnect 層的示範模型佈局。 在輸入 x 上執行特徵提取器 g()之後,遮罩 M(例如(b))的隨機實例將權重矩陣 W 掩碼。將掩碼後的權重與該特徵向量相乘以產生 u,這是 u 的輸入。 激活函數 a 和 softmax 層 s。 為了進行比較,(c)顯示了Dropout 在應用於上一層的輸出(紅色列)和該層的輸出(綠色行)時使用的元素的有效權重掩碼。 請注意,與(c)相比,缺乏(b)中的結構。
# DropConnect
說明:
Dropout 的一般化。隨機丟棄連結。這邊列出的是一次的例子,DropConnect 看似比較有變化,Dropout 比較單調。但是 Dropout 每次丟棄的神經元其實不一樣。所以 Dropout 不會明顯比較差。
-----
Table 1: Comparison of results on ILSVRC- 2010 test set. In italics are best results achieved by others.
表1:ILSVRC-2010 測試集的結果比較。 斜體字是他人獲得的最佳結果。
# AlexNet
說明:
比之前好很多。
-----
Table 2: Comparison of error rates on ILSVRC-2012 validation and test sets. In italics are best results achieved by others. Models with an asterisk* were “pre-trained” to classify the entire ImageNet 2011 Fall release. See Section 6 for details.
表2:ILSVRC-2012 驗證和測試集的錯誤率比較。 斜體字是他人獲得的最佳結果。 帶星號*的模型經過“預訓練”以對整個 ImageNet 2011 Fall 版本進行分類。 細節參考第 6 節。
# AlexNet
說明:
預訓練有比較好。ensemble 也比較好。
-----
Figure 4: (Left) Eight ILSVRC-2010 test images and the five labels considered most probable by our model. The correct label is written under each image, and the probability assigned to the correct label is also shown with a red bar (if it happens to be in the top 5). (Right) Five ILSVRC-2010 test images in the first column. The remaining columns show the six training images that produce feature vectors in the last hidden layer with the smallest Euclidean distance from the feature vector for the test image.
圖4 :(左)八個 ILSVRC-2010 測試圖像和我們的模型認為最可能的五個標籤。 正確的標籤寫在每個圖像下,並且分配給正確標籤的概率也用紅色條顯示(如果它恰好位於前 5 位)。 (右)第一列中的五張 ILSVRC-2010 測試圖像。 其餘的列顯示了六個訓練圖像,這些圖像在最後一個隱藏層中生成特徵向量,這些特徵向量與測試圖像的特徵向量之間的歐幾里德距離最小。
# AlexNet
Top - 5 的意思是,前五個可能,有一個是正確的。Top - 5 error rate 則是前五個可能,都不是正確的。
-----
# Optimization
說明:
考慮前一步的移動。可以減低震盪,加速前進。
-----
# Weight Decay
說明:
根據公式,每次更新權重時,不管權重有沒有更新,權重都會以(1 - λ)的比例縮小(往 0 移動),也就是,慢慢就會有一些權重的值消失不見。也就是說,參數量減少了。也就是說,避免 overfitting。
-----
# Weight Decay
說明:
L2 基本上是在損失函數的值之外,另外加一個單位圓的限制。損失函數的值雖然希望它很小,但太小的話就 overfitting 了。加一個懲罰項,避免損失函數的值過小。
選 λ' = λ / α,則兩者等價。
-----
# Weight Decay
說明:
紅色是 L2 的形式,如果要以 Weight Decay 的形式進行,要多乘以一個係數。
-----
Keras 實作
-----
http://hemingwang.blogspot.com/2021/03/keras-lenet.html
http://hemingwang.blogspot.com/2021/03/keras-alexnet.html
-----
References
[1] # AlexNet
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems 25 (2012): 1097-1105.
https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
[2] # LeNet。被引用 31707 次。經典的卷積神經網路,主要比 HDR 多了全連接層。
LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.
http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf
[3] # PreVGGNet
Cireşan, Dan C., et al. "High-performance neural networks for visual object classification." arXiv preprint arXiv:1102.0183 (2011).
https://arxiv.org/pdf/1102.0183.pdf
[4] # ZFNet
Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European conference on computer vision. Springer, Cham, 2014.
https://cdanfort.w3.uvm.edu/csc-reading-group/zeiler-eccv-2014.pdf
[5] # VGGNet
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
https://arxiv.org/pdf/1409.1556.pdf
[6] # Highway v1
Srivastava, Rupesh Kumar, Klaus Greff, and Jürgen Schmidhuber. "Highway networks." arXiv preprint arXiv:1505.00387 (2015).
https://arxiv.org/pdf/1505.00387.pdf
[7] # Highway v2
Srivastava, Rupesh K., Klaus Greff, and Jürgen Schmidhuber. "Training very deep networks." Advances in neural information processing systems. 2015.
https://papers.nips.cc/paper/2015/file/215a71a12769b056c3c32e7299f1c5ed-Paper.pdf
[8] # History DL(ReLU)
Alom, Md Zahangir, et al. "The history began from alexnet: A comprehensive survey on deep learning approaches." arXiv preprint arXiv:1803.01164 (2018).
https://arxiv.org/ftp/arxiv/papers/1803/1803.01164.pdf
[9] # ResNeXt(Channel)
Xie, Saining, et al. "Aggregated residual transformations for deep neural networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[10] # Group Normalization(Channel)
Wu, Yuxin, and Kaiming He. "Group normalization." Proceedings of the European conference on computer vision (ECCV). 2018.
[11] # Dropout
Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." The journal of machine learning research 15.1 (2014): 1929-1958.
https://www.jmlr.org/papers/volume15/srivastava14a/srivastava14a.pdf
[12] # DropConnect
Wan, Li, et al. "Regularization of neural networks using dropconnect." International conference on machine learning. PMLR, 2013.
http://proceedings.mlr.press/v28/wan13.pdf
[13] # Optimization(Momentum)
Ruder, Sebastian. "An overview of gradient descent optimization algorithms." arXiv preprint arXiv:1609.04747 (2016).
https://arxiv.org/pdf/1609.04747.pdf
[14] # Weight Decay
Loshchilov, Ilya, and Frank Hutter. "Decoupled weight decay regularization." arXiv preprint arXiv:1711.05101 (2017).
https://arxiv.org/pdf/1711.05101.pdf
-----
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.