The Star Also Rises: ResNet（三）：Illustrated

ResNet（三）：Illustrated

2021/03/27

-----

https://pixabay.com/zh/photos/utah-america-nevada-arizona-4272944/

-----

Outline

一、ResNet v1：殘差比較容易訓練，可達百層。

二、ResNet-D：v1 不是 ensemble，dropout 是 ensemble，可達千層。

三、ResNet v2：ReLU 從 identity 移開是重點，可達千層。

四、ResNet-E：v2 的 identity 是 ensemble。

五、ResNet-V：ensemble 是平均，讓 loss surface 平滑，有助於訓練到千層。

-----

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.

圖1. 在 20 層和 56 層“普通”網路的 CIFAR-10 上的訓練錯誤（左）和測試錯誤（右）。較深的網路具有較高的訓練錯誤，從而導致測試錯誤。 ImageNet 上的類似現象如圖4 所示。

# ResNet v1。

說明：

這個不是 overfitting，因為 overfitting 指的是訓練結果佳，但測試結果不佳。此圖顯示，訓練不佳，連帶測試也不佳，屬於網路退化。在 VGG-19 以下，持續加深有助於網路正確率提升，但 VGG-19 以上，持續加深促使網路正確率降低。

-----

# ResNet v1。

說明：

這裡的 x 和 y 是所考慮的圖層的輸入和輸出向量。函數 F（x，{Wi}）表示要學習的殘差映射。

-----

Figure 2. Residual learning: a building block.

圖2. 殘差學習：構建塊。

# ResNet v1。

說明：

ResNet v1 模組的架構。加上一個強制的恆等映射之後，卷積層所要訓練的即為殘差。

-----

Figure 3. Example network architectures for ImageNet. Left: the VGG-19 model [40] (19.6 billion FLOPs) as a reference. Middle: a plain network with 34 parameter layers (3.6 billion FLOPs). Right: a residual network with 34 parameter layers (3.6 billion FLOPs). The dotted shortcuts increased dimensions. Table 1 shows more details and other variants.

圖3. ImageNet 的範例網路架構。左：作為參考的 VGG-19 模型[40]（196 億個 FLOP）。中：包含 34 個參數層（36 億個FLOP）的普通網路。右圖：一個具有 34 個參數層的殘差網路（36 億個 FLOP）。虛線捷徑增加了維度。表1 顯示了更多詳細信息和其他變體。

# ResNet v1。

說明：

VGG-19、無恆等映射的 ResNet、與 ResNet。

有關 34 層的 ResNet，實線部分是 identity mapping，虛線部分是 projection shortcuts，即輸入跟輸出維度不同。

-----

Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Downsampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2.

表1. 為 ImageNet 的架構。括號中顯示了構建塊（另請參見圖5），其中堆疊了許多塊。下採樣由 conv3_1，conv4_1和conv5_1執行，步長為 2。

# ResNet v1。

說明：

18 層為 conv1 一層、conv2 兩層兩個、conv3 兩層兩個、conv4 兩層兩個、conv5 兩層兩個、全連接層一層。

50、101、152，則是將兩層的卷積層換成三層的瓶頸模組。

-----

Figure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared to their plain counterparts.

圖4. 在 ImageNet 上進行訓練。細曲線表示訓練誤差，粗曲線表示中心色塊的驗證誤差。左：18 和 34 層的普通網路。右：18 和 34 層的 ResNet。在此圖中，殘差網路與普通網路相比沒有額外的參數。

# ResNet v1。

說明：

ImageNet 的範例：ResNet 可以從 18 層到 34 層，效能持續提升（正確率提高，錯誤率降低）。

-----

Table 2. Top-1 error (%, 10-crop testing) on ImageNet validation. Here the ResNets have no extra parameter compared to their plain counterparts. Fig. 4 shows the training procedures.

表2. ImageNet 驗證中的 Top-1 錯誤（％，進行了十次裁剪測試）。與普通的 ResNet 相比，此處的 ResNet 沒有額外的參數。圖4 顯示了訓練過程。

# ResNet v1。

說明：

ImageNet 的範例：ResNet 可以從 18 層到 34 層，效能持續提升（正確率提高，錯誤率降低）。

-----

# ResNet v1。

說明：

虛線的部分表示維度增加，有三種方式可以增加維度。A、B、C，詳見下文討論。

-----

Next we investigate projection shortcuts (Eqn.(2)). In Table 3 we compare three options: (A) zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameterfree (the same as Table 2 and Fig. 4 right); (B) projection shortcuts are used for increasing dimensions, and other shortcuts are identity; and (C) all shortcuts are projections.

接下來，我們研究投影快捷方式（等式（2））。在表3 中，我們比較了三個選項：（A）零填充快捷方式用於增加尺寸，並且所有快捷方式都是無參數的（與表2 和右圖4 相同）；（B）投影快捷方式用於增加尺寸，其他快捷方式用於恆等映射。（C）所有快捷方式都是投影。

# ResNet v1。

說明：

參考下方。

-----

Figure 5. Structure of residual unit (a) with zero-padded identitymapping shortcut, (b) unraveled view of (a) showing that the zeropadded identity-mapping shortcut constitutes a mixture of a residual network with a shortcut connection and a plain network.

圖5. 殘差單元的結構（a）具有零填充恆等映射快捷方式，（b）（a）的分解視圖，其中零填充恆等映射快捷方式構成了具有快捷方式連接的殘差網路和純網路的混合體。

# PyramidNet

說明：

A：Zero-padding。

-----

The dimensions of x and F must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions:

We can also use a square matrix Ws in Eqn.(1). But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus Ws is only used when matching dimensions.

x 和 F 的尺寸在等式（1）中必須相等。如果不是這種情況（例如，在更改輸入/輸出通道時），我們可以通過快捷方式連接執行線性投影 Ws 以匹配尺寸：

我們也可以在等式（1）中使用方陣 Ws。但是我們將通過實驗證明，恆等映射足以解決退化問題並且很經濟，因此 Ws 僅在匹配尺寸時使用。

# ResNet v1。

說明：

線性投影就是投影。

「在線性代數和泛函分析中，投影是從向量空間映射到自身的一種線性變換。」

https://baike.baidu.com/item/%E6%AD%A3%E4%BA%A4%E6%8A%95%E5%BD%B1

「在內積空間中，最重要的運算除了內積本身，另一個威力強大的代數工具就是將任意向量分解為正交分量之和的正交投影 (orthogonal projection)。」

https://ccjou.wordpress.com/2010/04/19/%E6%AD%A3%E4%BA%A4%E6%8A%95%E5%BD%B1-%E5%A8%81%E5%8A%9B%E5%BC%B7%E5%A4%A7%E7%9A%84%E4%BB%A3%E6%95%B8%E5%B7%A5%E5%85%B7/

維度不同：通道數不同，可用 Conv1。尺寸不同，可用 pooling。

特徵圖張數不變：如 VGGNet 裡面使用的 Conv1。

特徵圖張數改變：譬如特徵圖張數加倍的 ResNet。又，瓶頸模組先降維減少運算量再升維。

說明：

B：升維用 Conv1，其他用 identity mapping。C 好一點點，但計算量較 B 大，後續採用的是 B。

C：升維用 Conv1，其他 identity mapping 用方陣（保持維度）的 Conv1。

-----

Table 3. Error rates (%, 10-crop testing) on ImageNet validation. VGG-16 is based on our test. ResNet-50/101/152 are of option B that only uses projections for increasing dimensions.

表3. ImageNet 驗證的錯誤率（％，進行了十次裁剪測試）。 VGG-16 基於我們的測試。 ResNet-50 / 101/152 是選項 B 的選項，僅使用投影來增加尺寸。

# ResNet v1。

說明：

（進行了十次裁剪測試）。

B 比 A 好一點。C 比 B 好一點，但計算量多不少。

-----

Table 4. Error rates (%) of single-model results on the ImageNet validation set (except † reported on the test set).

表4. ImageNet 驗證集上單模型結果的錯誤率（％）（測試集上報告的 † 除外）。

# ResNet v1。

說明：

單一模型。

-----

Table 5. Error rates (%) of ensembles. The top-5 error is on the test set of ImageNet and reported by the test server.

表5. 集成的錯誤率（％）。 top-5 錯誤位於 ImageNet 的測試集中，並由測試服務器報告。

# ResNet v1。

說明：

集成（多個模型）。

-----

Figure 5. A deeper residual function F for ImageNet. Left: a building block (on 56×56 feature maps) as in Fig. 3 for ResNet-34. Right: a “bottleneck” building block for ResNet-50/101/152.

圖5. 為 ImageNet 的更深的殘差函數 F。左：ResNet-34 的構建塊（在56×56 特徵圖上），如圖3 所示。右：ResNet-50 / 101/152 的“瓶頸”構建基塊。

# ResNet v1。

說明：

ResNet v1 模組與瓶頸模組。

-----

Table 6. Classification error on the CIFAR-10 test set. All methods are with data augmentation. For ResNet-110, we run it 5 times and show “best (mean±std)” as in [42].

表6. CIFAR-10 測試集上的分類錯誤。所有方法都具有資料增強功能。對於 ResNet-110，我們將其運行 5 次並顯示為 [best（mean±std）”，如 [42] 所示。

# ResNet v1。

說明：

CIFAR-10 測試。

-----

Figure 6. Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error. Left: plain networks. The error of plain-110 is higher than 60% and not displayed. Middle: ResNets. Right: ResNets with 110 and 1202 layers.

圖6. 有關 CIFAR-10 的訓練。虛線表示訓練錯誤，而粗線表示測試錯誤。左：普通網路。 Plain-110 的錯誤高於 60％，並且不顯示。中：ResNets。右圖：具有 110 和 1202 層的 ResNet。

# ResNet v1。

說明：

CIFAR-10 從 110 到 1202 層的 ResNet 效能變差。

-----

Table 7. Object detection mAP (%) on the PASCAL VOC 2007/2012 test sets using baseline Faster R-CNN. See also appendix for better results.

Table 8. Object detection mAP (%) on the COCO validation set using baseline Faster R-CNN. See also appendix for better results.

表7. 使用基準 Faster R-CNN 在 PASCAL VOC 2007/2012 測試集中進行的對象檢測 mAP（％）。另請參見附錄以獲得更好的結果。

表8. 使用基線 Faster R-CNN 在 COCO 驗證集上進行的對象檢測 mAP（％）。另請參見附錄以獲得更好的結果。

# ResNet v1。

說明：

ResNet v1 在 PASCAL VOC 與 COCO 都比 VGG16 好。

-----

Fig. 2. The linear decay of p` illustrated on a ResNet with stochastic depth for p0=1 and pL = 0:5. Conceptually, we treat the input to the rst ResBlock as H0, which is always active.

圖2. 在具有隨機深度的 ResNet 上，對於 p0 = 1 和 pL = 0：5 的p`的線性衰減。從概念上講，我們將第一個 ResBlock 的輸入視為 H0，該輸入始終處於活動狀態。

# ResNet-D。

說明：

「其中 L 表示塊的總數，因此 p_L 是最後一個剩餘塊的生存概率，並且在整個實驗中固定為 0.5。另請注意，在此設置中，輸入被視為第一層（l = 0），因此永遠不會丟失。」

http://hemingwang.blogspot.com/2019/10/an-overview-of-resnet-and-its-variants.html

-----

Fig. 5. With stochastic depth, the 1202-layer ResNet still signicantly improves over the 110-layer one.

圖5. 隨機深度下，1202 層的 ResNet 仍比 110 層的 ResNet 顯著改善。

# ResNet-D。

說明：

在 CIFAR-10 上，ResNet-D 1202 層比 110 層錯誤率低。

-----

# ResNet v2。

說明：

CIFAR-10。

ResNet v1 110 是 6.43%。ResNet v1 1202 是 7.93%。原論文表六。

ResNet v1 1001 是 7.61%，ResNet v2 1001 是 4.92%。

-----

Figure 1: Residual Networks are conventionally shown as (a), which is a natural representation of Equation (1). When we expand this formulation to Equation (6), we obtain an unraveled view of a 3-block residual network (b). Circular nodes represent additions. From this view, it is apparent that residual networks have O(2n) implicit paths connecting input and output and that adding a block doubles the number of paths.

圖1：殘差網路通常顯示為（a），它是等式（1）的自然表示。當我們將此公式擴展為方程式（6）時，我們獲得了3塊殘差網路（b）的分解圖。圓形節點代表附加項。從這個角度來看，很明顯，殘差網路具有連接輸入和輸出的O（2n）隱式路徑，添加一個塊會使路徑數量增加一倍。

# ResNet-E。

說明：

殘差網路具有連接輸入和輸出的隱式路徑，添加一個塊會使路徑數量增加一倍。

-----

Figure 5: (a) Error increases smoothly when randomly deleting several modules from a residual network. (b) Error also increases smoothly when re-ordering a residual network by shuffling building blocks. The degree of reordering is measured by the Kendall Tau correlation coefficient. These results are similar to what one would expect from ensembles.

圖5：（a）從殘差網路中隨機刪除幾個模塊時，錯誤會平穩增加。（b）通過對構建基塊進行改組對殘差網路進行重新排序時，誤差也會平穩增加。重新排序的程度由 Kendall Tau 相關係數來衡量。這些結果類似於集成所期望的結果。

「1、簡介在統計學中，肯德爾相關係數是以 Maurice Kendall 命名的，並經常用希臘字母 τ（tau）表示其值。 ... 肯德爾相關係數的取值範圍在 -1 到 1 之間，當 τ 為 1 時，表示兩個隨機變數擁有一致的等級相關性；當 τ 為 -1 時，表示兩個隨機變數擁有完全相反的等級相關性；當 τ 為 0 時，表示兩個隨機變數是相互獨立的。2018年12月19日」Google 查詢。

https://blog.csdn.net/zhaozhn5/article/details/78392220

# ResNet-E。

說明：

(a) 隨機刪除幾層，刪除越多層，錯誤越高。

(b) 隨機對調兩層（總共 k 組），新的錯誤率，與兩層的肯德爾係數成反比。

-----

Figure 6: How much gradient do the paths of different lengths contribute in a residual network? To find out, we first show the distribution of all possible path lengths (a). This follows a Binomial distribution. Second, we record how much gradient is induced on the first layer of the network through paths of varying length (b), which appears to decay roughly exponentially with the number of modules the gradient passes through. Finally, we can multiply these two functions (c) to show how much gradient comes from all paths of a certain length. Though there are many paths of medium length, paths longer than 20 modules are generally too long to contribute noticeable gradient during training. This suggests that the effective paths in residual networks are relatively shallow.

圖6：殘差網路中不同長度的路徑貢獻多少梯度？為了找出答案，我們首先顯示所有可能的路徑長度（a）的分佈。這遵循二項分佈。其次，我們記錄通過變化長度（b）的路徑在網路的第一層上引起多少梯度，該梯度似乎隨著梯度所經過的模塊數量而呈指數衰減。最後，我們可以將這兩個函數（c）相乘，以顯示一定長度的所有路徑產生多少梯度。儘管有許多中等長度的路徑，但長度超過 20 個模塊的路徑通常太長而無法在訓練過程中產生明顯的梯度。這表明殘差網路中的有效路徑相對較淺。

# ResNet-E。

「令人驚訝的是，大多數貢獻來自長度為 9 至 18 的路徑（VGG？），如（c）所示。但它們僅佔總路徑的一小部分，如（a）所示。這是一個非常有趣的發現，因為它表明 ResNet 不能解決很長路徑的消失梯度問題，而 ResNet 實際上可以通過縮短有效路徑來訓練非常深的網路。」

http://hemingwang.blogspot.com/2019/10/an-overview-of-resnet-and-its-variants.html

說明：

(a) 不同長度的 path 的個數。

(b) 不同長度的 path 在第一層引起的梯度。

-----

Figure 1: The loss surfaces of ResNet-56 with/without skip connections. The proposed filter normalization scheme is used to enable comparisons of sharpness/flatness between the two figures.

# ResNet-V。

說明：

平滑的 ResNet Loss Function 易於訓練。

-----

Figure 5: 2D visualization of the loss surface of ResNet and ResNet-noshort with different depth.

圖5：具有不同深度的 ResNet 和 ResNet-noshort 損失表面的 2D 可視化。

# ResNet-V。

說明：

越深的網路，快捷連接的改善越明顯。

-----

Figure 6: Wide-ResNet-56 on CIFAR-10 both with shortcut connections (top) and without (bottom). The label k = 2 means twice as many filters per layer. Test error is reported below each figure.

圖6：CIFAR-10 上的 Wide-ResNet-56（帶快捷方式連接（頂部）和不帶快捷方式連接（底部））。標籤 k = 2 表示每層過濾器數量是原來的兩倍。測試錯誤報告在每個圖的下方。

# ResNet-V。

說明：

標籤 k = 2 表示每層過濾器數量是原來的兩倍。

比較寬的 WRN 沒有快捷連接也 OK，但計算量會增加。

-----

# ResNet-V。

說明：

由高維投影到低維。高維上任意兩個向量很容易正交。【蜻蜓点论文】

θ* 是中心點。

δ 與 η 是高斯分佈取樣得來的，兩個正交的向量。維度與 θ* 同。

α 與 β 是兩個向量的 scale。

-----

# ResNet-V。

說明：

filter normalization。

以 3x3 卷積核為例，第 jth filter 為 w11, w12, w13, w21, w22, w23, w31, w32, w33, b。每個 w 的值都是高斯分布的取樣。

dij 的 norm 為上述所有項的平方和開根號。

https://zhuanlan.zhihu.com/p/52314278

-----

# ResNet-V。

說明：

泛化能力，特指：學習算法對新樣本的適應能力，外文名： generalization ability。

使用 filter normalization 之後，發現 minimizer（最低點）的平滑程度跟泛化能力成正相關。譬如圖一。

-----

References

# VGGNet。被引用 47721 次。以兩個 conv3 組成一個 conv5，反覆加深網路至 16 與 19 層。

Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).

https://arxiv.org/pdf/1409.1556.pdf

# ResNet v1。被引用 61600 次。加上靈感來自 LSTM 的 identity mapping，網路可到百層。

He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf

# ResNet-D。被引用 982 次。ResNet v1 的 dropout 版本，網路可到千層。

Huang, Gao, et al. "Deep networks with stochastic depth." European conference on computer vision. Springer, Cham, 2016.

https://arxiv.org/pdf/1603.09382.pdf

# ResNet v2。被引用 4560 次。重點從 residual block 轉移到 pure identity mapping，網路可到千層。

He, Kaiming, et al. "Identity mappings in deep residual networks." European conference on computer vision. Springer, Cham, 2016.

https://arxiv.org/pdf/1603.05027.pdf

# ResNet-E。被引用 551 次。ResNet v2 其實是淺層網路的 ensemble。

Veit, Andreas, Michael J. Wilber, and Serge Belongie. "Residual networks behave like ensembles of relatively shallow networks." Advances in neural information processing systems. 2016.

https://papers.nips.cc/paper/2016/file/37bc2f75bf1bcfe8450a1a41c200364c-Paper.pdf

# ResNet-V。被引用 464 次。ensemble 促使損失函數平滑化，也因此好訓練。

Li, Hao, et al. "Visualizing the loss landscape of neural nets." Advances in Neural Information Processing Systems. 2018.

https://papers.nips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf

# PyramidNet

Han, Dongyoon, Jiwhan Kim, and Junmo Kim. "Deep pyramidal residual networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

https://openaccess.thecvf.com/content_cvpr_2017/papers/Han_Deep_Pyramidal_Residual_CVPR_2017_paper.pdf

# Batch Normalization

Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." International conference on machine learning. 2015.

http://proceedings.mlr.press/v37/ioffe15.pdf

-----

Visualizing Loss Landscape of Deep Neural Networks…..but can we Trust them? | by Jae Duk Seo | Towards Data Science

https://towardsdatascience.com/visualizing-loss-landscape-of-deep-neural-networks-but-can-we-trust-them-3d3ae0cff46e

Loss landscapes and the blessing of dimensionality | by Javier Ideami | Towards Data Science

https://towardsdatascience.com/loss-landscapes-and-the-blessing-of-dimensionality-46685e28e6a4

[论文阅读]损失函数可视化及其对神经网络的指导作用 - 知乎

https://zhuanlan.zhihu.com/p/52314278

NeurIPS 2018提前看：可视化神经网络泛化能力 | 机器之心

https://www.jiqizhixin.com/articles/2018-11-28-10

(51) 【蜻蜓点论文】Visualizing the Loss Landscape of Neural Nets - YouTube

https://www.youtube.com/watch?v=xVxMvoacWMw