The Star Also Rises: May 2021

Sunday, May 30, 2021

DenseNet（四）：Appendix

2021/04/27

-----

以下只列出論文

-----

# DPN

Chen, Yunpeng, et al. "Dual path networks." Advances in Neural Information Processing Systems. 2017.

https://proceedings.neurips.cc/paper/2017/file/f7e0b956540676a129760a3eae309294-Paper.pdf

# DLA

Yu, Fisher, et al. "Deep layer aggregation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

http://openaccess.thecvf.com/content_cvpr_2018/papers/Yu_Deep_Layer_Aggregation_CVPR_2018_paper.pdf

-----

以下只列出論文

-----

一 CapsNet v0 論文

Hinton, Geoffrey E., Alex Krizhevsky, and Sida D. Wang. "Transforming auto-encoders." International conference on artificial neural networks. Springer, Berlin, Heidelberg, 2011.

http://www.cs.toronto.edu/~fritz/absps/transauto6.pdf

二 CapsNet v1 論文

Sabour, Sara, Nicholas Frosst, and Geoffrey E. Hinton. "Dynamic routing between capsules." arXiv preprint arXiv:1710.09829 (2017).

http://papers.nips.cc/paper/6975-dynamic-routing-between-capsules

三 CapsNet v2 論文

Hinton, Geoffrey E., Sara Sabour, and Nicholas Frosst. "Matrix capsules with EM routing." International conference on learning representations. 2018.

https://openreview.net/forum?id=HJWLfGWRb&noteId=BkelcSxC47

四 CapsNet v3 論文

Kosiorek, Adam R., et al. "Stacked capsule autoencoders." arXiv preprint arXiv:1906.06818 (2019).

http://papers.nips.cc/paper/9684-stacked-capsule-autoencoders

# CapsNet v4（被引用次數非常少）

Smith, Lewis, et al. "Capsule Networks--A Probabilistic Perspective." arXiv preprint arXiv:2004.03553 (2020).

https://arxiv.org/pdf/2004.03553.pdf

# Set Transformer

Lee, Juho, et al. "Set transformer: A framework for attention-based permutation-invariant neural networks." International Conference on Machine Learning. PMLR, 2019.

http://proceedings.mlr.press/v97/lee19d/lee19d.pdf

# Caps SS

LaLonde, Rodney, and Ulas Bagci. "Capsules for object segmentation." arXiv preprint arXiv:1804.04241 (2018).

https://arxiv.org/pdf/1804.04241.pdf

# CapsuleGAN

Jaiswal, Ayush, et al. "Capsulegan: Generative adversarial capsule network." Proceedings of the European Conference on Computer Vision (ECCV) Workshops. 2018.

https://openaccess.thecvf.com/content_ECCVW_2018/papers/11131/Jaiswal_CapsuleGAN_Generative_Adversarial_Capsule_Network_ECCVW_2018_paper.pdf

-----

DenseNet（三）：Illustrated

2021/03/27

-----

https://pixabay.com/zh/photos/city-architecture-building-urban-5051196/

-----

DenseNet 的第一個重點是圖4，比較 DenseNet、DenseNet-C、DenseNet-B，以及 DenseNet-BC 的異同。特別是 DenseNet-B 與 DenseNet 的差異。

DenseNet 的第二個重點是圖5。可參考 https://www.tensorinfinity.com/paper_89.html。

DenseNet 的第三個重點是 "Visualizing the loss landscape of neural nets." 的圖7。

Li, Hao, et al. "Visualizing the loss landscape of neural nets." Advances in Neural Information Processing Systems. 2018.

https://papers.nips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf

-----

# DenseNet

說明：

growth rate

k 為 growth rate，也就是每一層的輸出。每一層的輸出，也會當成後續所有層的輸入。一般的作法是每張特徵圖通過一個卷積核，變成一張新的特徵圖。本論文重點是 k0 張如何變成 k 張。可以參考 LeNet。

-----

# LeNet。

六張特徵圖到十六張特徵圖。

說明：

以 16 張的第 0 張為例，它以六張的前三張共用一個卷積核。

combined

How are the feature maps of all filters in a convolutional layer combined? What is the final output of the layer?

「The feature maps from one layer are used to create new feature maps in the next layer. Each feature map in this second layer is a combination of all the feature maps in the first layer. And the value of the feature map in the second layer, at any one pixel, is found by multiplying each feature in the first layer with a convolution kernel, with a different kernel for each feature map in the first layer. The responses are then summed, added to a bias term, and then modified by a simple non-linear operation.」

卷積層中所有濾波器的特徵圖如何組合？該層的最終輸出是什麼？

一層中的特徵圖用於在下一層中創建新的特徵圖。第二層中的每個特徵圖都是第一層中所有特徵圖的組合。通過將第一層中的每個特徵乘以卷積核，並為第一層中的每個特徵圖使用不同的核，可以找到第二層中任意一個像素的特徵圖的值。然後將響應求和，添加到偏差項，然後通過簡單的非線性運算進行修改。

https://www.quora.com/How-are-the-feature-maps-of-all-filters-in-a-convolutional-layer-combined-What-is-the-final-output-of-the-layer

-----

# Convolution Guide

說明：

卷積可以視為稀疏的全連接層。

-----

Figure 1: A 5-layer dense block with a growth rate of k = 4. Each layer takes all preceding feature-maps as input.

圖1：一個 5 層密集塊，增長率為 k = 4。每一層都將所有先前的特徵圖作為輸入。

# DenseNet

說明：

假定第一層的輸入是 k0 張特徵圖，第一層的輸出是 k 張特徵圖。後面每一層的輸出都是 k 張特徵圖。每一層的輸出都會成為之後每一層的輸入。

那麼，這 k 張是怎麼決定的？標準方法不是用 Conv1。DenseNet-B 用 Conv1。

-----

Figure 2: A deep DenseNet with three dense blocks. The layers between two adjacent blocks are referred to as transition layers and change feature-map sizes via convolution and pooling.

圖2：具有三個密集塊的深 DenseNet。兩個相鄰塊之間的層稱為過渡層，並通過卷積和池化更改特徵圖大小。

# DenseNet

說明：

過渡層，並通過 1x1 卷積降維，用池化更改特徵圖大小。

-----

Table 1: DenseNet architectures for ImageNet. The growth rate for the first 3 networks is k = 32, and k = 48 for DenseNet-161. Note that each “conv” layer shown in the table corresponds the sequence BN-ReLU-Conv.

表1：用於 ImageNet 的 DenseNet 架構。前三個網路的增長率是 k = 32，而對於 DenseNet-161，k = 48。注意，表中顯示的每個 “ conv” 層對應於序列 BN-ReLU-Conv。

# DenseNet

說明：

每個 “ conv” 層對應於序列 BN-ReLU-Conv。這張圖是 DenseNet-B。

-----

# DenseNet

說明：

每層增加 k 張會讓總數越來越多。瓶頸層用 Conv1 強制讓輸入變成 4k 張。

-----

# DenseNet

說明：

過渡層可以用 Conv1 壓縮特徵圖張數。本論文設定壓縮成一半的張數。

-----

Table 2: Error rates (%) on CIFAR and SVHN datasets. k denotes network’s growth rate. Results that surpass all competing methods are bold and the overall best results are blue. “+” indicates standard data augmentation (translation and/or mirroring). indicates results run by ourselves. All the results of DenseNets without data augmentation (C10, C100, SVHN) are obtained using Dropout. DenseNets achieve lower error rates while using fewer parameters than ResNet. Without data augmentation, DenseNet performs better by a large margin.

表2：CIFAR 和S VHN 資料集的錯誤率（％）。 k 表示網路的增長率。超過所有競爭方法的結果都是粗體，總體最佳結果是藍色。 “ +”表示標準資料擴充（轉換和/或鏡像）。表示結果由我們自己決定。使用 Dropout 可獲得 DenseNets 的所有不進行資料擴充的結果（C10，C100，SVHN）。與 ResNet 相比，DenseNets 使用更少的參數可實現更低的錯誤率。如果沒有資料擴充，DenseNet的性能將大大提高。

# DenseNet

說明：

粗體表示比所有競爭者好。

藍色表示是所有的裡面最好的。

+ 表示資料擴充（左上左下右上右下中、水平翻轉）。

-----

Table 3: The top-1 and top-5 error rates on the ImageNet validation set, with single-crop (10-crop) testing.

表3：使用單幅（10幅）測試的 ImageNet 驗證集上的 top-1 和 top-5 錯誤率。

# DenseNet

說明：

深度很重要，但寬度（k）似乎更重要。

-----

Figure 3: Comparison of the DenseNets and ResNets top-1 error rates (single-crop testing) on the ImageNet validation dataset as a function of learned parameters (left) and FLOPs during test-time (right).

圖3：在 ImageNet 驗證資料集上 DenseNets 和 ResNets top-1錯誤率（單幅測試）的比較，作為測試期間學習的參數（左）和 FLOP 的函數（右）。

# DenseNet

說明：

參數與浮點數運算都優於 ResNet。

-----

Figure 4: Left: Comparison of the parameter efficiency on C10+ between DenseNet variations. Middle: Comparison of the parameter efficiency between DenseNet-BC and (pre-activation) ResNets. DenseNet-BC requires about 1/3 of the parameters as ResNet to achieve comparable accuracy. Right: Training and testing curves of the 1001-layer pre-activation ResNet [12] with more than 10M parameters and a 100-layer DenseNet with only 0.8M parameters.

圖4：左圖：DenseNet 變體之間 C10 + 上參數效率的比較。中：對比 DenseNet-BC 和（激活前）ResNets 的參數效率。 DenseNet-BC 需要大約 1/3 的參數作為 ResNet 才能達到可比的精度。右圖：參數超過 10M 的 1001 層預激活 ResNet [12] 和參數僅為 0.8M 的 100 層D enseNet的訓練和測試曲線。

# DenseNet

說明：

左。DenseNet-BC 最優。

中。效能一樣的 DenseNet-BC，參數是 ResNet 的三分之一。

右。參數量較少的 DenseNet-BC，泛化能力比 ResNet 好。原因有可能是 DenseNet 是更稠密的 ensemble？

-----

Figure 5: The average absolute filter weights of convolutional layers in a trained DenseNet. The color of pixel (s, ℓ) encodes the average L1 norm (normalized by number of input feature-maps) of the weights connecting convolutional layer s to ℓ within a dense block. Three columns highlighted by black rectangles correspond to two transition layers and the classification layer. The first row encodes weights connected to the input layer of the dense block.

圖5：經過訓練的 DenseNet 中卷積層的平均絕對濾波器權重。像素的顏色（s，ℓ）編碼在密集塊內將捲積層 s 連接到 ℓ 的權重的平均 L1 範數（通過輸入特徵圖的數量歸一化）。用黑色矩形突出顯示的三列對應於兩個過渡層和分類層。第一行對連接到密集塊輸入層的權重進行編碼。

# DenseNet

說明：

紅色表示 strong use，藍色表示 almost no use。橫坐標是選定層，縱坐標是選定層之前一層。最右方與最上方是 transition layer。

從圖中可以得到以下結論：

a) 較早的層提取出的特徵部分仍可能被較深的層使用。

b) 即便是 Transition layer 也有可能使用到之前 Denseblock 中所有的層的特徵。

c) 第 2 與第 3 個 Denseblock 中的層對之前的 Transition layer 利用率非常低，這表示 transition layer 會輸出大量冗餘的特徵。這也為 DenseNet-BC 提供證據支持，也就是 Compression 之必要。

d) 最後一層的分類層，雖然使用了之前 Denseblock 中的多層訊息，但更偏向使用最後幾個feature maps 的特徵。這說明在網路最後幾層，某些 high-level 的特徵可能會被產生。

https://www.tensorinfinity.com/paper_89.html

-----

Figure 4: The loss surfaces of ResNet-110-noshort and DenseNet for CIFAR-10.

# ResNet-V。

說明：

ResNet-110-noshort 與 DenseNet。DenseNet 也是 ensemble？！

-----

Figure 2: A deep DenseNet with three dense blocks. The layers between two adjacent blocks are referred to as transition layers and change feature-map sizes via convolution and pooling.

圖2：具有三個密集塊的深 DenseNet。兩個相鄰塊之間的層稱為過渡層，並通過卷積和池化更改特徵圖大小。

# DenseNet

說明：

過渡層，並通過 1x1 卷積降維，用池化更改特徵圖大小。

-----

Are we really seeing convexity? We are viewing the loss surface under a dramatic dimensionality reduction, and we need to be careful interpreting these plots. For this reason, we quantify the level of convexity in loss functions but computing the principle curvatures, which are simply eigenvalues of the Hessian. A truly convex function has no negative curvatures (the Hessian is positive semi-definite), while a non-convex function has negative curvatures.

我們真的看到凸面了嗎？我們正在觀察維度急劇減小下的損失表面，我們需要仔細解釋這些圖。因此，我們可以對損失函數中的凸度進行量化，但要計算主曲率，這僅僅是 Hessian 的特徵值。真正的凸函數不具有負曲率（Hessian 為正半定），而非凸函數則具有負曲率。

說明：

Hessian 為正半定，則為凸函數（平滑）。

-----

https://zh.wikipedia.org/wiki/%E9%BB%91%E5%A1%9E%E7%9F%A9%E9%99%A3

說明：

Hessian。

-----

https://ccjou.wordpress.com/2013/01/10/%E5%8D%8A%E6%AD%A3%E5%AE%9A%E7%9F%A9%E9%99%A3%E7%9A%84%E5%88%A4%E5%88%A5%E6%96%B9%E6%B3%95/

-----

# Hessian

說明：

特徵值與特徵向量。

-----

Figure 7: For each point in the filter-normalized surface plots, we calculate the maximum and minimum eigenvalue of the Hessian, and map the ratio of these two.

圖7：對於濾波器歸一化曲面圖中的每個點，我們計算 Hessian 的最大和最小特徵值，並映射這兩個的比率。

# ResNet-V。

說明：

對於濾波器歸一化曲面圖中的每個點，我們計算 Hessian 的最大和最小特徵值，並映射這兩個的比率。

Hessian 為半正定的話，最小特徵值為 0。畫面是深藍色。表示是 convex。若否，則偏黃。

-----

Figure 2: Architecture comparison of different networks. (a) The residual network. (b) The densely connected network, where each layer can access the outputs of all previous micro-blocks. Here, a 1 x 1 convolutional layer (underlined) is added for consistency with the micro-block design in (a). (c) By sharing the first 1 x 1 connection of the same output across micro-blocks in (b), the densely connected network degenerates to a residual network. The dotted rectangular in (c) highlights the residual unit. (d) The proposed dual path architecture, DPN. (e) An equivalent form of (d) from the perspective of implementation, where the symbol “o” denotes a split operation, and “+” denotes element-wise addition.

圖2：不同網路的架構比較。（a）殘差網。（b）稠密網，其中每個層都可以存取所有先前的微型塊的輸出。這裡，為了與（a）中的微塊設計保持一致，添加了一個 1 x 1卷積層（底線）。（c）通過在（b）中的微塊之間共享相同輸出的前 1 x 1連接，稠密網退化為殘差網。（c）中的虛線矩形突出顯示了殘差單位。（d）擬議的雙路徑架構 DPN。（e）從實現的角度來看，（d）的等效形式，其中符號“ o”表示拆分運算，而“ +”表示逐元素加法。

# DPN

a ResNet。

b DenseNet。

c 將 DenseNet 轉成殘差格式。

d DPN。

e 是 d 的等效形式。

-----

# CSPNet

說明：

部分進入 DenseBlock，部分跳過 DenseBlock。運算量因而減少。結果可能接近。

-----

# DenseNet

Huang, Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE conference on computer vision and pattern recognition. Vol. 1. No. 2. 2017.

http://openaccess.thecvf.com/content_cvpr_2017/papers/Huang_Densely_Connected_Convolutional_CVPR_2017_paper.pdf

# LeNet

LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324.

http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

# Hessian

Brown, David E. "The Hessian matrix: Eigenvalues, concavity, and curvature." BYU Idaho Department of Mathematics (2014).

https://www.iith.ac.in/~ashok/Maths_Lectures/TutorialB/Hessian_Examples.pdf

# ResNet-V。被引用 464 次。ensemble 促使損失函數平滑化，也因此好訓練。

Li, Hao, et al. "Visualizing the loss landscape of neural nets." Advances in Neural Information Processing Systems. 2018.

https://papers.nips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf

# DPN

Chen, Yunpeng, et al. "Dual path networks." Advances in Neural Information Processing Systems. 2017.

https://proceedings.neurips.cc/paper/2017/file/f7e0b956540676a129760a3eae309294-Paper.pdf

# CSPNet

Wang, Chien-Yao, et al. "CSPNet: A new backbone that can enhance learning capability of CNN." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 2020.

https://openaccess.thecvf.com/content_CVPRW_2020/papers/w28/Wang_CSPNet_A_New_Backbone_That_Can_Enhance_Learning_Capability_of_CVPRW_2020_paper.pdf

# Convolution Guide

Dumoulin, Vincent, and Francesco Visin. "A guide to convolution arithmetic for deep learning." arXiv preprint arXiv:1603.

https://arxiv.org/pdf/1603.07285.pdf

-----

[1] DenseNet：比ResNet更優的CNN模型- 知乎

https://zhuanlan.zhihu.com/p/37189203

[2] DenseNet詳解

https://www.tensorinfinity.com/paper_89.html

[3] [線性系統] 對角化與 Eigenvalues and Eigenvectors

https://ch-hsieh.blogspot.com/2010/08/eigenvalues-and-eigenvectors.html\

[4] 半正定矩陣的判別方法 | 線代啟示錄

https://ccjou.wordpress.com/2013/01/10/%E5%8D%8A%E6%AD%A3%E5%AE%9A%E7%9F%A9%E9%99%A3%E7%9A%84%E5%88%A4%E5%88%A5%E6%96%B9%E6%B3%95/

[5] (51) 【蜻蜓点论文】Visualizing the Loss Landscape of Neural Nets - YouTube

https://www.youtube.com/watch?v=xVxMvoacWMw

[6] (52) Tom Goldstein: "What do neural loss surfaces look like?" - YouTube

https://www.youtube.com/watch?v=78vq6kgsTa8

-----

DenseNet - Keras

keras-applications/densenet.py at master · keras-team/keras-applications · GitHub

https://github.com/keras-team/keras-applications/blob/master/keras_applications/densenet.py

DenseNet - TensorFlow

GitHub - taki0112/Densenet-Tensorflow: Simple Tensorflow implementation of Densenet using Cifar10, MNIST

https://github.com/taki0112/Densenet-Tensorflow

DenseNet - PyTorch

vision/densenet.py at master · pytorch/vision · GitHub

https://github.com/pytorch/vision/blob/master/torchvision/models/densenet.py

-----

DenseNet（二）：Overview

DenseNet （二）：Overview

2020/12/28

-----

施工中。。。

-----

https://pixabay.com/zh/photos/architecture-construction-sites-3254023/

-----

◎ Abstract

-----

◎ Introduction

-----

本論文要解決（它之前研究）的（哪些）問題（弱點）？

-----

# ResNet v2。

-----

◎ Method

-----

解決方法？

-----

# DenseNet。

-----

具體細節？

http://hemingwang.blogspot.com/2021/03/densenetillustrated.html

-----

◎ Result

-----

本論文成果。

-----

◎ Discussion

-----

本論文與其他論文（成果或方法）的比較。

-----

成果比較。

-----

方法比較。

-----

◎ Conclusion

-----

◎ Future Work

-----

後續相關領域的研究。

-----

# CSPNet

-----

後續延伸領域的研究。

-----

# Tiramisu

-----

◎ References

-----

# ResNet v2。被引用 4560 次。重點從 residual block 轉移到 pure identity mapping，網路可到千層。

He, Kaiming, et al. "Identity mappings in deep residual networks." European conference on computer vision. Springer, Cham, 2016.

https://arxiv.org/pdf/1603.05027.pdf

# DenseNet。被引用 12498 次。反覆使用 conv1 也可加深網路。

Huang, Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

https://openaccess.thecvf.com/content_cvpr_2017/papers/Huang_Densely_Connected_Convolutional_CVPR_2017_paper.pdf

# CSPNet

Wang, Chien-Yao, et al. "CSPNet: A new backbone that can enhance learning capability of CNN." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 2020.

https://openaccess.thecvf.com/content_CVPRW_2020/papers/w28/Wang_CSPNet_A_New_Backbone_That_Can_Enhance_Learning_Capability_of_CVPRW_2020_paper.pdf

# Tiramisu

Jégou, Simon, et al. "The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2017.

https://openaccess.thecvf.com/content_cvpr_2017_workshops/w13/papers/Jegou_The_One_Hundred_CVPR_2017_paper.pdf

-----

DenseNet（一）：Paper Translation

2021/03/27

-----

時間因素，將不進行翻譯。

-----

https://pixabay.com/zh/photos/landscape-forest-trees-jungle-2584127/

-----

# DenseNet

Huang, Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE conference on computer vision and pattern recognition. Vol. 1. No. 2. 2017.

http://openaccess.thecvf.com/content_cvpr_2017/papers/Huang_Densely_Connected_Convolutional_CVPR_2017_paper.pdf

-----

Sunday, May 23, 2021

深度學習基礎理論

2021/02/26

說明：

李宏毅老師的教材。

在 LeNet 跟 AlexNet 之前，可以先觀看所有影片的前四場，作為輔助教材。

-----

https://pixabay.com/zh/photos/excavators-blade-1937151/

-----

教材

[1] courses

https://speech.ee.ntu.edu.tw/~tlkagk/courses.html

-----

機器學習筆記一

觀看日期：2021/02/25。

內容：Powerpoint version of the slides: link Course Info pdf (2015/09/18) What is Machine Learning, Deep Learning and Structured Learning? pdf,mp4,download (2015/09/18)

感想：生動有趣的觀念建立。

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html

-----

機器學習筆記二

觀看日期：2021/02/26。

內容：Neural Network (Basic Ideas) pdf,mp4,download (2015/09/25)

感想：兩個多小時的影片，很仔細地講了神經網路與梯度下降。另外也安排了一個語音的實作。是 CNN 與 RNN 共同的基礎。對深度學習有興趣的人，若有時間按照順序一直看應該可以打下很好的基礎。

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html

-----

機器學習筆記三

觀看日期：2021/02/27。

內容：Backpropagation pdf,mp4,download (2015/10/02) Theano: DNNpdf,mp4,download (2015/10/02)

感想：大約三十分鐘的 BP。講得非常仔細，如果一遍沒有完全聽懂，可以再聽一遍。另外有約一小時的 Theano。Theano 雖然已經過時了，但這段影片主要用來輔助一個神經網路的小實做，還是值得參考。

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html

-----

機器學習筆記四

觀看日期：2021/03/01。

內容：Tips for Training Deep Neural Network pdf,mp4,download (2015/10/16)

感想：大約兩個半小時。主要內容有激活函數的 ReLU，Maxout。損失函數 softmax。最佳化 Adagrad，Momentum。正規化 Weight Decay，Dropout。很適合作為 AlexNet 的輔助教材。分量很重，最好可以分幾次看。

http://speech.ee.ntu.edu.tw/~tlkagk/courses_MLSD15_2.html

-----

本來是想慢慢看完影片，不過還是決定多跑一些 Tutorials。2021/03/24。

-----

Sunday, May 16, 2021

ResNet（三）：Illustrated

2021/03/27

-----

https://pixabay.com/zh/photos/utah-america-nevada-arizona-4272944/

-----

Outline

一、ResNet v1：殘差比較容易訓練，可達百層。

二、ResNet-D：v1 不是 ensemble，dropout 是 ensemble，可達千層。

三、ResNet v2：ReLU 從 identity 移開是重點，可達千層。

四、ResNet-E：v2 的 identity 是 ensemble。

五、ResNet-V：ensemble 是平均，讓 loss surface 平滑，有助於訓練到千層。

-----

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.

圖1. 在 20 層和 56 層“普通”網路的 CIFAR-10 上的訓練錯誤（左）和測試錯誤（右）。較深的網路具有較高的訓練錯誤，從而導致測試錯誤。 ImageNet 上的類似現象如圖4 所示。

# ResNet v1。

說明：

這個不是 overfitting，因為 overfitting 指的是訓練結果佳，但測試結果不佳。此圖顯示，訓練不佳，連帶測試也不佳，屬於網路退化。在 VGG-19 以下，持續加深有助於網路正確率提升，但 VGG-19 以上，持續加深促使網路正確率降低。

-----

# ResNet v1。

說明：

這裡的 x 和 y 是所考慮的圖層的輸入和輸出向量。函數 F（x，{Wi}）表示要學習的殘差映射。

-----

Figure 2. Residual learning: a building block.

圖2. 殘差學習：構建塊。

# ResNet v1。

說明：

ResNet v1 模組的架構。加上一個強制的恆等映射之後，卷積層所要訓練的即為殘差。

-----

Figure 3. Example network architectures for ImageNet. Left: the VGG-19 model [40] (19.6 billion FLOPs) as a reference. Middle: a plain network with 34 parameter layers (3.6 billion FLOPs). Right: a residual network with 34 parameter layers (3.6 billion FLOPs). The dotted shortcuts increased dimensions. Table 1 shows more details and other variants.

圖3. ImageNet 的範例網路架構。左：作為參考的 VGG-19 模型[40]（196 億個 FLOP）。中：包含 34 個參數層（36 億個FLOP）的普通網路。右圖：一個具有 34 個參數層的殘差網路（36 億個 FLOP）。虛線捷徑增加了維度。表1 顯示了更多詳細信息和其他變體。

# ResNet v1。

說明：

VGG-19、無恆等映射的 ResNet、與 ResNet。

有關 34 層的 ResNet，實線部分是 identity mapping，虛線部分是 projection shortcuts，即輸入跟輸出維度不同。

-----

Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the numbers of blocks stacked. Downsampling is performed by conv3 1, conv4 1, and conv5 1 with a stride of 2.

表1. 為 ImageNet 的架構。括號中顯示了構建塊（另請參見圖5），其中堆疊了許多塊。下採樣由 conv3_1，conv4_1和conv5_1執行，步長為 2。

# ResNet v1。

說明：

18 層為 conv1 一層、conv2 兩層兩個、conv3 兩層兩個、conv4 兩層兩個、conv5 兩層兩個、全連接層一層。

50、101、152，則是將兩層的卷積層換成三層的瓶頸模組。

-----

Figure 4. Training on ImageNet. Thin curves denote training error, and bold curves denote validation error of the center crops. Left: plain networks of 18 and 34 layers. Right: ResNets of 18 and 34 layers. In this plot, the residual networks have no extra parameter compared to their plain counterparts.

圖4. 在 ImageNet 上進行訓練。細曲線表示訓練誤差，粗曲線表示中心色塊的驗證誤差。左：18 和 34 層的普通網路。右：18 和 34 層的 ResNet。在此圖中，殘差網路與普通網路相比沒有額外的參數。

# ResNet v1。

說明：

ImageNet 的範例：ResNet 可以從 18 層到 34 層，效能持續提升（正確率提高，錯誤率降低）。

-----

Table 2. Top-1 error (%, 10-crop testing) on ImageNet validation. Here the ResNets have no extra parameter compared to their plain counterparts. Fig. 4 shows the training procedures.

表2. ImageNet 驗證中的 Top-1 錯誤（％，進行了十次裁剪測試）。與普通的 ResNet 相比，此處的 ResNet 沒有額外的參數。圖4 顯示了訓練過程。

# ResNet v1。

說明：

ImageNet 的範例：ResNet 可以從 18 層到 34 層，效能持續提升（正確率提高，錯誤率降低）。

-----

# ResNet v1。

說明：

虛線的部分表示維度增加，有三種方式可以增加維度。A、B、C，詳見下文討論。

-----

Next we investigate projection shortcuts (Eqn.(2)). In Table 3 we compare three options: (A) zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameterfree (the same as Table 2 and Fig. 4 right); (B) projection shortcuts are used for increasing dimensions, and other shortcuts are identity; and (C) all shortcuts are projections.

接下來，我們研究投影快捷方式（等式（2））。在表3 中，我們比較了三個選項：（A）零填充快捷方式用於增加尺寸，並且所有快捷方式都是無參數的（與表2 和右圖4 相同）；（B）投影快捷方式用於增加尺寸，其他快捷方式用於恆等映射。（C）所有快捷方式都是投影。

# ResNet v1。

說明：

參考下方。

-----

Figure 5. Structure of residual unit (a) with zero-padded identitymapping shortcut, (b) unraveled view of (a) showing that the zeropadded identity-mapping shortcut constitutes a mixture of a residual network with a shortcut connection and a plain network.

圖5. 殘差單元的結構（a）具有零填充恆等映射快捷方式，（b）（a）的分解視圖，其中零填充恆等映射快捷方式構成了具有快捷方式連接的殘差網路和純網路的混合體。

# PyramidNet

說明：

A：Zero-padding。

-----

The dimensions of x and F must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions:

We can also use a square matrix Ws in Eqn.(1). But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus Ws is only used when matching dimensions.

x 和 F 的尺寸在等式（1）中必須相等。如果不是這種情況（例如，在更改輸入/輸出通道時），我們可以通過快捷方式連接執行線性投影 Ws 以匹配尺寸：

我們也可以在等式（1）中使用方陣 Ws。但是我們將通過實驗證明，恆等映射足以解決退化問題並且很經濟，因此 Ws 僅在匹配尺寸時使用。

# ResNet v1。

說明：

線性投影就是投影。

「在線性代數和泛函分析中，投影是從向量空間映射到自身的一種線性變換。」

https://baike.baidu.com/item/%E6%AD%A3%E4%BA%A4%E6%8A%95%E5%BD%B1

「在內積空間中，最重要的運算除了內積本身，另一個威力強大的代數工具就是將任意向量分解為正交分量之和的正交投影 (orthogonal projection)。」

https://ccjou.wordpress.com/2010/04/19/%E6%AD%A3%E4%BA%A4%E6%8A%95%E5%BD%B1-%E5%A8%81%E5%8A%9B%E5%BC%B7%E5%A4%A7%E7%9A%84%E4%BB%A3%E6%95%B8%E5%B7%A5%E5%85%B7/

維度不同：通道數不同，可用 Conv1。尺寸不同，可用 pooling。

特徵圖張數不變：如 VGGNet 裡面使用的 Conv1。

特徵圖張數改變：譬如特徵圖張數加倍的 ResNet。又，瓶頸模組先降維減少運算量再升維。

說明：

B：升維用 Conv1，其他用 identity mapping。C 好一點點，但計算量較 B 大，後續採用的是 B。

C：升維用 Conv1，其他 identity mapping 用方陣（保持維度）的 Conv1。

-----

Table 3. Error rates (%, 10-crop testing) on ImageNet validation. VGG-16 is based on our test. ResNet-50/101/152 are of option B that only uses projections for increasing dimensions.

表3. ImageNet 驗證的錯誤率（％，進行了十次裁剪測試）。 VGG-16 基於我們的測試。 ResNet-50 / 101/152 是選項 B 的選項，僅使用投影來增加尺寸。

# ResNet v1。

說明：

（進行了十次裁剪測試）。

B 比 A 好一點。C 比 B 好一點，但計算量多不少。

-----

Table 4. Error rates (%) of single-model results on the ImageNet validation set (except † reported on the test set).

表4. ImageNet 驗證集上單模型結果的錯誤率（％）（測試集上報告的 † 除外）。

# ResNet v1。

說明：

單一模型。

-----

Table 5. Error rates (%) of ensembles. The top-5 error is on the test set of ImageNet and reported by the test server.

表5. 集成的錯誤率（％）。 top-5 錯誤位於 ImageNet 的測試集中，並由測試服務器報告。

# ResNet v1。

說明：

集成（多個模型）。

-----

Figure 5. A deeper residual function F for ImageNet. Left: a building block (on 56×56 feature maps) as in Fig. 3 for ResNet-34. Right: a “bottleneck” building block for ResNet-50/101/152.

圖5. 為 ImageNet 的更深的殘差函數 F。左：ResNet-34 的構建塊（在56×56 特徵圖上），如圖3 所示。右：ResNet-50 / 101/152 的“瓶頸”構建基塊。

# ResNet v1。

說明：

ResNet v1 模組與瓶頸模組。

-----

Table 6. Classification error on the CIFAR-10 test set. All methods are with data augmentation. For ResNet-110, we run it 5 times and show “best (mean±std)” as in [42].

表6. CIFAR-10 測試集上的分類錯誤。所有方法都具有資料增強功能。對於 ResNet-110，我們將其運行 5 次並顯示為 [best（mean±std）”，如 [42] 所示。

# ResNet v1。

說明：

CIFAR-10 測試。

-----

Figure 6. Training on CIFAR-10. Dashed lines denote training error, and bold lines denote testing error. Left: plain networks. The error of plain-110 is higher than 60% and not displayed. Middle: ResNets. Right: ResNets with 110 and 1202 layers.

圖6. 有關 CIFAR-10 的訓練。虛線表示訓練錯誤，而粗線表示測試錯誤。左：普通網路。 Plain-110 的錯誤高於 60％，並且不顯示。中：ResNets。右圖：具有 110 和 1202 層的 ResNet。

# ResNet v1。

說明：

CIFAR-10 從 110 到 1202 層的 ResNet 效能變差。

-----

Table 7. Object detection mAP (%) on the PASCAL VOC 2007/2012 test sets using baseline Faster R-CNN. See also appendix for better results.

Table 8. Object detection mAP (%) on the COCO validation set using baseline Faster R-CNN. See also appendix for better results.

表7. 使用基準 Faster R-CNN 在 PASCAL VOC 2007/2012 測試集中進行的對象檢測 mAP（％）。另請參見附錄以獲得更好的結果。

表8. 使用基線 Faster R-CNN 在 COCO 驗證集上進行的對象檢測 mAP（％）。另請參見附錄以獲得更好的結果。

# ResNet v1。

說明：

ResNet v1 在 PASCAL VOC 與 COCO 都比 VGG16 好。

-----

Fig. 2. The linear decay of p` illustrated on a ResNet with stochastic depth for p0=1 and pL = 0:5. Conceptually, we treat the input to the rst ResBlock as H0, which is always active.

圖2. 在具有隨機深度的 ResNet 上，對於 p0 = 1 和 pL = 0：5 的p`的線性衰減。從概念上講，我們將第一個 ResBlock 的輸入視為 H0，該輸入始終處於活動狀態。

# ResNet-D。

說明：

「其中 L 表示塊的總數，因此 p_L 是最後一個剩餘塊的生存概率，並且在整個實驗中固定為 0.5。另請注意，在此設置中，輸入被視為第一層（l = 0），因此永遠不會丟失。」

http://hemingwang.blogspot.com/2019/10/an-overview-of-resnet-and-its-variants.html

-----

Fig. 5. With stochastic depth, the 1202-layer ResNet still signicantly improves over the 110-layer one.

圖5. 隨機深度下，1202 層的 ResNet 仍比 110 層的 ResNet 顯著改善。

# ResNet-D。

說明：

在 CIFAR-10 上，ResNet-D 1202 層比 110 層錯誤率低。

-----

# ResNet v2。

說明：

CIFAR-10。

ResNet v1 110 是 6.43%。ResNet v1 1202 是 7.93%。原論文表六。

ResNet v1 1001 是 7.61%，ResNet v2 1001 是 4.92%。

-----

Figure 1: Residual Networks are conventionally shown as (a), which is a natural representation of Equation (1). When we expand this formulation to Equation (6), we obtain an unraveled view of a 3-block residual network (b). Circular nodes represent additions. From this view, it is apparent that residual networks have O(2n) implicit paths connecting input and output and that adding a block doubles the number of paths.

圖1：殘差網路通常顯示為（a），它是等式（1）的自然表示。當我們將此公式擴展為方程式（6）時，我們獲得了3塊殘差網路（b）的分解圖。圓形節點代表附加項。從這個角度來看，很明顯，殘差網路具有連接輸入和輸出的O（2n）隱式路徑，添加一個塊會使路徑數量增加一倍。

# ResNet-E。

說明：

殘差網路具有連接輸入和輸出的隱式路徑，添加一個塊會使路徑數量增加一倍。

-----

Figure 5: (a) Error increases smoothly when randomly deleting several modules from a residual network. (b) Error also increases smoothly when re-ordering a residual network by shuffling building blocks. The degree of reordering is measured by the Kendall Tau correlation coefficient. These results are similar to what one would expect from ensembles.

圖5：（a）從殘差網路中隨機刪除幾個模塊時，錯誤會平穩增加。（b）通過對構建基塊進行改組對殘差網路進行重新排序時，誤差也會平穩增加。重新排序的程度由 Kendall Tau 相關係數來衡量。這些結果類似於集成所期望的結果。

「1、簡介在統計學中，肯德爾相關係數是以 Maurice Kendall 命名的，並經常用希臘字母 τ（tau）表示其值。 ... 肯德爾相關係數的取值範圍在 -1 到 1 之間，當 τ 為 1 時，表示兩個隨機變數擁有一致的等級相關性；當 τ 為 -1 時，表示兩個隨機變數擁有完全相反的等級相關性；當 τ 為 0 時，表示兩個隨機變數是相互獨立的。2018年12月19日」Google 查詢。

https://blog.csdn.net/zhaozhn5/article/details/78392220

# ResNet-E。

說明：

(a) 隨機刪除幾層，刪除越多層，錯誤越高。

(b) 隨機對調兩層（總共 k 組），新的錯誤率，與兩層的肯德爾係數成反比。

-----

Figure 6: How much gradient do the paths of different lengths contribute in a residual network? To find out, we first show the distribution of all possible path lengths (a). This follows a Binomial distribution. Second, we record how much gradient is induced on the first layer of the network through paths of varying length (b), which appears to decay roughly exponentially with the number of modules the gradient passes through. Finally, we can multiply these two functions (c) to show how much gradient comes from all paths of a certain length. Though there are many paths of medium length, paths longer than 20 modules are generally too long to contribute noticeable gradient during training. This suggests that the effective paths in residual networks are relatively shallow.

圖6：殘差網路中不同長度的路徑貢獻多少梯度？為了找出答案，我們首先顯示所有可能的路徑長度（a）的分佈。這遵循二項分佈。其次，我們記錄通過變化長度（b）的路徑在網路的第一層上引起多少梯度，該梯度似乎隨著梯度所經過的模塊數量而呈指數衰減。最後，我們可以將這兩個函數（c）相乘，以顯示一定長度的所有路徑產生多少梯度。儘管有許多中等長度的路徑，但長度超過 20 個模塊的路徑通常太長而無法在訓練過程中產生明顯的梯度。這表明殘差網路中的有效路徑相對較淺。

# ResNet-E。

「令人驚訝的是，大多數貢獻來自長度為 9 至 18 的路徑（VGG？），如（c）所示。但它們僅佔總路徑的一小部分，如（a）所示。這是一個非常有趣的發現，因為它表明 ResNet 不能解決很長路徑的消失梯度問題，而 ResNet 實際上可以通過縮短有效路徑來訓練非常深的網路。」

http://hemingwang.blogspot.com/2019/10/an-overview-of-resnet-and-its-variants.html

說明：

(a) 不同長度的 path 的個數。

(b) 不同長度的 path 在第一層引起的梯度。

-----

Figure 1: The loss surfaces of ResNet-56 with/without skip connections. The proposed filter normalization scheme is used to enable comparisons of sharpness/flatness between the two figures.

# ResNet-V。

說明：

平滑的 ResNet Loss Function 易於訓練。

-----

Figure 5: 2D visualization of the loss surface of ResNet and ResNet-noshort with different depth.

圖5：具有不同深度的 ResNet 和 ResNet-noshort 損失表面的 2D 可視化。

# ResNet-V。

說明：

越深的網路，快捷連接的改善越明顯。

-----

Figure 6: Wide-ResNet-56 on CIFAR-10 both with shortcut connections (top) and without (bottom). The label k = 2 means twice as many filters per layer. Test error is reported below each figure.

圖6：CIFAR-10 上的 Wide-ResNet-56（帶快捷方式連接（頂部）和不帶快捷方式連接（底部））。標籤 k = 2 表示每層過濾器數量是原來的兩倍。測試錯誤報告在每個圖的下方。

# ResNet-V。

說明：

標籤 k = 2 表示每層過濾器數量是原來的兩倍。

比較寬的 WRN 沒有快捷連接也 OK，但計算量會增加。

-----

# ResNet-V。

說明：

由高維投影到低維。高維上任意兩個向量很容易正交。【蜻蜓点论文】

θ* 是中心點。

δ 與 η 是高斯分佈取樣得來的，兩個正交的向量。維度與 θ* 同。

α 與 β 是兩個向量的 scale。

-----

# ResNet-V。

說明：

filter normalization。

以 3x3 卷積核為例，第 jth filter 為 w11, w12, w13, w21, w22, w23, w31, w32, w33, b。每個 w 的值都是高斯分布的取樣。

dij 的 norm 為上述所有項的平方和開根號。

https://zhuanlan.zhihu.com/p/52314278

-----

# ResNet-V。

說明：

泛化能力，特指：學習算法對新樣本的適應能力，外文名： generalization ability。

使用 filter normalization 之後，發現 minimizer（最低點）的平滑程度跟泛化能力成正相關。譬如圖一。

-----

References

# VGGNet。被引用 47721 次。以兩個 conv3 組成一個 conv5，反覆加深網路至 16 與 19 層。

Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).

https://arxiv.org/pdf/1409.1556.pdf

# ResNet v1。被引用 61600 次。加上靈感來自 LSTM 的 identity mapping，網路可到百層。

He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf

# ResNet-D。被引用 982 次。ResNet v1 的 dropout 版本，網路可到千層。

Huang, Gao, et al. "Deep networks with stochastic depth." European conference on computer vision. Springer, Cham, 2016.

https://arxiv.org/pdf/1603.09382.pdf

# ResNet v2。被引用 4560 次。重點從 residual block 轉移到 pure identity mapping，網路可到千層。

He, Kaiming, et al. "Identity mappings in deep residual networks." European conference on computer vision. Springer, Cham, 2016.

https://arxiv.org/pdf/1603.05027.pdf

# ResNet-E。被引用 551 次。ResNet v2 其實是淺層網路的 ensemble。

Veit, Andreas, Michael J. Wilber, and Serge Belongie. "Residual networks behave like ensembles of relatively shallow networks." Advances in neural information processing systems. 2016.

https://papers.nips.cc/paper/2016/file/37bc2f75bf1bcfe8450a1a41c200364c-Paper.pdf

# ResNet-V。被引用 464 次。ensemble 促使損失函數平滑化，也因此好訓練。

Li, Hao, et al. "Visualizing the loss landscape of neural nets." Advances in Neural Information Processing Systems. 2018.

https://papers.nips.cc/paper/2018/file/a41b3bb3e6b050b6c9067c67f663b915-Paper.pdf

# PyramidNet

Han, Dongyoon, Jiwhan Kim, and Junmo Kim. "Deep pyramidal residual networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

https://openaccess.thecvf.com/content_cvpr_2017/papers/Han_Deep_Pyramidal_Residual_CVPR_2017_paper.pdf

# Batch Normalization

Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." International conference on machine learning. 2015.

http://proceedings.mlr.press/v37/ioffe15.pdf

-----

Visualizing Loss Landscape of Deep Neural Networks…..but can we Trust them? | by Jae Duk Seo | Towards Data Science

https://towardsdatascience.com/visualizing-loss-landscape-of-deep-neural-networks-but-can-we-trust-them-3d3ae0cff46e

Loss landscapes and the blessing of dimensionality | by Javier Ideami | Towards Data Science

https://towardsdatascience.com/loss-landscapes-and-the-blessing-of-dimensionality-46685e28e6a4

[论文阅读]损失函数可视化及其对神经网络的指导作用 - 知乎

https://zhuanlan.zhihu.com/p/52314278

NeurIPS 2018提前看：可视化神经网络泛化能力 | 机器之心

https://www.jiqizhixin.com/articles/2018-11-28-10

(51) 【蜻蜓点论文】Visualizing the Loss Landscape of Neural Nets - YouTube

https://www.youtube.com/watch?v=xVxMvoacWMw