The Star Also Rises: NIN（一）：Paper Translation

NIN（一）：Paper Translation

2021/03/07

Network In Network

網路中的網路

-----

https://pixabay.com/zh/photos/drops-raindrops-synapse-network-1101259/

-----

Abstract

摘要

-----

We propose a novel deep network structure called “Network In Network”(NIN) to enhance model discriminability for local patches within the receptive field. The conventional convolutional layer uses linear filters followed by a nonlinear activation function to scan the input. Instead, we build micro neural networks with more complex structures to abstract the data within the receptive field. We instantiate the micro neural network with a multilayer perceptron, which is a potent function approximator.

The feature maps are obtained by sliding the micro networks over the input in a similar manner as CNN; they are then fed into the next layer. Deep NIN can be implemented by stacking mutiple of the above described structure. With enhanced local modeling via the micro network, we are able to utilize global average pooling over feature maps in the classification layer, which is easier to interpret and less prone to overfitting than traditional fully connected layers. We demonstrated the state-of-the-art classification performances with NIN on CIFAR-10 and CIFAR-100, and reasonable performances on SVHN and MNIST datasets.

我們提出了一種新穎的深層網路結構，稱為“網路中的網路”（NIN），以增強模型對可接收區域內局部色塊的判別能力。傳統的卷積層使用線性濾波器，然後使用非線性激活函數來掃描輸入。取而代之的是，我們構建具有更複雜結構的微神經網路，以萃取接受域中的資料。我們使用多層感知器實做微神經網路，該感知器是一種有效的函數逼近器。

通過以類似於 CNN 的方式在輸入上滑動微網路來獲得特徵圖。然後將它們送入下一層。深度 NIN 可以通過堆疊上述結構的多個實現。通過微網路增強的局部建模，我們能夠在分類層的特徵圖上利用全局平均池化，與傳統的全連接層相比，它更易於解釋且不太容易過擬合。我們展示了 NIN 在 CIFAR-10 和 CIFAR-100 上的卓越性能，以及在 SVHN 和 MNIST 數據集上的合理性能。

-----

1 Introduction

1 引言

Convolutional neural networks (CNNs) [1] consist of alternating convolutional layers and pooling layers. Convolution layers take inner product of the linear filter and the underlying receptive field followed by a nonlinear activation function at every local portion of the input. The resulting outputs are called feature maps.

卷積神經網路（CNN）[1] 由交替的卷積層和池化層組成。卷積層取線性濾波器和下面的接收場的內積，然後在輸入的每個局部部分加上非線性激活函數。結果輸出稱為特徵圖。

The convolution filter in CNN is a generalized linear model (GLM) for the underlying data patch, and we argue that the level of abstraction is low with GLM. By abstraction we mean that the feature is invariant to the variants of the same concept [2]. Replacing the GLM with a more potent nonlinear function approximator can enhance the abstraction ability of the local model. GLM can achieve a good extent of abstraction when the samples of the latent concepts are linearly separable, i.e. the variants of the concepts all live on one side of the separation plane defined by the GLM.

本段不翻譯。

Thus conventional CNN implicitly makes the assumption that the latent concepts are linearly separable. However, the data for the same concept often live on a nonlinear manifold, therefore the representations that capture these concepts are generally highly nonlinear function of the input. In NIN, the GLM is replaced with a ”micro network” structure which is a general nonlinear function approximator. In this work, we choose multilayer perceptron [3] as the instantiation of the micro network, which is a universal function approximator and a neural network trainable by back-propagation.

因此，傳統的 CNN 隱含地假設了潛在概念是線性可分離的。但是，同一概念的數據通常存在於非線性流形上，因此，捕獲這些概念的表示形式通常是輸入的高度非線性函數。在 NIN 中，將 GLM 替換為“微網路”結構，該結構是通用的非線性函數逼近器。在這項工作中，我們選擇多層感知器[ 3] 作為微網路的具體化，這是一個通用函數逼近器和可通過反向傳播訓練的神經網路。

The resulting structure which we call an mlpconv layer is compared with CNN in Figure 1. Both the linear convolutional layer and the mlpconv layer map the local receptive field to an output feature vector. The mlpconv maps the input local patch to the output feature vector with a multilayer perceptron (MLP) consisting of multiple fully connected layers with nonlinear activation functions. The MLP is shared among all local receptive fields. The feature maps are obtained by sliding the MLP over the input in a similar manner as CNN and are then fed into the next layer. The overall structure of the NIN is the stacking of multiple mlpconv layers. It is called “Network In Network” (NIN) as we have micro networks (MLP), which are composing elements of the overall deep network, within mlpconv layers,

在圖1 中將我們稱為 mlpconv 層的結果結構與 CNN 進行了比較。線性卷積層和 mlpconv 層均將局部接收場映射到輸出特徵向量。 mlpconv 使用多層感知器（MLP）將輸入局部補丁映射到輸出特徵向量，該感知器由具有非線性激活功能的多個全連接層組成。 MLP 在所有局部接收域之間共享。通過以類似於 CNN 的方式在輸入上滑動 MLP 來獲得特徵圖，然後將其饋送到下一層。 NIN 的整體結構是多個 mlpconv 層的堆疊。它被稱為“網路中的網路”（NIN），因為我們在 mlpconv 層中擁有構成整個深層網路元素的微網路（MLP），

Instead of adopting the traditional fully connected layers for classification in CNN, we directly output the spatial average of the feature maps from the last mlpconv layer as the confidence of categories via a global average pooling layer, and then the resulting vector is fed into the softmax layer. In traditional CNN, it is difficult to interpret how the category level information from the objective cost layer is passed back to the previous convolution layer due to the fully connected layers which act as a black box in between. In contrast, global average pooling is more meaningful and interpretable as it enforces correspondance between feature maps and categories, which is made possible by a stronger local modeling using the micro network. Furthermore, the fully connected layers are prone to overfitting and heavily depend on dropout regularization [4] [5], while global average pooling is itself a structural regularizer, which natively prevents overfitting for the overall structure.

代替在 CNN 中採用傳統的全連接層進行分類，我們通過全局平均池化層直接從最後一個 mlpconv 層輸出特徵圖的空間平均值作為類別的置信度，然後將所得向量饋入 softmax 層。在傳統的 CNN 中，由於全連接層之間充當了黑匣子，因此很難解釋來自輸出層的類別級別信息如何傳遞回先前的卷積層。相比之下，全局平均池化在實現特徵圖和類別之間的對應關係時更具意義和可解釋性，這可以通過使用微網路進行更強大的局部建模來實現。此外，全連接層易於過度擬合，並嚴重依賴於 dropout 正則化[4] [5]，而全局平均池化本身就是結構性正則化器，可以自然地防止整體結構的過度擬合。

-----

2 Convolutional Neural Networks

2 卷積神經網路

-----

Classic convolutional neuron networks [1] consist of alternatively stacked convolutional layers and spatial pooling layers. The convolutional layers generate feature maps by linear convolutional filters followed by nonlinear activation functions (rectifier, sigmoid, tanh, etc.). Using the linear rectifier as an example, the feature map can be calculated as follows:

經典的捲積神經元網路 [1] 由交替堆疊的卷積層和空間池化層組成。卷積層由線性卷積濾波器生成特徵圖，然後是非線性激活函數（整流器，sigmoid，tanh 等）。以線性整流器為例，可以如下計算特徵圖：

Here (i; j) is the pixel index in the feature map, xij stands for the input patch centered at location (i; j), and k is used to index the channels of the feature map.

這裡（i; j）是特徵圖中的像素索引，xij 代表以位置（i; j）為中心的輸入色塊，而 k 用於索引特徵圖的通道。

This linear convolution is sufficient for abstraction when the instances of the latent concepts are linearly separable. However, representations that achieve good abstraction are generally highly nonlinear functions of the input data. In conventional CNN, this might be compensated by utilizing an over-complete set of filters [6] to cover all variations of the latent concepts. Namely, individual linear filters can be learned to detect different variations of a same concept. However, having too many filters for a single concept imposes extra burden on the next layer, which needs to consider all combinations of variations from the previous layer [7]. As in CNN, filters from higher layers map to larger regions in the original input. It generates a higher level concept by combining the lower level concepts from the layer below. Therefore, we argue that it would be beneficial to do a better abstraction on each local patch, before combining them into higher level concepts.

當潛在概念的實例是線性可分離的時，此線性卷積足以進行萃取。但是，獲得良好萃取的表示形式通常是輸入資料的高度非線性函數。在常規的 CNN 中，可以通過使用一組過於完整的過濾器[ 6] 來覆蓋潛在概念的所有變體來對此進行補償。即，可以學習各個線性濾波器以檢測相同概念的不同變化。然而，對於單個概念而言，擁有太多過濾器會給下一層帶來額外的負擔，這需要考慮上一層的所有變體組合 [7]。與 CNN 一樣，較高層的過濾器會映射到原始輸入中的較大區域。它通過組合來自下一層的較低級別的概念來生成較高級別的概念。因此，我們認為在將每個局部色塊組合成更高級別的概念之前，對其進行更好的萃取將是有益的。

In the recent maxout network [8], the number of feature maps is reduced by maximum pooling over affine feature maps (affine feature maps are the direct results from linear convolution without applying the activation function). Maximization over linear functions makes a piecewise linear approximator which is capable of approximating any convex functions. Compared to conventional convolutional layers which perform linear separation, the maxout network is more potent as it can separate concepts that lie within convex sets. This improvement endows the maxout network with the best performances on several benchmark datasets.

在最近的 maxout 網路 [8]中，通過對仿射特徵圖進行最大池化來減少特徵圖的數量（仿射特徵圖是線性卷積的直接結果，而未應用激活函數）。線性函數的最大化使得分段線性逼近器能夠逼近任何凸函數。與執行線性分離的常規卷積層相比，maxout 網路更有效，因為它可以分離出凸集內的概念。這項改進使 maxout 網路在多個基準數據集上具有最佳性能。

However, maxout network imposes the prior that instances of a latent concept lie within a convex set in the input space, which does not necessarily hold. It would be necessary to employ a more general function approximator when the distributions of the latent concepts are more complex. We seek to achieve this by introducing the novel “Network In Network” structure, in which a micro network is introduced within each convolutional layer to compute more abstract features for local patches.

但是，maxout 網路強加了先驗，即潛在概念的實例位於輸入空間中的凸集內，該凸集不一定成立。當潛在概念的分佈更複雜時，有必要採用更通用的函數逼近器。我們試圖通過引入新穎的“網路內網路”結構來實現這一目標，其中在每個卷積層內引入一個微網路，以計算局部色塊的更多抽象特徵。

Sliding a micro network over the input has been proposed in several previous works. For example, the Structured Multilayer Perceptron (SMLP) [9] applies a shared multilayer perceptron on different patches of the input image; in another work, a neural network based filter is trained for face detection [10]. However, they are both designed for specific problems and both contain only one layer of the sliding network structure. NIN is proposed from a more general perspective, the micro network is integrated into CNN structure in persuit of better abstractions for all levels of features.

在先前的一些工作中已經提出了在輸入上滑動微網路。例如，結構化多層感知器（SMLP）[9] 將共享的多層感知器應用於輸入圖像的不同色塊；在另一項工作中，訓練了基於神經網路的過濾器進行面部檢測[10]。但是，它們都針對特定問題而設計，並且都只包含滑動網路結構的一層。 NIN 是從更一般的角度提出的，為了更好地萃取所有級別的功能，微網路已整合到 CNN 結構中。

-----

3 Network In Network

3 網路中的網路

We first highlight the key components of our proposed “Network In Network” structure: the MLP convolutional layer and the global averaging pooling layer in Sec. 3.1 and Sec. 3.2 respectively. Then we detail the overall NIN in Sec. 3.3.

我們先在 3.1 節和 3.2 節中重點介紹我們提出的“網路中的網路”結構的關鍵組成部分：MLP 卷積層和全局平均池化層。然後再於 3.3 節中詳細介紹整個 NIN。

3.1 MLP Convolution Layers

3.1 MLP卷積層

Given no priors about the distributions of the latent concepts, it is desirable to use a universal function approximator for feature extraction of the local patches, as it is capable of approximating more abstract representations of the latent concepts. Radial basis network and multilayer perceptron are two well known universal function approximators. We choose multilayer perceptron in this work for two reasons. First, multilayer perceptron is compatible with the structure of convolutional neural networks, which is trained using back-propagation. Second, multilayer perceptron can be a deep model itself, which is consistent with the spirit of feature re-use [2]. This new type of layer is called mlpconv in this paper, in which MLP replaces the GLM to convolve over the input. Figure 1 illustrates the difference between linear convolutional layer and mlpconv layer. The calculation performed by mlpconv layer is shown as follows:

在沒有關於潛在概念的分佈的先驗條件的情況下，期望使用通用函數逼近器來對局部色塊進行特徵提取，因為它能夠逼近潛在概念的更多抽象表示。徑向基網路和多層感知器是兩個眾所周知的通用函數逼近器。我們在這項工作中選擇多層感知器有兩個原因。首先，多層感知器與使用反向傳播訓練的卷積神經網路的結構相容。其次，多層感知器本身可以是一個深層模型，這與特徵重用的精神是一致的 [2]。在本文中，這種新類型的層稱為 mlpconv，其中 MLP 代替了 GLM 以在輸入上進行卷積。圖1 說明了線性卷積層和 mlpconv 層之間的區別。 mlpconv層執行的計算如下所示：

Here n is the number of layers in the multilayer perceptron. Rectified linear unit is used as the activation function in the multilayer perceptron.

在此，n 是多層感知器中的層數。整流線性單元用作多層感知器中的激活函數。

From cross channel (cross feature map) pooling point of view, Equation 2 is equivalent to cascaded cross channel parametric pooling on a normal convolution layer. Each pooling layer performs weighted linear recombination on the input feature maps, which then go through a rectifier linear unit. The cross channel pooled feature maps are cross channel pooled again and again in the next layers. This cascaded cross channel parameteric pooling structure allows complex and learnable interactions of cross channel information.

從跨通道（跨特徵圖）池化的角度看，公式2 等效於正常卷積層上的級聯跨通道參數池化。每個池化層在輸入特徵圖上執行加權線性重組，然後通過整流器線性單元。跨通道池化後的特徵圖會在下一層中一次又一次地進行跨通道池化。這種級聯的跨通道參數池化結構可讓跨通道資訊學習複雜的交互作用。

The cross channel parametric pooling layer is also equivalent to a convolution layer with 1x1 convolution kernel. This interpretation makes it straightforawrd to understand the structure of NIN.

跨通道參數池化層也等效於具有 1x1 卷積內核的卷積層。這種解釋使得理解 NIN 的結構更直截了當。

Comparison to maxout layers: the maxout layers in the maxout network performs max pooling across multiple affine feature maps [8]. The feature maps of maxout layers are calculated as follows:

與 maxout 層的比較：maxout 網路中的 maxout 層跨多個仿射特徵圖執行最大池化 [8]。 maxout 圖層的特徵圖的計算方式如下：

Maxout over linear functions forms a piecewise linear function which is capable of modeling any convex function. For a convex function, samples with function values below a specific threshold form a convex set. Therefore, by approximating convex functions of the local patch, maxout has the capability of forming separation hyperplanes for concepts whose samples are within a convex set (i.e. l2 balls, convex cones). Mlpconv layer differs from maxout layer in that the convex function approximator is replaced by a universal function approximator, which has greater capability in modeling various distributions of latent concepts.

線性函數上的 Maxout 形成分段線性函數，該函數可以對任何凸函數建模。對於凸函數，函數值低於特定閾值的樣本將形成凸集。因此，通過近似局部色塊的凸函數，maxout 可以為樣本在凸集合內的概念（即 l2 球，凸錐）形成分離超平面。Mlpconv 層與 maxout 層的不同之處在於，凸函數逼近器被通用函數逼近器替代，通用函數逼近器在建模各種潛在概念分佈時具有更大的功能。

-----

3.2 Global Average Pooling

3.2 全局平均池化

Conventional convolutional neural networks perform convolution in the lower layers of the network. For classification, the feature maps of the last convolutional layer are vectorized and fed into fully connected layers followed by a softmax logistic regression layer [4] [8] [11]. This structure bridges the convolutional structure with traditional neural network classifiers. It treats the convolutional layers as feature extractors, and the resulting feature is classified in a traditional way.

傳統的卷積神經網路在網路的較低層中執行卷積。為了進行分類，將最後一個卷積層的特徵圖向量化，並饋入全連接層，然後再輸入softmax logistic 回歸層[4] [8] [11]。這種結構溝通卷積結構與傳統神經網路分類器。它將卷積層視為特徵提取器，然後以傳統方式對生成的特徵進行分類。

However, the fully connected layers are prone to overfitting, thus hampering the generalization ability of the overall network. Dropout is proposed by Hinton et al. [5] as a regularizer which randomly sets half of the activations to the fully connected layers to zero during training. It has improved the generalization ability and largely prevents overfitting [4].

但是，全連接層容易過度擬合，進而妨礙了整個網路的泛化能力。 Dropout 由 Hinton 等人提出。 [5] 作為正則化器，它在訓練過程中將全連接層的一半激活隨機設置為零。它提高了泛化能力，並在很大程度上防止了過擬合 [4]。

In this paper, we propose another strategy called global average pooling to replace the traditional fully connected layers in CNN. The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the softmax layer. One advantage of global average pooling over the fully connected layers is that it is more native to the convolution structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. Futhermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.

在本論文中，我們提出了另一種稱為全局平均池化的策略，以取代 CNN 中的傳統全連接層。這個想法是為最後一個 mlpconv 層中分類任務的每個對應類別生成一個特徵圖。我們沒有在特徵圖的頂部添加全連接層，而是取每個特徵圖的平均值，然後將所得的向量直接輸入到 softmax 層中。全局平均池化在全連接層上的優勢之一是，通過強制執行特徵圖和類別之間的對應關係，對於卷積結構而言，是更自然的事。因此，特徵圖可以容易地解釋為類別置信度圖。另一個優點是，在全局平均池化中沒有要優化的參數，因此在此層避免了過擬合。此外，全局平均池化匯總了空間信息，因此對輸入的空間轉換更加穩健。

We can see global average pooling as a structural regularizer that explicitly enforces feature maps to be confidence maps of concepts (categories). This is made possible by the mlpconv layers, as they makes better approximation to the confidence maps than GLMs.

我們可以將全局平均池化視為一種結構化的正則化器，它可以明確地將特徵圖強制化為概念（類別）的置信度圖。 mlpconv 層使這成為可能，因為它們比 GLM 更近似置信度圖。

3.3 Network In Network Structure

3.3 網路中的網路結構

The overall structure of NIN is a stack of mlpconv layers, on top of which lie the global average pooling and the objective cost layer. Sub-sampling layers can be added in between the mlpconv layers as in CNN and maxout networks. Figure 2 shows an NIN with three mlpconv layers. Within each mlpconv layer, there is a three-layer perceptron. The number of layers in both NIN and the micro networks is flexible and can be tuned for specific tasks.

NIN 的總體結構是一疊 mlpconv 層，其上是全局平均池化和輸出層。像 CNN 和 maxout 網路一樣，可以在 mlpconv 層之間添加子採樣層。圖2 顯示了具有三個 mlpconv 層的 NIN。在每個 mlpconv 層中，都有一個三層感知器。 NIN 和微網路中的層數都很靈活，可以針對特定任務進行調整。

-----

4 Experiments

4 實驗

4.1 Overview

4.1概述

We evaluate NIN on four benchmark datasets: CIFAR-10 [12], CIFAR-100 [12], SVHN [13] and MNIST [1]. The networks used for the datasets all consist of three stacked mlpconv layers, and the mlpconv layers in all the experiments are followed by a spatial max pooling layer which downsamples the input image by a factor of two. As a regularizer, dropout is applied on the outputs of all but the last mlpconv layers. Unless stated specifically, all the networks used in the experiment section use global average pooling instead of fully connected layers at the top of the network. Another regularizer applied is weight decay as used by Krizhevsky et al. [4]. Figure 2 illustrates the overall structure of NIN network used in this section. The detailed settings of the parameters are provided in the supplementary materials. We implement our network on the super fast cuda-convnet code developed by Alex Krizhevsky [4]. Preprocessing of the datasets, splitting of training and validation sets all follow Goodfellow et al. [8].

我們在四個基準資料集上評估 NIN：CIFAR-10 [12]，CIFAR-100 [12]，SVHN [13] 和 MNIST [1]。用於資料集的網路均由三個堆疊的 mlpconv 層組成，並且在所有實驗中的 mlpconv 層之後是一個空間最大池化層，該層將輸入圖像下採樣兩倍。作為正則化器，dropout 將濾除應用於除最後 mlpconv 層以外的所有層的輸出。除非特別說明，否則實驗部分中使用的所有網路均使用全局平均池化，而不是網路頂部的全連接層。 Krizhevsky 等人使用的另一個正則化器是權重衰減 [4]。圖2 說明了本節中使用的 NIN 網路的整體結構。參數的詳細設置在補充材料中提供。我們在 Alex Krizhevsky [4] 開發的超快速 cuda-convnet 代碼上實現我們的網路。資料集的預處理，訓練和驗證集的劃分均遵循 Goodfellow 等人的方法 [8]。

We adopt the training procedure used by Krizhevsky et al. [4]. Namely, we manually set proper initializations for the weights and the learning rates. The network is trained using mini-batches of size 128. The training process starts from the initial weights and learning rates, and it continues until the accuracy on the training set stops improving, and then the learning rate is lowered by a scale of 10. This procedure is repeated once such that the final learning rate is one percent of the initial value.

我們採用 Krizhevsky 等人使用的訓練程序 [4]。即，我們手動設置權重和學習率的適當初始化。使用大小為 128 的迷你批次對網路進行訓練。訓練過程從初始權重和學習率開始，一直持續到訓練集的準確性停止提高，然後將學習率降低 10 級。重複此過程一次，以使最終學習率是初始值的百分之一。

4.2 CIFAR-10

The CIFAR-10 dataset [12] is composed of 10 classes of natural images with 50,000 training images in total, and 10,000 testing images. Each image is an RGB image of size 32x32. For this dataset, we apply the same global contrast normalization and ZCA whitening as was used by Goodfellow et al. in the maxout network [8]. We use the last 10,000 images of the training set as validation data.

CIFAR-10 資料集 [12] 由 10 類自然圖像組成，總共有 50,000 個訓練圖像和 10,000 個測試圖像。每個圖像都是大小為 32x32 的 RGB 圖像。對於此資料集，我們應用與 Goodfellow 等人相同的全局對比度歸一化和 ZCA 白化。在 maxout 網路中 [8]。我們使用訓練集的最後 10,000 張圖像作為驗證資料。

The number of feature maps for each mlpconv layer in this experiment is set to the same number as in the corresponding maxout network. Two hyper-parameters are tuned using the validation set, i.e. the local receptive field size and the weight decay. After that the hyper-parameters are fixed and we re-train the network from scratch with both the training set and the validation set. The resulting model is used for testing. We obtain a test error of 10.41% on this dataset, which improves more than one percent compared to the state-of-the-art. A comparison with previous methods is shown in Table 1.

在此實驗中，每個 mlpconv 層的特徵圖數量設置為與相應的 maxout 網路中相同的數量。使用驗證集調整兩個超參數，即局部感受野大小和權重衰減。之後，超參數被固定，我們使用訓練集和驗證集從頭開始重新訓練網路。生成的模型用於測試。在此資料集上，我們獲得了 10.41％的測試誤差，與最新技術相比，該誤差提高了百分之一以上。表1 中顯示了與先前方法的比較。

It turns out in our experiment that using dropout in between the mlpconv layers in NIN boosts the performance of the network by improving the generalization ability of the model. As is shown in Figure 3, introducing dropout layers in between the mlpconv layers reduced the test error by more than 20%. This observation is consistant with Goodfellow et al. [8]. Thus dropout is added in between the mlpconv layers to all the models used in this paper. The model without dropout regularizer achieves an error rate of 14.51% for the CIFAR-10 dataset, which already surpasses many previous state-of-the-arts with regularizer (except maxout). Since performance of maxout without dropout is not available, only dropout regularized version are compared in this paper.

在我們的實驗中發現，在 NIN 的 mlpconv 層之間使用 dropout 可以通過提高模型的泛化能力來提高網路的性能。如圖3 所示，在 mlpconv 層之間引入 dropout 層可將測試誤差降低 20％以上。該觀察與 Goodfellow 等人一致 [8]。因此，在 mlpconv 層之間將 dropout 添加到了本文使用的所有模型中。對於 CIFAR-10 資料集，不帶 dropout 正則化器的模型的錯誤率達到 14.51％，這已經超過了許多具有正則化器的最新技術水平（maxout 除外）。由於沒有可用的沒有 dropout 的 maxout 的性能不可用，因此本文僅比較 dropout 的正規化版本。

To be consistent with previous works, we also evaluate our method on the CIFAR-10 dataset with translation and horizontal flipping augmentation. We are able to achieve a test error of 8.81%, which sets the new state-of-the-art performance.

為了與以前的工作保持一致，我們還使用平移和水平翻轉增強在CIFAR-10數據集上評估了我們的方法。我們能夠實現 8.81％的測試誤差，這設定了新的卓越性能。

4.3 CIFAR-100

The CIFAR-100 dataset [12] is the same in size and format as the CIFAR-10 dataset, but it contains 100 classes. Thus the number of images in each class is only one tenth of the CIFAR-10 dataset. For CIFAR-100 we do not tune the hyper-parameters, but use the same setting as the CIFAR-10 dataset. The only difference is that the last mlpconv layer outputs 100 feature maps. A test error of 35.68% is obtained for CIFAR-100 which surpasses the current best performance without data augmentation by more than one percent. Details of the performance comparison are shown in Table 2.

CIFAR-100 資料集 [12 ]的大小和格式與 CIFAR-10 資料集相同，但包含 100 個類。因此，每個類別中的圖像數量僅為 CIFAR-10 資料集的十分之一。對於 CIFAR-100，我們不調整超參數，但使用與 CIFAR-10 資料集相同的設置。唯一的區別是最後一個 mlpconv 層輸出 100 個特徵圖。 CIFAR-100 的測試誤差為 35.68％，在沒有資料擴充的情況下超過了目前的最佳性能百分之一。表2 中顯示了性能比較的細節。

4.4 Street View House Numbers

The SVHN dataset [13] is composed of 630,420 32x32 color images, divided into training set, testing set and an extra set. The task of this data set is to classify the digit located at the center of each image. The training and testing procedure follow Goodfellow et al. [8]. Namely 400 samples per class selected from the training set and 200 samples per class from the extra set are used for validation. The remainder of the training set and the extra set are used for training. The validation set is only used as a guidance for hyper-parameter selection, but never used for training the model.

SVHN 資料集 [13] 由 630,420 個 32x32 彩色圖像組成，分為訓練集，測試集和額外集。該資料集的任務是對位於每個圖像中心的數字進行分類。訓練和測試程序遵循 Goodfellow 等 [8]。即，從訓練集中選擇的每個類別 400 個樣本和從額外集中的每個類別 200 個樣本用於驗證。訓練集的其餘部分和額外集用於訓練。驗證集僅用作超參數選擇的參考，而從未用於訓練模型。

Preprocessing of the dataset again follows Goodfellow et al. [8], which was a local contrast normalization. The structure and parameters used in SVHN are similar to those used for CIFAR-10, which consist of three mlpconv layers followed by global average pooling. For this dataset, we obtain a test error rate of 2.35%. We compare our result with methods that did not augment the data, and the comparison is shown in Table 3.

資料集的預處理再次遵循 Goodfellow 等 [8]，這是局部對比度歸一化。 SVHN 中使用的結構和參數類似於 CIFAR-10 中使用的結構和參數，它由三個 mlpconv 層組成，然後進行全局平均池化。對於此資料集，我們獲得 2.35％的測試錯誤率。我們將結果與未擴充資料的方法進行比較，比較結果如表3 所示。

4.5 MNIST

The MNIST [1] dataset consists of hand written digits 0-9 which are 28x28 in size. There are 60,000 training images and 10,000 testing images in total. For this dataset, the same network structure as used for CIFAR-10 is adopted. But the numbers of feature maps generated from each mlpconv layer are reduced. Because MNIST is a simpler dataset compared with CIFAR-10; fewer parameters are needed. We test our method on this dataset without data augmentation. The result is compared with previous works that adopted convolutional structures, and are shown in Table 4.

MNIST [1] 資料集由大小為 28x28 的手寫數字 0-9 組成。總共有 60,000 張訓練圖像和 10,000 張測試圖像。對於此資料集，採用與 CIFAR-10 相同的網路結構。但是減少了從每個 mlpconv 層生成的特徵圖的數量。因為與 CIFAR-10 相比，MNIST 是更簡單的資料集；需要的參數更少。我們在沒有資料擴充的情況下在此資料集上測試了我們的方法。將結果與採用卷積結構的以前的作品進行比較，如表4 所示。

We achieve comparable but not better performance (0.47%) than the current best (0.45%) since MNIST has been tuned to a very low error rate.

由於 MNIST 已調整到非常低的錯誤率，因此我們獲得的性能（0.47％）與目前的最佳性能（0.45％）接近，但並不比其高。

4.6 Global Average Pooling as a Regularizer

4.6 全局平均池化作為正則器

Global average pooling layer is similar to the fully connected layer in that they both perform linear transformations of the vectorized feature maps. The difference lies in the transformation matrix. For global average pooling, the transformation matrix is prefixed and it is non-zero only on block diagonal elements which share the same value. Fully connected layers can have dense transformation matrices and the values are subject to back-propagation optimization. To study the regularization effect of global average pooling, we replace the global average pooling layer with a fully connected layer, while the other parts of the model remain the same. We evaluated this model with and without dropout before the fully connected linear layer. Both models are tested on the CIFAR-10 dataset, and a comparison of the performances is shown in Table 5.

全局平均池化層與全連接層相似，因為它們都執行向量化特徵圖的線性變換。區別在於變換矩陣。對於全局平均池化，轉換矩陣帶有前綴，並且僅在共享相同值的塊對角線元素上為非零。全連接層可以具有密集的轉換矩陣，並且值需要進行反向傳播優化。為了研究全局平均池化的正則化效果，我們用全連接層替換了全局平均池層，而模型的其他部分保持不變。我們評估了該模型在全連接的線性層之前有無漏失的情況。兩種模型都在 CIFAR-10 資料集上進行了測試，其性能比較如表5 所示。

As is shown in Table 5, the fully connected layer without dropout regularization gave the worst performance (11.59%). This is expected as the fully connected layer overfits to the training data if no regularizer is applied. Adding dropout before the fully connected layer reduced the testing error (10.88%). Global average pooling has achieved the lowest testing error (10.41%) among the three.

如表5 所示，沒有 dropout 正規則化的全連接層有最差的性能（11.59％）。這是可以預期的，因為如果未應用任何正則化器，則全連接層將過度擬合訓練資料。在全連接層之前添加 dropout 可減少測試誤差（10.88％）。全局平均池化已實現三項中最低的測試錯誤（10.41％）。

We then explore whether the global average pooling has the same regularization effect for conventional CNNs. We instantiate a conventional CNN as described by Hinton et al. [5], which consists of three convolutional layers and one local connection layer. The local connection layer generates 16 feature maps which are fed to a fully connected layer with dropout. To make the comparison fair, we reduce the number of feature map of the local connection layer from 16 to 10, since only one feature map is allowed for each category in the global average pooling scheme. An equivalent network with global average pooling is then created by replacing the dropout + fully connected layer with global average pooling. The performances were tested on the CIFAR-10 dataset.

然後，我們探討了全局平均池化是否具有與常規 CNN 相同的正則化效果。我們實做了傳統的 CNN，如 Hinton 等人所述 [5]，它由三個卷積層和一個局部接層組成。局部連接層生成 16 個特徵圖，這些特徵圖通過 dropout 被饋送到全連接層。為了使比較合理，我們將局部連接層的特徵圖的數量從 16 個減少到 10 個，因為全局平均池化方案中的每個類別只允許一個特徵圖。然後，通過使用全局平均池化替換 dropout + 全連接層，創建具有全局平均池化的等效網路。在 CIFAR-10 資料集上對性能進行了測試。

This CNN model with fully connected layer can only achieve the error rate of 17.56%. When dropout is added we achieve a similar performance (15.99%) as reported by Hinton et al. [5]. By replacing the fully connected layer with global average pooling in this model, we obtain the error rate of 16.46%, which is one percent improvement compared with the CNN without dropout. It again verifies the effectiveness of the global average pooling layer as a regularizer. Although it is slightly worse than the dropout regularizer result, we argue that the global average pooling might be too demanding for linear convolution layers as it requires the linear filter with rectified activation to model the confidence maps of the categories.

這種具有全連接層的 CNN 模型只能達到 17.56％的錯誤率。當添加 dropout 時，我們將獲得與 Hinton 等人報告的相似的性能（15.99％） [5]。通過在此模型中用全局平均池化替換全連接層，我們獲得了 16.46％的錯誤率，與沒有 drop 的 CNN 相比，錯誤率提高了百分之一。這再次驗證了全局平均池化層作為正則化器的有效性。儘管它比 dropout 正則化器結果稍差，但我們認為全局平均池化對於線性卷積層可能要求太高，因為它需要具有整流激活的線性濾波器來建模類別的置信度圖。

4.7 Visualization of NIN

4.7 NIN 的可視化

We explicitly enforce feature maps in the last mlpconv layer of NIN to be confidence maps of the categories by means of global average pooling, which is possible only with stronger local receptive field modeling, e.g. mlpconv in NIN. To understand how much this purpose is accomplished, we extract and directly visualize the feature maps from the last mlpconv layer of the trained model for CIFAR-10.

我們通過全局平均池化將 NIN 的最後一個 mlpconv 層中的特徵圖顯式強制為類別的置信度圖，這僅在更強大的局部感受野建模（例如 NIN 中的 mlpconv）。要了解達到此目的的程度，我們從 CIFAR-10 的訓練模型的最後 mlpconv 層提取並直接可視化特徵圖。

Figure 4 shows some examplar images and their corresponding feature maps for each of the ten categories selected from CIFAR-10 test set. It is expected that the largest activations are observed in the feature map corresponding to the ground truth category of the input image, which is explicitly enforced by global average pooling. Within the feature map of the ground truth category, it can be observed that the strongest activations appear roughly at the same region of the object in the original image. It is especially true for structured objects, such as the car in the second row of Figure 4. Note that the feature maps for the categories are trained with only category information. Better results are expected if bounding boxes of the objects are used for fine grained labels.

圖4 顯示了從 CIFAR-10 測試集中選擇的十個類別中的每個類別的一些範例圖像及其對應的特徵圖。預計在與輸入圖像的基準真相類別相對應的特徵圖中會觀察到最大的激活，這明顯是通過全局平均池化實施的。在基準真相類別的特徵圖中，可以觀察到最強的激活大致出現在原始圖像中物件的同一區域。對於結構化物件（例如圖4 第二行中的汽車）尤其如此。請注意，僅使用類別資訊來訓練類別的特徵圖。如果將物件的邊界框用於細粒度標籤，則預期會獲得更好的結果。

The visualization again demonstrates the effectiveness of NIN. It is achieved via a stronger local receptive field modeling using mlpconv layers. The global average pooling then enforces the learning of category level feature maps. Further exploration can be made towards general object detection. Detection results can be achieved based on the category level feature maps in the same flavor as in the scene labeling work of Farabet et al. [20].

可視化再次證明了 NIN 的有效性。這是通過使用 mlpconv 層進行更強的局部感受野建模來實現的。然後，全局平均池化將強制執行類別級別特徵圖的學習。可以對一般物件偵測進行進一步的探索。可以基於類別級別特徵圖以與 Farabet 等人的場景標記工作相同的方式獲得檢測結果 [20]。

5 Conclusions

5 結論

We proposed a novel deep network called “Network In Network” (NIN) for classification tasks. This new structure consists of mlpconv layers which use multilayer perceptrons to convolve the input and a global average pooling layer as a replacement for the fully connected layers in conventional CNN. Mlpconv layers model the local patches better, and global average pooling acts as a structural regularizer that prevents overfitting globally. With these two components of NIN we demonstrated state-of-the-art performance on CIFAR-10, CIFAR-100 and SVHN datasets. Through visualization of the feature maps, we demonstrated that feature maps from the last mlpconv layer of NIN were confidence maps of the categories, and this motivates the possibility of performing object detection via NIN.

我們為分類任務提出了一種新穎的深度網路，稱為“網路中的網路”（NIN）。這種新結構由 mlpconv 層組成，這些層使用多層感知器對輸入進行卷積，並使用全局平均池化層替代了傳統 CNN 中的全連接層。 Mlpconv 層可以更好地對局部色塊進行建模，而全局平均池化則可以作為結構性正則化器，防止全局過度擬合。使用 NIN 的這兩個組件，我們展示在了 CIFAR-10，CIFAR-100 和 SVHN 資料集上的超卓性能。通過特徵圖的可視化，我們證明了 NIN 的最後一個 mlpconv 層的特徵圖是類別的置信度圖，這激勵了通過 NIN 執行物件偵測的可能性。

-----

# NIN。

Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." arXiv preprint arXiv:1312.4400 (2013).

https://arxiv.org/pdf/1312.4400.pdf

-----

The Star Also Rises

Sunday, April 11, 2021

NIN（一）：Paper Translation

No comments:

Programmer

Blog Archive

Labels

Recent Comments

My Blog List

MY LINKS

status

About Me