GoogLeNet(一):Paper Translation
2021/03/10
-----
https://pixabay.com/zh/photos/frogs-computer-google-search-1037868/
-----
Going Deeper with Convolutions
卷積深入
-----
Abstract
摘要
We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
我們提出了一種代號為 Inception 的深度卷積神經路架構,該架構在 ImageNet 大規模視覺識別挑戰賽 2014(ILSVRC14)中實現了分類和檢測的最新技術水準。 該架構的主要特點是網路內部計算資源的利用率得到提高。 通過精心設計,我們在保持計算需求不變的情況下增加了網路的深度和寬度。為了優化結果,架構決策基於 Hebbian 原則和多尺度處理的直覺。 在我們提交的 ILSVRC14 中使用的一種特定實現稱為 GoogLeNet,它是一個 22 層的深度網路,其品質在分類和偵測的範圍內進行評估。
-----
1. Introduction
1. 引言
In the last three years, our object classification and detection capabilities have dramatically improved due to advances in deep learning and convolutional networks [10]. One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures. No new data sources were used, for example, by the top entries in the ILSVRC 2014 competition besides the classification dataset of the same competition for detection purposes. Our GoogLeNet submission to ILSVRC 2014 actually uses 12 times fewer parameters than the winning architecture of Krizhevsky et al [9] from two years ago, while being significantly more accurate. On the object detection front, the biggest gains have not come from naive application of bigger and bigger deep networks, but from the synergy of deep architectures and classical computer vision, like the R-CNN algorithm by Girshick et al [6].
在過去的三年中,由於深度學習和卷積網路的進步,我們的物件分類和檢測能力得到了顯著提高 [10]。 好消息是,大部分進展不僅是功能更強大的硬體,更大的資料集和更大的模型,主要還是新思想,算法和改進的網路架構的結果。 除了用於檢測目的的相同競賽的分類資料集以外,例如,ILSVRC 2014 競賽的前幾名均未使用新的資料。實際上,我們向 ILSVRC 2014 提交的 GoogLeNet 提交的參數實際上比兩年前 Krizhevsky 等人 [9] 的獲勝架構少了 12 倍,同時準確度也大大提高了。 在物件偵測方面,最大的改進不是來自越來越大的深度網路的自然應用,而是來自深度架構與經典計算機視覺的協同作用,例如 Girshick 等人的 R-CNN 算法 [6]。
Another notable factor is that with the ongoing traction of mobile and embedded computing, the efficiency of our algorithms – especially their power and memory use – gains importance. It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer fixation on accuracy numbers. For most of the experiments, the models were designed to keep a computational budget of 1:5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.
另一個值得注意的因子是,隨著移動和嵌入式計算的不斷發展,我們算法的效率,尤其是其功率和內存使用,變得越來越重要。 值得注意的是,導致本論文設計的深層架構的考慮因素包括此因素,而非盲目追求準確度。 對於大多數實驗,這些模型的設計目的是在推理時保持 1:50 億的加法運算預算,這樣它們就不會成為純粹出於學術上的好奇心,而是可以在現實世界中使用 ,即使是在大型資料集上,計算的成本也在合理範圍。
In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous “we need to go deeper” internet meme [1]. In our case, the word “deep” is used in two different meanings: first of all, in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth. In general, one can view the Inception model as a logical culmination of [12] while taking inspiration and guidance from the theoretical work by Arora et al [2]. The benefits of the architecture are experimentally verified on the ILSVRC 2014 classification and detection challenges, where it significantly outperforms the current state of the art.
在本論文中,我們會專注於代號 Inception 用於計算機視覺的高效深度神經網路架構,該架構的名稱來自 Lin 等人 [12] 在網路中的網路論文以及著名的“我們需要更深入”。 互聯網迷因 [1]。 在我們的例子裡,“深層”一詞有兩種不同的含義:首先,在某種意義上,我們以“初始模組”的形式引入了新的組織層次,在更直接的意義上說,網路也在不斷增加深度。通常,人們可以將 Inception 模型視為 [12] 的必然成果,同時可以從Arora 等人 [2] 的理論工作中獲得啟發和指導。 該架構的優勢已通過 ILSVRC 2014 分類和檢測挑戰進行了實驗驗證,其性能明顯優於當前水平。
-----
2. Related Work
2.相關研究
Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard structure –stacked convolutional layers (optionally followed by contrast normalization and max-pooling) are followed by one or more fully-connected layers. Variants of this basic design are prevalent in the image classification literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classification challenge [9, 21]. For larger datasets such as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14], while using dropout [7] to address the problem of overfitting.
從 LeNet-5 [10] 開始,卷積神經網路(CNN)通常有標準結構-堆疊的卷積層(可附帶,對比度標準化和最大池化),然後是一或多層全連接層。 這種基本設計的變體在圖像分類文獻中很普遍,並且迄今為止在 MNIST,CIFAR 上以及在 ImageNet 分類挑戰中均取得了最佳結果[9,21]。 對於較大的資料集,例如 Imagenet,最近的趨勢是增加層數 [12] 和層大小 [21、14],同時使用 Dropout [7] 來解決過度擬合的問題。
Despite concerns that max-pooling layers result in loss of accurate spatial information, the same convolutional network architecture as [9] has also been successfully employed for localization [9, 14], object detection [6, 14, 18, 5] and human pose estimation [19].
儘管擔心最大池化層會導致丟失準確的空間信息,但與 [9] 相同的卷積網路架構也已成功用於定位 [9、14],物件偵測 [6、14、18、5] 和人體姿勢估計 [19]。
Inspired by a neuroscience model of the primate visual cortex, Serre et al. [15] used a series of fixed Gabor filters of different sizes to handle multiple scales. We use a similar strategy here. However, contrary to the fixed 2-layer deep model of [15], all filters in the Inception architecture are learned. Furthermore, Inception layers are repeated many times, leading to a 22-layer deep model in the case of the GoogLeNet model.
Serre 等人受到靈長類動物視皮層神經科學模型的啟發。 [15] 使用一系列不同大小的固定 Gabor 濾波器來處理多個尺度。 我們在這裡使用類似的策略。 但是,與文獻 [15] 的固定 2 層深度模型相反,Inception 架構中的所有濾波器參數都藉由學習得來。 此外,Inception 層被重複使用,在 GoogLeNet 模型的情況下,衍生為 22 層的深度模型。
Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representational power of neural networks. In their model, additional 1 x 1 convolutional layers are added to the network, increasing its depth. We use this approach heavily in our architecture. However, in our setting, 1 x 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without a significant performance penalty.
網路中的網路是 Lin 等人提出的一種方法。 [12] 為了增加神經網絡的表示能力。 在他們的模型中,額外的1 x 1卷積層被添加到網路中,從而增加了其深度。 我們在架構中大量使用了這種方法。 但是,在我們的設置中,1 x 1卷積具有雙重目的:最關鍵的是,它們主要用作降維模組以消除計算瓶頸,否則將限制我們網路的規模。 這不僅增加了網路的深度,而且還增加了網絡的寬度,而沒有明顯的性能損失。
Finally, the current state of the art for object detection is the Regions with Convolutional Neural Networks (R-CNN) method by Girshick et al. [6]. R-CNN decomposes the overall detection problem into two subproblems: utilizing lowlevel cues such as color and texture in order to generate object location proposals in a category-agnostic fashion and using CNN classifiers to identify object categories at those locations. Such a two stage approach leverages the accuracy of bounding box segmentation with low-level cues, as well as the highly powerful classification power of state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have explored enhancements in both stages, such as multi-box [5] prediction for higher object bounding box recall, and ensemble approaches for better categorization of bounding box proposals.
最後,物件偵測測的最新技術是 Girshick 等人的帶卷積神經網絡的區域(R-CNN)方法 [6]。 R-CNN 將整體檢測問題分解為兩個子問題:利用顏色和紋理等低階線索以與類別無關的方式生成物件位置建議,並使用 CNN 分類器在那些位置識別物件類別。這種兩階段方法利用了具有低階線索的包圍盒分割的準確性,以及最新的 CNN 的強大分類能力。 我們在檢測提交中採用了類似的流程,但是在兩個階段都進行了改進,例如針對更高的對象邊界框召回率的多框 [5] 預測,以及對邊界框建議進行更好分類的集成方法。
-----
3. Motivation and High Level Considerations
3. 動機和高層的考慮
The most straightforward way of improving the performance of deep neural networks is by increasing their size. This includes both increasing the depth – the number of network levels – as well as its width: the number of units at each level. This is an easy and safe way of training higher quality models, especially given the availability of a large amount of labeled training data. However, this simple solution comes with two major drawbacks.
改善深度神經網路性能的最直接方法是增加其大小。 這包括增加深度(網路級別的數量)及其寬度:每個級別的單位數量。 這是一種訓練高質量模型的簡便而安全的方法,特別是考慮到有大量標記的訓練資料的可用性。 但是,這種簡單的解決方案有兩個主要缺點。
Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting, especially if the number of labeled examples in the training set is limited. This is a major bottleneck as strongly labeled datasets are laborious and expensive to obtain, often requiring expert human raters to distinguish between various fine-grained visual categories such as those in ImageNet (even in the 1000-class ILSVRC subset) as shown in Figure 1.
較大的尺寸通常意味著大量的參數,這使得變大的網路更容易過度擬合,尤其是在訓練集中標記的樣本數量有限的情況下。 這是一個主要瓶頸,因為帶有強標籤的資料集很難獲得且昂貴,通常需要專業的評估人員來區分各種細粒度的視覺類別,例如 ImageNet(甚至在 1000 級 ILSVRC 子集中),如圖1 所示 。
The other drawback of uniformly increased network size is the dramatically increased use of computational resources. For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase of computation. If the added capacity is used inefficiently (for example, if most weights end up to be close to zero), then much of the computation is wasted. As the computational budget is always finite, an efficient distribution of computing resources is preferred to an indiscriminate increase of size, even when the main objective is to increase the quality of performance.
無差別地增加網路大小的另一個缺點是計算資源的使用急劇增加。 例如,在深度視覺網路中,如果將兩個卷積層鏈接在一起,則其濾波器數量的任何均勻增加都會導致計算量的平方增加。如果未充分利用增加的容量(例如,如果大多數權重最終接近於零),則將浪費大量計算量。 由於計算預算始終是有限的,因此即使主要目標是提高性能質量,也要有效分配計算資源,而不是隨意增加大小。
A fundamental way of solving both of these issues would be to introduce sparsity and replace the fully connected layers by the sparse ones, even inside the convolutions. Besides mimicking biological systems, this would also have the advantage of firmer theoretical underpinnings due to the groundbreaking work of Arora et al. [2]. Their main result states that if the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer after layer by analyzing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs. Although the strict mathematical proof requires very strong conditions, the fact that this statement resonates with the well known Hebbian principle – neurons that fire together, wire together – suggests that the underlying idea is applicable even under less strict conditions, in practice.
解決這兩個問題的基本方法是引入稀疏性,並用稀疏的層替換全連接層,即使在卷積內部也是如此。 除了模仿生物系統之外,由於 Arora 等人的開創性工作,這還具有更堅實的理論基礎的優勢。 [2]。他們的主要結果表明,如果資料集的機率分佈可以由大型的,非常稀疏的深度神經網路表示,則可以通過分析前一層激活的相關統計量並高度聚類神經元來逐層構建最佳網路拓撲的相關輸出。儘管嚴格的數學證明需要非常嚴格的條件,但這一陳述與眾所周知的 Hebbian 原理(將一起發射的神經元,連接在一起)暗合表明,即使在較不嚴格的條件下,也可以應用此基本思想。
Unfortunately, today’s computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is reduced by 100, the overhead of lookups and cache misses would dominate: switching to sparse matrices might not pay off. The gap is widened yet further by the use of steadily improving and highly tuned numerical libraries that allow for extremely fast dense matrix multiplication, exploiting the minute details of the underlying CPU or GPU hardware [16, 9]. Also, non-uniform sparse models require more sophisticated engineering and computing infrastructure. Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of employing convolutions. However, convolutions are implemented as collections of dense connections to the patches in the earlier layer. ConvNets have traditionally used random and sparse connection tables in the feature dimensions since [11] in order to break the symmetry and improve learning, yet the trend changed back to full connections with [9] in order to further optimize parallel computation. Current state-of-the-art architectures for computer vision have uniform structure. The large number of filters and greater batch size allows for the efficient use of dense computation.
不幸的是,目前的計算基礎架構在對非均勻稀疏數據結構進行數值計算時效率很低。即使算術運算的數量減少了 100,尋找表和暫存沒中的開銷也將導致:切換到稀疏矩陣可能不會奏效。通過使用穩定改進和高度調整的數值庫,利用底層 CPU 或 GPU 硬體的微小細節,可以實現極快的密集矩陣乘法,從而進一步拉大了差距 [16,9]。而且,非均勻的稀疏模型需要更複雜的工程和計算基礎結構。當前大多數視覺導向的機器學習系統僅通過使用卷積就在空間域中利用稀疏性。但是,卷積只被實現為到較前面的層中,色塊的密集連接的集合。自 [11] 以來,卷積網路傳統上一直在特徵維中使用隨機和稀疏連接表,以打破對稱性並改善學習效果,然而趨勢 [9] 改回全連接,以進一步優化平行計算。 當前用於計算機視覺的最新架構有統一的結構。 大量的過濾器和更大的批處理量允許高效使用密集計算。
This raises the question of whether there is any hope for a next, intermediate step: an architecture that makes use of filter-level sparsity, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices tends to give competitive performance for sparse matrix multiplication. It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deeplearning architectures in the near future.
這就提出了一個問題,是否有希望進行下一個中間步驟:用針對密集矩陣設計的硬體來實現理論建議,濾波器等級的稀疏架構。關於稀疏矩陣計算的大量文獻(例如[ 3])表明,將稀疏矩陣聚類為相對密集的子矩陣往往會為優化稀疏矩陣乘法。認為不久的將來,類似的方法將用於非統一深度學習架構的自動化構建似乎並不為過。
The Inception architecture started out as a case study for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by [2] for vision networks and covering the hypothesized outcome by dense, readily available components. Despite being a highly speculative undertaking, modest gains were observed early on when compared with reference networks based on [12]. With a bit of tuning the gap widened and Inception proved to be especially useful in the context of localization and object detection as the base network for [6] and [5]. Interestingly, while most of the original architectural choices have been questioned and tested thoroughly in separation, they turned out to be close to optimal locally. One must be cautious though: although the Inception architecture has become a success for computer vision, it is still questionable whether this can be attributed to the guiding principles that have lead to its construction. Making sure of this would require a much more thorough analysis and verification.
Inception 架構最初是作為一個案例研究來評估複雜網路拓撲構造算法的假設輸出,該算法試圖逼近 [2] 所暗示的視覺網路的稀疏結構,並通過密集的,易於獲得的組件來奧援假設的結果。儘管這是一個高投機的工作,但與基於 [12] 的參考網路相比,在早期觀察到了適度的效益。稍加調整,差距就擴大了,在 [6] 和[ 5] 的基礎網路中,Inception 在定位和物件偵測上的場景上被證明特別有用。有趣的是,儘管大多數原始架構都被分開質疑和測試,但結果它們都接近局部最佳解。但是要小心:儘管 Inception 架構已成為計算機視覺的成功之舉,但能否將其歸因於導致其構建的指導原則仍然值得懷疑。 要確保這一點,就需要進行更徹底的分析和驗證。
-----
4. Architectural Details
4. 架構細節
The main idea of the Inception architecture is to consider how an optimal local sparse structure of a convolutional vision network can be approximated and covered by readily available dense components. Note that assuming translation invariance means that our network will be built from convolutional building blocks. All we need is to find the optimal local construction and to repeat it spatially. Arora et al. [2] suggests a layer-by layer construction where one should analyze the correlation statistics of the last layer and cluster them into groups of units with high correlation. These clusters form the units of the next layer and are connected to the units in the previous layer. We assume that each unit from an earlier layer corresponds to some region of the input image and these units are grouped into filter banks. In the lower layers (the ones close to the input) correlated units would concentrate in local regions. Thus, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1x1 convolutions in the next layer, as suggested in [12]. However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions. In order to avoid patch-alignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1x1, 3x3 and 5x5; this decision was based more on convenience rather than necessity. It also means that the suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage. Additionally, since pooling operations have been essential for the success of current convolutional networks, it suggests that adding an alternative parallel pooling path in each such stage should have additional beneficial effect, too (see Figure 2(a)).
Inception 架構的主要思想是考慮如何才能近似估計卷積視覺網路的最佳局部稀疏結構,並通過易於獲得的密集組件來實現它。 請注意,假設平移不變意味著我們的網路將由卷積構建塊構建。我們需要的是找到最佳的局部構造並在空間上重複。 Arora 等 [2] 提出了一種逐層構造的方法:分析最後一層的相關性統計數據並將它們聚類為具有高度相關性的單元組。 這些群集形成下一層的單元,並連接到上一層的單元。我們假設來自前面層的每個單元對應於輸入圖像的某些區域,並且這些單元被分組為濾波器組。 在較低的層(靠近輸入層),相關單元將集中在局部區域。 因此,我們最終會得到很多聚類,它們集中在一個區域中,並且可以在下一層中被 1 x 1 卷積層建立,如 [12] 中所建議。但是,人們也可以預期,在較大的色塊上可以通過卷積覆蓋的空間分佈更分散的簇數量將更少,而在越來越大的區域上,色塊的數量將會減少。 為了避免色塊對齊問題,Inception 架構的當前版本被限制為 1 x 1、3 x 3 和 5 x 5 的濾波器大小。 此乃基於方便而非必要。這也意味著建議的架構是所有這些層的組合,它們的輸出濾波器組被串接到單個輸出向量中,形成下一級的輸入。 此外,由於池化操作對於當前卷積網路的成功至關重要,因此建議在每個這樣的階段添加替代並行池化路徑也應具有額外的有益效果(參見圖2(a))。
As these “Inception modules” are stacked on top of each other, their output correlation statistics are bound to vary: as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease. This suggests that the ratio of 3 x 3 and 5 x 5 convolutions should increase as we move to higher layers.
由於這些“盜用模塊”彼此堆疊,因此它們的輸出相關性統計數據必然會發生變化:隨著更高層捕獲更高抽象度的特徵,預計它們的空間集中度將會降低。 這表明隨著我們移到更高的層,3 x 3 和 5 x 5 卷積的比率應該增加。
One big problem with the above modules, at least in this naive form, is that even a modest number of 5x5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters. This problem becomes even more pronounced once pooling units are added to the mix: the number of output filters equals to the number of filters in the previous stage. The merging of output of the pooling layer with outputs of the convolutional layers would lead to an inevitable increase in the number of outputs from stage to stage. While this architecture might cover the optimal sparse structure, it would do it very inefficiently, leading to a computational blow up within a few stages.
以此單純的形式,上述模組的一個大問題是,即使是最少的 5x5 卷積,在具有大量濾波器的卷積層之上也可能是非常昂貴的。 一旦加入池化單元,此問題將更明顯:輸出濾波器的數量等於上一階段的濾波器數量。 合併層的輸出與卷積層的輸出的合併將導致不可避免地增加每個階段的輸出數量。 儘管此架構可能再現最佳的稀疏結構,但這樣做的效率非常低,導致在一些階段內出現了計算量大的問題。
In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation.
總的來說,Inception 網路是由上述彼此堆疊的的模塊組成的網路,偶爾具有步長為 2 的最大池化層,以將網格的大小減半。 出於技術原因(訓練期間的內存效率),似乎僅在較高的層開始使用 Inception 模組,而以傳統的捲積方式保留較低的層似乎是有益的。 這非嚴格必要,只是反映了我們目前的硬體效率低下。
A useful aspect of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity at later stages. This is achieved by the ubiquitous use of dimensionality reduction prior to expensive convolutions with larger patch sizes. Furthermore, the design follows the practical intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from the different scales simultaneously.
該架構的一個亮點是,它可以在每個階段增加夠多的單元,而不至於在隨後出現暴增的計算複雜性。這是通過在昂貴的卷積和較大的色塊之前普遍使用降維來實現的此外,該設計遵循直覺,即視覺信息應按不同的比例進行處理,然後進行匯總,以便下一階段可以同時從不同的比例中提取特徵。
The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties. One can utilize the Inception architecture to create slightly inferior, but computationally cheaper versions of it. We have found that all the available knobs and levers allow for a controlled balancing of computational resources resulting in networks that are 3 - 10X faster than similarly performing networks with non-Inception architecture, however this requires careful manual design at this point.
計算資源的仔細使用可以加大每個模組的寬度與模組數,而不至引起計算量的增加。Inception 架構可用來創建稍遜一籌但在計算上更便宜的版本。 我們發現,所有可調用的參數都可平衡計算資源,從而使網路的運行速度比具有非 Inception 架構的類似性能的網路快 3 到 10 倍,但是,這時需要仔細手動微調。
-----
5. GoogLeNet
5. GoogLeNet
By the“GoogLeNet” name we refer to the particular incarnation of the Inception architecture used in our submission for the ILSVRC 2014 competition. We also used one deeper and wider Inception network with slightly superior quality, but adding it to the ensemble seemed to improve the results only marginally. We omit the details of that network, as empirical evidence suggests that the influence of the exact architectural parameters is relatively minor. Table 1 illustrates the most common instance of Inception used in the competition. This network (trained with different image-patch sampling methods) was used for 6 out of the 7 models in our ensemble.
我們使用 “ GoogLeNet” 這個名稱來指代我們在 2014 年 ILSVRC 競賽中提交的 Inception 架構。 我們還使用了更優的更深,更寬的 Inception 網路,但將其添加到集合中似乎只能稍微改善結果。 我們省略了該網路的細節,因為經驗表明,確切的假構參數的影響相對較小。 表 1 展示了比賽中最常見的 Inception 實作。 該網路(接受了不同的圖像色塊採樣方法訓練)用於我們集成的 7 個模型中的 6 個。
All the convolutions, including those inside the Inception modules, use rectified linear activation. The size of the receptive field in our network is 224x224 in the RGB color space with zero mean. “#3x3 reduce” and “#5x5 reduce” stands for the number of 1x1 filters in the reduction layer used before the 3x3 and 5x5 convolutions. One can see the number of 1x1 filters in the projection layer after the built-in max-pooling in the pool proj column. All these reduction/projection layers use rectified linear activation as well.
所有卷積,包括 Inception 模組內部的,都使用整流線性激活。 在 RGB 色域中,我們網路中的感受野的大小為 224 x224,均值為零。 “# 3x3 縮小”和“# 5 x5 縮小”表示在 3x3 和 5x5卷積之前使用的縮小層中 1x1 濾波器的數量。 在 pool proj 列中內置最大池化之後,可以看到投影層中 1x1 濾波器的數量。 所有這些縮小/投影層也都使用整流線性激活。
The network was designed with computational efficiency and practicality in mind, so that inference can be run on individual devices including even those with limited computational resources, especially with low-memory footprint. The network is 22 layers deep when counting only layers with parameters (or 27 layers if we also count pooling). The overall number of layers (independent building blocks) used for the construction of the network is about 100. The exact number depends on how layers are counted by the machine learning infrastructure. The use of average pooling before the classifier is based on [12], although our implementation has an additional linear layer. The linear layer enables us to easily adapt our networks to other label sets, however it is used mostly for convenience and we do not expect it to have a major effect. We found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers.
該網路在設計時就考慮到了計算效率和實用性,因此推論可以在單個設備上運行,甚至包括那些計算資源有限的設備,尤其是內存佔用量較小的設備。僅計算帶有參數的層時,網路深 22 層(如果我們也計算池化層,則網路為 27 層)。 用於網路構建的層(獨立構建塊)的總數約為 100。確切的數目取決於機器學習基礎結構如何計算層數。 儘管我們的實現有一個附加的線性層,但在分類器之前使用的平均池是基於 [12]。線性層使我們能夠輕鬆地使網路適應其他標籤集,但是它主要是為了方便起見,我們認為它不會產生重大影響。 我們發現,從全連接層轉成平均池化後,top-1 準確性提高了約 0.6%,但是,即使在刪除全連接層之後,仍必須使用 dropout。
Given relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern. The strong performance of shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative. By adding auxiliary classifiers connected to these intermediate layers, discrimination in the lower stages in the classifier was expected. This was thought to combat the vanishing gradient problem while providing regularization. These classifiers take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). At inference time, these auxiliary networks are discarded. Later control experiments have shown that the effect of the auxiliary networks is relatively minor (around 0.5%) and that it required only one of them to achieve the same effect.
在相對較深的網路,可以將梯度有效傳播回所有的層是重點。淺層網路在此任務上的強大性能表明,網路中間各層所產生的特徵應有所區別。通過添加連接到這些中間層的輔助分類器,可以期望在分類器的較低階段進行區分。人們認為這是在提供正則化的同時解決了消失的梯度問題。 這些分類器採用較小的卷積網絡的形式,位於 Inception(4a)和(4d)模塊的輸出之上。 在訓練過程中,它們的損失將以折扣權重添加到網路的總損失中(輔助分類器的損失加權為 0.3)。 在推論期,這些輔助網路將被丟棄。 稍後的控制實驗表明,輔助網路的效果相對較小(約 0.5%),並且只需要其中一個即可達到相同的效果。
The exact structure of the extra network on the side, including the auxiliary classifier, is as follows:
• An average pooling layer with 5x5 filter size and stride 3, resulting in an 4x4x512 output for the (4a), and 4x4x528 for the (4d) stage.
• A 1x1 convolution with 128 filters for dimension reduction and rectified linear activation.
• A fully connected layer with 1024 units and rectified linear activation.
• A dropout layer with 70% ratio of dropped outputs.
• A linear layer with softmax loss as the classifier (predicting the same 1000 classes as the main classifier, but removed at inference time).
A schematic view of the resulting network is depicted in Figure 3.
附加網路(包括輔助分類器)的結構如下:
•具有 5x5 濾波器大小和 3 步長的平均池化層,(4a)的輸出為 4x4x 512,(4d)的輸出為4x4x528。
•具有 128 個濾波器的 1x1 卷積,與 ReLU。
•具有 1024 個單位的全連接層,與 ReLU。
•Dropout 層,其丟棄率為 70%。
•具有 softmax 損失的線性層作為分類器(預測與主分類器相同的 1000 個分類,但在推論時刪除)。
網路的示意圖如圖3。
-----
6. Training Methodology
6.訓練方法
GoogLeNet networks were trained using the DistBelief [4] distributed machine learning system using modest amount of model and data-parallelism. Although we used a CPU based implementation only, a rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory usage. Our training used asynchronous stochastic gradient descent with 0.9 momentum [17], fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs). Polyak averaging [13] was used to create the final model used at inference time.
在適量的模型和資料平行處理下,GoogLeNet 使用 DistBelief [4] 分散式機器學習系統訓練。儘管我們只用 CPU 訓練,但粗估一下,一週內用幾顆高效的 GPU 就可以訓練完,主要的限制是內存大小。我們的訓練使用具有 0.9 動量的非同步步隨機梯度下降 [17],固定的學習速率時間表(每 8 個週期將學習速率降低 4%)。 Polyak 平均 [13] 用於創建推論時使用的最終模型。
Image sampling methods have changed substantially over the months leading to the competition, and already converged models were trained on with other options, sometimes in conjunction with changed hyperparameters, such as dropout and the learning rate. Therefore, it is hard to give a definitive guidance to the most effective single way to train these networks. To complicate matters further, some of the models were mainly trained on smaller relative crops, others on larger ones, inspired by [8]. Still, one prescription that was verified to work very well after the competition, includes sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area with aspect ratio constrained to the interval [ 3/4 ; 4/3 ]. Also, we found that the photometric distortions of Andrew Howard [8] were useful to combat overfitting to the imaging conditions of training data.
在進入競賽的幾個月中,圖像採樣方法已經發生了巨大變化,已經收斂的模型採用了其他選擇訓練,有時還結合了可更改的超參數,例如 dropout 和學習率。 因此,很難明確定義什麼單一方法訓練這些網路的最有效。使問題更加複雜的是,受 [8] 的啟發,一些模型主要針對較小的色塊進行訓練,而另一些則針對較大的色塊行了訓練。 儘管如此,在比賽後證明效果很好的一些處理,包括對各種尺寸的色塊進行採樣,這些圖像的大小均勻地分佈在圖像區域的 8%和 100% 之間,且寬高比限制在 [3/4 ; 4/3]。 此外,我們發現 Andrew Howard [8] 的光度學失真對於消除訓練資料成像條件的過擬合很有用。
7. ILSVRC 2014 Classification Challenge Setup and Results
7. ILSVRC 2014分類挑戰賽設置和結果
The ILSVRC 2014 classification challenge involves the task of classifying the image into one of 1000 leaf-node categories in the Imagenet hierarchy. There are about 1.2 million images for training, 50,000 for validation and 100,000 images for testing. Each image is associated with one ground truth category, and performance is measured based on the highest scoring classifier predictions. Two numbers are usually reported: the top-1 accuracy rate, which compares the ground truth against the first predicted class, and the top-5 error rate, which compares the ground truth against the first 5 predicted classes: an image is deemed correctly classified if the ground truth is among the top-5, regardless of its rank in them. The challenge uses the top-5 error rate for ranking purposes.
ILSVRC 2014分類挑戰涉及將圖像分類為 Imagenet 層次結構中 1000 個葉節點類別之一的任務。 大約有 120 萬張圖像用於訓練,50,000 張圖像用於驗證和 100,000 張圖像用於測試。 每個圖像屬於一個類別,並且根據得分最高的分類器預測來測量性能。 通常報告兩個數字:top-1 準確率,將真值與第一個預測類進行比較;top-5 錯誤率,將真值與前 5 個預測類進行比較:不管排名第幾,如果真值位於前五名中,則圖像被視為正確分類。 比賽使用 top-5 錯誤率進行排名。
We participated in the challenge with no external data used for training. In addition to the training techniques aforementioned in this paper, we adopted a set of techniques during testing to obtain a higher performance, which we describe next.
我們參加了挑戰,沒有使用外部資料。 除了本文前面提到的訓練技術外,我們在測試過程中採用了一組技術來獲得更高的性能,說明如下。
1. We independently trained 7 versions of the same GoogLeNet model (including one wider version), and performed ensemble prediction with them. These models were trained with the same initialization (even with the same initial weights, due to an oversight) and learning rate policies. They differed only in sampling methodologies and the randomized input image order.
1. 我們獨立訓練了同一 GoogLeNet 模型的7個版本(包括一個更寬的版本),並對其進行了集成預測。 對這些模型進行了相同的初始化(由於疏忽,甚至具有相同的初始權重)和學習率策略。 它們僅在採樣方法和隨機輸入圖像順序方面有所不同。
2. During testing, we adopted a more aggressive cropping approach than that of Krizhevsky et al. [9]. Specifically, we resized the image to 4 scales where the shorter dimension (height or width) is 256, 288, 320 and 352 respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares). For each square, we then take the 4 corners and the center 224x224 crop as well as the square resized to 224x224, and their mirrored versions. This leads to 4x3x6x2 = 144 crops per image. A similar approach was used by Andrew Howard [8] in the previous year’s entry, which we empirically verified to perform slightly worse than the proposed scheme. We note that such aggressive cropping may not be necessary in real applications, as the benefit of more crops becomes marginal after a reasonable number of crops are present (as we will show later on).
2. 在測試過程中,我們採用了比 Krizhevsky 等人更積極的切塊方阿。 [9]。 具體來說,我們將圖像調整為 4 個比例,其中較短的尺寸(高度或寬度)分別為 256、288、320 和 352,取這些調整後圖像的左,中和右正方形(對於人像,則取 頂部,中央和底部正方形)。然後,對於每個正方形,我們取 4 個角和中心的 224 x224 切塊,以及將正方形調整為 224 x224 的大小以及它們的鏡像版本。 這導致每個圖像 4x3x6x2 = 144 個切塊。 Andrew Howard [8] 在上一年使用了類似的方法,我們通過實證驗證,該方法的性能比提交的方案稍差。 我們注意到,在實際應用中,這種積極的切塊可能不是必需的,因為在存在一定數量的切塊之後,更多切塊的收益就變得微不足道了(我們將在後面展示)。
3. The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction. In our experiments we analyzed alternative approaches on the validation data, such as max pooling over crops and averaging over classifiers, but they lead to inferior performance than the simple averaging.
3. 將 softmax 概率在多個切塊和所有單個分類器上取平均,以獲得最終預測。 在我們的實驗中,我們分析了驗證數據的替代方法,例如,對作物的最大池化和對分類器的平均化,但與簡單的平均化相比,它們的性能較差。
In the remainder of this paper, we analyze the multiple factors that contribute to the overall performance of the final submission.
在本文的其餘部分,我們分析了影響最終提交書整體性能的多種因素。
Our final submission to the challenge obtains a top-5 error of 6.67% on both the validation and testing data, ranking the first among other participants. This is a 56.5% relative reduction compared to the SuperVision approach in 2012, and about 40% relative reduction compared to the previous year’s best approach (Clarifai), both of which used external data for training the classifiers. Table 2 shows the statistics of some of the top-performing approaches over the past 3 years.
我們對挑戰的最終提交在驗證和測試數資料上均獲得 6.67% 的 top-5 錯誤,在其他參與者中排名第一。 與 2012 年的 SuperVision 方法相比,相對減少了 56.5%,與上一年的最佳方法(Clarifai)相比,減少了約 40%,兩者均使用外部資料來訓練分類器。 表2 顯示了過去 3 年中一些效果最好的方法的統計資料。
We also analyze and report the performance of multiple testing choices, by varying the number of models and the number of crops used when predicting an image in Table 3. When we use one model, we chose the one with the lowest top-1 error rate on the validation data. All numbers are reported on the validation dataset in order to not overfit to the testing data statistics.
我們還通過更改模型數量和預測表3 中圖像時使用的切塊數量,來分析和報告多種測試選擇的性能。當我們使用一種模型時,我們選擇了 top-1 錯誤率最低的模型 在驗證資料上。 所有數字均立足在驗證資料集上,以免過度擬合測試資料的統計。
-----
8. ILSVRC 2014 Detection Challenge Setup and Results
8. ILSVRC 2014 偵測挑戰賽設置和結果
The ILSVRC detection task is to produce bounding boxes around objects in images among 200 possible classes. Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least 50% (using the Jaccard index). Extraneous detections count as false positives and are penalized. Contrary to the classification task, each image may contain many objects or none, and their scale may vary. Results are reported using the mean average precision (mAP). The approach taken by GoogLeNet for detection is similar to the R-CNN by [6], but is augmented with the Inception model as the region classifier. Additionally, the region proposal step is improved by combining the selective search [20] approach with multibox [5] predictions for higher object bounding box recall. In order to reduce the number of false positives, the super-pixel size was increased by 2x. This halves the proposals coming from the selective search algorithm. We added back 200 region proposals coming from multi-box [5] resulting, in total, in about 60% of the proposals used by [6], while increasing the coverage from 92% to 93%. The overall effect of cutting the number of proposals with increased coverage is a 1% improvement of the mean average precision for the single model case. Finally, we use an ensemble of 6 GoogLeNets when classifying each region. This leads to an increase in accuracy from 40% to 43.9%. Note that contrary to R-CNN, we did not use bounding box regression due to lack of time.
ILSVRC 檢測任務是在200種可能的類別中的圖像中的物件周圍生成邊界框。 如果檢測到的物件與 groundtruth 的類別匹配並且其邊界框至少重疊50%(使用 Jaccard 索引),則視為正確。 無關檢測會被視為誤報,並會受到處罰。 與分類任務相反,每個圖像可能包含許多物件或不包含任何物件,並且它們的比例可能會有所不同。使用平均平均精度(mAP)報告結果。 GoogLeNet 用於檢測的方法與 [6] 中的 R-CNN 相似,但是使用 Inception 模型作為區域分類器進行了擴充。 此外,通過將選擇性搜索 [20] 方法與多框 [5] 預測相結合,可以提高區域建議步驟,從而實現更高的對象邊界框召回率。 為了減少誤報的數量,將超像素大小增加了 2 倍。這讓來自選擇性搜索算法的建議框減半。 我們增加了 200 個來自多框 [5] 的建議框,總共佔 [6] 使用的建議框的約 60%,而覆蓋率則從 92% 增加到 93%。減少提案數量並擴大覆蓋範圍的整體效果是,單個模型案例的平均平均精度提高了1%。 最後,在對每個區域進行分類時,我們使用 6 個 GoogLeNet 的集成。 這導致精度從 40% 提高到 43.9%。 請注意,與 R-CNN 相反,由於時間不足,我們沒有使用邊界框回歸。
We first report the top detection results and show the progress since the first edition of the detection task. Compared to the 2013 result, the accuracy has almost doubled. The top performing teams all use convolutional networks. We report the official scores in Table 4 and common strategies for each team: the use of external data, ensemble models or contextual models. The external data is typically the ILSVRC12 classification data for pre-training a model that is later refined on the detection data. Some teams also mention the use of the localization data. Since a good portion of the localization task bounding boxes are not included in the detection dataset, one can pre-train a general bounding box regressor with this data the same way classification is used for pre-training. The GoogLeNet entry did not use the localization data for pretraining.
我們首先報告最好的偵測結果,並顯示自第一版偵測任務以來的進度。 與 2013 年的結果相比,準確性幾乎提高了一倍。 表現最好的團隊都使用卷積網路。 我們在表4 中報告了正式成績以及每個團隊的共同策略:使用外部資料,集成模型或情境模型。外部資料通常是用於預先訓練模型的 ILSVRC12 分類資料,該模型隨後將根據偵測資料進行微調。 一些團隊還提到了位置資訊的使用。 由於偵測資料集中未包含很大一部分的位置任務邊界框,因此可以使用此資料對一般的邊界框回歸器進行預訓練,方法與將分類用於預訓練的方式相同。 GoogLeNet 未使用位置進行預訓練。
In Table 5, we compare results using a single model only. The top performing model is by Deep Insight and surprisingly only improves by 0.3 points with an ensemble of 3 models while the GoogLeNet obtains significantly stronger results with the ensemble.
在表5 中,我們僅使用單一模型跟其他團隊比較結果。表現最好的模型是 Deep Insight,令人驚訝的是,3 個模型的集成才提高了 0.3 點,而 GoogLeNet 的集成明顯更好。
-----
9. Conclusions
9. 結論
Our results yield a solid evidence that approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision. The main advantage of this method is a significant quality gain at a modest increase of computational requirements compared to shallower and narrower architectures.
我們的結果提供了有力的證據,即通過預先建構好的密集塊來近似預期的最佳稀疏結構,是改善計算機視覺神經網路的可行方法。 與較淺和較窄的體系結構相比,此方法的主要優點是在計算需求適度增加的情況下可顯著提高準確度。
Our object detection work was competitive despite not utilizing context nor performing bounding box regression, suggesting yet further evidence of the strengths of the Inception architecture.
儘管沒有利用到場景,也沒有執行邊界框回歸,但我們的物件偵測測工作還是具有競爭力的,這進一步表明了 Inception 架構的優勢。
For both classification and detection, it is expected that similar quality of result can be achieved by much more expensive non-Inception-type networks of similar depth and width. Still, our approach yields solid evidence that moving to sparser architectures is feasible and useful idea in general. This suggest future work towards creating sparser and more refined structures in automated ways on the basis of [2], as well as on applying the insights of the Inception architecture to other domains.
對於分類和偵測,可以想見,通過深度和寬度相似的昂貴得多的非 Inception 網路也可以實現相似的結果。 儘管如此,我們的方法仍然提供了有力的證據,表明向稀疏架構過渡通常是可行且有用的想法。 這暗示了未來的工作,以 [2] 為基礎,以自動的方式創建稀疏和更精細的結構,以及將 Inception 架構的洞見應用於其他領域。
-----
Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
-----
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.