The Star Also Rises: YOLO（三）：Illustrated

YOLO（三）：Illustrated

2021/05/26

-----

# A Survey of Deep Learning-based Object Detection

說明：

Two-stage and one-stage。

-----

AP、mAP。

https://blog.paperspace.com/mean-average-precision/

https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173

https://zhuanlan.zhihu.com/p/88896868

https://chih-sheng-huang821.medium.com/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92%E7%B3%BB%E5%88%97-%E4%BB%80%E9%BA%BC%E6%98%AFap-map-aaf089920848

http://hemingwang.blogspot.com/2018/04/machine-learning-conceptmean-average.html

-----

Figure 1: The YOLO Detection System. Processing images with YOLO is simple and straightforward. Our system (1) resizes the input image to 448 × 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.

圖 1：YOLO 檢測系統。使用 YOLO 處理圖像簡單明瞭。我們的系統 (1) 將輸入圖像的大小調整為 448 × 448，(2) 在圖像上運行單個卷積網路，以及 (3) 通過模型的置信度對結果檢測進行閾值處理。

# YOLOv1

說明：

1. 調整圖片大小為 448 x 448。

2. 用卷積網路進行一階段的物件偵測。

3. 設定 thresholds，用 NMS 去除重複的偵測（只留可能性最高的那一個，其他跟最高那一個若有大部分重疊，則去除）。

-----

Figure 3. In object detection, first category independent region proposals are generated. These region proposals are then assigned a score for each class label using a classification network and their positions are updated slightly using a regression network. Finally, non-maximum-suppression is applied to obtain detections.

圖 3. 在目標檢測中，生成第一類獨立區域提議。然後使用分類網路為每個類標籤分配這些區域提議的分數，並使用回歸網路稍微更新它們的位置。最後，應用非最大抑制來獲得檢測。

# NMS

說明：

非最大抑制的流程。先生成建議框。然後按照類別作分群。然後用上面的 NMS 去除重複的預測。

-----

Figure 2: The Model. Our system models detection as a regression problem. It divides the image into an S ×S grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an S × S × (B * 5 + C) tensor.

圖 2：模型。我們的系統將檢測建模為回歸問題。它將圖像劃分為 S × S 網格，並為每個網格單元預測 B 個邊界框、這些框的置信度和 C 類概率。這些預測被編碼為 S × S × (B * 5 + C) 張量。

# YOLOv1

說明：

7 × 7 × (2 * 5 + 20)。

YOLO v1 共有 49 個網格，每個網格可以預測兩張圖片（共 20 種類別，每個類別有一個可能性）。每張圖片有五個要素，分別是置信度 confidence、中心點 x y，寬、高。中心會落在網格內。

Confidence 在論文中的定義是：Pr( Object ) * IOU truth pred。此預測為物件的機率乘以預測框與真實框的 IoU，Intersection of Union。

「如預期沒有任何物件落在這個網格內，則所有信心程度均應為0；若否則依物件可能性及估計框架與實際框架重合程度計算信心程度。」

https://python5566.wordpress.com/2019/02/14/deep-learning-notes-object-detection-model-yolo/

圖片上方是 B * 5，下方是 C（值最大那一個）。

-----

Figure 3: The Architecture. Our detection network has 24 convolutional layers followed by 2 fully connected layers. Alternating 1 × 1 convolutional layers reduce the features space from preceding layers. We pretrain the convolutional layers on the ImageNet classification task at half the resolution (224 × 224 input image) and then double the resolution for detection.

圖 3：架構。我們的檢測網路有 24 個卷積層，後跟 2 個全連接層。交替的 1 × 1 卷積層減少了前一層的特徵空間。我們在 ImageNet 分類任務上以一半的分辨率（224 × 224 輸入圖像）預訓練卷積層，然後將分辨率提高一倍以進行檢測。

# YOLOv1

說明：

論文中提到，YOLO 的網路架構是受到 GoogLeNet 的啟發。但 YOLO 主要用 1 x 1 降維以及 3 x 3 卷積（參考附圖）。某些 3 x 3 之前會使用 1 x 1。

-----

模型的輸出

YOLOv1

S × S × (B * 5 + C)。

7 × 7 × (2 * 5 + 20)。

YOLOv2

S × S × B * (5 + C)。

13 × 13 × 5 * (5 + 20)。

YOLOv3

S × S × B * (5 + C)。

13 × 13 × 3 * (5 + 80)。26 × 26 × 3 * (5 + 80)。52 × 52 × 3 * (5 + 80)。

YOLOv4

S × S × B * (5 + C)。

19 × 19 × 3 * (5 + 80)。38 × 38 × 3 * (5 + 80)。76 × 76 × 3 * (5 + 80)。

-----

https://medium.com/ching-i/yolo%E6%BC%94%E9%80%B2-1-33220ebc1d09

https://zhuanlan.zhihu.com/p/49995236

https://zhuanlan.zhihu.com/p/143747206

-----

Loss function

# YOLOv1

說明：

1. I_obj_ij：第 i 個網格，第 j 個 box。有物件則為 1。

2. 無 hat 與有 hat 分別為 ground truth 與預測。但 v2 論文是無 hat 為預測。

https://medium.com/ching-i/yolo%E6%BC%94%E9%80%B2-1-33220ebc1d09

3. I_noobj_ij：第 i 個網格，第 j 個 box。無物件則為 1。

以 GT 10 x 10 與 100 x 100 為例，若誤差為 10。預測為 20 x 20 與 110 x 110。無根號版本 loss 都是 10，這不合理，因為小框比較離譜，loss 應該比較大。有根號版本，則 loss 分別為 3.43 與 0.48，就比較合理。

https://zhuanlan.zhihu.com/p/49995236

-----

Parameters

# YOLOv1

說明：

λcoord = 5，強調物體定位的能力。λnoobj = 0.5，降低非物件的比重。

-----

Table 6: Darknet-19.

# YOLOv2

說明：

YOLO v2 的架構。GoogLeNet（使用 1 x 1 卷積）與 VGGNet 的綜合體（使用 3 x 3 卷積）。

-----

# YOLOv2 Structure

說明：

架構與論文稍有不同。放此圖主要說明輸出為 13 x 13。此處的輸出部分 40 又與其他版本的 125 不同。

去除全連接層。兩路串接 2048 + 1024 = 3072。

https://jonathan-hui.medium.com/real-time-object-detection-with-yolo-yolov2-28b1b93e2088

-----

# YOLO v2

說明：

經由聚類而成，五種大小的 Anchor Box。

-----

Figure 2. Bounding boxes with dimension priors and location prediction. We predict the width and height of the box as offsets from cluster centroids. We predict the center coordinates of the box relative to the location of filter application using a sigmoid function. This figure blatantly self-plagiarized from [15].

圖 2. 具有維度先驗和位置預測的邊界框。我們將框的寬度和高度預測為與集群質心的偏移量。我們使用 sigmoid 函數預測框相對於過濾器應用位置的中心坐標。這張圖公然自抄襲 [15]。

# YOLOv3

說明：

cx 與 cy 是 grid cell（藍點所在的那個 cell）左上角的座標。每個 grid cell 邊長為 1。cx = 1，cy = 1。

tx 與 ty 是預測框的中心點 offset，用 sigmoid 把值限定在 0 與 1 之間。

bx 與 by 就是預測框的中心點。

Pw 跟 Ph 是 anchor box 的寬高。Tw 跟 Th 是指數的縮放比例，0 的時候表示預測值等於 anchor box。

bw 與 bh 就是預測框的寬高。

https://zhuanlan.zhihu.com/p/49995236

-----

# YOLOv2

說明：

第五列為 confidence 的公式。預測的 to，經過 sigmoid 後，會等於 confidence。

-----

Table 2: The path from YOLO to YOLOv2. Most of the listed design decisions lead to significant increases in mAP. Two exceptions are switching to a fully convolutional network with anchor boxes and using the new network. Switching to the anchor box style approach increased recall without changing mAP while using the new network cut computation by 33%.

表 2：從 YOLO 到 YOLOv2 的路徑。大多數列出的設計決策都會導致 mAP 顯著增加。兩個例外是切換到帶有錨框的完全卷積網路並使用新網路。切換到錨框樣式方法在不改變 mAP 的情況下增加了召回率，同時使用新的網路將計算量減少了 33%。

# YOLOv2

說明：

列表可以看出，YOLO v2 對 v1 的改善之處，其中一個是提高解析度。另一個是 dimension priors 與 location prediction。其他則是一些小改進。本篇略過 YOLO v2，以 YOLO v3 為重點，加以講解。Anchor Box 指的是固定的 Anchor Box，dimension priors 指的是經過聚類而成的 Anchor Box。

https://tangh.github.io/articles/yolo-from-v1-to-v4/

-----

# Focal Loss

說明：

YOLOv2 被 Focal Loss消遣了。

-----

# YOLOv3

說明：

YOLOv3 反消遣 Focal Loss。

-----

# YOLOv3

說明：

YOLO v3 的骨幹網路。跟 YOLO v2 類似，都是 GoogLeNet（使用 1 x 1 卷積）與 VGGNet 的綜合體（使用 3 x 3 卷積）。去除了所有的 max-pooling，增加了 Conv1 與 Conv3 的個數。以及引進 ResNet 的恆等映射。

-----

# YOLOv3 Plus

說明：

13 x 13 x 18。18 為特徵圖數目，非輸出。

特別注意上取樣的部分：直接將一個像素增加為四個。One pixel is transformed into a 4 pixels in a 2x2 area.

https://stackoverflow.com/questions/60333349/what-is-the-upsampling-technique-used-in-yolov3-upsampling-layers-no-resources

Leaky ReLU 用來解決 dead ReLU ，某些神經元不會被激活，導致某些參數不會被更新的問題。

https://zhuanlan.zhihu.com/p/25110450

-----

YOLOv3

Anchor (Feature Map)：Anchor Box

13 * 13： (116 x 90)，(156 x 198)，(373 x 326)。

26 * 26： (30 x 61)，(62 x 45)，(59 x 119)。

52 * 52： (10 x 13)，(16 x 30)，(33 x 23)。

說明：

「在最小的 13*13 特徵圖上由於其感受野最大故應用最大的 anchor box (116x90)，(156x198)，(373x326)，（這幾個坐標是針對 416*416 下的，當然要除以 32 把尺度縮放到 13*13下），適合檢測較大的目標。中等的 26*26 特徵圖上由於其具有中等感受野故應用中等的 anchor box (30x61)，(62x45)，(59x119)，適合檢測中等大小的目標。較大的 52*52 特徵圖上由於其具有較小的感受野故應用最小的 anchor box (10x13)，(16x30)，(33x23)，適合檢測較小的目標。」

https://zhuanlan.zhihu.com/p/49995236

-----

# YOLOv3 Plus

說明：

原始的 SPP。進行不同維度的最大池化後，接到全連接層。

-----

# YOLOv3 Plus

說明：

改進後的 SPP。此處用 padding 保持原來大小。

另一版本為「SPP：採用 1×1，5×5，9×9，13×13 的最大池化的方式，進行多尺度融合。」

https://zhuanlan.zhihu.com/p/143747206

-----

# YOLOv4 Structure

說明：

YOLOv4 = CSPNet + SPP +（FPN）+ PAN。

a. FPN。由上而下的語義增強。經過卷積之後，上層的特徵圖有較強的語義特徵，上取樣之後相加，則較低的特徵圖也就擁有較強的語義特徵。語義特徵有助於分類。

b. 由下而上的作法可以保留較多的輪廓特徵。輪廓特徵有助於建議框的預測或物體分割。

-----

# CSPNet

說明：

YOLO v4 的骨幹網路，CSPNet。要點在只取一半數目的特徵圖加以運算，再跟原來的特徵圖重新疊加。

-----

Figure 1. (a) Using an image pyramid to build a feature pyramid. Features are computed on each of the image scales independently, which is slow. (b) Recent detection systems have opted to use only single scale features for faster detection. (c) An alternative is to reuse the pyramidal feature hierarchy computed by a ConvNet as if it were a featurized image pyramid. (d) Our proposed Feature Pyramid Network (FPN) is fast like (b) and (c), but more accurate. In this figure, feature maps are indicate by blue outlines and thicker outlines denote semantically stronger features.

圖 1. (a) 使用圖像金字塔構建特徵金字塔。特徵是在每個圖像尺度上獨立計算的，這很慢。(b) 最近的檢測系統選擇僅使用單尺度特徵來進行更快的檢測。(c) 另一種方法是重用 ConvNet 計算的金字塔特徵層次結構，就好像它是特徵化圖像金字塔一樣。(d) 我們提出的特徵金字塔網絡 (FPN) 與 (b) 和 (c) 一樣快，但更準確。在該圖中，特徵圖由藍色輪廓表示，較粗的輪廓表示語義更強的特徵。

# FPN

說明：

a. 每個尺度都檢測，慢。

b. 在最後一層檢測，快。主要研究成果是 YOLO v1。

c. 在最後幾層檢測，快，主要研究成果是 SSD。

d. FPN，特徵金字塔網路，YOLO v3 加上此網路進行檢測。

特徵圖由藍色輪廓表示，較粗的輪廓表示語義更強的特徵。

-----

# YOLOv4

說明：

改成疊加效果比相加好，但維度增加，計算量增加。

https://medium.com/ching-i/yolo%E6%BC%94%E9%80%B2-3-yolov4%E8%A9%B3%E7%B4%B0%E4%BB%8B%E7%B4%B9-5ab2490754ef

https://www.jianshu.com/p/fa9b8b4361e8

-----

# YOLOv4

說明：

YOLO v4 與其他網路的比較，主要可看 YOLO v3。經過很多地方的小改良，加總成為大改良。

-----

模型的輸出

YOLOv1

S × S × (B * 5 + C)。

7 × 7 × (2 * 5 + 20)。

YOLOv2

S × S × B * (5 + C)。

13 × 13 × 5 * (5 + 20)。

YOLOv3

S × S × B * (5 + C)。

13 × 13 × 3 * (5 + 80)。26 × 26 × 3 * (5 + 80)。52 × 52 × 3 * (5 + 80)。

YOLOv4

S × S × B * (5 + C)。

19 × 19 × 3 * (5 + 80)。38 × 38 × 3 * (5 + 80)。76 × 76 × 3 * (5 + 80)。

-----

References

# A Survey of Deep Learning-based Object Detection

Jiao, Licheng, et al. "A survey of deep learning-based object detection." IEEE Access 7 (2019): 128837-128868.

https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8825470

YOLO v1

Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Redmon_You_Only_Look_CVPR_2016_paper.pdf

YOLO v2

Redmon, Joseph, and Ali Farhadi. "YOLO9000: better, faster, stronger." arXiv preprint (2017).

http://openaccess.thecvf.com/content_cvpr_2017/papers/Redmon_YOLO9000_Better_Faster_CVPR_2017_paper.pdf

YOLO v3

Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv:1804.02767 (2018).

https://arxiv.org/pdf/1804.02767.pdf

YOLO v4

Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. "YOLOv4: Optimal Speed and Accuracy of Object Detection." arXiv preprint arXiv:2004.10934 (2020).

https://arxiv.org/pdf/2004.10934.pdf

# YOLOv3 Plus

Zhou, Jun, et al. "Improved uav opium poppy detection using an updated yolov3 model." Sensors 19.22 (2019): 4851.

https://www.mdpi.com/1424-8220/19/22/4851/pdf

# Zero-Centered

Kim, Sungrae, and Hyun Kim. "Zero-Centered Fixed-Point Quantization With Iterative Retraining for Deep Convolutional Neural Network-Based Object Detectors." IEEE Access 9 (2021): 20828-20839.

https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9336635

# NMS

Bodla, Navaneeth, et al. "Soft-NMS--improving object detection with one line of code." Proceedings of the IEEE international conference on computer vision. 2017.

https://openaccess.thecvf.com/content_ICCV_2017/papers/Bodla_Soft-NMS_--_Improving_ICCV_2017_paper.pdf

# YOLOv4 Structure

Lyu, Jianjun, et al. "Extracting the Tailings Ponds from High Spatial Resolution Remote Sensing Images by Integrating a Deep Learning-Based Model." Remote Sensing 13.4 (2021): 743.

https://www.mdpi.com/2072-4292/13/4/743/pdf

# FPN

Lin, Tsung-Yi, et al. "Feature pyramid networks for object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.

https://openaccess.thecvf.com/content_cvpr_2017/papers/Lin_Feature_Pyramid_Networks_CVPR_2017_paper.pdf

-----

Real-time Object Detection with YOLO, YOLOv2 and now YOLOv3 | by Jonathan Hui | Medium

https://jonathan-hui.medium.com/real-time-object-detection-with-yolo-yolov2-28b1b93e2088

What’s new in YOLO v3?. A review of the YOLO v3 object… | by Ayoosh Kathuria | Towards Data Science

https://towardsdatascience.com/yolo-v3-object-detection-53fb7d3bfe6b

Review: YOLOv3 — You Only Look Once (Object Detection) | by Sik-Ho Tsang | Towards Data Science

https://towardsdatascience.com/review-yolov3-you-only-look-once-object-detection-eab75d7a1ba6

Implementing YOLO-V3 Using PyTorch

http://leiluoray.com/2018/11/10/Implementing-YOLOV3-Using-PyTorch/

Tutorial on implementing YOLO v3 from scratch in PyTorch

https://blog.paperspace.com/how-to-implement-a-yolo-object-detector-in-pytorch/