The Star Also Rises: July 2017

AI 從頭學（三五）：DPM

Object Detection with Discriminatively Trained Part-Based Models

2017/07/27

前言：

本篇為 R-CNN 的熱身之一。

Summary：

本次講解的論文，主要是利用 Latent SVM (L-SVM) 進行的物件偵測 [1]。

之所以選擇這篇論文閱讀，起因是之前一篇論文 [2]，引用的 R-CNN [3]，用到了 Selective Search [4] 跟 L-SVM [1]。

[1] 跟它前一個版本 [5]，都使用了 Histograms of Oriented Gradients (HOG) 作為特徵擷取的方法 [6], [7]。SIFT [8] 跟 HOG 類似，本文將藉助它來闡述 HOG 的原理。

[6] 處理的是概略的人體輪廓，[1], [5] 則利用 [9], [10] 計算出局部的框架，除了有較高的辨識率，也可適應各種不同物體的偵測。

L-SVM 可以說是一種 Energy-Based Model [11]，也可說是一種 MI-SVM [12]。限於篇幅，本文將只會講解簡單的 SVM 概念 [13] ，並不會對 L-SVM 著墨太多（其實是作者還在努力K書中）。

-----

本文將分成八個小節進行。

Q1: What is L-SVM?
Q2: What is AOL?
Q3: What is HOG?
Q4: What is SIFT?
Q5: What is SVM?
Q6: What is the Pyramid Structure?
Q7: What is the flowchart of the object detection algorithm?
Q8: What is the result?

-----

Q1: What is L-SVM?

A1:

圖1.1上方是測試的圖片，下方則是訓練出來的 L-SVM 模型。

圖1.1a可以看到一個粗略的人體形狀，這是 Linear SVM 訓練出來的 HOG filter [6]。

圖1.1b則是在 [6] 的基礎上，加上局部的 filters。由於局部位置並沒有標籤，所以在訓練時它們必須被視為 latent (hidden) variables [1]。

圖1.1c則是最後的模型，顯示為空間權重的表現。

技術細節在本文內稍後會討論。

Fig. 1.1. Detections obtained with a single component person model, p. 2 [1].

-----

圖1.2是不同種類物件的模型，此處列出的有瓶子、汽車、沙發、腳踏車。
圖1.3列出的有人、瓶子、貓、汽車。我們可以看到人與瓶子、貓與汽車有類似的模型。

這邊值得注意的是舊版本跟新版本的差異：新版的每個種類有兩個子模型來輔助偵測，稱為 mixture model。

Fig. 1.2. Some models learned from the PASCAL VOC 2007 dataset, p. 6 [5].

Fig. 1.3. Some of the models learned on the PASCAL 2007 data set, p. 1641 [1].

---

Q2: What is AOL?

A2:

AOL 是 2015年一篇論文標題的縮寫 [2]，參考圖2。[2] 主要引用了 R-CNN [3]、ZFNet、DQN，都是相當經典的論文。

R-CNN [3] 主要又引用了 Selective Search [4] 與 L-SVM [1]，這便是此次報告 [1] 的緣故。

Fig. 2. Related research.

-----

Q3: What is HOG?

A3:

HOG 是 Histograms of Oriented Gradients 的縮寫，這篇發表於 2005 的論文 [6]，截至作者撰寫文章之時，獲得 19606 次引用。[1] 的基本架構即根植於此。所以本篇先簡介一下 HOG。

圖3.1a是資料庫內的圖片，大小統一為 64 x 128，參考圖3.1b。

Fig. 3.1a. Some sample images from the new human detection database, p. 3 [6].

Fig. 3.1b. The image resolution is 64 x 128, p. 2 [6].

-----

此處假設讀者已有 SVM 的基礎。稍後在 Q5 會簡介 SVM。

圖3.2a是全部圖片的平均值，由於圖片在處理時左右反過來也用了，所以看到的是一個對稱的圖形。

圖3.2b跟3.2c是 Linear SVM 訓練完後，向量最大權重的顯示，分別是「人」與「非人」。這邊一個 pixel 代表一個 8 x 8 的向量（的最大值），對應於圖3.2e則是 HOG descriptor，在 8 x 8 的矩陣內，有九個方向的線條，線條的強度則以明暗表示（可參考下方的放大圖）。

圖3.2d是測試圖片。

圖3.2d經過HOG的計算後，得到圖3.2e。分別乘上 (b)與(c)，則得到 (f)與(g)。

更詳細的介紹，可以參考 Q7。

Fig. 3.2. The HOG detectors cue mainly on silhouette contours, p. 7 [6].

Fig. 3.2e. It’s computed R-HOG descriptor, p. 7 [6].

-----

圖3.3是完整的流程圖，有六個步驟。其中第一跟第四的 normalization 可以參考 [7]，本文會介紹步驟二、三、五、六。

Fig. 3.3. An overview of the feature extraction and object detection chain, p. 2 [6].

-----

Q4: What is SIFT?

SIFT 是 Scale-Invariant Feature Transform [8] 的縮寫。它跟 HOG 「有點像」。

圖4.1左邊看到最小的是 n x n 的 pixel cells，接下來是 4 x 4 的 grids，然後 2 x 2 的 grids 組成一個 descriptor。這邊可以看到強度是由箭頭的長度顯示，HOG則是明暗。另外 SIFT有八個方向。HOG 在論文中則選九個方向。

計算的公式可以參考圖4.2。

Fig. 4.1. SIFT: A keypoint descriptor is created, p. 101 [8].

Fig. 4.2. SIFT: For each image sample, the gradient magnitude and orientation is precomputed using pixel differences, p. 99 [8].

-----

SIFT 簡介完後，回到 HOG。

圖4.3a 顯示的是 [1] 中訓練完後的 36 (9 x 4) 個 eigenvector，每個 eigenvector 是 36 (9 x 4) 維的向量。

這個圖可以跟圖3.2e一起看。也就是，圖3.2e中的每個「雪花」可以用這36個（或部分） eigenvectors 組成。概念上每個雪花由四個小雪花組成，每個小雪花有九個方向，每個方向有其強度值，四個小雪花，所以是 9 x 4。

圖4.3b、c、d則是輔助的文字說明。

Fig. 4.3a. PCA of HOG features, p. 1638 [1].

Fig. 4.3b. HOG: A 4 x 9 matrix, p. 1639 [1].

Fig. 4.3c. HOG: A vector of length 9 x 4, p. 2 [5].

Fig. 4.3d. HOG vs. SIFT, p. 6 [6].

-----

計算梯度的方法，在 [6] 中試了好幾個，以 [-1, 0, 1] 最佳，參考圖4.4a。

圖4.4b則是 HOG 計算方向的方法。可以跟 SIFT 比較不同之處。

Fig. 4.4a. HOG: Simple 1-D [-1, 0, 1] masks work best, p. 4 [6].

Fig. 4.4b. HOG: Gradient orientation and magnitude, p. 1638 [1].

-----

Q5: What is SVM?

A5:

這邊簡介一下 Support Vector Machine (SVM) 支持向量機 [13]。

圖5.1的 D(x) 是一個「超平面」。w 是一個 n 維的向量。

圖5.2是一個線性SVM，φ（x）= x。維度為 2，以此為例。

有兩類資料散佈在平面上（大致聚集成兩團），其值分別被標誌為 1 與 -1。

從中取出訓練集，找出 D(x) 使得 M 為最大，也就是找出D(x)=0兩旁虛線上的幾個向量，連帶找出 w （以及）b。這幾個向量就稱為支持向量。

在 [6] 中把訓練圖形的 HOG 計算出來後，則可以使用此處介紹的 Linear SVM 加以分類。

Fig. 5.1. The decision function in the direct space, p. 145 [13].

Fig. 5.2. Maximum margin linear decision function D(x), p. 146 [13].

-----

Q6: What is the Pyramid Structure?

A6:

參考圖6.1。其實這邊的金字塔只有兩層。

淺藍色的部分跟 [6] 一樣，是單純的行人檢測。

[1], [5] 與 [6] 的不同之處，在於加上了局部的結構，就是黃色的框框。

圖6.2則是測試圖與 L-SVM 訓練出來的模型。

Fig. 6.1. The HOG feature pyramid, p. 2 [5].

Fig. 6.2. Example detection obtained with the person model, p. 1 [5].

-----

這邊有個疑問就是 bounding box（圖6.3虛線部分）是如何找出來的。

[6] 是固定大小整張的 64 x 128，沒什麼問題。然後人都在中間部分。

在 [5] 跟 [1] 裡面，bounding box 是由資料庫提供，參考圖 6.4a與6.4b。

有疑問的在圖6.4c，根據 [4] 的說法，[1] 是用窮舉法把所有可能的位置都試過。可以參考圖7.1，filters 整張圖滑過，找出兩個人。

有關 bounding box 如何找出來，[4] 提供了很好的演算法，有機會報告 [4] 的時候再討論。

Fig. 6.3. The dotted box is the bounding box label provided in the PASCAL training set, p. 5 [5].

Fig. 6.4a. The dotted box is the bounding box label provided in the PASCAL training set, p. 5 [5].

Fig. 6.4b. Training models from images labeled with bounding boxes, p. 1636 [1].

Fig. 6.4c. Exhaustive search, p. 155 [4].

-----

圖6.5顯示一個較好的 bounding box 是 [1] 的副產品。

Fig. 6.5. A car detection and the bounding box predicted from the object configuration, p. 1640 [1].

-----

Q7: What is the flowchart of the object detection algorithm?

A7:

圖7.1是圖片偵測演算法的流程圖。

右上方是訓練好的模型。

計算出 HOG之後，大致就是整體與局部的加總。

主要的重點有二：

第一，局部的小框框，要用 [8], [9] 先找出來，參考圖7.2（[14], [15]）。

第二，局部的權重要經過 [8] 的轉換，參考圖7.3（[14]）。

這兩篇論文不在本篇的討論範圍內。

Fig. 7.1. The matching process at one scale, p. 1633 [1].

Fig. 7.2. Best locations of the parts, p. 1631 [1].

Fig. 7.3. Generalized distance transform, p. 1632 [1].

-----

[5] 跟 [6] 的主要差別。

[5] 多了局部，bounding box 大小在不同圖片中也不一樣。

Fig. 7.4. L-SVM vs. HOG, p. 1628 [1].

-----

[1] 跟 [5] 的主要差別。

同一類別中，[1] 訓練兩個（以上？）的 model，稱為 mixture model。

另外演算法改進了很多。

Fig. 7.5. L-SVM 2010 vs. L-SVM 2008, p. 1630 [1].

-----

Q8: What is the result?

A8:

圖8.1是一個腳踏車的模型，可以看到正面跟側面。

圖8.2是汽車的模型，顯示了從整體再到「整體加局部」的訓練過程。

圖8.3跟8.4是部分測試結果，有部分錯誤是因為條件設定的關係。

Fig. 8.1. Detections obtained with a two-component bicycle model, p.1629 [1].

Fig. 8.2. The result of Phase 1 (a), (b) and Phase 3 (c) of the initialization process, p. 1637 [1].

Fig. 8.3. Examples of high-scoring detections on the PASCAL 2007 data set, p. 1642 [1].

Fig. 8.4. Some results from the PASCAL 2007 dataset, p. 7 [5].

-----

結論：

HOG + SVM 曾經引領風騷很久，在目前 Deep Learning 大行其道之時，仍有很高的研究價值。

-----

References

[1] 2010_Object detection with discriminatively trained partbased models

[2] 2015_Active object localization with deep reinforcement learning

[3] 2014_Rich feature hierarchies for accurate object detection and semantic segmentation

[4] 2013_Selective search for object recognition

[5] 2008_A discriminatively trained, multiscale, deformable part model

[6] 2005_Histograms of oriented gradients for human detection

[7] 目標檢測的圖像特徵提取之（一）HOG特徵
http://alex-phd.blogspot.tw/2014/03/hog.html

[8] 2004_Distinctive image features from scale-invariant keypoints

[9] 2004_Distance transforms of sampled functions

[10] 2005_Pictorial structures for object recognition

[11] 2006_A tutorial on energy-based learning

[12] 2003_Support vector machines for multiple-instance learning

[13] 1992_A training algorithm for optimal margin classifiers

The Star Also Rises

Thursday, July 27, 2017

AI 從頭學（三五）：DPM

Programmer

Blog Archive

Labels

Recent Comments

My Blog List

MY LINKS

status

About Me