The Star Also Rises: AI 從頭學（三五）：DPM

AI 從頭學（三五）：DPM

Object Detection with Discriminatively Trained Part-Based Models

2017/07/27

前言：

本篇為 R-CNN 的熱身之一。

Summary：

本次講解的論文，主要是利用 Latent SVM (L-SVM) 進行的物件偵測 [1]。

之所以選擇這篇論文閱讀，起因是之前一篇論文 [2]，引用的 R-CNN [3]，用到了 Selective Search [4] 跟 L-SVM [1]。

[1] 跟它前一個版本 [5]，都使用了 Histograms of Oriented Gradients (HOG) 作為特徵擷取的方法 [6], [7]。SIFT [8] 跟 HOG 類似，本文將藉助它來闡述 HOG 的原理。

[6] 處理的是概略的人體輪廓，[1], [5] 則利用 [9], [10] 計算出局部的框架，除了有較高的辨識率，也可適應各種不同物體的偵測。

L-SVM 可以說是一種 Energy-Based Model [11]，也可說是一種 MI-SVM [12]。限於篇幅，本文將只會講解簡單的 SVM 概念 [13] ，並不會對 L-SVM 著墨太多（其實是作者還在努力K書中）。

-----

本文將分成八個小節進行。

Q1: What is L-SVM?
Q2: What is AOL?
Q3: What is HOG?
Q4: What is SIFT?
Q5: What is SVM?
Q6: What is the Pyramid Structure?
Q7: What is the flowchart of the object detection algorithm?
Q8: What is the result?

-----

Q1: What is L-SVM?

A1:

圖1.1上方是測試的圖片，下方則是訓練出來的 L-SVM 模型。

圖1.1a可以看到一個粗略的人體形狀，這是 Linear SVM 訓練出來的 HOG filter [6]。

圖1.1b則是在 [6] 的基礎上，加上局部的 filters。由於局部位置並沒有標籤，所以在訓練時它們必須被視為 latent (hidden) variables [1]。

圖1.1c則是最後的模型，顯示為空間權重的表現。

技術細節在本文內稍後會討論。

Fig. 1.1. Detections obtained with a single component person model, p. 2 [1].

-----

圖1.2是不同種類物件的模型，此處列出的有瓶子、汽車、沙發、腳踏車。
圖1.3列出的有人、瓶子、貓、汽車。我們可以看到人與瓶子、貓與汽車有類似的模型。

這邊值得注意的是舊版本跟新版本的差異：新版的每個種類有兩個子模型來輔助偵測，稱為 mixture model。

Fig. 1.2. Some models learned from the PASCAL VOC 2007 dataset, p. 6 [5].

Fig. 1.3. Some of the models learned on the PASCAL 2007 data set, p. 1641 [1].

---

Q2: What is AOL?

A2:

AOL 是 2015年一篇論文標題的縮寫 [2]，參考圖2。[2] 主要引用了 R-CNN [3]、ZFNet、DQN，都是相當經典的論文。

R-CNN [3] 主要又引用了 Selective Search [4] 與 L-SVM [1]，這便是此次報告 [1] 的緣故。

Fig. 2. Related research.

-----

Q3: What is HOG?

A3:

HOG 是 Histograms of Oriented Gradients 的縮寫，這篇發表於 2005 的論文 [6]，截至作者撰寫文章之時，獲得 19606 次引用。[1] 的基本架構即根植於此。所以本篇先簡介一下 HOG。

圖3.1a是資料庫內的圖片，大小統一為 64 x 128，參考圖3.1b。

Fig. 3.1a. Some sample images from the new human detection database, p. 3 [6].

Fig. 3.1b. The image resolution is 64 x 128, p. 2 [6].

-----

此處假設讀者已有 SVM 的基礎。稍後在 Q5 會簡介 SVM。

圖3.2a是全部圖片的平均值，由於圖片在處理時左右反過來也用了，所以看到的是一個對稱的圖形。

圖3.2b跟3.2c是 Linear SVM 訓練完後，向量最大權重的顯示，分別是「人」與「非人」。這邊一個 pixel 代表一個 8 x 8 的向量（的最大值），對應於圖3.2e則是 HOG descriptor，在 8 x 8 的矩陣內，有九個方向的線條，線條的強度則以明暗表示（可參考下方的放大圖）。

圖3.2d是測試圖片。

圖3.2d經過HOG的計算後，得到圖3.2e。分別乘上 (b)與(c)，則得到 (f)與(g)。

更詳細的介紹，可以參考 Q7。

Fig. 3.2. The HOG detectors cue mainly on silhouette contours, p. 7 [6].

Fig. 3.2e. It’s computed R-HOG descriptor, p. 7 [6].

-----

圖3.3是完整的流程圖，有六個步驟。其中第一跟第四的 normalization 可以參考 [7]，本文會介紹步驟二、三、五、六。

Fig. 3.3. An overview of the feature extraction and object detection chain, p. 2 [6].

-----

Q4: What is SIFT?

SIFT 是 Scale-Invariant Feature Transform [8] 的縮寫。它跟 HOG 「有點像」。

圖4.1左邊看到最小的是 n x n 的 pixel cells，接下來是 4 x 4 的 grids，然後 2 x 2 的 grids 組成一個 descriptor。這邊可以看到強度是由箭頭的長度顯示，HOG則是明暗。另外 SIFT有八個方向。HOG 在論文中則選九個方向。

計算的公式可以參考圖4.2。

Fig. 4.1. SIFT: A keypoint descriptor is created, p. 101 [8].

Fig. 4.2. SIFT: For each image sample, the gradient magnitude and orientation is precomputed using pixel differences, p. 99 [8].

-----

SIFT 簡介完後，回到 HOG。

圖4.3a 顯示的是 [1] 中訓練完後的 36 (9 x 4) 個 eigenvector，每個 eigenvector 是 36 (9 x 4) 維的向量。

這個圖可以跟圖3.2e一起看。也就是，圖3.2e中的每個「雪花」可以用這36個（或部分） eigenvectors 組成。概念上每個雪花由四個小雪花組成，每個小雪花有九個方向，每個方向有其強度值，四個小雪花，所以是 9 x 4。

圖4.3b、c、d則是輔助的文字說明。

Fig. 4.3a. PCA of HOG features, p. 1638 [1].

Fig. 4.3b. HOG: A 4 x 9 matrix, p. 1639 [1].

Fig. 4.3c. HOG: A vector of length 9 x 4, p. 2 [5].

Fig. 4.3d. HOG vs. SIFT, p. 6 [6].

-----

計算梯度的方法，在 [6] 中試了好幾個，以 [-1, 0, 1] 最佳，參考圖4.4a。

圖4.4b則是 HOG 計算方向的方法。可以跟 SIFT 比較不同之處。

Fig. 4.4a. HOG: Simple 1-D [-1, 0, 1] masks work best, p. 4 [6].

Fig. 4.4b. HOG: Gradient orientation and magnitude, p. 1638 [1].

-----

Q5: What is SVM?

A5:

這邊簡介一下 Support Vector Machine (SVM) 支持向量機 [13]。

圖5.1的 D(x) 是一個「超平面」。w 是一個 n 維的向量。

圖5.2是一個線性SVM，φ（x）= x。維度為 2，以此為例。

有兩類資料散佈在平面上（大致聚集成兩團），其值分別被標誌為 1 與 -1。

從中取出訓練集，找出 D(x) 使得 M 為最大，也就是找出D(x)=0兩旁虛線上的幾個向量，連帶找出 w （以及）b。這幾個向量就稱為支持向量。

在 [6] 中把訓練圖形的 HOG 計算出來後，則可以使用此處介紹的 Linear SVM 加以分類。

Fig. 5.1. The decision function in the direct space, p. 145 [13].

Fig. 5.2. Maximum margin linear decision function D(x), p. 146 [13].

-----

Q6: What is the Pyramid Structure?

A6:

參考圖6.1。其實這邊的金字塔只有兩層。

淺藍色的部分跟 [6] 一樣，是單純的行人檢測。

[1], [5] 與 [6] 的不同之處，在於加上了局部的結構，就是黃色的框框。

圖6.2則是測試圖與 L-SVM 訓練出來的模型。

Fig. 6.1. The HOG feature pyramid, p. 2 [5].

Fig. 6.2. Example detection obtained with the person model, p. 1 [5].

-----

這邊有個疑問就是 bounding box（圖6.3虛線部分）是如何找出來的。

[6] 是固定大小整張的 64 x 128，沒什麼問題。然後人都在中間部分。

在 [5] 跟 [1] 裡面，bounding box 是由資料庫提供，參考圖 6.4a與6.4b。

有疑問的在圖6.4c，根據 [4] 的說法，[1] 是用窮舉法把所有可能的位置都試過。可以參考圖7.1，filters 整張圖滑過，找出兩個人。

有關 bounding box 如何找出來，[4] 提供了很好的演算法，有機會報告 [4] 的時候再討論。

Fig. 6.3. The dotted box is the bounding box label provided in the PASCAL training set, p. 5 [5].

Fig. 6.4a. The dotted box is the bounding box label provided in the PASCAL training set, p. 5 [5].

Fig. 6.4b. Training models from images labeled with bounding boxes, p. 1636 [1].

Fig. 6.4c. Exhaustive search, p. 155 [4].

-----

圖6.5顯示一個較好的 bounding box 是 [1] 的副產品。

Fig. 6.5. A car detection and the bounding box predicted from the object configuration, p. 1640 [1].

-----

Q7: What is the flowchart of the object detection algorithm?

A7:

圖7.1是圖片偵測演算法的流程圖。

右上方是訓練好的模型。

計算出 HOG之後，大致就是整體與局部的加總。

主要的重點有二：

第一，局部的小框框，要用 [8], [9] 先找出來，參考圖7.2（[14], [15]）。

第二，局部的權重要經過 [8] 的轉換，參考圖7.3（[14]）。

這兩篇論文不在本篇的討論範圍內。

Fig. 7.1. The matching process at one scale, p. 1633 [1].

Fig. 7.2. Best locations of the parts, p. 1631 [1].

Fig. 7.3. Generalized distance transform, p. 1632 [1].

-----

[5] 跟 [6] 的主要差別。

[5] 多了局部，bounding box 大小在不同圖片中也不一樣。

Fig. 7.4. L-SVM vs. HOG, p. 1628 [1].

-----

[1] 跟 [5] 的主要差別。

同一類別中，[1] 訓練兩個（以上？）的 model，稱為 mixture model。

另外演算法改進了很多。

Fig. 7.5. L-SVM 2010 vs. L-SVM 2008, p. 1630 [1].

-----

Q8: What is the result?

A8:

圖8.1是一個腳踏車的模型，可以看到正面跟側面。

圖8.2是汽車的模型，顯示了從整體再到「整體加局部」的訓練過程。

圖8.3跟8.4是部分測試結果，有部分錯誤是因為條件設定的關係。

Fig. 8.1. Detections obtained with a two-component bicycle model, p.1629 [1].

Fig. 8.2. The result of Phase 1 (a), (b) and Phase 3 (c) of the initialization process, p. 1637 [1].

Fig. 8.3. Examples of high-scoring detections on the PASCAL 2007 data set, p. 1642 [1].

Fig. 8.4. Some results from the PASCAL 2007 dataset, p. 7 [5].

-----

結論：

HOG + SVM 曾經引領風騷很久，在目前 Deep Learning 大行其道之時，仍有很高的研究價值。

-----

References

[1] 2010_Object detection with discriminatively trained partbased models

[2] 2015_Active object localization with deep reinforcement learning

[3] 2014_Rich feature hierarchies for accurate object detection and semantic segmentation

[4] 2013_Selective search for object recognition

[5] 2008_A discriminatively trained, multiscale, deformable part model

[6] 2005_Histograms of oriented gradients for human detection

[7] 目標檢測的圖像特徵提取之（一）HOG特徵
http://alex-phd.blogspot.tw/2014/03/hog.html

[8] 2004_Distinctive image features from scale-invariant keypoints

[9] 2004_Distance transforms of sampled functions

[10] 2005_Pictorial structures for object recognition

[11] 2006_A tutorial on energy-based learning

[12] 2003_Support vector machines for multiple-instance learning

[13] 1992_A training algorithm for optimal margin classifiers

The Star Also Rises

Thursday, July 27, 2017

AI 從頭學（三五）：DPM

No comments:

Programmer

Blog Archive

Labels

Recent Comments

My Blog List

MY LINKS

status

About Me