The Star Also Rises: January 2019

AI 從頭學（二一）：A Glance at Deep Reinforcement Learning

2017/04/06

前言：

資料很多，先理出個頭緒以便按圖索驥。

-----

Summary：

Deep learning (DL) [1]-[40] + reinforcement l earning (RL) [41]-[44] = Deep reinforcement learning (DRL) [45]-[53]。

這四個多月以來，我利用下班時間自學一點 DL [1]-[20]。嚴格說來，不能算是完全自學，因為社團內的先進很熱心，陸陸續續受到 Jason、Winston 等先進的指導，所以還算蠻順利的。最近因為 Jason 除了原來 DL 的讀書會之外 [29]，新增了 RL 的讀書會 [41]，而我也獲得新的 DRL 資訊 [20]，所以藉這個機會，重新擬定一下自己的讀書計畫，參考圖1.1。

一開始，我查了一些回顧性質的論文 [1]，隨著 S/W tools 的貼文 [2]，引來一連串的討論 [3]-[8]，而我也開始從 BP [9]、AD [10]、LeNet [12], [13]、CNN [18]、RNN [19] 一路看下來。與此同時，我還是不斷地問自己，DL 是什麼 [1], [11], [17], [20]？

整理完 DRL 的參考資料，大約可說稍有概念了！

Fig. 1.1. Deep reinforcement learning.

-----

Introduction

AI 要如何定義，圖1.2告訴我們，由淺入深是 ML、RL、DL。圖1.3則是把相關技術分為 Deep、Shallow，Supervised、Unsupervised。圖 1.4則是舉出另外還有 Reinforcement Learning。結合 DL 與 RL，目前 DRL 是一重要的發展方向。

到目前為止，我大概可以對 DL 下個不算精確的定義隨時提醒自己：DL 為一連續可微的函數表現為多層的網路，其輸入與輸出均為向量。每層前半為線性，後半段為非線性。f(x, w, b) = y，訓練的目的在於找出 w、b 值使得所有 x 輸入的誤差很小。其方法是 BP，對某個 x 輸入固定時，w、b 為變數，沿著讓 cost function 的偏導數變小的方向調整 w、b，則可讓誤差函數慢慢收斂。

以上這段是寫給自己的，不懂的人看不懂，懂得的人不用看。

以下共分六部分，A、B、C 為基礎到進階的 DL。D 為 RL，E、F 為 DRL，參考圖1.1，數字為我幫自己設計的課程順序。本文主要以解釋順序為主。

Fig. 1.2. AI, p. 9 [29].

Fig. 1.3. Deep learning taxonomy, p. 492 [28].

Fig. 1.4. Machine learning, p. 515 [28].

-----

A, 01-07

Part A 屬於 DL 的基礎部分，有關這個，我特別想推薦 [28]，這本書有電子版可以下載。它寫的不算很好，有的地方太簡略。但好處是 DL 就一章，RL 也一章，你不用在一開始就去啃 [29]，即使我現在對於CNN [18] 與 RNN [19] 已經有一點瞭解，[29] 的相關章節我還是認為不容易閱讀。

論文的部分，[23] 是可以先看一下的，因為它主要是介紹性質。[21] 裡面有好幾種基本架構簡介，也是可以反覆閱讀的。這樣先熱身一下，接下來就可以專攻 LeNet 了 [12]。

-----

B, 08-10

RNN 是可以記憶的單元，加上 Controller，則有 Neural Turing Machine (NTM)、Differential Neural Computer (DNC ) 與 Memory Augmented Neural Network (MANN) 的架構 [30]-[37]。

-----

C, 11-15

這個部分主要是生成模型。

圖 2.1a 指出，Autoencoders (AE) 之後是 RL。看 Deep Generative Models (DGM) 之前也得先看 AE。

Boltzmann Machine 在二十章前半部，VAE 與 GAN 都藏身在 20.10 裡，參考圖2.2a。

Fig. 2.1a. Deep Learning, p. 12 [29].

Fig. 2.1b. Autoencoder, p. 501 [28].

Fig. 2.1c. Representation learning, [38].

Fig. 2.2a. Deep generative models, [29].

Fig. 2.2b. Restricted Boltzmann machine, p. 498 [28].

-----

D, 16-23

有關 Representation Learning，[41] 這本不容易啃，所以我從 [28] 裡面先把重點找出來，首先是四個比較重要的技術，分別是 Dynamic Programming (DP)、Monte Carlo Methods (MCM)、 Temporal Difference (TD) Learning、以及 Policy Gradient Method (PGM)。

MCM 很重要，[29] 裡也有一章。TD 可能是最重要的，參考圖3.2a。所以我也找了 Q-Learning 的資料 [43], [44]。

從圖3.1a中，我們知道，Markov Decision Process (MDP) 跟 Functional Approximation (FA) 是基礎。FA 在 [41] 裡面有不少章，但是想看懂 AlphaGo [50] 的話，少不了 FA，所以整本 [41] 還是跑不掉的。

MDP，簡單來說就是sars，參考圖3.1b。

Fig. 3.1a. RL basis, p. 513 [28].

Fig. 3.1b. Basic RL model, p. 524 [28].

Fig. 3.2a. RL methods, p. 513 [28].

Fig. 3.2b. Q-learning, p. 537 [28].

Fig. 3.2c. Actor-Critic, p. 538 [28].

-----

E, 24-27

[45] 有很詳細的 DRL 介紹，包含今年 2017 的出版論文。ATARI 電玩是 DQN、S3C、UNREAL 這幾篇論文的重點 [46]-[49]。

-----

F, 28-34

最後是 AlphaGo [50]，這個有好幾篇中文文章介紹它用的技術 [11]，除了 CNN 之外，還用到 PGM [51] 以及 MCTS [52], [53]。由於這個 PGM 有用到 function approximation，所以要把 [41] 讀完，才能真正瞭解 AlphaGo。

-----

結論：

很快把 DRL 整個帶過一遍。算是地圖而非旅遊指南。後面還有很長的路要走呢！

-----

Fig. 4.1a. Policy gradient method [50].

Fig. 4.1b. Policy and value networks [50].

Fig. 4.1c. Monte Carlo tree search (MCTS) [50].

Fig. 4.2. Actor-critic artificial neural network and a hypothetical neural implementation, p. 340 [41].

-----

References

[1] AI從頭學（一）：文獻回顧

[2] AI從頭學（二）：Popular Deep Learning Software Tools

[3] AI從頭學（三）：Popular Deep Learning Hardware Tools

[4] AI從頭學（四）：AD and LeNet

[5] AI從頭學（五）：AD and Python

[6] AI從頭學（六）：The Net

[7] AI從頭學（七）：AD and Python from Jason

[8] AI從頭學（八）：The Net from Mark

[9] AI從頭學（九）：Back Propagation

[10] AI從頭學（一０）：Automatic Differentiation

[11] AI從頭學（一一）：A Glance at Deep Learning

[12] AI從頭學（一二）：LeNet

[13] AI從頭學（一三）：LeNet - F6

[14] AI從頭學（一四）：Recommender

[15] AI從頭學（一五）：Deep Learning，How？

[16] AI從頭學（一六）：Deep Learning，What？

[17] AI從頭學（一七）：Shallow Learning

[18] AI從頭學（一八）：Convolutional Neural Network

[19] AI從頭學（一九）：Recurrent Neural Network

[20] AI從頭學（二０）：Deep Learning，Hot

[21] Wang, Hao, and Dit-Yan Yeung. "Towards Bayesian Deep Learning: A Survey." arXiv preprint arXiv:1604.01662 (2016).

[22] Lacey, Griffin, Graham W. Taylor, and Shawki Areibi. "Deep learning on fpgas: Past, present, and future." arXiv preprint arXiv:1602.04283 (2016).

[23] LeCun, Yann, Yoshua Bengio, and Geoffrey Hinton. "Deep learning." Nature 521.7553 (2015): 436-444.

[24]Schmidhuber, Jürgen. "Deep learning in neural networks: An overview." Neural networks 61 (2015): 85-117.

[25]Deng, Li, and Dong Yu. "Deep learning: methods and applications." Foundations and Trends® in Signal Processing 7.3–4 (2014): 197-387.

[26] Bengio, Yoshua, Aaron C. Courville, and Pascal Vincent. "Unsupervised feature learning and deep learning: A review and new perspectives." CoRR, abs/1206.5538 1 (2012).

[27] Bengio, Yoshua. "Learning deep architectures for AI." Foundations and trends® in Machine Learning 2.1 (2009): 1-127.

[28] Gollapudi, Sunila . Practical machine learning. Packt Publishing, 2016.

[29] Goodfellow, Ian, Yoshua Bengio, and Aaron Courville. Deep learning. MIT Press, 2016.
http://www.deeplearningbook.org/

[30]

[31] 程式編寫程式：泛用人工智慧領域的一顆明珠 - 歌穀穀
http://www.gegugu.com/2017/03/16/9050.html

[32] 深度學習挑戰馮·諾依曼結構_幫趣網
bangqu.com/gpu/blog/5239

[33] 深度學習的新方向 One-shot learning - George's Research Website
http://tzuching1.weebly.com/blog/-one-shot-learning

[34] Olah, Chris, and Shan Carter. "Attention and Augmented Recurrent Neural Networks." Distill 1.9 (2016): e1.
http://distill.pub/2016/augmented-rnns/

[35] Graves, Alex, Greg Wayne, and Ivo Danihelka. "Neural turing machines." arXiv preprint arXiv:1410.5401 (2014).

[36] Graves, Alex, et al. "Hybrid computing using a neural network with dynamic external memory." Nature 538.7626 (2016): 471-476.

[37] Santoro, Adam, et al. "One-shot learning with memory-augmented neural networks." arXiv preprint arXiv:1605.06065 (2016).

[38] Bengio, Yoshua, Aaron Courville, and Pascal Vincent. "Representation learning: A review and new perspectives." IEEE transactions on pattern analysis and machine intelligence 35.8 (2013): 1798-1828.

[39] Doersch, Carl. "Tutorial on variational autoencoders." arXiv preprint arXiv:1606.05908 (2016).

[40] Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014.

[41] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. (2016): 424.
http://incompleteideas.net/sutton/book/bookdraft2016sep.pdf

[42] Heidrich-Meisner, Verena, et al. "Reinforcement learning in a nutshell." ESANN. 2007.

[43] Watkins, Christopher JCH, and Peter Dayan. "Q-learning." Machine learning 8.3-4 (1992): 279-292.

[44] Artificial Intelligence - foundations of computational agents -- 11_3_3 Q-learning
http://artint.info/html/ArtInt_265.html

[45] Li, Yuxi. "Deep reinforcement learning: An overview." arXiv preprint arXiv:1701.07274 (2017).

[46] 深度增強學習前沿算法思想 - 歌穀穀
http://www.gegugu.com/2017/02/17/1360.html

[47] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533.

[48] Mnih, Volodymyr, et al. "Asynchronous methods for deep reinforcement learning." International Conference on Machine Learning. 2016.

[49] Jaderberg, Max, et al. "Reinforcement learning with unsupervised auxiliary tasks." arXiv preprint arXiv:1611.05397 (2016).

[50] Silver, David, et al. "Mastering the game of Go with deep neural networks and tree search." Nature 529.7587 (2016): 484-489.

[51] Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with function approximation." NIPS. Vol. 99. 1999.

[52] Browne, Cameron B., et al. "A survey of monte carlo tree search methods." IEEE Transactions on Computational Intelligence and AI in games 4.1 (2012): 1-43.

[53] Huang, Shih-Chieh, and Martin Müller. "Investigating the limits of Monte-Carlo tree search methods in computer Go." International Conference on Computers and Games. Springer International Publishing, 2013.

Thursday, January 17, 2019

AI 從頭學（二一）：A Glance at Deep Reinforcement Learning