Thursday, December 05, 2019

Adam

Adam

2019/12/02

-----


// Overview of different Optimizers for neural networks

-----


// Overview of different Optimizers for neural networks

-----


// Overview of different Optimizers for neural networks

-----


// An Overview on Optimization Algorithms in Deep Learning 2 - Taihong Xiao

-----


# Adam

-----


// SGD算法比较 – Slinuxer

-----

「3 初始化偏差修正 正如本論文第二部分演算法所述,Adam 利用了初始化偏差修正項。本部分將由二階矩估計推匯出這一偏差修正項,一階矩估計的推導完全是相似的。首先我們可以求得隨機目標函式 f 的梯度,然後我們希望能使用平方梯度(squared gradient)的指數移動均值和衰減率β2 來估計它的二階原始矩(有偏方差)。令 g1, …, gT 為時間步序列上的梯度,其中每個梯度都服從一個潛在的梯度分佈 gt ∼ p(gt)。現在我們初始化指數移動均值 v0=0(零向量),而指數移動均值在時間步 t 的更新可表示為: 其中 gt^2 表示 Hadamard 積 gt⊙gt,即對應元素之間的乘積。同樣我們可以將其改寫為在前面所有時間步上只包含梯度和衰減率的函式,即消去 v: 我們希望知道時間步 t 上指數移動均值的期望值 E[vt] 如何與真實的二階矩相關聯,所以我們可以對這兩個量之間的偏差進行修正。下面我們同時對錶達式(1)的左邊和右邊去期望,即如下所示: 如果真實二階矩 E[g 2 i ] 是靜態的(stationary),那麼ζ = 0。否則 ζ 可以保留一個很小的值,這是因為我們應該選擇指數衰減率 β1 以令指數移動均值分配很小的權重給梯度。所以初始化均值為零向量就造成了只留下了 (1 − βt^2 ) 項。我們因此在演算法 1 中除以了ζ項以修正初始化偏差。 在稀疏矩陣中,為了獲得一個可靠的二階矩估計,我們需要選擇一個很小的 β2 而在許多梯度上取均值。然而正好是這種小β2 值的情況導致了初始化偏差修正的缺乏,因此也就令初始化步長過大。」

// 深度学习最常用的学习算法:Adam优化算法 - 知乎

-----

「In line no. 9 and 10 we are correcting the bias for the two moments. But why? As we have initialized the moments with 0, that means they are biased towards 0. We might not use these two lines and will still converge theoretically, but the training will be very slow for the initial steps. Well let us take an example, suppose β1= .2 and g1= 10. From line 7 we have: m1 = 0.2 * 0 + 0.8 * 10 = 8 But the average should be 10 realistically, now from line no.9 update we have: new m1 = 8 / (0.8) = 10 which is the desired value (well this is bias correction)」

// Everything you need to know about Adam Optimizer - Nishant Nikhil - Medium

-----

「Initialization bias-correction Since the EMA vectors are initialized as 0’s vectors, moment estimates are biased towards zero, especially during the initial time steps, and especially when the decay rates are small (the βs are close to 1). To counteract this, moment estimates are bias-corrected. Using the recurrent relationship between v(t) and v(t-1), it follows that: After using this equation by composing it with the expectation, it quickly follows that, E[v(t)] equals Zeta is a residual parameter to adjust the equation That is why to initialize the weights at every step, E[v(t)] is divided by (1-β2^t). The same reasoning follows for the first gradient moment.」

// Understanding Adam   how loss functions are minimized

-----

References

# Adam
Kingma, Diederik P., and Jimmy Ba. "Adam: A method for stochastic optimization." arXiv preprint arXiv:1412.6980 (2014).
https://arxiv.org/pdf/1412.6980.pdf

Keskar, Nitish Shirish, and Richard Socher. "Improving generalization performance by switching from adam to sgd." arXiv preprint arXiv:1712.07628 (2017).
https://arxiv.org/pdf/1712.07628.pdf

-----

Overview of different Optimizers for neural networks
https://medium.com/datadriveninvestor/overview-of-different-optimizers-for-neural-networks-e0ed119440c3

An Overview on Optimization Algorithms in Deep Learning 2 - Taihong Xiao
https://prinsphield.github.io/posts/2016/02/overview_opt_alg_deep_learning2/

Everything you need to know about Adam Optimizer - Nishant Nikhil - Medium
https://medium.com/@nishantnikhil/adam-optimizer-notes-ddac4fd7218 

Understanding Adam   how loss functions are minimized
https://towardsdatascience.com/understanding-adam-how-loss-functions-are-minimized-3a75d36ebdfc

neural networks - Why is it important to include a bias correction term for the Adam optimizer for Deep Learning  - Cross Validated
https://stats.stackexchange.com/questions/232741/why-is-it-important-to-include-a-bias-correction-term-for-the-adam-optimizer-for/234686#234686

-----

SGD算法比较 – Slinuxer
https://blog.slinuxer.com/2016/09/sgd-comparison
听说你了解深度学习最常用的学习算法:Adam优化算法? _ 机器之心
https://www.jiqizhixin.com/articles/2017-07-12 

深度学习最常用的学习算法:Adam优化算法 - 知乎
https://zhuanlan.zhihu.com/p/33385885

No comments: