The Star Also Rises: LSTM（三）：Illustrated

LSTM（三）：Illustrated

2021/07/12

-----

https://pixabay.com/zh/photos/mother-baby-child-garden-home-1150056/

-----

# History DL

說明：

未展開的 RNN（遞歸神經網路（recurrent neural networks，RNN）網路。

xt 是輸入。

ht 是輸出。

-----

# History DL

說明：

展開的 RNN 網路。

-----

# History DL

說明：

此處的公式與剛剛的圖不同。在這裡，

ht 是隱藏狀態的輸出到下一個隱藏狀態，yt 是實際的輸出。

一般的 RNN 是 Elman 網路。也就是，隱藏狀態由 xt 與 h(t-1) 作為輸入。

Jordan 網路則是由 xt 與 y(t-1) 作為輸入。

-----

# History DL

說明：

LSTM。

ft 是忘記門。ft 的輸入是 h(t-1) 與 xt。權重是 Wf，偏置是 bf。經過 sigmoid 讓值介於 0 與 1 之間。

it 是輸入門。it 的輸入是 h(t-1) 與 xt。權重是 Wi，偏置是 bi。經過 sigmoid 讓值介於 0 與 1 之間。

~ tilde

C~t 是 cell state 的預備。C~t 的輸入是 h(t-1) 「公式 (36) 有誤。」與 xt。權重是 WC，偏置是 bC。經過 tanh 這個激活函數。

Ct 是 cell state（記憶值）。忘記門的值乘以上一個 cell state 加上輸入門的值乘以 cell state 的預備，即為此刻的記憶值。

Ot 是輸入門。Ot 的輸入是 h(t-1) 與 xt。權重是 WO，偏置是 bO。經過 sigmoid 讓值介於 0 與 1 之間。

ht 是輸出。其值是輸出門乘以 tanh(Ct)。

-----

# History DL

說明：

參考圖的說明。

-----

梯度消失與梯度爆炸

-----

ANN

說明：

假設每層只有一個神經元。

https://zhuanlan.zhihu.com/p/25631496

https://ziyubiti.github.io/2016/11/06/gradvanish/

-----

RNN 反向傳播，連乘的項。

Modified from # History DL

說明：

St（紅點）。

-----

RNN 梯度消失與梯度爆炸

說明：

RNN 反向傳播，連乘的結果，是 tanh' 乘以 Ws。

https://zhuanlan.zhihu.com/p/28687529

-----

LSTM 如何解決梯度消失

導數函數值基本上不是 0 就是 1。當值均為 1 時，不會有梯度消失。當值為 0 時，不用回傳梯度。

說明：

LSTM 反向傳播，連乘的結果，跟 RNN 類似，所不同之處，在於 LSTM 連乘的結果，是 tanh' 乘以忘記門的值，而非單純的 Ws，由參考資料的圖可以得知，當忘記門的值等於 0 時，表示不用回傳導數值，因為過去的值，不影響現在的值。

以上說明並不理想。因為 tanh' 與 sigmoid 連乘，基本上還是一堆介於 0 與 1 相乘的數。圖片之所以看起來大都是 0 或 1，那是因為輸入的平面選擇比較大的範圍。如果在原點區域的小範圍，則還是有很多介於 0 與 1 的數。

https://zhuanlan.zhihu.com/p/28749444

-----

Modified from # History DL

說明：

反向傳播中，Ct 在保持導數回傳不要梯度消失是重點。

-----

說明：

反向傳播中，Ct 在保持導數回傳不要梯度消失是重點。

https://www.zhihu.com/question/44895610

https://medium.datadriveninvestor.com/how-do-lstm-networks-solve-the-problem-of-vanishing-gradients-a6784971a577

-----

參數量。

公式：（詞嵌入大小（輸入） + 1（bias）+ 隱藏層大小（輸入））* 隱藏層大小（輸出）* 4（門）。

例一：「((embedding_size + hidden_size) * hidden_size + hidden_size) * 4」

「((128 + 64) * 64 + 64) * 4 = 49408。」

https://zhuanlan.zhihu.com/p/147496732

例二：「params = 4 * ((size_of_input + 1) * size_of_output + size_of_output^2)」

「4 * (4097 * 256 + 256^2) = 4457472」

https://stackoverflow.com/questions/38080035/how-to-calculate-the-number-of-parameters-of-an-lstm-network

https://medium.com/deep-learning-with-keras/lstm-understanding-the-number-of-parameters-c4e087575756

https://datascience.stackexchange.com/questions/10615/number-of-parameters-in-an-lstm-model

-----

LSTM 的輸入

以 Keras 為例，LSTM 網路的輸入是一個三維數組。輸入形狀看起來像 (batch_size, time_steps, seq_len)。第一個維度代表批次大小，第二個維度代表輸入序列的時間步數。第三維表示一個輸入序列中的單元數。

batch_size：例：一個批次裡處理的句子筆數。

time_steps：例：句子的最長長度（不足則將資料補零）。

seq_len：例：句子裡，每個詞，詞嵌入的維度。

https://www.kaggle.com/shivajbd/input-and-output-shape-in-lstm-keras

https://www.zhihu.com/question/41949741

https://ithelp.ithome.com.tw/articles/10214405?sc=iThelpR

https://keras-cn.readthedocs.io/en/latest/layers/recurrent_layer/

-----

# History DL

說明：

一對一。無 RNN 分類的標準模式（例如圖像分類問題）如圖 35 (a)。

多對一。輸入序列和單個輸出（例如，輸入是一組句子或單詞，輸出是正面或負面表達的情感分析）如圖 35（b）所示。

一對多。系統接受輸入並產生一系列輸出（圖像字幕問題：輸入是單個圖像，輸出是一組帶有上下文的單詞），如圖 35 (c) 所示。

多對多。輸入和輸出序列（例如機器翻譯：機器從英語中獲取一個單詞序列並翻譯成法語單詞序列）如圖 35（d）所示。

多對多。序列到序列學習（例如視頻分類問題，其中我們將視頻幀作為輸入並希望標記圖 35（e）所示視頻的每一幀。

-----

# History DL

說明：

一對一。

多對一。

一對多。

多對多。

-----

# Word2vec 1。

說明：

周圍字預測中間字的 CBOW 與中間字預測周圍字的 Skip-gram。下回分解。

https://www.zhihu.com/question/45027109

-----

References

# RNN

Elman, Jeffrey L. "Finding structure in time." Cognitive science 14.2 (1990): 179-211.

https://cogsci.ucsd.edu/~rik/courses/readings/elman90-fsit.pdf

# LSTM

Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.676.4320&rep=rep1&type=pdf

# LSTM odyssey

Greff, Klaus, et al. "LSTM: A search space odyssey." IEEE transactions on neural networks and learning systems 28.10 (2016): 2222-2232.

https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7508408&casa_token=vRwVnt8iHsUAAAAA:G6OKQJ2x7VhXziDNSxLPohd9iEwyulS9cLZ73sk332XXWWizD_SNSZH9u6k68kmBWdMCiTtX&tag=1

# History DL

Alom, Md Zahangir, et al. "The history began from alexnet: A comprehensive survey on deep learning approaches." arXiv preprint arXiv:1803.01164 (2018).

https://arxiv.org/ftp/arxiv/papers/1803/1803.01164.pdf

# Word2vec 1。被引用 18991 次。

Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." arXiv preprint arXiv:1301.3781 (2013).

https://arxiv.org/pdf/1301.3781.pdf

-----

The Star Also Rises

Saturday, July 31, 2021

LSTM（三）：Illustrated

No comments:

Programmer

Blog Archive

Labels

Recent Comments

My Blog List

MY LINKS

status

About Me