LSTM units

Dimitri Fichou

2023-04-21

Feed forward pass

\[ f_t = sigmoid(h_{t-1} * W_f + x_t * U_f) \]

\[ i_t = sigmoid(h_{t-1} * W_i + x_t * U_i) \]

\[ g_t = tanh(h_{t-1} * W_g + x_t * U_g) \]

\[ o_t = sigmoid(h_{t-1} * W_o + x_t * U_o) \]

\[ C_t = C_{t-1} \cdot f_t + i_t \cdot g_t \]

\[ h_t = y_t = tanh(C_t) \cdot o_t \]

Back propagation pass

To perform the BPTT with a LSTM unit, we have the eror comming from the top layer (\(\delta 1\)), the future cell (\(\delta 4\)), the future hidden state (\(\delta 2\)). Also, we have stored during the feed forward the states at each step of the feeding. In the case of the future layer, this error is just set to zero if not calculated yet. For convention, \(\cdot\) correspond to point wise multiplication, while \(*\) correspond to matrix multiplication.

The rules on how to back prpagate come from this post.

\[\delta 3 = \delta 1 + \delta 2 \]

\[\delta 5 = \delta 3 \cdot 6 = \delta 3 \cdot o_t \]

\[\delta 6 = \delta 3 \cdot 5 = \delta 3 \cdot tanh(c_t) \]

\[\delta 7 = \delta 5 \cdot f'(5) = \delta 5 \cdot tanh'(tanh(c_t)) \]

\[\delta 8 = \delta 7 \cdot \delta 4 \]

\[\delta 9 = \delta 8 \cdot 10 = \delta 8 \cdot i_t \]

\[\delta 10 = \delta 8 \cdot 9 = \delta 8 \cdot g_t \]

\[\delta 11 = \delta 8 \cdot 12 = \delta 8 \cdot f_t \]

\[\delta 12 = \delta 8 \cdot 11 = \delta 8 \cdot c_{t-1} \]

\[\delta 13 = \delta 6 \cdot f'(6) = \delta 6 \cdot sigmoid'(o_t) \] \[\delta 14 = \delta 9 \cdot f'(9) = \delta 9 \cdot tanh'(g_t) \] \[\delta 15 = \delta 10 \cdot f'(10) = \delta 10 \cdot sigmoid'(i_t) \] \[\delta 16 = \delta 12 \cdot f'(12) = \delta 12 \cdot sigmoid'(f_t) \]

\[\delta 17 = \delta 13 * U_o^T \] \[\delta 19 = \delta 14 * U_g^T \] \[\delta 21 = \delta 15 * U_i^T \] \[\delta 23 = \delta 16 * W_f^T \] \[\delta 18 = \delta 13 * W_o^T \] \[\delta 20 = \delta 14 * W_g^T \] \[\delta 22 = \delta 16 * W_i^T \] \[\delta 24 = \delta 16 * W_f^T \]

\[\delta 25 = \delta 18 + \delta 20 + \delta 22 + \delta 24 \] \[\delta 26 = \delta 17 + \delta 19 + \delta 21 + \delta 23 \]

The error \(\delta 11\), \(\delta 25\) and \(\delta 26\) are used for the next layers. Once all those errors are available, it is possible to calculate the weight update.

\[\delta W_f = \delta W_f + h_{t-1}^T * \delta 16 \] \[\delta U_f = \delta U_f + x_{t}^T * \delta 16 \]

\[\delta W_i = \delta W_i + h_{t-1}^T * \delta 15 \] \[\delta U_i = \delta U_i + x_{t}^T * \delta 15 \]

\[\delta W_g = \delta W_g + h_{t-1}^T * \delta 14 \] \[\delta U_g = \delta U_g + x_{t}^T * \delta 14 \]

\[\delta W_o = \delta W_o + h_{t-1}^T * \delta 13 \] \[\delta U_o = \delta U_o + x_{t}^T * \delta 13 \]