make LSTM cell trainable
Im using the tf.contrib.rnn.MultiRNNCell
module to make a multi-layered RNN. I use the following lines to define a 3-layered RNN-LSTM network:
n_hidden = 2
num_layers = 3
lstm_cell = tf.contrib.rnn.BasicLSTMCell(n_hidden)
stacked_lstm_cell = tf.contrib.rnn.MultiRNNCell([lstm_cell] * num_layers)
However there is some uncertainty in my mind as regards to what is actually happening in tensorflow. As far as I can understand, this code gives me a computational graph in which there are 3 layers of LSTM cells, and each layer has 2 LSTM cells. I have the following doubts:
One note on syntax: as of TF ~1.0, you need to define multiple layers in a loop rather than using the [cell] * num_layers
syntax, so something like:
lstm_cells = []
for _ in range(num_layers):
cell = tf.contrib.rnn.BasicLSTMCell(n_hidden)
lstm_cells.append(cell)
stacked_lstm_cell = tf.contrib.rnn.MultiRNNCell(lstm_cells)
To your main questions:
num_layers
), where each layer contains an LSTM with a hidden state of length 2 ( n_hidden
). More on this in a bit. A look at LSTM
Let's take a look at the LSTM network architecture. This is a pretty great overview that I recommend reading. Basically a single LSTM cell maintains a hidden state that represents its "memory" of what it has seen so far, and at each update step, it decides how much new information it is going to blend with the existing information in this hidden state using "gates". It also uses a gate to determine what it will output. Taking a look at the update process for a single cell:
We first determine how much old information to forget (our forget gate): f_k = sigmoid(W_f * [h_k-1, x_k] + b_f)
Here we are operating on the network's previous history h_k-1
concatenated with current observations x_k
. The size of your history vector h
is defined by n_hidden
. The weights W_f
and biases b_f
will be learned through the training procedure.
We determine how much new information to incorporate (our input gate, i_k
), and create some new candidate cell states ( c'_k
):
i_k = sigmoid(W_i * [h_k-1, x_k] + b_i)
c`_k = tanh(W_c * [h_k-1, x_k] + b_c)
Again, we are operating on our old internal state h_k-1
, and our new observaions x_k
to figure out what to do next. The size of the cell state c
and candidate cell state c'
is also determined by n_hidden
. The W_*
and b_*
are more parameters that we will be learning.
Combine old information with new candidate states to come up with a new cell state: c_k = f_k * c_k-1 + i_k * c'_k
Here we are doing element-wise multiplication instead of dot products or whatever else. Basically we choose how much of our old information to keep ( f_k * c_k-1
), and how much new information to incorporate ( i_k * c'_k
).
Finally, we determine how much of our cell state we want to output with an output gate:
o_k = sigmoid(W_o * [h_k-1, x_k] + b_o)
h_k = o_k * tanh(c_k)
So basically we are blending old and new information into an internal "cell state" c_k
, and then outputting some amount of that information in h_k
. I recommend also looking in to the gated recurrent unit (GRU) network, which performs similarly to LSTM but has a slightly easier structure to understand.
Now on to how the multi-layer network stacks up. Basically, you have something that looks like this:
x_k ---> (network 0) --h0_k--> (network_1) --h1_k--> (network_2) --h2_k-->
So your observations come in to the first network, and then that network's output is fed as input to the next network, which blends it with its own internal state to produce an output, which then becomes the input to the third network, and so on until the end. This is supposed to help with learning temporal structure in the data. I do not have a good citation for that.
Typically if you are doing classification (for instance), you would throw a final fully-connected layer on your last network's output to get some measure of confidence that your observed process lies within each category over which you are classifying.
Trainable variables
You can print out all of the trainable variables that your network is going to learn using something like:
for var in tf.trainable_variables():
print('{}nShape: {}'.format(var.name, var.get_shape()))
Tensorflow does some fancy stuff with combining different operations, so you may see some odd shapes and apparently missing weight matrices and biases, but it's all there. Basically you are learning the weights and biases used in each gate. In the above, that would be:
W_f
, W_i
, W_c
, and W_o
for each layer b_f
, b_i
, b_c
, and b_o
for each layer I am more familiar with how TF handles the GRU architecture, where it basically combines all of the gates into a single big matrix operation, so you have one combined weight matrix and one combined bias vector for all gates. It then splits the result into each individual gate to apply them at the right place. Just an FYI in case it looks like you do not have weights and biases for each individual step of each cell.
链接地址: http://www.djcxy.com/p/32130.html上一篇: 这对我的训练集有什么负面影响?
下一篇: 使LSTM细胞可培养