make LSTM cell trainable

Im using the tf.contrib.rnn.MultiRNNCell module to make a multi-layered RNN. I use the following lines to define a 3-layered RNN-LSTM network:

n_hidden = 2
num_layers = 3        
lstm_cell = tf.contrib.rnn.BasicLSTMCell(n_hidden)
stacked_lstm_cell = tf.contrib.rnn.MultiRNNCell([lstm_cell] * num_layers)

However there is some uncertainty in my mind as regards to what is actually happening in tensorflow. As far as I can understand, this code gives me a computational graph in which there are 3 layers of LSTM cells, and each layer has 2 LSTM cells. I have the following doubts:

  • Are the weights between these 3 LSTM layers treated as variables?
  • If these weights are treated as variables, are they modified during the training session?
  • The LSTM cells have operators such as forget etc. Are these ops also treated as variables, and hence tuned while training?

  • One note on syntax: as of TF ~1.0, you need to define multiple layers in a loop rather than using the [cell] * num_layers syntax, so something like:

    lstm_cells = []
    for _ in range(num_layers):
        cell = tf.contrib.rnn.BasicLSTMCell(n_hidden)
        lstm_cells.append(cell)
    stacked_lstm_cell = tf.contrib.rnn.MultiRNNCell(lstm_cells)
    

    To your main questions:

  • Your code is giving you a network with 3-layers ( num_layers ), where each layer contains an LSTM with a hidden state of length 2 ( n_hidden ). More on this in a bit.
  • There are not weights between the three LSTM layers: each LSTM feeds its output to the input of the next LSTM.
  • All of the weights and biases in your network will be treated as trainable variables and trained through back propagation unless you tell TF not to train certain things.
  • The operations like forget and update in the LSTM execute some function on a linear combination of the inputs to the network and the network's previous hidden state. The "linear combination" part of this involves weights and biases that are trained by your network.
  • A look at LSTM


    Let's take a look at the LSTM network architecture. This is a pretty great overview that I recommend reading. Basically a single LSTM cell maintains a hidden state that represents its "memory" of what it has seen so far, and at each update step, it decides how much new information it is going to blend with the existing information in this hidden state using "gates". It also uses a gate to determine what it will output. Taking a look at the update process for a single cell:

  • We first determine how much old information to forget (our forget gate): f_k = sigmoid(W_f * [h_k-1, x_k] + b_f)

    Here we are operating on the network's previous history h_k-1 concatenated with current observations x_k . The size of your history vector h is defined by n_hidden . The weights W_f and biases b_f will be learned through the training procedure.

  • We determine how much new information to incorporate (our input gate, i_k ), and create some new candidate cell states ( c'_k ):

    i_k = sigmoid(W_i * [h_k-1, x_k] + b_i)
    c`_k = tanh(W_c * [h_k-1, x_k] + b_c)
    

    Again, we are operating on our old internal state h_k-1 , and our new observaions x_k to figure out what to do next. The size of the cell state c and candidate cell state c' is also determined by n_hidden . The W_* and b_* are more parameters that we will be learning.

  • Combine old information with new candidate states to come up with a new cell state: c_k = f_k * c_k-1 + i_k * c'_k

    Here we are doing element-wise multiplication instead of dot products or whatever else. Basically we choose how much of our old information to keep ( f_k * c_k-1 ), and how much new information to incorporate ( i_k * c'_k ).

  • Finally, we determine how much of our cell state we want to output with an output gate:

    o_k = sigmoid(W_o * [h_k-1, x_k] + b_o)
    h_k = o_k * tanh(c_k)
    
  • So basically we are blending old and new information into an internal "cell state" c_k , and then outputting some amount of that information in h_k . I recommend also looking in to the gated recurrent unit (GRU) network, which performs similarly to LSTM but has a slightly easier structure to understand.

    Now on to how the multi-layer network stacks up. Basically, you have something that looks like this:

    x_k ---> (network 0) --h0_k--> (network_1) --h1_k--> (network_2) --h2_k-->
    

    So your observations come in to the first network, and then that network's output is fed as input to the next network, which blends it with its own internal state to produce an output, which then becomes the input to the third network, and so on until the end. This is supposed to help with learning temporal structure in the data. I do not have a good citation for that.

    Typically if you are doing classification (for instance), you would throw a final fully-connected layer on your last network's output to get some measure of confidence that your observed process lies within each category over which you are classifying.

    Trainable variables


    You can print out all of the trainable variables that your network is going to learn using something like:

    for var in tf.trainable_variables():
        print('{}nShape: {}'.format(var.name, var.get_shape()))
    

    Tensorflow does some fancy stuff with combining different operations, so you may see some odd shapes and apparently missing weight matrices and biases, but it's all there. Basically you are learning the weights and biases used in each gate. In the above, that would be:

  • weigths: W_f , W_i , W_c , and W_o for each layer
  • biases: b_f , b_i , b_c , and b_o for each layer
  • and additional output layer weights/biases that you add on top of the last LSTM layer
  • I am more familiar with how TF handles the GRU architecture, where it basically combines all of the gates into a single big matrix operation, so you have one combined weight matrix and one combined bias vector for all gates. It then splits the result into each individual gate to apply them at the right place. Just an FYI in case it looks like you do not have weights and biases for each individual step of each cell.

    链接地址: http://www.djcxy.com/p/32130.html

    上一篇: 这对我的训练集有什么负面影响?

    下一篇: 使LSTM细胞可培养