Why random sample from replay for DQN?
I'm trying to gain an intuitive understanding of deep reinforcement learning. In deep Q-networks (DQN) we store all actions/environments/rewards in a memory array and at the end of the episode, "replay" them through our neural network. This makes sense because we are trying to build out our rewards matrix and see if our episode ended in reward, scale that back through our matrix.
I would think the sequence of actions that led to the reward state is what is important to capture - this sequence of actions (and not the actions independently) are what led us to our reward state.
In the Atari-DQN paper by Mnih and many tutorials since we see the practice of random sampling from the memory array and training. So if we have a memory of:
(action a, state 1) --> (action b, state 2) --> (action c, state 3) --> (action d, state 4) --> reward!
We may train a mini-batch of:
[(action c state 3), (action b, state 2), reward!]
The reason given is:
Second, learning directly from consecutive samples is inefficient, due to the strong correlations between the samples; randomizing the samples breaks these correlations and therefore reduces the variance of the updates.
or from this pytorch tutorial:
By sampling from it randomly, the transitions that build up a batch are decorrelated. It has been shown that this greatly stabilizes and improves the DQN training procedure.
My intuition would tell me the sequence is what is most important in reinforcement learning. Most episodes have a delayed reward so most action/states do not have a reward (and are not "reinforced"). The only way to bring a portion of the reward to these previous states is to retroactively break the reward out across the sequence (through the future_reward in the Q algorithm of reward + reward * learning_rate(future_reward))
A random sampling of the memory bank breaks our sequence, how does that help when you are trying to back-fill a Q (reward) matrix?
Perhaps this is more similar to a Markov model where every state should be considered independent? Where is the error in my intuition?
Thank you!
"The sequence is what is most important in reinforcement learning." No: by the Markovian property, given the current state you can "ignore" all the past states and still be able to learn.
One thing that you are missing is that tuples are not just (state, action)
, but (state, action, next state)
. For instance, in DQN when you update the Q-network you compute the TD error and in doing so, you consider the Q-value of the next state. This allows you to still "backpropagate" delayed reward through the Q-values, even if samples are random.
(Problems still arises if the reward is too delayed because of its sparsity, but in theory you can learn anyway).
上一篇: Android的布局与方形按钮
下一篇: 为什么从DQN重放随机采样?