Tensorflow NaN bug?

2018-06-11 03:42:30

I'm using TensorFlow and I modified the tutorial example to take my RGB images.

The algorithm works flawlessly out of the box on the new image set, until suddenly (still converging, it's around 92% accuracy usually), it crashes with the error that ReluGrad received non-finite values. Debugging shows that nothing unusual happens with the numbers until very suddenly, for unknown reason, the error is thrown. Adding

print "max W vales: %g %g %g %g"%(tf.reduce_max(tf.abs(W_conv1)).eval(),tf.reduce_max(tf.abs(W_conv2)).eval(),tf.reduce_max(tf.abs(W_fc1)).eval(),tf.reduce_max(tf.abs(W_fc2)).eval())
print "max b vales: %g %g %g %g"%(tf.reduce_max(tf.abs(b_conv1)).eval(),tf.reduce_max(tf.abs(b_conv2)).eval(),tf.reduce_max(tf.abs(b_fc1)).eval(),tf.reduce_max(tf.abs(b_fc2)).eval())

as debug code to each loop, yields the following output:

Step 8600
max W vales: 0.759422 0.295087 0.344725 0.583884
max b vales: 0.110509 0.111748 0.115327 0.124324
Step 8601
max W vales: 0.75947 0.295084 0.344723 0.583893
max b vales: 0.110516 0.111753 0.115322 0.124332
Step 8602
max W vales: 0.759521 0.295101 0.34472 0.5839
max b vales: 0.110521 0.111747 0.115312 0.124365
Step 8603
max W vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38
max b vales: -3.40282e+38 -3.40282e+38 -3.40282e+38 -3.40282e+38

Since none of my values is very high, the only way a NaN can happen is by a badly handled 0/0, but since this tutorial code doesn't do any divisions or similar operations, I see no other explanation than that this comes from the internal TF code.

I'm clueless on what to do with this. Any suggestions? The algorithm is converging nicely, its accuracy on my validation set was steadily climbing and just reached 92.5% at iteration 8600.

Actually, it turned out to be something stupid. I'm posting this in case anyone else would run into a similar error.

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv))

is actually a horrible way of computing the cross-entropy. In some samples, certain classes could be excluded with certainty after a while, resulting in y_conv=0 for that sample. That's normally not a problem since you're not interested in those, but in the way cross_entropy is written there, it yields 0*log(0) for that particular sample/class. Hence the NaN.

Replacing it with

cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y_conv,1e-10,1.0)))

solved all my problems.

Actually, clipping is not a good idea as it will stop the gradient from propagating backwards when the threshold is reached. Instead we can add a little bit of constant to the softmax output.

cross_entropy = -tf.reduce_sum(y_*tf.log(y_conv + 1e-10))

如果y_conv是softmax的结果，比如说y_conv = tf.nn.softmax(x) ，那么更好的解决方案是用log_softmax替换它：

y = tf.nn.log_softmax(x)
cross_entropy = -tf.reduce_sum(y_*y)

链接地址: http://www.djcxy.com/p/32040.html

上一篇: Captcha用convnet识别，如何定义损失函数

下一篇: Tensorflow NaN错误？