Tensorflow training/validation loss nan questions

2018-06-11 03:46:38

I've read a few other posts here on what to do when getting nan on training/validation loss. I'm assuming my issue is not enough decay in my learning rate though I'm wondering if someone could just take a look and agree/disagree.

I'm following the awesome blog post here but implementing in tensorflow. It's fairly easy to convert the model over, but the momentum and learning rate is a bit trickier and I think that's the issue. I'm having problems where I can only go so many epochs before the loss jumps up and to nan. The model I'm using should be equivalent to net4/5 in the blog tutorial.

... Epoch /Time/Train Loss/Valid Loss/Learn Rate
Epoch[ 900]0:14:11 0.000116 0.001566 0.027701
Epoch[ 910]0:14:20 0.000107 0.001565 0.026593
Epoch[ 920]0:14:29 0.000098 0.001564 0.026593
Epoch[ 930]0:14:39 0.000088 0.001567 0.026593
Epoch[ 940]0:14:48 0.000080 0.001567 0.026593
Epoch[ 950]0:14:58 0.000069 0.001578 0.026593
Epoch[ 960]0:15: 7 0.000072 0.001600 0.026593
Epoch[ 970]0:15:17 0.000105 0.001664 0.026593
Epoch[ 980]0:15:26 0.000221 0.001799 0.026593
Epoch[ 990]0:15:35 0.000456 0.002045 0.026593
Epoch[1000]0:15:45 0.000955 0.002473 0.025530
Epoch[1010]0:15:54 0.002148 0.003415 0.025530
Epoch[1020]0:16: 4 0.008455 0.009337 0.025530
Epoch[1030]0:16:13 0.009042 0.010412 0.025530
Epoch[1040]0:16:22 nan nan 0.025530

So I've seen this and it seems like it's just a case of needing to lower the learning rate at that point. It doesn't match up well with the tutorials numbers though which is worrying.

The next step in the blog post is adding dropout. I've already implemented it into the model, where I just pass a tensor bool to tell it if its training or not. So with dropout enabled I'm getting nans in under 150 epochs and I'm not sure what the problem is. Since it's supposed to be regularizing the system I wasn't expecting this to happen.

... Epoch /Time/Train Loss/Valid Loss/Learn Rate
Epoch[   0]0: 0: 1 0.025211 0.025614 0.045000
Epoch[  10]0: 0:11 0.003496 0.004075 0.045000
Epoch[  20]0: 0:22 0.003202 0.003742 0.045000
Epoch[  30]0: 0:32 0.003169 0.003712 0.045000
Epoch[  40]0: 0:42 0.003084 0.003605 0.045000
Epoch[  50]0: 0:53 0.002976 0.003507 0.045000
Epoch[  60]0: 1: 3 0.002891 0.003437 0.045000
Epoch[  70]0: 1:14 0.002795 0.003381 0.045000
Epoch[  80]0: 1:24 0.002648 0.003317 0.045000
Epoch[  90]0: 1:34 0.002408 0.003181 0.011250
Epoch[ 100]0: 1:45 0.002267 0.003107 0.011250
Epoch[ 110]0: 1:55 0.001947 0.003003 0.011250
Epoch[ 120]0: 2: 6 0.004507 0.005768 0.011250
Epoch[ 130]0: 2:16 nan nan 0.011250

Any thoughts on what could be the issues with having dropout enabled? I've built the exact same model afaik though even without the nan issues my losses aren't as good.

My code: https://github.com/sdeck51/CNNTutorials/blob/master/7.%20FacialFeatureDetection_Tutorial/FaceDetector.ipynb

EDIT:

So I have my convolution layers set up incorrectly. I've gone over the tutorial which has this.

InputLayer            (None, 1, 96, 96)       produces    9216 outputs
Conv2DCCLayer         (None, 32, 94, 94)      produces  282752 outputs
MaxPool2DCCLayer      (None, 32, 47, 47)      produces   70688 outputs
Conv2DCCLayer         (None, 64, 46, 46)      produces  135424 outputs
MaxPool2DCCLayer      (None, 64, 23, 23)      produces   33856 outputs
Conv2DCCLayer         (None, 128, 22, 22)     produces   61952 outputs
MaxPool2DCCLayer      (None, 128, 11, 11)     produces   15488 outputs
DenseLayer            (None, 500)             produces     500 outputs
DenseLayer            (None, 500)             produces     500 outputs
DenseLayer            (None, 30)              produces      30 outputs

and I've just updated mine so I think it's now the same.

conv: input size: (?, 96, 96, 1)
pool: input size: (?, 94, 94, 32)
conv: input size: (?, 47, 47, 32)
pool: input size: (?, 46, 46, 64)
conv: input size: (?, 23, 23, 64)
pool: input size: (?, 22, 22, 128)
fc: input size before flattening: (?, 11, 11, 128)
fc: input size: (?, 15488)
fc: input size: (?, 500)
fc: input size: (?, 500)
out: (?, 30)

Still not working though. With dropout enabled on the convolution layers and the first fully connected layer the model lasts for under 50 epochs and then the error goes through the roof. Even with very small learning rates the problem still occurs.

Epoch[   0]0: 0: 1 0.029732 0.030537 0.030000
Epoch[  10]0: 0:11 0.004211 0.004986 0.030000
Epoch[  20]0: 0:20 0.003013 0.003530 0.004500
Epoch[  30]0: 0:30 5.250690 5.426279 0.004500
Epoch[  40]0: 0:40 nan nan 0.000675

And it looks like the non dropout method is broken and doing the same thing >_>...

EDIT: I think I've figured out the issue. I'm using a momentum optimization algorithm that increases momentum over time. I think the small increase from that was causing it to overshoot. Currently running without dropout, but I'm getting better results than before by having a constant momentum. After I run 1000 epochs I'm going to check out it with dropout

Running with dropout now and it's not blowing up so I think I've fixed the issue.

The problem was indeed the optimizer. I'm using Momentum optimizer and I have it initially set to .9 and it should become .999 towards the end of its epoch cycles. For some reason the extra momentum is causing the loss to sky rocket. Leaving it as .9 fixes the issue.

链接地址: http://www.djcxy.com/p/32048.html

上一篇: 张量流中的损失突然变成了南

下一篇: Tensorflow培训/验证损失nan问题