请问确定正常跑的那次是用的全量数据吗?模型有在本地训练过吗?能否捕捉到更多的debug信息呢?
全量的数据试过几次,都是失败,正常的都是部分数据;
主楼贴的都是两个链接都是相同的部分数据训练。
模型本地训练使用部分数据都是正常完成。
用户配置里 加一下 FLAGS_check_nan_inf=1
看看是哪一层出的nan.
主楼贴的报错日志已经加了FLAGS_check_nan_inf=1,显示就是ctr预测的最后一个fc层softmax后出现nan
能不能把softmax的输入也打印出来看下?
重新跑了一次任务,发现softmax之前的输入已经是nan了, 加的FLAGS_check_nan_inf=1应该在哪里显示?
Tensor[batch_norm_1.tmp_3]
shape: [1024,64,]
dtype: f
data: nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,
1573027773 The place is:CPUPlace
Tensor[fc_3.tmp_2]
shape: [1024,2,]
dtype: f
data: nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,
1573027773 The place is:CPUPlace
运行时设置环境变量:
export FLAGS_check_nan_inf=1
,可以检测到哪一层最先出的nan。
-
我在日志中多加了几个Print中间变量操作,我发现是在batchnorm之后出现了第一次出现nan。
Fri Nov 8 11:07:52 2019[1,1]<stdout>:1573182472 The place is:CPUPlace Fri Nov 8 11:07:52 2019[1,1]<stdout>:Tensor[fc_0.tmp_1] Fri Nov 8 11:07:52 2019[1,1]<stdout>: shape: [1024,512,] Fri Nov 8 11:07:52 2019[1,1]<stdout>: dtype: f Fri Nov 8 11:07:52 2019[1,1]<stdout>: data: 0.364076,-0.215658,0.0754562,-0.706892,0.232593,-0.91087,-0.0826245,-0.238436,0.891936,0.904805,-0.06953,-0.567432,-0.142411,0.288235,-0.00939118,-0.146659,-0.0667748,0.0340427,-0.619866,-0.00593939, Fri Nov 8 11:07:52 2019[1,1]<stdout>:1573182472 The place is:CPUPlace Fri Nov 8 11:07:52 2019[1,1]<stdout>:Tensor[batch_norm_0.tmp_3] Fri Nov 8 11:07:52 2019[1,1]<stdout>: shape: [1024,512,] Fri Nov 8 11:07:52 2019[1,1]<stdout>: dtype: f Fri Nov 8 11:07:52 2019[1,1]<stdout>: data: -nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan,-nan, Fri Nov 8 11:07:53 2019[1,1]<stdout>:1573182473 The place is:CPUPlace
日志链接如下:http://10.76.118.34:8910/fileview.html?type=logsdir&path=/&instance=0.app-user-20191108094838-44504--shulei_msd_mmoe_dnn_v1_20191108_paddlecloud
1)PaddlePaddle版本:paddlepaddle 1.6.0
2)CPU:
4)
一个普通的MLP 点击率预估模型,使用了fleet接口来分布式训练,训练时最后一层sofamax输出后变为nan,导致计算auc时出core。
Tensor[fc_3.tmp_2] shape: [1024,2,] dtype: f data: nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan,nan, terminate called after throwing an instance of 'paddle::platform::EnforceNotMet'
PaddleCheckError: Expected predict_data <= 1, but received predict_data:nan > 1:1. The predict data must less or equal 1. at [/paddle/paddle/fluid/operators/metrics/auc_op.h:83] [operator < auc > error]
mpi日志链接: http://10.76.58.25:8910/fileview.html?type=logsdir&path=/&instance=0.app-user-20191103135008-40922--shulei_msd_mmoe_dnn_v1_20191102_paddlecloud
但是重新提交相同任务后,模型正常训练完,未能复现nan导致的core。此mpi日志链接 :http://10.76.125.48:8910/fileview.html?type=logsdir&path=/&instance=0.app-user-20191103163759-40991--shulei_msd_mmoe_dnn_v1_20191102_paddlecloud
感觉不是脏数据导致的。
现在每次提交全量数据训练任务基本都会出相同的core。