Batch是8张图片,在训练时的top1数值怎么这么奇怪? 验证的top1数值看起来是正常的。
[2022/03/13 18:03:04] root INFO: [Train][Epoch 50/50][Iter: 0/5]lr: 0.00041, CELoss: 0.56853, loss: 0.56853, top1: 0.87500, top3: 1.00000, batch_cost: 0.32509s, reader_cost: 0.11848, ips: 24.60891 images/sec, eta: 0:00:01
[2022/03/13 18:03:04] root INFO: [Train][Epoch 50/50][Iter: 1/5]lr: 0.00038, CELoss: 0.32264, loss: 0.32264, top1: 0.93750, top3: 1.00000, batch_cost: 0.32469s, reader_cost: 0.11800, ips: 24.63901 images/sec, eta: 0:00:01
[2022/03/13 18:03:05] root INFO: [Train][Epoch 50/50][Iter: 2/5]lr: 0.00036, CELoss: 0.21675, loss: 0.21675, top1: 0.95833, top3: 1.00000, batch_cost: 0.32431s, reader_cost: 0.11753, ips: 24.66740 images/sec, eta: 0:00:00
[2022/03/13 18:03:05] root INFO: [Train][Epoch 50/50][Iter: 3/5]lr: 0.00033, CELoss: 0.16291, loss: 0.16291, top1: 0.96875, top3: 1.00000, batch_cost: 0.32394s, reader_cost: 0.11706, ips: 24.69624 images/sec, eta: 0:00:00
[2022/03/13 18:03:05] root INFO: [Train][Epoch 50/50][Iter: 4/5]lr: 0.00031, CELoss: 0.13069, loss: 0.13069, top1: 0.97500, top3: 1.00000, batch_cost: 0.32356s, reader_cost: 0.11659, ips: 24.72523 images/sec, eta: 0:00:00
[2022/03/13 18:03:05] root INFO: [Train][Epoch 50/50][Avg]CELoss: 0.13069, loss: 0.13069, top1: 0.97500, top3: 1.00000
[2022/03/13 18:03:05] root INFO: [Eval][Epoch 50][Iter: 0/5]CELoss: 2.76934, loss: 2.76934, top1: 0.12500, top3: 1.00000, batch_cost: 0.14596s, reader_cost: 0.02998, ips: 54.80825 images/sec
[2022/03/13 18:03:05] root INFO: [Eval][Epoch 50][Iter: 1/5]CELoss: 4.25078, loss: 4.25078, top1: 0.00000, top3: 1.00000, batch_cost: 0.08149s, reader_cost: 0.01499, ips: 98.17481 images/sec
[2022/03/13 18:03:05] root INFO: [Eval][Epoch 50][Iter: 2/5]CELoss: 2.22130, loss: 2.22130, top1: 0.50000, top3: 1.00000, batch_cost: 0.05899s, reader_cost: 0.00999, ips: 135.61383 images/sec
[2022/03/13 18:03:05] root INFO: [Eval][Epoch 50][Iter: 3/5]CELoss: 0.13056, loss: 0.13056, top1: 0.87500, top3: 1.00000, batch_cost: 0.04774s, reader_cost: 0.00750, ips: 167.57107 images/sec
[20
找到问题了,在trainer.py里
if iter_id % print_batch_step == 0:
lr_msg = "lr: {:.5f}".format(lr_sch.get_lr())
metric_msg = ", ".join([
"{}: {:.5f}".format(key, output_info[key].avg) #这里或许应该改成output_info[key].val
for key in output_info
])
print(output_info['top1'].avg)
print(metric_msg)
time_msg = "s, ".join([
"{}: {:.5f}".format(key, time_info[key].avg)
for key in time_info
])