AI Studio用多卡训练莫名其妙的就挂掉了
收藏
快速回复
AI Studio平台使用 问答算力相关 711 3
AI Studio用多卡训练莫名其妙的就挂掉了
收藏
快速回复
AI Studio平台使用 问答算力相关 711 3

用的是launch方式

python  -m paddle.distributed.launch run.py

然后日志显示

------------------------------------------------
launch train in GPU mode!
launch proc_id:314 idx:0
launch proc_id:317 idx:1
launch proc_id:320 idx:2
launch proc_id:324 idx:3
W0618 10:18:48.121762   314 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W0618 10:18:48.126435   314 device_context.cc:422] device: 0, cuDNN Version: 7.6.
/mnt
[INFO]: train job failed! train_ret: 1

一点报错也没有的就挂了

0
收藏
回复
全部评论(3)
时间顺序
HClO
#2 回复于2021-06
W0618 11:29:19.044981   312 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:33748 failed 21 times with reason: Connection refused retry after 3 seconds
W0618 11:29:22.045243   312 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:33748 failed 22 times with reason: Connection refused retry after 3 seconds
W0618 11:29:25.045517   312 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:33748 failed 23 times with reason: Connection refused retry after 3 seconds
W0618 11:29:28.045799   312 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:33748 failed 24 times with reason: Connection refused retry after 3 seconds
W0618 11:29:31.046074   312 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:33748 failed 25 times with reason: Connection refused retry after 3 seconds
W0618 11:29:34.046363   312 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:33748 failed 26 times with reason: Connection refused retry after 3 seconds
W0618 11:29:37.046645   312 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:33748 failed 27 times with reason: Connection refused retry after 3 seconds
W0618 11:29:40.046919   312 gen_comm_id_helper.cc:120] connect addr=127.0.0.1:33748 failed 28 times with reason: Connection refused retry after 3 seconds

拒绝链接?你在开玩笑?

0
回复
PaddleTalent
#3 回复于2021-06

这个任务号是?

0
回复
johnyanccer
#4 回复于2021-08

同款报错。。。

0
回复
在@后输入用户全名并按空格结束,可艾特全站任一用户