首页 Paddle框架 帖子详情
分布式训练卡住
收藏
快速回复
Paddle框架 问答深度学习 145 2
分布式训练卡住
收藏
快速回复
Paddle框架 问答深度学习 145 2

1. 出现的问题

我正在使用paddlepaddle进行分布式训练,单机多卡可以正常训练,多机多卡collective模式下,会在init nccl后卡住,无法进入下一步,没有任何报错信息,使用--log_level=debug也没有输出任何信息,日志信息如下:

第一个节点:使用命令python -m paddle.distributed.launch --ips=172.16.13.74,172.16.13.87 train_with_fleet.py

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
LAUNCH INFO 2023-03-02 02:01:33,781 ----------- Configuration ----------------------
LAUNCH INFO 2023-03-02 02:01:33,782 devices: None
LAUNCH INFO 2023-03-02 02:01:33,782 elastic_level: -1
LAUNCH INFO 2023-03-02 02:01:33,782 elastic_timeout: 30
LAUNCH INFO 2023-03-02 02:01:33,782 gloo_port: 6767
LAUNCH INFO 2023-03-02 02:01:33,782 host: None
LAUNCH INFO 2023-03-02 02:01:33,782 job_id: default
LAUNCH INFO 2023-03-02 02:01:33,782 legacy: False
LAUNCH INFO 2023-03-02 02:01:33,782 log_dir: log
LAUNCH INFO 2023-03-02 02:01:33,782 log_level: INFO
LAUNCH INFO 2023-03-02 02:01:33,782 master: None
LAUNCH INFO 2023-03-02 02:01:33,783 max_restart: 3
LAUNCH INFO 2023-03-02 02:01:33,783 nnodes: 1
LAUNCH INFO 2023-03-02 02:01:33,783 nproc_per_node: None
LAUNCH INFO 2023-03-02 02:01:33,783 rank: -1
LAUNCH INFO 2023-03-02 02:01:33,783 run_mode: collective
LAUNCH INFO 2023-03-02 02:01:33,783 server_num: None
LAUNCH INFO 2023-03-02 02:01:33,783 servers:
LAUNCH INFO 2023-03-02 02:01:33,783 trainer_num: None
LAUNCH INFO 2023-03-02 02:01:33,783 trainers:
LAUNCH INFO 2023-03-02 02:01:33,783 training_script: train_with_fleet.py
LAUNCH INFO 2023-03-02 02:01:33,784 training_script_args: []
LAUNCH INFO 2023-03-02 02:01:33,784 with_gloo: 0
LAUNCH INFO 2023-03-02 02:01:33,784 --------------------------------------------------
LAUNCH WARNING 2023-03-02 02:01:33,784 Compatible mode enable with args ['--ips=172.16.13.74,172.16.13.87']
----------- Configuration Arguments -----------
backend: auto
cluster_topo_path: None
elastic_pre_hook: None
elastic_server: None
enable_auto_mapping: False
force: False
gpus: None
heter_devices:
heter_worker_num: None
heter_workers:
host: None
http_port: None
ips: 172.16.13.74,172.16.13.87
job_id: None
log_dir: log
np: None
nproc_per_node: None
rank_mapping_path: None
run_mode: None
scale: 0
server_num: None
servers:
training_script: train_with_fleet.py
training_script_args: []
worker_num: None
workers:
------------------------------------------------
INFO 2023-03-02 02:01:33,787 launch.py:504] Run collective mode. gpu arguments:['--ips'], cuda count:2
INFO 2023-03-02 02:01:33,787 launch.py:504] Run collective mode. gpu arguments:['--ips'], cuda count:2
launch train in GPU mode!
INFO 2023-03-02 02:01:33,809 launch_utils.py:561] Local start 2 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 2 |
| PADDLE_CURRENT_ENDPOINT 172.16.13.87:6070 |
| PADDLE_TRAINERS_NUM 4 |
| PADDLE_TRAINER_ENDPOINTS ... :6071,172.16.13.87:6070,172.16.13.87:6071|
| PADDLE_RANK_IN_NODE 0 |
| PADDLE_LOCAL_DEVICE_IDS 0 |
| PADDLE_WORLD_DEVICE_IDS 0,1,0,1 |
| FLAGS_selected_gpus 0 |
| FLAGS_selected_accelerators 0 |
+=======================================================================================+

INFO 2023-03-02 02:01:33,809 launch_utils.py:561] Local start 2 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 2 |
| PADDLE_CURRENT_ENDPOINT 172.16.13.87:6070 |
| PADDLE_TRAINERS_NUM 4 |
| PADDLE_TRAINER_ENDPOINTS ... :6071,172.16.13.87:6070,172.16.13.87:6071|
| PADDLE_RANK_IN_NODE 0 |
| PADDLE_LOCAL_DEVICE_IDS 0 |
| PADDLE_WORLD_DEVICE_IDS 0,1,0,1 |
| FLAGS_selected_gpus 0 |
| FLAGS_selected_accelerators 0 |
+=======================================================================================+

INFO 2023-03-02 02:01:33,809 launch_utils.py:566] details about PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
INFO 2023-03-02 02:01:33,809 launch_utils.py:566] details about PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
launch proc_id:324 idx:0
launch proc_id:329 idx:1
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
I0302 02:01:35.129714 324 gen_comm_id_helper.cc:205] Server listening on: 172.16.13.87:6070 successful.
I0302 02:01:36.279278 324 nccl_context.cc:83] init nccl context nranks: 4 local rank: 2 gpu id: 0 ring id: 0


第二个节点使用命令:python -m paddle.distributed.launch --ips=172.16.13.74,172.16.13.87 train_with_fleet.py

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
LAUNCH INFO 2023-03-02 02:01:22,936 ----------- Configuration ----------------------
LAUNCH INFO 2023-03-02 02:01:22,936 devices: None
LAUNCH INFO 2023-03-02 02:01:22,936 elastic_level: -1
LAUNCH INFO 2023-03-02 02:01:22,936 elastic_timeout: 30
LAUNCH INFO 2023-03-02 02:01:22,936 gloo_port: 6767
LAUNCH INFO 2023-03-02 02:01:22,936 host: None
LAUNCH INFO 2023-03-02 02:01:22,936 job_id: default
LAUNCH INFO 2023-03-02 02:01:22,936 legacy: False
LAUNCH INFO 2023-03-02 02:01:22,936 log_dir: log
LAUNCH INFO 2023-03-02 02:01:22,936 log_level: INFO
LAUNCH INFO 2023-03-02 02:01:22,936 master: None
LAUNCH INFO 2023-03-02 02:01:22,936 max_restart: 3
LAUNCH INFO 2023-03-02 02:01:22,936 nnodes: 1
LAUNCH INFO 2023-03-02 02:01:22,936 nproc_per_node: None
LAUNCH INFO 2023-03-02 02:01:22,936 rank: -1
LAUNCH INFO 2023-03-02 02:01:22,936 run_mode: collective
LAUNCH INFO 2023-03-02 02:01:22,936 server_num: None
LAUNCH INFO 2023-03-02 02:01:22,936 servers:
LAUNCH INFO 2023-03-02 02:01:22,936 trainer_num: None
LAUNCH INFO 2023-03-02 02:01:22,936 trainers:
LAUNCH INFO 2023-03-02 02:01:22,936 training_script: train_with_fleet.py
LAUNCH INFO 2023-03-02 02:01:22,936 training_script_args: []
LAUNCH INFO 2023-03-02 02:01:22,937 with_gloo: 0
LAUNCH INFO 2023-03-02 02:01:22,937 --------------------------------------------------
LAUNCH WARNING 2023-03-02 02:01:22,937 Compatible mode enable with args ['--ips=172.16.13.74,172.16.13.87']
----------- Configuration Arguments -----------
backend: auto
cluster_topo_path: None
elastic_pre_hook: None
elastic_server: None
enable_auto_mapping: False
force: False
gpus: None
heter_devices:
heter_worker_num: None
heter_workers:
host: None
http_port: None
ips: 172.16.13.74,172.16.13.87
job_id: None
log_dir: log
np: None
nproc_per_node: None
rank_mapping_path: None
run_mode: None
scale: 0
server_num: None
servers:
training_script: train_with_fleet.py
training_script_args: []
worker_num: None
workers:
------------------------------------------------
INFO 2023-03-02 02:01:22,938 launch.py:504] Run collective mode. gpu arguments:['--ips'], cuda count:2
INFO 2023-03-02 02:01:22,938 launch.py:504] Run collective mode. gpu arguments:['--ips'], cuda count:2
launch train in GPU mode!
INFO 2023-03-02 02:01:22,939 launch_utils.py:561] Local start 2 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 172.16.13.74:6070 |
| PADDLE_TRAINERS_NUM 4 |
| PADDLE_TRAINER_ENDPOINTS ... :6071,172.16.13.87:6070,172.16.13.87:6071|
| PADDLE_RANK_IN_NODE 0 |
| PADDLE_LOCAL_DEVICE_IDS 0 |
| PADDLE_WORLD_DEVICE_IDS 0,1,0,1 |
| FLAGS_selected_gpus 0 |
| FLAGS_selected_accelerators 0 |
+=======================================================================================+

INFO 2023-03-02 02:01:22,939 launch_utils.py:561] Local start 2 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 172.16.13.74:6070 |
| PADDLE_TRAINERS_NUM 4 |
| PADDLE_TRAINER_ENDPOINTS ... :6071,172.16.13.87:6070,172.16.13.87:6071|
| PADDLE_RANK_IN_NODE 0 |
| PADDLE_LOCAL_DEVICE_IDS 0 |
| PADDLE_WORLD_DEVICE_IDS 0,1,0,1 |
| FLAGS_selected_gpus 0 |
| FLAGS_selected_accelerators 0 |
+=======================================================================================+

INFO 2023-03-02 02:01:22,939 launch_utils.py:566] details about PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
INFO 2023-03-02 02:01:22,939 launch_utils.py:566] details about PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
launch proc_id:31267 idx:0
launch proc_id:31272 idx:1
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
server not ready, wait 3 sec to retry...
not ready endpoints:['172.16.13.74:6071', '172.16.13.87:6070', '172.16.13.87:6071']
server not ready, wait 3 sec to retry...
not ready endpoints:['172.16.13.87:6070', '172.16.13.87:6071']
server not ready, wait 3 sec to retry...
not ready endpoints:['172.16.13.87:6070', '172.16.13.87:6071']
server not ready, wait 3 sec to retry...
not ready endpoints:['172.16.13.87:6070', '172.16.13.87:6071']
I0302 02:01:36.274989 31267 nccl_context.cc:83] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0


发现两个节点都卡在了init nccl后,然后四个节点的endpoint log都已经检查过了,返回的信息都是

I0302 02:01:36.274989 31267 nccl_context.cc:83] init nccl context nranks: 4 local rank: 0/1 gpu id: 0/1 ring id: 0/1
使用命令nvidia-smi -l 10查看GPU占用,一直是0,最长一次等过大概20分钟,还是一直卡在init nccl后。
想请教一下具体的原因是什么。两台服务器的显卡驱动都是470,然后启动是通过容器进行启动的。

paddle版本:2.3.2
python版本:3.7
cuda版本:11.6

2. 排查

我排查了一下,发现好像74那台服务器的6070端口总是没有监听,但是6071端口有监听,另外一个服务器的两个端口也都正常监听,我试着换了一个监听的端口,发现还是一样的情况,74服务器第一个端口没有监听,其他的都没问题,证明应该不是端口占用的情况.
我又试了一下只使用一张显卡,进行训练,此时74服务器一个端口还是不能进行监听,87服务器端口正常监听,但是当我交换--ips的服务器顺序后,发现现象改变为了87服务器无法监听,74服务器可以正常监听。

3. 其他尝试

我试了多种启动方式,不论是`fleetrun --ips`还是`fleetrun --nnodes`+`fleetrun --nnodes --master`启动都是遇到了一样的问题,卡在同一个地方。现象也都如同上述排查中描述的情况一直。

0
收藏
回复
全部评论(2)
时间顺序
李长安
#2 回复于2023-03

问问平台RD看看咋回事吧

0
回复
何必固執丶
#3 回复于2023-03

还是得找客服

0
回复
需求/bug反馈?一键提issue告诉我们
发现bug?如果您知道修复办法,欢迎提pr直接参与建设飞桨~
在@后输入用户全名并按空格结束,可艾特全站任一用户