分布式训练卡住

项目

数据集

课程

比赛

模型库

活动

论坛

访问飞桨官网

项目

数据集

课程

比赛

模型库

活动

论坛

访问飞桨官网

西西门吹孤城发布于2023-03

1. 出现的问题

我正在使用paddlepaddle进行分布式训练，单机多卡可以正常训练，多机多卡collective模式下，会在init nccl后卡住，无法进入下一步，没有任何报错信息，使用--log_level=debug也没有输出任何信息，日志信息如下：

第一个节点：使用命令python -m paddle.distributed.launch --ips=172.16.13.74,172.16.13.87 train_with_fleet.py

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
LAUNCH INFO 2023-03-02 02:01:33,781 ----------- Configuration ----------------------
LAUNCH INFO 2023-03-02 02:01:33,782 devices: None
LAUNCH INFO 2023-03-02 02:01:33,782 elastic_level: -1
LAUNCH INFO 2023-03-02 02:01:33,782 elastic_timeout: 30
LAUNCH INFO 2023-03-02 02:01:33,782 gloo_port: 6767
LAUNCH INFO 2023-03-02 02:01:33,782 host: None
LAUNCH INFO 2023-03-02 02:01:33,782 job_id: default
LAUNCH INFO 2023-03-02 02:01:33,782 legacy: False
LAUNCH INFO 2023-03-02 02:01:33,782 log_dir: log
LAUNCH INFO 2023-03-02 02:01:33,782 log_level: INFO
LAUNCH INFO 2023-03-02 02:01:33,782 master: None
LAUNCH INFO 2023-03-02 02:01:33,783 max_restart: 3
LAUNCH INFO 2023-03-02 02:01:33,783 nnodes: 1
LAUNCH INFO 2023-03-02 02:01:33,783 nproc_per_node: None
LAUNCH INFO 2023-03-02 02:01:33,783 rank: -1
LAUNCH INFO 2023-03-02 02:01:33,783 run_mode: collective
LAUNCH INFO 2023-03-02 02:01:33,783 server_num: None
LAUNCH INFO 2023-03-02 02:01:33,783 servers:
LAUNCH INFO 2023-03-02 02:01:33,783 trainer_num: None
LAUNCH INFO 2023-03-02 02:01:33,783 trainers:
LAUNCH INFO 2023-03-02 02:01:33,783 training_script: train_with_fleet.py
LAUNCH INFO 2023-03-02 02:01:33,784 training_script_args: []
LAUNCH INFO 2023-03-02 02:01:33,784 with_gloo: 0
LAUNCH INFO 2023-03-02 02:01:33,784 --------------------------------------------------
LAUNCH WARNING 2023-03-02 02:01:33,784 Compatible mode enable with args ['--ips=172.16.13.74,172.16.13.87']
----------- Configuration Arguments -----------
backend: auto
cluster_topo_path: None
elastic_pre_hook: None
elastic_server: None
enable_auto_mapping: False
force: False
gpus: None
heter_devices:
heter_worker_num: None
heter_workers:
host: None
http_port: None
ips: 172.16.13.74,172.16.13.87
job_id: None
log_dir: log
np: None
nproc_per_node: None
rank_mapping_path: None
run_mode: None
scale: 0
server_num: None
servers:
training_script: train_with_fleet.py
training_script_args: []
worker_num: None
workers:
------------------------------------------------
INFO 2023-03-02 02:01:33,787 launch.py:504] Run collective mode. gpu arguments:['--ips'], cuda count:2
INFO 2023-03-02 02:01:33,787 launch.py:504] Run collective mode. gpu arguments:['--ips'], cuda count:2
launch train in GPU mode!
INFO 2023-03-02 02:01:33,809 launch_utils.py:561] Local start 2 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 2 |
| PADDLE_CURRENT_ENDPOINT 172.16.13.87:6070 |
| PADDLE_TRAINERS_NUM 4 |
| PADDLE_TRAINER_ENDPOINTS ... :6071,172.16.13.87:6070,172.16.13.87:6071|
| PADDLE_RANK_IN_NODE 0 |
| PADDLE_LOCAL_DEVICE_IDS 0 |
| PADDLE_WORLD_DEVICE_IDS 0,1,0,1 |
| FLAGS_selected_gpus 0 |
| FLAGS_selected_accelerators 0 |
+=======================================================================================+

INFO 2023-03-02 02:01:33,809 launch_utils.py:561] Local start 2 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 2 |
| PADDLE_CURRENT_ENDPOINT 172.16.13.87:6070 |
| PADDLE_TRAINERS_NUM 4 |
| PADDLE_TRAINER_ENDPOINTS ... :6071,172.16.13.87:6070,172.16.13.87:6071|
| PADDLE_RANK_IN_NODE 0 |
| PADDLE_LOCAL_DEVICE_IDS 0 |
| PADDLE_WORLD_DEVICE_IDS 0,1,0,1 |
| FLAGS_selected_gpus 0 |
| FLAGS_selected_accelerators 0 |
+=======================================================================================+

INFO 2023-03-02 02:01:33,809 launch_utils.py:566] details about PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
INFO 2023-03-02 02:01:33,809 launch_utils.py:566] details about PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
launch proc_id:324 idx:0
launch proc_id:329 idx:1
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
I0302 02:01:35.129714 324 gen_comm_id_helper.cc:205] Server listening on: 172.16.13.87:6070 successful.
I0302 02:01:36.279278 324 nccl_context.cc:83] init nccl context nranks: 4 local rank: 2 gpu id: 0 ring id: 0

第二个节点使用命令：python -m paddle.distributed.launch --ips=172.16.13.74,172.16.13.87 train_with_fleet.py

grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
LAUNCH INFO 2023-03-02 02:01:22,936 ----------- Configuration ----------------------
LAUNCH INFO 2023-03-02 02:01:22,936 devices: None
LAUNCH INFO 2023-03-02 02:01:22,936 elastic_level: -1
LAUNCH INFO 2023-03-02 02:01:22,936 elastic_timeout: 30
LAUNCH INFO 2023-03-02 02:01:22,936 gloo_port: 6767
LAUNCH INFO 2023-03-02 02:01:22,936 host: None
LAUNCH INFO 2023-03-02 02:01:22,936 job_id: default
LAUNCH INFO 2023-03-02 02:01:22,936 legacy: False
LAUNCH INFO 2023-03-02 02:01:22,936 log_dir: log
LAUNCH INFO 2023-03-02 02:01:22,936 log_level: INFO
LAUNCH INFO 2023-03-02 02:01:22,936 master: None
LAUNCH INFO 2023-03-02 02:01:22,936 max_restart: 3
LAUNCH INFO 2023-03-02 02:01:22,936 nnodes: 1
LAUNCH INFO 2023-03-02 02:01:22,936 nproc_per_node: None
LAUNCH INFO 2023-03-02 02:01:22,936 rank: -1
LAUNCH INFO 2023-03-02 02:01:22,936 run_mode: collective
LAUNCH INFO 2023-03-02 02:01:22,936 server_num: None
LAUNCH INFO 2023-03-02 02:01:22,936 servers:
LAUNCH INFO 2023-03-02 02:01:22,936 trainer_num: None
LAUNCH INFO 2023-03-02 02:01:22,936 trainers:
LAUNCH INFO 2023-03-02 02:01:22,936 training_script: train_with_fleet.py
LAUNCH INFO 2023-03-02 02:01:22,936 training_script_args: []
LAUNCH INFO 2023-03-02 02:01:22,937 with_gloo: 0
LAUNCH INFO 2023-03-02 02:01:22,937 --------------------------------------------------
LAUNCH WARNING 2023-03-02 02:01:22,937 Compatible mode enable with args ['--ips=172.16.13.74,172.16.13.87']
----------- Configuration Arguments -----------
backend: auto
cluster_topo_path: None
elastic_pre_hook: None
elastic_server: None
enable_auto_mapping: False
force: False
gpus: None
heter_devices:
heter_worker_num: None
heter_workers:
host: None
http_port: None
ips: 172.16.13.74,172.16.13.87
job_id: None
log_dir: log
np: None
nproc_per_node: None
rank_mapping_path: None
run_mode: None
scale: 0
server_num: None
servers:
training_script: train_with_fleet.py
training_script_args: []
worker_num: None
workers:
------------------------------------------------
INFO 2023-03-02 02:01:22,938 launch.py:504] Run collective mode. gpu arguments:['--ips'], cuda count:2
INFO 2023-03-02 02:01:22,938 launch.py:504] Run collective mode. gpu arguments:['--ips'], cuda count:2
launch train in GPU mode!
INFO 2023-03-02 02:01:22,939 launch_utils.py:561] Local start 2 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 172.16.13.74:6070 |
| PADDLE_TRAINERS_NUM 4 |
| PADDLE_TRAINER_ENDPOINTS ... :6071,172.16.13.87:6070,172.16.13.87:6071|
| PADDLE_RANK_IN_NODE 0 |
| PADDLE_LOCAL_DEVICE_IDS 0 |
| PADDLE_WORLD_DEVICE_IDS 0,1,0,1 |
| FLAGS_selected_gpus 0 |
| FLAGS_selected_accelerators 0 |
+=======================================================================================+

INFO 2023-03-02 02:01:22,939 launch_utils.py:561] Local start 2 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 172.16.13.74:6070 |
| PADDLE_TRAINERS_NUM 4 |
| PADDLE_TRAINER_ENDPOINTS ... :6071,172.16.13.87:6070,172.16.13.87:6071|
| PADDLE_RANK_IN_NODE 0 |
| PADDLE_LOCAL_DEVICE_IDS 0 |
| PADDLE_WORLD_DEVICE_IDS 0,1,0,1 |
| FLAGS_selected_gpus 0 |
| FLAGS_selected_accelerators 0 |
+=======================================================================================+

INFO 2023-03-02 02:01:22,939 launch_utils.py:566] details about PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
INFO 2023-03-02 02:01:22,939 launch_utils.py:566] details about PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
launch proc_id:31267 idx:0
launch proc_id:31272 idx:1
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
server not ready, wait 3 sec to retry...
not ready endpoints:['172.16.13.74:6071', '172.16.13.87:6070', '172.16.13.87:6071']
server not ready, wait 3 sec to retry...
not ready endpoints:['172.16.13.87:6070', '172.16.13.87:6071']
server not ready, wait 3 sec to retry...
not ready endpoints:['172.16.13.87:6070', '172.16.13.87:6071']
server not ready, wait 3 sec to retry...
not ready endpoints:['172.16.13.87:6070', '172.16.13.87:6071']
I0302 02:01:36.274989 31267 nccl_context.cc:83] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0

发现两个节点都卡在了init nccl后，然后四个节点的endpoint log都已经检查过了，返回的信息都是

I0302 02:01:36.274989 31267 nccl_context.cc:83] init nccl context nranks: 4 local rank: 0/1 gpu id: 0/1 ring id: 0/1
使用命令nvidia-smi -l 10查看GPU占用，一直是0，最长一次等过大概20分钟，还是一直卡在init nccl后。
想请教一下具体的原因是什么。两台服务器的显卡驱动都是470，然后启动是通过容器进行启动的。

paddle版本：2.3.2
python版本：3.7
cuda版本：11.6

2. 排查

我排查了一下，发现好像74那台服务器的6070端口总是没有监听，但是6071端口有监听，另外一个服务器的两个端口也都正常监听，我试着换了一个监听的端口，发现还是一样的情况，74服务器第一个端口没有监听，其他的都没问题，证明应该不是端口占用的情况.
我又试了一下只使用一张显卡，进行训练，此时74服务器一个端口还是不能进行监听，87服务器端口正常监听，但是当我交换--ips的服务器顺序后，发现现象改变为了87服务器无法监听，74服务器可以正常监听。

3. 其他尝试

我试了多种启动方式，不论是`fleetrun --ips`还是`fleetrun --nnodes`+`fleetrun --nnodes --master`启动都是遇到了一样的问题，卡在同一个地方。现象也都如同上述排查中描述的情况一直。