1. 出现的问题
我正在使用paddlepaddle进行分布式训练,单机多卡可以正常训练,多机多卡collective模式下,会在init nccl后卡住,无法进入下一步,没有任何报错信息,使用--log_level=debug也没有输出任何信息,日志信息如下:
第一个节点:使用命令python -m paddle.distributed.launch --ips=172.16.13.74,172.16.13.87 train_with_fleet.py
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script LAUNCH INFO 2023-03-02 02:01:33,781 ----------- Configuration ---------------------- LAUNCH INFO 2023-03-02 02:01:33,782 devices: None LAUNCH INFO 2023-03-02 02:01:33,782 elastic_level: -1 LAUNCH INFO 2023-03-02 02:01:33,782 elastic_timeout: 30 LAUNCH INFO 2023-03-02 02:01:33,782 gloo_port: 6767 LAUNCH INFO 2023-03-02 02:01:33,782 host: None LAUNCH INFO 2023-03-02 02:01:33,782 job_id: default LAUNCH INFO 2023-03-02 02:01:33,782 legacy: False LAUNCH INFO 2023-03-02 02:01:33,782 log_dir: log LAUNCH INFO 2023-03-02 02:01:33,782 log_level: INFO LAUNCH INFO 2023-03-02 02:01:33,782 master: None LAUNCH INFO 2023-03-02 02:01:33,783 max_restart: 3 LAUNCH INFO 2023-03-02 02:01:33,783 nnodes: 1 LAUNCH INFO 2023-03-02 02:01:33,783 nproc_per_node: None LAUNCH INFO 2023-03-02 02:01:33,783 rank: -1 LAUNCH INFO 2023-03-02 02:01:33,783 run_mode: collective LAUNCH INFO 2023-03-02 02:01:33,783 server_num: None LAUNCH INFO 2023-03-02 02:01:33,783 servers: LAUNCH INFO 2023-03-02 02:01:33,783 trainer_num: None LAUNCH INFO 2023-03-02 02:01:33,783 trainers: LAUNCH INFO 2023-03-02 02:01:33,783 training_script: train_with_fleet.py LAUNCH INFO 2023-03-02 02:01:33,784 training_script_args: [] LAUNCH INFO 2023-03-02 02:01:33,784 with_gloo: 0 LAUNCH INFO 2023-03-02 02:01:33,784 -------------------------------------------------- LAUNCH WARNING 2023-03-02 02:01:33,784 Compatible mode enable with args ['--ips=172.16.13.74,172.16.13.87'] ----------- Configuration Arguments ----------- backend: auto cluster_topo_path: None elastic_pre_hook: None elastic_server: None enable_auto_mapping: False force: False gpus: None heter_devices: heter_worker_num: None heter_workers: host: None http_port: None ips: 172.16.13.74,172.16.13.87 job_id: None log_dir: log np: None nproc_per_node: None rank_mapping_path: None run_mode: None scale: 0 server_num: None servers: training_script: train_with_fleet.py training_script_args: [] worker_num: None workers: ------------------------------------------------ INFO 2023-03-02 02:01:33,787 launch.py:504] Run collective mode. gpu arguments:['--ips'], cuda count:2 INFO 2023-03-02 02:01:33,787 launch.py:504] Run collective mode. gpu arguments:['--ips'], cuda count:2 launch train in GPU mode! INFO 2023-03-02 02:01:33,809 launch_utils.py:561] Local start 2 processes. First process distributed environment info (Only For Debug): +=======================================================================================+ | Distributed Envs Value | +---------------------------------------------------------------------------------------+ | PADDLE_TRAINER_ID 2 | | PADDLE_CURRENT_ENDPOINT 172.16.13.87:6070 | | PADDLE_TRAINERS_NUM 4 | | PADDLE_TRAINER_ENDPOINTS ... :6071,172.16.13.87:6070,172.16.13.87:6071| | PADDLE_RANK_IN_NODE 0 | | PADDLE_LOCAL_DEVICE_IDS 0 | | PADDLE_WORLD_DEVICE_IDS 0,1,0,1 | | FLAGS_selected_gpus 0 | | FLAGS_selected_accelerators 0 | +=======================================================================================+ INFO 2023-03-02 02:01:33,809 launch_utils.py:561] Local start 2 processes. First process distributed environment info (Only For Debug): +=======================================================================================+ | Distributed Envs Value | +---------------------------------------------------------------------------------------+ | PADDLE_TRAINER_ID 2 | | PADDLE_CURRENT_ENDPOINT 172.16.13.87:6070 | | PADDLE_TRAINERS_NUM 4 | | PADDLE_TRAINER_ENDPOINTS ... :6071,172.16.13.87:6070,172.16.13.87:6071| | PADDLE_RANK_IN_NODE 0 | | PADDLE_LOCAL_DEVICE_IDS 0 | | PADDLE_WORLD_DEVICE_IDS 0,1,0,1 | | FLAGS_selected_gpus 0 | | FLAGS_selected_accelerators 0 | +=======================================================================================+ INFO 2023-03-02 02:01:33,809 launch_utils.py:566] details about PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0 INFO 2023-03-02 02:01:33,809 launch_utils.py:566] details about PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0 launch proc_id:324 idx:0 launch proc_id:329 idx:1 grep: warning: GREP_OPTIONS is deprecated; please use an alias or script I0302 02:01:35.129714 324 gen_comm_id_helper.cc:205] Server listening on: 172.16.13.87:6070 successful. I0302 02:01:36.279278 324 nccl_context.cc:83] init nccl context nranks: 4 local rank: 2 gpu id: 0 ring id: 0
第二个节点使用命令:python -m paddle.distributed.launch --ips=172.16.13.74,172.16.13.87 train_with_fleet.py
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script LAUNCH INFO 2023-03-02 02:01:22,936 ----------- Configuration ---------------------- LAUNCH INFO 2023-03-02 02:01:22,936 devices: None LAUNCH INFO 2023-03-02 02:01:22,936 elastic_level: -1 LAUNCH INFO 2023-03-02 02:01:22,936 elastic_timeout: 30 LAUNCH INFO 2023-03-02 02:01:22,936 gloo_port: 6767 LAUNCH INFO 2023-03-02 02:01:22,936 host: None LAUNCH INFO 2023-03-02 02:01:22,936 job_id: default LAUNCH INFO 2023-03-02 02:01:22,936 legacy: False LAUNCH INFO 2023-03-02 02:01:22,936 log_dir: log LAUNCH INFO 2023-03-02 02:01:22,936 log_level: INFO LAUNCH INFO 2023-03-02 02:01:22,936 master: None LAUNCH INFO 2023-03-02 02:01:22,936 max_restart: 3 LAUNCH INFO 2023-03-02 02:01:22,936 nnodes: 1 LAUNCH INFO 2023-03-02 02:01:22,936 nproc_per_node: None LAUNCH INFO 2023-03-02 02:01:22,936 rank: -1 LAUNCH INFO 2023-03-02 02:01:22,936 run_mode: collective LAUNCH INFO 2023-03-02 02:01:22,936 server_num: None LAUNCH INFO 2023-03-02 02:01:22,936 servers: LAUNCH INFO 2023-03-02 02:01:22,936 trainer_num: None LAUNCH INFO 2023-03-02 02:01:22,936 trainers: LAUNCH INFO 2023-03-02 02:01:22,936 training_script: train_with_fleet.py LAUNCH INFO 2023-03-02 02:01:22,936 training_script_args: [] LAUNCH INFO 2023-03-02 02:01:22,937 with_gloo: 0 LAUNCH INFO 2023-03-02 02:01:22,937 -------------------------------------------------- LAUNCH WARNING 2023-03-02 02:01:22,937 Compatible mode enable with args ['--ips=172.16.13.74,172.16.13.87'] ----------- Configuration Arguments ----------- backend: auto cluster_topo_path: None elastic_pre_hook: None elastic_server: None enable_auto_mapping: False force: False gpus: None heter_devices: heter_worker_num: None heter_workers: host: None http_port: None ips: 172.16.13.74,172.16.13.87 job_id: None log_dir: log np: None nproc_per_node: None rank_mapping_path: None run_mode: None scale: 0 server_num: None servers: training_script: train_with_fleet.py training_script_args: [] worker_num: None workers: ------------------------------------------------ INFO 2023-03-02 02:01:22,938 launch.py:504] Run collective mode. gpu arguments:['--ips'], cuda count:2 INFO 2023-03-02 02:01:22,938 launch.py:504] Run collective mode. gpu arguments:['--ips'], cuda count:2 launch train in GPU mode! INFO 2023-03-02 02:01:22,939 launch_utils.py:561] Local start 2 processes. First process distributed environment info (Only For Debug): +=======================================================================================+ | Distributed Envs Value | +---------------------------------------------------------------------------------------+ | PADDLE_TRAINER_ID 0 | | PADDLE_CURRENT_ENDPOINT 172.16.13.74:6070 | | PADDLE_TRAINERS_NUM 4 | | PADDLE_TRAINER_ENDPOINTS ... :6071,172.16.13.87:6070,172.16.13.87:6071| | PADDLE_RANK_IN_NODE 0 | | PADDLE_LOCAL_DEVICE_IDS 0 | | PADDLE_WORLD_DEVICE_IDS 0,1,0,1 | | FLAGS_selected_gpus 0 | | FLAGS_selected_accelerators 0 | +=======================================================================================+ INFO 2023-03-02 02:01:22,939 launch_utils.py:561] Local start 2 processes. First process distributed environment info (Only For Debug): +=======================================================================================+ | Distributed Envs Value | +---------------------------------------------------------------------------------------+ | PADDLE_TRAINER_ID 0 | | PADDLE_CURRENT_ENDPOINT 172.16.13.74:6070 | | PADDLE_TRAINERS_NUM 4 | | PADDLE_TRAINER_ENDPOINTS ... :6071,172.16.13.87:6070,172.16.13.87:6071| | PADDLE_RANK_IN_NODE 0 | | PADDLE_LOCAL_DEVICE_IDS 0 | | PADDLE_WORLD_DEVICE_IDS 0,1,0,1 | | FLAGS_selected_gpus 0 | | FLAGS_selected_accelerators 0 | +=======================================================================================+ INFO 2023-03-02 02:01:22,939 launch_utils.py:566] details about PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0 INFO 2023-03-02 02:01:22,939 launch_utils.py:566] details about PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0 launch proc_id:31267 idx:0 launch proc_id:31272 idx:1 grep: warning: GREP_OPTIONS is deprecated; please use an alias or script server not ready, wait 3 sec to retry... not ready endpoints:['172.16.13.74:6071', '172.16.13.87:6070', '172.16.13.87:6071'] server not ready, wait 3 sec to retry... not ready endpoints:['172.16.13.87:6070', '172.16.13.87:6071'] server not ready, wait 3 sec to retry... not ready endpoints:['172.16.13.87:6070', '172.16.13.87:6071'] server not ready, wait 3 sec to retry... not ready endpoints:['172.16.13.87:6070', '172.16.13.87:6071'] I0302 02:01:36.274989 31267 nccl_context.cc:83] init nccl context nranks: 4 local rank: 0 gpu id: 0 ring id: 0
发现两个节点都卡在了init nccl后,然后四个节点的endpoint log都已经检查过了,返回的信息都是
I0302 02:01:36.274989 31267 nccl_context.cc:83] init nccl context nranks: 4 local rank: 0/1 gpu id: 0/1 ring id: 0/1
使用命令nvidia-smi -l 10查看GPU占用,一直是0,最长一次等过大概20分钟,还是一直卡在init nccl后。
想请教一下具体的原因是什么。两台服务器的显卡驱动都是470,然后启动是通过容器进行启动的。
paddle版本:2.3.2
python版本:3.7
cuda版本:11.6
2. 排查
我排查了一下,发现好像74那台服务器的6070端口总是没有监听,但是6071端口有监听,另外一个服务器的两个端口也都正常监听,我试着换了一个监听的端口,发现还是一样的情况,74服务器第一个端口没有监听,其他的都没问题,证明应该不是端口占用的情况.
我又试了一下只使用一张显卡,进行训练,此时74服务器一个端口还是不能进行监听,87服务器端口正常监听,但是当我交换--ips的服务器顺序后,发现现象改变为了87服务器无法监听,74服务器可以正常监听。
3. 其他尝试
我试了多种启动方式,不论是`fleetrun --ips`还是`fleetrun --nnodes`+`fleetrun --nnodes --master`启动都是遇到了一样的问题,卡在同一个地方。现象也都如同上述排查中描述的情况一直。
问问平台RD看看咋回事吧
还是得找客服