win10 paddledetection 多GPU训练报错

项目

数据集

课程

比赛

模型库

活动

论坛

访问飞桨官网

项目

数据集

课程

比赛

模型库

活动

论坛

访问飞桨官网

林中有只奇鸟发布于2022-04

PS D:\pycharm\PaddlePaddle\PaddleDetection> python -m paddle.distributed.launch --selected_gpus='0,1' tools/train_fleet.py -c configs/fcos/fcos_dcn_r50_fpn_1x_coco.yml -o use_gpu=true
----------- Configuration Arguments -----------
backend: auto
elastic_server: None
force: False
gpus: 0,1
heter_devices:
heter_worker_num: None
heter_workers:
host: None
http_port: None
ips: 127.0.0.1
job_id: None
log_dir: log
np: None
nproc_per_node: None
run_mode: None
scale: 0
server_num: None
servers:
training_script: tools/train_fleet.py
training_script_args: ['-c', 'configs/fcos/fcos_dcn_r50_fpn_1x_coco.yml', '-o', 'use_gpu=true']
worker_num: None
workers:
------------------------------------------------
WARNING 2022-04-02 15:01:02,073 launch.py:423] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode
launch train in GPU mode!
INFO 2022-04-02 15:01:02,073 launch_utils.py:641] Change selected_gpus into reletive values. --ips:0,1 will change into relative_ips:[0, 1] according to your CUDA_VISIBLE_DEVICES:['0', '1']
INFO 2022-04-02 15:01:02,073 launch_utils.py:528] Local start 2 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 127.0.0.1:64626 |
| PADDLE_TRAINERS_NUM 2 |
| PADDLE_TRAINER_ENDPOINTS 127.0.0.1:64626,127.0.0.1:64627 |
| PADDLE_RANK_IN_NODE 0 |
| PADDLE_LOCAL_DEVICE_IDS 0 |
File "tools/train_fleet.py", line 197, in
main()
File "tools/train_fleet.py", line 193, in main
run(FLAGS, cfg)
File "tools/train_fleet.py", line 130, in run
init_parallel_env()
File "D:\pycharm\PaddlePaddle\PaddleDetection\ppdet\engine\env.py", line 44, in init_parallel_env
paddle.distributed.init_parallel_env()
File "D:\pycharm\PaddlePaddle\lib\site-packages\paddle\distributed\parallel.py", line 215, in init_parallel_env
core.NCCLParallelContext(strategy, place))
AttributeError: module 'paddle.fluid.core_avx' has no attribute 'NCCLParallelContext'
INFO 2022-04-02 15:01:11,275 launch_utils.py:341] terminate all the procs
ERROR 2022-04-02 15:01:11,275 launch_utils.py:604] ABORT!!! Out of all 2 trainers, the trainer process with rank=[0, 1] was aborted. Please check its log.
INFO 2022-04-02 15:01:14,290 launch_utils.py:341] terminate all the procs
INFO 2022-04-02 15:01:14,290 launch.py:311] Local processes completed.

全部评论(2)

潮

潮流MI

#2 回复于2022-04

请问解决了吗

戶

戶枢不蠹

#7 回复于2022-12

大概是Windows不支持多卡，NCCL没有Windows版本，换Linux吧

提issue

需求/bug反馈？一键提issue告诉我们

提pr

发现bug？如果您知道修复办法，欢迎提pr直接参与建设飞桨~