首页 Paddle框架 帖子详情
单机多卡训练报错 paddle.distributed.launch 已解决
收藏
快速回复
Paddle框架 文章模型训练 3300 7
单机多卡训练报错 paddle.distributed.launch 已解决
收藏
快速回复
Paddle框架 文章模型训练 3300 7

按照paddle使用教程,版本2.1.2GPU。单机多卡训练,python -m paddle.distributed.launch train.py

发现只能使用默认GPU,即GPUS = 0,若选择GPUS = 1或者GPUS = 0,1就会报错

[2021-08-21 16:07:27,671] [ INFO] - Found /home/johnyan/.paddlenlp/models/ernie-gram-zh/vocab.txt
[2021-08-21 16:07:27,678] [ INFO] - Already cached /home/johnyan/.paddlenlp/models/ernie-gram-zh/ernie_gram_zh.pdparams
Traceback (most recent call last):
File "class2.py", line 116, in
pretrained_model = paddlenlp.transformers.ErnieGramModel.from_pretrained('ernie-gram-zh')
File "/home/johnyan/anaconda3/envs/rtx_gpu/lib/python3.8/site-packages/paddlenlp/transformers/model_utils.py", line 274, in from_pretrained
model = cls(*base_args, **base_kwargs)
File "/home/johnyan/anaconda3/envs/rtx_gpu/lib/python3.8/site-packages/paddlenlp/transformers/utils.py", line 83, in __impl__
init_func(self, *args, **kwargs)
File "/home/johnyan/anaconda3/envs/rtx_gpu/lib/python3.8/site-packages/paddlenlp/transformers/ernie_gram/modeling.py", line 201, in __init__
self.embeddings = ErnieGramEmbeddings(
File "/home/johnyan/anaconda3/envs/rtx_gpu/lib/python3.8/site-packages/paddlenlp/transformers/ernie_gram/modeling.py", line 45, in __init__
self.word_embeddings = nn.Embedding(
File "/home/johnyan/anaconda3/envs/rtx_gpu/lib/python3.8/site-packages/paddle/nn/layer/common.py", line 1343, in __init__
self.weight = self.create_parameter(
File "/home/johnyan/anaconda3/envs/rtx_gpu/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 411, in create_parameter
return self._helper.create_parameter(temp_attr, shape, dtype, is_bias,
File "/home/johnyan/anaconda3/envs/rtx_gpu/lib/python3.8/site-packages/paddle/fluid/layer_helper_base.py", line 369, in create_parameter
return self.main_program.global_block().create_parameter(
File "/home/johnyan/anaconda3/envs/rtx_gpu/lib/python3.8/site-packages/paddle/fluid/framework.py", line 2895, in create_parameter
initializer(param, self)
File "/home/johnyan/anaconda3/envs/rtx_gpu/lib/python3.8/site-packages/paddle/fluid/initializer.py", line 561, in __call__
op = block.append_op(
File "/home/johnyan/anaconda3/envs/rtx_gpu/lib/python3.8/site-packages/paddle/fluid/framework.py", line 2921, in append_op
_dygraph_tracer().trace_op(type,
File "/home/johnyan/anaconda3/envs/rtx_gpu/lib/python3.8/site-packages/paddle/fluid/dygraph/tracer.py", line 43, in trace_op
self.trace(type, inputs, outputs, attrs,
NotImplementedError: (Unimplemented) Place CUDAPlace(0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU or WITH_ASCEND_CL option or check that your train process set the correct device id if you use Executor. (at /paddle/paddle/fluid/platform/device_context.cc:88)
[operator < uniform_random > error]

以下log显示使用GPU = 0的训练结果
[2021-08-21 16:08:40,256] [ INFO] - Found /home/johnyan/.paddlenlp/models/ernie-gram-zh/vocab.txt
[2021-08-21 16:08:40,262] [ INFO] - Already cached /home/johnyan/.paddlenlp/models/ernie-gram-zh/ernie_gram_zh.pdparams
W0821 16:08:40.263463 2116 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.4, Runtime API Version: 11.2
W0821 16:08:40.276515 2116 device_context.cc:422] device: 0, cuDNN Version: 8.1.
global step 10, epoch: 1, batch: 10, loss: 0.59065, accu: 0.61250, speed: 13.08 step/s
global step 20, epoch: 1, batch: 20, loss: 0.59798, accu: 0.67812, speed: 16.54 step/s
global step 30, epoch: 1, batch: 30, loss: 0.42581, accu: 0.71250, speed: 17.50 step/s
global step 40, epoch: 1, batch: 40, loss: 0.44860, accu: 0.74609, speed: 16.76 step/s
global step 50, epoch: 1, batch: 50, loss: 0.43702, accu: 0.76313, speed: 16.37 step/s
global step 60, epoch: 1, batch: 60, loss: 0.47891, accu: 0.78073, speed: 17.34 step/s
global step 70, epoch: 1, batch: 70, loss: 0.46816, accu: 0.78571, speed: 15.20 step/s
global step 80, epoch: 1, batch: 80, loss: 0.35006, accu: 0.79063, speed: 17.75 step/s
global step 90, epoch: 1, batch: 90, loss: 0.43554, accu: 0.79271, speed: 17.62 step/s
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

按照安装教程操作的,import paddle 后, paddle.utils.run_check()

Running verify PaddlePaddle program ...
W0821 14:49:41.777092 2595 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.4, Runtime API Version: 11.2
W0821 14:49:41.779155 2595 device_context.cc:422] device: 0, cuDNN Version: 8.1 .
PaddlePaddle works well on 1 GPU.
W0821 14:49:43.047895 2595 parallel_executor.cc:601] Cannot enable 'Pe2Pe' access from 0 to 1
W0821 14:49:43.047909 2595 parallel_executor.cc:601] Cannot enable 'Pe2Pe' access from 1 to 0
W0821 14:49:44.378496 2595 fuse_all_reduce_op_pass.cc:76] Find all_reduce operators: 2.
To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 2.
PaddlePaddle works well on 2 GPUs.
PaddlePaddle is installed successfully! Let's start deep learning with PaddlePaddle now.

johnyanccer
已解决
5# 回复于2021-08
代码加入paddle.device.set_device("gpu"),再运行paddle.distribtuted.launch --gpus 0,1 train.py,目前没有报错了,两个GPU都有显存占用,batch数减半,应该运行成功了,十分感谢工程师回复!!
展开
0
收藏
回复
全部评论(7)
时间顺序
三岁
#2 回复于2021-08

建议去issue,目前社区里面也有多卡的教程可以进行参考,也可以提issue进行反馈解决。

参考地址:创建项目,脚本项目,模板

issue:https://github.com/PaddlePaddle/Paddle/issues

0
回复
johnyanccer
#3 回复于2021-08
三岁 #2
建议去issue,目前社区里面也有多卡的教程可以进行参考,也可以提issue进行反馈解决。 参考地址:创建项目,脚本项目,模板 issue:https://github.com/PaddlePaddle/Paddle/issues
展开

谢谢,论坛和github翻过之前的帖子,确实有碰到同样的报错,但都没见给出解决方案

0
回复
三岁
#4 回复于2021-08
谢谢,论坛和github翻过之前的帖子,确实有碰到同样的报错,但都没见给出解决方案

那就直接再怼一个,今天星期一他们都上班了

0
回复
johnyanccer
#5 回复于2021-08

代码加入paddle.device.set_device("gpu"),再运行paddle.distribtuted.launch --gpus 0,1 train.py,目前没有报错了,两个GPU都有显存占用,batch数减半,应该运行成功了,十分感谢工程师回复!!

1
回复
1
111nvhaizi111
#6 回复于2023-03

你好,我是在服务器上运行,服务器有4个处理器,安装的是cpu版本的paddlepaddle,请问这里要怎么修改?python -u -m paddle.distributed.launch --gpus “0” predict.py

0
回复
l
lewhy2004
#7 回复于2023-05

batch减半在哪里查看

0
回复
1
101yang578
#8 回复于2024-03

您好,请问在代码哪里加入paddle.device.set_device("gpu"),谢谢

0
回复
需求/bug反馈?一键提issue告诉我们
发现bug?如果您知道修复办法,欢迎提pr直接参与建设飞桨~
在@后输入用户全名并按空格结束,可艾特全站任一用户