首页 Paddle框架 帖子详情
使用AI_studio 环境训练模型出现OSError: (External) CUBLAS error(7).
收藏
快速回复
Paddle框架 问答模型训练 1743 2
使用AI_studio 环境训练模型出现OSError: (External) CUBLAS error(7).
收藏
快速回复
Paddle框架 问答模型训练 1743 2

aistudio@jupyter-783833-4012746:~/work/PaddleDetection$ python -m paddle.distributed.launch --gpus 0 tools/train.py -c configs/ppyoloe/ppyoloe_test.yml
----------- Configuration Arguments -----------
backend: auto
elastic_server: None
force: False
gpus: 0
heter_devices:
heter_worker_num: None
heter_workers:
host: None
http_port: None
ips: 127.0.0.1
job_id: None
log_dir: log
np: None
nproc_per_node: None
run_mode: None
scale: 0
server_num: None
servers:
training_script: tools/train.py
training_script_args: ['-c', 'configs/ppyoloe/ppyoloe_test.yml']
worker_num: None
workers:
------------------------------------------------
WARNING 2022-05-15 17:15:13,840 launch.py:423] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode
launch train in GPU mode!
INFO 2022-05-15 17:15:13,842 launch_utils.py:528] Local start 1 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 127.0.0.1:41631 |
| PADDLE_TRAINERS_NUM 1 |
| PADDLE_TRAINER_ENDPOINTS 127.0.0.1:41631 |
| PADDLE_RANK_IN_NODE 0 |
| PADDLE_LOCAL_DEVICE_IDS 0 |
| PADDLE_WORLD_DEVICE_IDS 0 |
| FLAGS_selected_gpus 0 |
| FLAGS_selected_accelerators 0 |
+=======================================================================================+

INFO 2022-05-15 17:15:13,842 launch_utils.py:532] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
launch proc_id:9147 idx:0
loading annotations into memory...
Done (t=0.54s)
creating index...
index created!
[05/15 17:15:17] ppdet.data.source.coco WARNING: Found an invalid bbox in annotations: im_id: 6918, area: 0.0 x1: 917, y1: 284, x2: 922, y2: 284.
W0515 17:15:18.745193 9147 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 8.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0515 17:15:18.748642 9147 device_context.cc:465] device: 0, cuDNN Version: 8.2.
[05/15 17:15:21] ppdet.utils.checkpoint INFO: The shape [80] in pretrained weight yolo_head.pred_cls.0.bias is unmatched with the shape [7] in model yolo_head.pred_cls.0.bias. And the weight yolo_head.pred_cls.0.bias will not be loaded
[05/15 17:15:21] ppdet.utils.checkpoint INFO: The shape [80, 576, 3, 3] in pretrained weight yolo_head.pred_cls.0.weight is unmatched with the shape [7, 576, 3, 3] in model yolo_head.pred_cls.0.weight. And the weight yolo_head.pred_cls.0.weight will not be loaded
[05/15 17:15:21] ppdet.utils.checkpoint INFO: The shape [80] in pretrained weight yolo_head.pred_cls.1.bias is unmatched with the shape [7] in model yolo_head.pred_cls.1.bias. And the weight yolo_head.pred_cls.1.bias will not be loaded
[05/15 17:15:21] ppdet.utils.checkpoint INFO: The shape [80, 288, 3, 3] in pretrained weight yolo_head.pred_cls.1.weight is unmatched with the shape [7, 288, 3, 3] in model yolo_head.pred_cls.1.weight. And the weight yolo_head.pred_cls.1.weight will not be loaded
[05/15 17:15:21] ppdet.utils.checkpoint INFO: The shape [80] in pretrained weight yolo_head.pred_cls.2.bias is unmatched with the shape [7] in model yolo_head.pred_cls.2.bias. And the weight yolo_head.pred_cls.2.bias will not be loaded
[05/15 17:15:21] ppdet.utils.checkpoint INFO: The shape [80, 144, 3, 3] in pretrained weight yolo_head.pred_cls.2.weight is unmatched with the shape [7, 144, 3, 3] in model yolo_head.pred_cls.2.weight. And the weight yolo_head.pred_cls.2.weight will not be loaded
[05/15 17:15:21] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/aistudio/.cache/paddle/weights/ppyoloe_crn_m_300e_coco.pdparams
[05/15 17:15:27] ppdet.engine INFO: Epoch: [0] [ 0/583] learning_rate: 0.000005 loss: 4.545620 loss_cls: 3.359301 loss_iou: 0.283135 loss_dfl: 0.956964 loss_l1: 0.481069 eta: 11 days, 9:39:17 batch_cost: 5.6327 data_cost: 3.1799 ips: 4.2608 images/s
Traceback (most recent call last):
File "tools/train.py", line 177, in
main()
File "tools/train.py", line 173, in main
run(FLAGS, cfg)
File "tools/train.py", line 127, in run
trainer.train(FLAGS.eval)
File "/home/aistudio/work/PaddleDetection/ppdet/engine/trainer.py", line 425, in train
outputs = model(data)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
return self._dygraph_call_func(*inputs, **kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/aistudio/work/PaddleDetection/ppdet/modeling/architectures/meta_arch.py", line 58, in forward
out = self.get_loss()
File "/home/aistudio/work/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 125, in get_loss
return self._forward()
File "/home/aistudio/work/PaddleDetection/ppdet/modeling/architectures/yolo.py", line 88, in _forward
yolo_losses = self.yolo_head(neck_feats, self.inputs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
return self._dygraph_call_func(*inputs, **kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/aistudio/work/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 214, in forward
return self.forward_train(feats, targets)
File "/home/aistudio/work/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 157, in forward_train
], targets)
File "/home/aistudio/work/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 301, in get_loss
pred_bboxes = self._bbox_decode(anchor_points_s, pred_distri)
File "/home/aistudio/work/PaddleDetection/ppdet/modeling/heads/ppyoloe_head.py", line 238, in _bbox_decode
])).matmul(self.proj)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/tensor/linalg.py", line 139, in matmul
return op(x, y, 'trans_x', transpose_x, 'trans_y', transpose_y)
OSError: (External) CUBLAS error(7).
[Hint: 'CUBLAS_STATUS_INVALID_VALUE'. An unsupported value or parameter was passed to the function (a negative vector size, for example). To correct: ensure that all the parameters being passed have valid values. ] (at /paddle/paddle/fluid/operators/math/blas_impl.cu.h:55)
[operator < matmul_v2 > error]
INFO 2022-05-15 17:15:47,900 launch_utils.py:341] terminate all the procs
ERROR 2022-05-15 17:15:47,900 launch_utils.py:604] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2022-05-15 17:15:51,904 launch_utils.py:341] terminate all the procs
INFO 2022-05-15 17:15:51,904 launch.py:311] Local processes completed.

0
收藏
回复
全部评论(2)
时间顺序
z
zhujiehaode
#2 回复于2022-05

换个paddle版本试试

0
回复
A
AIStudio790555
#4 回复于2022-09

请问解决了吗

0
回复
需求/bug反馈?一键提issue告诉我们
发现bug?如果您知道修复办法,欢迎提pr直接参与建设飞桨~
在@后输入用户全名并按空格结束,可艾特全站任一用户