PaddleDetection训练报错:available memory is only 0.000000B.
收藏
在用paddleDetection时,想要训练下模型,出现
:ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 20.187744MB memory on GPU 0, 5.999512GB memory has been allocated and available memory is only 0.000000B.
完整命令和输出如下:
(paddle) PS D:\PythonProjects\PaddleDetection-release-2.3> python -m paddle.distributed.launch --log_dir=./jde_darknet53_30e_1088x608 --gpus 0 tools/train.py -c configs/mot/jde/jde_darknet53_30e_1088x608.yml
----------- Configuration Arguments -----------
backend: auto
elastic_server: None
force: False
gpus: 0
heter_worker_num: None
heter_workers:
host: None
http_port: None
ips: 127.0.0.1
job_id: None
log_dir: ./jde_darknet53_30e_1088x608
np: None
nproc_per_node: None
run_mode: None
scale: 0
server_num: None
servers:
training_script: tools/train.py
training_script_args: ['-c', 'configs/mot/jde/jde_darknet53_30e_1088x608.yml']
worker_num: None
workers:
------------------------------------------------
WARNING 2021-11-21 13:58:57,274 launch.py:416] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode
launch train in GPU mode!
INFO 2021-11-21 13:58:57,276 launch_utils.py:527] Local start 1 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 127.0.0.1:55300 |
| PADDLE_TRAINERS_NUM 1 |
| PADDLE_TRAINER_ENDPOINTS 127.0.0.1:55300 |
| PADDLE_RANK_IN_NODE 0 |
| PADDLE_LOCAL_DEVICE_IDS 0 |
| PADDLE_WORLD_DEVICE_IDS 0 |
| FLAGS_selected_gpus 0 |
| FLAGS_selected_accelerators 0 |
+=======================================================================================+
INFO 2021-11-21 13:58:57,276 launch_utils.py:531] details abouts PADDLE_TRAINER_ENDPOINTS can be found in ./jde_darknet53_30e_1088x608/endpoints.log, and detail running logs maybe found in ./jde_darknet53_30e_1088x608/workerlog.0
命令语法不正确。
'rm' 不是内部或外部命令,也不是可运行的程序
或批处理文件。
launch proc_id:18568 idx:0
D:\App\Anaconda\envs\paddle\lib\site-packages\paddle\tensor\creation.py:130: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if data.dtype == np.object:
[11/21 13:59:01] ppdet.data.source.mot INFO: MOT dataset summary:
[11/21 13:59:01] ppdet.data.source.mot INFO: OrderedDict([('mot17.train', 1639)])
[11/21 13:59:01] ppdet.data.source.mot INFO: Total images: 5316
[11/21 13:59:01] ppdet.data.source.mot INFO: Image start index: OrderedDict([('mot17.train', 0)])
[11/21 13:59:01] ppdet.data.source.mot INFO: Total identities: 1640
[11/21 13:59:01] ppdet.data.source.mot INFO: Identity start index: OrderedDict([('mot17.train', 0)])
W1121 13:59:03.767341 18368 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 11.5, Runtime API Version: 11.0
W1121 13:59:03.780344 18368 device_context.cc:465] device: 0, cuDNN Version: 8.2.
D:\App\Anaconda\envs\paddle\lib\site-packages\paddle\tensor\creation.py:130: DeprecationWarning: `np.object` is a deprecated alias for the builtin `object`. To silence this warning, use `object` by itself. Doing this will not modify any behavior and is safe.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
if data.dtype == np.object:
[11/21 13:59:07] ppdet.utils.checkpoint INFO: Finish loading model weights: C:\Users\wenk/.cache/paddle/weights\DarkNet53_pretrained.pdparams
Traceback (most recent call last):
File "tools/train.py", line 172, in
main()
File "tools/train.py", line 168, in main
run(FLAGS, cfg)
File "tools/train.py", line 128, in run
trainer.train(FLAGS.eval)
File "D:\PythonProjects\PaddleDetection-release-2.3\ppdet\engine\trainer.py", line 393, in train
outputs = model(data)
File "D:\App\Anaconda\envs\paddle\lib\site-packages\paddle\fluid\dygraph\layers.py", line 914, in __call__
outputs = self.forward(*inputs, **kwargs)
File "D:\PythonProjects\PaddleDetection-release-2.3\ppdet\modeling\architectures\meta_arch.py", line 54, in forward
out = self.get_loss()
File "D:\PythonProjects\PaddleDetection-release-2.3\ppdet\modeling\architectures\jde.py", line 120, in get_loss
return self._forward()
File "D:\PythonProjects\PaddleDetection-release-2.3\ppdet\modeling\architectures\jde.py", line 70, in _forward
det_outs = self.detector(self.inputs)
File "D:\App\Anaconda\envs\paddle\lib\site-packages\paddle\fluid\dygraph\layers.py", line 914, in __call__
outputs = self.forward(*inputs, **kwargs)
File "D:\PythonProjects\PaddleDetection-release-2.3\ppdet\modeling\architectures\meta_arch.py", line 54, in forward
out = self.get_loss()
File "D:\PythonProjects\PaddleDetection-release-2.3\ppdet\modeling\architectures\yolo.py", line 121, in get_loss
return self._forward()
File "D:\PythonProjects\PaddleDetection-release-2.3\ppdet\modeling\architectures\yolo.py", line 79, in _forward
body_feats = self.backbone(self.inputs)
File "D:\App\Anaconda\envs\paddle\lib\site-packages\paddle\fluid\dygraph\layers.py", line 914, in __call__
outputs = self.forward(*inputs, **kwargs)
File "D:\PythonProjects\PaddleDetection-release-2.3\ppdet\modeling\backbones\darknet.py", line 329, in forward
out = conv_block_i(out)
File "D:\App\Anaconda\envs\paddle\lib\site-packages\paddle\fluid\dygraph\layers.py", line 914, in __call__
outputs = self.forward(*inputs, **kwargs)
File "D:\PythonProjects\PaddleDetection-release-2.3\ppdet\modeling\backbones\darknet.py", line 230, in forward
y = basic_block_i(y)
File "D:\App\Anaconda\envs\paddle\lib\site-packages\paddle\fluid\dygraph\layers.py", line 914, in __call__
outputs = self.forward(*inputs, **kwargs)
File "D:\PythonProjects\PaddleDetection-release-2.3\ppdet\modeling\backbones\darknet.py", line 174, in forward
conv1 = self.conv1(inputs)
File "D:\App\Anaconda\envs\paddle\lib\site-packages\paddle\fluid\dygraph\layers.py", line 914, in __call__
outputs = self.forward(*inputs, **kwargs)
File "D:\PythonProjects\PaddleDetection-release-2.3\ppdet\modeling\backbones\darknet.py", line 79, in forward
out = F.leaky_relu(out, 0.1)
File "D:\App\Anaconda\envs\paddle\lib\site-packages\paddle\nn\functional\activation.py", line 380, in leaky_relu
return _C_ops.leaky_relu(x, 'alpha', negative_slope)
SystemError: (Fatal) Operator leaky_relu raises an struct paddle::memory::allocation::BadAlloc exception.
The exception content is
:ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 20.187744MB memory on GPU 0, 5.999512GB memory has been allocated and available memory is only 0.000000B.
Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model.
(at ..\paddle\fluid\memory\allocation\cuda_allocator.cc:79)
. (at ..\paddle\fluid\imperative\tracer.cc:221)
INFO 2021-11-21 13:59:24,436 launch_utils.py:340] terminate all the procs
ERROR 2021-11-21 13:59:24,436 launch_utils.py:603] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2021-11-21 13:59:27,448 launch_utils.py:340] terminate all the procs
INFO 2021-11-21 13:59:27,448 launch.py:304] Local processes completed.
0
收藏
请登录后评论
Batch Size太大了吧,调小一点试试。
试过改batch size了,但是还是相同报错,感觉主要是这个available memory is only 0.000000B.
百度了也没结果
别人的Out of memory,起码available memory还有些空间,我这个是一直0B
help
我跟你一模一样的呢,你解决了吗请问
解决了吗解决了吗,我也是一模一样的0.000000B