为什么显卡没有内存
收藏
Out of memory error on GPU 0. Cannot allocate 4.267456GB memory on GPU 0, 27.955811GB memory has been allocated and available memory is only 3.792725GB.
0
收藏
请登录后评论
这个应该是你显存爆了,如果开了多个jupyter notebook,不用了就关掉或者用命令释放显存,要不然运行其他的程序就有可能显存不够
可是我没有运行别的程序
那可能是你的batchsize太大,显存不够
batchsize我都设为1了 还是不行
可以发一下代码看看吗
2022-07-18 10:13:44 [INFO]
------------Environment Information-------------
platform: Linux-4.15.0-140-generic-x86_64-with-debian-stretch-sid
Python: 3.7.4 (default, Aug 13 2019, 20:35:49) [GCC 7.3.0]
Paddle compiled with cuda: True
NVCC: Cuda compilation tools, release 10.1, V10.1.243
cudnn: 7.6
GPUs used: 1
CUDA_VISIBLE_DEVICES: None
GPU: ['GPU 0: Tesla V100-SXM2-32GB']
GCC: gcc (Ubuntu 7.5.0-3ubuntu1~16.04) 7.5.0
PaddleSeg: 2.4.0
PaddlePaddle: 2.2.2
OpenCV: 4.1.1
------------------------------------------------
2022-07-18 10:13:44 [INFO]
---------------Config Information---------------
batch_size: 2
iters: 40000
loss:
coef:
- 1
- 0.4
types:
- ignore_index: 255
type: CrossEntropyLoss
lr_scheduler:
end_lr: 1.0e-05
learning_rate: 0.01
power: 0.9
type: PolynomialDecay
model:
align_corners: false
backbone:
output_stride: 8
pretrained: https://bj.bcebos.com/paddleseg/dygraph/resnet50_vd_ssld_v2.tar.gz
type: ResNet50_vd
enable_auxiliary_loss: true
pretrained: null
type: PSPNet
optimizer:
momentum: 0.9
type: sgd
weight_decay: 4.0e-05
train_dataset:
dataset_root: /home/aistudio
mode: train
num_classes: 11
train_path: train.txt
transforms:
- max_scale_factor: 2.0
min_scale_factor: 0.5
scale_step_size: 0.25
type: ResizeStepScaling
- crop_size:
- 512
- 512
type: RandomPaddingCrop
- type: RandomHorizontalFlip
- brightness_range: 0.4
contrast_range: 0.4
saturation_range: 0.4
type: RandomDistort
- type: Normalize
type: Dataset
val_dataset:
dataset_root: /home/aistudio
mode: val
num_classes: 11
transforms:
- type: Normalize
type: Dataset
val_path: eval.txt
------------------------------------------------
W0718 10:13:44.540231 883 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0718 10:13:44.540282 883 device_context.cc:465] device: 0, cuDNN Version: 7.6.
2022-07-18 10:13:47 [INFO] Loading pretrained model from https://bj.bcebos.com/paddleseg/dygraph/resnet50_vd_ssld_v2.tar.gz
Connecting to https://bj.bcebos.com/paddleseg/dygraph/resnet50_vd_ssld_v2.tar.gz
Downloading resnet50_vd_ssld_v2.tar.gz
[==================================================] 100.00%
Uncompress resnet50_vd_ssld_v2.tar.gz
[==================================================] 100.00%
2022-07-18 10:13:56 [INFO] There are 275/275 variables loaded into ResNet_vd.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/norm.py:653: UserWarning: When training, we now always track global mean and variance.
"When training, we now always track global mean and variance.")
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:253: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.int64, the right dtype will convert to paddle.float32
format(lhs_dtype, rhs_dtype, lhs_dtype))
2022-07-18 10:13:59 [INFO] [TRAIN] epoch: 1, iter: 10/40000, loss: 5.9374, lr: 0.009998, batch_cost: 0.3095, reader_cost: 0.02927, ips: 6.4623 samples/sec | ETA 03:26:16
2022-07-18 10:14:02 [INFO] [TRAIN] epoch: 1, iter: 20/40000, loss: 5.0971, lr: 0.009996, batch_cost: 0.2788, reader_cost: 0.00504, ips: 7.1730 samples/sec | ETA 03:05:47
2022-07-18 10:14:05 [INFO] [TRAIN] epoch: 1, iter: 30/40000, loss: 4.4511, lr: 0.009993, batch_cost: 0.2886, reader_cost: 0.01262, ips: 6.9295 samples/sec | ETA 03:12:16
2022-07-18 10:14:08 [INFO] [TRAIN] epoch: 1, iter: 40/40000, loss: 4.5184, lr: 0.009991, batch_cost: 0.2888, reader_cost: 0.01298, ips: 6.9248 samples/sec | ETA 03:12:21
2022-07-18 10:14:11 [INFO] [TRAIN] epoch: 1, iter: 50/40000, loss: 3.6415, lr: 0.009989, batch_cost: 0.2878, reader_cost: 0.01424, ips: 6.9499 samples/sec | ETA 03:11:36
2022-07-18 10:14:13 [INFO] [TRAIN] epoch: 1, iter: 60/40000, loss: 4.0224, lr: 0.009987, batch_cost: 0.2933, reader_cost: 0.01799, ips: 6.8182 samples/sec | ETA 03:15:15
2022-07-18 10:14:16 [INFO] [TRAIN] epoch: 1, iter: 70/40000, loss: 4.5293, lr: 0.009984, batch_cost: 0.2794, reader_cost: 0.00488, ips: 7.1581 samples/sec | ETA 03:05:56
2022-07-18 10:14:19 [INFO] [TRAIN] epoch: 1, iter: 80/40000, loss: 4.5741, lr: 0.009982, batch_cost: 0.2966, reader_cost: 0.01943, ips: 6.7441 samples/sec | ETA 03:17:18
2022-07-18 10:14:22 [INFO] [TRAIN] epoch: 1, iter: 90/40000, loss: 5.5846, lr: 0.009980, batch_cost: 0.2913, reader_cost: 0.01670, ips: 6.8651 samples/sec | ETA 03:13:46
2022-07-18 10:14:25 [INFO] [TRAIN] epoch: 1, iter: 100/40000, loss: 5.4736, lr: 0.009978, batch_cost: 0.2967, reader_cost: 0.02022, ips: 6.7416 samples/sec | ETA 03:17:17
2022-07-18 10:14:25 [INFO] Start evaluating (total_samples: 114, total_iters: 114)...
18/114 [===>..........................] - ETA: 4s - batch_cost: 0.0506 - reader cost: 0.00Traceback (most recent call last):
File "PaddleSeg/train.py", line 199, in
main(args)
File "PaddleSeg/train.py", line 194, in main
to_static_training=cfg.to_static_training)
File "/home/aistudio/PaddleSeg/paddleseg/core/train.py", line 281, in train
model, val_dataset, num_workers=num_workers, **test_config)
File "/home/aistudio/PaddleSeg/paddleseg/core/val.py", line 123, in evaluate
crop_size=crop_size)
File "/home/aistudio/PaddleSeg/paddleseg/core/infer.py", line 232, in inference
logits = model(im)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
return self._dygraph_call_func(*inputs, **kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/aistudio/PaddleSeg/paddleseg/models/pspnet.py", line 69, in forward
feat_list = self.backbone(x)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
return self._dygraph_call_func(*inputs, **kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/aistudio/PaddleSeg/paddleseg/models/backbones/resnet_vd.py", line 359, in forward
y = block(y)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
return self._dygraph_call_func(*inputs, **kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/aistudio/PaddleSeg/paddleseg/models/backbones/resnet_vd.py", line 138, in forward
y = self.relu(y)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
return self._dygraph_call_func(*inputs, **kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/home/aistudio/PaddleSeg/paddleseg/models/layers/activation.py", line 71, in forward
return self.act_func(x)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 917, in __call__
return self._dygraph_call_func(*inputs, **kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
outputs = self.forward(*inputs, **kwargs)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/activation.py", line 430, in forward
return F.relu(x, self._name)
File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/functional/activation.py", line 520, in relu
return _C_ops.relu(x)
SystemError: (Fatal) Operator relu raises an paddle::memory::allocation::BadAlloc exception.
The exception content is
:ResourceExhaustedError:
Out of memory error on GPU 0. Cannot allocate 4.267456GB memory on GPU 0, 28.054688GB memory has been allocated and available memory is only 3.693848GB.
Please check whether there is any other process using GPU 0.
1. If yes, please stop them, or start PaddlePaddle on another GPU.
2. If no, please decrease the batch size of your model.
(at /paddle/paddle/fluid/memory/allocation/cuda_allocator.cc:79)
. (at /paddle/paddle/fluid/imperative/tracer.cc:221)
现存小吧
2022-07-18 10:14:25 [INFO] Start evaluating (total_samples: 114, total_iters: 114)...
18/114 [===>..........................] - ETA: 4s - batch_cost: 0.0506 - reader cost: 0.00Traceback (most recent call last):
你验证集设置的batchsize多大
2
你显卡有32G内存,如果batchsize设为1或者2应该用不到27G。如果不是你一个人用GPU,你看看是不是其他人在用。
我该怎么是不是别人再用
你可以使用nvidia-smi看看不同程序使用的显存。
你看看
验证集图片分辨率是多大,都是一样大小吗,看起来可能是某张图片特别大导致显存爆了
感谢你的回复 是验证集有几张分辨率很大 我改小就能跑了
您好,我的数据集有11个类别,但是有的类很少,然后请问一下如何在交叉熵函数中添加自定义的各类别的权重呢?
啊这,six six six
没啥,水一水帖子