使用AIStudio一机多卡spawn报错
收藏
快速回复
AI Studio平台使用 问答其他 388 1
使用AIStudio一机多卡spawn报错
收藏
快速回复
AI Studio平台使用 问答其他 388 1

[INFO]: current net device: eth0, ip: 172.28.1.98
[INFO]: paddle job envs:
POD_IP=job-725165ef608a14a778ee2072ce3a3408-trainer-0.job-725165ef608a14a778ee2072ce3a3408
PADDLE_PORT=12345
PADDLE_TRAINER_ID=0
PADDLE_TRAINERS_NUM=1
PADDLE_USE_CUDA=1
NCCL_SOCKET_IFNAME=eth0
PADDLE_IS_LOCAL=1
OUTPUT_PATH=/root/paddlejob/workspace/output
LOCAL_LOG_PATH=/root/paddlejob/workspace/log
LOCAL_MOUNT_PATH=/mnt/code_20211113175010,/mnt/datasets_20211113175010
JOB_ID=job-725165ef608a14a778ee2072ce3a3408
TRAINING_ROLE=TRAINER
[INFO]: user command: python run.py
[INFO]: start trainer
~/paddlejob/workspace/code /mnt
W1113 17:50:13.102766 223 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W1113 17:50:13.107898 223 device_context.cc:465] device: 0, cuDNN Version: 7.6.
model done
train_set done
eval_set done
W1113 17:50:38.728013 391 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W1113 17:50:38.732640 391 device_context.cc:465] device: 0, cuDNN Version: 7.6.
W1113 17:50:38.935602 392 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W1113 17:50:38.940632 392 device_context.cc:465] device: 0, cuDNN Version: 7.6.
W1113 17:50:39.245048 394 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W1113 17:50:39.250128 394 device_context.cc:465] device: 0, cuDNN Version: 7.6.
W1113 17:50:39.460613 395 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 10.1, Runtime API Version: 10.1
W1113 17:50:39.465711 395 device_context.cc:465] device: 0, cuDNN Version: 7.6.
model done
train_set done
eval_set done
The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/200
model done
train_set done
eval_set done
model done
train_set done
eval_set done
model done
train_set done
eval_set done
Traceback (most recent call last):
File "run.py", line 73, in
dist.spawn(train)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 564, in spawn
while not context.join():
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 373, in join
self._throw_exception(error_index)
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 391, in _throw_exception
raise Exception(msg)
Exception:

----------------------------------------------
Process 0 terminated with the following error:
----------------------------------------------

Traceback (most recent call last):
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 322, in _func_wrapper
result = func(*args)
File "/root/paddlejob/workspace/code/run.py", line 71, in train
save_dir = "/root/paddlejob/workspace/output/")
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/hapi/model.py", line 1732, in fit
logs = self._run_one_epoch(train_loader, cbks, 'train')
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/hapi/model.py", line 2051, in _run_one_epoch
0].shape) else data[0].shape[0]
AttributeError: 'int' object has no attribute 'shape'

/mnt
[INFO]: train job failed! train_ret: 1

0
收藏
回复
全部评论(1)
时间顺序
学习委员
#2 回复于2021-11

项目链接还请提供下

0
回复
在@后输入用户全名并按空格结束,可艾特全站任一用户