有没有spwan方式进行多进程训练的项目?
收藏
看文档,多卡训练即可以用distributed.launch启动脚本为单位的多进程训练,也可以使用spwan方式。但在项目中照着spwan方式文档写的训练脚本无法正常运行。现在只能用原来那种distributed.launch方式,写日志,存模型麻烦点。请问ai studio上有spwan的例程么?
0
收藏
请登录后评论
AI Studio 上好像没找到,我们这边正在整合分布式训练的文档,争取尽快提供一个~
你好,能否提供spawn启动的报错或代码,我们帮助找下原因
这是报错信息:
[INFO]: current net device: eth0, ip: 172.28.27.52
[INFO]: paddle job envs:
POD_IP=job-3fea207c500fb9e8352ef592d033b19f-trainer-0.job-3fea207c500fb9e8352ef592d033b19f
PADDLE_PORT=12345
PADDLE_TRAINER_ID=0
PADDLE_TRAINERS_NUM=1
PADDLE_USE_CUDA=1
NCCL_SOCKET_IFNAME=eth0
PADDLE_IS_LOCAL=1
OUTPUT_PATH=/root/paddlejob/workspace/output
LOCAL_LOG_PATH=/root/paddlejob/workspace/log
LOCAL_MOUNT_PATH=/mnt/code_20210622183143,/mnt/datasets_20210622183144
JOB_ID=job-3fea207c500fb9e8352ef592d033b19f
TRAINING_ROLE=TRAINER
[INFO]: user command: python run.py
[INFO]: start trainer
~/paddlejob/workspace/code /mnt
/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import MutableMapping
/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Iterable, Mapping
/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Sized
/mnt/code_20210622183143/generator.py:14: DeprecationWarning: invalid escape sequence \D
parsed = re.search('spade(\D+)(\d)x\d', config_text)
train_img: 118287
train_label: 118287
train_inst: 118287
Traceback (most recent call last):
File "run.py", line 22, in
dist.spawn(train, args=(opt, vggwpath, lastoutput, output, 1, opt.batchSize, 1, False))
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 458, in spawn
format(device))
ValueError: `device` should be a string of `cpu`, 'gpu' or 'xpu', but got gpu:0
/mnt
[INFO]: train job failed! train_ret: 1
这是调用方式:
import paddle.distributed as dist
dist.spawn(train, args=(opt, vggwpath, lastoutput, output, 1, opt.batchSize, 1, False))
是不是我的调用还有参数没有正确设置呢?
太好了,如果有什么方法可以查看脚本任务gpu的占用率,请也写详细些。还有就是SyncBatchNorm的要是也能提到一下就更好了~
这里有点小问题,已经在最新的2.1.1中修复了,您可以这里指定下npocs参数
例如:
dist.spawn(train, args=(opt, vggwpath, lastoutput, output, 1, opt.batchSize, 1, False), nprocs=2)
目前还看不了占用率,感谢给我们的好建议!我们认真评估下~