首页 Paddle框架 帖子详情
有没有spwan方式进行多进程训练的项目?
收藏
快速回复
Paddle框架 问答模型训练深度学习 2063 8
有没有spwan方式进行多进程训练的项目?
收藏
快速回复
Paddle框架 问答模型训练深度学习 2063 8

看文档,多卡训练即可以用distributed.launch启动脚本为单位的多进程训练,也可以使用spwan方式。但在项目中照着spwan方式文档写的训练脚本无法正常运行。现在只能用原来那种distributed.launch方式,写日志,存模型麻烦点。请问ai studio上有spwan的例程么?

 

0
收藏
回复
全部评论(8)
时间顺序
TC.Long
#2 回复于2021-06

AI Studio 上好像没找到,我们这边正在整合分布式训练的文档,争取尽快提供一个~

0
回复
f
freelyrunning1
#3 回复于2021-06

你好,能否提供spawn启动的报错或代码,我们帮助找下原因

0
回复
FutureSI
#4 回复于2021-06
你好,能否提供spawn启动的报错或代码,我们帮助找下原因

这是报错信息:

[INFO]: current net device: eth0, ip: 172.28.27.52
[INFO]: paddle job envs:
POD_IP=job-3fea207c500fb9e8352ef592d033b19f-trainer-0.job-3fea207c500fb9e8352ef592d033b19f
PADDLE_PORT=12345
PADDLE_TRAINER_ID=0
PADDLE_TRAINERS_NUM=1
PADDLE_USE_CUDA=1
NCCL_SOCKET_IFNAME=eth0
PADDLE_IS_LOCAL=1
OUTPUT_PATH=/root/paddlejob/workspace/output
LOCAL_LOG_PATH=/root/paddlejob/workspace/log
LOCAL_MOUNT_PATH=/mnt/code_20210622183143,/mnt/datasets_20210622183144
JOB_ID=job-3fea207c500fb9e8352ef592d033b19f
TRAINING_ROLE=TRAINER
[INFO]: user command: python run.py
[INFO]: start trainer
~/paddlejob/workspace/code /mnt
/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import MutableMapping
/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Iterable, Mapping
/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Sized
/mnt/code_20210622183143/generator.py:14: DeprecationWarning: invalid escape sequence \D
parsed = re.search('spade(\D+)(\d)x\d', config_text)
train_img: 118287
train_label: 118287
train_inst: 118287
Traceback (most recent call last):
File "run.py", line 22, in
dist.spawn(train, args=(opt, vggwpath, lastoutput, output, 1, opt.batchSize, 1, False))
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 458, in spawn
format(device))
ValueError: `device` should be a string of `cpu`, 'gpu' or 'xpu', but got gpu:0
/mnt
[INFO]: train job failed! train_ret: 1

0
回复
FutureSI
#5 回复于2021-06
你好,能否提供spawn启动的报错或代码,我们帮助找下原因

这是调用方式:

import paddle.distributed as dist

dist.spawn(train, args=(opt, vggwpath, lastoutput, output, 1, opt.batchSize, 1, False))

0
回复
FutureSI
#6 回复于2021-06
你好,能否提供spawn启动的报错或代码,我们帮助找下原因

是不是我的调用还有参数没有正确设置呢?

0
回复
FutureSI
#7 回复于2021-06
TC.Long #2
AI Studio 上好像没找到,我们这边正在整合分布式训练的文档,争取尽快提供一个~

太好了,如果有什么方法可以查看脚本任务gpu的占用率,请也写详细些。还有就是SyncBatchNorm的要是也能提到一下就更好了~

0
回复
Chenwh
#8 回复于2021-07
这是报错信息: [INFO]: current net device: eth0, ip: 172.28.27.52 [INFO]: paddle job envs: POD_IP=job-3fea207c500fb9e8352ef592d033b19f-trainer-0.job-3fea207c500fb9e8352ef592d033b19f PADDLE_PORT=12345 PADDLE_TRAINER_ID=0 PADDLE_TRAINERS_NUM=1 PADDLE_USE_CUDA=1 NCCL_SOCKET_IFNAME=eth0 PADDLE_IS_LOCAL=1 OUTPUT_PATH=/root/paddlejob/workspace/output LOCAL_LOG_PATH=/root/paddlejob/workspace/log LOCAL_MOUNT_PATH=/mnt/code_20210622183143,/mnt/datasets_20210622183144 JOB_ID=job-3fea207c500fb9e8352ef592d033b19f TRAINING_ROLE=TRAINER [INFO]: user command: python run.py [INFO]: start trainer ~/paddlejob/workspace/code /mnt /opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import MutableMapping /opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Iterable, Mapping /opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Sized /mnt/code_20210622183143/generator.py:14: DeprecationWarning: invalid escape sequence \D parsed = re.search('spade(\D+)(\d)x\d', config_text) train_img: 118287 train_label: 118287 train_inst: 118287 Traceback (most recent call last): File "run.py", line 22, in dist.spawn(train, args=(opt, vggwpath, lastoutput, output, 1, opt.batchSize, 1, False)) File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 458, in spawn format(device)) ValueError: `device` should be a string of `cpu`, 'gpu' or 'xpu', but got gpu:0 /mnt [INFO]: train job failed! train_ret: 1
展开

这里有点小问题,已经在最新的2.1.1中修复了,您可以这里指定下npocs参数

例如:

dist.spawn(train, args=(opt, vggwpath, lastoutput, output, 1, opt.batchSize, 1, False), nprocs=2)

0
回复
f
freelyrunning1
#9 回复于2021-07
太好了,如果有什么方法可以查看脚本任务gpu的占用率,请也写详细些。还有就是SyncBatchNorm的要是也能提到一下就更好了~

目前还看不了占用率,感谢给我们的好建议!我们认真评估下~

0
回复
需求/bug反馈?一键提issue告诉我们
发现bug?如果您知道修复办法,欢迎提pr直接参与建设飞桨~
在@后输入用户全名并按空格结束,可艾特全站任一用户