有没有spwan方式进行多进程训练的项目？

项目

数据集

课程

比赛

模型库

活动

论坛

访问飞桨官网

项目

数据集

课程

比赛

模型库

活动

论坛

访问飞桨官网

FutureSI 发布于2021-06

看文档，多卡训练即可以用distributed.launch启动脚本为单位的多进程训练，也可以使用spwan方式。但在项目中照着spwan方式文档写的训练脚本无法正常运行。现在只能用原来那种distributed.launch方式，写日志，存模型麻烦点。请问ai studio上有spwan的例程么？

全部评论(8)

TC.Long

#2 回复于2021-06

AI Studio 上好像没找到，我们这边正在整合分布式训练的文档，争取尽快提供一个~

freelyrunning1

#3 回复于2021-06

你好，能否提供spawn启动的报错或代码，我们帮助找下原因

FutureSI

#4 回复于2021-06

freelyrunning1 #3

你好，能否提供spawn启动的报错或代码，我们帮助找下原因

这是报错信息：

[INFO]: current net device: eth0, ip: 172.28.27.52
[INFO]: paddle job envs:
POD_IP=job-3fea207c500fb9e8352ef592d033b19f-trainer-0.job-3fea207c500fb9e8352ef592d033b19f
PADDLE_PORT=12345
PADDLE_TRAINER_ID=0
PADDLE_TRAINERS_NUM=1
PADDLE_USE_CUDA=1
NCCL_SOCKET_IFNAME=eth0
PADDLE_IS_LOCAL=1
OUTPUT_PATH=/root/paddlejob/workspace/output
LOCAL_LOG_PATH=/root/paddlejob/workspace/log
LOCAL_MOUNT_PATH=/mnt/code_20210622183143,/mnt/datasets_20210622183144
JOB_ID=job-3fea207c500fb9e8352ef592d033b19f
TRAINING_ROLE=TRAINER
[INFO]: user command: python run.py
[INFO]: start trainer
~/paddlejob/workspace/code /mnt
/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import MutableMapping
/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Iterable, Mapping
/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from collections import Sized
/mnt/code_20210622183143/generator.py:14: DeprecationWarning: invalid escape sequence \D
parsed = re.search('spade(\D+)(\d)x\d', config_text)
train_img: 118287
train_label: 118287
train_inst: 118287
Traceback (most recent call last):
File "run.py", line 22, in
dist.spawn(train, args=(opt, vggwpath, lastoutput, output, 1, opt.batchSize, 1, False))
File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 458, in spawn
format(device))
ValueError: `device` should be a string of `cpu`, 'gpu' or 'xpu', but got gpu:0
/mnt
[INFO]: train job failed! train_ret: 1

FutureSI

#5 回复于2021-06

freelyrunning1 #3

你好，能否提供spawn启动的报错或代码，我们帮助找下原因

这是调用方式：

import paddle.distributed as dist

dist.spawn(train, args=(opt, vggwpath, lastoutput, output, 1, opt.batchSize, 1, False))

FutureSI

#6 回复于2021-06

freelyrunning1 #3

你好，能否提供spawn启动的报错或代码，我们帮助找下原因

是不是我的调用还有参数没有正确设置呢？

FutureSI

#7 回复于2021-06

TC.Long #2

AI Studio 上好像没找到，我们这边正在整合分布式训练的文档，争取尽快提供一个~

太好了，如果有什么方法可以查看脚本任务gpu的占用率，请也写详细些。还有就是SyncBatchNorm的要是也能提到一下就更好了~

Chenwh

#8 回复于2021-07

FutureSI #4

这是报错信息： [INFO]: current net device: eth0, ip: 172.28.27.52 [INFO]: paddle job envs: POD_IP=job-3fea207c500fb9e8352ef592d033b19f-trainer-0.job-3fea207c500fb9e8352ef592d033b19f PADDLE_PORT=12345 PADDLE_TRAINER_ID=0 PADDLE_TRAINERS_NUM=1 PADDLE_USE_CUDA=1 NCCL_SOCKET_IFNAME=eth0 PADDLE_IS_LOCAL=1 OUTPUT_PATH=/root/paddlejob/workspace/output LOCAL_LOG_PATH=/root/paddlejob/workspace/log LOCAL_MOUNT_PATH=/mnt/code_20210622183143,/mnt/datasets_20210622183144 JOB_ID=job-3fea207c500fb9e8352ef592d033b19f TRAINING_ROLE=TRAINER [INFO]: user command: python run.py [INFO]: start trainer ~/paddlejob/workspace/code /mnt /opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import MutableMapping /opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Iterable, Mapping /opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working from collections import Sized /mnt/code_20210622183143/generator.py:14: DeprecationWarning: invalid escape sequence \D parsed = re.search('spade(\D+)(\d)x\d', config_text) train_img: 118287 train_label: 118287 train_inst: 118287 Traceback (most recent call last): File "run.py", line 22, in dist.spawn(train, args=(opt, vggwpath, lastoutput, output, 1, opt.batchSize, 1, False)) File "/opt/_internal/cpython-3.7.0/lib/python3.7/site-packages/paddle/distributed/spawn.py", line 458, in spawn format(device)) ValueError: `device` should be a string of `cpu`, 'gpu' or 'xpu', but got gpu:0 /mnt [INFO]: train job failed! train_ret: 1

展开

这里有点小问题，已经在最新的2.1.1中修复了，您可以这里指定下npocs参数

例如：

dist.spawn(train, args=(opt, vggwpath, lastoutput, output, 1, opt.batchSize, 1, False), nprocs=2)

freelyrunning1

#9 回复于2021-07

FutureSI #7

太好了，如果有什么方法可以查看脚本任务gpu的占用率，请也写详细些。还有就是SyncBatchNorm的要是也能提到一下就更好了~

目前还看不了占用率，感谢给我们的好建议！我们认真评估下~

提issue

需求/bug反馈？一键提issue告诉我们

提pr

发现bug？如果您知道修复办法，欢迎提pr直接参与建设飞桨~