首页 Paddle框架 帖子详情
bvar is busy at sampling for 2怎么解决
收藏
快速回复
Paddle框架 问答模型训练深度学习 1210 3
bvar is busy at sampling for 2怎么解决
收藏
快速回复
Paddle框架 问答模型训练深度学习 1210 3

请问大佬bvar is busy at sampling for 2怎么解决

0
收藏
回复
全部评论(3)
时间顺序
三岁
#2 回复于2021-11

可以详细一点嘛?

0
回复
十进制到二进制
#3 回复于2021-12

怀疑可能是linux系统相关报错,建议描述详细信息。

0
回复
a
alixiu
#4 回复于2023-03
三岁 #2
可以详细一点嘛?

我也遇到了这个问题,比如训练模型200个周期,前100个周期都正常执行,第101个周期执行的时候出现这个错误

 

W0325 15:05:45.884913 29836 sampler.cpp:189] bvar is busy at sampling for 2 seconds!
W0325 15:05:45.884963 29886 sampler.cpp:189] bvar is busy at sampling for 2 seconds!
LAUNCH INFO 2023-03-25 15:05:50,799 Pod failed
[2023-03-25 15:05:50,799] [ INFO] controller.py:109 - Pod failed
LAUNCH ERROR 2023-03-25 15:05:50,848 Container failed !!!
Container rank 0 status failed cmd ['/home/maxiu/anaconda3/bin/python', '-u', 'paddle_static_earlyexit_pruning_cifar10_resnet.py', './data/cifar.python', '--dataset', 'cifar10
', '--arch', 'resnet20_cifar', '--save_path', './logs/cifar10/resnet20/ee_1', '--epochs', '200', '--batch_size', '128', '--rate', '1.0', '--earlyexit_lossweights', '1', '1', '
--earlyexit_thresholds', '0.3'] code -9 log log/workerlog.0
env {'CONDA_SHLVL': '1', 'LD_LIBRARY_PATH': '/home/maxiu/anaconda3/lib/python3.9/site-packages/cv2/../../lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib6
4:', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=0
1;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*
.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.de
b=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:
*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.x
pm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=
01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*
.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;
36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_EXE': '/home/maxiu/anaconda3/bin/conda',
'LC_MEASUREMENT': 'zh_CN.UTF-8', 'SSH_CONNECTION': '10.3.0.56 61036 192.168.199.157 22', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'LC_PAPER': 'zh_CN.UTF-8', 'LC_MONETARY': 'zh_
CN.UTF-8', 'LANG': 'en_US.UTF-8', 'OLDPWD': '/home/maxiu/soft-filter-pruning', 'COLORTERM': 'truecolor', 'CONDA_PREFIX': '/home/maxiu/anaconda3', 'S_COLORS': 'auto', '_CE_M':
'', 'LC_NAME': 'zh_CN.UTF-8', 'XDG_SESSION_ID': '4', 'USER': 'maxiu', 'PWD': '/home/maxiu/Paddle-EE/paddle_ee_pruning', 'HOME': '/home/maxiu', 'CONDA_PYTHON_EXE': '/home/maxiu
/anaconda3/bin/python', 'BROWSER': '/home/maxiu/.vscode-server/bin/c3f126316369cd610563c75b1b1725e0679adfb3/bin/helpers/browser.sh', 'VSCODE_GIT_ASKPASS_NODE': '/home/maxiu/.v
scode-server/bin/c3f126316369cd610563c75b1b1725e0679adfb3/node', 'TERM_PROGRAM': 'vscode', 'SSH_CLIENT': '10.3.0.56 61036 22', 'TERM_PROGRAM_VERSION': '1.58.2', 'CPU_NUM': '6'
, 'TMUX': '/tmp/tmux-1003/default,4295,2', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', '_CE_CONDA': '', 'VSCODE_IPC_HOOK_CLI': '/run/user/1003/vscod
e-ipc-34435bce-3047-437c-9529-bf86e86439aa.sock', 'LC_ADDRESS': 'zh_CN.UTF-8', 'LC_NUMERIC': 'zh_CN.UTF-8', 'CONDA_PROMPT_MODIFIER': '(base) ', 'MAIL': '/var/mail/maxiu', 'VSC
ODE_GIT_ASKPASS_MAIN': '/home/maxiu/.vscode-server/bin/c3f126316369cd610563c75b1b1725e0679adfb3/extensions/git/dist/askpass-main.js', 'SHELL': '/bin/bash', 'TERM': 'screen', '
TMUX_PANE': '%2', 'SHLVL': '6', 'VSCODE_GIT_IPC_HANDLE': '/run/user/1003/vscode-git-1b289be235.sock', 'LC_TELEPHONE': 'zh_CN.UTF-8', 'LOGNAME': 'maxiu', 'DBUS_SESSION_BUS_ADDR
ESS': 'unix:path=/run/user/1003/bus', 'GIT_ASKPASS': '/home/maxiu/.vscode-server/bin/c3f126316369cd610563c75b1b1725e0679adfb3/extensions/git/dist/askpass.sh', 'XDG_RUNTIME_DIR
': '/run/user/1003', 'PATH': '/usr/include/leveldb:/usr/local/cuda/bin:/usr/include/leveldb:/usr/local/cuda/bin:/home/maxiu/.vscode-server/bin/c3f126316369cd610563c75b1b1725e0
679adfb3/bin:/home/maxiu/anaconda3/bin:/home/maxiu/anaconda3/condabin:/usr/include/leveldb:/usr/local/cuda/bin:/home/maxiu/.vscode-server/bin/c3f126316369cd610563c75b1b1725e06
79adfb3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'LC_IDENTIFICATION': 'zh_CN.UTF-8', 'CONDA_DEFAULT_ENV': 'base
', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'LC_TIME': 'zh_CN.UTF-8', '_': '/home/maxiu/anaconda3/bin/python', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'QT_QPA_PLATFORM_
PLUGIN_PATH': '/home/maxiu/anaconda3/lib/python3.9/site-packages/cv2/qt/plugins', 'QT_QPA_FONTDIR': '/home/maxiu/anaconda3/lib/python3.9/site-packages/cv2/qt/fonts', 'POD_NAME
': 'jtbnvr', 'PADDLE_MASTER': '127.0.1.1:62293', 'PADDLE_GLOBAL_SIZE': '1', 'PADDLE_LOCAL_SIZE': '1', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1'
, 'PADDLE_TRAINER_ENDPOINTS': '127.0.1.1:62294', 'PADDLE_CURRENT_ENDPOINT': '127.0.1.1:62294', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '1', 'PADDLE_RANK_IN_NODE': '0'
, 'PADDLE_DISTRI_BACKEND': 'gloo'}
[2023-03-25 15:05:50,848] [ ERROR] controller.py:110 - Container failed !!!

Container rank 0 status failed cmd ['/home/maxiu/anaconda3/bin/python', '-u', 'paddle_static_earlyexit_pruning_cifar10_resnet.py', './data/cifar.python', '--dataset', 'cifar10
', '--arch', 'resnet20_cifar', '--save_path', './logs/cifar10/resnet20/ee_1', '--epochs', '200', '--batch_size', '128', '--rate', '1.0', '--earlyexit_lossweights', '1', '1', '
--earlyexit_thresholds', '0.3'] code -9 log log/workerlog.0
env {'CONDA_SHLVL': '1', 'LD_LIBRARY_PATH': '/home/maxiu/anaconda3/lib/python3.9/site-packages/cv2/../../lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib64:/usr/local/cuda/lib$
4:', 'LS_COLORS': 'rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=$
1;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:$
.z=01;31:*.Z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.d$
b=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31$
*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.$
pm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm$
01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:$
.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00$
36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:', 'CONDA_EXE': '/home/maxiu/anaconda3/bin/conda',
'LC_MEASUREMENT': 'zh_CN.UTF-8', 'SSH_CONNECTION': '10.3.0.56 61036 192.168.199.157 22', 'LESSCLOSE': '/usr/bin/lesspipe %s %s', 'LC_PAPER': 'zh_CN.UTF-8', 'LC_MONETARY': 'zh$
CN.UTF-8', 'LANG': 'en_US.UTF-8', 'OLDPWD': '/home/maxiu/soft-filter-pruning', 'COLORTERM': 'truecolor', 'CONDA_PREFIX': '/home/maxiu/anaconda3', 'S_COLORS': 'auto', '_CE_M':
'', 'LC_NAME': 'zh_CN.UTF-8', 'XDG_SESSION_ID': '4', 'USER': 'maxiu', 'PWD': '/home/maxiu/Paddle-EE/paddle_ee_pruning', 'HOME': '/home/maxiu', 'CONDA_PYTHON_EXE': '/home/maxi$
/anaconda3/bin/python', 'BROWSER': '/home/maxiu/.vscode-server/bin/c3f126316369cd610563c75b1b1725e0679adfb3/bin/helpers/browser.sh', 'VSCODE_GIT_ASKPASS_NODE': '/home/maxiu/.$
scode-server/bin/c3f126316369cd610563c75b1b1725e0679adfb3/node', 'TERM_PROGRAM': 'vscode', 'SSH_CLIENT': '10.3.0.56 61036 22', 'TERM_PROGRAM_VERSION': '1.58.2', 'CPU_NUM': '6$
, 'TMUX': '/tmp/tmux-1003/default,4295,2', 'XDG_DATA_DIRS': '/usr/local/share:/usr/share:/var/lib/snapd/desktop', '_CE_CONDA': '', 'VSCODE_IPC_HOOK_CLI': '/run/user/1003/vsco$
e-ipc-34435bce-3047-437c-9529-bf86e86439aa.sock', 'LC_ADDRESS': 'zh_CN.UTF-8', 'LC_NUMERIC': 'zh_CN.UTF-8', 'CONDA_PROMPT_MODIFIER': '(base) ', 'MAIL': '/var/mail/maxiu', 'VS$
ODE_GIT_ASKPASS_MAIN': '/home/maxiu/.vscode-server/bin/c3f126316369cd610563c75b1b1725e0679adfb3/extensions/git/dist/askpass-main.js', 'SHELL': '/bin/bash', 'TERM': 'screen', $
TMUX_PANE': '%2', 'SHLVL': '6', 'VSCODE_GIT_IPC_HANDLE': '/run/user/1003/vscode-git-1b289be235.sock', 'LC_TELEPHONE': 'zh_CN.UTF-8', 'LOGNAME': 'maxiu', 'DBUS_SESSION_BUS_ADD$
ESS': 'unix:path=/run/user/1003/bus', 'GIT_ASKPASS': '/home/maxiu/.vscode-server/bin/c3f126316369cd610563c75b1b1725e0679adfb3/extensions/git/dist/askpass.sh', 'XDG_RUNTIME_DI$
': '/run/user/1003', 'PATH': '/usr/include/leveldb:/usr/local/cuda/bin:/usr/include/leveldb:/usr/local/cuda/bin:/home/maxiu/.vscode-server/bin/c3f126316369cd610563c75b1b1725e$
679adfb3/bin:/home/maxiu/anaconda3/bin:/home/maxiu/anaconda3/condabin:/usr/include/leveldb:/usr/local/cuda/bin:/home/maxiu/.vscode-server/bin/c3f126316369cd610563c75b1b1725e0$
79adfb3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin', 'LC_IDENTIFICATION': 'zh_CN.UTF-8', 'CONDA_DEFAULT_ENV': 'bas$
', 'LESSOPEN': '| /usr/bin/lesspipe %s', 'LC_TIME': 'zh_CN.UTF-8', '_': '/home/maxiu/anaconda3/bin/python', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'QT_QPA_PLATFORM$
PLUGIN_PATH': '/home/maxiu/anaconda3/lib/python3.9/site-packages/cv2/qt/plugins', 'QT_QPA_FONTDIR': '/home/maxiu/anaconda3/lib/python3.9/site-packages/cv2/qt/fonts', 'POD_NAM$
': 'jtbnvr', 'PADDLE_MASTER': '127.0.1.1:62293', 'PADDLE_GLOBAL_SIZE': '1', 'PADDLE_LOCAL_SIZE': '1', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1$
, 'PADDLE_TRAINER_ENDPOINTS': '127.0.1.1:62294', 'PADDLE_CURRENT_ENDPOINT': '127.0.1.1:62294', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '1', 'PADDLE_RANK_IN_NODE': '0$
, 'PADDLE_DISTRI_BACKEND': 'gloo'}
LAUNCH INFO 2023-03-25 15:05:50,849 ------------------------- ERROR LOG DETAIL -------------------------

[2023-03-25 15:05:50,849] [ INFO] controller.py:111 - ------------------------- ERROR LOG DETAIL -------------------------
_reduce operators: 73. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 71.
W0325 15:05:26.213204 29868 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 73. To make the speed faster, some all_reduce ops are fused during training, after fusion
, the number of all_reduce ops is 71.
W0325 15:05:27.389850 29868 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 73. To make the speed faster, some all_reduce ops are fused during training, after fusion
, the number of all_reduce ops is 71.
W0325 15:05:28.549260 29868 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 73. To make the speed faster, some all_reduce ops are fused during training, after fusion
, the number of all_reduce ops is 71.
W0325 15:05:29.739878 29868 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 73. To make the speed faster, some all_reduce ops are fused during training, after fusion
, the number of all_reduce ops is 71.
W0325 15:05:30.894938 29868 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 73. To make the speed faster, some all_reduce ops are fused during training, after fusion
, the number of all_reduce ops is 71.
W0325 15:05:32.036509 29868 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 73. To make the speed faster, some all_reduce ops are fused during training, after fusion
, the number of all_reduce ops is 71.
W0325 15:05:33.199337 29868 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 73. To make the speed faster, some all_reduce ops are fused during training, after fusion
, the number of all_reduce ops is 71.
W0325 15:05:34.405827 29868 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 73. To make the speed faster, some all_reduce ops are fused during training, after fusion
, the number of all_reduce ops is 71.
W0325 15:05:35.620419 29868 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 73. To make the speed faster, some all_reduce ops are fused during training, after fusion
, the number of all_reduce ops is 71.
W0325 15:05:36.792032 29868 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 73. To make the speed faster, some all_reduce ops are fused during training, after fusion
, the number of all_reduce ops is 71.
W0325 15:05:38.002785 29868 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 73. To make the speed faster, some all_reduce ops are fused during training, after fusion
, the number of all_reduce ops is 71.
W0325 15:05:39.153999 29868 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 73. To make the speed faster, some all_reduce ops are fused during training, after fusion
, the number of all_reduce ops is 71.
W0325 15:05:40.334870 29868 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 73. To make the speed faster, some all_reduce ops are fused during training, after fusion
, the number of all_reduce ops is 71.
W0325 15:05:45.884963 29886 sampler.cpp:189] bvar is busy at sampling for 2 seconds!
LAUNCH INFO 2023-03-25 15:05:50,867 Exit code -9
[2023-03-25 15:05:50,867] [ INFO] controller.py:141 - Exit code -9

0
回复
需求/bug反馈?一键提issue告诉我们
发现bug?如果您知道修复办法,欢迎提pr直接参与建设飞桨~
在@后输入用户全名并按空格结束,可艾特全站任一用户