RuntimeError: DataLoader worker (pid 17629) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
0.00% [0/3 00:00<00:00] epoch train_loss valid_loss error_rate time
Interrupted ---------------------------------------------------------------------------RuntimeError Traceback (most recent call last)~/work/py3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout) 723 try: --> 724 data = self._data_queue.get(timeout=timeout) 725 return (True, data) /opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/queues.py in get(self, block, timeout) 103 timeout = deadline - time.monotonic() --> 104 if not self._poll(timeout): 105 raise Empty /opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/connection.py in poll(self, timeout) 256 self._check_readable() --> 257 return self._poll(timeout) 258 /opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/connection.py in _poll(self, timeout) 413 def _poll(self, timeout): --> 414 r = wait([self], timeout) 415 return bool(r) /opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/connection.py in wait(object_list, timeout) 919 while True: --> 920 ready = selector.select(timeout) 921 if ready: /opt/conda/envs/python35-paddle120-env/lib/python3.7/selectors.py in select(self, timeout) 414 try: --> 415 fd_event_list = self._selector.poll(timeout) 416 except InterruptedError: ~/work/py3/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py in handler(signum, frame) 65 # Python can still get and update the process status successfully. ---> 66 _error_if_any_worker_fails() 67 if previous_handler is not None: RuntimeError: DataLoader worker (pid 25593) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit. During handling of the above exception, another exception occurred: RuntimeError Traceback (most recent call last) in ----> 1 learn.fit_one_cycle(3, slice(1e-2), pct_start=0.8)
省略。。。。
~/work/py3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout) 735 if len(failed_workers) > 0: 736 pids_str = ', '.join(str(w.pid) for w in failed_workers) --> 737 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) 738 if isinstance(e, queue.Empty): 739 return (False, None) RuntimeError: DataLoader worker (pid(s) 25723) exited unexpectedly
>>> paddle.fluid.install_check.run_check()
Running Verify Fluid Program ...
E:\Pycharm\anconda\envs\paddle\lib\site-packages\paddle\fluid\executor.py:774: UserWarning: The following exception is not an EOF exception.
"The following exception is not an EOF exception.")
Traceback (most recent call last):
File "", line 1, in
File "E:\Pycharm\anconda\envs\paddle\lib\site-packages\paddle\fluid\install_check.py", line 123, in run_check
test_simple_exe()
File "E:\Pycharm\anconda\envs\paddle\lib\site-packages\paddle\fluid\install_check.py", line 119, in test_simple_exe
exe0.run(startup_prog)
File "E:\Pycharm\anconda\envs\paddle\lib\site-packages\paddle\fluid\executor.py", line 775, in run
six.reraise(*sys.exc_info())
File "E:\Pycharm\anconda\envs\paddle\lib\site-packages\six.py", line 696, in reraise
raise value
File "E:\Pycharm\anconda\envs\paddle\lib\site-packages\paddle\fluid\executor.py", line 770, in run
use_program_cache=use_program_cache)
File "E:\Pycharm\anconda\envs\paddle\lib\site-packages\paddle\fluid\executor.py", line 817, in _run_impl
use_program_cache=use_program_cache)
File "E:\Pycharm\anconda\envs\paddle\lib\site-packages\paddle\fluid\executor.py", line 894, in _run_program
fetch_var_name)
paddle.fluid.core_noavx.EnforceNotMet:
--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
Windows not support stack backtrace yet.
----------------------
Error Message Summary:
----------------------
PaddleCheckError: cudaGetDeviceProperties failed in paddle::platform::GetCUDAComputeCapability, error code : 30, Please see detail in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038: unknown error at [D:\1.6.1\paddle\paddle\fluid\platform\gpu_info.cc:84]
Predict probability of 0.533816 to be positive and 0.46618402 to be negative for review ' read the book forget the movie ' Predict probability of 0.8875562 to be positive and 0.11244377 to be negative for review ' this is a great movie ' Predict probability of 0.42828017 to be positive and 0.5717198 to be negative for review ' this is very bad '
情感分类,预测结果与神经网络模型选择的诡异问题 我使用飞桨gitbub模型库 中的lstm_net模型进行训练预测结果如下: Predict probability of 0.533816 to be positive and 0.46618402 to be negative for review ' read the book forget the movie ' Predict probability of 0.8875562 to be positive and 0.11244377 to be negative for review ' this is a great movie ' Predict probability of 0.42828017 to be positive and 0.5717198 to be negative for review ' this is very bad ' 对于负面语句 ' this is very bad '分析结果模糊。 同样使用bi_lstm_net训练模型并预测后,得到相反的结果,即负面语句分类准确,正向积极预测模糊。 最奇怪的是,当我使用lstm_net网络训练一次并存储模型后,再使用bi_lstm_net网络训练一次存储相同位置模型,那后面再通过lode此模型进行预测时,结果对正面、负面语句分类都很准确。这是什么原因呢?有没有大神指点下?
项目id: https://aistudio.baidu.com/aistudio/projectdetail/157872 测试代码是fastai里教学第六课:lesson6-pets-more learn.fit_one_cycle(3, slice(1e-2), pct_start=0.8) 这句运行的时候报错: 0.00% [0/3 00:00<00:00] epoch train_loss valid_loss error_rate time Interrupted ---------------------------------------------------------------------------RuntimeError Traceback (most recent call last)~/work/py3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout) 723 try: --> 724 data = self._data_queue.get(timeout=timeout) 725 return (True, data) /opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/queues.py in get(self, block, timeout) 103 timeout = deadline - time.monotonic() --> 104 if not self._poll(timeout): 105 raise Empty /opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/connection.py in poll(self, timeout) 256 self._check_readable() --> 257 return self._poll(timeout) 258 /opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/connection.py in _poll(self, timeout) 413 def _poll(self, timeout): --> 414 r = wait([self], timeout) 415 return bool(r) /opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/connection.py in wait(object_list, timeout) 919 while True: --> 920 ready = selector.select(timeout) 921 if ready: /opt/conda/envs/python35-paddle120-env/lib/python3.7/selectors.py in select(self, timeout) 414 try: --> 415 fd_event_list = self._selector.poll(timeout) 416 except InterruptedError: ~/work/py3/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py in handler(signum, frame) 65 # Python can still get and update the process status successfully. ---> 66 _error_if_any_worker_fails() 67 if previous_handler is not None: RuntimeError: DataLoader worker (pid 25593) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit. During handling of the above exception, another exception occurred: RuntimeError Traceback (most recent call last) in ----> 1 learn.fit_one_cycle(3, slice(1e-2), pct_start=0.8) 省略。。。。 ~/work/py3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout) 735 if len(failed_workers) > 0: 736 pids_str = ', '.join(str(w.pid) for w in failed_workers) --> 737 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) 738 if isinstance(e, queue.Empty): 739 return (False, None) RuntimeError: DataLoader worker (pid(s) 25723) exited unexpectedly
训练的时候报错:
RuntimeError: DataLoader worker (pid 17629) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
经百度,发现是shm空间小导致的,只有64M,太小拉
aistudio@jupyter-141218-157872:~$ df -H
Filesystem Size Used Avail Use% Mounted on
overlay 832G 165G 625G 21% /
tmpfs 68M 0 68M 0% /dev
tmpfs 60G 0 60G 0% /sys/fs/cgroup
/dev/vda1 521G 53G 469G 10% /home/aistudio
/dev/vdb 832G 165G 625G 21% /etc/hosts
shm 68M 39M 29M 59% /dev/shm
tmpfs 60G 13k 60G 1% /proc/driver/nvidia
tmpfs 12G 1.3G 11G 11% /run/nvidia-persistenced/socket
udev 60G 0 60G 0% /dev/nvidia0
tmpfs 60G 0 60G 0% /proc/acpi
tmpfs 60G 0 60G 0% /proc/scsi
tmpfs 60G 0 60G 0% /sys/firmware
请问,能增加shm空间吗? 百度说,需要在docker命令里加上参数:
--shm-size=4g
此外也可以尝试在系统里手动加,但是没有sudo口令,所以加不上。
希望能增加shm孔家,谢谢。
自己跑文本分类,将批量运行去掉,报如下错误:
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/executor.py in _run(self, program, exe, feed, fetch_list, feed_var_name, fetch_var_name, scope, return_numpy, use_program_cache)
747 self._feed_data(program, feed, feed_var_name, scope)
748 if not use_program_cache:
--> 749 exe.run(program.desc, scope, 0, True, True, fetch_var_name)
750 else:
751 exe.run_cached_prepared_ctx(ctx, scope, False, False, False)
exe.run的参数好像不对
之前的hi3518的是arm926,现在新款的是ARM Cortex A7的了!非常感谢
您好,你出问题的项目id是多少,是哪里出错的,能提供下信息吗
项目id:
https://aistudio.baidu.com/aistudio/projectdetail/157872
测试代码是fastai里教学第六课:lesson6-pets-more
learn.fit_one_cycle(3, slice(1e-2), pct_start=0.8) 这句运行的时候报错:
0.00% [0/3 00:00<00:00]
epoch
train_loss
valid_loss
error_rate
time
Interrupted
---------------------------------------------------------------------------RuntimeError Traceback (most recent call last)~/work/py3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
723 try:
--> 724 data = self._data_queue.get(timeout=timeout)
725 return (True, data)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/queues.py in get(self, block, timeout)
103 timeout = deadline - time.monotonic()
--> 104 if not self._poll(timeout):
105 raise Empty
/opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/connection.py in poll(self, timeout)
256 self._check_readable()
--> 257 return self._poll(timeout)
258
/opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/connection.py in _poll(self, timeout)
413 def _poll(self, timeout):
--> 414 r = wait([self], timeout)
415 return bool(r)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/multiprocessing/connection.py in wait(object_list, timeout)
919 while True:
--> 920 ready = selector.select(timeout)
921 if ready:
/opt/conda/envs/python35-paddle120-env/lib/python3.7/selectors.py in select(self, timeout)
414 try:
--> 415 fd_event_list = self._selector.poll(timeout)
416 except InterruptedError:
~/work/py3/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py in handler(signum, frame)
65 # Python can still get and update the process status successfully.
---> 66 _error_if_any_worker_fails()
67 if previous_handler is not None:
RuntimeError: DataLoader worker (pid 25593) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
During handling of the above exception, another exception occurred:
RuntimeError Traceback (most recent call last) in
----> 1 learn.fit_one_cycle(3, slice(1e-2), pct_start=0.8)
省略。。。。
~/work/py3/lib/python3.7/site-packages/torch/utils/data/dataloader.py in _try_get_data(self, timeout)
735 if len(failed_workers) > 0:
736 pids_str = ', '.join(str(w.pid) for w in failed_workers)
--> 737 raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str))
738 if isinstance(e, queue.Empty):
739 return (False, None)
RuntimeError: DataLoader worker (pid(s) 25723) exited unexpectedly
请问这个怎么解决,我安装的paddlepaddle-gpu1.6.1.post107,我的cund是10.0,cudnn是7.6
您好 想问一下在NOTEBOOK里创建项目的时候 是paddle paddle已经装好了是吗 那如果要做别的项目的话 需要再去装paddlepaddle吗
是预装好了的哦~在创建项目的时候选择需要的paddle版本就好~
生成对抗网络项目可以搞一波CycleGAN
情感分类,预测结果与神经网络模型选择的诡异问题
我使用飞桨gitbub模型库 中的lstm_net模型进行训练预测结果如下:
Predict probability of 0.533816 to be positive and 0.46618402 to be negative for review ' read the book forget the movie '
Predict probability of 0.8875562 to be positive and 0.11244377 to be negative for review ' this is a great movie '
Predict probability of 0.42828017 to be positive and 0.5717198 to be negative for review ' this is very bad '
对于负面语句 ' this is very bad '分析结果模糊。
同样使用bi_lstm_net训练模型并预测后,得到相反的结果,即负面语句分类准确,正向积极预测模糊。
最奇怪的是,当我使用lstm_net网络训练一次并存储模型后,再使用bi_lstm_net网络训练一次存储相同位置模型,那后面再通过lode此模型进行预测时,结果对正面、负面语句分类都很准确。这是什么原因呢?有没有大神指点下?
欢迎您加入飞桨PaddlePaddle 交流群,群号432676488,群内将有专业工程师为您解答疑问哦
2020-3-18服务器进入不了训练
问:使用动态的Liner改之前动态FC怎么改?
AIstudio 增加下最长使用时长吧
visualDL全面支持动态图吧
要是ai studio的赛后有相关讲解就好了~
请问您的问题解决了吗,我也是这样的问题,修改num_wokers和batchsize都没用
没有解决,我只好改用飞桨拉!
现在我用飞桨比用pytorch熟悉多了。
来 多点数据 集
一起来搬啊!
要是aistudio上有多一点的权限就好了