首页 Paddle框架 帖子详情
运行paddle时报错,请教原因
收藏
快速回复
Paddle框架 问答深度学习模型训练炼丹技巧 1845 2
运行paddle时报错,请教原因
收藏
快速回复
Paddle框架 问答深度学习模型训练炼丹技巧 1845 2

运行时报错信息如下:
Traceback (most recent call last):
File "run_classifier.py", line 287, in
main(args)
File "run_classifier.py", line 187, in main
main_program=train_program)
File "/home/slurm/job/tmp/job-168618/ERNIE/python/lib/python2.7/site-packages/paddle/fluid/parallel_executor.py", line 122, in init
self.compiled_program.compile(place=self.place, scope=self.scope)
File "/home/slurm/job/tmp/job-168618/ERNIE/python/lib/python2.7/site-packages/paddle/fluid/compiler.py", line 282, in compile
scope=self.scope)
File "/home/slurm/job/tmp/job-168618/ERNIE/python/lib/python2.7/site-packages/paddle/fluid/compiler.py", line 253, in compile_data_parallel
self.exec_strategy, self.build_strategy, self.graph)
paddle.fluid.core.EnforceNotMet: internal error at [/ssd2/liyukun01/paddle-env/repos/Paddle/paddle/fluid/platform/nccl_helper.h:113]
PaddlePaddle Call Stacks:
0 0x7f3274456992p void paddle::platform::EnforceNotMet::Init<char const*>(char const*, char const*, int) + 354
1 0x7f3274456d16p paddle::platform::EnforceNotMet::EnforceNotMet(std::exception_ptr::exception_ptr, char const*, int) + 134
2 0x7f327459dfb5p paddle::platform::NCCLContextMap::NCCLContextMap(std::vector<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void
, boost::detail::variant::void
, boost::detail::variant::void
, boost::detail::variant::void
, boost::detail::variant::void
, boost::detail::variant::void
, boost::detail::variant::void
, boost::detail::variant::void
, boost::detail::variant::void
, boost::detail::variant::void
, boost::detail::variant::void
, boost::detail::variant::void
, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::allocator<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > const&, ncclUniqueId*, unsigned long, unsigned long) + 3317
3 0x7f327459a2a2p paddle::framework::ParallelExecutor::ParallelExecutor(std::vector<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::allocator<boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > const&, std::vector<std::string, std::allocatorstd::string > const&, std::string const&, paddle::framework::Scope*, std::vector<paddle::framework::Scope*, std::allocatorpaddle::framework::Scope* > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&, paddle::framework::ir::Graph*) + 2962
4 0x7f32744b4bfbp
5 0x7f327447b3c9p
6 0x7f32de9b6f13p PyObject_Call + 67
7 0x7f32de9c5a4dp
8 0x7f32de9b6f13p PyObject_Call + 67
9 0x7f32dea2520fp
10 0x7f32dea2197fp
11 0x7f32de9b6f13p PyObject_Call + 67
12 0x7f32dea6cb66p PyEval_EvalFrameEx + 15014
13 0x7f32dea7286dp PyEval_EvalCodeEx + 2061
14 0x7f32dea6f9fcp PyEval_EvalFrameEx + 26940
15 0x7f32dea7286dp PyEval_EvalCodeEx + 2061
16 0x7f32dea6f9fcp PyEval_EvalFrameEx + 26940
17 0x7f32dea7286dp PyEval_EvalCodeEx + 2061
18 0x7f32de9e9015p
19 0x7f32de9b6f13p PyObject_Call + 67
20 0x7f32de9c5a4dp
21 0x7f32de9b6f13p PyObject_Call + 67
22 0x7f32dea2520fp
23 0x7f32dea2197fp
24 0x7f32de9b6f13p PyObject_Call + 67
25 0x7f32dea6cb66p PyEval_EvalFrameEx + 15014
26 0x7f32dea7286dp PyEval_EvalCodeEx + 2061
27 0x7f32dea6f9fcp PyEval_EvalFrameEx + 26940
28 0x7f32dea7286dp PyEval_EvalCodeEx + 2061
29 0x7f32dea729a2p PyEval_EvalCode + 50
30 0x7f32dea9b782p PyRun_FileExFlags + 146
31 0x7f32dea9caf9p PyRun_SimpleFileExFlags + 217
32 0x7f32deab282dp Py_Main + 3149
33 0x7f32ddcafbd5p __libc_start_main + 245
34 0x4007a1p

*** Aborted at 1581944112 (unix time) try "date -d @1581944112" if you are using GNU date ***
PC: @ 0x0 (unknown)
*** SIGSEGV (@0x5b22120) received by PID 28165 (TID 0x7f32def9f700) from PID 95559968; stack trace: ***
@ 0x7f32de755160 (unknown)
@ 0x5b22120 (unknown)
script/run_ChnSentiCorp.sh: line 39: 28165 Segmentation fault python -u run_classifier.py --use_cuda true --verbose true --do_train true --do_val false --do_test true --batch_size 1024 --init_pretraining_params ${MODEL_PATH} --train_set ${TASK_DATA_PATH}/train_filelist --test_set ${TASK_DATA_PATH}/test_filelist.without_punc_stopword --vocab_path config/vocab_word.txt --checkpoints ./checkpoints --save_steps 100000000 --weight_decay 0.01 --warmup_proportion 0.0 --validation_steps 50 --epoch 1 --max_seq_len 70 --ernie_config_path config/ernie_config_distillation.json --learning_rate 1e-3 --skip_steps 50 --num_iteration_per_drop_scope 1 --num_labels 1 --random_seed 1

=============================================
代码:
image

0
收藏
回复
全部评论(2)
时间顺序
AIStudio786089
#2 回复于2020-02

请先确认是否是孔明集群的环境问题。

0
回复
AIStudio786089
#3 回复于2020-02

确认是孔明集群的环境问题,问题已关闭。

0
回复
需求/bug反馈?一键提issue告诉我们
发现bug?如果您知道修复办法,欢迎提pr直接参与建设飞桨~
在@后输入用户全名并按空格结束,可艾特全站任一用户