我在使用多GPU计算的时候遇到的问题

项目

数据集

课程

比赛

模型库

活动

论坛

访问飞桨官网

项目

数据集

课程

比赛

模型库

活动

论坛

访问飞桨官网

s sun863064405 发布于2020-06

您好我在使用多GPU计算时，遇到这个问题，单GPU可以使用但是多GPU不可以

addle.fluid.core_avx.EnforceNotMet:

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0 std::string paddle::platform::GetTraceBackString(std::string&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
2 paddle::platform::DeviceContextPool::Get(paddle::platform::Place const&)
3 paddle::imperative::PreparedOp::Prepare(std::map, std::allocator > >, std::less, std::allocator, std::allocator > > > > > const&, std::map, std::allocator > >, std::less, std::allocator, std::allocator > > > > > const&, paddle::framework::OperatorWithKernel const&, paddle::platform::Place, std::unordered_map >, std::vector >, std::vector >, bool, std::vector >, paddle::framework::BlockDesc*, long, std::vector >, std::vector >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::hash, std::equal_to, std::allocator >, std::vector >, std::vector >, bool, std::vector >, paddle::framework::BlockDesc*, long, std::vector >, std::vector >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > > const*)
4 paddle::imperative::OpBase::Run(std::map, std::allocator > >, std::less, std::allocator, std::allocator > > > > > const&, std::map, std::allocator > >, std::less, std::allocator, std::allocator > > > > > const&)
5 paddle::imperative::Tracer::TraceOp(std::string const&, std::map, std::allocator > >, std::less, std::allocator, std::allocator > > > > > const&, std::map, std::allocator > >, std::less, std::allocator, std::allocator > > > > > const&, std::unordered_map >, std::vector >, std::vector >, bool, std::vector >, paddle::framework::BlockDesc*, long, std::vector >, std::vector >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_>, std::hash, std::equal_to, std::allocator >, std::vector >, std::vector >, bool, std::vector >, paddle::framework::BlockDesc*, long, std::vector >, std::vector >, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> > > >, paddle::platform::Place const&, bool)

----------------------
Error Message Summary:
----------------------
Error: Place CUDAPlace(0) is not supported, Please check that your paddle compiles with WITH_GPU option or check that your train process hold the correct gpu_id if you use Executor at (/paddle/paddle/fluid/platform/device_context.cc:67)

W0622 13:17:57.689676 13321 device_context.cc:237] Please NOTE: device: 0, CUDA Capability: 61, Driver API Version: 9.0, Runtime API Version: 9.0
W0622 13:17:57.693465 13321 device_context.cc:245] device: 0, cuDNN Version: 7.6.
2020-06-22 13:17:58,544-INFO: Loading pretrained model from ./pretrained/resnet18-torch
2020-06-22 13:17:58,728-ERROR: ABORT!!! Out of all 4 trainers, the trainer process with rank=[1, 2, 3] was aborted. Please check its log.
ERROR 2020-06-22 13:17:58,728 launch.py:284] ABORT!!! Out of all 4 trainers, the trainer process with rank=[1, 2, 3] was aborted. Please check its log.
W0622 13:17:58.734313 13321 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0622 13:17:58.734344 13321 init.cc:211] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0622 13:17:58.734351 13321 init.cc:214] The detail failure signal is:

W0622 13:17:58.734359 13321 init.cc:217] *** Aborted at 1592803078 (unix time) try "date -d @1592803078" if you are using GNU date ***
W0622 13:17:58.737927 13321 init.cc:217] PC: @ 0x0 (unknown)
W0622 13:17:58.738090 13321 init.cc:217] *** SIGTERM (@0x3e80000335c) received by PID 13321 (TID 0x7f33bbbd1700) from PID 13148; stack trace: ***
W0622 13:17:58.739579 13321 init.cc:217] @ 0x7f33bb7c9390 (unknown)
W0622 13:17:58.740978 13321 init.cc:217] @ 0x7f33bb7c851d __libc_read
W0622 13:17:58.741684 13321 init.cc:217] @ 0x55b0c52e8107 _Py_read
W0622 13:17:58.742020 13321 init.cc:217] @ 0x55b0c52e82a7 _io_FileIO_readinto
W0622 13:17:58.742621 13321 init.cc:217] @ 0x55b0c52a82c5 _PyCFunction_FastCallDict
W0622 13:17:58.743213 13321 init.cc:217] @ 0x55b0c52a871f _PyObject_FastCallDict
W0622 13:17:58.743783 13321 init.cc:217] @ 0x55b0c5308aac PyObject_CallMethodObjArgs
W0622 13:17:58.744184 13321 init.cc:217] @ 0x55b0c5325326 _bufferedreader_raw_read
W0622 13:17:58.744612 13321 init.cc:217] @ 0x55b0c53340de _io__Buffered_read
W0622 13:17:58.745206 13321 init.cc:217] @ 0x55b0c52a8241 _PyCFunction_FastCallDict
W0622 13:17:58.745788 13321 init.cc:217] @ 0x55b0c52a871f _PyObject_FastCallDict
W0622 13:17:58.747185 13321 init.cc:217] @ 0x7f33bba07834 _Unpickler_ReadImpl
W0622 13:17:58.748574 13321 init.cc:217] @ 0x7f33bba08cae load_counted_binunicode
W0622 13:17:58.749981 13321 init.cc:217] @ 0x7f33bba0a6eb load
W0622 13:17:58.751315 13321 init.cc:217] @ 0x7f33bba14265 _pickle_load
W0622 13:17:58.751838 13321 init.cc:217] @ 0x55b0c52a8390 _PyCFunction_FastCallDict
W0622 13:17:58.752391 13321 init.cc:217] @ 0x55b0c52d4cd0 _PyCFunction_FastCallKeywords
W0622 13:17:58.752756 13321 init.cc:217] @ 0x55b0c532fb0c call_function
W0622 13:17:58.753306 13321 init.cc:217] @ 0x55b0c53535d9 _PyEval_EvalFrameDefault
W0622 13:17:58.753839 13321 init.cc:217] @ 0x55b0c532aa49 PyEval_EvalCodeEx
W0622 13:17:58.754194 13321 init.cc:217] @ 0x55b0c532b976 function_call
W0622 13:17:58.754747 13321 init.cc:217] @ 0x55b0c52a810e PyObject_Call
W0622 13:17:58.755306 13321 init.cc:217] @ 0x55b0c5353fbf _PyEval_EvalFrameDefault
W0622 13:17:58.755841 13321 init.cc:217] @ 0x55b0c532ade8 PyEval_EvalCodeEx
W0622 13:17:58.756196 13321 init.cc:217] @ 0x55b0c532b976 function_call
W0622 13:17:58.756750 13321 init.cc:217] @ 0x55b0c52a810e PyObject_Call
W0622 13:17:58.757300 13321 init.cc:217] @ 0x55b0c5353fbf _PyEval_EvalFrameDefault
W0622 13:17:58.757653 13321 init.cc:217] @ 0x55b0c53291f6 _PyEval_EvalCodeWithName
W0622 13:17:58.758005 13321 init.cc:217] @ 0x55b0c5329f31 fast_function
W0622 13:17:58.758366 13321 init.cc:217] @ 0x55b0c532fbe5 call_function
W0622 13:17:58.758913 13321 init.cc:217] @ 0x55b0c535281a _PyEval_EvalFrameDefault
W0622 13:17:58.759274 13321 init.cc:217] @ 0x55b0c5328f26 _PyEval_EvalCodeWithName

全部评论(2)

aaaaaa

#2 回复于2020-06

github上面提个issue吧，没用过多卡的

Sun小男生

#3 回复于2021-08

请问你解决了吗？我也不幸遇到了这个问题