1 最近几天,在AI studio上运行模型时,经常会遇到如下的bug,且无法解决
2 试过按照提示信息调小GPU内存分配设置,但是没有用,重启之后依然报错
3 有时候不做任何更改,又能正常运行,比如昨晚是可以的,但是下次再运行,又不行了,比如今天上午,期间没有做过任何修改
4 平台环境是: paddle 1.5 + python 2.7 + ernie1.0
5 报错信息如下:
I0317 14:18:13.443540 165 parallel_executor.cc:329] The number of CUDAPlace, which is used in ParallelExecutor, is 1. And the Program will be copied 1 copies
I0317 14:18:13.620071 165 build_strategy.cc:340] SeqOnlyAllReduceOps:0, num_trainers:1
W0317 14:18:14.687919 209 system_allocator.cc:121] Cannot malloc 1535.25 MB GPU memory. Please shrink FLAGS_fraction_of_gpu_memory_to_use or FLAGS_initial_gpu_memory_in_mb or FLAGS_reallocate_gpu_memory_in_mbenvironment variable to a lower value. Current FLAGS_fraction_of_gpu_memory_to_use value is 0.1. Current FLAGS_initial_gpu_memory_in_mb value is 0. Current FLAGS_reallocate_gpu_memory_in_mb value is 0
F0317 14:18:14.688122 209 legacy_allocator.cc:201] Cannot allocate 192.000000MB in GPU 0, available 265.937500MBtotal 16945512448GpuMinChunkSize 256.000000BGpuMaxChunkSize 1.499265GBGPU memory used: 14.673292GB
*** Check failure stack trace: ***
@ 0x7f1faa411f3d google::LogMessage::Fail()
@ 0x7f1faa4159ec google::LogMessage::SendToLog()
@ 0x7f1faa411a63 google::LogMessage::Flush()
@ 0x7f1faa416efe google::LogMessageFatal::~LogMessageFatal()
@ 0x7f1fac2b73d4 paddle::memory::legacy::Alloc<>()
@ 0x7f1fac2b76b5 paddle::memory::allocation::LegacyAllocator::AllocateImpl()
@ 0x7f1fac2ab7d5 paddle::memory::allocation::AllocatorFacade::Alloc()
@ 0x7f1fac2ab95a paddle::memory::allocation::AllocatorFacade::AllocShared()
@ 0x7f1fabeb8fcc paddle::memory::AllocShared()
@ 0x7f1fac27e204 paddle::framework::Tensor::mutable_data()
@ 0x7f1faaad4ce1 paddle::framework::Tensor::mutable_data<>()
@ 0x7f1faab1c035 paddle::operators::ActivationGradKernel<>::Compute()
@ 0x7f1faab1c183 _ZNSt17_Function_handlerIFvRKN6paddle9framework16ExecutionContextEEZNKS1_24OpKernelRegistrarFunctorINS0_8platform9CUDAPlaceELb0ELm0EINS0_9operators20ActivationGradKernelINS7_17CUDADeviceContextENS9_15ReluGradFunctorIfEEEENSA_ISB_NSC_IdEEEENSA_ISB_NSC_INS7_7float16EEEEEEEclEPKcSM_iEUlS4_E_E9_M_invokeERKSt9_Any_dataS4_
@ 0x7f1fac227907 paddle::framework::OperatorWithKernel::RunImpl()
@ 0x7f1fac227ce1 paddle::framework::OperatorWithKernel::RunImpl()
@ 0x7f1fac2252dc paddle::framework::OperatorBase::Run()
@ 0x7f1fac021a2a paddle::framework::details::ComputationOpHandle::RunImpl()
@ 0x7f1fac0143d0 paddle::framework::details::OpHandleBase::Run()
@ 0x7f1fabff5746 paddle::framework::details::FastThreadedSSAGraphExecutor::RunOpSync()
@ 0x7f1fabff43af paddle::framework::details::FastThreadedSSAGraphExecutor::RunOp()
@ 0x7f1fabff476f _ZNSt17_Function_handlerIFvvESt17reference_wrapperISt12_Bind_simpleIFS1_ISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS6_12OpHandleBaseESt6atomicIiESt4hashISA_ESt8equal_toISA_ESaISt4pairIKSA_SC_EEESA_RKSt10shared_ptrINS5_13BlockingQueueImEEEEUlvE_vEEEvEEEE9_M_invokeERKSt9_Any_data
@ 0x7f1faa4feb43 std::_Function_handler<>::_M_invoke()
@ 0x7f1faa395787 std::__future_base::_State_base::_M_do_set()
@ 0x7f200a925a99 __pthread_once_slow
@ 0x7f1fabfefdf2 _ZNSt13__future_base11_Task_stateISt5_BindIFZN6paddle9framework7details28FastThreadedSSAGraphExecutor10RunOpAsyncEPSt13unordered_mapIPNS4_12OpHandleBaseESt6atomicIiESt4hashIS8_ESt8equal_toIS8_ESaISt4pairIKS8_SA_EEES8_RKSt10shared_ptrINS3_13BlockingQueueImEEEEUlvE_vEESaIiEFvvEE6_M_runEv
@ 0x7f1faa396d04 _ZZN10ThreadPoolC1EmENKUlvE_clEv
@ 0x7f1fcb685678 execute_native_thread_routine_compat
@ 0x7f200a91e6ba start_thread
@ 0x7f2009f4441d clone
@ (nil) (unknown)
Aborted (core dumped)
项目链接方便提供一下吗?
https://aistudio.baidu.com/aistudio/projectdetail/301315
观点型问题阅读理解基线系统
麻烦帮忙看下
notebook在执行paddle代码过程中会生成一些临时的变量,这些临时变量会影响下次执行任务,如果需要重新执行整个paddle任务的话,需要清空这些变量,最简单的清空方式可以“顶部菜单--》代码执行器--》重启执行器”。
重启试过了,停止之后,重新再运行,也试过了,没用
现在又可以了,跟碰运气一样
我也遇过这种问题。好像和内存溢出有关。
有可能吧,试了其他小型的项目,是可以正常跑的,你遇到这种问题是怎么解决的?
把循环逐条打印,看大概什么时候溢出,手动改代码,半自动执行。
每次运行分配的机器不一样,所以有时能执行,有时不行。
听说26号AI Studio系统升级,改进不少。
显存溢出了
莫非是系统升级?
paddle最新1.7
一个是显存,一个是shm限制,调小再崩是不是shm崩的
代码改过了,内存降到了5G,还是跑不起来,应该跟11楼说的一样,系统不太稳定
应该就是这样,毕竟现在比赛多,用的人也很多
shm限制怎么解决?
你运行的模型规模真是挺大的。我那次其实是代码效率的问题。开始想偷懒着。