邀测Docker

首页版块访问AI主站注册发帖

邀测Docker

lxvicvicvic 发布于2020-12 浏览:4222 回复:3

邀测Docker

快速回复

最后编辑于2020-12

宿主机 Centos7，cudatollkit, nvidia-container-runtime都正常安装

Docker image: https://public-codelab.bj.bcebos.com/docker-images/codelab_gpu.0.3.0.tar.gz

修改examples/cls_cnn_ch.json中的 "PADDLE_USE_GPU": 1 后运行：

!python3 run_with_json.py --param_path examples/cls_cnn_ch.json

得到错误：

INFO: 12-02 11:35:35: base_dataset_reader.py:110 * 139892197275456 set data_generator and start.......
W1202 11:35:37.351891 6466 dynamic_loader.cc:167] You may need to install 'nccl2' from NVIDIA official website: https://developer.nvidia.com/nccl/nccl-downloadbefore install PaddlePaddle.
/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/executor.py:1070: UserWarning: The following exception is not an EOF exception.
"The following exception is not an EOF exception.")
ERROR: 12-02 11:35:37: custom_trainer.py:116 * 139892197275456 traceback.format_exc():Traceback (most recent call last):
File "../../wenxin/training/custom_trainer.py", line 59, in train_and_eval
return_numpy=self.return_numpy)
File "textone_pro/training/controler.py", line 437, in controler.BaseTrainer.run
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/parallel_executor.py", line 303, in run
return_numpy=return_numpy)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run
six.reraise(*sys.exc_info())
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run
return_merged=return_merged)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1156, in _run_impl
program._compile(scope, self.place)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/compiler.py", line 443, in _compile
places=self._places)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/compiler.py", line 396, in _compile_data_parallel
self._exec_strategy, self._build_strategy, self._graph)
paddle.fluid.core_avx.EnforceNotMet:

--------------------------------------------
C++ Call Stacks (More useful to developers):
--------------------------------------------
0 std::string paddle::platform::GetTraceBackString(std::string&&, char const*, int)
1 paddle::platform::EnforceNotMet::EnforceNotMet(paddle::platform::ErrorSummary const&, char const*, int)
2 paddle::platform::dynload::GetNCCLDsoHandle()
3 void std::__once_call_impl(ncclComm**, int, int*)::{lambda()#1} ()> >()
4 paddle::platform::NCCLContextMap::NCCLContextMap(std::vector > const&, ncclUniqueId*, unsigned long, unsigned long)
5 paddle::platform::NCCLCommunicator::InitFlatCtxs(std::vector > const&, std::vector > const&, unsigned long, unsigned long)
6 paddle::framework::ParallelExecutorPrivate::InitNCCLCtxs(paddle::framework::Scope*, paddle::framework::details::BuildStrategy const&)
7 paddle::framework::ParallelExecutorPrivate::InitOrGetNCCLCommunicator(paddle::framework::Scope*, paddle::framework::details::BuildStrategy*)
8 paddle::framework::ParallelExecutor::ParallelExecutor(std::vector > const&, std::vector > const&, std::string const&, paddle::framework::Scope*, std::vector > const&, paddle::framework::details::ExecutionStrategy const&, paddle::framework::details::BuildStrategy const&, paddle::framework::ir::Graph*)

----------------------
Error Message Summary:
----------------------
PreconditionNotMetError: The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
Suggestions:
1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
2. Configure third-party dynamic library environment variables as follows:
- Linux: set LD_LIBRARY_PATH by `export LD_LIBRARY_PATH=...`
- Windows: set PATH by `set PATH=XXX; at (/paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:194)

INFO: 12-02 11:35:37: params.py:41 * 139892197275456 ./output/cls_cnn_ch/save_checkpoints/checkpoints_step_1/model.meta
INFO: 12-02 11:35:37: params.py:48 * 139892197275456 {
"deploy_type": 4,
"encrypt_type": null,
"framework_version": "bml-code-lab-public-v1.0.0",
"is_encryption": false,
"job_type": "text_classification",
"model_type": "",
"net_type": "CnnClassification",
"pretrain_model_type": "",
"pretrain_model_version": "",
"stat_file_name": "wenxin_stat",
"task_type": "train"
}
INFO: 12-02 11:35:37: params.py:41 * 139892197275456 ./output/cls_cnn_ch/save_inference_model/inference_step_1/infer_data_params.json
INFO: 12-02 11:35:37: params.py:48 * 139892197275456 {
"fields": [
"text_a#src_ids",
"text_a#seq_lens"
]
}
INFO: 12-02 11:35:37: params.py:41 * 139892197275456 ./output/cls_cnn_ch/save_inference_model/inference_step_1/model.meta
INFO: 12-02 11:35:37: params.py:48 * 139892197275456 {
"deploy_type": 4,
"encrypt_type": null,
"framework_version": "bml-code-lab-public-v1.0.0",
"is_encryption": false,
"job_type": "text_classification",
"model_type": "",
"net_type": "CnnClassification",
"pretrain_model_type": "",
"pretrain_model_version": "",
"stat_file_name": "wenxin_stat",
"task_type": "train"
}
Traceback (most recent call last):
File "run_with_json.py", line 115, in
run_trainer(_params)
File "run_with_json.py", line 101, in run_trainer
trainer.train_and_eval()
File "../../wenxin/training/custom_trainer.py", line 119, in train_and_eval
raise e
File "../../wenxin/training/custom_trainer.py", line 59, in train_and_eval
return_numpy=self.return_numpy)
File "textone_pro/training/controler.py", line 437, in controler.BaseTrainer.run
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/parallel_executor.py", line 303, in run
return_numpy=return_numpy)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1071, in run
six.reraise(*sys.exc_info())
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/six.py", line 703, in reraise
raise value
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1066, in run
return_merged=return_merged)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/executor.py", line 1156, in _run_impl
program._compile(scope, self.place)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/compiler.py", line 443, in _compile
places=self._places)
File "/usr/local/bin/conda/envs/blackhole/lib/python3.7/site-packages/paddle/fluid/compiler.py", line 396, in _compile_data_parallel
self._exec_strategy, self._build_strategy, self._graph)
paddle.fluid.core_avx.EnforceNotMet:

terminate called without an active exception
W1202 11:35:37.705427 6513 init.cc:226] Warning: PaddlePaddle catches a failure signal, it may not work properly
W1202 11:35:37.705468 6513 init.cc:228] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W1202 11:35:37.705476 6513 init.cc:231] The detail failure signal is:

W1202 11:35:37.705487 6513 init.cc:234] *** Aborted at 1606880137 (unix time) try "date -d @1606880137" if you are using GNU date ***
W1202 11:35:37.709525 6513 init.cc:234] PC: @ 0x0 (unknown)
W1202 11:35:37.709700 6513 init.cc:234] *** SIGABRT (@0x3e800001942) received by PID 6466 (TID 0x7f3abdb98700) from PID 6466; stack trace: ***
W1202 11:35:37.712968 6513 init.cc:234] @ 0x7f3b30774980 (unknown)
W1202 11:35:37.716092 6513 init.cc:234] @ 0x7f3b303affb7 gsignal
W1202 11:35:37.719053 6513 init.cc:234] @ 0x7f3b303b1921 abort
W1202 11:35:37.721267 6513 init.cc:234] @ 0x7f3b063c784a __gnu_cxx::__verbose_terminate_handler()
W1202 11:35:37.722939 6513 init.cc:234] @ 0x7f3b063c5f47 __cxxabiv1::__terminate()
W1202 11:35:37.724933 6513 init.cc:234] @ 0x7f3b063c5f7d std::terminate()
W1202 11:35:37.726735 6513 init.cc:234] @ 0x7f3b063c5c5a __gxx_personality_v0
W1202 11:35:37.729219 6513 init.cc:234] @ 0x7f3b2c672b97 _Unwind_ForcedUnwind_Phase2
W1202 11:35:37.731604 6513 init.cc:234] @ 0x7f3b2c672e7d _Unwind_ForcedUnwind
W1202 11:35:37.734496 6513 init.cc:234] @ 0x7f3b30773000 __GI___pthread_unwind
W1202 11:35:37.737349 6513 init.cc:234] @ 0x7f3b3076aae5 __pthread_exit
W1202 11:35:37.738099 6513 init.cc:234] @ 0x55db59fb1e49 PyThread_exit_thread
W1202 11:35:37.738323 6513 init.cc:234] @ 0x55db59e35b23 PyEval_RestoreThread.cold.796
W1202 11:35:37.741436 6513 init.cc:234] @ 0x7f3aca40ee69 pybind11::gil_scoped_release::~gil_scoped_release()
W1202 11:35:37.741873 6513 init.cc:234] @ 0x7f3aca4f7976 _ZZN8pybind1112cpp_function10initializeIZN6paddle6pybind10BindReaderEPNS_6moduleEEUlRNS2_9operators6reader22LoDTensorBlockingQueueERKSt6vectorINS2_9framework9LoDTensorESaISC_EEE1_bIS9_SG_EINS_4nameENS_9is_methodENS_7siblingENS_10call_guardIINS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNES11_
W1202 11:35:37.744927 6513 init.cc:234] @ 0x7f3aca42c679 pybind11::cpp_function::dispatcher()
W1202 11:35:37.745769 6513 init.cc:234] @ 0x55db59f32914 _PyMethodDef_RawFastCallKeywords
W1202 11:35:37.746541 6513 init.cc:234] @ 0x55db59f32a31 _PyCFunction_FastCallKeywords
W1202 11:35:37.747300 6513 init.cc:234] @ 0x55db59f9f39e _PyEval_EvalFrameDefault
W1202 11:35:37.747999 6513 init.cc:234] @ 0x55db59ee2160 _PyEval_EvalCodeWithName
W1202 11:35:37.748723 6513 init.cc:234] @ 0x55db59ee2925 _PyFunction_FastCallDict
W1202 11:35:37.749497 6513 init.cc:234] @ 0x55db59f9beea _PyEval_EvalFrameDefault
W1202 11:35:37.750239 6513 init.cc:234] @ 0x55db59f31e7b _PyFunction_FastCallKeywords
W1202 11:35:37.751001 6513 init.cc:234] @ 0x55db59f9a740 _PyEval_EvalFrameDefault
W1202 11:35:37.751672 6513 init.cc:234] @ 0x55db59f31e7b _PyFunction_FastCallKeywords
W1202 11:35:37.752458 6513 init.cc:234] @ 0x55db59f9a740 _PyEval_EvalFrameDefault
W1202 11:35:37.753166 6513 init.cc:234] @ 0x55db59ee285b _PyFunction_FastCallDict
W1202 11:35:37.753934 6513 init.cc:234] @ 0x55db59f014d3 _PyObject_Call_Prepend
W1202 11:35:37.754766 6513 init.cc:234] @ 0x55db59ef3ffe PyObject_Call
W1202 11:35:37.755129 6513 init.cc:234] @ 0x55db59ff2f77 t_bootstrap
W1202 11:35:37.755313 6513 init.cc:234] @ 0x55db59fad818 pythread_wrapper
W1202 11:35:37.758723 6513 init.cc:234] @ 0x7f3b307696db start_thread
Aborted

请问这个是docker内的cuda没有装好么？docker内CPU跑该样例是可以的，并且docker内的终端里只能调出nvidia-smi，nvcc找不到

热门活动

技术问答

个赞

共3条回复最后由JavaRoom回复于2020-12

#4JavaRoom回复于2020-12

对#2 lxvicvicvic回复

已解决：在notebook中即使使用export CUDA_VISIBLE_DEVICES='0'也无法解决问题，但是在终端中先export CUDA_VISIBLE_DEVICES='0'后再运行示例程序就可以了。

展开

哈哈哈，我也搞定了。

不试试centos8吗？

#3春水shine回复于2020-12

赞！！！

#2lxvicvicvic回复于2020-12

已解决：

在notebook中即使使用export CUDA_VISIBLE_DEVICES='0'也无法解决问题，但是在终端中先export CUDA_VISIBLE_DEVICES='0'后再运行示例程序就可以了。

快速回复

TOP

操作指南

常见问答

平台公告

经验交流

技术专区

文字识别

人脸识别

语音技术

PaddlePaddle

EasyDL

BML

EasyData

AI Studio

UNIT

人体分析

图像搜索

图像识别

内容审核

自然语言处理

机器人视觉

视频技术

增强现实

知识图谱

智能创作

智能呼叫中心

文心

EdgeBoard

DuerOS

EasyEdge

度目硬件

百度AI市场

Doris

AI赛事

百度之星大赛

AI Studio人工智能竞赛

语言与智能技术竞赛

千言数据集

集思广益

共享工具

头脑风暴

成果展示

智能客服