参加比赛,有幸有一台双T4机器使用,本来以为装上飞桨是小菜一碟,谁知过程非常曲折。
这台机器对我来说有挑战的地方有:
1 系统是centos7 ,而我以前接触AI框架都是ubuntu
2 cuda版本是10.2 而飞桨支持到10.1
3 双T4卡
首先把cuda降级到10.1,然后通过晕头转向的操作(包括加路径,建链接等),飞桨倒是装上了,但是后来手残,装其它框架的时候动了cuda里的一些东西,导致飞桨崩了,并且我在重装cuda10.1之后,无论怎么晕头转向的操作,飞桨都报错。具体表现就是静态图ok,动态图报错:Error: Cannot load cudnn shared library. Cannot invoke method cudnnGetVersion at (/paddle/paddle/fluid/platform/dynload/cudnn.cc:63)
又重装了cuda10.0 ,还是同样报错。
最后只好又装了cuda9.2,这回报错变成了:
Error: Failed to find dynamic library: libnccl.so ( libnccl.so: cannot open shared object file: No such file or directory )
Please specify its path correctly using following ways:
Method. set environment variable LD_LIBRARY_PATH on Linux or DYLD_LIBRARY_PATH on Mac OS.
For instance, issue command: export LD_LIBRARY_PATH=...
Note: After Mac OS 10.11, using the DYLD_LIBRARY_PATH is impossible unless System Integrity Protection (SIP) is disabled. at (/paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:177)
Your Paddle Fluid is installed successfully ONLY for SINGLE GPU or CPU!
Let's start deep Learning with Paddle Fluid now
现在的问题就变成了装libnccl啦。到nvidia网站https://developer.nvidia.com/nccl,按照安装手册,装好之后,再运行测试:
python -c "import paddle; paddle.fluid.install_check.run_check()"
这回测试通过了,双卡飞桨装好了!
Running Verify Fluid Program ...
W0909 06:08:51.656754 20353 device_context.cc:252] Please NOTE: device: 0, CUDA Capability: 75, Driver API Version: 10.1, Runtime API Version: 10.0
W0909 06:08:51.660171 20353 device_context.cc:260] device: 0, cuDNN Version: 7.6.
Your Paddle Fluid works well on SINGLE GPU or CPU.
W0909 06:08:56.232347 20353 fuse_all_reduce_op_pass.cc:74] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 1.
Your Paddle Fluid works well on MUTIPLE GPU or CPU.
Your Paddle Fluid is installed successfully! Let's start deep Learning with Paddle Fluid now
飞桨在第一次安装之后,就没有再变动,后面只是在安装/调试cuda 。 话说,cuda这个东西,好是好,但是对非专业人士真的不太友好。
突然想到,我前面装的是飞桨cuda10.x 107版本,而现在是改成cuda9.2了,如果飞桨没变的话,那我里面还是用的cuda10.x啊!
nvidia-smi里面显示的是10.1,那我/usr/local/cuda-10.1/bin/cuda-uninstaller 这句命令没起作用吗 ? 我看软连接已经指向了cuda92了啊! 不明白。。。。
所以最后推理的时候,要怎么用起来gpu,道友有捣鼓出来吗?
推理应该不是问题吧,我装起来之后,只跑了训练,还没怎么跑推理,就是跑了paddlehub里的几个例子试试,都是么问题的。
赞~
牛,收藏