OSError: (External) CUBLAS error(13).
[Hint: 'CUBLAS_STATUS_EXECUTION_FAILED'. The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons. To correct: check that the hardware, an appropriate version of the driver, and the cuBLAS library are correctly installed. ] (at /paddle/paddle/phi/kernels/funcs/blas/blas_impl.cu.h:35)
[operator < matmul_v2 > error]
6%|▌ | 1597/25932 [08:37<2:11:33, 3.08it/s]
terminate called after throwing an instance of 'phi::enforce::EnforceNotMet'
what(): (External) CUDA error(700), an illegal memory access was encountered.
[Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at /paddle/paddle/fluid/memory/allocation/cuda_device_context_allocator.h:99)
INFO 2022-08-17 13:27:29,212 launch_utils.py:322] terminate process group gid:4437
INFO 2022-08-17 13:27:29,212 launch_utils.py:322] terminate process group gid:4437
INFO 2022-08-17 13:27:33,215 launch_utils.py:343] terminate all the procs
INFO 2022-08-17 13:27:33,215 launch_utils.py:343] terminate all the procs
ERROR 2022-08-17 13:27:33,215 launch_utils.py:640] ABORT!!! Out of all 2 trainers, the trainer process with rank=[0] was aborted. Please check its log.
ERROR 2022-08-17 13:27:33,215 launch_utils.py:640] ABORT!!! Out of all 2 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2022-08-17 13:27:37,220 launch_utils.py:343] terminate all the procs
INFO 2022-08-17 13:27:37,220 launch_utils.py:343] terminate all the procs
INFO 2022-08-17 13:27:37,220 launch.py:402] Local processes completed.
INFO 2022-08-17 13:27:37,220 launch.py:402] Local processes completed.
666
这个66666是什么意思捏
按它说的:To correct: check that the hardware, an appropriate version of the driver, and the cuBLAS library are correctly installed
先用单卡测试下每块显卡,看看显卡都没有问题吧。