【MKL】2014MKL下使用layers.matmul，输入3D shape会报段错误

Aurelius84 发布于2019-11

使用2014版本的libmklml_intel.so编译develop分支的whl包，在输入为3D shape时，调用layers.matmul会出现Segmentation fault

Note: 目前paddle-dev是没有问题的，只是替换mkl为2014的，会出错。

1. 复现环境

Python版本：2.7.15
操作系统：CentOS release 6.3 (Final)

2. 复现代码

import paddle.fluid as fluid
import numpy as np

shape = [11, 84, 12]
x = fluid.data(shape=shape, dtype='float32', name='x')
y = fluid.data(shape=shape, dtype='float32', name='y')

z = fluid.layers.matmul(x, y, transpose_y=True)

x_data = np.random.random(shape).astype('float32')
y_data = np.random.random(shape).astype('float32')

place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
res = exe.run(feed = {'x': x_data, 'y':y_data}, fetch_list=[z])

错误日志：

/workspace/baidu/personal-code/test_matmul/env2014/lib/python2.7/site-packages/paddle/fluid/executor.py:784: UserWarning: The current program is empty.
  warnings.warn(error_info)
W1128 08:57:38.760659 182155 init.cc:205] *** Aborted at 1574931458 (unix time) try "date -d @1574931458" if you are using GNU date ***
W1128 08:57:38.761991 182155 init.cc:205] PC: @                0x0 (unknown)
W1128 08:57:38.762099 182155 init.cc:205] *** SIGSEGV (@0x0) received by PID 182155 (TID 0x7fc6fcda5700) from PID 0; stack trace: ***
W1128 08:57:38.763234 182155 init.cc:205]     @     0x7fc6fc5777e0 (unknown)
W1128 08:57:38.764281 182155 init.cc:205]     @                0x0 (unknown)
Segmentation fault

全部评论(8)

AIStudio785465

#2 回复于2019-11

@yinghu5 @LeoZhao-Intel

Could you help see it?

AIStudio790712

#3 回复于2019-11

not sure 2014 mkl version, from https://software.intel.com/en-us/articles/introduction-to-the-conditional-numerical-reproducibility-cnr

There is requirement before 11.1:

In MKL versions before 11.1, there was one more limitation: Input and output arrays in function calls must be aligned on 16, 32, or 64 byte boundaries on systems with SSE / AVX1 / AVX2 instructions support (resp.). MKL 11.1 has dropped this requirement. CNR can be obtained on unaligned input arrays, but aligning data will typically lead to better performance.

You can simply try with different shape to check if it gets segv for all shapes.

Aurelius84

#4 回复于2019-11

You can simply try with different shape to check if it gets segv for all shapes.

shape改为[16, 32, 64]也会Segmentation fault

AIStudio790713

#5 回复于2019-11

@Aurelius84

在早期我们用MKL 11. x 等命名版本，后来才改成 MKL 2017/2018/2019 ，请问 2014 MKL version, 是从哪里下载的，具体安装时的文件名是什么？

另外在跑的时候 export MKL_VERBOSE=1 能打印出信息吗？我们是 11.2 时引进MKL verbose 功能的。
https://software.intel.com/en-us/articles/verbose-mode-supported-in-intel-mkl-112

Aurelius84

#6 回复于2019-11

请问 2014 MKL version, 是从哪里下载的

确认了下，这个版本的so文件应该不是完整的（相对于2019版），是单独编译的，年初1月份的时候intel这边同学提供的。

所以，这个段错误，应该是so不是官方导致的（只打包了部分函数）。

AIStudio790713

#7 回复于2019-11

@Aurelius84 @luotao1

share the 2014 information to us, I got them :). thank you!
[yhu5@snb04 baidu_sgemm]$ ./a.out
Major version: 11
Minor version: 1
Update version: 2
Product status: Product
Build: 20140122

It reminder me that the version 11.1. 2 (2014.01,22) don't support sgemm_batch, which are used in the 3D shape test code.

could you please check the 2014_mklml.lib , for example enter command:

nm 2014_libmklml.lib | grep sgemm_batch (maybe only definition, no implementation)

Aurelius84

#8 回复于2019-11

@yinghu5

nm 2014_libmklml.lib | grep sgemm_batch (maybe only definition, no implementation)

感谢的回复，我尝试了你提供的命令查看了下，结果如下。好像2014MKL的so文件确实没有sgemm_batch的实现？

运行：nm 2014_libmklml_intel.so| grep sgemm_batch
输出：

U cblas_sgemm_batch

运行：nm 2019_libmklml_intel.so| grep sgemm_batch
输出：

00000000001d75f0 T cblas_sgemm_batch
0000000000218d50 T mkl_blas__sgemm_batch
000000000024a7d0 T mkl_blas_errchk_sgemm_batch
00000000002abe10 T mkl_blas_sgemm_batch
0000000000218d50 T sgemm_batch
0000000000218d50 T sgemm_batch_

AIStudio785465

#9 回复于2019-12

由于业务只需要对齐6位小数点，因此关闭本issue。