首页 Paddle框架 帖子详情
【MKL】Different output with different version of libmklml_intel.so
收藏
快速回复
Paddle框架 问答深度学习 3090 24
【MKL】Different output with different version of libmklml_intel.so
收藏
快速回复
Paddle框架 问答深度学习 3090 24

I'm training a model on CPU machine with paddle, but the precision (PNR) can not converge up to the baseline with same train dataset.

I found that the result trained by paddle whl with 2014MKL is better than the 2019MKL.

  • 2019MKL means to compile paddle with 2019 libmklml_intel.so, which is used by default in Paddle.
  • 2014MKL means to compile paddle with 2014 libmklml_intel.so
  • The result of Baseline is trained by other framework tools based on 2014MKL.

There are many Matrix multiplication in my model, such as blas.MatMul or blas.GEMM. I found the output of Matrix multiplication has different result while using different version libmklml_intel.so to compile paddle whl.

I'm not sure how the difference output of MatMul or GEMM influence the precision. I would appreciate it very much if relevant developers can help to follow this problem.

See details as follows:

image

result Baseline 2019MKL 2014MKL
max(PNR)
2.4207
2.3924
2.4303

How to reproduce with Docker:

1. Enviroment & Version

  • CentOS release 6.3 (Final)
  • docker image: docker.paddlepaddlehub.com/paddle_manylinux_devel:cuda8.0_cudnn7
  • python: 2.7.15

2. build paddle

  • Compile with 2019 MKL
1. git clone https://github.com/PaddlePaddle/Paddle.git
2. cd Paddle & make build
3. comopile
export LD_LIBRARY_PATH=/opt/_internal/cpython-2.7.11-ucs4/lib:${LD_LIBRARY_PATH#/opt/_internal/cpython-2.7.11-ucs2/lib:}

cmake .. ${PYTHON_FLAGS} -DWITH_DISTRIBUTE=ON -DWITH_GRPC=ON -DWITH_BRPC=OFF  -DWITH_FAST_BUNDLE_TEST=OFF -DWITH_PROFILER=OFF -DPY_VERSION=2.7 -DWITH_FLUID_ONLY=ON -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=ON -DWITH_MKLDNN=OFF

make -j$(nproc)
  • Compile with 2014 MKL
1. git clone https://github.com/PaddlePaddle/Paddle.git
2. cd Paddle & make build
3. comopile
export LD_LIBRARY_PATH=/opt/_internal/cpython-2.7.11-ucs4/lib:${LD_LIBRARY_PATH#/opt/_internal/cpython-2.7.11-ucs2/lib:}

cmake .. ${PYTHON_FLAGS} -DWITH_DISTRIBUTE=ON -DWITH_GRPC=ON -DWITH_BRPC=OFF  -DWITH_FAST_BUNDLE_TEST=OFF -DWITH_PROFILER=OFF -DPY_VERSION=2.7 -DWITH_FLUID_ONLY=ON -DWITH_GPU=OFF -DWITH_TESTING=OFF -DCMAKE_BUILD_TYPE=Release -DWITH_MKL=ON -DWITH_MKLDNN=OFF

# here you should put 2014 verison libmklml_intel.so into build/2014_libmklml_intel.so
cp 2014_libmklml_intel.so third_party/install/mklml/lib/libmklml_intel.so
cp 2014_libmklml_intel.so third_party/mklml/src/extern_mklml/lib/libmklml_intel.so

make -j$(nproc)

3. check diff

You can compare the output of fluid.layers.fc by feeding the same data and weight. Here is my scripts. Firstly, you should save xy into .bin file to keep feed data same.

import paddle.fluid as fluid
import numpy as np

shape = [16, 384]
x = fluid.data(shape=shape, dtype='float32', name='x')
#y = fluid.data(shape=shape, dtype='float32', name='y')

#z = fluid.layers.matmul(x, y, transpose_y=True)

##### run me  only once ###
x_data = np.random.random(shape).astype('float32')
y_data = np.random.random([shape[1], 128]).astype('float32')
x_data.tofile('x.bin')
y_data.tofile('y.bin')
####  end #######

x_data = np.fromfile('x.bin', dtype=np.float32)
x_data = x_data.reshape(shape)
y_data = np.fromfile('y.bin', dtype=np.float32)
y_data = y_data.reshape([shape[1], 128])

z = fluid.layers.fc(x, size=128, param_attr=fluid.initializer.NumpyArrayInitializer(y_data))

place = fluid.CPUPlace()
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
res = exe.run(feed = {'x': x_data}, fetch_list=[z])
np.savetxt('res.txt', res[0].reshape(16, -1))

You should install paddle whl with different version libmklml_intel.so and run above script to save result into res.txt. And reinstall another version paddle whl to generate new res.txt.

Then just run followed code to check the difference:

import numpy as np

res_2014 = np.loadtxt("res.txt_fc_2014")
res_2019 = np.loadtxt("res.txt_fc_2019")

print res_2014-res_2019

Ouput may be like:

[[ 2.28881836e-05 -2.28881836e-05  1.52587891e-05 ...  3.05175781e-05
   3.05175781e-05  3.81469727e-05]
 [ 0.00000000e+00  1.52587891e-05  2.28881836e-05 ...  9.15527344e-05
  -7.62939453e-06  7.62939453e-06]
 [ 6.10351562e-05 -3.05175781e-05  1.52587891e-05 ... -7.62939453e-06
   1.52587891e-05  7.62939453e-06]
 ...
 [ 1.52587891e-05  0.00000000e+00  5.34057617e-05 ... -4.57763672e-05
   0.00000000e+00 -1.52587891e-05]
 [-3.05175781e-05  2.28881836e-05  6.86645508e-05 ... -1.52587891e-05
   7.62939453e-06  3.81469727e-05]
 [-6.10351562e-05 -3.81469727e-05  4.57763672e-05 ...  7.62939453e-06
  -3.05175781e-05  0.00000000e+00]]
0
收藏
回复
全部评论(24)
时间顺序
Aurelius84
#22 回复于2019-12
@yinghu5

估计是cloud上的libiomp5.so和本地的不一样

cloud上的libiomp5.so文件是我上传上去的,和本地开发机测试是一样的。

libmkl2014.zip
huying的新libmkl2014.so

想问下,这里添加依赖的libiomp5.so是与paddle 依赖的一致的么?

因为目前在cloud上用这种方法跑不了实验,没有办法做进一步验证。

0
回复
AIStudio790713
#23 回复于2019-12
@Aurelius84

, 云上或者其他机器上有装MKL 2014吗。如果有,在它的目录里面有个libiomp5.so 用这个试试?
还有这个 路径: ./libiomp5.so 是和这个一样的吗/home/disk1/normandy/maybach/app-user-20191203185711-11915/workspace/env_run/thirdparty/libiomp5.so 目前错误信息是这个libiomp5 是高版本。

0
回复
AIStudio790713
#24 回复于2019-12
@Aurelius84

libmkl2014.zip

我打了一个不带openmp 的包。 先在线下试试, 然后上线。

0
回复
AIStudio785465
#25 回复于2019-12

由于业务只需要对齐6位小数点,因此关闭本issue。

0
回复
需求/bug反馈?一键提issue告诉我们
发现bug?如果您知道修复办法,欢迎提pr直接参与建设飞桨~
在@后输入用户全名并按空格结束,可艾特全站任一用户