首页 PaddleNLP 帖子详情
[QAT] BERT Model
收藏
快速回复
PaddleNLP 问答NLP 894 3
[QAT] BERT Model
收藏
快速回复
PaddleNLP 问答NLP 894 3

Hello,
as it has been agreed on previously, intel is going to prepare QAT pass for BERT INT8.
Hence, we'd like to find out the following:

  1. What flavour (type) of BERT model shall we optimize? There are several ones that I know of, each differs by the ops it is built of.
  2. Could you please provide data the reader for us? Two of the QAT BERT models we have received accepted only 2 inputs, while other BERT models we have saw (the one from benchmark repository or the one from bert unit-test), contained 4 inputs, named placeholder[0-3].
  3. How can we find out how to compute the accuracy?
  4. What is the performance measure that we use for the model in question? Is it words per second (wps) or something else?
0
收藏
回复
全部评论(3)
时间顺序
AIStudio786081
#2 回复于2019-11

So far we've attempted optimisation on float_model of BERT using QATv1 mechanism, these are the profiling results:
FP32 QAT BERT

Run 100 samples, average latency: 181.305 ms per sample.
Run 99 samples, average latency [exclude 1 warmup steps]: 181.006 ms per sample.

QATv1 INT8 model

Run 100 samples, average latency: 50.4984 ms per sample.
Run 99 samples, average latency [exclude 1 warmup steps]: 48.1151 ms per sample.

According to the final benchmark result we have managed to achieve ~3.8x speedup, however since fp32 and int8 versions had a lot of outliers in its results (a typical result was ~100.712 ms, while some outliers where much larger, i.e. 705.972 ms or 672.464 ms) the results were skewed. Hence, I want to add, that the typical latency of single batch computation was for FP32 QAT BERT: 100.712 ms and for QAT INT8 BERT 44.4283 ms.

Full output for FP32 QAT
Full output for INT8 QAT

0
回复
AIStudio786082
#3 回复于2019-11
@Sand3r-

Thanks for your results. To answer your questions:

  1. The QAT INT8 model which has two inputs.
  2. UT and input data has sent to you by slack
  3. Accuracy can be calculated by comparing the results and label (sent to you by slack)
  4. performance should be the average latency with batch_size=1 and max_seqlen=128
0
回复
AIStudio786082
#4 回复于2019-11

New benchmark follow-up is in PaddlePaddle/benchmark#275

0
回复
需求/bug反馈?一键提issue告诉我们
发现bug?如果您知道修复办法,欢迎提pr直接参与建设飞桨~
在@后输入用户全名并按空格结束,可艾特全站任一用户