[QAT] BERT Model

AIStudio786081 发布于2019-11

Hello,
as it has been agreed on previously, intel is going to prepare QAT pass for BERT INT8.
Hence, we'd like to find out the following:

What flavour (type) of BERT model shall we optimize? There are several ones that I know of, each differs by the ops it is built of.
Could you please provide data the reader for us? Two of the QAT BERT models we have received accepted only 2 inputs, while other BERT models we have saw (the one from benchmark repository or the one from bert unit-test), contained 4 inputs, named placeholder[0-3].
How can we find out how to compute the accuracy?
What is the performance measure that we use for the model in question? Is it words per second (wps) or something else?

全部评论(3)

AIStudio786081

#2 回复于2019-11

So far we've attempted optimisation on float_model of BERT using QATv1 mechanism, these are the profiling results:
FP32 QAT BERT

Run 100 samples, average latency: 181.305 ms per sample.
Run 99 samples, average latency [exclude 1 warmup steps]: 181.006 ms per sample.

QATv1 INT8 model

Run 100 samples, average latency: 50.4984 ms per sample.
Run 99 samples, average latency [exclude 1 warmup steps]: 48.1151 ms per sample.

According to the final benchmark result we have managed to achieve ~3.8x speedup, however since fp32 and int8 versions had a lot of outliers in its results (a typical result was ~100.712 ms, while some outliers where much larger, i.e. 705.972 ms or 672.464 ms) the results were skewed. Hence, I want to add, that the typical latency of single batch computation was for FP32 QAT BERT: 100.712 ms and for QAT INT8 BERT 44.4283 ms.

Full output for FP32 QAT
Full output for INT8 QAT

AIStudio786082

#3 回复于2019-11

@Sand3r-

Thanks for your results. To answer your questions:

The QAT INT8 model which has two inputs.
UT and input data has sent to you by slack
Accuracy can be calculated by comparing the results and label (sent to you by slack)
performance should be the average latency with batch_size=1 and max_seqlen=128

AIStudio786082

#4 回复于2019-11

New benchmark follow-up is in PaddlePaddle/benchmark#275