[QAT] BERT Model
收藏
0
收藏
全部评论(3)
So far we've attempted optimisation on float_model of BERT using QATv1 mechanism, these are the profiling results:
FP32 QAT BERT
Run 100 samples, average latency: 181.305 ms per sample.
Run 99 samples, average latency [exclude 1 warmup steps]: 181.006 ms per sample.
QATv1 INT8 model
Run 100 samples, average latency: 50.4984 ms per sample.
Run 99 samples, average latency [exclude 1 warmup steps]: 48.1151 ms per sample.
According to the final benchmark result we have managed to achieve ~3.8x speedup, however since fp32 and int8 versions had a lot of outliers in its results (a typical result was ~100.712 ms, while some outliers where much larger, i.e. 705.972 ms or 672.464 ms) the results were skewed. Hence, I want to add, that the typical latency of single batch computation was for FP32 QAT BERT: 100.712 ms and for QAT INT8 BERT 44.4283 ms.
0
@Sand3r-
Thanks for your results. To answer your questions:
- The QAT INT8 model which has two inputs.
- UT and input data has sent to you by slack
- Accuracy can be calculated by comparing the results and label (sent to you by slack)
- performance should be the average latency with
batch_size=1
andmax_seqlen=128
0
请登录后评论
Hello,
as it has been agreed on previously, intel is going to prepare QAT pass for BERT INT8.
Hence, we'd like to find out the following:
benchmark
repository or the one from bert unit-test), contained 4 inputs, named placeholder[0-3].