启用enable_profile,性能数据错误
收藏
在利用paddle-trt进行性能评估时,我出现了一个问题,报告中的测试数据与我自己用time函数统计的耗时不一致。
测试说明:测试数据集coco val2017,其中包含5000张图片,测试耗时统计方法如下:
start = time.time()
predictor.run()
end = time.time()
total_time += (end - start)
统一到的平均耗时为26.76ms
开启enable_profile接口,获得到的性能报告如下,通过分析,对可平均调用耗时求和(如:multiclass_nms + tensorrt_engine = 27.4ms),这个数值大于26.76ms。
谁能解答下这个报告怎么看,跟人工统计的时间有什么区别?
------------------------- GpuMemCpy Summary -------------------------
GpuMemcpy Calls: 65291 Total: 25498.3 Ratio: 10.2264%
GpuMemcpyAsync Calls: 50000 Total: 9653.51 Ratio: 3.87167%
GpuMemcpySync Calls: 15291 Total: 15844.8 Ratio: 6.35477%
------------------------- Event Summary -------------------------
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::multiclass_nms 5000 119291 112637.455869 (0.944221)6653.985737 (0.055779) 7.37919 40.9935 23.8583 0.478434
GpuMemcpySync:GPU->CPU 10000 15595.7 8941.718900 (0.573345) 6653.985737 (0.426655) 0.183995 13.9719 1.55957 0.0625486
thread0::tensorrt_engine 30000 106199 13364.216963 (0.125842) 92834.317382 (0.874158) 0.547528 31.3716 3.53995 0.425923
thread0::GpuMemcpyAsync:CPU->GPU 10000 8898.93 5031.297751 (0.565382) 3867.636127 (0.434618) 0.014988 16.3952 0.889893 0.0356903
thread0::deformable_conv 15000 7910.64 2967.598615 (0.375140) 4943.039027 (0.624860) 0.388718 25.0518 0.527376 0.0317267
thread0::yolo_box 15000 2614.69 1414.669365 (0.541046) 1200.021849 (0.458954) 0.057361 38.3538 0.174313 0.0104866
GpuMemcpyAsync:CPU->GPU 15000 299.826 282.410887 (0.941916) 17.415218 (0.058084) 0.007734 38.1444 0.0199884 0.00120249
thread0::concat 10000 1234.86 756.215652 (0.612392) 478.639522 (0.387608) 0.05831 38.8619 0.123486 0.00495255
GpuMemcpyAsync:CPU->GPU 20000 271.862 249.451655 (0.917566) 22.410810 (0.082434) 0.00598 38.5968 0.0135931 0.00109034
thread0::nearest_interp 10000 996.956 595.503555 (0.597322) 401.452901 (0.402678) 0.071152 1.95621 0.0996956 0.00399843
thread0::transpose2 15000 841.012 525.319150 (0.624628) 315.692511 (0.375372) 0.022546 4.25895 0.0560674 0.00337299
thread0::scale 5000 835.687 814.550576 (0.974707) 21.136627 (0.025293) 0.110291 3.72206 0.167137 0.00335163
GpuMemcpySync:CPU->GPU 5000 232.961 226.762321 (0.973391) 6.198947 (0.026609) 0.032096 3.56408 0.0465923 0.000934322
thread0::load_combine 1 315.491 315.491330 (1.000000) 0.000000 (0.000000) 315.491 315.491 315.491 0.00126532
thread0::GpuMemcpyAsync:GPU->CPU 5000 182.892 176.209841 (0.963465) 6.682012 (0.036535) 0.024113 0.3656 0.0365784 0.000733512
thread0::GpuMemcpySync:CPU->GPU 291 16.1448 10.823600 (0.670406) 5.321247 (0.329594) 0.016948 3.47948 0.0554806 6.4751e-05
0
收藏
请登录后评论