首页 Paddle框架 帖子详情
启用enable_profile,性能数据错误
收藏
快速回复
Paddle框架 问答深度学习模型训练 528 0
启用enable_profile,性能数据错误
收藏
快速回复
Paddle框架 问答深度学习模型训练 528 0

在利用paddle-trt进行性能评估时,我出现了一个问题,报告中的测试数据与我自己用time函数统计的耗时不一致。

测试说明:测试数据集coco val2017,其中包含5000张图片,测试耗时统计方法如下:

start = time.time()
predictor.run()
end = time.time()
total_time += (end - start)

统一到的平均耗时为26.76ms

开启enable_profile接口,获得到的性能报告如下,通过分析,对可平均调用耗时求和(如:multiclass_nms + tensorrt_engine = 27.4ms),这个数值大于26.76ms。

谁能解答下这个报告怎么看,跟人工统计的时间有什么区别?

-------------------------     GpuMemCpy Summary     -------------------------

GpuMemcpy                Calls: 65291       Total: 25498.3     Ratio: 10.2264%
  GpuMemcpyAsync         Calls: 50000       Total: 9653.51     Ratio: 3.87167%
  GpuMemcpySync          Calls: 15291       Total: 15844.8     Ratio: 6.35477%

-------------------------       Event Summary       -------------------------

Event                                   Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.      
thread0::multiclass_nms                 5000        119291      112637.455869 (0.944221)6653.985737 (0.055779)  7.37919     40.9935     23.8583     0.478434    
  GpuMemcpySync:GPU->CPU                10000       15595.7     8941.718900 (0.573345)  6653.985737 (0.426655)  0.183995    13.9719     1.55957     0.0625486   
thread0::tensorrt_engine                30000       106199      13364.216963 (0.125842) 92834.317382 (0.874158) 0.547528    31.3716     3.53995     0.425923    
thread0::GpuMemcpyAsync:CPU->GPU        10000       8898.93     5031.297751 (0.565382)  3867.636127 (0.434618)  0.014988    16.3952     0.889893    0.0356903   
thread0::deformable_conv                15000       7910.64     2967.598615 (0.375140)  4943.039027 (0.624860)  0.388718    25.0518     0.527376    0.0317267   
thread0::yolo_box                       15000       2614.69     1414.669365 (0.541046)  1200.021849 (0.458954)  0.057361    38.3538     0.174313    0.0104866   
  GpuMemcpyAsync:CPU->GPU               15000       299.826     282.410887 (0.941916)   17.415218 (0.058084)    0.007734    38.1444     0.0199884   0.00120249  
thread0::concat                         10000       1234.86     756.215652 (0.612392)   478.639522 (0.387608)   0.05831     38.8619     0.123486    0.00495255  
  GpuMemcpyAsync:CPU->GPU               20000       271.862     249.451655 (0.917566)   22.410810 (0.082434)    0.00598     38.5968     0.0135931   0.00109034  
thread0::nearest_interp                 10000       996.956     595.503555 (0.597322)   401.452901 (0.402678)   0.071152    1.95621     0.0996956   0.00399843  
thread0::transpose2                     15000       841.012     525.319150 (0.624628)   315.692511 (0.375372)   0.022546    4.25895     0.0560674   0.00337299  
thread0::scale                          5000        835.687     814.550576 (0.974707)   21.136627 (0.025293)    0.110291    3.72206     0.167137    0.00335163  
  GpuMemcpySync:CPU->GPU                5000        232.961     226.762321 (0.973391)   6.198947 (0.026609)     0.032096    3.56408     0.0465923   0.000934322 
thread0::load_combine                   1           315.491     315.491330 (1.000000)   0.000000 (0.000000)     315.491     315.491     315.491     0.00126532  
thread0::GpuMemcpyAsync:GPU->CPU        5000        182.892     176.209841 (0.963465)   6.682012 (0.036535)     0.024113    0.3656      0.0365784   0.000733512 
thread0::GpuMemcpySync:CPU->GPU         291         16.1448     10.823600 (0.670406)    5.321247 (0.329594)     0.016948    3.47948     0.0554806   6.4751e-05
0
收藏
回复
需求/bug反馈?一键提issue告诉我们
发现bug?如果您知道修复办法,欢迎提pr直接参与建设飞桨~
在@后输入用户全名并按空格结束,可艾特全站任一用户