Debugging tensorrt

Previous results with efficientdet were documented here:

https://hackaday.io/project/162944/log/203849-best-results-with-raspberry-pi-4

Reasons for not using efficientdet were documented here:

https://hackaday.io/project/162944/log/203975-the-last-temptation-of-christ

Detecting 2 overlapping animals, lying positions, & oscillations from windowing were the deal breakers.

Despite the naming convention, all the caffe "forward" functions in openpose seem to be raw CUDA with no specific dependencies on CAFFE. The memory mapping for the CAFFE & CUDA buffers is the same. They're flat float buffers of pixel data. The CAFFE buffers (ArrayCpuGpu) have a cpu_data() to access the data from the CPU. The CUDA buffers have a cudaMemcpy to access the data from the CPU.

To debug the porting effort, it was essential to have the caffe & tensorrt programs read the same frame from a gootube vijeo (yJFOojKXe4A) as input. Then write different stages of the output as flat PPM files. Never underestimate the value of obsolete formats like PPM.

Input to body_25 is 3 256x144 planes

The output of body_25 is 78 32x18 frames. Obviously the top frames are the part affinity fields for each body part & the bottom frames are the confidence maps of how the body parts attach. 1 frame is a background.

In the tensorrt version, much effort was spent chasing why the output was Nan, all 0 or rotated. 1 problem is CUDA doesn't give any errors for buffer overruns so it wasn't allocating a big enough input frame. Another problem was the input dims back in pose_deploy.prototxt were height & width.

name: "OpenPose - BODY_25"
input: "image"
input_dim: 1 # This value will be defined at runtime
input_dim: 3
input_dim: 144 # This value will be defined at runtime
input_dim: 256 # This value will be defined at runtime

The proper output of tensorrt was nearly identical to caffe. Minor differences could have been caused by FP16, which means all of the improvement in body_25 could be from the bit precision rather the model. trt_pose drops a lot of body parts & sure enough, the big difference with FP16 is some markers not being as bright as float32.

The spResizeAndMergeCaffe function just upscales these 32x16 frames back to the original 256x144 resolution. The magic is in spNmsCaffe (NmsCaffe) & spBodyPartConnectorCaffe (BodyPartConnectorCaffe). Those 3 functions are busters with many openpose dependencies & multidimensional arrays.

Openpose framerates with tensorrt

Body_25 using FP16

Discussions

Become a Hackaday.io Member