Close

Simplified pose tracker dreams

A project log for Auto tracking camera

A camera that tracks a person & counts reps using *AI*.

lion-mclionheadlion mclionhead 02/26/2022 at 10:350 Comments

The lion kingdom believes the pose tracker can run fast enough on a raspberry pi if it's reduced to tracking just 4 keypoints.  It currently just uses head, neck, hip, & foot zones to drive camera tilt.  These are fusions of 3-6 keypoints.  1 way is to reduce the movenet_multipose model itself to 4 keypoints.  Another way is to try training efficientdet_lite on the 4 objects for the 4 keypoints.  

The mane point of the 4 keypoints is tracking when the complete body isn't visible.  A full body bounding box can't show if the head or the feet are visible & can't detect anything if it's too close.

Sadly, there's no source code for training the fastest model, movenet_multipose.  There's source code for another pose estimator based on the COCO2017 dataset:

https://github.com/scnuhealthy/Tensorflow_PersonLab

Key variables are defined in config.py: NUM_KP, NUM_EDGES, KEYPOINTS

1 keypoint applies to both sides.  The other 16 have a left & right side which seem to be unnecessary.

The mane problem is consolidating multiple keypoints into 1.  It's looking for a single point for every keypoint & COCO defines a single point for every keypoint instead of a box.  Maybe it can accept an average point in the middle of multiple keypoints or it can accept the same keypoint applying to multiple joints in different images.

It's still a mystery to lions how the neural network goes from recognizing objects to organizing the hits into an array of coordinates.  There's not much in model.py.  The magic seems to happen in resnet_v2_101, a pretrained image classification model which presumably knows about the heirarchy of keypoints & persons.  The results of the image classification get organized by a conv2d model.  This seems to be very large, slow process, not a substitute for movenet.

Defining tensorflow models seems to be a basic skill every high school student knows but lions missed out on, yet it's also like decoding mp3's, a skill that everyone once knew how to do but now is automated.

The lion kingdom observed when efficientdet was trained on everything above the hips, it tended to detect only that.  If efficientdet was trained just above shoulders & below shoulders, it might have enough objects to control pitch.  It currently just has 3 states: head visible but not feet, feet visible bot not head, both visible.  It can probably be reduced to head & body visible or just body visible.  If head is visible, put it in the top 3rd.  If head is invisible but body is visible, tilt up.  

The mane problem is training this dual object tracker.  Openpose could maybe label all the heads & bodies.  It's pretty bad at detecting sideways bodies.  The head has to be labeled from all angles, which rules out a face detector.

There's a slight chance tracking would work with a general body detector.  Fuse the 2 largest objects.  It would tilt up if the object top was above a certain row.  Tilt down if the object bottom was above a certain row & the object top was below a certain row.

The mane problem is the general body detector tiles the image to increase the input layer's resolution.  It can't tile the image in portrait mode.

There was an attempt to base the tracker on efficientdet_lite0 tracking a person.  It was bad.  The problem is since there are a lot more possible positions than running, it detects a lot of false positives.  It might have to be trained using unholy videos.  Another problem is the tilt tends to oscillate.  Finally, it's become clear that the model needs to see the viewfinder & the phone is too small.  It might require porting to Android 4 to run on the lion kingdom's obsolete tablet or direct HDMI on the pi with a trackpad interface.

The best results with the person detector came from just keeping the top of the box 10% below the top of the frame with a 5% deadband.  There's not enough information to have it center the subject based on the paw position.  A full pose tracker does a much better job than a person detector at composing the shot.  So far, it's not worth paying for a photo shoot.

Discussions