Overview of modern detectors for human pose estimation
In this article, we reviewed three different pipelines for human pose estimation detectors, namely the most popular detectors on Github (OpenPose and Simple Pose) and a lighter weight model from TensorFlow - PoseNet (tflite).
We tested each pipeline on three different images of a person. The detectors’ following parameters were assessed: the ability to find key points in the image, confidence in predictions, the total size of the models on disk, and the predictive time on the CPU / GPU (Google Collaboratory). All our experiments are located on GitHub.
The key feature of this pipeline is the use of crops obtained with the Yolo v.3 object detector, i.e., Initially, the image passes through the SSD detector, and only then the resulting crops (with people) are fed to the input of the pose estimation detector.
Mean prediction time (Yolo + pose detector):
- CPU: 921 ms
- GPU: 918 ms
The occupied space on the disk:
- Yolo v.3: 91 MB
- Pose Detector: 61 MB
- Total: 152 MB
The detector can recognize and localize 17 points:
nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles.
You can see that the detector has correctly located almost all the points in this image. Some problems were detected with the right ankle (most likely, this is due to the non-standard position of legs). The model's confidence in this prediction is only 0.34, which is the lowest value among all points. The facial points have the highest values (0.93 - 0.97), and the average confidence score for all points is 0.83.
In this image, the detector predicted all points correctly. The right knee (relative to the character) has the lowest confidence score of 0.77. The facial points have the highest confidence scores that are between 0.93 and 0.98. The average score is 0.90.
In this image, the facial points are shifted, as well as the left shoulder. This is due to the profile position. The left ear has the lowest confidence score of 0.6. The average score is 0.88.
Based on the results obtained, we can conclude that the prediction can still be erroneous despite the fairly high confidence score. As a result, you should carefully select the threshold of confidence for a particular pose.
This is a pose estimation detector from TensorFlow. We will see the tf lite version here. Accordingly, it can already be used in mobile devices. It has the smallest size, and it’s also swift.
Mean prediction time:
- CPU: 40 ms
- GPU: 45 ms
The Total occupied space on the disk: 4.81 MB
PoseNet also incorrectly locates the right ankle. The left ankle is slightly shifted as well. However, the confidence score for these points is 0.80 and 0.85, respectively. The average confidence score for all points is 0.94. The right wrist has the lowest score of 0.74.
All points were found correctly, but the facial points are slightly shifted. The right ankle has the lowest confidence score of 0.7. The average score overall is 0.93.
We can already see several errors here: the right ankle was detected incorrectly (associated with a non-standard position), the right shoulder is shifted down, and the left eye is falsely localized (which is not visible in the image). The detector assigned a low confidence score for the left ear. With a score of 0.03 and for the left eye a score of 0.49. With the threshold of 0.5, it would not have localized these points, so we can assume that PoseNet did a good job with the unseen points. The average score for all points is 0.81.
OpenPose has represented the first real-time multi-person system to jointly detect a human body, hand, facial, and foot keypoints (in total 135 keypoints) on a single image. But we will consider a variation with 18 points (all the same points + neck).
Mean prediction time:
- CPU: 2430 ms
- GPU: 63 ms
The total occupied space on the disk: 85 MB
The detector had two mistakes: it incorrectly localized the nose, and like the others, it incorrectly localized the right ankle. You can also see that the right wrist is shifted upward, and all facial points, except the left ear, are shifted to the left. The right wrist has the lowest confidence score of 0.74. The average score for all points is 0.89.
The model incorrectly localized the nose and predicted the same coordinates for the right eye. The left elbow is slightly shifted to the left. It has the lowest score of 0.81. The average score for all points is 0.92.
We can see the model did a good job here as well, although some points are slightly shifted. The left ear has the lowest confidence score of 0.03. The average score for all points is 0.78.
|Count of key points||17||17||18|
|The point with the lowest confidence score (for 3 images)||right ankle, right knee, left ear||right ankle, left ankle, left ear||right wrist, left elbow, left ear|
|Average confidence score||0.83, 0.9, 0.88||0.94, 0.93, 0.81||0.89, 0.92, 0.88|
|Number of real errors (wrong localization or point not found)||1||2||2|
- If the priority is the operation’s speed and the detector’s size (MB), then PoseNet (tflite) is the best choice. This model can also be used directly in mobile devices because the prediction time on the CPU / GPU is the same and, in both cases, between 40 and 45ms. However, in some non-standard poses, the model may be wrong (like in the example with image № 3).
- In my opinion, the most accurate pipeline is Simple Pose, but it is also the heaviest and the slowest one (due to the two models).
- OpenPose is a compromise between accuracy and size, slightly less accurate than Simple Pose, but weighs almost 2 times less.
- First image: All models were wrong on the right ankle. The legs being crossed is a problem for models.
- Second image: All models performed well, despite the curvature of the skeleton.
- Third image: SimplePose and OpenPose performed well. Only PoseNet misidentified the right ankle. All models also gave a low confidence score to the left ear (which is not visible).
- All detectors have a slight vertical/horizontal point deviation. The main reasons are the accuracy of the localization of the detector itself and the image being resized. First, we reduce the original image to the size that the model requires. Therefore, a small part of the information will already be lost. Then we get the coordinates of the points for the reduced image and bring it to the original one (round it to whole numbers), which can also lead to inaccuracy.