Official YOLOv7 Pose vs MediaPipe | Full comparison of real-time Pose Estimation | Which is Faster?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
YOLOv7 Pose vs Mediapipe! Human Pose Estimation Benchmark and which one to choose?  Hey there welcome to LearnOpenCV. YOLOv7  pose is the new kid on the block for Human Pose   Estimation. We will pit it against the popular  Mediapipe pose model using challenging videos   and share our findings through this video. I am  sure this will help you choose the right pose   estimation model for your next project. Before we  start, let's do a quick refresher on Human Pose   Estimation for those who are new to this topic.  If you want to jump into the comparison section,   skip to the fourth minute of the video. Now  what is Human Pose Estimation? Human Pose   Estimation is the task of predicting the  location of major joints of the human body.   It is also referred to as Keypoint  Detection. Now you may ask how many points?  There are many datasets for Keypoint  Detection with different topologies.   The most common being the Coco dataset which has  17 keypoints. Others include MPII dataset which   has 16 key points, AI Challenger dataset and CrowdPose dataset both with 14 key points and so on.   Human Pose Estimation has possible applications  in many Industries. For example you can build AI   fitness trainers like that of peloton.ai. It can  be used in healthcare services, sports analytics,   activity recognition, surveillance, in the VR  industry for gaming and human computer interaction.   There has been extensive research on this topic  and many algorithms have been proposed over the   years. Some of the popular ones are OpenPose,  AlphaPose, HRNet, DEKR, BlazePose and YOLO Pose.   Now what is YOLO Pose? YOLO Pose is  an end-to-end pose estimation model.   To understand what it means let us quickly discuss  the two general approaches for pose estimation.   In the Top-down Approach we first detect  persons in the image and then use these   boxes to find keypoints. As you can guess  this would give pretty good results but   would be too slow if there are too many people  in the scene. The Bottom-up Approach uses heat   maps to find key points and then uses a grouping  algorithm to map the key points to each person.   This is a single-stage and thus it is fast but  you can imagine it won't be very accurate in case   of crowded scenes. YOLO Pose moves away from  both these approaches and directly optimizes   the object keypoint similarity metric which  jointly detects the key points as well as the   bounding boxes. So it does not need the grouping  step done in Bottom-up Approaches and because the   key points and boxes are predicted in a single  inference step, it does not compromise on speed   as well. Thus it brings in best of both approaches.  YOLO Pose is designed with a general architecture   so that you can plug in any object detection  model. The original paper was based on YOLOv5.   Now since the architecture of YOLO Pose is  general, the YOLOv7 authors adapted it to   YOLOv7 and hence the name YOLOv7 Pose. We  want to test YOLOv7 Pose in terms of speed   and accuracy and compare the features with Mediapipe as it is a very accurate and real-time model.   Mediapipe is a framework that provides  multiple computer vision solutions and   one of them is pose estimation. It uses the  popular BlazePose model for pose estimation.   Now let's see the differences between YOLOv7  Pose and Mediapipe . YOLOv7 is a multi-person   pose estimation model whereas Mediapipe is for  single-person pose estimation only. YOLOv7 has   17 keypoints as compared to 33 keypoints for  Mediapipe. The default input size of YOLOv7 is   960P whereas for Mediapipe it is 256 by 256. This  means it is trained for small image sizes as well.   YOLOv7 performs detection on every frame whereas  Mediapipe framework uses a detection plus tracking   approach so as to increase the speed as well as  provide a more stable output reducing the jitter.   Mediapipe is optimized for CPU and Edge devices  but there is no such optimization done for YOLOv7   Pose. Now let us see how they compare against each  other by running them on some challenging videos.   First we see the results that we get  out of the box with the default values.   Here we can see the drastic difference in FPS  between Mediapipe and YOLO. This is because the   input size of Mediapipe is 256 by 256 whereas for  YOLO it is 960P. This comparison is a little unfair.   So in our next run, we change the input size of  YOLO to 256P and get these results. It gives an   8x boost to YOLO but still lags behind  Mediapipe in terms of FPS. The next thing   we want to check is the amount of jitter or  flickering present in the models. For this   to be prominent we zoom in on a particular area  of the video and see the results. The difference   is evident that there is a lot of flickering  in the YOLO results as compared to Mediapipe.   Even for the default 960P input size, the YOLO  Pose model flickers more than the Mediapipe model.   The next comparison is checking for scale  variations like what happens when the person   size varies. We can see that the YOLO model with  a small input size of 256 fails to detect the   person but Mediapipe is able to detect the key  points. YOLOv7 with a larger input size of 960   detects for some frames but then fails but Mediapipe continues to detect as it performs tracking.   So if it has detected the person once, it is  able to track the key points across frames.   The other common challenge in pose estimation is  Occlusion. In this specific example of occlusion,   Mediapipe is mistaking the horse's legs for human  but YOLOv7 is able to detect the occluded legs   as well. But this seemed very uncharacteristic of  Mediapipe so we tested it on some more videos.   In this video we can see that both the models are  performing decently well even when only a part of   the person is visible. Still on closer observation,  YOLO outputs look better than Mediapipe for   occlusion. In case of low lighting conditions,  Mediapipe makes more mistakes as compared to   YOLOv7 but it is still a close call. Finally it's  time to put them to the extreme test. This video   is difficult because the motion is fast and the  person is upside down. The YOLOv7 model is not   able to keep track of the fact that the person is  upside down and is still predicting the points as   if the person was upright. Notice that the blue  lines are for the arms and orange lines are for   the legs. Mediapipe completely outperforms  YOLO in this video for both the input sizes.   Now let's summarize the results and look at  the key takeaways. Mediapipe cannot be used for   multi-person pose estimation and the discussion  ends there if you are looking for a multi-person   pose estimation model because Mediapipe is only  for single-person pose estimation. Mediapipe has   less jitter as compared to YOLO. This can be  attributed to the additional tracking step.   if you have a GPU and you want very accurate  results, use YOLOv7 with a larger image size.   If you want a model capable of running  at real-time on the CPU go for Mediapipe.   When you need to perform pose estimation on  still images Mediapipe might not be the best   option since a big advantage of Mediapipe comes  from the detection plus tracking framework. Note   that it has a separate functionality for  running pose estimation on images as well.   What do you think about the results? We would  love to hear your observations as well. I hope you   found the video interesting. Do LIKE and SHARE the  video and SUBSCRIBE to our Channel. Until next time!
Info
Channel: LearnOpenCV
Views: 65,836
Rating: undefined out of 5
Keywords: mediapipe pose vs openpose, mediapipe pose multiple person, mediapipe pose classification, mediapipe pose estimation github, mediapipe pose detection, pose classification github, mediapipe pose python, 3d human pose estimation github, real-time human pose estimation, human pose estimation opencv, yolov7 pose detection, yolo-pose estimation github, human pose estimation-opencv python, human pose estimation project report, human pose estimation, yolov7 versus mediapipe
Id: hCJIU0pOl5g
Channel Id: undefined
Length: 9min 9sec (549 seconds)
Published: Mon Nov 14 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.