Dear Fellow Scholars, this is Two Minute Papers
with Dr. Károly Zsolnai-Fehér. Today, a variety of techniques exist that
can take an image that contains humans, and perform pose estimation on it. This gives us these interesting skeletons
that show us the current posture of the subjects shown in these images. Having this skeleton opens up the possibility
for many cool applications, for instance, it’s great for fall detection and generally
many kinds of activity recognition, analyzing athletic performance and much, much more. But that would require that we can do it for
not only still images, but animations. Can we? Yes, we already can, this is a piece of footage
from a previous episode that does exactly that. But what if we wish for more? Let’s think bigger, for instance, can we
reconstruct not only the pose of the model, but the entire 3D geometry of the model itself? You know, including the body shape, face,
clothes, and more. That sounds like science fiction, right? Or with today’s powerful learning algorithms,
maybe it is finally a possibility, who really knows? Let’s have a look together and evaluate
it with three, increasingly more difficult experiments. Let’s start with experiment number one,
still images. Nice! I think if I knew these people, I might have
a shot at recognizing them solely from the 3D reconstruction. And not only that, but I also see some detail
in the clothes, a suit can be recognized, and jeans have wrinkles. This new method uses a different geometry
representation that enables higher-resolution outputs, and it immediately shows. Checkmark. It is clearly working quite well on still
images. And now, hold on to your papers for experiment
number two, because it can not only deal with still images of the front side only, but it
can also reconstruct the backside of the person. Look! My goodness, but hold on for a second…that
part of the data is completely unobserved. We haven’t seen the backside…so, how is
that even possible? Well, we have to shift our thinking a little. An intelligent person would be able to infer
some of these details, for instance, we know that this is a suit, or that these are boots,
and we know roughly what the backside of these objects should look like. This new method leans on an earlier technique
by the name image to image translation to estimate this data. And it truly works like magic! If you take a closer look, you see that we
have less detail in the backside than in the front, but the fact that we can do this is
truly a miracle. But we can go even further. I know it is not reasonable to ask, but what
about video reconstruction? Let’s have a look. Don’t expect miracles, at least not yet,
there is obviously still quite a bit of flickering left, but the preliminary results are quite
encouraging, and I am fairly certain that two more papers down the line, and these video
results will be nearly as good as the ones were for the still images. The key idea here is that the new method performs
these reconstructions in a way that is consistent, or in other words, if there is a small change
in the input model, there will also be a small change in the output model. This is the property that opens up the possibility
to extend this method to videos! So, how does it compare to previous methods? All of these competing techniques are quite
recent as they are from 2019. They appear to be missing a lot of detail,
and I don’t think we would have a chance of recognizing the target subject from the
reconstructions. And now, just a year and a half later, look
at that incredible progress! It truly feels like we are living in a science
fiction world. What time to be alive! Thanks for watching and for your generous
support, and I'll see you next time!