Learning 3D Human Pose Estimation from Dozens of Datasets by Bridging Skeleton Formats (WACV'23)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hi, I'm István Sárándi, and today I will  present you "Learning 3D human pose estimation   from dozens of data sets using a geometry aware  autoencoder to bridge between skeleton formats."   In this work, we tackle 3D human pose  estimation from a single RGB image.   A major challenge in this task has always been  the difficulty of obtaining ground truth and   therefore a lack of large-scale diverse training  data. For example, to annotate 2D poses you can   just click on the body joint pixels, but  annotating depth is not possible that way.   So most 3D post data sets are recorded in  a studio environment, which means that the   models often overfit to the studio and  don't generalize very well to the wild.   However, we have noticed recently that many new  datasets have been released to the community.   In fact, we found a total of 28 large label  data sets, indoors, outdoors and synthetic ones   as well. So why don't we train on all of them at  once? When taken together, they could cover much   more pose and appearance variation. But here's the  catch. These datasets use a variety of different   skeleton format definitions. For example, because  they were recorded with different mocap systems.   Both the number and the placement of the keypoints  can be different. Some have surface markers,   some have joints inside the body. So how can we  train one model with all these different labels?   On the one hand, we could just pretend for  example that any joint named hip is the same   point on the body. But then we would blur the  different hit points together within the model.   On the other extreme, we could predict the  different formats on separate output heads   as in multi-task learning, without  assuming any relations between them.   Unfortunately, this second option doesn't work  very well either. The depth predictions for the   different skeletons become inconsistent  with each other. It seems that these   different labels don't end up supervising  a single internal human representation.   So instead of those, we propose to find a good  middle ground, and establish some relations   between the formats, without assuming that they  are all the same. Our workflow has three steps. We   first train a pose estimator with separate output  heads and use it to create pseudo ground truth.   This will function as a Rosetta Stone parallel  corpus for figuring out how the skeletons relate.   In the second step, we perform dimensionality  reduction on the number of keypoints in the pseudo   ground truth. In other words, we want to discover  a set of latent keypoints underlying all of these   formats. For this, we propose a novel linear  autoencoder formulation, the affine-combining   autoencoder. Both the encoder and the decoder  compute simple affine combinations of the list of   input points making the transformations equivalent  to rotation and translation. This formulation also   allows learning these 3D relations merely from the  2D projections of the pseudo-ground truth, which   is more accurate. Finally, we attach the frozen  autoencoder to the end of the pose estimator and   define a consistency regularization loss. This  encourages the model to output poses that can   pass through the autoencoder unchanged because  that means that the different skeleton formats   are now consistent. And indeed, the resulting  predictions now become much more consistent and   improve the scores on four different benchmarks.  When compared with methods from the literature   that typically only train on one or a few  datasets, our models are also much stronger.   Qualitative results in the wild are very  good, too, even on very challenging poses. We believe our models can be useful for  many downstream research applications   that currently only use 2D poses out of  convenience. We therefore make our models   easy to use with minimum dependencies  and an intuitive multi-person aware API.
Info
Channel: RWTHVision
Views: 9,041
Rating: undefined out of 5
Keywords:
Id: 6IW6oImq3RM
Channel Id: undefined
Length: 4min 0sec (240 seconds)
Published: Thu Dec 29 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.