Learning 3D Human Pose Estimation from Dozens of Datasets by Bridging Skeleton Formats (WACV'23)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Hi, I'm István Sárándi, and today I will present you "Learning 3D human pose estimation from dozens of data sets using a geometry aware autoencoder to bridge between skeleton formats." In this work, we tackle 3D human pose estimation from a single RGB image. A major challenge in this task has always been the difficulty of obtaining ground truth and therefore a lack of large-scale diverse training data. For example, to annotate 2D poses you can just click on the body joint pixels, but annotating depth is not possible that way. So most 3D post data sets are recorded in a studio environment, which means that the models often overfit to the studio and don't generalize very well to the wild. However, we have noticed recently that many new datasets have been released to the community. In fact, we found a total of 28 large label data sets, indoors, outdoors and synthetic ones as well. So why don't we train on all of them at once? When taken together, they could cover much more pose and appearance variation. But here's the catch. These datasets use a variety of different skeleton format definitions. For example, because they were recorded with different mocap systems. Both the number and the placement of the keypoints can be different. Some have surface markers, some have joints inside the body. So how can we train one model with all these different labels? On the one hand, we could just pretend for example that any joint named hip is the same point on the body. But then we would blur the different hit points together within the model. On the other extreme, we could predict the different formats on separate output heads as in multi-task learning, without assuming any relations between them. Unfortunately, this second option doesn't work very well either. The depth predictions for the different skeletons become inconsistent with each other. It seems that these different labels don't end up supervising a single internal human representation. So instead of those, we propose to find a good middle ground, and establish some relations between the formats, without assuming that they are all the same. Our workflow has three steps. We first train a pose estimator with separate output heads and use it to create pseudo ground truth. This will function as a Rosetta Stone parallel corpus for figuring out how the skeletons relate. In the second step, we perform dimensionality reduction on the number of keypoints in the pseudo ground truth. In other words, we want to discover a set of latent keypoints underlying all of these formats. For this, we propose a novel linear autoencoder formulation, the affine-combining autoencoder. Both the encoder and the decoder compute simple affine combinations of the list of input points making the transformations equivalent to rotation and translation. This formulation also allows learning these 3D relations merely from the 2D projections of the pseudo-ground truth, which is more accurate. Finally, we attach the frozen autoencoder to the end of the pose estimator and define a consistency regularization loss. This encourages the model to output poses that can pass through the autoencoder unchanged because that means that the different skeleton formats are now consistent. And indeed, the resulting predictions now become much more consistent and improve the scores on four different benchmarks. When compared with methods from the literature that typically only train on one or a few datasets, our models are also much stronger. Qualitative results in the wild are very good, too, even on very challenging poses. We believe our models can be useful for many downstream research applications that currently only use 2D poses out of convenience. We therefore make our models easy to use with minimum dependencies and an intuitive multi-person aware API.

Info

Channel: RWTHVision

Views: 9,041

Rating: undefined out of 5

Keywords:

Id: 6IW6oImq3RM

Channel Id: undefined

Length: 4min 0sec (240 seconds)

Published: Thu Dec 29 2022