Hi, I'm István Sárándi, and today I will
present you "Learning 3D human pose estimation from dozens of data sets using a geometry aware
autoencoder to bridge between skeleton formats." In this work, we tackle 3D human pose
estimation from a single RGB image. A major challenge in this task has always been
the difficulty of obtaining ground truth and therefore a lack of large-scale diverse training
data. For example, to annotate 2D poses you can just click on the body joint pixels, but
annotating depth is not possible that way. So most 3D post data sets are recorded in
a studio environment, which means that the models often overfit to the studio and
don't generalize very well to the wild. However, we have noticed recently that many new
datasets have been released to the community. In fact, we found a total of 28 large label
data sets, indoors, outdoors and synthetic ones as well. So why don't we train on all of them at
once? When taken together, they could cover much more pose and appearance variation. But here's the
catch. These datasets use a variety of different skeleton format definitions. For example, because
they were recorded with different mocap systems. Both the number and the placement of the keypoints
can be different. Some have surface markers, some have joints inside the body. So how can we
train one model with all these different labels? On the one hand, we could just pretend for
example that any joint named hip is the same point on the body. But then we would blur the
different hit points together within the model. On the other extreme, we could predict the
different formats on separate output heads as in multi-task learning, without
assuming any relations between them. Unfortunately, this second option doesn't work
very well either. The depth predictions for the different skeletons become inconsistent
with each other. It seems that these different labels don't end up supervising
a single internal human representation. So instead of those, we propose to find a good
middle ground, and establish some relations between the formats, without assuming that they
are all the same. Our workflow has three steps. We first train a pose estimator with separate output
heads and use it to create pseudo ground truth. This will function as a Rosetta Stone parallel
corpus for figuring out how the skeletons relate. In the second step, we perform dimensionality
reduction on the number of keypoints in the pseudo ground truth. In other words, we want to discover
a set of latent keypoints underlying all of these formats. For this, we propose a novel linear
autoencoder formulation, the affine-combining autoencoder. Both the encoder and the decoder
compute simple affine combinations of the list of input points making the transformations equivalent
to rotation and translation. This formulation also allows learning these 3D relations merely from the
2D projections of the pseudo-ground truth, which is more accurate. Finally, we attach the frozen
autoencoder to the end of the pose estimator and define a consistency regularization loss. This
encourages the model to output poses that can pass through the autoencoder unchanged because
that means that the different skeleton formats are now consistent. And indeed, the resulting
predictions now become much more consistent and improve the scores on four different benchmarks.
When compared with methods from the literature that typically only train on one or a few
datasets, our models are also much stronger. Qualitative results in the wild are very
good, too, even on very challenging poses. We believe our models can be useful for
many downstream research applications that currently only use 2D poses out of
convenience. We therefore make our models easy to use with minimum dependencies
and an intuitive multi-person aware API.