Dear Fellow Scholars, this is Two Minute Papers
with this guy's name that is impossible to pronounce. My name is Dr. Károly Zsolnai-Fehér, and
indeed, it seems that pronouncing my name requires some advanced technology. So what was this? I promise to tell you in a moment, but to
understand what happened here, first, let’s have a look at this deepfake technique we
showcased a few videos ago. As you see, we are at the point where our
mouth, head, and eye movements are also realistically translated to a chosen target subject, and
perhaps the most remarkable part of this work was that we don’t even need a video of this
target person, just one photograph. However, these deepfake techniques mainly
help us in transferring video content. So what about voice synthesis? Is it also as advanced as this technique we’re
looking at? Well, let’s have a look at an example, and
you can decide for yourself. This is a recent work that goes by the name
Tacotron 2, and it performs AI-based voice cloning. All this technique requires is a 5-second
sound sample of us, and is able to synthesize new sentences in our voice, as if we uttered
these words ourselves. Let’s listen to a couple examples. Wow, these are truly incredible. The timbre of the voice is very similar, and
it is able to synthesize sounds and consonants that have to be inferred because they were
not heard in the original voice sample. And now, let’s jump to the next level, and
use a new technique that takes a sound sample and animates the video footage as if the target
subject said it themselves. This technique is called Neural Voice Puppetry,
and even though the voices here are synthesized by this previous Tacotron 2 method that you
heard a moment ago, we shouldn’t judge this technique by its audio quality, but how well
the video follows these given sounds. Let’s go! If you decide to stay until the end of this
video, there will be another fun video sample waiting for you there. Now, note that this is not the first technique
to achieve results like this, so I can’t wait to look under the hood and see what’s
new here. After processing the incoming audio, the gestures
are applied to an intermediate 3D model, which is specific to each person since each speaker
has their own way of expressing themselves. You can see this intermediate 3D model here,
but we are not done yet, we feed it through a neural renderer, and what this does is apply
this motion to the particular face model shown in the video. You can imagine the intermediate 3D model
as a crude mask that models the gestures well, but does not look like the face of anyone,
where the neural renderer adapts this mask to our target subject. This includes adapting it to the current resolution,
lighting, face position and more, all of which is specific to what is seen in the video. What is even cooler is that this neural rendering
part runs in real time. So, what do we get from all this? Well, one, superior quality, but at the same
time, it also generalizes to multiple targets. Have a look here! And the list of great news is not over yet,
you can try it yourself, the link is available in the video description. Make sure to leave a comment with your results! To sum up, by combining multiple existing
techniques, it is important that everyone knows about the fact that we can both perform
joint video and audio synthesis for a target subject. This episode has been supported by Weights
& Biases. Here, they show you how to use their tool
to perform faceswapping and improve your model that performs it. Weights & Biases provides tools to track your
experiments in your deep learning projects. Their system is designed to save you a ton
of time and money, and it is actively used in projects at prestigious labs, such as OpenAI,
Toyota Research, GitHub, and more. And, the best part is that if you are an academic
or have an open source project, you can use their tools for free. It really is as good as it gets. Make sure to visit them through wandb.com/papers
or just click the link in the video description and you can get a free demo today. Our thanks to Weights & Biases for their long-standing
support and for helping us make better videos for you. Thanks for watching and for your generous
support, and I'll see you next time!
This is going to be a game changer for game development.
Text to voice, then style to voice, or even merging styles to make up new voices.
Use voice actors for the most important lines and auto generate the rest.
When there is a version my hardware can run (or my next upgrade) I'd love to get Daggerfall Unity and Morrowind characters fully voiced.
Hello. This is 2 minute papers with Caroly Zsonai Frafsra.
If you were to make one of these for a single person, but you want to be able to express different emotional cadences, could you have a single model? Or would you have to train separate models for each emotion, with training sets for each emotion.
This is incredible and thank you for sharing. I’m gonna excerpt this for some of my insta story; do you have an IG?
Where do I get this?