This AI Makes "Audio Deepfakes"

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

This is going to be a game changer for game development.

Text to voice, then style to voice, or even merging styles to make up new voices.

Use voice actors for the most important lines and auto generate the rest.

When there is a version my hardware can run (or my next upgrade) I'd love to get Daggerfall Unity and Morrowind characters fully voiced.

👍︎︎ 14 👤︎︎ u/Forlarren 📅︎︎ Apr 13 2020 🗫︎ replies

Hello. This is 2 minute papers with Caroly Zsonai Frafsra.

👍︎︎ 4 👤︎︎ u/Jafjaf5 📅︎︎ Apr 13 2020 🗫︎ replies

If you were to make one of these for a single person, but you want to be able to express different emotional cadences, could you have a single model? Or would you have to train separate models for each emotion, with training sets for each emotion.

👍︎︎ 2 👤︎︎ u/bitchgotmyhoney 📅︎︎ Apr 14 2020 🗫︎ replies

This is incredible and thank you for sharing. I’m gonna excerpt this for some of my insta story; do you have an IG?

👍︎︎ 1 👤︎︎ u/katiecharm 📅︎︎ Apr 13 2020 🗫︎ replies

Where do I get this?

👍︎︎ 1 👤︎︎ u/Iliketodriveboobs 📅︎︎ Apr 14 2020 🗫︎ replies

Captions

Dear Fellow Scholars, this is Two Minute Papers with this guy's name that is impossible to pronounce. My name is Dr. Károly Zsolnai-Fehér, and indeed, it seems that pronouncing my name requires some advanced technology. So what was this? I promise to tell you in a moment, but to understand what happened here, first, let’s have a look at this deepfake technique we showcased a few videos ago. As you see, we are at the point where our mouth, head, and eye movements are also realistically translated to a chosen target subject, and perhaps the most remarkable part of this work was that we don’t even need a video of this target person, just one photograph. However, these deepfake techniques mainly help us in transferring video content. So what about voice synthesis? Is it also as advanced as this technique we’re looking at? Well, let’s have a look at an example, and you can decide for yourself. This is a recent work that goes by the name Tacotron 2, and it performs AI-based voice cloning. All this technique requires is a 5-second sound sample of us, and is able to synthesize new sentences in our voice, as if we uttered these words ourselves. Let’s listen to a couple examples. Wow, these are truly incredible. The timbre of the voice is very similar, and it is able to synthesize sounds and consonants that have to be inferred because they were not heard in the original voice sample. And now, let’s jump to the next level, and use a new technique that takes a sound sample and animates the video footage as if the target subject said it themselves. This technique is called Neural Voice Puppetry, and even though the voices here are synthesized by this previous Tacotron 2 method that you heard a moment ago, we shouldn’t judge this technique by its audio quality, but how well the video follows these given sounds. Let’s go! If you decide to stay until the end of this video, there will be another fun video sample waiting for you there. Now, note that this is not the first technique to achieve results like this, so I can’t wait to look under the hood and see what’s new here. After processing the incoming audio, the gestures are applied to an intermediate 3D model, which is specific to each person since each speaker has their own way of expressing themselves. You can see this intermediate 3D model here, but we are not done yet, we feed it through a neural renderer, and what this does is apply this motion to the particular face model shown in the video. You can imagine the intermediate 3D model as a crude mask that models the gestures well, but does not look like the face of anyone, where the neural renderer adapts this mask to our target subject. This includes adapting it to the current resolution, lighting, face position and more, all of which is specific to what is seen in the video. What is even cooler is that this neural rendering part runs in real time. So, what do we get from all this? Well, one, superior quality, but at the same time, it also generalizes to multiple targets. Have a look here! And the list of great news is not over yet, you can try it yourself, the link is available in the video description. Make sure to leave a comment with your results! To sum up, by combining multiple existing techniques, it is important that everyone knows about the fact that we can both perform joint video and audio synthesis for a target subject. This episode has been supported by Weights & Biases. Here, they show you how to use their tool to perform faceswapping and improve your model that performs it. Weights & Biases provides tools to track your experiments in your deep learning projects. Their system is designed to save you a ton of time and money, and it is actively used in projects at prestigious labs, such as OpenAI, Toyota Research, GitHub, and more. And, the best part is that if you are an academic or have an open source project, you can use their tools for free. It really is as good as it gets. Make sure to visit them through wandb.com/papers or just click the link in the video description and you can get a free demo today. Our thanks to Weights & Biases for their long-standing support and for helping us make better videos for you. Thanks for watching and for your generous support, and I'll see you next time!

Info

Channel: Two Minute Papers

Views: 493,982

Rating: 4.9208617 out of 5

Keywords: two minute papers, deep learning, ai, deepfake, obama deepfake, trump deepfake, audio deepfake, voice puppetry, neural voice puppetry, deepfakes, voice deepfake

Id: VQgYPv8tb6A

Channel Id: undefined

Length: 5min 38sec (338 seconds)

Published: Wed Apr 08 2020