Google's NEW TERRYIFYING AI 'Soundstorm' Shocks The ENTIRE INDUSTRY! (NOW ANNOUNCED!)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so a couple of days ago Google actually released a stunning new paper which actually does host some potentially terrifying impacts for society but it wasn't picked up by The Wider AI Community since this wasn't at all that was released rather a demonstration of what is now capable with certain AI software the tool in question is a soundstorm a new AI tool that can leverage current AI capabilities to create hyper realistic sounding voiceovers with a new sort of architecture now this is very interesting because it does link to many different previous AR voiceovers that we've heard in the past but this one is definitely striking in the terms of accuracy that we do see now in this video I'm going to show you the tons of different examples that we can see and why this is possibly one of the most dangerous pieces of AI to come out because the implications which they discuss in the paper are actually quite bad if it does get into the wrong hands in terms for wider societal impact so let's get into this because this one is very interesting but the demo you're about to hear isn't based purely on on how the voice sounds now try to understand the small inflections you hear when someone is speaking for example the pauses the arms the little things that make a voiceover human and I'm going to show you another example from Google that demonstrate this in full effect which they released a couple of years back did you hear about Google's paper on soundstorm um no I must have missed it what's what's it about well it's a parallel decoder for efficient audio generation uh so it can even be used to generate dialogues oh interesting yeah yeah like this one was generated by soundstorm wait what what you just heard there was the official demo that they had on their research paper now there are tons of more demos which you will get into but before we do I just want you to you know pay attention to the little inflections that we hear when someone speaks because these are the important things that determine whether or not someone interprets a voice that sounds human or not but soundstorm manages to capture that very very well so I think that it's very very interesting as to how it does that so let's look at some of the other things on the page and see just how realistic this stuff is so there's gonna be three parts to this that we need to talk about and the first part is the dialogue synthesis so soundstorm what it can really do really well and something that I found very interesting was that it can actually generate dialogue from two people talking just from around three seconds now you might be wondering how does this differs to 11 laps this actually makes it sound really really realistic in terms of the little inflections like we talk about before in terms of the little inflections that we hear in the dialogue and it sounds really realistic so this is the text this is the voice prompt and then of course we have the synthesized dialogue so we're about to listen to is the original recording and then you're about to listen to the synthesis dialogue aka the AI generated dialogue so he listens to the voice prompt where did you go last summer I went to Greece it was amazing and then what we're going to listen to now is of course the synthesized dialogue where did you go last summer I went to Greece it was amazing that's great I've always wanted to go to Greece what was your favorite part uh it's hard to choose just one favorite part but yeah I really loved the food the seafood was especially delicious yeah and the beaches were incredible we spent a lot of time swimming uh sunbathing and and exploring the islands that sounds like a perfect vacation I'm so jealous it was definitely a trip I'll never forget I really hope we'll get to visit someday now some people might think that's not impressive but the thing is with that is that that's pretty insane now I'm not sure if this is linked to Google's previous AI voice over software in which they demoed ages ago and when I took in ages ago I think it was around 2018 which they demoed something that sounded so realistic it genuinely freaked me out and anyone I showed to it feature I'm going to show you a little clip of that but when I show you a clip of that and then replay you the clip here you're gonna realize just what these implications are so let's go back to this example let's say you want to ask Google to make you a haircut appointment on Tuesday between 10 and noon what happens is the Google Assistant makes the calls seamlessly in the background for you so what you're going to hear is the Google Assistant actually calling a real Salon to schedule the appointment for you let's listen hi I'm calling to book a woman's haircut for our clients um I'm looking for something on May 3rd still working me one second [Music] sure what time are you looking for around at 12 pm we do not have a 12 pm available the closest we have to that is a 115. 10 a.m and uh 12 P.M depending on what service she would like what service is she looking for just a woman's haircut for now okay we have a 10 o'clock 10 a.m is fine okay what's her first name the first name is Lisa okay perfect so I will see Lisa at 10 o'clock on May 3rd okay great thanks great have a great day bye so yeah just so that you were able to see exactly how realistic this can sound when put in the right environments now of course we're going to talk about applications later but I just wanted to show you that when you actually know the applications of this kind of stuff it definitely is pretty crazy so let's listen to another example right here we're gonna look at the voice prompt something really funny happened to me this morning oh wow what and then let's listen to the synthesized dialogue something really funny happened to me this morning oh wow what well uh I woke up as usual uh-huh went downstairs to have uh breakfast yeah started eating then uh 10 minutes later I realized it was the middle of the night oh oh wait that's so funny something really funny happened to me this morning oh wow what well uh I woke up as usual uh-huh went downstairs to have uh breakfast yeah started eating then 10 minutes later I realized it was the middle of the night both of those and it's interesting because the second one did sound a lot better but I gotta be honest this is definitely uh just just a little bit too realistic for my liking if that is an AI voice because if I was listening to that conversation I would definitely say that those people are human so this is uh definitely starting to freak me out just a little bit it's definitely impressive I would say but at the same time it's a little bit scary also listen to one more example um and I think you guys are gonna like this one because this one also does provide more context as well and does sound really realistic I'm going to Istanbul for the Champions League final that's awesome who are you supporting Liverpool I've always been a big fan ah Liverpool is a great team but I I think it will be it will be a close match yeah I can't wait you know I'm super excited to be going there yeah I can imagine are you coming as well uh no unfortunately I I can't no I gotta be honest the mail in this one right here definitely sounded really really realistic and like I said even if the text is great it's definitely starting to sound a bit too realistic now there are some subtle cues that this is AI generated like some of the audio does sound a little bit I'm not sure what the word is not Chris per se but it does sound just a tad bit generated in terms of the actual quality but not in terms of the Linguistics and the way that they speak if that makes any sense but um we're gonna move on now and I do think that this one right here was pretty interesting and if you want to know how this will work like we said before the text was controlled by Google that of course had two people speaking and then they were able to replicate that conversation and I do think that one thing that's going to be interesting was to see how exactly this is going to be when it comes to video games because as you know AI generated conversations real time could be something very interesting in terms of a new world and we did see that recently with Nvidia which we will be making a video on because Nvidia demoed a huge amount of stuff at their recent conferences just to provide us with a lot of stuff in terms of the AI future then essentially we have the paper scroll on down to here where we've got these bass lines and essentially we look at the original audio then we look at what these other AI tools can essentially replicate when trying to make the voiceover sound exactly like the original audio so here we have the original audio his heart full of Charity and severity at the same time then we have audio LM descend with his heart full of Charity and severity at the same time then we have greedy send with his heart full of Charity and severity at the same time you can see here that in greedy's example it does sound a little bit more robotic which is one of the things I did talk about earlier that sometimes you can hear that sort of echo roboticness sound in certain kinds of voiceovers then we have sound storms he must descend with his heart full of Charity and severity at the same time so another thing that we did want to talk about was prompted an unprompted generation so with soundstorm essentially and I need to show you guys another example because this is actually really really interesting so you can see right here on the left this is the original then this is the unprompted and this is the prompted so essentially what this is this is the original audio which is taken from a huge library of audio spoken by a variety of different people and this is a kind of like I guess you could say a benchmark of what they use to see how people actually sound compared to the AR voices and the original samples are taken from Liberty speech test clean and that's the sample thing so essentially right here this is the original and then of course you can get it converted into other human voices so let's listen to the original then we're going to listen to How the AI voice mimics your own voice okay and what you'll notice is that the original and the prompted sound exactly similar and the unprompted actually do sound like other people as well so let's actually listen to the original Mr metacroft the Elder having not spoken one word thus far himself introduced the newcomer to me with a side glance at his sons which had something like Defiance and then let's listen to the unprompted so this is going to be sounding like another person Mr metacraft the Elder having not spoken one word thus far himself introduced the newcomer to me with a side glance at his sons which had something like Defiance a glancewitch now we're going to listen to the prompted one which is essentially the AI voice that sounds like the original voice Mr metacroft the Elder having not spoken one word thus far in himself introduced the newcomer to me with a side glance at his sons which had something like Defiance in it so essentially why this is very very interesting is because when we have the prompted and unprompted generation is that you're able to generate high quality voiceovers with only three seconds of voice now one thing that I do find interesting about this is that in the testing of this is that we don't see examples compared to other softwares for example 11 Labs so what I'm going to do is I'm going to download the original sample then I'm going to go ahead and see how this compares to 11 Labs because if I show you quickly on 11 Labs the voices that I've cloned they actually sound really well so I'm going to show you a voice that I personally cloned of Joe Rogan and one of Donald Trump to show you what they sound so like we said it would have been good if we had the benchmarks against some popular figures because then it would be interesting to sell exactly how they sound but of course as you know they decided to use that specific cohort voices because those are the ones that are the industry standard so this is 11 Labs many people if you're not familiar with it it's a voice it's a voice AI tool and essentially if you want to clone a voice it's very very simple all I had to do was simply click add a voice then just click add then of course for example if you wanted to clone Joe Rogan or anyone you'd simply type in the name right there not the name has anything to do that you can call it what you like then you just upload a file you don't need to upload 25 samples you can literally just upload a standard sample and then essentially just within a matter of seconds you essentially have something like this so I'm going to show you the clip that we just got hey everyone Welcome to The Joe Rogan podcast episode 100 we are here with the AI grid and we're going to talk about artificial intelligence so I think that does decently sound like Joe Rogan if you ask me I mean other people will be able to tell the internet voiceover but it is still very very good from what we've come so far and I think with soundstorm and other Technologies it's also going to sound very good now what I'm also going to show you is one more example and this is of Joe Biden because because more people are going to know exactly what he sounds like welcome to the United States the greatest country in the world and today we are speaking with the igrid so I think that one there also did sound quite realistic you have to understand that certain ones won't sound that realistic but the point of this is to demo and show you how these tools actually can be used very very quickly now what I'm going to do now is just show you some examples of where this was essentially used malicious so you can see from this Washington Post article it says they thought their loved ones were calling for help it was an AI scam scammers are using artificial intelligence to sound more like family members in distress people are falling for it losing thousands of dollars The Story Goes On to say the man calling Ruth card sounded just like her grandson Brandon so when he said he was in jail with no will or cell phone and needed cash for Bill she scrambled to do whatever she could to help it was definitely the feeling or fear she said that you've got to help him right now so the 73 old woman and her 75 year old partner ran to their Bank withdrew 3 000 Canadian dollars which is around 2 200 in U.S currency which is the daily maximum then they hired to another Branch for more money but essentially a bank manager told them that this was a scam because they'd learned earlier that someone else had a similar scenario and that this person that was on the phone wasn't their grandson and that's when they realized that they had been duped and in an interview they were like you know what we've been duped and this is why these tools are very very scary but at the same time very very good I mean it's good for Content creation for example I could clone my AI voice and then essentially maybe if something happened to my voice or potentially if I had a cough or something I could simply use that to write the script and then essentially I could read out that script and people wouldn't know which could essentially save a lot of time because voice actors are very expensive and sometimes pronouncing words is quite hot so with a technology like 11 labs and of course with technology like soundstorm coming out it's going to be very very hard to see exactly what kind of things are going to be real and what kind of things are going to be fake and what kind of identity verifications we're going to need in the future so for example if we check out this article here from the guardian it also entails us of some of the problems that are going to come up so so if we check out this article by the guardian it says AI can full voice recognition used to verify identity by Central link and Australian tax office it says a voice identification system used by the Australian government for millions of people has a serious security flow it says essentially this government office uses a voice print with other information but the problem is is that the voice print is just someone's voice over the phone and of course we know now with these AI generated programs that are voice train to sound like a specific person could be used to access phone banking services overseas and their system was able to be fooled by this AI voiceover so we know that this is going to be a problem in the future one thing that I did want to talk about and touch on and something that was very very important because I'm going to show you will link to another video in the description because it's a really important video that you do need to see and not enough people have seen it essentially this is the broader impact so it says soundstorm is a model for high quality efficient generation of neural audio codec derived audio representations yada yada yada so basically okay um it does essentially say here that this is dangerous so it says in turn the ability to mimic a voice can have numerous malicious applications including bypassing biometric identification and for the purpose of impersonation thus it is crucial to put in place safeguards against the potential misuse to this end we have verified that after replacement the generated audio remains detectable by a dedicated classifier 98.5 using the same classifier as boros sense as a component of a larger system we believe that soundstorm would be unlikely to introduce additional risks to those previously discussed by boros so essentially this broader impact talks about how if someone is able to clone your voice simply with three seconds which people are currently doing and if they're able to get it at this level where it sounds even more realistic in terms of the levels and how you talk it's going to have potentially malicious applications but they also talk about how that even when you generate audio through soundstorm you're able to detect this if it's generated by AI AI with a dedicated classifier now I do want to be honest the only problem I have with this is that and one of the recent examples that we did see where people have been scammed by you know their grandsons or relatives that have been asking for money this if you haven't been made aware there have been several scams that have been going around essentially where someone may receive a phone call from a loved one now if you receive a phone call from your loved one and you hear their voice on the phone and they're asking you to transfer money into a certain account you might not think twice about it and that's happened to many different people and with the rise of the voice cloning from many tools like 11 labs this has been made free on a global scale now of course with every great tool there's always going to be some misuse of the application but I think with this the risk is definitely higher now it does say that it's unlikely to introduce additional risks but I guess we're going to have to see that's where this product comes in now most of you in the AI space probably haven't heard before this but what's interesting is you've likely heard the person who created this this is World coin essentially created by Sam Altman if you don't know who Sam Altman is it's the guy that created chat TV BT now what worldcoin is it's essentially a project that is going to fix digital ID allegedly and essentially how it works is it's an orb that scans your eye and that is the only form of ID that you're going to have and it's going to be kind of tied into this huge I guess digital wallet and you can see right here that it's essentially a digital passport that lets you prove you are unique and a real person while remaining Anonymous and you can see right here that this is essentially how this works they have this orb essentially it scans your eye and then essentially you're in the database you have your digital ID and that's how they're going to prevent or they're working on preventing this kind of air Revolution where we're not going to know who is who so let me know if you are excited about the world coin or are you a bit on the other side maybe you don't want a digital ID maybe you think this problem isn't a big problem are programs like Google soundstorm and for example 11 Labs where we do see professional voice cloning which is going to start very very soon in July where it's going to be indistinguishable from the reality are these things likely going to be problems in the future I'd love to know your thoughts and comments down below because it is definitely a very interesting debate as the rapid Riser AI increases all the kind of different problems that we didn't see before starting to come to light
Info
Channel: TheAIGRID
Views: 264,421
Rating: undefined out of 5
Keywords:
Id: mAh1ls2sAKM
Channel Id: undefined
Length: 19min 2sec (1142 seconds)
Published: Fri Jun 09 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.