10 Things About OpenAI SORA You Probably Missed

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Sora the video generator by open AI released on February 15th 2024 and I've spent pretty much every hour of my life scouring the internet and researching what else this could do and there's actually a lot of things that weren't obvious in the middle of all the hype that accompanied the release of this AI video generator I studied a technical report on detail watched all the YouTube videos spent an unhealthy amount of time on Twitter looking for all the discussions and the little findings people had matter of fact since release I didn't even leave the apartment if we haven't met yet I'm eigor I made it my full-time calling to research what AI has to offer and how to put it to work in your everyday life and before doing that with the a Advantage I had a video production company that operated for eight years in Central Europe I helped clients with everything from corporate video trainings to directing smaller commercials and even shooting festivals nightclub videos when it comes to videography I've really seen it all and this stuff is exactly in the middle between technology and video production so I can't wait to dive into all of this all right so without further Ado let's look at all the implications of Sora that you might have not been aware of right away okay so first of all I want to talk about audio because Sora only generates video right all the example we saw were muted without music or sound effects in the background and a lot of people rightfully pointed out that hey in film it's really 50/50 at the very least it's 50% visuals and another 50% audio and there's many layers to that right you might have the actor's voice as one track but then there's also sound effects of things happening around them and then you have foli which is the background sound that just persists you're not really consciously aware of it but it's there and if it's not there the shot is missing something so surely audio must be a complicated issue too right well not really because 11 Labs actually reacted to the Sora release and they released a new sound generator that from text prompts is able to generate an entire soundscape okay so today we don't have access right but if open AI hooked up Sora to this audio generator you would have a audio visual generator where you create full soundscapes have a quick listen and sure a sound designer could do this manually but again if you're a oneman show and you're producing a commercial like I did so so many times you're doing everything yourself from planning to recording editing doing the sound design doing the color grading doing feedback rounds with the client invoicing and often times you don't have budget for a sound designer so you bet that there's going to be models I don't know if Sora or others that combine both they're going to give you audio visual outputs this is not a question that's just a straight fact at this point and with tools like sun AI out there already that can generate full songs including lyrics at a decent quality with AI well you're going to be able to generate the background music the background sound effects the voices that are in the scene because voice generators are thing and they're virtually indistinguishable already right and now the video components so we really have the full stack for audiovisual production it's just a question of time now and from my estimate it looks to be months not years till we'll get there okay my next point is all about the capab abilities of Sora that are actually brand new because a lot of the stuff that we saw just drastically reduce the cost of what it takes to produce a clip like this or an animated video like this you might be aware that movies like this exist right it just cost a lot of money to produce this so first of all let's talk about the things that are actually brand new and not just a cost reduction although that has its implications too and we'll talk about that but the things that are actually new are first of all you can extend videos okay so this is beautifully outlined in a technical paper here and it shows the example of a San Francisco subway car so as you can see this clip is the same in all three instances but if you back up a little bit then extended the beginning of it okay so as you can see the video generated by Sora is different every single time and it seamlessly transitions into the subway car so this is something that was not possible up until now okay it generates this video from scratch now I guess you could argue that you could recreate this entire scene in 3D and then create the frames before that and seamlessly transition into it but you have to realize that at a certain point this is going to become a feature in every editing software right you'll have just an image and it will turn it into a video and then you can extend it to any duration you can add a clip before add a clip after you'll be able to turn your old family photos into Vivid memories sort of that is really scary but it's going to be a thing and you bet apps like Instagram at one point I don't know when are going to have a feature where you're going to be able to turn a photo into video and then extend that indefinitely another new capability is you're going to be able to Lo videos okay and this is also something that you could kind of but not really achieved today definitely not in this form okay you'll give it a video clip and it will be generating extra frames that will seamlessly let the footage loop I had a good chat with a friend and we kind of talked about how this could be the new Rick rolling on the internet because if you do this to a longer clip you just don't realize that it's looping and that it's just playing forever so you could send somebody a clip and it might take them minutes to realize that the whole thing is looping and just repeating over and over again anyway this is something that was was not really possible and some people went ahead and tried this anyway in videography there was this whole Trend a few years back where people were trying to seamlessly transition one thing into another like for example and my shirt is gone magic now those are the simplest way to do it but here we will have the capability of generating brand new frames and things will be able to Loop indefinitely okay so those are the new features you can expect in editing software somewhere down the line but then there's a lot of the ones that are just simple cost reduction this is why people refer to it as the death of Hollywood in many cases now I don't know if that's an accurate assessment in my opinion I think they're going to use this Tech to Advantage to lower the prices of production and pump out even more content we'll also talk about that soon but let's finish up the segment and talk about the things that were already available but now it's just a 10,000x reduction cost for that calculation I see a subscription price that is somewhere around the GPT plus plan so what's going to be possible at this super low cost is first of all generating images we're able to do that with other image generators right sure these are hyper realistic and very high quality just like M journey and so but then it's capability to turn images into videos that is very very big in my opinion because it's going to make it so easy to craft compelling videos like I feel like most people that talk about this don't appreciate how much this is going to lower the barrier for entry for videography and high quality videography that is because you're going to get access to things like this so even if you've seen this before I think I have a bit of a different perspective here so look here on the left you have the Drone image here on the right you have this butterfly right and here in the middle you have the mix of the two where the Drone is flying through something like the Coliseum and then it morphs into a butterly fly and look I could do this today okay this just takes about 3 to 5 hours of work dependent on your skill level you just go into after effects and you rotoscope out this butterfly meaning you go frame by frame that's 25 frames every single second and you make sure you animate a mask exactly in the form of the Butterflies wings and you redo that for every movement now yes there's tools that help you but a lot of times you're stuck with manual labor there so it might just turn out that the 3 to 5 hour task turns it into 15 20 hours and then you can bring the butterfly into here and morph it into the Drone with something like a morph cut inside of Premiere Pro now if none of that means anything to you that's fine I'm just saying hours of work are going to be done like this and this is just one simple example in many others a oneman crew could never do this right all these animation related examples where they turn an image into an animation like this are usually just not feasible for a oneman show it takes too much time to animate all the little things you might be able to do it for a few shots but if you do a whole one minute trailer you'll find that you spend 2 weeks at the computer if you really animate all the little details like in this shot and you have a lot of different shots so that's my second point it lowered the bar by a factor that is larger than most people realize I don't know if it's 1,000x or 10,000x but a lot of these things were Unthinkable for small Crews or oneman shows and now they will be doable like for example before after Okay so this point is all about the editability of the video and here in Twitter Owen Fern went ahead and he criticized the fact that hey yes these Generations are absolutely incredible but what if the client has feedback and this is very very appropriate criticism in my opinion because clients always have feedback and if you're going to use this for job if this is supposed to be the death of Hollywood just between directors and producers there is so much feedback going on in the post-production of any advertisement movie heck even if it's an event video I had clients that went back and forth 10 times and gave feedback over and over again and I had to adjust things so one points out here that yeah there's going to be a lot of little details that will need to be changed about these scenes and with Sora you're not really able to go back and change little details right you're going to have to regenerate the whole scene and maybe you like the character here but you just don't like the fact that this is not a Thum it just looks like a fifth finger and we would like to give it a look of a Thum can we do that and his point is the answer has to be no and then you have a dissatisfied client which is a very fair point but as I've been following this very closely over the last months there's one tool and one research that needs to to be pointed out here okay first things first Runway ml the previous so to say leader in AI video a few weeks ago introduced a feature called multi motion brush tool which allowed you to use multiple brushes on the video to just animate specific parts now that is for animation but over in M journey and many other image generators you're able to do something called inpainting where you just paint in a little part of the image and then edit just that you can reprompt it so on images today you could actually go in and just paint in this Thum and say regenerate the Thum why would that not be possible on video eventually it will be and further than that bite Dan the creator of Tik Tok actually published a research paper less than a week ago about this so-called boxor okay so I didn't cover it on the channel because I like to cover things that are available today or truly truly revolutionary this kind of Falls in this in between zone of hey really interesting but it's not available and in my eyes probably not worth a dedicated video but look the whole point of this is you draw different boxes in the scene and thereby you can control the seen in great detail so if you select the balloon and say it's going to fly away in this direction and then you select a girl and she's going to run in a different direction exactly that is going to happen so between tools like the box imator and inating in mid Journey it's just a question of time where you're going to be able to use a mix of these tools and also in paint on top of AI video now sure there's going to be a temporal axis there right because on images you only have the X and Y AIS and in video there's also the time axis and sometimes you even have movement in zspace but between This research and painting I can totally see that happening for AI video 2 down the line plus as we know with prompt engineering today for language based models there's a lot of control that you have in the text prompt you just have to be really detailed if you look at a lot of these prompts they're good but they're not as detailed as they could be some of the best stable diffusion prompting is extremely detailed also in mid journey in stable diffusion if you keep your prompts relatively simple you're going to get varied results even if you roll the dice and create a new scene it's going to be very similar plus let's refer back to Mid Journey again they just recent recently announced a new character tool where it's going to maintain character consistency based on a character that you pick in a tool so all of these AI image features that we've been talking about and I've been tracking regularly they're going to apply to video tool it's just going to take longer but I absolutely believe that we'll be able to implement all of this little feedback into AI video and therefore this actually being production ready at some point okay so my next Point here is that I didn't expect right in a beginning is that you can prompt stories into existence from a single prompt okay so here's an example from Bill PE from the open AI team and he generated an entire story of two dogs that should walk through NYC then a taxi should stop to let the dogs pass across walk then they should walk past the pretzel and hot dog stand and finally they should end up at Broadway signs and if you follow this channel you might know how much context you can add text prompts to achieve exceptionally accurate results from things like chat GPT if you added way more details here I believe they would be reflected in it and then the story can develop and as right now you already have tools that can manipulate someone's mouth to speak in another language so it looks naturally also that will be possible here so you will be able to create these long shots like they have in movies which are incredibly difficult to achieve I mean some movies like Dunkirk took it so far where they turned the movie into a single Take It All flows seamlessly and Sora is able to do it too and that I didn't expect at the beginning also they didn't share this example right off the bat I think this is actually very very impressive and if now we're already able to generate stories from a single Simple Text prompt it's just a question of time until we arrive at something like this where you just type in a prompt and you get a full movie back or a full show I mean at some point it's just a question of having enough gpus this is obviously just a mockup but something to think about especially because this is the worst teack is ever going to be and you know what let's talk about that point that is actually my next one so where are we in the timeline of this okay it was really helpful to look into some of the discussions that are happening online to orient myself in terms of where we actually are today soad most St from stability AI actually had a fantastic take here he compared this to the gpt3 of video model models so if you didn't know gpt3 was the predecessor to chat GPT okay it was available before but the interface was not as intuitive and you actually had to prompt it differently rather than cat gbt that had reinforcement learning for human feedback which means a lot of humans feedbacked the outputs to make it more user friendly for humans and that's where this is at right now okay it's not at the cat GPD point where it's going to be really easy to use and it's going to gain Mass popularity and then we got gbd4 and all the additional features and it's just crazy capable now and he even said that all the images generators like stable diffusion were more comparable to gpt2 where the quality of the output was not nearly as good as gpt3 so as in large language models this puts us on the timeline somewhere in the middle of 2022 because the chat gbt gbt 4 llama and mistrals will come over the next few years we Rems at the pace that we're moving ahead right and on this topic there's another fantastic Fred by Nick samier here on X and he ran all the exact prompts that Sora generated through my journey and then paired them with the results and the thing is they're shockingly similar right so people are already joking that hey is my journey just open AI disguised probably they're just using very similar training data right but look at that all of these examples are very similar now I'm sure these are the ones that were the most similar right to create this illusion of it essentially being the same model here I mean if you look closer the beaver is very different but the point is these are not night and day right sure these helmets are completely different but the Cinematic look is very similar with slightly different color grading down here fair but the point that I'm trying to make here is that we literally skipped two to three years ahead in AI video because what we had up pela was something like gpt1 or gpt2 oh that's hot now we got gpt3 that is actually usable and can create useful outputs that are essentially hyper realistic but we're not even at the chat GPT moment yet where you get editability and things like audio generation that we talked about here that is all yet to come but again at this pace of development we should probably be thinking in days and and weeks and maybe months and not years or decades I guess that poses the question at which point in the development do we reach the Matrix and I don't know the answer to that question I'm turning 30 next month and it does feel like it will happen in this lifetime or something akin to that right who knows moving on okay so my next Point goes back to my original video where I stated that you know this is going to be the death of stock footage I sell it myself since almost decade and there's just no way people are going to be paying $50 or $100 per clip if they can just generate them for a few cents and yeah I think that one is an obvious one but beyond that it really got me thinking about what this means for video creation especially for the smaller cruise and oneman shows well you're going to be able to generate entire video libraries for yourself hear me out so right now if you have a video let's say this is the a roll right this is the main story of the video me talking presenting to you all my findings and then on top of that we have something that we refer to as broll these are the clips that are there to add an additional layer of information they add visual interest keep you more more engaged and really allow us to get the most out of this audiovisual medium and right at this very moment you're consuming both audio and video at the same time so we're trying to make the most out of all these layers I do my best to keep my speech and presentation concise because I value your time and then in the editing we do our best to add as much information on top and right now that is done for boll so we pay for various libraries where we take these shots that enhance our videos and we also pay for various music libraries to add the right type of music to enhance the atmosphere of the video but with models like Sora this will really change the game because you're going to be able to generate an entire library for yourself for that specific project because the cost goes down so much you're going to be able to prompt things into existence that beforehand you would have to research download and compile and usually they don't even match and you have to do color correction and color grading on top of them and here as you can see from a single text prompt we got five video frames and all of these can be upscaled with something like topas video AI right that tool is paid they cost a few hundred doar but you can upscale 1080p Clips to 4K with AI really effectively but here you're just going to be able to prompt them and then again just looking over at all the AI Imaging tools all the features that we see in the Imaging tools are going to be available to the video tools so something like a oneclick upscale to 4K quality is going to be there can you regenerate this or can you generate four more just like this is going to be there you can think about the whole mour interface in Discord being something that you can do with these videos upscale reroll more like this use a different ver version of the model and after a few minutes you'll have a whole library of Boll that can enhance your video now I as a video creator can't wait for this I know that eventually the end point of all of this is the technology really replacing a lot of content and who knows if I'll be sitting here and presenting the news to you if an AI can do it in real time minutes after the release of something and you will be able to get it exactly in the voice that you prefer while it also respects your context right so in this video I kind of have to assume your knowledge level right so at certain points I also have to assume that somebody never created a video before but some of you might be experienced directors that know all these Concepts and know how the industry works well the AI is eventually going to be able to create that exactly for your context but I digress the point here is that at least for the footage at least for the production of this video I could have a custom library that is going to enhance all the visuals and maybe we could be taking a trip through Tokyo as of now where I present these ideas there's going to be some point where I'm just going to be able to take my voice and use my digital Avatar let him walk through Tokyo and explain these Concepts in a very practical manner without ever leaving my desk I don't think at this point that is a stretch a week or two ago it seemed a bit unreal to think of lifelike video the best we had was animations that were good and talking head videos that looked okay they looked convincing for a second or two if you weren't looking for AI but again if this is the gpt3 of AI video then what is the chat GPT and the GPT 4 going to look like that's what I'm already thinking about and some of these Advanced capabilities are outlined in the technical paper too here here it clearly states that you're going to be able to create videos in any format okay so from 1920 * 1080 to 1080 * 1920 so you know phone format all the way to WID screen and then cropping into cinematic formats from this is easy right all you need to do is add black bars at the top and bottom and you have all the Cinematic format so really there's going to be a lot of variability and you're going to be able to get exactly the b-roll that you need for your project and then eventually AI is going to be creating the scripts and editing the video itself according to all the other videos it saw and how they were edited right I mean that might take a lot of time and we do so much manual work with these videos that there's always going to be a style expression and a handwriting to the post- production of a video I think but it's crazy to see that you know a week ago thinking about the fact that you would have a library of b-roll for a specific video well you had to go out there and shoot it in the real world or you had to purchase stock footage and then it was scattered and all over the place here you're going to be able to get the best of both worlds going to get great b-roll and all from the same scene and it's going to cost virtually nothing or if you have some b-roll that you already use going to be able to extend that or maybe you have some phone pictures and you're going to turn those into b-roll it's really a whole new world for video production I I can't overstate that but it doesn't end there and this brings me to my last point which is 3D World and World Generation because in the technical paper they actually refer to this as a world simulator and I think that's a big claim but it's also a Justified one because if you take some of the clips at face value it's incredible it's temporarily consistent the these houses are not warping right you're moving through the scene like a drone would you have these people on their horses going about their daily business it's incredible but what you have to realize is that beyond that you can apply this in something like goshan splatting which simply put is a technology that creates this so-called Gan Splat that is a 3D representation of the video in even simpler terms it turns a video into a 3D model and this is what it looks like in practice now look this is a simple video that wasn't even intended for this purpose but you could easily imagine a drone shot where the Drone parallaxes around the subject and gets it from all angles and then you can create 3D objects of something that doesn't even exist so right here manov Vision took exactly this drone clip and he recreated it as a goshan Splat and then brought it into Unity a real-time game engine and then you can animate the camera and insert characters and do all sorts of things right the important fact here is that Sora doesn't have to do everything from A to Z you can still have a human write the script you can still have a human in front of a green screen acting it out you can have your favorite actors in these scenes but it's going to be so much cheaper to produce because you're just going to generate old environments like this and then everything is going to be shot in front of a green screen until AI perfectly synthesizes the actor's voices which if you follow this channel you know that it already has and then the last missing piece is really the human part it's character consistency and the ability to edit little details so it aligns with the vision of everybody involved in the movies creation and then if you take that thought experiment even a step further you end up in Minecraft because in the technical paper you can see these that are not recorded from with in Minecraft these have been generated by Sora by simply including the word Minecraft in the prompt it saw so much Minecraft footage that it was able to recreate Minecraft perfectly and if it can do it with Minecraft now how long until it will do it to all of this world I don't know but I'm scared and excited at the same time but one thing is for sure I want to stay on top of all of this I'm going to keep my eye on it and if you want to follow me along for the ride subscribe to this channel subscribe to our Weekly Newsletter that is completely free and keeps you up to date once a week with all the Revolutionary breakthroughs and that's really all I got for today except if you want to try out Sora there is actually a very very limited demo here on this page if you haven't tried this yet I recommend it because it's the closest you can get to trying it and it's this little interface here where you can change these variables so you can go from an old man to an adorable kangaroo and then there's a few more variables that you can change out here okay Antarctica and for now this is the closest we get to playing with this thing so I hope you enjoy this let me know which one of these was new or interesting to you and if you have even more facts that I might have not considered yet also leave those below and if you haven't seen the original video about the announcements and all the video clips they presented that is over here all right I can't wait to see how this develops and what the competition comes up with this is a whole new world and I'm here for it see you soon

Info

Channel: The AI Advantage

Views: 14,600

Rating: undefined out of 5

Keywords: theaiadvantage, aiadvantage, chatgpt, ai, chatbot, advantage, artificial intelligence, gpt-4, openai, ai advantage, igor

Id: 2sFw9cEaOHg

Channel Id: undefined

Length: 23min 17sec (1397 seconds)

Published: Thu Feb 22 2024