OpenAI's NEW AI "SORA" Just SHOCKED EVERYONE! (Text To Video)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what you're looking at is 100% AI generated this is open ai's state-of-the-art new text to video model they State introducing Sora are text to video model Sora can generate videos up to a minute long while maintaining visual quality and adherence to the users prompt today Sora is becoming available to Red teamers to assess critical areas for harms or risks and we are also granting access to a number of visual artists designers and filmmakers to gain feedback on how to advance the model to be most helpful for our creative professionals they also state that we're sharing our research progress early to start working with and getting feedback from people outside of open Ai and to give the public a sense of what AI capabilities are on the horizon Sora is able to generate complex scenes with multiple characters specific types of motion and accurate Det details of the subject and background the model understands not only what the user has asked for in The Prompt which you can see in the bottom part of the video but also how those things exist in the physical world the model has a deep understanding of language enabling it to accurately interpret prompts and generate compelling characters that Express vibrant emotions Sora can create multiple shots within a single generated video that can accurately persist characters and a visual style in research techniques that Sora is a diffusion model which generates video by starting off with one that looks like static noise and gradually transforming it by removing the noise over many steps Sora is capable of generating entire videos all at once or extending generated videos to make them longer by giving the model foresight of many frames at a time we've solved a challenging problem of making sure a subject stays the same even when it goes out of you temporarily similar to GPT models Sora uses a Transformer architecture unlocking Superior scaling performance and they also state that we represent videos and images as collections of smaller units of data called patches Each of which is akin to a token in GPT by unifying how we represent data we can train diffusion Transformers on a wide range of visual data than was possible before spanning different durations resolutions and aspect ratios Sora Builds on past research in dal and GPT models and uses the rec captioning technique from Dar 3 which involves generating highly descriptive captions for the rual training data as a result the model is able to follow the user's text instructions in the generated video more faithfully and in addition to being able to generate a video solely from text instructions the model is able to take an existing still image and generate a video from it animating the images contents with accuracy and attention to small detail the model can also take an existing video and extend it or fill in missing frames and you can learn more in their technical paper coming today which is likely going to be linked in the description Sora serves as a foundation for models that can understand and simulate the real world capability and we believe will be an important mstone so ladies and gentlemen we have it here this is by far the most shocking thing I've seen in AI after we got to see the gp4 capabilities reveal this text to video model is genuinely absolutely incredible based on what we are seeing here I still to myself cannot believe that this clip the one that I'm looking at right now is actually AI generated I mean it looks like a realistic clip first of all and not only that it is also so high quality that it makes me think that this is even photo realistic I mean take a look at this example right here looking at this if someone was to show me this video perhaps a couple of months ago or even a couple of days ago I would say with striking accuracy that yes I can approve that this is 100% realistic but now we really really don't know what is going on because this is the next evolution in text to video there have been a ton of different competitors and companies racing towards the giant title of the state-of-the-art model but openi has shown us what is capable now there are so many different examples that openi have included in this blog post but I'm going to be showing you guys some of my favorite ones that show you why this is a truly Advanced model that really does understand exactly what's going on so you can see this example of a stop motion animation of a flower growing out of the window sill of a Suburban house and the reason I like this one so much is because this shows us how well the model understands exactly what is going to come next and true coherence in terms terms of what to expect from the final output because it's not like previous models in which we just get a literal text to video we can clearly understand that this AI system has some general World model and that essentially means that this model has a true understanding of physics and how things move and how they exist in the world now in addition they also did state that there are several model weaknesses so they state right here that the current model has weaknesses it may struggle with accurately simulating the physics of a complex scene and it may not understand specific instances of cause and effect for example a person might take a bite out of a cookie but afterwards the cookie might not have a bite mark the model also might confuse spatial details of a prompt for example mixing up left and right and may struggle with precise descriptions of events that take place over time like following a specific camera trajectory let's take a look at some of these weaknesses so we can see here that this is one of the first demos of the model's weaknesses The Prompt is Step printing scene of a person running cinematic film shot in 35 mm and to the normal eye of course this does look pretty realistic just to state that out there like it does but of course the weaknesses is that they state that Sora sometimes creates physically implausible motion for example if you haven't realized the person is running backwards on a treadmill when treadmills don't really do that I mean I guess you could kind of run like this if you made your own treadmill but this is not generally what we expect when we see someone looking and running on a treadmill now there was also this example the first one and although they state that this is a weakness I still find this video to be really cool and one of my favorite because it is basketball through hoop and then explodes and we can see that of course the basketball does go through the hoop and it does seem like a kind of slow motion video it says the weakness there is that it's an example of inaccurate physical modeling and unnatural object morphing which which is of course how the basketball goes through the rim right there and doesn't exactly um I guess you could say interact physically with that part of the rim now in addition here there was also this really cool example and it's prompted with archaeologists discover a generic plastic chair in the in the desert Excavating and dusting it with great care and of course the find really funny and you can see that the prompt is rather large I mean it says a grandmother with neatly combed gray hair stands behind a colorful bir birthday cake with numerous candles at a wood dining room table the expression is one of pure joy and happiness with a happy glow in her eye she leans forward and blows out the candles with a gentle puff and the cake has a pink frosting and sprinkles and the candles cease to flicker the grandmother wears a light blue blouse adorned with floral patterns and several happy friends and family are sitting at the table can be seen celebrating out of focus the scene is beautifully captured cinematic showing a 3/4 view of the grand and the dining room and warm color tones and soft lighting enhance the model now essentially the weakness here is pretty evident at first it isn't really evident because this looks really really good but if we take a look at this what they State they state that the weakness here is simulating complex interactions between objects and multiple characters is often challenging for the model sometimes resulting in humorous Generations so if you if you do take a look at the back you can see that sometimes the woman is clapping like if we go to the start you can see that she's clapping and it looks really good and then of course over time she's doing something really weird with her hands and then of course the other person is kind of like waving in this sort of weird way with her hands but at least they do have five hands so it's not that crazy and of course her eyes just seem to be looking around in a strange way but I still think that this is this bad clip from opening ey is still better than some of the other clips that exist from state of the-art models from other companies and of course and the reason I love this clip so much is because if we take a look at the the Dalmation trying to move over here it's crazy because it clearly understands how the dog's legs move as it jumps from one LGE to another so this is some really really really advanced technology that exists here and what's crazy about this is that I recently just saw a tweet of Sam Alman actually showcasing exactly what this model is like so what I'm going to show you guys is a few of the tweets that Sam Alman has directly replied to to Showcase how how crazy this model really is and there is a lot to unpack here so let's get into some of these examples so one of the ones here is a prompt from a random person that's why this is so cool because interactive uh a wizard wearing a pointed hat and a blue robe with white stars casting a spell that shoots lightning from his hand holding an old toome in his other hand and we can see at full screen this looks really really good it seems to be pretty pretty cool and at the end you can see he actually cast the spell and there's some kind of Orange thing going on there which is rather fascinating because we really want to understand how these models are able to do what they do there was also this one a halfer duck half dragon flies through a beautiful sunset with a hamster dressed in Adventure Gear on its back and this one is rather accurate and it looks like some kind of 3D animation from a kid show but if we put it to full screen you can see that they clearly do have access to this model in terms of it being really responsive and really rapid in terms of its development and the reason I do like the fact that Sam ultman is is going to give these tweets a response with the actual video output is because he's showcasing that the model that they've built is actually a really good model and they haven't cherry-picked any of the videos that have been outputed from their system so him going through all these tweets and responding to people and showcasing what the model capabilities are shows us that this is likely a very capable text video model now this one is one of my favorite ones a street level tour through a futuristic city in which harmony with nature and simultaneously cyberpunk slhtech and this looks pretty crazy if I saw this as the background of something I would say that there's no way that this isn't CGI or some kind of 3D render that someone did in some kind of 3D application software but of course you can see that this is literally text to video and by far this looks so so realistic in terms of how it was created I mean the humans walking around it's just absolutely incredible and that was literally someone that just did that prompt to Sam Alman and what was funny was with the recent prompt that we just got we got an updated one from Sam Alman where he said look even a better one so he prompted it again to see what the model would do and you can see that it looks like a hamster with Adventure Gear on and it's chilling on this half a duck half dragon type creature there was also this futuristic drone race on a sunset in a place in Mars and this is pretty crazy cuz this looks incredible like I think the reason this is so interesting is because when we get an image from a you know text to image model it's super interesting because we just get a snapshot of what could be some kind of story so I think the fascinating thing about these videos is that they do have sometimes their own storylines like you can see that these drones are kind of racing around these kind of beams that seem to be the kind of guard rails for where they should be racing with and I think these kind of mini stories that we do get are part of the interesting story and how we kind of understand how crazy these models are in terms of their cap cap abilities and this video right here is just going to show you how crazy opening eyes image model is because I'm pretty sure that whatever model they're using they definitely do have an upgraded version of Dar which is purely focused on photo realism and there were some people stating that you know open AI wants to use their photo realism for text a video and it seems like that is really true because this prompt right here this is just absolutely incredible because if someone was to show me this I would be like why on Earth did you put those two dog there with a microphone and your headphones just for a picture but it's a rather funny thing that you'll all enjoy and if you take a look at this example that Sam Alman just tweeted out this is crazy because we can see that the woman actually grabs a spoon and then starts moving some of the stuff in the bowl and then crazy enough picks up the ball with that stuff in there so the consistency of these videos is absolutely incredible and it's nothing like what we've seen before in terms of how well the final output does look because there is a clear understanding of physics and how things interact with one another and there was this example here tweeted out by a researcher at openi Bill Pebbles and you can see right here that he tweeted welcome to bling Zoo this is a single video generated by Sora shot changes and all what's crazy about this model is that it's able not only to generate realistic text to video but it's able to generate some kind of small movie you can see here that all of these shots the way that they're panning the way that they've moved the way that they changed are very very cinematic and this seems to be really really really advanced in terms of what we're seeing because it's not like it's just generating these short clips it's generating them and it's telling a story so being able to generate this entire thing with shot changes is absolutely telling in terms of what this model is truly capable of now there was some other videos that I did now there was some videos that I do want to show you for example this one right here that is absolutely incredible that show us the kind of stories we do get when these things are left to run for longer durations you can see that it's absolutely incredible in terms of the panning around and looking at the scenery I mean I'm honestly struggling to put words together to describe what I'm looking at because the fact that this is AI generated truly like genuinely blows my mind and I'm struggling to put words into what I'm looking at here but all I can say is that the people that work at open AI are truly truly ahead of where anyone really thought they were especially with their video models and there was also this example right here prompt of historical footage of California during the gold rush and this looks absolutely incredible because with AI generated videos a large problem that many have had is the fact that they're not high quality so when there are essentially like in 1080P footage and you do kind of have small errors you do really see them but in these Western type of old videos the minor minor mistakes that you do usually get we're simply not seeing now there's some other crazy examples that I do need to show you the capabilities and by far if there's one example that you do want to see it's this one because this is Reflections in the window of a train traveling through the Tokyo suburbs I mean ladies and gentlemen you have to give open AI their credits because being able to make a video that can literally pan across right here then when the reflection comes up it's able to understand the the lighting able to understand how all of that works I mean generally I have no words like open ey telling me that this video right here on my screen right now is AI generated I genuinely don't believe that this is AI generated I mean I'm going to have to wait until the model comes out itself because part of me wants to not believe it because it just seems that my eyes cannot deceive me and I truly cannot tell that this is AI generated even as someone who looks at tons and tons of different research papers and tons and tons of different AI videos this one is by far the most realistic I mean some of the other ones like this one you could you know potentially tell that it's maybe 3D animated and even this one looks 3D animated it doesn't look AI generated at all but you know some of them just don't make sense and one of the capabilities that I wanted to show off from this model was this one so this is a beautiful silhouette animation shows a wolf howling at the moon feeling lonely until it finds its pack and you can see that there is a kind of Animation there that this story goes on so I think this kind of model that we do have is so Advanced at generating you know single Shop videos like essentially kind of mini movies and of course panning around to Showcase what else is in the environment and sticking to the entire storyline like the shot changes that you're seeing here are truly truly incredible I mean at the end you're going to see that this wolf does manage to find its pack and I wonder in the future if because we do have these things where you can have an entire storyline with this are we going to get music tracks with this are we going to get voiceovers with this and are we going to get you know of course chat GPT and GPT 4 creating entire movies that are just AI generated for us like imagine you're bored and you decide you know what I want a movie about this and about that and then you go into an AI system and then you essentially say I want a movie about this and about that and then output that entire movie for you we could potentially be moving to an era in the future where we have 100% AI generated content that is literally user specific spefic I mean of course that is a far far away time but the level of quality that I'm seeing in these clips is by far something that I thought we wouldn't see for maybe at least another year or another two years and some people did make predictions that this was going to be the year of Hollywood level AI movies and I do think that those predictions were really really correct I mean take a look at this it really does understand exactly how snow interacts with fur and do you know what the craziest thing about all of this is is that sometimes you know when you've looked at the backend of CGI and visual effect Studios on the amount of computations and calculations that they have to do for things like water and for things like fluid simulation you guys have to understand that it takes hours even days sometimes even months to render a few seconds of these Advanced complications so this is why this is so crazy because the AI system is able to do that with a level of accuracy that we really didn't think was POS possible now I want to get into some of the comments cuz one of the comments from senior AI researcher at Nvidia also does touch on this he stated if you think open AI Sora is a creative toy like Dar think again Sora is a datadriven physics engine it's a simulation of many worlds real or Fantastical the simulator learns intricate rendering intuitive physics long Horizon reasoning and semantic grounding all by some noising and gradient Mass I won't be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5 it has to be and he also talks about here the fluid dynamics of the coffee even the Foams that foam around the ships and the fluid simulation like I just talked about is an entire subfield of computer Graphics which traditionally requires very complex algorithms and equations and of course the photo realism so now with what opening I have released here I do think that this is genuinely incredible this is by far something that I'm going to be continually watching again and again taking a look at it to understand exactly how on Earth this thing works because I really want to see more videos from this I really want to see what open a I have to say about this I mean they've stated quite a decent amount but this is truly going to change everything if this goes mainstream and of course with opening eye we know that they're working on a ton of different things and of course with all of these clips you can take a look at the prompts in the bottom of the screen if you wanted to know what they prompted it with and I think this is about to shake up some entire Industries because this is something that we didn't see and one of the favorite things from this was of course this one right here which essentially you can see that the prompt is actually a movie trailer featuring The Adventures of a 30-year-old Spaceman wearing a red wallk knitted motorcycle helmet and it's in a I mean genuinely guys we are going to get to a level in the future where we could probably generate entire movies with a single push of a button and the crazy thing that people are missing about this clip right here is the entire consistency character consistency is something that we don't usually get because it's quite hard to do due to the randomness of the generative models but it's clear that here this is no problem for this model right here and and of course this one right here was the debut video that a lot of people did see on Twitter so I'm glad I got to Showcase a lot of these videos because I mean it's just absolutely incredible what we're seeing here and I like I said before I'm still completely mind blown this is very captivating stuff really captivating technology and again again very very surprising and do you know why this is so surprising it's because do 3 was really really good and of course there were other models that were better than it so when opening eye came out with this model this was something that I don't think anybody expected and something that we're likely to see even advancements in the future on so with that being said if you did enjoy this I'll see you on the next one
Info
Channel: TheAIGRID
Views: 169,565
Rating: undefined out of 5
Keywords:
Id: agTJpLS7cjY
Channel Id: undefined
Length: 21min 56sec (1316 seconds)
Published: Thu Feb 15 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.