Sora - Full Analysis (with new details)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Sora the text video model from open AI is here and it appears to be exciting people and worrying them in equal measure there is something visceral about actually seeing the rate of progress in AI that hits different than leaderboards or benchmarks and in just the last 18 hours the technical report for Sora has come out and more demos and details have been released I'm going to try to unpack what Sora is what it means means and what comes next before getting into any details though we just have to admit that some of the demos are frankly astonishing this one a tour of an art gallery is jawdropping to me but that doesn't mean we have to get completely carried away with open ai's marketing material that the model understands what the user asks for and understands how those things exist in the physical world I don't even think the authors of Sora would have signed off on that statement and I know it might seem I'm being pedantic but these kind of edge case failures is what's held back self-driving for a decade yes Sora has been trained at an immense scale but I wouldn't say that it understands the world it has derived billions and trillions of patterns from the world but can't yet reason about those patterns hence anomalies like the video you can see and later on in the release notes open AI says this the current model has weaknesses it may struggle with accurately simulating the physics of a complex scene it doesn't quite get cause and effect it also mixes up left and right and objects appear spontaneously and disappear for no reason it's a bit like gp4 in that it's breathtaking and intelligent but if you probe a bit too closely Things Fall Apart a little bit to be clear I am stunned by Sora just as much as everyone else I just want it to be put in a little bit of context that being said if and when models crack reasoning itself I will try to be among the first to let you know it's time for more details and Sora can generate videos up to a full minute long up to 1080p it was trained on and can output different aspect ratios and resolutions and speaking of high resolution this demo was amongst the most shocking it is incredible just look at the consistent Reflections in terms of how they made it they say model and implementation details are not included in this report but later on they give hints in terms of the papers they site in the appendices almost all of them funnily enough come from Google we have Vision Transformers adaptable aspect ratio and resolution Vision Transformers also from Google Deep Mind and we saw that being implemented with Sora and many other papers from Facebook and Google were cited that even LED one Google deep minder to jokingly say this you're welcome open AI I'll share my home address in DM if you want to send us flowers and chocolate by the way my 30second summary of how it's done would be this this just think to yourself about the task of predicting the next word it's easy to imagine how you test yourself you'd cover the next word make prediction and check but how would you do that for images or frames of a video if all you did was cover the entire image it would be pretty impossible to guess say a video frame of a monkey playing chess so how would you bridge that Gap well as you can see below how about adding some noise like a little bit of cloudiness to the image you can still see most of the image but now you have to infer patches here and there with say a text caption to help you out that's more manageable right and now it's just a matter of scale scale up the number of images or frames of images from a video that you train on ultimately you could go from a highly descriptive text caption to the full image from scratch especially if the captions are particularly descriptive as they are for Sora now by the way all you need to do is find a sugar daddy to invest $13 billion into you and boom you're there of course I'm being a little bit factious it builds on years of work including bu notable contributors from open aai they pioneered the autoc captioning of images with highly descriptive language using those synthetic captions massively optimized the training process when I mentioned scale by the way look at the difference that more compute makes when I say compute think of arrays of gpus in a data somewhere in America when you forx the compute you get this and if you 16 exit you get that more images more training more compute better results now I know what you're thinking just 100x the compute there's definitely enough data I did a back of the envelope calculation that there are quadrillions of frames just on YouTube definitely easier to access if you're Google by the way but I will caveat that as we've seen with gp4 scale doesn't get you all the way to reasoning so you'll still get weird breaches of the laws of physics until you get other Innovations thrown in but then we get to something big that I don't think enough people are are talking about by training on video You're inadvertently solving images an image after all is just a single frame of a video the images from Sora go up to 2K by 2K pixels and of course they could be scaled up further with a tool like magnific I tried that for this image and honestly there was nothing I could see that would tell me that this isn't just a photo I'd almost ask the question of whether this means that there won't be a darly four because Sora supersedes it take anim in an image and this example is just incredible of this shba Inu dog wearing a beret and black turtleneck that's the image on the left and it being animated on the right you can imagine the business use cases of this where people bring to life photos of themselves friends and family or maybe even deceased loved ones or how about every page in what would be an otherwise static children's book being animated on demand you just click and then the characters get animated honestly the more I think about it the more I think Sora is going to make open AI billions and billions of dollars the number of other companies and apps that it just subsumes within it is innumerable I'll come back to that point but meanwhile here is a handful of other incredible demos this is a movie trailer and notice how Sora is picking quite Fast Cuts obviously all automatically it gets that a cinematic trailer is going to be pretty Dynamic and fast-paced likewise this is a single video generated by Sora not a compilation and if you ignore some text spelling issues it is astonishing and here is another one that I'm going to have to spend some time on the implications of this feature alone are astonishing all three videos that you can see are going to end with the exact same frame even that final frame of the cable car crashing into that sign was generated by Sora including the minor misspelling at the top but just think of the implications you could have a photo with your friends and imagine a hundred different ways that you could have got there to that final photo or maybe you have your own website and every user gets a unique Voyage to your landing page and of course when we scale this up we could put the ending of a movie in and Sora 2 or Sora 3 would calculate all the different types of movies that could have led to that point you could have daily variations to your favorite movie ending as as a side note this also allows you to create these funky Loops where the starting and finishing frame are identical I could just let this play for a few minutes until people got really confused but I won't do that to you and here is yet another feature that I was truly bold away with the video you can see on screen was not generated by Sora and now I'm going to switch to another video which was also not generated by Sora but what Sora can do is interpolate between those videos to to come up with a unique creation this time I'm not even going to list the potential applications because again they are innumerable what I will do though is give you one more example that I thought of when I saw this another demo that open AI used was mixing together this chameleon and this funky looking bird I'm not sure it name to create this wild mixture now we all know that open AI are not going to allow you to do this with human images but an open-source version of Sora will be following close behind so imagine putting a video of of you and your partner and creating this hybrid Freaky video or maybe you and your pet now the best results you're going to get from Sora are inevitably when there's not as much movement going on the less movement the fewer problems with things like object permanence mind you even when there is quite a lot going on the results can still be pretty incredible look at how Sora handles object permanence here with the dog fully covered and then emerging looking exactly the same likewise this video of a man eating a burger because he's moving in slow motion it's much more High Fidelity aside from the Boker effect it could almost be real and then we get this gorgeous video where you almost have to convince me it's from Sora look at how the paint marks stay on the page and then we get simulated gaming where again if you ignore some of the physics and the rule breaking the visuals alone are just incredible obviously they train Sora on thousands of hours of Minecraft videos I mean look how accurate some of the boxes are I bet some of you watching this think I simply replaced a SORA video with an actual Minecraft video but no I didn't that has been quite a few hype demos so time for some anti-hype ones here is Sora clearly not understanding the world around it just like chat's understanding can sometimes be paper thin so can sorus it doesn't get the physics of the cup the ice or the spill I can't forget to mention though that you can also change the style of a video here is the input video presumably from a game now with one prompt you can change the background to being a jungle or maybe you prefer to play the game in the 1920s I mean you can see how the wheels aren't moving properly but the overall effect is incredible well actually this time I want to play the game underwater how about that job done or maybe I'm high and I want the game to look like a rainbow or maybe I prefer the oldfashioned days of pixel art I've noticed a lot of people by the way speculating where open AI got all the data to train Sora I think many people have forgotten that they did a deal back in July with shutter stock in case you don't know shutter stock has 32 million stock videos and most of them are high resolution they probably also used millions of hours of video game frames would be my guess one more thing you might be wondering don't these worlds just disappear the moment you move on to the next prompt well with video 2 3D that might not always be the case this is from Luma Ai and imagine a world generated at first by Sora then turned into a universally sharable 3D landscape that you can interact with effectively you and your friends could inhabit a world generated by Sora and yes ultimately with scale you could generate your own High Fidelity video game and given that you can indefinitely extend Clips I am sure many people will be creating their own short movies perhaps voiced by AI here's an 11 Labs voice giving you a snippet of the caption to this video an adorable happy otter confidently stands on a surfboard wearing a yellow life jacket riding along turquoise tropical waters near lush tropical Islands well how about hooking Sora up to the Apple Vision Pro or meta Quest especially for those who can't travel that could be an incredible way of exploring the world of course being real here the most common use case might be children using it to make cartoons and play games but still that counts as a valid use case to me but underneath all of these use cases are some serious points in a since deleted tweet one openai employee said this we are very intentionally not sharing it widely yet the hope is that a mini public demo kicks a social response into gear I'm not really sure what social response people are supposed to give though however it's not responsible to let people just Panic which is why I've given the caveats I have throughout this video I believe as with language and self-driving that the edge cases will still take a number of years to solve that's at least my best guess but it seems to me when reasoning is solved and therefore even long videos actually make sense a lot more jobs other than just videographers might be under threat as the creator of GitHub co-pilot put it if openai is going to continue to eat AI startups sector by sector they should go public building the new economy where only 500 people benefit is a dodgy future and the founder of stability AI tweeted out this image it does seem to be the best of times and the worst of times to be an AI startup you never know when open aai or Google are going to drop a model that massively changes and affects your business it's not just Sora whacking pea Labs Runway ML and maybe mid Journey if you make the chips that open AI uses they want to make them instead I'm going to be doing a separate video about all of that when you use the chat PT app on a phone they want to make the phone you're using you come up with character Ai and open AI comes out with the GPT store I bet open AI are even cooking up an open world game with GPT powered NPCs don't forget that they are acquired Global illumination the makers of this Minecraft clone if you make agents we learned last week that open aai want to create an agent that operates your entire device again I've got more on that coming soon or what about if you're making a search engine powered by a GPT model that's the case of course with perplexity and I will be interviewing the CEO and founder of perplexity for AI insiders next week insiders can submit questions and of course do feel free to join on patreon but fitting with the trend we learned less than 48 hours ago that open AI is developing a web search product I'm not necessarily critiquing any of this but you're starting to see the theme open AI will have no quals about eating your lunch and of course there's one more implication that's a bit more longterm two lead authors from Sora both retweeted this video from Berkeley you're seeing a humanoid transformer robot trained with large scale reinforcement learning in Sim imulation and deployed to the real world zero shot in other words it learned to move like this by watching and acting in simulations if you want to learn more about learning from simulations do check out my Eureka video and my interview with Jim fan tldr better simulations mean better robotics two final demos to end this video with first a monkey playing chess in a park this demo kind of sums up Sora it looks gorgeous I was astounded like everyone else however if you look a bit closer the piece position and board don't make any sense Sora doesn't understand the world but it is drawing upon billions and billions of patterns and then there's this obligatory comparison the Will Smith spaghetti video and I wonder what source they originally got some of the images from you could say this was around state-of-the-art just 11 months ago and now here's Sora not perfect look at the paws but honestly remarkable indeed I would call Sora a milestone human achievement but now I want to thank you for watching this video all the way to the end and know despite what many people think it isn't generated by an AI have a wonderful day
Info
Channel: AI Explained
Views: 225,644
Rating: undefined out of 5
Keywords:
Id: nYTRFKGR9wQ
Channel Id: undefined
Length: 16min 9sec (969 seconds)
Published: Fri Feb 16 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.