OpenAI's "World Simulator" SHOCKS The Entire Industry | Simulation Theory Proven?!

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

what I'm about to show you might actually scare you what you're looking at right now is Minecraft but it's not Minecraft it's actually a simulated version of Minecraft by open ai's new text to video product called sora and why is that so interesting well this was generated using a completely new technology that is vastly different than how video games are created today and I'm going to tell you exactly how that happens but if we take it a step further if AI is able to simulate infinite worlds then hypothetically it could simulate our reality perfectly and the implications are stunning so let's talk about all of this everybody has seen open AI Sora at this point the text of video product that was just released by open AI last week has shown absolutely stunning results what you can build and the consistency of objects within the video are mindblowing but I don't think people are truly appreciating the magnitude of sora's potential and in this video that's what we're going to be talking about I'm specifically going to be highlighting how video games are likely to be changing completely over the next few years because of what Sora is going to be able to deliver and keep this in mind through this video what we're seeing what open aai has shared with the world is probably only a fraction of what they are capable of today they probably have some things in development that might actually scare people and they kind of hinted at it in the title of their research which which is video generation models as World simulators that is such a distinct way to title this paper now we're going to be talking about that world simulators what do they mean first let's talk about how worlds are currently simulated typically you're using something like Unreal Engine and you're creating every single object in the world every single Pixel really has to be simulated and calculated as they move through the world the light has to be calculated the movement of hair objects skin everything everything has to be calculated by the pixel and it is extremely extremely expensive to do that calculation with a GPU in real time now what Sora is showing us is that they can actually do these calculations at much less of a cost because their model is essentially calculating the entire scene all at once and they don't actually have an understanding of each individual pixel in the scene but even though they don't they're still able to calculate the objects moving through a scene perfectly it seems even when occlusion happens and an object falls behind another object the model is still able to remember the object moving behind another object which just that alone is insane compared to all the other Tex of video products out there that we've seen this is Far and Away the best and most consistent now the last sentence here is also really telling our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world now if you're into simulation Theory you're probably pretty excited to hear this and if you're not familiar with simulation Theory let me explain what that is the simulation hypothesis or simulation theory proposes that what humans experience as the world is actually a simulated reality such as a computer simulation in which humans themselves are constructs there has been much debate over this topic ranging from philosophical discourse to practical applications in Computing so imagine this as computers get better and better at simulating the real world at a certain point it's going to simulate the world perfectly meaning down to the very atoms that make up the entire universe it is going to be able to simulate that and at that point What's the difference between that simulation and reality I love thinking about this stuff and I find it super interesting and there was actually this show called devs that was really good that I recommend checking out that is all about this topic but here's the problem at least to date video games are a great representation of our ability to simulate environments and as you can tell they're pretty good but nowhere near realistic nowhere near reality and the reason for that is because traditional methods to do simulations in video games have been using gpus to calculate every single object every single Pixel and how they move through the world however with Sora it promises to do things very different and as I mentioned it is able to calculate the entire simulation all at once using diffusion and Transformers the same technology that powers large language models so with that context let's talk a little bit about what's possible now I want to show off this video first this is Minecraft but it's not Minecraft this is actually a simulated version of Minecraft using Sora so not only did Sora simulate the graphics here but it also simulated the interface the UI and if we layer on chat gbt it probably has a very good understanding of how to play the game so if you take these two concepts the visuals and the logic to run the game and and you put them together all of a sudden you have a full understanding of this video game and you can actually play the video game without having actually written the video game which is crazy to think about you can simply describe the type of game you want to play the rules you want to play by and it will generate it in real time it is completely mindblowing to think about so all of the physics of the game seem to be getting created really really well now obviously this is still even before beta this is probably Alpha and the results we're seeing are extremely impressive already and this is a completely different way to think about video games rather than writing all of the logic of the game and the rules of the game line by line rather than having to paint every single Pixel and calculate how all the pixels move through the environment line by line and pixel by pixel you're able to do this all at once using artificial intelligence and we don't actually know exactly how it's doing it but it's doing it that's what we do know now look at this video this could be a video game this is a Land Rover type vehicle driving through the mountain bunch of trees there the graphics are flawless they look real this is better than any video game that I've seen now this 20 second long video was processed by Sora it processed the entirety of every frame in this 22nd film now I can imagine the future of video games being essentially you are playing this and it is generating the environment in real time it has not pre-rendered this environment which is really difficult for me to even comprehend how that can be possible but it seems like it could be with enough computing power and not only that you can change the game dynamically as you're playing it so you can say you know what rather than being in the mountain I want to be in a swamp and it will dynamically change the entire environment as you're playing it and you can also change the rules of the game as you're playing it you can say hey rather than Earth's gravity I want the moon's gravity and then all of a sudden you're going to be bouncing like crazy so the potential is awe inspiring and and not only that in a previous video that I made a while ago I talked about a simulated environment in which 25 AI agents lived in this simulated world and actually formed relationships formed habits and went about their life like they would any human and now pair that with what we're seeing here we will have this completely simulated environment done by artificial intelligence and you can have a bunch of NPCs that are effectively alive how you interact with them will then change the environment change how they behave and video games are going to be essentially reality simulators and that is what is so exciting about this so with all that aside I want to show what Dr Jim fan has said about Sora which really paints an important picture about how he's seeing the future now Dr Jim fan if you're not aware is a senior research scientist and lead of AI agents at Nvidia so definitely very knowledgeable about this topic if you think open AI Sora is a creative toy like Dolly think again Sora is a datadriven physics engine that is so cool to think about it is a simulation of many worlds real or Fantastical the simulator learns intricate rendering intuitive physics long Horizon reasoning and semantic grounding all by some denoising and gradient maths I don't think many people thought this was possible with Transformer architecture but it seems like it is and now here it is here is the most important line I won't be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5 it has to be so basically Unreal Engine 5 created a bunch of simulated environments and Sora is learning from them but eventually wouldn't sorb replace Unreal Engine 5 it is much more efficient it is going to be much cheaper and easier to produce now he breaks down this video which I'm going to show as I'm reading his breakdown of it the simulator instantiates two Exquisite 3D assets pirate ships with different decorations Sora has to solve text to 3D implicitly in its latent space the 3D objects are consistently animated as they sail and avoid each other's paths fluid dynamics of the coffee even the Foams that foam around the ships fluid simulation is an entire subfield of computer Graphics which traditionally require very complex algorithms and equations photorealism almost like rendering with Ray tracing the simulator takes into account the small size of the cup compared to the oceans and it applies tilt shift photography to give a minuscule Vibe the semantics of the scene does not exist in the real world but the engine still implements the correct physical rules that we expect very very cool and he follows up with apparently some folks don't get data-driven physics engine so let me clarify Sora is an endtoend diffusion Transformer model it inputs text image and outputs video pixels directly Sora learns a physics engine implicitly in the neural parameters by gradient descent through massive amounts of video so basically the same way it learns how to predict the next token in a sentence is the same way it learns how to predict the next frame in a sequence of frames Sora is a learnable simulator or World model of course it does not call ue5 explicitly in Loop but it's possible that ue5 generated text video pairs are added as synthetic data to the training set then not only that a few days after this was announced 11 Labs came out with their synthetic audio so basically they took a bunch of Sora videos and fed them through their processor and output the audio it thought would go along with the videos and it is so cool check this out [Music] ding in a place beyond imagination where the Horizon kisses the heavens one man dares to Journey where few have ventured armed with nothing but his wit and an unyielding Spirit he seeks the answers to Mysteries that lie beyond the stars okay now you get the sense that with Sora generating the graphics chat GPT generating the logic and 11 Labs generating the audio this has the potential to simulate Real Worlds thanks to the sponsor of this video hostinger hostinger allows you to build a full website just by answering three questions and then AI will do the rest for you no coding required hostinger has a ton of functionality including 150 customizable templates a drag and drop editor and an AI heat map with very easy migration to Wordpress and not to mention they also have an AI writer that can write all of your SEO optimized copy for you which is a huge timesaver and other hosting Services charge for these features but hostinger includes it all for you with their hosting service to get 10% off use my code Matthew or go to hostinger.com Matthew check out hostinger today I'll drop a link in the description below so you can check it out and you'll receive an exclusive 10% discount just for my viewers thanks again to hosting her now back to the video here's what Jim fan said about the synthetic audio it's prompted by text but the right conditioning should be on both text and video pixels learning an accurate video to audio mapping would also require modeling some implicit physics in the latent space here's what he thinks is happening identify each object's category materials and spatial locations identify the higher order interactions between objects is a stick hitting a wooden metal or drum surface and at what speed identify the environment restaurant space station Yellowstone Japanese Shrine retrieve the typical sound patterns of objects and surroundings from the model's internal memory run soft learned physical rules to piece together and adjust parameters of the sound patterns or even synthesize completely new ones on the Fly kind of like procedural audio and game engines if the scene is busy the model needs to Overlay multiple soundtracks based on their spatial locations none of the above is an explicit module all will be learned by gradient descent through massive amounts of video audio pairs amazing so again all of this is based on traditional machine learning techniques now let's bring it all back to simulation Theory which Dr jinfan also thinks about apparently if there's a higher being who writes the simulation code for our reality we can estimate the file size of the comp compiled binary meta's AI emu video is 6 billion parameters let's say if Sora is 10 times larger with B FL 16 then the Creator's binary might be no larger than 111 GB crazy to think about and that actually might be more accurate than you think large language models use much more processing power than our human Minds to process the world our human mind can process the world really efficiently and the way we learn is very very efficient compared to large language models so apparently there's probably this Delta between the way large language models works today and the way that the human mind works so what Dr Jim fan is saying is that we might actually be able to simulate reality with data that can fit on a small hard drive let's read his caveats the actual code might be far simpler as Sora is still far away from the korov complexity and I didn't actually know what that is let's read it in algorithmic information Theory the call mov complexity of an object such as a piece of text is the length of a shortest computer program that produces the object as output next Sora is not just compressing our world but all possible worlds our reality is only one of the simulations that Sora is able to compute it's possible that some parts of the physical world doesn't exist until you look at it much like you don't need to render every atom in ue5 to make a realistic scene and here's Nando def freus who was previously leading Google Deep Mind and what he's saying here is the only way a finite-sized neural net can predict what will happen in any situation is by learning internal models that facilitate such predictions including intuitive laws of physics given this intuition I cannot find any reason to justify disagreeing with Dr Jim fan so there are major Minds coming together saying yeah this is probably the Prelude to being able to simulate all realities with more data of high quality electricity feedback AKA fine-tuning grounding and parallel neural net models that can efficient absorb data to reduce entropy we will likely have machines that reason about physics better than humans and hopefully teach us new things I love the future we're living in I am so excited about everything I'm reading Sora blows my mind I want to play these video games that are being created dynamically and I can essentially dictate exactly what I want the video game to be what do you think about all this let me know in the comments below if you liked this video please consider giving a like And subscribe and I'll see you in the next one

Info

Channel: Matthew Berman

Views: 45,706

Rating: undefined out of 5

Keywords: ai, simulation theory, video games, artificial intelligence, sora, ai sora, openai, chatgpt

Id: BH9FU7Gd6v8

Channel Id: undefined

Length: 15min 56sec (956 seconds)

Published: Wed Feb 21 2024