"VoT" Gives LLMs Spacial Reasoning AND Open-Source "Large Action Model"

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
today we have an opsource large action model so very similar to how the rabbit R1 can control applications within the Android environment just by speaking natural language we now have a completely open- Source version of that for the windows environment released by Microsoft so not only did Microsoft release a research paper outlining how they were able to achieve it they also have an open-source project which you can download and use right away and I'm going to show you that today so first let's go over the white paper this is called visualization of thought elicits spatial reasoning and large language models and essentially what this paper describes is a way to give large language models spatial reasoning and if you're not familiar with what spatial reasoning means it's basically just the ability to visualize the relationships in a 3D environment or even a 2d environment between different objects and this is something that large language models have historically done really poorly and the lead of meta AI Yan laon has talked about about this as being a core missing feature of large language models that will prevent us from reaching AGI but in this paper they show that it's actually possible to get spatial reasoning out of large language models so let me give you an example of what spatial reasoning is in your mind think about this you're standing at a point on the North Pole and you start walking and you walk 50 yards in One Direction then you turn left and then you continue to walk indefinitely now think about this if you continued walking would you ever cross over that initial point now you're doing all of this spatial reasoning in your head through what's called your mind's eye language isn't really involved when you're thinking through this problem and that is what spatial reasoning is and that is why Yan laon thinks spatial reasoning is not possible with language models alone but according to this paper it definitely is so let me get into it and remember stick around to after this because I'm actually going to show it to you in action in an open source project so this is out of Microsoft research so in the beginning it talks about how large language models are really great however their abilities in spatial reasoning a crucial aspect of human cognition remain relatively unexplored humans possess a remarkable ability to create mental images of unseen objects and actions through a process known as The Mind's Eye enabling the imagination of the Unseen World inspired by this cognitive capacity we propose visualization of thought prompting and I'm going to show you why this will translate into a large action model because right now it's called visualization of thought but if we take this technique and we apply it to a user interface we can actually control that user interface and that's essentially what a large action model is so let's look at this diagram this is what is happening in the human mind we have visuals we have verbal language we put it all together in what is called The Mind's Eye and then we put together a mental image of whatever we're thinking about now on the right side is what is the The Mind's Eye of large language models so really we only have text language we put it all together in what is the large language models Mind's Eye and then we come up with what is a mental image so can we actually achieve that with a large language model well let's find out so here is conventional prompting you have an input and then you get an output and then we have more advanced prompting techniques like Chain of Thought So it's an input and then walk me through thought by thought how you get to the output and what we found is when you use Chain of Thought prompting and other prompting techniques like reflection you actually improve the performance of the large language model pretty greatly actually then we have visualization of thought we have the input and then we ask it to have a thought and to represent the visualization at each step along the way before we get to the output and this is all theoretical I'm going to show you actual examples of it in a second so humans can enhance their spatial awareness and inform Decisions by creating mental images during the spatial reasoning process similarly large language models can create internal mental images we propose the visualization of thought prompting to elicit The Mind's Eye of llms for spatial reasoning so spatial reasoning is super important in basically every aspect of life whether you're driving playing video games playing chess just walking everything you're doing is using spatial awareness as long as you're interacting with your 3D World so let's talk about visualization of thought vot prompting to elicit this ability this being spatial awareness this method augments llms with a visual spatial sketch pad to visualize their reasoning steps and inform subsequent steps vot adopts zero shot prompting instead of relying on few shot demonstrations or textto image visualization with clip to evaluate the effectiveness of vot and spatial reasoning we selected three tasks that require spatial awareness in llms including natural language navigation visual navigation and visual tiling and I'll explain what all three of those things are we designed 2D grid worlds using special characters as enriched input formats for the llms in visual navigation and visual tiling tasks now remember large language models can't interpret graphs like if we were to put together a 2d tile and just pass it to the large language model it wouldn't really understand it we have to represent that 2D space with natural language and you'll see how they do it so vot prompting proposed in this paper consistently induces llms to visual uze the reasoning steps and inform subsequent steps and consequently this approach achieved significant performance improvements on the corresponding tasks so let's look at this we have a bunch of 2D grids right here and they're of different sizes and they have different objects within them so let's look at this k equals 2 so the house is the starting point and the office is the ending point and what we're going to do is we're going to ask the large language model to navigate step by step from the house to the office it's easy for humans to do this right go right go right go up go up and that's it and obviously we can get more complicated but it's still super easy in fact we don't really even need to go step by step we can kind of just look at it and go all the way through just by thinking about it but if we had to we could describe it up up left left up up Etc but this is spatial awareness this is spatial reasoning and this is very difficult for large language models to date but not anymore so spatial reasoning refers to the ability to comprehend and reason about the spatial relationships among objects their movements and interactions and these can be applied in the context of technology to navigation Robotics and autonomous driving so here they say in this context a square map is defined by a sequence of random walk instructions along corresponding objects denoted as and then they actually just give the algorithm to denote the graph and the walking path then we have visual navigation so visual navigation task presents a synthetic 2D grid world to llm challenging it to navigate using visual cues the model must generate navigation instructions to move in four directions left right up down what we just talked about to reach the destination from the starting point while avoiding obstacles this involves two subtests route planning and Next Step prediction requiring multihop spatial reasoning while the former is more complex and here is the formulation of it so it's represented by a formula rather than just passing in like an image of that 2D grid then we have visual ual tiling and that is what we're seeing right here in these examples and let me just talk about that for a second polyomino tiling is a classic spatial reasoning challenge we extend this concept to test the lm's ability to comprehend organize and reason with shapes in a confined area so essentially you have a grid with different colors different shapes really and you are tasked with finding a place for a new object now if we just look at this we can tell that within this grid right here we can place this red 4X one or 1x4 object right here okay so that is essentially what this test is accomplishing now the really important part of vot prompting is visualizing at each step so it's kind of like Chain of Thought we're not just saying okay do it all at once it's I want to see a trace of the path step by step as you go along the way so we introduce vot prompting and it just starts really simply visualize the state after each reasoning step this new paradigm for spatial reasoning aims to generate reasoning traces and visualizations in an interleaved manner so let's look at the one on the left first so this is visual navigation we've already seen this so we have the house right here and the llm is supposed to navigate through all of these empty squares so the ones with gates in them cannot be navigated through all the way down to the office and what we're seeing down here is the L I'm doing that and doing it step by step so step one move right step two move down step three move left move down move left move down and they reached it same with visual tiling and what we're doing is we provide it with this grid and three different objects so 1x4 this is essentially Tetris objects and we say can you fit all of them into this grid and so it says okay well let's look at I where does that go then let's look at l where does that go and then let's look at T where does that go and then it is able to accomplish that and get them all in there and then here we have natural language navigation so we describe a 3X3 grid and we tell it step by step what it needs to do and we're actually giving it the steps and then at the end we say okay where are you what did you find and so we're visualizing each step and the one with stars on it is where the large language model thinks it is in the current state so step two it's w step three it's c all the way up to step seven s and so on and then finally we're at C and so they tested four different versions and they're using GPT 4 so first gp4 with Chain of Thought So let's think step by step GPT 4 without visualization so don't use visualization the techniques that we're talking about today let's think step by step then gp24 with vision so the ability to interpret what's in an image let's think step by step and then gbt 4 with v so visualize the state after each reasoning step now let's look at the performance so as you can see all the Bold across the board is where it performed best so first for route planning we have the completing rate and we have GPT 4 with vot as the best then we have the success rate far superior nearly 50% greater than the second place GPT 4 without visualization Next Step prediction visual tiling and natural language navig ation across the board vot prompting technique just wins it's really impressive so does that mean that different prompting techniques actually affect the outcome well yeah I mean that's obvious right so what it says here is in the setting gp4 coot Chain of Thought without explicit visualization prompts it demonstrated noticeable tracking rate across almost all tasks except route planning the fact implies that llm innately exhibit this capability of visual State tracking when spatial temporal simulation is necessary for reasoning and in this figure we're also seeing the difference between asking it to visualize and output the visualization at each step along the way versus just at least one step so here is the complete tracking rate which means it's visualizing at every single step route planning completely dominates for Next Step prediction does a lot better visual tiling and so on natural language so this purple is gb4 with vot on the right side is partial tracking rate which means at least one step had the visualization and what we're seeing here is similar results except for Next Step prediction in which gp4 with coot Chain of Thought actually performs pretty darn well so one last thing before I actually show you the examples what are the limitations so both mental images and visual State tracking rely on the emerging ability of advanced llms therefore it might cause performance deterioration in less Advanced language models or more challenging tasks so here is the project it's called Pi win assistant and it's described as the first open source large action model generalist artificial narrow intelligence that controls completely human user interfaces only by using natural language so they reference this paper this is actually how I found the paper and it uses the same techniques to control a Windows environment so they give you this cute little character in the right and you can essentially task it with anything you want so let's look at a few examples all right so what we're going to be seeing is an example in the windows environment we have this little assistant right there and you can tell it to do different things so the first thing we're going to tell it or the first thing that the video tells it is to open Firefox open Firefox click on YouTube click on YouTube so it's giving it a series of things to do CLI onto the element without visioning context okay so it clicked on YouTube Okay so let's take a look at actually what's happening so you click clicked on the assistant you dragg me so that's just the person dragging the little assistant around then we say open Firefox so it responds with clicking on click on YouTube selected application Mozilla Firefox then AI decision coordinates it actually finds the coordinates then it says clicking on the search input and so on so let's keep watching so there we go type Rick Roll type Rick Roll click on search click on search clicking onto the element without visioning context click on the second video okay so we're just telling it what to do and it's able to do that this is essentially open interpreter but it works really really well clicking onto the element without visioning context and there we go so it was able to do that I'm going to mute it because I don't want to get copyright stried and it's playing the video now so it's just step by step telling it exactly what it needs to do there it said to mute it so it clicked on the mute button again it has no training as to what is on the screen or how to click it's figuring it out as it goes and it's asking to visualize it at each step so very impressive all right so let's look at this next example by the way this is an awesome background so the user has given it the instruction make a new post on Twitter saying hello world and a brief greeting explaining your an artificial intelligence and then here's the prompt here's another prompt it is analyzing what to do generating the test case and then it actually interest inly iterates on the prompt automatically and then it says current status so that is where it's representing what it currently understands it's basically the visualization at each step so let's keep watching so add SPAC map click on what is happening okay then it generates the actions right here so step click on the browser address bar enter twitter.com wait for the Twitter homepage to load so it's giving the entire set of actions it needs to accomplish and it's going to go through it step by step so it's actually asking it to do the planning up front well let's watch it so selected element locate the address it shows the coordinates of the address bar clicks on it enters twitter.com there we go okay found the address bar right there entered the tweet and then hopefully they're going to push post but here we go we can see every single step along the way very cool so let's look at some of the cases these are proven cases working cases so open a new tab with the song click on the button send a list of steps to make a joke about engineers whilst making it essay and so on and so forth so it's actually a lot of really cool implementations of this so I encourage you to check this out read the research paper if you're interested if you want to see me do a full tutorial of pwin assistant let me know in the comments I'm happy to do that if you enjoyed this video please give a like And subscribe and I'll see you in the next one
Info
Channel: Matthew Berman
Views: 93,702
Rating: undefined out of 5
Keywords: prompting, coding, llm, ai, microsoft, visualization of thought, research paper, prompt, llm prompt
Id: JSRBesvBCtI
Channel Id: undefined
Length: 16min 25sec (985 seconds)
Published: Wed May 08 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.