Solving Wordle using information theory

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

I really enjoyed this analysis. Wordle reminds me of learning Mastermind years ago. In attempts to make Mastermind more challenging I would allow duplicates, blanks, and eventually programmed it with the 16 MS-DOS colors. But a key challenge was making every guess conform to all previous guesses. For this reason, I play Wordle the same way.

👍︎︎ 33 👤︎︎ u/toowm 📅︎︎ Feb 06 2022 🗫︎ replies

This is interesting. I’ve been going with a Wheel of Fortune based start word (r, s, t are most popular) and using “stare”.

👍︎︎ 71 👤︎︎ u/LeonardSmallsJr 📅︎︎ Feb 06 2022 🗫︎ replies

Wonderful video, but I am sceptical whether this is actually the optimal strategy, or whether it is a good heuristic.

I tackled this problem last week by considering it as an adversarial game where player A picks a guess word, player B then picks the largest class of possible final words (where a class is just the words that would produce the same green-yellow-gray clue), and so forth. Then the problem boils down to a minimax problem, which as far as I could tell would give you a true optimal strategy (if you pretend to play as player A).

I would be interested whether these strategies would boil down to the same strategy. Or, perhaps, we have some slightly different definitions of what "optimal" Wordle play entails.

👍︎︎ 142 👤︎︎ u/StephenSwat 📅︎︎ Feb 06 2022 🗫︎ replies

This is a good video, but I disagree a bit with how it defines optimal strategy as the word which results in the lowest average guesses for a solution as for me the optimal strategy is one which always guarantees a victory is possible in six guesses or fewer, and also minimises the maximum number of guesses.

On this front, there's a great post on Stack Overflow which details this (https://puzzling.stackexchange.com/questions/114316/whats-the-optimal-strategy-for-wordle) and contains a 'solution' to Wordle (although the second aspect as to which starting word minimises the average maximum number of guesses required still seems to be open).

👍︎︎ 22 👤︎︎ u/ConstantAndVariable 📅︎︎ Feb 06 2022 🗫︎ replies

I have two questions. (He may have already addressed them to certain degree in the video but I didn't catch since I didn't pay 100% attention in the whole 30 minutes. Please let me know if it's the case.)


In general, would "only use the word that fulfill the patterns revealed in previous guesses" a better strategy than picking the one to maximize (average) new information?

Seems to be obviously no at least for the first few tries, but I'm still kinda curious (like, if the answer set is much smaller, surely it would help to try to directly guess the answer earlier right?)


In his second strategy, he uses frequency of words (not literally, but fitting them into a sigmoid function) as a weight.

The idea here is that the answer is more likely to be a common words, but from a game design practice perspective (even with no knowledge about the word list in source code) , it makes more sense that there would be a cutoff somewhere than a smooth transition, and after that all the words are equally likely. i.e. a "more common" words is not going to be more likely to be the answer than a, "slightly less common" words as soon as both made into the list of possible answers. We may still want to assign a non-zero low probability with other words since we don't know where the cutoff is at.

In other words, the probability should be a step function. I guess he uses sigmoid function simply for ease of computation? Anyway, in one of his demo there are only two choices left (words and dorms), and one has significant higher P, which sounds wrong. But then again, in these "pick 1 from 2" scenarios it doesn't really matter. I'm just curious if it may affect the strategy in other cases?

👍︎︎ 8 👤︎︎ u/fireattack 📅︎︎ Feb 06 2022 🗫︎ replies

I always open with penis.

👍︎︎ 13 👤︎︎ u/darthlobster603 📅︎︎ Feb 06 2022 🗫︎ replies

My first two words contain the 10 most popular letters in the English language. I recognize this isn’t perfect, and considered doing some additional work on it, but thought better of it since I’m already solving at a 100% rate.

👍︎︎ 3 👤︎︎ u/JuuliusCaesar69 📅︎︎ Feb 06 2022 🗫︎ replies

Great video. The way Entropy is used here doesn't guarantee an optimal strategy. The problem is that not every group of words is the same. You could have a relatively large group of words with lots of distinct letters between them that can be completely disambiguated with a single word, and on the other hand you might have a set of relatively fewer words that all share similar letters like [bills, dills, fills, hills, kills, ...] which can't be disambiguated in a single guess and therefore at least some of those words require three or more guesses. Therefore the high entropy/low probability event (the group of words is small) doesn't exactly correspond the target "will be solved in the fewest possible guesses"

One way to find the best solution is to use a search tree, where you evaluate the top N candidates based on the above heuristic at each branch. Here's a nice write up

👍︎︎ 3 👤︎︎ u/xloper 📅︎︎ Feb 07 2022 🗫︎ replies

Here's a fun idea:

Suppose Wordle is a 2 player game (player A and player B). A picks a word for B to guess and B picks a word for A to guess. The winner is the player who guesses the word in the least number of turns.

👍︎︎ 2 👤︎︎ u/No-Eggplant-5396 📅︎︎ Feb 07 2022 🗫︎ replies
Captions
The game Wordle has gone pretty viral in the  last month or two, and never one to overlook   an opportunity for a math lesson, it occurs to  me that this game makes for a very good central   example in a lesson about information theory,  and in particular a topic known as entropy.   You see like a lot of people I got kind of sucked  into the puzzle, and like a lot of programmers I   also got sucked into trying to write an algorithm  that would play the game as optimally as it could.   What I thought I'd do here is just talk through  with you some of my process in that, and explain   some of the math that went into it, since the  whole algorithm centers on this idea of entropy. First things first, in case you haven't heard of  it, what is Wordle and to kill two birds with one   stone here while we go through the rules of the  game let me also preview where we're going with   this, which is to develop a little algorithm  that will basically play the game for us.   I haven't done today's Wordle, this is  February 4th, and we'll see how the bot does. The goal of Wordle is to guess a mystery  five-letter word, and you're given six   different chances to guess. For example, my  wordlebot suggests that I start with the guess   "crane". Each time that you make a guess, you  get some information about how close your guess   is to the true answer. Here the gray box is  telling me there's no c in the actual answer,   the yellow box is telling me there is  an r but it's not in that position.   The green box is telling me that the secret word  does have an a and it's in the third position.   And then there's no 'n' and there's no 'e'. Let  me just go in and tell the wordlebot about that   information...we started with "crane",  we got gray yellow green grey grey...   Don't worry about all the data that it's showing  right now, I'll explain that in due time. Its top suggestion for our second pick is "shtik".  Your guess does have to be an actual five-letter   word, but as you'll see it's pretty liberal  with what it will actually let you guess.   In this case we try stick and...all right!  Things are looking pretty good. We hit the   's' and the 'h', so we know the first three  letters, and we know that there's an 'r'.   So it's going to be like s-h-a something r  or s-h-a-r-something. And it looks like the   Wordle-bot knows that it's down to just two  possibilities, either "shard" or "sharp".   I's kind of a toss-up between them at this  point, so I guess probably just because it's   alphabetical it goes with shard, which...hooray!  It is the actual answer. So we got it in three. If you're wondering if that's any good, the way  I heard one person phrase it is that with Wordle,   four is par and three is birdie,  which I think is a pretty apt analogy.   You have to be consistently on your game to  be getting four but it's certainly not crazy.   But when you get it in three, it just feels great. If you're down for it what I'd like to do here  is just talk through my thought process from the   beginning for how I approach the wordlebot.  And like I said really it's an excuse for an   information theory lesson, the main goal is to  explain what is information and what is entropy. My first thought in approaching this was to take  a look at the relative frequencies of different   letters in the english language. I thought,  okay, is there an opening guess or an opening   pair of guesses that hits a lot of these most  frequent letters. One that I was pretty fond   of was doing "other" followed by "nails". The  thought is that if you hit a letter, you know   you get a green or a yellow, that always feels  good, it feels like you're getting information. But in these cases even if you  don't hit and you always get greys,   that's still giving you a lot of information,  since it's pretty rare to find a word that   doesn't have any of these letters. But even  still that doesn't feel super systematic,   because for example it does nothing to consider  the order of the letters. Why type "nails" when   I could type "snail". Is it better to have  that s at the end? I'm not really sure. Now a friend of mine said that he  liked to open with the word "weary",   which kind of surprised me because it has  some uncommon letters in there like the 'w'   and the 'y'. But who knows, maybe that  is a better opener. Is there some kind   of quantitative score that we can give to  judge the quality of a potential guess? To set up for the way that we're going to rank  possible guesses, let's go back and add a little   clarity to how exactly the game is set up. There's  a list of words that it will allow you to enter,   that are considered valid guesses, that's just  about 13,000 words long. But when you look at   it there's a lot of really uncommon things  things like "aahed" or "aalii" and "aargh".   The kind of words that bring about  family arguments in a game of Scrabble.   But the vibe of the game is that the answer  is always going to be a decently common word,   and in fact there's another list of around 2,300  words that are the possible answers. This is a   human-curated list, I think specifically by the  game creator's girlfriend which is kind of fun. But what I would like to do, our challenge  for this project, is to see if we can   write a program solving Wordle that doesn't  incorporate previous knowledge about this list.   For one thing there's plenty of pretty common five  letter words that you won't find in that list,   so it would be better to write a program that's  a little more resilient and would play Wordle   against anyone, not just what happens to be  the official website. And also, the reason   that we know what this list of possible answers  is is because it's visible in the source code,   but the way that it's visible in the source code  is in the specific order in which answers come up   from day to day, so you could always just look up  what tomorrow's answer will be. So clearly there's   some sense in which using the list is cheating,  and what makes for a more interesting puzzle and   a richer information theory lesson is to instead  use some more universal data, like relative word   frequencies in general, to capture this intuition  of having a preference for more common words. So! Of these 13,000 possibilities, how  should we choose the opening guess?   For example if my friend proposes "weary", how  should we analyze its quality? Well the reason   he said he likes that unlikely 'w' is that he  likes the long shot nature of just how good it   feels if you do hit that 'w'. For example if the  first pattern revealed was something like this,   then it turns out there are only 58 words in  this giant lexicon that match that pattern,   so that's a huge reduction from 13,000. But the  flip side of that, of course, is that it's very   uncommon to get a pattern like this. Specifically,  if each word was equally likely to be the answer,   the probability of hitting this pattern  would be 58 divided by around 13,000. Of course, they're not equally likely to be  answers, most of these are very obscure and   even questionable words, but at least for our  first pass at all of this let's assume that   they're all equally likely, then refine that a  bit later. The point is the pattern with a lot   of information is by its very nature unlikely to  occur. In fact what it means to be informative   is that it's unlikely. A much more  probable pattern to see with this opening   would be something like this, where of course  there's not a 'w' in it, maybe there's an 'e'   and maybe there's no 'a', there's no 'r', and  there's no 'y'. In this case there are 1,400   possible matches. So if all were equally likely,  it works out to be a probability of about 11% that   this is the pattern you would see. So the most  likely outcomes are also the least informative. To get a more global view here, let me show  you the full distribution of probabilities   across all of the different  patterns that you might see.   Each bar that you're looking at corresponds to a  possible pattern of colors that could be revealed,   of which there are 3^5 possibilities.  And they're organized from left to right,   most common to least common. So the most common  possibility here is that you get all grays,   that happens about 14% of the time. What you're  hoping for when you make a guess is that you end   up somewhere out in this long tail, like over  here where there's only 18 possibilities for   what matches this pattern, that evidently look  like this. Or if we venture a little farther   to the left...you know maybe we go all the way  over here...okay here's a good puzzle for you.   What are the three words in the english language  that start with a 'w' end with a 'y' and have an   'r' somewhere in them? It turns out the answers  are... let's see... "wordy" "wormy" and "wrily". To judge how good this word is overall, we want  some kind of measure of the expected amount of   information that you're going to get from this  distribution. If we go through each pattern and   we multiply its probability of occurring times  something that measures how informative it is,   that can maybe give us an objective score. Now  your first instinct for what that something   should be might be the number of matches, you  know you want a lower average number of matches,   but instead I'd like to use a more universal  measurement that we often ascribe to information,   and one that will be more flexible  once we have a different probability   assigned to each of these 13,000 words for  whether or not they're actually the answer. The standard unit of information is the bit,  which has a little bit of a funny formula,   but it's really intuitive  if we just look at examples.   If you have an observation that cuts  your space of possibilities in half,   we say that it has one bit of information.  In our example the space of possibilities is   all possible words, and it turns out about  half of the five letter words have an 's',   a little less than that but about half. So that  observation would give you one bit of information.   If instead a new fact chops down that space of  possibilities by a factor of four, we say that it   has two bits of information. For example it turns  out about a quarter of these words have a 't'.   If the observation cuts that space by a factor of  eight, we say it has three bits of information,   and so on and so forth. Four bits cuts it into  a sixteenth, five bits cuts it into a 32nd. So now is when you might want to take a moment to  pause and ask for yourself, what is the formula   for information, for the number of bits in  terms of the probability of an occurrence?   Well, what we're saying here is basically that  when you take one half to the number of bits,   that's the same thing as the probability, which  is the same thing as saying 2 to the power of the   number of bits is 1 over the probability,  which rearranges further to saying the   information is the log base 2 of 1 divided by the  probability. And sometimes you see this with one   more rearrangement still, where the information  is the negative log base 2 of the probability.   Expressed like this it can look a little bit  weird to the uninitiated, but it really is   just the very intuitive idea of asking how many  times you've cut down your possibilities in half.   Now if you're wondering, you know, I thought  we were just playing a fun word game why are   logarithms entering the picture? One reason this  is a nicer unit is it just a lot easier to talk   about very unlikely events. Much easier to say  that an observation has 20 bits of information   than it is to say that the probability  of such and such occurring is 0.00000095. But a more substantive reason that this  logarithmic expression turned out to be a   very useful addition to the theory of probability  is the way that information adds together. For   example if one observation gives you two bits  of information, cutting your space down by four,   and then a second observation, like your second  guess in Wordle, gives you another three bits   of information, chopping you down further by  another factor of eight, the two together give   you five bits of information. In the same  way that probabilities like to multiply,   Information likes to add. So as soon as we're in  the realm of something like an expected value,   where we're adding a bunch of numbers up,  the logs make it a lot nicer to deal with. Let's go back to our distribution for weary  and add another little tracker on here   showing us how much information  there is for each pattern.   The main thing I want you to notice  is that the higher the probability,   as we get to those more likely patterns, the  lower the information, the fewer bits you gain.   The way we measure the quality of this guess will  be to take the expected value of this information,   where we go through each pattern, we say how  probable is it, and then we multiply that by   how many bits of information do we get. And in the  example of weary, that turns out to be 4.9 bits. So on average, the information you get  from this opening guess is as good as   chopping your space of possibilities  in half about five times. By contrast,   an example of a guess with a higher expected  information value would be something like "slate".   In this case you'll notice the distribution  looks a lot flatter, in particular the most   probable occurrence of all grays only has about  a 6% chance of occurring. So at minimum you're   getting, evidently, 3.9 bits of information.  But that's a minimum, more typically you'd   get something better than that. And it turns out  when you crunch the numbers on this one and you   add up all of the relevant terms, the average  information is about 5.8. So in contrast with   weary your space of possibilities will be about  half as big after this first guess, on average. There's actually a fun story about the name for  this expected value of information quantity.   You see information theory was developed by  Claude Shannon, who was working at Bell labs   in the 1940s. He was talking about some of his  yet-to-be-published ideas with John von Neumann,   who was this intellectual giant of the time,  a very prominent in math and physics and the   beginnings of what was becoming computer  science. And when he mentioned that he   didn't really have a good name for this  expected value of information quantity,   von Neumann supposedly said, so the story  goes, "well you should call it Entropy,   and for two reasons. In the first place your  uncertainty function has been used in statistical   mechanics under that name, so it already has a  name. And in the second place, and more important,   nobody knows what entropy really is, so in  a debate you'll always have the advantage." So if the name seems a little bit mysterious,  and if this story is to be believed,   that's kind of by design. Also, if you're  wondering about its relation to all of that second   law of thermodynamics stuff from physics, there  definitely is a connection, but in its origins   Shannon was just dealing with pure probability  theory. And for our purposes here, when I use   the word entropy, I just want you to think the  expected information value of a particular guess. You can think of entropy as  measuring two things simultaneously.   The first one is how flat is the distribution.  The closer a distribution is to uniform,   the higher that entropy will be. In our  case, where there are 3^5 total patterns,   for a uniform distribution, observing any one  of them would have information log_2(3^5),   which happens to be 7.92. So that is the  absolute maximum that you could possibly have   for this entropy. But entropy is also kind of a  measure of how many possibilities there are in   the first place. For example if you happen to have  some word where there's only 16 possible patterns,   and each one is equally likely, this entropy, this  expected information, would be four bits. But if   you have another word where there are 64 possible  patterns that could come up, and they're all   equally likely, then the entropy would work out to  be six bits. So if you see some distribution out   in the wild that has an entropy of six bits, it's  sort of like it's saying there's as much variation   and uncertainty in what's about to happen  as if there were 64 equally likely outcomes. For my first pass at the worldbot,  I basically had it just do this.   It goes through all of the different  possible guesses that you could have,   all 13,000 words. It computes the entropy for  each one, or more specifically the entropy of the   distribution across all patterns that you might  see for each one, and then it picks the highest,   since that's the one that's likely to chop down  your space of possibilities as much as possible. And even though I've only been talking about the  first guess here it does the same thing for the   next few guesses. For example, after you see  some pattern on that first guess, which would   restrict you to a smaller number of possible words  based on what matches with that, you just play the   same game with respect to that smaller set of  words. For a proposed second guess, you look   at the distribution of all patterns that could  occur from that more restricted set of words.   You search through all 13,000 possibilities,  and you find the one that maximizes that entropy To show you how this works in action let me  just pull up a little variant of Wordle that   I wrote that shows the highlights of this  analysis in the margins. So after doing all   its entropy calculations, on the right here it's  showing us which ones have the highest expected   information. It turns out the top answer, at  least at the moment we'll refine this later,   is "tares", which means...um...of  course, a vetch the most common vetch. Each time we make a guess here, where maybe I kind  of ignore its recommendations and go with slate,   because I like slate, we can see how much  expected information it had. But then on   the right of the word here it's showing  us how much actual information we got   given this particular pattern. So here  it looks like we were a little unlucky.   We were expected to get 5.8, but we happened  to get something with less than that. And then   on the left side here it's showing us all of the  different possible words given where we are now.   The blue bars are telling us how likely it  thinks each word is, so at the moment it's   assuming each word is equally likely to  occur, but we'll refine that in a moment. And then this uncertainty measurement is  telling us the entropy of this distribution   across the possible words, which right  now, because it's a uniform distribution,   is just a needlessly complicated way  to count the number of possibilities.   For example, if we were to take 2 to the power  of 13.66, that should be around the 13,000   possibilities. It' a little bit off here, but only  because I'm not showing all the decimal places. At the moment that might feel redundant, and like  it's overly complicating things, but you'll see   why it's useful to have both numbers in a minute.  Here it looks like it's suggesting the highest   entropy for our second guess is "ramin", which  again...just really doesn't feel like a word.   So to take the moral high ground here I'm going  to go ahead and type in "rains". Again it looks   like we were a little unlucky, we were expecting  4.3 bits and we only got 3.39 bits of information.   So that takes us down to 55 possibilities.  And here maybe I'll just actually go with   what it's suggesting, which is "kombu",  whatever that means. Okay! This is actually   a good chance for a puzzle. It's telling us  this pattern gives us 4.78 bits of information,   but over on the left before we see  that pattern there were 5.78 bits of   uncertainty. So as a quiz for you, what does that  mean about the number of remaining possibilities?   Well it means that we're reduced  down to 1 bit of uncertainty,   which is the same thing as saying that there's  two possible answers, it's a 50/50 choice.   And from here, because you and I know which  words are more common, we know that the answer   should be "abyss". But as it's written right now  the program doesn't know that, so it just keeps   going trying to gain as much information as it  can until there's only one possibility left,   and then it guesses it. So obviously we need a  better endgame strategy, but let's say we call   this version one of our Wordle solver and then we  go and run some simulations to see how it does. The way this is working is it's playing every  possible Wordle game, it's going through all   of those 2,315 words that are the actual Wordle  answers, it's basically using that as a testing   set, and with this naive method of not considering  how common a word is and just trying to maximize   the information at each step along the way until  it gets down to one and only one choice, by the   end of the simulation the average score works  out to be about 4.124. Which...you know it's not   bad. To be honest I kind of expected to do worse.  But the people who play Wordle will tell you   that they can usually get it in four. The real  challenge is to get as many in three as you can.   It's a pretty big jump between the score four and  the score of three. The obvious low-hanging fruit   here is to somehow incorporate whether or not a  word is common, and how exactly do we do that? The way I approached it is to get a list  of the relative frequencies for all of the   words in the english language. I just used  Mathematica's word frequency data function,   which itself pulls from the google books english  n-gram public dataset. And it's kind of fun to   look at, for example if we sort it from the  most common words to the least common words,   evidently these are the most common five letter  words in the english language. Or rather, "these"   is the eighth most common. First is "which" after  which there's "there" and "their". "First" itself   is not first but ninth, and it makes sense that  these other words could come about more often,   where those after "first" are "after," "where",  and "those", being just a little bit less common. Now, in using this data to model how likely  each of these words is to be the final answer,   it shouldn't just be proportional to the  frequency. Because for example "which"   is given a score of 0.002 in this data set,  whereas the word "braid" is in some sense   about a thousand times less likely. But both  of these are common enough words that they're   almost certainly worth considering,  so we want more of a binary cutoff. The way I went about it is to imagine  taking this whole sorted list of words,   and then arranging it on an x-axis, and then  applying the sigmoid function, which is the   standard way to have a function whose output is  basically binary, it's either zero or it's one,   but there's a smoothing in between  for that region of uncertainty.   So essentially the probability that I'm assigning  to each word for being in the final list   will be the value of the sigmoid function  above wherever it sits on the x-axis. Now obviously this depends on a few parameters,  for example how wide a space on the x-axis those   words fill determines how gradually or  steeply we drop off from one to zero,   and where we situate them left to right  determines the cut off. And to be honest   the way I did this was kind of just licking my  finger and sticking it into the wind. I looked   through the sorted list and tried to find a window  where when I looked at it, I figured about half   of these words are more likely than not to be  the final answer. And I use that as the cutoff. Now once we have a distribution like this across  the words, it gives us another situation where   entropy becomes this really useful measurement.  For example let's say we were playing a game and   we start with my old openers which were "other"  and "nails", and we end up with a situation where   there's four possible words that match it. And  let's say we consider them all equally likely.   Let me ask you, what is the entropy of this  distribution? Well the information associated   with each one of these possibilities is going  to be the log_2(4), since each one is 1/4,   and that's 2. It's two bits of information,  4 possibilities, all very well and good. But! What if I told you that actually there  are more than four matches. In reality,   when we look through the full word list, there  are 16 words that match it. But suppose our model   puts a really low probability on those other  12 words of actually being the final answer,   something like one in a thousand,  because they're really obscure.   Now let me ask you, what is the  entropy of this distribution? If entropy was purely measuring the number of  matches here, then you might expect it to be   something like the log_2(16), which would be 4.  Two more bits of uncertainty than we had before.   But of course, the actual uncertainty is  not really that different from what we had   before. Just because there's these 12 really  obscure words doesn't mean that it would be   all that more surprising to learn that the  final answer is "charm", for example. So   when you actually do the calculation here, and  you add up the probability of each occurrence   times the corresponding information, what you get  is 2.11 bits. It's saying it's basically two bits,   it's basically those four possibilities, but  there's a little more uncertainty because of all   of those highly unlikely events. though if you did  learn them you'd get a ton of information from it. So zooming out, this is part of what makes  Wordle such a nice example for an information   theory lesson. We have these two distinct feeling  applications for entropy, the first one telling   us what's the expected information we'll get  from a given guess, and the second one saying   can we measure the remaining uncertainty  among all of the words that are possible. And I should emphasize, in that first case where  we're looking at the expected information of a   guess, once we have an unequal weighting to the  words, that affects the entropy calculation.   For example let me pull up that same case we were  looking at earlier of the distribution associated   with "weary", but this time using a non-uniform  distribution across all possible words.   So let me see if I can find a part here  that illustrates it pretty well...uh   okay, here, this is pretty good. Here we have two  adjacent patterns that are about equally likely   but one of them, we're told, has  32 possible words that match it.   And if we check what they are, these are those  32, which are all just very unlikely words. As   you scan your eyes over them it's hard to find any  that feel like plausible answers. Maybe "yells"? But if we look at the neighboring pattern in  the distribution, which is considered just   about as likely, we're told that it only has eight  possible matches. So a quarter as many matches,   but it's about as likely. And when we pull  up those matches, we can see why. Some of   these are actual plausible answers  like "wring" or "wrath" or "wraps".   To illustrate how we incorporate all that, let  me pull up version 2 of the wordlebot here. There are two or three main differences from  the first one that we saw. First off, like   I just said, the way that we're computing these  entropies, these expected values of information,   is now using the more refined distributions across  the patterns that incorporates the probability   that a given word would actually be the answer.  As it happens, "tares" is still number one,   though the ones following are a bit different.  Second, when it ranks its top picks, it's now   going to keep a model of the probability that each  word is the actual answer, and it'll incorporate   that into its decision, which is easier to  see once we have a few guesses on the table.   Again ignoring its recommendation, because  we can't let machines rule our lives... And I suppose I should mention another  thing different here is over on the left,   that uncertainty value, that number of bits, is no  longer just redundant with the number of possible   matches. Now if we pull it up, and you know,  we calculated, say, 2 to the 8.02, which would   be a little above 256...I guess 259. What it's  saying is even though there are 526 total words   that actually match this pattern, the amount of  uncertainty it has is more akin to what it would   be if there were 259 equally likely outcomes. You  could think of it like this, it knows "borks" is   not the answer same with "yortz" and "zoril" and  "zorus". So it's a little less uncertain than it   was in the previous case, this number of bits will  be smaller. And if I keep playing the game, I'll   refining this down with a couple guesses that are  apropos of what I would like to explain here... By the fourth guess, if you look over at its top  picks, you can see it's no longer just maximizing   the entropy. At this point, there's technically  seven possibilities, but the only ones with a   meaningful chance are "dorms" and "words", and  you can see it ranks choosing both of those   above all of these other values that, strictly  speaking, would give more information. The very first time I did this I just added  up these two numbers to measure the quality of   each guess, which actually worked better than  you might suspect. But it really didn't feel   systematic. I'm sure there are other approaches  people could take, but here's the one I landed on.   If we're considering the prospect of a next guess,  like in this case "words", what we really care   about is the expected score of our game if we  do that. And to calculate that expected score,   we say "what's the probability that 'words'  is the actual answer?", which at the moment   it ascribes 58% to. So we say with a 58%  chance, our score in this game would be four,   and then with the probability of one minus that  58 percent, our score will be more than that four. How much more? We don't know, but we can estimate  it based on how much uncertainty there's likely to   be once we get to that point. Specifically, at  the moment there are 1.44 bits of uncertainty,   if we guess "words" it's telling us  the expected information we'll get   is 1.27 bits, so if we guess "words"   this difference represents how much uncertainty  we're likely to be left with after that happens. What we need is some kind of function, which I'm  calling f here, that associates this uncertainty   with an expected score. The way I went about  this was to just plot a bunch of the data from   previous games based on version one of the bot, to  say "hey what was the actual score after various   points with certain very measurable amounts of  uncertainty?" For example, these data points   here that are sitting above a value that's  around 8.7 or so are saying "for some games,   after a point at which there were 8.7 bits  of uncertainty, it took two guesses to get   the final answer. For other games it took three  guesses, for other games it took four guesses." If we shift over to the left here, all the  points over zero are saying "whenever there   are zero bits of uncertainty, which is  to say there's only one possibility,   then the number of guesses required is  always just one", which is reassuring.   Whenever there was one bit of uncertainty, meaning  it was essentially just down to two possibilities,   then sometimes it required one more guess  sometimes it required two more guesses,   and so on and so forth. Here, maybe a slightly  easier way to visualize this data is to bucket   it together and take averages. For example,  this bar here is saying "among all the points   where we had one bit of uncertainty, on average  the number of new guesses required was about 1.5." And the bar over here saying "among all of the  different games were at some point the uncertainty   was a little above 4 bits, which is like  narrowing it down to 16 different possibilities,   then on average it requires a little more  than two guesses from that point forward".   And from here I just did a regression to fit  a function that seemed reasonable to this. And remember the whole point of doing any of  that is so that we can quantify this intuition   that the more information we gain from a  word, the lower the expected score will be.   So with this as version 2.0, if we go back and we  run the same set of simulations, having it play   against all 2,315 possible Wordle answers, how  does it do? Well in contrast to our first version,   it's definitely better, which is reassuring.  All said and done, the average is around 3.6.   Although unlike the first version, there are  a couple times that it loses and requires more   than six in this circumstance, presumably  because there are times when it's making   that trade-off to actually go for the  goal, rather than maximizing information. So can we do better than 3.6? We definitely can.  I said at the start that it's most fun to try   not incorporating the true list of Wordle  answers into the way that it builds its model.   But if we do incorporate it, the best performance  I could get was around 3.43. So if we try to get   more sophisticated than just using word frequency  data to choose this prior distribution, this 3.43   probably gives a max at how good we could get with  that. Or at least how good I could get with that. That best performance essentially just uses  the ideas that I've been talking about here,   but it goes a little farther. Like it does a  search for the expected information two steps   forward, rather than just one. Originally  I was planning on talking more about that,   but I realize we've actually gone quite long as  it is. The one thing I'll say is after doing this   two-step search, and then running a couple  sample simulations in the top candidates,   so far for me at least it's looking like "crane"  is the best opener. Who would have guessed? Also if you use the true word list to determine  your space of possibilities then the uncertainty   you start with is a little over 11 bits. And  it turns out, just from a brute force search,   the maximum possible expected information  after the first two guesses is around 10 bits,   which suggests that, best case  scenario after your first two guesses,   with perfectly optimal play, you'll be left with  around one bit of uncertainty, which is the same   as being down to two possible guesses. So I think  it's fair, and probably pretty conservative,   to say that you could never possibly write an  algorithm that gets this average as low as 3,   because with the words available to you there's  simply not room to get enough information after   only two steps to be able to guarantee the answer  in the third slot every single time without fail.
Info
Channel: 3Blue1Brown
Views: 8,368,749
Rating: undefined out of 5
Keywords: Mathematics, three blue one brown, 3 blue 1 brown, 3b1b, 3brown1blue, 3 brown 1 blue, three brown one blue
Id: v68zYyaEmEA
Channel Id: undefined
Length: 30min 38sec (1838 seconds)
Published: Sun Feb 06 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.