Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution (Paper Explained)

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello there today we're going to look at prompt breeder self-referential self-improvement via prompt Evolution this is a paper by Google Deep Mind which is the new conglomerate between Google brain and deep mind so it should be fun the basic idea behind this paper is that rather than writing your own prompts for large language models for them to do a particular task uh you have a system that evolves prompts by itself so you just kind of give it the description of a task and it comes up with its own prompts in order to have a large language model solve the task obviously the coming up with its own prompts is also done by large language models and is done via an evolutionary algorithm The evolutionary algorithm if you know evolutionary algorithms they rely on sort of mutation and fitness evaluation and in this particular paper even the mutation part is is done by large language models and so that's what makes it self-referential and self-improving is that it improves its own method of generating improved prompts we're going to look into this uh it all gets a little bit complicated not necessarily complex but complicated in the sense of you know what are you actually currently doing because every single thing is handled by a large language model but we'll dive into it at the end of it it's not that complicated and while the while kind of the results seem promising uh if you dig a little bit into them in my opinion it's not it's not certainly not the end of the problem like we haven't solved the problem at all it just kind of pushes the problem into a different domain and if you look at the actual results I kind of have a hard time believing that you know this here is the solution it's there's something going on maybe but they do see considerable improvements so we have to give them that uh so the authors here say that strategies like chain of fault prompting can dramatically improve the reasoning abilities of large language models so rather than let's say you have a a family of tasks like little math tasks like Peter ate two apples and Peter ate three apples how many apples did Peter eat uh you can just ask this to a language model and it'll it'll tell you maybe it's even correct but what people have noticed is that you can dramatically improve the final answer of a language model if you say something like how about you think step by step you you tell that to the language model and what you do is you kind of force it to break down the problem into small chunks and then perform these chunks individually and explicitly and Via various techniques of doing that you can get a much better likelihood of The Final Answer being correct than simply asking your problem in the first place so these various prompting techniques Chain of Thought tree of thought step by step and and so on uh all of this people have come up with they had to sit down and think and that's what they call here these are handcrafted prompt strategies and even though they work they're often suboptimal prompt breeder which is this paper uh they call is a general purpose self-referential selfimprovement mechanism that evolves and adapts prompts for a given domain with the point being that this is no longer a manual process for us to sit down and think okay uh how am I going to write this please solve this math problem think step by step each line write the formula write the results on and so so none of this anymore all of this this coming up with the correct prompts is now being done via this mechanism here uh the mechanism as I said is evolution evolutionary algorithm to evolve task prompts and the clue of this paper is the prompt reader is not just improving task prompts but he is also improving the mutation prompts that improve these task prompts so that's the self-referential part uh they introduce again saying prompting is Central prompting is Central to the downstream performance of foundation models okay so if you if you buy into this ridiculous term of foundation models then in my opinion this here is used slightly incorrectly because prompting specifically refers to language models and it actually doesn't refer to Foundation models so none nothing models there are models that I can prompt which can be classified as Foundation models but also models that would certainly not be classified as Foundation models by even by the vague original terminology around that term and there are Foundation models that cannot be prompted um at least so far according to the definition the vague definition of the foundation model I guess marketing department it's not really research research organization anymore um so yeah that's just a pet peeve of mine but fine prompting is important we get it but prompt strategies are very manually engineered so people have asked themselves can this prompt engineering be automated and there have already been papers into that direction the automatic prompt engineer or AP ape I don't know how you want to call it is a an attempt into this direction so they generate an initial distribution of prompts using another prompt that infers the problem from from a number of input output examples from the data set so you have uh you have input output examples you take a part of the data set as a training set let's say and that those are your demonstrations of how it's done correctly and from that you try to infer a prompt that would I mean it's it's sort of like you like very classic machine learning you take a set of examples of how it's correctly being done and then you train something that would give you the correct prompt for that so in that sense The Prompt here is like the weights of a regression in the classic sense and the the examples will be like the data uh yeah it's a weird world but that people have tackled automated prompt engineering uh like this however those things have given sort of this these diminishing returns so as they go on and on trying to improve and improve and improve the prompt at some point you sort of hit ceiling really relatively quickly and that's it they the authors here say we propose a solution sorry about that we propose a solution to the problem of diminishing returns via diversity maintaining evolutionary algorithm for self-referential self-improvement prompts of llms so they say the reason why the previous automated prompt engineering efforts haven't worked out well is that they didn't maintain diversity as they improved their their prompts uh they just sort of improved improved and then at some point with the initial initial boost uh to The Prompt if you go into that direction there is kind of a a limit but maintaining diversity is a very well-known technique in a lot of these sort of exploratory blackbox optimizers and that's what they do right here uh if you like if you want to know more or learn more about these things um there's a lot of work on open-ended learning and um I guess goal-free learning and things like this um Works originally by Kenneth Stanley and people around him uh there i' I've made videos about this stuff before and it's highly interesting so there's a lot wealth of things to be to be explored in this direction and this paper here goes a little bit into that in Ed it says hey we want to maintain diversity of solutions because what happens is when one gets into a a point where you can't improve it any longer you may be able to take some ideas or approaches from another place because you've maintained that diversity Port that over or combine two solutions and therefore um therefore improve even more rather than getting stuck in in some um forget how it's called like a street where you can't move on I I don't know the the English word right now blacking out so the results are the first thing they show right here you can see that in various uh data sets there which is PB here um in zero shot and in F shot learning or in F shot inference they Trump the other methods or largely largely surpass the other methods yeah now to be said in some of these things for example the star here it's not the original number if you want to compare that to other papers but they have sort of subdivided it into training and test samples yes yes we we are back in the world of training set and test set um llms if you just use them in zero shot fashion they can actually solve tasks sometimes that's what the original paper large language models are strong multitask zero shot something like this Learners was gpt2 paper maybe or gpt3 I don't know but we're back in the world of actually using training data so the way that these prompt Improvement systems work is by consuming training data training data of know successful executions of the task so how does prompt breeder work this is the overview that they give right here it all starts with a initial set initial population that's up here so initialize initialization of the population of task prompts and mutation prompts so they have two kinds of prompts actually they have three kinds of prompts actually they have three kinds of prompts and then some constant prompts um there are these thinking Styles right here they are it it's a bit weird to me why they're here like in principle I can totally see that at some point this just probably didn't work as well as it should and then they just introduced say how about we just kind of introduce different thinking styles to even up the diversity even more but the thinking Styles that's just a list of like thinking Styles one is like let's think step by step another one is they have a list down um they have the list down these are fixed they are just a list of different thought heuristics not specific to any problem whatsoever just like try to think outside of the box or something like this try to reformulate or you know try to break the problem down into different sub pieces or so thinking Styles um yeah completely irrelevant to here except introduced probably to make it work better uh the problem description is very specific to the task you want to solve for example solve this solve this math problem give the answer as a number or something like this is this a problem description and then obviously the so-called um mutation prompts the mutation prompts are can I take one prompt sorry that's an R I wanted to make a p can I take a prompt can I make it into another prompt via a mutation prompt so the way that works is you take a large language model and you feed in the mutation prompt plus the prompt and the output is going to be the next prompt so the mutation prompt is going to be something like like this um change this instruction to make it more fun that's a mutation prompt because you can concatenate it with a prompt and they have they have this structure right here so they put the word instruction here and then the problem description and then instruction mutant right so you kind of C the language model into giving you a variant of this prompt right here now at the very beginning they just mutate the original description of the task like solve this math word problem but as the algorithm goes on um this thing right here this original task description has also changed so these things here are P the prompts and then these things here these are M the mutation prompts both both things are going to change so the prompts are going to improve and the mutation prompts are also going to improve over time at least that's the idea so those are the sets of things that take part there are prompts there are mutation prompts there are is the pro the original problem description there's a list of thinking Styles there is a little bit of structure into for example here instruction instruction mutant like The Meta prompting so to say and that's it except for the mutation operators down here but we'll get to that in a bit so the initial population we just generate like this we take the original problem description and we use a initial set of mutation prompts and we just generate a bunch of variants right now one thing that wasn't very clear to me when I read the paper initially was the following they have what they call units of mutation and a unit of of or a unit of evolution a unit of evolution is a mutation prompt and as far as I understand and two P P1 P2 two prompts so the mutation prompt could be something like we just saw like reformulate this instruction to make it I don't know better to make it more fun to make it more concise something like this then P1 and P2 are actual prompts like solve this math problem and however those two are used in the following at least for most tasks that's how I figure it's not super duper clear but when you actually get the task like at evaluation time what you do is you write P1 then you write the problem problem like the actual you know here is the math question um then you let the llm produce something so you let it produce something here so this could already be the solution but then you put prompt two and then you let it produce something else and this down here is what you check for correct so you have you have for example the chance of instructing to solve the problem and then this here bring it into the correct format like that's a very common approach uh so first you just instruct it in some way to solve the problem and usually these LM they they they'll put out something like here is the solution and then they give you the solution and what you wanted was just the solution so people have started to put into the initial prompt like only provide the solution blah blah BL blah but it works also equally well if this really just focuses on solving the problem and then this focuses on getting into the correct format of the solution but there are other things thinkable but essentially these two prompts together you can think of sort of like the prompt except that you structure it in this way that you first put the first prompt with the problem you let the model produce something then you put the second prompt you let it produce something again and that's what you check for the answer another thing apparently the AP paper introduced this format another thing that makes it even harder to compare to just like zero shot prompting where you just say please solve this problem right the the the chance that you have two prompts here and you kind of inter leave them in any case that's how they do it now why do they call this thing here a unit of evolution they they always evolve the together so one mutation prompt goes with one actual task prompt together and that's not really necessary you could just maintain a population of mutation prompts and a population of task prompts kind of but they do it because they always need to evaluate Fitness of these things if you know about evolutionary algorithm you maintain a population of these and you want to know what's the fitness of these like how good are they how good are these prompts here and how good is this mutation prompt now the prompts you can e easily evaluate you just take a bunch of data points right a bunch of questions from your training data set as we said and you evaluate it and you check whether they get the correct answer if they get the correct answer often for a lot of these data points they're good prompts the mutation prompt however is a bit more tricky so the way you evaluate that is because you have these things together you save them in a way that um I believe you always save the mutation prompt together with the prompts that It produced the way you would and then it's really easy to say well if the prompts are good that means at least in part the mutation prompt was good because it produced these prompts and therefore it probably didn't do something stupid right you could also look if if the new prompts that it mutated are better than the old prompts that it got as an input but in essence you keep those together because you need to evaluate the mutation prompts and you evaluate the mutation prompts by by proxy essentially by evaluating the prompts they produce so you always keep them together which is not necessary engineering wise but that's what they do that's why you see down here if you look at these things this is your population right here and this is an individual in the population and you can see it's made up of a prompt and a mutation prompt or alternatively two prompts and a mutation prompt and they have a fitness over here so what do we do now as the algorithm progresses we're building up this population as we said and we have fitnesses and usually what you do is you order these things by Fitness and then you just drop away the ones that have a very low Fitness so that thing right here you know out you go and is replaced by another individual the other individuals you get by taking the good ones in your population and mutating them so you always you want to take the good ones mutate them and put them back and then order all by Fitness evaluate everything and then call away the ones that have a very low Fitness that's how evolution-based populationbased algorithms work how do they mutate these things and that's where these mutation operators come in so they have various ways of mutating the prompts themselves and that sometimes makes use of that mutation prompt right or evolving the mutation prompt as such so this this is where it gets a complex these mutation operators either they take a prompt and they make a new prompt right so then if you think of what happens to the unit of evolution this unit of evolution is going to be MP let's just focus on a single prompt right now not two prompts MP that's going to be m p Prime as far as I understand right it's also possible to for them to take absolutely nothing and make up a new prompt or just the initial task description so in that case also you just replace one of the prompts with the new prompt with the same mutation prompt then it's also possible to take a prompt and a mutation prompt and produce a new prompt right that's the why we have the mutation prompts in the first place so in that case again the mut mutation prompt pair would be transferred into the same mutation P Prime pair so and then it's also possible to take a mutation prompt and make a new mutation prompt and then that you can use to process the prompt into a new prompt so they have all these different things that they do to this population in order to change the population around and they list them down here so okay yada yada yada and then yeah so they have say they have nine operators for mutation the rationale for using this diverse set of operators is to enable the llm to explore a large space of cognitive methods of linguistic self-questioning by repeatedly changing the framing of the problem as well as retrieving mental models expressed in natural language that can help tackle a given reasoning challenge so we'll go through the mutation operators um and yeah keep in mind these are things that they do to change the population around in between these rounds so every round they want to change the population they want to mutate a bunch of the ones they have add them back into the population then order them all by Fitness call away the Boton ones and repeat again and while doing that so the mutation always concerns the prompts but also the mutation prompts that are used sometimes to mutate the prompts so direct mutation concerns with just directly generating a new prompt like a new P prompt a new task prompt there they that's what they call zero of order Generation Um we generate a new task by concatenating the problem description with the prompt a list of 100 hints so this is like the very start when we just generated our initial list of prompts uh we just do that again so in that sense we can always we always have like a a starting a thing that generates prompts from the very starting problem description so we don't kind of evolve into a place where we get lost and we have nothing to do anymore with the original problem which could be a degener generate State um so they say it's regenerated from the problem description each time this does not use any existing prompt this does not use a mutation prompt or anything like this this just generates new prompts and adds them to the population then we have first order prompt generation here is here what they say we concatenate the mutation prompt to the parent task prompt and pass it to the llm to produce the mutated task promp prompt for example say that instruction again in another way don't use any of the words in the original instruction there's a good there's a good Chap and the instruction is solve the math word problem giving your answer as an Arabic numeral so this is probably what you imagine when I tell you there are mutation prompts and there are prompts and we use the mutation prompts in order to mutate the prompts this is exactly it so we give the mutation prompt we give the prompt and mutation prompt says something like change this prompt to be better and outcomes a new prompt the procedure they say is is identical to the initialization method except that a randomly sampled thinking style string is not used so uh in the initialization you use like the thinking style and here you don't now so the the whole point of this paper is kind of can we do something hands off that is not hand engineered so we don't have to think about this you know this prompting thing and really it really depends on how you do it so we'd rather not do that ourselves yet so many things in this paper are clearly done out of fiddling around because someone discovered oh if I do it like this it works better like ooh if I add the thinking Styles at the beginning then it works a lot better so that's a a bit my problem I'd rather have had a paper that doesn't do a lot of the things that they do in this paper to make it better but instead have sort of a more solid work of research like a work of research that actually focuses on what if we just hands off do the most general you know straightforward thing and see how far that gets that would have been so much more valuable than you know let's let's add in thinking like thinking Styles and let's add in this and the fact that they have these nine random mutation operators that they have like here and the exact way they construct them is also pretty sure a result of kind of tinkering around and finguring around so in my estimation this it just delays the problem to a different domain namely to The Domain sort of of meta prompting and so on and yes probably it's like teaching a person how to fish instead of giving them a fish right but to me it's a little bit too much focused on teaching a person how to fish that exact one fish in that exact one pond with the exact one uh fishing line I give them and if they do it in exactly my way and also I'm going to kill the fish first and kind of put them put it there so they they can just grab it um yeah plus the fact that obviously all of this uses training data which is sort of against the sense of working with llms but never mind so then they have um they say look we can not just condition on zero or one parent but we can condition on a set of parents so we provide a filtered and numbered list of the current population of task prompts to the llm and ask it to continue this list with new task prompts so they're just going to take go through all their units of so they're all their units of mutation let's maybe back to the graphic they're going to go through all of these right here they're always going to grab just the prompts out of each order by their Fitness so they have an ordered list by their Fitness and provide that and then they say you know we could just let the thing continue that list right um but they filter it so they don't want things that are too similar so if they're too similar coine similarities they throw them away so this is kind of a diverse list uh encouraging diversity and no sorry my bad this particular mutation operator does not provide an ordered list but just an unordered list that is focused on diversity by filtering out very similar things this next one here provides an ordered list of the same thing so this is a variant of the above in which task prompts are listed in Fitness order so they say preliminary experiments show that the LM is more likely to generate entries that are similar to The Elements appearing later in the list they say we ordered the task prompts in the population by ascending order of Fitness so the most fit things are at the end if I understand the correctly where the uh yeah the llm is more likely to generate entries that are similar to the end of the list so the highest Fitness are at the end of the list and then they tell the llm that this is a list of responses in descending order of Fitness okay so uh yeah so they lie to they lie to the language model they say they order it in ascending order of Fitness so that the last one which they know the llm focuses by default on is the best one but then they say no no no it's actually the other way around so in the prompt they say this is a descending order of fitness and they even clarify it they say this you know that index is the best response it resembles that more than it does that other one so they really hard try to convince the language model that it's the other way around why they say this is because otherwise it is too biased towards producing a new entry that is too similar to the final entry the contradiction between the asending ordering and the statement that it is a descending ordering appears to improve the diversity of sound sampling really like this to me here is the most staggering example of I have nothing against this I'm all for you know weird tricks that you know the initial ones were were like the Unreal Engine tricks for the diffusion models I'm all for that but not in a paper whose explicit reason for existing is saying well these other things they just do handcrafted random things that just they that they just find to work and and that's not a state of being that we should solve that problem so here is like an automated system hands off so you don't have to handcraft weird things anymore that just kind of you see working yeah not in a paper like this um I get it it improves the score but it in my opinion diminishes the value of this particular paper the finding here to me is great and could be its own other paper of just saying I found this weird trick and that's very cool in any case as we go on here you'll if you know if you see it in this light you'll see a lot of the same things um so here they say lineage based mutation we provide a history of the individuals in its lineage of a mutation uh sorry of a unit of evolution sort of like okay here is how this this particular unit evolved over time the ancestors of it like it was mutated and mutated and mutated and always remained in the population which means its Fitness probably increased over time so they ask well could we just coax the language model into continuing that list and therefore make it even better so they say look this is a genotypes found in ascending order of quality and you may notice noce now they don't lie now they don't lie to the language model they just say this is as sending order of quality and they actually provided in a chronological order which we know given the evolutionary algorithm is probably an ascending order of quality because the lineage always survived in the past and why would you need why would you do the trick here and not the same trick here probably because they found in the first case it actually improves things on their particular problem with their particular method and in the second case it probably didn't improve things on their particular thing with their particular method of doing it yeah again same thing and then there's hyper mutation so that's how do we mutate the mutation prompts uh there is zeroeth order like zeroth order prompt mutation is just we concatenate the original problem description with a to a randomly sampled thinking style and feed it to the llm to generate a new mutation prompt so just using the thinking Styles as sort of meta mutation prompts generate a mutation prompt again why the thinking Styles enter here who knows it just worked better right which thinking styles are in the list just a bunch first order hyper mutation we con catenate the hyper mutation prompt please summarize and improve the following instruction to a mutation prompt so that the LM generates a new mutation prompt again this thing right here I know it's quite generic but it is a handcrafted prompt and then you have Larkin Evolution where you put a bunch of examples of how it worked out previously and try to get the prompt out of this and you have prompt crossover and context text shuffling and so on techniques that are used in evolutionary algorithms in general so their results are okay they're so they're they're better than the other automated prompt generation methods and they're also better than for example hand design prompts so they say this in this classification problem this hate speech classification problem um prompt breeder was able to involve a prompt strategy consisting of two sequentially applied relatively long prompts that scored 89% uh an improvement over the hand design prompt determine whether text contains hate speech which scores only 80% if I show you I'm going to show you these prompts but just look at what it says this is the hand design prompt is determine whether a text contain ains hate speech right I don't know how much time went into this but it's sort of can't be more than 30 seconds and I already get 80% zero shot I have never looked at any training data I could come up with this right now right I already get 80% in order to get to 89% um this entire apparatus is being used including training dat right including actually taking data from the data set and using it as training data which is very expensive to collect often plus we are now don't have one prompt we have two sequentially applied relatively long prompts that were evolv an evolutionary algorithm that uses llms on top of llms to evolve and you get 9 percentage points out of it now granted that's like 50% % less mistakes if you think of it like this but still we can actually go look at this at this stuff uh so there there's only the appendix remaining where I want to show you a bunch of things so here are here are curves these are very typical of evolutionary algorithm you start with generation zero you evolve over time and you see sort of the the parto Frontier uh improving and the average fitness improving which is pretty cool the mutation prompts um here's a list of of mutator prompts that are I'm going to guess the initial ones because or the evolved ones this should be described should be described somewhere I'm not sure I just found this one funny break free from conventional constraints that takes the prompt to Uncharted territories it's what what is the world come to um thinking Styles here's the list of thinking Styles so something like are there any stakeholders or individuals who are directly affected by the problem what are their perspectives and needs I mean sure but also this is competely engineered um but it is more General than like I think like I totally believe that I can throw this thing at various different problems and it will kind of do good job at them as long as they're sort of in the same meta domain as the problems that we see here in this paper so take my criticism with a grain of salt it is a cool work and is definitely a cool attempt um I I don't want to denigrate the paper as such and now yeah look at some of these results and that's my last going to be my last CR oh here is the Evolve prompt for this hate speech detection look at that look at that you you think like okay yes yes that that's probably going to be better than just saying does the text contain hate speech but if you read it it's like I wonder what a human could do by just spending 15 minutes right um yeah evolved mutation prompts the top one is please summarize and improve the following instruction this is the most successful mutation prompts evolved in a self-referential way you can see that the top the top scoring mutation prompt is essentially what you would come up initially um I mean it's cool but it's does it really necessitate this entire Machinery also here the mutation operators that are the most effective are zeroth order hyper mutation so where you just kind of come up with a new mutation prompt um lineage based mutation is interesting that it's so high and first order hyper mutation also shows that okay reformulating the prompts in some way is important what's also important is that there is no super duper dominant uh mutation operator here that means all the mutation operators kind of contribute that's a positive result I would guess for this paper even though the sort of zero if order things being at the top is probably the it it's probably more than was or different than was hoped for because what you hope for is sort of that that the most int intricate ones go to the top and that those are the really important ones where you take a mutation prompt that was evolved already and evolve it more and then use that to change a regular prompt and and that mutation is really but I guess I guess that's kind of this if you think about it except that you don't I guess this one here would be yeah that one here would sort of be the core of the paper the first order hyper mutation where you mutate a mutation prompt you use that to mutate a task prompt and then that is really important so I guess it's a good result that that's in in third place so here are some examples uh after 1,600 mutations and this is a two prompt thing okay that's what's evolved zero is the m mutant and prompt one is mutant that's it so the tasks are like the tasks are I don't even know what the tasks are ad sub but it's like a bit of math problems um they say themselves we find that in the fucha evolution case the contexts Dominate and often the tasks prompts drift into nonsense they're less critically determining of Fitness than the evolved context oh evolve context I've didn't touch on this but they also essentially evolve which few shot examples get into the context and determining those well I guess can make a difference of that being said every time they use training data they completely use in domain training data like the same distribution as the test data uh yeah there are more so there is math problems solve the math word problem giving your answer as an Arabic numeral have you solved the problem like this before like this is interesting it's interesting that this is what works well but also I I have my doubts that that that is really the best prompt it's sort of a self assessment thing but if you look through these examples you could definitely definitely so these are also math problems um I would solve the math word problem without using a calculator giving my answer as an Arabic numeral okay that's a fine prompt I guess but then the next prompt is 1 2 3 4 yeah and here you it's you just shout you just Shout at the language model you hope that improves it I okay here is where I actually believe that that this is this is a good prompt just shout and that's it uh here the first prompt is just kind of a date range um and this has nothing to do with dates these are also math problems as you can see right here so doubt press X to doubt like it's definitely cool work but if you look at the actual results um it's sometimes questionable whether the [Music] whether whether it really does something sensible right uh or whether the results we see like the good numbers are more like there's something else that contributes to them that's just kind of the nature of these problems or some kind of circumvention of this and whether this is really the method that involves these evolves these Stellar prompts especially as they could generalize to other things and do more stuff and we could extend this method I remain doubtful nevertheless this is it is an actual cool work um and I'm very happy that people are researching into this and I'm very excited where it goes from here don't get me wrong so thank you a lot to people from Google Deep Mind for researching this and I hope we can continue and I hope uh the yeah I hope the next steps in this uh are better and better and better and at some point we won't have to write prompts or meta prompts or hyper prompts or anything anymore like this excellent thanks for listening bye-bye

Info

Channel: Yannic Kilcher

Views: 35,782

Rating: undefined out of 5

Keywords: deep learning, machine learning, arxiv, explained, neural networks, ai, artificial intelligence, paper, prompt engineering, think step by step, deepmind, google deepmind, google brain, prompts, llm prompts, automatic prompts, llm prmpt engineering

Id: tkX0EfNl4Fc

Channel Id: undefined

Length: 46min 45sec (2805 seconds)

Published: Sat Oct 07 2023