George Hotz | Programming | Decision Transformer Reinforcement Learning (RL) | LunarLander | Part 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what this paper's pretty old this paper's like two years old no those this needs double this doesn't I don't know I'm just not like I hate this does everybody struggle with this I feel like I struggle with this stuff all the time just like getting the dimensions to be right I'm sure everyone struggles with it there not true George there's super Geniuses out there who don't struggle with it never mind I believe that um well that's just not right oh the last actions oh oh no this doesn't work okay why aren't we going back to why can't I just predict the three together like why don't I just [Music] group AR ARS together I see no reason why that wouldn't work because it's causal so the question is kind of where is the auto regressive nature uh and they kind of don't answer that in the okay so you put in a state you get out an action you take the action right then you feed the action back in that's great that's normal gbt stuff the reward and then the state the question is is that all done Auto regressive when this samples the next action no because there's no sampling I'm just gonna this is dumb I'm just going to do this unless someone can tell me why I can't just do that uh speak now or forever hold your piece why can't I just concatenate the dimensions together why does this actually have to be three steps of the Transformer could it be a causal problem there definitely is causal masking there but it's not like those outputs are even used so why are they not just concat why are they just not catted together can someone actually explain this to me wait twitch gave you a substance warning are people watching my twitch or are is there is there weed AI which which detects if someone smokes weed on Twitch who who bought weed AI you know what I mean it's like hot dog and not hot dog but weed and no weed if I we'll try vaping next stream and see if we get a contact warning wow well at least they gave you at least they gave you a warning everybody should be in tile to Choice um my Chinese food will be here in 15 minutes that's great okay so you see what I'm saying like could we just concatenate these together I'm just going to do that that's so much simpler yeah that's so much simpler so let's concatenate them here just do it at the dimension [Music] level I'm just use that and have a completely separate head which predicts the output so let's just do that is there a reason I can't do that let's ask perplexity all that one's too slow that makes no sense I don't really get [Music] this it's really hard to check if you implemented it correctly I'm not trying to implement it correctly I'm trying to implement something that works the only problem is I put in a fake no I don't even need to put in a fake action I what do I do about the first one one let's have yeah I'll just make a fake action okay so let's start thinking about it as a r s right and that's like a token that's a token uh the start action is always fake I don't think that's going to affect anything I might explicitly want to call it a fake action just do action space plus one um and then I can probably do minus no I probably can't do minus one [Music] uh yeah okay well we can't do that every time the next time I put I put the real action in okay great put them in the right order uh let's look at the shapes cannot expand yeah don't need that no I think I actually want three a batch time and then the thing okay it's all that's fine so then we can catenate them on the minus one AIS get rid of these weird time 3es and these weird plus ones okay let's just copy that from gb2 great yeah BSC CL okay great uh mask realiz l yeah minus one there because we're going to want to train using that same thing uh don't worry about that because we're not training okay why is that true oh because I did Count minus one uh oh no that's start PA so that's right take a look at this no the mask is none no we need to keep that there okay that's fine oh why do I have this just got so much simpler now we can catenated them I I don't understand what they're saying about the causal structure uh it's very possible that certain not x. requires grat but I oh that's annoying okay now we have stupid stuff well can I just not use GPT entirely the only thing that I have to change if not is instance tokens variable how could tokens be a variable oh I see interesting I didn't know we supported that well that doesn't work anyway this is not a trainable Transformer uh but okay the problem with that you can actually set start pause to [Music] zero like I've had this problem before oh I know the problem it's because usually this avoids the jet because the size is not uh we have bugs in this implementation usually you have okay so usually you have a uh a multi-headed usually your first token is a long sequence like think about when you're using a chatbot um or even when you're using a a language model with a prompt right your prompt is your initial sequence and that initial sequence is why this works and why it doesn't work here uh shouldn't I update the target reward yes I do have to update the target reward okay and then we do rewards to go when we're training this and this should all work um okay it's annoying that that doesn't work it doesn't like doing that initial sample let's take a look at what it's doing here 192 mod something that's clearly a lot bigger than [Music] 192 just fix that she had a test a symbolic for it okay so symbolic is the back end that's used to render all these uh shapes after I smoke it's harder to talk I think I I still understand it it's just like with like a different sort of vocabulary at least I think I still understand it who really knows put a number like one 2 and let's mod at 384 and we expect the answer to be 192 and just want see if that works well that's also a number which is not fair oh this doesn't work because 192 mod 92 is zero so if start pause is zero there that's zero but if start pauses that's stupid you see why it doesn't like you see see how this going evaluate you two things based on the value of start pause um it's not going to work is it yeah so where do I pass in start pause in here to the Transformer block yeah uh I wonder which function actually breaks that should we fix this in gpt2 first should we fix this bug I don't like the nun thing this is what's failing this might be fixable simply okay uh oh no part of the problem like you want this to be zero I think probably it can be zero how do I deal with this in GPT one if St pause else zero fails on the first run though right do I need the cash it's very slow without it and then we can't jit with about it this is showing me where tiny grad still not quite usable I think my Chinese foods [Music] here it is a glorious day for eating Chinese food these are handful noodles hand pulled noodles a bit of sauteed chicken some vegetables spicy delicious this actually pretty authentic I read stuff like this in China and yeah I'd shove my Chopsticks in like that I've heard it's offensive in Japan but I'm offensive in America so why should I change that's not true you should respect other cultures that's really a terrible bug I think we need to fix that let's replicate it in gbt so I can run this uh what oh because hell is a single a single uh yeah you see look there's a replication of the problem in GPT so let's fix it boys wow we eat Chinese food I'll move the mic back a little you guys can hear the Chinese food and wish you had some Chinese food of your own this dream has been brought to you by Uber e which probably no matter where you live you can get Chinese food and if you live in California or New York or a place with a large Chinese population you can get authentic Chinese food there's so much to fix in tiny grad I'm glad we're I'm glad we're fixing things by the way this is the schedule you can see here in lower schedule item one thing that would be cool is tying this back and like logging where each logging where each statement happened they' be pretty beautiful like how much time is wasted in tiny crab because people don't know so yeah the reason this doesn't work is if start pause started at 1 that would be allowed um or if 1024 were bigger and I think 1024 is bigger if you're doing uh if you have multiple tokens in your sequence okay so let's extract the problem case to A test should not be in test symbolic can having like a test integrated symbolic this is probably pretty good for is okay so there's the bug um if I do this it works see that's fine but that's not fine uh so we should think about why that's true and why this thing's not behaving as I want it [Music] to eating this chicken is difficult [Music] how much of this eating can you see on stream should I get a better camera you got a hand cam we can zoom in we can fix this in chap might just have to do that well okay wait there's something we could do instead of division I think we could subtract what we've done so far first would you like to see the eating how many people oh 890 people okay so the truth is this has to be zero and okay maybe we can fix it here print doesn't work H either way it's fully divided actually because I'm not asking the question interesting okay so it could be 128 or it could be zero that doesn't matter but how do I detect that generically if a mod equal Z wait but actually that probably shouldn't be true no same problem okay there's other ways we can compute X VAR idx I guess what we can do well let's look at what the shape is let's look at the whole shape so it's this like once we're here can this just go away I'm not sure why am I even checking this real strides simplify merge adjacent so it's in real strides oh it's multi view the problems go so deep that should not be multi view that's the bigger problem so the only difference there but we have to we have to shrink it we have to remove the ones you see the problem we basically have to remove these strides or put those strides in but without yeah um so if it's a shrink only if it's a shrink I'm gonna write Logic for this it's stupid that this is so annoying let we have to put the one zeros back in whenever there's you know what I'm saying for okay is it correct I do not know is's that stupid print it's in shape tracker okay wait that still looks bad is the problem fixed in gbt2 sweet hello security grants program hello.com looks like it works why does it still not work in lunar lander that's the same book we'll get there but this is a oh maybe because I'm no the star paw is zero that's a different problem oh oh sometime it picks too it shouldn't be able to pick to oh okay wait wait we going have ignore value for sparse categorical cross entropy I [Music] believe or maybe we just um it should not be able to pick two oh I know we do okay so back to this bug it's still in real strides let's see if we can hack this case too it's contiguous oh but that's 192 that's still okay I think if it's a contraction we'll write that more cleanly have a function called get contraction um my example it's not a contraction but it totally is did I do that backwards Wait no that's just weird never mind great okay so that's a contraction from that one wait that's not oh those are just okay never mind that's just fine does everyone understand what I mean by contraction I I will now go on to not explain what I mean that sounds like it would take a lot of effort okay the contraction is the shape of the no that's not actually what we want um wait it might actually be this stupid hang on it might be really stupid uh okay um if length see lse Z if there is a length we want the stride from C sub minus one oh this is so easy it's this I'm 80% sure wait that's hella nice yeah it's almost worth it actually now now that it's only two lines like it's a weird special case but still an issue a contraction it is a contraction what oh it's not even a valid reshape oh no it is a valid reshape this what we get for just trying to hack around the problem and not trying to fix it like adults I ate all the noodles want more noodles you see the problem with that right like that's true sometimes can also probably make that a contraction okay so better question why is this not a problem for GPT how did that Become Three I just have weird parameters in here don't I 848 816 264 and gbt that splits evenly okay so what are the numbers from GPT n layers n heads is 12 dim is 768 that doesn't seem so can't type well it's definitely not a contraction it's also not a contraction with GPT oh I see okay that's actually fine no still doesn't work okay great fixed we have tests which don't have test anything that's good let's get rid of that print okay so a contraction is when your mother was nine months PR uh a contra we're all just gpts man so a contraction is when you have something like this and you're Contracting it and you reshape it to this right and then the contraction is this right does everybody understand why does everyone understand why that's the contraction if you don't understand why that's the contraction uh ask your neighbor uh we have office hours on Wednesday and um the te are very helpful I will warn you though this school is big on Dei and some of the T did not receive their positions based on Merit um look I I'm not going to single any Tas out by race but I hope everybody has come to understand that Dei means that you have low expectations for a certain look of a person um and it's really a tragedy to that person it really is like you just have to prove yourself harder man and it just sucks and that's why the left are the real racists uh tune in to George Hots tonight um we used to be on Fox news now we've moved to rumble uh yes that's that's right Rumble with an R um it's like YouTube but rightwing uh but it's rightwing but we're going to repeatedly claim how this is the real home of free expression um so yeah I hope everyone enjoys uh my show make sure to check out my key sponsors uh the uh my pillow pillow and the uh man my pillow's gone I don't know what we're going to do man how are we going to get funding to run our rightwing show I know you can buy my mail Vitality supplement oh supplements are over I'm out of I'm out of ideas man I don't know what I'm going to do how are we going to keep the doors open I'm getting sued by those kids from Candy Sandy Hook man getting sued by the Sandy Hook kids what am I going to what am I going to do oh no they're dead it's fine jokes about dead kids don't sue me don't sue me um it's brought to you by uh Bud Light Bud Lights trying really hard to rehabilitate their brand so they're they're going out there and sponsoring this far right content um they're they're very upset uh that their sales are down they thought they were just doing the right thing but it turns out they weren't um and they're doubling down on politicizing instead of going back to just a beer that you could normally drink Bud Light makes a statement and that is the real problem making a statement and caring I hope everybody understands that that's the takeaway and that's the only way to survive the future but that's what they want you to believe because if you're not out there making a statement and you're not making your your voice heard then it's easier for them to make their voice heard because that's how democracy Works no I have not watched Alex Jones on the uh I only have so many hours in my day I don't know about watching Alex Jones on The Joe Rogan podcast who am I voting for well I learned a long time ago that you know when I supported OB Obama and when I supported Ron Paul and when I supported Trump and when I supported Andrew Yang that sometimes you get the candidate you want and sometimes you don't get the candidate you want but either way nothing changes and I'm sorry VC but you know like like you're the first year that I'm just I'm just I just know that nothing is going to change and I'm not going to get my hopes up man president Nikki Haley here we come you know that's right yeah Ron Paul was a winner yeah I know I know um but they all turn out to suck right George how did you ever vote for Trump well he turned out to suck George how do you ever vote for Obama well he turned out to suck they always turn out to suck man and that is the truth you just you just can't get your your hopes up man and yeah that's why and then how do old people still vote I mean I guess you really just have nothing else to do uh when you're old but yeah vet gets my year of indifference unfortunately uh does he sell hats I did I did I did buy hats from all the other ones I didn't buy an Obama hat I did not have an Obama hat in 2008 and that's cuz I used to be poor but I did have a Ron Paul hat a trump hat a Yang hat if BC sells hats I'll buy a hat okay how old are you boner pull okay why are we printing this why is that printing let put here okay know the other way around we want yeah no really Ron Paul like Ron Paul was the you see Dad see Dad if he's dead we can say all the great things about him we want and talk about how much we loved Ron Paul if he's not dead he still has room to disappoint us yeah no uh really this you you you can just you can just there's simply one way to tell if someone's the good guy or the bad guy and and it's do they support the surveillance of Americans right that's that's it that's it like how could you that should be so far beyond the Overton with oh my God you know we still have did they did they repeal the Patriot Act yet no they did not um H Ross perau I'm sure I would have been I'm sure I would have been a perau fan um I was a I was a I was a Gore fan in uh 2000 uh in 2004 I was just devastated by what an idiot Bush was you know look this is like the normal this is the normal trajectory of the manipulated American political system and then you just learn to be annihilist and realize that nothing you're going to do is going to change things but that's not even how things change to begin with would things really change if Charlie had died sooner on loss yes the show would have been better because that guy sucked okay I was so happy when Charlie died that was finally they were listening spoiler alert if you didn't watch season 3 have lost I don't know look you do okay all right you ready you ready for how the trick works there's this thing called the government and then there's this thing called The Deep state right and the Deep State sorry I said right deep state is actually what runs the country but you don't elect the Deep State you elect the government the government goes on TV and talks about how are we going to let this happen again and that we need to do something and protect the children the Deep State uh you know sends billion dollar wire transfers to weird places in the world and like brings Hunter Biden his hookers right that's that's what the Deep state does but you don't vote for that you vote for these guys over here the government so then you get people over and over again who are like well I'm going to vote and I'm going to make change but it turns out you don't vote for the guys who actually have power over here you vote for the guys over here who are puppets and they're not even puppets in the sense that they're being controlled by somebody to do something they're just over here playing do you think that anybody lets Nancy Pelosi near any real lever of power probably yes and that's the [Music] problem uh what's the difference between the PMC and the Deep State um I think they're the same thing I think like the PMC is more of a I think that the Deep state is made up of PMC members um by the way the D state is not a conspiracy uh don't let don't let people tell you that the Deep State's a conspiracy the Deep state is you can find all their names on the internet it is all the unelected bureaucrats it's all the people who just hang around in Washington DC it turns out if you just stand next to the guy long enough you just you know how things work and that's what these people have done and they don't die because we have uh high blood pressure medication and they're all on that they're all on that like like what what are those things called you know statins they're all on Statin and now they don't die so we don't have a natural cycle of turnover in this country and even if we did have a natural cycle of turnover in this country who would we turn it over to who would we turn it over to there's nobody because the pyramid ended and that is why I'm investing all my money in Nigeria look at Nigeria they have a pyramid Nigeria still knows how to build pyramids in America we forgot in 1971 um I'm I actually have money on Biden so we got a roo for Biden guys we're riding with Biden on this channel um I think I got good odds on Biden okay this works we just need to make the jit work let me copy the J Code where's the J Code can always use the jet for okay I think it's faster now okay 20 milliseconds a step doesn't sound too terrible is it actually that might even just be because it's render mode human might be even faster if I just say that great it's fast now love it got to use the jit boys are you using the jit uh wait no no no no no no why is that back oh because I got rid of count plus one someone's really clanking on the stairs this morning are you using the jit jet yeah I'm using the jit jet of course I'm using the jit jet you got to get your jets boys yo I heard you like Jets okay that looks pretty good it looks like it takes the actions equally and that's what matters equality guys the actions don't have to be good as long as you have a diversity of zeros and ones it doesn't matter if the actions actually keep the cart pole balanced because we achieved our goal of diversity I've never been to Nigeria I do know that they have a much more functioning looking per PID than we do uh by the way Africa documentaries I'm going to oh that's what I'm going to do today oh we're gonna we're going to write decision Trends more we're going to smoke some more and then we're going to watch Africa [Laughter] documentaries how does anyone take anything seriously anymore how does anybody this is also wrong I believe oh I if count one why do I have count like that why don't I just make count zero and then I can say Z one if count you meet people and they still take things seriously okay so this can generate we'll say that's good that's good roll out m the wokies are great boys the wokies are great who said I can't stand wokies okay I'm not complaining about the wokies man I'm complaining about the people complaining about the wokies okay they they they cared and that was the the problem man I'm complaining about myself okay I'm the bra it's me hi I'm the problem it's it's me white men straight white men are the problem wow if we just if we just like went to space would be pretty cool you know what I had an idea a few nights ago and we should turn Earth into a nature preserve think about it a lot of people win in that scenario Earth has so much natural beauty and biodiversity that hum should off to space and leave Earth to be a nature preserve where's my cross I'll go get my cross that's a good point I've been wearing my cross but I showered and I didn't put it on for for we're not going to the Cross is on under my sweatshirt we're not going to show it on stream because we don't want to uh offend people of other faiths is this yes Dee work with it no what this ends with tanks on the Harvard campus this ends with tanks on the Harvard campus wow it did pretty well that time should we just give up on Transformers and just use an evolutionary algorithm Japan is very cheap right now um maybe I'll go chill in Japan for a bit it sounds nice [Music] okay so hang on we don't just want to pend the reward there oh we want a compute rewards to go I have that code in beautiful cple do I not what are discounts oh I have to to compute that I copy this line what he say yeah it's cold and I didn't use enough energy today I wanted to use more it's pretty sick return 109 wait I did not get booted I did not get thrown out of any colleges I quit both RIT and CMU and CMU yes I had a 40 I tried one of the first times in my life I tried at something okay that's good that's pretty good uh wait we can't just do that we also have to pass in model model model too okay sweet uh great now we just need to generate a whole bunch of [Music] them then each one is a is a episode for a Transformer think that's batch size then I have to pad it which kind of sucks or should we just like that's just the reset action and we'll just concatenate all this crap together kind of sounds better let's concatenate the crap together let's con we can concatenate crap together we'll call them BR BS and ba yeah I love my names BR R plus equals R I'm got to make sure these don't go numpy I was just like con numpy that's fine and then actions or not nump okay good while length BR less than I don't know 128 does this work missing mismatch of viles what the hell does that mean oh we have to reset the jet damn homie High School you Dem man homie I think it's called reset I don't exactly know why we have to reset it but we just do all this crap again oh because we also have to reset the stupid actually probably to I was thinking about this this we probably want to say if start pause. Val equals z instead great okay we're running episodes I'm so happy we concatenated them that makes life a lot easier uh okay BR equals we can also if we want to do the mask so we like mask the episode separately but I don't think it really matters like just if you give up like just reset uh and it can't generate that action so great oh my God is college worth it guys do you know what a mosfet is why is this one so big oh cuz I have 100 as the default should be zero this is the most incredible decision Transformers are brilliant you just put in the reward you want to get that's so cool we can also roll out multiple at once I think I don't know that sounds like a lot of effort of course yo that Chinese food was great but you know what would be wait where did my Chinese food go I put it over there you know what I really want a c and roll that's just your high brain talking you don't actually want to C and roll oh my God you know it's I don't know if it's better or worse and everyone talks about the being anti the W stuff is as bad as being the woke stuff is a liberal arts woke study at we aren't even close to like there's like a maslov's hierarchy of and we're still at the level of like figuring out you Ci's right guys yud Kowski is right our civilization was faced with a test and we lost we lost okay and now we just have two choices you can can either be on the do things team or on the don't do things team and I don't know I like things I think we should do them this is subscribers talking see this is the problem I Leo I love you for gifting subscribers but the problem is you should get a star with an asteris next to it if you're gifted a sub or I don't know actually I Leo I might be blaming the wrong person you might have carefully picked those 20 people and determined that they were good people and that's probably good because I trust sub gifters a lot um let's order I would order an apple pie and panata but they got rid of uh oh the guy this guy who dm' me yesterday wants to do a phone call to be intern for tiny grad um all right let's see I'm streaming right now are you okay with the interview being on stream wait what do we think about that should we do an interview no I think that's actually let's see what he says uh now I shouldn't do that to people that that's pretty terrible and yeah they're okay with it but like I'm curious what he says he should say no but if he says yes that's that's ballsy we can't just we can't just like use people for Content that's te then I'm no better than any other did he solve a bounty no but he did a few PRS I want donuts okay I want a cinnamon roll I won't eat the whole thing I just want to look at the cinnamon wow wait Cinnabon can deliver to me 1149 Donut Bar Crispy Cream can deliver Chief Le Cafe basic waffle bon bon croissant can we just order a single cinnamon roll with that's kind of savage I'm just high and want to order things wait a second this cinnamon roll this cinnamon roll costs 50 let's buy it utensils I don't get my h fine I won't get my Uber one discount I'm not buying more stuff okay I'm getting a 950 cinnamon roll let's see what we get thank you people who purchase subscribers to fund my stupid uh wouldn't be comfortable being on stream no I don't think I don't think that gives a signal either way I'm I'm just curious okay that took a really long time to generate um this is the batch size why I order Cal I was just high in order to Cal okay we're making bad choices today guys bad choices just because you're self-aware about your bad choices doesn't make it any better just because you say things and it's sarcastic you can't just chase the irony you can't just chase the the stuff all the way down something has to be real something has to okay because if nothing's real then nothing matters and if nothing matters you can just die and if you can just die like why don't you do it already trigger warning self harm got to click through that link Bros you got to click through you got how does anyone take it seriously you know like I guess they're trying I I and I'm just mocking the people who are trying postmodernism is but how do you you can't get out of postmodernism you can't you'll be stuck in postmodernism Forever you won't really but like like the gradient like you're on one of those what I'm saying you're one of those like like in the gradient like we we can't get out just just just a postmodern hell like in order to get out people have to die and I don't mean that in like a war sense I mean that like a generation change sense you do know everybody dies right like that's what they never told you during Co like you're going to die right are you that worried if it's tomorrow or you know 50 years from now does that really matter that much to you what's the difference between Now 50 years from now and the heat death of the universe like the second the first two were so close together compared to that second one I'm sure you're going to have some gamma Decay on it because but but you still know what I'm saying right we're playing for the whole fade of the universe the whole like like the future of light the light cone happens now or do we just feel that way I mean is that true I don't don't know like again like it's it's one of it's like saying if if you know you're if you think that you're crazy you're not crazy if if you're aware of the fact that everyone who makes these apocalyptic doom day doomsday predictions is an idiot then why don't you look at yourself and treat yourself just like the Mayan calendar right why do I think the singularities are going to happen in 2038 am I just as bad as the May it's not even a date that has any significance it's just the Unix timestamp roll over like literally how stupid is that but it makes more sense than anything anyone else is saying and and the question is is that because I'm crazy like why does it make more sense to say that the singularity is going to happen in in 2038 than to say Joe Biden is a great leader why does why does the first one make more sense what's wrong with metic thank you for gifting Subs do you have a question all right it's like we're learning it's still a little slow but we're we're learning hopefully all let's turn on training true no grab false um model brbs B A A logits equals a logits dopar categorical cross entropy ba it can't just be that no it's not that it's that that makes more sense all right decision Transformer make decisions can't just be zero it's going to be variable start pause zero all right all right let it happen after I die that's that Boomer logic um input tensor shapes cannot be multiplied well that's upsetting I would like them to be multiplied actually I'm not sure if I'd like them to be multiplied what okay this is probably wrong then um uh oh I think I have to do this no there we go okay cert T.G grad is not none I know about that I haven't watched Cowboy bbop should do that okay why are some of the gradients none is that no grad def Falls uh we're not using the jet actually let's just explicitly not use the jet I can do dot forward here then we're explicitly not not using the jet sir t. grad is not none I mean I did loss backwards don't know what else to say I said no grad to Falls maybe but maybe something's wrong there the live action she likes anime mostly we watched jiujitsu Kaizen she liked that a [Music] lot okay we're getting an assertion that the gradient is none uh it's because it's not Computing the gradient for some stuff and why is that oh no that's going to be after I wish I could know which one that was again this comes back to like the tiny grad needs like the the tiny grad introspection project will happen search er none well obviously it's none oh I think I do some cash in in uh maybe I don't explicitly say that it doesn't require gradient I shouldn't require gradient shouldn't require a gradient do I have any advice dude you are late and if you're asking me that question you're never going to get better and that's the truth do you want the truth that's a dumb question do you think anybody who's any good asks that question and I'm just I'm just going to be honest like I don't have an answer for you but it's a dumb question and I get emails like this all the time do I have any advice right and we're going to be harsh about it we're going to be harsh about it because guess what guys we're in the end times you know does it matter no like you know it kind of upsets me I I think the quality in the tiny grab Discord is going down and partially it's because of these streams here's my harsh tone uh and like in some ways I get it streams these streams are not particularly like this is not my job right I'm I'm here you know I'm smoking weed on a Saturday ordering $950 cinnamon rolls off Uber Eats um so they're not high quality and they don't really attract would I watch my own streams maybe like something some times honestly I'd probably watch Clips I don't think I'd sit there for eight hours and watch a stream and that's kind of the you know the line of the Little Dicky song professional rapper he's like you know the that I'm bumping wasn't the that I'm making and that's the truth and that's that's that's on me you know I don't know like like make content for me to consume Something's Gotta yeah we been doing this I've been trying to get into having a schedule lately going to work every day put in 10 hours no put in put in put in 10 hours put in 12 hours uh into working make tiny gr better come home sleep do it again the next day on Saturday stream Sundays either go to work or try to do something with Alex uh then on Monday go to work again just schedule I don't know doesn't have any answers for you it's answers for me the answers for what this stream becomes you ask me for advice well why don't you give me advice that sounds better ask not what this stream can do for you but what you can do for this stream this is probably oh this is the time embedding wait did I delete the time embedding oh well that explains it great we found a real bug [Music] okay we are learning oh the L lost went up oh I love this learning oh don't we love learning so what's the initial return zero just going to flip the order match oh this is actually reward is it not what is m. reset return do you get a reward from the first time step I guess you [Music] don't yeah notice how I never trained on that one um yeah and also this is not what I should be putting in there let's look at the decision Transformer paper here okay R well no that's not right um okay let's target a return of 50 Target return minus equals reward okay well my loss went to zero very fast I mean maybe it's predicting the same action every time we have no diversity oh yeah no that's definitely true also move to Nigeria move to Nigeria and have three children so I don't understand how that's zero so quickly um okay so very quickly it converges to just predicting a single action every time actually after one after one update oh um make sure actually P1 is the rest of it okay it converges to 90% And you can see that that's the episode line so they're very short it converges to 90% Z does sometime it converge to 90% one let's reduce the learning [Music] rate okay well at least reducing the learning rate doesn't make it collapse anymore okay doesn't seem like it's getting better um should also probably calibrate [Music] that so now it's getting 68 okay let's like discount this one oh well that's just dumb uh that should not be the reward I don't really understand why it's outputting that this should just be the real returns from the thing should not be that how is Target return getting into R because it clearly is oh CU I put it in R that's not good maybe we don't train on that first one cuz it has a fake action too but I kind of like the fake action did I do this backwards no that's right we never train it to predict the fake action which is what works but the reward should not [Music] include Target reward yeah okay maybe this just shouldn't be Target return it should just be zero that can stay Target return Target return even okay so that's probably not we want the target return to be like the most it's ever seen or like what it converges to go back to 99 it actually discounts pretty fast what's the convergence of the power series of 99 um there like some way to calculate that right we can calculate it is that you oh it's just like it's the sum of that though there's probably some expression for that equation 99 what is if I do that it's 20 so the target return is 20 and we'll change that to be a discount factor of 95% okay that goes negative which should not happen because we need to Discount the target return [Music] um we actually don't want to discount it I don't think they discount it in there so I'm not really sure how to take that into consideration but actually in some ways maybe we never change the target return maybe we don't do that actually do they change the target return no this should work actually that's fine because that's like the max this it's because we're discounting it actually should work okay are you learning how to be better or not my batch size is very small did you just get lucky with those big numbers okay uh let's reduce multi roll out to [Music] 64 B size to four so each one of those are the episodes happening you see this one's converged to zero I thought a bigger batch size would help it did not going turn on render mode human so we can watch it not balance is it learning do we think it's learning boys see look it's collapsing look see I'm printing the probability of each state so it's just learning to spam one now it's forgotten okay so here's another cool thing this is an off policy uh this is an off policy algorithm so we can do some amount of uh of Replay we don't have to sorry not replay I'm not sure this helps but like maybe it does George you have to not be so cynical don't insult the guy who posted some big long rant in your chat expecting a real answer you have to you have to just have a positive outlook on people you know you catch more bees with vinegar than you do with bee zappers [Music] for [Music] guess you don't actually want to do multi roll out you want to like just stick these all together um this function sucks I don't know why I wrote it all right now we just like I don't know it's actually stupid because like we don't actually need that context hell's a pipe reward where am I pipe in it too bro pipe pipe pipe pipe pipe pipe pipe pipe I don't know you know we could do we could just like what if we just normalize it that's for so you say like a AI could probably be typing this form right now it probably could I think the bigger problem with that is it's not like I'm doing nothing with this time I'm actually thinking I use a different part of my brain to uh [Music] okay it's not actually what I I want to do um okay it's four that's right what doesn't work about this oh that problem I reuse ba down there it okay it's learning it's machine learning it's doing a shitty job of it lears to always predict zero okay my hyper parameters let's raise the temperature didn't fix anything so we're making each we're clipping each one to be only 20 like let's take a look at a sample so like that's the reward these are the states and then those are the actions um those ones don't look clipped wait I think that worked clip R not work did oh I don't know okay they're definitely there that's fine I I guess we were just looking Happ to look at a sample that was bigger yeah it's just jamming zero as hard as it can now here sometime it looks like it's predicting one heast in that oh we're never deleting them my overpriced cinnamon roll n minutes [Music] away wait is it learning to balance at least it's at least it has diversity very important it kind of is working that's how you feel your feelings matter feelings matter bro feelings matter is it learning I don't know you know how you can know if something's learning or not ooh 70 you can use a graph for okay let's first do a short I one to make sure I didn't mess up the plot see you can't just count on that 60 that was all just lock okay it's just going to warm up a bit before it starts training with some random noise still all in random noise Land wait no it's not oh all right now we're learning we're doing two episodes every time okay lt. show good thing I didn't run the big one um I mean maybe can I do this here does that give me it's going to hang after I do pl. showell though right okay for some reason all my returns oh they're all 20 damn it it's not what I want okay good well that was a quick way to see a mistake um if I do that but I don't hang does that just like do what I want which is like show the plot without no oh I think there's an argument to dt. Shell um block yeah okay so far it's doing worse make that one one again PLT plot returns it seems fine all right will it learn it's not even getting longer anymore that supposed to be getting bigger are there other figures hiding hidden figures years might you say pl. clear how do I PLT update a thing every Loop PLT plot update in Loop figure. canvas. flush events that's a lot of work do I have to do figures okay fine we'll make a figure uh figure. canvas. draw oh this is too much work pt. pause there we go oh oh is it going to learn will it learn it looks like it's getting stupider media comes to the moon was that a media movie i' believe it I don't know reinforcement learning takes forever but it also may be doing nothing we do not oh wait is it learning boys no it can't be learning that's too optimistic if you stare at an RNG long enough you'll start to see God well okay so here's something we can do we don't actually need to watch it try to balance it should be faster without that um what's that stupid print all these prints d That's the most disgusting code I wrote in here okay this is faster now are you learning is you learning learn better oh I know one of the problems okay so because we're clipping things to 20 yeah that sucks um because we're cling things to 20 it's going to have trouble when it starts to try to roll out past 20 so let's clip it to 50 instead I thought it would be okay but it might just get confused don't think it's actually learning anything you see look when it goes past 50 it gets angry I don't know I might just be staring at a random number generator probably where's my C roll it's no more computer at the problem also this is kind of a tiny let's do four will you learn how to play Card pole there's so many pieces in here that could be broken I have very low hope do you see why it should predict better actions it's because I'm putting in I'm telling it that you want to get 20s it's never actually seen a 20 which is also kind of unfair let's just tell it you want to get 15s it's probably directionally correct that's probably fine wait know those Rewards are hang on we might just be doing this too much this back to 50 okay remember RNG says you're just going to get some good sometime what we're looking for is uh steady up upward progress and we're making sure we keep the diversity you see these aren't very diverse I I don't want to start adding crap like entropy regularizers I don't know there's so many ways this could be broken all right let's see if it learns how to fly let's just give it a harder problem and see if just maybe magically it'll work I mean it won't but I got to get my cinnamon roll anyway and you guys can watch the lunar lander while this thing doesn't uh work hasn't even started learning yet okay now it started learning let's see if it learns will it [Music] learn [Music] how's my cinnamon roll look not very good $9 Uber Eats is such a scam man let's see how it tastes [Music] terrible I paid $99.50 for this is it learning do we think it's learning oh we're also not looking at the rewards there we should probably look at the rewards oh did I mask the first one like I said I was going to or did I not I just put zero in there okay fine you feel like it's worse than random no that's very possible there's many things that are worse than random okay I'm going to have to think about how to build tools to debug this so I don't think it's being trained on predicting the next States because the only output from the Transformer is the actions select hidden States for Action prediction tokens it's crazy there really is only loss for the actions the loss is going down sort of the loss is good but do just mean it's good at predicting failure the loss is now almost zero and and it never predicts the other action let's go back to a small one okay could it just not be working could I have a bug in like the jit and stuff that's that's probably the first place to look no I don't think we have to implement it in torch I don't think it's a tiny grad bug I'm saying that there's I'm saying that the uh my implementation's wrong not not the code in tiny gra but the code outside so how do I possibly debug that no let's look at the hugg and face blog okay Happy from other agents or human demonstrations you Fe the last K time steps into the decision transform with three inputs return to go State action you know what I mean it might be a little confusing how uh wonder if it doesn't like how we're adding the positional embedding no that should be fine the prediction is condition oh on up to 20 previous frame oh interesting with a window of the previous 20 time steps so they're not doing roll out forever but I don't really care about that even okay let's just try something stupid or if I set the target Return to Zero okay that immediately collapses well let's set the target return to like eight or whatever that is okay that collapses to zero pretty quickly so now if we set the target return to like 50 it still kind of [Music] collapses I don't think that code is wrong we're putting in one sequence per thing I guess something I could do is a sanity check also predict the state there's nothing fundamentally that makes them logits right no it's just it's just a Dimensions fact I could even move that whole head one out if I want it Min 20 highest found what oh you actually getting that from somewhere yeah okay we can do that let's actually do like a 90th percenti or something why is sometime printing a really small number where does that print you really what is that tiny number it's not the target return oh that's the loss why don't I label my numbers cou believe I at the cin it's kind of [Music] disgusting um what will the reward in betting make you choose a good path well that one I could done equals why do you not learn um okay let's try something else this isn't right we really want random. choice I write this all the time for f I really want to do that without replacement whatever this replacement matter why I eat that disgusting cinnamon roll that was terrible okay [Music] see I get excited when it goes up but you got to learn how to not do that put this in the fridge for later for okay hopefully we can see that it clearly does not learn did I implement the sliding roll out I don't know what that is okay how can we check this for bugs the loss doesn't even look like it's going down that much it does seem like using the larger replay buffer the loss isn't going down anymore is uh only learning with no I don't think you need that think that helps all right so here's the loss it doesn't look like it's really learning pick the same one from each one print this shouldn't do anything weird no that looks fine SEC of this like way to train with the same ones at first doesn't matter this is lost we want this to go down oh so I don't understand these things are not logits do Spar categorical cross entropy no I think it includes I don't think I need that I think it includes a thing no self as a logic's input so this actually just might be wrong here CU that might have to be log soft Max argmax is still the same not sure how much this [Music] matters let's look at the logits okay they're definitely Logics uh and then if that's in log space we don't divide by temperature we subtract the log of temperature see if that sounds right again that isn't going to affect this those things are logic we're predicting the next action that's right and yet now my loss goes up it is true that it's a non-stationary distribution those don't look like logits [Music] now oh wait what the hell am I doing that's not right I have obvious bugs in here God this stuff's so hard okay that's that so you got to do that how did this even how did this not just break deep learning man I was just cutting one off the batch does the loss go down at least [Music] now no might be a different problem problem my loing R might just be too low how does anybody get the stuff to work does deep learning work or is it all a lie made up by big support Vector machine L go down trust that samples works okay um how can we debug this I mean I can't believe we had that bug and it did anything all right so one way you can always make sure is if I set samples to 0 * BX that should be the same sample over and over again and the loss better go to zero great the law isn't going to zero it is just wait what okay if I can't learn literally the same pattern over and over again something's really messed up okay well that time it did learn it except it spiked how is that happening I'm feeding in the exact same thing every time like it's the same thing every time oh well there's another bug what are those twos getting set to maybe we can let it well those are like masked actions okay I mean this is another good sanity check so if the action predicted is a two learning the same thing over and over again loss goes down but very slowly and then spites up and it's spli BL that's usually a problem with like lack of log softmax or something but there's a log soft Max right there it's outputting twos it's literally outputting twos it should never output twos that should strip all the twos from the beginning at the end it should have a reward that's close to zero and clip that let's clip that smaller 4 is fine oh no it's just it's just learning on the same thing over and over again but it's not learning okay tensor training true no grad false zero grad optim step oh okay well now the loss got to zero so that's great finally it learned how to overfit my one sample okay I mean it looks like loss pretty reliably goes down now the twos were causing it pain I know why they don't buy us there either okay loss go down very good okay good it's not predicting a lot of twos I hate twos how's the TC so high look at all the twos we shouldn't have two count we should say [Music] TC divided by length R how is there so much percent of twos in percent of twos but I'm never putting in twos why does it predict twos the twos percent needs to go down does this make sense twos means you've lost the game twos also mean you just started the game so that's very confusing and let's increase this by two and make a three so you just start the game and it can't predict threes right does the loss go down sometime you get good reward and sometime you get bad reward do you not understand why it doesn't understand why Wow if we think our world's confusing imagine how this poor model feels good thing it doesn't have enough layers to be cured I understand the model rights activists will eventually come for three layer models but you know until they do okay so predicting a lot of twos like this is never going to work I'm asking for something way too complex the loss is lower because it just predicts twos all the time huh you know we should really subtract the reward from the the real reward like you have to compute reward to go as you go am I sure I'm implementing a reward to go correctly yeah it's the same one I use in beautiful cple so probably can never be sure let's try a much larger batch size let's try 64 rewards aren't discounted in decision Transformer yeah I think they should should be there's probably some broadcasting bug like why is this not learning way faster and why is it still predicting twos no this doesn't make any sense okay so the these are the probabilities my sampler is just totally busted just what I just do go back to dividing by temperature and we can go back to softmax we just need to add an exp okay now it's not twos anymore that's good what happened what did I write [Music] wrong okay well at least that's say again see how does anyone ever make this stuff work okay two percentage quickly goes to zero but that probability also greaters to zero so how do we fix that for do we need a positional EMB badding I for carpo we really shouldn't I mean but yeah carple should be so simple I'm not that worried about that all right let's get rid of the reward to go and let's add that back and then let's say Target return minus equal [Music] Z well that just collapsed even f faster if the loss is low no this is like it just becomes so quick he doing multiple actions for run of the simulation no how we going to debug this discounts work better not subtracting that because I mean this in theory like he learns to end the game as fast as possible why I mean I agree that's definitely what it looks [Music] like there's so much potential for bugs like the loss goes down but the average episode length is also down this is now doing worse than random ever since I fixed the sampler okay uh learning right too high get rid of printing all those stupid actions actions all right doesn't predict many twos which is good have a somewhat balanced policy there it should see that some of them are better than others I get my hopes up every time it goes up I shouldn't do that still pretty large want go 384 also try not adding the positional in batting I mean it's cart pull I don't need it right just get rid of the position one B I don't need the position we can all agree on that I don't know I mean my Transformer learning can just be wrong no it's just it's just collapsed to total crap now you [Music] you'll overfit on the policy if you look at the to conquer the cart pole problem no I I this has to [Music] work put this back in all right fine can I do lunar lander you think lunar lander is going to work work we have so many bugs in this you know um what I hard code for harded to here can you at least do your custom gym environment that tries to guess the previous state which one is that oh that's smart yeah you could have like a really dumb that makes sense yeah there should be environments that are just like good examples for reinforcement learning oh when I made one yeah I don't know we can just sit here and pray for the numbers to go up should we try that the power of the cross the power of Christ please Jesus make the numbers go up my machine learning is very bad and I'd like it to be better now there's so many things can please choose just make the numbers go up your crypto telegram right it's definitely learned something it's learned how to just go there never come back is that easy to do fly off screen's not landing on the moon the only winning move all right right boys we have so much to debug here uh I think we should just do it let's debug with contrived data to see if our Transformer is learning and if our Transformers is not learning we can cry all right let learned something new it's learned something new it's very repeatable at [Music] least oh oh look at that I kind of think it's learning it's trying this for a little bit Yeah remember it's don't land on the moon phase that was a good phase [Music] okay so part of the problem now might be that we're clipping at 64 uh it looks like we're getting episodes that are way longer than that so let's clip it 128 If this just magically starts to work it's learning something right no we should we should write some contrived okay it's in it it's in its don't land on the moon phase now oh I also have a bug highest reward should not be zero reward should be negative Infinity I was GNA actually try to put that in the first time I don't know negative a th000 I can put a negative Infinity I don't know what that does might throw errs okay it's learned about not landing on the [Music] moon our candidate would prefer not to be interviewed on stream so we will not interview our candidate on stream to know what to expect think I'm going to tell you what to expect I'm going to ask you why my lunar lander doesn't land on the Moon better to Not Crash so it runs away is that actually true it doesn't want to get the negative I don't know to say that it doesn't want to do anything implies that this thing works at all and I have some serious questions this is not predicting twos could consistently stop predicting twos which is good whoa whoa that was close boys I think it's going to learn no it's back to doing this it's learning how not to do that we're just staring at nothing RNG plays Pokemon okay it's found a new thing just not firing the engines actually gets a pretty good reward for doing that it has learned you can't just use the engine sometimes but look the rewards are higher use the engine I think it's stuck I'm not giving it more reward for engine use come on bring the engine back no it's very fixed in the I'm never firing the engine again phase oh remember it's smoking weed Saturday we're we're not trying that hard you know we should really be writing test cases and writing a few different test environments seeing if it can learn in the test environments or we can just run it again and see if it works better this time what do you think [Music] okay now it's learned how to spam the right Thruster all it's doing is spamming the right Thruster that can't possibly be good you can't possibly H be happy with yourself spamming the right Thruster that can't possibly be a fulfilling lifestyle Let's Just Jack the temperature up okay guys this isn't we we've crossed over from machine learning to just randomly doing things do temp deay no and that's without Decay it's already we still suffering from collapse these new models love just spamming the right Thruster is that the new like the Club where you're like talk about how hard you can press the right Thruster oh remember that one that learned reinforcement learning doesn't work okay all right let's think of some contrived environments and let's let's make it work them you landed but you killed everybody on board you landed but you killed everybody on board we haven't even like like at least the lunar Faller was getting pretty good scores the lunar Barrel roller just sucks I don't understand it should be pretty high um okay I don't know smaller batch size no no no no we can get even more contrived in that environment trust [Music] me oh we got another lunar Faller here it fell right in the middle though the things ever learn or is that just too small of a learning rate let's try 3 e minus 5 that sounds like a good learning rate sounds like a good number right 3 minus 5 all right we're going to need coffee and we're going to actually have to try to write environments why do we remind me to never try to do reinforcement learning on stream because reinforcement learning just doesn't work okay and if you thought it worked someone lied to you okay I think this is just RNG Lander I don't actually think it's learning anything my loss is not really going going down at least it's using a diverse set of thrusters yeah I know one4 is karpathy blessed I've heard about that one I've heard that's a good learning rate you think you've ever seen it land I got one to work once one of my I think I did finally get this to work with Po and like a massive amount of shed okay guys we're going to see it land well I think it's learned to be a lunar Faller yes it's learned to be a lunar Faller a common mode of collapse the poor Cruise okay you know what I like the one let's let's let's make it shoot for the stars let's not update highest reward and let's give it a highest reward uh let's give it a Target return of 50 you can do it figure out what 50 might be interpolate bro oh regret we'll just change that to 50 we'll change that to 50 oh we're not even trying anymore we're not even trying anymore am I losing viewers no the viewers don't care the viewers don't care if I try or not no we're going to try what did I change I don't even remember what I changed I just changed something oh I told it to shoot for the stars and try to get a 50 and it has no idea how to get a 50 so it just kind of tries random things look it still has a twos percent like that's just really giving up if you're predicting a two you've just really given up it's interesting that definitely shows it's doing something though there I look see this twoos perc this might be too aggressive 50 might be too far out of distribution let's try zero again that was the bug that I wrote before and I think that was actually the best run we had like will tw's percent went away this is when it learns to fly away are we in the wrong oh we in just chatting oh we probably are in the wrong category oh we should change categories we'll do that now done shouldn't it be decremented at each step yeah probably should I don't know how to do that though we can try that no okay let's let's try some very naive environments they're not dead they left the moon they went that way all right we're going to we're going to get some coffee and we're going to try some simple environments and if we can't learn the simple environments then um that's just sad [Music] [Music] if it starts crashing again we'll play crashing [Music] [Music] music we're going off on Cosmic Adventure [Music] that's my song dedicated to all the lunar Landers that have flown off into space the beautiful flag set up for them there's a great surface of the Moon but they just decided to go up okay I think this is we need the okay fine 3 e minus 4 3 e minus 4 that's that's the one that works I'm going to get some coffee we're going to try some real environments you will not beat me today lunar lander if we conquer lunar lander today oh if we manage to use a decision Transformer to solve lunar lander be back on track for doing what we said we were going to do on the streams but you have to counterbalance that with the fact that reinforcement learning is impossible uh so there's that and considering this is reinforcement learning it's impossible so I don't know why we're even trying we going still got to open the coffee it's less easy you think okay I don't think I've ever seen it land actually to be fair I don't know I okay did it learn we try not CU it's easy but because we thought it easy okay uh let's make a ridiculous environment [Music] how do I actually write a real gym environment I've been working on I've been working on a song uh it's only dark the first time we'll hang lights for those who come next oh no I have to do render oh that's hard this is impossible we can't do that keep your inside about reward scaling now you ready for we're going to we're writing we're writing stupid crap okay what I have to write I got to write reset and I got to write step and then for reset we return the observation uh does everyone understand we're writing like a contrived stupid okay uh what this times self. size r r sub self do um now we have to make that self done and we do don't anything there and for step uh we return the what's the observation um it's always the same observation all you have to do is learn how to guess the stupid I'm giving you the thing you just have to take the action you have to literally press the button one of the buttons lights up and you press the button all right does everyone understand the game one of these buttons is going to light up and you are going to you are going to press the button okay okay does everyone does everyone just just literally that there's my reward function okay all right is everybody happy okay is it terminated yeah it's terminated is it truncated no and that's not all right we don't need those parentheses those parentheses were dumb let's go let's go all right okay we wrote stupid environment okay it's called button press the light up button okay we're going to use Java names here which button lights up okay you literally just press the light up button okay stupidest environment in the world we can call it the press the light up Buton when you want to site that in papers site that in papers okay press the light up button has no okay well you know that's cuz I didn't construct it you got to construct it all right boys you got to you got to construct because you don't construct it you're GNA what the what you you Tuple give you a t tup tup tup Tuple well that's cuz I put a comma there for no apparent reason all right let's get rid of that comma okay well why does it still make a lunar lander know because there right there we're making a lunar lander we don't want to make lunar Landers why would we do that let me just comment those lines out cuz we don't need any more lunar Landers we have we have all right well that can't be multiplied that's too bad do I see $1 is there $1 $1 $1 $1 $1 $2 are there $2 do I see $2 $3 three chat gbt write this chat gbt write this why am I writing this oh what that doesn't even make sense the Box oh that's low and high okay well high is that and the box is that okay I just made the size of the box ready they're all just zeros or ones everything's a zero or a one okay can you learn uh no you can't learn congratulations you're an idiot right literally there's two buttons you either press that button or that button don't press the two button right can it learn how to play this game it cannot learn how to play this game this is the stupidest Transformer I've ever seen okay if you can't learn okay if you can't learn how to play press the light up button okay it can't even learn how to play press the light up button okay let's see if we can press the light up button okay can we pray press the light up button let's pray press the light up button okay so we're going to get the observation okay print the observation we're going to get an input and this input is going to be called the action okay we're going to get an action you ready for an action okay we're going to we're going to n. step act and we're going to print that okay ready you guys ready to play press the light up button okay press the light up button let's play okay did we mess this game up okay which button lit up is it button zero or button one let's play button zero well look we got a reward button zero yay zero oh okay well this one we're going to guess one and we're going to get it right okay this one we're going to get zero look we got it wrong press the light up button come down all day it's the easiest game in the world if you can't play press the light up button you are a certified all right press the light up button does everyone see how to play press the light up button does everyone understand the rules do you understand the rules okay well I can't even play this so that sucks wait this model is so stupid it can't play press the light up button you got 50% of them you guess one all the time stop it all right let's look at some episodes of press the light up button okay we don't need 128 let's just clamp it to four four sounds good you learning how to play press the light up button what how did that do that I don't even understand what just happened why is it doing so many things I'm getting rid of that I'm getting rid of [Music] that this is the easiest game in history and it can't play it you literally press the light up button okay all right everybody come down play press the light up button okay let's take a look at the rewards the states and the actions from press the light up button the reward is actually always zero no wait that one had a reward okay this one has some reward uh what I don't even understand this it's like not right oh okay well we still have Decay I don't know if we need Decay let's just get rid of Decay that just made things more complicated I don't even think it matters but whatever it's just one more thing that can't be wrong okay could my me zero learn to play press the light up button that's the same bug I wrote before it's the exact same bug I wrote before IDI okay see sometimes it wins it press the light up button all right so there's the button it pressed one which was the right choice since that was the lit up button and that one pressed one too but that one press zero and button one was lit up that one press zero and button one was lit up so it got a reward that one press one button one was light up okay great so press the light up button seems to work no no you press the light up button okay this is why we have subscriber only origin DOTA you know honestly how many of you would fail I press the light up button like let's be real here let's be real about who couldn't play this game what about an Epsilon greedy strategy okay I mean we can even try we'll set the sampler to to just do that no temperature and it still can't learn to play press the light up button we've turned oh you literally just just press the one that lights up bro okay well it can't learn press the light up button but oh oh that's cuz my target return is zero but then it should be trying to lose actually so I don't even know what to say about that you were literally trying to lose and you couldn't even lose it press the lightup button you're that dumb something's just broken broken wait boys that was a good streak of press the light up button is it learning H kind of ly [Music] stream well the loss is definitely zero let's print the action distribution okay this decision Transformer is too complicated let's make stupider decision Transformer okay let's make stupid decision [Music] Transformer [Music] God I don't even know D equals [Music] 12 we were so far away from anything actually working he couldn't even play press the light up button okay stupid decision Transformer hey decision Transformer I want to tell you if stupid decision Transformer solves this problem and you can't um oh we have a jit that's fine [Music] wait huh let's just seek Len just not put minus one [Music] okay it doesn't even do any Transformer junk this one can't learn either okay does machine learning work real question wait does tiny grad not work what if Tiny grad's broken did I break tiny grad I did make a lot of changes it's possible I broke something let's see if gbt2 still works I that's just a different problem wait did I just seriously break tiny grad maybe or maybe those tests are just dumb not un has no attribute Val oh okay gbt2 still works see beautiful ad this still works beautiful M this still works wait that's less beautiful than usual though wait I think I might have broken tiny gr that's stupid this whole time I just broke tiny grad because of that because of that contraction thing that was wrong this is making me mad oh deep learning sucks Bros if I do beautiful lest on Master what what accuracy do I get oh yeah look see that one works and the worst part is it like kind of works if you do it wrong but it doesn't completely work this should get like 90 this should get like 90 uh 98 but I broke something and it doesn't get 98 on the alll Transformer Branch it's just worse there's some bug this doesn't work but that was completely my fault for okay well I'll notice that even with that bug fixed it's still deep learning is impossible I know I know look I know all these things all right we're doing better now we're doing better wait I I don't like I don't like how I sub slly broke something by doing that tiny gr needs so many more tests I I will note by the way that Master still works fine it was my fix from before that broke things that's what we get for smok and weed you know that's what we get that was before we smoked weed okay this should be the easiest thing in the world to learn I'm literally just asking you to learn put things in the right order right it's literally going to be this stupid like I like did stuff in the Ral order Target return what I set the batch size too 16 that's crazy high we now have the simplest possible problem the action the state and the reward why is the reward always zero [Music] but that's not even what we're learning on let's print what we're learning on wait what is this never called with [Music] training right that tor. training equals true then we oh we call model off forward okay that's fine [Music] wait wa that's not really right I no that is right we thought this was all going to work in like the complex Jet and stuff wow we were we were so Off the Mark we were so we were so done in Krueger on this you know okay I'll press the light up button we're going to we're going to at least if we can't solve press the light up button we're going to have to give back our machine learning card okay we're gonna we're going to take it away from us uh oh do I have to put an access into soft Max no this one's probably right once it works will it scale to solve AGI yeah probably I mean that's the kind of cool thing right like when you get after you get press the light up button to work there there's very little between press the light up button and AGI literally literally you you you output the action [Music] okay the data looks fine why is the learning not happening let's comment out the forward jet I don't think it's my embedding that doesn't even matter because that's always fixed for the one that I actually manage to use all right let's get stupider oh I guess what if we just try press the light up button is actually a very complicated game we have an even stupider version where the button that lights up is just always the first button can it learn this it can't even learn that that oh guys I think it finally learned to play a game it finally learned how to play a game okay it's pressed the light up button but the button that lights up is always the first button it took a long time to learn how to play that game like it takes many episodes oh that time it didn't get it something's really broken that time it failed at literally play press the light up button where the button that lights up is always button zero reinforcement learning is impossible and also how did I break tiny grad in such a way that it's just subtly wrong what did I do wrong here oh I have to check to see if the strides are actually contiguous okay well I see what I did wrong there no we're not even using trans we haven't even gotten to Transformers yet but we're just we're just okay that embedding seems to work how about the state that embedding seems to work how about the [Music] reward and that embedding works okay good all the embeddings work let's look at the embeddings together do they work together I think so sometimes that one's different but that's always the first action embedding okay should be fine what did I hardcode 12 right there okay that's the action that's the state see okay we have action and state and then we have reward we shouldn't even train on that last one actually I can I can even truncate this I can even can even clip R to one I have two buttons it's the dumbest thing I could possibly make wait why does that affect that why is the loss on that oh cuz you have to take the action for the next one okay that's right there should be normal numbers right those aren't numbers no there should be numbers what all those are the actions those are the states those are the actions oh and I guess we need the rewards too so these are the states these are the actions so we got one okay so here when the button was one it pressed one it got a reward of one button was one pressed one got a reward of one here when the button was one press zero got a reward a zero okay great the optimal strategy is to literally press the light up button it's the dumbest strategy in the world this is the dumbest game in the world yet this thing for some reason can't be yeah okay sometimes you just have to press the light up button and we're going to write the dumbest code okay the code's about to get Dumber we're going to start from Dumb and we're going to go to smart okay okay so this is the game it's called press the light up button we didn't bring in the whole game we also need this I'm going to bring in the stupid decision Transformer so by the way we were just staring at RNG all the time before how do I need a start pause I don't even need a sequence link okay now okay we don't have to we don't have to go this far I'm getting rid of this we're just gonna we're going to be very careful slow and methodical I know those things aren't my strong suit we gathered data Let's test the model so for reward for reward the action we're going to start with is action three The observed state is 1 comma 0 and our Target return is one you know you do this you never forget never forget how this thing works works we test the model for some reason we're putting some other Logics there which doesn't really make sense um oh it's probably like through a soft Max and stuff okay regardless longt Max action on P put someone on the model putting the token don't do that okay so this is our desired reward this is our desired reward here when this is this when this is pressed we don't press that why does this not learn why does it ever learn that this is an okay action it never sees that action it's just it's almost like it just isn't learning isn't learning optim zero graad loss backwards optim step doing that right right I'm putting the parameters in the model something's just so broken remember when we just stared at the lunar lander and thought it might land oh the old days it's like the loss is really high like look so this is the same as that let's make contrived train so hang on we we can actually just do this with batch sizes um Target return one one uh well that's somehow time step which we don't really care about it doesn't matter it's equivalent okay does this if I if I uh sparse categorical cross entropy it to we're running out of places for bugs to hide but I don't think it's better in pytorch like I've struggled with stupider things in pytorch uh okay a. logit sparse categorical cross entropy 113 okay so desired act make that a tensor [Music] desired action there is that desired action there is that sh learn it doesn't learn I like my embedding is broken [Music] let's get rid of just one of these this is completely independent of time [Music] no now we need that I mean if it can't learn this how's it going to learn anything can like look at the gradients and stuff it just hasn't trained long enough no it gets no smarter doesn't even learn that twos are bad actions it does like very slowly we don't have Norms I bet there's some like bug with okay this is like the token embedding every time and like it does move a little bit wait why are those the same why are those the same I start out this same start out pretty close to the same out all pretty close to the same that's sort of annoying what did I initialize embedding with gorite uniform that's fine what do I initialize linear with caming uniform that's fine oh okay so each one of those is one embedding so that's the action embedding those actually all are the same this is the reward embedding which again it makes sense this is that and this is the uh State embedding and then it's being told to predict either this or that assuming this actually works which I'm starting to think that's where the bug is I don't know if this thing was designed to work uh for more than what if I do this and then here I do reshape minus one3 it learns just very very slowly okay I don't know let's give it a larger learning rate okay look like it learned why are these learning rates just so small well at least it's learning not to press that one this the mean I mean that's totally right take the log soft Max multiply by that take the sum don't think there's anything in Lost mask [Music] true divides it by one that does [Music] nothing take the derivative of that Z grad backwards step it's learning how to diminish the difference between those it's just learning how to like Miss all the loss together I feel like there's some like broadcast issue but I'm not sure where it would be I'm I'm I'm just like I'm lost for I should expect this to work what did I forget why is this loss not going to zero does that loss go to zero that loss goes to zero but he can't distinguish them for some reason even though they look pretty different to me on topic please okay Target return for the batch time Step Zero time step one time step two times step three Target return is like the desired uh reward we can call it reward if you want desired action okay State let's go through the states okay so the state for batch size one times Step One is this is the desired reward Infinity what are you talking about so we'll go through all the states now if your target return is what you do that you do that if your target return is zero you do that if it's that you do that okay good let's turn these into tensors I like I'm going to get this and then what's it going to be [Music] oh we don't do uh fake act goes here then my loss is literally just this yes sparse categorical cross entropy desired act am I trying to teach a linear model XO maybe maybe we'll give it you want me to give it another layer well now the losses are more of a mesh but you're right CU it was converging to something that looked like what happens when you try to teach a linear model XR let's give another layer you agree this should definitely be able to learn it right no same crap same blah though to be fair no that's not exor I'm trying to teach it give it a d of 12 too no that doesn't matter the same blah effect uh okay the tget returns are wrong I mean okay that seems to learn the function let's not let's just test the model here no Norms or anything like there's no reason that shouldn't work no it doesn't work all right let's get stupider I I agree I don't know if it's tiny grad weirdness or oh I have to name it properly even stupider decision [Music] Transformer okay so now we concatenate them on the final AIS same as this it's the action embedding that literally does nothing this this does not matter okay what's the function I'm trying to learn it's like it's like some l looking chat no we also need to know the reward we can't just know the state one two 3 four okay it's four output action spaces that plus that [Music] didn't learn yeah we're going to try just the state next okay we will try just the state no we need we need the state and the reward doesn't work with only the state we don't need the [Music] action that should learn right definitely learns as a 50/50 probability there it's probably bugs and Tiny gra now I'm thinking that's what it has to be CU like this doesn't make sense okay it can't distinguish those it's not XR that it's trying to learn is it this needs to be zero this needs to to be one this needs to be one and this needs to be zero it's like a linear function okay let's get even dumber is okay we have 1 comma 1 comma 0er and for the output you should predict zero you have 1 comma 0 comma 1 and for the output you should predict one you have Z comma one comma 0 for the operation predict one and then you have Z comma and let me just put a float in each one of them 0 comma 0 comma one and you should output zero okay great um model subx St space is three action space is two list object has no attribute linear oh we got to make those tensors great uh okay let's go stupider than this I want you to learn to predict this xor and linear layer can't learn that I that's xor well I can't learn that so fine so we need a nonlinear layer how many layers do I need two is two a good number of layers deep learning just doesn't work I give up on deep learning okay is reu good enough non linearity okay now it learns with very varying degrees no sometimes it doesn't learn sometimes it doesn't learn see that time dude deep learning is okay there we go now we got something that should always learn lottery ticket hypothesis [Music] dude I I feel I feel so scammed like deep learning is a [Music] scam okay all right let's work our way let's work our way back up the stupids let's work our way back up to stupids okay all right that one was really stupid let's look at this one it didn't learn okay let's give it a little more space print a comment that it works whatever okay now this one doesn't work why doesn't work okay stupidest test works instead of doing this let's change this to zero and one exor is a hard function to be fair it's very I understand why it's very difficult sparse categorical cross entry to Y nope and now it doesn't learn again I don't understand why it's xor either but that one learns so stupidest test two doesn't work there's something wrong with spars categorical cross entropy oh is it because they're not floats oh maybe no actually they shouldn't be right no it's CU those are actually the wrong shape though should actually just be this okay of course with no error and it seems to learn slightly slower but it still lears [Music] okay all right let's make stupider test we're slowly working our way up to deep learning [Music] um State spaces on used for some reason is four oh was I trying to use even stupider decision transformer for something I don't think I tried to use that one let's just use stupid decision Transformer okay doesn't work the losses got low there that's pretty good all right it seems like that works now for some I don't know how [Music] where press the light up button okay uh all AC space should not be three it should be two but maybe that's what's causing the problem now that works fine huh okay maybe it was just the xor thing the whole time all right let's PR press the light up button does this [Music] work no put it all together it doesn't work yet but we're getting we're getting closer guys we're making progress okay stupidest test seems to work um we cannot do that Max okay and then we can put desire out okay it's going to work okay look they they are equal now okay we got something to learn I I don't know why the other one doesn't learn but stupider test passes all right great okay so that's 22 let's output let's print this here and see if that matches Co okay don't need [Music] that that's junk guys we're almost there okay if stupider test works I like we comment that it works okay guys you don't understand how actually close we are to AGI okay as soon as we can play press the light up button that's AGI like like like if because it learned to do that you know it's it's astonishingly hard to like learn tict tac toe um we have this one working so now let's see what about this somehow doesn't match this wait I think it is learning no oh there we go it learned how to play the game after many struggles it sometimes learns how to play the game it learned once how to play the game now it just learned how to Output [Music] zero uh human play m one oh yeah we got reward zero we got reward yo yo this game's fire boys we're pressing the light up button one oh I didn't get reward I played wrong yeah stupid or test learns okay this learned like once but I don't know why it doesn't reliably learn it should be be pretty much the same thing though I guess there could be sampling noise [Music] here see look it's just learning to always play Zero now learns to always play one too aggressive with the learning rate let's lower the learning rate okay what happens if I lower the learning rate speak our test do it learn it still learns okay doesn't learn as decisive of a policy but it definitely learns okay that's fine right now it's not learning anymore it's the same stuff I have a state 821 822 82 821 821 oh okay to be fair I'm putting two of them in somehow no but I'm not no no no that's just that I guess I want the forward on two of them but that shouldn't matter cuz like regardless let's not do that I'm doing extra computer anyway uh so let's just C that that that that we'll cut the left last time step off because we only have those actions and then we can just do a logic sparse categorical cross entropy changes nothing how is this different from stupider test which may I remind you works well that doesn't matter because I have the same test model code here no actually I don't let's get the test model code no see it doesn't work kind of works anyone who ever told you they were doing deep learning is whing I think that the company has just figured out that you just get people to type in all the weights and then they tell them it's deep learning so that everyone else is just a red herring you know like it's just a scop there's no deep learning I think this works should we having any progress no I a deep state right sounds right no like learn you piece of if I if you ate dinner I wouldn't give you any tonight oh look that one's increasing oh it's going down again oh that one learned to predict one oh this one's learning to predict one too oh is it going to work oh look it learned it just took forever the reward is now one all the time it's just the slowest like thing you can imagine I see that one learn to be one assume that one's going to learn to be one there we go I mean it it learns it just takes 32 years okay let's put the big decision Transformer in and see if that works maybe it's [Music] faster Jack no Jackal it up it collapses okay now we have this bug again great well we at least know how to fix that for [Music] yeah that still works we break it uh the rides okay to check if those strides are contiguous or not for [Music] [Music] on damn it that's contiguous see the problem a for this is this is just frustration this is this is not good programming you can't actually do that what strides are both zero wait that's actually fine [Music] [Music] that one's not fine I'm not even returning [Music] rat I don't even know anything about [Music] that I just returned on there is that fine I didn't need that at all oh I needed some of that maybe that's when I went off the ra and got stupid supported type for oh well that's a different problem I think I think I put none in there somewhere yeah I have none in the test that's fine okay well now this one doesn't learn and not only does it not learn it doesn't learn very slowly quickly converge to all ones see if I broke anything fun [Music] that so quickly just outputs just snaps to one output this one learn is just excruciatingly slowly it would learn faster if I gave it more layers actually [Music] might no no 's at about the same shitty speed in fact it still hasn't learned has never see that example okay now it's learned reinforcement learning doesn't work okay someone might told you that reinforcement learning does work but they're a dirty liar okay cuz it doesn't work okay well now this is again collapsing immediately to a ridiculous State the model test should still be right right what is the let me test the model before I even train okay don't update that's just strange look some of them output two it learns not to Output two and then eventually it learns to Output some ones bias is towards zeros but eventually that one flips this one and now this one's going to flip there there's one and it [Music] converges okay that pretty reliably converges and it looks like it stays converged now why can our decision Transformer not learn as we switch to this thing it doesn't learn anymore okay first let's get rid of the jit we don't need a jit right now JS just one more thing that can break things don't the jet was breaking things but look it's like good and then it updates the first step just destroys it completely is the learn right too high maybe but again it's it's converged to junk converges immediately to that okay we I mean we can do roll outs like [Music] while length of all are less than the S so we get some [Music] diversity oh wow it learned look at that very fast too see look what I'm looking for so see these are zeros this is one zero um I wish I'm numpy how do I not do scientific notation what are we suppressing okay it becomes very confident in its outputs and it shouldn't be like this one didn't totally converge that that's wrong there again we're back to very quick snap judgments that are wrong entirely based on what we get as an initial uh just do we get lucky you feeling lucky punk it's learning something now we're making some progress there we go I see it's right now okay cool it didn't work with the jit um we have no idea if it works with temporal things but at least it's winning most of the time at pressing the lightup button uh where's my roll out where I print the where's that print let's stop testing the model oh it's not doing any more roll outs I see uh put this back here we go and it gets one every time but somehow it's wrong about that oh that time it got okay that time it got it it's very sensitive to the initial conditions did not learn then um I don't know should we try going deeper let's go deeper all I'm literally having gpt2 PR play press the light up button and it loses at it my initialization may just be crappy too like I kind of suspect that's why tiny gr didn't train the [Music] uh the resnet to the required accuracy just like yeah my initializer of this I'm doing I'm doing Cayman uniform but like who knows who knows if that's right um inside Transformer block who knows feed forward linear are those right I don't know where do I get the hidden di for that four times d w that's ballsy let's go small is that one going to flip that one flips this one's still learning too it's only 90% confident see the confidence here okay then that one flips there we go okay I mean it's learning to play press the light up button that's all I can say there weren't any bugs it was just lots of hyper parameters and also reinforcement learning doesn't work okay now way do I just do gym make AGI and it's going to do AGI okay well we can't tust the model anymore I'm going to commit that because I know it works sometimes if we get a good seed can I increase the temperature the temperature is already very high you you can look at what the probabilities are converging to and I don't think that's going to help you okay cart pole press the light up button is a really hard game okay well that one just converged to pressing one all the time oh am I not what's my target return I think my target return is zero and my target return is one so that's not going to work uh we can go back to highest reward that's mostly pretty good it's not like it's not that my temperature is high though I'm telling you uh 7 we can go higher telling you they shouldn't they initialize this with with uh with deep Q where' my line go for my training here that shape that's stupid let's bring the plot back should I use two figures let's use two figures oh GPT write this for me [Music] for and we have to give it a while I also disabled the jit unfortunately so it's a lot slower doing the roll outs we can try to reenable the Jet and see what I broke there I have an idea that I might not have broken anything and it might just be the uh that I was just a test okay the loss is not really going down but that's okay give it time all things take time I like that it's maintaining diversity at least what did I make the batch size eight I could probably make it bigger Whoa We Love diversity I don't know about equity and inclusion I don't know if those things help RL now really we need Merit we need you to take the best action you know what the action is the best action you got to take the best action one time it got 93 that was that was a good time that was a long time ago though I don't know like is it learning maybe it's learning prob not I actually doubt it's doing anything at least it has not collapsed it's still outputting twos sometimes this thing is unforgivably stupid I don't [Music] know no the you don't understand there's no policy like it it's you just put the reward in and you tell it to do things oh okay so I guess one thing I did was I got rid of so that's equal there so okay let's let's try that I don't know if that's going to change anything but um we should also be able to learn better if I increase the batch size so let's put bat size up to 32 I can have be can you give an explanation of this Transformer read the paper of course the paper makes it look so easy until you actually try to you know make it deep learn and then you remember that deep learning especially reinforcement learning doesn't work um oh look at that loss going down oh yeah yeah go down yeah that's the direction you want your loss to go for did it learn or did it collapse okay the loss looks like it's still going down it may not understand the monotonicity of reward how do they choose their highest reward how do they choose their uh oh I guess well this problem kind of sucks Target return yeah see loss is going up okay let's just it's nice that we can always go back to press the light up button it's a great it's a great task for everyone who believes in tasks should eventually it should pretty quickly get uh yeah okay look it actually gets good it press the light up button okay so we we we've made some progress notice how that's getting one reward every time it's clamp to the highest reward that's pretty good um you want to also see if it can lose every time just for fun you can also set we can just set Target Return To Zero and we'll see if it learns to lose every time I press the light up button so give it a little bit lose oh here we go that way to lose okay so it takes about it takes about 50 episodes and press the light up button and it can reliably play that game I feel so good we we did reinforcement learning and it learned something so I'm I'm I'm just I'm I'm hopeful for everything else else now um I don't know if we should try to reenable the jet does it reliably learn to play press the light up button if if we could get something that could reliably learn to play cart pole that would be a great victory for reinforcement learning and then you're right I've never seen that lunar lander actually land nicely I think I saw it once on YouTube but they had a guy with a they had a guy with a Xbox controller which streaming Alex we've learned how to play press the light up button do you know how to play press the light up button for you what is it court summons no it's your ID oh it's my ID return sh whoa that's cool oh voters we're not doing that this year guys every time I vote I'm upset and every time I don't vote I'm upset and that's why just you know how to play press the light up button no what is it okay so you have two buttons and one of them lights up and you press the one that lights up okay that's the whole game sounds very straight forward well I know but trust me not all models are smart enough to learn it but don't worry we now have a model that after playing the game just 50 times can learn how to play press the light [Music] but um ID reveal you did gift Subs I would think about it but okay let's play lunar lander all right we got lunar Faller well it's gonna have to do a bunch anyway before it before it goes uh I need more coffee and I had something else to do but I forget what it was you going to make coffee no I'm going to have cold coffee why you making coffee how i87 and it was disgus and I don't know why I got it do I want Pok ball I wouldn't say no yeah I need a Pok ball all right has it learned the lunar lander yet I will say the loss is very much going down wait boys it might be working no we're staring at a random thing again remember when we stared at a random thing last last time we never really found the bug we just fixed it what happened to not wearing all black I gave up on that now I'm back at work and all I care about is work the loss is still going down so I actually have big hopes for this model even though the oh is position embedding enabled I don't know uh yes it is we definitely had a bug before the one that was breaking beautiful mest still putting twos with some percent of the time whoa I think it's learning okay you want to go let it do 10 on its own it can't learn cartpole but someone said this thing could never learn cartpole and that you had to initialize card pole with good policy data I think it is getting better I agree I mean the rewards aren't better but that was worse it's like trying things though which is cool oh big spike in a lost there we should probably have something that discards the older samples it'll also be a lot faster if I disable the display try a positive Target reward well it has a positive Target reward now because it did manage to hit that once I don't know how okay but look the LW spiked but that could be okay right it could just be now it's exploring like a different part of the space which is actually pretty cool I think this is going to learn I just don't think it's fast should we uh should we Implement saving of the model I like why not saving the optimizer State tiny gr's not good about that stuff okay I think it got dumb again it still has all the good ones in the replay buffer so now look the model got dumb that's the Lost there like it was going down and now it's gotten stupid so no I don't think it's learning right scheduling I think that the I mean the distribution of the data definitely changed right and this may actually reconverge I can't say it's gone forever but part of the problem may be that this highest reward thing is it really has no idea that the rewards are like monotonic and I wonder if the paper does anything about that no see they just have separate tokens aren't going to help we haven't done that many roll outs like these things whenever I've done lunarlander before I've never gotten in less than 500 loss is coming down again so is the reward like it's possible that it just does well on one and then just gets itself into a bad State and then has no idea what to do okay I think that we should add back in I do I have reward oh I do have reward subtraction again is there a way I can change the render mode what's the default render mode okay never mind complicated code how slow is it if I'm not doing render mode human what why is that not oh that's all still the first pretty useless oh wait what are we doing what I still have a two there well that doesn't work that might be why cart pull didn't work actually um okay let's go back to playing press the light up button cuz we know press the light up button is reliable and a good game and then let's increase that two to things that are not two okay it learns how to play this um let's raise this up to 128 should still learn just slower maybe not much slower no she doesn't learn with all that garbage padded on the end we might want to mask it okay what does it break at what if I do four I think the problem might just be that there's like no signal in the thing anymore more or is it explicitly broken interesting okay it might just be explicitly broken with more than uh with more than two and that that's fair actually uh this might just have to do with how I'm doing sparse uh categorical cross entropy oh yeah okay well again I don't know how lunar lander did anything because this is totally wrong this has to actually be this I think no that's fixed elsewhere okay so just with three it breaks uh I don't know do we have a test for sparse categorical cross entropy with [Music] more things there should be a positional embedding too so they shouldn't actually break this is where it came from along with code that learns lunar lander what's upside down reinforcement loaring P rewards no whatever okay good let's just fix this for for this um so the problem is in clip R oh I shouldn't put two there first of all that's a mistake that happens to be correct for I should probably actually do minus one there these won't have any loss if my sparse categorical works okay that works should never learn to predict two uh some reason that has action space plus one I don't really know why it's just a fake action it's fine okay but yeah this is fine I'm not actually masking the Transformer that worked before okay okay that works uh let's put it up to 128 see if it still works we're just masking the return we're masking the uh cross entropy return at the end of the episode which is fine see if it works at 128 okay good takes a tiny bit longer I don't know the Transformers masked great all right card [Music] PA mostly [Music] collapsed narrow work get out of here I've told you about the volcano man I've told you you just just like what kind of person would I be if I took you seriously what kind of person am I that I get upset why doesn't my reinforcement learning work that's what you really have to FOC on these are the real problems in the world am I okay yeah there's just there's just idiots on the internet you know and it's upsetting because I wish everyone on the internet was smart and they used to be until the Eternal September and then the noobs just kept coming you know that you know that porno about the girls never came you know the girls never came and it was just dudes it's like that but the internet and it's the idiots and they do come and it sucks hey like I feel like it wants to learn even subscribers only is lost yeah I know I mean it's just it's just the hordes of the it's it's like it's the same things happening in the real world like you know like like like like there's no more separation between the first and third world the internet's just it's the internet's just the third world okay like I feel like it wants to learn every time it does that you know like you can learn man learning is possible you can be smarter be smarter uh my positional eding could be messing things up you can turn it off also C pole may just not [Music] work I like seeing my lunar Landers what was it learning before we just whatever we were just getting lucky before you need better than random inputs okay so we did have great like there was great gains in there was great results in the loss with lunar lander okay this is really slow should we we try to turn the jip back on that guy said it took 300 EPO how do they Define an EPO good result after 700 EPO oh that's cool desired return an actual return oh we can plot that too going to plot our desired return on the same graph as the first one that's pretty cool oh where is that a list that's a list oh do that think we can get the jit turn back on let's get the jit turn back on I hate when things are slow uh okay press the light up button uh what I do what' I break oh how to get Negative a th000 oh that's desired oh that's annoying okay that learned even faster than usual that's great um Can it learn with the jit on we have to add back the jit reset and roll out model forward jit reset forward jit for jit no it doesn't learn with the jet oh maybe it does oh we're good just took a while okay fine it just works good we've had very few tiny grad bogs except for that one from before uh most of these bugs are just kind of like hyper parameter reinforcement learnings impossible kind of bugs all right lunar lander that that let's go all right and they keep learning learning learning we're raising the ceiling this is this is one step away from AGI don't don't don't you feel it if we just like like turn this on the world instead of lunar lander I I couldn't land this lunar lander well you know it's just just just it's the same algorithm as AI you just need to scale it up a little bit they got 700 EPO but I felt like they were getting gradually better over time whereas we on the other hand are not you know how to do better do better remember that time we thought it was learning but like maybe it wasn't his orange line kept going up my orange line does not okay all right all right fine fine fine we're in the period of great struggle Wow Let's watch a gif of it working wow wow that's a nice GIF is the cut off still two no I fixed though the cut off no I I changed the clip to 128 I think that's fine yeah most none of these episodes are going 128 all you do is learn by trial and error okay say that's not how AGI works it's totally how AGI works I feel like the loss got lower last time I don't know maybe we should try a Higher Learning rate [Music] the learning right is low like these problems are low dimensional too which sucks let's also change this to every 25 to watch an Epoch it is fast now too which is nice not sarcastic this is Agi boys this is it it's just gonna it's just going to f from here okay look the loss is lower than last time now that I raised the learning rate oh it's learned to be lunar Faller oh good I forgot how to be lunar Faller that's good come on learn all right lunar Faller there's a lot of episodes of falling but it fell right in the middle that time it's hard to overcome being lunar Faller you can get a pretty good reward by Falling you just got to fall in the right place this is definitely not what's in your brain okay something in your brain's got to work way better than this with SpaceX Landing produced by AI have you seen AI try to do this you think SpaceX used this to land the Falcon what they they just built 100,000 fake Falcons and crashed them all they did it in simulation no they used control theory okay I mean this kind of sucks it's not moving all right I think it's time 3 e minus 4 it's got to be that karpathy come on 3 e minus 4 all right while we watch this thing never actually learn to land let's think about what other games we could make oh you want to try iterated you want to try iterated uh press the light up button where it's a multi-episode game that keeps going if my schedule is too aggressive do you want to add the times uh the times9 back we could do that I mean it's it's it's going up sort of that one's going down that's pretty good there's also some RNG built into lunar lander which is kind of interesting um I think we should clear the plots I also think it's getting slower cuz it's drawing those plots it's hard to overcome being lunar Faller see if that works CLA what's the difference between CLA and clf this works the light switch game worked you can make more complicated versions of the light switch games oh yeah no I think I met that guy they have like one like like it's just a genius of controls it totally makes sense and they definitely didn't use AI well we be nice to each other on this stream only I'm allowed to be mean it's because I have the band button I don't know who thinks it's eventually going to learn it's got a lot of parameters I think okay do they update the target reward based on the actual reward yeah they do okay we honestly have no idea if the KV cach is working it may just not be working which would explain a lot don't know like the loss is going down maybe it'll do something I don't know High Hopes actually where' the video go all right we should try uh multi uh I feel like it's not just going to get it now there's still bugs we definitely still have bugs like now that's what I'm doing Target return minus equals observed [Music] reward at the reward at the observation at the action it's true that if like reward to go is wrong and stuff none of that will show up in the in the in the light bulb game right okay bro bro okay I hooked it up get the Xbox controller right come on we got to look good for the video it loves that that Thruster it's barely using the down thruster I mean it is learning and the loss can only go to zero good night reinforcement lighting doesn't work all right let's go back to the light bulb game it's not learning anything it's like it's learning something here but that's the higher learning rate too oh look it fired that Thruster that time see it gives me a little bit of hope it gives me a little bit of hope that it won't just crash into the ground at full speed NOP crashing okay we have to just stop watching it the number is not going to go down okay it doesn't matter how long you watch it for the number is not going to go [Music] down let's play press the light up button okay now let's add a few parameters to press the light up button let's call one of them size and call one of them game length let's confirm we can still learn press the light up button takes a little bit but now we've learned almost we got it wrong that time now we've learned it okay so we do learn to play this game reliably now let's add some complexity to the game well let's start by setting the size to four now it has four buttons to choose from it's a much harder game can it still learn to play can gpt2 the whole gpt2 learn how to play new press the light up button okay I can so it has no problem with size let's increase the game length uh we have to be a little careful now we want to get a new observation so we're going to say reward equals self. OBS Suba step equals zero self. reset and then this is self. Step Nom less than G greater than or equal to okay and we broke it even though that totally shouldn't matter we somehow broke the game or we just got unlucky we can go back to size one by showing a different observation the second time we break the game that does not make sense you can also put the size back reason to make it harder than it has to be I understand press the light up button a very hard game okay well that works again so maybe that wasn't the problem maybe the problem was just that it's unreliable with a large size large size okay that's fine all right so it's unreliable with the large size well that's a different problem okay let's try game length of five should be able to get five reward now and the first time out it got four reward nope it's very stupid okay well it definitely can't play this comp licated game I don't think it's a tiny C isue like I I wrote a bug in tiny gr but tiny gr is a very comprehensive test Suite how easy is it to do in pie torch why don't you give it a try I I think this has nothing at all to do with pytorch or tiny grad and everything to do with the fact that machine learning is just not hard yeah let's try game Le of two let work our way up slowly should be able to get two reward now I think but yet it can't do we have some bug in the game let's try some human play of this great game okay one should also probably never give reward after the game ends but I don't think that [Music] matters now we have a game length of two oh it did get a reward of two once by sheer luck and it will never do it again okay larger models are less temperamental yeah that's generally true right let's try a game of size 16 is this ever going to work but eventually learn to play this now there's 16 buttons it has to press the one that lights up this is very hard it did get the four though but yeah there's definitely something wrong with game so that's interesting I can't reliably play this game there's four buttons guys you press the one that lights up it's not that hard is a light switch fixed what do you mean by that you want to play you want to play the human version of this zero look I got reward two look I got reward two one I missed reward one what got reward three see you just pressed the light up button of course this game is too hard I just like if this game's too hard what can this thing do me 12 is like give it six heads it three heads maybe I don't know we have too many heads okay that was a stupid guess why did I think it didn't anything to do with [Music] heads oh God cringe do you not see the other people posting cringe and you're like I'm going to come in here and I'm going to post cringe [Music] what what what do you want do you want to make the the machine learning work do you want do you want to make the thing learn how play PR Buton well I'm sorry that's who you making I'll callose the door we're very angry oh like hello I have a 17-year-old please help me give me life advice bro like have you not watched any of the streams and how does it end for people like you see that's that's the really baffling thing this is the most baffling thing about these people like it's it's lmics it's literally did you not see that guy fall off the cliff did did you not watch did you not see him fall off the cliff who do I'm going to walk off the cliff too like like what are you doing um or do you think other people in chat are going to answer you and anyone in chat who gives who answers these people banned okay we're here to play press the light up button right can you play press the light up button can you write deep learning to do this how about three does three work we solved it once with four but that would before we added the game length stuff I worry that my causal Transformer doesn't work let's get rid of the jit again we definitely don't need that now we all get banned all right never mind we need to make the the press the light up button game work with three actions first because this is clearly way too hard all right I know how did anyone get this to work right so now we can see the probability Matrix from that thing does look like the right answer is increasing there okay now it's getting two of them correct now just isn't getting this one correct it's torn between stupid answers like this doesn't really look like a fundamental problem [Music] how do they predict their action a hugging face blog post had some stuff right where's get action action this doesn't work turn an action does it no okay you see the problem here it's just collapsed very quickly and is never guessing zero for the first one this one fixed the first one but now it's struggling with the third one it's not confident in those answers but it never predicts that one all right let's make the batch this bigger also don't understand what happens if I what if I make the temperature greater than one good that's going up now but my serious question to the people who come in here and ask for Life advice are you not a frequent Watcher of the stream now it's solved I decreased the learning rate and increased the patch size see if I can solve size eight or do you think for some reason I'm going to give you a different answer which is perhaps even more baffling right it's it's not it's a it's a dumb question in kind can it learn it with five oh you think most people who stream give those answers I mean they aren't good questions anyone answers that seriously is doing you a disservice can someone who's paying attention explain why people who answer those things are doing people a disservice yeah and then yeah it's not actually because they're very personal it's it's actually because the biggest problem you have is asking that question I'm genuinely giving you good advice the biggest problem you have is coming on the internet asking someone for Life advice like like you think that that's structurally broken and I'm sorry other people have taught you to think that way but until you fix that you will never make any progress in anything I you can ask it on the street if you want to most people understand those questions for what they are and maybe these people do too is kind of like a you know talking whatever man you know I'm hanging out at the bus stop I guess you don't even ask questions like that I guess I guess what else upsets me about the questions is you wouldn't ask someone at the bus stop that kind of question you think that I have some answer for you I don't nobody has an answer for you you're either going to find the answer yourself or you're going to die miserable and alone um I don't have experience with being stupid because I I'm not it sucks I mean I don't know if it sucks right I I think that what sucks is to have Ambitions greater than your intelligence um I've met some I think there's some very happy people who are smart and have low ambition and these people are just like yeah man everything's just easy uh there's people whose intelligence matches their ambition um things kind of work out for these people who what really doesn't work out for is when they have Ambitions here and they're not smart and like what can you do you know just just be happy that you're not this model that can't even solve if you're watching streaming you're above average intelligence I don't think that's true again I think the internet's gotten really stupid uh but yeah no it's it's the shortcutting the learning process again you you completely misunderstand you completely misunderstand what learning is if if you think there's a way to shortcut it it's it's like like cheating on a test um they say you should learn how to learn and that's part of it but still not all of it because then you'll just get the stupid question well can you teach me how to learn and no nobody can teach you that either education like you can be taught very narrow facts about the world sure but again you don't even need a person anymore just go look at look at my conversation this morning about mosfets people have paralysis with trying I don't think this is true I actually think what the problem is is that they have no gradient um they have no gradient and just like these models like this problem is incredibly easy but it can't learn we we can sit here and be like bro like okay you got the first three but why can't you output three for the last one and the probabilities are slowly changing and maybe someday this model will eventually solve uh press the light up button with four states the saddest thing is I don't even think there's bugs here it's gonna it's it's G to flip it's gon to flip there we go 0123 and now it gets reward well it doesn't get reward one every time yet we have to give it a little bit longer for it to stabilize okay so after 500 EPO it learned how to play press the light up button uh is jit running again no the jit's turned off but I think it makes no difference at all it's just another confounding variable to remove uh I think there is something more structurally broken okay so it can solve size equals 4 it just takes a long time just like I think there's something more structurally broken with game length equals two you do think learning to learn is part of the problem uh people give up when they see people like me who started young as and have a lot of talent yeah I mean the truth is some most people should probably give up the the the truth is if you're uh and again this is not advice that someone's going to tell you but if you're 25 and you it's not like if you're 25 and you program if you like uh have aptitude with physics or philosophy or or math right like if you have strong aptitude in another area and you just didn't put time into programming then yeah perhaps you can learn to program but ask yourself then honestly do you have do did you how did you do on how do you do on your math sat it's that stupid right if you got if you got a A a 450 on your math sat you're not going to learn to be a great programmer in 25 you're just not that's that's the reality of it am I am I good at at stem ask me a question ask me a science question don't ask me if I'm good at stem don't trust me to evaluate myself who knows that's just stunning Krueger you can't ask someone if they're good at something why is this model not good at solving twep I wonder if it's a KV cache for some reason we're not incrementing prove the irrationality of the square root of two uh in I actually did that that's I have that on my uh my thing I saw I saw a Taylor Swift proves the irrationality of a screw Ty oh people should ask themselves that I mean again but like this is just this is just a complete it's a complete circle jerk uh no nobody who's any good at things sits around and talks like this and if you are talking like this I don't think it's going to get better for you or at least not what you're trying uh I have a contiguous after the KV cache yeah I'm going to try disabling the KV cache I mean the KB cach definitely works in gbt2 welcome wasp um you have continuous there yeah well so I guess like how does it not even get it wrong accidentally half the time time that's the weirder thing what is it outputting why is it only outut one thing wait it's terminated already no that seems broken oh okay we had a bug and press the light up button if you can believe it but now we're still getting false false there why is that Loop exiting God I can't believe I actually had a bug and press and press the light up button this just bike shedding guys bike shedding uh okay terminated and not truncated Loop through there why is that not printing a second action or okay okay there I mean to be fair this was just a bug with my implementation of uh a press light a button and not a bug with uh okay will it learn to play two steps it's slowly learning I had to lower the learning rate to get it to learn the big size ones should we have early termination if you lose probably do that uh so if you lose okay so I guess this can be [Music] truncated and then and done goes here if not reward self. done equals true okay now the episode L will change let's put the oh you want you want me to add a failure reward I don't know if we need a failure reward um okay let's try a game length of five and we'll put the learning rate back up to something reasonable oh yo give me McFlurry bro no like literally I'd love MC Flur right now is it going to learn how to get good reward here interesting it learns how to get three reward and then it stops I see my ID uh I'm so glad I got a new ID my ID has been expired for a while okay it learned how to get three reward I don't know if it just like stops at three I'll point out also we haven't fixed any real bugs in it or did maybe got five no I'm going had to get one from the top interesting I wonder if that like means something I wonder if it's trying for that somehow tendies yeah tendies is the mly using cash yeah it is again this is a totally stateless game um all right you guys want to try a new game will you have to guess the state to see if it's doing Time stuff correctly you just it just tests sequential actions it's interesting that it converges to four wonder if that's like fundamental or a lower learning rate it's worse yeah yeah I think banned is that the same guy who posted before dumb crap dumb crap that's dumb it's dumb start with a timeout okay [Music] see that's interesting it learns and then it must be when it subtracts the reward like it must actually be learning to do badly on the last one driv an off by one error somewhere H okay let's try a few things let's always shoot for a little bit higher than the highest reward I don't know what that's going to do I just not learn anything and these all like this is all just out of distribution at that point it's like it's it's shooting for 5.5 oh well that's the best I've seen it do okay that's pretty good it solved it we shoot for a little bit higher than than the highest reward now these are still State independent so like we could get trickier with this game like it has to learn the stuff offset by one all right well just add We'll add something here called hard mode um in hard mode the correct action is actually the action plus the step number mod steal. size all right let's see if we can as a human play it in hard mode okay uh the answer is one the answer here they want you to think is zero but it's actually one wait no that was wrong oh because we're not in hard mode Let's go to Hard Mode one the answer a one you think is one but it's actually zero no should have been that that should have been step stepn zero oh also if self. hard mode no it's not respecting hard mode what's going on here why is that Step Zero it shouldn't be I incremented the step num here uh oh am I resetting every time yeah I'm resetting every time yes okay can I play the game on hard mode you guys see how hard mode works I feel like I probably wrote this exact same game last time like with all the same stuff does it ever learn hard mode it knows the first step reliably why is it not actually taking that first step reliably then why does sometime I get zero reward it seems extremely confident in the first step no hard Mode's a little too hard for it I don't know I'm interested to plug in press the light up button into all like the RL stable baselines things be curious to see someone do it like can po solve this how is it ever getting a reward of Z it just shouldn't be hard Mode's too hard for it non hard mode do better than that there's a five you can get to I mean actually we're asking it to play long games so let's give it a game length of 32 because it's really a Max game length yeah I don't know try that some other stream like there's lots of different variants of this game that are are quite interesting it's only getting to nine can't do better it's confident there now got to nine oh yeah like like a a a t a table based algorithm will solve this instantaneously why does doesn't reinforcement learning work oh part of the reason this doesn't work is cuz Target return is one part of the reason that's reliable and sometime it does does worse is because even though that's I don't know I almost want like a like a discount Factor on the subtraction was a roll out what if I do that there's no excuse for not doing well on the first one I mean it occasionally gets it it's just not reliable I feel like I'm missing something I mean I see no reason like this game's going to struggle once it gets to the game length is 32 bro keep going you can do better you'll never do better now like it's a pretty good Agent but it Peaks there it's cuz when it gets to the last one it's in a fixed St date okay I have an idea just so you don't get stuck subtract a random amount of reward it can't just be that Target return torch cat Target return oh what is this scale what is this that that's exactly what I was doing where does that come from the word scale doesn't appear anywhere else in here yeah then I got to set that up what do you think I'm good at computers I'm not okay that's a lie so where do I get this scale from well look I I told you I told you this had to be in here this okay now now we're in land of RL like it's mostly correct but hey Marcel Bishoff um if you check the GitHub it's there where ask Quenton Quenton I'm over Quenton scale equals no no no no scale can't equal a thousand that doesn't make sense now that's decision transform forer show me where scale equals a th000 I'm not I'm not asking a model that doesn't make sense run decision Transformer R 93 normalization oh okay well that's actually not what I thought it was was done that's stupid I mean that's just cuz regression is hard but we don't really care about that we actually don't care about that at all um okay so that wasn't what I thought it was what did we have the most the best results with we had the best results with just multiplying it by a little something CU we wanted to do a little better you know like every everybody's got to move up in the world you know mo move a little past their their their parents generation or whatever one's got to move up in the world you're 5% better than anything you've seen before try penalizing it if it gets stuck all right good carrots and sticks I love it I don't know how to do that reinforcement learning just doesn't work this whole thing's broken humans don't do this crap if it gets down if it if it learned the whole thing before it uh I mean okay part of the problem now is that this thing's just too simple so you know like the failed self-driving car company um so let's go back to lunarlander and let's see if we can make it [Music] land like the old days boys like the old days um okay well let's let's try a little more press a light up button uh just to make sure the jet works okay I don't think there's anything wrong with the jet during 10 hours all right reinforcement learning is addictive because it never works yeah okay let's just go this again what missing mismatch of VAR vales I don't know about that looks like some jet problem how come I didn't see that when I was in the jet before for jet. reset not actually resetting [Music] oh it's cuz I have this yeah get a little faster with the jet or nothing faster it converges to a point where the loss is low and yeah it converges to something like that come on just beat that now let's play with our model A little maybe the model may go better di bigger definitely don't need that give her four layers it learns so fast come on do better than four but it has no incentive now to do better yeah the loss gets low and then it like stops caring like nothing improves after the loss is low we just learn to predict everything in the uh okay I mean another fix for this potentially is where do I have first we can just just ER ratio right you can decide how many you can decide like now I'm doing two episodes for each so maybe the loss won't go to zero as fast pretty useless I mean okay we can inject variants by lowering the batch size that's pretty much the same thing is messing with the learning rate I open you have to push down thank you we got tendies and [Music] mozzarella okay well lowering the batch size made it worse lowering the batch size made it way worse so you know what that makes me think increasing the batch size will make it better wait look it just improved can it do better than that why does it stop at Seven there's nothing special about seven let's try let's target a 3% GDB increase 3% let's target a 3% increase [Music] H oh we also we have to be a little careful with this I actually want to say highest reward plus highest absolute value of highest reward throw a little additive in there too I don't know something like 0.1 that one is nothing oh I went crazy there I meant that since 128 is divisible by 8 I don't think so bro should we also add some RNG to that okay now we reach for something that's a little random I don't think we want minus one come on go up keep going keep going you got this boy it don't got this it's something can it learn to play the game in hard mode hard mode has like polarity and stuff Universal function approximated my ass to be fair these are the kind of things that uh that uh the AR kind of functions deep learning is very bad at okay the actual thing I want sometime don't do your [Music] best yeah okay well card P's impossible lunar lander why is it not putting twos but it never seas in the data set for okay I think machine learning is impossible this doesn't work just doesn't work I'm sure I have tons of bugs still ah I think this is the final attempt tomorrow we'll try again we'll look at some that actually work okay okay the loss is going down the reward is going up I think that it's just become lunar Faller yep [Music] all right just shape that local Minima try crazy high batch I mean part of the problem is it's just showing it the same data over and over again which I mean this should be an offline line all right now we've gone into this mode where we just fired a right Thruster how did anyone get this to [Music] work f to has it learned anything like it all just looks so easy that they draw that picture and then you think oh I'll just Implement that but you won't hm what what do they use for K we feed it the last K time steps I wonder if this actually helps what's K we blade on the context length k what do they use for k no these things work there's just bugs they're not debuggable which is the most annoying thing falling is pretty good oh here we go we use context l k equal 30 which is interesting I've tried like the stable Baseline stuff to get something that works and I've had no [Music] luck okay I think we tried this for a while I don't think this works let's spend 30 minutes and see if we can get beautiful cart Pole to become I mean I'm sure it works I don't know where my bugs are there's so much that we have to test and so much that we just need better introspection for but tiny grad needs better introspection for um [Music] so I I wrote this uh this is beautiful cartpole uh it's a example on Tiny grad and it actually is an example of working RL um it uses Po and now let's see if we can change the environment name to lunar lander and let's see if we can make it work this one does work at carple at least it's good to have something that works I've solved lunar lander before with with PP okay uh let's train for way longer than 40 let's try 400 because it's fast too which is nice It's Quickly losing episodes now let's see if it comes back no longer replay buffer let's tweak hyper parameters until it learns [Music] both losses are really high might be too aggressive with the learning rate go the replay buffer thing let try [Music] 200 oh see look it takes some bad step and then it just starts losing [Music] entirely let's get rid of reward to go oh sorry not reward to go let's go to discounting what does this do now I think it just I think it naned [Music] out or maybe that's when it fills the replay buffer takes less action oh yeah that probably makes sense once it reaches the replay buer it's fast okay that makes a lot of sense actually [Music] um but sorry you don't get to be fast it has to do it has to do compilation until it hits the uh the right size could fix that with variables I'm sure RL lib Works kind of if you set the hyper parameters to what they to what they demand the to be I don't know maybe tomorrow we we'll take a look at other people's RL code and see what we can replicate uh why does amnest work and RL not work [Music] no the answer is not because RL has too many hyper parameters o entropy loss let's change that I mean it can't just be that it has too many hyper parameters right emus has all those hyper parameters too and this is just stable across a wide variety of them right there's not that many hyper parameters that are unique to RL the critic loss is really high oh we might also have to scale these rewards don't have to be such large numbers I don't know how it manages to do so much worse than just falling at least the other one learned how to fall this one I don't know what it's learning but it's fast critic loss too high what does it do after all that oh yeah great good move bro good move it's almost like it wants the reward to be big oh we might have to make the hidden State a lot bigger this isn't cple those are very simple models it's unclear if they can learn it that might be the problem but no it's probably not that I don't know you can solve this with like neat and three neurons um all right guess what we're going to stream tomorrow too and we're going to make this [Music] work it's just so frustrating I think has anyone tried has anyone made like dumb gym games that are just like way stupider well this doesn't work either so that sucks we'll put back beautiful cple that one at least solves cple to be fair this algorithm does better that it learns how to become lunar Faller to be I wonder if it would be better with entropy regularization where's the's the app I don't know if letting a train overnight is going to help I mean it can't solve the simple it can't solve those simple environments which is a bigger problem has anyone done that before very complex every time we try to do RL on stream no I'm not trying some stupid hyper parameter optimization like this is exactly the problem every time we try to do RL on stream we just end up frustrated and angry this stuff just doesn't really work maybe some very careful imple implementations of it do but box 2D is a p p game Library I think yeah the loss calculation and update well yeah the update's the most important part oh the update's the most important part of all of learning the update is what learning is sorry is that where my head is can you not see that graph not go up and I thought we were going to get Pokemon playing on a tiny box can't even get lunar lander to work can't even get the button game to work yeah I've seen that one why are llms bad because the IRL doesn't work oh today was a shitty live stream today was a shitty stream we started off we started off strong with some good rants we smoked some weed we had High Hopes And what actually happened the thing didn't even like like it just it doesn't land I don't know how to fix that uh there's a million places bugs can be they're probably in all of those places plus 10 more I didn't think of um like it's not just going to start learning it can't learn it can't learn how to get past the seventh step of the button game I'm not pushing this to GitHub this sucks too much we'll stream again tomorrow maybe I might have something to do tomorrow I do tomorrow stream again try again we'll go very nice and slow we'll try introspection like the usual techniques what time tomorrow dude I just just just don't don't don't trigger me right now man I I thought oh we'll just get a Transformer oh we'll just get a Transformer it'll be easy oh it's just like supervised learning it's not like supervised learning because your shitty model changes the distribution of the shitty data wow I never realized that the ground changed shape wonder if that's put in is that put into the state walk d I know you want you want to just sit here and watch it learn for a little bit there just eight observations so it doesn't tell you how bumpy the ground is maybe one of the observations like how close you are to the ground or something I could try implementing it with pie torch did did any of the I I've [Music] so what's my take in the future of web D no uh no the words are still true but I I am I'm in a bad mood that it does this doesn't work and like I probably have bugs I probably have just tons of bugs and we'll all be equal and we're dead that's true that's true except this model this model is still going to suck more than everybody when it's dead um just just land the thingy just just just just land it bro just you can do it to be fair this does look a lot more stable at least than po it's a cool idea we we we learned we learned something cool today we learned that about how decision Transformers work and who knows if you get everything right maybe it just works the loss is going down which is interesting this thing is this big Transformer is learning a model of how the world works just just leave it it's just like my computer tries to land that did that did that increase the thing look it increased right there it went up a tiny bit do we get the press button game working yeah but not the hard edition of it we got some simple variants of it working sometimes but I think we should formalize the press button game formalize the press button game we'll see what else can play it to be fair I haven't done much Transformer training at all yeah I played breath of the wild last night there's my switch controller should we play some breath of the wild and be bad at that yeah this is what we did like it doesn't work like the thing doesn't land I have to do some real soul searching after this no I don't I'm just I'm just I'm just mad RL doesn't work like you see the Deep mine robots playing soccer they did that by no nobody's gotten this stuff right like RL doesn't work anywhere why are there no robots cooking and cleaning it's cuz this do like no I I work in robotics too trust me I know it's all fake like K works through so much careful testing and like like like so much just testing and tweaking why do I just believe that it's going to get better soon I just believe that like oh it's going to learn how to land but like it's not this is a problem Comm uses R around we don't use RL cuz RL doesn't work we we wasted we wasted six months last year on RL Comm has stuff we have we like all our stuff works with lunar lander and and cartpole and we have something that can kind of drive the car with RL it can go straight on the highway but like it's just way worse than everything else figure robot tweet says they have an RL breakthrough can they solve the button pressing game can their thing just oneshot the button pressing game um I've seen mobile Aloha uh the videos are cool but yeah like all demos what's real what's sped up um it is cool though and like I think that is a lot of the the right approach to uh doing the like doing cooking and stuff is to first set up a very good uh they have like they're their rig has like four hands on it it's got like two hands on the front and then two places you put your hands and you can manipulate the the rig okay we'll try to stream again tomorrow we're going to make this work we're not going to lose we are going to make the lunar lander land if if you want more content you're in luck because you get more content when doesn't work we're in a bad mood nothing you can do about that 10 7th it's I'm not frustrated that this code doesn't work I'm frustrated that I I feel like all of the introspection tools that I would need to debug this just don't exist uh it's as far as I know unless something's really changed recently it's not any better in pytorch do you remember my mzero stream I just failed at that I don't think I ever got that muzero to to I did not get it to solve lunar lander it just didn't work uh which is which is just frustrating and then even beautiful C Poole we have beautiful cart pole in there it only solves it like 75% of the time uh I do think that another problem might be that my initialization is bad that that torch has better initialization so that's one thing that might be different about tiny gr and torch RL is so hard and stupid yeah but without we were talking at lunch a few days ago like RL is RL is ious if there's anything that is consciousness it's it's it's having a model of yourself in the environment and you'll never do that with supervised learning you'll never understand that you are the agent you need our for that so this has to work all right I'm tired try again tomorrow yeah I know the musero streams were massive because I I probably got equally frustrated by this same garbage like why why won't you just just just like land between the flags bro land between the flags all right we'll make it work
Info
Channel: george hotz archive
Views: 84,836
Rating: undefined out of 5
Keywords: programming, livecoding, georgehotz, george, hotz, geohot, twitch, github, yt:cc=on, lunarlander, tinygrad, decision, transformer, paper, balancing, temperature, logits, bugs, reinforcement learning, impossible, decision transformer, gym environment, press the light up button, game, embedding, action, reward, broadcast, issue, probability, layer, AGI, progress, scientific, notation, suppress, learning, loss, equity, inclusion, CartPole, pressthelightupbutton, game_length, advice, learn, gradient, game_lenght=32, rl, model, strategy
Id: 8U8kK3SpLTU
Channel Id: undefined
Length: 477min 31sec (28651 seconds)
Published: Tue Jan 09 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.