Building an LLM fine-tuning Dataset

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

what is going on everybody and welcome to yet another Reddit data set video on this channel I just I can't stay away U I also already kind of done a very similar video but it was more focused on the culur aspect of this process as opposed to the data set creation and arguably the hardest part is the data set curation but with Cur you actually need such a small number of samples that uh you you can get much closer to kind of like handcrafting samples so in this case uh in this uh video here we're going to be building out about 35,000 samples there is also ways and filtrations that you could make this be really Millions even with uh Wall Street bets data only and that's that's going to be the subreddit this is going to be uh fine-tuned to um and in this I had a couple of questions about uh can you have multip speakers so for example this data set that I've built and the way that I've formatted it is such that you have like beginning conversation that you can see here uh then you have speaker zero to the nth speaker I think there's up there's like 10 plus speakers in some of the data uh and then you have the walls leap bot is the actual bot response and this is good for two two reasons one is generally online when you have a conversation between uh parties sometimes it is just one party and then you're you have like the bot or the person or whatever responding to one person but instead the reality is usually there's mult parties so um that's what this data set ends up uh being so that's pretty cool and then at the end I do do a cur fine tune um of this we've seen a Kora fine tune in the past I think this one is actually a little bit better than the previous one I trained um but there's actually still some room for improvement so we'll get there but um this video is going to be quite long mostly covering the entire process of actually finding the data downloading the data formatting the data making the data set uploading the data set all that so um that's what you're what you're in for and because this is going to be so long I don't want to bury uh to at the very end I'm also giving away an RTX 480 super courtesy of Nvidia uh to be eligible to get it you need to sign up for GTC which is the uh GPU technology conference from Nvidia um it's absolutely massive it's virtual and in person so whichever one if you sign up from the link below that makes you eligible you have to attend at least a session um but otherwise you're good to go and the sessions there are so many sessions personally I'm interested in the video generation from the runway ml CTO as well as the Arthur mench talk from of course mistal um but there are tons of same thing with like robotics I'm really interested to see the eventual meshing of Robotics with generative uh AI so anyways uh check that out below otherwise let's get into the video about 6 years ago I found the following Reddit thread uh this was posted actually nine years ago but I found it about six and um basically it was every available Reddit comment um up until that point now of course this was 9 years ago so this is actually quite old um and uh it was very interesting when I found it because uh this was when there was a lot of people playing with chatbots and at the time it was mostly rnns not large language models that we know today with attention and Transformers all this fun stuff and um I made a chap out with it and uh it actually worked pretty well it was it was a very enjoyable chatbot and uh this is so so very nostalgic for me to look at uh the super old room we got the P python plays GTA 5 in the background the good old days let's see what was I coding an idol back then as well let's see what I if I open up an editor here do I ever open an editor or we just downloading stuff so actually it's funny because I'm actually going to continue we probably hit the exact same problems with this video series that I'm hitting now get to the oh no we're using Sublime Tech well we're using both Sublime and idle I'm still coding in Idol love it love it maybe I'll uh I probably won't though I don't have a co-pilot in Idol so it can't be done um anyways so I made that chatot it was pretty good and what I really like about the Reddit data is the character that comes out from both the average of Reddit but also you can Target particular subreddits and more recently I did this with a Cur on top of uh I want to say it was llama 27 something I don't know if it was 70b or which one I did but anyway I did one of those with a fine-tune of Wall Street bets data it was way too few data um and the quality was not quite right so then I wanted to revisit it but just getting the data is actually is like the hardest part like training the model is is the easy part so so anyways I I keep coming and running into this problem of ah I got to go back to that Reddit data set and then I keep being reminded how difficult this Reddit data set can be so here's why so first of all you have this one which eventually turned into a torrent but again it's a it's the old stuff I want to say it it's like up to 2015 or something like that um Beyond this one I actually think this this bit of information is also housed here on archive.org so you can actually come here and then go to any of these uh years and you can actually download so 2015 the first five months um which again I I'm pretty sure is uh what this torrent is here um so you can download those and then uh you can go all the way back to 2007 and download you know the months worth back then alternatively there is Big query now this was linked I think actually in this post somewhere um yeah in this URL so I'll just copy that paste we'll head there hey oh we don't have the SL R oh we got rid of the slash okay so um instead we have somebody took this data initially and put it up on big query which then if you click here it will take you here um and it's this FH big query and this got maintained all the way up till the end of 2019 so if you click on if you click on that URL basically I'll put a link in the description you got big query you can come down and there's a whole bunch of actually good information in here you definitely would be worth checking it out you've got um besides that you got like Hacker News and stuff like that and then also some other like Reddit informations there's some apparently python stuff anyway there's a whole bunch of information here um but mainly I'm interested in Reddit comments but there there's a whole there's tons of stuff here uh that would be definitely worthy of checking out and actually I'm going to check V2 real quick quick I I don't know if there's anything post 2019 or you know 2020 onward I don't know if there's anything here for that but anyway we're interested in not Reddit posts Reddit comments so this has Reddit comments going from 2005 all the way up to the end of 2019 that is a ton of comments this is terabytes of comments billions and billions of comments so honestly there's enough here to fully train a model uh a language model especially like a 7 billion parameter language model you have enough data here to get it done so that's very interesting and anyway definitely enough to also full fine-tune just about any model on a particular subreddit if you want that kind of behavior anyways so what I want to go through is the process that I've ended with I'm sure there's a better way there's always a better way um but this is what I have so I've downloaded a lot of the 2018s already so I'm going to go ahead and start now in 20 17 so you can click on that and it will take you here to this table now things have changed over time um but now the way you can export you used to be able to just straight export which was gnarly but cool uh but it's just too much data to to do that so now what we're going to do is actually export to I think GCS is like Google Cloud Storage I want to be honest and upfront I like never use Google Cloud uh it's like Google cloud like these kinds of things in Google Cloud so again I'm I'm like an idiot so I'm sure there's a better way it's always fun like the latest video that I did with um with the Wi-Fi stuff people were like wow you you know AI but you don't know how to do Wi-Fi well no I don't need to know Wi-Fi for AI you know like like everybody is is good at their one little thing and as soon as they step out guess what they're an idiot so anyways moving on so GCS location this is going to be your bucket um so I actually I've made a whole bunch of buckets as I was trying to figure out what the how the heck I wanted to do this uh because you have a whole bunch of options you can save it as CSV or Json or you can and then also you can do compression or not um and there's some other stuff I can't even remember all the details but anyways I couldn't decide how I wanted to do it so I ended up making a whole bunch of buckets um that I probably should delete and I have no idea what this costs by the way eventually I'm going to upload everything or as much as I can or whatever to hugging face so I might upload like whole months worth of Reddit comments because that's about 50 gigabytes of a data set so that's enough to you could upload you could share that in a single um you know file especially because some there's a lot of duplication in here so uh we'll get to that but anyway you could have like a whole data set a month and maybe allow people to download that from hugging face so hopefully we'll I'll find a way to make this a little more accessible but anyways continuing on I'm going to create a new bucket and uh the name of the bucket this time I'm going to start now in 2017 I'm going to underscore J and that's just for Json for me you can call it whatever the you want uh and then I'll click create I'm going to go with all the defaults I don't know what any of this means honestly if that's security risk uh thanks uh and then file name so we want all of this is just December of 2017 yes um but it's even though so in big query it's separated by month so that each table is not just too massively huge um but it's still too massively huge for me to save as one Json object into my bucket so I still have to save it as many local little tiny files so to do that you set the file name as 2017 Jore and then asterisk and that gives it the um ability to just save just a bunch of these files I want to say it's going to make like well I don't even know actually probably 100 plus I we'll find out in a moment so I'll go ahead and um oh and in fact maybe I want to save this as file name j so we're going to save it into 2017 but then I actually want to save it also as the month so I'm going to saycore 12core I think I want to do that we also could just save it as 12core maybe that would be the I'm trying to think in the way how I did this uh with uh 2018 uh let's see whoops yeah I just did the file the month it really um probably won't matter we can always change these file names later as well I think the most you want to save into a year is like all this stuff can be different and that's why like it's so hard to make a video about how how I go through this process because so much changes and then you forget and I'm going to copy what I did for 2018 though and just do the month so instead it will be 12core and then the file number okay we're going to say select we're going to go we're going to do it maybe uh we are going to export not as CSV I found the csvs to be illegible like I wasn't really sure when I like saved it it is like delimited by an new line potentially but as soon as I did it and I like looked at it that's why I have so many of these freaking buckets because I was like oh CSV cool that's easy um but actually no it's very hard to go through these CSV files in this format um I have no idea how you would properly do it with a program because it would be different if everything was like a single word as a value or a number as a value but in this case it's it's body of text so it's actually unbelievably complicated as a CSV so anyway Json uh you can add compression if you want but then you have to C through and uh decompress um but it is adding quite a bit of my own download uh to get them um not compressed so actually this time I'm going to test gzip and see what happens wish me luck because it will be quicker to unzip it locally at least for me depends if I had better internet out here I would not compress um but I'm going to go go ah and hit save here um but my internet is no good and as soon as I'm start downloading this it just nukes the rest of my internet um so I have to like download this stuff overnight so okay so we're grabbing 2012 uh to be honest well not because we're compressing it it might take a little bit longer um but so we'll get 2012 and uh maybe I'll just or 2012 2017 December 2017 okay so now that we've done that I'm tempted to continue but I think instead what I'm going to do is actually search for the bucket so I'm going to search 2017 and then hopefully somewhere around here I don't even see the bucket yet 2017 J there's our storage bucket so we can click that we shall wait um so the export okay okay we're going good stuff so each file is 27.6 megabytes I actually don't know if they're going to be the exact same amount of data per file but what I will say is interesting I'm also the Google Cloud website maybe it's because I'm in the brave browser but that's still like a Chromium browser but the Google Google cloud is like unbelievably laggy and I I can't understand why um because I have plenty of CPU and RAM available I got a 100 gigs of RAM available and tons of CPU Cycles available right now um there is no reason for it to be so laggy but this website is just very unpleasant uh to use okay we are well on our way uh I've been downloading a bunch of data I've got some more on the way um I actually have the entire 2017 is ready to be downloaded um and now I want to figure out how I want to essentially decompress all of these files so essentially I want to take this 2017 all these compressed files that I have right now and I think I want to pre- decompress them and I want to do that I could do that in line with my actual data as I begin to populate like a panda's data frame or a database or something like that but I suspect I'm going to want to change how I'm doing things down the line and having these decompressed I believe in my little lizard brain that um interacting with those already decompressed will actually speed up the line by line iterations because we're dealing with so many comments that no matter what it has to be decompressed I'm pretty sure no matter what that's going to take you know CPU power and spread over billions and billions and millions of comments um that's a big thing I'd rather only do the decompression one time rather than a bunch of times if I ever make more than one pass which is almost certain so uh to do that I'm going to I've got two two directories here I've got basically a a f you know development uh compression directory and then where I want the decompressed files to go so what I'm going to go ahead and do is uh grab a couple of compressed files once it loads uh um we'll just grab these right here I'm going to copy those and then we're going to come over into Devy paste those in and now what I'm going to do is uh hope that co-pilot just does this for me essentially so uh compressed uh dur will be uh what was it uh 2017 devc so 2017 Dev C uh de de de oh my gosh decompress d uh that would have made sense but I'm just going to go with Dev right uh 2017 Dove yes okay so now I have those I'm going to import OS zip file sure that sounds fine shutil that seems to make some sense compress decompress if os. path decompress dur um what that make dur for file in file if file ends with uh ZIP so this makes sense um so what I need to do I so that made sense but it's all Zips I need to uh so let's just make a comment decompress all the do gzip files in compresser and put into de decompressed okay so let's see gzip I don't know we might have to install that that's probably a if if not os. path decompressed dur make that directory thank you I already made that but that does make sense for root dur finer blah blah blah cool cool cool why what is this do oh do gzip I guess is what it's doing that would be five why is it doing that oh up open blah blah blah uh I actually don't think these these have. gzip at at the end of their names though so this might not have any purpose uh let's go ahead and run that uh open open I bet that'll air and it'll probably air be because of that I thought maybe I'm wrong oh I see so it's going to create the file okay okay okay that makes sense um really uh shouldn't that have worked why wouldn't that have worked uh for file in files uh let's print uh oh it didn't end with gzip i mean that's exactly the problem is they don't um yeah uh yeah so if I save this and go yeah see like so they're not gzip that's why it's doing that negative five so I think what I want to do is actually remove this uh move this over this up uh and then we actually want to not do that yeah it's like little stuff like this that you know like I still it saves me time to write it but if you weren't an engineer that would have been annoying oh man we have so many files this is going to take ages oh no oh dear I don't know if I really want to do this I might want to just do it live but no matter what it's going to take so long to go through these files that's gross that's very gross let's see what we're looking at now though okay so these are decompressed I don't know how many that was uh per minute that we could do but boy that's going to be nasty this out of the way okay well anyway this is the data that we're interested in and now we could load this data for example um and begin interacting with it uh for now I'm not going to I think we'll continue downloading this and then and then man we got we got a lot of decompression to do uh I could use multiple machines at least Toops to do this decompression and maybe I'll do that I'm not really sure okay we'll run that one more time bring back yeah so yeah we could definitely multiprocess this um H you know what owner of co-pilot will handle that for us uh okay now so pay multiprocess um in this case I'm not really sure how big that pool would be but let's go ahead and run that again let's see what happens I don't want to see anything happen happening maybe I'll have to debug that oh no it just did everything that was weird it did everything like all at once at the very end I'm not really sure that's probably not the right way to do it it was definitely faster 12 it was about twice as fast but I'm not really sure why why did it take so long I'll probably have to Tinker with that but I think that's probably the right way that's how we'll we'll do the decompression but I I'll find a better way to uh do that a little quicker but anyway okay so we have that data and basically once you have that data um you know we we could immediately begin you know import import uh import pandas as PD uh DF equals um yeah let's just grab one of the file names here so you could grab this one what is this this is a Json so Json file equals this uh try again co-pilot thank you lines true print DF aad uh what is that 100 megabytes so it's probably pretty big anyway here you see there's all our data we can begin working with this and to be honest um let's just go down here uh DF do columns uh this is a lot of columns this is a lot of information that we probably don't even really want like we definitely want body we I see anyway um hey welcome to Reddit uh yeah so we we probably want body hope that doesn't like cause me the sad yellow dollar sign anyways uh yeah so body we want that maybe author created UTC I actually don't know created versus retrieved on but this one of those is going to be silly link ID we don't need subreddit ID we don't need ID we don't need um I can't remember of controversiality even works I also seem to recall that things like [Music] um score I want to say um because I think later the API started doing like up votes and down votes but if I recall right score won't go into the negatives or something there's something weird about that um well this one goes into the negatives so maybe it does H very interesting anyway I remember something goofy about score um oh of course it's uh this we're starting off with the NSFW subreddit so that's that starts to explain oh actually quite a few so how wonderful how wonderful for these okay so anyway um yeah so there's probably a lot of stuff here that we don't actually need and then um yeah so we'll figure that out we can hopefully get this to be even compressed further also if a if a comment doesn't have a reply then we also aren't probably going to be saving that cuz basically I'm only going to be interested in a pair or more but then as we make longer comment change chains you're really actually going to have probably more data because every single subsequent reply is going to be a new sample in theory to train the model of a history and so yeah you're going to have to yeah interesting and now that I think about it that's not even really an instruct model anymore that's some other type of model because it's not it's not a conversation between two entities it's a conversation between um a nebulous amount of entities and really each entity could get like a unique cuz like sometimes it's a back it is a back and forth between two individuals but then other times it's a back and forth between multiple parties all right so checking in again here uh we've got quite a few files downloaded I've got the entire uh 2017 and 2018 downloaded so um all I did so far to basically make this is I helped it a little bit but mostly I just I started with with a very long comment to uh co-pilot essentially saying here this is what I want to do I basically want to load in uh well I have a main data frame that is just going to contain subreddit author when it was created the parent ID the current ID and the body um and then I go ahead and sort these files um I'm just trying to sort them by uh an ascending order because that's the chronological order that the comments came in theoretically so then once I uh at the moment I'm just taking the first 25 files uh because that's all I'm I just trying to do some development first and then we'll go ahead and do the entire uh data set but then I'm going to iterate over these files we're going to open those files we're going to load them to a temporary uh data frame I'm going to use uh Json loads we're going to load every single line in there uh this is probably not the most efficient way to do this uh you know might be like read Json or something like that for from pandas I'm going to Tinker with that figure out what is the fastest method or at least what can I figure out is the fastest method uh and then go from there uh the other thing is I forgot to uh I don't need to really pull it up but uh using that multi-process uh decompression I needed to run that on the actual Nas itself and that seemed to go much faster I could do like 2500 I think it was actually 2,800 files in about 10 minutes uh I think the reason for that is if you're not not on the actual Nas machine it's transferring those files over Network and even though I'm everything is connected over 10g I guess that just still wasn't enough like I still needed more so anyways doing it on the nas which cuz my main PC's uh CPU is I think considerably more powerful than what I have on the nas so not really sure what's going on there other than yeah it must be just because it's transferring over Network that that was what was slowing it down so anyway um continuing on so we loaded into a temporary data frame and then we're going to concat to that main data frame the subreddit author created and so on um and in fact part of me wants to um at this point remove subreddit so once we've filtered for subreddit this is a waste of column and a waste of memory so that's going to make a pretty substantial difference uh when we're getting into like the trillions and trillions um all basically all of the Reddit data set is something like 2.5 terabytes or something like that um I don't yet know the total comment number but it's quite a bit so uh I think what I'm going to do rather than 25 let's just do 10 for now I'm going to go ahead and save and run that and let that populate with the updated uh I do want to sort uh by created UTC very good I'll just go ahead and do that and then we can just print the head here so we're not doing that twice um okay so then we start having these conversations now what I couldn't quite decide on is like how um you know how will we build this how do we know cuz like At first I wanted to just make conversation pairs but then as I'm working through this data set I I'm like thinking you know what we need to um this what a great everything's got to be like this yeah anyway uh historically everyone has always done paired conversations so it's always bot and human or human and Bot or instruction and response and all that stuff and it's always like these pairs but in reality both in the real world you have multiple entities coming in to add input to a situation and then you actually want that output to be like based on like multiple variable uh inputs I suppose and so and and that's that's true both for real world modeling of all kinds but also for conversation you're you're typically while you do have some back and forth that will be just between one entity entity um most of the time you actually have a whole bunch especially on social media for sure on Reddit you have you know thousands of other people that are reading it and then from there maybe tens or fives or a few people are actually like taking part in interacting and I think we want to be able to model that because until we can model that um the large language models aren't really very realistic are they so I think the first thing that I really want to do here is actually try to model that so as we iterate through here I don't really know how do we know what um you know how deep the conversation goes so you can dynamically like as long as you have a every unique author can get a unique ID and we would just like increment that like zero 1 two three four and so on until we're done and then we have you know the actual final bot reply and in fact the bot will get its own ID I'd probably give the bot ID zero um or something like that because then you might have cuz it's it's not just self- incrementing for every single reply because sometimes you know id2 will chime in again or ID3 or some new thing and then the bot will reply and that would be in the historical um context so to speak so we need to find a way to build that um and for some reason that is just really breaking my brain because making a pair is really simple you just iterate through here every time you happen to have a parent ID you search the data frame for that ID Bing Bang Boom done and it's super easy um but in this case we actually want to build the whole conversation and for training testing validation purposes I don't think the problem is if you just keep running through and just keep building it as you go and making a new sample every time every time there's a new comment well that would make a new sample and if that's the case I think it would help the model to cheat on um validation loss that said I'm not sure it matters as you get larger and larger models it's like the training loss is what matters but then you but you would still have too many samples that are likely too close but again I don't I don't know for sure but if you shuffled them there would be a historical sample that contained the response that it was supposed to make and I feel like that would help the model to cheat and we don't want that because even though cheating it sounds great you you the loss L loss go down um good but um I don't think that would be actually good so we we also want to combat that and so anyway the problem is kind of annoyingly tough but I think I think the solution likely is found right here so I happened to notice you've got ID and then you've got parent ID and what I'm uncertain of at the moment is you see these t1s I'm wondering if T is tier right so this is a tier one and then this would be a you know the lowest level potentially and like what order do these go in I I don't really know but I'm starting to suspect that's what is going on there so because also like parent ID so the next question I would have is does ID ever contain a tier right so um let's see so what if we did DF uh DF ID really oh welcome to python buddy yeah H okay so for we'll just make this for I uh I DX n DF please help me out H it is not idx is it we probably should fix that um we will print uh ID I guess ID is fine we'll just do ID let's go scrollable so at no point does this ID oops does that ID exist right or does T's exist in these but I do wonder what if we uh rather than the first 10 up to - 10 so what if we do the last 10 and we run that uh and then we'll do the IDS CU I'm wondering does the parent ID get just a tier appended to it and then should we like remove the tier but then you'd almost want to save that tier to another column or something um I don't really know I can't can't quite decide if uh oh oh uh yeah okay so it needs to be10 colon no sir um please stop thank you let's run that one more time maybe I should have done even less um okay so if these don't have tier we need to probably remove the tier but then if not we can potentially use the tier stuff to our advantage like we would want to hunt out teers yes okay oof okay so maybe what we want to do is we want to hunt for parents that have tears deep tiers right so parent ID tier one DK okay so probably the parent ID would probably be this like this ID right here would be like this and then this tells you okay so this tells you how deep this comment likely is so I need to ponder on whether I want to use that information or build this uh myself I'll check back in when I have the answer all right after many many many many hours of very inefficient programming we finally did get the outputs of uh 2016 2017 and 2018 Wall Street bets I also went ahead and tried to add a bunch of other subreddits I'll talk about those in a second but I did try that um and uh that took a little bit more time but as you can see the Wall Street bets was you know actually not too large of files like 100 megabytes to 300 towards 2018 um probably what I'm going to do is is ADD 2019 and then potentially some of the earlier years like 2015 and all that because I want more data and there just really is something special about Wall Street bets so um coming back over uh I did also do uh 2016 2017 and 2018 of yes all of these reddits but um as I was going through the data I I really don't understand how this could be but like every sample was an instance of um either incrementing numbers so people would just reply to each other with like an incrementing number and then or everything like everything was all caps and um I I don't know I don't know which subreddit here is causing my problem um but uh initially as I was doing um well as I was loading subreddits I had actually removed subreddit um building the data set because I was like well that's taking up just no reason to do that with Wall Street bets when it was just one so I had removed it and then I added in all these subreddits and I forgot to add that back so what I'm going to end up doing is probably redoing all of this now with the subreddit column so I can figure out which of these subreddits is the um is one that I just I I really don't want I even did write a little function to detect like is this all caps all that it's not necessarily the case though that I want to remove all cap apps it's just like it just was like really low effort low quality like I get it or people were spelling words sometimes one letter at a time but it was just it was like 90 95% like junk so um and it but it was like it was millions and millions of these comments so I don't want to do that I don't want to go through that kind of a data set so uh instead I commented this out and uh I'm continuing to focus primarily on Wall Street bets but I am going to like I said I I'll come through and figure figure out which subreddit was that or was it just a bunch whoops a bunch of them um I don't know why that copied that anyway uh so then coming over here uh we can load them in and uh once we have like those files whether it's Wall Street bets whether it's a bunch and again I will put like these these uh this code somewhere on GitHub or something like that uh so if you want to do this to your own favorite Reddit uh have at it uh also the other thing that we left off on was like the tiar stuff honestly I went through someone feel free to comment below I tried to track like does tier correlate to the depth of that comment I could not find a correlation I know that it's got to be I like I really thought I was right on the tier thing uh before but it does not appear to be the case so anyway I have no idea um but anyways I all I'm doing is once I've uh loaded in the data frame I'm actually just saying hey parent ID is just just skip the first three tokens BAS basically or the first three characters so uh once we have that we have at least with Wall Street bets we have uh 3.29 million uh rows but obviously not every single one of those is going to be a reply to anything some things many things are a reply actually to the top level uh post and so part of me also wants to go back and find all those top level posts and grab those because a lot of times the reply is a perfect reply to the actual title string and we're missing out on a lot of comments there and a lot of really great quality the most upvoted votes uh comments are going to be those ones right there and they're not currently in reply to anything so we're not able to find the parent so anyway um all that I'm doing now is just building a conversation like a chain of conversations basically here and all I'm doing is I take an ID and then I'm going through and saying like can we find the parent can we find the parent can we find the parent and then once we can't find any more parents okay that's our conversation chain essentially um and then we're going down here uh this part is subject to change this is why we're not really typing any of this out a lot of this stuff I'm just kind of this is just the live development process because so many things change as you go like this something doesn't work okay scrap that um so uh so for example here I I really can't decide how I really wanted to like multiprocess this aspect because this part takes a lot of time as well so um for now this basically what I'm doing is I'm just going to build the sample itself so we have all these conversation chains and now what we want to do is actually build the text uh that will be the training for our AI so in this case what I wanted to do is kind of normalize usernames so later on if you were trying to actually reply to users or whatever you would probably have a dictionary that is like this username is you know this speaker number uh and all that and then I just gave the bot a name I thought about just using bot colon but I I don't know I just couldn't I couldn't really decide anyways I went with Wall's Elite bot and um particularly proud of that and then finally this is basically these are like our samples so for example um you know this this would be your supposed input uh and then the output would just be this information here so it's just somebody saying hey speaker zero says what's Robin Hood's customer service number it's 555 shoe I don't know why Sho um which one of you failed to pray for Starbucks and McDonald's guilty okay and then yeah so this is one where you have speaker zero speaker one and Bot so this is a multi-speaker uh scenario or a multi- chatter scenario um and I did ask I was ask I was talking to I think it was on Twitter possibly um someone saying that they did do some like multispeaker stuff and part of the problem is the bot never really knows when it's its turn to interject like it always thinks it's its term to interject so I am still kind of curious if you have multiple speakers if you know if you could have an actual bot such that unless it generates you know walls Leets bot walls Le bot in this case like if it just tries to generate another speaker don't speak but if it starts to generate this then it's like okay yeah now he wants to give input maybe I have no idea uh got to find tune to find out but essentially um all this left to do is you could either have quite literally this be input this be output depending on how you want to finetune the actual model um or you can also just have literally all of this be your sample and then train the model on that because really an instruct model all it is is just trained on some specific format so that's basically what this is so the next thing that I'm going to do is is take this kind of data where we have you know multiple speakers uh multiple times the bot quote unquote is talking um here's love it love it but as you can see um like this is uh well it's maybe not exactly what you're looking for it at least is a somewhat decent conversation um the other thing that we could do is like I think in this case I'm setting the so you can I I just wrote some quick code for the minimum length this is like how many back and forth so in this case there must be a reply otherwise what's the point but you could also uh look for only like long form back and forth discussions um and then also I set the minimum score here that's the number of votes so how many up votes and that's going to pertain to the final reply so basically in all of these like Blues um these all required people to upvote this three or more times the or really um a surplus of three or more um so anyway okay so now all this left to do is just build that like Json uh samples and then start to actually fine-tune models um on this kind of pick a model that we want to fine-tune and then see how that goes all right such as life in the reality of software development now in many videos uh it's like one time through and I know all the settings that I want uh but the reality is that is almost never the case so for example I thought for I just would want like the ID just to be a primary key and then I just wanted the text but I think probably what I want to do really is sort by score I don't think I really want any just anything that hasn't been at least upvoted once so I think I still want that like two plus filter or three plus filter I think actually I was using three plus uh somewhere uh there it is okay anyway that's what happens when you copy and paste from a uh notebook anyway so the M score is three and Min L is two but later we might want to actually change that and make men score five or something like that and it would sure be a lot nicer to just select off from database where men's score was or where score was five or greater or 10 or greater or 20 or whatever the heck we're actually looking for same thing with length like maybe we want to have much longer lengths because the bot still has not learned how to do multiple uh people in conversation something like that so we definitely want um to be able to to filter for that and not need to rerun that like multi- processed script Okay so uh what I'm going to do here is actually ID train text um and then we're going to have score score will be an INT I can't even remember is it int or integer I really wish co-pilot would save me here so it is just int int and then score and then uh oops and we'll do length also int okay so now that we have that we'll come on down to to here insert into ID uh train text uh what was it uh score length is that right score and length score length good score length uh we'll add two more new values here reply score is is going to be reply score and this is just Len chain chain okay we commit add to database good good good good good and I think actually the rest of that uh gets to remain so now let [Music] me make sure this actually works so I'm going to get rid of this database because we made a new one and then we will open a new and let's just test this real quick Python 3 build training data so hopefully we made that new and then I'll do this okay so we're starting to append uh to the database and let's go ahead and do we have any no we don't have anything else there so select train text from Wall Street bot let me just let me just run this real quick make sure that actually works we'll just grab 10 rows hopefully I can select all from okay I really thought that uh okay well for R in rows up to 10 print r i Haven it saved so I'll save um we'll just call it analysis blah blah blah let's try that one more time okay very good so you have your ID the conversation uh score length okay very good so then we could filter as uh as needed but uh okay so now what I'm going to go ahead and do is run this on like a million uh processes on the Puget machine again and then hopefully this will be the last time I modify it before I actually upload the data and then can like fine-tune a model on that data all right another day and uh finally we have quite a few um updates uh first of all I now have the fullish database of about 600,000 paired samples that's with a minimum length of Two and a minimum vote of three if I recall right um and uh let's see build I think it's actually is it this script let me make sure I was correct in that statement yes a minimum length of uh of two so at least one reply uh and then a minimum score of three for the latest reply in order to go into the database so that's the bare minimum and then one of the changes that I made right you know in one of the previous videos this would be like yesterday but it could have been like 5 seconds ago for you uh depending on how I edit this so uh initially I wanted to go with an even bigger score just to keep the samples low I think I don't need 600,000 samples but um it is beneficial I think to have this store those in um in the actual database so the database does contain score and length so then later I can filter those down and then kind of Tinker with that rather than needing to change that and then build the data set over and over and over so now at this point um yeah for some reason I couldn't load this load the Json so I went ahead and just uploaded it to uh hugging face which is now here so O3 and then o1 this one is going to be the men score uh you can actually just see it here men score five hello dog uh and then the minimum length of two so then what I'm going to go ahead and do is I made a few of these so I did a Min score three Min length two um Min score five min l two Min score 10 min l two and then Min score 10 min L five uh and in fact I kind of want to do like a Min score three and a length of five I think I'd like to see that as well so so I think I can come over here let's see make train Json and I want to say Min score 3 minland five we'll go ahead and save that and then I can come over here and each time I save one I can kind of see like how many samples were in whatever it is that I'm saving um this wasn't really meant to show on video so uh just understand the given the settings I just set let's see how many we have so 98,000 um I want to say this one is our we could probably guess here okay probably this one Min score 10 min Len five um yeah and anyway so U I kind of want to be able to kind of go back and forth because I will say one of the hardships of creating models like this uh in the past is like okay well if you wanted to deploy the model to something like Twitter or something like that it never really worked super well because the model wasn't really trained on multi-turn Multi speaker contexts so that is one of the things I'm super interested in checking here so anyway um once this is uh uploaded and easily accessible and legible um here yeah you can definitely see that the samples are loaded here I wonder why I couldn't up view this analyze training data import Json file name this load load as like what I have no idea I don't know why that won't load someone comment below I don't know I've been doing way too much and it like that's it's so hard to like do projects like this and then come back and like I'm sure maybe it's something really stupid but anyway um okay so the data set is up there so now the next step is to figure out which model I want to use and initially I really wanted to use Gemma but then after poking around it sounds like I don't want to use Gemma and instead I think I'm going to go um with stable LM 3B which is comical because stable LM 3B is actually 2.8 billion parameters and Gemma 2B is actually 2.5 billion parameters but everyone uh seems to be the cool thing to do these days is round down considerably uh even like Gemma 27b is actually well over 8 billion I want to say it's like8 and a half billion parameters so it's kind of like kind of goofy uh we can we can really just look this up real quick um yeah 8.5 billion parameters but they call themselves 7B so anyway that's cool that people are doing that now but yeah so like Gemma 2B for example is uh 2.5 billion and I think the one I'm going to try at least first will be uh stable LM 3B let's see if we can find the one that I actually want it is from stability AI but it's not that one it's going to be be the Zephyr I think yeah right here so this will probably be the first one that I attempt to fine tune but anyway I'm going to poke around a little bit and I'll let you know how that goes and we'll uh maybe I'll upload these other data sets too so I can kind of play with those but uh I'll see you guys in a moment all right in the end I went with neither of those models first of all Gemma I didn't go with just simply because I was talk out of it and then uh stable LM 3B uh I can't remember my exact issue but I was unable to figure out how the heck to Korea that so I ended up koring with llama 27b which I've done before so I was familiar with at least what I wanted to Target and all that uh and then I actually found a super helpful uh notebook which I will also link and I think that's like the simplest notebook now that I've found um I haven't looked super hard but it's simpler than what I have found in the past um and so I just ended up using that notebook switching out the things that I wanted to use uh for my particular model and as that notebook trains it you can set how many steps you want it to take before it checkpoints and then those checkpoints are your adapters and then also while I was uh researching and kind of poking around and actually reading some documentation for once in like a year um I found out that PFT has this its Auto PFT model for causal LM and you can use that to Target both the adapter on hugging face and the base model on hugging face and what this allows you to do is like in the past you would train you know your adapter then you would merge the models de quantized and um upload the whole model to hugging face which in some cases you might actually want but chances are most of the time you actually want to keep the model quantized and want to keep the adapter and then potentially switch around the adapters and use them um interchangeably and this I don't know how long this is this like Auto PFT model for a caal LM has been around it could have been there from the very beginning and I just didn't know about it because I very rarely read docs um it uh that makes things unbelievably easy because now every time it checkpoints you can CH test that checkpoint you could train different um different adapters switch them around it's all much smaller the model is quantized itself so it's much less memory and then the adapter is very tiny so this makes things super convenient um and at least for me personally this is the method I'm going to be using on my fine tuning where I'll get my adapter find the adapter I like put that on hugging face and then now I can pull that down anytime I want and and work with that so that's what I've done I've got a few examples um I'll show some of the text examples from some of those sample prompts that I have none of these are doing like that kind of multispeaker prompt but um I did actually test some of that and it does seem to work pretty well but I need to like build a bigger list of prompts to kind of further test this functionality like doing it just quick quickly on the Fly wasn't going to work so anyway uh maybe I'll follow up on a video with that I do think that the result here is actually better than the previous Cur uh that I've done so I do like at least the filtration that I've done up to this point um I definitely think first of all that you only need to take about a th000 steps I'm not really seeing much improvement in the actual model past 1,000 steps and arguably like 500 steps is enough for me that was a batch size of seven so we're talking like 3,500 samples is probably you really need and then the next question is could you do it with like 200 samples and then just do like two or three Epoch um my suspicion is yes so so um that's another thing I really want to try is like what if we really get really down drill down into the best of the best samples somehow maybe scoring on like sentiment or even have a much higher threshold for the actual score the up votes but whenever you do that you do have to be careful because then you you're going to get like people that are you know maybe quoting a movie or just counting up or like those other things that I found like you get weird ones uh when the scores get real high but again we only need a few hundred so chances are really you could start manually kind of filtering for that so um anyway yeah maybe I'll have a follow-up on on that because I think I think this is on we're on a good track here and now it's like it's just become so simple with this autoed that's like my favorite thing I'm so excited I found that I can't believe I've never seen that before so um awesome and then also I appreciate the uh the uh notebook here um that made it that made the actual Laura Q Laura of llama 2 7B much easier and then I used the news model the it was the chat model um you could use all kinds of you could probably use the bass llama 2 model I'm not sure if the chat one is actually better to use in this case at the end of the day it's all generating tokens but then maybe the base model would have made more sense since I'm changing the structure like if I was keeping that lud kind of structure maybe that or whatever the news research structure was I I haven't even double checked um maybe that would have made more sense I don't really know um again lots of things to test and now that it's a little easier to test and now I know a thousand steps is really all you need you can start automating and doing like aval and stuff like that so anyways um everything is uploaded um I will try my best to have links in the description to like everything but if there's something I've mentioned or shown and there is no link or you can't find the link feel free to ask for it uh the models are up you I'll link at least to a few of the models and then the data data sets are up I'll try to upload uh at least one large data set that you could then filter the problem I really wanted to upload like all of the Reddit stuff like that would be really cool I just simply do not have internet capable of doing that um unfortunately so uh I'm trying to think of a way to do that I might just pull I could pull down the data set off you know on some sort of uh some host somewhere and then upload it there and like kind of mooch off someone else's uh internet so maybe I'll do some cloud provider or something and do it that way but anyway I think it's just such a cool data set um and there's so much room for opportunity here I really like the Wall Street bets one but there's lots of Reddit uh subreddits so anyways um I just think it's such a cool data set and it it's a little it is tough to like Wrangle that data set so it'd be cool to have something up ready to go and then maybe a few kind of cure slightly curated ones that you could then download so you're not downloading terabytes maybe you're downloading like a gigabyte or a few gigabytes uh and then you can work from there but anyway uh if anybody has any ideas about types of data sets or like something like that like structures and all that uh let me know uh because I I could definitely down or upload all kinds of variations I have a lot of Reddit data now so anyways um also reminder about the 480 super all you need to do is sign up to GTC and attend a session with the link in the description um that 4080 super GPU would likely fine-tune faster than I fine tuned here uh on the RTX 8000 since the uh the RTX 8000 is kind of slow nowadays I love it for the 48 GB of uh GPU memory but uh otherwise it is quite slow especially compared to something like a 480 super so congratulations in advance to whoever the heck snags that and takes it home so anyways that's all for now questions comments concerns suggestions uh whatever feel free to leave them below otherwise I will see you all in another video

Info

Channel: sentdex

Views: 35,313

Rating: undefined out of 5

Keywords: python, programming

Id: pCX_3p40Efc

Channel Id: undefined

Length: 61min 55sec (3715 seconds)

Published: Wed Mar 06 2024