Data Mining in C

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
looks like we're live hello everyone and welcome to yet another recreational programming session with ausim let's make a little bit of an announcement and officially start the stream as usual so red circle uh live on Twitch and what are we doing today on Twitch television website today we're doing uh Data Mining and see how about that so I'm going to give the link to where we're doing all that https twitch.tv/ and I'm going to Pink everyone who's interested in being pinked and there we go the stream has officially started the stream has officially started so the title is pretty loud but we're not going to be doing anything like super super special right uh especially if you are like a person who works in data mining industry is probably not going to be that interesting for you but what I want you to do today I want you to implement a key means clustering algorithm in C right so if you want to learn more about this kind of thing you can find it in in here I'm going to copy paste it in the chat for people who's watching that on YouTube it's going to be in the description of course so and um so I learned about this algorithm in the context of one meme paper that was circulating around in 2022 when the chat GPT was like the hot thing and stuff like that and it's basically the sort of like the code name of that paper is gzip is all you need right it's it's a pretty funny thing it's sort of the idea in um you know classification of of the documents right so essentially uh this paper demonstrated that you can quite easily classify documents by just gzipping them and doing K means clustering on them and that basically competes in the performance with like a you know like deep learning algorithms and stuff like that so very dumb thing performs better than just like uh large language models and stuff like that for classifying documents and it's just like very unexpected and it was like what the is K means clustering and K means clustering is a very interesting algorithm essentially you give it uh a set of points right of some sort of observations right in in here I suppose it's a set of documents right so one document is going to be one point right and essentially you say Okay I want to have I want to split this entire set of points into K clusters you basically decide how many clusters you want to have and what it tries to do it tries to find uh so-called centroids I think they're called centroids or means right uh for each an individual cluster and it groups all of the points uh to the cluster that are closer to that specific mean right so there's a pretty cool visualization of this algorithm how it works uh right so essentially here is an animation you have a bunch of points and as you can see you have centroids and the algorithm sort of like uh just the centroids so uh you can fit optimally uh all of these uh data sets in like three clusters right so and interestingly it kind of reminds me the machine learning algorithm right so you have some sort of like a cost function some some sort of parameter and you adjust the parameters AKA centroids so to to optimize a certain cost function and somehow you split the entire data set into into K clusters and apparently the naive algorithm that is one of the like most used algorithms right so the Wikipedia at least calls it the standard algorithm is extremely simple it is extremely simple uh so to the point that like I mean I can Implement that in C uh not only I can Implement that in C I want to actually write a visualization in r with animation that just does that or something that would have been interesting I think uh right just visualize that stuff as well um so shouldn't be that difficult essentially what you do you have k m k centroids k means right and essentially you split uh the entire set according to those means right so this is a very simple algorithm as you can see this is sort of like a for Loop right this is a for Loop you're iterating from 1 to K for all JS between one and K it's a for Loop right it's more like a list comprehension right and then you take a point uh right so you take a point and if one point is like uh closer to the mean than the other point you actually include that in a set right so basically you split all the points According to which set it is closer to right and then you update each individual mean by uh essentially summing up all of the values in that specific set and dividing it by the size of the set and it's an iterative process and it slowly converges towards local Optimum right it doesn't converge to the global Optimum but it converges to some local one depending on where you start you can basically randomize uh this means right you can randomize this means and depending how you randomize it it will just find different optimums and also depends on data set and stuff like that and that's it that's the entirety of the naive algorithm right and I suppose that's the most used algorithm as well right there's more optimal algorithms that go really nuts about different stuff uh but that's the basic idea uh and once we nail that once we Implement that we can try to tackle this paper like I'm not going to try it today that could be probably a separate stream right but we can try to just implement this kind of stuff today uh right just to learn how to use this algorithm because I've never actually implemented that even though I went to a university I I went to a pretty shitty University that didn't teach me so because of that I have to sort of make up for that uh like myself on the streams right like just reading Wikipedia and stuff like that sounds interesting I really like that it's just that simple and it's really easy to to comprehend and really easy to implement really easy to visualize uh right I really like things like that like small little golden nuggets right that you can just like play around with so yeah uh so let's go ahead and try to do that how about that all right so let's go ahead and just create uh some sort of a body plate right so uh K uh means right so this is going to be basically the folder where we do all of that and I'm going to just create a simple hello world and of course for building this entire stuff we're going to be using a noob right as usual as usual uh because that's the only acceptable for me build system and see right I mean it's it's not true I can use whatever build system like is required but when I'm talking about like my personal projects in my personal projects I do whatever the I want right and I want to use uh knob I want to use knob so I I think Noob it's located somewhere here so here it is no. H and there you go so I need to create no. C uh and uh in no. C we're going to be including the noob locally right and of course we need to do Define knob implementation to make this knob Doh act like a C file as well because by default it acts like a header as soon as you include this thing it has to act like a c and uh also includes implementations and stuff implementations and stuff so the first thing we're going to do we're going to do NOP CMD uh right and we're going to just craft a simple command line uh command line that is going to build our project so for now it's going to be very simplistic uh no CMD append it's going to be very very simplistic and it's going to simply just call CC uh right so then it's going to accept main. C and then it's going to Output the executable let's call it K means right so let's call it K means and that's going to be about it so I would also like to add maybe a bunch of warnings maybe a bunch of extra warnings for extra safety you know if you know what I'm talking about so we may also include uh some debug information in case we want to debug something and let's go ahead and uh run this entire think synchronously I don't remember how to run it so I think it's no CMD run synchronously right so we want to run it specifically synchronously right because we're not doing like a parallel uh build or anything like that and if it fails we is going to exit with non zero exit code if it doesn't fail we exit with a zero code and also we should not forget maybe enable go rebuild yourself technology uh Noob go rebuild yourself technology so we don't have to rebuild our build system over and over again especially every time we modify this entire thing and we're probably going to be modifying it uh pretty often I I do believe so right so let's go to the compilation error you just pass it without any pointer and there we go we just compiled our own custom build system that we can try to run and that build system is supposed to uh build some other things so in here something weird happened uh probably because I forgot to put a comma somewhere look at that I forgot to put a comma but that is a valid C code that's the most annoying part about C is that uh two uh nearing string literals is considered to be a valid C code a valid C expression which is just like one string literal that looks like this right and if you forget to separate string literals with a comma well you can you get very weird errors honestly but in any case there we go so we've got uh our build going right so we got our build working we got our build to working so and in the main thing do we even do anything I think we should just print something just in case like hello world uh there we go so I'm going to rebuild this entire stuff right so it rebuilded and now I can do K means and it says hello world how about that how about that so we also need to think about like uh what kind of data set we are going to use I suppose we can just generate some sort of a data set right so we can just generate some sort of a data set so there it it's kind of funny that on Wikipedia they're using so-called Mouse data set uh right so artificial data set Mouse G guess why it is called mouse that is actually kind of funny right so this is actually kind of funny so I suppose we can kind of do a similar thing right we can do a similar thing so essentially we can create a generator where you can put uh sort of like centroids and um it will Generate random points around that specific centroid or something like that uh all right so we can do a similar thing it shouldn't be that difficult to do shouldn't be that difficult to do uh okay so interestingly interestingly we need to now start using Ray liip so the problem with starting using Ray liip is that I don't really have a ray liip build in visualizer I actually build rap from source code from scratch every single time uh so and in here I don't really want to do that so maybe for now as I develop this entire thing as I'm prototyping this entire thing I'm going to literally just copy paste Li li. a from the Mizer build and use it like row just like binary I'm not going to commit that yet but just for local quick iteration just to get yourself started really quickly I might actually do that I might actually do that I might copy copy paste this thing in here and then I can go and take the source code of this entire thing right so this is going to be R deep uh. H right so here is a r. H and as far as I know Ray also depends on Ray math so this is another thing that we may need um and uh our LGL is another stuff that is very important in here right so that's that's all of the things we might as well even create a folder rayap where we're going to put all of the rap stuff related things right so we're going to just like move that stuff in here uh and the next thing we can try to do we can try to include simply include Ray Li and Marvel how it is not going to compile right so look at that so I'm going to do Noob and it doesn't compile it can't find relib it can't find relib so one of the things we probably need to do we need to um add the search path to um to the compilation Target so like like to split first uh this entire stuff uh so the command line is a little bit shorter right so I'm going to do something like this uh right so here are sort of all of the flags I might as well even split it like so so here we sort of have a compiler then all of these sort of flags then the output and maybe the input also is going to be like a separate line so it's easier to sort of manage so we're splitting all of that into logical chunks into logical clusters if you know what I mean so we're going to do an OP CMD append and then CMD and we're going to do include path and we're going to just do ray lib like this there we go so let's try to compile this entire thing and as you can see it successfully compiled right so and it's successfully compiled because we don't really use anything from R right so we can init window right we can just init the window uh provide the size and then we can organize a loop organize a loop saying should uh window while not window uh should close while not window should close we're going to actually begin drawing uh we're going to begin drawing and then we're going to end drawing like so then we're going to clear uh I think it's clear background let me open Ray in here so it's a clear uh yeah it was a clear background right clear background uh red and after we're done with this entire loop we're going to close the window right we're going to close the window like so so now if I try to compile that it is not going to compile right it is not going to compile because I forgot to provide um you know the title for the window let me see so where do you put the title I think you put title the last so we're going to put K means in here there we go so and it compiled successfully right it compiled successfully but didn't link properly it didn't link properly because it doesn't uh know where to find the library so we can try to do the following thing we can say Okay link with L Ray lib right so but it's still not going to find that because there's not such thing so we also need to modify the search path for the libraries as well we're going to put it in the same place where you have uh you know the headers and stuff like that and now it kind of compiles but it complains about missing other librar so we also need to link this entire stuff with the math library and there you go so we finally compiled and if we now try to run uh the entire program right if we try to run the entire program we've got a window it's that freaking simple uh what's interesting is that so the rip the entirety of rip is basically this right so this is basically it three headers one static library and actually build that static Library myself from scratch right so I didn't download it from the ray official Source I could have but I didn't right I should build it from scratch I just took it from visualizer and it still works it still works how about that but you expect that to happen my okay so uh let's go ahead and maybe try to render something uh on the um on the window right so first thing we need to do we need to uh have a way to render the the samples that we're generating right because we're going to be clustering these samples right but before we can cluster them we need to generate them and as before we can generate them we need to a way to display them right so we need to figure out how exactly we're going to be displaying the the dots how exactly we're going to be displaying the samples right so uh for the background I think I'm going to be using my usual background the same one as I have in um you know in emex right so the Hax code of this background is 18 18 18 it's a pretty good background in my opinion I don't remember how to to specify the hex code for the for the colors um I think it's color get or maybe it is a get color uh right get color so essentially you can provide a hex value in here uh and that way we can do the following thing get color uh something like this I mean hex is actually not this right so it's it's more like that uh and let's go ahead and run it this looks sus not going to lie this looks sus this doesn't look like like well maybe maybe you want to do something like this uhhuh so that means we have to do it like that there we go all right so I forgot to provide the AL Alpha channel so it kind of yeah it kind of created this weird greenish color it looks like the color that Jonathan Blow would use honestly right so yeah it it looks like one of the yeah the John blow them it is literally John blow them he he likes to put these kind of colors in his tests and stuff like that it's just like wow that's funny it is the jum blow color it's kind of funny how it's instantly recognizable if you know what I mean right it is instantly recognizable somehow um okay so how we going to be generating the points right so we can just go ahead and draw some points on the screen uh luckily we have functions like a draw uh Circle right and it's pretty pretty straightforward right so just draw the circle uh we're going to render it maybe somewhere let's say at the center so we can get screen width and as far as no screen width it's not the size of the whole screen it is the size of the window for some reason uh yeah get current screen width um and there's also render yeah so I I don't quite understand like what's the difference between these things so one is one one considers IPI but I remember when I started to use this thing there was some problems on Mac OS ah I I suppose so essentially in in hdpi you have more pixels sort of like per pixel right so because of that the value in render width and height is going to be bigger actually right so and maybe because of that the the scaling was kind of weird on macro right so but we had a bug I was using render width in visualizer and Mac OS people basically Chang it to screen and they said that it fixed the problem with scaling for them for them right so I suppose that's what we want to use generally right so I don't really know what's up with that but if that makes macro people happy so like I mean so be it so be it so we're going to put like a uh all of that stuff in the center uh right so we're going to put all that stuff in the center height and uh what kind of radius is we're going to we're going to have let's say we're going to have like a 30 and the color is going to be red for now so because I want to be able to see that little dot in there it's not really that little honestly uh it's not really that little so what about like 10 pixels yeah 10 pixels looks okay um yeah so that looks okay uh that looks okay and essentially we probably want to be able to render the points on different ranges if you know what I mean right we want to be able to render them on different ranges but does it really even matter does it even really matter it probably doesn't matter if you think about it right so what I was thinking what I was thinking is that maybe I want to sort of make it so uh left is going to be like minus 20 and the right is going to be plus 20 right so so up is going to be like plus 20 and minus 20 or something like that and all of that all of those ranges are going to be configurable so you can sort of fit different kinds of data and stuff like that but since we're kind of just want to generate a bunch of random things and just cluster them maybe it doesn't matter maybe we can do all of that within the ranges from 0 to 80 and 0 to not 80 800 and 0 to 600 so maybe that's fine right because we're not going to be wored working with the real data though we're going to be working with the real data in the future right so we're going to be working with the real data in the future so we'll need to think how uh we want to approach that so maybe I'm going to actually create the range right so maybe I'm still going to create range so we're going to have something like minimal X right so the minimal X is going to be minus 20 right so and maximum X maximum X is going to be plus 20 so and the same is going to go for for y right the same is going to go for y and essentially maybe we're going to have something like project uh project sample right so project sample So you you're going to give the uh the coordinates of the sample in this range in this specific range right so Vector 2 sample uh and as far as I know I think it's somewhere in Ray math so there's a structure Vector 2 Electric Buu there we go so here is the vector 2 okay uh and we're going to accept that and we're going to get get a vector in the screen coordinate so maybe I'm going to do a project sample toine right is that a good name is that is that a good name maybe it's too long I don't really know I don't really know so but essentially what we want to do right so we have a sample somewhere on the Range ofus 20 to 20 we kind of want to convert this entire thing to a range from 0 to one because it will allow us to map the entire thing to the screen right so in essentially how we can do that right we can find the sort of the length of this entire range we can find the length of this entire range which is going to be Max x minus minan X and that's basically sort of like the entire length uh of this range right and uh then we probably um so we probably need to shift the x of the sample to start from zero to start from zero to that specific length right to that specific length and to do that I suppose we need to subtract the minimal X right so by subtracting to the minimal X we kind of like mapping the sample from this range from this range to uh 040 range if you know what I mean right so we're sort of so the the range is negative positive but we are shifting it to be from zero to the length of this entire thing which makes it then super easy which makes it super easy to map to 01 by just dividing by L right so that's basically what it is we might as well even in line this entire length and that's basically the uh first sort of like normal i x this is a normalized x it's from 0 to one and you want to do with the Y as well right so we're basically remapping this entire thing like that uh right where basically mapping it and all of that so we have X and Y in0 to one but now what we have to do right to map it properly to to the screen is we need to know the size of the screen we already know it so it's a screen uh width and we just multiply it by X right so and essentially what we can return here here we can probably return cliteral uh Vector 2 and this is going to be just that this is going to be just that so we'll have to replace maybe with with height and something like this so interestingly maybe we can even inline all of that to make it even more unreadable for the normous so yeah that's that's a good idea look at that look at that so we really want to make it unreadable for the normies so we can gatekeep all of our mathematical secrets that's what we want to do uh honestly right so we probably need to do that in a very specific order right so we want to do that in a very specific order because they're going to going from left to right so that's the first operation is going to be and then this one so we kind of want to do that in that specific order which of course makes it even more unreadable for the normies right so and that's very important that very important so there we go so we we got a projection formula we got a projection formula it's a very simp simple formula all right very simple formula um yes yes yes yes and essentially uh we can try to maybe map that thing so we're going to have some sort of like a test sample right so here is a sample and this is going to be sample at um you know minus 10 - 10 I think that's a that's a pretty good sample that's a pretty good sample so it's supposed to be maybe somewhere uh at the left bottom corner of this screen at the left bottom corner of the screen I think right that's going to be the the the situation in here so and now we are projecting uh the sample to the screen and we need to render it but I can't render it with this function because this function accepts the coordinates separately so we need to find a different function so I think there was a draw Circle V which accepts the vector in instead right so which accepts the vector instead so we know that the radius is 10 the color is red and we can just do something like that instead of the center look at that look at how it looks like draw Circle project sample to the screen and we just render it so we expect it to be on the left bottom uh quarter of the screen and it's probably not going to be true it's probably going to be on the left upper one because the y-coordinates is flipped as as in any self-respecting graph library right so let's actually see if it is true or not uh there we go it is in the upper one instead of the bottom one which begs the question uh which begs the question should we flip it to be more mathematical should we flip it to be more mathematical I'm not really sure we could probably flip it we could probably flip it by uh just taking this entire thing and subtracting it like that so I don't really like how we call function every time we need to get the width and height so maybe one of the things we're going to do is uh basically cach all of those things right so we're going to Simply cache them uh so this is going to uh width and height and then we're going to put it like this come on Boom so hopefully that will flip it um not necessarily we can flip it by actually well I mean by actually moving H outside right so essentially something like this but it's basically the same isn't it yeah it's it's basically it's basically the same yeah so this one of the things we can do but anyway uh anyway anyway anyway anyway so there we go so it is at the bottom so in more of a mathematical manner in a more mathematical manner so the the entire thing is not particularly resizable so what if we make it resizable this is actually very interesting so how do you make the window resizable in it um so set window Flags I think resizable okay so there are some flag system window Flags config flx and nothing except yeah you can set the config flag I suppose right so we can try to set the config flx right before we initialize the window and one of the things we can say we can do recy able uh window resizable uh and we're going to just slap it in here straight up slap it in here uh there we go so it is resizable and as we go as you can see it scales it scales accordingly it is actually scales accordingly isn't that PO mind for okay so here's the funny thing so if we did everything correctly if we did everything correctly the z z should be at the center of the screen there we go it is at the center of the screen and no matter how we recite it it's going to stay at the center of the screen right so it is in fact at the center of the screen isn't that I think it's pretty freaking cool very simple formula maximum result uh low it is easier than centering deep in CSS you know it you know it but all of the react deps are going to be in denial saying that you don't understand the actual business need of the Enterprise you don't understand that it needs to be over complicated like that you don't understand you don't understand I'm inan denial you don't understand [Music] anyway zoen programming has to be shed software must be shed dog slow it you don't it must be there's no way round it it just must be like that you don't understand it's like it it must be must be must be um okay so I I suppose we are ready to generate some some random points how about that we are ready to generate some random points so what I'm thinking is that we need to actually have a dynamic array of points right so Dynamic array of points so let's actually introduce something like uh samples right something like samples and we're going to have uh items right so this is going to be count and this is going to be capacity and this is a dynamic array Essen this is a dynamic array so it's feels like something from from R is it going to collide with r right no it's not going to I don't think so but anyway so and we probably want to have a function to generate a bunch of points around certain center right so basically generate uh cluster right generate cluster so we are going to provide this center right so this is where around this specific Center we want to generate a bunch of points uh we're going to specify also the radius right right so radius within which radius we want to generate all of that stuff uh and also probably how many points we want to generate and uh so there we go so this is basically going to be the samples so and that algorithm should just generate a bunch of points it should just generate a bunch of points so let's go ahead and do that we're going to iterate through all the like count right so how many things we have to generate so this is going to be count Plus+ five uh and um I suppose uh what we're going to do um we need to basically generate a vector right we basically need to generate a vector because we are generating like around certain cint right so we generating around certain centroid uh so essentially we have just a circle we have just a circle and we going to monter Carlo that like this right uh I just realized this going to be pretty funny clip out of the context so yeah and essentially what we need to do we need to generate a random Vector right with a random Direction and a random magnitude a random magnitude from 0 0 to one so Random Direction and a random magnitude and as soon as we have that right we can pretty much scale it and set it to any Center and any radius if that makes any sense right if that makes any sense mon Carlo equals random yes so if you are talking to normies right and you need to gatekeep all of the uh precious knowledge when you're talking to normis instead of random you have to use uh words like Monte Carlo or another good gatekeeping word is stochastic right so Monte Carlo or stochastic to be fair stochastic these days became a meme due to to uh you know rise in popularity of large language models so a lot of normies kind of learned the the word stochastic right because there was a stochastic parad mem so it kind of like lost its uh gatekeeping qualiity so I personally would recommend to use Monte Cara instead right so yeah if you know what I mean so yes yes yes so how going we generating all that so we need to generate a random direction we need to generate a random Direction so that means we need to generate a direction from 0 to 2 pi 2 2 to 2 pi 2 to Pi so how can we generate like a random thing from Z to Pi we we can generate a thing from 0 to one right so Random uh from 0 to one and multiply it by 2 pi and that will give us like the random direction from 0 to 2 pi but how can we generate stuff from 0 to one in in C we don't really have that right so we only have thing that generates integers so the thing I like to do I like to have something like Rand float right which basically generates like a value from 0 to one I don't remember if a no if I included that thing in no oh by the way by the way do you guys remember the change log of Ray deep do you guys remember the change log of R in the change log of R of the version 5.0 they said that they finally introduced random number generators maybe the time has come to test it out okay let's give it a try random uh get R it's integers I flipping integers is that it really useless anyway so we're going to do our own one I'm just joking by the way we could use this one right so um but it's just like then depending on the distance between Min and Max we would decide the resolution of the random number right so but I would like to have the maximum resolution so yeah essentially what I like to do so if I take a look at the function Rand Rand Ranch so it has a very interesting properly it generates the um the values from zero to Rand Max from zero to Rand Max inclusive by the way which is which is fine actually which is exactly what we want so essentially what you can do you can just take Rand and divide it by Rand Max and of course before you do that you you want to convert it to float and that effectively gives you the value from 0 to one with the maximum resolution with well I mean technically it's going to be a value from 0 to one I think that's a good one all right uh so yeah um Rand float and we're going to multiply it by 2 pi and do we have Pi in R does anybody know do we have Pi yeah we do have Pi would you look at that what the uh all right so and this is a random angle by the way so and we want to have a random magnitude uh Rand float go we just generated a random Point within Circle uh at with the center at z0 and with the radius one so but it allows us to scale this entire thing and offset this entire thing however we want however we want so and how we're going to be scaling and upsetting this entire stuff so essentially let's create a vector two sample right so this is going to be something like X and we're going to say that this is the center plus uh cosine uh plus cosine of that specific angle multiplied by um yeah multiply by the magnitude right so multiply by the magnitude but also multiply by the radius right so because cosine is going to give you sort of the length um the length one right so you want to shorten it up to to the magnitude that would generate from 0 to one and then you want to scale it again to the actual radius and you want to do the same thing with Y if I'm not mistaken but here we have to use syn and that's about it actually that's about it so then you can do Noob d da append uh samples samples samples so we are appending all of the samples in here and theoretically we just generated a cluster we just generate the cluster it's pretty cool I guess so uh let's just generate a simple cluster right so going to be samples cluster so this is going to be that and then let's go ahead and grab the generator the Gen de generator degenerate cluster right so the center where going to be the center where we're going to have the center so I suppose we're going to have Center at0 0 so that makes sense uh so how can I quickly do that is there any quick way to to do that um without the cital and some other I suppose there's no such way to do that uh maybe we can do it like that so what's going to be the radius let's actually say it's going to be 10 right so how many points we want to have we want to have maybe also 10 right so and we're going to have also 10 and then I'm going to provide the pointer to the cluster clutter cluster right and we need to start iterating uh through all of the points within that cluster right so just start iterating from them okay so I'm going to just iterate class clust uh I think it's count right so it is in fact count right and in here so we have a sample and uh this is going to be basically cluster uh items items I all right and we just mapping this entire thing to the screen and we're just using it so one of the things I like to do when I'm iterating uh this thing I would like to have it which is going to be basically something like this right so it's a little bit more readable there we go uh hopefully we manag to generate some some clusters okay it doesn't compile because we don't even include uh the standard Library where the rant is located we don't include the maths Library we need to include the maths library and do we have an no di append it is from no that means we also have to include uh no no. H and of course we have to Define Noob implementation and then we go we are ready to do that so we just generated some cluster so this is only 10 points so we mon car that mother flipper simply uh Monte carlet that mother flipper so uh we can actually provide maybe bigger amount of points right so we can say uh maybe 100 what if we generate 100 of this points 10 times uh there we go so this is basically 100 so that's pretty po so we've got a data set we got a data set at the center as well you know what's funny is that we can do that several times at different centers several times at different centers and put everything into a single sort of like set oh maybe we should actually call this thing a set right so I think that makes a little bit more sense so this is going to be the setet L set okay so what's going to be the second place so let's actually also generate the Mouse um the mouse data set right how about that so the mouse data set so the second data set of course is going to be smaller right so maybe half of half as small and uh then maybe we're going to actually go to half of the mean X right so we're going to take half of the mean X maybe I'm going to even multiply by half like this and in terms of Y um yeah maybe it's also going to be half of the max y right halfway uh and essentially the right ear right the right ear is going to be Max X right so we want generate something like this uh so that so Disney low suit incoming so yeah so this is the thing that we probably this is kind of weird like why is it like that why is it like denser in here uh so maybe actually in terms of like the amount of points I actually want to have less point in here because the ears are a little bit too dense if you know what I mean they're a little B and the funniest thing is that they kind of like dense in the center is that like expected in here or did I up something about the the random number generator or what's interesting is that I don't really change the seed right so maybe I need to change the seed according to the current time right so that's one things we can do just change the seed so yeah that's really weird and it's kind of different uh we can probably bind some sort of a latone uh where we're going to regenerate those things over and over again right so if is key pressed key R so what we want to do we want to reset the entire thing and basically Bally regenerate all of that stuff right so if I try to run this one more time uh they kind of at least by the feel Maybe not maybe I'm just imagining but it feels like they're more denser to the center so there's definitely something uh so maybe make the dot smaller maybe that's a good idea actually so let's say maybe sample radius right so this is a sample radius let's actually put this kind of stuff like sample radius so this is going to be 2.5 F and let's also introduce maybe sample color though we're going to have samples of different color we're going to have samples of different color as we cluster them differently so uh yeah they do feel denser at the center a lot of time can't you even see that I'm pretty sure maybe it's too small actually so let's do something like for Rand is not uniform yeah that's true it is true that it is not uniform not particularly uniform maybe that's fine maybe that's even better because you can like more clearly see where are the centroids of the Clusters and stuff like that that looks actually kind of cool I like that uh I really really like that all right so uh let me let me think let me think okay so I already ran out of tea and already streaming for 1 hour so I suppose this is a good moment this is a good moment to make a small break right refill a cup of tea and after the small break we're going to start clustering this mother flippers how about that sounds good sounds Gucci sounds what D maguchi let's go all right so let's go ahead and see what we can do do in here so this is a k means clustering so that means we need to decide up front how many KS we're going to have right so maybe we're going to just literally Define this entire thing so I'm going to Define K right and as of right now it is three you can clearly see uh that uh this thing is uh three of them right there we go so there's three clusters in here right so there is a three clusters uh and now what we want to do what we want to do we want to probably preallocate sort of like an array of clusters right an arrayed clusters into which we're going to be clustering things right so uh I'm going to do the following thing so this is the samples and this is an like actually an array clusters right so this is the Clusters and there's a k of them right so it's actually also zero initialize also make it static and stuff like that might as well also make these things static so making them static means that they're not going to be visible outside of the translation unit which helps the the compiler to do more optimizations for example maybe it will be able to inline some of the functions small functions like Rand float we can maybe do static in line to help it a little bit more who knows right so so saying that we're not going to use that outside of this current translation unit kind of helps the optimizer and stuff like that so anyway uh also we're going to have the uh the means themselves right so the means themselves and this is going to be an array of actually not floats but rather vectors right so uh you know array of vectors so and the first thing we need to do we probably need to right so we generated the set that's totally fine we need to Generate random means right so how we're going to do that I suppose we can just iterate through K means right like so uh okay plus plus I and just pick a random point on the screen a random point on the screen uh so this is going to be means I X is going to be rent float rent float multiplied multiplied by Max x minus mean X right so the whole range um plus mean X right so we're generating like like within a particular range so could have been this is basically lurp isn't it this is basically lurp maybe we should like literally Implement something like lurp uh lurs do we still have do we already have lurp in the standard Library apparently we don't have right I could have done something like mean x uh Max X right and then say something like this uh for some reason I never feel the need to have lurp function right I usually implement lurb directly right as I just did for some reason for my brain it's just like easier to think in this specific like thing so R has lurp people say let's see lies oh it's it's a rayth okay I was almost about to ban you okay so so I guess that's fine I guess that's fine uh so let's go ahead and start using so going to be lurp minan X Max uh Max X Rand FL so we're going to have something like this and this is going to be Y and we're doing that from uh minan Y and Max Y and we go we have a random mean we have a random means of production so and we can of do a similar thing uh as we do with the uh with the set right so we're going to have like a k K uh and in here we have minan I and here I suppose we want to have a different radius right so we're going to have something like a mean radius I want it to be bigger right so you can clearly see that specific mean right away so let's make maybe make it yellow or something right so the samples right now are going to be righted but the mean around which we cluster and everything is going to be yellow so uh let me see so lurp is not available because we have to include Ray maths Ray maths okay and mean there's several of them uh and mean radius is not defined let's go ahead Define it and I might as well even Define this entire thing in terms of the sample radius right every time I modify the sample radius the mean radius is also going to be bigger and it's also going to be twice as much of as the sample radius so we can always see it so there we go uh so generated like three of them in really weird places but maybe that's fine maybe that's fine so another interesting thing uh I think when we press R we should regenerate not only the uh set we also regenerate the means right so this is sort of going to be the button that resets literally everything I think that's very good right as you can see we're just generating those things in different places right so these are initial means initial places of mean uh all right so that's pretty cool I also would like to have maybe different colors for the me right because we're going to be visualizing the points depending on what cluster they belong to if you know what I mean what cluster they belong to uh right so essentially um if it belongs to a particular cluster is going to be colored in a particular color um so maybe we could introduce something like color colors and you have K colors in here right so we have K colors um and we probably want to generate Maybe random colors right so we probably want to Generate random colors and stuff uh and how we're going to be doing all of that so we have gray gray I want to just take the ray leap colors right so I want to just take the r colors and pick random colors from that spe specific set okay so let me let me see how can we easily do all of that um I specifically avoided dark gray gray light gray because I don't think having Grays in there is that like good of an idea we might as well also avoid dark colors right so we want to have like bright colors in here so this is going to be that so let just remove them uh right so we want to have bright cores so the next thing I would like to do is probably get rid of all of these things right so we don't need that and we probably also want to get rid of these things right we're getting rid of those things and delete trading white spaces there we go we got this stuff we can query replace actually want to query replace with uh regular Expressions so at the end of the line uh we want to put a comma right so this is the comma I don't know what the happened here I think I did a walk as usual and there we go so we have like a r liap colors sort of speak so this is going to be static uh color uh array lib colors so this is basically how many of them we have so and when we are generating this kind of stuff we can just pick a thing from there or in fact we can just use these colors directly we can just use these colors directly but then uh what if if you define more clusters than you have colors in here there's two options in here we can forbid that we can add a static assert which says that uh array length right array length of uh this thing and K should be less than equal to that right if it is bigger than that right so we don't have enough colors so this is one of the uh things we can do in here or we can simply wrap around right so if you have too many clusters they're going to wrap around this color so we're going to reuse some of the colors so what I'm thinking is that I think I'm going to go with the latter uh right so I'm going to Simply wrap around so I'm going to Simply wrap around so we have this amount of color so we're going to be just using them so and how we're going to be approaching all of that so as you can see here we use yellow right so we're going to be using actually colors and I'm going to do I uh and we have to wrap it around so we have to wrap it around with knob array length of the colors but that is too much right so I think I would like to wrap it into the colors colors count or something but I'm not sure if it is shorter if you know what I mean right I'm really not sure like so if we compare that colors count it is shorter it saves us nine characters which is well obviously something that we would like to take right so and here we're just wrapping around we're just wrapping around uh and let's actually see how it's going to go uh so something really weird is going on right so because I provided the colors did I not recompile or something uh so this is yellow uh I recompiled oh wait they are different this one is orange this one is yellow they they look very similar actually they look very similar because yellow golden okay um these are different color these are three dots with different color I mean technically yeah I can see that but I mean that actually made me think that I made an error somewhere that I have a bug somewhere it's just like the co look almost the same anyway so I mean we can just remove some of them let's actually keep golden right so who needs yellow when you have gold right now we're talking you have red R and pink red and pink so that's actually pretty cool so do we have any other like lime for instance so yeah let's actually do green sky blue uh-huh purple yeah I want to keep the names with like the colors with the fancy names if you know what I mean right yeah so essentially uh yow gold orange we're going to keep only gold red H pink what is maroon by the way like I don't really speak English so I don't really know what Maron means uh oh that's basically the brownish Crimson brownish crimson color ahuh so that means maybe we want to actually keep uh maroon instead of red right so it means that one that one that one uh be brown I guess that's fine so they're different colors I'm I'm pretty sure they're different course they're different enough yeah I would say they're different enough okay uh or maybe they're not really that vibrant right so they're kind of dark yeah I'm going to keep these things so going to keep these things there not that many C but that's fine uh yeah yeah something like this okay cool uh now what we're going to have we need to start clustering these things right we need to start clustering these things and how are we going to be clustering this thing okay so we have to uh essentially iterate through each individual point right and see to whom that point is closer whom that closer who whom that point is closer to right is that how we say that in English but anyway uh so let's go ahead and iterate the set right so this is going to be set Zer less um count Plus+ I and according to the Wikipedia according to the Wikipedia we have to essentially do the square mean right um so how do I interpret that I suppose it's going to be more of like a two-dimensional um two dimensional thing right mhm so yeah so we'll need to have two like nested Loop in here so we iterating each individual set and for that specific set we're going to be iterating each individual cluster right we iterating each individual cluster so basically the K so I can see why it is like a very slow algorithm right to because to recluster things to recluster things you have to do stuff like that I wonder if you can optimize it if you don't recluster from scratch every time because one of the things we'll have to do right so one of the things we'll have to do before uh like recluster and everything we'll have to iterate through each individual cluster right and basically clean it up right we're first going to be cleaning up the like all of the Clusters and only then just going and figuring out things for for for the Clusters uh then what we do we take the point right so I can take Vector 2 point set items I so here is the point and here we have the mean uh means J right and now we need to find who's the closest to um right to which cluster to which cluster this thing is the closest if I understand correctly uh right so so let me see do we have something like a vector sub uh subtract right we do in fact have Vector subtract uh all right and we're subtracting the point from like M from the point so we get that so and then we have a length but yeah we have a length squared right we have a length squared which is basically what it is right so which is basically what it is so we can just do that uh like so and there we go we got the value and we are minimizing by that specific value right we're minimizing by that specific value so now um I need to basically keep track of the index that is the closest right so uh maybe I'm going to have something like size t k but at the beginning at the very beginning uh we're not going to have any cluster sort of assigned to uh to P in here right so that's kind of the problem that's kind of problem but maybe that's not the problem um right so essentially we still need to um mark it as minus one right so it doesn't belong to anything and then we need to keep track of this specific value let's call it something like s right and since we're minimizing it we want it to make super big right so is there something FLT Max or something like that so then when I have something like SM uh and I do something like if SM is smaller than S it will actually trigger this entire thing uh right so then I can do as smm and then k equal J and that way I sort of like figure out the cluster into which I want to push that specific button right if push that specific point right right so this is the case cluster and we're going to be doing no di append no di append of course we're going to take it by point and then I'm just pushing that specific point in there right so something like that something like that uh so but I'm not sure if FLT Max is like an actual thing it probably isn't right so but maybe I just need to include some stuff like I I can never remember the name I can never remember the header that I have to yeah so FLT Max and mean are a think um are a think and what do I include so limits do I just include limits to include that I think limits only contained integers right I think there was something like Floats or what not numeric limits mhm FLT Mac right and what was the float yeah floating Point limits are located within float do H limits. H contains integer limits maybe I I don't really it's kind of bizarre to me not going to lie but I mean it's it's c what did you expect uh it's it's just float it's not even you know there we go uh okay so it didn't crash at least or anything uh so I suppose now when we're rendering the point we should instead of rendering the set of points we should render individual clusters if you know what I mean uh right so how we're going to be doing all of that so within uh this entire thing I'm going to be iterating the specific cluster right so it's going to be J um less than cluster I count right so we iterate in each cluster and then each point within the cluster all right and in here and in here is going to be it cluster I items J right we are projecting it uh we take sample radius we take sample radius and we use the colors um of the case cluster so we use I actually so and since we're going to be using that for both the um the sample and the mean I feel like it makes sense to actually Factor out that to a separate variable color like so so it can be used within this Loop and this thing as well I think it does in fact make sense I think it does in fact make sense so let's actually remove this eight because it's not that big anyway here we do need it because this expression is pretty long so I'd like to even have it uh so yeah effectively by doing it like that we're going to color samples according to their current cluster the the the cluster they belong to hopefully if we actually did everything correctly so this is actually several clusters and and here we also have two and we got a segmentation fold it's so cool I like that it's kind of funny um right it is in fact kind of funny I wonder why though I wonder why though so because we do that stuff in here so cluster i k uh J aha [Music] freaking classic classic that's why you should program the roast okay so that's pretty cool isn't it I think it's it's fair to to actually use this kind of algorithm yeah I think I think it is so now they are colored uh differently huh wait that's a bit bizarre oh yeah okay I see because when I refresh it doesn't recluster okay yeah yeah it doesn't recluster them um so I think we need to have separate operations right we need to have a separate operation so here uh we G generating like new set like generating completely new um set and here we're doing clustering right we're doing clustering so I feel like we need to factor out these operations into their own sort of like um functions right so set I feel like maybe the set must be also a static variable in here right so clusters only kind of like reuse this stuff from from the set they only kind of reuse the stuff from the Set uh but in here when we are regenerating uh we set count to zero and that is already equal to literally um what we do down there I think yeah it's not even equal to what we do down there almost yeah because there is a like a padding additional padding here that's why I couldn't find it but I mean it's the same operation so we have a chunk of code that is repeated two two times right so how are we going to call this entire thing think so generate new set right new Set uh maybe new state I think I think that's a good name for that generate new state because uh it's state includes not only set but it also includes the means and stuff like that right um okay so we do generate new set new state and every time you press R every time you press R you also generate new state like this so we factored out that operation and uh here we can do recluster recluster state and what it does it simply updates the Clusters uh recluster state so at the beginning we generate new state then we recluster State and then um we reclast it every time we press R so that should fix the problem not really because it says that we don't use recluster State uh type voice defaults to int what what Bo ah okay essentially I made a typer and it's basically interpreted as a variable without any type and that basically means it has a full type integer because C is an old language there we go okay so now every time I refresh this entire thing as you can see uh yeah we have different state and different reclustering and stuff like that all right so that's pretty cool that's pretty cool so now we need to implement the second step in the K means clustering right so we implemented only sort of like assigning step uh is only first step right so we just recluster everything and now we need to do update so for each cluster we need to find the sum of its points and divide it by the um the length divided by the length and that should be but how does it work in the sense of points if x is multi-dimensional so the end result here is going to be a vector wait a second do we take oh yeah mean is also a vector I'm an idiot okay so sorry okay uh yeah mean is also a vector so that's that's fine that's fine fine do we add we actually make it equal right so what I want to do I want to actually assign um a special key to that so let's actually say that maybe it's going to be space right so this is going to be space here we're going to do update means update means and then we're going to do re cluster right so we update the means and we do recluster uh so in update means update means we are going to do exactly that right so we're going to do exactly that uh so for each cluster so weera in the case right weera in the case and uh so we need to find the sum right so this is going to be Vector 2 s so this is the sum and we probably can use Vector to add Vector to sum so it's more like add yeah so this is what we can do so we're adding up uh everything within the cluster uh cluster I yeah and that means we need to have a nested thing here as well J J less then cluster I count uh Plus+ okay Cote myself code myself uh almost almost uh all right so and in here after that we take uh X and we basically divide cluster I uh count right and we do that like this because I think that's what it means right so this is the size of the cluster right this is the size of the cluster so it's a power of that specific set um right and because of that I think count may not be zero right so one of the things we probably want to do in here we want to check that this thing is greater than greater than zero right so that's kind of important we're going to assert that as well and once we've done that we want to reassign mean uh I to S right we want to reassign mean I to S maybe we can actually in line S everywhere right so essentially here we can start with Vector 2 Z right and then instead of s in here we are well I mean that's kind of difficult so we have to do it like this so it's actually means because there's several of them uh means I and here we're also going to have means I uh means I there we go and that means we don't need that stuff so we directly recomputing these things uh directly Rec Computing these things okay interestingly wait wait we can have situation when the cluster is actually empty the cluster is actually empty so we we even had that at some point I remember that so we need to be able to handle that so one of the obvious way we can handle that is basically like do that update the mean only in case of that but that means that the mean is never going to be updated that's what it means right maybe we should do something in case of an empty cluster right what can we do we can regenerate the mean actually that's a very good strategy right so if you ended up with an empty cluster you might as well just regenerate the mean like put it put that in a in a different place or something like that so um how can we do that when we are generating new States this is what we can do let think think what you guys think is that a common practice because yeah this is very weird situation and like what can you even do in that situation if you ended up with a cluster that is very like which is empty um okay okay so this is clusters and uh yeah so I forgot to update all of these things let's quickly do that it's not that big of a deal it doesn't compile but it's not that big of a deal for example here right for example here uh we ended up uh with the empty cluster and it actually regenerated uh red right and we can keep updating it and as you can see they actually okay that actually converges very quickly surprisingly that converges very quickly it's kind of cool and yeah we it again and eventually it converges this is so cool so yeah this is K means clustering apparently it's it's actually very simple algorithm I didn't expect that um what if you have like less clusters right like let's say we have two right so what if I want to split it in two how it is going to split it it's not a bad way to split it actually look at that it's it's not a bad way to split it uh we can try that again another interesting way to split it so we have a head in the ears uh we can find it that that's a really weird way to SP but you can do that sure you can do that so and now if you have for instance uh four clusters like how would you split four clusters uh so now we have four so that's another way to do that so it's just like depending on how on like different uh initial States and amount of clusters you can actually split them differently okay so um what about like 10 clusters that's a lot of clusters but we'll see we'll see okay so here we have 10 clusters just like it found something right gu um we can maybe increase the amount of actual clusters that we got right so here for instance I generate three blobs right three blobs uh what if I generate two additional blobs blobs but somewhere at the bottom right somewhere at the bottom something like this so the actual blobs are going to be like this right so you have four blobs now though we said that we have three KS um yeah and it actually split like that uh but what if I say that we have five case right so this is going to be five Cas uh and does it yeah okay so it actually split them you know more or less correctly so uh does anybody know interesting like simple data sets that we could have used to you know K cluster like what could we use uh we need to animate that we could try to animate that but I mean I know the leave data set I think okay let me see the leave data set um all right L say consist of collection of shapes and textures feature extracted from digital images okay so is that multi-dimensional though uh so further details on this data set on its attribute please read me okay so let's download okay it just allows us to download this thing uh how for how long is going to be downloading oh it's pretty big um how big is it how big is it uh data set consist a collection of shapes and texture features extracted from the digital images Okay so let's see uh what we can extract from there leaves um and let me by the way put them in the description um all right so this is the leaves uh K min uh let's go ahead I hope it's not going to be a z bomb right so because I don't trust those people right so they probably use Windows and windows people like to do Z yeah it is z bomb I actually predicted that of course of course uh read me PDF okay um data description um uh the present data database compresses 40 different plant species table one uh details each plant's specific name and the number of leaf specimens available by species uh species numbered from 1 to 15 and from 20 to 36 exhibit simple leaves and species number from 16 22 and from 3742 have complex leaves okay each Leaf specimen uh has photographed over a colored background using Apple iPad 2 device uh RGB images have a resolution blah blah blah binary version uh okay so it would be nice to have like a two-dimensional points though right so attributes so okay each list has attribute um uh aspect ratio specimen number a sensity aspect ratio we can maybe cluster them by aspect ratio it's one of the things we can do uh uniform entropy or by entropy maximum indentation depth or something as a metric Factor uh right so consider any uh for where X and Y such as that I don't know what is I uh uhuh all right so suppose aspect ratio just defines the shape of the leaf right it's some sort of a shape of the Le [Music] um convexity that's a pretty cool data set actually uh so references evaluation of features of De discrimination development of a system for automatic plan species recognition holy this is actually a cool data set for like for example machine learning and like you know classification and stuff like that that would have been actually kind of cool uh right so are using that are I don't know what it is you mean the PDF reader the PDF reader that I use is M PDF so M PDF that's the one lightweight PDF you were written in portable C so it also has beam key bindings and stuff like that yeah so it looks similar and uses VM key bindings all right so looks interesting right maybe so and we can okay we can try to parse this mother flippers so is that the like everything it's not a particularly oh that's actually look I really like the color look at that background color holy oh my God that's that's a very nice pink I love it uh ta Bata oh my God this is a nice background okay so black and white I suppose right so ah okay so this is sort of like a corn of the of the this man like every time I'm looking through these data sets I feel like like an actual scientist just like I feel good about myself so it's just like look at that like and we gathered information we conducted experiments we're analyzing experiments look at that so we have the data we have the data we have the technology so where do we have the aspect ratio mind so attributes so I suppose aspect ratio is the fourth uh data attributes 3 one um yeah one uh 4 One 2 3 4 uh and I suppose they are numbered from one so that's basically aspect ratio that's basically aspect rati um and um yeah we can okay so if we want to Cluster by like in two-dimensional space we can just pick two of the attributes right so just like yeah so here's the pair of attributes uh can we just classer them or that's actually very interesting one so we have a class right so we have a certain class and class could be one um you know axis and then rtio and see how the cluster like with relative to their species or something like that um another data scientist I'm not sure if I can even interpret that in a very any meaningful way but we can just try to do that right we can just try to do that why not so the first thing we need to do we need to basically parse this entire file right so that would have been interesting right so I'll go ahead and do that um so main. C so here it is uh leaf leaf leaf leaf okay so here is the leaf and when I do um main. C the first thing we probably want to do we want to read the entire file um so in Noob we do have a function read entire file yeah there we go so read uh the path is going to be uh Leaf path and is going to be Char Leaf path Leaf CSV there we go and we need to save that into like a content into its own separate content but I'm going to call it C SB uh and we're just doing it like that if we couldn't read that we just return one so that means we couldn't read this entire stuff uh right and uh so the next thing we want to do we want to start iterating the entire thing by lines right so let's do Noob string view content right and is there any way to construct the content Noob SV from parts right so that's the thing that we probably want to use right it's from the parts and how we're going to be constructing this entire thing we're going to take the uh string Builder items and the string Builder count right and that gives us the cont so we want to start splitting by uh by lines essentially right so we're going to be parsing CSV in a very dumb way by splitting by lines and then splitting by commas if you are anxious because of the escaping and stuff like that we don't have commas in any of the fields and we don't have uh quotes anywhere it's a very straightforward like this specific file doesn't really use any weirdness of CSV format so we can parse it in a very dumb way so we don't really need a special library to parse this specific file and if so why bother like trying to find some sort of third party dependency and and whatnot right so if we can just parse it directly uh all right Advent of code 2023 parsing Vibes yeah exactly so we're going to be parsing while we have some content in here right and I suppose now we need to do chop uh by delimit right so we're going to be chopping this stuff by the delimit right so this is going to be the content and the delimer is going to be basically the new line right so that should give us the line uh knob string view uh view line like so and then we can do uh knob uh log knob info here we can do essentially something like this SV fmt uh svr so this is the line afterwards I feel like I want to actually add it right so I don't want the continue execution I just want to see how I'm parsing this entire thing uh okay so we managed to split everything by lines right so we read the entire file and we split everything by lines which is pretty cool the next thing we need to start doing right we need to start splitting the uh the line right so here I think we're going to be splitting up until like we have something right uh right and essentially we're going to do it like that so this is the attribute and we are chopping from the line by the comma right from the line by the comma and that gives us the attribute so in essentially maybe we could actually print this entire stuff by doing something like this so we can provide the number of the attributes right the number of the attribute and this is going to be I and then the value of the attribute right so this is the value of the attribute and then um right we can basically split each individual line by some sort of a bar right so something like 10 like this so we can see each individual line so here we go so here are the attributes right and as you can see we have 16 attributes right 16 attributes and here are the numbers right so which one we should pick which one we should pick uh we can pick um class I think it's a pretty interesting idea and maybe smoothness or something like that or maybe aspect ratio um entropy so uh stochastic convexity elongation Elon mus elongation I think yeah I I think I like elongation actually elong uh all right so maybe um we can have something like this enumeration type def Leaf uh Leaf at right so Leaf at and maybe we can just enumerate all of them though there is no really an easy way for me to copy paste those things unless I open this entire stuff in um chromium I think I could open it in chromium we go and then I should be able to maybe just select this entire stuff and copy paste it in here that was easier than I expected honestly that was easier than expected um so class specimen specimen specimen um so stochastic convexity so I'm just thinking how I'm going to be uh approaching all of that stuff so obviously I might as well just leave the numbers as they are right because initially this one is going to be zero uh and the rest are going to increment so we don't really need this this kind of stuff that's for sure we don't really need this kind of stuff uh we probably want to add comma at the end but furthermore I want to maybe capitalize all of them just like so I think that's the easiest way to do that uhhuh I wonder I wonder if I can select some of these things in here and just say okay if you encounter for instance like an actual space could you replace that space with underscore that was easier than I expected okay and then uh we can prefix this entire thing with Leaf okay that's pretty cool uh so here I can qu replace boom there we go easy peasy lemon squeezy D can your Rim do that can your Rim do that uh all right so no but my Vim can do your M sorry all right so I suppose now since we have the attributes right so we have the attributes we can just basically uh right do the following thing if I maybe even switch a on all that stuff right switch um I and basically leavea class right so this is a leaf class um so this is one and then Leaf entropy so and that's how we pick them so in the case of a default we just do nothing we literally ignore this entire thing um right so that's a good idea I think that's a good idea in any case though in any case I think what we have to do we have to convert the attribute that we got uh the attribute that we got to um to a float right so we need to convert it to a float so and uh funny enough right so we can have class U well I mean yeah okay so let's call it class even though this is C so it's going to be fine right so I still want to use ZZ because the emac extension is do be like that uh right so this is the entropy and um essentially we're going to have some sort of a value right so attribute value that we kind of like convert from the attribute I don't really know how but that's going to be the case and depending on the like attribute number we're going to do class equal value right like that or entropy equal value so that's how we're going to be doing all of that right so that's basically the idea that's basically the idea maybe even align that stuff a little bit differently um right so and if we want to capture more different attributes more different attributes that's how we can do that though I think it would be even more better I think it would be even better if we just had something like point in here all right so and instead of like specific names we would say x y there we go all right and then we can assign different attributes to different X and y's that's pretty cool right this one is to do we don't really know how to do that yet right but that's the idea basic it's the basic idea so we can kind of control what kind of fields we want to use from that file file specifically uh right so and after that uh what we're essentially doing we're just appending that to the set right so uh di append so this is the set and we're just appending this entire thing in here so that's how we're going to be reading this entire stuff but that is not enough actually that is not enough this stuff assumes a pretty specific range so maybe we should um derive the range from the the the set that we've got right so because here we have hardcoded minus 202 but maybe we can handcraft it uh somehow maybe we can but uh it will be better to actually know um you know automatically derivable that so how can we parse floats right so we have the problem is that the problem is that we have sized um we have sized strings that means we cannot just use like Str str2 F or anything because they expect an UL terminated one uh right so Str str2 F I think yeah they expect n terminated things though in KN we have a pretty cool shed we have KN tmps sprintf which is basically s print F which allocates stuff in the um in a temporary buffer right in the temperary buffer we can use that right essentially what we can do do that SV fmt SV ARG ATR and that basically converts this entire thing to the C string n terminated C string which we then can do Str strr to F2 right so and I suppose s2f accepts a second parameter right so we can just put null in there because we know that it's going to successfully parse everything so it doesn't really matter so we can say value like so right so and then we can reset the temporary buffer right no big deal uh how do we do that uh reset reset uh TMP reset so that will deal allocate all of the allocations made by uh temp Sint F right so hopefully that will kind of work so that's a pretty good way to do that and so we've got a bunch of points which we can print for instance right so we can print all of these points right size t zero uh set count uh Plus+ I set count Plus+ I and then we can do no log Noob Ino uh f f uh might as well make make them look like vectors if you know what I mean right so PX U py so we get the points so we need to go through the compilation errors Vector two electric of course and this one p oh yes so because it's more of like vector to P set items I boom we got the points yeah do are they what they are so obviously class is parsed correctly class class is parsed correctly so because we have this first class then another one and so on and so forth the second one is aspect ratio right so we entropy right we decided for the entropy and entropy in our case entropy is which one it's the last one so we can double check if it is correct so the last one is 11756 11 1756 okay so we pared everything correctly and we can even control uh which field we assigned to X and Y by specific attributes uh right by specific attributes so uh that's actually pretty cool system and all of that in Pur C without like third party dependencies right so we don't use CSV or anything like that we can quite easily just par CSV file and just assign different columns to different X and Y and then we can reuse that to for K means clustering um right so that's that's actually really cool it is kind of surprising how much you can see achieve with simple code not with C right but with simple code this is not about C this is about Simplicity I already made that mistake before people started to say oh c is such a nice language this is not about C this is about Simplicity okay this about L Simplicity L simplicity so um we need to figure out we need to figure out what's the mean X and Max x uh mean Y and Max y so we can map everything correctly right so we can map everything correctly how can we do all that so we could actually do that as We Gather the points right so we could have things like float minan X which is FLT Max then Max X which is FLT Min it's a classical way of doing that right so float and then we can copy paste this entire thing and change it to sort of like y right and essentially if uh PX is smaller than minan x right so that means this is the new uh mean X all right if this thing is greater than Max that's the new Max and we can repeat this entire stuff for y as well right so we keep in track of the ranges and we only need to pass this ranges to uh projection right so we project uh sample to screen we project sample to screen and I suppose we can simply get rid of this entire stuff and just accept all of these things like that mean Max Max x uh mean y uh Max y right and then as we encounter those things we can just like bring them to uh lowercase so they are lower case now and now every time we call to this function simple to screen we have to provide this entire stuff like so then uh we probably need to get rid of the float right so it is not need it so project uhhuh another project there's not that many calls in here so we can easily do all of that uh okay so now I'm going to try to compile this entire thing and what do we have right so when we generate a new state okay this one is interesting right so generating a new state actually kind of implies that we did have this thing okay I can accept uh float mean X right since we're reading the state from the um from the file we don't really need this function anymore we don't really need this function anymore so I would maybe it would make sense to remove it but I want to keep it I want to have the mode of uh generating random and also generating like uh you know stuff from the file right so I want to keep like all of them I think it makes sense um so let's actually go ahead um can I just do it like that but I'll have to do wait a second uh can I select this entire thing and then nah I literally have to do that but maybe there is a function in imex about lower case lower case mhm so it would be kind of nice if we had a function that brings to lower case everything within the selected region uh right so if we take a look at the bindings ml so down case word so there should be something like maybe down case region uh right so let me see uh down case you have evoked the disabled command disabled command sounds funny okay it's disabled because new users often find it confusing okay so let's type yes enable for future no why would users find this feature confusing is it really that dangerous I don't know here's the interesting thing here's the interesting thing I've been using Max for more than 10 years and I've encountered occasionally this warning about disabled commands but I never actually legitimately needed that command I was only stumbling upon these commands because I accidentally press pressed something I don't know I don't know what this is for the first time ever in my entire time of using emx where I legit needed to use this one of these confusing commands you just witnessed like a you know historical point in my life it's just like this is for the first time I needed one of those commands I never needed them before like I I knew they exist because emex kept telling me about this like confusing commands but that's the first time I ever needed this that is amazing did I level up as a emex user now uh where were you when Z don't case his region yeah uh you must be confused about it emx was right yeah okay so that's what makes it confusing okay all right now I know now I know okay so let's continue but now I want to use it the second time this is the second second time I need to use a confusing command uh so oh I can just do XL okay wait a second I can just do XL holy this is so Comm can I so that means logically that has to be XU or something if I want to bring to uppercase uh yeah and it was also a dangerous command H whatever anyway so uh let's go ahead and continue recompiling this entire stuff yeah mhm and update the means o I see what's going on here update the means because we still need mean Max thingy in case you end up with a empty cluster so you need to sort of reshuffle the means and stuff like that that makes a lot of sense so maybe I should group this entire thing into a structure but maybe not I mean copy pasting this stuff around is not that big of a deal honestly it's not that big of a deal uh sure uh this is because I query replaced x with Y So Max turned into May that's that's funny uh that's fing funny uh all right so this is Max and stuff like that uh all right and every time we do that we have to be passing mean X and some other all right uh right right uh all right uh request to State and um yeah recling I don't know why I thought that recling the state needs this stuff I'm an idiot okay uh it doesn't need this stuff okay so it seems to be compiling that is nice that is absolutely n that is absolutely n so we generate a new state and since we are um essentially doing that we're regenerating a new state we probably don't want to do that honestly we probably don't want to do that so I want to I'm going to disable this entire thing for now right because it's literally generates like new data set but we reading the data set from the file uh generate cluster um defined but never used yeah that is true that is totally true okay uh I'm a little bit scared because that should be basically it right that should be basically it we um parse everything um we assign everything and then uh we generate a new state based on all of that uhhuh uhhuh okay cool we could also add a little bit of a padding uh for mean x uh and Max X right something like uh something like this so let me let me let me show essentially mean x minus maybe X is padding right so max X Plus and of course could replace x with Y and don't forget about May right so I freaking um yeah it's hard it's freaking hard a uh let's say it's going to be around like maybe four or something maybe it depends because it depends on the scale though it really depends on the scale so one of the things okay one of the things we can do we can actually multiply all of them by two I think that's the easiest way to sort of put it uh or maybe even you know what multiply them by like by 10% just add a margin of 10% to them so just like a little bit of a space around them I think that's good way to do that so though interestingly that will only work if mean um is negative right it's not necessarily negative actually let's not do that right so it's kind of a dumb thing uh so I'll need to think how we can sort of extend or so we we have to work on a level of like vectors right so here's from the center so so and Vector you just extend it like that I don't want to go into that it's too much all right so everything seems to be compiling and let's try to run K means and do we have oh this is interesting okay so X is a class X is a class and Y is an entropy and they form these sort of stricks which is kind of expected right which is kind of expected we can start clustering them all right so and that formed very specific cluster that's actually very interesting that's actually very interesting so this is entropy and yeah I want to take a look at the um maybe entropy and aspect ratio right so let's actually say that the aspect ratio is going to be X and entropy is going to be why how they're going to look like that's very cool okay so X is aspect ratio and why is an entropy huh that is so there is like clear clusters I literally have no idea what that means any of this attribute means but I really like the fact when you try to dissect the data at different like um you know planes there is clear clusters within them um most leaves are round I guess maybe but I don't know the interpretation of these attributes does that mean that the most of them are round uh so if this is the aspect ratio uh it's so the smaller the aspect ratio is the smaller aspect ratio is uh the rounder it is so that that means I mean I would expect them to be closer to maybe ah here's the thing here's the thing we don't really know the value of this thing right so I I picked a very dumb way of representing the axis is that a zero or is that one yeah no AIS labels exactly exactly so that kind of makes it difficult but yeah so we can probably we can print those things right so we can actually print them so no uh log Noob info right and it's going to be something like this so this is X Min x max x uh and this is going to be y like so all right and we should have something like that okay X is one right so yeah aspect ratio of one means that they're more like squarish or roundish or something like that and that tells us that the majority of the leaves are like roundish squarish right so that is true we can clearly see that so but there is a very specific cluster in here right of different things with a high entr with a higher entropy generally right hm that's is very interesting so uh what is the entropy though so there is some sort of a definitions elongation somewhere solidity stas convexity alness entropy a measure of intensity of Randomness huh mhm we can take a look at smoothness all right we can take a look at smoothness uh so we're going to keep the aspect ratio but we're going to use smoothness for for y let's take a look at their smoothness all right so yeah majority of them are just like that uh and there is this cluster of uh maybe longer longer leaves right so the the bigger the aspect ratio the sort of elongated the r but I mean we have the thing that measures elongation though right uh so which means that we can take a look at that right so aspect ratio and elongation elongation what the the bigger aspect treasure the more elated they are yeah it it makes sense like it's not about really clustering it's not really about the clustering but it's so cool that you can see that but I mean it probably depends on the definition of both aspect ratio and elongation right so it's just like the definition of both both of these parameters are related to each other that's why they form like a clear function so it's it's more about the definition right it's more about the definition but that's is kind of cool right so and yeah we can already extract different knowledge uh about uh about leavs isometric Factor stochastic convexity solidity compute uh elongation could be the maximum scape um so elongation and class for instance what about that yeah so I think using class as an axis is useless because class is more of a discrete thing it's not a continuous sort of like a value it's not a continuous value so these clusters don't make much sense if you have class as the xaxis so though maybe within the single class clusters do make sense right within a single class clusters do make sense uh right oh yeah so we did a little bit of a data mining I suppose today wasn't it cool wasn't it cool I think it was pretty cool and the coolest thing is that all of that could be done in C without any python or anything like that right so yeah so that's pretty cool that's pretty cool uh all right so uh we learned how to do K means clustering which is going to be the first step in me trying that legendary paper about less is more parameter free text classification with gzip right so uh the reason why I studied all of that right is because I want you to try this right so basically classification of the documents uh by K means clustering by gzip of the document right so but we're not going to do that today we're going to do that next time right so we're going to try to do that next time uh I'm going to read the paper I'm going to research a little bit more maybe gather the data right so because we definitely need a set of documents that we want to Cluster and stuff like that uh right and this is going to be a separate stream now I understand how K means clustering works so I think I'm prepared for this paper I think I am prepared for this paper uh all right that's it for today thanks everyone who's watching right now I really appreciate that have a good one and I see you all on the next Recreation programming session with Ain I love you
Info
Channel: Tsoding Daily
Views: 39,651
Rating: undefined out of 5
Keywords:
Id: kH-hqG34ylA
Channel Id: undefined
Length: 119min 50sec (7190 seconds)
Published: Mon Jan 01 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.