Let's Code an AI Search Engine with LLM Embeddings, Django, and pgvector

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so on a few of our projects we've been talking about building recommendation systems around large language models like chat GPT and I have been playing with one way of doing that using hostgressql with the PG Vector plugin which enables you to unlock Vector database functionality inside of postgresp and I like that because I prefer not to add more tools to our Tech stack if I can help it right so we can use postgres as our normal Django database but as you'll see we can extend postgres with Vector functionality let's dive into it so I have a repository here of my own I'm calling it AI experiments but let's go ahead and get started with the bootstrappers so I have a terminal open here and I'll put in our cookie cutter command for the bootstrapper and let's call it Vector demonstration all right and we're good to go oh I accidentally created it in the migrations folder move it up all right so now we can open up our Django application follows our normal structure one of the easiest ways that I found to get started with postgres is to use Docker with Docker compose and the image that you can use is called ankane PG Vector so I'm going to plop that in here instead of the default postgres image and then I'm also going to change the port to five six four three two uh so that it doesn't clash with any other postgres instances that I have it'll load from the DMV file that's fine and we can grab our database name and password from there I'll go ahead and create my local dot EnV o it looks like one was created for me by cookie cutter which is great the volumes help persist the data between between runs of the docker image so that's all good and fine we'll use the server and the client if we want to yeah we'll get into the server code now so okay so I'm going to just build in core on a normal app you may want to create a new a new Django application but I'm gonna scroll down here to the bottom of models.pi and core the first thing that I want to do is decide what I'm going to use PG Vector for so I have a data set here that I found on Reddit and it has I've discovered 137 000 job descriptions some are in languages other than English it's very nicely structured data every job description is structured like this pretty clean HTML so I think this was scraped and then cleaned up and then posted to Reddit which is which is pretty nice a lot of this stuff comes from indeed.com I don't know what the terms of use are on this data but we're not using it for anything professional what I want to do then is Define a a couple things so I'll start with a job description class that suggestion there by the way is um from chat GPT and we can go ahead I'm going to go ahead and accept what it's telling me here I do like the idea of having the title in the description those are two of the fields that we get in our data set so we have title company location a link which we don't necessarily need but we could store description and skills so I'm going to grab all of these the company can be care field can also be a character field link I don't care about description we have and then skills this actually is is listed out in in a sort of Json or or even pythonic notation so it's a list of skills so that's that's kind of interesting we can read those we can read that in and parse it like that I think for now I might just save that as a text field to make the import easy I don't want to order these by title I'm going to get rid of the meta and then that string is fine and actually there is a field that I want language can also be a character field this will be helpful to make sure that we're only getting English language job descriptions and I'll show you how we'll determine the language because we're not given the language in this data set but we can use a little AI to detect the language let's talk about approach so we'll start with the job descriptions and the description field is is really the field of that contains the most content so that's the field of Interest right okay so now we have our basic job description model and we could load this into the database from our data set we'll write a loader in a minute that'll be not very interesting pretty run-of-the-mill except for one thing which is detecting language but once we have the job description in the database then we can get to the interesting part which is parsing this description into something that we can use for search and our approach for that will be to use embeddings embeddings are long lists of floating Point numbers that reflect the location quote unquote of the semantic data contained in the description the location of that in the semantic space I'm going to define a model here called job description chunk which is going to store chunks of the description along with the embedding that we generate using the language model and this can also be an abstract base model and the chunk itself is going to have a foreign key to the job description on delete Cascade is great related name chunks is is fine then we're going to have a text field for the chunk GitHub copilot just generated this for me I did not write this code previously here now this this line doesn't so much make sense yeah I can live with that string that's probably as good as anything I'd write myself chunk type is made up that's not something that we need but these these two fields are a good starting point so what we want to add here is not just the chunk content right which will be a piece of this description but we want to also add the embedding we will be cutting the description up into pieces that are small enough to fit into the embedder we'll start with this so what this will allow us to do is feed in a or come up with a query right so like let's say we have a student who is interested in acting and has experience as two years of experience in community theater right we can use this as a starting point and we can generate embeddings for this one sentence and then we can compare those embeddings to all the embeddings in the database and find the job description chunks that are closest to this query and then relate that back to the job descriptions that are related to those chunks and rank those based on the similarity match so we can basically find job descriptions that are the closest match to our input query right and we'll see how to generate those embeddings it's basically that you set up a model actually that's a good example we'll we'll be using the sentence Transformer in a minute you set up a model you set up your query and then you you use the model to encode your query as an embedding and this will come out as like I said before a long list of loading point numbers and the length of that list depends on the on the model that you're using the query embedding would end up looking something like this list and so on usually with hundreds of of members if not thousands of members so that we'll come back to later but for now I Define embedding as a text field that's not actually what we want we actually want to use PG Vector so first I suppose we should make sure that EG Vector is installed something I haven't done here is actually activate our pip files okay I paused there momentarily so now the pipian V is installed go into that shell I'm also going to take note of this path here to the virtual environment so I can set up my python interpreter correctly in vs code I'm going to copy that paste it and yeah just it's the full path ending in bin slash python if I open my shell it automatically sources the correct interpreter so now I want to install a PG Vector so that I can use it with Django here and so I do Python and pip install PG vector and that installs the python PG Vector package and again you have to make sure that the PG Vector extension is installed in postgres and activated and then you also have to make sure that the the PG Vector python package is installed on on the python side so that you can use its features that's an important detail there let's hide the terminal for a second we're back here to our Django code we're going to define the embedding what we want to do is from PG Vector now that it's installed dot Django import Vector extension we don't actually need in the models file but we'll need this over in a migration file so we'll create that in a second and then from PG vector.jango import Vector field this is what we actually need in our python file here so we'll Define a vector field and then as we Define this we want to set the number of Dimensions so I know already that the model that I want to use has 384 Dimensions so that's the number of floating Point values that are in the list for that Vector this is very important all the vectors that you save half and compare have to be have to have the same number of dimensions and the model that you use again influence is the number of Dimensions that there are if you use chat gpts and beddings for example there's something like 1500 Dimensions this uh this model that we're going to be using actually performs pretty close to chat GPT and stores its output in fewer Dimensions which is actually a pretty a pretty good thing in terms of performance we want to create a new migration I think we can just create this from from scratch a new file zero zero zero two PG vector or enable PG vector.pi and I copied the initial migration just so we have the template for this and we'll get rid of all these things so we're down to operations it will depend on the initial migration here we'll do core and then type zero zero zero one initial just so that it keeps things in order and then this is not our initial migration so we'll remove that we can remove these dead dependencies up here and now we can move over this import statement and finally the one operation that we'll do is to run Vector extension so that'll make sure that the vector extension is set up now we still are using the PG vector image over here which which should have it enabled by default but it never hurts to make sure that your migration is set up so that you know if someone else tries to run this on a database that doesn't support PG Vector it will throw an error that's our migration taken care of and remove this import here and fold some things down just to get it out of our way okay let's just review what we have so far we have our job description class that has all of the data about our job descriptions that matches what we have in our data set our dummy data set this is active thing is is here we can you know we can turn that off and then we have our job description chunk which will represent chunks of the of each job description and embed those as vectors in the database for later query okay the next thing that we have to do go into our Django app and go into the server folder and we'll make migrations for our new models so that ran without a hitch and then the other thing that we want to do is get our database up and running so that we can actually run these migrations and start storing data I'm going to go ahead and comment out most of the docker compose file so we're only going to be using the postgres service and not running the server because I want to run the server from the command line so I can get a shell on it more easily so that's the main reason for that Docker compose up D alright so now it has started our Vector demonstration postgres database here and then we'll try to connect to that so we'll prepare the command we'll do psql the postgresql protocol this is the name of our database user that database user's password the host and then the port and then finally the database name we need to consult our auto-generated.env file and we see the user is Vector demonstration so we're going to plug that in here we see that the password is this big long password so we'll plug that in here that that will be at localhost and then it was five six four three two that we used and then finally the database name which is Vector demonstration DB all right so this will test that we can actually connect to that Docker instance and it's upset because of this of the way that this password is set up so what I'm going to do is simplify this password just so that we don't have to worry about this here and now we can get rid of the crazy password in here we'll go into Vector demonstration we'll do pseudo Docker compose down remove this database cool and then we will do sudo Docker compose up and then a Dash D to start it in the background copy paste this guy in here and again this is after running Docker compose up Dash D I now have a database I shouldn't see anything in this database if we describe the tables there are no relations and that's fine we don't expect there to be any so we're going to go into our server folder manage.pi migrate and this should connect to our Docker instance it did not well I think it did not because we didn't set the port in our environment variables or in our settings.pi so I think when we come down here we have DB host we don't have a port configuration on that so I think by default it's going to be looking at 5432 so it seems our bootstrapper doesn't account for this but I think we'll have to add a port configuration here so let's go ahead and do that yeah the default is 5432 which makes sense because we're using postgresql and then over in our dot EnV file we'll add a port this is not correct that looks good and then we'll need to make sure that we force quit restart our shell to load those new environment variables now if we run migrate the migrations run okay so now if we open a new little terminal here and run and connect it to the database directly we should see our tables and we do see our tables and among those tables we have the job description and job description chunk so we can describe job description ah sorry core job description and we see the fields that we created those all look correct and fine and then we can describe the core job description chunk model that we created and we see the embedding with the vector field and so this means that PG Vector is working as expected which is great so we'll close that down so now on to the fun part that was all just getting us set up to store these embeddings and job descriptions in the database in the first place right so next we want to load in our job descriptions from a file and I've I already wrote this up because this is kind of trivial I'm going to go ahead and rush through this just a little bit but I have for myself another directory where I wrote um I wrote some functionality for loading in the job descriptions now I was using cyclop G that's how I pronounce this uh the uh the postgresql python Library I was using this directly and not going through Django's orm so there's some raw SQL in here that we can circumvent but the idea is still basically the same we will connect to our database of course Django does that for us we will look at the data jobs folder which I have over here and find all these csvs in that directory that directory only contains csvs and then this is called a generator function we will open each CSV and return it as a file that we can then read and then each line of that file is a separate job description so this load all job descriptions caller calling function calls get job CSV it goes file by file row by row and loads each job description into the database let's Port this over to Django really quickly I like putting things like this on the model itself we could make this let's say a class method and we'll call it import job descriptions that looks good don't need that second parameter okay and then we will bring over our helper function here get job CSV now this is written with a relative path dot dot data jobs if that's incorrect then this uh this will fail so something Django does give us is the setting base directory base dir this is in settings.pi this tells you the path to your project so base directory should be in the same place as settings.pi so from there we would go up one two and then descend down into data jobs so we could probably use that we're already importing settings over here so let's do that settings dot base directory and then it's always a good idea to import OS in Python and use that to join your paths data directory equals OS dot path dot join base directory up up three directories there's probably a better way to do this but we'll stick with that so then we have the data directory and then we can Define the glob string glob dot claw glob the data directory star CSV uh glob by the way is a python standard library that we can import blob like that and it too is a generator function When We call glob.glob we can we get back something that we can iterate over that gives us each match for our glob pattern in that directory then we iterate over that we call open on the file and then we yield that file this yield statement is what makes this function a generator so we can also iterate over this function and so now we can take our loading functionality that I already wrote and we can do four CSV file in git job CSV Pixar indentation here and then I I have some nice print statements that tell you what's going on we need to import time I'm keeping track of how long these Imports take so we're importing time here we're importing CSV so that we can read the csvs and then this by the way also is a generator function that you can iterate over and it gives you the CSV contents row by row what I was doing here before was setting up a list to receive all the data from each CSV file and then bulk inserting the contents of that into the database I'm just going to reset this so to speak and then write it from scratch for Django one thing I didn't delete here is this line or row and CSV reader try to detect the language in the description if there's an exception set the language equal to blank this is pretty key for knowing which language our job description is written in that way we can filter down on only English job descriptions later I found that this data set has French and Japanese and other languages mixed in which I'm not interested in and I think also probably mess with the embeddings because I don't think I'm using a model that is multilingual like that so I just want to stick with English language and English embeddings so I I need to bring that in as well so Python and pip install I think it's Lang detect and this is a Google library it runs completely offline it is a little bit slow it's going to slow down this whole import job so we could potentially put it throw an option on here to detect the language in flight or add language detection after we import the uh the basic jobs okay so we want to detect the language and the way that we do that is by importing detect from Lang detect and then I'm also importing this exception so that we can catch any languages that can't be identified and just set those fields to blank and that also reminds me that language here needs to be blank equals true and we can update our migrations great so now we need to save the instance to the database so we'll do class.objects.create actually you know what we can do is this job descriptions we can create a list and then we can append instances of the class to that list and then we can bulk create at the end just like that so chat GPT is suggesting that for me okay so here I'm surprised copilot isn't smart enough to know what to do here so we'll do title equals there we go row title oh and by the way I'm using a dictionary reader so that the key names in here will match the row headers in our data set title company location link description skills so that makes this really easy to write and to read not for the AI however would seem there we go company is it going to pick up on this location description that's the key piece there skills we'll just bring in the string and then language we're detecting here and setting to this variable language after this block of code here finishes executing we'll have a list of job description class instances and when you have a list a list of model instances like that you can call Bulk create on them and that creates all of those instances in one database transaction Which is far faster and more efficient than calling dot create on each of these as we go through the loop I also like the idea of separating logic like this from input output operations but that's a different story this should work now if we go over here to our terminal shell plus terminal is all kinds of messed up there we go we can now do job description dot import job description so we can call it directly on our class because this is a class method and it should be entirely self-contained because it's looking at the csvs itself and doing all the iteration within this method so that should be good to go that seems to have failed because we didn't get any output and it ran way faster than it should have so that implies to me that it failed to find this directory so let's do a little testing here oh okay we actually didn't need to go up so many directories I think we only need to go up to preview it okay so now we're getting a list of all of the CSV paths um and so let's just make sure that we reflect that over here so we needed to go up two directories and then down into the data path again there's probably a better way to specify that path but I don't want to think about it too much okay so we'll come back in here job description Dot import job descriptions and we'll let this run this is going to take a little while so I'll probably pause the video until it finishes but we'll watch it for a couple seconds just just for the sake of seeing how this works what this is doing right now the reason it's thinking so long on the first CSV file is promptly because of the language detection this would go a lot faster without that language detection in fact let's just remove language for a minute and do job description dot import job descriptions now we see yeah it's about half a second import each CSV actually this is probably a better way to do it we can get these job descriptions imported and then start embedding and chunking them while we wait for the detector to also detect the languages and save that to the database so we could also write another class method called detect languages and here I don't want to get all job descriptions from the database in one go like this that'll be a major hit to memory I mean we could probably do this locally would never want to do this in production or on a small machine I probably have plenty of memory on my current computer but a better way to do this would be to write another generator function so we could do get jobs and then we'd page through this so we do let's say page size 1000 we could do it we could do a thousand job descriptions at a time that shouldn't be too crazy we could start with page size 1000 and then we want to know what's the count so what's the total number of pages so we would do there we go yep class.objects.count divided by Page size plus one I see okay so yeah this will round down to the nearest integer so we're doing page yeah we're doing that plus one so order of operations this will run first and then it will add one and the double divisor will round down so that's cool so that looks good for the number of pages uh yeah I mean this is a good suggestion from copilot here so we'll do four page in range number of pages and then this is fine because the slicing syntax here in this part of the code actually adds a limit Clause to the SQL so this this actually will work fine for our purposes okay so we have the generator there and we'll go for each job description and get job let's call it descriptions now we'll try to detect the language update the job description and then save Let's test that job description Dot detect languages ah yes yes yes okay so what we want to do is have this be the query set and then we want to get each object in that query set so anyway what this code is doing here is it's getting 1 000 job descriptions at a time so it's not slamming the database with repeated fetch one queries and then once it has the 1000 job descriptions then it's spitting out each of those job descriptions one at a time using the the generator syntax here um so let's go over here okay so let's do job description Dot detect languages I didn't put any update on here which may be annoying because then we'll we won't know how long it's likely going to take quickly add a print statement here detecting language for job description yeah we could do some kind of almost like a progress bar so total JDS count we'll increment this with each step of the the counter so um count plus one J D number count of total JDS okay so that'll give us something a little bit more descriptive to work with and then we can do job description Dot detect languages here we go 137 000 job descriptions and you can see the language detection it's actually moving pretty quickly so yeah we'll let that continue to run while we move forward to the the next step great so now we've we're importing we've already imported all of our job descriptions we're generating the languages for them and saving that down in a separate function over here let's move on to the logic of chunking and and generating the embeds I think I'll add another method to the job description and let's call it generate embeddings we can do this a couple different ways we can we can throw everything into one big function so we're generating embeddings for all of the job descriptions at once or we could write this to generate embeddings for one job description as at a time I think for demonstration purposes I will do just one job description at a time let's actually make this a normal method so it's going to operate on itself and is so this is a good opportunity to talk about how we're going to be generating the embeddings um so I have a demonstration over here in an IPython notebook and the simplest approach that I found is to use the sentence Transformer library or sentence Transformers library and the sentence Transformer class out of that Library I I'm also bringing in Auto tokenizer for for a different purpose but this is all you need to generate embeddings locally and offline we could also use chat GPT that would be faster in terms of how quickly chat GPT can generate embeddings but slower in terms of the latency of making those API calls and we may hit rate limits and and things like that or it may be flaky so I like the idea of generating these embeddings offline and so to do that I'm using this all mini LM l6v2 which has been benchmarked as having pretty high quality and being relatively small so this is a 400 100 megabyte model I think that model gets automatically downloaded when we first initiate this sentence Transformer so that is one piece of this code that is not going to be production ready necessarily is storing the model alongside of our code we don't want to download this model every time we instantiate our code and generate the embeddings but at the same time we usually won't be generating embeddings on the fly or online we'll be generating embeddings offline and once you have an embedding it stays the same forever so we don't necessarily need to install the sentence Transformer model on every run of the code but that's all you need to get started once you have the model I think I typed this out earlier but you can call model encode to embed a chunk so again you don't want to be calling model and code in our use case we don't want to be calling model and code on an entire job description we need to split the job description up into smaller chunks first then we'll be generating embeddings for each chunk and storing each Chunk in the database with a foreign key pointing back at the job description that it came from we'll also store the chunk so that we can reference it and get a sense of what the language model is thinking when it says that certain things are similar to each other I'll go ahead and add the Transformer import now we haven't installed this yet so we can go ahead and do that and it's just the python or Pi Pi module sentence Dash transform meanwhile we're detecting languages over there uh we'll let that run I'm gonna go get some coffee okay our Pippi and V install finish there so now we can do uh pip EnV install sentence Transformers actually I think it it would have already installed because I put it in the PIP file so I think we're good now on that front so let's open a new terminal here yeah let's try it okay so sentence Transformer in fact we can just try this from the command line real quick and can instantiate our model copy paste there we go and now let's encode something so let's come up with a sentence this is a sentence about me all right it doesn't really matter what the sentence is as long as it's English and as long as it is under the token limit which for this model is 512 tokens so I find that around you know 400 characters or 200 or so words is a good length for this particular model so we can encode that sentence so this gives us back a list of numbers that represents this model's understanding of where this particular sentence fits in its semantic space and we can compare different sentences this way so we can say here let's come up with a new sentence so or a query let's say what is the capital of France all right that can be our query and then let's define a couple of possible responses right we could say the capital of area France is Paris and we can say now facts don't necessarily matter here uh to be clear but what does matter is the some the semantics of it so uh even if we put in an incorrect fact if we said the capital of France is Brussels uh that actually probably would push this sentence further away semantically from from our query the capital of of France uh but the the llm wouldn't know the factual difference between those things it would just know that one is a better match than the other and not necessarily y right and that's what makes llms maybe not fully conscious because they don't understand those conceptual differences they're only looking at word frequencies and word likelihoods but here we can say the capital of the United States is Washington DC all right so now we have these two sentences that we can compare with our query so how would we do that well we'd get embeddings for for all three of them right so we'd get the query embedding model Dot code query and then we would get the sentence embeddings model dot encode S for S in sentences and then we would use some sort of similarity metric to to compare these okay looking at the docs I found a way to do this so we'll need to import the util library from sentence Transformer udel has a function called cosine similarity cos Sim and now we can compare our query with let's say the first uh or our query embedding rather yes query embedding with our first sentence embedding and we get back this tensor object it contains a value and that the value is what is of Interest here with a number between zero and one and the closer this is to one the more similar these things are so we see that what is the capital of France is fairly similar to the capital of France is Paris so that seems like a likely answer and then if we compare that to our second sentence we see that we get a much lower result meaning this is a this is less similar and so here the first sentence is a better fit for our query but it's more similar to our query than the second sentence and we could come up with an a third sentence right so let's do sentences dot end my cat's name is Hobbs and then we will regenerate our embeddings and then let's do cosine similarity index two which is the new sentence my cat's name is Hobbs and we see that's not similar at all it's even further away at least here we were talking about capitals now we're talking about cats and the names of cats so hopefully you can start to see how this is useful for storing chunks of a job description or pieces of any sort of document in embedding format and then retrieving those pieces of documents based on some query and the the similarity of that chunk to the input query right so that's what we're interested in here so the way that we will query our job descriptions database is by providing some information about the student or what the student is interested in then querying for chunks of job descriptions to see if we find matching chunks chunks that match the interests of the student and then we'll use the rankings of those chunks from the bigger job descriptions to rank the job descriptions in order of relevance based on we'll also use the cosine similarity metric but we could also use other distance metrics that are provided by PG Vector okay so let's uh let's go ahead and do that we need to generate our embeddings I'm going to generate this for a couple of job descriptions this part takes a long time and it depends on the GPU of your local machine or your or your Cloud resource so I would recommend doing this offline depending on the size of your data set it could make sense to send this off to chat GPT for processing or it could make sense to buy some Cloud GPU time and parallelize this process for now we'll just run it on my machine I ran it yesterday and it took about 90 minutes to generate embeddings for all of the job descriptions but that includes the ones that aren't English so we can trim that down a little bit and we also don't need to import the full data set to get a demo going here so I'll just do what I have time for maybe let the embedding system run and come back later to generate embeddings for a single job description the procedure will be to First chunk the job description into sentences this is actually more nuanced than one might think especially considering the structure of the data that we're working with let's take a look at this again real quick we see that the data is in HTML format so what do we do about these tags we'll need to make a decision there and then we also see that we don't have very clear markers about where sentences begin and end and there are a lot of situations where we're dealing with lists yeah here's a good example of here's an unordered list of bullet points uh so each of these should we consider each of these a sentence or do we want to kind of consider the relationship between elements in a list and maybe have some of these elements of the list grouped together because these all kind of fall underneath work perks right so let's say that a student is really interested in a workplace that has good benefits or good work perks we may want to consider the work perks as a whole and not break this down into too small of chunks right then again maybe we could have each work perk be a separate chunk and embed that and then we'd end up with more matches with chunks on the other end and we could use that to sort of weight the recommendation somehow the only way to know which approach is better is to try them and put them against each other and see which results we prefer and I won't go that deep in this demonstration I'll just show you how to get to the point where you could start to experiment and iterate with that so the way that I was doing this is by simply using another generator I find generators very useful for um for this particular exercise here so I'll just bring this code over and then we can work on it okay so first we'll start with the content of the job description so that's going to be self Dot description right and then we will want to start chunking our job description so we'll write a generator here called get chunks actually let me bring over the code I had for that and as you can see here I say this is naive chunking of job descriptions and we set a chunk size which by default I set to I found 750 to be a good size so that's 750 characters not words importantly there and then I had another function here strip HTML tags in order to keep our inputs to the embedder as lean as possible I'm removing all of the HTML noise around the words we really only want and need the English words now that HTML structure may actually provide information that could be useful to them better all of these large language models are trained not just on English but code and things like that it's possible we're losing some structural information that uh that the that the system could use to to make sort of inferences as it's doing as it's doing the querying but we would have to try that to to find out if that's the case and I'm not convinced that it is or I'm not convinced that we're losing enough to make it worth it to have to cut down the job description into so many little chunks it would definitely increase the computation time uh for generating these embeddings in the first place and then the other thing is with all those HTML tags in the mix a big portion of our chunks would end up being HTML and not actually text and we're going to be using English text to query the system we're not going to be using HTML to query the system necessarily so the HTML doesn't in my opinion seem to add very much information to our data set and only detracts because it consumes tokens and resources so I'm stripping those out that's the the long and short of it there and in order to do that I need to import the regex library in the HTML Library so I'm going to do that here great and those again are part of the standard Library so we have our strip HTML tags helper we have this get chunks helper which will take in some content and chunk it into 750 character bits again I'm saying this is naive because 750 characters may not be it may still be too much this may end up giving us too many tokens it also may end up giving us too few tokens so the the variance and the number of tokens that will eventually be embedded will be wide we may also have some blank space that we're encoding with this and I'm not going out of my way to make sure that we strip that white space but I am removing new lines and replacing that with the space because we don't really need the new line content again this could go either way maybe the new lines are important information but again we'll be querying with with simple paragraphs sentences and paragraphs of text that won't have new lines and so I don't know that there will be much informational benefit in retaining the new lines oh and by the way all of our job descriptions now have languages attached to them it seems yep just finished over there so that's great and we can count real quick the number of job descriptions that are in English so there's 74 787 job descriptions in English out of a data set of 137 000. so there are quite a few job descriptions in this data set that were not in English um and so this this should uh also reduce our embeddings computations um and then yeah so while while we continue to have content we're going to grab a chunk so we will slice the front of the the content to get our chunk and we'll base we're taking the first 750 characters out of the content and then we are updating the value of content with everything else so content is getting shorter and shorter that's why this while loop will eventually end uh content gets shorter and shorter and it spits out one chunk at a time which we can do work on okay so then down here we're getting the JD content so we can do get chunks and we can save ourselves some line of lines of code here and something else okay so so this is what's called a generator expression it's sort of like a list expression but instead of using square brackets we use parentheses and so this gives us a generator function that we can then iterate later on we'll do that down here but to walk through this really quick what we're doing is for each chunk so for C in get chunks for each chunk of the description we are getting the chunk content that's C we are then tokenizing the chunk content so we're turning the content into individual Tokens The reason we want to get those individual tokens is because I want to keep a count of the number of tokens that we have we're not actually going to use this I'm just keeping track of the token count and in fact that reminds me that I wanted to add that here so token count integer field yeah let it be nullable okay all right for the for the tokenizer this is where I was using Transformers Auto tokenizer Transformers is I believe installed as part of installing sentence Transformers and then we get the embeddings so I'll need to well actually get chunk embedding is simply calling our model model dot encode so really we just need to set our model up so that's really step one set up the embedding model step two chunk the job description into sentences and then step three save the embeddings for each chunk so we'll come back to that so model will be you know I thought I had already brought this in sentence Transformer no I'm just the import okay so model will be this model we defined before okay and then instead of doing this we can do model dot encode and then we'll also need to set up this tokenizer so the tokenizer will be Auto tokenizer from pre-trained all mini LM l6v2 yeah let's let's actually preview what this looks like so we can Import Auto tokenizer copy paste this over here it isn't happy oh I I remember for this to work we need to prefix it with sentence Transformers um so it's part of the Transformer Library um but it's sort of a dependency whereas here because we're using the sentence Transformers Library this model is built in at the top level but it's not built in at the top level of this other Library sentence Dash Transformers there we go okay um so now we can preview what it would look like to tokenize a sentence so this is my sentence um it contains a cheeseburger and you see that it breaks the sentence down into what it considers logical tokens and the reason I added cheeseburger in there is because I know that that is a word that contains multiple tokens cheeseburger and uh the period you'll also see that the punctuation there is is identified as a separate token as well so when these llms work like chat gbt what they're doing is they're taking in a string of tokens like this and then they're predicting the next most likely token not the next most likely word necessarily but the next most likely token so that could be a bit of punctuation that could be a symbol that could be a piece of a word and that's also how GPT can make words up because it can combine different pieces of words if that makes sense in terms of its statistical predictions so that's pretty cool okay here's a good example therefore of how the number of words does not reflect the number of tokens there's there are 11 tokens in a sentence that has one two three four five six seven eight words okay so that that's a that's a good example of what the tokenizer is doing so here I'm just tokenizing each chunk of content and then later plan to save the length of the tokenized content okay so now we want to save the embeddings so we're going to call our generator expression and iterate over it so that's what we're doing here and we're then expanding the we defined a tuple in here right so we're expanding that Tuple into variables chunk content chunk tokens and chunk embeddings and then we're appending that to the data so we have self dot ID as the job descriptions ID and then I'm going to do what I did before which is to create a job description no in this case we're creating a job description chunk instance for each iteration and then yeah and then we will bulk create those at the end so we'll call these JD chunks and then we're going to do that yep I'll create this is running for one job description we are setting up our models which maybe could happen outside of this function if we wanted it to these could be globals but we're setting up our models then we're we're getting the chunks of the description so this generator here is what actually does the uh the chunking and then for each Chunk we are tokenizing and encoding that chunk and then we are instantiating a job description chunk model I'm saying chunk an awful lot it's getting funny uh we're instantiating a job description chunk for each chunk and then keeping track of the content in each chunk the number of tokens and the embedding that we generated from our model and bulk creating that here so this should be a complete implementation let's see if we made any mistakes and if this works go back into the shell and we'll get a job description let's just get the first one and we will generate the embeddings uh you know what we need to Define these as keyword arguments of course so this is going to be the job description and in fact I think this we can just pass itself here um and then the chunk will be the chunk content token count will be this and the embedding will be chunk embedding let's restart our shell here get the first job description and then do jd.generate embeddings all right let it run okay that was pretty quick um so now we can do jd.chunks.all to see the related chunks aha well this is where relying too much on the chat TPT can get you in trouble but it put a made up attribute into our string actually it it had added that as a field before and I didn't catch that it also added it to our uh our Thunder string method there JD jobdescription.objects.irst and then jd.chunks.all and we see the chunks there let's see how many chunks we have we have six okay and so now we can go through and uh look at the content if we wanted to Chunk in jd.chunks.all [Music] chunk Dot um so now we see how this it's actually print a couple new lines between each one so it's easier to see there we go so these are the individual chunks out of the job description we see we've seen some clutter at the top here actually um some CSS that got mixed in with the HTML so it might be good to actually strip out that CSS if we can because we see how now the content is kind of the content of this chunk is kind of squished into the end so this chunk is not going to be very useful to us we see another ish potential issue here in that we're cutting chunks up by the number of characters and not cutting cleanly at the edge of words so that could be an improvement that we make again tokens can correspond to pieces of words and and I found that chat GPT for example is pretty good at picking up on misspellings and partial words but could be improved I think and and also you know like I was saying before just finding better logical separators in the job descriptions that are reliable could be a good idea whether it's separating on periods or actually reading some of that HTML structure and chunking based on the the list content or even asking chat GPT or an llm to extract information and put it into a structured format such as extracting the salary and determining whether the salary is high or low for a particular job you could get really intricate with with how you do your chunking and your analysis of the of the data set so you know I glossed over that part here and did this in a very basic way but I think that you know garbage in garbage out as they say and nesting your content will potentially improve the quality of the recommendations that you're getting out of it and the sensibility of the recommendations that you're getting out but I wanted us to get to a point where we had some data we could actually query and that would be interesting so you know the other thing we can look at here for each chunk is the embedding here so we do have a chunk and we can look at the embedding for that chunk and again not much to see here but we do see that we're storing you know 384 values and now we can start to do some interesting things with this like like query against it so how would I run a query well first of all let's generate more embeddings from our job descriptions so let's do let's say 300 job descriptions um and let this run real quick so 4jd in jobdescription.objects dot filter language English and then let's limit this to 300 I think that's all we want to do there JD Dot generate embeddings all right so we'll let this let this go um I want some sort of progress indicator don't print let's do count equals zero count plus equals one print generating embeddings for JD number count or JD number account JD dot generate embeddings all right let that go okay that finished generating embeddings for 300 job descriptions which is great but it occurred to me that it's probably starting at the beginning or maybe iterating through these in the order that they were imported meaning that most of these job descriptions are probably accountants which is not very interesting for our purposes that did take about two minutes to run and so you can imagine for 75 000 job descriptions it's going to take quite a while to generate the embeddings maybe an hour or so what I'm going to do is change the order by statement here to order by random job descriptions um okay so I'm going to do this I'm going to do I'm going to reset this job description chunk dot all dot delete I guess that change I will just delete all of the embeddings we've generated so far um and then we will this time around order by question mark which should give us random results a random sampling throughout the entire database I'm going to go ahead and bump this up to a thousand jobs just because I think that'll be more interesting to look at let's do that and let that run oh boy nope that's not what we wanted okay let's reset the counter count plus equals one go put the print line in there and generate embeddings I think that's all we need all right let that go so we're about halfway done uh generating the embeddings I'm gonna let that run on the side I figured we could go ahead and get started on the querying side of this just a refresher what we're going to start with is something like what a student is interested in so we can say the student would prefer a job in the Arts they have a background in Finance or let's say let's not let's not complicate this let's make this very straightforward they have a background in choir and theater um yeah that's fine recommendation let's let's start there I mean we could add more detail like uh their major we could say their major uh was yeah sure music minored in theater um graduating year 2022 something like that and we could provide as much detail as we as we like as long again as it fits within our token limit so we're going to start with a query like this and then what we want to get back is a list of let's say expected result list of expected result is a list of job descriptions in descending order of relevance so just like the content preparation side the the querying and search side of this also has a lot of nuance I'm going to show a very basic way to do this that's not necessarily the most performant way to do it but it gets us some results that we can start to look at very quickly so let's just write a a search function so search yeah we can just call it search and we can put it up here as another class method how about that and we'll move this up here and meanwhile our embeddings have have finished we're going to be searching against chunks and so I I don't you know I put the search method here in job description because I think I mean that's where we've been putting all the methods so it's kind of sensible but uh but just a reminder that we'll be searching against the chunks not the job descriptions directly dot objects dot so in order to query against our embeddings we need to bring in a distance algorithm from the PG Vector Django Library so we'll bring in L2 distance this is a bit different from cosine similarity we can also bring in actually we could bring in cosine distance maybe it'd be fun to to compare those two and see what we get so it's similar to querying on a normal field but because we are doing comparisons using the distance function distance is not a literal field on the object right so we need to annotate our query set with the distance of Interest so let's let's use the L2 distance so when we use L2 distance we use it in a similar way to other model functions like the Q function or the F function we wrap the field name that's embedding so that needs to match our field name on the job description chunk model and then we're going to supply it with a query Vector which will be the you know for now we can use what we defined up here and then vectorize it so we'll need to bring our model in same way that we did here Define our model and then query embedding let's call it embedding because that's what we call it everywhere else model dot encode query and then we can do query embedding here there we go so now when we get back I mean first of all this is going to query all of the job description chunks okay so this will grab all of the chunks and then annotate them with the distance to the query embedding uh so let's let's start there we can put the search term here and we can say query equals query or the uh the default one we've kind of programmed in there and then let's just return our annotated query set so if we bring this into the shell now let me do job description dot search this will take a moment because it's comparing with everything has no attribute default alias I think it means we are supposed to give L2 distance and Alias here maybe something like this well yes the manager doesn't have it because it's a class method there we go okay there we go let's call it JD chunk results um and then the length of that should be that's all of the chunks in in the database right um so let's get the first one uh DOT first should work yep because it's a query set uh and then here we can check out the distance because we set that as an alias no we cannot maybe we need to annotate it instead of aliasing it okay so we'll do job description dot search and after a time we'll get back some results there we get our results back and then we'll do let's call it JD chunk results job description.search we'll just search again let's look at the first result let's see if we can inspect its distance from our query there it is so the distance comes in and again this is L2 distance in this case not cosine distance the distance is coming in at 1.3 do we know if that's good or bad no um the only way we could know is if we sort our results so let's go ahead and do that results dot order by and again we can do this because we're working on a query set that's annotated by distance that's what our search method already did for us okay so that works then we can do or chunk and ju chunks let's just get the top 20. uh chunk dot distance and let's go ahead and do chunk dot jobdescription dot title so we can see the titles that are being recommended to this person so remember this person would prefer a job in the Arts they have a background in choir and theater they majored in music they minored in theater and then we threw in the graduating year just to see what that would come up with and we get back results we're ordering from the lowest distance to the highest distance so because this is a distance algorithm low is good it means it's a a close match if we were to flip this and do the um descending ordering we would find that we would find longer distances here and in theory less relevant results and at a glance you know ux researcher that's kind of a creative field maybe but we get data analyst deal entry business analysts strategic sourcing those yeah and Senior Finance those don't look like artistic jobs necessarily so I have a good sense that um that this is working okay so we noticed that what we've ended up with here though is not job descriptions but just chunks of the job description so what we would like to do is get the top 10 20 30 40 jobs not by chunk but by full complete job description so we could try to figure out how to do that with the Django orm I was thinking about that and it seems a little bit complicated so what I'm going to do instead is sort of coalesce our values in Python here so the way that I'm going to do that is by all right so what we're doing here is We're looping over each chunk if the job description ID is already in the dictionary then or is not already in the dictionary then we're going to add we're going to add that key of the ID to the dictionary and the value will be a dictionary that contains the job description object not the ID but the object and then starts a list of chunks by adding the current chunk to the list otherwise if the job description ID is already in our lookup dictionary here then we can append the current chunk to the existing dictionary this will allow us to to gather everything together like that actually I'm going to go back to calling this results because we ultimately want to return this then we can iterate the dictionary so we can do 4K V in results dot items and we're going to update each one with the well yeah basically with that as we're updating each dictionary we're adding a score field and that's going to be the average of the distance of of the chunks um so yeah copilot strikes again and uh does a great job with it now the results we get back are going to be kind of weird there'll be this dictionary yes we could do this um unique to go back to Unique JDS but then we'll make results be a list and we can do results dot append the job description so we can compute the score instead of updating the the dictionary turn results here um instead of updating a dictionary we'll compute this score but we could you know less hacky way to do this would be to have results be a list of tuples just like it it's suggesting here so um probably put the now we can put Score first um yeah that way we're not modifying a Django object so then we'll have the score paired with with each job description uh just like that it'll be interesting to see what this does with ordering we may need to sort the results one last time before returning it but this should now give us a list of job descriptions with their related score rather than job description chunks and the chunks have kind of been well they've kind of been discarded we could do this though to keep the chunks around and make sure that they get returned with the results so let's see what that looks like so the results equals job description Dot search and we'll just use our usual search query this will take a while because it's Computing this on all job descriptions okay again we may want to pre-filter what we're acquiring against um or we may want to create an index to optimize the query or only get the nearest neighbors it is possible to to filter by nearest neighbors as part of this distance calculation I didn't demonstrate that before but yeah I mean if we simply add Slice on here we could get the 100 nearest neighbors by distance and that would be that or that would be a lot more performant because then it wouldn't be comparing necessarily against every single job description in the database all 100 000 of them but you know I like this because now we have all all of the distance comparisons we can rank them in order and see how our our recommendation algorithm is performing so then we can do four R in results R well yeah let's look at it for our in results and let's just get the top 25 we can peek at those and they are tuples uh the chunk the chunks are making this hard to look at so let's just do r0 which is the score and then R1 which should be the job description we can do that just grab the title out of that there we go so yeah we see the results that we saw before but now these are these rows each correspond to a specific job description instead of chunks and the score that we see here is the average of the scores of all the chunks so it's sort of an aggregated version of what we were looking at before and we see that these are somewhat out of order now because of the averaging that we did but somewhat still in order so we would need to add one more sort to this so we could probably do it on this sorted I'm gonna see if copilot can help me out here here we go yep so we'll sort the list of results by their first key which is which is the score and then this would be in ascending order which is what we want because we're using L2 distance so let's back out of this come back in results equals job description dot search or are in results and then we'll do r0 R 1 dot title and there we go oh I listed them all out again all right let's just do the top 40 there we go now these are in order from the best match to the least best match and that is looking good so yeah so that's just a basic walkthrough of how to set up Django models in such a way that you can store vectors using PG vector and then you can query against those vectors using the distance algorithms provided by the PG Vector Django package in in this case we used L2 distance but you could also use cosine distance I think you can use euclidean distance and import those functions whichever you prefer whichever your experimentation shows delivers the best results and this is where the magic really is happening things got a little bit messy down here bet there's a better easier way to do this that I'm not thinking about right now but this part at least is all you really need to execute your query to get matching bits of data that you had embedded and stored in your database so yeah I hope this has been informative and helps you on your machine Learning Journey
Info
Channel: ThinkNimble
Views: 3,327
Rating: undefined out of 5
Keywords:
Id: OPy4dLHdZng
Channel Id: undefined
Length: 71min 1sec (4261 seconds)
Published: Wed Oct 25 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.