How We're Building AI Search Engines using LLM Embeddings

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hey everyone so we have a lot of clients who are interested in seeing how we can use AI large language models specifically that are all the rage right now to search custom data sets and custom documents we have a lot of examples of current clients and probably future sales prospects who are looking to solve this problem with large language models if only to get that AI stamp on their company but also because these large language models lend themselves very well to this use case of searching unstructured data so I've put together a quick proof of concept in our stack using Django and postgresql with a plugin called PG Vector I think that's important for us to kind of get started on in thinking about how we can do this technically I also like using the existing tools that we have rather than introducing new tools to our stack if we can help it and in this case I think that Django postgres and PG Vector work very well together and will be a good solution for a lot of our clients right now so I built a very basic app using our bootstrapper and what you can do is grab a sentence about a student I can say this student is looking for a job teaching high school math you know completely unstructured English and then I can search against my job descriptions database in these job descriptions by the way I took off of some public data set that I found on on Reddit and we see that the first result here is mathematics teacher we get a match score the lower the better because that means it's closer to the to the sentence that we put in rather than further away and as you scroll down the results get further and further away and then we can look at the job description itself so the way that this is working is it's using a concept called embeddings what is an embedding the classical example this is sort of the hello world of machine learning especially for embeddings is to think of words as being on a graph just like you're seeing here so we have the words queen king woman and man and the classical example is King minus man plus woman equals Queen and so essentially what we're doing is taking the meanings of words and translating them into locations in space here we have X Y coordinates but the llms can think of multiple dimensions and in our case we're using an llm that produces embeddings of 384 parameters so they're 384 numbers long those numbers sort of encode the meaning of the word in relation to other words so words that have similar meanings end up having numbers that are closer together so you can mathematically compare these vectors these long lists of numbers and find out which list of numbers are more similar to each other and that's the basic premise of our of our search algorithm so this is a very basic example with only two dimensions and single words but the llms can take entire paragraphs and entire sentences and come up with sensible embeddings for those two so that's what embeddings are attempting to do is they're attempting to use numbers to represent the meanings of words so this is really interesting because what it means is in this search we're not searching based on content we're not searching for jobs that literally include the words teaching high school math although that happens to be what we got back we're searching for Concepts we're searching for job descriptions that have Concepts that are a close match to the concept that we search for and so that's a really important distinction between this and traditional searching algorithms is it's not really searching for keywords it's searching for words with similar meanings and that's why we can get back jobs that say tutor because the word tutor is similar to teacher right but it's not a precise literal match for the word teacher so I think that's really interesting really powerful so another thing we can do is go to our data set and we have here hundreds of students and each of these students have filled out a profile with their you know graduation dates their major their preferred job function that they're looking for so this person's interested in analytics research and finance here's their major here's their job experience right and the year that they graduated and then it has a bit in here about the skills that they have so this person is advanced in Excel they're a beginner in R beginner in Python and have some experience with SQL and Salesforce so what's really interesting about this is we can take this very unstructured data and I'll just copy paste it in here and we can search against this um well yeah I won't even modify this I'll just paste it directly as it is out of the spreadsheet we'll see what results we get back cool so we get back bookkeeping jobs data scientists so that's that matches uh the data analyst position that they're looking for administrative assistant so we see we get a completely different set of results than we did for the student we searched for before and they're they're all related to analytics and finance like we expected and what's really interesting is that we didn't have to really massage our data in any way to make it nice and pretty for the computer the computer the llm specifically can just make sense of this and handle this so what it's doing is it's taking our query here and it's generating an embedding from it which is a list of numbers similar to this example except the list of numbers in this case is 384 elements long and I like to think about this as you know being a multi-dimensional space sort of like your your mind has all these neurons that are that are really deeply interconnected with each other and that gives it a lot of Dimensions to store data so in a sense you can think of this embedding as being the place in the llm's brain where it stores information that is close to this information Nation it generates that embedding for our query and then we already have embeddings generated for our results and it uses a similarity algorithm to find the job descriptions that are the closest match for the query Again by the meanings encoded in these words and not by literal keywords or anything like that so it's super cool so I'll switch over to vs code now and just show how I went about building this uh the first part is that you need an instance of postgres with PG Vector the PG Vector add-on installed so what I did here is I set up a Docker compose file and I'm using this and Kane PG vector image and configuring that so the next thing we have to do is set up a migration that enables the PG Vector extension and that's as simple as doing this we import the vector extension operation from PG Vector Django and then we add it down here to our migration this is the second migration so this happens after the initial migration it just needs to happen before you start adding Vector fields to your database and then over in models.pi I have added a job description model here that has the title company location description skills all of these by the way are coming out of our data set which is here it's just a list of CSV documents grouped by job type so if we take a look at this we can see we have those columns there title company location link description and skills so I'm really interested in here is the description field and we see here that the job descriptions are all HTML documents with some pretty clean markup on them so that's nice and then I wrote a few methods to import the job descriptions to detect the language since I'm only looking at English job descriptions and then we generate the embeddings and finally there's a method here to search then over here we have what I call the job description chunks so what we're doing is we're taking the job descriptions and we're breaking them down into I think 750 character chunks basically the goal there is to not exceed the context window of the large language model that we're using so there's a limit to the a number of not words exactly the large language model calls them tokens but there's a limit to the number of tokens that the large language model can process at one time so we need to take these longer job descriptions break them down into smaller chunks so that the large language model can process them and give us an embedding Vector each chunk has a token count associated with it and then we have the embedding we with 384 dimensions and here's where we're using a feature of Django PG Vector the vector field so we're importing that from PG Vector Django and then we're using that just like we would any other model field and defining the number of dimensions and then we can take a list of 384 numbers and store it in this field and what this unlocks for us is the ability to search based on the similarity of different vectors so when we use this field it enables us to use those similarity search algorithms to find vectors that are similar to each other here's how we generate the embeddings the main approach here is to take those longer job descriptions and then chunk them cut them into chunks that in in my case I chose 750 characters so each chunk is 750 characters long and then generate an embedding for each chunk and then later when we query we will find all of the matching chunks and then bundle those up and return just the matching job description or the best matching job descriptions right so we're picking out I think the top 40 or 50 job descriptions from that so the first thing I do is strip the HTML tags out of the job descriptions I I didn't think that for our purposes they added much information uh and and they take up a lot of tokens too because each of these brackets ends up being its own token so took out the HTML Tags I'm keeping only the content of each job description so then the magic happens down here here we're setting up our llm model using the sentence Transformers library and there's a sentence Transformer class in that so this is all you need to do to get your model set up I'm using the All mini LM l6v2 model because it's relatively small it's about 400 megabytes and it can actually be compressed to a smaller size than that and I heard it's pretty good this line here is from another Library we're using the auto tokenizer class to count the number of tokens it's a sanity check just to make sure that 750 characters was not too large or too small a chunk size for the model because this model is limited to 512 tokens on the input side of things and then again on the output side it generates an embedding Vector that's 384. numbers long right that's what we're going to store in the in the database this next line here grabs all of the chunks so we we get the chunks and then for each of those we encode them so this is generating the embedding so model.code is what actually generates the embedding for us the tokenizer takes the chunk and breaks it into tokens which later all I'm interested in is the number of tokens and then finally we're saving that down to the database as chunks so all of this is generated offline we generate embeddings for each job description before we do any searching so this needs to be done first I had 137 000 job descriptions in this data set and it took about 90 minutes to generate embeddings for all of them and that you know was done completely offline I wasn't using something like chat GPT to generate those embeddings so it can take a long time it really depends on your hardware and whether you've parallelized the operation and so on and so forth so just bear that in mind it's probably best to generate embeddings offline and then finally there's a search function here where we actually run the query and I hook this up through Django rest framework onto the front end which is a vue.js app that I quickly threw together but the search is also where the magic happens once the embeddings are in the database we are once again instantiating our model we are encoding the query so this is an important step so I have a test query in here but this is where we're passing in the query from the front end we encode that query which gives us an embedding 384 numbers and then we use the orm features provided by the PT vector vector field to find matching job description chunks based on those chunk embeddings and the query embedding that we just generated and then we order those by distance and then this logic down here is probably not optimal probably not great but what this is doing is it's bundling up all those chunks by job description so that what we see listed on the front end is job descriptions and not all the individual chunks but this is where things kind of get interesting with these chunks we could also take those shorter Snippets of content and run those through another language model like say chat GPT and ask that language model to summarize those chunks so what it could say is we picked this job description because this matched and mismatched and mismatched and I think that could be really cool way for us to build an AI explainer of why certain job descriptions are considered a good match for a student and why others are not cool so that's just a quick overview of how we can use our stack to search a custom data set this has all sorts of applications for example for a doctor in a healthcare setting we can pull in research from various sources and then summarize that in plain English for the doctor or for the patient we have a lot of clients who are interested in this search problem in other domains for example learning content so they want to take in things like the title of a calendar invite or the content of an email that you wrote recently and then search for learning content that's relevant to what you're dealing with in your workplace or what you're emailing other people about and so that knowledge will hit you where you're at so many examples of of how this can be useful there's a lot of nuance though in here of you know how exactly do you break that content down into chunks I was doing it in a really naive way where I was just grabbing 750 characters at a time but there might be a smarter way to do that so that you're keeping cohesive Concepts together which is not something that I did here for example in a job description there's a perks and benefits section maybe we want to keep all of that content together in the same chunk so yeah this has been a really basic first stab proof of concept at building search using llm embeddings I hope you learned something I hope you see the potential of this for your projects and if you're a coder you now have the basics of how you could implement this yourself using Django and PG Vector thanks for watching and I look forward to seeing what you build

Info

Channel: ThinkNimble

Views: 15,578

Rating: undefined out of 5

Keywords:

Id: ZCPUmC37HLU

Channel Id: undefined

Length: 13min 58sec (838 seconds)

Published: Tue Sep 12 2023