Build A Simple Search Engine in Python

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
what is going on guys welcome back in this video today we're going to learn how you can build your own simple search engine in Python using language models so let us get right into [Music] it all right so we're going to learn how to easily build our own customized search engine in Python today using language models now just to clarify this right away we're not going to build a search engine for the internet like Google Bing or uh .go for example this is going to be a search engine for our own data set so for example you might have a data set full of movies with titles descriptions and so on or of products or of blog post something like this and you want to search them with your own search engine this is what we're going to learn how to do in this video today and it's a very easy process because we're going to use an external python package for this called txt aai so we're going to use pip or pip 3 install txt AI in addition to that we're also going to install numpy and pandas just to work with some data sets here and we're going to also build this into a web application so we're going to actually have a user interface with the search engine uh or with the search field and for this we're going to use streamlit so these are the packages that we're going to need for this video today txai numpy pandas and streamlet once you have all of this installed let's start with the simple Act of embedding the items into a vector store and then searching um searching the items based on a similarity search basically uh now for this I use two data sets in this video today here are the links you're going to also find them in the description down below we have the Amazon product data set and the Seth Goden uh block post data set you can use of course whatever data you want I just wanted to use some data that has a lot of text like titles of products or descriptions or blog post uh content something like this so what we're going to do first is we're going to import txt AI we're going to import asnp and we're going to import pandas aspd and then we're going to load the data set from Amazon first which is here I extracted it already the train CSV file that you're going to find on kaggle when you download the data set and for this we're going to just say DF equals and then PD reor CSV train.csv this just loads the data set and we can print it to see what it looks like and what you will see here in a second is the data said once all the importing is done this tensorflow message here is because we import txc um but then pandas is going to load the CSV file and hopefully display it here in a second but I can already tell you there's going to be a Field title which contains a lot of information about the product already so it's going to be uh the full Amazon title of the product okay actually doesn't show it here so maybe just to show you what the field looks like now if you want to explore it yourself you can just go and say print DF columns then you will see that that there's a column title in caps and this is the one that we're going to use for our search so we're going to take the title of all the individual items and then we're going to uh embed the title itself which is not just two words it's a description of what's actually uh what the product is actually about we're going to take that title we're going to embed it uh into a vector store and then we're going to uh use that as our database so you can see here these are the titles of the products and this is what we're going to embed so in order to not use all of the data set because it's quite large we're going to just limit our uh selection here to 100,000 items so I'm going to say here and to keep this consistent I'm going to set a random seat so I'm going to say NP random seat one just so that when I do a random sample it's always the same random sample so what I'm going to do is I'm going to say titles is equal to DF we're going to drop all the Nan values and from the remaining values we're going to sample 100,000 and from those I'm going to just get the title the titles as a list so title. values this is just now 100,000 randomly selected titles from the data set but always the same so every time I run the script I'm going to get the same selection Now to turn this into uh a vector store or to embed these titles into the vector space what we're going to do is we're going to say embeddings is equal to txt AI embeddings and now what you want to do is you want to provide the path to a hugging face Transformer that is compatible now you can look up the package if you want to try some Alternatives the one that we're going to use in this video is going to be sentence Das Transformers SL all Das Min lm- L6 dv2 and I think there's also an l12 maybe we're going to try this later on uh but this is now just the link to a hugging face Transformer so on hugging face uh under this URL here you will find uh a Transformer and we're going to use this Transformer model to embed our titles into Vector space and then we're going to say embeddings dolo embeddings T gz and now we can try also can try to um to have a search prompt so let's say I want to search in the embeddings once they're done I want to do embeddings do search I can enter a search phrase so for example protector for Cam and I can say that I want to have the top five results for this query and then I will get the actual results by saying uh actually I can show you why I do this we can print the results in between but to get the actual text results we're going to do titles come on first of all let's remove vs here then titles and then x0 for X in result now why do we do that because the result is going to be um a collection of indices because we're not going to get the text as a result we're going to see okay which index of the titles is the one that fits this the most and then we have to look up the actual Title by providing the index and the index is going to be stored in each result object in the first position so this is the result object index zero will have the ID and for this ID we want to get the title so this is the actual results that we get here and we can print them and that is actually the whole Magic now this is not the application yet but this is the process we load we embed we search and we get the results now this is going to take some time now um meanwhile we can discuss what we're going to do now next we're going to build a streamlet application and a streamlit application is basically uh I have a video on this channel where I explain how to build streamlit applications but it's quite simple you just provide uh the UI elements and you say how they are connected now what's the problem here oh of course sorry I uh have to not load the embeddings here we load the embeddings of course in the actual application we need to save the embeddings so that we can load them later on so actually everything else is obviously going to be the same so we save the embeded settings and later on we're going to uh to to then load them and also before we save them we need to say embeddings index the titles so that's that's the order here we um yeah of course I didn't do yeah this was stupid I didn't do the actual work here because let me repeat this again we load the data set we get the titles we say that we want to have this embedding model here then we need to do the actual embedding on the titles obviously then need to store them and then we can use them so I skipped basically the most important step where we actually embed the data set but let's get back to streamlit so what we're going to do in streamlit is we're going to just Define uh the UI elements and we're going to say what happens when you press the search button or what happens when you um yeah when when you enter a search query we're going to uh have a request for the database then we're going to get the result and we're going to display it in a UI and then basically the functionality that I'm showing you here is going to be done in stream it but we're not going to train all of this again we're not going to embed all of this again we're going to just load it the way I actually did it here uh by mistake so while this is happening I can already create now a second python file let's call This mainor streamlit py And here now we're going to say import streamlit SST import numpy SNP import pandas SPD and also from tx. embeddings or actually let's just say again import TX the way we did it before um and now what we're going to do is we're going to Define what we did now with the embeddings we're going to Define this as a function with the difference of course that we don't index and save but we load so I'm going to say load data and embeddings it's going to be a function and we're going to say here NP random seat as you can see this takes some time down here um we're going to provide seat one again so that we have the same data set by the way there you go you have now here the search results these are the indices for the most fitting items and uh whether they are good or not we're going to ex examine that in the um in the actual web application but you can see already that we have here some uh screen guard we have some camera lens protection so it's not uh definitely not something that doesn't fit at all because the prompt remember was protector for cam so this makes sense uh so let's continue here we're going to load the same data set it's important that you set the same seat because you want to load the exact same 100,000 samples now this is of course only important if you use samples if you don't use the full data set but in our case we used 100,000 samples so you want to also uh use 100,000 samples and you want to use the same 100,000 samples here uh in the application so PD read CS V uh train CSV and then we're going to say titles again is equal to DF drop all the Nan values and then sample 100,000 get to title values then again embedding embeddings is going to be equal to txt embeddings we're going to provide the same path is important of course sentence Transformers all mini LM mini LM L6 V2 and now we do what I did by mistake before embeddings load embeddings come on why is this always lagging tar do gz so we load what we exported here so we have the same embeddings the same data set the same titles and all we want to return here is just titles and embeddings now for the streamlet application it's not um at all complex so what we do here first is we say titles and embeddings are going to be equal to St cache data we need to cach data and we want to say load data and embeddings we don't call this we just provide this and then we call the result of this cache data function so we don't call our function itself we passed the function and then we call the result of this here which is going to be a callable obviously then we can just provide a title here let's say it's our uh text search engine or let's say Amazon item search engine let's call it like this and then let's say that our query is going to be equal to St text in put so we want to have a text field where we can enter a search query enter a search query and a default is going to be an empty text and then we're going to say if St button is going to now be the search button here if this button is pressed this is how you say that basically if St button means when the button is pressed if there is a query if there is a text input we want to say result is equal to embeddings search whatever the query is and the top five results now we need to get again the actual results here which is going to be equal to titles x z for X in result and then for result in actual results what we want to do is we want to write the results down below and otherwise we want to write please enter a query so that is our application now again to just recap we load the data and the embeddings in the same way that we did before we load the embeddings instead of indexing and uh saving them we get them here we cache them we Define that there is a title we Define that there is a text input and the result of that text input is stored in query if the button search is pressed and there is a query we just search the vector store we get the actual results and we write them down uh into the user interface and otherwise we just say please enter a quy so let's run this but in order to run this of course we don't just uh run it the conventional way we need to say streamlit run so we need to go to the actual directory here and we need to say streamlit run and then mainor streamlet and this is going to open up the web application here and if everything went fine if I didn't make another mistake we should be able to uh get the results now you can see it's running the load data and embeddings function so this is actually loading now the data set from disk it's also loading um the embeddings and yeah basically the tar gz file and this can take some time but once this is loaded a single time we don't need to repeat this process we can just run it once and then we can use our search engine so let's go and say protector for cam search and then you can see here we have have the titles now of course in this case uh we're just displaying the titles it's not difficult to also display uh the URL because in this case all we have to do is we have to go to the data set and we have to get the right indices and we can also get the URLs or uh something like that it's not too difficult to do that it's actually quite uh straightforward and maybe we can do that I'm going to do that with the next data set I'm going to do that with the uh block post data set here but that is um as you can see we have your camera lens protector we have another protector here we have some uh back protector with camera lens protector and so on uh maybe if I go laptop gaming I'm not sure if I'm going to find something here but let's see we get here uh Lenovo Idea pet gaming laptop gaming laptop gaming laptop uh and it doesn't have to be the exact keyword so if I say laptop game uh maybe I will get Now video games but maybe I'm going to get still a gaming laptop even though the word game does not occur there and also I don't think oh actually laptop occurs but game does not occur um but here for example we have a mini laptop for kids English study game so that's again containing game and here we have Game of Thrones with a laptop skin so it's not perfect but you can see it works uh and depending on the model that you use for the embeddings it's going to work better or worse so let's now change this to work for the other data set and for this let's go back ref first to our main file the only thing we're going to change here is that now we're not going to use a sample because this data set is a little bit smaller I'm going to say um that I want to load the Seth data this is just a block post collection of Seth Goden um which I found on kaggle so we load the data set here and we just drop the Nan values and then we can use all of that and what we're going to use here is we're going to use the content so content is going to be equal to DF and again we can print the column name to see what is part of the data set I already know it you can just print DF columns uh but I'm going to take the content plane value so what I want to do is I want to uh index the actual contents of the block posts not the titles and not the URLs but the actual content but I can still access uh when I get results all the other fields as well so content is DF content plan values uh I use the same embedding model and here now I'm indexing content and we can say embeddings Seth tgz and I can use um another search query for example or actually I'm not going to even use a search query here I'm just going to save the model so this is going to hopefully work unless I didn't do something properly here um but what we can also do now is we can adjust our streamlet application because our load function is going to be different of course we're not going to uh need a seat anymore we're just going to drop an A and um we also don't want to actually display the contents of the block post so we don't really care about uh the thing that we're indexing here because the indexing part and the search part is uh separate from the actual displaying so we don't even need to have the content loaded here we just need to say uh DF is actually let's do it here drop an A and here we're going to say DF titles so DF title values and URLs is going to be equal to DF URL values so these are the things I want to display I want to search based on the content but what I want to display as a result is titles and URLs and uh then I use the same embeddings model then I load embeddings Seth tar gz I do the same thing here and then I just go I can change this to Seth block post search engine and now what I do is actually everything stays the same the only difference is now that I don't actually return this but what I do is I say formatted string title and then I can use titles x0 to get the title and URL is going to be then url's x0 because remember x0 is just the index uh of the thing that we need and of course we need to here return your else as well otherwise it's not going to work but now we get a different result so let's go back to our terminal here let's rerun this and let's see if it works okay I made one important mistake obviously I loaded the same data set as before we need to load Seth data CSV here as well otherwise it's not going to work so now we have to restart this and wait again all right so now it seems to work we can just enter a search query for example let's say um productivity productivity tricks or something like that I don't know if there is a block post for that uh but there you go the productivity pyramid and then we can go to the block and see what we have here um as part of the blog post and maybe the productivity Gap here and what I can also do is maybe I can say book recommendations if I'm interested in that search and then you can see here uh book list or a big week uh for books so here I will get book recommendations and even if book does not occur in the title for example here more than 12 minutes I assume it's going to be part of there you go unlike most books that cross my day desk um or here we have reading list even though book does not occur here so this is also because book list occurs in the content um yeah so one more thing that I want to show you is we can change the model so we can change this one easy thing to do is we can change this to uh what was it l12 I don't know if l12 is going to be more powerful but there's also entirely different models so we can also use for example the paraphrase paraphrase uh what was this train paraphrase uh- mpet db- V2 and of course the important thing is if you want to change them all you have to do it all over again so you have to do uh paraphrase Das mpet db- V2 you need to run this script again to do all the embeddings with this model first and then you need to uh restart the application and load uh the new embeddings here as well so I'm not going to wait and and show you this now but this is how you do it so you change the embeddings here you train or you do the embedding process here again and then you also change it to the same model here and you load the embeddings and you use the application but this is how you build your own customized search engine in Python so that's it for today's video I hope you enjoyed it and I hope you learned something if so let me know by hitting a like button and leaving a comment in the comment section down below and of course don't forget to subscribe to this Channel and hit the notification Bell to not miss a single future video for free other than that thank you much for watching see you in the next video and bye
Info
Channel: NeuralNine
Views: 9,568
Rating: undefined out of 5
Keywords: search engine, python, search, txtai, embeddings, vector store, python search engine, streamlit
Id: H-Cgag672nU
Channel Id: undefined
Length: 22min 57sec (1377 seconds)
Published: Thu Mar 21 2024
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.