MongoDB + Python #3 - Full-Text Search with Atlas Search

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] in this video i'll be continuing to teach you about mongodb in python by covering advanced queries in full text search i'll be showing you how to search documents using fuzzy matching and synonyms how you can auto complete search queries how to perform compound queries and how to find documents ranked by their relevance this video is the final video in my mongodb with python series and you can check out the previous two videos from the link in the description lastly if you haven't already you can claim 25 in free mongodb atlas credits by clicking the link in the description and using the code mkt tim i'd also like to thank mongodb for sponsoring this video and providing you all with this discount code and with that said let's go ahead and learn about full text search so to begin here i want to give you a brief on what full text search is so full text search refers to searching some text inside of extensive text data that's stored electronically and returning results that contain some or all of the words from the search query so full text search is different from searches based on metadata or on pieces of the original text like titles regions etc since the full text search engine is examining all of the words in every single stored document so wherever you have a collection of documents or even just a single document and you want to search through all of the text in those documents for some keywords you can use full text search so the obvious example here is something like the internet that contains a high volume of text that's stored in some type of documents and when we're looking for something say by using google we're looking for a specific word or words or some related topic to our query now when we get our results back we want relevancy rankings on how well the piece of content matched our query and this can be very basic from an exact match to something like fuzzy matching where the words don't quite match but they are equivalent so the internet is just one large example of this but many types of websites and applications need efficient and fully featured full-text search so as you can imagine there's a good demand for people who know understand and want to work in this area and usually they are called search engineers now a search engineer has two primary responsibilities and those are to develop and program search engines and to optimize web content to achieve the best possible rankings in search results so obviously what comes to mind here is working for google as a search engineer but there are like 20 people doing this job and it's likely that you need a phd to actually work at google and do this however there are a lot of other available options it's kind of an undiscovered gold mine in terms of engineering and starting salaries for this position at least according to glassdoor are 125 000 and up so this is kind of a fascinating area and there's a lot of opportunity and so i thought i'd show you some full text search concepts in action using non-relational data via mongodb's full text search implementation so now that we have an idea of what full text search is let's drop into a cluster with some sample data and i'll show you how you can perform full text search on a mongodb database alright so i'm here in mongodb atlas i've already set up a collection and a database i have some sample data in here i'm just going to show it to you and then we'll start looking at full text search on this data so here i have a bunch of jeopardy questions i have about 156 000 of them the full data set i'm using has over 200 000 i just haven't inserted all of them into the collection anyways for every single question we have a category an error date the actual question a value the answer to the question the round and then a show number and you can imagine that this would make decent data for us to be able to search through if we're looking for a specific question if we're looking for a question associated with an answer if maybe we want to find all of the questions in a specific round there's all kinds of stuff that we can kind of search through this data with so it'll make a good example for this video now if you want to mess around with this data yourself i'll leave a link to where you can download the data set from the description all i've done is downloaded the json file and then i've loaded the json file into my python script and just inserted all of the documents into mongodb if you're unfamiliar with how to do that then please check out the first two videos in this series again this will be linked in the description so now that we have a basic idea of what the data looks like i'm going to go here in vs code and we'll start writing some stuff that we can use for full text search so as you can see here my basic setup is done i've already connected to my cluster connected to my jeopardy database and then my question collection so the first thing we're going to have a look at here is how we perform fuzzy matching and then how we use synonyms when we're actually searching for stuff inside of our text so what i mean by that is if we search for something say like beer maybe we're going to have pine to be a synonym of beer and that way if we see either of those words then we'll return them as if they were equivalent right i'm sure you guys know what synonyms are i don't really need to explain that but i'll show you how we perform that using mongodb search so the first thing we actually need to do here is go back to our cluster and we need to create something known as a search index now if you're unfamiliar with indexes essentially what this is is a special data structure that holds the data of a few fields in our documents on which the index is created now what this allows us to do is search through the index which is containing less data than the entire collection itself so we can speed up our searching operations on our database i'll put up the official definition of an index on the screen so you can read through it but that's the basic idea is that by creating an index we're storing less data that we need to search through and then the index will kind of point us to the original documents and it just speeds up our search operations so i'm going to go here to search indexes and i'm going to create an index on my collection now if you're following along with this you'll need to do this on your collection as well if you want to do the fuzzy matching and the synonym search so we're going to click on create search index here now there's the option to use a json editor where you can just type all this in like kind of raw you don't have a visual editor but we're going to use the visual editor for right now okay so let's go into the visual editor for our search index we can call this whatever we want but i'm just going to call this my language search because we want to create a search index here that's going to allow us to search for text specifically english text now in terms of the collection we're going to select this one right here the question collection on our jeopardy database okay so let's go next here so now we need to kind of mess around with a few of these parameters so we need to click on refine your index here we need to modify some of these parameters specifically the index analyzer and the search analyzer so for this we're going to select the leucine.language and then english it should change both of them for us and i apologize if i'm mispronouncing this i'm not sure exactly how you say that now what these are is essentially our full text search engine so this is what's actually going to perform the search for us when we make a search query and what this will do is essentially ignore insignificant words for us and provide some context to our text so when we do this full text search it's just going to actually allow this to work properly and again ignore those insignificant words and do some other more advanced stuff that i won't get into here so there's a few other options as well since we know we only have english text here i'm selecting english but if you add a specific language right you could select that in here now there's a few other things you could do not going to get into it in this video for now this is all we need for our search index so we can leave the rest the same and simply click on save changes here and then create our search index now this is going to take a second to complete once it's done we can actually start using this language search and i'll show you how to do that from code all right so our index is now done and i'm going to go back to the code here and we're just going to write a very simple query that's going to give us all of the text that matches uh with a specific search query we're going to use something called fuzzy matching but for now let me just write this function so i'm going to say define fuzzy underscore matching like that and inside of here we're going to write our query so i'm going to say result is equal to and then this is going to be question and that not fuzzy but dot aggregate and instead of here we're going to put a list with our different operations so the operation that we're going to be using here for pretty much the entire video is search this is how you perform the full text search and for search we need to provide the index that we're going to be using so we're going to say index and then we're going to paste in whatever we called our index which in this case was the search or the language search sorry probably should have called that language index but that's fine then we're going to provide a keyword here called text and inside of text we pass a query this is what we want to search for so for the query for now let's go with something simple like computer we then want to have a path now the path is going to be the field that we want to search on so i want to search on the category here and see if we can find something that has a category similar to computer and then lastly here we're going to pass fuzzy and i'm just going to pass an empty object here now what fuzzy says is that we want to look for something that's similar to computer but not exactly the same really what that means is that i can do something like add an extra r here i can misspell this slightly uh i can you know do something like compute and this will still give me results for computer because we're doing fuzzy matching so this is similar to how google would work right when you spell something incorrectly or maybe you have like kind of a grammatical error but it still gives you the correct results back now i'm not exactly sure how off you can be in terms of the query i know in the mongodb documentation it does state that but i know with using the fuzzy search here you can actually manually pass in some parameters on how fuzzy you want it to be so how off you want to allow it to be sorry but for now we're just leaving this empty because i just want to use the default parameters so hopefully that makes a bit of sense again just performing a fuzzy search or something kind of similar to computer in the category field so we want to print this out so let's use our pretty printer so printer.p prints like that and we'll just print out the list of result and let's see what we get here when we call fuzzy matching so let's call this let's run our code and let's see what we get all right so we've just got a ton of results here from the fuzzy match and you're going to notice a bunch of them are not actually that similar to computer based on how fuzzy matching works by default so there's actually a bunch of kind of variations it's going to search for and it's going to allow letters to be in kind of the wrong place and i haven't messed with any of the the settings right i've just passed in kind of the default object here and that's why you see when we have a look at category we're getting some stuff like take a comp day right so comp close enough to computer hence why that's being returned if we scroll up a little bit we have completes the play title the reason we are getting this is because completes is close enough to computer uh right that's why we're getting that and then if we were to continue here you'd see kind of all of the other categories that match like campus this is close enough to computer with the default fuzzy matching settings now we can change them i'm not going to do that you can have a look on your own on how to do that but if we want to get a more exact search we can remove this fuzzy parameter here and run the code again and now when we do this we should only get results that actually contain computer so it doesn't have to be the last word it could be one of the first words too like we have computer literacy here but we're searching exactly for a computer whereas when we add in fuzzy we're doing kind of that fuzzy match so that is the first thing that i wanted to show you how you perform search for specific text as well as fuzzy matching now what i want to show you is how we look for synonyms so how do we look for something that's maybe similar to computer like a laptop or tech while we're searching for the query computer now to do this we need to implement a synonyms collection and kind of combine that with our search index so we're going to go back to mongodb atlas and do that alright so i'm back on mongodb atlas i've gone to collections here and the first thing i need to do to implement this synonym search is create a collection that contains the different synonyms so it doesn't actually give them to you by default you do need to add your own synonyms although you could bring in like a pre-built database if you want however i'm just going to make a collection here let's call this synonyms and let's click on create now inside of here we're i'm just going to provide one document that contains some synonyms but you would put your documents in kind of the following format that i'm about to show you so i'm just going to copy this in and then i will discuss kind of how this works all right so let's go to insert document here i'm going to go to the actual object view and i'm just going to paste this in where we have a mapping type which is equal to equivalent and then we have synonyms and these are going to be the synonyms that are equivalent to each other so for now i've just had a basic one like beer and pint we could change this and do something like computer and laptop uh if i could type laptop properly here and maybe we just throw in tech while we're at it uh just so that we have a few that are that are similar so this is a way that you create a kind of synonym what would you call this document now i'm just going to bring up the documentation on exactly how you do this and you can see that we have the mapping type equivalent as one valid option but we also have the mapping type of explicit so the first type we have here is equivalent which is the one that i'm using and what this means is that all three of these terms are equivalent to each other so if i search for vehicle it will return car and automobile if i search for automobile it will return anything with vehicle and car they're all equivalent however if i explicitly map something i need to pass another field here called input and now what i'm saying is i'm mapping this input to all three of these terms but not the other way around so that means if i search for something like pint only stuff that contains pint is going to be returned it's not a synonym of brew and beer it's only that beer maps to these three terms so hopefully that makes a bit of sense you can read through this explanation it probably explains it better than i just did and i'll leave this in the description anyways for now we're going to go with the mapping type of equivalent i'm just going to insert this in to our document here into our collection and now that we've done this we actually need to add the synonyms collection to our search index so we're gonna go back to search index here uh and this needs to be on so i have the wrong selection here the language search and what we're going to do is go to edit index definition and this time we're actually going to use the json editor because the add synonyms at least right now when i'm filming this video it's not supported in the visual editor so i need to add a field here this field is going to be called synonyms i think i spelt that correctly this is going to be a list and we need to pass these objects here which are going to define the collections that contain our synonyms so the first thing i'm going to do is just give a name this name will just be i'll say mapping for right now if i could spell mapping correctly that was atrocious okay now that we have mapping i'm going to pass a source the source is going to be the collection that contains our synonyms so actually this will be an object and inside of here we're going to say collection and then we're just going to pass our collection which is called synonyms which is in the same database as this so we don't need to explicitly reference it and then after this i need to pass an analyzer for our synonyms and this is going to be the leucine dot and then english like we've used before so let's spell english correctly okay let me just make sure that looks good i think we are right so we have name we have our source we have our collection we have our analyzer and now we can save okay so we've now added synonyms to this index i don't think i need to do anything else i think that's saved and we are all good and now that we have the synonyms here we can actually start searching using them so to do that let's go back to our code and let's write the kind of synonym search all right so to do this actually fairly easy all we're going to do is add one parameter here and i really should have made another function but that's fine we'll do it inside of fuzzy matching and this is going to be called synonyms of course i spelt that incorrectly so just spell it right for me thank you very much and for this we're just going to pass the name of the synonyms that we added so if i go back here sorry to our search index and we have a look here and we go to edit with json editor notice that for my synonyms here inside of synonyms i called this one mapping so since i called that mapping this is the one i want to access from my code and so i'm referencing mapping here for the synonyms field all right so now that we've done that it should actually return to us anything that contains a computer or is a synonym of computer in the category field so let's run this and let's see if we do actually end up getting that okay so it gave me a ton of results here and notice that we're getting tech right we're getting tech again let's scroll up a bit and find some other ones we're getting computers okay uh computer characters let's see if there's any laptop stuff we get computer geniuses uh okay we get techno so all this stuff is uh you know a synonym of computer as i stated not much more for me to explain all right so with that said i have now shown you how we do fuzzy searching or fuzzy matching how we search with synonyms and how we do just a regular text search on a specific field in this case we've been using category now that we've done that i want to show you something known as autocomplete so how we actually do a search that's going to give us autocomplete results so i'm sure you're all familiar with autocomplete but this is very similar to when you're kind of typing in like a google search result or you're searching some website or something like that and as you're typing you kind of get results being filled in based on their relevancy that's what we want to do here so we want to find all of the things that could be auto-completed from what you're typing and return those so let's make a new function here let's call this autocomplete and inside of here we'll start writing what we need now the first thing we actually have to do here is we have to go back to mongodb atlas and i just need to remove the synonyms from this because they're not supported with the visual editor and we're going to be using the visual editor to help us with the autocomplete so let me remove synonyms let's save that let's go here to the visual editor and what we need to do is add a field mapping here with something that is autocomplete so that i can actually use the autocomplete feature so let's make this full screen i'm going to say add field here and for the field name i need to select the one that i want to have autocomplete for so i'm actually going to go with question because i think that makes sense for autocomplete we'll have enabled dynamic mapping that's fine and then for the data type here are actually going to select not string but autocomplete so here you can mess with some of the properties of the autocomplete i'm not going to change any of them this is fine for right now and i'll just hit save so that's what we've done we've now added the question field with data types autocomplete and this means now when i use this search index i can use the autocomplete feature for the question field okay hopefully that's clear let's go back here to autocomplete and let's start writing this out so for autocomplete we need to do something a little bit more advanced than before we're going to say result is equal to and then this is going to be question.aggregate and inside of here we're going to pass our search so we're going to do our operator and then search like that for the search again we need to pass the index so our index is going to be language search and then at this time rather than text we're going to do autocomplete okay so for our autocomplete we need a query so let's go with the query and we're searching for questions so we can do something like what is the i don't know fastest uh and maybe that will give us some autocomplete might have to change that if there's no results for that but that's fine for now next we're going to have a path and the path here is going to be the question so that is the field that we added in our kind of field mappings right so we need to use the same one here which is question then we're going to have token order so token order is essentially saying are we going to be looking for something sequentially or do we not care about the order i'll talk about that more in a second and then lastly i'm going to say fuzzy and when we add in fuzzy here it'll give us a fuzzy matching not just the exact query which is kind of what we're looking for here all right so let's just break this down a little bit here so token order as i was saying sequential means that what we've placed right here we're looking for exactly this where the different words or what we could call tokens appear adjacent to each other in whatever the result is that we're going to be kind of matching with this that might be a little bit confusing but all that means is we're looking for what is the fastest kind of in this sequential order if we had the other one which is any then that means that we're looking for any four of these words in any order in the result so it could be like fastest is the what as opposed to what is the fastest so use the appropriate one based on that i also pull up the documentation here i'll link this in the description with all the other options so we have fuzzy enables fuzzy search right so that's what we're doing path this is the indexed autocomplete type of field to search so we're searching for question we have our query this is a string or multiple strings that we're going to search for if we wanted to do multiple we actually could just pass an array here of multiple strings for now though i'm just going to do one single string okay so let's do that for now and just to make this a little bit easier to see i'm going to add a projection operation here just so that we're only projecting the question so we don't have to search through so much text to see if this is working properly so i'm just going to say underscore id is 0 and then i need my answer or not answer sorry this is going to be question and this is going to be 1. okay so let's print this out let's say printer.pprint and then the list of the result i need to actually call this function otherwise of course it's not going to do anything so let's run this again okay and let's scroll down and we actually didn't get any results okay so what is the fastest wasn't really giving me any results there so let's do something that's going to be a bit better for autocomplete uh let's just try actually computer programmer and see if that actually gives us anything at all with the fuzzy matching okay so let's have a look here and we should see computer programmer okay so gary kasparov recently beat a computer program okay that's pretty close uh computer programming language okay computer program there you go so we're getting all the autocomplete results that contain something similar to computer programmer or exactly computer programmer okay so that is how you perform the autocomplete maybe this wasn't actually the best example using the jeopardy data set because it's hard to really auto complete i guess the questions there's so many of them that are very similar to each other you get the point that's how you do autocomplete all right so my apologies for the abrupt cut here but at this point we have covered autocomplete fuzzy matching searching with synonyms and i want to start showing you some more advanced stuff and how we actually kind of filter the result here so we have this search stage right where we're actually going and we're searching for some specific text but a lot of times i want to kind of fine-tune this and make it so that maybe we're filtering out specific results or we're prioritizing results that contain some extra data in our queries right so i'll show you how we do that and i'm actually just going to paste this in and then i'll kind of walk you through the syntax and explain how this works so this is something that i have here that is going to perform a more advanced search using this compound operator or this compound field now i'll bring up the documentation here so we can have a quick look all of this will be linked in the description afterwards so you can look at it yourself but we can see compound as the following syntax where we can pass an object here that contains must must not should filter etc now these keywords here you want to use over something like a match statement in the aggregation pipeline so rather than searching getting all the search results and then trying to match them to a specific query instead you want to use this the must must not should and filter so as you can see here for must this is kind of mapping to and and it means anything that we provide here must be true for a document to be included in the results must not that's the opposite and then for should this is going to prioritize results that do have the should clause that so the should clause is true now as you read here it says if you use more than one should clause uh you can use the minimum should match option to specify a minimum number of should clauses that must match to include your document in the results and if omitted the default is zero which is what we're going to have we then have filter and you can have a read at how that works i'm not going to go through that in this video okay so let's go back here to vs code and let's actually run compound queries uh after quickly just having a look at all the stuff that we've put inside of here so we have our search we're looking at the index called language search and then we have our compound keyword so for must here we've provided an array of must clauses in this case we've just done one which is text and this is saying that we want to have computer or coding inside of category so that must be true for us to return this continuing we have must not and this is saying we don't want to have codes inside of the category path so if we have codes we cannot return that and this is in case sensitive by the way so if this was like a capital codes same thing it's not going to make a difference here in this result continuing we have should so for should we want application to be a part of the answer so anything that has application as a part of the answer we're going to prioritize returning that and then we're performing a projection here where we're getting the question answer and category and we have a score which is a field that we're adding to each result which contains the metadata of the search score for this kind of search operation so let's just run this and see what our result looks like and notice here that this is what we're getting so the answer application that's the first answer that we're getting it has the highest score of 10. we then have the question here the app in killer app stands for this the category is computers okay continuing we see all of our scores down here we have category computers the answer is not application it wasn't prioritized blah blah blah blah uh you guys get the point i'm not going to go through all of this and you're getting results uh in terms of their relevancy right ranked based on that score nice okay so that is the first operation right here again to learn more about this stuff please reference the documentation it is very time consuming to try to explain every single field here in this video actually i'll leave that there now though i want to show you something called relevant search now what we just did is kind of a relevant search but this one is more fine-tuned and allows you to kind of boost answers and change the score of specific results based on some specific queries so let's paste this one in here it's called relevance i just need to change this to be search and what this does is prioritize questions appearing in the later rounds as the comment states so we have our aggregation we're doing search we have our index and we have compound again now this time we're looking for anything that contains geography in the category and now we have multiple should clauses so the first one here we're looking for final jeopardy as the query in the path round all of our documents here have a round and we're saying if it appears in file of jeopardy we want to boost the score by a value of three now what boost does it actually multiplies the score by three so that's what we're doing just multiplying it by three if it appears in the final round and then we have another text query here for double jeopardy so if this appears in one of the later rounds i believe double jeopardy is i think the second last round in jeopardy then we're going to boost the value by 2.0 now there's a whole bunch of other stuff that i can do here rather than multiplying i could add a constant value i could use a custom function i could implement something like gaussian decay i think i'm pronouncing that correctly but i might not be so please excuse me if that's the case and i can really customize kind of how i'm getting results ranked by relevance in the way that i define for now though let's just call this function and let's see what the result is so let's run this and let's bring our terminal up here and if i scroll down uh actually let's just clear and rerun it just so i get all the results here in the terminal okay nice so now that i'm here i've just limited this to 10 by the way so i'm only getting 10 results you can see i have my category geography i have my question it's the only country whose name begins with an a but doesn't end with an a okay and this round final jeopardy that's why it's appearing first we have a score of 7.7 which means the score would have been lower but we multiplied it by three right continuing we have another one in final jeopardy and i think all of these are appearing in final jeopardy now if i make the limit like 100 let's rerun this and let's see if we get some ones that are appearing in double jeopardy yes you can see now we have double jeopardy and we're only multiplying those results by two so they're going to have less of a score than the ones that appear in final jeopardy and those ones all seem to have kind of like a seven plus score all right so there you go that is the relevance search again as i keep saying there's a lot more advanced stuff you can do i can't cover it all in this video it's really meant to be kind of an introduction to these topics and encourage you to go read the documentation i will bring up the documentation for this which is customizing the score in your results again all this will be in the description and you can see we have options like boost constant and function so the boost is going to multiply result score we can actually use a value from the document for the multiplication factor or we can just hard code our own value like two or three which is what we did we then have uh what else was here uh the constant this is going to add a constant amount and then we have function and if i scroll down here i think there's some examples yeah so the constant option replaces the base score of the specified number so my apologies actually we're not adding we're just replacing it with a value continuing we have function the function option allows you to alter the final score of the document using a numeric field you can specify the numeric field for computing the final score through an expression if the final result of the function score is less than zero atlas search replace it with zero okay and you can use stuff like a gaussian decay and it kind of shows you how you would do that here not really going to go through much more of that okay so i think with that said that is going to wrap up this video i do apologize that this wasn't extremely in-depth but i can't really go through much more than i covered in this video because it gets very granular there's all kinds of options at that point i'm just really reading the documentation to you and covering you know all the different options and kind of specific stuff that you use really the core thing i wanted to show you here was this search operator how you create that search index and how you can perform full text search in mongodb because this is something i've actually never seen before and that was really cool and then i wanted to kind of mention to you in this video so with that said i think i will wrap it up here another massive thank you to mongodb for sponsoring this video and this series hope you guys enjoyed and learned a bit about mongodb and python if you did leave a like subscribe to the channel and i will see you in another one [Music] you
Info
Channel: Tech With Tim
Views: 21,042
Rating: undefined out of 5
Keywords: tech with tim, mongodb and python full text search, what is full text search, full text search, full-text search, metadata search, difference between full text search and metadata search, relevancy search, programming full text search, coding full text search, how to use full text search, how to do full text search, search engine, search engine organization, search engine optimization, fuzzy matching, compound queries, search indexes, what are search indexes
Id: nc-Kpiq1zLc
Channel Id: undefined
Length: 31min 35sec (1895 seconds)
Published: Fri May 06 2022
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.