How Google searches one document among Billions of documents quickly?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
have you ever got frustrated while searching a file or folder in your desktop or laptop whether it is a Mac you know Windows Linux or any operating system they still take a couple of seconds in worst cases they take mints too but on the other hand when you search something in Google the results pop up very faster and you guys know the scale at which Google works they basically crawl billions of web pages and they do store in their computers and when you search something it's not just a word when you search a sentence also they were able to find all the you know webpages which has those keywords in the matter of milliseconds but why does our computer takes you know more than seconds or sometimes minutes to search for something which we have locally but Google will be searched from our laptop it basically excuses the query in the remote computers or servers and then they were still able to get the results faster and you might say that they have cached the results but that's true but still even if they don't catch it the search is pretty fast and you will get you know results in the matter of milliseconds but how you may say that Google uses supercomputers or powerful servers at their end to search one specific documents among billions of documents but that's not true because they also use commodity you know computers like desktop which you have at your home but still how are they able to do that of course they have so many computers at their end to power their search results but still how they were able to search a specific web page among billions of webpage when your computer is taking a couple of seconds to search a file among very few thousands of five so not just Bing or Google or Yahoo or any other search engines there are so many other systems like elastic search solar you see is available where you can use these systems or frameworks in your project and a very good search engine for your project using these technologies in this video I'm going to explain the underlying gear structures and the process which basically powers these search systems searching is basically the process of information retrieval so any search system we build has to fulfill two important criteria the first one is being low latency and the second one is - where does low latency means it means that whenever we search something the search result would be obtained very faster as soon as possible like Google right the second thing is true that means that system should be able to answer a lot of queries at any given point of time that means that even though there are thousands or millions of billions of users are searching some queries in that system the system should be still be able to answer to all of them and of course there is a scanner that issue everything but still it should be scale let's go to the basics of searching suppose in your computer we do have there are like a number of files now let's take this end 200 then 100 files now when you search for some keywords say for example you're trying to search for a word called a thing part in all of this voice basically these are text files and these files has so many so many words inside of them now you are actually trying to search this keyword called as ThinkPad and your computer should be able to find all the files in which ThinkPad appears one or more times now how does computer use it does a computer if it is like a primitive computer what it does is a computer or a program it physical search what it does is it basically has to I trade through all of these hundred files and such for this particular word inside this fight basically it has to go to the first file open it and search for this word ThinkPad if it exists in the corpus or in any of the text inside that file if it's not it just closes such I fit is there it just adds into an list of search results basically if it is a search this sacred document one and think back basically after document name into the result basically this is risen and if second doesn't doesn't take whatever basically it would basically keep on up and indeed such as if I so this is now basically how much time it will take you know fire operations are much cost here so computer when it opens and closes opens and close all of these files it takes a lot of time right that's why your computer slow but not really so then little more optimizations to it what it is basically the program which is responsible to search anything in your computer or basically any word or any file or any folder in your computer what it does is at free time or when the computers are idle these programs basically index these files what is an index or indexing means indexing is a pre-processing basically instead of explaining how actually it does all these files I'm gonna explain what is the next thing then what is indexing index basically helps you to identify your resource based on some identifier say for example a simple example is you may say many arrays our dictionaries item lists and dictionaries say you have a list called as it's taken true so any b c and e so how do you basically identify this particular resource so X of one is basically point you to be so this is an identifier and basically this is kind of missing also this is a very basic representation basically in databases when you basically in this particular column say some ID or something so you have one two three four basically you do the indexing operation on top of this research some row by ie say for example you're trying to search a row with ID 3 in that case you don't need to go one-by-one linear like one ok reform to reform 3 base they don't need to do order of n operation basically upon indexing you we can find in order one that's like you basically give the ID and then you your database can easily find the address or the location where this particular row with ID three this person so that's what the advantage of indexing but how does they work they use something called as b3 in all the stuff but I'm not going to explain all of that because it's a separate video but just for your curiosity you can learn about B trees and stuff how individual happens but this we just focus is more about how Search Indexing busy labels so in this case our program should basically index all of these files so that whenever I search some keyword it can basically easily figure out which particular document contains this word thing back so it has to do some kind of pre-processing when your computer is not in use or even though whenever when your computer is has been used this program basically the search program can do indexing and then it keeps the metadata somewhere so metadata so that it whenever you search for something it doesn't basically go to the documents to search which file has this word it basically search at the metadata to figure out which file has it so basically that's what even the search engines test they the search engine is this program it doesn't actually go and look into each and every document which you have crawled in case of Google we should have crawled but instead it basically looks into the metadata in this data to figure out which documents for the document file name or the document IDs in which this particular word exists and that is what we are going to learn in this video so before understanding the indexing let's also understand the basic building blocks of any you know search engine say there is three main steps to it the first thing is crawling second indexing is basically the query part of it so chronic is the process of fetching all of the different web pages from all over the internet google has a distributed crummers basically the cognitive spider what it does is it is kind of distributed that means that so many computers are basically crawling over different sites using links provided by a billboard or text or links which are available in the pages so they still keep crawling all of the pages and store it into some persistent storage and then all of these files will be fed into index over what indexing does is what we are going to learn basically it basically does this indexing process on all of the web pages which basically the crawler has crawled and then it is going to emit a Metallica into some kind of persistent storage and then basically the query is going to happen it's not exactly this way you can think of the query is going to be performed from here so who does the query users are any other systems who wants to see the results these are one who issues the queries using the query system there are so many different ways in which the queries works workand let's discuss that about that at work so the query is going to look into the metadata and then it's going to give the results to the users so this is what is all about okay now our goal is to build a search engine or search system with you know less latency high throughput and near real-time capabilities to search on a very big data very or so many files are so many textual files or web pages or whatever so let us take an example for example we have three documents we don't have so me because it would be difficult for me to show examples so let's only consider Lee has three different documents with these tests okay now I'm going to explain a little bit about TV because that's important so if it was TV what you have done so if you want to build a search Jews and DB simply you basically make two columns one ID and one for the content right so this all goes into ID son goes into content column basically so I don't want to write it so this is ID this is content so how do you search you basically use like operators and you use percentage and while the underscores for the placeholders to find any words inside this particular you know content so that works but it is not so efficient when there are so many documents in it the DB will not perform as expected like low latency in Hyderabad so it's not possible so how do we solve this problem so I'm going to introduce to a concept called as inverted index so let's see how this inverted index table or this is actually called as Tom document matrix or term document table is basically pitch now as I have already mentioned there are three documents with us in this document there is a cells and in this document there is a difference in text and another difference so if you see there are so many words are repeating and so many are kind of like unique so now to build the term document matrix first and foremost thing we need to do is read the document and then start marking then say for example initially I will start with an empty table but just to save time and they have different essentially the state will be empty so now I'll go through the sentence and I'm going to take the words from the sentence and then start building the mattress okay now our initiative in the first document we have the quick wrong sorry the quick brown fox jumped over the lazy dog so the first word in that sentence is B so right B oh this is not that why because if you see the th e V doesn't really matter to us it is basically article right so in a document that we so many be so if someone search th e in Google were you basically get it so this is what but in how searching does we really need to support searching for these kind of articles and all of that nonsense stuff not ready because we want to provide a meaningful search so we only should consider the meaningful words among the documents like for example quick brown fox jumped and lazy over and off so we had to remove the the right so there to these and in this case quick brown fox leap over the boss in we don't want this and in this case we don't want D and E so what are these colors these are basically called a stop words in natural language processing so before the basically build is stable we have to do some pre-processing on the data so far this we need to remove all the stuff words for example the stock words are like I need my must few etcetera yours are as exit run so these kind of words we don't really need it so we have to remove one of them so there are so many libraries in every language in in attica package will you basically tell which language you want to remove the stop force it basically removes the struck words in the given document so you just need to bring also stock words all of these things are gone so basically we are kind of remove this we will remove this image okay now there are still more pre-processing is required say for example so if you see there is work quite as quick here quick here and then there's quickly according to the meaning basically these three words are same right it just the tens are difference all of them are which basically gives the same many but if you see the letters in this word is different this wire is different so for computer it doesn't really care about it so it treats these two has different words but for search we do to consider all of these three words are set so what do you have to do so they have to do something called a standing and limit realization so I'll I'll talk about normal later so stemming is a process of basically removing these suffixes to get the you know base word or the actual core word in the given word say in case of cookie and quick quick if use the if you remove the suffixes you basically get so the suffix in this word is ly so we basically get quick in this case there is no suffix in this case there is also X so when we do these stemming we basically get the what a quick quick quick so this sentence basically become the default quick jumped over the bridge so when we're on the spinning jenny will also become a chain so there is one technique that is called stemming it's a kind of rudimentary process in which you basically simply remove those suffixes you don't really care about the resulting word is real meaningful or we basically stripping this suffix down to get a root word but the word might not be meaningful to give an example just we have a example for a struggle with trouble even trouble in this case if you do run stemming be basically this doesn't have any suffix but still if you treat as E as suffix and then IG e D so what we actually get is tau PLO so we don't get the e not exactly the trouble word but in case of polymerization this process basically is a little yes where is where as this process actually requires a lot of metadata to do this process this process basically understand the every word of English and it also has understanding of all the Pens of the word so this process basically used the actual root word so when we run on this we basically get it this might differ better but underlying concept is correct so the motivation is always safe we basically get the actual work so in this case there's little more examples in this case so it's fine so we they see it have to run STEMI and limitation to remove the suffixes the tends to get the root word so that's one good step so next thing is we have to do noise remove what it's noise to move so in the text basically this does not have any noise in case of when Google crawls the web pages there will be tons of noise in it say for example but definitely hashtags URLs you know hyperlinks some some kind of like random noises we you have to remove them also because if you consider all of them it isn't it may not be proper intimidation but it's all up to your used cases if you want to consider hashtags you can still consider okay so after running all of these things our documents are clean now we have removed all the noise so basically this is gone so we have to consider we will have only the proper words now we'll have to put the term document matrix how do we get it let's go with one by one words one by one so we get the quick brown fox so we have to consider first word quit and we have to mark in which documents it basically presents so quick if you see it is present in the document one so we'll have to mark and quick is present in document two quick here also so instead of doing this way you basically do this way so first you see the word quick so I record it quick and I mark this here under document one so there will be n number of columns which is equal to n number of documents we have to be indexed so in our case we have three documents so we have three different columns in the table and we have one column dedicated for time so there will be only one eric entry for a unique word so qwave is present in not one so next thing is brown write down the brown present market here and then go dog because okay so there is Fox they are a little messed up but it's fine you basically take all the unique words this is done and then you mark which document these words are in this document but no dog has become to talk dog so so Mach 1.2 C dog is not there similarly Fox is present in all the three documents basically Fox Fox and Fox inlet State Bridge which is only person in documentary because bridge is not here which is not here in bridges here in the document II so so this is what which is called as tone document matrix how does this index table will help us to search faster so let's see if you want to search for some word and find all the documents where that word is present it's very simple now so if I want to search for say quickly which are the documents do I have in my computer or do I have in all of the pages I have stalled you just need to refer to this table such for quick and then see all of the columns in which we have March it's that simple so we go to quick which we basically get all of this row and then basically say to invoke similarly let's see maybe that's such a general jump it's such vision we go to general and then we get that row we know dark one and torch it so it is much faster than opening a file looking at it and then opening if I look for open file individually so this is much much faster right so that's what the performance advantage of having this table so this is what it metal and I was referring where in the search engine basically refers to this indexed material and use the result instead of looking into the actual documents now all good but is this how the search engines work not really because if you see the problem with this is we are basically creating a table each table each column and each cell is going to take a lot of space because this is a table and this is a data type that you basically have one else you know if it is a bit but if you create a table if use a database table this would be too much of data so this toys ace is too much that's why there are so many tricks available like instead of having the column you can actually have a representation of it so instead of having you know this as IET generated I or boolean data right so just read you can do quick and then you map it to just a base presentation so if it is one in the position 1 2 1 1 so this is the data so for Brown it will be 1 and 1 0 this is the binary representation using one simple number we basically able to store a lot of information so that's kind of possible it's like a compression technique in which you basically save all of this information of which document this word exists in the bitwise format you basically know now exactly in the first bit I mean this is from from the left side you will see how this document has a tip this document has it undisturbed in this case this this but not this so that's how you can actually use it but this has limitations so we have different additional indexing so now I have a different representation of the term document table in this case we are also capturing frequency and that occurrences in terms of inverted indexes now how this table is built have been expected so I couldn't just written about few words so let's take quick so how many times quick has appeared so it's basically three so it had to keep the total number of occurrence come here so basically hazard menu index documents if you see more than once you basically keep incrementing this value and then keep up and in this file us with how it is built so initially it will be zero and this is empty I mean basically this is all all of this data belongs to quick so you should equal to zero and this is a deal so now when we see the first time quick in this document I'll have to increment it by 1 and then basically show you time so he committed by 1 and then add this document on it this is not that okay this is not there so what this is this is a multi dimension all right so and heterogeneous as well so this first one will tell you document ID and the total number of occurrences so basically this is the one array that means that the first value will tell you that this word is appeared in document 1 and this is the position so document 1 and position 2 so quick is present in 1 & 2 that's position 2 and this is a document 1 so okay now when we check it over here there is no other okay so quick so there is only one value over here so if just just for the sake of example if there was one word quick in document 1 in the end basically one two three four five six seven eight nine times position so we will have had something like then okay but unfortunately in the example we don't have that so I don't want to you know modify the examples of so basically so only two will be okay now we go to the point with the idea so the second document quickly is present in the first position so increase it by 1 so 2 in the second document it is present in the first position and also this one and in the tournament also they I see quick is there so there are 3 appearance and then we're in the third document in the third position so 1 2 3 so basically this is how so similarly you have to build for all unique in you want to index you basically end up getting this kind of table so now whenever you want to search something basically we'll search quick this will go here so this so basically we can keep my hash map in Python basically a dictionary or basically index with the help of index you can go to this pointer much faster than r1 right so if I want to search for Brown if I go with energy to brown and the values in this data is say it tells you how many times in all of the documents it is present this is a 2 and where this value this data will give you that representation and the first document in the third position it is plus the second document is this problem so the same trick will be used even in case of when you search any word in the ebook apps so basically when you search for some word this kind of indexing would have performed on all of the pages of the ebook and then as soon as you search something the results will basically highlight all of these words in a love letter or something like that so basically is with this chemical function the same information you can actually use it to basically show the search page on your application using this data basically this is what the low seen solar elasticsearch given Google and all of the search engines actually has in its core of the framework it's on you know genetics is what it is doing now because of the space constraint I have visited that they don't have just written the result which I'm going to use it for example so now let's discuss about the query or such now we created the metadata information and now how do we basically retrieve the information I just show you the single word retrieval right say for example how to get the result for quick or brown fox you had the data you can just directly show the result but search doesn't such is not always limited to one word so in Google you basically type the complete sentence I've seen people typing the completes and that's how do I do theirs or whatever so you don't need to do type all of that the keywords are very important but perspective that you basically hello in your self system to enter more than one word so how do and how to handle that so in that case let's take an example so if I'm gonna search for quick brown fox in the metadata we have how does this work there are two strategies to it one is contribute destructive query contractive is basically an operation basically an expecting all of these words quick brown fox we had results for quit separately browser between fox a patina table so now an expecting all of these words to be present in some documents and i am very interested on those documents on it not the documents which only have quick document which only have fox first i'm really interested in the documents which have all these words and later on and kind of interested in the documents which have two words later one words but google also does the same thing it orders based on that and also by PageRank and all of this stuff for specific purpose let's think will order the pages or results by most occurrence so conjecting query is basically doing and operation of all the results of the individual words basically we had the support of individual words basically written out here so we have a thousand off quick they have to take this little brown there's the fork now what we need to do is where to intersection of it so intersect all of this isn't to produce only few results which actually convince all of these words missing Kentucky query basically gives only the documents which have all these three words in them so this one's got a skeptic where's well as indistinct Aquarius basically take results from all of them and then just to the Union operation just combine all of them so we just comment all of them and then which isn't the output it doesn't really care about which document has all of these three words present or doesn't matter this result the decision to use it also contains the document which have just a quick just the brown just false or just some words or just all of those words so that's what these two kind of words but if you think about the order at which these words appear also is kind of important right so we are not really interested in the document which have basically if there is a document which has false brown quick does it really make sense I don't think so but some in some instances the the position of the words also matters so not just conjecting you know we will have to do some little more operations or some kind of processing to to get the documents which have the same ordering as well and I mean how to do that I mean it's also pretty easy we just need to figure out and find out among the documents which have the positions which is in the increment order so this is when we do the consecutive query on all of these documents then it have to find out you know the documents in which the positions are kind of like in cruising or say for example yeah I don't exam so in the first document we have good brown fox in the second document US will be healthy quick brown fox but here we just have false quick so this is not the run run but these two how to figure out these two you know sometimes when you will do the pre-processing these words will go so positions might call but he went here packing this since the positions will have an increasing order and because sometimes it happens so that quick or the in between when you pre Francis I mean those positions would have been affected but in this case we are considering all of that so positions doesn't really matter so if you're okay with you know some words being in some articles or something in between these things and still it's about keyword you just take the interesting I'll show you that so so where is it so quick brown fox if you see this two three four that means that they're just next to each other in the first off committed yourself in the document one is here also document one document one so quick brown fox appears in three document I mean same document that is one one one one right so but now we are really worried about the order so two three and four okay so that means that the order is kind of increasing that means out quick is present in the first document that's position two and then the problem is present in the first document that we said then next to each other two three if the Fox is next to brown that means the value of info so if you see here at Fox is exactly at four that means that these three words are in the same order that means these results are much more relevant time the any other results because we found a document in which the same order same words are present in the same order so that's what we're gonna figure out there are so many different tricks in which we can figure out even though in some cases say for example if there was a word same some document maybe in here if there was a word in between quick off or somewhere in that case maybe yeah the brown position in here would all be in here with me not to instead it will have been three because we just so the brown position was three okay and fox position is okay now if I check quick brown fox and what happens so in the second document right quick brown fox is there but there's quick off brown fox if I if our framework wants to treat that also as a valid reason how do we figure it out so here if you see in the second document cookies present in position one okay now brown is present in the second document and the percent okay Fox is present in the second document but the position is four so if you see they're not really next to each other but there is a mismatch there is a gap in between but this is not really uh you know exact match they are not just next to each other but the order is reserved cookies appear in first Brown is appearing afterwards and then Fox is again off towards not really next to each other but we might can consider this as also you know valid case of so valid case and we can include this also initial you know there's so many different ways we can build our queries it's all up to us and Everett used cases so these are the some different kind of tricks you can use it to query so I just wrote the table of turn frequency and in rotary index table once again but in a different way so if you'll notice what it is so we just learned how do we search by a single word basically word so so this this table can be implemented using you know b3 or hash try any of these data structures so they all help you to go to the location much faster so if it is hash you can just go to the index you basically get the hash that figure out the position and then go here so it's like part of one right so any word can be figured out or the result for any word can be fetched using our f1 so that's much faster single word is much faster and then we just learned how to get the disjunctive and connective results with the you know opposition as well so now there is one more use case we just figure out so how do we search something with just by a prefix or suffix let's take I want to find you know a word with just something like Ju M star so I don't care what it's there at this part I just want to find resulted for for this way basically I have Ju M star and I just need to find all the documents in which this kind of query goes good so how do we do that for that we also make sure that this table whether we're building this table we just need to keep this keys in a started manner so what is the advantage of doing this say in this case you didn't notice that brown dog fox gent lazy over you can see B D F J L & O so all the kids are kind of sorted in stable the advantage of that is now we can do binary search on the keys so I can search for gem I just made the codes on my point so f okay I know I had to go this way so half of this maybe if I come here L so I need to go back so you basically here and then I find GA and then you basically keep doing that if you have so many words with J but in this case I don't I just have one word so I basically search it I have J um I don't care what is there later so I found the key and I'm from those so I have multiple words to Ju and something else all of that keys are considered for the results so this case I just have a gem so this result is will be provided by so that's how we can get to this for profit successful and there are other queries like like crazy quasar there so something like star okay let's take easy is it and stuff so I don't care what is there in it has a purpose and what is their suffix I just want all the written me all the results for this squared so any word which contain inside in the middle so these kind of results and our searching is little tedious always but nevertheless in the English we have only limited words maybe I think I don't know the exact count I I think I remember 10,000 words if I'm not wrong about 1000 words I don't really know maybe you just need to do linear search on the total document because we just have ten thousand entry so I can find the result with all the words which is it but that is like overkill there are other techniques like prefix try and then prefix are a kind of implementation in which you can build another data structure to figure out all the keys which basically matches this kind of wildcard searches but that's let's not discuss that because that itself is like a big topic that just I think this much concept is much more important to for you guys to understand so when you want to search by prefix or suffix you can just leverage the binary search but the only thing is the table should be ordered the alphabetical bar so that's how we can do this type search
Info
Channel: Tech Dummies Narendra L
Views: 111,179
Rating: undefined out of 5
Keywords: Amazon interview question, interview questions, interview preparations, software interview preparation, Facebook interview question, google interview question, Technical interview question, software architecture, system design, learn System design, search engine system design, inverted index, search engine design, system design search engine, how google finds pages, how inverted index works, search engine datastructures, indexing billions documents
Id: CeGtqouT8eA
Channel Id: undefined
Length: 41min 34sec (2494 seconds)
Published: Mon Jun 17 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.