Elasticsearch (Part 1): Indexing and Querying

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome there's plenty of space up here in the front middle if you're interested to move in a little bit get a little closer this is eric rose from mozilla and he's going to talk to you about alaska search so please give him a warm welcome thank you very much welcome to elastic search the missing intro here what I mean by that this is the orientation that I wish somebody had given me when I was getting started with elastic search this is stuff that either isn't in the manual at all or that's kind of spread scattershot overall the manual pages so it's hard to pull back together and get the get the main idea we only have 30 minutes here or 25 without questions so I want you to get the most bang for your buck so here's what I'm going to do I'm not going to quote trivia at you you can look that up it's in the manual what I'm going to do is run this like a seminar on the fundamentals of elastic search I'm going to show you what goes on inside elastic search mechanically so that you're going to be able to derive the answers to your own questions after you leave so Who am I to attempt this well in 2011 I was working on Mozilla supports site and we were using Sphinx as our search engine and we kind of hit our heads on it in a couple of ways we were sick of reindex ting as a batch every 15 minutes and we wanted really good multi-language support since we internationalized into or localized I should say into 80 different languages so I led the team that transitioned us to es from there I went on to a startup called vote ism where I used es in a much different way we would pull in bits of your Facebook friends and Twitter friends we pull in names addresses States they lived in the countries they lived in whatever we could get and mash this kind of fuzzy contact info that was no doubt full of lies and inaccuracies against the hundred and eighty million u.s. registered voters and try to figure out which one of them most likely was identified with each pile of contact info we got that down to about 23 milliseconds with the aid of giving Amazon a hell of a lot of money and yes when that startup got acquired I moved back to Mozilla where in addition to my day job doing static analysis on the Firefox codebase I do a lot of es consulting we have a lot of for projects at Mozilla that uzs we have everything from the company phonebook which is very small and and not challenging to things like terabytes of Firefox crash reports that come in I also maintain the PI elasticsearch library which is a low-level Python interface to yes I don't know if anybody outside the authors of elasticsearch can really claim to be an expert but I got started pretty early with it and I've had the chance to really make a lot of mistakes I'm going to try to help you avoid most of them any luck so let's start with the basics what does es good for obviously it's good for full-text search how many of you have ever thrown a query at es and then how many are here just because you're thinking about OU might be fun it's about half and half maybe even a little bit more of the latter fantastic and full-text search does multiple languages really well I've also continually impressed the with how good a general data story yes is now these days whenever I have a general data storage retrieval problem it's in the running right along with relational stores and key value stores segwaying into that big data it doesn't do any transactions so there's no lockstep synchronization across the nodes and an es cluster also there's no referential integrity enforced as a result yes paralyzes and distributes really really well scales to at least terabytes and I put 280 gig in and got in those 20 millisecond response times fairly straightforwardly third it's fantastic at faceting and this is mostly due to the loosing libraries that it sits on which are infinitely tweakable and academically impressive relational databases are pretty bad at faceting typically you can do a group by count query but you have to do one per facet so maybe one if you're passing on color and other if you're faceting on price and so on and so forth I believe Python Postgres and finally geo queries it's relatively new to us but it will happily do bounding box distance queries and even really expensive polygon containment queries which it will very nicely paralyze for you all the parts of AES that aren't Lucene are largely a one man Shay Bannon who goes by kimchi on IRC and Shay writes the clustering and the Jason and everything else it isn't raw information retrieval and although the Docs that Tina's team have written are extensive es is far more extensive you can get the API out of the box but a lot of the practical perspective is still tied up in the author's head kind of tends to feel like this lot of the time so once you understand the data structures fortunately you can answer a lot of your questions about how to format your store data and efficiently query it so let's take a look at the shape of the data in case you're unfamiliar with it es is a document store kind of like a MongoDB sort of thing it's documents look and feel like Jason and you access them over HTTP so you can use curl web browsers and all your standard URL Lib stuff for requests to talk to it you can even put load balancers in front of it now let's talk about the conceptual data structure the one the API presents to you at the first level you have indices and there's no structure or schema information they're just bags kind of like the databases in a relational system now inside that you have dock types and dock types you can think of like tables or in Opa like a class they do have schema attached to them now you might say oh I thought yes what's schema lists right I feel like I just stick Jason in it and it would go well you can what will happen is the first time it sees any new field in a document it'll say well let me in for a type for that if there's a 3 in it I'll say that's an INT field and then if you try to put a string in it in a new document later it'll freak out and throw an error you'll get unhappy surprises so when you make your documents that go inside a doctype or really are have a doctype or attached to a doctype be sure to express your schemas explicitly also if you express them explicitly they end up in your source code and you can kind of use them as documentation you can think of the contents of a document like a hash table it's very hash table E and may not be stored exactly like that but if you use that to reason with you won't go too far wrong now if you want to store a bunch of things that have vastly different shapes in one elasticsearch instance you can go ahead and make multiple doc types no problem in fact es is even happy to query across multiple doc types so here's something to keep in mind if you have multiple doc types if if your different document types say have a title field and both have a modification date try to name those fields the same across different doc types and then it becomes easy to do a single query and of course you can have more than one index that just works like you think it would and es even happy to query it across multiple indices though it's a small performance hit go ahead and query across three or four or five but maybe not across 5,000 now let's talk about the documents documents have IDs and you can think of these like the primary keys in a relational system they end up in the URLs in the API so if a common pattern is to keep all your data wherever it is in my sequel Postgres or whatever and then kind of batch index it into elasticsearch to get started and then leave it in my sequel as the canonical data store you don't rewrite all your code now I advise you to take that primary key from your original data store and go ahead and use it as your document ID in elasticsearch you'll be glad you did because otherwise elasticsearch will make really ugly document IDs for you these geo IDs and you have to put them somewhere and they're enormous and they're unmemorable so don't do that now let's start to explore full-text search this is where things really get interesting it's easy to look at the es query Docs and wonder well how fast will this go how can I tell but if you understand the data structures you can reason that out how does full-text searching of documents work it's actually surprisingly straightforward it just does pretty much exactly what you would do if someone blocked you in a room with a pencil and made you do it by hand think about concordances who's ever used a concordance in here so not too many I mean they've gone out of style what for obvious reasons they were very very expensive to construct I'm sure there's a monk in the back room somewhere figuring out which pages Dwolla kurd on book and then writing those down next to dois and you can see how trivial would be to work that backwards and say well which page is wall work Quran well these two now full in-text index is just like this but instead of page numbers it uses document numbers here we can see which documents the word row is in documents 0 1 & 3 so down at the bottom this represents our index and we list 0 1 & 3 next to the word row similarly here are the docks that boat is in 0 & 1 so in our index we have boat point 2 0 & 1 & chicken is only in document 2 so chicken points 2 just to now that we've built up this index called an inverted index it's trivial to do these full-text searches on those three words so let's try a few now let's just look just look at the bottom here just that our index and try to figure out which documents contain the word boat so who thinks document 0 contains word boat yep how about document 1 how about 2 wrong ok very good so let's try something a little bit harder which Docs contain the word row or the roared boat 0 1 2 3 good keep it away that's good and then you can also see how you could do which contain both 0 and 1 you just take the intersection of those two sets so you can imagine giving each document a point for each search term that it contains and that would give you a basic scoring algorithm and this is roughly what es does although it does a couple other fancy Lucini things like it says well you know chicken is a very rare word in your corpus so if you match chicken well that's kind of a rare gem so we'll give you extra points for that now let's try something a little bit trickier let's try a phrase match let me find these words in exactly this order how would we use this index to search for the phrase rowboat you can't do it the best we could do is to see that only documents 0 and 1 contain those words at all and then we'd have to scan through those documents which you know either 6 Meg documents right so let's make our index finer-grained so what we had before is in the yellow column on the left and the new stuff is on the right next each document number now we keep track of the position where each of these words occurs in other words like the word numbers so for the first line this means that the word Row occurs in document is zero in position zero one and two row row row your boat so now we can use this to search for the phrase row boat first of all we can eliminate any document that doesn't contain both words right okay so the red ones are still under consideration then all we have to do is examine the positions array so we'll start with document zero row occurs there in positions 0 1 & 2 so to match the phrase boat would have to occur in 1 2 or 3 just kind of offset by 1 does it no it does not it occurs only in position 4 so document 0 is out of consideration meanwhile maybe in another CPU we could be examining document number 1 row occurs there in positions 0 & 2 so boat would have to be in 1 or 3 and does it occur there why yes it does so therefore just by looking at the index we know that document 1 contains the phrase rowboat we'd have to look at the original text so this is called again an inverted index and this is what es uses on all of your text fields so the proceeding was all based on breaking documents down into words which is soon as we know how to do that we glossed over how that's done and this is a subject in itself known as analysis now analysis breaks down text into words or more generically into terms the vocab word for your terms because after all it might not always want to break on word boundaries hyphens and spaces and such es comes with several analyzers that know how to do this in various ways here I've taken each of the stock analyzers original whitespace standard etcetera and applied them to an English sentence and notice the differences here first of all you can see that certain analyzers have stripped out what we call stop words words that don't have a lot of semantic significance because they're just overly common like and at in this case second some of them stem words for example they've extracted the roots of words so gerbils becomes a gerbil on the last line also orange turn into Arang which is really a plug orange strange and then punctuation most of these strip the number sign off for example they strip the period off the end of the sentence and finally they've done case folding mashing everything down to lower case upper case I've been fine too doesn't matter the snowball one this last one this last analyser is particularly worthy of mention it's got built-in support for stemming 22 different languages it's got stop words for a lot of them you don't have to bring your own stop word lists and it's where I go first four bodies of pros good good first step though as you can see it's not perfect with the boring the analyze API is really handy for seeing what the results of the analyzers will be for kind of auditioning them and that's how I built that last slide I just ran this I said which analyzer I wanted and that first bold thing and then I took a look at the token keys and the result so to use a particular analyzer you just specify it in your mapping and a mapping is that schema I talked about the doc types have I didn't give you the entire entire big crazy hash for doing that but you can pull out of the docs this is this the interesting segment where we choose an analyzer for a specific field the address field in this case of a document analyzers are made out of three stages they kind of flow one stage into the next like a pipeline got sharp filters tokenizer x' and token filters chart filters are just what they sound like they manipulate a raw stream of characters coming through and the only two chart filters out of the box are ones that remove the HTML elephant elements elephants also and replace specified combinations of characters with other ones so you could like replace pH with F if you wanted to and make a poor man's phonetic key maybe second stage the tokenizer that picks how to divide text up into words or terms for example the letter tokenizer breaks the text on anything that is not a letter so something like a name like O'Brien Oh apostrophe Brian we turn into two terms with a letter analyzer oh and Brian and then finally the token filter at the end this is the most interesting part this is where all the magic happens it does stemming and roofing stop words and then finally at the end out pop the terms and the terms turn into those words in the index in the inverted index we already looked at so the stock analyzers are prepackaged combinations it's kind of like a choosing from columns at a Chinese restaurant or something to choose some travel tourism tokenize or some token filters and you get these pre-built bundles out of the box but you can also build your own by choosing an implementation of each for example this is a settings excerpt which defines a custom analyser called name analyzer and you can reuse that throughout your schemas it doesn't give the chart filter so we don't see the worst chart filter for tokenizer it uses one built on a custom reg X at the bottom which divides the terms on anything but letters and apostrophes so for example Oh Brian in this case would make it this right it divides terms on anything but letters and apostrophes right so at Patras are included Oh Brian is one term and then use the token filter which turns everything into lowercase so your searches can be case insensitive now this analyzer also splits hyphenated names into two different words to two terms so that queries that have both halves of the name but maybe don't put a hyphen between them can still match now so far I've talked about analyzers merely for indexing things but there's another time when analyzers are used they're used at query time so when you say put a string a search box someplace attendance on Google and you type it in an uppercase by default the same analyzer that happens at index time also happens to your query so your query will be folded to say to lowercase and split up maybe into oh and Brian and then you can see how be much simpler to match that lowercase I split up version against the static data in your index now there are certain instances where it's helpful to be able to control and choose a different analyzer a query time than at index time one of those is when you deal with synonyms synonyms are handled by a special kind of token filter when it sees one word it replaces it with another one or lots of different words now you can't change synonym lists of the index analyzer on-the-fly because if you think about it it would have to go over a whole corpus again and recompute an index for all your documents so take forever but you can change the synonym list of your query analyzer because after all there's no you know document corpus that's bound to it just happens dynamically so if we do something like this mapping Elbert to Albert and Al and Alan - Alan and Al we can do searches like this for Alan Smith and that turns effectively into a query for Elinor Al Smith those two words that notation there means that Alan and Al are both superimposed on position 0 in the index it's ok they can overlap it's perfectly happy to deal with those super positions and likewise if you search for Albert Smith it would search really for Albert or Al Smith so you all find what you're looking for now I like to set up my query side synonym mappings using the update settings API which is a departure from what the docs recommend they recommend putting everything in a config file on the server which it's a kind of a drag then you need to kind of puppet that out to all your servers fine you do that anyway for your real config file but you also have to then restart the server so they pick it up and nobody likes to do that why do that right use update settings so now that we know and are comfortable with the index data structures let's talk about how to get data out of them efficiently the best way by far to query es is via JSON this is called the query DSL and here's kind of an example query we need iam sized now he s also understands loosing style textual queries who's familiar with Ellucian style queries and I'd like capital knots in them and stuff I don't really know them very well the trouble with exposing those via your application is that a parse error partial query kind of explodes and throws ugly errors and you don't know what to what to feed your users then and they can only express a small subset about the JSON query scan so the Jason queries though are basically a STS abstract syntax trees they're cumbersome to build by hand and they're not terrifically easy to read but on the other hand they are immensely powerful and easy to build programmatically in fact there are several libs that are really good at making them easy to build we're going to talk about that a moment the documentation around the query syntax is among the most comprehensive in all the es manuals so I'm not going to sit here and read all that to you you can you can do that but I do want to cover some bumpy ground that is almost guaranteed to trip up newcomers they call it the query DSL but really there are two constructs that can go into it filters and queries to things called query great idea right so it's important to understand the difference between these two constructs because picking the wrong one will kill your performance first filters are boolean the either match a doc or they don't there's nothing in between there's no ranking or scoring queries on the other hand score and rank II stocks they can match a doc more than another doc now because they don't have to do any ranking filters on an order of magnitude faster so filter when you can query when you must once more filters are automatically cached so if you use the same type of filter with the same parameters it'll go crazy fast a second and following times and here's a really fun thing about the caches es does they don't get invalidated at the drop of a hat like most caches do instead they get updated as the index changes so if I've filtered for all my blog posts whose category fields has rant and then I make a new rant that filter cache which is stored as a bitmap internally will be actually updated to take the new doc into account and then a memory gets full these things get evicted and least recently used fashion as you might imagine so anything you can imagine doing by hand with the index data structures we've already talked about you can do with the query DSL for example some common patterns include using a filter to query this is specific kind of query and here's one of these it combines some filters with some queries kind of in a one pipe into the nother fashion the filters window down the set of considered documents just like we did and then the queries further window them down and then rank the remainder this is about the most realistic example I could fit on a slide they can get really long I can fill up a page of your editor no problem all its going to walk you through this one so the second line the second operative line here sets a filter that says oh this is a filtered query so if you looked in the documentation under filtered query you would say well these have a filter and a query and you can see those that's the next set of keys within there the inside of the filter is a term query which says the field category should contain rants meanwhile in query we have a boolean query here and EE inquiries are maybe even overkill for this but one of the things that they can do is say well this query should match and this query should match and what that means is that a document won't necessarily be excluded because it doesn't match but it will get a nice score bonus if it does match this illustrates another pattern here we have a match phrase a phrase match and then a normal match was just more of these words that are in there the better and I'm doing both of them and they're both working together to build my final score you can you can think about why this might be nice so if I if I Google say for Fisher a little red wagon I would expect the top hits to be fixture little red wagon and then as we move down the results maybe some of them just say fix my wagon for example this way we're sure to catch anything that has all the words but we give a nice ranking bump to the ones that have the exact phrase now let's talk about how to do this from Python pi es was one of the first libraries for getting a TS now I have used it for a while it was fine but I have a couple of a couple of objections to it it can't really decide what level of abstraction that wants to operate at it does weird things like I have no op lines in the code and it does socket calls from destructors so if you have circular data structures somehow with an es object you never know when you're going to actually do your bulk indexing and the code is kind of Byzantine and hard to read especially the connection handling and failover code which should be most auditable so I don't recommend that instead I like PI elasticsearch right now I did noticing we need a more reliable client and I found this one lying around and essentially rewrote the thing so that it would have more solid connection pooling better load balancing and so that it would say well that node timed out when I tried to make a query to it I'm going to put that in a penalty box for five minutes and not try to query it anymore I also made the API a lot more consistent and improved error handling so if you ever get a bad HTTP code back from es it'll be sure to raise an exception you won't just swallow that it's fairly low level it has methods for each of the es API calls but you do construct you have to construct the dict equivalent of JSON yourself so while it's my favorite choice for complex queries if you want to start mucking around with something like custom scoring but if you want to do something simpler maybe combine kind of Venn diagram II sets of things like Django's ORM does you might want to take a look at elastic utils it's the higher-level library built on PI lastik search so you get all that load balancing stuff free and it gives you kind of a Gengo chainable or any kind of flavor to your api is shorter tercer but it's kind of hard to translate sometimes between that and then looking at all the JSON on the es document pages also it gives you kind of more convenient object or in a ways of getting your results so is the abstraction worth it it depends on where you're coming from if you are if you don't know anything if you're coming into it fresh and you you have to learn one API or the other es or you know elastic utils so you know choose which one you like better but if you already know es maybe just go with PI elasticsearch you don't learn additional API now remember you all have to learn a little bit of the es api so you can read the documentation and finally there's a Jango haystack I don't know too much about this but one of the cool things it's kind of Django or Emmy but it also has pluggable back-end you can stick it against us solar and a couple of other different engines I mean we're out of time unfortunately I hope this is given you a really good taste of the es internals and I hope you find uses for them and your new understanding helps you avoid my past mistakes thanks so much for coming if you have any questions you can come up the microphone we've got just a couple minutes left or you can raise your hand and I'll run over to you after this talk will be on Sunday 1:10 we'll be talking about clustering replication sharding deployment general maintenance concerns monitoring all that good good stuff I'll somebody's got query questions come on all right well I was gonna vote and with the pie lastic search library being so low-level do you actually get much out of it as opposed to just using say requests and using as a JSON web service it uses so the question is PI elasticsearch a so low-level what do you get out of it compared to say just requests and requests is great I use requests in PI elasticsearch it handles the pooling what it gives you in addition is something I think is very important which is errors that are hard to accidentally swallow you get a bad response code back it'll go kaboom I have kind of nice exceptions that let you catch things like oh the index wasn't there at all and distinguish that for like a 500 error that bad things are going on and then the failed note penalty box thing is another important bit hello I realize that there's a lot of variables here but I'm wondering if you'd be brave enough to throw out a number in terms of indexing speed transactions that you think is decent numbers you've seen with elasticsearch so the question if I'm working with elasticsearch and my how I might know my numbers are way off that something is wrong with my configuration when would you look at a search cluster say so that's a reasonable indexing speed and a thousand documents a second is a perfectly reasonable indexing speed for a cluster of eight nodes feed with one Python process feeding them okay typically yes will happily outrun your Python processes typically need to run just short of one Python per es node a lot of variables here I'll talk about that in part to a lot a very good talk is it is it possible to like give a cage to like certain terms like say for instance whenever I type in our like microwave contains are like have the robot as a first query like how hard are easy is it to like boost up like certain results so there's a boosting question but I didn't quite get the specifics what do you want to boost up it might it might be like searching like say these tend results should always be at the top ah so a question is if I have kind of doing a manual audit of my search results and I put in a little red wagon and my little red wagon document isn't coming up to the top how can I force it up to the top is that easier or not so there are two ways to go at this I mean you can make up make it a requirement that this document will be at the top and try to you know handle it with if thenns in your app logic or a separate step which is you know what you have to do if people just don't understand but ideally you're reasonable and you can solve it out algorithmically and say nope to keep on manually chasing always down one of the approaches to this well one of the approaches is obviously the match phrase they mix that into your query someplace and the other approach is quasi manual if you have people just won't leave you alone about this or you really want to be able to kind of influence things you can make like a keywords field throw stuff in there query against that while you're querying against the full-text and really just boost the heck out of it I know this could be a talking in itself but it maybe you could say why you chose elastic search over other search engines I'm trying to audit something from my app so I'm just kind of curious if maybe a couple sent to this thank you so solar think of another one and say it I can address this sure you know something like well now whoosh I don't know I'm gonna go to the whoosh talk later on I that's fascinating to me and whoosh reminds sneaks okay flow Sphinx I said things didn't support incremental indexing when I got off of it it does now it also I think you can really only put integers into it been a long time since I use things comparing electors to solar now solar 4.0 has clustering but 3x did not so that was a big thing while that lasted yes is still easier to configure I believe it's very easy to build up a cluster little harder to build it up well the JSON querying is nice I like that a lot whoosh whoosh reminds me of Z catalog mechanized zoek de is a Python implementation of an inverted text index and I suppose speed would be a major major thing there now now put pie pie around it and we'll see thank you all right thank you very much you have more questions LLL be somewhere else right there right there all right and actually whooshes our next talk so stay tuned
Info
Channel: Next Day Video
Views: 127,981
Rating: undefined out of 5
Keywords: psf, pycon2013, talk, ErikRose
Id: lWKEphKIG8U
Channel Id: undefined
Length: 31min 16sec (1876 seconds)
Published: Fri Oct 16 2015
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.