Elasticsearch in an Hour

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
than three of you I have given this talk within a room of three people so hopefully uh that was now they had a lot of trouble but we got to work back and forth and figure it out anyway so you're here to learn elasticsearch I've got 50 minutes I believe so we'll see if I can fit an hour-long talk within 50 minutes hi my name is John Berryman just to do a quick introduction to myself you know find me on Twitter a lot of my self-worth is derived from my Twitter followers I I was a I was a so growing up I was a pretty pretty nerdy kid to the started reading programming manuals when I was in like the first grade ended up getting into aerospace engineering that was my first career decided that satellites in it and all that stuff were pretty cool but I liked the programming and I liked the math so I moved after about four years in the field I got into search technology and I was consultant I wrote a book that guy right there wouldn't necessarily recommend doing that with your life but uh it it's a good calling card and now I work at Eventbrite I am a discovery engineer so search and recommendations and stuff like them so uh to give you a little preview of what we're gonna talk about this is not really an advertisement for elasticsearch but what a lot of what we're doing involves my mental model thinking through the last the Eventbrite problem so I just give you a little shared background historically that the company I worked at Eventbrite has been a very organizer focused startup we we allow organizers who want to put on their own events to come to our website you can build a nice little web page with little effort expended you can sell tickets we take care of all the credit card mess you can you have a platform for messaging attendees and we get their metrics so like after the event is done you get to look back and make sure that your next events are as good or better than the this event but after years of actually nailing down this side of the market pretty well my company realized that look we've got all this inventory everyone we're basically white label but everyone is plastering their events on our website if we can turn around and sell our inventory to everyone else then organizers are happy the customers happy because you can find something to do over the weekend and we're hoping to generate kind of a the so called flywheel effect and this is exciting for me because this is where I belong is creating the marketplace is all about building search and browsing and recommendation features for Eventbrite and of course this technology is based on elasticsearch what we're talking about today but can you guys keep a secret so I know I know we're supposed to talk about elasticsearch today but I I gotta tell you I'm actually more interested in talking about my new startup event dark yep so don't tell anybody but I'm going to directly start competing against a vent right and our guiding principles and I am sorry to do this to you I know we're supposed to talk about elastic search but elastic search is hard so I'm gonna focus this new startup and you guys enjoy me if you'd like it's gonna be focused on my sequel because everyone knows how databases work databases are easy and let's just let's just build on a tried-and-true platform and let's not overthink it and our specialty because I found a free data thing online is I can't related events with start account related events good see we have some attendees already and then we'll expand to other fields alright we have someone who at least buy our tickets so we have a marketplace excellent all right so building this new website is going to be pretty easy there's not really too much to an event so here's our schema with my sequel we're gonna have a DS an integer name description City start date you can look at all that that makes pretty good simple sense right and my hypothesis that I know will play out well is that we can build a website based on this so I I'll demonstrate it so here here's our event search select star from events that gives us all the details we'll need back for the website we have date range search obviously you'll need that to find something in this weekend we have geosearch not hard why invest in all that stuff we can just do string matching and finally oh it's easy to search for events that you like so I want to find an event where the name equals cat the results are nothing oh so this is interactive part why do you think might there might not be any results for that particular my sequel query yeah okay so that that's a little problem I I could spell cat with misspellings these are all you guys are overloading my brain but I think we can still make this work out now you guys can't spoil all my slides before I get to them all right so the the particular problem here is probably no one's gonna - the first answer no one's probably going to name their event cat would you like to come to see cat so but okay so my sequel it my sequel solves this for us whoo we can use a light query like percent cats percent and the results come back is yep teach your cat to nits and even you know cat bowling and BYOC cat dance party this is we're on board so okay so that that was just a silly thing just to show you that we can probably accomplish this let's get more serious with a more serious query someone's likely to be looking for a cat farming seminar so we're gonna help them do it not in a bad way that might have particular meaning to you that it doesn't - most of my audiences no not that event so anyways so how do we look how do we search for this if someone comes to our website and they look for cat farming's in our select star from events for a name like percent cat for me tomorrow results well it's it's in red which that that's the thing I would like to match but it doesn't match interactive time what have I done wrong now okay so that's right right right add my sequel the whole uppity about case so this is also not hard all we have to do is whatever the people type to us we lowercase it and it'll still work cat farming seminars okay great that matches but seminar for farming of cats not such a match anyone have any ideas how I can deal with this one cats or farming well let's try and I want to make sure yeah so okay so let's do something like this good idea good idea and uh well it's starting to itch me a little bit because I heard that like is not as efficient of a quarry is just like a pure match I but surely not right and we're doing it three times so it's kind of like scanning every document in the database three times right but we'll probably shard it and that scale will be fine I'm sure so anyways we do indeed match that similar performing of cats but we don't yet match making a cat farm the seminar and now you're totally in my head because I didn't realize that this was a potentially derogatory thing making a cat farm the seminar one so why does why does that would not match farming and well but they're the same thing right yeah so so it's like I would search technologies they do a pretty good job about understanding language and I guess we'll have to like cut off the ends of the words so farming farm at least that'll match farmer farms it'll match other stuff and we in do indeed do get back the results we want to poke some holes in my little theory here though this is an old presentation are you telling me I should have retired my presentation after this time oh yes you're right okay so I should have updated the dates on my examples on my slide for mr. Michael handle in the front row so next one cat farm class doesn't match either it's a class it's kind of like a little mini seminar in order to make that work I'm gonna have to do a what am I gonna have to do for that one oh okay okay it doesn't match all the terms but at least if it matches it like a couple of them that should be good enough right so our place by hands with ORS to someone's suggestion earlier and what happens I do indeed match everything I want and I match all these things that don't want and since there's no notion of which match is better than the other match yeah all the stuff all the stuff with a cat event goes to the top and this is the whole things about cat events so guys I think I think we're sönke I apologize for taking you through the startup with me but databases are very good at some things but search engines and search technology are very good at a different set of things in particular search engines are quite good at finding documents that not only just match exactly what you have but contain specific tokens and phrases of the tokens in different mutations of the tokens they understand English and in a way that I think you'll understand when we when you leave here scoring and sorting of documents my sequel finds the set that matches where as elastic search as we'll see in a little bit you can put into it an understanding of how good or bad a match is to particular search terms and finally this is something that both my sequel and last search are good at but it's it's become an interesting more recent use case with search technologies searches are actually really good for filtering grouping and aggregating data so search engines came out of information retrieval field but they're being used more and more of like log analytics and stuff like that and we'll touch on that right at the end all right so now since we have failed let's let's go ahead and get back to the main talk that you guys got came here for we're going to teach about elasticsearch and in the next thirty minutes we'll do a really quick and dirty application I'll show you how to pull down elasticsearch create an index index stuff and retrieve it we'll take a peek under the hood so that you can see the data structures and algorithms in place fortunately this the data model for elasticsearch is simple simple enough that you can leave with a basic understanding of and we'll get as I promised we'll get into some of the data aggregation stuff that elasticsearch has been used more recently for and then we'll have hopefully a little time for questions what in particular I want you guys to get out of this is a couple of meta goals one I want you to see me using the very basic implementation of elasticsearch and I want it to be approachable for you guys so it's a tool on your shelf that you can grab for and learn more about when you need it the second thing and I encourage you to do this with any technology any data store technology that you want to use I want to impart an intuition about how these data structures work and what they're good at and a little bit about what they're not good at this this means that when you reach the shelf to get your tool you actually get the right tool so building the basic search up is not that hard and you can get I'd say four there's a lot of tuning that comes with elastic searching getting the the behavior and the relevance notion of relevance just right but getting the thing out of the box and turning it on it'll actually get you about fifty percent of the way there so it's a it's a real quick technology to get up and running and get some good results in order to install in run elasticsearch this is pretty easy you all probably know it w get is so you can pull down the find your favorite mirror pull down elastic search in this case it's a I've I need I do need up-to-date my notes here it's a little bit older version of elastic search but pull down unzip it to whatever or if you want it to live CD into that directory and then start the binary bin slash elastic search once you do that you can just curl local hosts at the elastic search support 9200 and it tells you hey you know for search like in case you forgot that it was for search but elastic search is now up and running and just like with my sequel with elastic search you will want to think in advance about the type of data that you're going to be interacting with and build a schema for it or as they say in elastic search a mapping now elastic search is interesting here because early on they advertise that they were a schema less data store in the age where MongoDB was rocketing off everyone's kind of tacking on to this and it's it was true to an extent that you could just start dumping information into elastic search and that's gained elastic search a lot of popularity but it's still kind of an anti-pattern so it in my opinion over years using this technology it's still very important to think through what you're getting ready to do with this thing so setting up the mapping is simple everything in elastic search is a JSON interface and in this particular this is a Python conference so every example that you'll see here I am using the the Python client it's it's really nice it's really a fairly thin layer over the JSON interface of elasticsearch so when you're setting up a schema all you have to do is specify the fields that you're going to have in this case ID name description City start date price and you get all of the things that you would typically think of existing and in a datastore so you have numbers integers floats strings dates as actually so you can start to get more complex things like dates and get locations that are a little bit more aware than you know just two numbers it knows what a location is and but one thing I'll be focusing on is not only can you have strings but you can say that your strings are special in some way for example an ID is a type of string but it is a string that is not analyzed that means that we're not going to do any special special massaging and and trying to understand this as in a string from natural language however both the name and the description here I've marked as having an analyzer that is English so this is me giving elasticsearch a hint that not only is this blob of bytes actually text but it's text of English and and I'll show you what that means to elasticsearch in a little bit but it's interesting because you don't have to put English here you can put Chinese or Japanese or any language most any language that you'd want and you can make up your own stuff so there's interesting things that you can extra rules you can put in for like if you have camelcase strings because you're indexing programming languages you can break it up and make your own analysis chain for it and then of course you here's me using the client you create event brights with that that mapping structure okay so we have an index setup ready to receive events actually adding the events at that point is pretty simple you have an array of events and it's just JSON blobs again the client is nice because you can do you can use date times and it does the right thing and then the simplest version is for just an iterator for every doc that you have then dump it into lastic search this does make an HTTP request for every doc so there are batch methods once you actually really want to put this into production that's an easy way to get up and running okay so now we've got a bunch of documents in the index the next bit is to pull stuff out of it and the easiest way to explain this oh yeah sorry for the microscopic text how horrible is that two people in the back I'll just speak louder so the simplest building block for pulling stuff back is a is this match all query and it does exactly what you think it's effectively the select star from the events table it gets everything back in the order that you indexed it in and and you you don't have to understand what is on the screen here but I'll provide these notes on my Twitter account later you can see it but it gives you back what you'd expect it tells you how much time the quarry took it tells you if there's any errors and obviously importantly it gives you all the hits back all the documents that match the query sorted by how well they match the query in the case of match all there's no notion of relevance and just get them back in the order that you index them alright so that that was like the the hello world of making a query but there's a lot of different things you can do to craft the notion of relevance what is an important document what should match what should not and the building the smallest building block for these is the so called term query so if we have an index document it's in a it's in the event in Nashville if I wanted to make a filter over all the documents and only hit documents corresponding to the city nashville then that's a term query i say this is a term the field of city the token is nashville the special thing about a term query is just like earlier i said not analyzed term means that this is just a token it has to be capital in a shv i ll he doesn't do anything special and so that that's a match but where it gets interesting and where you really get a benefit from a search engine it's when you start incorporating this notion of hey this is not just a string this is actually english text and so if we have a sort of stupid document here name equals zilbert sorting for fun and profit then a query that is not a type tech term but of type match actually applies that special knowledge about this is english and so rather than looking for sort filbert's the exact tokens there it knows that it can be lower cased we can split on spaces sorting and sort should be basically the same information and so that's a match so compared to what you think about how you'd have to do that in my sequel you would have to make a horrendous query to make that one simple match right there and it would also be very poor poor performing for reasons that I'll get into in a little bit getting more more complicated because your application has to have a lot of different ideas mixed together you can do phrase matching so not only do we have the notion of matching documents that have these terms but we want a document that has the term sorting and fill Burtt in it in that order this is not a match because the original document had filtered sorting however if we search for filbert's space sort that is a match despite the fact that it's different from the original document original document has uppercase and has different parts of speech but think about is a user looking for something you don't quite remember the in the movie but you're probably going to get something like this so getting these these type of fuzzy matches is a specialty of search technology Philibert fun won't match because there's space between filbert and fun just more example of how match phrase works but you can you can add this notion of slop and everyone chuckles when I do that one so what it's called you can have at flop slop and it'll find any document that has these two words within a space of two and you can go nuts with this I once had a gig with a US Patent Office and their search technology that they were getting rid of and moving to a different search than the elastic search solar they really wanted to know I want to find this word within the same sentence it's some other word and I want to find it before or within like some number of words and so you can take this same behavior and overload it and get some really complex search behavior but everything I've showed you that this point is just like atomic it's like I want this thing or that thing you have to have a way of gluing these things together in elasticsearch that is a boolean query it in normal notions of boolean crazy in normal notions of boolean you think ands and ORS and nots elasticsearch has that but using different terminology rather than ands we say must rather than should or we say should and then not it's must not so that one makes makes pretty good sense but the idea and if you play around with a few queries you see why they moved to this terminology usually you have an array of things that must match so in your last search query you have a must key and so you stick everything that must all these sub clauses that must match there and additionally you have several things that don't have to match but should match if they could match if you could find documents that also happen to have these other things it should boost a little bit higher so that's yet another array of things that if it matches then you get a better score each one of these pieces you have the ability to also adjust weights so we're starting to get into a notion of how search understands what's important to your customers and to your business you can not only match documents that match the queries but you can also boost documents that that need we need to sell quickly because they need to they're expiring inventory or something like that and that and that leads us to our next big topic search relevance I'm curious how many people here have heard of the notion of tf-idf okay only this half of the room that's interesting you guys should have mixed in a little bit more if not a hard concept and so I I think it's intimidating at first but I can break it down pretty easily this will be a little bit of a Matthew slide but not too bad first off TF is really just means term frequency and I'll get into that and IDF means inverse document frequency and the best way rather than giving you the Webster's definition the best way of explaining this is through an example and let's say a user comes to your website in and makes a search for the diddle now that seems odd until you realize that one of the matching documents in your index is hey diddle diddle the cat in the film that's actually a pretty good match for it so let's do a little practice round and see what this document would be scored as in from the search engines perspective term frequency is simply the number of times a term occurs in a document so the TF or V in this case is two there's the the occurrence of V is twice similarly just by coincidence diddle also occurs twice so TF for both of those guys is two so far so good inverse document frequency sometimes I just wish they called it document frequency and just put a 1 over it it's basically how many number of times the so the document frequency is how many number of times the term occurs not in this document but across the entire set of documents so document frequency for thee pretty high so the inverse document for thee is just about zero makes sense and the document frequency for diddle not a very common word is about it's only occurs in seven documents so it's actually very important and it gets an inverse document score of 1 over 7 which is a lot lot higher than 0 so when you finally are figuring out the total score of this document against this query you you put all those pieces together the score is the tf-idf score for thee plus the tf-idf score for diddle and you probably make sense but just be a little bit redundant TF of these two IDF of the zero goes away TF of diddle is 1/7 there's two and IDF is 2/7 and so you get the final result of 0.28 five seven but the but the idea is every document is going to go through the same process and be sorted and so the way that you craft your query informs the way that this math works in the documents that you know you have a 10,000 matches but you want to make sure you do the right thing so the top 10 search results are what they want okay yep so that was that was a pretty overloading slide I always like to take a break after heavy slides like that and I think clay work is really therapeutic and in particular I think that this this is this is my favorite one ah that's great we're gonna watch that one more time I love this part of this okay service is good break so to this point how much time have I got left by the way so at this point we've done a lot to get you in the mind space of how search works from a mechanical perspective how to dump stuff in how to pull stuff out what it can do as compared to other data stores like my sequel that I was picking on the next thing that we want to do is dive inside the data store and give you a little of intuition about how the piece is inside work and what you'll find is it's not that complicated so after this section we'll you'll have a little better understanding about when it's right to use elastic search and when when it's not so getting data in it in any data store there's two main chunks that you have to understand how you get data in and how you get data out so that's that's the outline for the next bit the first step of getting data into elastic search is a step called analysis and basically we're gonna take a document and in this case I've got just one field out of a document and I will show you how it effectively gets shredded and rearranged and shoved into the data structures that make search technology so fast our example in this case is the sentence the conspirators conspire conspicuously I chose it so that I could almost not pronounce it at a conference tokenization that's the first step in this case we have told elasticsearch hey this is English and that gives us some interesting things that we can play off of we know that English is split on white space and also punctuation we can we can basically throw our punctuation an interesting side note that I always like to make here is this is not true of a lot of languages and on the other half of the earth right so like my wife is Japanese and so there are places where you could have symbols right next to each other and they're different words in the same thing doing the same thing in Japanese which you still have to do you have to have a really complex algorithm to know where the best place is to slip these things to make a logical sentence so tokenization itself is a fairly deep topic next step is actually a fairly shallow topic lower casing pretty easy but if you if you have someone type in lower case you want you better make sure that it matches a document on in that has upper case letters stop wording Oh a lot of the words in English are just noise words they help us understand where things are placed relevant each other but they don't really change the content so we can throw a words like the V in is and was and stuff like them and perhaps my favorite step of analysis is stemming this is another place where because we've given elasticsearch the hint that this is English it knows some some interesting tricks to do if you want a document for farming to match a query for farms which is often the case then effectively what stimming does accomplished is that you can take a word and using a statistical technique you can chop effectively chop off and sometimes modify the end of these words to make tokens that are easier to match no matter what the intent was of the people searching alright the next step after analysis is indexing so our example sentence has turned into these three tokens conspira-con spiricon speaker sounds like lemon let's say that this is document 1 the this secret sauce of elasticsearch for being so fast is effectively during the indexing process it takes these sentences turns it into a bunch of tokens and then it effectively transposes that so instead of document 1 has these tokens at the end of the analysis when you've gone through all your documents you say these tokens have these documents so document 1 had these tokens but in the end con spear a peer din document 1 as well as these two other documents conspicuity r din document 1 as well as these 3 other documents and so effectively from a Python point of view you can implement this with a dictionary where the keys are tokens and the values are an array of IDs now this under the hood this is actually implemented in Java and they do a lot of sneaky stuff they shim extra information in the keys so all the notions of document frequency which we use for scoring gets shoved over into the the keys when you look stuff up and all the notion of term frequency that's the other half of the tf-idf are basically hidden into the values on the right as well as other information like the positions of the words in the document so you can do phrase matches and stuff like that but effectively a simple search engine is just a Python dictionary like that all right so we have now gotten all information in da index the next half the equation is getting information out of the index so our inverted index looks like this and yeah make this interactive how would given that data structure what's the easiest way to find all documents that contain conspicuous and aardvarks in one yep that's all you have to do effectively you have either lists but they might as well be sets or iterators and you find whichever one ids occur in both and you can build arbitrarily complex things on the same idea or is just a set Union and if you combines a more complicated search it's a set Union followed by a different set or set intersection pretty easy so but that's only half the puzzle because my sequel is really good at finding documents that match I just showed you how elastic search finds documents that match efficiently but elastic search has to turn around and do a sorting algorithm that is as part of the important aspect of search when Google gives you back the 60 thousand results it supposedly says you have for your query you only see the top 10 and they're usually pretty good if you scroll down 50,000 pages they would probably be less good so it's important to know how that works effectively what happens is when your user gives you a query you have an iterator of all the documents that match and so what you do to find the top 10 is you initialize it you have a priority queue do you all know roughly what a priority queue is ok we can talk about that but effectively what you do is every document that comes through you take it off of that iterator you look at all the other secret stuff we've hid in there and find the score for that document and now you put the document and that score on your priority queue and there's something there that just iterating doing that with every single match that that exists the interesting aspect of this priority priority queue though is that it doesn't keep up with every document it ever sees it's only of length 10 or whatever you tell it to be so as as soon as you get past you know the top 10 documents you've got one that's scores lower than the documents then it it compares itself to not even 10 like you know it's log order login or whatever it compares it to a few of the documents and says I'm lower than all these never think of me again and so the act the action is actually pretty efficient now there's a little side note this is another intuition that might be important for elastic search if you're doing some sort of relevance but you also want to return 100% of the documents think about how you to implement that if I want to deep paging is what it's called if you've got a robot scanning your website for the 10,000th to ten thousand and tenth most fun event then this means that you have to have a priority queue that is ten thousand and ten long and you sort all the documents in throw away the first ten thousand of them and give that chunk back and guess what happens when the die when the robot goes the next page carelessly it just gets worse and worse and worse so that's one important intuition to think about search technology elastic search allows you to turn that off so you if you don't care about relevance but if you do I would recommend not letting anyone get past about 500 results all right and then I said it returns that the the most high-priority contents thank you that's effectively what we do like after top ten they go away that the data structures only ten items long so I can't hold any more than that one oh yeah yeah that's that's not a bad idea if yeah I don't know how I would implement that in the last search I don't think they make that easy for you but yeah that that totally checks out all right okay so I need a little transition slide here but effectively that gets us through everything that a search engine has been until about three years ago but alas six so a search came out of information retrieval library technology type stuff finding whatever I wanted to find but elastic search has started to prove the point really strongly that the the same data structures that serve search results are actually really good for online analysis log parties and stuff like that and a big chunk of that is its ability to do aggregations and I think I can convince you that it's it's basically what we're doing before just one extra step and you get this nice ability to do aggregations for free almost so just like before whenever we're aggregating over the you know we want to find the the a histogram of the ticket prices or something like that we have all the results that we had from before we do the sorting like we did with them before but while we still get that document hand we push it through an aggregator it's basically just a little in-memory thing that says okay how many documents have has seen in you know from $10 to $20 and it just increments those counters for every document it does this and at the end of it you pass back this aggregator thing and you have these really nice results and it was just something that you did almost as a byproduct of the actual search itself so with the building blocks that I've given you right now you you can see how we have the ability to easily filter just what a search is you can group stuff because you can see as the documents are coming through you can already figure out which group it belongs to and within each group you can do calculations to do running averages or anything like that so to give you a little more intuition about how you might use aggregation here is how I encountered it for the first time let's say you go to and with your chuckling have you have you seen my that top book by the way is a really excellent book so anyway that's if you go to ecommerce sites you see a lot of the original use for aggregations is they were called facets faceted search you have a list of subcategories on the side you have the counts for how many things are in that category you can click on it and it serves as a filter it gives you a little bit of what I call relevance feedback so you can understand what's actually happening but people have taken the same data structure you turn it on its side and you've got really nice histograms which add event right we're making them prettier now but you can use them to feed back good information about how many tickets are sold from a particular class you can take exactly the same information but a different data set and give spark charts for how many tickets were sold in a particular day and you could take again counts over buckets and you can plot it on a map and you've got a really nice nice geo information console just to give you intuition about where things are happening in geospatial relationship and finally it's I don't know exactly how to make a picture for it but analytics log analytics in particular are great with elasticsearch building elasticsearch building aggregations the last search is easy I'm going to kind of fly through this so I have a couple questions but effectively all you have to do is you have your normal query you keep asking your quarry like normal but you add a new section to your your query tool ask church called axe and in this particular case it's going to be hard to read so I'll blur over it but you say you can say things like my aggregations I want you to counts for how counts grouped by city so that's a term aggregation with the field is city and I also want you to do a histogram aggregation for the prices with an inter full of 10 so that's the second thing the results come back and you have the normal search results at the very top but you have a new section that has these these aggregations in it in this case I've got the city bucket right there with my Nashville and Dallas and BFE events and I've got my price buckets for you know what distribution of events occurred but a neat thing that you can do so right now I really needed a graphic for this right now I've got two separate aggregations a neat thing for us to provide back to our users is not only the histogram of all the events but we could do a histogram per each city and you can do this with elastic search you owe aggregations can be arbitrarily nested there's performance issues after you after some point but I can say at the top level do a terms aggregation so we bucket everything by city and I get the counts back and then within that aggregation do a histogram so that we can show our users here's the price distribution within the city that you're interested in that the results turn come back very similar structure except it's appropriately nested so that for each city bucket you have the count and within that you have sub buckets for the histogram so that you can draw it on the screen that's effectively it I've been doing this a while so I have a lot of a lot of things to learn but a lot of other things that I would enjoy talking about also if you're interested in learning more on your own I know of a some reading material and and you know find me on Twitter tell me tell me what I did right and what I did wrong anyway that that's it what have you guys got any any questions so repeating questions I guess right uh the question was around how do we deal we can specify English or not but how do we deal with unknown turns different languages jargon terms stuff like that the easy answer is you still just say it's English if it's basically English and you still get the ability to split on whitespace and all that stuff because that's presumably or you might come from I'll go to the extreme in a second and you you still do stimming which means if it's like maybe a verb but it's a verb I haven't heard before stimmy actually does pretty well for English like things I mean but if if you're willing to put the work in it you have a arbitrary amount of control over what you can do so at the other extreme end of things I mean I guess you could write your own Java it's all pluggable it's just leucine Java leucine you could write your own classes to do whatever custom logic you want if you don't want to go quite that far there are other kind of middle-ground things like synonyms you can say you know as a pre-processing step before you do the semi and chop off and throw away the ends of words you can you can say here's a file of every jargon word you might see and you can either say don't touch it for the downstream stuff or you can say you know this maps to three other words or these three words map to one word so there's a lot of flexibility about what you can do to tune that relevance notion but it's might be a lot of work he had a question first you yes and no part of that is the not only do we hide the term frequency that counts for each one of those terms that they occurred in the documents but we also hide a few other small things that we stick next to the tokens we hide its position in the document which which would gets to your answer about phrases and you can also hi there's a couple other things that aren't used as often but liking a high part of speech there if you have that set up and you can hide a payload which you can do whatever you want to is you can boost on documents that have certain words in it a little bit higher but it's still there one thing you can't do though is Riyaz make a search and reassemble it into the original document from this data structure that's why whenever you store a document elasticsearch it gets shredded turn into that and at the same time you have a different file on disk that's all in the memory that reads the original document out so you're effectively storing it twice every time a document Eventbrite is an event and it it has what I call the boring fields that are expected the name description the date geolocation which actually that gets interesting but we also have this is in progress but working on interesting fields like machine learning things like event cluster that we can later match up with a user cluster that comes in or event quality which is another thing that we're inferring from the metadata around it so those are all things that elastic search is happy with dealing with and and then there's not too much more than that that's like mind-blowing from departure from what I showed here it's elastic search this is a data store it's a yeah Jason record that's we do exactly that thing with it last handed elastic search and make a variable elastic search stores both ah like connector type things we're older when I was in solar Land the predecessor sort of de les Church there's a lot of plugins where you could connect them up quickly called pipeline things he'll there are some of those for elasticsearch school we end up just rolling around because we want to control over it so not a very specific answer one more I think he had his hand up first but please talk to me later so James really cool question a really cool benefit of elasticsearch is it's a write only index so segments on disk effectively are never touched again but the caveat is when you actually change a field what you do is you go back find that record where it used to be written read out the entire document change that one field and write it to a new segment file and the only place you can change the old follow is you mark one bit is dead done it so not great but but it's a trade off you get benefits for treating it that way definitely not a table scan it's it's it's still pretty quick cool so I have exactly zero minutes left please come back talk to me later and thank you very very much for coming
Info
Channel: Next Day Video
Views: 113,701
Rating: 4.7911301 out of 5
Keywords: pyohio, pyohio_2017, JohnBerryman
Id: UPkqFvjN-yI
Channel Id: undefined
Length: 49min 35sec (2975 seconds)
Published: Tue Aug 01 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.