Getting Down and Dirty with ElasticSearch by Clinton Gormley

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

I'm Clinton Gormley I have BR in fact I'm the first user to put elasticsearch into production back in February 2010 also that was just after release 0.4 at which time it didn't do nearly as much as it does now and it had a hell of a lot more bugs than it does now I wrote the Perl API back then and then I'm still maintaining and in fact we've just written new official API is for a number of languages and I am working with her Reilly on a book about elasticsearch which the first part of which is due to go online in December hopefully if all goes according to plan so that's me and tell me a bit about yourself so I'm assuming you've all heard of elasticsearch which is why you're here how many of you have used it and how many of you are using it in production okay that's great I'm using it more for search or for logging search logging cool interesting okay so for those of you who don't know that much about elastic search it is a real-time search and analytics engine okay and it comes with a lot of buzzwords so it's distributive I can start a node here on my laptop then it will work very happily but if you were to start nodes and your laptop's they'd start talking to each other and form a cluster and they'd be able to spread the load up between the cluster this makes it really scalable so you not only can you give it better hardware you can give it more hardware and it knows how to spread the load across all of this hardware automatically so you can have small beginnings and grow up to a massive size and with all this extra hardware the more chances that things will fail so it knows how to handle that it gives you high availability it keeps redundant copies of your data spread around the cluster so that when a node does go down it still has a copy of your data and can reallocate that redundant data to somewhere else in the duster is big to it using a RESTful API with JSON over HTTP and we say it's schema free now after nathan's talk it's not scheme Alice alright schemas are important the only meaningful way that you can index some data is if you know what the data contains but what it tries to do is not get in your way so if you just want to try something out it's not going to go okay first can you tell me you know and what does its column do and how you're planning and using it because very often you don't know you just want to throw some data at it and get going and so it makes that easy to do and and it supports multi-tenancy there are some applications where you would need a separate application a separate instance of the application for every use case like any real database a single cluster can support lots of different applications with quite different characteristics okay so oh it's open source Apache 2 license it is based on Lucy so this is all cool alright and and this is why we use it this is why you go to the website you download it you unzip it and you start it up but this talk is about how do we use it alright and all the examples in on the elasticsearch website give this curl format as a way of speaking to elasticsearch because it's HTTP over jason you can speak to it with any HTTP client and you can figure to from the command line speak to it from the command line so this is the structure of your basic command and let's have a look at what it consists up you have a verb and these are the standard HTTP verbs get head put post delete the node you're talking to and you can talk to any node in the cluster but in this case if we're just running at our laptop we're talking to localhost 9 200 is the port that speaks HTTP we give it a path and then some query string parameter pretty in this case just makes the output easier to read okay so from a request to request most of this doesn't change and I'm going to show you just the bits that are interesting so let's find out if the node is there for a start we do a get request on the route path and we get some information back all right so it has a name the clusters in the status okay 200 and the current version of the node we're talking to okay it's running but now what where do we start and where we start is with data all right the reason you're using elasticsearch is you have some data that you want to make searchable so the first thing you've got to do with the data is to convert it into a chase inform though Jason is a lovely Sarah lies ation format it's simple it's easy to read as a human and is actually quite expressive here we've got a tweet with terms and full text content my neck which is the clinton Gormley my name a date a retweet count and a geolocation point where i am at this moment so how do we put this data into elasticsearch well we put it we need to say where we're going to put it and an index is a bit like a database in MySQL alright so we'll call it my app because we're creative like that we need to say what it is all right it's a type at which is a bit like a table in MySQL again and it's a tweet which tweet is it well we'll give it tweet number one and if we didn't give it an ID it would also generate one for us but one is a nice easy number to look at so we'll use that and then we pass the document itself alright so that minus D is what curl uses to say passed this document body with the request and we get back a response Isis 201 created and and it provides us the metadata that we Costin plus this version number and you'll see this version number a lot we'll come back to that later on in order to retrieve the document we just get it using exactly the same URL that we passed in at the beginning so my app tweet one and we get back the same metadata and the source field which contains the actual JSON string that we put into elasticsearch two slides ago to check that it exists you use the heads verb and this doesn't give you back a body it'll give you either a 200 okay if it does exist or 404 not found if it doesn't if we want to update a document we just put it again the files and elasticsearch it an elastic search index are immutable all right you don't go back in and randomly update them so array index is actually an atomic delete of the old document and a put and by delete is really just marking the old document as deleted the actual collection of that deletion will happen later on at some stage in the background we don't need to worry about it but it re-index is delete the old version put a new version and we get it 200 okay and you can see that the version number has incremented because we've made a change to the document to delete the document we use the delete verb with the same index type ID get a 200 okay and the version number has again incremented okay let's talk about those version numbers they are used for optimistic concurrency control Nathan gave the example of incrementing counters and what you don't do with all these changes happening in parallel is to miss some changes well that kind of depends sometimes you don't care sometimes your data is sitting in one database you indexing it in interval astok search and all you want is the latest version of data from this other database and you can just stick it into elasticsearch and be done with it but sometimes you do care things like incrementing counters I read the documents and another process reads a document we both increment and write back what we're going to miss one change so elasticsearch uses these version numbers for optimistic currency control the ability to make sure that you're not missing changes without any locking most relational databases will lock a table or column a row or field in order to make the change but you can't use locking in a distributed system and expect to perform well so with the version numbers we when we make any change we tell it what version we expect to be changing okay so the current version that is on disk and assuming that that version number is correct we get back a 200 okay if we try to change an old version if not going to work we'll get a 409 conflicts back and at that stage our application has to decide what to do with that something has changed perhaps we need to show this form to the user who is editing the object and go somebody else has changed this in the meantime please check the changes and try again or whatever makes sense for your application there's also an update in place command which allows you to change an existing document here we use the update endpoint and a script which basically increments the account field in the source document and this is working in exactly the same way as what we were doing manually ourselves it gets the old documents makes the change and then puts it again the difference is it's happening local to a shard and elasticsearch rather than having to cross the network for each of these phases that produces the chances of conflicts but it doesn't eliminate them you can still get changes happening in parallel so this is what the retry and conflict thing is we're trying conflict says get the document with the version number make the change and try to index it if the version has changed there's a conflict at then either throw an error message or retry this many times until you either succeed or you run out of tries so this would try three times if it still hasn't succeeded it would throw an error if it succeeds then or good and dandy all these operations are cheaper to do in bulk okay so if you are lugging lots of data into elasticsearch logging a thousand records right rather than one at a time saves you a lot of network overhead you probably get 10 times faster throughput using bulky indexing rather than document by document similarly you can get multiple documents at once you can search multiple searches at once and so on I'm not going to go into the details of the syntax because it's not really that interesting but you should know that that exists and it is worth using so we've got to fit elastic search into your infrastructure and there's a good chance if you have got an existing application you know if you have a data store already and and that's fine will work quite happily and by taking this data and mirroring is into elastic search so the client can make changes to the data store those changes get reflected into lasting search and any queries the client wants to run can be run on elastic search directly if you're starting a new application you can either go with this or you can use the distributed data store that you've just seen an elastic search to store your data directly completely up to you doesn't matter so that's using elastic search as a data store now let's talk about how to search for your data and the first most simple search is the empty search which just says give me everything that you've got that's going to respond with we the time it took I'm really needs smaller fingers and whether it timed out don't worry about it but you can't say time outs on queries and it had failed I'll tell you that at times out of it hadn't succeeded in that time and then it tells you how many shards were expected to respond and how many actually respond so I wasn't going to go into the internals of how elasticsearch works but this is important this is worth know we're talking to an index but an end that is really just a virtual namespace and the where the data is still on disk is in shards so a shard is the smallest unit of scale and elasticsearch it is an engine in its own right and you can move shards around from node to node and you can make copies of shards to give you high availability in your application we just put a document into the index and it figured out which shard it should belong to so when you search against an index it looks at all of the shards in the index runs the search on each of them gets the results back reduces them into the overall result set which it then returns to the client so try to hit 10 shards and it hits 10 successfully we wouldn't expect shards to fail but you know if some massive disaster happens you lose 2/3 of the nodes in your cluster there's a good chance you will have some missing data at that point it elastic search isn't going to refuse to work it'll give you the data that it can but let you know that it's incomplete so rather sum than nothing at all then you get back the hits section which tells you how many documents match to your query a maximum score which I'll come back later and the top hits in the hits array so the hits themselves look like this it's the same metadata you've already seen you get back the source field so you have your original document available to you immediately you don't need to have a separate stage where you fetch it from the data store some other data store in order to display your results you can display it directly from the search results coming out of elasticsearch and then it has a score and this score is a relevant score it says how well did this document match this query now we didn't have a query here where it's just the empty query so all documents get a neutral score of 1 when searching you can choose how broadly or narrowly you want to search and I said that a search on one index is hitting multiple charts so a search in multiple indices is just hitting more shots it's exactly the same process they're just more shards involved so you can quite happily search across one matter many or all of the indices in your cluster and you can limit the search to a single type a few types or any of the types in your indices and the syntax looks like this here we're searching all types in one index and all types in two indices all types in all indices beginning with IND one type in one index two types in one index multiple types in one index all types at all times beginning with type in all indices you get the idea you can combine this as makes sense for your needs but of course these searches could match millions of documents and we're not going to return millions of documents to you so we paginate the results they get sorted by default they're sorted by this relevant score and then we return you the top 10 results so by default size is 10 you can choose later pages by setting the Fromme parameter which tells you how many results to skip so if we changed our page size to 5 this will give you page 1 page 2 page 3 etc ok this these are all things that you would expect to be able to do onto search itself at we have something that we refer to as search light and search light allows you to specify a query in the query string so here we're looking for John in the name field pretty simple but this syntax is actually quite powerful the plus here means this is required so we are Phu must be in the tweet field John must be in the name field and the date must be greater than the beginning of May you know that's fairly powerful and succeed of course you put it through some percentage encoding and it becomes slightly less readable so the other thing is that this is a mini language and as a language it has syntax which means you have syntax errors all right so this style of search is not what we recommend that you use in production it is very useful to run ad hoc commands from the command line it is useful for demoing which I'm about to do and but you don't want to use this as the main queries that you use in your code will come to what you should be using later on but will use it for the devil now here I'm searching for Mary and I haven't specified a field and I get back a user object named Mary tweets by Mary and a tweet mentioning Mary so what's happening here all of this data is stored in different fields well it turns out we've got something called the all field and the old field takes the string values from all of the other fields concatenates them into a big string and indexes that okay so it's quite a useful way of getting started you don't have to know too much about the structure of your document chances are if you search against the all fields you're going to find something relevant but it's worth understanding how different fields work so as an example let's look at this query Q equals 2013 and now I've got 12 tweets in my index and each one has got a different date in 2013 so I get back 12 results okay that kind of makes sense if I search for the exact date I still get back 12 rows although there's only one day's one tweet that is on that date if I search on the date field for the exact value I correctly get one result but if I search for 2013 in the date field I get nothing back okay so this seems weird it doesn't seem to be doing what we wanted to do and the reason for this is that these two fields have different data types and this is what they're saying where the schema is important to know how to index the values in that in each field so to figure out what's going on we'll check the mapping the mapping is what we call the the field the field schema and to get the mapping back we say get index type underscore mapping and it gives us a whole lot of information here now we didn't set this up in the beginning we're just indexed into it and this is what elastic searches understood from our document a lot going on here but let's just focus on the date field and it has been recognized as type date and the old field doesn't appear here because it's the special added metadata field but we said that the old field is type string okay so you can understand that dates and strings are quite different beasts and would be indexed differently but in fact the difference is really between exact values and full text exact values are things like whole numbers floating point numbers dates boolean's and strings where a capital foo and locates foo are different so enums or status codes things like that and and this is the kind of data that a typical database handles very well indeed full text is more like the body of an email or words and the play or anything of the sort of unstructured text which is the province of the full text change here we want to match things exactly with full text we want to search within the text and that's a completely different beast so in order to enable this the full text search engines use something called the inverted index we'll take these two documents as an example of how we build an inverted index the first thing you do is separate but separate all the words up into separate terms don't fall asleep then we sort them into a sorted list of unique terms then for each term you list the documents that contain that term so you end up with something that looks a lot like this each term a unique terms sorted and the documents in which they appear so a search is now quite easy we need to look up the term quick and the term brown in this list but the list is sorted so that's quick and then find out what documents they belong to and we find that document - just as brown but document one has two terms so this is probably the more relevant document all right the this is a very simple as scoring relevance calculation but you know it does a reasonable job on the surface anyway but what if we said that we must have quick and we must have foxes in our search well capital quick it appears in doc one and foxes appears in doc too and but neither we said that these are both terms are required and neither document contains both terms so we end up with no matches but really we've got Fox foxes Capitol quick lowercase quick so we need to be a bit more lenient about how we treat our terms and bike and normalizing them into a broader form we can find more dodging so we can better understand the intent of the user when they're searching so how should we normalize it the first thing we'll do is to lowercase everything so quick and quick there and that they just become lowercase quick and ver a dog and dogs will just reduce dogs to the stemmed form dog and foxes to the stemmed form Fox and so both documents now contain those and jump and leap they're different words but they didn't have the same sort of meaning that really they synonyms so let's just index both of them as the single word jump and this process is known as term normalization and our new inverted index looks like this so that there's a lot more overlap between the terms in both documents so which we try our search quick foxes and we get back nothing because capital quick and foxes no longer exists in our inverted index so you need to normalize the terms at query time to at index time all the terms have to go through a normalization process and at search time they have to go through the same normalization process you can only find what's actually in the inverted index so capital quick becomes lowercase quick foxes becomes Fox and now our query matches this process is called analysis and an elastic search it it is sorry analysis which consists of tokenization breaking things up into individual words and this normalization and inelastic search this is done by analyzers and the tokenization is done by tokenizer x' and the normalization by token filters so every analyzer has to have a tokenizer and it can have zero or more token filters and they're a bunch of analyzers that come out of the box the default one is the standard analyzer and here's how it works first it breaks the sense of the sentence all to string up into individual words it's not just breaking a whitespace it it uses an algorithm from the Unicode consortium that works well for most of the languages in the world and and then it applies the lowercase token filter the stop words filter which removes words that have little weight like the and we don't have a language so it just defaulted to the English stop words and that is the reverse of the terms that then gets stored in our index we also have language specific analyzers such as English we and they use the standard tokenize on the lowercase filter but now we do know something about the language so it applies the English stemmer so that jumps here becomes jump and then it removes the English stop words and if we were to use the Spanish analyzer the French analyzer whatever else it would put the words through the appropriate stemming and remove of stock words for that language okay so let's go back to our data type differences and figure out why we got the results that we did data is an exact value and all is a full text field means that date is storing the exact date 2013 Oh 603 while the old field is treating this as a string and breaking it up into three terms 2013 oh six and other three so when we search the all field for 2013 that term exists in all of the documents so we get 12 results we search for that exact value the query string again gets analyzed and this becomes a query for 2013 or oh six or oh three so all the match the date we search for an exact value it exists we get one result but the exact value 2013 does not exist in the date field and so we get zero result so understanding your field mapping is clearly important and fields can have a variety of types strings date times whole numbers floats billions and objects and there are a variety of other more specialized types a couple of which we will talk about as we go on when you through when we through the document at elasticsearch at the beginning the schema was empty and so all of these fields were new and it had to try to figure out what data it contained so this is how the dynamic detection works if it looks like a string it's a string unless it's a string that looks like a date in which case it's a date whole numbers become Long's floats become doubles true and false become boolean and objects become objects we haven't mentioned arrays and because there's no special mapping for an array any field can have multiple values it just uses the first value in that array to determine what the rest of the array contains which also means that you can't mix data types in an array so it'll be a multiple multi value of one of these types so the most important setting by far is type and for things like numbers and dates usually there's very little else that you want to set here we've got type string type date type long and type object for this to your location well this isn't correct it isn't an object it represents something very specific which is a latitude longitude coordinate it's a Geo point but it can't recognize geo points automatically so we have to actually specify this manually and we'll change the location field from type object to type geo point and for string fields the most important distinction is between full text and exact values the string fields are default to being full text and and that really means setting the index property to analyze so analyze the values before you store them in the end if you want a field to be an exact value you need to specify it by setting index to not analyzed and you can also set it to not be searchable at all but the string will still be there in the source field or you can retrieve it but you won't be able to search and if you set it index to know so our Nick is not full text it contains Atkinson Gormley so we'll change that to index not analyzed for full text fields is one more important setting the analyzer setting and let's say that our tweet field we're indexing English tweets so rather than the standard analyzer we'd actually like to use the English analyzer and we can set that as follows and that analyzable will be applied both at index time and at search time okay so we made these changes to our mappings but now we need to apply them you can add new fields to a mapping using the mapping endpoints you can just put new fields into here and anything that you add will be merged into the existing mapping but you can't change existing fields alright because data already exists for those fields and it won't be searchable if you change the definition so you delete the index you create a new index with the right mappings and then you we index all of your data into it and all of the official clients have helpers which make the reindex in process much easier so now we come to full body search and this is the type of search you should be using in your applications and it's called body search because we pass everything in in the request body and it'll take a query parameter which defaults to match all in other words give me all documents and the front of the from and size parameters and and various other various of the parameters you can pass in so these are the defaults and but let's look at an actual query the query is expressed in the query DSL which is the sub J s'en structure which is really a rich flexible query language you can be very specific and very complicated about it you can't either to give you exactly what you want but we won't going to the complexities we'll start with just a simple query than match query and we'll say look for search in the tweet field so that's going to find the search the term search in the tweet field go through all the documents that contain it calculate a relevant score for all of those and return that the ten best results these queries are kind of individual units that can be used all over the place in the query DSL and so we described them like this but then to actually use them you need to pass them to the query parameter in the Search API so that's what the the actual request would look like they're not just queries we also have filters and they look quite similar but they have different use cases filters are used for exact matching while queries can also handle exact matching but they are really useful full-text search queries give a yes/no answer it either matches or it doesn't you're sorry filters queries on the other hand will tell you how well a document matches so there's an extra layer of subtlety in there because the filters are simple they're fast queries have to do the scoring so they can be a bit heavier filters are cashable and queries or not so general rule of thumb use a query for anything involving full-text search or relevance and filters for everything else so here we have a query on that we've seen already and we want to filter our documents only on tweets by mary search only takes a query parameter so we need to somehow wrap these two things into a one query and that query is called the filtered query take the query and the filter and then we can pass that into the search request like that perhaps you want just a filter with a query at all so just give me all the tweets by Mary and then the query becomes a match all query or we can leave it out completely and but this is being there's no sort of intelligent sort order here because all results are returned or get the same relevant score so we can add a sort parameter which is softened by at the most recent tweets we can also summarize data and here we're looking for oh sorry we're on a range here so this is a different type of filter it's a range filter on the date field and we're saying everything in the month of May so greater than equal to and less that to summarize data we can look at the top tweeters in our index so the the number of times that each Nick has a tweet associated with it and these are called facets and they run against millions of documents in real time so we'll call us the top tweeters facet that we want back we're looking at popular terms in the field Nick but these facets can be run in the context of the query so this query is saying give me people who have tweeted about elasticsearch and give me the top ten people who have tweeted about it so so this can be done on the fly in any context without having to set up anything beforehand similarly we could ask for tweets by month we're using a date histogram facet on the date field and bucketing dates in by individual month so it just gives us the counts for the number of tweets in each month all right let's put some of this together and we have a common requirement is to do autocomplete you want to search as the person types somebody types in Joh we'd like to return John Smith Johnny Depp Lyndon Johnson the only problem is that Joh doesn't exist in the index and if it's only index you can't find it so we need to put it in the index and we use engrams to do this so engrams are like a window on a word all right if the sort of the window just moves across the word and gives us what is under that window and an Engram of length one just gives us each individual character length to two characters at a time etc etc this is great for partial word matching but not perfect for a autocomplete where really we want to start at the beginning of the word so for this we use a specialized type of engine Graham called the edge and Graham and it's anchored to the beginning of each word which would result in terms like this being added to the index and of course this is perfect for autocomplete so to set this up it takes a bit of work but we'll go through it step by step the first thing is we need to use the Engram edge Engram token filter so it's token filter we're calling it autocomplete its edge Engram type and we want one to twenty characters okay so that'll give us plenty of space to get the right word from whatever the person has started typing in and we're going to apply the air this analyzer to the name field but we actually want to analyze the name field in two ways one just as in the standard way add word by word and two as autocomplete so we're going to define two analyzers the first the name analyzer just uses the the standard analyzer but we're going to remove any stop words otherwise names like a a mil would just remove the A's then we have another analyzer called name autocomplete which is a custom analyzer it also uses the standard tokenizer and the lowercase token filter but then we add in our newly defined autocomplete so can filter now that we've defined the analyzers we need to apply them to the name field and we want to use the the name field in two different ways in order to do that we use a multi field so multi field is to use one field for multiple purposes this simple mapping becomes type multi field and it has two subfields alright see this this one is the main subfield it has the same name as the the multi field itself and we can refer to it as name or name name and that's we're just going to have type string and our standard name analyzer for autocomplete and it's a subfield we can only refer to it as named or autocomplete type string and here we're actually going to use a different analyzer at index time in search time at index time we want to store all of those edge engrams but at such time we just want to search on what the user actually typed in we don't want to search on jjo joh we just want the Joh part we have to delete our index and recreate it with the new settings and mappings that we've just specified reindex our data and now we can run and also complete query and it's very simple we query the name the autocomplete field with John Smith that's great except that we can do better John might be part of Jonathan so we should probably reward reward for word matches so we want to match both in the autocomplete field and the standard name field which means that we need to combine multiple queries and this we can do with the bull query so bull takes must must not or should clauses mask clauses must match must are clauses must not match should clauses don't have too much but if they do then the document gets extra points it is more relevant so we can make our name to autocomplete match a must Clause things must appear in that field but if it also appears in the name field then it'll get extra points so in named our autocomplete will match both John and Smith but in the should Clause will only match but that gives it an extra point bumping this result at higher up the the results list and that's what I've got for you although there are some bonus slides coming after it's how many more minutes do I have okay we'll see where we get and let's say we want to give some reward for popular tweets and we have this retweet count so we could incorporate that but if we were just just sort on it well you're either sorting on retweet counts on score and you're not you can't use both of them to sort on but we can combine the retweet count into the score so here we use a function score query and we're running just a simple query on search but then we apply this script which takes the value of the retweet field it smooths it out using a logging value because we want to have a lot of effects at the beginning and then less affect the more retweets you have combines it with a score from this query and you get this a nice combination of relevance with popularity similarly if we wanted to find results near the person who's querying we could filter by geo distance and say only include results which are within 100 kilometers of the user but that's quite a harsh barrier you know if there's a really relevant result at 101 kilometers to go we're e and we can combine the two so that we get the score from our query and we can use a Gaussian curve to have a sort of decreasing boost the further away you get from the origin the center of this of the circle the less boost you it gets added to each tweet so this will include really relevant results that are far away but it'll give more way to results that are local and now I really am finished so thank you very much any questions you

Info

Channel: NoSQL matters Conference

Views: 80,508

Rating: undefined out of 5

Keywords:

Id: 7FLXjgB0PQI

Channel Id: undefined

Length: 44min 12sec (2652 seconds)

Published: Mon Dec 16 2013