ElasticSearch in action - Thijs Feryn

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

Buon Giorno we've been Venuti a tutti I hope that sounded quite right is that okay is that somewhat Italian all right let's do this this talk is called elasticsearch in action and the reason I call it elasticsearch in action is because I seen way too many presentations about elasticsearch that spent 40 minutes explaining what elasticsearch is I'm going to explain what elasticsearch is in a single slide and I promise you that and right after that we'll dive right in and we're gonna go with an action-packed presentation I promise you one single slide you ready let's do it elasticsearch is a and I'm gonna try the sound as boring as I can to you to give you this one elasticsearch is a full text search engine it is also a no SQL database it's also an analytics engine it's written in Java sorry about that it it's based on Lucy technology just like solar it has inverted indexes it is very easy to scale hence the term elastic it has a restful interface and that's quite interesting you interact with it over rest it's somewhat schema-less somewhat it's good for semi real-time stuff and there's this thing called the elk stack and I'm gonna leave that for the end so that was that are you still with me that was the theory for today let's dive right into the action but before we dive into action I need to formally introduce myself hi hello everyone my name is Stace yes that's the way you pronounce this name I am Belgium and I am paste fitting on twitter if you ever heard of Twitter you can follow me on that one I'd love to interact with you talk with you afterwards answer questions if you want to because I'm not sure if you'll have time enough for Q&A professionally I'm a tech evangelist that I bailed from web hosting company called combo but most people probably know me from my involvement in the PHP community and more specifically as an organizer of the PHP Benelux events that was the intro let's dive right in what's the first thing you need to do when you want to work with elastic search download elastic search this is the link where you get this stuff it's quite easy it has it for all platforms even if you want to do some testing on your computer I have a Mac I had some Mac compatible software there it's a folder you extract it there's a binary folder you start the binary done how do you address it I mentioned to you that it's restful so the only thing you need to do is point a browser or a other client like Karl I love using Karl point it to localhost 9200 and this is what you get and when you see you know for search and you know you've set it up right and you can dive right in there is a bit of specific lingual some terms you need to get familiarized with I'm gonna compare it with our DBMS systems we are well aware of what we call a database in our DBMS world we call an index in elasticsearch which might sound confusing because you have indexes in our DBMS terminology as well but because elastic search indexes everything you don't have to put indexes they just call it an index what we call a table in our DBMS lends let's say MySQL or Oracle or what have you it's called a type and what we call a row it's called a document we agree okay let's do this creating a database I'm gonna use the term database now creating a database or an index as we call it in elasticsearch terminology is just doing a post to the name you want we call this blog I'm gonna give a fair number of examples of the blog so what I did in preparation for this presentation I wrote a script that reads the atom feeds of our company blog and indexed all of that in elasticsearch so a lot of the stuff you'll see is just blog posts but I did not include the blog post itself just a title and some meta information so to create this you only have to do post slash blog any idea what kind of HTTP status code this will give just as a teaser come on we create a database 201 yes brilliant audience thank you and if you want to feed it a document you just send it some random JSON post it to the to the index give it a type give it an ID and you're good to go this is just some random stuff I fed from our corporate blog and I'll give a 201 it will give you some meta and and that is as easy as it gets if you can work with an HTTP client you can work with this retrieving is also easy you just do a get and you get the index the type name the index number this is your document if you did not specify like let's go back if it not specify the identifier it will choose it for you it'll be patched a random hash and then you can use that hash to retrieve it I told you it was key Milus I actually lied if it's not really scheme Ellis it takes the burden of you of choosing a schema it will guess the schema for you and we write here it guessed it wrong I fed it data let's go back to the data this is the day Tuesday 15th of December 2015 and the time and the timezone it turned that into a string but that's not a string it's a date type and that nice little goo it this one six one six oh it turned that into a string as well so that's not really good so it guesses but it can guess wrong so that's where explicit mapping comes into play you want to define a schema other people say oh yeah we have highly unstructured data we never know what data is gonna be fed in be realistic and a lot of cases you will pro have a pretty good idea of all the data that will be fed into your system so let's do this let's create an actual schema for the data we created the only difference is that we create a tape type date and we mentioned the format because we have a very specific format we have to specify this otherwise elasticsearch won't be able to deal with this and the cout it is explicit meaning teacher you feed that again by a post command and it will create that the next time you insert a document in it it will respect that and that won't won't be done let's pick it up a notch let's make it a bit more interesting and I hope you can read this in the back they specifically asked me to have a minimum of 48 as the font size but I didn't manage for this so what we're doing here and I'm gonna zoom in a little bit further is we're defining all our properties but we're adding some weird keywords as you can see analyzed analyzer English analyzer Dutch I'm adding fields to properties what's that about let's zoom in let's give you the over because this is where it gets a lot of fun what we do is we specify that our title exists as a property so you can send it the title of a blog post but if you want that data could also be structured in another way as a sort of copy and I want it to be analyzed in English in Dutch and I also want the raw formatting raw formatting means that it is not analyzed so what that means is we're we're coming into a situation of choosing between analyze data and non analyze data basically full-text search on human language or exact values like you do in a database by default strings are analyzed so it takes a string chops it up into bits and pieces and stores it in a very specific way that's just in the index the original source document is never touched it's left intact so you still have your source document so what it does the analyzer is it first uses so-called character filters it takes characters and throws characters out it doesn't feel that is relevant for the indexation process next up it tokenize as the rest of the text and then it uses token filters to remove certain aspects of it by default it's the standard one and you have a bunch of those this is just a small list there's plenty more and you can influence the way your text is dealt with by do by doing this it uses a standard tokenizer lowercase tokenized token filter and so on and so on and so on so let's do this this is a piece of text is a piece of information that might be useful to you so I took a I figured this I wasn't really creative at that time I just took a random catch phrase from English hey man how are you doing if you would analyze this and there's a way of doing this in elasticsearch as well in the end I'll have a link with the examples you can fiddle around with the examples yourself and you can use the analyzer engine to just feed random data to it without saving it in the database but just seeing how the analyzer responds if we use the standard analyzer it will chop it into hey man how are you doing notice that it removed the question mark and notice that it removed two comma because those aren't really relevant to human language if we would use not the standard one with a wide space analyzer it will keep it intact there's a comma here there's a question mark here if we would use the English one it will reduce certain words to their base form hey all of the sudden has an eye instead of a why and doing the verb doing is reduced to do and ant is a stop word we removed it so if we go back all the way back to our analysis process maybe zooming in what we do is we store the title as an analyzed string using the standard analyzer if the data that I store in it is English I might want to query it using the en field because we know it's analyzed as English if it's Dutch text that's my native language it will be stored as Dutch and if we want to don't want it to be analyzed but if you want the full set we use the raw and that way if you if you manage that live will be easier for you and you have lots more capabilities to leverage so let's get back to the business this is an example of using this is a query this is what the query syntax looks like so you posted a piece of JSON what I ask is just return the title field it's like selecting title from blog posts where title equals working that would be the corresponding sequel query so what I ask the system is give me the title of each blog post where the word working occurs so what it does is the if if I use use this one it will just return a single blog post hosted SharePoint 2010 working efficiently as a team so the working it's an or match so it looks for a term if it's there it will return it if we would do the same thing on the dot en field boom a lot more results it will find work because it will know working comes from the word work so documents that match the term work will also be returned there and it gives a irrelevant score to it as well so this is easy to to dig deeper to return data that might be relevant to the user talk about search in detail let's let's let's dive a little deeper right and look elastic search is quite simple and it's a small compact tool it's not the size that matters it's what you do with it that's what they say right so let's go into this if we would just do search without passing on any parameters we add too pretty because it pretty Phi's your JSON it's in dense it adds new lines if you don't do that it will be one long piece of JSON it will return every single document in the block index and within the post type so all of it but it will page it so it will output a given number of results and then you can use the cursor to loop through it so this is a lazy way of getting all the data from all our blog posts if you would turn on that search lights as I call it if you would make it a bit more complicated and use the full-on search this one could be turned into this piece we query it and we match all documents and we're gonna stop using search light in just a minute I'll show you a single example what you can do is say I want all the documents where my name tastes is returned in the title field I can turn this into a match query which we already saw in a previous example and this will return our document so there's a search light using the query string which I would not advise because it's not that powerful but if you use posts here and pass it along some payloads you'll have a much more elaborate way to search and to retrieve data you can also count so if you don't do search but underscore count it will count the results so what I'm interested in is all the blog posts where the title matches proxy protocol support in varnish and that would return 162 posts why because it's an or match looks for proxy it looks for protocol it looks for support it looks for in and it looks for varnish so every time a token matches either of those documents will be returned if I use the raw fields if you remember the mapping correctly it doesn't analyze it's the full-text it's just one single document being that very blogpost so lots of ups opportunities lots of possibilities my goal is to show you that this exists because before I dug in deeper in elastic search I created a system that indexed email logs and email logs the identifier contained dashes and whenever I looked for it I couldn't find it because the dashes were an index and then I learned about mapping non-aligned analyze data analyze data and that got me into this so let's go into the next step I showed you how to query the database I mentioned a bit about filters let's answer or show you the difference between a filter and a query filters are faster filters are like exact matches on values it's like what you do in sequel select asterisk from table where key equals value that's what a filter does and it only responds to the question doesn't match yes or no queries are better for full-text search data and they do a relevant scoring so they guess how relevant this is to you this is ideal for human language if it's not human language just exact values so in the case and let me try to go back to the document this is an exact value it will be either en-us or nlnl this is human language this could vary this is an exact value but it's a date so it won't be analyzed in any way it only applies to strings this is an exact value these are exact values this is an integer so for the majority of our fields will be using filters but only for the title field in this case we'll be using queries and there's a lot of queries like there's I've only shown you a bit on a piece go to the elasticsearch website and you will see all the all the kinds of queries they have and for filters they also have a bunch like you can do an amazing amount of things with that let me show you some filter examples here let's dive in deeper I showed you how you can do a get on a single document ID and that would return all the document IDs well in this case if you want a multi match if you want to have multiple documents we can use the IDs the IDS filter pass it along a set of numbers and it will retrieve those very documents it's a simple one but yet a convenient one next up we're going to use the boolean query and that's a way of mixing multiple filters or multiple queries or combinations of filters and queries by using buu we have three terms you have a must must not ensured in logical operators must is an ant not is not and should this or so let's try figuring out what happens here we're doing a filtered query so we're going for exact value matches we're not gonna interpret human language it's gonna be on non analyzed fields and we say that every document should be written in English and the date range should be as of the beginning of this year up to up until now the category should definitely not be Joomla because we don't care about Joomla blog posts right and we don't care about that stuff and it should either contain a category being hosting or evangelists and/or not combine multiple ways design more complicated filters another one is if we use an exact value match but with a with a wild-card we could say let's prefix it on the raw field and look for everything that starts with combo which is the name of my company so every document that has combo or any other business would be returned in that way this is where it gets really fun if you have geo spatial data you can throw it in and you can map this as being a geo point it only works on to your point stuff and you can say this this is the very location of my city in the west of Belgium and I said I want to know all the little villages that are within five kilometers now that data being fed into it I got that from the maximized geo database which is freely available I filtered it down because if you want to index every city in the world it will take you some time even the even all the cities in Belgium was more than 100 megabytes of plain text data so that was was a lot so I indexed all the cities from my region from my province and I said I want to know everything in a five kilometer range from my place and it just did that and it even learned me that there's small villages that I have never heard of that are close to where I live so interesting you can take it to the next level if you want to and you can filter on a geo bounding box so what you do is you give it a bottom left and an upper right and you draw a box around it and it will return every document that has a fields within there so if you're working with with let's say mobile devices and you have geo data this is very useful to to retrieve documents that match this so use it if you can I like this next up relevance let's dig into relevance for a second for a quick second let's have a look at this query so what I'm doing here is a boolean careering and I'm interested in every blog post where the title contains either varnish or my name days we return those and the language must be us so what we're doing here is combining a query and a filter so we're using full text and examining some exact matches this will be faster than using queries for everything especially if your data set grows if your data sets gross this will save you time because these filters will be cached so that's a nice way of doing things so you can see if we talk about relevance that this document contains both days and varnish so it's relevant score will be higher than the second document that only contains varnish so we boost it to relevance there so that's a good one if you use regular queries and this is what we're using here in the beginning the sorting will be done on relevance if you use filters you can define the sorting yourself because let's let's show you here we're gonna filter on the category PHP Benelux it will have a constant score of 1 there'll be no difference in score and then it makes sense to add your own ordering you can order on the fields where you want it to be ordered and in this way we treat elasticsearch more like a know SQL database rather than a full-text search engine I use it more as a database this is my go-to database a lot of people say well there's occur and CouchDB and what-have-you I like this because I I tend to know it a bit and it's cheaper for me than running MongoDB and I'll show you in the end why that is I'm not bashing MongoDB I'd like to learn more about MongoDB but this is easy for me because it combines full-text search no SQL features and I'll show you about the analytics engine as well in just a minute this is also a cool one so let's look at it a bit deeper it is a it has a must and it should it's an or what it does is it says look for all the documents that contain my name and it will look for either case or Ferren so documents that contain just my last name will be returned as well and if the category contains varnish the relevance will be increased now mind you we've done these shoulds with in combination with filters when you use filters the or is not optional it it needs to be there in queries it will only be used to boost the relevance so if you use a and it doesn't match documents would still be returned but the documents that contain varnish will be boosted it will be a higher score will be higher in the results set and you can combine it in the other way around we do a filtered query where every rook look for all documents that were written by my colleague Romi but we want to sort it based on relevance so if it contains Magento boosted's so you can see the intricate differences between queries and filters use them wisely I'd say let's continue on this is query time boosting in your logic so what you can do is define a relative score boosting score and in this case we want Magento documents have a tree to two ratio in relevance so for every document Magento is treated to so 1.5 times more relevant than WordPress documents this is this is a good way of even controlling it better okay next up multi index and multi type so throughout this presentation I have only talked about retrieving data explicitly from the blog index and on the type of index elasticsearch could do way more and this is an overview of capabilities it has and that's something that SQL doesn't have you cannot select all documents from all tables from all databases at once and if you do that it will be highly inefficient I guess in elasticsearch you could say search and it will look for all the data in every database and every table so in every index in every type and it will cursor it by by five or by ten and you can define the size of your cursor you can also say I want all information out of the products index regardless of the type for regardless of the table and you can continue on this is an exact one for all every product in the products database next up you can define multiple indexes to search like I'm in the interested in all the clients and all the products return me that data and then you can have specific payloads by doing a post to define what you want to be returned it goes even further you can use wildcards you can say every index that contains the world pro and whatever that fits after that could be returned and so on and so on you see it right you can combine multiple wildcards you can do the same thing for types and if and you can use the all keywords to say I want everything from all indexes but every time from the product or the product and invoice and so on so move T all the things is exactly what I'm trying to mention here it's a powerful way of returning data next part of the presentation aggregations and this is why so people tend to like elasticsearch in the beginning and when they learn about aggregations they fall in love with the product it's basically grew by on steroids it's it's the third part of the of the theory first one it's a search engine second of all it's no SQL database aggregations make sure that this is a analytics engine it's a cheap one it's an easy one and a lot of people use it for simple bi and simple data analytics so simple data science it's not as elaborate as a lot of other technologies but it does the basic job let's go back to sequel for a minute and see what happens there we have aggregations as well we call them grew by ax we say select the order and the amount of documents based on the guiit fields from the blog post and group them by author we have a bucket this is what we're grouping by we have different buckets and every bucket will have a metric while in elasticsearch we have the very same thing what I'm doing here again is using the account not to return to the actual documents that match but just to return the group by information if you don't do this you get both the information your query for and on top of that aggregation so you can combine both that's also something that SQL does not do for you so we create an aggregation it's called this is the name the given name I'm giving my aggregation it's called popular blog posts and I'm using a terms filter on the field author and the only thing it will do is for every author it will count the number of blog posts in there and this is what gets returned my colleague Romi she's a works in the marketing department has 415 blog posts we also have our anonymous combo name it has 184 and then we dive in all the way to Christoph who is our marketing manager who only wrote 23 blog posts we can dig even deeper we can do it nested so what we're doing is the very same deal - num popular bloggers and for every bucket we want sub buckets so we want to know the difference in language how many were written in English how many were written in Dutch and this is how it gets returned very same results in in red on top but then it dives deeper Robbie oh and it contains a filter as well I forgot - or or a query as well all the documents were varnish is matched Romi wrote for blog posts about varnish three were in English one was in Dutch you see where this is going this is going deeper and deeper and deeper and you can get some really interesting statistics and information about documents that were indexed and there is a bunch of aggregations as you can see you also have Geo aggregations you can write graphs histograms state histograms it's all in there next up managing elasticsearch that's also an interesting one because we talked about the elastic search more or less from a developer perspective but the hosting elastic search and making sure it's really available and accessible is a different story all together there are plenty of ways to manage elastic search and that's a different talk as a whole we're not going to dive into all the specifics when I'm gonna show you a couple of bits and pieces the reason why it's called elastic is because it scales very well and it has a built in clustering model that is why don't cut quite okay there are some caveats of course but every technology has that one if you have note it will slap or divide your data up every index up into shards charts for those who aren't familiar with the terminology is let's say we have a regular SQL database let's say MySQL if we have copies of our database like we use replication every database server will have the exact same data set so if we have load lots of load we're going to distribute the load by having multiple servers using replication but the problem is as your data set grows every server should contain all the data and that's a problem and that's a thing that sharding solves it looks at your data set and a chop setup into pieces and divides the pieces among the various servers you have because we only have one single server and because in the settings of this index or this server there's only tree shards by default there's five it will keep them on the very same server it also has a concept of replicas so we can have replicas of our daily data copies of our data that will be distributed even if we define that there's replicas it's not there because there's a single server and elasticsearch is smart enough to figure that out as soon as we add a secondary node to the story elastic search is getting smart and it replicas it's working ok replicating every shard to the secondary one there is no actual sharding sharding happening because there's only two the first one contains all the primary data second one contains all the shards as soon as we start adding a third nodes this is when the actual sharding happens it will dynamically like on the fly start moving data across to spread it equally and the more replicas that you have and it's like a RAID controller the more power you have for searches because it could distribute the searches among all the servers you have and return data faster now there's a cost to having more replicas it's a writing cost it's an operational costs I tend to stick with a single replicas if the data is that valuable add more replicas because in a lot of cases in a lot of cases you don't use elasticsearch as the source of all truth like a lot of people do but other people say we're going to use our relational database as a transactional piece of data that we trust that we know and we're gonna add elasticsearch has a layer on top of that to index certain data to make accessibility a lot faster we're gonna flatten some queries into views we're gonna store those views in elasticsearch a lot of systems do that because if you do all of this like these complicated queries and full-text search in relational databases that will slow you down quite dramatically so as soon as we have a tree node cluster it will start distributing it configuration is really really easy this is just an example do you think you need to do is you get you have to give it a cluster name and all the nodes that are within the same network that have the same cluster name will automatically join you don't have to do an elaborate strategy of grouping nodes no elasticsearch will do that for you it will find it it will use multicast or unicast you can define it look for nodes group them into a cluster start sharding along so as a sysadmin this is kind of easy if you have like the minimum amount of servers you need to have a decent cluster the minimum it's not the optimal is to like with two you have your basic one and if that one falls over you have another one but the problem with clustering systems and consensus systems in general is if a vote needs to happen let's say there's a network partition and the connectivity between them is dropped they individually still process data when the data come the network comes back up they need to reach this sort of consensus there's a vote and the tricky bit it's the one that loses the vote drops all its data which is a nightmare so that's why I would advise if you have a mission-critical system have at least three nodes and set this one Discovery's and minimum master nodes to two so what will happen is you should at least have two master nodes to master illegible notes if a nodes gets disrupted from the network it won't be able to find a secondary server and it will say oh I'm in trouble I'm alone and it will go down gracefully that's a good thing you rather have a system go down then become inconsistent you can define on a server level the number of leprechaun's the number of shards and so on and so on charts and replicas could also be defined in your index so pair index you can define the number of shards number of replicas and these are another bunch of cool things you can define if HTTP is enabled so if you have a big stack of server let's say you have 20 or 30 elasticsearch servers you can choose which ones are data nodes which one are master nodes and which ones are HTTP notes and you can start dedicating certain nodes to just processing incoming requests you can tune also on an operating system level certain notes that only store data and then nodes on the side that participate in the master election but that only works if your cluster is big enough and if your cluster is distributed among many locations you can use this is just a value rack like nodes that are in the same rack he give him the same name but rack is just the keyword you can add other keywords you can use nodes data center given that the same data center and elastic search will be smart enough to distribute data or not to save replicas on nodes that are too close to each other it's gonna put it a bit further out if a data center dies you still have data there so elastic search is that smart in terms of manageability there's lots of API is there which are all JSON which is not always easy to read but they have these call api is called cat so if you do a get of underscore cat it will be like cat on linux the cat binary right it will just be key columns and rows and if you do this you'll see a nice funny cats at the first line which always makes me smile and then all the API is that you can manage and there's a bunch you can manage your indexes your nose your masters let me show you an example if you do cats charts and do V to display the columns you see every chart on your system on which node it is located which IP that node has if it's a primary one or a replica and you can see that we have we have five shards and ten shards in total five our primary five are replicas and you can see that I have a tree node cluster it's out there another thing you can do is check the cluster health if one node goes down you might lose replicas if the state is is green that means all primary nodes are there all replica shards are up if it's yellow that means primary is up some replicated charts might be missing that's not a disaster but it's a cause for alarm if it's red you primary data and you can use that in your monitoring system you can refer to that and you can even filter out the word green so if the word green is not there you're in trouble so that's a good way of monitoring it on a final note because I think we have five minutes left I promised you I'd mention something about the e lk stack who has heard about EE LK stack for those who haven't l K stands for elasticsearch log stash and Cabana those are or those were separate open source projects but now they're coordinated by the elasticsearch company the company behind the open source project what it does log stash as you could as the name mentioned is really good at processing log information because elasticsearch is a good no SQL database it can do full-text search and it has an analytics function it's a great way of storing log files for visualization purposes and in the end you can drop key Bona on that and Cabana will visualize the things you have LK has become more interesting ever since the new beets product was released by elasticsearch it's also open source but the problem with log stash is it's a big java application and it might be heavy on your system just to extract logs and save them an elastic search so the new stack is these lightweight beets services that are written in go can analyze files so that's file beets you have top beet that can analyze process information and you also have beets that can check network traffic so you can examine that network traffic over its own protocol it will send it to log stash and log stash can parse it could analyze the data could re index the data in a certain way store it in elasticsearch cabana will visualize this and this is what this could look like it's a nice looking interface it's an easy way of creating dashboards that you can project on your big ass LCD screens in the office and see what is happening could be used for sales it could be used for checking servers it could be used for internal application data it's just the simplest way of composing really shiny and nicely looking manager friendly dashboards on a final note this is how the presentation will end how do you integrate elasticsearch if you're not using it for looks but just as a search engine on that analytics engine a database how do you integrate it well my simple answer to this is it's rests deal with it and if you're not dealing with it or if you don't if you want a bit more layered around this there are api's and SDKs and nice little libraries and every other language you want now we covered a lot of information today and I encourage you to test it yourself what I created for you is a github account you need to download two things you need to download these slides will be online no worries no reason to write it down what you need is an elastic search if you use Windows or Mac or Linux it's readily available you just drop it there extract it run the binary but you also need to install cabaña not for visualization with just the framework because we'll be using the sense plugin and I have all this information on the github repository and since is like an IDE ver elasticsearch you can start typing queries and it will autocomplete so if you're looking into learning the query DSL it's really an interesting and simple way of doing it all these examples are linked to two cents so all these data sets are in there so the only thing you have to do is click a link sends will open press play and all the example data sets will be in there all the queries I showed you will be in there and more so that's a good way of playing around this is what sense looks like so it's on localhost five six or one is your Cabana absense and you can throw stuff in there and it will return it in a nice way so that's the end of the presentation if you want to follow other stuff I do I have a blog I have a YouTube channel where I do interviews and tech talks or do tech reviews I have a podcast I have Twitter and you can also follow me on iTunes again these slides will be available I'll upload them and distribute them over Twitter thank you for being part of this thanks to the organizers for inviting me I have another talk I think this afternoon about using if we're running PHP and PHP is slow for you how can you solve this by using non PHP technology so that's another one I'm using thanks for the attention thanks for being there grazie Amelie

Info

Channel: Codemotion

Views: 56,936

Rating: undefined out of 5

Keywords: codemotion, codemotion rome, codemotion rome 2016, thijs Feryn, ElasticSearch

Id: oPObRc8tHgQ

Channel Id: undefined

Length: 38min 1sec (2281 seconds)

Published: Fri May 27 2016