Deep dive into the new Elastic Indexing Strategy

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] so [Music] all right i guess we're live now um hi everyone welcome to today's meetup um we're all here today to hear um from nicola who's gonna talk to you about um the new elastic indexing strategy um please ask all of the questions um in the end if possible if not go ahead throughout we'll try to answer them and enjoy nicola please go ahead thanks elia for the intro i would say we directly jump in today it's a virtual meetup so i'm going to pretend i see all of you like kim bernard some out of switzerland so quick introduction about myself my name is nicola more of you probably know me under my handled roofline i'm an engineer at elastic a lot of beats in my past now working on a new project which is called fleet and elastication that brings all the beats together as one piece i'm living in switzerland i normally run the elastic meet up switzerland so exotic circumstances we're virtual but yeah so please use the chat ask questions if the questions are directly fitting into the part i'm talking about and i can see it here on the side i'll try to answer it directly but we also get a chance for questions in the end so let's kick it off so today we're to do a deep dive into the new elastic indexing strategy and what we're going to talk about is first why there is a new indexing strategy then a bit of technical details on how it works what it is i'm going to show how we use it in fleet and elastic agent because that was the initial driver behind that new indexing strategy with a live demo and in the end a quick summary so to get the elephant out of the room i'm going to talk about the new indexing stretchy first there is no all the indexing stretchy i just pretend the old indexing stretch is what we do with beats and lock sesh the second one is i talk about indices but as you're going to see along the talk is it's all based on data streams which is a construct around indices and when we started the indexing strategy that contract didn't exist yet so the the name new indexing strategy kind of just stayed around and we haven't found a good new name yet so i'm going to keep referring it to new indexing strategy you probably see it in the future with a better name but as you know naming is one of the hardest engineering problems so let's jump in why a new indexing indexing strategy and normally if we built something new it means there were some problems with the old one or there were some challenges so what is the pass indexing stretching any one of you that used foul bead magic bead any of the beats or lock sesh is used to the file be their star indices and historically store meant days or months basically a date a timestamp and you had an index per day an index per week and then along came ilam with rollover and you could have actually a criteria but you had still filebeat dash 01. so that is one part of the indexing streak the other part about the indexing data is that beats loaded all the assets for you there was one massive index template for all the fields there was a single island policy and then just pipelines to be used on the data were basically sent with the request so that is what we had historically but there are a few problems with that the first one was too many fields i'm pretty sure a few of you have seen that so the challenge is following the metric beat have so-called modules for mysql for redis for apache all of these have their own fields so what it meant over time is that we broke the limit of 1024 fields for a single index even though most of these fields were never used they were always there even if you didn't use apache they were there so we start to hit more and more the limits of elasticsearch on the number of fields in a single index in the single template that was one of the issues another one if you wanted to do ilm and you said i want to have a retention for my apache logs of one month and a retention of two years for my mysql logs you couldn't do that because ilm is pre-indexed um performance wise if you did run a query for your apache dashboard you wanted to see the traffic on apache over time actually that query was run across the minus qual data the already stated any data you had in there fourth bootstrapping was tricky so because the beats or logstash did the bootstrap means setting up the pipeline it always had to check in advance is a template there if not please load it what if two beats actually try to do it at the same time what if a thousand beats tried to do it at the same time so the bootstrapping was tricky and it worked but still you needed additional permissions on the edge to actually do that bootstrapping and that was one of the other problem and the last one let's say you want to have ilm for your mysql locks and for your radius lock separate so what happened use it modifications the indices have become file beats dash mysql dash something falbe dash ready stash something well that didn't last long often because things broke the dashboards didn't work anymore the index pattern and kibana didn't work or the interest pipeline didn't apply correctly so user modifications you really had to know what you're doing uh to be successful at it so with all these problems a bit about a year ago we set off uh with building elastic agent which is basically a supervisor for all the beats and a few more processes is how could we change that how can we prove it because we basically have now an opportunity for the first time to really rethink how indexing should work how indices should be split up so after many discussions many iterations we basically came up with three important pieces in an index the first one we call it type type is a very generic one logs metrics traces so the type is the first split up of indices it is you can we users are going to be interested to create all the logs we're going to be interested to query all the metrics that's what they can do with the type the second one is the data set and a few of you that have used beats are probably familiar of that for example access nginx to their mysql dot status so the data set part really describes the data itself it describes the structure the mapping and this part is kind of unique for that data set so engineering success logs have a different structure from nginx airlocks they have a different structure potentially different retention you might be required by law to keep the excess logs for two years for the interlocks you should throw away after a month and the last part is we call it namespace this is user configurable and we really wanted to solve the problem that if someone said i have actually production and testing or i have two teams team one team two and they should not share the same index i wanna have separate security i wanna have different retention different permissions so the third part is the namespace so when we look at it we have type dash data set dash namespace this is the very basic foundation of this new indexing strategy so if we take an example it's locks dash engine x success uh dash default we have default values that you can use out of the box so if you just start shipping logs it goes to logs generic default so this index is basically the file we start index so that dump where you can put just everything still there but in general you're gonna have much more fine grained indices so this new indexing strategy had a few challenges the first one is boot trapping so elastic search by default if you just check data it uses text fields with a second uh subfield of keyword so what we did is actually we introduced some basic templates in elasticsearch logs that start a star metrics that start a star and trace death star the star is if you ship any data in the predefined format that just showed you from elasticsearch 79 and onwards it's just gonna work there's a very basic template in elasticsearch this template is ecs space so what is ecs ecs is the elastic common schema um we published this about i think a bit more than two years ago for the first time it basically is a standard format on how you should structure your data it has a predefined set of fields you can all of course add more but these new indices are all ecs-based that has a lot of advantage you're going to know where a specific field is what it does and where it comes from and these templates are loaded by elasticsearch and the second part is besides these generic templates we have the data set specific templates when i say templates i mean elasticsearch templates and these specific templates logs as engineer success as you can see there is a star in the end what that means is we had the name space before so if we go down back down here this one is a star so now what that template means if you use production testing anything else mapping wise exactly the same template is going to apply so you can use to user modification things keep working then the interest pipeline historically had to be specified with each request we changed that now you can actually attach it in your template or on the data stream so you specify it there because we know all the data that is coming in is going to be have to be processed in that way that has the advantage you can essentially modify the processing and it's all loaded by interest manager uh interest manager or sometimes refer to it as fleet is a new package management system in kibana that instead of the modules in beats all these modules are now centrally managed as integration you can install them with a single click i'll show you that later in the demo and things just work so with that we also solved the bootstrap issue but there was a problem about a year ago uh what we wanted to build didn't work in elasticsearch it was either we have problems that we had to query across too many indices we had a problem with what we called accidental inheritance so elasticsearch index templates inherit from each other if they apply to the same indices and for the boot trapping alias with indices behind it i'm sure a few of you have worked with that wasn't an option for us because we just wanted to start shipping data and did not want to set up the indices in advance so three new features which i'm gonna dive into are constant keywords we too templates and data streams which only make this new indexing strategy possible so i hope some of you are interested in a live demo and what is more exciting than a live demo that they can actually break so let me quickly log into elastic cloud create a new deployment durability deployment i guess that's all good let's call it meet up demo create the deployment let's copy paste the password over here we're gonna need that later on again let's download good so what we just did i set up a completely fresh cluster so as it takes one or two minutes i'm gonna continue my talk i'm gonna come back to it later on what i just used was elastic cloud cloud elasticseal you can spin up your own clusters with all of the elastic stack so before we do the demo let's dive a bit into the technical details behind the indexing strategy first let's talk about data streams um so data streams is new concept in elastic elasticsearch it is basically a formalization of time-based indices behind there are multiple hidden indices normal indices but when you interact with a data stream it actually feels like a single index and that is really the goal so whatever we do we always think of data streams and the indices behind it are an implementation deal detail what the benefit for us is with the data streams is it fixes the bootstrapping problem so there is no setting up an alias trigger rollover to get the first write index is you just ship data to a data stream if it's not set up yet it's elastic search is going to do it for you the only thing it requires is the definition of a time stem field and guess which one is default one it's at timestamp again coming from ecs so data streams are very focused on time series in addition it basically has an integration with ilm it accepts the rollover it accepts shrink uh commands and it is based on composable index templates or what we sometimes call v2 index templates for elasticsearch so historically if you had an index template with logs star and an index template with logs nginx.star is priority wise the fields missing in this one were inherited from this one assuming that had a higher priority and the same here but that caused a bit of issues uh on our end as we had sometimes some settings in here we didn't want to have here so the elasticsearch team built composable templates so it's an index template which you have multiple component templates that it can pull into that template so the basic logs or settings from that one is a component and we can use in this template or this template so you have building blocks that you can move around but you don't have the inheritance problems so whatever do we do in here whatever components we have in here only these templates can apply not not some parts of this template too so talking a bit more about the bootstrapping problem historically you created that alias with the right index for compatibility with rollover then you started sending data and you had to do that for each data stream the problem was let's assume you started to ship data to a specific index and you forgot to set it up well there was no way out you just had a single index that kept growing and you had to wipe it delete it create alias and the roller index so index v2 templates can become to use data streams so these two are kind of tied together if you want to use data streams you need to jump to v2 templates and you can basically start one template log star store anything that applies below it it's just going to use data streams if you wanted to and everything gets created automatically so which is great so a bit more about the indexing strategy about the split of data streams you saw before we split by type data set and namespace so what that means is really you can have retention per data stream you can have warm cold tiers per data stream you can have roll up per data stream all of these things benefit you as a user it also makes actually the ingestion easier and it helps you with security instead of having to expose all your locks to users or you need uh document based field-based security you can actually now use index-based security for just giving access to small groups for example to the access logs but not the airlocks overall i'm gonna make that point later on with some examples the indices are smaller and because they're smaller actually querying also becomes faster so we also have performance and storage benefits with the new indexing strategy so here just an overview of the system module the metric set we had in the past to give you a bit of understanding what we talked about so process is a metric set or data set network memory cpu and that's kind of an overview on where all the data went so historically all of this went into a single index now with the new indexing stretchy each slice here becomes actually its own index which means network data is not messed around with memory data not mixed with process data they all have very different structure so smaller faster yeah actually the index size we took some identical data sets and the index size decreased by about 18 obviously it's our case data set but that's the number we saw 80 18 reduction in index size from the all to new strategy and also memory usage dropped and actually surprising to us initially the index throughput increased we worried in the beginning that more indices will potentially mean more data rooting inside elasticsearch so indexing might slow down but actually uh the opposite happened so yeah let me quickly check there are some questions in the chat reducing the number of fields in the index template to a small subplot or what is needed does this give us better memory efficiency does having less possible fields safe heap space so david i would need to pull in someone from elasticsearch to 100 correctly answer it so one thing it actually reduces uh so basically the template itself doesn't uh why it doesn't help doesn't make a difference what helps is actually the mapping of the data stream that is going to be much smaller so um i think what you said about it's going to make it a bit more memory efficient and safe heap i think the answer is yes but i would need to pull in someone for elastic search to uh 100 uh be hundred percent sure in that what i can guarantee you is memory usage drops so i think that is one of the reasons that we have less fields and there is a related question from bernard um dave streams are light but will lead to a higher number of indices what is the impact of this so this is not directly correct data streams doesn't lead to a higher number of indices the indexing strategy leads to higher number of indices we'll get to that later on on what the impact is uh that is one of the downsides that we have more indices but i'll show you some more benefits so data streams itself if you just create a data stream which is a shell with a single index you still have the same number of indices so it's really the indexing strategy that makes a difference great let's go on so the next technical piece is can match that elasticsearch has that provides us with a lot of magic around the new indexing strategy so what is can match can match was initially introduced for range queries and timestamp so if you have time series very often you say give me the data from uh 22nd of november or you say give me the date of the last 15 minutes so that means if you have all the indices and it's we know that in advance we actually don't have to open the indices because we know there is no time step that is going to match in there so can match was actually extended uh it can also for example now be applied on constant keywords field so constant keyword is basically a keyword which has exactly the same value in all the documents in a similar index so taking that for the indexing strategy we have three fields type data set namespace i'll show you these in the beginning these have the exactly same value in a single data stream so why that helps let's have a look so the can match plus constant keyword works out the following so can match is basically run before the query and it checks all the indices if it is even possible that this query might match so now because data stream the data set uh with mysql status is a constant keyword elasticsearch can actually check all these indices and say in advance that's not going to match all these are not going to match so the query itself only needs to be run on that single index so before we had in metric b not only the the query run on all mysql data but actually on all metrics data and now with the new data set and constant keyword include can match we actually have the query run only on the minds called status data which is a very very small subset so the query itself is going to be much faster and it also tackles much slower low much smaller portion of the data because all the other indices were just skipped and ignored this is also advantage if you have cold or frozen indices like let's assume these indices are slow queers are going to be slow because you said you rarely query them before if you run this kind of queries on these all of them had to open but now if can match the indices don't have to be open because we know in advance if they're going to match or not so what is the catch and better not you you you kind of point it in that direction we actually have many more indices so this is partially true that we have many more indices so let's take the example that um [Music] we we have a lot of data for engine x access and their locks and let's say we have a retention of 60 gigs and we fill that up in two days but uh we said we want to have rollover in 60 days but we reached 60 gigs in in two days so pure we're gonna have from day one an index for access locks and airlocks but what is going to happen the excess lock in this keeps rolling over every second day and the airlock only every second uh every 60 days so over time in the beginning you probably have a few more but over time assuming you have a lot of traffic actually number of indices becomes very similar where it has a bigger impact is if you have certain data sets that have a very low volume so it means the roller on these indices might be month old in the hot tire and also the rollover age means sometimes you have more shorts and they're smaller than average it is a disadvantage but we're working with the elasticsearch team to make sure this doesn't become an issue we kind of there is always a trade-off between benefits and downsides as we see here but all the benefits before that showed you kind of said like faster queries better compression uh easier modification for users all these kind of things that we solved we said that is a trade-off we we're gonna take and we know it's there so we can really improve it so let me quickly jump again to the questions uh david asked how can we retrieve the follow-up information now i'm not sure which one you refer to can you derby can you perhaps type in a bit more detail so i can tackle it again uh later on i would appreciate that so let's continue demo so let's see if it's going to work so we have our cluster single sign-on should work hopefully we have that now in cloud so what i'm going to do is basically to show you the new indexing strategy is a quick installation of an elastic agent and show you the data and it ingests with kind of the metrics around it because it's all going to be based on that new indexing strategy so what i just opened is fleet that is the product that installs packages where you can involve investigation and manage uh them centrally there is one issue our setup sometimes takes too long i should probably have selected a brigademo cluster though we didn't provide more details yet but we're ready so let's add an agent uh let's enable central management for the elastic agents and go to download page what we're going to need i want a mac let's download that quickly i'm glad my internet connection is fast so we have nothing enrolled yet let's open the terminal zoom in a bit i'm going to jump into this directory here so this is the new elastication binary and to start it we actually get here a nice com command for osx let's copy paste it ah yeah let's do root so uh let's have a look uh what is happening it tells me it's installed and it's running and see what happens here my agent shows up it just enrolled let's have a quick look on the policy uh that we put in it policy basically configuration of the agent so what we have in here we call it system we can have a look at this integration and we have a few logs we have quite a few metrics systems disk io file system i hope that rings a bell now because what we actually have here are the data sets so by now i kind of hope it's online so let's have a look at our data streams and now we see the three pieces we have type we have the data set and we have the namespaces so we have system processors with system network and now again the beauty of this is we can actually jump directly from this to some dashboards and all the dashboards that you're going to see now hopefully not broken come on so all these dashboards are now based on a new indexing stretch so what that means we have to load here and the load is system.loads dataset so the query that was run here was only running on the load data set same for the memory so before every single dashboard on here when you just spin up metric feed was actually run on all your data but now actually every single block is only run on a very small subset of the data so if you look at it this dashboard has about 12 15 queries uh perhaps a few more now you can actually see how the new indexing stretch is gonna improve dashboards and everything around that so one other thing i quickly promised you is uh have a look at fleet uh only indirectly uh related to the indexing strategy but i want to show you that one because i said it's now all loaded essentially so we for example have apache so we have here the apache integration as soon as i go here i say add apache i'm going to add it to my uh to my default policy which is the one i'm running on my local machine i say save integration say yeah you want to deploy it so what actually now happens in the background installs interest pipelines index templates for every single data sets of apache so it does this it now rolls it out to my agent unfortunately my local machine i don't have apache running today so it's probably going to report some errors but i have dashboards now basically if we go to dashboards i didn't have any apache dashboards before but if we look here we have now a few apache dashboards so this indexing strategy basically brings all the pieces together it makes central installation of assets possible it makes separation of all the indices as possible it makes the queries faster that's probably also the reason why it took us so long to build it but it is now there so let's do a quick summary i see there is an additional uh question when i from david uh when i asked my heap question from earlier reusing the number of fields in the index number two small set what is does it give us memory efficiency you me uh about following up uh with elasticity i got it david so um uh what what i would suggest is ping actually go to discuss the elastic tatia post there exactly that question in the elasticsearch channel and i'm pretty sure someone can answer it directly for you feel free to ping me so i can route to the correct person at rufflin is always going to work so we can get that answer for you in general any question you don't get answered today jump on discuss i'm more than happy to answer there or find you the right person to answer it so a few questions in advance before we get into all your questions like can you use the new index strategy yes you can it ships with seven nine with data streams and you cannot only use it with the elastic agent you can use it for yourself if you have your own data shifter you can actually directly use it so it's not just elastication elastication is today the biggest user but you can use it for any of your time series data for logs and metrics and traces next one is about how should i name my indices or actually data streams the first one the type is very simple slogs metrics traces pick the one that fits your data best the next one is which data set should you pick if we talk about services i often say prefix it by the service and the second part which is relevant for what it actually is but you need to split it up or not the questions are is the data the same structure does do you expect to have the same island policy does it have the same processing and also very important how will the queries look like will you always create it as one block or or actually all your queries just for a section of the data that is probably a good indication that you should split it up and last decide if you need a namespace probably at first you don't just leave it as default later on if you figure out i have multiple teams i want to split it up by production testing or whatever you come up for namespace you can still introduce that later on i see a few more questions in the chat so when you go for custom logs the custom integration pushes configuration to the endpoint who creates interest by applying config and roots data over it so um there are two ways i might understand custom integration so if you talk about custom integration that you really build your own integration which in theory is possible today but we don't support it it means like it would show up in kibana you install it there it will actually install the central interest pipelines and push down the configuration to the agent the other way to read that is actually we have actually we can jump over here quickly to show you that so there is also the integrations the part about custom lock so that's basically the custom log integration means you're going to specify a log file path um like using the log input in filebeat uh you can specify the data set and some additional configuration so if you do that and you set up full here you're gonna have to load the template for logs full dash default and attach an interest pipeline to it and your processing is going to work so you basically built your own integration into it we're not going to update it it's yours so a few more questions are standard beats moving to new stretchy as well in the future um probably not um the the reason is it's a very heartbreaking change we're still thinking about migration path to it of course uh but in general it's a heartbreak indirectly beat supported because the agent just runs beats but we're still thinking of how all these things are going to play together so no definite answer here good seems like i answered the one of mr dot i'm glad i did so um let's jump in to the presentation again yeah now we are at the part around the questions please fill up the chat so you can see some more questions over here don't be shy it's a pity that you cannot speak up but i think youtube doesn't support that so some questions let me give you more one or two more minutes while you're typing let me mention another thing we have a survey and uh dalia i think you can post it in the chat uh i would like to ask you to fill it out it's really for us so you could get feedback about the talk about the meetups and all these kind of things so on that what we can do in our end is really improvement so perhaps you want to see not the indexing stretch you want to see different topics you want to have more detail into technology or more high level like let us know in in that survey uh we have mr dot about ecs yes ecs we we keep pushing on ecs in all places also lockstache by the way locksesh is also working on supporting data streams and the new indexing strategy so hopefully soon you're going to be able to use the new new indexing strategy in lockset actually you can use it today to be fair if you specify the right patterns in the end if it's an index or not blockchain logstash doesn't care so it's just going to ship the data what else do we have the survey again the survey good good um i think uh we're gonna wrap up here soon what i really recommend you is please jump to discuss the elasticseal if you have more questions and oh more more questions come in let's jump on that dolly it's okay to jump on us a few more absolutely go ahead so david asks uh they built a few dashboards on their sim rules with the new old index patterns when lockbeat how can we adapt to this new indexing strategy we're asking our users to modify every dashboard very good question we're in the progress of answering that so uh we also need to answer that for all our users because wind lock b actually also runs on the elastication or at least the win log input so uh any of these modules are going to be published also as integration packages so i don't have an answer for you today uh but i hope we can provide you with one in the near future um oh yeah up there you damn what using fleet for this i understand please narrowly release feature currently um yeah release dates is the questions so we don't talk publicly about releases when it's going to go uh ga um so i'm not going to enter that slippery slope i always hope soon uh i'm basically the tech lead on that whole project we're pushing really hard to make it available as ga but it has a lot of move parts so if we make a ga we really want to make sure it's good so the goal is definitely in the 7x cycle so one more is it possible to configure integrations you only set up data streams and dashboards but inches python if logs are enriched by logs as yes so an integration or a package basically can load whatever you want it could be just a dashboard it could be just an interest pipeline it just just the template you can really pick what are you going to put into an integration bernard before you get too excited we don't unfortunate officially support yet your own integrations perhaps you have figured out how you could do it yourself by now but it's not officially supported but yes definitely what he mentioned there is all possible you you dice and slice it the way you want did i miss some more dahlia all good good i would say we wrap it up thank you very much for coming over dahlia last what's yours yeah thank you very much nicholas what an amazing presentation super interactive we'd love to have you again as a presenter and if any one of the attendees ever wants to present in one of our meetups please give us a shout out we always look for amazing speakers who are willing to share their stories their new cases so please let us know and if that's all we're going to wish you a very good evening to all of you and thanks for coming to our meetup thank you
Info
Channel: Official Elastic Community
Views: 4,208
Rating: undefined out of 5
Keywords:
Id: ls1O-gB-Voo
Channel Id: undefined
Length: 44min 58sec (2698 seconds)
Published: Tue Nov 24 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.