Michael Desa [InfluxData] | InfluxDB 101 – Concepts and Architecture | InfluxDays London 2019

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi everybody my name is Michael DISA and today we're gonna going to be going over in flux DB 101 so a little bit more background on me before kind of I dig into the agenda here so I've been an engineer in flux data for about four years so when Paul's talking about the history there right around the O dot eight to nine transition is when I came on board so kind of been through this whole story arc of the company so if you have anything any questions about anything the entire time I'm happy to answer any question at all and hopefully I'll be able to give you some kind of explanation so the agenda for today is to be able to just define at a high level what time series data is and kind of recognize some of its use cases we're gonna describe what the influx DB is and its relation to influx data so there's been a little journey there we're gonna explain the influx DB data model and then we're gonna kind of reason about some of the impact that schema decisions might have on our instance and things like that we'll talk a little bit about flux and in flux QL and capacitor and how we kind of got to each of these things so to get started what is time series data so it's kind of everything from tick market data this is probably where I see people have the most experience with time series data you know just having x axis you have time y axis you have the stock price right it's a great instance of time series data we also have things like dashboarding and you know application monitoring where we're looking at service latency and you know the the various kind of incidence creation and whatnot all this is kind of in the realm of time series data it should be very familiar to everybody here right this is kind of what we're doing at time series data conference we also have things like system monitoring right where we're looking at the disk ops the disk use the short term load on in instances allows the resident set size of a particular process as well as IOT and things like heart rate data and how many of these slides did I end up making as well as kind of logs logs are an interesting thing that people usually don't really associate with time series but if you think it about at the core a log really is a kind of event it's it's really a time series right so that's actually something that we've been focusing on here at influx is trying to figure out how do we get logged data into our platform in a way that's reasonable so we may not have the best kind of compression for Strings and we've been working on trying to make that's something that's that's more of a possibility something that people don't think of as much of when they think of time series data are traces but but really at its core a trace is a time series so if you're not familiar with traces or distributed tracing essentially what you're trying to do is you're trying to map in kind of your system what took the most time in the relationships between those components but the time component here really is how long something took which doesn't necessarily fit with a traditional kind of time series database right it's it's not overtime it's a range of time however you can efficiently store those types of data into a time series database and it's something that definitely doesn't work within time series there's kind of two major categories the formal terms for those are regular and irregular time series so regular time series is that graph up at the top what you have is kind of a point coming in every fixed interval so every five seconds every 10 seconds something that you know so things like CPU monitoring right I pulled the CPU at the same time every time just to get a general idea of the overarching shape so that's called regular time series and beneath that we have irregular time series which is the data is going to come in we don't know when it's going to come in but it's going to come in both of these things our time series the more kind of colloquial name for these is metrics or events right think about metrics it's that regular time series data coming in at a fixed interval every 10 seconds every 5 seconds as well as kind of events down below which are you know somebody pings my server and I get a value back for the latency for that and I want to record that event for there's a security breach and I want to log that right so there's an event that takes place and we're trying to monitor those things so just so I'm not up here talking to myself the whole time is this graph that we have here time series we've gotten X this temperature y-axis price this time Sirius would anybody ever think that this is a time series see a lot of shaking heads anybody want to give a verbal answer there we go not time Siri it's definitely definitely not time series what about this one over here why yeah exactly so we look at what's going over here it's the viral spread of Ed Sheeran sing and we're looking at that over time right we've got a visualization of the map and we're looking at how its kind of popularity has spread over time the thing that it's changing here is the time axis great so what what is the time series database a time series database in my mind is really a database that has a few characteristics my answer to this is it's a database where you manage and store time series data but you can pretty much put any kind of that data into any type of database right you put a and SQL really really anything but the two things that I think time series database should kind of handle out of the box R should be able to efficiently handle time series data so time series data traditionally comes in very very high right loads so you're monitoring hundreds of thousands millions or even billions of independent things and then your query workloads are also kind of in that fashion where when you're reading data you're reading millions and millions of records right you're not doing something where you're selecting a specific record maybe you're doing your specs of picking something that's like the max record or something like that but you're not looking for individual records in something it's usually ranges of time that you're looking for and then the other thing that I think that basically a time series database must handle kind of out of the box to be considered a time to use database is you need to support range queries so queries based off of time that are efficient and I think it shouldn't be something that is kind of tacked on to the end I think it should really be a first-class part of a system so there's a bunch of different time series databases out there we happen to be number one it's a shameless plug here that's why you're all here as but there's there's been a bunch of them in the past there's KDB which is a multi model one that's very popular in the finance industry and then instead of the monitoring space you have things like graphite and Prometheus or our D tool and kind of new up-and-comers you have things like like time scale but one question I get all of the time is well I know MySQL really well why should why can't I just use MySQL or I know super well I can put my records in and the answer is you really can but you're kind of losing out on a bunch of thing so that a time series database just gives you for free and it's only gonna scale to a certain level so you know a day in and day out we see people who are going from a graphite installation which is a technic sensors database or my SQL or a solution to wanting something that is specific for time series and that just has to do with the massive right load and massive query loads that are associated with those things and I think the more important factor in all this is time series really is not just a database problem right so if you've ever done anything with x news database you you really know putting your data into the database and just letting it sit there is not super useful right there's a lot of other components that are kind of intertwined with time series so specifically I think the big ones that come to me are I want to visualize my data somehow right I want to see what is actually happening and how it's changing over time so I can understand what is happening I want to be able to alert on that data so like if something's going over a threshold I want to know about it or if you know a particular value is an error or an error rate I want to be alerted to that type of behavior I also want to process that data like I may not care about individual kind of points but I want to see trends over over some kind of time scale and then most importantly I think I want to be able to take some kind of action from my data so you know what is in flux DB and in flux data something we talked about a little bit it's a database but how did we kind of get here so we didn't actually kind of fall into this time series DeBlase idea or platform Pender's platform kind of by random it really started out in 2012 when Paul and another founder started a company called airplane and the objective there was to make something that was a lot like kind of New Relic and just make application monitoring a lot better but pretty early on Paul recognized that there was a big gap in the market there was other time series databases like open TS DB and whatnot but they always had external dependencies they were giant pain to mess it up or he could do something custom with Cassandra and it was always very difficult to kind of get up and running so his idea was let's just make a database that just does all these things very simply no external dependencies just lets let's kind of get it up and running around 2015 is when I joined the company so and that's that's kind of when we started the path of transitioning from influx db2 influx data which is really a company that is all around this idea of solving the time series problem so not just a database where you store the data but also the visualization also the alerting also the processing of that data and you know from 2015 to 2018 we kind of were going down this path and then 20 at 2018 we really kind of started hitting some rough edges with with influx QL and the general model of things it was setting up all that we built about four or five different pieces we are had something called the tick stack which I'm sure you're all familiar with but a problem we had as a lot of people didn't couldn't configure it very easily and setting up all the pieces independently was it was a bit of a pain and so there wasn't really a single unified experience of what influx DB was or influx data was there are some people using capacitor there were some people using coronagraph or Agra fauna and there's all these same things we wanted to really have a cohesive singular experience that was the influx data experience on top of that something we had heard time and time again was we're a team at a company and we want to offer influx data as a like internal SAS tool essentially like we want to offer this as a tool to our internal teams and we want a lot of tooling around controlling what series get written or who can do what in the system and we wanted to kind of build from first principle the system that would work in that kind of scenario so I think why influx data why would you go with influx data or influx DB over some of the other solutions out there's I think to this day it is our kind of aim is it should be very easy to get started with we do everything that we can to try to get out of the way and help you solve your problem and however you see best fit to my knowledge we're the only company that is really aiming to solve the entire science areas problem we're not trying to solve the database part or the visualization part we really want to get the whole solution right because that's what we think is the key to kind of having time series be just a thing that is is no longer an issue and then on top of that we we scale well both horizontally and vertically so when I started I would say our performance was okay today it's it's great so in a single instance you can do a couple million writes the second 4/4 time so use data which is pretty impressive or if you want to use our commercial product we kind of scale horizontally so it's a lot about influx data so I'm going to talk a little bit about the influx DB data model but to do that I'm gonna kind of start with canonical time series line graph and kind of reason about how I think about the various pieces of influx DB so I always struggle to explain what's a measurement in the tag and how are they different and why would I put something where and how should I think about these things but whenever I come back to this graph I find that I can reason about what those things are so start we got a graph its stock price on the x-axis time y-axis we have price and then we have some kind of legend data over here so the first thing you should notice is that label up at the top that label we call the measurement so it's it's a high level grouping for all the data beneath it so common ideas for measurements are memory or CPU or Postgres you just take a high level concept and kind of group them into a it's somewhat similar to a table name in MySQL but I actually don't quite write like that analogy I think it's it's a little bit more a layer above that next we have the legend all right in the legend is kind of metadata about what's going on here we call these these this made-up metadata tags so for example we have that blue circle there right that would be ticker equals a and we'll say that the market would be Nasdaq where the circle corresponds to Nasdaq and the color corresponds to ticker right so the collection of all of these things for a particular point that we call the tag set one important thing to note about tags is they're just key value pairs thing like that does adding it as a tag actually offer any kind of benefit is there really more than one point may be associated with this or it's just just a way to get a single point so the other thing that's super important to do is to not use too few tags so something that can happen if you use too few tags is you end up getting data collisions so if I have the measurement CPU with region uswest and the host server one in a value of zero if I write all of this data down what's going to happen is you can kind of think of the ID for a the timestamp as an ID for a point in a series and if I write something has the same measurement and tag set and the same timestamp the fields that were there are going to be overwritten and so you want to be careful that your data is sufficiently distinguished because otherwise you're gonna start getting collisions and you're gonna lose data and the system is a last right wins system so we all if I wrote a bunch of things then I got a new one that comes in the same measurements and tags same timestamp those fields will get overwritten so that's something to be be mindful of whenever you're working with so you want to make sure that you don't have too few tags so now I've told you what you should not do the question really what will what should I do and I think that's a really hard question to answer I find it's best to kind of walk through an example and kind of reason from first principles what maybe the thing to do so whenever I'm designing a schema I I really start with the questions of well what dashboards do I need what kind of things do I need to see what kind of alerts do I need to have which ones like the mission critical things I need to be alerted on and what kind of reports do I want to generate and is there any information that I need kind of readily available if there's an incident so I'm trying to identify what are things that I'm going to be running continuously like every five seconds on a dashboard somewhere versus things that maybe be a little bit high early and so you're kind of random that I just need available if there is a problem but I don't need to have them offhand so we're gonna walk through an example where we kind of reason about these things here so suppose that I operate a SAS application there's a hundreds or thousands of different services and I want to know the requests and error rates for each service and I want to trigger an alert if the error rate for each service is too particular high and I want to see that the service is I want to know which services currently have kind of the highest average request durations I want to know what's really going on there so that's the problem that I'm trying to solve and at a high level the data I have available to me is the application the service name the container ID the path the HTTP request path the method that amusing the source and destination of the request the status HTTP status of the request the request ID the duration of the request the bytes transmitted in the bytes received so this is kind of all the information I have available there's potentially a lot more than this but I didn't want to have this slide just kind of go on for forever where I just read the names of different possibilities of data although we can do that if you would like so one question is why would it be a bad idea to make container idea or request idea tag and and one thing I want to stress here is container ID is not that bad and I probably today would use container ID as a tag but request ID is one I really want to call out here so why would that be a bad idea cardinality yeah we're gonna blow up the cardinality here and as I said yeah container ID is something that we really container idea is something that also kind of continuously goes up especially if you don't things like kubernetes where container is coming up and down right so they're very ephemeral and they will kind of grow unbounded so you do want to have some kind of way to make sure that those values are being churned down but it's something that is kind of common in the business and we've really put a bunch of effort into allowing for that kind of churn in in cardinality so so long as you have something cleaning up the series behind the scenes your cardinality won't grow just for forever but if you need to keep that data in definitely having container ID is probably a bad idea all right your shares cardinality you will only increase and so the next kind of question is well well how should we organize our data if we think about those original questions that we had and the data available to us there we go we really wanted to know request rates error rates and we want to be able to monitor an alert on those things as well as kind of which things are taking the longest for us which request services so the schema I would propose is we're gonna have a single measurement we're gonna call that measurement latency something like that and then we're gonna have tags for the application so the service the container ID the path or HTTP request path the method the source and the destination and then the HV status that we got out of that and we're gonna have fields for things like request ID duration bytes TX bytes rx this should be kind of pretty intuitive here right so none of the things that our tags are gonna blow up on cardinality for us except potentially that container ID which could grow over time and request ID we have access to if we need to so if I wanted to look for a specific request ID eventually I could do so and get the actual value back it just might be a little bit slower so it fits our category of being able to get the data but maybe not in the fastest fashion and most the visualization I'm going to want to do is probably based off of either the application or the container or some the path thing like that so and just to kind of give you an example here we have the influx QL queries that we would issue to get those exact things so we have something like select top average duration from app and ten so give us the top 10 average durations and - we have a sub query where essentially we compute the mean duration as average duration for the latency for the last hour grouped into one minute intervals and then we have the request rate and the error rate one thing that's a little bit difficult to do here is there's no way I could look at the ratio of those things which is what I would really want to do but you can plot them on a graph and a lot of visualization tools will let you do that in the browser but Indian flux QL though is something that we struggle with is how do we take these two pieces and kind of combine them into something that is coherent awesome and I believe that is is all the content that I have but I'm happy to stick around and answer any questions anybody has yeah so it's a little bit complicated to answer that question but a retention policy when the data is retention applies to some data that data will be evicted and the cardinality will come down so the data gets D indexed when the data expires No so if it's to retention policies in the same database it'll be the same cardinality yeah yeah yeah it's a great question so it depends what you're using so about six months to a year ago we introduced something called TSI which is a time series index which is a disk based index that is kind of a default in 1.7 in that case I would consider a high cardinality 1 billion series we regularly test in the hundreds of millions of of series and that's kind of our aim rate all right our target but I've also seen workloads that do about the the billion range that are possible and that's our real target is to get into the super billion phase that's that's kind of what we we aim for so you imagine you know we have the Sen certification of the world there's going to be more and more things that are ephemeral that are coming alive that we want to be able to see that for like the series and track that thing over time but that things gonna change so we want to get that as high as possible next question yes yeah yes so the question is is there a way that we can really systematically understand our schema decisions there's a couple utilities that we have that will help you go about that so you can look at a couple metrics that we have about the various sizes of things to get a feel for it but the best way to do it is to give a schema a try and then we have an explained query that will show you where a query is taking a really long time and that's a pretty good indication of where problem spots might be but having something that's a utility that tells you this is the tag that is the problem and you should do these steps is something that we haven't got to but we really want to it's it's kind of on the agenda yes that's something that we're working on in the immediate future so it's not something away I've done yet but if you remember from Paul's talk yesterday he talked about trying to think of influx data and the influx db2 auto is really like a serverless platform for time series data and having specific triggers that will cause another flux function to run or to pull in a certain set of data is something that we're working on we do have tasks which are that but like on a cron job but we want to have something that's more reactionary than that so that's it's definitely coming and the question was what is there a way to have a a flux function that's kind of run as a trigger to something forgot to repeat those you know how's it going anybody else see everybody yeah yeah that's a great question yeah so it's a good question and the the objective so there's been a couple other projects that have been doing something that is pretty similar so griffin ax has a thing called low key t that does a pretty similar thing and it's our opinion that really what you want to do is you want to grep through your logs and if you can give us a time range and some things that you want like a couple tags about possibly what you want to be looking at we think you could probably pull log messages pretty efficiently after that that being said we do still have a lot of work to be able to index a lot of those things efficiently and to be able to store strings in a way that doesn't end up blowing up memory so that's something that we're actively working on but the dream is we really want to have a system where you can have logs traces metrics everything kind of in one one area so like the idea is suppose I have a graph right and I see that the latency for my service spiked up Oh get munch I want to be able to click on a specific metric there and then have the logs for that up here and well you could hook into other systems and we'll probably investigate if that's the thing that we want to do our preliminary testing has shown that putting logs an influx DB it works for our internal purposes we do it internally and so it's something we want to explore and definitely validate that if it is pop if it isn't possible what we should do about it but so far it seems like it's worked pretty well did I repeat the question on that one I don't remember it's so hard to remember to do that kind of thing any other questions going once going twice I feel like somebody's dying to ask a question all right that's my time thank you [Applause]

Info

Channel: InfluxData

Views: 9,452

Rating: 4.7391305 out of 5

Keywords: InfluxDays, InfluxDays London, InfluxDB, Time Series, Time Series Database, Time Series Data

Id: S1kuOyS8FHY

Channel Id: undefined

Length: 26min 33sec (1593 seconds)

Published: Thu Jun 27 2019