Time series collection and processing in the cloud: integrating OpenTSDB with Google Cloud Bigtable

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right let's let's get going thanks everyone for coming up today for for our open DSD be a big table session I was really excited and actually surprised to know that there's a separate track for time series when I submitted the talk I didn't know about that I thought it just yeah great we'll talk about open G's to be a big table and it looks like it's it's a really hot topic at Percona life which is really amazing and so it was really great to hear about all other time series backends and today we'll talk about something something slightly different which I think we haven't seen in other presentations and specifically we'll be talking about integrating open TS DB with Google Cloud BigTable my name is Daniel brewski I'm a leader of the gated practice at zapathon and with me my co-presenter Christos Julio's who is our big Gator architects so I will do a quick intro about the serious and about why we even started doing this project and Christos that will take over and talk more about the implementation side of things and how actually did it really quick about pretend we are a consulting and managed services company we provide support for relational databases like my sequel Oracle sequel server no sequel databases like Cassandra and and my team specifically is responsible for architecture and building big data platforms whether on Prem with Hadoop or in the cloud with things like BigTable or the query or redshift that's pretty much what my team is doing sometimes serious like if you were be attending sessions of previous sessions of this track we probably heard about 10 times about wartime services and what's important but a quick refresher it's essentially like a three tuple of a time a metric and a value assigned to it and it's a really simple concept but super powerful right because it describes so many processes in the real world and you can use it anywhere from monitoring applications to collect the data about how your system behaves to industrial equipments and sensors that no need to pressure and temperature to even describe more collective web traffic and events from your plication Xand and things like that but again the concept is very simple but through a powerful water descriptive and that's why I think it's really important to to have databases and storages that can if they are like efficiently deal with this data why storing time series is a challenge why don't we just have my secure database and store everything in one database and just forget about it again the challenges are pretty common I'm pretty sure you guys heard about this before but the main problem is that the data in time series arrives in one pattern and being consumed in a completely different pattern and the volumes of time series can be explosive because if you're thinking about monitoring applications or sensor data it's easy to see how adding more nodes to your infrastructure or adding more sensors on the equipment can just be a result can result an explosive growth of the of the volume of the data but there as for the access pattern as you can see like you can imagine that if we have some measurements M 1 M 2 and M 3 we receive the data pretty much sequentially in time so for each time point we have one data point for the for the measurement right so data kind of arrives in this virtual columns for each for each measurement but when it comes to consuming the data in most cases in most cases you want to consume it in the improve the opposite way like you want to consume it in this kind of rule format where you want to look at one particular metric and read a long series of measurements so and that's essentially that's the challenge indexes in relation of the basis you can index one way or another way but then you either slow down the rights or you slowing down the rates so like that there's just no ideal solution for that and there are some tricks and things that people do to to fix this and like if you attended Ben Schwartz talk from would course X yesterday they're doing this in Mexico they're storing time series in my sequel but they are really using it as a no sequel store essentially they're not using any my sequel specific features so and yeah that's that's the challenge that x your database trying to address there are better alternatives right so instead of trying to feed those time series into relational databases we invented multiple alternative specialized stores and you've seen at this track for time series there's more probably every year being added to to the list and the idea is that we will use the data model and storage that optimize four times for time series so we'll invent kind of our own structures that will be more efficient just for that use case and the trade-off is that we'll probably have to use a separate query language for our works and the data because we are kind of building our own layer on top of the out of the storage and today we'll talk about open th DB which is probably the most established time seer database it's been around for for a long time it's a little open source it uses HBase as the data store and it has a data model that's optimized for time series and actually very likely to have Chris Larson here one of the committers a dopant is DB later today he'll be talking more in depth about the data structure so if you're interested definitely attend their talk as well and essentially what open TCB does it provides us the data model that's optimized for per time series and an API on top of that that allows us to to work with the data easily so on a high-level open PCB architecture look something like that so you have a number of servers if it's a more notorious case but those servers could be again like sensors and stuff like that you have tease the instances which are Java kind of agents sitting in between servers and HBase and what's happening is those TSD instances actually do the translation of the operations that you do with time series like insert a metric insert a new measurements get an aggregation get a value for a given point and the storage layer we open this DB relies on HBase which is a scalable non-sequel storage built on on Hadoop and they're really in this case what's happening is TSD is giving you this kind of terms like translation language right you don't directly with the data in HBase because the Bordeaux is so highly optimized but I'm serious it just I mean it's very hard to work with the data manually so it's competitive as very compact storage and you need this kind of layer to translate your requests into the requests of the storage and it has TSD has the sound built in web UI which is pretty pretty scary but it supports it supports all the modern kind of graphing tools so you can plug them in and and actually display the the data and in in in TSD and yeah you can also plug into their PC to do any kind of description for alerts and stuff like that but just this is like a not an in-depth overview just so we understand the service and the metrics Tooty is DTS D translates them to the HBase operations and if you want to query also query through TSD don't go to the to the HBase directly but the challenge that we've seen and I think many people seen is like okay I want to build the metrics collection solution and there opt to be is great but HBase is just too much sometimes like HBase requires a pool Hadoop set up through the keeper nodes to name the three data nodes minimum and to master notes to HBase master nodes to hp's region nodes three actually region nodes so like you need a full-fledged cluster up and running just to get your data collected and monitored so if you're building a monitoring solution for your like in-house infrastructure you're actually building more infrastructure just to boil it over that infrastructure it's it's a very challenging task and HBase is a great know cycle store but it's rough around the edges sometimes and to knitted tuning edge base and dealing with some of the challenges there is not for faint faint of heart there's a lot of stuff going on and there's a lot of components that can can break so like HBase is scary like it's a lot of stuff inside and there everything there can break so you have to be ready that I mean it's bit like it's becoming better and better but still it's a lot of management overhead so if you don't have Hadoop already and you don't have a team of Hadoop admins and with HBase skills it's probably a blocker for you I should like to deploy a point is to be on HBase because you will spend more time just managing the cluster than you actually spend in time doing what you what you need to do but all we wanted was actually a time series database right so we wanted to store this simple data data structure in a predictable scalable way right we didn't want to roll our own Hadoop clusters so luckily a better way exists and will be in the second section of this presentation we'll be talking about Google cloud BigTable and how open it is to be can integrate with that and why would you want to do that and here I will hand off to Christo's hello can you hear me yeah so hello everyone yeah what I would also like to add is like if it is tough job to set up an HBase hadoop cluster on premise then setting it up in the cloud is like ten times more because it is more expensive it is harder to get like configured disks and iOS and everything so let's see if we can do something else but if you can have another approach to this problem and maybe we can check the Google BigTable so I won't go into too mine too much detail about Google BigTable it is like an age-based alternative in fact it has been around for our white board like 10 years as an internal project in google it is like a massively scalable no sequel database it supports high throughput hike low latency and high concurrency so yeah it used to be like most Google project product work on that and lately as of like a year or so it has been released as a Google cloud service so when Google Cloud BigTable was released we were approached by Google and saying like here's what we're going to release and if you love to have some business use cases that work on BigTable something you ran on HBase because we are originally a Hadoop HBase theme so can you change that and like open TS DB was a very good candidate for what we've mentioned before so Google has done something really smart so that they can attract HBase users since HBase and BigTable are kind of similar in architecture and approach and data structure and everything they they created the library that is API compatible with HBase what does this mean like if I have an HBase client application and I want to change it and migrate to Google big-time table all I need to do is like drop the new jars like the big table HBase jar no code change is required just configuration and replacing the library so we said all right that would be an easy task let's take open TSD be ah replace the libraries and then test it and see how we go unfortunately it was not that easy it was not that simple and that is because open GSD B does not use the standard HBase client in fact what the open is to be maintained is have done is they have rewritten the whole HBase client stack and they've replace that with Anna Singh HBase library so what I think HBase it is an open source HBase client library it is multi-threaded while back few years ago our h table was not multi-threaded latest release is is and it's fully a synchronous non-blocking implementing the low-level HBase are pieces that change every few months so this is a tough job and like I don't know how for for people that have been writing software like a synchronous programming is like a hot thing nowadays and that is like so far when we wanted to have I oh called Network IO disk IO or anything when we were actually calling a function it used to block and then doing things at the background we were waiting like the thread was blocked and probably was removed from the scheduler and then when the result was ready we were gonna wake up the thread and then give the results and go on that cannot scale a synchronous what it does instead is like when we invoke a call then within a future this is just a handle that does not mean that the result is there it means like will be we will be able to access the results through this handle and then later when we've done some work we can either call get and then block or even we can add some listeners so we can change more methods that will work as synchronously and then more methods and then essentially what we can build is like a a chain of a synchronous methods and this is good I mean that's really good why first because we have an efficient third usage so like so far we have many threads that would block an i/o and then remove from this the CPU and then entering new blocks new threads that would block and so on now we have less leads but perform more work they don't block the don't weight they don't need to be scheduled so for start we used less threads so it's less memory second it is CPU usage scheduler friendly so we have cache affinity we don't have to move our threads here and there and finally we have extremely high concurrency with lower its on resources so that is a graph from the S&H based library and it shows the sequential writes for the H stable API and the sinkage base depending on the writing threads so here you can see like how how latency increases with H stable and it stays the same with adding edge base and also the page fault are very very important KPI so we said all right yes I think HBase is like very good implementation we won't be able to be as fast at least in the beginning let's get that started then probably we can improve the performance later however when we start to look in the documentation for say KH base it says like it's totally different it is it has its own API okay so just switching to HBase is not a simple thing it's not like drunk you drop the jar we need to write code and essentially that's we had to do so what we did is we developed their sink BigTable library and what we did was like what Google did for HBase that's what we did for a single base so we wrote a API compatible library that can be replaced a single page and then use the standard HBase client or BigTable libraries so yeah that would be that would be the plan ah and let's see what were the challenges there so for for a beginning open TSD be is really opinion opinionated when it comes to jar dependencies and I don't know if any of you has read worked with Java but there is nothing like jar hell so you had some library some jars and they have some transitive dependencies that need more and then it did more and then in the end you wanted for libraries and you end up with 50 jars that you don't know what they do and sometimes you have version conflicts and everything so you don't want to be there what the open T is to be maintained is did is like they decided to keep it simple so all libraries will be there explicitly and keep it as low as possible of course yeah what we want you to do is add all the libraries but we at that moment we did not want to like touch the dependency system of open gzb that much so what we did at that time is like we created our ennoble jar a super job which is like a fat zip file containing all the libraries in there and instead of giving 20 libraries we gave a big library 55 megabyte like takes a while to upload but it seemed to work after a while second thing is like a sync BigTable is is not a sink okay and that is because we use the standard HBase API which is a blocking API so what we do is like we call a big table and then wait and blocking there and that's not a good thing first because as because the rest of the open th DB infrastructure just waits to return immediately and then we ended up with many threading issues like increasing the workers did not work sometimes we would learn fred's and so on so what we did instead is like we use the buffered mutator and a thread pool so we kind of emulated their sink behavior that that meant like what we did is instead of writing one by one all the inserts or doing the the gets no that's for the inserts one sort of thing one by one they insert we were actually buffering them and then we had a separate thread that would pull from that buffer so yeah when the buffer gets full in very high high high concurrency essentially threads block again but for now this can take us far so here is like a short benchmark we have done a it's not something extremely reliable and that is because a single base works on a single node HBase on Google Cloud and you know that like if you have a single node you don't have all those replication issues you don't have consistency issues you don't have nothing like that and we have a local storage not HDFS which is another thing so we made it like a simpler arm made it reach up to like thirty eight thirty nine thousand events per second our first version of the library which was a zero point two could reach about 12 and a twelve thousand thirteen thousand events per second and last week we list the zero point three that actually uses latest BigTable client API and it has some performance improvements so I think we're getting good in that so a sync BigTable library this has been accepted by the open TS DB project and it has been merged upstream are we do our best to help there I mean provide more code provide feedback and everything and if you want to check here's the documentation how you can enable that and here's the code on github what we intend to do first is like there's another big table API that is not HBase replacement it is native and this supports a synchronous calls so we intend to actually we implement the async big table using the native victim API so that it is essentially a synchronous and then this will be improved performance we have many more things to check for improving before marks are just like like memory management third management buffers and everything and finally we need to add more tests like for now we haven't done much in this in this thing so yeah I think that was all we'll take some questions for been we haven't we've been to like three notes instances and five notes instances we haven't big - - more than that yes most of the times yeah most of the times in artist that was like to to the blocking behavior of the table so you have the buffers full okay and you can increase it increase it but you can't get that far and then you have many threads writing so where the buffer essentially gets full then everything blocks and like that another bad thing here is like the right path and a really deep path actually use the same threads so you may have like real real hard traffic on the right path and then you cannot even view the the graphs I think this is going to be resolved like if we do it a synchronous and solid once in a while well for us that was not like something we had like an internal use it was a good case on which we could work using the BigTable client and yeah it has proved to be like a I really need out like there's a lot of people that have asked us like yeah how we can do that work in those things you mean that one oh yes that's a sinkage base and that is a sink table 0.2 which is like what is currently running on 2.3 release and this is a sigma table 0.3 which was released last week and it's going to be like pushed in the future that one that one is like three note setup it's not so yes and also here i would like to say that big table is like it doesn't feel any any hard I mean like we write as much as possible but we cannot like take all all big table load instead what we have tested so far is another set up where we have like on the same machine two different ESDS and then send messages and this actually shows that we have twice the writes so for this set up yeah three note BigTable was like nothing yeah that's the that's that's like HBase right now it's base client library is blocking it does not support like non-blocking a synchronous i/o HBase 2.0 supports that but don't know when it's going to be out still it's a snapshot and this is why we need to figure out if like HBase a synchronous API is going to be out soon so that we can work on that or we can just like bypass this API and work directly with like Google libraries for BigTable that's supported through a synchronous behavior like libraries that you can download but yeah Chris you plan to do that like in the following my few weeks yeah yeah that's already so by yes if you want you can just download it replace the current snapshot library with this and then it will work and also there are few changes on the documentation because from like a year ago they have changed the connection parameters for big table ah I'm going to file a pull request for the tsdp documentation to is anyone using like open TST bean production with HBase right now so yeah any other questions thank you very much [Applause]
Info
Channel: Percona
Views: 1,091
Rating: 5 out of 5
Keywords: Time Series, NoSQL, Data in the cloud, Developer
Id: 13xIdnIAkn8
Channel Id: undefined
Length: 24min 50sec (1490 seconds)
Published: Wed Nov 15 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.