Jaeger Intro - Yuri Shkuro, Uber Technologies & Pavol Loffay, Red Hat

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so welcome to the talk about Jager introduction to Jager so as an agenda we will be will give a brief introduction to people what what tracing is and why we use it that we will how many people do really want to see live GMO versus screen shots all right I mean I'll try to rise on the fly there's the docker compose file so it's not that hard to run but I'll still probably go to the screen shots because it's it's kind of an outdated and we'll talk about project status over all the sins you have project the a great graduated project since October and we'll talk about roadmap so so hi I'm Paula love I am a social engineer at Red Hat I'm working on the Ziggler open tracing and open telemetry and a little bit on tracing in service matching clearly project I'm use chrome from uber I also maintain eager and Emma tree and you can see I also published a book earlier this year on tracings first point of right so but other than that yeah you can you can check it out I'll take that try that one how's that better yep yeah cool all right so yeah if you want interested in the book check it out it has a lot of basically basics about what tracing is how to use it how to deploy it which is actually the hard part usually would do the tracing so let's start quickly with what tracing is and why basically the need for tracing appeared in industry when systems became like very complex so we started building distributed systems which were not only kind of scaled wide by replicating the same service many times but they also started scaling deep where you have multiple layers of services of micro service in the architecture and it becomes really really challenging to actually figure out what's going on and all circuit textures so here we can see a screenshot from uber this is like done by jäger the data collected and visualized by jäger so it's a service graph nodes represent services lines represents connections like in RPC calls between services and a transaction which comes from mobile app to the backend might look like this it can touch like several dozens micro services hundreds of nodes potentially include hundreds of RPC calls right and so if you if you are on call for particular service and you have an SLA of let's say your latency or your error rate is supposed to be within certain limits and suddenly boom you get an alert USL is broken what did you well ok great you have a service but your service in the deep micro services architecture probably depends on whole bunch of other services right and so any of them can be responsible for ultimately for the error but but you're the one on the hook food cause your service is still broken that's SLA right so you need to do something to fix it even though you may not have a control to actually fix it so don't the thing you can do is at least I'll isolate the problem and say ok well I need to page that guy to wake up and like fix whatever or maybe you can route around the problem somehow and so distributed tracing helps you with that basically it unlike other monitoring tools distributed tracing allows you to to figure out what what's going on this complex architecture right and you might ask why I like what about other monitoring tools like metrics and logs and so the difference between tracing and and the monitoring tool specifically is that monitoring are the things that you know up front you want to measure like the US LA for example you say I want to measure all the error requests and say that I it should not have more than one person to request failing right so that's the thing you know up front you measured you set up an alert and that great metrics are perfect for that but once that alert fires then what can you use can you continue using metrics to actually troubleshoot that problem and the answer is usually no because well it might but it's very hard because metrics are focused on an individual node so if in that picture if you're at the top of the pyramid and then all this downstream services are below you you can't have to go through every service and look at its metrics you probably don't own that service you probably don't even know where the dashboards are for that service but if you know you may not understand those servers because again it's you don't own it so doing it by metrics is kind of very hard and plus it's actually N squared problem because do the pyramid right the height is the surface of that pyramid is proportional to the square of the depth and so the deeper the architect is the more stuff you have to look through to understand where the problem is and similar thing with logs logs is what you typically use to troubleshoot problems right you don't use blocks for monitoring you go you might again but like usually you go you go to look at them when something goes wrong but you have the similar problem you have this like n square number of potential points where you have to go and look for for for the error for the explanation and so but traces provides the correlation across all those services and it actually allows you to narrow down the problem much faster traces you can think of em as like as a stack trace for a distributed system and so how tracing works the idea is actually very simple you probably may already be doing that in your applications anyway so when the request comes to like a front end service we assigned two unique ID and we make sure that that ID is propagated to all the execution of the request through the call graph right deceptively simple somewhat hard to implement in some cases but as we do that allows us to do is if we also in addition to propagate in the IDE will collect some telemetry and sort of send it out out of the process to the central back end like Jaeger then you can reconstruct the execution of the trace later on into a graph or a Gantt chart or a time sequence diagram and basically you can reason about which services were executing the request for any errors in them and so on and so on so yeah that's sort of again the idea of tracing is kind of very simple it's not new it's been around for like 15 20 years maybe there's a lot of academic papers written about 15 years ago but it's just getting now into the mainstream because more and more organizations are switching to micro services and Microsoft's are very deep systems where you experience these problems that you we didn't have before so is so many people raised hands for live demo I'll try to do that now close some secret over stuff first so I already have it running but I already started out this docker compose smart enough not to fix the typos so this is a you can go to yoga repository example slash hot rod it has basically this docker compose file so you can run it very easily yourself it's also this URL b-joo Yeager - hot rod is a blog post with a walkthrough that sort of demo in a lot more details if you want to try it out so while this is starting you can see here the there is two services that I'm starting one is the actual Jaeger back-end we have a component in the egg recalled all in one which kind of in the single container encompasses all of the Jaeger components all at once right so you I like collector storage in memory storage so it's very easy to run and then hot rod is the reactor of the application which is instrumented for tracing that I'll be ominous so it has a URL here somewhere this one so I can open that so here's the application Oh make it bigger it's it's sort of a mock right sharing application where you have several customers you press the button and the car comes to that address right so I can do this let's say request and we can see okay well there is a key number something see that's a new york-style license number for commercial drivers so that's what usually looks like in uber so it's arriving in two minutes you get an ETA you get some additional trouble like information specifically like latency and some request ID which we might get into in a moment and so again we don't know what the application is behind the scene so let's look at that so here's the homepage for Jaeger and first thing I want to go to is the dependency diagram and switch to this view so I executed a single request against the application right Jaeger automatically built a service graph by observing what happened with an application how the services are connected and so in it has this picture basically tells us that there are four micro services and two storage components in that one single micro services application right of course they're all simulated and there is like single binary in practice but they actually that binary talks to itself three different sequels and different protocols so yeah we could we can easily see the architecture one nice thing about tracing in general is that you can say well you could have drawn the circle each action posted it on my own your like office world right but you know two weeks later someone comes in and they don't know that they may see it on the wall but then someone else goes and says okay I'm gonna add another service and what happens to your poster on the wall you you spent like two hundred bucks printing it at Kinkos in the sky still exists but also like the the documentation for architecture is always out of date right whereas microservice is like you do releases like hundred times a day in a large organization so everything is changing and even if you not necessarily create a new service new service every day you potentially creating new connections between services because you started to call on some other endpoint and another series an extra dependency exit dependency means back to that pyramid idea more stuff that can go wrong and in your application and your service right and so tracing keeps that up to date so it because it just observes what actually happens in production that knows of all the dependencies and you can always go back and say yes that's the actual picture of my architecture all right so that's a very useful thing but now let's look at the actual trace right so notice how Yaeger automatically recognized all the services and show them here that we're involved in SOP it also like it traces itself so you have Yaeger query as a service as well showing up here but whatever we're gonna skip that so front end I'll pick front end because that's the top service where the requests come in and I search for traces and I get this one trace slash dispatch and we can see the latency is 730 milliseconds which was what did they say 37 right the difference is the networking so this is measured from the front and this is measured from the back end only so it's shorter a bit and then if we go to the trace we see that picture which I'll also show in the slide later just to oh maybe I'll skip the slide since I'm vanadium anyway so very classic view of distributed tracing it's a Gantt chart so what do we get here so on the Left we see the hierarchy of calls that the services that were involved in executing the single request and in which all the writers like a parent-child relationships between the MOOC hierarchy then horizontally there's a time line and every service executed a certain operation that may have been waiting for something else but that operation is called a span and it's represented with a block which is proportionate to the duration of the operation and then at the top we see here this thing that I'll just hide this one right shows up at the mini-map it's the same view of the Gantt chart but collapsed so that if you can chart itself I don't know 100 lines of 100 pages long but answer didn't large trace the mini-map will still show everything that you can kind of easy to navigate if you want to through that and finally the the Gantt chart itself like why what's what's useful about it right so certain things are kind of made jump out so you can see that there are a few operations that are marked with the explanation point so those tell us that there's an error in that operation right Jagr is actually smart enough that if I collapse this whole operation so everything that was driver was doing then you noticed that it also got this exclamation point it's kind of bubbled up because there were errors so if we don't see them we want to see that on the parents bun and I can like if I collapse everything it will also bubble up to the top thing right maybe if it's too small I can actually make it bigger it should still work right so great now there is no exclamation because would you see that so what what else can a trace tell us well so we can we can trace down the problems like specific like execution errors already another thing we can immediately tell us like if we investigate in latency maybe our SLA for latency is broken then we wanna see bottlenecks hotspots in the requests and clearly there is this thing called MySQL select which takes likes time of 40 percent of the time so if we were to optimize that we would probably look at that as the first thing because it's the largest thing on a critical path clearly here and one of the sort of a hidden power of traces so when we look at this like a high-level view of what execution of the request look like we're getting a macro picture right there's not a lot of information here just like which service code which service and how long it took but what happens if I click on this one and sudden I get this whole other set of details about this single operation right I can see let's say I can expand the tags I can see a whole SQL query that was executed within that database call I also see like a process information like hostname which Costas was running and things like that right you can you can enrich this with your instrumentation in any way you want like port region zone whatever so like not only you know logically which service executed this request but like which specific services or instance of that service so and that in that way trace and provides a mic review as well of your like individual operations and a macro view of the whole transaction across the architecture there are the things that you can see this tick marks on the span let me see what they are so there are so called logs so looks interesting actually we see a lot of them on the top level a request so I can look at that so we see 18 locks here and if we look at this it actually looks like almost like a regular logging right and it is it is a regular login if you look at the source code of application what it does is actually just called logger towards something but it does it in a way that it information is not only written to the standardout and by the way if I switch to this one these are all this the same logs in the standardout you can see but it also writes them to the span in in a trace and which one is better well this was kind of like readable as well in a log but this is a single request I executed what if I was doing like hundreds of them per second right how would you look at them in a typical log stream you would need to correlate them somehow you use like I select the subset of the ones that actually belong to my thread of execution thread and go there's no such thing as execution sir there's no idea that I can put on a on a thing right but in in a trace you are guaranteed to get the logs which only correspond to the single request to this specific span right and more than that they're like I mentioned this there's another span here and and so it also has the logs right so you can get you get the logs which are very contextual as to the place in the your execution workflow where they actually happened not just randomly in some like a pile of text that you get with logs so it's all collapsed into one one tool that where you can easily investigate the problems now going back to latency so okay let's say we solve this thing we optimized our SQL query or do something maybe put an index on a table and it's become becomes much faster another potential latency we see here is the separation so it's taken what 200 milliseconds by itself but what really happens is that it makes this a whole bunch of coffee ages and all those calls as we can see they're happening one after another right so it's a very clear pattern from the just from visual representation of the trace it's like a staircase pattern right and you can immediately tell that that's probably an issue that it most likely you can optimize this and to design some parallel thing oh maybe you're like in this case those actual actual requests going to the radius so you can do the modern parallel but what I've seen in practice people don't often realize is that when you use the aureum like object relational mapping libraries and you write in a loop it looks like it's very tight but turns out the behind the scenes where I'm actually making the database call for every iteration of your loop right and so instead of doing that you just want a single request and loading in bulk and suddenly people get this like I improved my performance region tracing by like ten acts right so Google published blog post with Griffin and a cortex like they did exactly that it looked like some brilliant like seat of engineering they found this very stupid thing that like very easy to fix but very hard to find but that's the challenge right and that's what trace it helps it fit helps you to find this place as an architecture where there are certain things that you potentially can fix very easily and so yeah I'll skip that that one I mean there is like since I'm here let me just finish this one so this is interesting like if I collapse this thing we can see that there are whole bunch of requests going on right and unlike the previous staircase patterns these are not exactly because they're like there are three requests runyan and parallel right so good news bad news situation kind of like three in parallel but why three in fact if I run the same request many times concurrently from the front end start these three iterations we'll see them drifting apart because what happens in application is actually there is a thread pool limited by size three right and so when you have a single request we get we get the three parallelism but then because this one is doing ten requests so obviously that's not enough so you still have a staircase pattern but just all groups of requests when you have a lot of concurrent requests going through this application then they all kind of start competing for that same thread pool and so again fairly straightforward to see that on the trace I may not be straightforward to fix an application like in this case in the gem application if you go through the blog post there is like a command line parameter you can pass to the application to take care of this and and you can see like how the whole trace shape changes like you can for example reduce the simulated time of the MySQL query and then like make it much smaller you can you can't make this paralyze because that would require actually rewriting the code so but this one you can you can change the parallelism of the thread tool easily so I think that's let me switch to the slides and see if I missed anything so we talked about the service graph right the we talked about the timeline the hierarchy that the the timeline the blocking operations the staircase the so like the nesting of the parent and child spends that kind of represent waiting one on another and we talked about span details drilling into that thing database query some logs so that's a new feature since last year in in iaeger so when you do in performance optimizations specifically think about it like have you ever optimized something like or debug the memory leak how do you do that you typically take a snapshot before take snapshot after and you compare them right and so when you look at the single trace as we did before it as we see it's a lot rich information like in there but if you actually try to troubleshoot the regression the performance it might not always be helpful for example let's say you're you you got an alert that you SLA is broken it's like suddenly from I know 700 milliseconds at 1 to 900 you go and you look at the trace at the zero my skill 300 milliseconds or 500 whatever but what if it's always 500 right you can tell from that single trace so you kinda what you want to do is look at the difference what the normal trace looks like and this one looks like and that's what what we built in Jaeger is that you can select two traces you can do it like I think we can do it and now let me see quickly in the UI so go back to this one let me do a few more requests so if I search for them so we're getting a lot of this bit smaller also you can you can just select some of them right and compare and you get this thing like this graph is not as interesting as the slide that's why I'll switch the slide but yeah you can you can do that in the area today if you know which trace is to pick but that's a challenge so this one is probably from uber like from production it's a lot more complex as obviously there's like hundred RPC calls in this one but we what it's it's a jiff it's like a code if right you have red and green minion like red is missing something from the right side green means it's added to the right side and what it's helpful to do is like it's helpful and troubleshooting problems basically you can see that if you try to understand how the behavior of the execution of this tree request is different you can immediately tell well this bottom segment in red was like the whole section was missing from the second trace right and so something went wrong above that probably that caused this not to execute but you can you can just immediately see that and if you didn't have that tool of that kind of tool then potentially it's okay so like it's just the trace with hundred requests in it try to figure out where it's actually something went wrong all right there are some more details here like the light colors mean that spans are present in both traces but in different quantities right so you may have like more spans on one versus less than once it's like the graph one of the other feature of this graph is that it's not it doesn't represent every single span from the trace because that would actually make that graph almost as complex as a Gantt chart itself and for large traces Gantt chart becomes a bit unwieldy to to use this one collapses a whole bunch of things so if you make him like this register all in the previous example like ten times this would be probably just two notes because what's the difference repeating them structurally they're the same and and so but but if the number is different you will see the difference in the light colors so this was like it was indeed the production like from uber ears and there are some some issue with what it says you have an outstanding balance you created card problems so we couldn't charge the credit card we couldn't complete the transaction and that's why this whole section of the overall request failed it like it never got executed basically in a second trace because we failed on a transaction earlier than in in a thing another thing which is kind of the same visualization but different color coding which could be also useful is again when that example was I will I gave with if you investigate in a latency regression and your mysql' span takes like 40% of the time always it's not useful to look at it but in a Jif we can suppress it if we use like a heat map color code and of like which which things contribute the most to the differences in duration right and so when you do that you can immediately see here with what pass the differences are right where words the biggest contribution to the latency between the two spans is between two traces coming from so that's again it's a same same kind of principle but just different color code Union and you can probably come up with other things to do that and there's like because fundamentally behind these traces there fool Yaeger traces so you could also obviously do this like pop-ups on individual nodes and you can also deep link from every node to the trace within the specific place with a trace which is helpful like again if you didn't do the trace of I know mm spans the Gantt chart is very unwieldy it's long it's like wide all the spans are probably tiny because you have like a long time line but all operations are small within that but if you deep link to span you know with the exact place in the architecture you can like zoom in into that and drill down and you can get a lot more into details but after you already like use this you just a late where you want to look for [Music] okay so just in summary of the demo so we see that tracing can be used for monitoring distributor transactions in architecture monitoring and troubleshooting but like in general you can do root cause analysis what we've done with like latency with the errors etc looking and in details we can obviously use it to optimize for latency and for performance which is kind of just another side of the coin of overall root cause analysis or like because ultimately performance of an application doesn't mean performance in speed or anything performance mean it could be mean correctness it could be mean availability and it could be mean real raw performance right and finally we can automatically produce service service dependency diagrams with tracing and one thing that I haven't talked about that I will skip that because it's kind of condensed or tangential to the to the stock but Jaeger libraries provide distributed context propagation function which is not only clearing trace ID but you can carry other information so you can like tag your request with a specific product line let's say oh this is like a dream male versus I know Google Docs right so two different business lines and Google they probably at some point all come to the big table and so how do you if you are in a big table owner as a service like how do you basically know which customers are using you what's their capacity plan and how do you basically well how do you tell anything about what the usage patterns our view of your service right because well the product Google like Gmail oil dogs it's like Oh way up front too high from from the deck shared platform here and so by using context propagation you can tag the request and basically get that information all the way down and emit metrics basically tagged with this information as well and I think at this point I'll pass to Pavel to talk about more Jagr project so Jaeger is not only distributed tracing tool it's it's a whole platform we have different bits of pieces so starting we have instrumentation libraries which implement open tracing API in different languages then we have the whole trace collection back-end which collects these traces from the clients and then there is also visualization part which visualize the trace in the UI last part is kind of optional data planning data mining platform which gets the traces from the storage there's some analysis aggregation and stored them back to the storage for later presentation Swiger it was basically the architecture inspired by google's devvra paper and open Zipkin it was created at weber 2015 and open source in 2017 when it's also joined the cnc FS incubating project I think last month where Jaeger graduated as a top-level cnc of project i would like to thank you to all of the contributors who made this possible so what the architecture this is just high level view as you know the back end is more complicated but in this case we see the services and the important part here is that service a is instrumented with open tracing API it's using Jaeger client you know which implements the API and service B is instrumented with open Zipkin or open senses or open telemetry SDKs and both these services the report data to the same back end and back end can you know connect these two to spends together and visualize them as a single trace so in Jaeger you can collect data from different libraries the other interesting part on this slide is that service a and service B they can use even though they are instrumented with different SDKs they can still propagate the trace because the context propagation in Jaeger is pluggable when we zoom in into the Jaeger this is basically the architecture on the right side we see the host or container your application instrumented with a Jaeger client you can client sends data to eager agent which runs on the on the host so in kubernetes it can be site container or demon set then the agent since data to the collector which then stores the data to the storage and the query gets it from the storage and visualize them then there is a spark job which you know does the aggregation on the data this is used at the moment for dependency links you know what you saw the first slide from the demo so this used to be the traditional architecture the the simplest thing what you can deploy we kind of improved that by using Kafka so the Agra collector is able to write the traces to Kafka and then there is a new component coral the the in gesture which gets the traces from Kafka and store them into the storage this provides more flexibility and elasticity so instead of only you know getting the traces from the cough considering them to the storage we can also have like a streaming pipeline to to do analytics in a real-time fashion so on the technology stack the Jager itself is written in golang there is a pluggable back-end layer as a part of the main project we we bundle implementation for Cassandra elasticsearch pager which is in memory storage something like Prometheus and in memory is used for the most front-end track-based and then the instrumentation libraries for many languages starting go Java by the no J's C++ C sharp I think the PHP and Ruby are community based the then integration with Kafka and Apache fling but also Apache spark and filling this kind of new the new way the streaming works and the spark is used for the old job which gets the data on daily basis for the community at the moment I think we have around 9000 stars on the main repository a lot of contributors we have 15 maintained errs across all the repositories we are happy to accept more and we did 15 releases since the first since I start basically so what is new in EA grow since the last cube con we implemented the operator for communities this operator is able to not only deploy Jaeger manage it but also create see ours which will be picked up by the storage operators and deployed the storage so if you're using open shift you can just create it once er and the open sheet the operator running an opiate shift will get that CR also create the elastic switch for it so it's very simple to create like productions already Jager deployment then there is a better store storage which is no local story something like Prometheus the single node this is by default used in sto and the storage layer is pluggable it's based on G RPC plugin at the moment there are two implementations ones for couch couch bass and other for influx DB you will talk about the trace comparison this is also view and we made some improvements on the security part so there is TLS for the gr PC communication between agent and the collector and also for the Kafka and elasticsearch so documentation I don't have time to go through that but we have quite nice website and all the are there we don't have to go to our github repositories to find out about things again some of the integrations with Jaeger and I would like to also mention the zip incompatibility Jaeger can receive the data from zip key from zipping clients and also can can talk to zip key clients so if your service is instrumented with zip in the other service can be instrumented VD ager and they getting basically they can talk to each other I'll probably skip roadmap for the deep dive tomorrow because we are out of time yeah okay so deep dive is tomorrow 2:25 you are welcome to join thank you very much [Applause]
Info
Channel: CNCF [Cloud Native Computing Foundation]
Views: 13,903
Rating: undefined out of 5
Keywords:
Id: cXoTja7BvSA
Channel Id: undefined
Length: 34min 46sec (2086 seconds)
Published: Fri Nov 22 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.