Observability of Your Application by Marcin Grzejszczak & Tommy Ludwig @ Spring I/O 2023

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

foreign [Music] the conference yay so the first thing we got to do is to take a selfie for sure because rarely we can be on the main stage so let's try to do it so everybody says spring I like your enthusiasm okay let's get it started so just before we get started if anyone has any questions you can put them in the app and we'll try to check that by the end of the talk and make sure that we answer any of your questions don't forget if we don't forget which is very likely to happen please remind us but also we're going to be showing a lot of uh UI elements where we might need to zoom in so that you can see it so if you're sitting in the back and you don't have amazing eyesight you may want to move a little bit farther forward but wherever you're comfortable okay so uh welcome everybody to our presentation titled observability of your application so my name is Martin J Strack I know like the last name is is a mess so this is why I picked as my Twitter handle the first letter of my first name and the whole surname so that was a good idea but if you go to my blog too much coding com you can find a link to my Twitter handle and then I created an account in Mastodon which was the same I made the same mistake I created it like with my second let me move my with my last name but I renamed it to Too Much coding which is much better and I am Tommy Ludwig I am super excited to be here in Barcelona with all of you today I came on a 20 hour flight from Japan and I am still jet lagged so bear with me if I'm not making any sense but I this is one of the greatest spring conferences and if you're watching this recording and you haven't come to this you really should and I'm really happy that you all are here and we're really excited to talk about observability in your application so this is our our lawyers make us do this uh don't sue us don't take anything that we say for absolute truth and before we get started with the actual talk I want to gauge the audience a little bit Gage like the micrometer gauge to choke Gage that's all right sorry we laugh for ourselves our own jokes it's fine it's fine so how many people in the audience by show of hands are already using spring boot 3. wow oh that's about half of the audience I'd say it's pretty good pretty good and how many people are using observability in production where we don't really Define what obserability is whatever it means to you are you who's doing observability in production yeah about a half again all right not bad not bad and how many of you are using micrometer already pretty good okay who of you is actually reacting to alerts okay some people good good okay not ignoring them that's good all right so we just wanted to kind of see where everyone was at in the audience we're going to try to do this so that it hopefully makes some sense and you can get something out of this regardless of whether you're really an expert in uh observability or you're just wondering what this all means and so to that end we wanted to instead of starting off with a bunch of slides and try to explain what observability is and how you do it we wanted to give you something that's maybe a little bit more visual we wanted to use a sample application and then walk you through issues that happen in that sample application and how observability can help you as a developer or somebody in charge of a production application enjoy your time and not be woken up so much in the middle of the night and not have to spend days trying to figure out what happened so that everyone can understand what's going on in this sample application just wanted to start off by explaining the architecture so all of the applications are spring boot applications spring boot three applications and we're the application's purpose is a t-service so there's a t-service application this is the application that end users would be interacting with and this t-service will call a another spring boot 3 application the water service that has its own database and in addition it'll be calling a tea leaf service that has its own database and so this is what the architecture is supposed to look like for this application and we'll be able to confirm later using observability whether that's actually the architecture in production okay so time for the demo so for the sake of this demo what we did is we stopped for a second writing libraries and we decided to write production apps and this is the best we can do with the UI I mean we think that this is like the thank you thank you thank you like 2023 stack it's responsive because when you click on the on the box can you can you do it it reacts it's responsive reactive Design This is reactive design responsive kubernetes this is this is everything we have here is is like the best of the best so what do we have here Tommy so this is the in the architecture that I showed before this is the t-service application that we're uh interacting with and so you can select which kind of tea leaf you want and so I like Placentia and I'm really thirsty up here so I'm going to do a large on the water 300 milliliters steep it and it gives you all of the details that you need to know about that and so that's working I'm I'm satisfied Martin do you would you like some tea what kind of tea leaf do you have oh English breakfast who wants an English breakfast okay all right let's do it let's do it give me the English version how what size do you think we need so I mean we need to share it so 300 milliliter leaders everybody gets a milliliter that's fine all right let's and I don't know how you brew your tea but I I don't typically do it wait wait I think I think this is this is a bug in Firefox can you can you uh Click Five more times yeah okay refresh refresh refresh okay so you went away wait wait there's no problem Okay click again are you sure so I mean we tried to push the problem to somebody else but most likely it's our fault um so what can we do about it man well we are yeah we are responsible for this so I guess we're the ones who have to fix it right I mean can we just admire the UI instead that's one way of things but we're gonna get fired let's maybe look at the metrics yeah okay that sounds like a good idea so we know we can see we just tried five times refresh the page that didn't work which is weird because that usually does uh so we can go over here we can check out the dashboard that we have already made for this t-service application you know because I have a problem with the eyesight can you can you zoom in on things oh okay yeah is that so yeah that's too good even yeah that's fine that's fine can everyone see everyone so we have this uh dashboard for the t-service application and we've got a couple of different uh panels in this dashboard so we have one here that's showing the latency and then we have one here showing the throughput and you can see on the throughput graph that there are two different statuses so we have errors and successes and so we can see that there's a more or less constant rate of errors and we're not using the UI right now so clearly other users all of our t-service users are trying this out and they're also getting errors so it's not just my browser it's it's not our UI yeah because that is perfect here something's problematic so we need to to dig into that and actually if you look at uh this latency graph over here you'll see in addition to the lines there are these dots and if you Mouse over these dots you can see that it says Exemplar so what this Exemplar is is it showing a specific request as opposed to these metrics which are aggregated across all of the requests that we're getting so the t-service is getting all these requests and among those we're getting 0.5 requests per second that are errors but if but we want to look into the details and figure out exactly why that's happening so we need to look at a specific request that has an error so if we had exemplars like this over here you can just jump over and look at your distributed tracing data and find a specific request that has that error so wait a second understand correctly that by having exemplars you can click this and then see a visualization of the latency of the processing of this request yes okay and so that's what we want to do and we want to do that specifically for one of the requests that was an error like the error that we had so why can't we click it in the the in the throughput scenario so that's what we want to do but we don't have exemplars available there right now but it is a feature that we're working on it will be there in the future and so instead we're going to have to do this the old-fashioned way and type a few things so we will have to SSH to the machine and grab the logs no never no never do that that's too old school that is too old school we have two beautiful vui to be doing those kinds of things so old-fashioned way we're gonna look we're in a search our distributed tracing data we're going to search the T service and we saw in that graph before that it had an outcome and the outcome was there were successes and there were server errors wait a second do I understand correctly that due to the fact that we have a spring boot app that has actuator in the class above thus has micrometer there we have instrumentation and we have instrumentation of the HTTP related components that when an error comes they tag the metrics or spans with the information there was an error yes okay and we didn't have to write any of that code ourselves so that we could spend all of our time on the UI UI of course so we didn't have to write any instrumentation that was provided by the spring projects that we're using and micrometer so we're going to search the if I can zoom out there we go so we're going to search the traces and so these are all of the recent traces that are in the t-service and that have an error so this is how we can find a specific request so I'm just going to click on one of them and now this is a visualization of the tracing data and it's probably a bit small but you can see there's a hierarchy here between the services and so you can see they're nested and they have a parent-child relationship so you can see that the t-service is calling the water service and see the T services calling the Tea Leaf service and you can see exactly how much if I didn't technical difficulties okay we're back so you can see how much time is spent in each service in each part of that service and so this is very useful if you want to drill down and see how much time is being spent and what services are calling what services and where the errors are happening so to understand correctly that if we have if you zoom in a little bit we have this water service connection uh and to the right we have the bar so that represents the duration of the connection yes okay and you can see visually here there's an icon that's showing an error so we knew that this was going to be an error because we searched specifically for the tracing data that had an error in it but this is showing that the t-service had an error and then you can look down through here and you can see there was no error in the water service but this Tea Leaf service also has the error icon so you can see that's where that originated and we mentioned before that the architecture has is supposed to be a certain way and if I can zoom out again you can see here a service graph that should match what we were looking at before so you have the end user calling the T service and the T service calling the T Leaf service and the Tea Leaf service calling the Tea Leaf database and the same thing for the water service so this should match and we can see exactly based on what's actually happening in production instead of what we intended to design what we intended to deploy you can verify based on actual requests for users what is happening in production it's a way to say why did we own the slides take images of servers instead of screenshotting this we would have more time to work on the UI you're right but when we made that architecture diagram we hadn't made these Services yet so we couldn't get this image we had to deploy these to production actually get requests and then we could get this image so it's like bracket bracket cylinder right with the service service databasting okay yes I get it so if we go back to the trace view yeah what what is the problem why did we have the exception so we could see that the error is happening in the Tea Leaf service but we need to dig in and if you click on one of these you can get even more metadata about what happened in that specific span it's called which is just a unit of work so if we look here you can see there's this these attributes and you can see that the error is set to true but there's a lot of other really useful metadata that we're getting from the different spring projects so this is a web MVC endpoint I believe and so we can see the actual URL that was called hdb method outcome HTTP status all of these kinds of things you get attached to your tracing data so this way you can drill down and see for a specific request exactly what happened in that request and where but so we can see that the error message here is resource not found and is trying to search for English breakfast so maybe the reason for this exception is that there was no resource that is what it is saying isn't it yeah yeah but we should probably dig in a little bit more just to confirm so you can see the query here and so you can see that we don't just have the HTTP layer instrumented we have the jdbc layer also instrumented and here you can see the actual query that's going to the database is also attached to the metadata in your tracing data so due to the fact that the boot apps are instrumented with micrometer due to the fact that the jdbc integration is instrumented with micrometer we can see all of this information bound together right yes okay and without you having to write any of this instrument instrumentation yourself you can spend all of your time on the UI of course oh wait gee this is the results then right yes so why do we have the jdbc row count equals to zero so I'm guessing it must have uh ran that query so ran this query and then it went to process the result set but the result set was empty so are you saying that in the UI we have put a value that is clickable but is not present in the database yes in our perfect UI okay there was an option that we forgot to actually put the data in the database okay that's that's not good but maybe maybe you're wrong can we check the logs for this whole trace for the processing of this whole request maybe you will double check this in the logs yeah so there is this uh handy logs for the span button right here so if you click on this opens another window and Now using another component in the grafana ecosystem called Loki you can look at logs that are also collected from your application and because we're inserting the trace ID and span ID into the logs we can search for all of the logs that match this specific Trace ID so now we've looked at the tracing View and we can see everything that's happening over here and then we can actually see the exact logs that correspond to that Trace from that one specific say user so I can put them in in a chronological order instead of me SSH into every single machine log looking at the logs grapping for the trace the copying it somehow and gluing it together I have it everything in like in the single place exactly it's fantastic so we can confirm from the logs exactly what happened and you can see the resource not found which was also in the trace data so we could basically piece together what happened just from looking at the trace data but in case you can't or in case you have some special information that you want to see in your logs it's easy to jump from the traces to the logs because that is uh correlated so wait wait a second so what we managed to do so far is we had metrics from metrics we managed to jump to traces from traces we've managed to jump to logs now there is a chance that for example uh one of the users will tell us about this bug they would file an issue and copy the log that they have seen for example containing the trace ID could we then if let's assume for the moment that we have we are in this particular situation could we from the trace view jump to the corresponding metrics Yes actually so there's this link here and use them if you click on the link you can look at the metrics the throughput or the latency for a specific span and so if I click on that it's going to open up another window and it's going to look for the metrics that have the same matching tags to that span that you clicked on so for instance this might be useful if you found so we were looking at the metrics before and we saw that there's a consistent rate of Errors so it's not just a one-off thing it's not just one thing happening but how do we know that the trace that we found actually is the same issue that everything's happened maybe there are multiple issues and maybe the one Trace that we clicked on we were just unlucky and that was something that nobody else is having so you can do this now where you jump to the metrics and see how many uh of those exact matching uh tags in the metrics correspond to that Trace so it's not just one Trace that has that we can see here that over time there's multiple matching the throughput is non-zero right so we're getting requests that match the same strength okay uh how can we fix this well I think we've got to put that English breakfast in the in the database so I just happen to have a request ready to do that this is amazing how did you have it prepared man like as if you had known before that there was a problem I mean there's no way I could have known but uh really conveniently so can you zoom I cannot oh you cannot oh there we go there you go so what we managed to do right now is send a request to the API to insert the missing English breakfast yeah and so that was uh successful so if we go to the metrics now we should see things getting better right uh yeah let's go back to the dashboard so the T API dashboard that we had before so it's going to take a little bit of time because so let me try to summarize what we did until now so we clicked our fabulous UI which resulted in a fabulous exception then we went to the metrics and confirmed that the problem is more generic it's not only our our browser then from the metrics we managed to go to traces that we could let's say analyze to verify what is the potential cause of the exception so we managed to drill down that the exception is in the database it comes from the database because we're missing an entry so we added the entry and now the problem is gone right because we have yeah so now we can see the throughput there are zero server errors so okay so I think that can conclude the presentation we're done right or like is there anything else you want to tell me about no I think that's it's probably but it's looking pretty good wait a second like can you can explain me the heat map like why is it moving upwards what is going on here uh that well I understand that previously so with the heat map we have buckets of latency and so there are more or less around 20 to 90 milliseconds and now we have around 300. that is what it looks like yeah I mean I think the heat map is broken like if we look at the low the the latency over here no either grafana is broken or if we have introduced latency in the meantime well I would never do that okay so that's unfortunate because we already wanted to um you know end this but like if we never introduce any bugs we're gonna get fired so this is why bugs equals work so let's now figure out how we can fix this so you mentioned something about the exemplars right so we have those traces uh corresponding to the given uh metric in time so to say so could we do this like this that we would take a trace for analysis from before the latency got introduced and compare it to the one that happened after it's a good idea let's uh zoom out a little bit so this is from before yeah and then let's take one from after we'll copy the trace ID so we can split the view right split the view here okay put in that one and so now we've got a side-by-side comparison of a trace from before when the latency was low and Trace from after when the latency let's check this out so what we have here is the 75 4 milliseconds case okay so that's that's the that's the good one so to say now here we have 226 so I can see that the connection has increased in time why is it so like it means that there's a problem with the database it looks like that the uh yeah the Water Service connection so but maybe again maybe maybe we're unlucky and we have just clicked one Trace that is related to this particular problem maybe there are other problems so if we have a hunch that the problem lays with the database what we could do is to try to analyze metrics from the water service and we'll start with looking at the gvm stuff then you know Tomcat Etc and then we'll try to drill it down to the database metrics to confirm that what do you think sounds like a good plan okay so we'll go over to this uh pre-built dashboard and you said it was the waters yeah the Water Service I guess okay so um can you zoom a little bit so the basic settings like there are no spikes or anything uh situation seems kind of normal I don't see anything out of the ordinary so let's just collapse this jvm looks ordinary nothing special is going on GC every now and then normal stuff HTTP server throughput is the same we see the increase in latency fair enough but that doesn't mean anything like if it's if it's the database it's obvious that the HTTP side will be problematic so let's ignore Tomcat Hikari so the database what do we have here can you can you show the connection later and see oh we see a spike Okay so so we do have a spike which means that we have confirmed that in this set of apps or in all the instances that we have uh we have seen and introduced latency right so what we could do is file a ticket to the database team saying hey there's a problem and we could introduce caching I guess right so we would create a new instance because you mentioned that we're using Eureka so we would create a new instant instance that has caching then we would tell all the clients that the other instance is getting deactivated we would wait for it to register from Eureka and then we would have the problem gone so the only thing that I think we could have done before is to see this latency problem not by looking at the graph because I prefer not to just sit down on a chair and look at the screen I would prefer to be notified about the problem right yeah I want to sit and stare at our UI not not oh yeah the dashboard beauty of DUI and we should put it in Louver the museum anyways uh so we could create an alert that's given the this metric if it goes like Beyond certain point then we should get a notification right yeah and if we did that then we would have known without having to just happen to be looking at the dashboard at that time okay so I think that's that says we've managed to fix the bug we've managed to fix the latency do you want to summarize this part yeah so we started off with uh trying out the UI we ran into an issue which again if we had alerts we would have uh known about that without us having to actually try out the UI ourselves but we found that we were getting an error when trying to steep English breakfast tea and then we confirmed with the metrics that it wasn't just us getting that but that this was happening across our our system and then we drilled down into the tracing data by finding traces that corresponded to that same error that we were seeing and then we found the root cause by looking at where the error was originating in the trace which service we narrowed it down to the water or no we narrowed it down to the Tea Leaf service and in the T Lee service it was calling the database not getting any results back and so we inserted that in that fixed that but then we noticed that the latency was high and we then again went and narrowed that down and we saw that that was high because of the water service and it was the water services call to the the database that was slow and we came up with a plan we could add some caching because we don't need to call that every time and that would solve the latency problem fantastic so that concludes the demo let's now talk a little bit about the theory behind the whole thing so how does it actually all work so let's start with metrics so as we mentioned before we have spring boot apps the actuators on the class path so we have micrometer on the class path and we have been using both the direct micrometer API for creation of different types of metrics like timers counters gauges but also we are using the new micrometer observation API to instrument once and has multiple benefits out of it but we're going to talk about that later so how do things work with metrics so since we have micrometer on the class path each of the apps are registering their own metrics now what we were using in this demo is we had Prometheus so each of those apps has a dedicated Prometheus endpoint where when asked for it can let's say print in a dedicated format the metrics that it got and then the metrics store once it pulls for the data it gathers the data Aggregates them and then you can visualize them here you can see an example of tons of observability by wavefront but you can use whatever tool you want so we were using Prometheus and you showed tons of observability does that mean that those are the only two options or no so if you go to the micrometer page you'll see that we are like slf4j dot for metrics so you have one abstraction you don't change your instrumentation code and you can have different backends recently we have rebranded that slogan because we are more about observability in general so we are like slf for jbat for observability so we don't care about the backend backend is the configuration problem your instrumentation code will not change if you change the backend so now let's talk about lock correlation um so the fact that we were able to see the trace identifier in the logs and correlate the logs I mentioned here spring Cloud sloof who has been using spring cloud sloof okay who is sad that spring Cloud sleuth is feature complete so what we've managed to do is we we made a feature complete and from that we have created micrometer tracing why did we do it so essentially it's a copy of sprinkler suf with certain bugs fixed of course we back forwarded them to sort of so um spring Cloud still required you to use spring cloud and spring micrometer tracing does not it's spring agnostic so how it works is that let's say we have those four Services the request comes in it there was no tracing context so a an identifier was was generated so let's say it's one to three so the thing is that whenever the service logs uh you know if you don't properly see the text doesn't matter it's just a lock statement the important thing is that the identifier gets attached now for one service that's kind of easy so to say but what needs to happen is that when we send the request to service 2 this context the distributed facing context needs to be propagated which means that this identifier needs to be propagated over the wire so that the service to is able to actually put it in the logs as well and the same thing for service free and service for that means that for the time of the processing of the business request within those four services you always get the same Trace ID so what happens then uh depending on your configuration and you can you know store to the drive or or push it to some some place or to the system out whatever happens at the end of the day the logs need to be sent to the log store and then they are parsed cut into pieces and then you can search for them like give me the uh all the logs for the given Trace identifier and then you can put them in a chronological order and no longer do you have to SSH to any machines and copy all the logs and do all sorts of stuff like that which of course none of us has ever done never now for that to work we need to have distributed tracing which means that we have a tracing context that gets propagated but we also need to have uh like what we want to achieve is to have latency visualization how does it work let's look into this four Services let's say communication so looking at this we can say that what we would like to observe measure somehow for sure would be this part so how much time it takes from service one to call service two maybe this part so for example you have like a very specific piece of code inside service 2 that you want to analyze so that would be a different observation and then the same thing for communicating from service two to service three and two service four so that means is that you need some sort of a framework like spray Cloud sleuth or or micrometer tracing that handles life cycle of spans so it will know how to create a span how to push it over the wire and how to get it from the wire and continue it so everywhere here you have a framework and since you're using spring everything happens out of the out of the box uh behind the scenes you don't have to configure anything for that to happen so what happens then is that not only do we have to propagate Trace ID but also span ID actually the same thing happens when you're doing log correlation but I wanted to simplify the the example so here you have the span ID and now with the difference between trace and a span is that all these fans form a tree-like structure a hierarchy and they all share the same Trace identifier so if you go over the wire you have the same Trace ID but a new span ID because this is a child span so 234 is a child of one two three three four five is the child of two three four and four five six is also the child of two three four right so what a span has is apparently a parent ID start time stop time the duration does and metadata so do you remember when Tommy was showing like jdbc result count whatever these are this is the metadata these are tags so once the span is stopped what needs to be done is the spans need to be sent to a span store and then you can visualize them so how it looks like to the top you have typically the production dependency graph so from production it tells you how your communication looks like you have the whole Trace meaning this is the time required to process the whole thing and you have a single span which means this particular operation took that much time to to complete okay so what options are out there so there are certain tracing standards we have open Zipkin open tracing open senses open Telemetry tell me what do you think that they have in common I think that they're all very open they're very open which is a lie because open tracing and open sensors are closed they're gone it should be renamed to close tracing close senses because they got deprecated and you should not be using them uh which is kind of sad to quite a few customers who have been using their API had their customizations and now they're all gone now you should be using either open Zipkin or open Telemetry because open Telemetry has devoured open tracing and open senses now this is a new thing so let's talk about open Zipkin it's a mature production tested project ecosystem that has has a multi-language support so the latency visualization tool has around 16 000 stars on GitHub and first release was there in May 2016. which is the open source release because primarily Zipkin comes from Twitter and so it's older than that Brave has around 2 300 Stars what is brave it's a tracer Library what is a tracer Library it's a library that handles the life cycle of a Spam it knows how to start a span stop span annotate it Etc so it's been first released in April 2013 which is 10 years ago right and it supports quite a few languages so uh Brave takes care only of uh tracing right now I'll tell not to mix with OT which is open tracing Hotel open Telemetry new multi-language and cncf oriented by the way spring Cloud sleuth is also cncf in the landscape don't ask me why open Telemetry Java has around 1500 stars on github's first release it's a milestone release happened in November 2019 the spec has 3 200 stars and the first release was in June 2019. so first GA releases we're about two to three years ago and now let's pause here for a second because for us GA means no breaking changes right so if you do a 1-0 then if you break then you do 2o that's quite um not the same understanding as in the auto world because over there you have GA Alpha things I mean they are in Central an alpha jar which is describing the semantic convention so all the names of all the metrics all the spans can be changed at any point in time in a backward and compatible way which we know will happen with http I don't know when but sooner than later so if you have any dashboards you've been building them for years all of them gone and you'll have to write a script to migrate to the new thing so another thing that auto does except for breaking compatibility is doing the three types of problem analysis so they want to solve the problem of logs tracing and metrics right for logs they don't have a separate API they have a separate API for tracing and metrics so this is the main difference also with Zipkin because zip can only takes care of Trace let's park that thing with logs tracing a matrix because this is something that in micrometer world we want to address in a different way so it supports also different languages so what the latest visualization tools are out there there's a lot of them we're just mentioning a few so here we have the tons of observability by wayfront there is the Zipkin one there's Jager it was shown even today at 9am at the presentation and during the demo you've seen the grafana tempo for metrics again tons observability by wavefront you've seen grafana today and for logs who has heard of elk stack yeah you have quite a few people so elasticsearch looks Dash kibana a very frequently used stack for logs today we've shown you the grafana low key tool as well okay now let's talk about micrometer observation API who has been using this API wow there are quite a few few people that's interesting so we believe that you should instrument once and get multiple benefits out of it because you have multiple apis to do tracing metrics Etc if you have one block of code you have to instrument it multiple times with different API we believe that this is not the way to go we believe you should do it once so you have an example here hopefully you can see it a very simple example you you have an observation registry your chicken can configure in different ways you have a context which is a mutable map so to say in essence if there is a method that you want to observe called do some work here you would create an observation give it a name for example my operation call observe and that's it this is the whole instrumentation code and when you do that what can happen is you're going to have metrics traces and logs and whatever you want how does it work in micrometer we have the notion of a Handler a Handler reacts to events like start an observation stop an observation there was an error with an observation Etc so we come like a micrometer gives you uh the such handlers like the metrics Handler that uses the micrometer timer when it starts and micrometer timer is getting stopped when the observation stops and then you can ship the metrics wherever you want micrometer tracing gives you tracing handers that can start span stop spans you can use here Brave or open Telemetry tracer to create those spans you can use not the micrometer API to create metrics you can use the open Telemetry metrics API and that's fine so whatever you pick this is an implementational configurational detail because you separated this from the instrumentation code in your instrumentation code metrics and traces should not leak at all and the same thing for logs like you can log whatever you want we are not providing a custom logging solution because we don't know what you want to log so you can provide your own and we're good with time here so if you have any questions and we have the app as well for the questions we have now a couple questions we have the question okay so maybe we'll take the questions from the app first and then if you have any questions uh from the audience we're gonna take those so first question that came in was would you recommend this kind of setup with grafana etc for a monolithic application and I think the answer is yes you can certainly use tracing and and metrics and get benefit out of that even in a monolithic application I suppose the difference there is that if you want to debug something or see where something is called from if it's a monolith monolithic application it's much easier to do that just in your own IDE because it's not across multiple services but you can certainly still get benefit out of this because you don't want to you can't it's not as easy to debug even a model of application in production right so you want to know what's actually happening based on real user requests in the wild so to speak and if you're using a project like the the modular project that Oliver is running it has an inbuilt observability support that creates separate spans when you close the boundaries which means that normally like in the microservice architecture you will be making a network hops here you're not making Network Hops and you still see spans which is great so if you have a monolithic application and it's a spring boot free app the only thing you should be doing is adding manually observations wherever you think it makes sense to to add them in the in the trace View and this is also somewhere where we offer annotation support so we have an at observed annotation that you can use on places where you want to have that separation of this is a separate operation that you want to have timing data for for metrics and tracing and any other Handler that you might configure so I think it's it's definitely viable to to use this in a monolithic application and as we showed in the example even in a microservice kind of world we had the hgv instrumentation and the jdbc instrumentation so there's still a a breakdown of timing within an app within one service I would not only say this is viable it's you should do it it definitely should do it any more questions we have a lot of questions now so let's try to answer them I mean time is up but we can say talk for one more minute I think that clock's faster oh yeah you're right so the next question that we had was will the trace ID be generated for all spring applications for example spring integration or spring batch so the very idea of one of the key themes for spring boot free was the observability thing so we created micrometer observation micrometer tracing and we discussed this with many projects spring projects that are actually already in being instrumented with micrometer observation we automatically generate a documentation given the observations that they have so you can go to their docs and check exactly what kind of observations they do so the answer is yes we haven't actually instrumented all the projects but we are working on that uh next question can you can you let's say say it in a different way can we remove open Telemetry if we migrate our spring micro service to Spring boot free and make use of micrometer I don't really understand like fully the question because if you're using micrometer observation with micrometer tracing one of the tracers you can use is open Telemetry if you're saying we want to not use the agent then the answer is yes you can consider that because like I mean there might be a case that we don't instrument a given library but then just file an issue I'm going to fix it because there's more and more external libraries that are using micrometer observation but yeah you can consider doing that uh what about tracing the case when service a is calling service B but not in a direct way over HTTP but rather over a message broker of course we're instrumenting uh spring rabbit spring Kafka we are um also actively working as we speak with the rabbit mq project to have native observability instrumentation in rabbit mq so it's already working answer that how did you solve the latency in t-service so we didn't actually show solving the latency problem t-service we had more time we would uh show but it's really pretty simple to implement the caching with the spring cache annotation enable caching and you just add The annotation to the the data repository you do cachable provide the name and you're done so we didn't show that but the and the latest he got introduced I don't know if you've seen how Tommy was so fast that he switched the screens and typed in make chaos so we in the make file what we do so the thing is that one of the apps is not communicating with mySQL database directly but via toxic proxy so what we did we introduced latency in the toxic proxy this is why it was exactly 50 milliseconds for the connection 50 milliseconds for the results that does 100 milliseconds of latency think we're out of time right yes so let's maybe conclude for now and we're gonna let's say go outside and answer any of the questions that you have thank you so much [Applause]

Info

Channel: Spring I/O

Views: 15,710

Rating: undefined out of 5

Keywords:

Id: fh3VbrPvAjg

Channel Id: undefined

Length: 47min 30sec (2850 seconds)

Published: Wed Jun 14 2023