Getting Started with OpenTelemetry and Distributed Tracing in Java

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello hello everyone welcome to the workshop uh great to see people uh signing in we're gonna wait till 9 10 to start just to make sure we get everyone involved but in the meantime uh i'd love to do some introductions um and just get an understanding of why everyone's here so real quick my name is ted young i'm one of the founders of the open telemetry project and i'm really excited that uh we're in beta and we're coming close to ga and that people are really starting to use it aggressively so i'm excited to give this series of workshops and today's workshop java is super interesting to me because java is one of the core languages in open telemetry and one of the languages that has some of the best kind of dynamic support so um that's me but i would love to hear um from you all uh if people wouldn't just mind saying hi in the chat uh and maybe just giving a one-liner of what brought you to this workshop or what you're interested in um the format's gonna be pretty loose and i really like it to be kind of q a focused um so i can customize this to to your needs so yeah uh there's a chat if you wanna pop that open post it there and in addition to the chat there's also a q a uh where you can ask questions so i'll be periodically looking over the chat in the q a as i go through the workshop one thing i'm curious about if anyone um attending this workshop has used open telemetry before or another distributed tracing tool or if this is uh your first encounter with one of these uh that would actually be super useful yes muhammad heard about the aws distro yeah i'm happy to talk about distros um we're actually going to be using the light step distro for for this workshop um distros was a concept that i actually came up with as a way to package up open telemetry uh and kind of hide a lot of the boilerplate that you would normally have to deal with when connecting open telemetry to a particular backend or service like aws or lightstep well we've got doug here hi doug um doug's been working on cloud computing for the past decade awesome yeah you need both um looking at service messages service message and monitoring services and service request performance great yeah we don't we don't um dive specifically into service messages here but i do want to point out that um uh envoy and nginx uh have open tracing support and they'll have open telemetry support soon if not already and we've got dan from sky scanner hi dan dan's interested in open telemetry is we're working on a poc some of our services to migrate tracing from open tracing to hotel great uh hopefully migrating to hotel and q1 awesome and metrics and hotel yeah that's coming soon uh yeah we do have a an open tracing shim uh in java that that works today um so uh we we can go over that uh maybe in the the hands-on workshop portion just to point out setting that up but it's pretty it's pretty straightforward to do cool we got stephen hi stephen yeah steven's looking forward to getting involved distributed tracing yeah and hoping that open telemetry takes care of the the open tracing open census but yes yes that that it did it did that at the the cost of maybe a year um but i think it was very very much much worth it and we've got chris hi chris chris has been looking at tracing like jager but haven't implemented anything just want to see what open telemetry gives us awesome i'll go over exactly what you get with open telemetry versus what parts of jaeger you'd still want to keep running if that's the system you want to run i'm always bummed at these webinars don't let the participants speak because i think this would be a lot more fun if we could actually have a chat instead of doing it through those little chat window but i do appreciate you all taking the time to actually type cool and we're coming up on 9 10. uh still feel feel free to to uh uh feel free to um keep posting uh in the chat even as we get started um and i'll look over there occasionally to to answer your questions and also use the the q a tool but to get started first i'm going to start sharing my screen so you should be able to to see the slides i'm actually going to post a link to the slides in the chat you're welcome to access these slides and follow along uh on your own or follow along on my screen i'm not gonna i'm gonna leave the the slides uh not in presentation mode so i can still see your chats and questions but if you grab the slides there's actually some helpful links on that uh first page these are some things you'll want to do for this workshop so that the first thing we want to do is account setup so if people could go ahead and do this part right now if you just click on this link that will take you to a free lightstep account so if you just sign in like this made an account with this email right use that one no well anyways if you sign in there i already made an account but if you sign in then uh you'll get a free account so we're going to use that to actually look at our traces on the back end during this workshop so i'm going to sign it to my work account here great so that's step one and we really would appreciate if people like go ahead and do that right now uh so that we can uh uh jump right into the walkthrough later um the code for the walkthrough the code we're going to be looking at can be found here so that's a thing uh you can get started with and maybe go ahead and download and then these other links are just um handy links to uh the repos so light steps open telemetry distro is called open telemetry launchers there's a link to that code this is our bootstrapper code and um after this workshop if you're interested in learning more i've got the beginnings of some quick start guides uh going here so i'm starting to put together my own sort of set of open telemetry resources to help with getting started and last but not least open telemetry java the core project just so you know where that is okey dokey so that's intros uh those are the links we'll we'll be getting through uh to a walkthrough at around 10 a.m and uh between now and then i'm just going to do an overview of open telemetry discuss kind of the major components and again this is very q a focus so as i'm going along if you have a question please add it to the the q a tool here in this zoom webinar and um i'll be stopping periodically and and answering those and i'd much prefer to do this as a conversation than me just you know yapping at the webcam cool so hopefully you've got an account um so you'll be able to grab an access token and actually do uh the workshop and let's talk about getting started with open telemetry so first thing people often ask is like what what is open telemetry um like what what is the scope of open telemetry and we called it uh telemetry for reason this is the um cambridge dictionary definition of telemetry i think it's uh accurate for what we're doing it's the science or process of collecting information about objects that are far away and sending that information somewhere electronically and that really covers the part of observability that open telemetry fills which is the generation and the transmission of the data so just to get into like the major software components in open telemetry when uh you're trying to make use of it the the first thing is when you have a service here so this is green circles our service you install the open telemetry sdk so the sdk is basically the open telemetry client that'll be running in every service uh in your deployment so when you're deploying open telemetry step one is getting the sdk set up and running in every service so that sdk in turn [Music] implements an api so the open telemetry api is actually a completely separate package from the open telemetry sdk and the reason for that there's a couple good reasons for that one is just clean separation of concerns the sdk can potentially haul in a number of dependencies and since his implementation there's just a lot more code there and when you're instrumenting say your frameworks and applications uh your frameworks your http clients all of your libraries you don't want to necessarily have um the fact that you're adding this instrumentation take on a big dependency like all the sdk dependencies this is especially true if you're say you know like an open source framework that is going to be used in a variety of environments or some shared piece of code you want to be able to natively instrument that code with open telemetry and not not worry about that creating some kind of incompatibility when your library gets installed somewhere else so the open telemetry api is a completely separate interface only layer this also allows by the way other sdk implementations to be used so we have the official open telemetry sdk but we really see this as a standardization project and so we want it to be well factored enough that if you didn't like our sdk and wanted to create your own crazy sdk or something experimental you could swap it out the sdk itself is actually a framework so it has various plugins like you know exporters and sampling plugins and things of that nature so you shouldn't have to rebuild the sdk i'm just pointing out that that's a possibility for example wanting to use um say a c plus plus implement a implementation through you know form function calls which you probably wouldn't do in java but in other languages might be useful so all of your framework code gets uh and your application code gets instrumented through this api and then um once your program is running uh all of that instrumentation starts producing data that data goes into the sdk it goes into a plug-in called an exporter plug-in which then batches that data up and sends it out by default to something called the collector which is another open telemetry component so you're not required to run a collector but we do recommend them the collectors are the part of open telemetry that allow for uh data pipelining uh data munching scrubbing and changing data formats so open telemetry supports the number of tracing data formats out of the box otlp is our own data format that's what open telemetry uses by default but you can configure it um to use zipkin jaeger or prometheus those are the the protocols that open telemetry ships with so of course you can always write your own exporters if you want to talk to a different protocol and these exporters can happen at the language level so again you don't need to run a collector you could just be running um the configure your sdk to use say the zipkin exporter but it is nice to do this stuff in a collector rather than in your application because that takes a lot of this data formatting um and data pipelining issue and makes it a purely operational concern so you're not having to write code and redeploy your application in order to make some kind of change to your data pipeline instead you're reconfiguring the collector and redeploying that component and i think that's really useful to be able to adjust your telemetry pipeline and make changes to your observability without redeploying your actual applications especially if you're in some kind of situation that's you know actually say you're fighting a fire or you know you're seeing some load issues and you want to change something it's really a bad idea to be doing a redeploy in that moment and so the collector kind of takes the weight off of the application as far as a lot of the stuff that you might want to be doing to adjust your telemetry deployment now one thing i'll note is that open telemetry doesn't come with any analysis tools so the end of the road is the telemetry pipeline and then that data can be pipelined into a variety of back ends jaeger for example is an open source backend that accepts open telemetry data natively and lightstep is a commercial one that accepts open telemetry natively so the reason why we don't ship any analysis tools directly baked into open telemetry is again we see this as a standardization process we want to standardize the basically define a standard language for describing distributed systems so that's otlp like an actual structured language for describing distributed systems and so we think you can standardize how systems are describing what they're doing but when it comes to actually analyzing that data or doing something useful with it that's not really something you would standardize that's actually the area where we want to see a lot of innovation and competition and open telemetry actually enables that because if you try to build some kind of cool analysis tool for any of this observability stuff especially distributed tracing you're going to discover that a huge amount of the schlep is actually building this ecosystem of plug-ins and this whole telemetry system there's a huge amount of code and effort that goes into that part and if that part's already done for you and all you have to do is build an analysis tool that really lowers the barrier makes it a lot easier to build sort of like one-off analysis tools and things of that nature so i'm really excited to see the world of observability accelerate once we get something like the open telemetry standard adopted everywhere how open telemetry actually manages its own code ecosystem since this is as i just said a giant pile of code we do that through a specification process so the heart of open telemetry is the open telemetry specification you can find that on github that's a language neutral document where we define how open telemetry should work what interfaces it should have etc etc and then that specification is then implemented in every language so that's like a super basic overview of of open telemetry do we have any questions at this time i'm moving a little fast so we can definitely do a lot of q a here ah so tony asks does the collector push to prometheus or can prometheus scrape it um that is a good question i believe it is the pole model that um that prometheus prefers to work with so uh i i think it may actually work in both modes but i do know that you can run it in a mode where we will buffer the data and then prometheus can can come and grab it i'm not i'm not a prometheus expert but i come from more the tracing side than the metric side but i do believe that works yes cool any other questions wow we have a quiet crowd today java people are much quieter than go people apparently cool well if we don't have more questions i do want to show people around the project uh yeah about prometheus says andrea what about prometheus though so let's just look in the specification i'm gonna go into the specification and let's have a look for so one thing we can do with our time right now is just do a quick tour of open telemetry the project uh so if you go to uh just github open telemetry you'll find uh where everything lives if you're interested in getting involved in the open telemetry community like you're wondering how the project works where to find all of the meetings and things of that nature the community repo contains a lot of that this also explains the the governance structure of the pro project so we have a governance committee which i'm on uh which sort of defines all the rules and kind of bottom lines the project and then a technical committee which is bottom line the specification work then every implementation has a set of maintainers approvers and triagers that that work on it so it's a pretty large community we do have some mailing lists that we almost never use everyone pretty much hangs out on gitter but we do have a calendar because we have lots and lots of meetings you'll notice there's meetings almost every day of the week we just got done with the open telemetry specification meeting the agent and collector meeting is tomorrow so if you do want to actually get directly involved in the project uh or if you have questions uh jumping onto gitter is a great way to to get answers really quickly um posting issues uh uh into the github repo we'll get you a quick response and uh if you want to talk to someone live you can always hop on to one of these calls and ask ask a question it's a very friendly community and it's very active so i would encourage you to get involved since it's still in beta um it's really a great way to to get your answers so cool so andrea says we are encountering some difficulties uh writing manual code in order to configure prometheus exporter with open telemetry because in java there isn't much information yeah this is true it's the biggest problem with project right now we don't understand how to export a port on our service which prometheus is scraped from uh maybe we could send the information to push gateway yeah um that is possibly true i again i'm not um a prometheus expert and i will say the the tracing part of the project is farther along than the metrics part so we're looking at uh the tracing ga by the end of the year but metrics probably won't be done until sometime in q1 like january or february uh so i would say that the metric side of this is gonna be um uh isn't isn't nearly as fully baked as the tracing side so uh that's a possibility that you're encountering something like that but if i go into open telemetry specification um one thing we can look at is this spec compliance matrix if you're wondering whether or not something's been implemented these are all the different features of open telemetry and whether or not they've been implemented on master so you can see java here and where java is at and you can see for metrics we're not we're not even tracking this yet uh because we're we're really focused on getting uh getting tracing complete at this time [Music] but you can come in here and look at the the metric specification so if you come in here and go to metrics and look at the specification here uh you can learn a lot about how this should work and and there may be a prometheus specification in here somewhere but it might not be there yet so yeah i would recommend um uh popping into the popping into either the java getter room um or posting an issue saying you're having trouble uh getting this configured and they should be able to help you out we'll also have carlos alberto come on at 10 a.m he's one of the members of the technical committee and one of the java maintainers so he can also answer some more java specific questions great nothing much else well maybe we can move a little faster uh i'm going to see just if i can grab uh grab carlos a little early see if you'd be interested in popping in a little early so one second if you do have more questions about how to get this set up please post them right now all right so i've asked carlos to pop on since uh it doesn't seem like we have too many questions about the the project overview we can uh get into something more like oh why is this slide here never mind i still have more stuff to talk about um so let's get into some core concepts of open telemetry so one thing that's core to open telemetry is a concept of a transaction when we view the open telemetry model of a distributed system is transaction focused meaning uh in particular a transaction that's sort of um within the realm of of uh human wait time so a classic example might be um someone trying to upload a photo uh from an app uh so if we look here like let's say we have you know a client and this client is trying to talk to a server and say upload a photo with a caption but of course we know it's not just one server it's going to be talking to a reverse proxy let's say that then calls out to an authentication service to check that this is kosher and then it might download the image to local scratch disk and then call a local application the application might then take that image off the scratch disk and throw it into cloud storage and then contact a data service and that data and you know write down where it put the image um in sql and cache it in redis so i almost wouldn't even call this a distributed system i feel like i've been looking at systems that do literally exactly this for 20 years but you can see even a basic system like this you it's already fairly distributed uh you have so many different components here and as your system grows it becomes harder and harder to kind of keep track of what logs are associated with what without the right kind of context you want to be able to find basically all the logs and events that occur within a single transaction um quickly and easily and do some basic analysis on them so another view this is sort of what i would call a service diagram view of a transaction another way to look at a transaction is more like a call graph so in this drawing we're defining each colored line here represents uh a service and the amount of time that was spent in that service so here we have our client the most time was spent there because the client you know uh was waiting for the transaction to complete the client then talks to the proxy the proxy talks to our auth service writes down the image talks to our app apps talks to third party service and our data service and yada yada one of the first things we might want to look at uh when we're looking at this transaction is where did the time go uh so that's an interesting question to ask because where the timeline is not necessarily where the the longest span is and by the way we call each one of these [Music] transactional hops as span so for example the client here is the longest span the most you know from beginning to the end of the transaction the client took the longest amount of time but the client was actually waiting for most of that time so even though that's the longest span that's not actually where you'd want to look in order to reduce the overall latency likewise you can see here that the amount of time spent in the data service is is minimal so if you went to try to optimize that data service uh you wouldn't really be moving the needle uh very far on uh decreasing the latency of this transaction in this particular case you know uploading the image to scratch disk and then uploading the image to uh to cloud storage are where the bulk of the time went so that might convince you for example that it wouldn't be worth it to try to optimize this particular transaction because you don't have control over those two um those two aspects of writing the image or it might encourage you to find a better faster way to to deal with uh image uploads so the other thing it shows you uh you know by the way we call this the critical path the time spent actively working and the critical path also shows you implicitly all the time spent waiting so these are places where you have a program sitting there waiting if your program's multi-threaded that's fine um but potentially this waiting might might itself be uh creating some blockage somewhere so you can learn a lot from just looking at the critical path and looking at where things are waiting the other thing you want to look at is errors you want to have an error budget for every service and transaction and you want to be monitoring errors and of course when things do catch on fire you want to immediately understand uh which where to start hunting and usually the first place you want to start is where the actual the air actually occurred so you want to know did was the air originating from from the auth server was it originating from like cloud storage was it originating from your sql storage uh being able to quickly identify which service to look at is one of the great things you get out of distributed tracing [Music] and last but not least you want to have contextualized events so these events are basically logs right this is what you'd normally be doing or at least what i used to traditionally do is just log log log all over the place um but uh the problem with those logs is they're they're not necessarily contextualized or they're not easily contextualized i might have a certain amount of of local contacts that i can attach to a log like a request id but um there's a lot of context that isn't necessarily available at the call site for that log that would actually be really useful information to index that log with and so that's correlations and these correlations are really what makes something like open telemetry and distributed tracing to my mind fundamentally more useful than than the traditional kinds of logging that you might do so basically you kind of have this traditional issue of you have all the logs and you're trying to reconstruct uh the la the a particular transaction which essentially means filtering out all the logs on all the servers that transaction touched that didn't have anything to do with that transaction and if you've ever gone through that process it can be potentially kind of labor intensive depending on what kind of identifiers you have available but with something like tracing it's it's really really easy to do but at the heart of tracing is something called context propagation so context propagation is what actually makes this easy but it is the thing that makes setting up something like distributed tracing a little more difficult than setting up something like standalone locking or metrics so what distributed sorry what context propagation actually means is let's say you have two services and they're in a transaction so you have some spans and then you have this network request and some more spans uh within the process so in process you need to have a way of propagating all of this context so we call that a context object so within your application as you're going through your control flow you're trying to follow the control flow and contextualize it um mdc is an example of one of these that are exists in java there's a variety of them in fact there's too many so we actually had to invent our own context object uh we kind of wish this could get standardized but um it's not standardized yet in java we do hope it does become standardized at some point so context gets you uh gets you context within your program but for a distributed system you have to actually propagate that context to the next service so we call that context propagation and what that means is at the call site when you're making say an http request you're taking this context and you're serializing it and injecting it into a set of http headers so and then on the other end you're extracting that context from those http headers uh deserializing it into a new context object and then carrying on and so this this is sort of the core underlying principle that all of open telemetry is built on top of you have a context object the context object tends to be kind of hidden in java but it is there and you can't access it directly and then you configure what are called propagators that define uh when you um have something like an http request uh what should be injected and what should your system look for on the other end in order to extract there's a couple of different header types uh b3 is the sort of de facto standard that's out there today that comes from zipkin but we've also been working on standardizing uh the standardizing these tracing headers and actually adding them to the http spec so a bunch of us have been part of a w3c working group to standardize this stuff and that header is called trace context trace context consists of two header fields the first one is trace parent trace parent contains your trace id and your span ids so these are the two core identifiers i was talking about earlier so if you say look at this the trace id is an id that's attached to every span and event in the whole transaction and a span id is attached to every event in a particular operation within that transaction so though those two identifiers on their own just eliminate a huge amount of work when you're trying to collect all this data together and that's what enables a lot of these um tracing-enabled analysis tools to sort of automate a lot of that and do a lot of analysis for you in addition to the trace context headers which are sort of uh tracing specific there's also a new set of headers we're designing called baggage baggage are literally generic key value pairs so these aren't tracing specific but this is the idea that once you've got context propagation flowing through your whole system you're going to discover other uses for this for example you may want to propagate something like a project id maybe you have an identifier that's available in one service early on but it would be expensive to go try to grab that identifier from every service down the line and so rather than doing that you could simply add that project ideas baggage and then it would propagate to all of your downstream services and you could have access to it as an index you could also potentially use baggage for non-observability things such as say feature flagging or a b testing in a distributed system being able to know whether you're on the a side or the b side some information that you could you could flow through with baggage you could potentially use it for authentication i would hold off on anything that's system critical and dependent on baggage at this point just because it's all new and while observability is important it doesn't break your system if observability has some kind of hiccup but in the long run i expect to see a lot of interesting um sort of crosscutting concerns get handled find baggage to be like a really useful tool for implementing them and then last but not least we're going to get into sdk setup so if you're going to like actually sit down and uh start attaching open telemetry to your services before you actually do that there's some things that you want to check you want to verify first and foremost that the frameworks and libraries you're using actually have instrumentation plugins so you know getting back to this diagram uh if you um if the http client and the http server aren't instrumented with open telemetry then inject and extract uh won't work so most of the heavy lifting is done through instrumentation plug-ins uh for um open source or shared code that you're using in your application ideally you don't actually have to add any application level uh tracing or logging in order to get a complete trace in order to get really enough information and visibility into your system uh should be available just through installing these different uh plugins but you do want to make sure that that open telemetry actually supports uh the frameworks and libraries you're using in particular your you know application framework and your um your http and database clients are the ones that are most important you want to make sure that context propagation is working once you get this set up that's one of the areas where things can break most easily um and it can be a little mysterious because you just it's potentially normal to say get a request that doesn't have some context in it so this can be a silent error and the way you ensure that you don't encounter that is to ensure that you're using the same context propagation header uh everywhere you go and that's why we really want to standardize these headers because if they're not standardized then you know you run the risk of service a not being able to talk to service b because one's expecting trace context and the other one's sending b3 okay now we're going to get to the code walk through but before we do that are there any other questions uh that people have on what we've just been talking about i see steven's got a question we have a lot of investment in homegrown metrics pipelines we 100 want tracing but do we have to be concerned about the performance impact of hotel also gathering metrics or is there a way to just disable that yes uh open telemetry is totally a la carte so you can just use the tracing portion and disable the metrics portion um and uh it's also very light on the metrics front i actually believe a lot of the metrics that you want uh you can actually get from aggregating your tracing data and you don't neces and if you have a an analysis tool that will let you do that let you actually make aggregates and histograms and um you know various like metrics dashboards out of from counting trace data rather than going into your code and putting a bunch of instrumentation points in there um i think that's that's a really useful way to to move the burden of uh you know a lot of your metrics works from actually uh typing code to uh just set up that you're doing on your back end but there's of course also a metrics api um that does a lot of useful things it's also very efficient um for calculating um a lot of like pre-calculating things you might need uh to do some of this aggregate work so you can turn it off uh you don't need to use it and um you can also potentially um write a metrics plug-in that simply piped those calls into your custom metrics pipeline okay daniel's got a couple questions q a oh sorry daniel wow scroll button okay let's get through these so daniel asks what is the recommended migration path from open tracing to open telemetry we're thinking of starting with the shims but can we expect the shims to be compatible long term yes so that is the correct migration path so if you already are instrumented with open tracing then you've got the api layer done so if we go back up here to this diagram you're saying open tracing is just an api layer and open tracing was designed to talk to a variety of implementations so you would take the open tracing api and then you would build a client implementation of that and so there are lots of implementations of open tracing and open telemetry is just one more implementation of open tracing so this is actually precisely how open tracing was designed to work and so when you install the shim uh all that does is it allows all of your open tracing api calls to be collected by the open telemetry sdk so i would say step one is if you're looking to migrate is to first migrate your client from whatever client you're currently using to the open telemetry sdk you shouldn't have to change anything else you will potentially want to disable um any auto instrumentation that the sdk is doing though i would say that's the one thing maybe the one gotcha you don't want to double install instrumentation because that might lead to a lot of weirdness um open tracing and open telemetry api calls can be mixed though so they uh if you're creating spans and open tracing they'll be available through the open telemetry api so you will be able to progressively migrate your system over if you want to eventually get rid of those open tracing api calls but there's no need to do that we're looking for a very long term backwards compatibility there their minimum of two years and in general the project is very very very backwards compatibility focus so it's in beta right now but once we've 1.0 these apis we are going to adhere to a strict backwards compatibility guarantee so there might be some breaking changes to say some of the internal apis within the sdk but this separate api package the plan there is to make it never break so we might add things to it but uh we never want to break everyone's instrumentation because this is the stuff that gets out into the wild and there's millions of lines of it written so that's the other reason why the api is separate from the sdk okay so thanks for your questions let's move through some of these so hopefully that answered your migration question daniel please follow up if didn't chris vogel asks did you see any interaction between open telemetry metrics and the micrometer java library um we've been uh we we communicate with micrometer crew regularly and um again that's a very similar project to open tracing in a lot of ways where it's it's mostly just an api and a layer that lets you kind of mix and match things so that's um that should be compatible with with open telemetry as well um i think we did look at potentially using the micrometer api but because open telemetry is kind of a cross-language project uh micrometers java specific we didn't end up doing that um but they they should be compatible and and again if you're happy with micrometer and you just want to add tracing that's totally fine there's no need to use open telemetry metrics if you want to use another metric system likewise if you wanted to use open telemetry metrics but another tracing system that might potentially be feasible but i think that'd be a bit trickier great ah cool and we've got carlos on the call as well so hi carlos carlos is uh one of the members of the technical committee and one of the java maintainers so he could answer some of your questions um actually we've been getting some prometheus questions carlos and uh i was curious do you happen to know the the state of the prometheus exporter in java uh is that stuff working right now it's working i'm not very familiar with prometheus but we have had the exporter for a mu for a few months now so it should work overall yeah do you happen to know if it's a push or pull model uh i'm not i'm not aware of that yeah you have people from the tracing side on this call apologies there is a there's also a strong metrics community and if you're interested in metrics there's a metric specification call that happens once a week and if so that's a place you can hop on to get access to all the metrics experts okay so carlos we're just going through some q a now and then we'll be moving on to the code walk through sweep so tony asks how do you differentiate between network latency and service latency you can notice the difference between network latency and service latency by basically um the gap in time between the client span and the service span uh that that's a rough uh a rough rough way of getting network latency out of the system right now uh we we don't have um uh currently don't have anything that specifically hooks in and just uh just measures the the network portion of that but you can deduce it by looking at the difference in time spent on the client versus the server the the difference in time represents the network latency daniel asks can open telemetry tracer propagate contacts with multiple propagators while injecting and extracting yes yes we can so open telemetry uh comes with what's called i think a compound we used to call it a stack propagator but i think it's called a compound propagator and you can set that up to look for multiple types of context coming in so it can scan in order when you configure it to say first look to see if there are trace context headers and if it doesn't find trace context it looks for b3 headers and then you can also configure it when you're injecting to inject all the header types so rather than just inject one type you can inject both trace context and b3 and that can be a useful way to transition your system so if your systems say currently running b3 and you want to have a zero downtime transition over to trace context then you can configure your compound propagator to propagate both get that deployed into all of your services so that both are getting propagated and then um and then remove the b3 uh from that list so obviously there's a there's extra overhead when you're propagating both kinds of headers but it's fairly negligible and that that's a a great way to do a sort of rolling transition from one header type to another tony asks what would a trace look like where where an async call to some service didn't complete you see the service which was called died before writing to the collector so the service dies before a span completes or exports then then you'll lose that data so the way open telemetry exporters currently work is they they they're a sort of buffer based model so they uh buffer spans locally and then once you've hit your max buffer size uh or your timeout uh those fans are then flushed um so that one gotcha you'll have is um is if your services suddenly gets killed before um those spans get exported you you won't be able to get that data unfortunately that that is a limitation uh right now um we're looking at adding some shutdown hooks and things to java in particular some other languages have this already to to help ensure when your system's shutting down cleanly that all of that information gets flushed um but that is a gotcha it's a particular gotcha in a serverless environment where you may have systems spinning up and down really quickly if you do encounter something like that and the volume is low enough you could switch from the batch exporter to the single span exporter which will flush a span every single time it gets one and i would be interested in potentially looking at a more event-based model in the future so that it might be possible to collect partial spans so there is a protocol we've experimented with but probably won't implement you know until we're well on the other side of ga that involves basically um sending out uh individual span events start span event event add attribute in span and just uh spewing that out all out over udp to a sidecar we've tested out some of these other streaming protocols and we think they they work really well but uh one thing that doesn't play well with streaming is wanting to have uh some kind of uh something in your process that actually wants to collect the whole span and then look at all that data and then say manipulate it before it goes out the door so that that's part of the issue you can't really have both but if you move a lot of that processing out to the collector and make the sdk leaner and just something that's that's streaming out data um then you could potentially have something like that so i would say that that kind of stuff is what we want to look into after we ga right now we're we're just trying to to move the core model that uh open tracing and open census used and the kind of the buffering model that was used at google we're kind of sticking to that for the first version cool okay i think we've gotten through the q a for now so let's see if we can hop in to an actual walk through here uh so i'm going to do the walk through but it's great to have carlos on call just in case i'm doing something silly in java but what you can do is you can follow along yourself so if you go to the links that i sent out earlier and go to our java launcher or sorry not the java launcher you do want that but going to uh hotel java basics this is the um this is the walkthrough code that that we're going to be using so if you haven't um create an account please do that now and um uh go ahead and download this repo if you want to to follow along uh otherwise you can you can just watch me do it but i find it's it's more fun to play with it yourself so once you what you need to do is download that repo um i'm gonna stop sharing this i'm gonna start sharing my code so the first thing you want to do is grab this and just run make and that will download uh all the components you need and you know if you're wondering what this is doing uh you know please just just check out the make file uh it's it's pretty simple but um basically uh you're gonna need to download the the uh lightstep uh java agent jar so this is the the light steps wrapper around uh the core open telemetry agent i should mention that the primary way to and recommended way to work with open telemetry right now is using a java agent it is possible to do it all manually if you like but we really strongly recommend the agent approach so if you download that and then want to actually connect to light step you need to have an access token which you can set as an environment variable so the way to get that access token here i'm just going to share my whole screen so i don't have to keep popping back and forth so if you go here to your light step account you'll see there's this little settings wheel over here in the corner and if you hop down to that you'll see there's an access token and you just want to grab that and then you want to just add that to your environment variable you can also set this as a system property but just doing environment variables for now and i'm going to add it to this other window because i want to run both a server and a client so once you've added your access token uh you should be able to to run this and just see a bunch of spam so we're just gonna do that you can run it by first running the server so make run server you're gonna see a lot of building going on you'll see um the command lines added the agent we've set a couple of different configurations so one thing we've done is we've named this service hello server that's uh important to always add a service name so that you can identify the service that's issuing the spans propagators we've configured this to propagate both trace context and b3 as per that discussion before um and then we're also adding resource attributes so resource attributes are a way to actually index your services so it's not just adding indices onto your traces you also want to add indices onto your services that these traces are flowing through because that's another way to correlate data and open telemetry auto detects a number of uh useful um indices or sorry number of useful resources such as um you know host name and things of that nature but you can also add your own attributes uh currently in java this is supported through a system property um so you can pass them in as a key value pair so that's that i would say that that's useful uh approach and then uh here we just uh reduced the the delay on flushing the data so that we're not waiting a long time uh for the data to show up uh in our back end but normally uh you'd want this in production to be a much higher number cool and so that's it and that gets our service started and then on our client we're going to say make actually i need to shrink this a little bit or i can't see can't see your questions so make run client and what the client does in this walkthrough is it makes five uh requests to this service the service just prints out hello world um but all five of those requests are themselves uh connected into a single transaction by a parent span so what we'll see on our back end is a trace containing five requests to this server we're getting a timeout here possibly just because i'm running it in these terminals here so if i run it out here you say make on the server get the server going up here say make run client okay and that exited cleanly so we successfully connected to our server from our client and so if i just go into light step and look at our explorer here i'll see these spans coming in so here we can see a bunch of data coming in and if i click on any of these we'll see the overall trace so uh here's an example of a trace view and you can see this black line here represents the the critical path uh you'll also notice that there are errors going on we added an exception to one of these and i'll show how that works but if you do capture an exception uh and count it as an error uh then uh it'll mark the entire span as an error and you can use this for error budgeting and calculating error rates and things of that nature looking at the actual data that comes in we can see a bunch of instrument interesting information instrumentation name um tells where this data actually came from so you can see which package actually uh actually created the data or created this particular span you can see you know some additional pieces of interesting information and if you look here under details you can see a lot of information about the service that this came from you know the platform client library version the the host name etc and if you look at a span that represents an http client you'll notice that it has a bunch of http and networking information added to it so all of these attributes are standardized at open telemetry one of the things we want to do is make sure that an http request looks like an http request regardless of you know where it came from and so if you go look at the specification you'll see that we have what we call semantic conventions and these semantic conventions basically describe how you should uh observe these common operations so if you're writing your own instrumentation i strongly recommend uh giving these conventions uh once over and using them uh using them when you're describing something common like an http request or a database request or you know anything of that nature cool so let's pop in and look at uh some of this code that's actually generating this data oh and we've got a question from daniel asks last year the micro profile spec specified that operation names and server spans should be of the form get path to span as far as i can tell open telemetry spec does not mention a convention on names for service spans do we want to formalize this yes we do have conventions so you will notice that these spans get named something specific so for http spans in general we have some specific recommendations about uh what it should be the the ideal um span name is a route uh one issue with naming spans is you don't want something that's too high card analogy so something like api users one two three like uh like a path such as that that the cardinality of that is too high um because you don't wanna be looking at you you wanna be bundling up all requests to this particular endpoint not uh splitting it out by you know uh individual variables uh on the flip side if you don't have routes but you need to go with something a little bit simpler which is just http method name and so this is a little too low cardinality because you don't necessarily want to look at all just the http gets across of you across your system but you can um scope this down a bit by looking for say gets coming from a particular client um but ideally you're getting this routing information in there if you're using a service that um if you're using some kind of web framework that that has a concept of templated paths or routes that's that's the ideal name for an http span and we do have general guidelines for span names that get into a lot of these details so you post that into the chat um so yeah this is a a useful read-through if you're you're thinking about making your own spans in your own operations daniel then asks i understand you're supposed to be using operation names for statistically significant groups of spans that's perhaps why the method is useful in the service span yeah post is very different from get yeah yeah this is true um post is different from get um but it's a little a little too generic get account is 42 is too specific um something just like get account like a handler name uh is good um or a templated uh templated path is also good so i either of these are kind of the level of granularity you want with your span name okay so popping in to some actual code let's have a look so if you look at this this walkthrough code it's a very simple client server example so here we're just setting up a little jetty jetty server and since open telemetry comes with a servlet instrumentation out of the box it's automatically instrumented so we've added some additional instrumentation in here just to kind of explain it and show it off but i do want to stress that you don't actually have to write any code you'll get you'll get most of the information you want directly from the instrumentation that's automatically installed by the agent but if you do want to add additional data to your application here's how you do it so um the first thing that you want to do is um get a handle to your tracer so if you're going to be doing tracing related operations in your class you just want to get get a tracer and you want to name it the fully qualified name for your class this is what's going to let your back end know that uh where are we looking here um where that instrumentation came from so here we can see that um this particular client span the instrumentation was in uh auto installed uh instrumentation from open telemetry um for ok http and so if you're going to add your own you want to qualify it in a similar manner so that will give you a handle on a tracer and you can use that tracer to build your own spans so just to give an example of what that looks like let's say you have a sub operation that you want to measure the latency of to do that uh you grab your tracer and you create a span uh with a span name that you've thought about hopefully according to those guidelines and then you start that span if there's um if there's already a span running in this current context which in this case there is because we are in a handler for an instrumented service there's already a span running you can access that span that's currently running by calling uh grabbing a we have a utility class that makes it feel easy so if you want to pull that out of the context um you just call getcurrentspan and that'll give you the currently running span so for example in this case since a jetty doesn't add a route name automatically i'm manually setting the route uh here so i'm not changing the span name but i am adding uh i am adding a route uh semantic convention as an attribute so you could index on that but that's how you access the current span if you want to create a child span you grab your tracer build the span and then start it and so that'll give you a running child span if you want to then measure something with that you can just kind of grab this span you know do stuff and then be like and the span like that so that's start and end um but uh this doesn't actually change which span is active you've just made a span uh locally so generally speaking when you create a span you now want to make that span the the span that's active in your current context so that it'll follow the flow of your application and to do that you create a closure um so by calling uh with span with that child span uh inside of this try block uh a get current span will now return this new child span so just to be clear out here get current span is a span that was created by our jetty instrumentation and inside of this try block get current span returns this new child span and so this is how you set up the context aspect of context propagation oh and just by the way spans uh have a kind of chaining api so it's easy enough to just get the current span um you know perform a set of operations on it so let's say we want to add we added an event to a child span here but we could also an event here just like this so um that's like a pretty kind of concise and convenient way to um just uh do kind of quick quick logging uh quick attribute setting things like that uh it's also easy enough to wrap this stuff up in your own um your own shortcuts which is a thing i recommend the open telemetry api is a little low level and usually you'll end up with like patterns of things you're doing in your code and it's easy enough just to create a little wrapper like this to to make that easy to do or at least take up less space on your screen so there's some other special things that you can do one special thing is recording exceptions this is a special kind of event if you're just doing regular structured logging you use add event but if you have an exception you want to record a record exception we'll we'll create the right semantic conventions and and make sure it looks uniform on your back end but notably if you record exception this won't automatically count this span as an error because not every exception is an error in order for a span to count as an error you need to set this status to error and so these are two separate things so uh you know it's possible to have errors where there wasn't an exception present or access to exception it's possible to have an exception that doesn't count as an error um so you do need to do both um another thing you can do is just uh grabbing a current span and setting it to error this is a pretty common common approach if uh rather than creating child spans if you're in your http handler and you're like this handler you hit an air path you can just set the status code to error and and that is what will cause it to look like an error in most back ends like here for example so going over to some questions now daniel asks i understand you're supposed to be using operation names for statistically oh i already answered that one uh tony asks how does tracer spanbuilder work with async code how will it determine what the correct span to use is uh carlos you might be able to answer this one better than i can here yeah sorry what was the question that was yes yeah so the question is um if we're looking at a span builder here um the question is how does this work with with async code so if you're if you're doing a synchronous coding how does uh uh how would you track uh the current context uh usually in that case you could use instrumentation that automatically activates and deactivates a span uh so behind the scenes for example if you have something like a thread scheduler you know uh or a thread pull uh every time that has a task is sent to a thread the span will be activated or set as current and then before it goes to the next thread it will be deactivated um yeah this can be manually but it's a lot of work usually but hopefully auto instrumentation or instrumentation in general should should help that but yeah that's the thing because essentially every time you change a thread you need to do this activation and then of course deactivate the spam yeah and there's i think a general point there which is ideally you kind of want to centralize a lot of this work um you don't necessarily want to be you know every time you're you're making an async call or an every handler setting up and manually doing a lot of this if it's remotely repetitive you ideally want to pull this out of your application code and and centralize it somewhere or turn it into a wrapper or something of that nature just just to keep your code clean and to to keep things regular uh especially given that this has to get deployed across you know all the different services and service teams in your system you know the the less you have to do by repetitively writing application code the the better it is and honestly this goes for logging and everything else too um but i won't get into that uh yeah but yeah yeah and to be clear i think that a lot of users are afraid of how this will work with asynchronous frameworks so this is actually very simple but it's very repetitive so yeah um just go use instrumentation if possible if we don't have instrumentation for that you can always ask us to write one for you that you know like just tell us what framework you are using a scala akka and then we will try to get some instrumentation though yeah um we also do have some uh convenience so uh you'll notice that you know creating a scope and a try block in a span is a little bit of boilerplate and if you've already wrapped um this up inside of a method you can use uh the withspan annotation so with span will automatically wrap this method in essentially you know this uh this kind of a try block with the child span so so this is maybe again like a cleaner way uh to be to be adding more spans to your system though i do want to recommend you you don't necessarily want to be making lots and lots of shots fans we can go into some of those details later but um ideally you want to have fewer spans that have more correlations and more attributes on it rather than lots and lots of small spans you should only be creating child spans if you're specifically trying to measure the latency of a particular operation uh so sometimes people think like span equals function like every function should be wrapped in a span but that's that's actually overkill a serious overkill okay so that's the basics of span creation right there um so you'll see this pattern a lot [Music] but most the time as i said you won't need to actually create your own spans you can just access the current span and if you want to be able to correlate the span um you can add attributes to it so here we're adding project id 456 this will allow you to find all the spans in your back end so just for example here if i go project id 456 and just do a query like that now i can see all the spans in my system that had 456 on it this stuff might have gotten flushed out of here already but here let me create some more data i broke something that was probably it probably did something dumb over here there's always one thing which is uh creating some more data okay running is and then running a client and then we need to just wait for it to flush flush the data there it goes so now i can pop in here and i'm looking for anything that has project id 456 we should be seeing some of those show up if i just do a clean query like that so okay so here's all the spans that are associated with project id456 and then i can look at that span and see the whole trace and then look over here and be like well what about everything that was associated with this hello route that's kind of interesting to me and then come back here and see everything associated with uh hello and then you can pop back into that span and so kind of going back and forth uh um looking at a trace trying to get to the bottom of it noticing that there's um some correlation you might want to look across um checking out that correlation in aggregate then seeing something interesting pop up here and diving back into the span that that's a common a super basic common uh workflow for investigating your data obviously you can do things a lot more complicated than that but for me i find when i'm trying to kind of diagnose a system this is sort of how i start uh looking around um there's of course a lot of tools that different systems are going to build that allow you to sort of automatically show correlations and things of that nature uh so hopefully there'll be more and more fancy things but this basic pattern of looking at your correlations um to to sort of move around through your data um is super great and i don't know i find this impressive because uh having in the past having to do just lots of querying and filtering to move around like that i've used to find this this kind of stuff to be somewhat laborious um and with something like distributed tracing and open telemetry it's it's a lot more straightforward to do and that's basically it so really all you need to know to actually use open telemetry is to uh the instructions for installing the agent and then this uh super basic set of operations just grabbing the current span uh whenever you want to add more of these correlations so you can do the workflow that i just showed you um creating child spans and setting them as active recording exceptions setting your status to error um and then of course um adding events aka logging and and that's there's lots and lots more to open telemetry but but really um like as an application developer that that's your bread and butter um that's all you should really be interacting with on a day-to-day basis um when you're dealing with tracing and of course there's also metrics but we're not going into that today because it's not fully baked yet so that's a basic walkthrough here um do we have any questions about the walk through at this time after that we're going to go on break for a bit and then when we come back we're going to talk about advice best practices how to roll this stuff out in your organization things of that nature cool well we're gonna go on break uh so it's 10 30 now we're gonna go on break uh for 30 minutes and come back at 11. and we'll be diving into uh more of these details at that point around best practices but uh if you're around on the break i really encourage you to play around um with this uh walk through code uh so you see about modifying the the api calls in there and seeing how that changes your data uh try adding a new semantic convention uh things of that nature if you do run into some hiccup if for some reason you're not being able to get this connected um uh just uh uh post a note in the chat and we'll see if we can sort you out and otherwise i will see you at 11 thank you ah hello everyone coming back from break um if anyone had any questions over break uh hopefully people got a chance to play around with some of that code uh if you do have any questions or comments please uh post them in the q a or the chat okay so we're coming into the last part of the workshop here uh so we're gonna move back out uh to kind of talk about some of the more soft issues around best practices rolling out open telemetry things of that nature um so we've already talked about this a bit but one best practice is around naming the spans and how many spans you should need in your system does this span a function is a span a library where does it live there's a couple of scopes you can see when you're looking at a trace transactions like the largest scope then your process uh then a transition between code bases so like every library that your code winds through and then of course functions down at the bottom and spans spans represent operations which is usually a granularity somewhere larger than a function uh but maybe smaller than a library maybe it's a single span representing all the work that that library did in a trace like all the work that a database client might be uh be able to be encapsulated into a single span but there may be some sub operations uh uh within that uh database client that you'd want to measure so it's usually i would say maybe one to three spans per library if you want to use a rule of thumb but i would strongly uh recommend aiming towards fewer rather than more um and part of the reason is one spans are a little more expensive so the overhead of doing the measurement of the timing is small but it's it's still overhead so measuring things uh that you're you're not necessarily going to look at isn't necessarily the greatest idea you can use um events to to measure things you can obviously look at the time between events um when you're trying to create spans and operations you're really thinking about operations that you'd want to be setting up say some monitoring for uh not proactively um wrapping every function in a span um it's also as we saw you know before starting as finishing spans means you have to you know create a try block and a scope so that can be a little bit bulky um ideally you don't have too much of that going on in your application code and there's also some practical aspects around being able to actually um index and look up these spans uh in your tracing back end so you know light step like a lot of systems you can look for spans you can do queries based on attributes that are on a particular span but it's harder to do queries where you're looking for i want to find traces where span a has the route hello and span b um you know has a project id four five six uh it's usually uh easier for these systems to have a single span that has both you know a project id on it uh and http route on it so for that reason you want coarser grained uh spans you want to have as much of the indexing on a single span as you can get that will give you the best experience on the back end though obviously you still want to sometimes create child spans uh if you're trying to measure you know specifically measure uh the latency of a particular operation so uh yeah that's that's my main advice and then also um as much as possible you know try to keep a lot of this uh observability stuff you know in inside of your framework code and not dirting up your application code uh i know we have a tendency to to pepper application codes with logs uh you can still do that with uh span events but uh in general you should try to think of a way to to get this out of your hair so you're not copying pasting a lot of observability code all over the place as you're trying to write application code but does that make sense to everyone uh i think that's pretty straightforward but people have questions around um you know span size and operation stuff we could go into that now so otherwise we can uh move on uh and please continue to post questions even after we move on but let's move on to getting started so the thing about distributed tracing is you have to install it in every service in the trace so uh you that unless your organization is very small uh or very centralized uh you may need to have a rollout plan uh to get started here especially if you have a lot of service teams and a very large system so one thing that's important when you go to roll out something like open telemetry is to have a goal if you look at the amount of code involved the vast bulk of the cold is going to be an instrumentation hopefully most of that instrumentation is already written for you as i said but um installing and writing instrumentation is where potentially a lot of the work is going to be and so you want to avoid sort of boiling the ocean when you get started so what i kind of recommend if you're going to if it's not uh a kind of situation where you have a sort of centralized um infrastructure team or a centralized code base where you can set it up and deploy open telemetry to all of your services all in one go if you can't do that and you have to go from service team to service team and install it service by service try to find some kind of high value transaction it might be something where you're already having issues or you're already concerned about the latency and you'd like to reduce it or it could just be a really important transaction in your system like check out or something like that but but have some kind of goal that's that's high value and important and uh try to do the work of instrumenting that transaction first so finding all of the services that are part of that transaction um getting open telemetry installed making sure context propagation is flowing and that you can get a proper end-to-end trace for at least one transaction that people in your organization understand and care about if you do that then it becomes a lot easier to explain the value of distributed tracing and you can potentially get there faster than trying to add open telemetry everywhere [Music] once you get that one trace going hopefully you know that alone should give you some insights especially if you've never added a latency monitoring tool like distributed tracing if this is your first time adding something like that to your system i find things will often just pop out uh there's often some some low hanging fruit hanging around places where say operations were serialized where they could have been easily parallel parallelized things of that nature so you want to start hunting around for those sort of low hanging fruit just looking for latency outliers or average latencies that seem strangely long and then going back and adding details to your trace as you go so i kind of recommend rather than trying to to deeply instrument every service just trying to get um that framework and library instrumentation installed and getting that high level trace going it's more fruitful to have a complete high level trace than a lot of detail uh in a potentially unconnected trace um which is honestly how i do see these things go wrong sometimes like uh when i've seen um it become difficult to actually roll out tracing with an organization it usually follows a pretty common pattern someone gets really excited about tracing maybe it's one of you all on the call today and you decide you want to to roll this out so you go back to your company and then um kind of go around to each service team trying to encourage them to to to add this uh maybe some teams at it maybe they don't you know maybe some other teams are more interested in observability than others um and you end up with this sort of scatter shot approach where you don't really have complete traces and it's a little bit hard to show the value of distributed tracing so uh that those that kind of approach can often lead towards uh the effort p petering out uh before uh uh before any kind of like success is reached so that that's the number one thing you want to you want to avoid is some kind of inconsistent rollout where um you're not getting complete traces and tony asks uh do you know if there's any scope use uh scope use a service match um for example istio to avoid having having update all services ah i see you're i believe you're asking um is it possible to say roll out um distributed tracing through your service mesh uh i think this is what you're asking tony like can you just if you have a service mesh running can you just uh enable uh distributed tracing in that service mesh and uh not not have to go into your application server and add it there and that would be awesome but unfortunately the answer is no and the reason the answer is no is your a service mesh can see ingress and egress from your application but it can't flow the context through the application meaning you can't there's no way to determine um what ingress uh matched what egress if that makes sense because within your application you're not going to be flowing those trace ids so the service mesh would have to be holding onto the trace ids during ingress and then looking for the egress and reattaching them there and there's just no realistic way to do that so instrumenting service mesh uh is a useful way to get a lot of data out of your system um but it's not a replacement for at least enabling the basic context propagation of open telemetry within your application so hopefully that explanation makes sense if it doesn't uh let me know i'm happy to go into details there because it's a pretty common question so um yeah uh again like uh the the pitfalls that i encounter when people you know uh try to to kick off an observability effort uh within their organization is again if this involves like multiple teams a lot of engineers you do want to have a certain amount of project management i know as engineers we have a tendency to maybe skew that but um but but getting those team organized uh and having an organized effort really does make a difference so uh if you feel like maybe you're not positioned uh to do that um maybe the first step is to go find a you know an engineering or project manager uh you can pitch to to to help you with the the project management aspect of connecting to the different teams and having a planned uh rollout and then it's also helpful you know as you may have noticed there's always going to be tons of questions about distributed tracing it is kind of a new thing and so when people get started with it it can be a little confusing especially the context propagation part so i really recommend centralizing all of those resources you know maintaining some internal documentation that's specific to your organization as much as possible making the instrumentation a thing that lives in shared libraries so people are not having to manually instrument every handler or route or something like that and then coming up with your own semantic conventions that are specific to your organization so there's all the standardized conventions around things like http and sql calls but surely your organization has its own concepts uh that you'll want potentially standardize so you know you want it to be project.id everywhere not project.id here and project dash id there and project camel case id somewhere else so standardizing and sanitizing the data that comes out of your system is another thing that can be really helpful and again if you can add all of that in some centralized place so individual application developers are not having to remember to add it themselves that that makes the whole thing smoother and so those are like the basic pitfalls uh i do have a short pitch here that they asked me to do if you're looking at rolling this out in your organization and you would like help uh we're looking at offering like an open telemetry quick start so this would basically be a consultation with white step you don't have to be a light step customer you don't even have to use light step this is purely an open telemetry uh support offering we're checking out whether or not this is something people would actually want so this would basically be an engagement with us to come into your organization and just hands on keyboard help you actually set up open telemetry so if something like that's actually interesting you can contact us at support lightstep.com and we'd be happy to help you set this up so hopefully you can get it set up yourself but if something like this looks like it would really facilitate the effort this is a thing we're happy to do i don't think it's that expensive relative to what you get out of it so if that's interesting to you support lightstep.com and that's basically what we have for this workshop uh i want to just sort of do a brief overview of where everything is at um just so uh it's clear before we go at this time the betas that i consider to be production ready are go python java and javascript other languages are coming along i do want to stress that while we're close to ga these languages are still in beta so you are going to see potential breaking changes to the api um again if you stick to the automate automatic instrumentation that'll really reduce uh the any potential pain there if we break an api once we hit 1.0 we're going to lock those apis down hard and it won't be a problem after that if you're looking for just the basics of the project of course open telemetry i o that will give you access to all the different repos we're adding official documentation there soon so you'll see more and more resources showing up on that website i also am starting to put together and maintain my own set of resources and that you can find at hotel.whitestep.com we have getting started guides going for these languages right now but i really want to add a lot more detail things like cookbooks uh deep dives uh software's framework or database specific thoughts uh message cues there's really a lot of different areas that i think would be helpful if there's anything in particular you'd like to see please let me know um you can dm me at twitter i'm ted sue on twitter um i'll be posting regular updates there about open telemetry project plus any additional resources i'm creating uh so you could follow me on twitter if you want updates about all of that stuff and you know last but certainly not least uh if you want org buy-in you know basic strategy is pick a known pain point uh instrument only what you need to get that single transaction completely instrumented by installing automated instrumentation via the java agent and then from that point uh start looking for out other outliers and low-hanging fruit and then expanding the effort effort from there and uh if you are looking for more help and you'd like to chat there's of course getter where you can find the open telemetry community but you can also uh check out our own discord so if you're looking for a place where uh lightsteppers hang out uh who work on a lot of this stuff uh we're starting to run a discord now uh so you can pop into there and that's a great place to just pop in and ask quick questions uh while you're trying to set this all up and that's the workshop thank you so much everyone for coming we've got about 10 minutes left oh thanks tony yeah i appreciate you as well if people have any final questions i'm happy to hang out for the next 10 minutes and answer them so daniel oh drama contentious question uh open telemetry distros are great but doesn't it go against the openness of open telemetry when some distros contain proprietary receivers exporters that are not the core open telemetry or open telemetry contrib repos what is my take on that um i don't think it does uh the core so i will say first and foremost uh if it's only an open telemetry distro uh in my book if you can continue um to if it to configure open telemetry beyond what that distro does for you and that it interoperates with any other plug-in um um basically it can't be a fork uh if someone actually like changes up their distro to the point that it's incompatible uh with other things you might want to install or use with open telemetry on the same service that's not a distro that's a fork and it's open source so people of course welcome to fork it if they want but distros are really just a pre-packaged version of open telemetry and they may contain i've yet to see a proprietary exporter but uh you know there most certainly contain exporters um and other code that's like specific to a back end so aws is an example um they uh released their own distro recently and part of why they needed to do that is when you connect to x-ray they have their own data format that nobody else cares about and they have their own sampling algorithm and so if you want to connect to uh to x-ray you have to install that exporter and that sampling algorithm sampling plug-in and configure them correctly and if you forgot to do one of those things you would have some funky half half uh um you know sort of half setup system and so that that's sort of the concern is rather than um telling end users here's a bunch of boilerplate go do it yourself and then potentially get something wrong why don't we just package up all that boilerplate since most of it's sort of specific to an individual another way of putting it is a lot of the flexibility in open telemetry is about connecting to different systems but once you pick the system that you're going to connect to most of that configuration becomes boilerplate and especially for getting to install a sampling plug-in that's sort of like a foot gun waiting to happen so i wrote about this in more detail uh here uh let me just paste this in so this is the blog post that launched the whole concept of distros and you can see here for what we're doing with our light step distro we actually use open telemetry out of the box so lightstep speaks otlp natively you don't need to install any plugins to connect uh open telemetry to lightstep but you do have to configure it and so this is all the configuration you would have to do if you're doing it by hand and it can be helpful to like go through this configuration so first thing first you have the light step access token uh you need this in order to gain access to your project so that gets set as a grpc header um on the exporter so you have to first put in your access token uh we're secure you got to create your tls credentials you gotta get your certs um you have to set up the otlp exporter so again this is core um open telemetry but we're just creating our exporter with our headers and our security options and we're configuring the default address to point at light step um then by default we're going to add some resources that we've detected to your system in particular we care about service name and service version and then we want to automatically add things like you know language and library version and stuff like that [Music] and then once you've got all of those pieces you configure tracing by creating a new tracing provider with this config that we just set up so we're adding our configuration we're adding our exporter we're adding our resources and now you have a trace provider and then you set that trace provider as the global trace provider so this is like in my opinion very well factored um each one of these pieces makes sense and it's a logical way to put them together but holy smokes you don't want to be like copy pasting all of this stuff everywhere like that's obnoxious and like what if we uh end up tweaking some of these things uh like telling everyone oh now go back and like tweak all of these copy pasta scripts everywhere that's just kind of a gross way to do it and given that all of this is boilerplate once you've decided you're going to connect to lightstep it's a lot easier to just package that up in a little configurator that makes it a lot simpler and makes it clear what kind of options you actually need to configure but if you did this with the light step launchers um you can also that nothing stops you from cracking open the lower level apis and you know adding more exporters or mucking about with it or configuring metrics a certain way so um that's why i say distros are more uh more of just like a configuration layer and so uh amazon has basically done the same thing uh but they've also included things like you know their own ensuring a lockdown set of plugins that get installed automatically that have gone through their own security procedures and stuff like that and we want to add a lot of that ci cd to open telemetry itself so all of that auditing is happening in open telemetry core and then you know maybe some kind of github action or trigger that can trigger builds of all these different distros uh so that's kind of where we're going i do of course have a worry that like you know big co is going to big co and maybe someone does something kind of gross with open source code that's always a risk you can look at you know all the different places where there's some contentious between what one of these infrastructure providers is doing versus maybe what the spirit of a particular project was but in this case i think amazon is acting in very much in good faith uh alolita i know uh who's running the project is pretty deeply involved in open telemetry um and uh they're done a lot of work to actually help us set up our own ci cd pipeline and they're gonna try to move a bunch of the work they've done back into core so so they are acting in good faith there uh they just need to install um some amazon specific stuff uh just to make sure that you know getting started with them is easy the same way getting started with white stuff should be easy so hopefully that answers your question daniel and explains why it's not not maybe as contentious as it looked when they first launched that and that might be it we've come up on 11 30 if anyone has any final question i'm happy to answer it otherwise i hope you enjoyed this workshop i'm always looking for feedback again you can dm me at ted sue on twitter if you have any feedback positive or negative and if you have any questions hit me up there alright hope you enjoyed it
Info
Channel: Lightstep
Views: 1,427
Rating: 5 out of 5
Keywords: #OpenTelemetry, opentelemetry, #Java, APM, OTel
Id: 7h9LTTrGL28
Channel Id: undefined
Length: 117min 22sec (7042 seconds)
Published: Tue Oct 27 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.