Serverless Observability: Introducing OpenTelemetry for AWS Lambda

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
take this off and let's get back to the beginning and let's present already so this is just a quick overview of open telemetry and if we're going to talk about open telemetry we have to talk about the problem that it's solving and that problem is distributed transactions and the part of uh solving distributed transactions that open telemetry solves is telemetry so telemetry is the process of collecting information about objects that are far away and sending that information somewhere electronically so that it can be analyzed the world of observability is big and it has a lot of moving parts uh the part that open telemetry solves is standardizing the data that systems generate so that we have a standard common language for describing systems that allows analysis tools to really get a much better sense of what systems actually are and it creates for a sort of green field of possibility to build new analysis systems all on top of the same data but the reason why we need this is because transactions are becoming more and more distributed so what do i mean by a distributed transaction let me walk you through a super basic example um this is like the world's simplest app let's say we've got a system where you have a client and that client is a mobile client that wants to upload a photo and a caption for that photo uh and send it over to the server to get stored now of course we know it's never this simple right like that server is actually a bunch of systems it's probably an uh reverse proxy that sits and then talks to an authentication system and then that proxy will write that image down to scratch disk and then send that image and the caption to an app which will then upload the image at the cloud storage like s3 and then send the url and the caption over to a data service and that data service will sit in front of a couple other cloud databases like let's see you know it sits in front of a redis for caching that information and then stores it permanently in sql so you can see right here this is like a really simple application but um it's already fairly distributed there are a lot of moving parts there are a lot of separate services and as we move into a serverless world this becomes even more distributed so microservices service serverless growth and scale growth and diversity means that our systems become really really complicated it's not just one big monolith now it's many different systems talking to each other and that means when you're trying to reconstruct a chain of events to actually figure out where the error that the user saw came from or why the the page was slow when it loaded that's becoming harder and harder to do the more distributed our systems are becoming uh so for example logging uh all of these systems produce logs we're all used to looking at log data but if you try to reconstruct one of these big distributed transactions uh out of your logging system it becomes really difficult to do because you have to find and filter to just get the logs that were in your particular transaction and if you had a something like a transaction id or as we'll call in a moment a trace id attached to all of these logs then you'd be able to do a query which is find all the other logs that are in this transaction but with the traditional logging tool you don't have that because you don't have that kind of context likewise we have metrics and dashboards we're really used to looking at those uh but it can be difficult to connect what you're seeing in those aggregates in those dashboards with the actual events that are being produced and that's because uh those systems tend to be completely separate we have one system like a logging system for storing events and then we have another metric system uh that's showing you metrics but the two systems aren't tied together and even at a fundamental later level the data isn't tied together and so that just makes the amount of time you have to spend connecting the dots difficult and it makes it really difficult to automate the connecting of those dots so this is where open telemetry comes in what open telemetry does is it unifies all of these different signals into a single system with tracing distributed tracing being the fundamental backbone that metrics and logging sit on top of and by combining that into a single data stream and correlating data across these different streams we're creating a much more rich set of data that's allowing us to build new tools and new technology that provide more insight and are going to be able to automate a lot of the work that you have to do today so uh because people are not familiar with distributed tracing i do just want to give a quick overview of distributed tracing and how it works because this is sort of the fundamental you get with open telemetry so just imagine real quick you have two servers server 8 server b and you have a sequence of operations and server a talking to server b over network call so distributed tracing and it's fundamental is about having some amount of context that follows the flow of execution and when that flow of execution hits a network call that context then gets propagated so we call this context propagation so within each process you're you're following along so in the background at any moment you can call up this context object and ask it for a trace id or an operation id or any piece of baggage you want to add to it and then when you hit a network call you inject that context say as http headers and then on the other side the tracing system extracts that call and then turns it back into a context object which continues along and so this gives us the ability to trace the entire distributed transaction no matter how many services are involved in it so when you get all of your data in tracing you can then form a graph so uh you've got to you'll see these kinds of graphs if you start looking at tracing systems they're pretty common but just to explain how they work each color-coded line here represents an operation uh possibly on a different server the length of the line represents the latency or how long that app uh how long that operation took and the arrows represent network calls so this is sort of our original simple app we have a client and it's talking to a reverse proxy an authentication server and so on and so forth being able to connect all this together automatically into a graph is really powerful for one you can visualize things like latency you can see where the time is going in your system that's something that can be somewhat difficult to do if you're just looking at logs you can also quickly identify what service the error is coming from which is also really important because you might see an error downstream uh saying your client reporting a 500 error but you want to quickly trace that back to the service that actually generated the error and that's much easier to do with the tracing tool and then of course you want to have fine grain events this is basically logging only now your logs are contextualized by the trace which means if you find one log you can find all the other logs in your transaction and that's a really powerful time saving tool and last but not least you want to be able to connect these transactions together in aggregate with other transactions so that you can create dashboards and metrics and things of that nature so you want to be able to correlate these and we were able to do that by attaching attributes to each one of these operations so what host it was on uh if it was a client request was it a 500 error etc etc so that you can then build graphs out of this information so that's like the basic of what open telemetry provides uh let me just check the time real quick cool i got a couple more minutes i want to just pivot to how open telemetry is actually structured because we're going to be installing the system on lambda it's good to know just what the actual pieces are so open telemetry is divided into a couple of pieces one piece that you would install in your client so this green circle's a client is what we call the sdk this is the open telemetry implementation that actually collects the data uh you install all your plugins into this your exporters all of that stuff happens at the sdk however the sdk is not what you use to instrument your system the sdk can get installed by the application owner or in the case of lambda maybe even a third party and then when you go to instrument your system you use the api one advantage of having the api separate from the sdk is that uh it comes with no dependencies so the api is just the interfaces that you use to instrument your system and that allows you to separate out what implementation you're using from the instrumentation you're providing this is especially useful for shared open source libraries like frameworks and network clients and things like that where that library is going to be installed in a lot of different applications and when that library gets installed in all these applications it doesn't want to haul in some huge dependency chain that might create conflicts so when you instrument your app or your library you just use the api and we'll see why this is helpful in lambda in a minute so again this is just to kind of show how that works so you have your frameworks your http clients your sql clients your app code that all talks to the api and then the api then forwards that information on to the sdk implementation so long term we hope that these instrumentation calls can actually be moved natively into frameworks and clients uh so that out of the box you know your web framework will actually have observability installed and the people developing that web framework will be thinking about not just testing it but how you would want to observe that system in production so we're hoping to bring observability more to the forefront of people's minds by giving them a tool that they can actually install successfully directly into their libraries the other important piece of open telemetry is what we call the collector so the collector is a separate service that the sdk talks to so you tend to leave the sdk in a default mode speaking otlp which is the native protocol for open telemetry and then move all of your configuration to the collector so the collector is where you do data processing that's where you can tee it off to talk to multiple systems in multiple different languages for example you might send your traces to a zipkin backend and your metrics to a prometheus backend uh all of that processing and configuration you want to do in the collector you can do this directly in the sdk using exporters but if you do it in a collector then you're able to redeploy and change your observability topology without having to redeploy your application so by pulling this out from your application this helps operators be more in control of their actual observability pipeline now i should mention open telemetry doesn't come with an analysis back end that part is as we say sold separately and that's because open telemetry is a standardization process so we want to standardize how this data is produced collected and sent off to different analysis tools but the analysis tools are where all of the innovation is happening there's no standard way to look at this data instead we want to have competition there and we want to invent new and more interesting and more helpful ways to look at this data so open telemetry tries to be vendor neutral and become an industry standard and in fact uh all the major players in the observability industry uh are involved in open telemetry at this point last but not least the actual process for developing all of this if you want to get involved in open telemetry is through a specification process so we have a specification where we all come together and work on the design of that system then we cut a release of that specification which then gets implemented by the maintainers of the different language implementations and in fact just this week we've cut a 1.0 release of the specification so release candidates and 1.0 implementations are about to hit starting next week okay so that's a very quick very fast-paced overview of open telemetry and how it works now i'm going to hand it off to nazar to talk about lambda specifically uh thank you ted so let me share my screen uh you might that you might want to i think you'll have to stop oh yeah oh shutting what can i do here let me see stop share oh i'm still sharing right now okay uh then let me do this one all right you can see my slides yeah we can see them cool uh thank you ted so uh in this uh uh part of the session we will dive deep into the aws distribution for open telemetry specifically understanding uh why aws is investing in open telemetry why are we creating a distribution out of it and then focus our kind of conversation more towards what we have done for the lambda use case right so customers we know running applications in different compute containers eka ccs on-prem ec2 as well as lambda and each environment is a little bit different than it it works uh different in terms of their architecture and their communication patterns and lambda is also different so we want to focus this conversation into lambda what we have done specifically to make open telemetry integrated very well with lambda and how we can use open telemetry uh sdks apis and and collectors and agents that said just showed uh for instrumenting your lambda applications so uh to get started with like at a very high level like why why aws destroy for open telemetry so we have heard from our customers uh many of them are using multiple uh frameworks sdks and agents to collect logs metrics and traces right uh and this is more towards aligning with their use cases or more towards what they have what tools they have been using uh in the past and they want to continue doing that and more towards what each team is is comfortable in using and have invested heavily into them uh there are a couple of problems with this right so with multiple tools uh the data is now distributed uh into multiple different uh back-end monitoring services uh to different set of agents and sdks uh that adds a lot of investment of time and resources into making it work at the production capacity so to managing it maintaining those agents upgrading it patching it uh getting new versions and so on uh not only that the data that comes out of these sdks and agents is now disconnected it's there's no correlation between uh the value that we want to add between metrics logs and trades is something that that ted already kind of uh spoke about right the value is understanding this correlation between uh why is there a peak in the error rate aligned to which services it's impacting and how it's impacting my end users uh so in order to solve the problem open telemetry comes in the picture it provides a standardized set of apis sdks and agents to collect your metrics and traces and send to uh multiple destinations uh at once so just with a simple configuration change you can say i want to send my traces to uh now aws and lightstep and i want to send my metrics to a promises backhand or or any other vendor specific uh backend the other benefit that it provides is uh it provides uh this correlation context right so there is the data that is being generated metrics and traces has a correlation context which helps you connect the dots in your monitoring service and this correlation context is is done at the ingestion of the data so the implementation of the monitoring backends to create innovations around it or create machine learning models or create a more dive deep triaging workflows becomes very simple for the monitoring backends so uh aws distro for open telemetry is a secure production ready open source distribution of the upstream open telemetry project uh we are upstream first that means we invest in in uh all our of our 100 of our code goes to the upstream open telemetry depot we then create a distribution out of it we run through a lot of security analysis security checks uh pen pen test performance test to make it ready for a production using aws and we also pack it with aws support uh the other thing we have done is we have tight integration with aws services and one of the services will show in case of lambda uh today in the demo uh we also integrate with container services so or we are in the process of integrating and improving that so when you are creating a target task or you are creating an uh a cluster automatically you can select that i want to collect my traces and metrics and here are all my destinations i want to send my data and automatically we will provision for uh provision it we will deploy it we will configure it and and so on at the same time uh this aws distro has exporters for not only aws services but also for uh partner specific uh services as well so you can send traces to x-ray uh metrics to manage from eps service and cloud watch and also you can send trace data to uh uh partner solutions like like light step is one of the examples and we'll show that in the demo as well so where where are we currently with the lambda uh support for open telemetry and where are we going next so currently we support uh lambda and python from through the aws distro for open telemetry it is based on the lambda extensions we will have a lambda layer or we have a lambda layer that you can add to your functions and uh we uh the couple of areas that we have invested uh in to make it very uh simple to use so we have invested in auto instrumentation and python so a lot of your calls uh and your dependency calls that you're making or aws calls that you're making is automatically instrumented and that's what we will show in the demo that if you're making a call to uh to s3 or dynamodb uh through lambda automatically we instrument the call so that you don't have to write any custom instrumentation code uh the other uh investment we have done is automatically we collect uh resource-specific information so we collect uh our lambda function name uh lambda function version uh it's on so it helps in the end when you're building a monitoring experience or a triaging experience it helps to connect the application performance data to more of an infrastructure or resource uh performance data to tell a better story at the same time there is uh we have enhanced kind of worked on on using the the uh the lambda extensions and layer technology uh to provide a batching across through within the invoke and we aim to expand on that to have batching across invokes eventually in future and today we support uh sending traces to x-ray and otlp protocol so any of the partner solutions like lightstep which uses otlp protocol you can uh you can send the data uh to them as well okay okay so let's so in the next few slides let's dive a little bit deep into uh how the open telemetry integration has been set up within the lambda premises or the lambda architecture and and how the calls are being passed from the point that you invoke a function to the point that the call is made to the lambda service uh to the point it goes to the runtime and then and then to the point where it goes to the open telemetry sdk and the collector which is then batching the data and sending to the backend service so this is kind of the uh as i mentioned like the entire lambda integration is based on uh for open telemetry it's based on its uh the lambda extensions uh that came out last year uh so i want to kind of dive a little bit deep into kind of explaining how the calls are going with that so before the lambda extensions here was the life cycle of the lambda right you initialize the function then you execute the function and you end the function however with uh with the invent of the lambda extensions uh there is more control that is more kind of specific communication that you you can control and you can handle so there are it's divided into three phases one is the initialization phase one is your function invoke phase and then one is your function shutdown phase so in the initialization phase there is extensions that is extensions is basically uh a way to bring your own agent or bring your own code that runs alongside your function uh that and you can control the life cycle of that uh so let's take a look into each of these uh uh these initialization invoke and shutdown phase and see how uh the communication happens with uh with the lambda uh extension so in the initialization phase let's see like let's if we take a look at it there are three different kind of areas in in the entire lambda execution there is the lambda platform or service there is the lambda run time and then there is uh the lambda extensions and uh when there is this lambda extensions is where open telemetry fits in so we have created a lambda extension specific to open telemetry where the collector runs uh as part of your function execution so when an extension is initialized all the extensions need to be initialized before your function or the runtime uh and they have to register uh with the platform so that the platform knows that okay i have this extension which is the open telemetry extension i am initialized and i'm ready to take request and then once the all the extensions have been initialized you can have multiple extensions right so you can have open telemetry extension you can have another extension for another of your observability solution or you can have your own custom extension that you are doing as part of your lambda execution uh once all the extensions are initialized then we are ready to receive uh the invokes for the function so when we receive the invoke for the function there is no specific ordering on this extension will receive first versus the second extension will receive the next all the invokes are directly passed to uh to all the extensions so when you invoke a function the the platform and the runtime sends a message to the open telemetry extension that hey i have initialized do what you want to do in terms of initialization and that's where we uh do the dependencies of we bring in all the dependencies of open telemetry we initialize the auto instrumentation for python and so on and then once uh the initialization is done uh or once the invoke has been received extension again responds back to the platform saying that hey i have uh i have faced my invocation and here is my response and then the the final stage is the shutdown phase right so when uh when the invoke has been completed when the function invoke has been completed we want to go and very uh subtly shut down on the extension so shut down the open telemetry collector the reason why is open telemetry collector might uh buffer the data for a certain period of time and and i'll go deep into that how it's buffering the data and so on uh or the open telemetry sdk is buffering the data and sending it to uh the open telemetry collector actually in this case uh so the open telemetry collect we can signal the open telemetry collector to push the data out uh to the back-end services so that we don't use any or lose any of the observability data that has been collected and stored so again uh the runtime is shut down before the extensions and the once the extensions are ready uh the extension are shut down they have to give a signal that the shutdown is complete for some reason where the extensions cannot be shut down a kill signal can be sent okay so now uh let's go specific into uh how uh a user function uh within the lambda runtime works with the lambda extension for the open telemetry collector right so what we have done is if you look on the left hand side and we have this user function which is kind of uh the execution flow uh the open telemetry sdk and the open telemetry collector is uh the collector is part of the lambda extension we create one layer that bundles the open telemetry sdk and or the collector into one executable i'll show that in the demo and that works in conjunction with your lambda execution so for example let's say an invoke came in uh your function user function is now invoked it's executing uh telemetry data is being collected and it's sent to the open telemetry sdk so we have the open telemetry sdk as well as the collector the sdk has bundled the apis and and all the uh all the components that uh ted mentioned in few slides uh uh before uh and the the sdk then matches the data for five seconds so uh uh you can have uh within an invoke you are processing uh multiple uh executions right so you let's say there is a for loop or you're calling an s3 and then you're calling an api you're validating authentication and then you're calling dynamodb and so on so all that instrumentation the instrumentation data is collected and batched for five seconds once those five seconds are done automatically we have a timeout flush that then calls the open telemetrics uh uh sends the data to the open telemetry receiver within the collector and the collector then sends it to the uh the receiver then sends it to the exporter the exporter flushes the data out to the backend service so it's like it's quite close to real time happening into where uh you want to send the data to uh what happens when you uh the the function invoke ends right so you have stopped invoking the function what we do is we do a force flush uh there might be some telemetry data that has not been sent out because the the five second timeout has not been reached so we do a force flush and we signal uh the uh the lambda extension which has the collector saying that hey we are ready to send the data out we send the data out directly and which then calls the uh the monitoring back and flushes the data to the monitoring backend and then sends the response so this force flush is a synchronous process to ensure that all the data that has been uh buffered that has not been flushed out because the timeout might not have been reached is then flushed out to the monitoring back-end so that we don't lose the data okay so what's uh coming next right so again uh the with aws distribution the lambda support is only for python today we are actively working on four other languages uh and more in future so node.javagon.net uh we also uh want to emphasize that uh that the instrumentation is very easy and simple to do so uh we we try to do auto instrumentation agent uh auto instrumentation capabilities in in language specific uh sdks we cannot do it in all uh the languages for example the scripting languages so we focus on java and dot net we are also going to provide a managed layer an aws managed layer so it becomes instead of you creating your own layer you just have to search for the manage layer add it into your function through cfn or sam templates or the console and everything would be uh set up for you ready to go uh also we are going to add more vendor specific uh exporters so today we support x-ray and we support otlp exporters we are going to support other vendor-specific exporters to send data to other partner solutions that you would uh like uh batching of the data today is only within an invoke we want to batch the data across invoke so if you're if you have a high scale uh execution or high usage application you can batch the data across invokes to have uh uh so that you don't reach the throttling limits of your backend monitoring services and all the data is then sent in within an acceptable network traffic range uh we also are focusing on logging support in open telemetry uh as well once the logging specification is defined and executed on i will add it into the aws distro for open telemetry and then in the end as the logging is defined will have better correlation between logs metrics and creases so now uh let's go to uh uh the demo let me uh start with like a small demo that i have and then we can take it from there let me share my screen again okay so uh i have so there are uh the demo that we would like to do is show kind of an api invoking uh lambda function which then invokes uh dynamic tables we want to show kind of a semi-real use case and with that uh i want to show open telemetry uh what data you can collect and then with the elements dive deep into how you can use x-rays as well as uh we would like to show how you can do the same thing with light step and then you can talk a little bit more about the other functionality in the light step on trg and removing the issues uh written it's just basically making simple calls to insert an and an item in the dynamodb table what i will do is i will just uh test this from my console a couple of times just to make sure as well as i am going to i have uh i'm going to invoke like few calls in a loop to make sure we can kind of exit keep executing on it so i have that on the side uh so this does not have any open telemetry as yet right so i have a simple lambda function in python making dynamodb calls so if i go to uh if i go to my console and if i refresh the last five minutes you'll you'll see something uh uh something like that right so lambda function is making a call all the calls have been successful however one thing you can see that i don't see the dynamodb call uh that is being made because the only thing i have done in my uh lambda function is i can get the screen what the only thing i have done in my lambda function is i have enabled tracing for it that's it i have not done anything else so i have not instrumented the code i have not added any specific sdks for instrumentation however i can as soon as i enable it i get some amount of code i can look at the traces and i can look at the response time and so on so let's walk through uh adding a lambda uh layer into it right so i have i'm going to go to the layer section i'm going to add a lambda layer i have already created a lambda layer or in we have a documentation that we will we can share uh later on on how to create lambda layer it is on the aws distro for open telemetry uh developer portal so i have this lambda layer i have i'm going to add version 5 to it so this is where this layer has the open telemetry sdk and the open telemetry collector all baked into it so you just add the layer into your function and you're good to go in addition to that i have to add uh an environment variable so i'm going to copy the environment variable and and i'm going to explain what that environment variable is so there are there's one environmental variable i need to add which is uh for uh the auto instrumentation agent so uh this environment variable what it does it interjects the lambda handler and it enables the auto instrumentation so basically in auto instrumentation what we do is we pre as the lambda extension is initialized we automatically go and initialize all the open telemetry libraries and make the connections so that when you invoke those libraries automatically the instrumentation is done so one of the examples is when you are making calls to aws services uh automatically the calls that you're making downstream will be uh instrumented so what i have done is i added i have added the layer i'm going to just do a couple of tests from here i have enabled auto instrumentation agent for it uh i have also added some uh some random failure modes so invalid product id and so on but i'm going to execute couple of times and if i go and hopefully if i refresh you would see a dynamodb node as well connected so i made the calls from my api remember i ran i was called invoking an api which then invokes the lambda function and then uh hopefully as the time comes in otherwise i'll have to go a little bit back you would see so before before adding the open telemetry layer i didn't have this dynamodb uh node but as soon as i added uh the the open telemetry layer automatically without changing a single line of code just adding the layer and adding or the environment variable now i'm able to instrument my aws calls that i'm making to dynamodb or s3 and in in this in my application i also have some uh basic errors uh random errors so let's look at those errors and see if we can try to figure out why those errors are so what i did is i selected that node i looked into uh into uh i selected one of the traces and i'm focusing on the error i want to find out why i'm seeing the error right and if you go very easily you can see that the lambda handler there is a call there's some issue in my lambda function uh which is an hypothetical error i'm creating which is an invalid uh product id and i get the stack trace as well right so and you can see where the call is going and so on also let's look at maybe some of the successful traces right the traces that uh that have worked so maybe we can go and look at some successful traces uh and see how they look right so you can see that in these successful traces uh my lambda call is uh i have a client calling my lambda function uh which is then calling my dynamodb table you can see automatically the lambda handler instruments that dynamodb calls i have a success and everything looks good now as uh the second step of this one of the key features that we want to provide in aws distro is making it easy to send data to any destination uh in addition to the aws destination so what i'm going to do is just by adding a simple uh environment variable uh i am going to uh send the data to light step and i will see the data how the data is seen in light step as well so what i'm doing is uh let me add the environment variable and then i'll kind of explain what that environment variable is so i'm going to add a environment variable that points to the collector configuration file so uh each collector has a configuration file and in that configuration file you can define what data you want to collect and where you want to send the data to in addition to other parameters are the specific parameters such as the sampling rate and all that stuff so let me do some tests over here just to make sure everything is good okay and uh before explaining that collector file i'm going to just invoke some of the data points so that we can send data to lightstep so let's look at the collector config file so what i did in the environment variable i'm saying that i want to bring my own collector configuration file okay and i'm specifying the part of that collector configuration file which is in my function itself and if i look at this yaml file what i'm doing is i'm sending data to two places right with just before i was sending data to x-ray but by just adding this configuration file now i'm sending data to x-ray as well as light step so what i'm saying is i'm sending data to x-ray and i'm using otlp and in the otlp section i'm specifying what endpoint i need to send to which is the light step and what is my access key and then uh i'm going to just make some invokes and i'm going to go in the light step console and see if i can uh get some data so i have logged in into my lightstep console i am running a real-time query into this uh in my explorer view and i am able to see a lot of data around here i am sure we will see some errors as well i will run the function again i'm going to send more data just to make sure uh it's there you can look at uh individual responses you can see how the lambda function is integrating what is the latency what are the correlation keys that uh that you can get uh let me try to see try to get more data over here i'm trying to see if i can post an error actually let me try to post more errors so that we know that the data is coming so this is real data that i'm running and it's sending it's like real-time queries that we are making and and getting the data uh oh so let me go back one second maybe yeah there you go yeah so i got some errors right so i'm creating these fictional errors and i'm trying to see why uh uh uh make sure that that data goes which we show we wish we viewed it in x-ray i wanted to make sure we are seeing it in light step as well so i click on that error and automatically you'll see in this correlation tab what are the events what are the log events what is the stack trace uh why what is the error message and so on so uh with this like we we started right like we showed very easily how you can instrument your existing application just add the layer uh of python and in future other languages uh by default it sends the data to x-ray if you want to send it data to any other vendor you can add your own configuration file it's just a very simple ml file and even without writing a single line of code you can start collecting data in like few seconds or few minutes uh with this i would like to hand it over back to ted uh i know he wanted to talk a little bit more about uh lightstep and its debugging capabilities there you want to take it over from here yeah absolutely uh just uh to reiterate because we've just been looking at at these kind of small examples so you can see a bit about tracing and see how this works on lambda but i did want to go over the fact that you can when you've got lots of data and more of a production environment uh the way um tracing uh can tie some of these tools together so uh we're looking at a larger system in light step right now uh here let me get our little zoom chrome out of the way so you can see for example there's a number of systems involved you can see some of them have errors and you can see a latency histogram so these are all the different operations uh sorted by time so you can see how many are in each time bucket we've got some operations that are here all the way at the extreme so you can focus in uh on this particular set of operations and say well i want to know information about this and uh when you do that you start getting correlations so for example here the highest correlation we can see with slow requests is kafka partition 4 and you'll notice that uh some of the bar is tuned black so this all seems to be clustered up so operations uh that are labeled with this attribute are all clustered up being kind of slow so that's interesting and if you wanted to group by that kafka partition all of a sudden we can see the different kafka partitions and we notice that they're all fairly slow oops this demo environments being a little fussy for me apologies here let me reload this i think our demo is a little under provisioned um so here if you have a look kafka partition four you can see these are significantly slower than the rest of the partitions you can also see the span count the number of operations hitting uh partition four seem to be a lot larger than the number of operations hitting the other partitions so that already tells you something's up right like there's some kind of rebalancing that probably needs to occur but you can then also click into this and have a look at the actual operations and if i click into one of these things i can look at an individual trace so these are the trace graphs that i was showing you before and if you're trying to figure out where the time went this black bar represents the critical path which is our guess for what was actually holding the system up so if i look at that particular span it's called uh wait for client queue so it's calling a back end service it's trying to get something from a queue from a kafka queue and it's waiting so this sort of ties with what i ties together with what i just saw that there were a large number of requests going to this particular kafka queue and so the latency we're seeing is because things are are queued up uh waiting for the other requests to complete so that's an example of something where you now know you need to go look at your kafka system and figure out how to rebalance it just to give you an example of how to look oh and i should mention look how you can just you have all this information tied together if you're trying to collect all of this information from a logging system you'd have to do a fair amount of querying just to piece together such a large distributed trace but looking at to just see how traces can relate to metrics here i'm looking at a dashboard metrics for example i'm going to look at this api gateway and i'm looking at latency rate and error rate so i can see there aren't any errors here the rate of operations uh is somewhat steady but there's these latency blips um so what's going on with this metric you might have a metric in another system if you are just recording latency as a metric uh but here you notice we have all of these dots and each one of these dots represents a particular operation that contributed to this graph so i can click out here to one of these outliers and look at a trace and start to get a sense of what was actually going on so in this case it looks like a cassandra client query is taking over a second so boom i've already gone straight from a metric giving me an alert to uh a trace and a root causing the issue so that's that's the kind of quick movement that you can do once you have all of this data in a single system okay uh this is more of a open telemetry demo not a lightsaber demo so that's that's all i want to show a light step at this time but i just want to give the impression about like how fast you can start to move once you have all the data in one spot all righty and i think that's it for our presentation uh nazar thank you so much uh for for showing off all the lambda stuff uh that's really exciting um could you tell a bit again about what the the road map is for uh for lambda development like what other languages are you looking at oh you're muted sorry uh our near term roadmap is focusing on kind of the uh two things one is expand on the language support so our next language is node.js java go and net those four of them and then the second area we want to invest in is making it easy for customers to collect data right so invest in open color in auto instrumentation as well as managed layer so customers can just search for the layer add it or use it as part of their cfn template or their sam template and and just start collecting instrumentation so those are the two key areas in kind of the next one or two quarters we are focusing on awesome that's great for people who are super interested in lambda and want to actually get involved in development i just want to mention we have orlando working group uh it meets once a week if you go to the open telemetry community repo on github you can find the links to where all the different sig meetings meet including the lambda one so we hope you contribute we hope to see you there thank you thank you bye
Info
Channel: Lightstep is now ServiceNow Cloud Observability
Views: 5,124
Rating: undefined out of 5
Keywords: apm, microservice, performance monitoring, distributed tracing, distributed systems, opentelemetry, observability
Id: Ty_AToJW5Fc
Channel Id: undefined
Length: 48min 17sec (2897 seconds)
Published: Fri Feb 12 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.