OpenTelemetry Collector Deployment Patterns - Juraci Paixão Kröhling, Red Hat

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello and welcome to open telemetry collector deployment patterns my name is judas paschenkrolling i'm a software engineer at red hat and i work on the distributed tracing team i am a maintainer on the eager project and a collaborator on open telemetry now for this conversation here today we're talking about patterns before we go into that we're going to invest a couple of minutes talking about open telemetry and the open telemetry collector now those are the patterns that we're going to cover here today the first one is the very basic pattern and if you followed a quick start of open temperature collector you know them already the second pattern is the normalizer pattern and followed by a couple of variants of a pattern for deploying open telemetry collector on kubernetes we talked about load balancing multi-cluster and multi-tenant scenarios as well all right so um all of the the patterns that we talked about here today they are available in this repository here it this repository contains images and configuration file examples and a deeper explanation on those patterns now you can download this slide deck here either from the conferences website or from this box from here and i'm also sharing this live deck right now on twitter all right um so let's get started the open telemetry is a project that was created with a fusion of open tracing and open sensors um it it is actually composed of two big parts the first one is the specification and conventions part so it is uh where the community gets together to determine uh what are the cementing conventions that we should all be following when instrumenting our applications and it also defines um specifications on uh for 23 data types like for uh traces for metrics for logs and so on so forth we have um a group taking care of the the client apis or the instrumentation apis and that's the case we have a group making a definition about otlp and hlp stands for open telemetry line protocol and it is in concrete terms it is a protobuf basically right but it is a specification of on on how we can transmit data uh telemetry data from one service to another right so it specifies both um the the the message and um how the services should look like on the client and on the server side and then the fourth big part of open telemetry is the collector and the collector is where we are focusing on here today in the collector if you go to open the launcher's website the documentation and open the collector documentation you see that the definition of the collector is this one here it's a vendor agnostic way to receive process hand expo and export telemetry data it kind of hints at the internal architecture of the collector and the collector is composed of the following components right it has receivers it has processors it has exporters and it has extensions we're not talking about extensions that much here right so there are a special type of component that is not part of the pipeline and a collector is view as the pipeline for the telemetry data so data is either received or or polled by receivers and the receivers put data at the beginning of the pipeline now once they they flow through the pipeline they got to the processors and the processors have the ability to look into this data and do some action with them or just observe this data and create perhaps a new data point out of it now a processor once the other processors have finished doing their jobs with that specific data point with that batch of data uh data reaches the exporters and exporters can again be passive or active so they can be sending data actively to a final destination or they might be making data available for other systems to uh to extract from the collector now when we talk about receivers we are talking about things that emulate a jager server or that emulates the behavior of prometheus or that implement otlp on the server side or they might emulate a zip code server as well now processors they might be doing sampling they might be changing attributes or adding removing updating some attributes from the data points themselves they might be doing some matching they might be doing some routing and so on and so forth and when it comes to exporters we have a bunch of them uh for pretty much all the um relevant systems and vendors out there all right so we have exporters for eager zipkin and ultlp we have um uh exporters that and that emulate or that expose a prometheus uh compatible endpoint um so that other prometheus compatible systems could i can scrap data out of it out of the process and we have exporters for pretty much all of the commercial commercial vendors out there all right and besides the collector proper besides the open transfer collector itself we have a a couple of repositories or a couple of projects that are part of the same ecosystem and the first one and the biggest one is the contrib the contrib is where all the non-core components live including the vendor-specific ones right so if you want to use a very specific uh component for open telemetry collector you'd probably download the contrib distribution first now you might notice that the contrib distribution the contrib binary is actually quite big so you might want to consider using the open transfer collector builder to pick and choose which components you want as part of your own distribution of the collector right and then uh for that binary you have only the components that you need so it's very slim it does it's very it's very lightweight all right this is how a configuration file looks like for the collector so we have sections for for extensions for the receivers processors exporters and we tied them all together under the service node right so we specified extensions that we want there for this process and we specified the pipelines now the pipelines can be traces metrics and logs pipelines and we might even have multiple pipelines for the same data type so we can have multiple trace pipelines here for instance in our example we have only one receiver one processor and one exporter all right so let's get started with the patterns then the very first one is a basic pattern and in this case we have our application instrumented using the open telemetry sdk exporting data with otlp to an open parametric collector located somewhere now that open temperature collector then exports data to a final destination in this case here jaeger note that throughout this presentation i'm using here as an example here of the final destination but you can replace that with pretty much everything that you want right so it can be a different tracing solution or it might be a specific vendor but or it might be even something not tracing specific so it can be in most of the cases it can also be uh like prometheus in here and um so this is our first pattern um i hope it's it was enough you know to warm up and a second pattern that we have is or it is a variant of the first pattern and it is a fan out pattern uh going back to our original image we have the same application instruments with open term 3 sdk exporting data with otlp to open transfer collector and open terms with collector exports data to jaeger and additionally to a an external vendor now the point here is we still have the ability to own our data right so we can still have um access to our raw data within our realm within our infra at the same time we can send the same data to an external vendor and have a different view of this data now the second pattern that we have today is the normalizer pattern and in this case here we might probably have a promises client instrumenting our application for for metrics and we might have our application instrumental using open tracing for the traces with the jager client as the actual tracer now data is then either made available to prometheus or sent to jager and in this pattern here we are using the collector as a drop in replacement for eager and prometheus um for you know when it comes to the the contact with our application so prometheus is now scraping the open transfer collector that we have so not our application anymore at the same time the application is sending uh the eager data or the your client is making a connection to the open terms of collector thinking that it's a eager server eager collector now the point here is that our open club shoe collector that is sitting in the middle between those systems it has the ability of looking into all the data points that are flowing through this um [Music] through this pipeline here or through this communication channel and ensuring that they all have the same set of basic labels right so let's say that we want the all data points to contain the cluster name they they originated at so we can ensure that the collector has a couple of processors adding a collector label to metrics and two all right now our third pattern here is a couple of patterns actually on to deploy on kubernetes the first one is using a sidecar so a sidecar on kubernetes is basically a a second container as part of your pod so you have your application pod uh with one container in it and you add a second container with the open telemetry collector now that open telemetry collector acts as an agent and it sends data then to an external collector possibly deployed on on another namespace observability here in this case and from that collector we export data together now there are a few advantages to this kind of approach here the first one is if we decide to change the way that we send data to eager or if we change our yield location or if we decide to not use jaeger anymore or if we decide to use a another exporter in addition to jager uh then we can just change this one uh deployment here in our observability namespace we don't have to change any of the sidecars now the second advantage is we're not talking about only one application here right we're talking about multiple applications in multiple namespaces so what we the advantage here is we get better client-side cloud balancing when we have multiple instances of the clients making a connection to one server right so we when we scale up um the number of instances or when we scale up our open function collector on the observability namespace then all the new collectors or all the new instances of the collectors on the workload namespaces they would then be using those new instances so it's very likely that the load balancing would work better when you have more clients than fewer clients now a another advantage of having side cars on a per application basis only per pod basis or you know per deployment is that we can we can fine tune the configuration for that collector to the necessities of that application of that deployment right so if we have a critical um [Music] application here if we have a critical deployment we can have a very specific configuration for the collector perhaps with a more you know more resilient more um retry mechanisms and um and perhaps even more memory more cpu allocation for that specific process whereas for lower criticality services we can have a lightweighter configuration for for that agent for that sidecar now of course managing hundreds of side cars might be a headache and that's why we might want to use an open telemetry operator here to manage the side cars for us so the open temperature operator can inject and manage side cars on our behalf all right the the second variant of this pattern is not using a side car but using a daemon set so the advantages are very similar so we have a a collector that is very close to our application to our workload uh making it easy for the application to offload data very quickly to the sidecar or sorry to the agent in this case here as in the demo side um but we don't have many of the advantages of the sidecars for instance we don't it is harder to do much tendency in this kind of scenario right because we have multiple namespaces running on the same node and we have only one collector running on that node it means that that collector is going to see data from all the tenants in there so it's harder for us to manage a tendency at that level not impossible of course but harder it is also harder to have multiple instances of the collector on that same note right so if we need to scale up for some reason then it's harder to do again not impossible but harder especially when it comes to service discovery now the biggest advantage of demon sets over side cars is as you can imagine the the overhead right so open times require itself is not big or it doesn't have to be that big in here so as a side car i would say that the memory consumption of the collector itself should be around five to ten megabytes um but it does add up when you have hundreds of of those uh instances right when when we have as a demo set we only have one collector per node so we hit our overhead years is slower now um the idea is that each one of those collectors within collect data from the applications running on those nodes so it's local to an application and resiliently securely and safely send data to a central collector all right now our next pattern is about load balancing and uh to explain a little bit about why you need load balancing at the collector level and instead of you know just a regular jrpc or http load balancer we have to go back a little bit and understand how tracing actually works right so if we have a user doing a transaction on service a uh it's very likely on a microservice architecture it is very likely that service a would make a connection service b service c service d and each one of them would then make a downstream connections to whatever services they they need to get information from now the way that tracing works is not that we're gonna wait for all the services to complete and then service a sends data to a tracing backend it's not like that right the way that it works is service a is responsible for collecting and sending data for related to its own operations all right so only the expands belonging to service a are going to be sent from the service a to a collector somewhere now the idea that we have here is that each collector should have a complete view of the trace so that it can make a decision based on the base on that on that trace so for instance if we're doing tail-based sampling we want to take a look at the whole trace and make a decision whether we want to sample or not or perhaps we are doing some some um some analytics on the tracing data so we want to take a look at the whole trace um perhaps compress it in some way perhaps uh just extract some metrics and discard the trace itself and anything things like that now to do that we need to ensure that all the traces all the spans for the same trace are at the same collector what we do not want to have is spent for the same trace at different collectors now we ensure that by having two layers of open-source collectors so the first layer is a is a load balancer layer uh with it is a basic open telemetry collector with the load balancing exporter now this load balancing exporter will split the batches that it receives from the clients it will look at each single span in it it will take the extract the trace id hash it and determine which collector should be receiving this data point and it sends data to that data so you can have a ha-like deployment of the load balancer with the three replicas for instance where uh and at the same time having hundreds of instances of the collector uh backing that load balancer right so um as a final destination or an intermediary destination for this data those collectors might then be doing they'll be simply and sending data to the final destination like eager or uh your uh or your other tracy system [Applause] all right so um the next partner that we have is a multi-cluster pattern and in in this case here it is very similar um to one that we've seen before uh for the kubernetes um either aside car or demo site so the idea is that we have our application very close to our application we have a sidecar or an agent that receives this information from that application and sends information to a central collector local to the cluster now that collector then makes decisions about data from the cluster itself so perhaps it is um adding cluster specific information to the data points perhaps it is doing database sampling at the cluster level uh but the point here is that we centralize all the data to that collector and that collector then makes a very secure connection to a collector on a cluster on a control plane cluster now that communication might have different resiliency different reliability uh requirements they may have different security requirements uh so it makes sense to have to to um to extract that knowledge or that that logic into one collector that that works at the boundary and then on the other side on the control plane cluster we don't have this uh collector that is receiving data from other workloads so of course we're not talking about one cluster only we're talking about multiple clusters sending data to a central control plane cluster now that one is receiving data processing the way that it should be processed and sending data to the final destination here right so this is the multi-cluster and finally we have the multi-tenant pattern and in this case here we have multiple data coming from different tenants right so we might have like one application um that is multi-tenant or we might have [Music] multiple tenants as part of our as clients of our application right so um i named it here open time to collector as a service all test but it's in the real world it's very likely to be like it department owns you know your observability stack um and each department is then a tenant that is then charged back based on the resources that are consumed right so we have we should have one central location for data to arrive at so our open function collector that is then uh taking care of some um of some logic that applies to all the tenants for instance in in doing security and doing data cleanup uh perhaps we are removing some personally identifiable information from these pens perhaps we are adding some other information to those pens and sending data to the final destination so in this case here the two different eagers one for each tenant and then we can charge each one of those tenants based on their uh usage and the open challenge collector central here it could then be owned by ita itself all right we have a bonus pattern here and that is a per signal deployment of open telemetry collector in this case here we we have one collector taking care of the metrics part and one collector here taking care of the tracing parts now there are some reasons why we would want to do that and the first one is uh it is the way that we scale an open country collector doing scraping is different than the way that we scale as in a in a push based model right so when we have a push based model we can just scale the number uh we scale up the number of replicas that we have and the new clients would then just find new new backends and send data directly to them now when we are talking about pulling data out or you know in a typical scraping mechanism from prometheus we cannot just increase the number of replicas for the for the collector because you know the more replicas that we have the more scrapers we have and they're all going to the same targets so we might want to have a different um a different instance of the collector taking care of metrics than the one taking care of the of the tracy data points another reason to have it displayed per signal is that um right now the the maturity of each one of those signals are not the same right so now tracing components are very mature whereas the matrix components are not that mature right so we might want to um uh to mitigate a risk we might want to get you know all the traces on a highly available scenario on a you know production quality deployment for the tracing pipeline and for the metrics pipeline we have something that you know it is more fragile fragile than uh the tracy one all right so uh that's wha those are the patterns that i had for today and um i think there are key takeaways here right uh the first one is the hope and telemetry collector is very versatile right so it's it's um it's incredible how many things we can accomplish with the collector and uh i think the key here is to understand and getting to know the existing components so once you know which components you have you can start planning you can start drawing and architecting how you're going to deploy a open timesheet collector now a key thing as well is to understand that collectors can be chained together right so if you go back to our patterns here most of them they they they are about chaining open telemetry collectors together in different levels right so we have uh side cars talking to other collectors as uh local in local genetic cluster talking to collectors on a control plane cluster and so on but they're all the same binaries they're just configured differently and finally mix and match uh um components and collector instances right so um plan your deployment and um and mix and match and if you you don't actually have to use the same binary you can build your own binaries depending on the needs that you have at each of those levels all right so those are some of the resources that you can use to continue from this conversation here so some information about open country the collector the location here for the contrib repository computer and the patterns for from this presentation here if you've used the open transfer collector before i'm quite sure that you have your own patterns so here's my actual item to you go to this repository fork it right now and include your own patterns thank you very much

Info

Channel: CNCF [Cloud Native Computing Foundation]

Views: 327

Rating: undefined out of 5

Keywords:

Id: WhRrwSHDBFs

Channel Id: undefined

Length: 25min 8sec (1508 seconds)

Published: Fri Oct 29 2021