GopherCon 2021: Tom Lyons - Rethinking How We Test Our Async Architecture

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

happy to be here gopher khan like they said i was the first gophercon hire at crowdstrike um so it's really great to be back and be able to talk and actually talk about stuff at crowdstrike so a little bit about me i'm a that's too far there we go uh so a little bit about me uh i'm a senior software developer at crowdstrike um a little background about crowdstrike we are a cyber security company and we ingest over a trillion messages a day into our distributed graph database called threatgraph kind of the basis of the company is that we have sensors all over you know the servers and laptops and desktops and mobile devices of all of our customers and they use our cloud to be able to send us different data about what might be going on the system so we can have a global resource pool of all of the events that happen throughout our network so we're able to use that to build different data models and really protect the entire network live so to do that we need to process a lot of messages we need to process a lot of data and get it into our cloud so today i'm going to be talking about that for how we how we kind of build this architecture and are able to troubleshoot it and do that live so this is going to be a talk about asynchronous architectures these are also called event driven architectures it's relatively common concept in modern cloud computing where you have different microservices you know observes a b and c a gets a message sends it over to b b then takes that message does some processing sends it to c you know asynchronous these are very common because they're very fault tolerant they are able to ingest a large amount of data and process that and they're able to do that because it doesn't necessarily have to be time dependent so you might have a low lag time we're able to process something through the pipeline in milliseconds but you want to have the architecture available so that you're able to ingest all of that information and then process it whenever you have the cpu cycles or resources available to process it so if you get a large chunk of messages for whatever reason all of your devices are sending you know millions of messages or in our case you know hundreds of millions of messages a second then you might end up with a small backlog and you can work through that where in other times you might you know have lower need for for latency all right so let's talk a bit about the the current state of the art for testing architectures so very baseline is unit tests so this is the absolute minimum for any any kind of go application is writing your unit tests if it doesn't have a unit test it's not done allegedly they're relatively easy to write these tend to be very small if you're writing unit tests correctly you're writing them on a per function basis to or per method basis to basically verify that it's doing what you're you're intending to do so if you write a function that reverses you know a slice um you want to write a unit test to make sure if you put in a bunch of random different slices does it reverse it properly are there any bugs or other things you need to do and these also tend to be service specific so it's very rare that you'll have a unit test do cross-service functionality tests especially now that go is a lot more modularized and you'll have different services in different modules so for most cross-dependencies you won't be readily writing that many unit tests and it's generally limited to that scope so although you can you know theoretically write different unit tests and mock data so for instance if you have a redis application you want to be able to talk to it you can mock those messages coming in and out of it and kind of get that data there but once you go up to a you know against a live redis cluster then that would go into integration tests integration tests are also one of the core foundations of testing your architecture they're basically something that you can run in a development or production environment that allows you to verify that your services are talking together so if your service needs to talk to cassandra or kafka or redis or your service needs to talk to another service this is where you would write those integration tests um they're they're relatively straightforward they will usually include injecting some kind of real or mock data into the system and seeing what you get so for instance if you have a restful api [Music] then it's very easy to test you can very simply go into your api send in a message say you know hey i'm sending you this message and expect to get some kind of response from that so restful responses or really any kind of request response fairly easy to test did i get what i expected yes no cool it's also relatively easy test for very simple asynchronous flows so if you have a single service that say writes to a database you can inject a message into that service and then you can verify oh did it actually show up in the database like i expected it to so the problem is is that asynchronous pipelines can get incredibly complex incredibly fast so let's kind of take a look at this use case where we have some service a it writes to b b does some uh you know analysis on that message then it writes it to c c creates a very complex message and writes it back to a a then writes it to d and e d sends it to a database e sends out an email so this is based on a real you know kind of flow it's relatively common that you might have you know something coming to your pipeline and it spawns just a bunch of processes as part of your asynchronous architecture the the problem is that it's there's a couple different layers for how you could do this with integration tests so for instance if you have an integration test where you want to check to make sure it got in the database you send it to a and then nothing happens it's fairly common there could be q lag you could have injected into the queue but there was you know a million 10 million whatever backlog and so it won't process immediately and so you have to continually pull your database to see if it got there and then eventually time out so the test could have failed because it never got to d or it could have just been delayed or it could have been dropped legitimately so if you were to send it to a you know how do you know that it didn't get dropped at b or dropped at c um you could theoretically um you know simulate the message that came from c and put it into a or simulate the message that came from a and put into d the problem is that at that point you have to write very manual mocks you have to write data that is essentially allowing you to verify what came out of your previous service which if you know you've ever worked with developers you can absolutely rely on them to never update their integration tests um yeah because it's it's hard it takes time and it's it's very difficult to keep those updated and keep those up to date in a very robust way so before i keep going let's talk a little about the last point of testing tracing i absolutely love tracing so tracing is a just incredible uh incredible way to look at your data so things like open tracing um io jaeger all of the you know many tools i won't really get into here will allow you to have really robust views into your architecture so if you take in a message each service can then send to a central data lake um all of the attributes of that message as it goes through the system so when it goes into a a then sends up a message hey i got this message i sent it over to b b then gets the message it sends up a common uh message to the data lake that says i got this message from a and i'm going to you know take these different parts of it and kind of move through the architecture so as you go from a to b to c all of them are uploading that data into a central data lake so they can kind of make a full view of the picture so you can see all the steps that this message is taking as it traverses through your system and this is fantastic i highly recommend it we do actually have tracing at crowdstrike but we're not able to have the same level of tracing for a lot of our asynchronous pipeline because we get a trillion messages a day by the time i started this talk we've processed you know probably a billion yeah definitely a billion messages um so we don't have the there isn't a data lake big enough for us to ingest all of those messages so really it's something that we need to be able to do in a more robust way so before i get into how we do that we're going to talk about context in go i have opinions on this and i'm going to share them with you so context is one of the most powerful things in go and also one of the worst things in go if you use it correctly it is amazing context allows you to add a context to your downstream uh method calls and your functions so if you use it correctly it's great um so number one rule don't attach things you want your function to call or data you want your function to have these are very common especially for new developers coming from you know java environments or things like that where they want to just put pack in stuff to context so they don't have to add params to their function line which is great you know it's uh it's very it's very normal for for new developers want to use this as their kind of just central repository for all of their parameters that they could possibly pass in the next function the problem is is that it's incredibly hard to debug and it doesn't allow for the same robustness that you get and that you can expect to get with a strictly type language like go and generally you should not have anything in context that modifies how your function fundamentally works so if you have some kind of you know value or parameter in context and the result of that context will wildly change what your function is doing you've used context wrong now some do's are absolutely used to think for things like logging and tracing that's what it's there for is so that if you are you know adding a customer id or you know device id or a message id or you know trace id or whatever that is it's not necessarily a parameter that you need for your function um but it's something that you want your downstream loggers and tracers and you know modifiers to be able to use so that when it's logging hey i have this error then you're also able to include the message id or the trace id or you know whatever it is it's also valuable for things like context deadlines so context is very robust in that you can add things like context deadlines so you can add timeouts to otherwise you know normal function calls so this is very common in the restful library where you might make a request um and be able to uh really you know cut it off if it takes too long so if you're making a request to redis and you say well this should come back in you know five seconds absolutely then you're able to add a context deadline so that if it doesn't you know that that service might be dead and that you should bail out faster and the last uh do recommendation is always assume whenever you have a function that accepts context that that context is blank that it doesn't have any data on it if you can assume that and your function still works then you've written it correctly all right so let's talk about our testing paradigm here crowdstrike or at least part of our testing paradigm for our asynchronous pipeline um first we're going to talk about no op services or nuke services if you pronounce it that way so our pipeline like i said we handle trillions of messages we handle over a trillion messages a day we have massive data lakes and so we cannot mock every permutation of a message we have such a wide variety and expansive scope of our messages and potential data we might get off of these because these are from you know sensors computers servers all over the world from all different developers backgrounds people companies that might send us different data so it is virtually impossible to mock all of the potential permutations of that message so we mock the services [Music] we have the ability to put up two instances of a single service and that we're actually able to use that to deploy so they live side by side so for instance if you have a kafka cluster and your service is reading off of that you can simply spin up a different service or the same service with a different consumer group and it will process the same messages that the live service is doing so one of the services is just your regular service takes in data writes the database writes to the next service easy peasy good to go the other service is what we call a no op service so this is something that takes in the data it goes through all of the business logic but when it actually goes to do any i o any you know call to the network or call to the disk so that it's writing something like that we simply skip that step and say oh it succeeded assuming that that service you know cassandra redis kafka whatever services you're using is available so yeah we call those services no op no operation instances so why so we can use this to canary some of our services if we want to deploy a no op service we can or sorry if we want to deploy a service we can first deploy the no op instance so this allows us to put it against live data and see anything that can go wrong so this is really important because we're able to process all of the real world message all of the real things that we might see coming in in our production environment and for whatever reason if somebody updates you know the code they forget to initialize a map for instance that can cause a panic and so we want to be able to catch that very early on before this data actually um gets processed by anything in our system so we're able to still rely on the existing version service that's out in the cloud but also put up this no op canary service so that we can kind of verify oh is it going to die under this kind of load so it can exist for a few seconds or a few days it depends on really the change you've made as a developer and how long it would take you to verify has the change i made actually succeeded in what i want it to do so how do we do that very simple if memoir.noap on the context return nil or return whatever the parameters are for success um so this is basically checking if a value is true if it's a no op instance you know we should return assuming that we've done everything else in business logic so for instance uh here is a function the no op and context with no op so no op just looks at the context if it's nil or if it doesn't contain the value it'll return the value false so when we check the context for this value we can get a true false it'll only return true if the no op was originally set as true we see in the context with no op function this is basically adding a no op value to the context and returning it so if it's nil it'll create a new context off a background and then it'll add that with the value with that no op context key now the reason we use the context key as a constant as a specific type called context key is so that it's package specific so it doesn't matter if another package adds the string no op true false it won't actually be picked up by the no op function because this is looking specifically for that context key so it is only going to return for that type so we can reliably say the only way this will beat no op is if we add it with a function contacts with no so let's take a look at a message so this is something that we might have running you know off of a queue so that whenever we get a new message in we give the context and we give the raw data of that message in this case we're going to unmarshall it with jason and get that into our message structure from there we're going to check our environment variable no op in this case it could be an environment variable it could be something like a restful parameter we're going to check if it's true and then we're going to add that to context from there we just call our other functions like normal we're going to write that message to the database and we're going to send it to the next service with write to kafka and we do that while passing in the context earlier that contains our no op value so let's check that the right to kafka version this one's very very bare bones we're just going to marshal the message again in jason and then we're going to write it to kafka now before we actually do the kafka library right we're going to check if it's no up we're going to basically check if that context has that no up if it does then we're going to assume that we succeeded we're not actually going to write to the next phase of kafka we're just going to end there and say we succeeded and i i literally just got done saying uh you know don't don't let context change fundamentally how your code works um so let's first address why not a global variable uh why even put it on context at all you could just have a check you know is this global variable true yes or no and then keep going from there the reason for that is that we actually want this to be service specific or request specific so we want to be able to run this service in production without a canary just as a standalone service and still be able to feed it messages that we want to run in no op mode this is useful if we want to replay a message and get back what had happened or test something without it actually affecting production data so that's why we will use context so we can do it on a per message per request basis um and then let's address the fact that i just said context should never fundamentally change your code and that's valid uh so i would argue that this is actually uh not fundamentally changing the code because of how it's placed in the code you only want this to change your criteria for success on saving to a database or sending a message so as opposed to other instances where you're going to just check the error does it succeed or not we want to be able to say you know this would have succeeded and our only criteria that has changed because of this context is our recognition of whether or not it would have succeeded so that really comes down to where you put it so this is exactly where you want to put it on the left hand side of the screen the one with the green check mark is all of your logic so you know you have some right to database function that takes in context takes in the message and you want to be able to look at it make sure it's the right type for instance or make sure it has a field make sure that that field is valid and then call your noaa before you do something like insert it to the database you do all this so that if you mess up the code somewhere if you are trying to change the code so that it's more more robust or maybe you're changing the field that might handle that's the thing that you're really checking when you're checking in no op is how it handles the real message the core business logic of what you're doing so once you've gotten through that core business logic then you're good to go what you don't want to do is put it at the beginning of a function or a head of you know some method that you don't really check for for validity so in the the second case where we just immediately no op out of there it we could actually have caused a panic once this gets out of no op mode because maybe it's not a map streaming interface maybe it doesn't have a field id or maybe that field's not a string or really any number of business logic checks that you need to do prior to writing to the database if you check your note before all of these then you could have deployed bad code and you won't really recognize that but if you do it just before you do that disk i o or network i o then you're you're in a much safer place all right let's talk about step two api injection so we talked about how asynchronous services can be very hard to test because they're very very intricate they they have a lot of different layers to them and it can be very difficult to kind of go through those different layers and really really get to the root of the problem but restful response requests super easy also grpc or really any kind of request response service super easy test you give it data you get back data see what happens so we're going to just use that uh paradigm this is a very very simple hack really um to be able to test your asynchronous pipeline is that you use the same code that's coming in on your asynchronous pipeline and you treat it like a synchronous request response structure so we'll just set up an api handler and then pass that off to the message handler the same one that we were using with whatever queuing service we're using um so actions that are returned inside the handler we want to return to the api caller so what we're going to do is we're going to have an api injection service you give it a message as if it came off the queue and then it's going to take that message and it's going to replay it um into the the pipeline into the message handler and it's going to make a record make essentially what's a trace of all the actions that it's going to take saying that oh you know i would have written to the database i would have written to kafka i would have written to redis i would have written to you know any number of services downstream upstream whatever and so it's able to kind of return that trace back to you all right so let's look at uh what that might look like this is just uh the restful handler so this is you know pretty common format for go for different rest services there's a lot of different you know opinions about how you should structure your restful handler functions we're not going to get into that in this talk is a very basic example but first you just read off the message off the request body we're going to check for that no op value so in this case we're going to check if the no op url parameter is true so this way a requester can actually set no op if they want this request to run in no op mode or if they want to run it in a standard operation mode so they actually want it to write to the database or the next service both can be valuable for different reasons and really it's up to you as a developer or as a tester to determine which one works better for you this next one we're going to create a new data structure we're calling it a memoir i'll get into that in a little bit but we're going to create a just you know a go pointer and we're going to add that to the context and this is going to be the thing that we populate with all of those calls you know i wrote to the database i wrote to kafka i wrote to you know the cache all of these things are going to be saved in that memoir because we're going to pass it into context in the handle message and then we're going to encode it as the response to our restful request so that memoir that will contain all of the actions the handler did or would have taken and we're able to send that back so we're able to verify that our service is actually operating with the way we expect it to all right let's get into memoirs what is a memoir it's a historical account it's basically something like a journal or a log or a trace it is in the same realm in that it is gathering data up for all of the operations that have taken place in a way that makes sense that you can send it back so we need to record everything it's doing yeah so we need to record everything that a service is doing and that we need to be able to put that onto context and have downstream services decorate it assuming that a memoir is present on context we're not going to use memoirs at all in terms of our production code no downstream code will open up a memoir see what happened before it this is going to be the same way you know no service looks at the logs that came before it or really you know the traces that came before it this is purely for the response that we're building up for this restful response that is the only consumer of it um and that's why we're we're doing it now because it'll only operate on this memoir when it's in context it becomes super cheap to run in production because when we don't have an asynchronous or sorry when we do have an asynchronous message come through we simply don't attach a memoir so all of these fields all of these events that we're adding to it will not actually take up any memory because we're not actually adding it because memoir doesn't exist so we're able to run through and do everything at scale but every time we make an api request we're able to get a very rich data context off of that message and all of the traversals that it's done all right so let's make some memoir so a memoir is just very simple structure this has a channel of a slice in this case i use an empty interface in your application i would highly recommend using a common structure things like service name hostname you know data logs trace id things like that that is much more specific to your needs in terms of what you want this memoir to contain but if you're you know doing a gopher contact you just want something generic then you can use an empty interface yeah so when we initialize this memoir we simply create the new memoir structure we create a channel of length one and then we put a single array onto that channel we do this so that it's much more thread safe so that whenever we go and add an event we can pull this array or this slice off of the channel we can append to it whatever event we want to append to it and then put it back on the channel in this way we can have a common pointer that has this channel structure and we can constantly be updating it and appending it in a thread thread-safe way so if we branch out we do multiple go routines with a common context with a common memoir as they update events we're able to get that in a thread-safe way relatively in order there might be a race as to you know which one comes first if there happen to be a you know something in a go routine sometimes you might let's say right to two caches um and you might write to one first and the other second because they're in a go routine or vice versa and then finally when we actually go to marshall this when we're actually marshaling the memoir structure we have a custom custom jason marshaller and this allows us to kind of supersede the the marshall function so that whenever we call marshall jason it will take all of the events off of that channel it will marshal them as a json event and then we defer putting that back onto the channel so that our we're constantly able to keep updating this even after we marshal the json and then totally thread safe really easy to use all right so let's look at the context with memoir and the from context this is very similar to our no op instance surprise you thought you were learning about async architectures it's actually more about context so in this case we're using the context key again in this you know we have memoir and we're simply updating it um to allow contacts with memoir so we put a new memoir on there if not we create one we make sure that the channel has been initialized and we're putting it into the context later on we're able to take it from context and really just start up now you'll notice that from context if context is nil or if the memoir isn't actually present on context it will return a nil memoir that's totally okay and that's because in our previous one before we check any of these we're checking you know is you know whenever we add for instance we're checking is the memoir nil are the events nil if you know they are return nothing it's fine we don't need to add to this we don't need a memoir on this and because we only call the marshall json in our restful function we really only need to check the events field although you know it's probably good practice i should update this to if m equals nil all right so let's go back to this handle rest function so just like before we're reading off the body into a message we're checking the no op we add the no optic context but then we're also going to add the memoir to context so it's the same thing as before add memoir new and then we pass that into the handle message function this handle message function again is the same thing that we're using to process all the data off the queue so you know we'll process this from kafka or from this restful command either way the only difference is that if it's from the restful command we'll have a no op and we'll have a memoir and now our right to kafka function will add in that memoir this is a very simple you know instance in this case we're just adding a string but you can add very complex data structures or you know much more robust things to your memoir in order to be able to better debug and kind of process that so we marshalled the message and just before we check our no op we're saying that we are writing to this coffee topic and this is the message that we are writing this is very important because it allows us to then take all the messages that we would have written to all these services and kind of get them in that restful response so let's put it all together so going back to that service a goes to b goes to c goes to a goes to d and e all of these services can contain api injection and memoirs and no ops so we're able to very cleanly go through this structure and actually see it all put together so we're able to you know taken a message to a and then get the response back and go to b and go to c so it becomes much less asynchronous and much more synchronous in the purpose of testing so this is great um so first we have our testing program this could be a cli this could be something that's you know in your cidc line um or really just you know random engineer going through to test things manually i do this all the time if i'm testing how to change locally and i want to just send it something i'll you know spin up a small cli to send restful requests and then get the responses and see see where it goes um so fairly straightforward they're all synchronous responses but you go a to b and then a sends back that it would have written something to b so you go over to b you send that message then go over to c send it that message and you go down the pipeline constructing your messages from the actual services that come before it this is really important because it really allows you to go through your services and if there's any change like let's say two services both had a version update within a very short time you know of each other they might have had this you know they might have still had these same services going uh talking to each other they might have uh needed to deal with integration tests that haven't been updated and so you need to find out where in the pipeline it actually fell down so in this case we really want to be able to to test them and see that did it actually get from a to b did it get from b to c and are those messages legitimate so we can have a lot of confidence that the messages we're sending between these services are the literal messages they're not mocks they're actually what would go in we're also able to check things like the database rights and email sentence so for instance if you are wondering why i you know i'm writing it to the database but why isn't it available after a couple days and you find out that you know there was a time to leave stamp that was incorrectly calculated that that could be a bug that you actually find prior to deploying any of these in production environments because you can do this in a test environment or you can deploy it as a canary and see what the actual data coming through is now this is also really valuable for blue-green testing so if you're testing in the case of a blue-green deployment that's where you have two instances of the same service and you want to verify that they're each behaving as expected so you can send in a message you know to your version one and your version two or your version blue and green [Music] and really see oh i sent this message into a i got back a response and then i sent a message into uh you know a 2.0 and i got back a different response and you're able to actually analyze either automatically or manually is the change valid is there any change at all and can i expect that this would be an appropriate uh context so really that's the the core of how we've begun and really uh you know motivate us to test our asynchronous pipeline so it's allowed us to be incredibly robust in the fact that we are handling you know trillions of messages in our system in a given week we are handling millions you know tens of millions messages a second and that really allows us to have a high confidence in our production so that we can deploy stuff out and we can guarantee it's not going to have down time because we're a security company we can't afford to drop any detections we can't you know afford to have this kind of thing where uh yeah we we leave messages on floor because that could be the the million dollar message that could be the thing that you know saves you know however many customers from however many breaches we really uh value um our our customers and we really make sure that we we strive to give them the best possible delivery because our core business is stopping breaches

Info

Channel: Gopher Academy

Views: 151

Rating: undefined out of 5

Keywords:

Id: xa_lLAucKKg

Channel Id: undefined

Length: 37min 1sec (2221 seconds)

Published: Fri Dec 17 2021