Matt Heath - Building Microservice Architectures with Go

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello yeah everyone can hear me you'll be pleased to know I'm the last obstacle between everyone and beer so I'll race through this quite quickly so today I'm going to talk a bit about building micro service architectures specifically we go but this is a kind of general kind of architecture thing obviously there's going to be a little bit of overlap with Neil's talk earlier if you saw it which is cool it's like there's a lot of recurring themes I think so so my name is Matt I'm on Twitter here this is a very huge screen but somewhere over here so if you've got any questions I'll be around today tomorrow and ping me on Twitter um and I'm a back-end engineer primarily so I've worked with distributed systems quite a few years specifically going for the last three years before my my current job I was working at a company called halo where we built a kind of on-demand taxi app we run across three different continents so we ran in three different Amazon regions we used a lot of distributed databases because we obviously have to have high availability and we we moved from a monolithic PHP and java application over the course of a year or so to a micro service based architecture that was about two three years ago and and that was really interesting we we learnt a huge amount and then we've been able to take that with me myself a couple of other people I used to work with to accompany my work out now which is mondo and mondo we're run we're building a new kind of bank so this doesn't sound like the traditional kind of approach you use for like extremely new bleeding edge technologies people have like banking is kind of stuck in this kind of age where you have branches and you go to branches and people send you a statement with your like list of transactions on and that comes in the paper comes through the post and now mics have moved online except it's still essentially the same thing you just have a list of things on your app then that's considered internet banking there's not really been any change there and fundamentally this is because their systems are like this so there's huge amounts of mainframes and they're running COBOL and all sorts of the systems and they have to be extremely reliable and they can't be changed very often they're like quite literally the definition of a monolithic system now on top of that there are lots of banks using extremely new technologies they use like a docker and extremely other like scalable technologies but they're only around the edge and they can never really change these systems right at the core and that's a problem and because for us the world has changed everyone expects to get stuff on their phone immediately real-time if I go and pay with my card it should show up immediately in my app it shouldn't take two days so at one day that's what we're we're building and so in Mobile first we have kind of contextual information so I can see where I actually bought things this is directly from my app a couple months ago so I can attach seats I can search you can actually search for emoji extremely useful feature and but we can do contextual things like we can do this month and we can easily do things like that that's like super easy for search but we can also do things like contextual information for where you were when you bought things and that allows us to detect Ford more or easily or attach pictures to each transaction so most people use this to take take pictures their receipts and then flag these things that expenses I take pictures of my brunch which I I can do and then other kind of features that you'd also really expect if you built something from scratch so if you lost your card you'll be able to freeze it and then it wouldn't work but then crucially unfreeze it if you find it again which is a flag in your database right is a boon flag except in the UK this takes seven days to happen you can't uncounseled a card you get sent a new one through the post so if we're going to build one of these applications right from the beginning where will we start we traditionally start with like a large application very simple we do exactly what we needed to do and as a starter we build like the Minimum Viable Product and behind that we'd have some form of database it'd be quite nice and simple and then over time this is going to expand and were going to get a larger application and it's going to get more complicated and this is a kind of the story that lots and lots people have gone through lots of startups have grown they get to a point where the development team is too big to coordinate everyone working on one single codebase so we end up with more databases we have more features things like searching caching all sorts of other things that we need and over time we are developed process slows to a halt we just I've worked on projects like this it's extremely difficult to get changes in and then deployed so the deployment frequency slows down because you have to spend longer testing in each release and then that slows down and you end up in position you're not shipping like one months and and this this is really what's happening at this point so we have a monolithic system we're shipping now and this is my face when I have to work on it and fundamentally there's there's not a huge amount I can do if I'm a developer working on that team I I have to get my changes in I'm trying to deliver like business value and and ship a product except unconstrained by my development tools so the standard kind of thing that we've heard a lot over the last couple years is starting chipping away of these systems so we break sections off and we can isolate those with a bounded context we can have small amounts of functionality to do one thing on one thing well and then at the end of that we end up with a whole system of these systems and this is essentially just a service-oriented architecture there's as Neil mentioned there's a lot of kind of very specific terminology around service-oriented architectures whether these mean like enterprise service bus --is and all sorts of horrible things in this case all I mean is there are a number of applications which are coordinating across some mechanism to deliver our product and these might be synchronous they might be asynchronous there might be a vent Riven we can kind of build all of these things in because they're all independent applications and when constrained by any particular technology on each one of those so why would we start with a micro service architecture so a little bit of background bondo has only existed for about a year so we founded the company last February and then over the period about three or four months we got debit cards working and then over the next few months we kind of added additional functionality and launched into an alpha stage and obviously as a start-up you need to like know your product you need to work out exactly what customers want and that means you need to get something into the mark extremely quickly and hypersurface architectures have a huge amount of upfront cost so there's a lot of work that you usually have to do in order to even get a simple product working but the reason that we did this was because we need speed so in the UK there's like four or five really huge banks and they are massive corporations like Berkeley spent three billion pounds on IT in 2014 we do not have three billion pounds worth of funding like we have no way of competing with that so the only real advantage we have is that we're extremely small and nimble so we we need to move extremely quickly and at some point these large companies will realize what's going on and they they're already aware there's a lot of really smart people who work in these places and and they will slowly change their business and we need to be like a long way ahead of them before before they can catch up so when we're building these systems we really start off with single responsibility principle so every single one of our services and Monday we have about 90 at the moment halo have about 250 each one of these where we're making sure it's doing one thing and only one thing they're usually quite small and as a result we can scale them we can understand exactly what's going on in that small box and to do that we're going to take our problems main and break that apart into a number of bounded contexts so things like someone's bank account that is one thing their customer details like their addresses things like that that would be a different system and we can break these things apart across those those lines and the nice thing there is if we if we're modeling this in a very kind of domain-driven context we we get really nice breakpoints we get nice interfaces between these systems and the result of that is we get like really nice composability and what I mean by this is that means we need to send push notifications so we have a push notification service and that has an interface that we can interact with and that's it I don't need to deal with like idempotence our service we'll deal with that for us we can make sure if our APNs gateway is down it will handle retries and potentially it will do slightly more complicated things like exponential back-off we might have jitter in that but my other systems don't need to be aware of that they just want to send a push notification so they can talk to that service and trust that it's going to handle that for us so how would we start in our case we're going to start with a load balance right at the top and then behind that we have some form of HTTP API so crucially we have a mobile app we have to integrate with a lot of other third-party systems banking networks things like that but our mobile app is is just HTTP JSON so we need some way of terminating SSL and dealing with our requests and then behind that we're going to fire those internally over some form of transport and that could be HTTP might rest might be a kind of a message bus using a AMQP or rabbit and key or something like that and then behind that we'll have a number of services which all talk to each other and then crucially behind that some of those will have databases and those are only used within each one of their services there should be no like talking between from services to different databases that's like a recipe for disaster because it makes things really tightly coupled and it's extremely difficult to migraines um so in this case what we've really got right at the top is actually an api gateway and in our case our api is dealing with things like rate limiting and our services underneath don't really need to be aware of that and it does things like take the messages off and in our case we actually use RabbitMQ behind this so we we take HTTP in and then fire that internally over are our internal network and the cool thing battery an API gateway right at the top is that that means you can do quite interesting migration patterns so at halo we had quite a large application and what we could do is slot in this gateway above that and it just proxied all of the existing traffic through to the current thing literally just reverse proxying and then we can move that off to one side and then we could introduce a number of our services and what this API gave I was doing was taking particular routing patterns matching on those and then sending those off to other services which allowed us to migrate straight over just chopping away certain sections of our infrastructure now in Halos case we we took huge quantities of GPS points if you know we're tagged like if you've got taxi driving around you you take in like hundreds or hundreds of billions of GPS points and that's a huge percentage of our API traffic so really nice thing what we could do here is we could build the services that just handled that section switch that over and we'd already massively reduced our costs and we could shut down like loads and loads of servers mainly because we're moving from PHP to go so hands up anyone who's use going here two people excellent so for anyone who has an use go go is a quite new language has been around six years now and it's got a quite a few nice properties so the language itself is really really simple there's there's a huge amount of other functionality that you can use but the actual like language constructs and the syntax is extremely simple and for me that's quite nice because it means you can get up to speed really quickly in some of the languages I find myself kind of I write some code and it totally works but then I'm not sure if it's the most idiomatic code maybe I should rewrite this in a slightly better way and then go I don't really get that like it's very straightforward and kind of procedural but that's awesome it's extremely simple to read through code bases and as a result like it's really great for a development team um it's also statically typed which if you're dealing with people's money is quite useful so we don't have any floating-point issues which is quite nice or kind of any type convert stuff like that and also statically linked and this means that you can take your go program you can compile it with this particular architecture and then you can just put that binary a single file on a machine and run it and there are no runtime dependencies so deployment is extremely extremely simple on top of that it has a really comprehensive standard library so the standard library has all of the extra functionality things like creating an HTTP server in like three four lines of code and it's really quite nice because in 1.6 which is go 1.6 which has launched it has native HTTP to support so actually if you want to go 1.5 server and you now recompile it with go 1.6 it will natively just upgrade your clients to HTTP to like I don't have to do anything the garbage collection performance is increased in every version it's just really nice that those things are kind of built in and particularly micro-service architectures you have huge number of Network calls so that the networking library is really good and does all your kind of normal things like connection pooling and keep lives and everything just like automatically as economist but the main thing that you kind of hear about go is it's concurrency so and go you can write a function and you can literally write go and a space before it and that will execute asynchronously and concurrently with all of the other go routines that you have running on your machine and that will run across multi-core machines as well so go has a very efficient scheduler where it runs a small number of OS level threads so one thread per call a few others if they get blocked on IO and things like that and then it will kind of multiplex those go routines across the OS level threads and it just deals with all the execution for you so you essentially have an extremely simple kind of threading model but it's extremely performing because you're not using like OS threads and then on top of that you have really nice interfaces so go supports like kind of duck typed interfaces where if you have a particular struct that has a number of methods any struct that matches those methods you can patent so you can switch our types which is super useful for testing and as an example in our case we have an abstracted transport which we can plug in HTTP we can plug in RabbitMQ but we can also plug in it in memory mock so we can stub that we can stub that object for testing and we can run all of our tests in memory which makes them thousands of times faster so if you're starting building a microservice object you can go there's actually a few frameworks you can use so gokit has been around for a year or so and that that provides like a really nice framework for kind of building up services micro is quite similar again allows you to build clients and servers very easily and there's a number of reference services things like discovery and configuration that the authors sort of kind of put together um gee on PC is going to come out with Google and I think stripe and one of the really nice things about G on PC is a bit lower level but it does things like streaming so if you've got massive domain objects that you're moving between services you can stream those and you don't have to do the whole thing in a request response style unfortunately in our case none of these existed when we started building our infrastructure so we have our own system these are on github and they're they're open source they're MIT licensed and what these are doing is we have a line play that provides an abstracted transport and then allow me on top of that that gives us a client and server and we can configure these with like a number of kind of things like this is how you get configuration this is how you do service discovery so in our case to go back to the previous example we have a load balancer we have our API routing layer and then behind this we have a number of services and these are like API services these deal with our API functionality so as an example we have a web hook API so you can register new web hooks on your on your bank account every time you use a card you get a web hook immediately usually before the receipt has printed in the machine but it's not very hard to send things in like 200 milliseconds so in our case we take slash web hooks on our API and we route that internally to a service that is the web hook API and the reason we do this is because all we have to do is deploy additional API so we have like this kind of routing pattern and that means if I didn't have a web hook API before I now deploy it it registers with our service discovery our service discovery now knows it's there our API is pulling this continuously which now knows that there is an application capable of handling slash web hooks and now it will start routing that traffic through to our application and what that means is that I can deploy entirely new sections of our API with zero impact at all to anything that is currently working in our infrastructure and this makes our infrastructure a lot more stable we we don't have to change this HTTP routing layer right at the top that's extremely stable which is great if you're doing things like PCI compliance where you have to do like lots of change control on anything that's kind of involved in that the data flow that involves card numbers and it just means it's much simpler we can test things really easily and then behind this this is then going to send off requests to a number of other services so the web hook API in this case is kind of acting as an orchestration layer it's going to orchestrate calling our other services within our application and do things like authentication and kind of our web hooks service down here is actually the thing that stores web hooks in the database and listens to an event stream and decide when to dispatch them so it's doing it's responsible for like just storing web hooks and just processing them and we have a separate thing that's kind of responsible for orchestrating our API which our customers are interacting with and behind these these then have their own databases so they'll just use their own databases entirely we use Cassandra for almost all of our data which means that we have eventual consistency but it's more tunable we can choose whether we want extremely high levels of consistency with lower availability or actually we can deal with eventual consistency in a number of cases and it also means we can very easily scale outwards as Cassandra scales linearly with his rights but realistically these services can can do anything so we might have number of services and they might do things like abstracting a third party API and each one of these again is just a single goer plication it's a single binary that's running on a machine and it's just connecting to our message bus and scooping up messages and sending back replies so inside one of these we have a service and we kind of this is broken into like three main areas so we have like the handlers right at the top these are essentially just functions that take a particular message format and send a response behind that we obviously have all of our business logic and then kind of underneath that we have all of these like abstracted service providers and what I mean here is that like if I want to write data into Cassandra and I'm I'm doing application development I shouldn't need to worry about where Cassandra is like Cassandra servers may go up and down the IPS may change all of these things can be handled automatically for me which means that I even can just write my code deploy a new service and I can have stuff in production within a couple of hours and that's the kind of speed that we're developing out of Londo so we have our mercury library at the top which provides this framework that we can slot function calls into and then each one of these is just going to take a request and return a response and an error so as basically no one is you go before this is a kind of standard type definition so this is an interface in go so we have a type called handler and that type is a function that takes a request and returns a response and an error and in go the kind of convention is that most functions return multiple values and the last value will be an error which may be nil or maybe some kind of error type so this means that any function at all can kind of satisfy this interface so as long as it takes a request returns a response an error and we can build like a lot of very flexible things off the back of that and then underneath these then go into our transport again our transport is an interface so we can use any number of any number of different types that again satisfy a particular interface which would be something like this most of which is completely relevant but the important things are that it can do like these four things so we can start listening for messages we can stop listening for messages if we're going to shut down and we'd want to shut down gracefully so we'd stop receiving new ones we'd reply to all the kind of in-flight requests and then shut down the service we need to be able to send new requests if we're requesting another service and also respond to high imbalance and because again this can be any type that satisfies this interface when we're testing we can just switch this out for essentially a map that is wrapped in this type where we would put a request in and our response would be to take it back out of the map so behind this we have our service we have all of our business logic and then our storage is kind of abstracted into a number of libraries a new mention like this kind of moving a lot of this development out into like kind of platform teams and and that's something that we we do really we have a lot of core libraries in a service template which this is and that means that we can we can improve the kind of the reliability of all of our clients and our database connections and how we handle requests and then we can just rebuild these things with the latest version of libraries and our services become more more resilient and more reliable over time and these things do like a huge number of things that everyone talks about when they're talking about microsomes is so ago we have really nice deployment we have a single binary that we can put on any server when these services come up they're going to register with a central discovery system in our case we use a zookeeper that we have like a service that wraps around so zookeepers a really nice consistent data store but it doesn't really provide much about key value storage so we have a service that deals with services registering and also the fact that you need to like Harvey they should only have a particular lifetime and if a service hasn't reported that it's healthy in a particular amount of time then that should expire and we shouldn't communicate with that service anymore again things like configuration our monitoring systems are plugged into this automatically so if you use a database then that means you're using our database driver from our service template that already has monitoring hooks built in so that means if you deploy your service and you haven't actually created the key space or the database in your database server that will flag up automatically normal monitoring systems because we have tests built in to make sure that you can read and write to and from those databases and all the other things so circuit breaker is a really interesting one by which I'll come on to in a little bit and then as far as deployment goes as I mentioned and go this is really easy so we have a single binary and if that's compared for like 64-bit Linux then I can put that on any server obviously there's a lot of tools around this so the main one we all hear about really is docker and dock is great because all that go you have a binary so you don't have like all the runtime dependencies and the kind of repeatability the docker gives you you do get a lot of other things you get all the kind of you can limit the memory and CPU usage which you could do natively using C groups but we're docking you get for free you also get a whole ecosystem of tooling so it's really easy to just take your binary that you've compiled and put it in a really simple container so you don't need like an entire Bunton container whose docker here few people hands up if you use like full Linux and Linux images or use small okay a couple of people so the difference here is like shipping around a container there's maybe a couple of hundred megabytes or a binary that's five megabytes with a container that makes it 15 megabytes it's a lot easier and there's a couple of really good ones so you could use busybox or if you need a couple more tools inside that you can use something like alpine linux but again those are really small containers that you can just ship anywhere in your infrastructure as I mentioned a statically compiled so there's no runtime dependencies inside the container which makes deployment super easy but there are still a couple of things to bear in mind if you use the scratch container for example which has nothing inside a blank image you don't have root certificates inside so you need to copy those into your container before you run your binary otherwise you can't do TLS which for a bank is quite important and now once you have your containers you can then ship them onto any number of platforms so we actually use mezzos with marathon as a scheduler and that means that we have a kind of aggregated compute power across a number of available at these zones in Amazon and we can apply constraints so we'll never run more than one copy of a service per instance or we'll run at least two per availability zone things like that so we can get an even balance across the availability zones and we can make sure that a single node failure won't take every copy of a service out so given that we're running on systems such as Amazon or other cloud providers where our our infrastructure is constantly shifting underneath us how do we make our service more reliable there's a lot of things that we can do um but in a lot of cases we have extremely low latency requirements we have to reply to when you use a card in a car machine we have about 100 milliseconds to reply to that which Ingo is super easy right that's fine but we needed to be reliable and so are our distribution of response times has to be quite tight so the way we get around most of this is by pushing everything that we can to be asynchronous so we use a lot of event-driven on event Riven things within our F structure and by this I mean give them a kind of synchronous call path this this will be charging a card for example there are certain certain number of things we have to do immediately we can't make them asynchronous but the end result of this is that we can push a lot of those things viral viral an asynchronous event bus and then any number of these systems can listen to these and then do further actions off the back of that so a good example would be something like Kafka we use nsq where we can publish events onto a durable message bus so we know that they will get actions and then any number of subscribers can can read these and the advantage there again is that we can deploy new services to add functionality without impacting any of our existing architecture so we can test things out really easily um the downside of this of course is that you end up with something like this you have no idea what's going on and I mean we only have like 90 services at the moment but we're running multiple copies per availability zone we've run across to be available at ease owns some point in the not-too-distant future will run across multiple Amazon regions and keeping track of every single instance of a service is essentially impossible I certainly can't keep in my head of where all these things are or how all of our systems work and we've got quite a small team still so this is only going to get worse in the future so we can get around a lot of these things with kind of topology management and we can do all of this automatically so we can use things like service discovery so that we know which services are available which services are currently in a degraded State or which instances of each service are the fastest performing and if you're running on a cloud provider then some instances like some of the servers will be lower performance and I think I saw something that Netflix used to do where they would test instances when they came up and then kind of kill the ones that are really really bad performance and just get a new instance and test that one and see if it was better we don't have that problem yet maybe at some point in the future but what we do have is things where services fail and this is a problem because in the case of this service this call failing then that means our API will fail and we'll return errors back to our clients and if we had a very naive approach here we might be doing like a round robin across all of the available instances so we at this point get one in three failures at random and that's not great like there are a lot of ways you can make up so if we can detect failure we can spin up new instances and then we can change these in our clients in our service discovery so that our clients are only sending messages to available healthy instance of each service which is great but then you have a slightly different problem where some of these are really slow so in the case of a particular one it might be like it might have transient network problems so some of the calls might work it's still checking into our service discovery system it still thinks it's okay but actually it's significantly slower than any other copy of that and and we need ways to be able to test for this so if you've used things like circuit breakers you can kind of so go has a really nice circuit breaker implementation there's definitely a couple in Java and Scala and the way this usually works is you'll have a number of hosts and every time you make a request say HTTP request you'd check out one of these hosts type and library you'd use it and then check it back in with either an error or not and how long it took and from this this central librarian in your service can keep track of which the fastest like which the fastest copies are which ones have errors so if you have a particular copy that just has lots of errors you might stop sending it requests for a certain period of time which is a circuit breaker pattern but if you do something a little bit nicer then you can actually preferentially send requests to the fastest instances and what that I mean is you could have all copies of the services in a region in Amazon available but you usually preferentially send them to the ones within your availability zone or even locally on the machine if your collocating services so you can use something like an epsilon greedy algorithm which will preferentially select fastest services always the other way to deal with tail latencies is with fan-out and cancellation so Neil mentioned how Google search works earlier so when you type something in you search and it gives you one an extremely fast search result but also like if you type something in slightly wrong it will give you possible Corrections and the way that you can do this is by firing off lots of searches in parallel and you can do that even if you are still doing the same thing so if you have something that's cancelable you can send a request to a particular service a few milliseconds later send another request to a different service so now you've got to employment go bad and then you can send a third one maybe and at this point you've now got three outstanding requests now some of these will probably return really quickly but if you maybe have your clients set to time out at the upper 95th percentile then that does mean some of those times it's going to come back really really slowly and your client will time out whereas in this case we might get a really fast response from one copy of the service and at that point we can cancel the two of standing requests so this is something we can do actually in our NGO libraries and the way we can do that is by passing essentially request IDs throughout our system so we can say here's a request oh now cancel that one cancel that one but don't cancel that one however we go that's actually a little bit tricky so we have like a context that we need to pass through on every request but go doesn't have thread-local variables so you can't store like a request ID and then just pass that all the way through you have to pass that as an object and go has a really nice contact button so again it's just an interface and it's some kind of object that has a number of methods but the crucial one here is this done method so when you call done on a context what this is going to do is return a channel with a struct and this is just a channel of like blank objects and channels are a really nice way that you can communicate between go routines in go applications so this channel will send a struct it will receive a struct down the channel when the context is terminated so you can implement this in an application where you get a request that's gone several levels down in your application and then you can you still have a copy of this and you can cancel it and then all the way down in your application that will return and just stop doing work at that point so given that we have quite a complex infrastructure we don't need to be able to test this and this is something that as we're dealing with people's money we're dealing with kind of low latency payments things like that it needs to be extremely reliable so there are a few main things that we for really obviously load testing I'd recommend doing that capacity planning super-important but then also failure testing so things like using Netflix's simian army where you can terminate instances corporate services things like that and make sure that your application still works as a result and then the thing that I normally see the problems with are when you have degradation of these so when you have higher latencies in particular applications and you haven't configured timeouts in other sections your application correctly so this is something that I would strongly recommend if you have tests that you could run them and you can hook these into your monitoring systems as well so all of our services in the service templates register callbacks the send data monitoring data constantly to our monitoring systems and that means again as an application developer I can build a service really quickly and everything is already built in I don't need to worry about kind of building and monitoring hooks unless I want custom ones and then we have a framework that allows you to do that so the nice thing really is if you have monitoring systems you want to be able to like test that the actual things you're expecting to work do work so things like car payments so who has acceptance tests in their application no one please please everyone has accepted says come on ok a few people have acceptance tests everyone else homework right acceptance s for your application you should definitely have them and in our case we have acceptance s and they work with essentially fake customers and fake cards and what we can do is we can assert that not only are accepting tests work in our build pipeline and they work locally on my machine locally on our build system but also when we've deployed them into production we can run these again continuously so we have real cards that are doing real payments constantly throughout the day and that means that what we're really doing is we're testing all of the functionality that customers know about and as long as our acceptance tests work in production that means we're up that means the customers are unaware of any problems so even if we have database failures or servers that have run out of disk space or like crazy latency between different sections of our application as long as our test run and run successfully then our customers are unaffected and we can be very confident on that um and crucially that's like important for us because if you go to a cash machine and your car doesn't work then you're going to be really really angry and you'll probably be going to switch banks so it's crucial that these kind of things work all the time so we have to kind of monitoring we have as I mentioned outer band stuff which is outside of our infrastructure that is kind of continually testing that our application works and then also inbound testing so these are things built into the services where they're pinging checking that the the transport layer is working checking that they have a connection to the database checking that they don't have like too many connections open to an external supplier maybe if we've kind of got a bug in our connection pooling logic we're just establishing hundreds or thousands of new connections constantly and again and go you can do this real easily so we're going to have a type which is a function that returns an error so it could be any function and in go you can do like closures so you can pass stuff in get a custom checker back and then we can plug this again into our monitoring framework and all this is going to do is execute it periodically and if it returns an error then we have a problem and it can put that into our monitoring systems so on top of that again we have lots of other tools that you can use to like debug performance problems and all sorts of other problems that you have with quite distributed infrastructure and one really cool tool is the ability to do distributed tracing so is anyone use zip kin or like yak or a couple of people implementation that Google's data paper and there's a few others like chage trace and a couple of others so what this is really doing is allowing us to get really good insight into what happened on a particular request so we can aggregate these and get like an overview of our system but we can also debug individual requests even though they've gone through 20 or 30 different systems and the way that we can do this is given kind of a sequence diagram we start right at the top of our API and we'll generate a random ID for this request so something like a type 4 UUID so completely random and this will then be pressed all the way down through our application stack so we'll pass it on here when it gets to the but when it leaves the API we'll take a request ID we'll marshal that as headers on the wire send that across the wire and then unmarshal it from the headers back into this context object and then when we send off another client request we can again take on context we can marshal that and tie headers and again send that down to the next service so that means that every service has this unique request ID and we always know which requests where we're dealing with and then again we're using this context interface this time we're actually using a slightly different functionality so it has a key value store so you can put a key of any type and you can get back a value of any type so the interface is essentially like represents any type and go it's a blank interface which means any type kind of has all the methods that a blank one does and when we then get this return path we we can each one of these points send like tracing data so this is like instrumentation data about whether we got an error if we've sent a request and got a reply if we timed out how long it took the service instance that handled that request the machine that they're on all those kind of things we can aggregate all of that data together into a central system so you could you zip kin we have an open source one called phosphor and the end result of this is something ridiculous like this and it's like megabytes of JSON sometimes and this is completely unintelligible I have no idea what's going on here but we can take all of that and we can reconstruct what happened on a particular request so this is an example from from our production systems so this is someone in a shop they've used their card put in the card machine and charge their card and we do a lot of things off the back of that so we have our API over you hello or back up again so we have asynchronous call to our card processing service so we have a card API that is taking all of our card payment requests we pass that down then to another service that kind of understands how those work so the top level is actually in this case dealing with XML and it's basically a soap server because banking networks use soap I mean a lot of them like soap is quite advanced so great and then we post that down to another other systems so we're going to look up and see whether the card actually exists and if it does if it's not been blocked or turned off temporarily then we can carry on down we can put that into our transaction system and we do a number of other things there but the end result is that we do all of this sections synchronously and then we can respond back to our card processor and as a result of that we publish an event onto a durable message bus and then everything else from this point is asynchronous we can do this whenever we need if our service is down we'll just pause them and know they'll queue up and we can process them later and we do everything separately so when we take in a car payment we actually go and look up like merchant data so if you go and use your card in a petrol station like BP then we'll we'll identify that and we'll go off and do things like get the logo from Twitter we actually have a whole system that kind of aggregates there and we vet them and make sure that they look ok and then the result of that we publish another thing asynchronously and this goes into a separate service that actually sticks an item into the app so in the app we have like a feed of transactions and is the actual point that that slotted into the app and visible to the customer is entirely asynchronous so if our systems were down there can't would still work but their app might not update which is preferential to their code not working at all and then as a result of that we then send them a push notification and again that is asynchronous if we haven't processed it within a short period of time then we cancel it but again if our systems are down if our AP and our gateway is not working then we'll riku these and try them again so this sounds like it takes a huge amount of time um basically we we get this at the end so we get a coffee cup because this was an espresso shop we know exactly how much it was we actually do the currency conversions in real time but we have that contextual information and actually we can do this in like 200 milliseconds it's really really easy so in the case of using your card banking networks are really slow like user card and it takes a few seconds before the till has accepted the payment and then it takes a couple more seconds and then the receipt prints whereas we can get something to someone's phone in yet a couple of hundred milliseconds so 90 percent of time we actually beat the receipt machine it is really easy like that's really not very hard to do unlike modern systems except this isn't really what we do we do something that's a bit more like this so the end result of this is I use my card and my phone buzzes before before the receipt prints or sometimes before I get asked if I'd like a receipt and the cool thing again is because we have this request idea every stage of our application we can actually look up log lines so every log line we write in our application and we try not to write too many because logging is extremely expensive I don't know if anyone's use things like Cabana or log stash cool yeah if you stick 10 billion log lines in that day it doesn't work like it's really hard to scale but we never really need to search like across every log we need to search for specific log lines that are interesting so we have different levels like error debug info so we can throw those away like we could throw away debug lines really quickly we can keep errors or more serious errors for much longer period of time so that we can investigate them but also this is for one particular request so we have every log line from this request even though it reversed 10 different services across 10 different physical machines so using ten completely different log files and again that's quite easy to do because all we're doing is looking up by single key which means we're doing a key value lookup and in our case Cassandra is really good at that kind of thing we can also do aggregates so in our case we can also generate diagrams like this which are pretty but not very useful but they kind of convey the kind of complexity to dealing with and why with microservice architectures you need a lot of tooling to be able to debug these problems when you have them so this is an example where we've got the width of the line is a the amount of traffic and then the color is whether it's healthy or not so there are a couple of failures in here hopefully everything's okay so if you're starting building a microservice architecture should you use go I mean obviously I'm quite biased so I'm going to say yes I think it has a number of advantages it's really easy to learn because the language is so small so it means you can be productive in a really short period of time because the language features it's extremely good for micro-service Pegasus architectures you have highly concurrent services that can do in some cases 10 or 100,000 requests a second on a single process and it scales linearly with with the Machine pretty much so performance wise is really good as far as downsides go there are obviously downsides so go historically hasn't had particularly good a dependency management so there are a few alternatives now but that does currently mean that you you have lots of third-party libraries and you'll have to copy all of those into a repository which is less good also because the language is quite relatively young the quality of those third-party libraries is quite variable so halo we were stuck for about a year maintaining our own Cassandra driver which is not ideal and thankfully times moved on a bit so like most of these like database drivers are actually really really reliable but there are still libraries that just don't exist or of quite poor quality so should you start with micro service architecture I'd say probably in most cases it's still not necessarily the best thing to do because the main problem you have certainly if you've just started in your company is understanding the problem that you really have and unless you understand like the business problem is very difficult to break your your business problem apart into the right kind of level of service so you'll get those abstractions wrong and because of that you'll have to rewrite things throw things away and if you're an extremely early-stage company that's that's a lot of extra work but if you've already kind of built your application if you know how that works then you can make far better decisions about that and at that point that's a really good time to start like chipping away and breaking your application into much smaller parts and that's about it thank you very much does anyone have any questions before we drink lots of beer yeah what kind of information yep yes so the question is how we get the information about cards when we use them and so we actually issue our own cards which look a bit like that so we have a card processor when their process of becoming a banks and about six to nine months we should get full approval to become a bank and until then we we have a card processor who we've interfaced with so we're in the the authorization loop so when they when I charge the card in a machine we get a real-time synchronous requests from our card processor and that's like a soap request button and as a result we get a small amount of information things like the postal code are the name of the merchant kind of stuff like that from the MasterCard Network no no no where is standard master code so it's a normalized card we get a small amount of information from MasterCard about that transaction and from that we approve or decline it in real time as as a card processor and then we have an asynchronous pipeline which then takes the small amount of information and does things like look up against Foursquare Google Places all of those kind of other like location-based services yes so our card processor has a 200 millisecond timeout on their end and during that that's got to come across a VPN negotiate TLS because they're comparatively few so we don't have a lot of established connections and then we have to do all of our internal processing and return within 200 milliseconds so we have about 100 milliseconds on our side by the time we've done TLS and all those other things anyone else yep yeah so we have our own system which is open source so it's very similar to Zipkin so the question is how we aggregate all of these things together so bearing in mind we have a request that's traversed let's say five systems synchronously and then we've published an event as a result of that which is now triggered an unknown number of other things that happen at some point in the future so we pass the request ID through our event bus as well so every touchpoint has that UID and then we publish that onto like an internal event bus again that just aggregates that so every touchpoint even if it's synchronous or asynchronous we publish that data we have a separate service which then aggregates all of that and stores it in cassandra currently so the important thing is we're not doing full touch searching the only searches were ever doing is by the UUID and in cassandra that means you have like a wide row and you can just put every single event into that and pull out very very quickly I mean yeah it means we're doing like maybe 10,000 rows a second on man but it's we can throw the data away very quickly so it's not a huge problem to store it because it can all do the vast majority of that and let's be fair has much better tooling but it doesn't do any of the asynchronous parts which are extremely important in our systems anyone else yep how do we avoid data loss in the financial parts so we use as I mentioned we use Cassandra and Cassandra is really nice because it has tunable consistency so we have a number of Cassandra nodes and for everything that is extremely important we can write that at a very high level of consistency so we can actually write to all of the nodes which guarantees that you've written it but that means that if one of the nodes is down you can't write or you can write to a quorum of nodes which is far more far more important so on our cards actually at the moment our card processor is the system of record which means they hold the balance officially we hold the balance internally when we become a bank we need a fairly consistent data store to store people's money but until then actually we can do most of these things at a quorum level right my experience with Cassandra is actually if you write a call and you have quite reliable source of data and it has you can tune Cassandra as well to do things like immediately sync the data to disk which lowers your write performance but it means again it's more reliable in the case of like process crashes as far as consistent data source going obviously you can use any number of large Enterprise II databases or things like post graphs where you have serializable consistency the problem then it again is you have problems with replication and node failure so in our case actually there's really only one system which is like the the ledger of all transactions of money but not the metadata so just like literally this account moved that money to that account where you have dual booked accounting that's the only part that has to be actually consistent and even then the choice of consistency is really a risk to the business that we would be operating so yeah we we can use some consistent data source but actually just behind one service and most of the touch points that are asynchronous so if it's down it is less an issue so in the case of the consistent data store to me yep so in our case when we write something onto an asynchronous message bus if you use something like Kafka then you have they're durable you can write again to a quorum of nodes so we use an askew but we we write to a quorum so we actually publish each event multiple times and only acknowledge success once we've achieved that and what that means is although we have duplicated messages throughout our system both Kafka and nsq are guaranteed at least once systems anyway so you have to handle duplicated delivery so in our case we might get a message to three times and we'll just make sure that we process that either using something like a distributed lock using a Soyuz etcd for our locking but then that adds a lot of additional complexity so we try and leave as infrequently as possible or we make everything item potent so a lot of operations in our system are idempotent we can deal with those being retried the message bus itself is durable and will retry and that means we might pause them a particular service or a whole section of that event bus for a day two days if we had a serious problem NS q does exponential back-off per message so it deals with a lot of that force cool anyone else yep yeah so in the case that they're reading that's really easy yet we can just send lots in parallel in the case they're mutating state we now have race conditions so in the case that we do do that to be honest most of the ones where we're reading we want higher performance in the case that we're writing we would still be using something like a distributed lock but that means that we would we use the retry mechanism less often because that means that we're always blocked waiting on that lock to to expire so yeah mainly for the read path to be honest and if you're dealing with like most of our rights are actually quite low volumes of data say 60 quite quickly yep yep so the question is if we replicate data between services and how we deal with that so actually the example of a user profile we store that in one place there is a profile service which sorts the user's name addresses are a bit more complicated because well all of these things we have to have history but you may have multiple addresses for example so we store those in services that just deal with those things and that does mean that yes if we send someone an email and we need their name to go on the email will call the profile service to get their name and that means you might have quite a large chain of services that need to execute so what we'd usually do is we try and kind of abstract that up a bit so that we can do as many of these things in parallel as possible rather than calling one service which now cause two services cause three services we realize that we need some of this data on all of these interfaces so we can call all three and parallel and then call a final service that would send the email for example and that means also in some cases we can deal with partial failures so an example would be if we're returning particular search queries back to to the the end user in the app we do things like we look up the items and the feed we do transactions we look up things at particular merchants and I think if the merchants ones fail or the attachments that you put on we actually we can tolerate that and we can return the search query the search results without that like extra information and then if you use a tap some on that search result can then request and actually try and get that data again so it allows us to deal with like partial failure we we don't if I mean we don't if we can avoid it in this case for a user profile we have like one domain object which is the user and the users like name and information that is respondent like there's one services responsible for them and I'd say in a lot of cases that that is the case underneath that in Cassandra we don't denormalize the data loads because that's the way that Cassandra works but a service level no we generally have one thing that's responsible for a particular section of our application cool anyone else otherwise I'll be around all the time and I think we have beer go thank very much
Info
Channel: Voxxed Days Vienna
Views: 25,096
Rating: 4.8869257 out of 5
Keywords: VoxxedDays, Go, Voxxed Days Vienna, Vienna, Microservices, Voxxed Vienna, Devflix, Voxxed, Voxxed Days
Id: dVnMLtdJzn4
Channel Id: undefined
Length: 62min 11sec (3731 seconds)
Published: Thu Mar 17 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.