Design Microservice Architectures the Right Way

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Damn they have some slick processes

👍︎︎ 7 👤︎︎ u/jeefsiebs 📅︎︎ Feb 20 2019 🗫︎ replies

With all their slick processes in place, I am really surprised that they have no staging environment. He only briefly mentioned it at the end in the Q and A, but his reasoning was that since their testing processes are so tight that it's ok to build everything on production.

He says that they are using feature flags and pushing right to production. I just can't fathom how that is a good idea no matter how good you think your testing infra is.

👍︎︎ 3 👤︎︎ u/apennypacker 📅︎︎ Feb 21 2019 🗫︎ replies

What's wrong with URL rewrite?

👍︎︎ 1 👤︎︎ u/przemyslawlib 📅︎︎ Feb 20 2019 🗫︎ replies
Captions
today we're going to talk about designing microservice architectures the right way and I thought to start it'd be helpful to share a personal story that I hope resonates with many of you so the names will be unnamed but it basically goes like this could you please change this URL from phu kham slash latest slash bar to phu kham slash one point five point three slash bar and the answer is sorry that would take weeks we don't have the resources to do that it's just a frigging URL and we have to ask how does that happen and in this particular case this was a URL in a library there are hundreds of services to update many other services have not been updated in a long period of time which means that updating that service means updating its dependencies which just frankly takes time it is actually the reality of the work to do it and incredibly frustrating right I mean super super frustrating when a simple task ends up being complicated or time-consuming in practice and this is where I think great architecture can really help us and we talk about great architecture I think what we're really after is a few key features one is this idea that we can scale development teams develop deliver higher quality software enable ourselves have a choice are we really after high performance or Reaktor low cost and to actually be able to make changes that drive what we really like in our business and I think one of the defining characteristics of great architecture and this is the hard one it's a support future features not naturally and the way we like to think about this when we're designing something do we have a good design I don't know I'll let you know in three four years and when we find out how everybody would like to use it and whether or not we make good decisions today that's ultimately a good that's ultimately when we learned and we talked about not-so-great architecture I think it often looks like this people like to talk about spaghetti goto gets a lot of memes out there but I think really what's happening is we're trading near-term velocity for what I like to call future paralysis and I think in the micro service architecture space certainly we've personally have seen this in many many examples where we're tempted by the benefits of micro services and we underestimate or under invest in what is needed to build a great architecture and this is what we end up with we end up with one to two weeks of work to change a URL and so today's talk is about designing micro service architectures the right way and frankly how do we avoid spaghetti and make a perfectly crafted meal fey we're layer after layer itself is simple perfect and together the whole is greater than the sum of its parts briefly my background today the co-founder and CTO of an enterprise SAS company called Flo commerce Flo Daioh where we build software that helps brands expand internationally from day one we built our company on micro services with many of the lessons we learned in our prior experience prior to that was the co-founder and CTO of gilt calm if you're familiar with gilt large-scale micro service architecture I think probably over 400 applications now quite a large company over a thousand people six seven hundred million in annual revenue at the point in time that we sold the company to Hudson Bay and I think it was there that we really learned a lot about the benefits of micro service architectures in terms of scaling teams delivering quality isolation performance as well as many of the challenges and frankly areas where in hindsight we wish we had invested more and that's what today's talk will focus on let's start with a few misconceptions misconception number one micro services enable our teams to choose the best programming languages and frameworks for their tasks this is often cited as one of the big benefits of micro services I can build one service and go one service in rust one service in node and one service and whatever language gets invented tomorrow the reality is and we'll demonstrate this today it is super super expensive to adopt new programming languages and frameworks and really the bigger bar here is team size and the level of investment into the architecture one metric is if we look at Google as a generally great engineering company they have about I don't know let's call it 20,000 30,000 engineers at last count they have eight programming languages so I like to say one programming language for every 4,000 engineers as a good metric misconception number two cogeneration is evil I used to think this and the reality is cogeneration is just a technique and what's really important especially in these micro service architectures is creating a defined schema that is actually a hundred percent trusted and today we'll demonstrate one technique that we use quite a bit we're at flow we're leveraging a significant amount of cogeneration in different parts of our software development process misconception number three the event log must be the source of truth so when we were starting flow I also thought this and I reached out to Jake reps who wrote cough-cough Co and wrote the paper the definitive paper I think on logging that many people here surely read and I and he really helped us as we were space scaling guilt and I said Jay I don't get it I've got a rest service and I'm creating a user what am I supposed to do publish an event wait for that event to come back to my service so that I can respond to my client because after my client creates a user they may want to get that users details so how do I guarantee that I have those users details and from the horse's mouth II said no no no you just saw that in database that's fine and so that's what we did the resources are stored in databases that belong to the micro services but then we guarantee absolutely at least one semantics that those messages are going to end up on the event stream it's okay so okay misconception number four developers can maintain no more than three services each this is certainly true at gilt a few folks from Netflix shared that the ratio of three services for developer into being kind of this like magic number and you get to that number and then you stop feature development and all you do is babysit and maintain all your services I think this is the wrong metric to focus on and if you're having conversations about this metric I think it's a clear sign that you need to invest in automation and tooling and we'll go through a lot of the tooling we have here today flow were in our third year we have about a hundred services and the ratio is about five services per engineer and every week we ask people how much time is being spent on maintenance the portion of maintenance that goes into maintaining and loving our micro services of less than five percent so it's absolutely doable so we'll touch on the flow architecture because this is driving a lot of the content today so distributed micro service architecture add flow over 100 micro-services each one of them defines a REST API all of our services communicate via api's and on the bottom events every service publishes events of interest there is a lambda architecture at the bottom but I think the key thing here is there's a lot of services that are interacting together we do a few things that are quite unique which is one we don't have a private network and that means that all of the products that we build our UIs are actually built on the same API so that we offer our clients there is just one set of API s and events to flow one of the key practices and I think the first critical decision to make in going into a micro service architectures is how are you going to manage and define your API s and here we're going to talk about REST API as we'll talk about events in a bit this is what an API looks like at flow the very first artifact and step of software development app flows the design of the API a few critical things it's not in code it's not annotations in code it has to be language neutral here you see this is an example we use JSON second we define resources everything is resource first at flow and that's why here you see a definition of a user or user as an ID email name status and then by convention to create a user you use an object that is the resource name underscore form so in our language forms are used to create instances of resources and there you can see the data defined for the resource one small thing just because it's 2018 gdpr which is the new privacy regulation for Europe which a lot of people have had to scramble and invests you comply with in an API first world we simply added an annotation where we can add a field level say for example email is considered personal data and from that we can automatically generate a complete trace through every single service at flow for anything that may contain email we know we don't have to guess it's programmatic we know and that's possible because we start with the API definitions so we've got our user how do we actually interact with the user this is how we create let's call operations on the API what we really do is we take a user resource and we expose that we say here's a resource it's a user it has two operations one is they get by ID the second is at post the post accepts a body of type user form and this is how you create a user so again we're taking our user resource and then exposing our user model and then exposing that as a resource is what makes it available through the API what does it mean for API to be first class the API definitions are not in the micro service repos they are in a dedicated git repo called API how do you make a change to an API you open a text editor you modify the JSON file and you create a pull request what happens when you create a pull request well of course continuous integration why wouldn't we run automated tests on the definition of our API and so here you can see an example of a pull request in gilt and get just on the definition of the API we've been touched implementation people can collaborate we can have feedback through the standard tools that we use and on the right there you see a set of linters that have run over the definition of the API what's a linter look like here's an example of a real linter so linters do lots of things but ultimately our goal with all of our one of the big goals of continuous integration on the api definition is that it really should feel like one person wrote the entire API your customers don't care about develop your team one and Team two having different opinions on how rest works they care about using software from your company in a consistent way and automated linters on the API definition is a pretty powerful tool just make sure that things feel that way so this is just a simple example of a linter that walks through the entire service the service is an instance of the entire API and just validates that everybody is defined paths in lower case as an example breaking changes in api's so from a policy perspective what's really interesting is like we're actually empowered to just make a decision to say we don't break things that actually is just a decision that we get to make we get to make that decision for databases and schema design there we can make the same decision in API so we just decide don't break api's that's it you just decide it really is that simple once you make that decision though and I think it's a critical decision to make now you have to build practices around that and one of those practices is to make sure that you know that you may be about to break an API and so this is an example these examples are from a tool called API builder which is something we started at gilt it's an open source and free hosted solution for a 10-10 API designs a lot of these practices and one of the really nice things that comes native is annotation of every single change at a detail level in the API include things that are breaking so trivial now if you want to just add a continuous integration test that says did you break your API if so the build failed right and at least build a process to review it and so you'll know you can build that into the process and also super interesting that this is happening in the API design phase right so before we've built any code or really invested in implementation or any UAT we already know upfront that wait a minute we're on a path that may be a breaking change and now we can make the decisions that's something we want to do or do we want to course-correct and it's cheap to course-correct cuz this is the very very beginning of the process this is cool by the way I mean I think it is great so now we can go ahead and start implementing our service we've got our API user recent user model Muse reformed user resource let's go do some stuff so this is the first time we get into code generation if you're using G RPC or any binary protocols you probably already using code generation here I think the really important thing about code generation is that there are opportunities to say the specification is in fact the first-class thing that we built don't duplicate that anything that can be driven off the specification there's an opportunity to do it either through reflection dynamically or through code generation in particular become a fan of code generation because it's really easy for anybody to read the code and we invest heavily in the generators we write to make them readable so that you can really really understand what's happening without having to dig through meg and Meg of source code in libraries that you may not be familiar familiar with so here's an example we run API builder update for the app user and this created three this it generated three types of things the routes file a client and a mock client so let's go and look at them for the routes file all of our micro services at flow are written in scala and play and in play the way that you respond to an HTTP HTTP request is you declare a route so we automatically generate this route file from the API builder specification get users by ID post users this is really nice because now we're guaranteeing that our implementation it has these methods defined right because when we generate the routes know and everyone compiled the compiler complains because these things don't exist so we can actually have a guarantee that the operations that are exposed on our resources in our API actually are in fact implemented by the service another thing to note is user friendly paths and we didn't actually specify the path we can provide some nice defaults that are restful and consistent naming this is also really important users get by ID users post if I told you we had a feature called company you could probably now guess that to create a company it's going to be called companies that post to get a company it's gonna be companies that get by ID and that consistency again is really really important because we want the API to feel like it was built by one person for all of our users that are using our API whether they're internal or for our clients code generating a client library so this is a client library that's used to communicate from with any of the services through the REST API and this is an example of the implementation the post method this one is in Scala using play JSON but essentially the key things here are this entirely generated from the specification again and really friendly to use as a developer it's dot post and you pass in an instance of a user form and that's it it corresponds and again the key message here is this little value to developers writing this over and over again and as you build microservices you'll have lots of them and you're gonna spend all your time writing client libraries and then you imagine you introduced a second language all of a sudden your clients want say you spell internally like us your client wants interact with you and Ruby or and go or whatever now you have to take all those fine libraries that you wrote and write them again in every single language and that work while valuable starts to compete with work that you could be doing in terms of performance tuning implementing more features building new product so I think this is really really important and in the industry where I think a lot of things fall down is when building code generation I think a lot of people just optimize to make things possible but that's not the intent the goal here is to make the generated client so nice that a developer will love using it because only then will developers not write their own handcrafted client library and third we'll look at generating a mock client we're gonna talk a lot about testing testing has to be thought about from the start particularly in a micro service architecture there's a lot of asynchronous communication going on here and one of the this is an example of a mock client the actual method generating the mock client and the mock clients that we produce in Scala are they compile and they're fully functional and what this really means and because they come from that same API specification it allows us to do high fidelity fast testing we can actually write a bunch of unit tests and integration tests against the mocks and have confidence that those tests that we wrote are sufficient to prove that things are going to work correctly in production mocking is never 100% and so there's other techniques that complement this to get to that final you never get to 100% confidence pretty close in practice I would say in three years I can think of two bugs that made it to production where that couldn't be caught by mocks and they had to do with Network things like authentication being a little bit different on one particular resource so we do this for everything generate everything from the spec now we have a good way to go and test everything great so now let's talk about we're actually ready to write some code we're developers like to write some code so here's what code looks like get flow so this is the actual implementation of the post method user Dao create we'll talk about the Dao in a bit and basically I've got a user from the request request body as user form there comes back about Asian errors or I've created user and we just case I can serialize to JSON this is what all basically all our controllers that flow look like this it's the same thing over and over again validate and create and so now let's talk about well first I mean that's pretty bad Scotty that's beautiful code simple to read and that's what we want write the code that we're actually writing as developers we make it as simple as possible and there's even fewer bugs that we're gonna catch in our tests let's talk about this users now and a little bit about database architecture first each micro service application owns its own database so the way we run this is every if you need a database you get a database and that database belongs to the service no other service is allowed to connect to that database it is private like the database is not part of a micro services interface it is private every other service and communicates with the service either through the API or through events and this is really important because once you let people to your database connect your database through JDBC you lose the ability to know if it changes safe just lose it and that turns into actually an np-complete problem over time you will not be able to prove if you can make a change in the database and that's usually frustrating we talk about tech debt and all these different variations this is a very insidious form of tech debt so the solution is just don't let anybody in it's yours it's not part of the interface it works if we have a great API and if we publish the right event because everybody else will be able to continue and do the things that they need how do you create a database well this is how you create database dev RDS we're running on RDS and Amazon - app tests and you get your default settings it's gonna be called test DB these are our default settings you can change them if you want but the important thing here in terms of investment in tooling is we have a single CLI that we call dev intended for developers and that's what all developers use for all their common all of our common infrastructure and development tests it's one thing I want to know how to do something don't know where it is first thing you do is you try enter and you get a menu of a bunch of stuff that people before you have done that is now automated in this consistent way this is super important has to be the same one of the things I love they tell you I'll share this anyway I love logging into Amazon which of course nobody has to do on a daily basis cuz everything's automated and then just looking at our database names and they all follow the exact same naming convention everything is the same everything is automated but it's only automated because somebody took the time to invest and make the CLI so that the experts in database do it once and everybody else just benefits from their work we don't need everyone to be an expert in every piece of technology great so now let's actually use the database we're talking about code generation we like code generation so why don't we try to describe our database needs in metadata and code generate our way to a solution so first we'll describe our scholar requirements here the package name is going to be DB generated ID generator this is just a flow thing and how we generate unique IDs this one will start with the be fix USR for user now we'll describe our storage requirements in a piece equal attribute we'll say the primary key is a field named ID and our we would like to create an index on the field name email we then wrote a code generator that takes this metadata and actually creates the table definition they did access and it's important to note that even though we're using the same tool chain for capturing the metadata and for writing the code generators this has we have divorced our storage needs from our API right those are two different things what does the table look like here's an example of the code generation that produces a table and I think there's nothing fancy here but there's a few really interesting things one I personally hate debugging the difference between null and empty string and an Excel report I don't know how many of you have done that I hate it and so from the beginning we came up with a convention that we're not going to allow that at flow and we have these constraints in Postgres util non-empty trim string ID and if you try to insert an empty string you're getting it an error and we're never going to have this we're gonna have good clean data from the beginning because we're using cogeneration this is how everything is we have the ability to enforce a policy like this across the company I think maybe more interesting is there's this thing called hash code at the end of the table actually this came from a conversation my colleague met here at the beginning as we started updating our records a lot we kept generating all this load on the database then many of the updates are actually the same right clients just took their catalog ecommerce catalog and sent us a product catalog every day and some of our clients and that's their whole catalog every hour not much changes every hour and so what we did is we implemented a global solution that simply computes the hash code of what we're about to do and we only actually update the record in the database that the hash code changed that feature is available for every single table at flow across every single micro service and developers don't have to think about it because we have the point of leverage of metadata and the code generation that we could do that and that's a really really powerful thing we probably have saved 100x rights on our databases and we could do it globally because we've had the discipline to use metadata and not want which the word this is not a creating and defining your database table is not a area where at flow we value creativity right this is a known problem we just need to get the work done what are the scholar classes look like same thing big thing here is just a normalized access to the databases and specifically one small thing to pick out because we documented that the email column has an index there on the find all method how do you even get a collection of objects you'll see that we can filter by an email or filter by the presence of an email house email email this is driven by the fact that there's an index there and the only reason I highlight this is because I feel like in industry there's a lot of people who say oh my gosh things are slow I got to fix it I created an index I'm a hero things are fast again no the hero is the person who prevents you from ever having that problem before they're the unsung hero and to do that you have to think about this in advance and one way to do that is to make sure that your data access layer is triggered to what you're actually indexing for retrieval and that will happen is the developer will go in and say I need to find a user by email and there won't be a find by email method and guess what I may want to put an index on that and we catch it at the beginning of the design process and drive quality through the entire infrastructure great testing so here's an example of how to create an instance of our mock client for testing so this is we're getting a new client in play they use dependency injection so we grab an instance of WS clients for the play thing the URL localhost dollar port dollar port is where our integration test is running and here we put two off headers this identify as a user pretty basic stuff but what this enables is real tests that look like this so the get user by ID method that we started with at the beginning this is an entire integration test end to end running with all the mock clients like from the generated code I create a user and now a wait it's a future users get by ID and I better get back the user I expected and then the second test case if I get a user by random ID I better get back a knock down to 404 that's it this is a real test and this is using actually using the mock client actually making an internal HTTP request within the play framework and we're testing end to end and I will tell you these tests work because when we write tests like this we don't we never find discrepancies in production or said another way of this over the past few years through all of this focus on testing and really leveraging the specification I've come to a point where I expect our team expects that as code moves to production it just works you're not surprised we'll verify it and it just works over and over and over again and that drives quality that drives team velocity great so we've written our service wrote our really beautiful code tested it time to deploy so let's talk quickly about deployment continuous delivery is a prerequisite to managing micro service architectures you can quote me it is absolutely essential if your team is spending hours babysitting releases and you have 100 microservices good luck you're going to be spending your time deploying your services it just does it will just bottleneck you ediot when we move down the path from monolith and started to distribute this was the first big investment guilt made into a delivery system to deploy software and it was an excellent decision it probably took us nine months to get to a point where it was reliable but definitely the right first decision continuous delivery means a ton of things what we mean is that a deploy is triggered by a get tag we like it we use git to deploy you created tag and then tag gets deployed and then in addition the continuous part is we automatically create tags whenever there's a change on master that actually triggers a system time I created tag and tag automatically does a bunch of stuff like go create a docker image go set the desired state principles of a continuous delivery system or metrics 100% automated and 100% reliable rarely rarely do systems behave that way right and so red flag is if deploys keep failing and you're finding developers having to log into lots of systems to debug why deploy failed all that time is wasted and that needs to be fixed to get back the velocity across the platform here's our our dashboard what it looks like so micro services when they were last deployed a deployed just sets the desired state to the latest tag and white means nothing's going on and here's what it looks like I should say we use a really an open source project that we created in a week and a half at the beginning of flow called Delta let me say that again our entire continuous delivery system that deploys software thousands of times a week we wrote in a week and a half one person that's it don't need a massive investment this isn't a insanely large project you just have to really focus on what you're delivering and what you're delivering is a reliable pipeline to deploy software into the cloud I mean all these tools exist you're just doing a little bit of plumbing to connect them so if you're interested that's Delta here's what it looks like when Delta says when you change something and get github sends a webhook Delta says oh look something to my project what's the head of master oh they had a master of something new created tag I got a new tag I'm gonna set the desired state of my project in the new version so from 54 to 55 here and then just monitor just pulse it's my daugher image ready great my document is ready Hey ECS gooddiplomat darker image and it just monitors that's it it doesn't do anything fancy right configuration so I will say I would speak for everyone I don't like describing my infrastructure needs with 10,000 lines of JSON I don't understand it I don't know what security groups are I don't know what VPNs are I'm a software developer right in that role and so I think when the key things for micro services infrastructure is really really to try to get it down to the most basic elements and let the people who understand infrastructure make these recommendations for the company and this is what our configuration looks like for deploying stuff it's six lines of the ML and if you needed a bigger instance my guess is everyone here could figure out how to do it and that's what we want right self-documenting another key thing that we see we do and we see all over the place is standardize health checks and ours wrote on internal health check it's just a standard URL that's well known every service implements this how do they implement this they pull in the specification of the health check from the API spec because how else would you expose an endpoint right and so here you see a simple model called health check which just has status we say healthy when healthy otherwise we describe the problem 200 healthy means good for 22 text means bad and this is really important I mean just a critical element but just to institutionalize some type of health check into all the micro services even if at the beginning is just returning at 200 at least you've got the placeholder to add checks in the future things that services do make sure they have access to their database make sure that any environment variables are actually available in production and what this allows us to do is during a deploy if any of that isn't ready that instance just fails to become healthy and it's never put into traffic and then we can go debug whatever happened at the deploy and do that on our own time as opposed to having an issue in production and now it's time to talk about events so the way I like to describe our API is that flow or credibly proud of them you everything we do is API first it's beautiful as well document they're simple they're consistent perfectly happy if we never use our api's we'd prefer you to use our events rooms instead and more than that our own internal network we don't want to use our api's at all there are a few rare examples where you really really need synchronous operations and in those instances we will make API calls for everything else our own services just consume events and process everything asynchronously and I think we've been in a lot of talks on this particularly I think this year even more momentum around this sort of approach and it it really really works but requires again an investment to make the tooling right so let's talk first about a few principles of an event interface first-class schema for all events you have to have events in well-defined schema everyone who's using binary formats like G RPC is in great shape everyone who's using things like swagger is in terrible shape the big difference there is binary formats forced developers to use the cogeneration to produce and consume events which is a good way of guaranteeing that the schema is in fact correct and regardless of the tooling that's used that correctness for the events in the API is is critical critical critical to make sure it is true if you find an example in the organization where behavior and production differs from the declared spec in my opinion that's like the Toyota Kaizen process you pull the alarm everybody stops and you fix the process so that can never happen again because of developers lose trust in the specification it turns into an incredibly huge bottleneck in software development all of our producers guarantee at least once delivery all of our consumers must assume multi deliveries and therefore have to implement idempotency right there's the semantics which we chose I think today quite common it works puts a little bit more emphasis on the consumers to make sure that they implement idempotency but keeps the system quite reliable and just a few metrics we're built on top of Kinesis which has some inherent latency on it but end-to-end single event latency at Flo is about half a second right so this is from time database record created published to Kinesis consumed on the other side and some action taken on it which for almost everything that we do is plenty and the things that need to be faster end up being either actually we've never had to do anything bastard we just need those few things in curiously and our systems based on Postgres we'll walk through how we do it we emphasize a simple system that was easy to debug and frankly low-tech and we'll walk through but it still is scaling to about a billion events per day per service which again for many use cases by far I think the majority use cases that many of us interact with on a daily basis is plenty so here's how we do it so on the producers we create a journal of all operations on the table the journal basically de stores every insert update delete along with the operation on that table so we have a complete history of everything that ever happened and so in the users table you can think of the record in the users table as the current view of that user record and then behind it there will be a journal users table which has every single operation when we insert into the journal table we queue that journal record to be published real-time asynchronously we publish one event per journal record right insert into users insert into journal users insert into a cue that there is a journal record to publish notify we use actors notify an actor that something has changed actor comes in gets a bunch of work and publishes them to Kinesis replay becomes quite simple because you can either just riku the record or frankly what developers do is just update the update user set ID close ID or ID is five and it just goes through the chain again and that user gets published so replay becomes quite simple on the consumer side consumers read off of Kinesis get a batch of records they actually insert them into a local database into their local database in basically temporary storage it is partitioned for pass removal on event arrival we just queue that there's an event to be consumed send a message to an actor that something new has come in and then we do this in micro batches by default every 250 milliseconds we grab a batch of all the new events and then process them in the app any failures are recorded locally and published to a monitoring system so that you you actually see we will receive notifications if there are kind of if you see any build up in failure queues and then operationally we work so that there are no failure and failures are treated first-class it's not a product usually not a production issue but within hours gets looked at resolved and often the solution is either a bug fix or a replay or something I think visibility on errors is super important right and by having a local copy of the event in the consumer once you fix the bug that caused the error to happen in the first place you just regret to be processed you don't have to leave you don't have to go back to Kinesis you don't have to go back to the producer because you have a local copy of your event you fix the problem you actually have all the data as it's really easy to write a test case for it fix the problem and then just riku the record to be processed this is how we define our event schemas again we like to use the same we like one tool for everything less to learn and we use the same tool in this case API builder to define our events so the way this works is we define one model per event in this case four users we have two events one called user absurd it so users insert it an insert or an update on a user and the second called user deleted where a user was deleted those are the two models we group models into Union types and so an event types will go into a single Union type in this case the Union type is called user underscore event by convention the first part is the name of the microservice user the second part is the word event our convention is important because we have a linter that will go through on events right and then we can say that every Union type maps to a single stream in Kinesis and this is nice because now we can control you know if we have a very high velocity event very easy to just give it a dedicated stream but generally the common case is that a stream 8 8 micro service will publish will have one Union type that's published in one stream for others to consume streams are owned by exactly one service so if there is a user event it can only be published by a single micro service that way you can go back to the source and again most services define exactly one stream here's an example of what a linter looks like for events so one of the things we do is every event must have a field called timestamp it must be in the documentation the second field the third field if we're SAS platform we identify our customers by a field called organization so if this is a model that is organization specific that there field will be the word organization if the model has a field called number for example that will be the fourth field and so really pedantic stuff but at the end of the day all of our events look the same you can see them at a user event you can probably guess what a company event looks like you'll have the exact same structure and that again super important and actually enables when you have this kind of consistency it really enables you to do interesting things like consistently Drive and drop your events into a data warehouse for example you can actually do that programmatically because you have consistency what's the database look like talk about journaling we have metadata that describes our storage requirements this is a nice storage requirement we'll call a journal and Journal has two attributes how long and how frequent so this one we're going to journal data for three days what this actually does in the database is create partitions that are daily and on day four we'll drop the old partition from four days ago we use a couple of libraries one is a churning library for Postgres written by ryan martin from guilt and second is we use partition managers I think is from Keith at Omni ti but a great library for partitioning on Postgres and I know he's optimistic that finally we're gonna get native partitioning I know we have native partitioning posters ten and I think by 11:00 we're hopeful that the feature set that we need will become native but that's it all the developer has to do is declare the retention policy and these journalists are created for them now we'll go to app code how do we actually publish first thing we need to do is get a stream how do you get a stream we have a library for venting that frankly we invested a lot of time to build so that the developer experience of working with events was easy and simple so here's an example Q is our internal library I'm going to be producing user events Q dot producer type tag user event and you'll see nowhere in here you're going to see the stream name like why should somebody have to write a stream name they don't care we can use reflection and figure out what stream we should be publishing to and just make sure it's consistent in this stream name you'll notice on at the end this is so in cases where we like JSON because it works everywhere but in cases where you actually need a naari format you can switch that and publish it a stream where essentially we embed the content type in the name of the stream stream this is the actual code to produce an event takes an instance of a user version the user version maps to a record from the journal table all of that again is code generated developers don't have to worry about that what a developer has to worry about is actually publishing the event that they want and this is exactly what they do given a version of a user I'm gonna go ahead and publish if it was an insert update publish and up started otherwise publish a deleted and interestingly because we've Co generated the interfaces to this data what happens is all of our app code starts to look the same right and now when we think about an unspoken benefit is any one of the developers on the back-end team can actually drop in to any one of the other micro services and be productive yes they need to have a domain knowledge and contacts and all that but it all behaves the same right and so the learning curve as team shipped around and people move around really goes away and we can again stay focus on building product testing testing is super important this is an actual test that goes end-to-end on publishing an event on user creation so we create a user and then eventually our stream must contain an event of type user absurd where the ID email and name actually use a recreated end-to-end test that's it and again we've invested a lot of time in the library run streaming so that when this test passes locally it works in production and the only difference in production is instead of an in-memory queue or now we now have network in Kinesis right but from an interface perspective for all of us as developers it doesn't matter - Ella T testing and similarly in the consumer side this is what it looks like you receive a user event payload as JSON cast it to a user event and then we can pattern match and just store a copy of the event and we do this quite a bit we really publish events and then if if I need to operate on users I keep my own copy of users locally and then interact with that data there similarly testing on the consumer side again we spent a lot of time here making sure that it's simple to write test investment to make it simple to write test factories make userupp sorted factories you can probably guess was code generated based on the API spec and gives us an instance of the user absurdity vent I can publish it to my mock stream and then in a few milliseconds we'll see that back up sort of my local database and so again I can just go check my database so this is an end-to-end test event published my consumer picked it up and shortly thereafter stored a copy in my local database and I can keep building on this critically critically important so now our service is in production it's working we have a database we can create users everything is great we're done right now dependencies this is where things get really interesting there's a decision to make in micro services to paint the to broad extremes one extreme is once it's deployed you never touch it then factor in if you need to make an improvement you might as well rewrite it or on the flip side you can decide to pay a tax as you go and just keep your dependencies up to date we've chosen to pay that tax and our goal really in dependency management is to be able to automatically update all of our services to the latest dependencies and this is I think this is the right thing to do it's debatable because there's a tax you pay as you go but it means that if there's a critical security update in a library we can get it out to all our services and ours if we have a critical bug fix in our core leverage and get it out in hours and it should take hours not weeks or months and interestingly we thought a lot about making sure that the process we use for the software that we develop internally was the same as the process we use for all our open source libraries it's the same it's just code I don't care if I know the author personally or we sit at the same table it's at the end of the day it's just a library it's just code we should have the same process whether we're together or a part ed flowed process wise we upgrade our services generally at least once a week our process is once a week but I think this week we did it twice in fact we did this morning about 10:30 a.m. it was done at 11:30 and so we've really invested in the tooling here and I think this is one of the best things that we've done and I honestly have and I don't see this much in industry myself this is dependency that floated IO is an open source project that we built early on in the days of flow and what it does is you add you connect to github and you add a project it then crawls your project and extracts all the dependencies automatically your libraries your bi it then crawls all the resolvers all over the world and keeps track of every library in every version of every library and then is able to turn that into an event stream back to you the human to say hey this project I have some recommendations for you for my user project you're using live validation zero zero seventeen I would like to suggest that you upgrade to zero zero eighteen and there is a crazy if you're into this stuff there's a crazy version tag parser built on scala parser Combinator's if you're into that sort of thing and it friggin works so there are no false positives which is what drives the cut of the quality of the automation if you're into Scala it's actually was a cross build aware and so if you're on call at 10:00 you're not gonna get a recommendation to upgrade to a 2:11 library that sort of thing big investment but it's worth it because once a week we can type this this script written in Scala ammonite scripts upgrade upgrade SC it reaches out to the dependency REST API it has its own REST API of course comes back with a list of recommendations for all our projects upgrades our dependencies and then creates a pull request right so when this happens actually funny story you know there's services that tell you who and your team is productive based on analyzing get usage we had them analyze us and we were monsters in their system because of all these pull requests that we were submitting to guilt to get on a weekly basis it's amazing contribution history but this is it you run one command and we have PRS for everything and because we've spent so much time on our testing the policy is once it's green we deploy that's it right and now we can do that every single week and all of our applications are running on the latest versions of every piece of software and we just nurture them and feed them every single week so in summary just to really just focus in on kind of three critical decisions I think the first one is really to design your schema first for all of your api's and events and in there to really focus on consuming the events not the API have the API but by all means if you can use the events use the events you're going to get so many benefits second is a high level of investment in automation across the board this is really important whether it's the code generators the deployment system that a tendancy management a real real investment in automation and when we think about polyglot and micro server in different languages this is where we have to be really careful because every language every framework that we choose to add into our infrastructure all of those things that we just saw that allow us to be efficient in delivering our micro services now need to be built to take into account that new language and that new framework and it's a I mean frankly it's a huge investment huge investment and I think that's what needs to be considered which is why I think when we look at you know 4,000 employees per engineers per language it's like having the resources and the time to make it a priority to invest is absolutely critical to being able to do this successfully and third I think there's this focus on enabling teams to write amazing and simple tests drive quality streamline maintenance enable continuous delivery imagine if you didn't trust your tests and you want to upgrade your dependencies what are you going to do how are you going to verify what's you know how much time is it actually going to take before you feel confident to deploy yesterday and we tell the story about testing so yesterday I was on an airplane and as I like to do on an airplane I'm writing tests and I'm writing tests against production because that is a really good way to build quality software and in this case this was a bug that was reported by a user and I say well it's complicated bug involves lots of services some orchestration I'm gonna just write a test that sets up everything in production and I got to the end and actually the test passed it's like frustrated because I thought oh there was a bug and there was no bug we now published that test now runs every single day it's a cron job that's running his production this morning a slight variation was reported I said aha we have the framework wrote the test to demonstrate the bug in production and yes there was a bug and I think I'd like to call it TDD in production wrote the test the test is failing against production have now been able to go into the micro Service replicate the test replicate the bug in the micro service write the unit tests in the micro service that unit that micro service is getting deployed and when that deploy finishes we can now go run the production test and verify that the production test now passes so TDD in production and it feels so amazing to be able to do that and at the end of the day it's one of those critical elements that goes overlooked how can we really get to the point where we're so confident in our tests that we can do all of these other things and automate the maintenance so that we can actually get the benefits of these architectures because there are a lot of benefits without paralyzing our teams at the end of the day so thank you very much go forth and design micro-service architecture is the right way and I think we may have time for one or two questions and Happy's to stay after as well yes sir there's a microphone next to you thank you very much how do you how do you balance new features that aren't ready yet and a continuous employment right do you have branches do you have a separate data environments how do you handle that yeah it's a great question how do we manage features at different stages of development so my I'll talk personally I can't go to sleep if I have an open PR I can't I don't want to and I don't and so everything that I do in my life is optimized so that however much time I have when I'm done it's in production he's not in production I have to worry about it and what that means in practice is if we're working on a larger feature you've got to decompose and every day we're deploying and it's dark it's dark it's dark it's dark it's dark it's dark it's dark now we can get to a point where we have a feature flag where we can enable it and start to verify but it's always in production all the time non-stop and I think that's frankly I think it's the best way to do it it depends on continuous delivery it depends on a great system of testing to make sure that you have that confidence that you're actually not breaking anything but boy is it nice to go to sleep knowing that everything is working and there's no outstanding work to do great question I think we're time all right we're at time so happy to take your questions after and thank you very much [Applause]
Info
Channel: InfoQ
Views: 566,947
Rating: undefined out of 5
Keywords: Microservices, Software Architecture, Continuous Deployment, Continuous Delivery, DevOps, IT Service Management, Automated Deployment, Agile Techniques, Agile, Performance, Architecture, Infrastructure, Scalability, Cloud Computing, InfoQ, QCon, QCon New York
Id: j6ow-UemzBc
Channel Id: undefined
Length: 48min 29sec (2909 seconds)
Published: Mon Oct 22 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.