What I Wish I Had Known Before Scaling Uber to 1000 Services • Matt Ranney • GOTO 2016

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Most of these problems are why I haven't jumped into full out microservices and I don't plan to do so any time soon. I mean, for a small scale platform it is just not possible to build the necessary infrastructure to handle all of these issues because you would spend more time on infrastructure and supporting tracing, context propagation, logging, deployment, etc. systems than on actual features.

I was expecting some talk about distributed transactions (is this the proper name?). When you have a chain of calls and somewhere along the way something goes wrong, how do you revert these changes on the previous links of the chain? Is this a solved problem and I don't know?

👍︎︎ 78 👤︎︎ u/darkean 📅︎︎ Sep 28 2016 🗫︎ replies

There are a lot of comments here and on HN about how 1700+ services seems insane. I think they're missing the point of the talk.

Look at all of his WIWIK points. They're all people related. Every one of them.

You don't get to 2000 services because the technological needs of the system demand it. You can be guaranteed there are overlapping concerns between many of the services. They probably aren't well-planned.

You get there because you're hiring people non-stop. You can't train all these people on your Perfect Way to design and interface with Uber services. Even if you did, they all have conflicting opinions on how to do things. Some people would just ignore your advice; did you miss the part where he said "mandates are bad" and "I would use carrots 99% of the time over sticks".

Having 1000 engineers does not mean your company is churning out engineering related features 1000 times faster than if you had one engineer. In reality, there are reasons why it might be more, and there are reasons why it might be less. But, if your product concept is big enough to support another engineer, then it is practically guaranteed that, with the right management structures in place, adding another engineer means you will get some positive return on productivity.

👍︎︎ 39 👤︎︎ u/[deleted] 📅︎︎ Sep 29 2016 🗫︎ replies

Some people, when confronted with a library versioning and boundary problem, think "I know, I'll use microservices." Now they have 1000 problems.

👍︎︎ 18 👤︎︎ u/midairfistfight 📅︎︎ Sep 29 2016 🗫︎ replies

Out of curiosity, is there a TL;DR anywhere? I'd love to watch this but can't at the moment...

👍︎︎ 37 👤︎︎ u/kirbyfan64sos 📅︎︎ Sep 28 2016 🗫︎ replies

I worked at a place that made relatively mediocre insurance agency software. They fucking LOVED microservices. I still to this day, do not understand why they thought they needed 350 services for what was essentially a crappy CRM.

👍︎︎ 13 👤︎︎ u/TheGonadWarrior 📅︎︎ Sep 29 2016 🗫︎ replies

You can also build things that are partially monolithic, but built as microservices inside (modularity!), and routing rules that know whether you can use the in-process service, or need to use a remote service.

In this way you can keep your projects light and fast to develop, and not incur the penalty of continually traveling over the network at every step of logic or data gathering.

EDIT: To expand on this slightly, the internal micro-services can be treated in the same way as external microservices, so that they can later be split out of the "monolith" if that is required, and eventually it will be for various reasons as things scale. Some other things may be able to be migrated in, which were originally separate, if they were built under the same principles. In this way, they are working through an API that is available locally or remotely, and functions in the same way in either case. This is how modularity like this is possible, and it is how things can be moved into or out of a "monolith" as needed. It's the ability to route effectively (this node, or another node? Node being a process, in this case, but could also just be local host TCP/etc) that allows this to happen, and with enough intelligence in the design of the system, the same monolith may sometimes be able to take requests for a given service locally (cached data, or the appropriate node to request from), and other times may have to communicate with a different node for the same service (remotely cached data attempt).

Then how to route requests to the appropriate nodes make this work, and the standard ring-buffer or modulus-partitioning methods can be used.

👍︎︎ 12 👤︎︎ u/[deleted] 📅︎︎ Sep 29 2016 🗫︎ replies

A very good talk, IMO. He illustrates the problems I had with the microservices religion in a previous thread.

Use the right tool for the job. Sometimes that is microservices, but not always. Microservices doesn't magically fix spaghetti code, it simply changes it to a different problem.

👍︎︎ 8 👤︎︎ u/RiPont 📅︎︎ Sep 28 2016 🗫︎ replies

Question here: "How do they / or you do service discovery for developers? What I mean is that if I'm a developer writing a new service that needs to do certain computations that might already be implemented by another micro-service, how can I organise my services so as to find them easily?".

👍︎︎ 3 👤︎︎ u/need-some-sleep 📅︎︎ Sep 29 2016 🗫︎ replies

This sounds like a nightmare.

👍︎︎ 5 👤︎︎ u/colelawr 📅︎︎ Sep 29 2016 🗫︎ replies
Captions
all right great thank you so much so I want to talk to you a little bit about sort of some lessons learned about about ubers scaling and you know we've been talking about this for a while so I mean there's there's a bunch of stuff out there about like I don't know like uber got pretty big you probably like you know like most people know what it is now like it's kind of a thing I think perhaps surprising to many people is sort of how quickly things are changing and so I've got a couple of quick little videos to show you the the pace of growth in a couple of Chinese cities over the course of the course of one year is that playing oh there we go so so this is this is Beijing over from the beginning of beginning of the year to an end of the year and you can just sort of see like in one city things are happening like that this is Chengdu and it's sort of the same story and you kind of see big reveal you kind of see the same pattern happening in bunch of cities sort of all over the world so this thing is happening really really quickly and a bunch of the ways that we have responded to kind of like how to handle all of this scaling maybe you know we did the best that we could they weren't always right but we certainly learned a lot along the way some some quick public numbers because everyone always wants to know most of the stats I'm not allowed to talk about but those are some fun ones another fun one is how many engineers we have which is somewhere around 2000 and when I started it over a year and a half ago there were 200 and so and this is a 10x growth in a year and a half pretty pretty crazy and I've learned in this very short I mean a year and a year and a half doesn't seem like long at all but in this kind of crazy accelerated path it seems like forever ago and I tried to think about like what could I tell myself a year and a half ago like what advice would I do I wish that I could have gotten and you know a funny thing about advice is you're not always you're not always hoping to hear it like you hear advice when you're ready to hear it and I think most of my advice I probably would have said you don't understand me man it'll be different this time like I got this but I think that there are a few things there are a few things that maybe I could have gotten through to to my young and naive self a year and a half ago but you know I think a lot of a lot of this stuff like man you probably have to learn it the hard way like at least a little bit so to take this take this advice for what it is you might have to learn some of these things the hard way but then maybe you can think back to then oh that one hooper guy was telling me about this long time ago and it'll be fun so here's a graph of our service growth you'll note it doesn't end at a thousand even though I said a thousand this is because we don't have a reliable way of sort of tracking this over time it's like bizarre but it's somehow really hard to get the exact number of services that are running in production at any one time because it's changing so rapidly so there are all these different teams who are building all these different things and they're cranking out new services every week some of the services are going away but honestly tracking the total count is like the sort of weird anomaly like we have metrics and everything else but on actual count of services sort of not a thing people care about so yes yes I have to use the word micro services like we like everyone else have micro services and it's a technology conference so we have to say micro service is a whole lot and so here we go I have said it but I want to like try to add a little bit something to to this discussion of yes it's super great to kind of break up all of your your monoliths into smaller things monolith just the name sounds bad I mean has lift at the end of it right it sounds like a terrible thing that you would never want like monolith but are there I mean there are probably some sort of similar like ATS at certain scale there are some similar I don't know downside surprising trade-offs about microservices that they'll want to explore with you here a little bit and one of them is if you're building a whole lot of services and you've said you start to accumulate a lot of them and you might start to wonder at some point whether you could get some of these other cool buzzwords from the industry that like database people and programming language people are always like so excited about they've got their mutable immutability it changes everything it's so great why can't I mean at some point like on micro-services it can can may be immutable specifically or I mean additionally like append-only databases these are great like everyone loves them you can reason about them etc but I think this similar properties are true of a micro services deployment at some level in some ways the time when things are most likely to break is when you change them so uber is most reliable on the weekends when our engineers aren't making changes even though that's when we're the busiest and so every time you go to change something you risk breaking it I think that there is a point at so you know at some point a set of software that's deployed in a micro service it might actually make sense to just never touch it maybe I don't know I think we're doing that whether we like it or not like whether it was even planned out it was sort of like a surprising benefit that we ended up there you know this just gets sort of all you know kind of happened and you know it's worth sort of questioning like like why why are we throwing a lot of engineering effort into this kind of this architectural style like like why are we gonna do this like it's not purely by accident I mean like we're definitely getting a lot of good things out of it like obviously like we can you know it allows these teams so we're adding people all the time there need to be quickly formed into teams and like get to work on something where they can sort of reason about the boundaries and so so this allows these teams to be formed quickly and move and release independently like that's really good and we've adopted the own your own up time so you like you run the code that you write everyone is on call for all the service teams are on call for the services that that they run a production and so so that's pretty cool and then it starts to get maybe a little less obvious but it doesn't mean people say it less often which is we should use the best tool for the job but best has so many different ways of being understood like best in what way like best to write best to run best because I know it best because there are libraries best like ah they're all these like weird like it sounds so obvious you can just say it's the best but when you really dig into it maybe I don't know it starts to get a little shaky any of it and if you start really thinking about like what are the costs where the cost of doing like a big micro services deployment I mean right off the bat now you have a distributed system and these are way way harder to work with than a single piece of software than a single monolithic system like everything is an RPC all of the sudden you have to handle all of these different crazy failure modes all of these different weird behaviors that in a in a single model if did not exist and more importantly like what if it breaks so how do you troubleshoot how do you how do you figure out if your service is in some chain between something else that's broken like how do you make sure that the right people get paged the right corrective actions are taken to fix it but you know the these are still like sort of the the obvious things like you know you can you can read a lot of people talking about you know about these and they're sort of objects if you think about them but there are some less obvious costs and I think these are really interesting first of all everything's trade-off let me just reiterate that even if you don't realize you're making it everything is a trade-off like you're gonna get something and you know in exchange for having all these micro services like you get these things but you're probably giving some things up and this happens in more ways then then you might think like you might choose to build some new service instead of fixing something that's broken and I know that doesn't maybe seem like a cost at first like it maybe seems like a feature like I don't have to wade into that like old code and you know risk breaking it but at some point like the costs of having always building around problems and never cleaning up the old problems and like this starts to be a factor and another way to say that is you might trade complexity for politics so instead of having to have a maybe awkward conversation with some other human beings turns out as all human beings who are writing the software at least at least these days still there are people on the other end of these you know of the source code files and you might have to talk with them and you might have to have these awkward you know human emotion relationships with them and this is really easy to avoid if you can just like write more software and that's a weird weird property of this system you can just write more software and avoid having having these these sort of awkward conversations and you get to keep your biases like if you know along those same lines if there are some things that maybe like you don't necessarily agree with they're like let's just say you like a Python but the team that you're interfacing with they're writing their stuff and nodejs you might think that you might have opinions about which one of those languages is better and instead of maybe helping work in that other code base that is a language that you don't personally care for you might just build some new stuff in your own language just to sort of you know prop up you're sort of the the things that you think are best even though that might not be the best thing for the organization or the the system as a whole so these are kind of weird like these are sort of like I'd never I hadn't really considered that by super modularizing everything that we we might be introducing these these sorts of things so I'm going to take you through like give you some some realist some concrete examples of sort of how how we've broken up some of our engineering teams and some of the technology choices that that we've made and you can maybe see how you can imagine project if you will how this might be going down with all these all the people that work in uber so before in in the prehistory of uber there was 100 percent out sourced there um you know we didn't seem like a technology problem so like just had some some company somewhere right the first version of the were mobile app and the back and but eventually the when the engineering was brought in house we had dispatch which is all written in nodejs well these days these days the teams that team is writing all the new stuff and go the core services which is kind of like the rest of the system was originally written in Python and now that team is moving to go and eventually we brought maps in-house and so that those those teams are using Python and Java and we have a whole data engineering whole group that their stuffs written in Python and Java we've got an in-house metrics thing that we've written that's written and go and you start to notice like wow that's like a lot of languages and we could do that like it was great like micro-services let us do that like we could just have teams be writing in different languages and they're still communicating with each other and that's cool except we hadn't really counted on or factored in what all the costs would be of operating that way so first obvious one like okay fine yes hard to share code like simple like note no no surprises there but it's also kind of hard to move between teams like if you want to sort of reorganize some people and like now you've got a whole bunch of knowledge built up inside of people's heads in sort of one platform but maybe doesn't directly translate to do some other platform in the can obviously they can learn anybody can learn but like there's a cost right there's a switching cost of moving between teams and so what I wish I knew before kind of thoroughly embracing this like microservices Everywhere model is that by having multiple languages it can fragment the culture you can end up with sort of camps of like oh I'm a node programmer or i'ma go programmer and now Java don't care for it like people have its natural like human beings organize around you know tribes or whatever but by embracing this we can have lots of languages everywhere it's you know there's there's nothing to sort of like check that that tendency and there's a real cost there and I'm not sure exactly how to like how to you know appropriately address those costs but surprise there there is a cost so like I said everything becomes an RPC and that has a lot of surprising issues when when teams are communicating with each other with our PCs like that's the way that they're talking to each other so right off the bat like when you know when I started we were doing everything with with HTTP and wow this is a very common path that a lot of people go down but at scale especially with lots and lots of people joining really really quickly the the weaknesses of something like HTTP really start to show up like what's a status code for what are headers for what goes in the query string like is this restful Oh what method is it like what does that mean and all of a sudden all these subtle things that seem like really cool when you're doing browser programming all of a sudden become very very complicated and what you really just want is to say I run this function in this like over there as opposed to here tell me what happened but instead now you have all these like weird subtle interpretation issues with like you know HTTP semantics and yeah that's surprisingly expensive and Jason again like super fun like it's neat that you could just like look at it with your eyeballs and like read it and just see like beauty F the characters there but their's without types it is a crazy crazy mess but not right away it's it's just you're sort of borrowing against a future crazy mess when someone will change something and then maybe a couple hops downstream they were depending on some subtle interpretation of like empty string versus null versus some type coercion deal or whatever and one language versus the other and it will cause like a huge huge mess that what takes forever to sort out and what we really could have fixed all of this if we had just had types on all these interfaces and all of these things like that seems so great at like small scale or when doing browser programming end up being like real costs at edie scale especially a big team scale of course our pcs are slower than just like making a function call in inside a browser but especially as you move everything the micro-services the the cost of making these things starts starts adding up so what I wish I knew is that servers are now browsers like talking across the datacenter it makes way more sense to treat everything like a function call than it does to treat it like some kind of web request that you might want to cache and fetch over the internet and have all the extra browser e stuff when you control both sides of the of the interaction it's way better for everybody to treat it like a like a function call so how many repos is best this is a thing that I never thought that a discussion I never thought I would have to thought the answer was obvious one is what I thought but it turns out many people disagree with me and many people think that many repos is the right answer maybe one repo for project maybe multiple like who knows because you know it's kind of like the the trend in the industry of like publishing many small modules and names very open-source friendly right like you have if you break everything into nice small modules and it's really easy to just open-source some part of it or swap pieces of it out but on the other hand one repo is really great because then you can make cross-cutting changes if you want to change something into services at the same time it's really easy you can browse the code really easy really easily um so you know there there are definitely trade-offs and you know many is bad because like it's really going to stress out sort of your build system your navigation ability it's really it's kind of painful to make sure that you've done a cross-cutting change correctly or even this that you have all the right code and one of course is bad because eventually like it's going to get so big that you won't be able to to actually build your software or even check out your software unless you have some elaborate crazy system that you know just goes through heroics to allow you to preserve this illusion like the system at Google Google famously has one repo but kind of because it's so large you could never actually check it all out so instead they have a virtual file system layer that like lets you pretend that you have it all checked out but you only actually have the stuff checked out they're actually working with Wow one repo super cool but probably not usable without specialized tooling so here's a glimpse of how many repos we have at uber I do I took this a month ago and this is also apparently not tracked anywhere so I just happened to have looked at it a month ago and so in our internal git repository there there are over 7,000 repos a month ago and now there are over 8,000 repos so that's a lot of repos and if you sort of filter out like okay like individual engineers might just want to put up a few a few of their own little projects as individual repos and some to some teams track the configuration of their service separately from the service itself and so these are like the configuration repos but you can see like that's you know still not the majority of them the mean there are just a lot of repos out there we are so far to the other side of the of the many continuum on the the one versus many that I fear that there is no there's no way to ever become a single repo I don't know maybe we'll see so it is big you know this this big like micro services deployment like there are a ton of operational issues that come up you know like I mentioned well what happens in things break so yeah there are downstream dependencies like this is really complicated but back to the people issues there are some surprising things that some surprising situations that will come up is when other teams maybe are blocked by you because it's not part you're not ready to release your thing but they have to put a fix into your service but here you are owning your uptime but now some other team's dependent on you and so can they release your service like if all your tests pass or like I don't know is it is your is all of your automation good enough that other teams can your teams can release each other's software like I don't know that's an interesting problem and for us it's sort of like depends on the situation like sometimes yes but sometimes like usually no like usually you just have to have to be blocked by the other team releasing their thing and you just got to coordinate and it's kind of you know a time expensive but I think that the real like big realization you know for me was that we really want to understand like like we went to one side of the trade off with microservices we said yeah great small teams everybody's moving fast they're all releasing features super quickly they're iterating their own in their own time etc that's all great but sometimes you have to understand the whole sifrit the whole service as one thing as one top-level working together as one giant machine system and that is kind of hard when you spent all this time breaking it up so we just got done you know decomposing it as much as possible and now we have to like somehow like ree-ree understand that all as one this is a tricky problem and I I wish that I had spent more time kind of keeping that context together of like we should really understand we should still be able to understand the whole system working as one so performance a big you know it's definitely going to come up when teams are talking through each other when you know to each other over our pcs our PCs are expensive and especially when there are multiple languages the the answer for like how you understand your performance it totally depends on the language tools and the tools are like all different and so now we've just gone through this process of like allowing people to write in all they're just all these different languages but now that and the performance across those languages becomes a real challenge so like maybe and go you might have something like this this p prof output which is pretty sweet but not all runtimes have this so what we have been doing is to try to get everything to get the performance tools that all the languages that we run to to have a common sort of profiling format and which is flame graphs so this for example is a flame graph from a go program and this is a flame graph from a node program and we have similar deals with with Python and and I'm pretty sure we have this just about worked out with Java at this point although there's some weird other deals where some things in Java are hard to capture performance wise but anyway like the goal the point of this though is that as you want to as you move around and want to understand the performance of the system if the tools are so very different from one side of the other that is another big point of friction when you're trying to chase down a performance problem and it's the the tools are different the the other big thing that that I learned that was surprising was everybody wants to get a dashboard you know some performance dashboards like this but if it's not generated automatically teams will just end up making their own like fun little dashboards of things that they think are important and then when you want to try to trace some problem down one team's dashboard will look all totally different than another and what we should have done earlier and what we're doing now is make it so every service when you create it you get the standard dashboard with the same set of things that we all agree are useful things to know about from your service and you do that you get this without having to do any work at all like it just shows up you have a service and now you have these dashboards so then you can always browse everybody else's stuff and it will all sort of look the same another big debate performance wise is whether you should even care about performance you know performance premature optimization root of all evil etc it's spawned that like that phrase has spawned a very weird subculture in our industry of people who are literally like against optimization because then they're just like it doesn't matter this is this is not that busy this is the service of them it was traffic like we should always optimize for developer velocity but we can always you know buying computers is cheaper than then hiring engineers and there's a certain truth to that I mean engineers are indeed very expensive and getting more expensive all the time whereas computers are getting cheaper all the time so so that definitely is I mean there's some truth to that but the problem is that performance doesn't matter until it does one day you will probably have a performance problem and if you've kind of established this culture of no performance does not matter we'll just get more computers it can be very very difficult to actually have it suddenly matter if you're without the kind of infrastructure necessary to to sort of deal with these sorts of things and so what what I think is you probably want to have at least some kind of minimum requirements some kind of SLA that's performance based on everything automatically that it gets created whether or not it's even useful like usefully fast number just so that there is a number maybe and set it to 20 seconds like but just to make it so everybody has a performance isolated that they just have to have whether they don't have to opt into it they can opt into a more aggressive you know amount but everything has an SLA and then that way you can always lower it you can always you can always bring it in and there will always be just that safety net of like at least we have a way to you know a knob that we can turn so you know what what I wish I knew is that you know good is not is not required but you have to at least know where you stand and it's maybe not now but it will be eventually so related to to performance is this notion of fan-out which causes a lot of performance problems and like this has some kind of you know sort of obvious at first side effects but like a little subtle if you've never dealt with this problem before which is the overall latency before you can respond to the user right you've got you've got to wait for the whatever the slowest thing you have in your call chain and so imagine a typical service that you know it's generally pretty fast responds in one millisecond 99% of the time that's pretty good but 1% of the time unfortunately it takes like a whole second still doesn't sound too bad right well it's a bummer but 1% of the time my users are gonna get the slow case and like oh well and you know that's that's pretty basic but you know if you suddenly start having big fan out the chances go up pretty quickly and suddenly more and more and more of your users are hitting somebody's slow case somewhere in this kid so this gets even worse if you're not on the p99 if you're on the p90 five it only starts taking a few a few you know fanned out service calls before suddenly you start hitting the slow case way more often so the best way to sort of to sort of go after these kind of fan-out problems is to get some kind of distributed tracing there are lots of ways to get this you can get the zip Kannur now this this open tracing project is pretty cool we're participating in that that's a that is a great effort but you know if you don't have that kind of scale you can just use blogs just plumb a common ID through every request and just put it in your log message without this kind of tracing without a way to understand a requests journey through the architecture understanding these sorts of fan-out problems is really really difficult and I will give you some examples these are actually from our production Zipkin deployment of sort of why some of this stuff is so hard to track down so this one I'll start with an easy one so this one super obvious classic like a request is working its way through the system and then all of the sudden somebody took like a long time and the results of the you know the services down below they kind of all depend on that and you know it's really straightforward like like no tricks let's just get in there and figure out what's why every now and then that thing is too slow but here's a trickier one which is there's a huge fan out and it turns out it's - all the same service and what you see is that yeah of course the top-level service it took it a lot you know a lot it was a long time before I could respond but each of the individual service calls into this thing below it are kind of the same speed and you know they're not that fast I mean this is not a not a blazingly fast service but it I can't remember what it does on each request but it's it does something that you know takes some time but still they all take roughly the same time and so if you ask that team like hey it really seems like like something might be wrong with your thing and they'll look at their graph and they'll be like no way man our stuff is super fast it's like really consistent it's responding really what really you know really well but it turns out that there's this massive fan out now this is a case of unintentional fan out where the the top level service was fetching a list of things and then that list contained a bunch of IDs and then it was going on in resolving those IDs like one at a time and so it was making like a whole bunch of concurrent our PCs you know you can see how it's got like a sliding window of you know bunch of pending requests and as some of them finish you can fire off more of them but this is just silly right like there should just be a bulk resolved end point where you just ship the whole thing though let's do IDs that you want and it just sends them all back to you but we never would have found this if it had not been for tracing because these things they sort of they sort of confused a lot of the the common metrics so here's another one this one the the fan-out is super super fast err that you're there it's um the requests are super super fast but the the amount of them is many thousands and what's going on here is there's an ORM involved and the the service gets back an object that looks like a regular old object that you might loop over or change things in a loop on but it turns out when you do that every time you change one of these properties due to magic it turns into a database request and so all of a sudden this seemingly innocent traversal of a collection turns into 10,000 database requests whoops so anyway but you can you can if you ask the database team like how is your thing going they'll be like yeah we are serving requests so super fast like you guys got nothing to worry about everything's great but by the way why are we doing like a million you know transactions per second or whatever because turns out they're there pointlessly doing these these tiny tiny little operations so you probably want you want to get some kind of tracing you know to to understand the fan-out another surprising thing about this is the the overhead from tracing actually change the results so it turns out it's a lot of work to actually do all this like tracing stuff and so you probably want to not trace them all but trace a statistically significant portion of them like 1% or something like pretty low is I think what we do but really the you know this this is you know not too surprising but the real surprising thing to me anyway was implementing this kind of tracing behavior requires cross language context propagation and this is in the way of so many other problems that we have right now which is because we have all those different languages and they're all using different frameworks if we want to have what to plumb through something some kind of context about the request like what user it is or are they authenticated or what geofence are they in or I don't know like there are all these things that we want to do and that becomes very very complicated if there's no way to if there's no place to put this context where you know be propagated and specifically I mean if you take in a request and you make dependent requests based on that incoming request will you add in this incoming context without even understanding it so there may be like fields in there that you this certain that the the intermediate service doesn't even understand but it will send them along the way to the ones down below this would have saved us so so much time and hassle had we added it a long time ago or just prioritized adding it but it just kind of didn't seem that important and turns out it is so related to tracing of course is logging a big sip name perhaps it's no longer a surprise you're probably starting to see the pattern here but with a bunch of different languages and bunch of teams and a bunch of people who are all very new you know half the engineering team has been here less than six months so like that's you know everybody's pretty new everybody might tend to log in very different incompatible ways and I think the the answer there is to give everyone the the structure like the mandate the mandates are tricky to give them tools that are so obvious and easy to use that they wouldn't do it any other way to get consistent and structured logging and you know of course multiple languages make this hard and but even even worse when there are problems logging itself can make those same problems worse so something bad starts happening is that you start logging a lot now you've actually got a whole new problem caused by like logging too much and then that can cascade etc etc so having some kind of back pressure in the logs where you'll actually just drop log messages if you can't log fast enough should have we're slowly rolling that out to you know to the various different platforms but wish we had put that in there sooner but I more importantly than all of that is I wish that we had added some notion of accounting for all log messages because what happens to all these logs is generally they get like indexed by something like a like an elk or a Splunk or you know someone Hadoop do some kind of a thing takes all these log messages and you know lets build indexes on them so that people can like search through them and like learn things but the amount of data that people log like if it's free some people will log way more than other people without you no malicious intent they were just like I don't know logs are free right I just made them and without having some way to to sort of trace which you know which somebody to send a bill to so when the when the big elasticsearch cluster that has to index all these things you know starts running into capacity problems and needs to prune something like somebody can get that like feedback that hey maybe slow down with the logs already so we've we've we open sourced the thing that we're building a lot of our new stuff with on the go side called zap structured logging my favorite part about it though is how fast it is zero allocations on every log so you might want to check that out another thing that that comes up when when you know like like a problem surprises you know that contains surprises is around load testing and so so we want to UM we want to do these load tests like before things go into production of course that only makes sense but unfortunately there's no way to build as big of a test environment as there is a production environment and even worse there's no way to actually generate realistic traffic that will exercise kind of all of the problems that are that are lurking out there and so what we have started to do is run these load test against production during off-peak times and this causes a lot of strange problems the biggest one is that blows up all the metrics suddenly everyone thinks that we're having way more traffic than than we actually do and that's you know in order to to fix that we're back to this context propagation problem again of if your load testing production but you want it you want people to not think that we're getting you know denial of service you have to make sure that all of those requests have some kind of context that says this is a test request like don't increment the counters and don't freak people out and that has to plum all the way through the system otherwise like someone's going to get someone's going to get sad but what we really want to do is we want to run load against all of our services all of the time even when there there is excess capacity because we keep having these bugs where things there are latent bugs that only show up when when the traffic hits its peak so we want to keep our systems near their peaks you know just just shy of them and then back off as the real traffic starts ramping up but that just makes this problem even even worse so I wish we had we had started building systems a long time ago that knew how to handle this test traffic and sort of account for it differently so if you if you just saw saw KC's talk like that was you know very very relevant stuff we I was I was already super onboard with this idea of failure testing you know chaos monkey etc it's super great not here to try to convince you of that but a surprising thing that I found when getting this done was that um not everybody likes it in fact a lot people hate it especially if you have to add it in later so if imagine you've written the service and you think it's pretty great and you put it in production and then someone comes to you and says oh hey by the way we're gonna start killing your thing now like you will feel bad like that's my baby I just spent all this time making it and now you're gonna kill it and it's probably gonna users are probably going to notice and like this is going to be bad so what I wish that we had done is made failure testing happen to you whether you liked it or not it's not a decision this is just part of going into production is you just have to you just have to withstand these sorts of random killings and slowing x' and you know perturbing x' of your operation because people probably won't want to opt into it so another big surprise was migrations um you know a lot of times you people come you know here like big big companies come and they talk about how amazing their technology is or whatever and it's always like the new stuff they just got working that's why they can't do a Tech Talk cuz they want to talk about their new stuff it meanwhile you might think oh no my stuff it's like really bad and old and we spend all our time doing migrations and it's mostly legacy most of our self is legacy well no surprise all of our stuff is legacy it's all migrations most of mostly like people who work on the storage like all they're doing is migration they're just like some constant migration from one piece of legacy to another not quite so legacy thing someone is always migrating something somewhere and that is I'm pretty sure what everyone is doing in spite of what you might think when people come to comforters and give you awesome talks about their new tech because you know what the old stuff still has to work like you have to keep that business working there's no such thing these days as a maintenance window there's no like going down to you know work on your work on your service like even the off-peak times as you become a global business there are no often times like it's always peak times somewhere right and I remember when I said before about like well what happened to like these immutable micro services what are you like migrating well that turns out to be the problem with with making a you know just abandoning things at some point as long as they keep working is at some point you might need to make some kind of a cross-cutting change and that's going to be very expensive if people haven't touched that code in months and are afraid of ever touching it again because the last time we released it it was you know six months ago that's a tricky problem and what I really wish that I knew was that mandates to migrate are bad I knew people wouldn't like it like I know it's like I never want to be told I have to like add on some new system but specifically being making someone change just because the organization needs to change versus offering a new system that is so much better that it will just be obvious that you need to get on this new thing I wish I had a place the appropriate priority on like pure pure carrots no sticks anytime the sticks come out it's I'm convinced as bad I mean unless it's for like security or you know compliance reasons I think those are kind of like the two trump cards where you know it's probably okay to like force people to do stuff but as a you know someone working on architecture or infrastructure I think the carrots are the only way to go so here's another another surprise you know open source is super fun right everybody loves open source this you know the bill but the bill by trade-off remains a non-unanimous you know non commonly understood concept not everyone agrees we're you sure you know what you should build and what you should buy what you should you use from open source or what you should build internally but I would observe that anything that seems like infrastructure anything that seems like it's part of the platform and not part of the product even if it's something that just makes sense for you to build for yourself at some point it is on its way to becoming an undifferentiated commodity Amazon is going to offer it as a service or someone's going to offer it as a service and eventually that faying that you're spending all your time like kind of this platform you're spending all your time working on and someone else will do for cheaper and better and as long as you have that perspective I think that's very very healthy but that what I was surprised to learn is this will make a lot of people sad a lot of people get really invested in their work and if they're working on some kind of like lower level infrastructure platforming kind of thing it it doesn't sound good until here that amazon has just released your thing as a service like you still kind of try to rationalize why we should use your private thing or or whatever and this this is not this is not obvious and it's again back to turns out there people on the other end of all these text editors yeah I wish that I had known that how different the understandings of of this trade-off are amongst in our industry which brings us to politics so these services by breaking everything up into small pieces it can allow people to play politics so it's useful politics is one of these funny words like like a lot of people throw it around but I don't actually think everyone knows what it means I will a colleague of mine has a really great definition that I like he says that politics happens whenever you violate this property so if you make a decision that violates this property where you put the values of yourself above those of your team or the values of your team above those of the company that anything at you know decisions that violate this property are our politics a thing is that you in many environments like politics just means things I don't like oh that thing happened I don't like it politics like that's the obvious reason right but I think seriously like this this this property here is is a useful one to consider and and it's one that by embracing highly modular you know rapid development hiring people very quickly and trying to release features as quickly as you can it's there there is a temptation to to start violating these properties when you when you value shipping things you know shipping things in sort of like smaller individual accomplishments it can be harder to to to sort of prioritize what is better for the company surprising trade off but you know everything is a trade off and I think that you know if there's if there's one thing that I learned from going through this process it's that I wish that I knew how to better make these trade offs intentionally like understand like when things are just kind of like happening like without an explicit decision just because it seems like this is just the way things are going it's just the momentum is heading in that direction like I wish that I had that I had thought about what what trade offs were being made even though I wasn't explicitly like you know we weren't explicitly making them so anyway that's all I have thanks a lot [Applause] we have time for one question there's like 30 questions here so right given that you use our PC rest how do you avoid coupling between your services how do I avoid what how do you avoid coupling oh I don't know I mean we don't in many cases I mean we do our best but like it happens like yeah that that's part of the problem I mean there's duplication sometimes when there's coupling sometimes and we you know we try to like exercise you know good engineering discipline in like we try to do failure testing and making sure that like we surface like unexpected couplings but sometimes there just is and especially like in a veil you know for like downstream availability like a lot of times there's just nothing you can do if a service you know there's some service like three you know three down the chain that if it's down there's actually nothing there's no useful answer that you can degrade to and that coupling just exists but we still it's important to remember what the trade-off is even though there's coupling the teams can still they can still move independently they can still release their components independently and monitor them understand them etc but sometimes it does happen okay thank you remember to vote in the app have a good break
Info
Channel: GOTO Conferences
Views: 278,159
Rating: undefined out of 5
Keywords: GOTO, GOTOcon, GOTO Conference, GOTO (Software Conference), GOTOchgo, GOTO Chicago, Matt Ranney, Uber, Voxer, Always On, Scaling Software, Computer Science, Videos for Developers, Microservices, Software Industry, Software Engineering, Software Development
Id: kb-m2fasdDY
Channel Id: undefined
Length: 46min 38sec (2798 seconds)
Published: Wed Sep 28 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.