API Platform Conference 2021 - Phil Sturgeon - API Horror Stories from an Unnamed Coworking Company

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] right um so yeah i'm talking today about api horror stories from an unnamed co-working company uh for legal purposes um yes it's that company um and i was basically kind of asked to go and and work at that company because they have terrible apis in their words they were like everything's terrible please can you come and help us so i did and there were so many problems we don't have time to get into all of them but api a would talk to b c d e all synchronously and entirely in web threads um which is a recipe for disaster um some of those apis individually could take two to five seconds on a good day um and uh that means on on a bad day it was more like 20 to 30 seconds for some of those apis so things were pretty slow um what made it worse was that the apis were designed so badly that they were both suffering from over fetching and under fetching at the same time you'd have to hit lots and lots and lots of different services and different endpoints to get your information but you'd also get more information shoved into that response than you really wanted some of which was being calculated on the fly and so were very slow and you didn't even want it in the first place um there was no http caching anywhere no one used it at all for anything so those um poorly targeted very slow badly designed um endpoints were always the slowest they could possibly be um you could never reuse a response even when the data was very much the same um different error formats would be emitted from the same api different versions or sometimes different endpoints would have their own unique snowball um snowflake uh custom error format and so people would try and write generic code that would say oh if we get an error use this and because it was javascript the customer would end up seeing error object object which is ridiculous um auth was enabled on a per endpoint basis in this ruby code and if you didn't enable auth for an endpoint it just didn't have any but because all authentication was disabled in testing um the testing was just unit tests uh it meant that they'd kind of skip the tests in the unit tests and so quite often really sensitive data would be just emitted in production because there was no authentication um on any of it so there were a lot of scary things but the worst part when trying to figure out like okay i'm coming in i'm trying to help fix all this stuff what are we gonna what are we gonna do first question was so where's your apa documentation and there wasn't any none anywhere at all um so after trying to figure out why a lot of the time they were like we don't have time to write and maintain documentation we just you know rewrite it if we forget how it works and this is like a genuine thing that people would do they would legitimately just rewrite the api if they forgot how it worked um it's kind of the foundation for me learning about the api design first workflow and several years ago trying to refine it and improve it and build tooling around it because they would do it very much the the worst code first way possible um they would decide that they had to make an api somehow uh they would you know plan it on a whiteboard or in a meeting or in notes or on a napkin whatever um they then you know share that napkin around between the api developers who would spend a lot of time writing a bunch of code and then they'd maybe give it to the customer for feedback but because they spent so much time writing all that code um it was pretty much close to the deadline and they couldn't necessarily get all of the feedback done um so they'd end up deploying the close enough api and having a big celebration saying you know we did it yay let's go to the pub and maybe we'll write the api documentation later but oh first we should probably well a few of us want holidays a few of us need to uh work on this feature a few people need to work on bugs or performance issues we'll do 1.1 and 1.2 and we'll get there later and while they've been doing all of that a new customers appeared maybe six months later they have no idea how the api works because the documentation doesn't exist or is bad sometimes if it exists this company it didn't um and the api developers in that time have legitimately forgotten how that exactly works they might not remember exactly what the different statuses mean or which combination of booleans has which effect right so they uh they try looking at the code and because of a few rewrites trying to like hack performance it would be too confusing so they just make a new a new api or a new version and you end up with multiple globe uh global versions of an api that are literally just for different you know iphone web etc so no client would be able to use the newer versions of the api because firstly um it was designed for another client's requirements right it wouldn't necessarily work the way they wanted to or it wouldn't be able to solve the problem they wanted it to solve so no clients would be able to use other versions which meant they were stuck maintaining version 1 2 3 4 12 forever until one of those clients vanished or got rid of their app and even if the clients were able to you know use that theoretically there was no documentation so they'd never be able to figure out how to actually do it um so the solution here i i i went to work at a company called stoplight we make a bunch of tools that solve all these problems right it's nice to be the customer we're designing uh solutions for because i feel all these pains um we came up with the we we support the api design first workflow where you design a bunch of open api you can write it by hand or use a gui like we make um you can then give mocks and docs to uh your potential customers multiple potential customers hopefully um and you can get customer feedback on those mocks and docs uh they can interact with like a fake api to see how it works when you haven't written any code and it means that you can just tweak a little bit of yaml and and make the apis and you can keep saying is this what you want is this what you want does this work better until you've got something and you can iterate on that really quickly because you're not rewriting all of the code every time you try and iterate and then once the customer is like yeah this is great you've got your open api living in your github repo you can use that open api to simplify the code that you're about to write you can use it for request validation so requests coming in with a generic middleware it can say oh that email should be this format or that strings required and you haven't done it it can take care of all that stuff that you normally have to write making your requests really easy um and then you can use it for your contract testing and your responses so you don't have to spend loads of time in both places saying yes this is an email yes this is required yes this is a string yeah you know you used to do that in several places now you just do it in open api the source of truth and it means that when you deploy your code your code and your docs are completely up to date because they use the same source of truth tests are up to date too everything's the same and what that means is that when new uh functionality is requested or when a new customer comes along and they need something slightly different you've already got that open api completely up to date and completely accurate um so there's no more like whoops we forgot how this works we'll have to try and figure it out um so if you want to give that a try free account on stoplight i o it's you can do lots on the free account i'm not here to sell anything it's it's uh pretty cool for free um the next like thing is something i call cash stampede cluedo um you know clueless pluto's the game where you have to figure out who done the murder in the library with the candlestick right uh that was what we had but with the cash stampede cash time pete if you don't know the phrase is when you have um when you have an application uh that is kind of reliant for its performance on on a cache being there like every request that comes through hits this cache where normally it would hit some other third party or some other api some dependency and the only reason it's really working is because your cache is there holding back all that traffic from being unleashed on the rest of your architecture um and if for any reason the cash is uh to break then all of your dependencies just get flooded with um with a whole bunch of traffic all of a sudden um and that can really knock out huge chunks of your ecosystem if it's not prepared carefully um we had a cash break and uh we had no idea where it was coming from uh there was no way of finding out there was no open tracing there was no service mesh tracking anything there was no proxies between anything it was just just purely random heroical instances calling random heroku instances but the one clue that we had was user agent faraday 091 um and that's a common http client all of our apps were using it uh but luckily um we were able to search for gem faraday just across the entire organization in github which is not how you should do this stuff but it's all we could think of doing um and all of the uh all of the apps thankfully were so behind on keeping up with their dependencies that it made it a unique fingerprint that we could look for um and so we looked through each of the apps lock files and that one would be on zero nine one that one would be on zero eight one right they all had different versions different levels of old with different numbers of security bugs and problems in um and then eventually we found the right one we found just this one app that was on zero nine one and so we took down the server and just went screw that out that one's that one can go down um and uh that solved the problem i mean that app was gone but the rest of the architecture was okay because this little server was no longer obliterating the monolith that everything else required to run uh there's a lot of solutions to this the most basic thing in the world is to put user agent with your app name and if you can put like the version number or like the commit hash of the version that's been deployed just so you know what's what's going on um and if something weird is happening can be like oh there's like an errant old instance running but that's educated at least put the app name and then setting up something like open telemetry uh open tracing is the what it used to be called you merged it with some other stuff um and that will show you like a through b through c it will show you the request as it passes through all these different services and different responses come back and it can track a single transaction which is really helpful advance move is you can use a service mesh to specifically set which apps are allowed to talk to what uh problem that we have quite often is that you start off with just a and b talking to it and that's what your documentation says that's what your notes and your architectural diagrams say but some other service just quickly started calling it because they had to get something and they might be the ones causing the problem and no one knows that they're doing that um so if you set up a service mesh you can say these are allowed to talk to it and if you would like to talk to it you have to like ask the ops team to enable that and service meshes usually have some sort of tracing ready to go so uh really i i think if you are using the micro service architecture without using a service mesh you're probably just going to have a distributed monolith there's a lot of different things this term means um a lot of different problems with distributed monoliths but basically if you just have a bunch of random stuff calling a bunch of random stuff you have no control over it and you have no limitations on those network interactions in any way no timeouts no circuit breakers then you you don't really have a micro service architecture it really is um something scary the chat called scot something had a brilliant quote he said if you switch one of the microservices off and anything else breaks you don't really have a micro service architecture you have a distributed monolith and things can start off really well planned they can start off really simple and you've got a few different uis and a few different apis and maybe occasionally things will call another api but over time more interfaces need more functionality for more of your other services you want more integration more everything and everything just starts calling everything um and it's just a mess and there's no clear separation of concerns and everything's like syncing with everything else and everything's copying data everywhere and and if you deploy a change to one thing everything else breaks if one gets slow everything else crashes um and that was exactly how it was at unnamed co-working company uh we had these two giant monoliths in the middle that required each other um and everything else required and they required a lot of other things so there was no real separation between upstream and downstream dependencies everything just was a you know octopus getting all entangled with everything else and there are a lot of problems there um loads and loads of apps 50 to 100 apis with different apps and they're all quite badly written and they were just randomly on heroku pointing at each other um and so you'd get these random transactions that would sometimes take 30 seconds and the only reason they stopped at 30 is because heroku would chop them off they would have probably gone on for much longer and that led to these awful new relic graphs that you'd see where performance would just spike and this is just the averages as well this isn't even like the 95th percentile so this isn't like things were actually taking a second and a half they were taking much longer but this dark green spike at the top is web external that means that uh this service has been waiting for some other api for like a second and a half um and you can see that that's also caused a bump in request queuing because there's only a certain number of web workers available and they're spending all of their time waiting for this other api to do something which means they can't do anything because they're just stuck so that means that heroku starts to you know stack up these requests of other people trying to talk to your api so that hopefully it can serve them later before they time out uh this would happen all the time really bad knock-on effects so at first i kind of struggled to understand what that really meant like i struggled to visualize like why would setting timeouts or you know how would that help uh if stuff's slow it's going to be slow but the problem is if you have service a which has you know a bunch of different endpoints and one of them endpoint okay we'll call it uh call just calls the database and return some information that one's quite easy that's probably not going to have too much trouble but this other endpoint that we'll just call slow needs to talk to service b and service b right now is incredibly slow maybe because it's waiting for c and d and e but whatever b is slow right now and what that means is if you have only six we'll just say six worker uh processes um sometimes they're called threads or you know dinosaur there's a lot of different ways of breaking this down but you have six different things responding to web traffic right um and of those six one of the requests was for the slow end point so great that means five out of six they respond in 50 milliseconds or whatever you're good more requests can now be handled right and a few more people have requested slow that request number three is still waiting for that slow response to come to be resolved and now we've got request eight and request 11 are stuck on that slow end point meanwhile 7 9 and 10 quickly responding and asking for more you know more traffic is coming into those because they're available now and damn it three more people have wanted the slow end point now all of our worker threads are completely stuck all of our web threads are completely stuck on trying to solve this slow thing and because no one in the company has set any timeouts anywhere on anything it's going to sit there for as long as it can until heroku tells it to stop so yeah this would happen a lot and request queuing was a real problem um it would just kind of back up and up and up and people would spend forever waiting for a request to start and so the problem here is that any random um any random api it could take out everything depending on what the arrows were doing um every now and then something like the members network api it was like facebook but specifically for this co-working company for some reason um they would deploy some change like they would run a migration that locked a table and that caused performance issues or they would um make it update to a query that was missing an index and that would you know make a query really slow when it used to be really fast whatever was going on anything that made that api go slow would trigger this cascading failure right because there's no timeouts anywhere so it's stability just you know spreads like a fire and that would crash the rooms api for like booking uh meeting rooms and that would crash uh the user company api um which is one of these two giant mega monoliths and because that one crashed well now everything that talks to ether of those two is crashed and slow and waiting for everything else and we've even spread it to the other monolith because these two mega monoliths both require each other right which means that now everything's on fire and those two mega monoliths just thrash and make each other worse um so yeah the problem there is that people would just stand around and be like ah this happens you know it happens all the time distributed systems are hard sure but you there are solutions to these problems um basically they would treat the problem here was that like oh well there was an issue with the members um the members network api the problem in our post-mortem was that we deployed a bad migration so we should change how we do migrations or we made a mistake with indexes so we'll be more careful to make sure that we check indexes and they would treat the symptom every time and not the overarching overarching architectural issues that meant that if there was a problem with any one api that everything else would crash a lot of the time what they do is they just spin up more servers and more instances trying to like fend off the trouble and that doesn't really help it just like buys you a few more minutes before all your threads are stuck um and it also like there's actually a climate crisis going on and burning resources and literally burning the world to the ground um just because we can't design good architectural structures is not the solution um the internet already accounts for 3.7 of global emissions and so just like throwing servers at poor design is not the way to go general recommendation if you can't make a good monolith with good separation of concerns in the same code base don't start adding network calls um i'm not the first person to say this but like i really can't emphasize that enough um micro services aren't necessarily better every time um and you really do need a high level of understanding of network interactions to be able to make them work well especially using things like service mesh but one thing you can do is you can create service level agreements for your apis and stick to them you can say this api will always respond in one second always and if anything else goes wrong we'll jump at that like it's down we'll panic and make sure it always lasts in less than one second or even less if you can um and then when you're calling an api set a timeout on every single call always set a timeout um and it should match the sla if there is one right um and then basically expect everything to fail and be pleasantly surprised when it doesn't never expect a request to work it probably won't um when something goes wrong you can you can cue requests up for later you can back off and try so you're not just going is it good yet is it good yet is it good yet is it good yet because now you're threshing that server or you can also hide a feature i've seen people just hide map view when the map servers were broken you get rid of the autocomplete functionality when the autocomplete search is broken you can just kind of like make things react based on what's working and what's not right now and saying timeouts is easy um most http clients will have a timeout option this would be five seconds so if i haven't got the whole response done in my status code in five seconds i'm gone um and five seconds is still a long time this this code is in a background worker not um not in the web thread that's too long for a web thread and then open timeout is uh in some http clients is like uh i need to know within two seconds this server is going to start working on my thing otherwise i'm out of here so these numbers don't add it's within two seconds you better have started this job and within five seconds you better finish this job not not seven seconds uh craziest solution uh problem we had was busy days in australia nobody poops really weird real world problems happening from this let's look at this little bit of the uh architecture here we've got a receptionist app kind of like a front desk uh situation and they'd use that to check people in and see who's where and let people know about meetings and things like that and that would talk to the a lot of different apps but specifically here the user and company api that would for some reason talk to an api that was maintained by another company um which is called the keycard api the wobbly bad keycard api if you've ever been into like an office building and you have a key card you know you have to tap it against the door to get in we had those and the keycards would talk to some other service um that wasn't the problem um but we had uh like a gateway api that would say frank owns this card and this card um should let you into these buildings and these floors um and that gateway was the problem basically on the first of the month all of the new customers would turn up um would turn up to access their new office that they've just paid a whole bunch of money for right they're excited they want to get in and get started with their startup work that's very important and they wouldn't be able to get into the building um because well one uh basically there were only three instances of this gateway api um there was one on the east coast one on the west coast and then other and at the time there wasn't really that much other going on um i think it was like an ec2 micro instance because there wasn't really that much traffic but um that company never noticed that traffic was increasing and so as we got more and more buildings other got busier and busier and when it looked like this it still wasn't that bad by the time the company had you know offices all over the world um it meant that other was pretty strained and so just australia and and kind of east asia once that started to really build up traffic it would just crush the entire company for the entire day pretty much um because what would happen is the keycard api would get overloaded um and that would uh basically mean that the you know the people on the front desk are sitting there trying to add that card there's a queue of 10 people right and the first person is there like right frank let's give you this card let's hook it up and that goes through the id com through the user and company api and the wobbly bad key card api is being wobbly and bad slowing down doesn't work might take 30 seconds might take two minutes to get your no at which point front desk notices that wow that didn't work let's try again wait another 30 seconds another two minutes didn't work try again right and those that queue of people are now mad because they're paying loads of money and this is the first day of their like contract their first experience is they literally can't get into the building um so what they'd end up doing is just saying like oh go and we'll just let you in through the front door ourselves and then if you need anything let us know um and that would mean they couldn't go to the toilet without going to find an employee who could let them in because the toilets you need the key cards to get in there as well um and this would just be a really bad experience but not only did that make the whole kind of uh key card thing a problem right it would also crash the this monolith that all these other services required and because this monolith not only handled everything to do with like user information company information and key card logic but it was also handling oauth tokens for everything so that meant that just everything crashed so even though you got into your new office um today none of the systems would work you couldn't book a conference room you couldn't do anything and it just looked like a show on the first of the month every month um so i was trying to fix that like when i found out that was going on i was like let's not just let that happen every month and started to work on solutions we looked at we implemented a traffic proxy at the time we used run scope that's not around anymore as a proxy um but there's a bunch of other things around and we just funnel all of our traffic uh through that and i would they they were saying nothing is slower than 100 milliseconds and i'm saying i have evidence that that's not true and they tried to blame us but i was able to share those requests with them and they dug through everything and they found out that their logging system if it was more than a second would just give up because clearly that was a mistake so 100 of their requests were good because they were ignoring the errors um i also pointed out like hey i've got these two minute responses that are giving a 502. and their response was oh man nothing should ever give a 502. it should 200 on that error and my eyes started to twitch a little bit and i was like i don't have time to get into that conversation but can we focus on the two-minute part please um oh yeah that's weird too so the first step was because we had no trust in them actually solving the problems um we i basically copied and pasted all of the code from the user and company api over to the keycard service and redirected all the traffic to that rerouted the traffic and it was still synchronous it was still bad but it meant that when the wobbly bad keycard api crashed only the keycard service crashed because i put timeouts on the on the request um oh no sorry because we redirected the receptionist app um to the keycard service and it didn't touch the monolith so it meant that the sync service would crash and that's okay for now you still have the same problem with people being backed up but you haven't taken down the entire company in the process um the next step was to make a asynchronous service so that requests could go back through to the user and company api but um it would uh [Music] basically just store the request for that keycard to be added for that person on that building in a background worker um and then if the wobbly bad keycard api was being slow it would just retry and retry and retry we put timers on it saying if it's taking longer than 10 minutes send them an email and let them know we'll work on it like and then we can send them an email when it's done and that pretty much solved it and then eventually they made their api not be bad but that was just purely a case of how long will it take to get your keycard and not the entire company is down um so demand an sla for third-party services uh pipe external traffic through proxies like resurface or istio um really try and avoid putting your api requests in a web thread whenever possible um it's not always possible but like usually it is um and it's especially important if they're not under your control because you don't want to be completely reliant on something that you don't own uh it might take months for them to fix it or they might lie about it in the meantime and um you want to use background workers and event driven apis for things that can happen later so even like sending an email via like send mail api anything that requires going to another api you can usually do that later the last part of the distributed um distributed monolith is what i like to call mutually assured destruction um this part of the diagram if you have noticed it is the worst thing possible because um what would happen the something as simple as getting information about a user was just this giant json request that had like 200 kilobytes of json for an average user and sometimes on the collection you'd get even more we had to just remove the collection page because they had you know 200 kilobytes of json per user 100 users um and that response could take 10 to 20 seconds like really really slow and it was a mixture of stuff like basic user information locale phone number kind of profile stuff all of the locations that that user is in and the companies that user belongs to and then the locations that those companies have access to we're all jammed into this one thing by default not you can choose to include it if you want is jammed in there by default which is terrible um what it generally meant was whenever the client was trying to find out some information about subscription and billing stuff it would need to try and find the user's locale so it could give it the right currency or i guess or language or whatever it was it needed to find out what the locale was so the the client so the subscription and billing api one of these two mega monoliths it would call the other mega monolith to say hey user123 what's their locale and because at some point they decided to start jamming this locations and company locations data into that response by default it meant that the user and company api was now asking the other monolith that was originally doing the request to get this data that it doesn't care about but needs to do um and those might take four or seven seconds to respond so it meant that um each monolith had a not only a reliance on itself but back to the original caller um and that might take 20 seconds if they're both working again none of it's cached um so yeah if the user and company api for any reason started to go slow it would mean that the subscription and billing service is now slow and that would mean that the id is crash id is crashing because it's waiting for subscription and then of course subscription crashes because it's trying to wait for id um so i had to call that a double knockout and it would happen all the time and guess what timeouts help there as well um the real solution was uh basically we couldn't really change users one two three so what we did was we just made a new locale endpoint that just had the locale information and then people can request that and we put cache control on there to be like they're not going to change their locale that often mate every day is fine so solutions there stop designing apis for http one you don't need to smash everything into one mega core like we're doing css combination that's the thing that fell out of fashion for css and image sprites they've all stopped doing it for some reason in api world we still do it you can use http 2 and 3 to multiplex multiple requests that's a lot quicker a lot of the time um and it means that clients can only request more data if they want it which is what requests are for you just request what you want and you can put caches on things to speed things up to make that time back and using timeouts and circuit breakers means that simple requests can succeed even if complicated requests are currently failing and you can do things differently if they are and then you want to get an api architecture and governance team to review changes so things don't start off nice when you're whiteboarding and then grow into a mess over time if you want to do that i work for a company called stoplight io which i've already mentioned uh we make the whole api design uh first workflow easy um all of the open api lives in your github repo so you can like see pull requests changing it and go oh that's a bad idea and a bunch of other cool stuff like that so on that note i'd like to thank you all for listening uh sorry i couldn't be there i hope the sound was okay and we've got a little bit of time left for questions if you want to do that they're not fd is that's a typo come back over here thank you very much phil hey thank you let me stop sharing my screen a second eh thank you very much phil hey you still there yeah okay um are there any questions in the assistance or maybe online no questions oh not a dev question but curious how long you were at the unnamed company it was about a year and a half um that was about all i could handle of uh firefighting every single day could you repeat please i didn't know that sorry yes someone was asking how long i was at unnamed co-working company and it was about 18 months that was about all i could handle for firefighting every single day putting out problems p1s p0s every day had to do that for too long oh okay okay i can see yeah um ooh is stoplight io supporting jason ld and api's platform um not currently we focus on open api tooling and you can use that to describe pretty much any type of http based api but there's no formal support for jason ld it is something i'd like to look into so if you want to talk to me on twitter or wherever yeah let's talk about it okay great let's do that thank you very much yeah oh and do i have a blog or something where you talk about what i just presented yes so apis you won't hate dot com um we can try and uh there's a link to a lot of it if you have the slides i put a link to the slides in the talk in the chat here and yes there's a lot of links to kind of blog posts and stuff there don't hesitate to share those links also on the on the live stream so people online can also follow those links i don't know if it's possible to share the links directly on livestream there we go i've got apas you won't hate ah can't spell anything you won't hate.com got it okay thank you very much phil uh in a few moments we're gonna welcome here in this room uh samir jose you're gonna talk about jason ld and then nikola gracas we're gonna talk about symphony red time symphony run time and first we're going to make a short break we're going to see you there everybody at 25 past 4. foreign you

Info

Channel: Les-Tilleuls.coop

Views: 128

Rating: undefined out of 5

Keywords: API Platform Conference, APIs

Id: C72UE0ypr6c

Channel Id: undefined

Length: 36min 43sec (2203 seconds)

Published: Tue Nov 09 2021