How Netflix Scales Its API with GraphQL Federation

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] in the beginning the cloud was formless and void and the engineers provisioned a server it talked to the database and served up the web page and it was good the users came and the business grew as the company grew they formed more engineering teams as the server grew it became a monolith they divided the monolith into micro services in order to increase the autonomy of the teams so they could move faster then the engineers created apps that used the services but the engineers saw that it was not good for each app to have to talk to every service on its own so they created an api gateway to bind the services together and in the seventh year they rested but not for long because they continued to innovate they saw that rest was insufficient so they created graph query languages for the apps to fetch data from the api and it was all good time passed and the company continued to grow and teams were fruitful and the services multiplied the api gateway that bound them together grew as well in order to compose the many services and there was temptation in order to handle failure gracefully they added fallback logic into the gateway simple caches gave way to complex and memory data stores along with business logic before they knew it the api gateway had become the new monolith what stephen has just described is the state of netflix microservices today we have hundreds of mid-tier services providing apis for uis to consume and yet they're all aggregated into a single api monolith this architecture might sound familiar to you if your organization also implements a microservices architecture with a single api aggregation tier we had done all of this work to break apart our system into microservices and yet we still found ourselves with an api monolith so we asked ourselves now what i'm jennifer and i'm steven we're engineers at netflix on the api systems team you might also know us as edge engineering we work on this api aggregation layer the nexus point between uis and the universe of netflix microservices there's all these services on the back end and all these different uis running on different device types our team represents this tiny aggregation point in the middle we take the apis exposed by microservices and weave them together into one big graph api and this the clients can simply pretend that netflix is a single service but we're actually just the middleman we simply aggregate information from all of these different data sources this architecture has served us well for many years but we're starting to see that it's reaching its limit in order to scale even further netflix has placed a big bet on an architecture called federation we're here today to tell you what federation is how it's enabled us to scale to previously unprecedented levels and to convince you that federation is the future of apis first let me tell you a story one year ago i was on call for the netflix api service team it was thanksgiving and i was visiting family everyone was bustling trying to prepare dinner when i get paged so i log on to see what the problem is and i'm told that the trick play images are appearing in mandarin on a big movie that we had just released so naturally the first thing i did was google what are triple images [Laughter] then i started searching through api to figure out where it was exposed and how it was used i knew i'd seen the word trick before somewhere but even though i'd been on the api team for several years i never knew exactly what they were and yet right now i needed to become an expert enough to fix them and quick before everyone starts tweeting that netflix is in chinese that's the problem our graph had grown so large that no single human understands the entire surface area and yet the entire graph is owned by a single team what if we could break the api apart so that domain experts could own their part of the graph and still expose the entire netflix ecosystem from a single unified access point this is precisely what federation enables it's a way of breaking apart the implementation of your api while preserving the facade of a unified api for clients it allows you to remove the business logic from the core aggregator so that it becomes an appliance kind of like a reverse proxy and that is something you can scale the title of this talk is how netflix scales its api with graphql federation so let's talk a little bit about what graphql is and then we'll talk about how federation works within that context now if you've seen netflix talks in the past you might know that the netflix app actually uses a different graph api technology called falcore it's conceptually very similar to graphql but back in 2012 graphql didn't exist yet so we created falcore in 2020 graphql is now pretty much taking over the world and netflix is using it too federation can actually be applied to both but today we'll be talking about graphql here's an example of a really simple graph api first from the very root of the graph which for graphql is called query you can fetch the recommended videos for a user from there you can traverse into each one of those videos the video type has further fields that we can fetch like title or rating some of these fields just return scalar values but others express a relationship to another object such as trailer which would then be another video object one of the key distinctions of a graph api is that we can selectively choose the properties that we want as a client and then follow relationships and recursively select properties from other objects now the actual netflix graph is a bit more complex you might say it's like something from the upside down so let's break it up with graphql federation each distinct domain or logical business portion of the graph is served by a different service the api aggregation layer composes these together into a single unified graph let's take a look at our example again this is what that graph would look like in a federated architecture each of these colors represent fields fulfilled by a separate service these different chunks of the graph represent a portion of the graph that one domain service is responsible for serving up then a service called the gateway which you can think of as the the aggregator binds these separate schemas or graphs into a single composed graph each service only provides the part of the schema it is responsible for the video service provides the title description and trailer for a single video and the images service provides image urls or backstart urls the recommendation service provides the top recommended videos for a user for each video it only knows the video id that's all it knows about videos that's all it needs to know the gateway does the rest with such clear service boundaries we now know who to talk to when say video metadata stops appearing for a given video in fact their pager duty id and slack support channel are actually embedded directly into the graph metadata so we know exactly who to call when something goes wrong but also the video service team is not bound by the speed of any other team in exposing their apis when they're ready for their api to be consumed by clients their apis are available so that's federation you take your api and break it into chunks that can be developed independently you can think of each chunk as a micro aggregator that just handles a single domain these can be implemented by domain experts then a graph aware gateway ties them together into a single api now this graph gateway is still a central junction in our architecture but there's a key difference it doesn't contain any business logic it just follows a declarative configuration that tells it which data comes from which service this means crucially that the team managing the graph gateway doesn't need to scale along with the size and complexity of the graph that's being exposed so that's the idea behind federation let's go a little bit deeper into how it's implemented there are three components to a federated architecture graph surfaces the schema registry and the graph gateway so let's take graph services first graph services are simply just graphql servers they expose a small portion of the overall schema and publish it using what's called a schema registry the schema registry has one essential task to hold the schemas for all your services along with each schema it holds some configuration like url or discovery identifier we also like to register the contact information for the team behind the service because then that can be embedded automatically into the documentation for the graph before a schema can be updated it has to be validated beyond basic linting we also catch things like breaking changes and conflicts that arise when combining the schema with the rest of the graph and finally the registry provides the schemas and configuration to the gateway and finally we have the graph gateway this is where the magic happens the job of the gateway is to take a single incoming client query and break it into sub queries that can then be executed against the downstream graphql servers now remember this graph gateway is an appliance until it loads a configuration from the schema registry it knows nothing about netflix or about our api in particular a traditional http proxy is generally considered a layer 7 proxy in reference to http belonging to the application layer of the osi reference model but graphql queries unlike rest are abstracted from the http layer so you could think of this as a layer 8 multiplexing proxy so how does it work the gateway processes a request in two stages query planning and query execution the query planner traverses a client's entire request and recursively collects the fields that belong to each service and it identifies the ones that can be fetched in parallel and the ones that have to be retrieved sequentially let's look at an example given our initial schema if we wanted to take the top 10 videos for a given user and then for each one we want to fetch the title from the video service and the box our images from the images service the query plan would look something like this we know we have to fetch the top recommended videos first because we need those video ids in order to know which titles and image urls to fetch so that's precisely how the query plan is constructed the recommended videos are fetched first and then in parallel title is fetched from the video service and box art urls are fetched from the image of service the gateway sees a query plan in this form it's simply a tree of fetch parallel sequence and flattened nodes there are three fetch nodes in this query plan you'll notice the parent of the very first fetch is a sequence node signifying that whatever sibling of this first fetch will happen after that fetch is executed the sibling of this initial fetch is a nested parallel node signifying that after the first vet from recommendations the gateway should then execute the subsequent fetches in parallel that is then wrapped in a flat node which signifies how the results should get stitched back together executing the query plan is pretty straightforward then we simply traverse the entire query plan starting from the very root node in parallel or in sequence and merge them together into the overall response here's pretty much the actual code that's responsible for this it's simply a recursive function that traverses all the nodes in the query plan when the node is a sequence node we simply execute whatever's inside that node when the node is a parallel node the code execution block is wrapped in an async block and that tells the kotlin compiler to execute the nested code asynchronously or in parallel flattened nodes as mentioned tell the executor where to stitch the result of the query execution back into the overall response and fetch nodes actually do the fetching against the graph service and that's pretty much it so that's federation now let's talk about how we've been using this at netflix and what we've learned it all started a couple years ago in 2018 the netflix api team was exploring ways to break apart our api monolith we prototyped a federating graph gateway for falcore this was really exciting because it demonstrated a potentially transformative way to scale our api meanwhile there was another organization at netflix that was building their own api aggregation layer this rapidly growing organization is netflix studio netflix studio engineering makes a bunch of apps that facilitate the creation of all the content you enjoy in the netflix app this includes custom software for things like scheduling talent management dubbing animation etc there were dozens of services providing all this functionality and netflix studio decided to make a graphql aggregation layer to tie them together now you might have noticed that the number of netflix originals has exploded over the last few years well so has the netflix studio graph in only a few months this studio graph had grown to a point that it took the netflix consumer graph years to get to the studio api team was already feeling the pains of a monolithic architecture so the company placed a big bet on implementing federation here first the two api teams joined forces to create a scalable api platform for studio right around this time a company called apollo had just released a spec for something they called graphql federation studio was already using graphql at the time so this seemed like a perfect fit in july of 2019 the combined api teams started building a graphql gateway based off of apollo's reference implementation we chose to implement our gateway using kotlin this would give us access to netflix's java ecosystem while allowing us to rapidly develop a robust solution with language features such as coroutines for efficient parallel fetches and an expressive type system that handles null safely as we started to implement the gateway we had one lingering question would it be fast enough we wanted to make sure that we weren't going to add too much latency so as soon as the basic functionality was complete we did some benchmarking the core gateway activities of query planning and execution were clocking in at under a millisecond this gave us enough confidence to move forward within a few months we had an initial release of the gateway ready to go we took the former api monolith and put it behind the gateway this became our very first federated graph service next we set up one new graph service alongside the api monolith that exposed one small portion of the monolith's schema but we marked this new schema with a directive called at override this schema directive instructed the gateway to route to this new service instead of the old one when constructing and executing back-end queries from there we opened up the platform for wider adoption over the past year we've had more and more graph services taking over functionality for the former api monolith and adding brand new functionality to the graph if you look at this chart of the number of graph services behind the gateway you can see that we're looking at exponential growth there are now over 50 services in production contributing to the graph that graph is now being used to power over 60 studio applications with all the teams behind the services contributing the graph has exploded the number of nodes in the graph has grown from about 800 in october of last year to almost 7 000 today there would have been no way that one team could have added this much functionality to the graph in only one year and yet this was precisely what the federated platform enabled and that is the kind of software you want to be building you want the effect of your efforts to be multiplicative not linear especially when your growth goals are as big as netflix's have been and as we yet envision for the future this kind of multiplicative effect is precisely what you're getting after finally not only has the federated platform enabled such explosive growth but that old api monolith that used to be a bottleneck those complex and hard to reason about and that so many considered to be a blocker is now slated to be deprecated and fully removed sometime this quarter now that the federation platform was built and the studio graph was taking off it's time to circle back to the netflix consumer api there have been growing interest over the last few years of using graphql for the consumer netflix app and our internal falcor implementation has been evolving in that direction in fact we are already using graphql's schema definition language to describe our api so we don't have to maintain our own parser anymore could we use the exact same federation infrastructure that we built for studio to power the netflix app a small working group was formed to build out one page the search page in mobile devices on graphql here's what that screen looks like the client here is using apollo client to speak graphql to a gateway this gateway is federating requests to three separate graph surfaces on the back end each exposing a different portion of the graph we then sent a small amount of production traffic into this new stack initial results are looking good so the next steps are a to fill out the graph even further and make more data available on this platform and b to send more traffic to the stack overall so that we can collect more data from the wild but we're really excited about the results so far by now i hope your lizard brain is starting to tingle and you're thinking there's no way a project like this is all roses it's true federation isn't some magic potion it's not going to solve problems like climate change or world hunger in fact it comes with some real challenges first you're going to need a team that's devoted to building and operating a brand new platform we dedicated three engineers to building out the core components like the gateway and the registry we also dedicated an entire team to the the developer experience and tooling of these graph services our colleagues paul and kavita are actually presenting their work on this effort on november 18th so be sure to check it out they've worked on some really great cutting edge stuff we also dedicated resources for instrumentation like distributed tracing so engineers can investigate and troubleshoot problems that are happening in near real time for more information on how we did that check out our colleague elizabeth coretto's talk and distributed tracing at netflix she just presented it yesterday november 10th so be sure to check that out with a distributed graph you're going to have a lot of engineers contributing to a massive graph if you don't have a strong sense of controlled chaos best practices documentation even a schema working group you can end up with one gigantic highly messy graph finally you'll be distributing the concerns of the api layer instead of shielding your engineering teams from them this could be potentially a radical and costly shift especially if most engineers in your organization are not generally concerned with matters like defensive programming or security we invested heavily into this new architecture we even merged two teams that used to be part of two totally separate organizations in order to deliver this new product was it worth such a heavy investment especially when you consider all of the tooling and instrumentation that had to be built in as well we posited that it was as netflix grows its subscriber base to 300 million global subscribers and more it's crucial that no single layer of our architecture is a bottleneck for growth and innovation what have we covered we talked about federation and we went into some detail about how we've implemented it at netflix but our hope is that in all of the technical detail we haven't lost the essence of what federation is it's more of a philosophy than a specific technology you could call it a philosophy of illogical aggregation you remove the business logic from the core aggregator you restore api ownership to domain experts and yet still maintain a single unified api for clients to access your entire ecosystem this vastly increases productivity it enables loosely coupled teams and systems and it restores separation of concerns to your micro services architecture is federation the future it is a future we're not actually that naive to believe that a single technology or pattern could be the right choice for everybody and we've already seen that it doesn't solve problems like world hunger or covet 19. but we've taken a journey through the hype cycle past the peak of inflated expectations through the trough of disillusionment to a place of extreme productivity and that is a pretty awesome future so one question remains is this your future thank you both stephen and jennifer um great great presentation very informative got a lot of great questions coming in from attendees i encourage folks um that are attending keep adding keep asking in the chat we've got we've got a curated list so far but we'll keep building on it so i'll start with some general questions and then we'll go into some of the specific attendee questions so first to you jennifer what was the biggest challenge technologically with moving to this federation architecture um technology i would just say the the actual architecture was not super difficult i think part of the the challenge in kind of bringing this in is that because it was brand new this was from apollo a lot of the challenges actually came from kind of marketing this to the wider organization making sure that we had a broad alignment across the entire org so that you know people like uh the domain service teams that they would be on that graph services that we had the framework teams building out platforms that would be become available to these other teams to consume and use excellent that was actually the next question was what was the biggest people challenge so do you want to expand on that um you didn't have this awesome presentation with cool music and graph graphics so how did you build that alignment across the organization so um sorry was that targeted to me or steven uh go for jennifer for for continuity go for it okay um so it was actually uh the studio edge team was originally kind of the the graphql sort of owner they were the ones who first created the initial monolith for the studio side of the organization so um robert and uh tagus our our teammates they really kind of formulated the idea of having this gateway layer uh created a dock uh shopped it around basically they really did a grassroots level kind of effort starting with like pretty much every single team and uh and marketing it from there it's interesting it's the people side is very one-on-one very relational the technology side is the big scale part right once you get that alignment you can scale it up yeah right right exactly especially yeah especially when you come you know especially with netflix where um kind of individual contributorship is like so important um i think having those relationships working really closely with individuals is pretty important excellent next question um actually two more questions then we'll go to the attendee questions so where should people consider starting if they're considering doing federation stephen we'll go over to you for that well i think the first question to to ask yourself is is why you know why are we interested in federation and you know are you committed to doing microservices already if a monolithic architecture is working for your company then uh you know don't don't let us be the ones to convince you otherwise but if you are down this microservices road uh then this is you know this is a really great option and so to start uh you'll need a gateway um we we chose to implement our own using uh this graphql federation spec but there is one that is open sourced by apollo and so so there's that and then you'll need some way also to to register your schemas and uh this could be as simple as a a git repo that you all put your schema files in or you can uh create a like a registry service like we did or also uh there's there's an offering from apollo you that you can use for that and so i guess these are the the the you know the practical steps that you would start out with that's that's great and folks keep asking i'm curating them and i'm adding to the end of our list for questions to go through one more general question then we'll get to attendees is the gateway open source you mentioned kotlin you mentioned you know it works at netflix scale tell us about that our gateway is not currently open source but it is something that we're definitely open to and uh it's been on a roadmap so if this is something that that you would be interested in in using or collaborating on uh then do let us know and that'll help us prioritize the the work of uh sporting that as an open source project and follow up on that stephen so boris had asked um so ours is an open source right now but where could somebody use a open source gateway yeah as i mentioned uh apollo gateway is um it's a typescript node.js server that implements the same spec and uh and so it's it's like ours is uh actually interchangeable with that in terms of the um the spec talking to the uh graphql servers great and then um bartholson apologize if i didn't say your name correctly but he's asked the question how should you think about the granularity level of the the federated layer the federated graphql servers when you're breaking this up how should you think about splitting that up jennifer what do you think so this is a two-part question we can talk about it in terms of the netflix studio organization and then kind of broader so for netflix studio um each team was kind of already working in silos just because of the way that netflix uh studios kind of grew organically because of that they were sort of kind of owning their own part of the graph anyway when we were dealing with the graphql monolith so so we were able to kind of take that model and apply it into the federated world um in terms of where you might think of your organization i would say i think you would think of it similarly like who owns the data and is there a way that you can organically create these relationships between your front-end engineers who are consuming the graph and the people who are actually delivering it from uh from from the back end side and then uh two related questions on that sort of on that schema management and that ownership um so john and christian both asked john asks how do you avoid teams creating similar scheme entities in the fitter environment and then is there any governance around the schemas can you speak to that jennifer yeah this is actually this is really interesting because this is pretty new right like apollo just open sourced their federation spec it was like uh last year sometime uh netflix i think has one of the biggest federated graphs out there pretty much in the world so one of the things that we're seeing is that you have like you you create like a person entity or like a movie entity and like any other graph service can extend it and then just start adding fields to it so we actually have a schema working group uh that meets weekly we have a data a schema architect that understands the entire kind of surface area of the graph of studio and so we've helped that kind of like we talked about controlled chaos it sort of aligns the teams to kind of have a similar ideology or methodology behind the schema to add on that one that's one of the the advantages of this this architecture is that you're separating your your schema and the you know the the design of your api from the implementation and so that makes it possible to have like this this working group that can be front end engineers backend engineers a a data architect that's uh you know not focused on the you know like the engineering part of it and they can have that api discussion separately from the backend engineers implementing it and then similarly on that how about authorizers approvers security dimensions how do you think about that and after that we'll go back to some questions about the gateway and questions about back-end data sources behind it jennifer you want to or see if you want to speak to that one sure yeah so as far as you know like uh authorizing an approval as far as uh the api itself i think that is you know a question that is really for your organization how you do things netflix tends to lean in freedom and responsibility and so we we try to provide context to allow people to make good choices for their api and then give them the freedom and the responsibility to to do that but that that maybe isn't the right choice for for every organization and so you can put more oversight on that as far as authorization we do the well we do the authentication at the gateway level so that's centralized and then the authenticated user information is provided to these individual federated graph services and so they are able to make authorization decisions that are right for the the domain of the data that they're providing and part of it too is we uh we shouted out to the talk that's going to happen on the 18th for paul and kavita as well but for netflix we have this idea of a paved path and it's basically like if you really want to do things um kind of the way that netflix does them we're going to make it the platform team is going to make it super easy to just get on that paved path so everybody has a single way of doing authentication for instance that helps us at least even though we have many different graph services out there kind of have a similar stack across them great so let's let's talk about the gateway itself a bit more so there's some questions around why did we choose to implement it uh ourself and then also the performance the caching the comparison to apollo's version i'll open up to both you i mean those are you have thoughts on both of that jennifer you want to start yeah so um one of the things i think was just that this was going to be such a huge bet for netflix we were really kind of uh betting our entire future of the architecture of a studio on this new technology and this framework and so because it was you know such a big bet we wanted to make sure that we kind of owned that code and we had control over that going forward so that was one big concern for us yeah and um you know when you know whenever you're doing something like this you know if there are uh you know things in the community you can leverage you want to do that and so we considered you know all layers of leverage including using the uh you know the code um and uh but then there's also internal leverage that we could get and you know within netflix we have platform teams that develop a lot of libraries that are you know for uh our market service ecosystems you know service discovery and things like that and so we could leverage those best by writing our own that runs on the java virtual machine and still get a lot of leverage from the community by the fact that we're using graphql and and we're using this this open spec as opposed to uh doing you know doing our own kind of federation um protocol on top of graphql and then how about the the caching uh concerns can can you speak to that stephen sure yeah so uh so as far as caching we are letting the the federated uh services themselves make the decisions on caching because uh you know whenever you start caching things you you open up a pandora's box of questions about how you you know how you manage and expire those caches and uh and so we want to make sure that the those domain experts are able to make the right decisions there and i think there's a uh also in that same question asking about optimization query optimization and then that is something that the gateway does as far as the the query plan is optimized to minimize the end to end latency of a query and so it's broken up and parts that can be done in parallel are done in parallel and then also anything that can be batched together is batched together so you have a minimum number of requests that are going to the the graphql services yeah if you remember the part of the talk where we were talking about the osi reference model we really want to keep the gateway as dumb as possible and basically just make it as kind of like a routing layer as opposed to something that holds on to information and is and becomes stateful thank you so we're running down to the end of our time there's a bunch of questions about how to think about data sources i think about connecting to rest how to think about http um one one one and two stephen jennifer do you mind sticking around for the hallway track to go deeper for people have interest in that yeah so final final two questions if people want to learn more where can they find out more so we actually just uh our team just released a blog post on this and so if you uh look for netflix tech blog um graphql and you'll find it and so check that out it's it's uh it provides a lot of great details great and then to to you jennifer and then stephen what's one thing you want everybody to remember as they as they go off to the rest of their day and the rest of talks and everything else in life yeah the presentation kind of hit it at the very end but um i could i think that my kind of point would be that federation is sort of like a philosophy it doesn't have to just be about graphql or about a certain technology but really you know if you have an aggregation layer as as stephen said take the logic outside of the aggregation layer so that you can have a single api but also it's not beholden to kind of the speed of the team that can actually open it up and if you can do something like this you know as with all big projects start small and move incrementally but then finally go big or go home because this is ultimately about bringing your organization to another level of scale thank you to both of you um thank you to all the attendees excellent questions see you on the hallway track with the speakers and see you in the rest of the the talks for the track if you're interested we've got uh airbnb person airbnb and then somebody from twitter and then a panel where steven and jennifer will come back for a broader discussion on api architectures thanks everybody thanks y'all thank you
Info
Channel: InfoQ
Views: 76,541
Rating: 4.9258924 out of 5
Keywords: Software Architecture, Netflix, API, GraphQL, Federation, GraphQL Federation, QCon Plus, InfoQ, Transcripts
Id: QrEOvHdH2Cg
Channel Id: undefined
Length: 38min 50sec (2330 seconds)
Published: Thu Mar 25 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.