GraphQL design patterns for real-time and offline app architecture - Richard Threlkeld

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
my name is Richard I am an engineer at AWS mobile and it's really been great here all the energy and the stories about this and hopefully this session is kind of worthwhile in the sense that I listen to Rob's talk this morning listen to Lee's and some others and then at all I worked in the field as a mobile architect on site with customers implementing the early like the first versions of graph QL as well as other technologies and before that did a lot of like peer-to-peer software engineering for large banks at scale so all this is to say like some of these problems have studied with a lot of other people like for a really long time in production with some some major customers and you know kind of piggybacking off of what URI was just talking about I thought he made some really good points about like what's the goal to roll out graph QL and and actually Amazon we have a like a saying that you know it sounds very you know generic but it's actually true is like working backwards from use cases and working backwards from the customer itself and like when we built some of these services we didn't decide like hey we're gonna build like this graph QL service or we're gonna build a graphical proxy actually what we were doing was spending a lot of time working backwards from some very specific use cases especially when I was in the field and two of them popped up quite a lot one was around offline use cases specifically you know lossy networks or people would go into mines and they were trying with project management software and stuff to actually you know do offline operations or do synchronization tasks or even like I used to live in New York City and you know I've got a lot of apps that you know I want to read a blog post and then comment on and I'd like that to be optimistically set over the network and then interestingly enough these same exact apps had real time use cases so I wanted to send data down to clients you know quote unquote Network real time whatever that means usually around most people mean three to four hundred milliseconds when they're talking about that but there was a lot of overlap and and this is where graph QL is kind of like a network protocol really started to make sense because we could do a couple of things one is like obviously we could advantage of queries mutations and subscriptions for some of these things but we could also do some things to like remap data structures from back-end databases to client apps and do some of these operations with local stores so with kind of that like level set what I wanted to do to talk about these these problems generically and how we think about them is there was some talks earlier even in URIs like he touched upon this is like you know who owns which piece and the backend and the client and stuff like that but the truth is the concerns that you think about when you're talking about the backend or are very similar in the client it's just kind of your perspective and what the constraints are to given an amount of time so like I said I'm gonna talk a little bit about it theory here and then we'll see how this actually practically showed up in applications and techniques you can use to solve them so for instance in the backend right we talked about caches what our cache is typically used for not for that we'll talk about caches in sense of like making things faster right and usually we put these in front of a store and there was some talk about stores and client applications early today we talk about types nowadays right we didn't used to talk about that and that's an interesting thing because it used to be you know we had generic endpoints but now we took concepts from languages and compilers and actually push them to the back end right and in fact in some ways when you think about how graph kill works it's very similar to a compiler where we at processing ast just instead of it being in your you know your build tooling it's actually going to be on the back end at runtime when you send something over the network I talked about Q's message ordering transactions and so forth the key differences when it comes to our architecture is we have different kind constraints right so we're bound by CPU we're bound by memory we're bound by the storage on the physical device and these things might not seem important at first but they really are like one example I commonly talked about is SSL handshakes right so this is actually a classical Network request and acknowledgment of the handshake process when you're going to send an HTTP request from a client to the backend and it might seem trivial but all of things go on CPU registers they keep hit and getting hit in your processors on your clients and that's going to ultimately consume battery life and yeah if I just do one query I might not see that but if I want to get real-time data on clients you know this is one of the actual motivating factors of why we look at things like subscriptions with WebSockets because they keep a persistent connection open and you only need to do this at initialization so if you look at the bell curve of this graph you know those are certain designs that when we think about our back-end even though you know the the designs with caches and stores might be similar I have a faster network in a data center I don't need to worry about sending Delta updates from one client to another or one note in the network so that's essentially how we think about these things what constraints on the clients can we use similar techniques in the backend to impose but do them in a nice flexible way and so to break this down from a philosophical perspective let's talk about something known as cap theorem is there anyone is anyone in here familiar I'll ask you that with the cap theorem if I go show hands okay actually not a lot of people this is good then all right so cap theorem if you spend any time in backends talking about distributed systems this comes up all the time all right so a guy named Eric Brewer who's at Google originally when he was doing his I think his PhD thesis he came up with this and it basically says if you have more than one node that you're trying to build like a service or a data store usually is what you talk about you have three properties that you're trying to satisfy you want consistency between them like actually when you read when you do a write I can do a I can have a correctness operation which is read my rights so the next read that I do has that data that I just wrote I can choose availability so that that endpoint I don't always get I don't ever get any failure requests when I do a request or I can have a partition right so the cap theorem basically says that for any distributed system so more than to more than one node I can only choose two but practically what this means in real networks is I have to choose between availability or consistency because I'm going to have two nodes I automatically get a partition operation between those two usually almost all services choose availability and they sacrifice this to get what's known as eventual consistency meaning you get eventual replication and why is this important incline apps because I'm always going to have a partition I'm actually I usually have less of a choice but the pond that needs to take place in the graph QL ecosystem is we we do tend to impose a mental model of graph QL itself in the developer tooling and even in the clients that we expose that is a little unnatural to some of the domains because when you're building client apps for either real-time or offline data or just regular development you logically are just thinking about the device itself so when you think about the device you don't have a partition you have immediate consistency and it's of course always available because you're doing a local right so let's see what this looks like architectural II so this is like kind of the architecture that I'm going to refer to when I'm when I'm making phrases about some of the restrictions that we have in the trade-offs that we choose in these designs and let's first list out the partitions so I have one between my two nodes in the backend like I was talking about and you can think of this as like a load balancer in between and I naturally get this partition between the back-end and the client so I could think of like nodes on the backend or I could think of the client in the backend as being separate nodes I'm going to have stores and then if I want to make those stores faster I get hashes and I did the animation in this way on purpose because the community in general when it comes to not just graph to L but just kind of client side state management and so forth seems to conflate this concept between caches and stores and we we talk about them synonymously but they're actually quite different and by treating them the same we we have some trade-offs and some some developer difficulties that turn up over time on this that I'll dive into in a little bit then we of course have queues yes I did get it and queues are a great tool that we use to decouple systems in the backend and this also allows us to decouple the backend in the client by sending data if I'm on a lossy network and making sure it's in queued so when I come online or if networks immediately available I send that request successfully so what does graph QL give us here let's look at the the the store property first right and that store property as Yuri was just talking about like these resolvers could be anything right they could be a no sequel database of my sequel database they could be just raw compute they could be something reading or writing to a file system and graph QL effectively allows us to remap these types from the back end to our client persistent storage what could be a lot of things right it could be web storage it could be a storage on a native app and so forth and this is what the popular graphic uol clients used for all of their partitioning schemes on the clients right so if you look at the Apollo client what it basically does is takes that type name and it remaps this in a relational structure on the client I'll talk about that a second if you look at our chol it takes that type takes all the the items in it and creates a murmur hash locally which is really great for having fast operations but a little tougher if you want to manipulate the data locally when it's in the client because kind of hashes can't be easily updated we also Missouri talked about use this for some of our code gen properties and so forth one thing I will point out especially as graphical starts to mature and needs more adoption using code gen from a graphical API to generate things like JavaScript types for typescript and so forth is really nice but it's a bit unnatural we found in practice for iOS and Android developers because what effectively happens is you take your API which is your data model for the network and you impose that on the domain structure like for a Java developer or a swift developer and they have nice classes or opie properties or functional styles that now they instead of like taking just a list of items and just doing a put item or something like that into that list they instead need to make API calls inside their code so this is another area that I really think is is an untapped resource to to kind of look at it from the other direction and say could we first actually look at the code in a client app and make the API respect that kind of entity first model but if I want to remap some of these things what I basically need to do is take the type information and first translate it down to the client in a store so in the right-hand side you'll see a screenshot of have the Apollo cash manipulates this and effectively what happens is I run a graphical query let's say it's a list of items like posts here creates a root element and that root element has effectively a pointer to one or more queries on your system each one of those queries then has pointers to one or more items this is a normalization process and this is really nice because like for instance with the absent client what we do is we persist this to local store so if you just run queries it's great right so I've run this query it comes down into cache we persist this to sequel Lite to local storage to async storage and react native and things like that it's really great until you want to do some things like manipulate your data locally and I saw a lot of apps that were presented today where people were actually clicking buttons so I think we need to do mutations right now the fundamental difference here is again like I mentioned earlier caches are not databases but in client apps you really are thinking about stores the difference between a cache as your data model and a database on the local client and caches show up all the time right you have CPU registers that have caches because you effectively can't even wait nanoseconds to move data across a bus to RAM you have caches that show up all the time in network storage and so forth and and these are really optimized with with specific data structures that give you constant time lookup for inserts and and searches and things like that they don't really have relational structures like B trees or indexes and things like this so that if you update an item in one row and a table then all of your queries from an algebraic perspective are going to respect that data but when we're using these caches locally in our clients from graph QL queries as databases then we need to give developers controls to manipulate them locally which ends up happening in a bunch of these tools and here's the problem with doing something like this so that this is an image of of essentially that that Apollo cache that I showed just using something very simple like a graphical query that I want to filter and have like categories and let's say that this is something important like a list of you know immunisations that that animals need to take right and so it gives me from this you know filter that I'm that I'm using give me everything that has a dog category and now I as a developer need to keep a reference right and humans aren't really good at reference counting by the way like if you've ever looked at how reference counting is implemented in applications is there's a reason why it's done by things like the JVM and so forth because if you don't track this reference carefully and you have five queries locally or sometimes fifteen or twenty which happens all the time I end up no I can't point to that I end up doing something like this I I proceed to my cache I pop everything off of it I manipulate one of the items and then I update it and it says cat and I push everything up there and that's that's cool because I had this proxy to my cache that I updated but because I didn't manage my relationships now I have invalid queries that I'm showing and people say why is graph QL showing me the last data or the wrong data so this is really a problem that we see all the time when we're we have customers are building things on top of app sync so what's like kind of the message I'm getting across why did I talk about this theoretical concepts well because when you're dealing with data it's super important that data can't be lost it has to converge and it has to be correct right this is incredibly important and then now you know there's all sorts of topics I could get into around conflict resolution that I won't go into detail but just to kind of show you a taste of this even things like deletes are hard so we have a bunch of customers that are using this type of a strategy to go offline right there they're going offline in subways they've got people that are going into underground things we even have a customer aldo shoes that that has a really cool use case with a graphic UL that they've rolled out in a production so they're a shoe company out of out of canada that works across Canada and United States and imagine this scenario I go into a shoe store in the mall and I want to get a pair of shoes so they press the button on an iPad sends a real-time subscription to a device that's a runner in a store that gets a message down to their phone that runner goes into the back room and now that back room has no network connectivity so now they're offline and while that's happening other subscriptions are coming through and so when they come back out they need to sync well the kind of naive way to sync and I'll talk about a more sophisticated way at the end is to rerun that graph QL query and the data comes down but what about deleted data right so let's look at what what happens in a couple of cases here so I have a server and a client and on this server I've got some data that's going to be replicated down to my client now that client goes offline I delete the data client comes back online there's a problem here right it doesn't know that the data was deleted all it knows is that it fetched that query it went through the normalization process looked at the type name and if there's new items that'll be updated in its cache but we have an inconsistency problem here right again going back to this cap theorem of two different nodes I chose availability and a partition rather than having correctness here so what's another way to look at this problem well another way is the client goes offline but I use a technique in in databases known as accountants always keep their records and what I mean by that is I mark the item for deletion so I put a status on it now when the client comes back online it actually will see the deletion marker and remove it from its local cache the downside to this is as your database grows in the back end and you get hundreds of thousands of records over time you need to do some garbage collection so the traditional process that though that we found the most one this because you never know how long clients can be offline like in the happy case yeah the client will come back online and and all of your clients will delete the record and then the the deletion will happen in the background using some sort of a TTL but if somebody goes on vacation and their phone is in their laptop or something for a few weeks and they come back again you come into some inconsistent state so the way to really deal with the business operations are and how fast they move so our architecture on the client it was mentioned a couple of times a day this is this is a diagram of how we do this in the ada bios absent guest kay for Apollo today and we used Apollo Lynx essentially the the SDK for Apollo is just a collection of Apollo links so we have an off link we have a subscription Handler and stuff like that but effectively what we do is the top layer is when the clients offline or if it was online and it just happened once locally we allow people to run a graph QL query or mutation things like that and we automatically propagate the data through to their local store and we pop it into a queue so it's a write through cache and it uses the normal Apollo cache and everything else but just after it hits the cache we persist it to a storage medium like local storage or something else then when the client comes back online we trigger an action we actually have a global flag that says do it and if it's true if do it is true then we send the data back over the network and we disable the offline link and then we respond back to the client now this is really nice under the covers and we handle this and this is how we get automatic persistence of queries to the local store but how many people in here could I see a show of hands have tried to use like the Apollo cache which is a cache again using the same properties and tried to update their local store ok a handful so for the rest of you that will discover this when you're dealing with caches especially if I have you know a master detail view you end up with some big piece of code for a simple use case that looks like this and what I have to do is I have to have a query then I have to have a proxy object then I need to read data off of that query then I need to update all that data if this were a post with a list of comments on it guess what I need to do that for each child object as well and then I need to push back of all of these back up to my cache just just so that I can update the local store if I wanted to get an optimistic response this is pretty messy so how we've dealt with this is essentially you again using another technique that we learned in backend stores actually back in caches I should say from Redis and so forth which is we make it a simple put operation so on the right hand side is one of the helpers that we've used for this to where you just need the data and the type that you're sending up the only problem with this is again I'm mentioning like there's no trade offs here for free when you're dealing with this a scale so we do get a little bit of a leaky abstraction so we can just send through the comments and then the new data that we want we actually need to tell it the type name to update in the cache so it's simpler for like normal operations but there there is still a little bit of knowledge that developers need with graph QL all right so you're probably sitting there saying well this is great but I'm okay with like not having full offline operations but I do want subscriptions right well subscriptions are actually another channel of data that's coming in right these are near real-time up time updates to changes of state in the backend and you might have thought that you could get away with just using queries mutations and subscriptions and not worrying about some of these convergence problems but you still have an issue with merging data into your local cache even if you're just using subs so again we look what qualities can back-end architectures and clients share and this is a diagram of an architecture that's roughly five to ten years old and it's called lambda architecture it's got nothing to do with AWS lambda in fact this is a big data paper that I pulled this from an AWS reference Docs and this architecture is commonly used when we're doing big data processing for customers and they need to represent the the results in a real-time view and a web app or something and what you usually do is you have a batch layer and then a speed layer so your batch layer might be something like a new cluster may be running on EMR or maybe your own instances that's turning through and normalizing a bunch of data you know taking semi-structured data and putting it may be in an MPP warehouse and then you have streaming data that's coming in in real time and the interesting thing about this is on the left hand side you have a merge operation that's joining it up into a query and that looks pretty similar to what we're doing on these clients when it comes to taking graph QL queries which are effectively your batch operation and then listening for graph QL subscriptions and merging them into your local client store so the takeaway from this is synchronizing and dealing with this partition that naturally exists from the backend of the client is effectively a synchronization problem you just have to merge it whether it's coming from a query a subscription and mutation and actually on the client it's a pretty complex problem much more than you deal with database in the backend because you basically have a three-way merge if you're going to use all of the graph QL properties you have a emerge for queries that are coming in and updating items you have emerged for mutations that are happening locally first if you want to give an optimistic UI and then you have a merge of real-time updates that come from the back end and if you have clients that are going online and offline and by the way most of you do if you're dealing with Wi-Fi networks this problem gets really tricky when it comes to sequencing so let's take the mutation out of it for a second and just look at this simple case of this - where I want to deal with subscriptions and queries that come in so let's say that I I come on the network just by opening my laptop right and I want to subscribe to data that's coming in and I also want to run a graph QL query and show it in my local cache and this is new fresh data and by the way this use case comes up all the time if you're building a real-time dashboard trading apps banking apps you name it so that query starts at some time t0 and finishes at TN and while this is coming over just because of the physics of the network a couple of mutations trip on the back end triggered a subscription message on the client they come in so I've taken that subscription merged it into my cache and then when the query finishes I then merge that into my cache as well well if those items coincide you now have a convergence problem your your your data is actually out of sync which was which in the back end and you need to handle that appropriately so the way we handle this and again this is just a simple case without mutations is the pattern that we do in all of our clients is we first set up a subscription and all of those events are written to a local queue then we run this query we merge this into our cache and then we start draining items off of the queue and processing as normal if you want to introduce equate mutations into this equation well then you have to you have to process that after the this after the the query finishes as well and then you can merge those subscription operations into your state all right let's look at one more advanced use case for I just kind of give it a little bit of a vision to the future so Robert talked a little bit this morning about CPU impacts to clients this is something that we focus on quite a bit and at first I mentioned although earlier they were doing some of the we had in the clients that we gave to them some of the knee naive cases to where they would run graph QL queries to synchronize data down to the clients or even just regularly they wanted to filter some things and this is an iOS native app so top-of-the-line iOS developers using top-of-the-line devices in the latest iOS phones released all right so we're not talking about old hardware either but even when trying to filter data and pruning it out for for you know just a couple of hundred records that come down you're gonna see sluggishness and apps and flickering and CPU impact locally so you really need to still handle no matter what your geography is or what your hardware is today getting just the the latest data down to clients but in in an effective manner and we implemented a pattern known as Delta sync to do this down to our clients so the way that we do Delta synchronization is we break up our resolvers into two types we have a primitive type called a unit resolver and then we have this newer one called a pipeline resolver but you could do this in any graph QL engine today because they all have resolver types that can run multiple actions and what we do is the base information all rights that happen go to a master table all right so this is kind of our base query on the left-hand side I'm getting different results and then after that result that right happens and you should and could do this in a transactional manner we write to a second table we call this like our Delta table but we add a couple of properties to it so we add the server generated timestamp when this happened and we add a TTL time and this TTL time is configurable by different customers so it's essentially their epsilon when they want to evict the data from the back and so that they can keep this table when they run operations against it really fast and the graph QL response that happens when to the clients is this last synced timestamp so it's a server generated timestamp now there are some edge cases where even the nodes and the server might be a little bit out of sync but these are like in the 99th percentile then what happens on the client is it's normal query operations like when it wants to just do a base list or a base query or something like this it pulls from the base table but the key is then when the clients are going back in and out of these warehouses or anyone else is doing these these Delta sink operations they're able to run their query from the Delta table so they get just the changed events and the great thing about this is you might say well is this graph call specific well it actually is because we're able to give the same sort of data structure and type down to the client so that it can normalize it in its local store that's pushed through from that cache that we're operating over so they get the same data model it just happens to be that they're sending a different argument in their query to the back-end so graph QL it would have been really difficult to do this without graphic ul hints coming from the API and a type system to normalize this and effectively remap to different data bases down to a single store on a client alright so a couple of slides on just looking to the future so these are a lot of problems that that we had to tackle over the past couple of years at AWS building these in real world apps in production with some of these companies that have hundreds of thousands of clients right but I would love to see more open source tooling and more solutions from the community and some of these that that kind of if I were to look ahead in the future to graph QL or the web things look a little bit different so one of them is I would love to see some of the graph QL clients move towards proper stores rather than just caches that mobile and web developers have to manipulate yes we can do some things like we've done over the past year with the absent Gus TK give helpers allow people for common operations just to pass things in in one line of code but it'd be nice to actually look at some ways to automatically reconcile this data when you are just writing analogical way to the client right think of how you would want to do just setters and getters over either data models or into local databases and how could we do that with graphical clients but still use some of the power the other thing that I'd love to see a little bit more is using graph QL as a network layer but not necessarily as the language that you're using in client or even some of the server back backed interfaces right like I mentioned earlier that the way that code gen operates today is nice in a lot of ways right the code generation can't be the answer for everything because what happens when you when code gen is the answer for everything is your client applications end up look end up being a function of your back-end API where really if you want graph QL as a technology to take off especially around native developers on mobile for iOS and Android and c-sharp developers and so forth we need to write tools that start in their domain specific language and allow the ability to translate their data models to an API on the back-end that still uses graph QL over the network layer and so what does this all really mean right kind of tying into the presentations from this morning is graph QL is a great technology addresses a lot of problems but to really tackle all of the real-world problems with mobile and web apps especially for doing synchronization of data for offline use cases in real time it's one of the solutions to the entire stack but it's not the solution to the whole thing thanks [Applause]
Info
Channel: GraphQL Asia
Views: 1,396
Rating: undefined out of 5
Keywords: graphql, graphql asia, bangalore, offline app
Id: v7HsQRx0e2A
Channel Id: undefined
Length: 30min 31sec (1831 seconds)
Published: Tue May 14 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.