Common design patterns with Azure Cosmos DB | Azure Friday

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
>> Hey Friends, it's another episode of Azure Friday. I'm here with Aravind Krishnan, we're talking about some common design patterns that you might find when you're working with Azure Cosmos DB. How are you sir? >> Good. Thanks for having me Scott. >> My pleasure. So what do you have for me today? >> So, Cosmos DB makes it really easy to build scalable applications so your data can grow seamlessly in terms of storage and throughput. You don't have to worry about schema. You don't have to worry about indexing, so all that's good. So, a lot of these problems are like the models that we see on application patents. They tend to fit a few common themes and patents. So in this session, I just want to walk through. Maybe start with the vanilla use case and then, we look at a few exceptions to the rule. So, I want to stress these are exceptions only to use in case you get into or run into certain kinds of issues. So things like, how do you handle skews? How do you handle a workload which is mostly rights. So, we just look at it case by case and walk through court to see how to go through it. >> It sounds like a plan. >> So, let me walk through. So, this is in GitHub. I mean, all of these sample are all out there. These are mostly snippets to illustrate all of these patterns. >> Okay. >> So, before we jump into it. I just want to summarize the quick key concepts that follow, that become the guiding principles in terms of these patterns, why you adopt them. >> Okay. >> First is that documents in Cosmos DB are free from JSON. If you had schema, the only schema is they should have a primary key and that's a combination of partition key and row key. >> Okay. >> Good. Now the documents stay within collections. These are the containers and each of these collections must have a partition key definition. So, that's the property that you use for scaling out data. They have an optional indexing policy. By default, everything is indexed, but you can turn it off if you don't want queries you want to go for. There's automatic expiration. So, if you have timed series data, you can set expiration windows, you can set unique key constraints. And this is an important part, collections are scaled-out across partitions. And each of these partitions, you can think of it as a server. And it hosts a range of partition keys. So, this becomes important when you think about how your workload would scale out effectively. Now in terms of operations, it's a simple set of basic primitives of operations. So, you have your CRUD operations your SAP are GET, POST, PUT and DELETE. You have SQL queries, so that's your single partition queries as well as the cross partition query. You have some stored procedures within the scope of a partition that lets you do things like bulk operations. And you have read feed and change feed. So, this is your set of toys. And if you look at the principles that you see, all of these patterns is, if you're trying to go from most efficient to the most expensive, the order are pretty much goes GET which is your point read, your single partition query, your cross partition query, and then, lastly scan query the goes over read feed. Oh, interesting. Okay, so to put that in the context of a SQL person's brain because I'm an old SQL person. Scan query, like a table scan, if you end up having to look at every single row to answer a question, that's going to be more expensive than a simple straight get, buy ID, or a known index. >> Correct. >> That makes sense. >> And the same policies apply just like a relation database. Bulk Insert is better than POST which is an insert, which is better than Replace, which is a little bit more heavyweight. >> Okay. >> And lastly, if I could summarize, one takeaway would be you should use change feed more. >> Okay. And maybe you could speak a little bit about that using change feed and why that's important. >> Yeah. So I can go through the specific examples but, >> Great. >> The benefit is, it's incremental data, right? So, instead of running query over a large data set, you're working off of the deltas and that lets you do a lot of work at right time, so that reads are inexpensive. So, that becomes a very scalable distributed kind of a pattern. >> Cool. >> So, let's look at Vanilla use case, right? So, this is like your simplest use case. So, we have players and let's say, you're building a gaming app, and you have used player profiles and you want to look up by player. The API 31 implement or GetPlayerById, AddPlayer or RemovePlayer, UpdatePlayer, right? >> All for an individual player and you'd be doing it probably within the scope of that player. They would log in and you'd call those queries and learn about that one person. >> Correct. >> And that should be very fast, right? >> Those would be very fast. And the beauty of it is, I mean, so this is our player class, right? Id name, handle high score, pretty simple class. And the beauty of this is, this scored whatever you write. This will work for hundreds of players or billions of players. You just need to scale out your Cosmos DB throughput. >> Just to be clear, you said billions and you're not joking. >> Yeah, yeah for sure. Billions, I mean, I wish I could go further. But I mean, billions is how many people we have. >> Cosmos DB is known to scale the world scale. >> Yeah, yeah for sure. So, this is your simplest case, right? So let's say, you have, in this case, you need to come up with the partition key. And we know it's all key based access, so your partition key could be the Id. And this is perfectly fine. I would say the strongman for most of your partitioning, partition key selection is Id because that's the simplest and gives you great distribution of rights across partitions. So, you get seamless scale out and also read access, as if it's by key then it works. >> And is Id, forgive my ignorance, but is Id always good way or is it always unique or is it always an monotonically increasing number? Is it something you decide? >> You get to decide. And what this database provides is the id is guaranteed to be unique. So, we have constraints on the database and the combination of partition key and Ids unique. But if you set Id to partition key, then that value alone is unique. >> Okay. >> So in this case, we know it's just key value access. So, why not turn indexing off because that could give you a few more areas back in terms of indexing. Not for this document maybe, but it certainly gives you a big impact if you have big documents. So, if you look at GET, I mean GetPlayer, it's just a wrapper around ReadDocumentAsync. So, this was our first point, use GET over query because that is, you know you don't even have to compile SQL, it's just a straight RSQL. Add player, I mean you can use Upsert which is like the atomic creator in the insert or update. Remove is again delete. Now for update, if you look at the pattern, it's a little bit more interesting. You have to do a conditional update with an ETag. So, you read the ETag back and you know, if it's the same as what you had read previously, then you apply the update to the document. And this gives you that benefit of not locking the optimized SQL and currency lets you scale out effectively for items that infrequently charge. >> Now, I don't see, and there's no if, explicitly here it's built within that access condition. >> Correct. So, you see the access condition is, if it says if match and that is a request option. >> And you passing that condition into this request object, so it's an implied if you just can't see it. >> Correct. So, that's your simple vanilla case, right? So, if you look at it, this is pretty much the simplest case. I mean a lot of use cases fit within this bucket. Now, let's get a little bit more complicated. What if you have hierarchy? So, if you have players and players have games. So, you have one, two, and parent-child relationship and that's not all that much more complicated as well. So, in this case, let's look at our repository class. It's over the same data set. But what I do, is I set player id as my partition key. And this means, I can get all of the data for player. So, if I wanted to implement a GetGameAsync, then I just pass in my player Id and game Id, and then again, I can do a GET. And similarly, if I want to find all of the games of a player. I know that this is for example, a link query, where I see player Id equals player Id. So, this is again a sequel query that's sent to the back-end and it's still scoped to a single partition key. So that's also efficient. So, that one was not that much more complicated, right? So, if you have 1:N, the key is if you have that hierarchy, you want to take the parent of the hierarchy make it your partition key. And here again, we stick to single partition queries as well as the rest primitives, the crowd primitives. So this is going to scale really well. Now, let's get into the exceptions. What do you do when you have M:N hierarchies? The idea being if you have a multi-player game, it's a popular viral game, then it's not really the parent-child relationship so you want to find by player, find by game ID. So the methods you need are get player by ID, get game by ID, add game, remove game. So that gets a little bit tricky. So, let me walk you through our same repository if we were to extend this. >> Yes, I'm seeing a lot of games now where they have 100 people on an island. So, you've got hundreds of players playing hundreds of games, and you have to manage all of the estate. >> Yeah, it's a fun problem. So, in this case, the simplest thing is, we have one collection. Let's start with one collection. Here, again, because we are explicitly controlling indexes because my goal was showing how to get the best performance, you have indexes both on player ID and game ID. Get game by player ID is simple, we just saw that earlier after the partition key query. >> Right. >> Now, what about get by game ID? Now, notice it's not a partition key, right? So, what do you do here is if it's infrequent, this is a big if, usually if we have some kind of operator where most requests have a player ID and infrequent by a game ID, then you can go do a across partition query. >> Do you have to decide what infrequent means to you? You said if it's infrequent. What's infrequent? Daily? >> I would say if it's a 90,10 mix, roughly speaking. It's a thumb rule. Not often. >> That makes sense. >> But what if it's often? In some cases, there might be a 50/50 mix, right? >> Right. >> This is where you have change feed to the rescue. So, let's say you wanted another pivot of the same collection which has game ID as your partition key. So, you would have a clone of this collection which has a different partition key. And let's say you have the same data that now you can perform lookups against game ID. Now, this problem is solved, the partition key look-up problem is solved. But how do you populate this data? >> Yeah, and using a clone, do I have to keep this up to date? Which one's the authoritative source? >> Correct. Double rights, the evils of double rights. Like what happens if one fails, the other succeeds, right? Change feed gives you that deterministic hope to keep track of changes as they come in. So, you have failure recovery, you have scale out. For example, if you can draw a bunch of what good knows at it, they can do load balancing, things of that nature. So this is the bare bonds change field API. Right? So if you look at partition key ranges, the way this works is you have actual direct access to the ranges of partition keys, and you can retrieve changes from those partitions independently. The beauty of this is now, for example, I can throw multiple workers on it and I can turn this in parlor, and drain all of the changes that come in. Ultimately, the code that works here is essentially a single absurd document I think. That goes and takes all of the changes from the source, and applies it to the destination. If you were, for example using "Azure Functions" or "Spark Streaming", some of the higher level primitives, essentially you'll just have to write that one for loop without all of the wrapper code. >> Is there a concern that I might be in a for loop that has millions? >> The beauty of it is it's once. So, reads are cheaper than rights. And you're doing this work exactly once, and once you have the data cooked up, the fact that reads are efficient more than pays for that cost. And this shows you when you getting game, you're just getting a game by its own ID. So, if you look at another more complicated example, and this is very common, it's Time-series data, right? And the problem or the challenge there is there is no natural key. I mean, it's tempting to use time as the key but that naturally leads to hot spots within your application. So, how do you handle Time-series data? And here, essentially, the trick to it is finding something that has a wide range of values. Now, since I look at my Repo, what I'm doing over here, is I'm essentially using sensor ID as my partition key. And what this means is if I have a number of sensors, they can continue to use the full throughput, and they can all continue to write at the same time. The flip side is when you're doing queries, you would have to read it across all of these partitions. >> Right. >> But that's okay because I have an index on the time range information within each partition. Now, to extend the same idea of change feed, sometimes what do you want to do is, you have data that is raw data. And if you want to take the raw data and, let's say you've seen this in the Azure portal as well. If you have data that's organized by minute, and then you have data that's rolled up, and it's a summarized aggregate or hour for example, and then day and so on. And the benefit of this is you care about the high precision within your recent time windows, but you only care about roll-ups at the longer time windows. And you can build that exact same time window-based aggregation using change feed. >> Where is this repository so people can go and look at all these samples? >> So, this is in GitHub, if you go take a look at GitHub under our.Net samples. >> So, we have github.com/azure. And under Azure, documentDB.net, you've got samples and literally everything you've got right here as you have it, is there available, so all the source code we we hide nothing. >> Yeah, we hide nothing. And, of course, there are more interesting cases to all of these, how to handle cases where you have large documents, large keys. So, it illustrates the various tradeoffs, the patterns that you can use. Of course, there is a ton of documentation and samples aside from this as well in this repository. >> I wish we could talk for an hour or more. There's so much great information and so much great resources available to you to learn how to use Azure Cosmos DB and the document DB features in it. And make sure you check out that GitHub repository. All of the samples that Irvin showed are available to you. And I learned a lot today on Azure Friday.
Info
Channel: Microsoft Azure
Views: 22,288
Rating: undefined out of 5
Keywords: azure cosmos db, design patterns, time series data, event sourcing, data storage, data modeling, microsoft azure
Id: 5YNJpGwj_Zs
Channel Id: undefined
Length: 15min 47sec (947 seconds)
Published: Thu Apr 12 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.