Behind the scenes - Exploring Azure Cosmos DB's front and back end - Episode 27

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hi everyone welcome azurecos tv live tv uh welcome back i'm your host mark brown and as promised uh we've got andrew liu back uh hey andrew how's it going everyone it's really wonderful to be back here mark yeah so uh back in episode 20 when i first had you on uh we talked about a lot of stuff uh we talked kind of about the history and evolution of nosql as a as a database and a market um talked about the history of cosmos uh as well and how it went from a document db to multi-model and uh started getting into all these different kind of aspects of how cosmos is built and uh we i think we just really ran out of time and had a parking lot of all these different topics we were gonna say we were gonna come back to and so uh now it is what six weeks later uh and i've got you back on um actually seven weeks i guess that was episode 20 and this is episode 27 so i guess seven weeks uh so i wanted to bring you back uh and just kind of keep having the discussion um i think it's a good topic especially i mean i guess if even if you're new or experienced with cosmos being a fully managed service there's a lot of things you don't get to see right you don't get into the weeds you're not provisioning vms and deploying database engines and configuring replication and networking and all of these different things just to make it work uh you go to azure portal or you deploy an arm template give it a name give it some throughput you know some databases and containers and and that's pretty much you know all you need to do so there's all these things we obviously do for you being fully managed and i thought you know this is a good topic right as a developer i'd like to know how kind of things work behind the scenes so i can understand its behavior and why it does what it does um so here we are let's uh continue our conversation about cosmos and uh and kind of how it's built yeah yeah i am super excited um and i mean as always always um uh everyone please you know uh bring in your questions uh it is a live session uh we'll keep an eye out for them you can put them in the uh the little comments question window and and we'll try to get to him otherwise i'm gonna just here riff with you kind of give you a behind-the-scenes look at uh cosmos tv and kind of what that means to you from a performance and scalability perspective yeah so where do you want to start i guess there's the front of the house and our gateway uh or the back of the house and the storage yeah let's jump over to a whiteboard and um i've put together kind of like a parking lot of just some topics uh that we tried to get to last time and i know the the list kind of grew as we ripped on the session oh well we'll continue to grow the list and then i'll try to uh dequeue them as uh as we go and uh everyone please please feel free to add to it but we can kind of make this our rough agenda and list of topics um i figure a nice beginning point would be uh looking at gateway and direct and what that is is cosmos db actually has two connection modes um effectively cosmos db is a distributed database um meaning it it's not just one server uh we we distribute this across many many computers in the back end and so how we route requests so that imagine that if a piece of data alex's data is on computer a bob's data is on computer b and you run a query how do we actually go route that query to the right computer um so that that's going to set the stage i think for the first two topics our connection modes gateway and direct will kind of demystify our front end and back end um the back end is partitioned so it'll be a very natural segway and we'll kind of blur that second topic around how cosmos db actually handles scale out um uh when we zoom in on any particular node there's also going to be a few uh interesting aspects like you know how do queries work uh how once we routed a request to a computer like what happens next um there's the other side of scale out uh so um the two dimensions of scale out you can think of is like hey um i have a bunch of requests uh like queries i have a bunch of data it doesn't fit on a single machine there's actually two ways to to scale that if you have a ton of rights and you have a ton of data i mean easy way to handle this is your load balance right like you get two computers and you put half the data in one computer half the data and another computer and you load balance the request across those as well that's called partitioning but what if i have the same piece of data that uh is also getting read very frequently you can both scale that by through replication and get more reads by copying that data as opposed to dividing that data and then the other thing you can do is when as you copy that data there actually are a few other motivating reasons especially around availability high availability um it gives you a lot of redundancy when you do this which is very good for mission critical applications and then if we have time to i'll get to another topic which is the cosmos tv change feed i think this is one of the most powerful features it's a very simple feature and i think it often gets overlooked but in terms of the design patterns and how you implement things like event sourcing or materializing views it's a it's an incredibly powerful primitive for developers to go and implement some of these very advanced design patterns and so uh that looks like a good agenda and uh yeah so tom i love behind the scenes talks so wonderful so let's dive in yeah um so i'm gonna get some whiteboard real estate let's demystify the gateway first um so cosmos db effectively has um well maybe let's break it into uh like the lifetime of a request we have basically things that from a database perspective it would be the the database client we have things that basically have a a front-end view which is how we route requests and then we also have a back end view on how we look at it and so the client actually can be multiple different things um whoops uh let me front end um so effectively the the primary way to interact uh with cosmos db is through our sd case um we have a variety of different uh apis and no matter uh which api you're using whether it's the uh core sql api the document api sorry uh the api the cassandra api tables api gremlin api your application is going to connect to it through some form of you know sdk uh alternatively you have an http client and that's actually what um uh depending on which api uh it's gonna have different uh protocols um but think of it as like a http client or a tcp client that um is ultimately making the request and that goes into well this mysterious cloud endpoint right like the cloud endpoint we uh expose uh will be that connection string and if you're using the core sql api it'll look something like a um well there's a rest api uh hidden behind that um and if you're using some of the other apis like for example uh api uh you're actually just talking over like a raw tcp stream uh where you have some bits the bits will be something like an op code and the op code will tell you oh this is a create versus a query and the following bits of course will be based off of that up code like if it's a query the following bits will be something like well here's what the query is here's the filter predicates and here's what we want to project so that goes into this thing called uh the gateway this is what uh we like to call it um but effectively this is just a you can almost think of it as a it does two jobs it does our interoperability this is what powers all of our different apis so when you're making a request a cassandra request a tables request a gremlin request a course equal request cosmos db doesn't actually have like five different databases in the back end it has a one common platform one common backend uh and the back um the core sql api is actually the the quote unquote native api it's what our backend speaks and so what the interop layer does is it um basically does a live translation um when you're running a query whether it's through the driver the cassandra driver the tables driver cosmos db goes oh okay uh queries i understand how to do queries so you pass in a query and i'll uh uh translate it into something that cosmos db understands a cosmos db query in the back end and we actually host uh another little client in the back end uh that's using direct mode uh in the back and what this does is actually federates that request uh to a partition to backend and so when you look at the back end you can think of it as a bunch of or maybe i'll kind of explain there's a hierarchy here we have these things that we like to call partition sets or in our documentation i believe these are called physical partitions um but uh the thing to note about a physical partition is it's actually not one server um it's it's actually a replica set and so within the partition set we actually create these things called replicas and what we do is we do a little set per region so that if you have multi region configured like if you have a two region or a three region or n region cosmos db account what we're actually doing is within that quote-unquote physical partition we're setting up these database replicas the database replica is where we actually host the cosmos db software itself this is a i think of it as a a a query engine an indexing engine a storage engine and so that's what actually does all the magic of hey i want to write a piece of data i want to read a piece of data like bread and butter what a database does that is hosted within a replica and what we do is we distribute this workload across many many replicas so think of it as i have a replica set uh imagine like i have just for sake of example um four computers and whenever i uh write a piece of data let's say i'm going to use uh one piece of data is going to be uh the color red we're going to replicate that data across a replica set and if this is the first region let's say uh this is a west u.s and i have a database account that's also configured in east u.s and uh we'll just i mean it can be any region but uh just using an example um what we effectively do is the sdk will say hey let me write that request the gateway is going to route it to a particular partition and that partition let's say if i'm writing the piece of red data this is saying insert red um what we're effectively doing is writing into one region and then once you write it into one region what we also do is establish a replication across these regions where one of these nodes within the replica that is designated a forwarder and that's what is going to coordinate their replication things get a little bit more complicated when you start looking at um multiple right regions when you have active active uh right topologies because you're no longer just forwarding a piece of data if i have two conflicting rights come in then um let's say at uh time is uh 10 a.m and exactly at time 10 a.m uh let's say across west u.s and eus the network latency is about 50 milliseconds uh within that 50 milliseconds if i try to write two pieces of information like let's say the record i'm modifying is an apple and i'm saying the color is red in region one the color is green in region 2. we also have to do some things around conflict resolution so that that way you converge you don't want a situation where in one region the color that that record is a value red in another region is a value green we'll go into that rabbit hole a bit later uh for the time being we're gonna simplify the discussion and first talk a little bit about the um just what happens with the client the front end and the back end and uh how that uh overall topology looks like and so this is going to be let's say let me change a color um can i just yeah this is uh let's say partition one we also have a n number of partitions that are forming these um replica sets and um what this allows us to do is scale like if i have let's say an arbitrarily high amount of information let's say a petabyte that i want to store in cosmos db well unfortunately we can't store petabyte in one machine but a petabyte can be subdivided and so what you're doing is you're just subdividing that data until it fits onto a machine and you're having many of these partitions that effectively go and each have a fragment of that data so that cumulatively you can store that entire petabyte of information so this is going to scale to n number of partitions and i'll kind of talk about um how we actually end up routing like when we write a piece of data let's say the red record and then later on the blue record how do we know that the blue record should go to partition two versus the red record should go into partition one and you want this to be deterministic why this needs to be deterministic is if i write let's say a piece of data to partition one later on when you query that piece of data if i query that on let's say partition two well it's not going to find the information that actually needs to go back to where i've written it sure and so this does need to happen in a very deterministic manner and so we have an algorithm for this it's a it's a fairly common algorithm um pretty much every distributed database standards on it standardizes on it and this is an algorithm called consistent hashing and this uh basically is how we deterministically uh can write to a partition as well as locate that data to read from that partition imagine that i have a number line that goes from zero to uh maximum value and i have a hashing function this is the hash part of a hashing function every single value if you feed in a value let's say alice bob and carol these are going to hash to somewhere between 0 or min value to max value right so alice might be here bob might be here and carol might be here in terms of the hash value now uh what we do here is you actually have a challenge here and uh the challenge is um let's talk about naive hashing and then we're going to evolve that into consistent hashing so let's talk about naive versus consistent hashing and why this is important is a concept called elasticity not only do we want to scale a database which is to distribute that workload across let's say a cluster of many machines but what we also want to be able to do is elastically grow that and shrink that so that that way we can manage our costs if we grow or we're stuck then i mean i puts ceilings on our scalability elasticity is that property that the workload continue to grow and shrink and and fit that as needed um cosmos db of course manages that and it needs a mechanism uh to go and learn how to manage that so if i look at a naive hash on those values imagine if i did alice bob carroll and on the naive side um let's say that hashed to some number and we're going to mod it to number of partitions and that way we can say hey rather regardless of what you have to we're going to fit you onto a set of partitions so let's say we have a number of partitions initially is um two then we can solve the scalability problem by just going okay um alice will go to machine one bob will go to machine two uh carol will go to machine three um uh what happens if we need to grow from two to three partitions this is where it becomes problematic if uh on random distribution we have something that looks like one two one two one two and i now need to move to something like one two three one two three what you're gonna notice here is there's a lot of wasted movement we have to move values uh let's say when i expand this out to danny and eve carol is moving from machine one to three danny's moving having to move from machine to one and eve is having to move from machine one to two this isn't very efficient and so this is why we don't use a naive hash a naive hash is not what we want instead what we do is we do a different algorithm called a consistent hash and so the way consistent hashing works is remember that number line we were talking about think of that as um uh just project that out to a circle where this is min value and this is max value just like a clock once you hit max value you just roll over back to the min value and and what a consistent hash does is it actually mixes two concepts um rather than just hashing it also uses a an additional layer of abstraction code range partitioning on top of that hash partitioning and so given i have a set of hashed values where alice might be here bob might be here danny might be here uh and carol might be here is initially if i had um a need for two partitions what we do is we just straight up divide that number line so that that first half of hashed values go to let's say machine number one and um the second half of hash the values go to machine number two now here's where uh consistent hashing really shines is the elasticity problem imagine that machine one is now filling up like alice's data and carol's data is actually a lot heavier than what we had expected or perhaps we have even more um users on board on let's say our uh let's say we've made a social media platform and it's going gangbusters like the number of users has really started to pile up and we now have let's say billions of users all over this ring on where we're hashing them to well the very easy thing to do is you go well let's split that partition let's say machine 1 is filling up so what we do is instead of having one machine carry the workload uh we can have one machine 1a and machine 1b so in total we have three partitions the nice thing here that you're going to notice is this doesn't require any data movement i'm not having to move data to machine 2 or pull my data from machine 2 a partition can actually in isolation split itself and elastically grow and later on if machine 2 needed to split as well like the same thing machine 2 can become machine 2a and machine 2b and what we can do is we can actually subdivide these rings to n number of subdivisions think of it as like a pizza pie and that pizza pie i can uh uh just keep slicing and subdivide a slice uh to meet as many consumers uh as i need um and this is effectively how we're dividing up that data and because we have a very deterministic way of locating uh this information like if i hash a value later on when i want to query for let's say alice's information well when i hash alice and say oh alice has to value let's say uh value um uh 1 356. uh just just for sake of example um well when we go and look at the ring and say oh okay well this goes to zero to n uh on this partition and the next partition has a range from n to m on hash values and 131 1356 actually um uh hashes to a range between m plus one and uh value o so therefore when you run a query you don't need to go and just like figure out or fan out there's a there's a category of queries uh called fanout queries a fanout query is i don't know which partition to route to so i ask all of the partitions and ask them to check their index in this case uh since we can deterministically say oh you know what because we have a locator a placement hint route that query efficiently to just the one partition that's relevant to me um now how we go and uh implement our consistent hashing is notice we need a value to hash and um uh what we end up doing is when you create a cosmos db container whether that's a collection a table a graph and our various different apis we always ask you for this thing called a partition key and people complain about it people ask like what the heck why do i need that thing yeah why do i need to know what this is well effectively a partition key is the placement hint when you tell us that hey i want to pro uh uh i want to partition my data by let's say users so i set uh my partition key to the user id well what that says to cosmos db is okay now i can deterministically look at the user id for every single record run through this consistent hashing algorithm to both route the rights to the right computer as well as uh the reads i can also route those to the correct computer and what ends up happening is because uh uh we use the partition key for this you want to put a lot of fair uh care and and thought in your choice of partition key basically there's a set of uh design time problems and there's a set of runtime problems the runtime problems are the things that cosmos db solves this is the how does routing happen how does uh growing or shrinking uh to provide elasticity how do these things happen the nice thing is um cosmos db takes care of you for all of the runtime uh so this is like uh taken care of by the system but at design time you still need to uh define a partition key yeah this is what we ask from our users to do and it's one of those things where we front load a lot of big decisions um on users when they get started with us and um they i see users constantly kind of tripping over this uh because they've got to think about this at some depth uh before they can really do anything um and um but it's but the uh you know but if you do put that time into it i mean look at look at what you get as a result of it right so um yeah so uh an important thing to call out here is it's not a unique thing to cosmos db this is a fairly general principle for scaling out the observation i noticed is scaling applications is a lot easier than scaling data the reason why that is is for scaling an application what you're typically taught is make the application stateless and as long as we make the application stateless i can go spin up another virtual machine deploy my software uh whether that's in a virtual machine directly or through azure app service or through some containerization containerization software let's say kubernetes um and what i do is i just quickly spin up a computer put my software on it and hide it behind that load balancer and what we do is it can randomly load balance across these machines the problem we have on the data side is well it turns out applications as much as we want them to be stateless it turns out state really really matters um when we go and scale our app and say hey okay it's stateless things are great imagine i'm building a shopping cart my shopping cart might be stateless but ultimately if alice goes to your website and adds something to the cart and then later on go to checkout and you go oh yeah you know what go stateless so i have no idea what's in your shop yeah one of those other things that they added earlier um so it turns out state is really really important all we've done is we've punted the problem where we pointed that problem to well we've punted it to the database yeah and we've gone hey database i'm going to dump all of my state on you and application just don't worry about the state the database will handle it we'll run a query and it'll tell us what the shopping cart contents are um the challenge is that databases well they need a way of also scaling out so if you as much as you scale out your application if you home all of your data on a single computer in your database well your database very rapidly becomes the bottleneck and so the database also needs to scale out and so uh what ends up happening is these are design patterns that actually are they're not fairly new uh consistent hashing is actually a very old algorithm if you've ever had to like if the year was 1995 and we were still using everything like the world revolved around relational databases well guess what you would go and manually shard across uh your relational databases you'd set up a relational database for uh uh let's say part of your data set if i needed ten shards then a tenth of the data set and in each one of these instances um and uh the challenge there is it's not just a design time problem anymore it's also a runtime problem your application needs to solve the routing and your application needs to solve the oh crap what if i need to grow from 10 shards to 11 shards because it turns out business is doing very well and we need to grow and so um these are not novel problems i do think these are problems that as time has gone on um the nice thing in 1995 was um i mean i still remember the days of just writing plain old html websites the web was very simple it used to be static the whole concept of like dhtml and having some dynamic content on your website was completely new terrible flashbacks right now um this is like pre-javascript days or javascript was very simple the concept of a of ajax of asynchronously making calls to a a back-end and having dynamic content unless you were like really early like a thought leader um most it just wasn't a mainstream concept yet fast forward to today uh the observation is well uh web is very dynamic it turns out rather than having a static let's say a product catalog things like real-time personalization can really make that catalog uh pop a lot more like actually surface the things that are relevant to a shopper as opposed to not relevant to a shopper and how we personalize that is well it turns out we need to know a lot of information about our user and fast forward to now is like oh well we have scalability challenges have just become mainstream is is the observation um and it's not just in retail with like real-time personalization it's things like um uh and manufacturing turns out when the assembly line halts it's really really bad so we want to do things like predictive maintenance and keep the assembly line online but in order to do that uh every machine is now like producing a heartbeat and a bunch of telemetry and once again it's like wow that that's a pretty gnarly scale problem yep um anyhow long story short is partitioning is the key to make this happen um and i will leave uh the partitioning conversation with just a few design principles and these design principles are your uh guiding light on hey this really confusing a new novel topic around partitioning uh that has been thrust upon me how should i actually rationalize and think about this design principle one you want to find you want to pick a partition key let's uh scope this to picking uh picking a partition key design principle one is you want to choose a partition key that is able to distribute uh your overall workload and what you want to do is you want to efficiently load balance your workload across as many computers or as many partition key values as possible the second design principle that i want you to think about is how to efficiently route the majority keyword majority of your requests especially your query patterns um ideally a query uh can be uh efficiently routed and if it can't be efficiently routed like imagine if i don't have the partition key in the query if i say uh a use case had partition on user id when i run a query select star from this collection where user id equals alice we know where alice's data is but when i run a different query select star from this collection where age is greater than 30. well it turns out that alice might be greater than 30 bob might be greater than 30 carol might be greater than 30. only way to find out is you do this thing called a fan out uh and fan out is not necessarily a bad thing um when you're doing ad hoc queries uh i guarantee you you're not gonna find like if i enumerate every single possible query you're gonna run you're not gonna find like oh i have the exact same where clause in every single query uh my where clause tends to differ um but what you want to do is you want to get the majority of your queries efficiently routed uh and the reason why is doing a fan out query on its own like one fan out query or 10 fan out queries isn't really going to make or break an application they actually can be done in parallel so the performance of that individual query is actually still pretty good because we can just ask all of the partitions in parallel what really hurts is when you um are in a situation where every query fans out so if my goal was a scale let's say a a million requests per second well if every machine has to uh scale to a million requests per second that's not actually going to work the way we actually scale a million requests per second is we subdivide it and fan out actually goes contrary to our ability to subdivide so like if i had a million requests uh per second going on ideally what you're doing is something like oh 10 000 will go to machine one uh 10 000 we'll go to machine two and what a fan out does is it's actually pushing instead of dividing that you're just running the million on every single machine you have a scalability challenge again um what do you see is someone that is uh designing their their database and their model and they're like hey i ran uh all these queries in uh in data explorer and they were really they didn't really cost a whole lot in rus um but six months a year later i'm running into the problems um so the thing you want to do is uh during initial development of a brand new application um the amount of data and the amount of traffic you have is still low but once you hit production and people are actually using the system that's typically where you go like oh it turns out that running one query versus running a million queries at any given point in time is a very different story um what you want to do is you want to try to figure out ahead of time which of the queries are going to be your common queries like given a hundred different queries what you find is um a minority of those queries actually end up being the majority of the workload uh like if i go and enumerate out query 1 query 2 query 3 all the way to query 100 what you'll see is there's like this pareto principle the 80 20 rule where 20 of your queries or it's actually usually even more drastic than that but um a minority of your queries is actually the majority of your traffic like when the application boots up i'll use a product catalog for a sake of example you show your aggregate product listings a lot more uh often than um uh certain details pages like people tend to browse around and so when i look at these 100 queries you're going to find like oh these are the three queries that keep getting run over and over and over and so what you want to do is you want to optimize around those if query 100 is infrequently run fine just let it fan out but if query one two and let's say query four where the ones that are constantly being uh peppered on the application those are the ones you're going to want to optimize around for the partition key the way to get clarity on this is try to figure out like do you understand how your application's being used if it's uh if it's a shopping cart or a product catalog well you can intuitively think like well common operations are give me the shopping cart for alice give me the shopping cart for bob what's not going to be very frequent and users typically don't do is find me all of the shopping carts for people between the ages of 30 and 35. you might have a a business analyst who is interested in that kind of information maybe they're using that to do some very important business decisions however you might have a million shoppers at any given time you are not going to have a million data analysts in any given second asking you that question in fact you probably have a much smaller team if you have a million data scientists like wow that's amazing but most companies actually have like well it turns out instead of a million i might have a team of 10 data scientists and instead of running a query per second they're actually running like five queries per hour so if i translate that to queries per second it turns out that it's less than one query per second meanwhile my website they're also probably not running it in the oltp data store if they're a data center right you bring up a very good point your data analyst honestly has other tools available and a beautiful thing that cosmos db has is a completely different other feature called synapse link and you can use a purpose-built query engine that is uh a bit more uh purpose-built for analytics basically the uh this actually might be a nice segue into i believe our next topic was it uh query so i would broadly put queries into two buckets there's uh olotp queries and there's olap queries what the difference between an oltp query and an olap query is when you look at your transactional the t in olotp stands for transactional when you look at those queries typically what you're doing is you're doing a ton of them it's like what your application your live website is doing like hey give me the shopping cart for alice now give me the shopping cart for bob you want lots of small requests and these tend to be needle and haystack queries like given a million users i honestly don't care what them like alice really doesn't care what's in the 999 000 other users shopping carts she just wants to know what's in my shopping cart because that's what i'm going to buy i don't want to buy other people's stuff i want to buy what i chose um that's a needle and haystack query i always leave my amazon wishlist up though so [Laughter] you'd probably love for everyone else to buy and send you stuff send them my way so let's kind of capture this and codify this let me give some board real estate and looking at queries um olotp queries versus olap queries and why this is important is you optimize them very differently um these tend to be a highly concurrent there's lots of them going on and they tend to be short-lived these are things that tend to um uh they're actually pretty quick like when i do a uh just find me when it's in a shopping cart it's a relatively simple query um the latency these are latency sensitive if my shopping cart uh took let's say one hour to load well i'm pretty sure your user bounced like 59 minutes and 59 seconds a day uh too late uh no one's gonna sit there and like wait for your shopping cart for an hour at the road unless like you have something really really compelling like a playstation 5 or a uh a new graphics card uh things that are on hot and short hot demand and in short supply uh most people just honestly aren't gonna put up with it if you're just selling like t-shirts uh by the time the shopping cart loaded i bought you know a t-shirt on another website sorry i'm gone um olap queries on the other hand they're compute dominant uh and uh what i'm doing here is rather than giving me alice's shopping cart i want to do some analysis right uh these are your typical analytics uh that's the a in olap is analytics a key uh thing is these tend to be scans and ootp these tend to be seats seeks are hey given a terabyte let's say data set i only want a kilobyte of information and i wanted you to find that killer by it's a needle and haystack very very fast and very very efficiently in contrast for the olap a lot of times is well i actually want to just go and group by uh like i want to find let's say sales patterns by age demographics or by geography or by a certain set of dimensions these tend to be scan queries where i'm going to read every like a majority or a a very large chunk of that terabyte and then do some batch computation on top of that and so um the main thing you want in a seek is this thing called an index whereas when you do a scan honestly whether you have an index or not if you're going to scan everything an index will help like if you are only caring about certain dimensions it can narrow the data analysis but uh at a macro level it's not nearly as important to optimize that as aggressively as an oltp query so there's a bunch of different characteristics difference and then there's uh how do distributed systems optimize for them when we look at or scale out systems one key differentiation here is uh how do we manage uh resource governance um when we look at resource governance if you're gonna tell me i need to uh given like i have let's say a cluster of uh a finite cpu that i can keep scaling but nonetheless whether i have 100 nodes a thousand nodes they still have like a number of cpus and those cpus do cost very real money should i if i have a million requests well then maybe i should be very frugal on the cpu memory and i o so you want to be frugal per request and by being frugal per request what you're effectively doing is allows you to run lots of these requests in parallel um you're optimizing for parallelism optimizing for parallelism uh at a workload level not at a query level workload so this is managed at the at least how these things get distributed uh at the gateway level right so when query hits the gateway the gateway is distributing the query it knows which physical partitions to go hit because it knows where the data is maybe if it's passed in i guess if it doesn't it goes yeah i guess what i'm trying to allude here is um should i run analytics directly on cosmos db's oltp service area or should i run transactional workloads on my data warehouse and the answer is it's not going to work well and there's a very good reason why it doesn't work well and it comes down to the architecture specifically how you how these systems handle resource governance when they scale out so when i have millions of requests what cosmos db does is it tries to reduce the amount of cpu memory and io per request so i can run all of them uh in contrast if i only have let's say a team of data scientists who run instead of a million requests per second they run 10 queries per hour you want to parallelize the request itself [Music] my spelling is not very good parallelize uh the the request itself and you're you're optimizing for a different thing you're optimizing uh the uh latency uh for a very compute heavy request this is why mapreduce is so popular in this space yes yep mapreduce is the core primitive on how pretty much any modern data warehouse or big data system works when you go and run even if you're not writing maps and reduces you're running let's say select star from where group by and you're doing a sum average mid max whatever math you want what effectively happens behind the scenes is you turn that uh query into a map and reduce steps that run in the back end um uh the main thing is like uh given a pool of compute do i give all of my compute to this one request meaning i can't really run too many of them in parallel because i've paralyzed the request itself or alternatively do i try to minimize the footprint on each request so that i can run just a ton of requests that's why those things are job-based as well right because you're they're long-running they're going to hit lots of machines there's lots of coordination involved i mean i see this with like uh uh like this the old uh it was it's well scoped queries i guess but uh that old um uh i guess you can call it yeah whether you're writing mapreduce hive pig scope u-sql it's also uh it's all the same more or less what's happening uh at a high level um there are some when you zoom in there are differences across these systems um and a lot of intellectual property built up over the years but they have some fundamental like at a broad category level fundamental differences between transactional scale-out systems like the type of things i would run in a data warehouse or on something like synapse it looks very differently from what i run in cosmo cb you really want to use purpose built tools otherwise it's kind of like the metaphorical trying to hammer in a screw it just doesn't work very well and you probably should pick up a screwdriver um maybe if you hit the screw as hard as possible the screw will still go in as if it were a nail it'll probably like tear all of the threads in the screw hole but it's not a efficient way to go about things um so let's talk a little bit about zooming in on the ootp side i think both sides there's a lot of depth it's it's a rabbit hole in olap it's actually not just batching um in the uh analytics side there's actually sub categories like how to do batch analytics stream analytics uh interactive analytics and then there's different um like just within azure we have different ways of handling that like if you wanted to do um batch analytics synapse sql and writing a spark sql query on synapse uh or databricks uh can work very very well if you're doing stream processing well maybe you should be looking at azure stream analytics uh it's a more purpose-built tool there and likewise for interactive analytics interactive analytics you'll find that um these things are a lot more responsive you're not waiting like an hour for a query to come back but they put boundaries like very tight boundaries on what you're able to accomplish um this is where azure data explorer log analytics time series insights actually do very very well yeah um uh certain query when i talk about our indexing yeah let's jump into indexing on the oltp side cosmos db actually has a fairly interesting way it approaches indexing imagine that i have a table and whether it's a table or json it actually doesn't make a huge difference imagine that if i had let's say a table that had like a person's name like this is alice uh has her age 30 i might have an address etc let's say alice lives in seattle um the traditional way that uh we used to build uh indexes is uh let's say i'm querying on the name and the age a lot well what i end up doing is i create a b tree here and a b tree here and these are two independent trees and so um this brings us to the legacy of the whole like altar tables create index type of management this is the classic index management behavior in contrast there's actually another approach here the other approach i want to show is this thing called an inverted index and there's some nice things that invert inverted indexes bring i'll first maybe talk about the pain point the pain point i have normally with a traditional index management is as i grow my cluster whoops um uh well interesting uh let me scroll up uh remember that we've now shorted our database uh into many little computers imagine uh actually a very easy way to simplify this problem is let's go back to something that everyone understands well relational databases how long did it take to run alter tables create index now if i have to magnify that i have 10 database shards times four replicas 40 computers and when i do alter table create index uh or add a column well these things take a while and during that time you can't actually have your application uh smoothly uh operate like if i have a data model on my application that's expecting a i've built a new feature i've added a new property just doing author tables create the column and then alter or also create an index on that column um until that is completed across every single one of these shards every single one of these replicas the application at the time has to just kind of sit there and wait and so the way we used to handle this is you'd have like some kind of like patch tuesday you would go hey everybody we're taking our website offline for maintenance and then we'll be right back you know 6 a.m on the next business day um the challenge here is when you look at modern workloads modern workloads are very sensitive to downtime it turns out that hey when i take my shopping cart offline um well it's a lot of revenue loss all the order flow is just stopped and during that maintenance period if i especially i'm selling goods that are not unique to me uh if i'm selling something that's available in someone else's product catalog well crap i just missed out on a whole lot of traffic there how do i solve that problem well let's look at fundamentally what's happening what's fundamentally happening on the uh is every single shard and every single replica is going i am the source of truth for what the data model is meanwhile every single instance of your application is also going i am the source of truth of what the data model is so i have like let's say 10 000 sources of truth and if they don't match the application doesn't run well can we think about a better way to handle this and the better way to handle this is can we not have one source of truth just make the application itself authoritative whatever the source code for the application has it that defines the data model and i don't have to align a bajillion other things uh uh to try to resolve that discrepancy so this is the birth of flexible schemas flexible schemas does not mean schema has gone away flexible schema means from a management perspective you can actually keep your workload online when you have uh highly distributed systems and so this is what a inverted index brings to the table for how do you handle flexible schemas from an indexing point of view so rather than having let's say a b tree uh per column and having well-defined columns like name age address and later on tomorrow if i add zip code let's say zip code is the new kid on the block um uh in terms of uh uh what what column we want to add rather than doing ultra tables create index and also adding the column inverted index is flexible and the way an inverted index looks like is let's talk a little bit about like how cosmos db views data we've actually embraced fairly heavily a few different data models for sake of simplicity i'm going to focus a little bit on the document oriented data model but a lot of the concepts here are going to translate across a lot of the different nosql data models the neat thing about a document oriented model is it's self-describing like if i look at json json at this point has become fairly ubiquitous every single record actually tells you what is this schema and so rather than having table level schemas you have to have record level schemas the record itself was tell you i have a name i have the age i have an address and for one instance of this i might have let's say alice by having alice it's actually telling you several key things here not only is it telling you what the value is it's also telling you it's data type it's a string um if it's ages 30 and i'm telling you it's it's a numeric um and this can get complex like if i wanted to embed an address and have let's say the street zip code city separate we'll add these just for sake of example to show the shape of the inverted index we can say the city is let's say seattle and the zip code is 98101. in contrast we also have bob let's say bob is 35 and he lives in los angeles and uh his zip code is honestly don't know what the subjects are over there one two three four five you know what mine is but i don't know what they are up in l.a right two nine one something i don't know so let's say this is uh record one and record two now i wanna run a query my query says in this data set of two records find me users whose age is 30 or find me users who live in zip code 98101 what an inverted index looks like is as such uh there's a logical layout and there's a physical layout we're going to talk about the logical layout to build the intuition on this we first have this thing called a root the root actually has a pointer the pointer in this case is going to have a pointer to records one and record two underneath that root it's a it's actually a another tree it has this thing called a name and under that name it's going to have both alice and it's going to have bob bob's going to have a pointer to record two and alice is going to have a pointer to record one likewise name is going to have a pointer to records one and two as the new records come in let's say tomorrow i add under another property address let's say historically address has a city with seattle uh los angeles uh seattle might point to one uh los angeles points to two city points to one and address points i want into the type of queries you can run are hey find uh select star from this container where let's say the name property is defined well what we would do is look in the index and see oh name property is defined well in the query results we should go and surface documents one and two so therefore we're going to see both alice and bob come back in the query result if i do a query select star where the name is alice when we look up alice well alice points to record one so in the query result we're going to surface record one and the query results now what's neat about an inverted index is you're not explicitly defining it on a per property or per column level tomorrow let's say just for sake of example alice we didn't even define what a zip code is but for bob we we did define a zip code for the very very first time this is uh from an app development perspective tomorrow you've added a new feature it needs a new uh property well what the inverted index does is it's very dynamic it just goes and adds another path oh i have a zip code well the things that i have a zip code defined for is record two two and that zip code of one two three four five that also points to record two and so what happens is um uh when you run queries what we're doing is we're just adding new paths to the tree and uh uh without any index management like alter tables create index the the the tree can just automatically populate itself that isn't to say that you shouldn't uh optimize an index you should still optimize an index because if you have a document that has let's say a thousand properties it turns out if you only query off of 10 of them it's very wasteful to generate on the table 990 other entries um however it's not as uh it becomes a optional optimization as opposed to a mandatory deployment step so does that get expensive from a storage perspective look look at this there's a lot of a lot of stuff getting stored in here um so this is the interesting thing is what we've found we've done an analysis uh when you look at typical documents in a collection most of them actually end up having the exact same properties like it turns out that if i have a shopping cart what are the chances if document document one had the property name that document too would also have uh the name of the shopper well that likelihood is very high and so when we look at a property level there's a a very high degree of overlap and because there's a very high degree of overlap uh the actual additional overhead imagine that i insert document three all i'm doing in the tree is adding the value three into the pointer and so this is a highly compressible um we actually play additional tricks on the physical layer so once we have this logical layer on what a logical tree looks like on the physical layer what we do is we do a bunch of bit manipulation so those pointers are actually really big bitmaps that we can do a lot more optimizations around but in practice because of the high degree of overlap that you tend to have across uh documents even if they don't have the exact schemes same schema uh most of the properties are overlap that the index overhead in practice actually isn't too high i would say in practice if you store let's say a gigabyte of data and you turn on automatic indexing on average uh i would see something like 400 megabytes of additional indexing overhead so it's like 40 more for indexing every column every property and you can then further optimize that where this becomes problematic is if i only if i have a thousand properties and i only query 10 of them that's only one percent and there is an ru charge for uh generating all of those index properties so rather than looking at the storage footprint you should actually be looking at the ru overhead and at some point if you go like well it's only 0.2 uh approximately rus per index path at auto indexing on 20 pads is what two ru's okay not a big deal uh auto indexing on a thousand properties okay now that's 200 rus i if i only need 10 of them i should probably optimize my index policy so that's how you want to think about indexing here um i would encourage people to optimize but i would also point out the nice thing is it becomes an optional property as opposed to a mandatory deployment hurdle and that at a high level is how indexing works there are some other nuances the specific nuances that i would say that are also good to call out are geospatial indexing and composite indexing geospatial and composite i think uh geospatial indexing is a it's a pretty deep rabbit hole on its own there are ways of handling it basically there's this thing called geohashing and you can basically do this thing called quad trees where you think of it as like um if i have let's say a map here and i have a coordinate you can like subdivide uh each of these into coordinates so that you know that it's like this part of the um this part of the tree this part of the tree and this part of the tree um that's at a high level how quad trees work uh to go and quickly narrow down a bounding box or a distance query um i won't go too much in depth because uh just from a timing perspective i think we might be running up against an hour again yeah that's my show and i don't have uh i don't have to worry about any show coming after me pretty much oh wonderful yeah we'll go over a little bit then that's fine my takeaways here uh actually composite indexing is uh the more important one um uh where composite indexing becomes applicable is when you have things like where a as a condition and b or alternatively like where a uh order by b um what ends up happening is uh cosmo cb can go and check the inverted index for a it can check the inverted index for b and what it does is it does some set theory behind the scenes like if i'm doing where a and b what happens is it's going to go and look at let's say a set of index postings for a instead of indexing index proteins for b and look at the intersection for those on the whole it doesn't take a whole lot of compute but it can add up on having to recompute if this is like a frequently run query where you're constantly having to look at the intersection of a and b what we can do is we can also materialize the index postings for the intersection of a and b and that's effectively what a composite index does is rather than having to compute this intersection and do it for every single query imagine if i have if i run this query a million times i'm recomputing this a million times wouldn't it be great if we could just compute it once and then every single query just takes advantage of that that in effect is what a composite index is going to do is when you define a composite of a and b what we're doing is at the time of write pre-compute and optimize for queries that have properties a and b and then that way on the read path on the query it minimizes the amount of q and minimizes the ru charges for such a query and it's sometimes staggering how much it minimizes it i've seen a difference in some queries and it's just you can go from like a thousand ru to like 10 in some crazy i've seen it's just insane and it really adds up a lot yeah uh like if you do uh now imagine if i run this query a thousand times per second yep well you're saving that a thousand x that every single second yeah um i'm going to take a peek in case uh if you have any questions please add them into the chat window i'm happy to entertain them um uh like tom or any of the other folks uh feel free to add to it uh and then what we'll do is uh we can make this a live running series uh we'll continue to talk about cosmos db's back end and and follow-up episodes we'll also drill in a little bit more on the replication h-a and uh change feed uh since each of these is a rabbit hole on its own i'm still uh working out schedule for the rest of the calendar year here um i got most things filled up but i got about three spots left for the end of the year and then uh we're gonna take some weeks off uh we got holidays coming up november december of course uh there's conferences happening again uh ndc oslo and i will be in oslo uh end of november um having figured out i was toying with the idea of doing a show remotely uh at nbc but honestly the logistics behind it is just i think a little too crazy so i see also whenever uh we once it would be like 10 10 o'clock at night there too uh a little late so but um yeah we can come back well i'll get you we'll get you scheduled back and maybe people can come and uh ask some questions uh either about stuff we talked about today or uh you can go back and look at episode 20 uh where all of this started uh and see if you got questions there um and uh yeah let's do that we can come back and i got for anyone watching after the facts um feel free to leave a comment once this uh video is posted and recorded um and yeah we'll we'll make this a live uh we're living uh an evolving discussion that sounds good that sounds cool that sounds cool all right well thank you everyone for uh joining us this week um we had uh azure serverlessconf uh yesterday uh and early this morning uh you should be able to go and watch some of those sessions on demand if you missed any of them uh at uh aka ms slash azure service conf uh i'm thinking at least the america's session should be up it's only about 24 hours sometimes less but also we've got a ton of on-demand sessions available there as well so please go and check us out or check out all the great sessions there were some fantastic sessions uh yesterday the keynote was amazing from the.net community guys james and john they showed all the ways that they uh basically do everything all their content production and everything is all done using serverless technologies uh dolby was showing how they do signal processing all serverless there was just too many uh lots of great sessions uh all yesterday so come check those out uh azure serverless.com uh we've also got a new user group i had our first user group meeting a couple weeks ago we'll be scheduling a new one for end of october uh that'll come out in the next week or so but anyway feel free to come and sign up and rsvp or at least get reminders when we get uh our next user group uh meeting scheduled that uh aka ms cosmos dbe user group uh then of course hey if you've got any show ideas i do have some slots left for the end of this year um feel free to just uh hit us up on twitter uh at azure cosmos tv uh oh i do see a question oh just a comment from an old friend here uh great explanations andrew very clear love your insights yep definitely uh always good um always good andrew so uh that's it next week folks i've got uh we'll have deborah chen back uh she was with us a couple weeks ago talking about all the auto scale stuff she's going to give us a sneak peek into hierarchical partitioning which is something i'm really excited by in fact if you're building sas applications or are using synthetic partition keys today um this is something you should definitely learn about and check out and see uh what we're building around that so uh that's it uh thanks again andrew i appreciate you coming by um good seeing you again cool awesome thank you everyone all right thank you everyone see you next week bye [Music] [Music] [Music] you

Info

Channel: Azure Cosmos DB

Views: 487

Rating: undefined out of 5

Keywords:

Id: O5oBM8O-MCE

Channel Id: undefined

Length: 69min 25sec (4165 seconds)

Published: Fri Oct 01 2021