Meetup November 19, 2021

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

foreign okay folks welcome uh azure cosmos db global user group uh this is meetup number three our november 2021 edition i got a guest with us this week or this month ken you want to say hi ken and introduce yourself good afternoon everyone i'm ken muse i'll be uh presenting today i'm the vp of professional services at wintellect as well as a four-time azure mvp well great uh so folks thanks for joining us uh feel free to uh make comments or ask questions uh all throughout the talk uh ken's gonna do a q a at the end uh as well uh we'll also do a drawing for two of these very fancy cosmos db uh neoprene uh mine's kind of dirty yours would be cleaner i promise uh neoprene drink coasters uh and uh cosmos db sticker uh so to enter into that just put a comment or a question in the uh in the chat and uh we'll do a drawing here at the end uh with that uh let me get rid of the we'll start shortly banner because we're going to start right now and ken it's all yours my friend take it away thank you very much so welcome all today we're going to be talking about how do you think about and adopt cosmos db if you're coming from a background in sql or you're a sql server developer there's a number of differences between the two products so we're gonna try and walk through some of those and try and help you learn how to take the skills you have today and begin to understand how those translate to the cosmos db world as i mentioned a moment ago my name is ken muse i'm the vp of professional services at wenelect i'll be guiding you through this session and afterwards if you have any questions feel free to reach out to me on twitter through linkedin glad to help you understand how to make a better use of these technologies so we're going to walk through in this session what is cosmos db how does it compare to sql server and what do you need to know to be able to understand the basics needed for data modeling and query design so from a high level cosmos db is microsoft's globally distributed multi-model database service it's elastically scalable it's distributed all over the world which if you're coming from a sql server background trying to figure out how do you replicate data how do you make those changes available everywhere not just for disaster recovery but for performance those kind of things can be quite daunting quite challenging cosmos db takes some of those pains away and makes it very easy to build a distributed data solution that's globally available can scale up or down on demand and where you can balance the trade-offs that typically come if you're trying to build this kind of model yourself it is one of the few microsoft products that has some additional slas and i'll discuss some of that more in a moment but essentially it guarantees you a number of things including the latency so that you have very very predictable performance characteristics when you're using this product today we're going to primarily focus on just one of the different models that are available but it's worth knowing that for different use cases other models are available the teams can use to be able to target specific types and ways of storing data as i mentioned a moment ago cosmos db is globally distributed that means it's in more than 30 regions worldwide and it supports multi-master out of the box so if you're trying to figure out how to replicate the data how to make it available how to lower the latency requirements for applications that are distributed cosmos was designed with that particular use case in mind out of the box and whether you're comfortable working through the portal or using infrastructure as code with arm templates it's either a one button click to begin enabling those features or a couple lines of code it's also elastically scalable which means that we have the ability to quickly grow or shrink what we're using on demand meaning we can align cosmos to the particular workload we're going to have so that we're just paying for what we use while i'll talk about this more in a moment because it's an important concept to understand usage is expressed in request units which represent a blend of memory cpu and iops if you've worked with azure for a long time this isn't too different than the dtu model that existed for azure sql for many years essentially when you're talking about request units you're talking about how much does it cost to run a quarry within a defined period of time and a key concept with that is how much can i do per second this is the throughput which is expressed in request units per second or are you slash s so that term is going to be used a lot and in today's session you'll understand more about why another unique characteristic of cosmos db is that it has five consistency levels if you're coming from the sql server world you're more used to a strong model that is you have guarantees about the consistency of the data between a read and a write how long will it take for the the read to be able to see the right how consistent is it between two requests what about under load in the rest of the nosql world we're really used to an eventual model which essentially says that i'm going to write the data and at some point it's going to be consistent but i don't know when the wonderful thing about cosmos db is it recognizes both of these models and the design trade-offs that come when you have to use either and it expands that it makes three other models that are in between available so that you can pick the right level of consistency for the job by default session will be the primary consistency level and it's going to be one that feels very similar to what developers are using when working with a traditional database at the high level what's important to know is that as you move towards strong consistency we give up a few things availability guarantees latency becomes higher the overall throughput may become a bit lower but we gain guarantees around the ordering and behaviors of our reads and rights at the other extreme eventual we get the highest possible availability and the lowest possible latency this means that everything can perform faster but it's weaker consistency if we're using a globally distributed system it may take a little bit longer before we see all those changes reflected in every environment in many cases that may be sufficient but you have the range of options that help you to be able to fine-tune what you need if you're working with a single region a single location it's worth knowing that effectively you're really getting a strong behavior that most of this really starts to become something that you notice as you spread out and begin to use that globally distributed nature the other really big thing to know about cosmos if you're coming into it new is it is the only service in azure that guarantees and puts an sla on the latency that can be a bit of a shock if you're coming from sql server but it's worth realizing that the latency guarantees that exist in azure for that product are really targets it's where it'll end up most of the time but there isn't the same sla on that with cosmos db there are in fact specific slas that guarantee you the performance of the overall system you do their your part and the system will do it those guarantees are documented online i've put them here and it's worth knowing that as you go to a more strongly consistent model across multiple regions the time does increase because well it takes time even for light to travel in each case there are lots of guarantees behind the scenes for the integrity of that data and lots of control for what happens if in these brief times there's a conflict as a developer it's also worth understanding that as we talk about the latency and the models this is often on the order of milliseconds so the delays aren't huge but they're worth being aware of and so even with eventual consistency while there's no hard guarantee of how fast that catch-up occurs in terms of specific milliseconds the targets are quite rapid and so it's really fine-tuning your approach to the overall system so the next big question i hear from sql server developers why would i consider cosmos db when i have sql server a key to that is understanding what is cosmos db intended for it is optimized for point queries so if you're doing a lot of operations that involve reading and writing by record ids that's a point query that is i'm getting a specific record and changing it and that is where cosmos db will excel above all others it also can really stand out in terms of the fact that designed properly it has virtually unlimited data volume support meaning your database can grow as big as it needs to to support your needs and still provide these guarantees if you've ever had to deal with global distribution or trying to replicate sql server handle failovers you've really gotten used to what that pain point is and the benefit that cosmos can bring to the table with the fact that it distributes the reads and writes globally when you need it to and if you're working with different data modeling methodologies that is table column store column family if you need to do graph modeling well that's where the multi-model capabilities can come to play cosmos db is also great for working with json and you'll often hear it referenced as a way of storing json documents i want to clarify this for everyone here that this is not the same as storing a json document if we're talking just straight document storage and archival blob storage may be a better fit for you but if we're talking about i need to be able to read and write by id i need to be able to reference certain parts of the data i need to react to changes in it then cosmos db giving you that paradigm you're used to around json and apps that use json can be incredibly powerful and convenient i'll go into the json structure a bit more in a minute because it is important to understand what that really is to understand how to do the data modeling so if we have these features why sequel server well it comes down to the use case sql server is optimized for aggregations and set-based operations if your primary use case is that you're needing to pull reports to group together similar data to analyze or return chunks of things by relationships potentially that are highly denormalized sql server is a much more natural fit and will much better align this doesn't mean you can't use cosmos db and with some of the features it has such as synapse link you can bridge the two worlds very effectively and it's great to know because it really is with azure about using the right tool for the right job at the right time sql server supports denormalization and joins which really isn't a use case for cosmos db it tends to need a level of normalization in order to be able to provide you that performance guarantee through point operations and that means that if you're tying in a reporting system you may want to consider tools like synapse hyperscale azure sql db or managed instances because the use cases may align better because of the use behind the scenes of lots of aggregations and set operations and then if you're comfortable with broad sql support cosmos has lots of support for sql but that does not mean that all of the sql you use today on sql server are available inside of cosmos and then finally sql server beyond just being used for general reporting if you're tying it to a reporting solution like power bi sql server or one of its flavors may be a better fit again because of the level of denormalization typically required the two systems are very similar though first cosmos db does support sql although some aspects are a bit different and we'll dive into that more in a moment it uses indexes to get the best performance but not the way you may be used to thinking about it on sql server so we'll also dive into that just a bit more like any good index if you make a change it may take a little bit of time to rebuild the index to get the full benefit of it that said you can rebuild those indexes in production to see that performance in an environment that needs it and really the key here partitioning and data modeling are critical in both if you want to get the most out of each system so what you know about how to organize data continues to be critically important but as you'll see in a minute the way we model the data is different and worth diving into a lot further and contrary to the misconception that often occurs with nosql environments schema changes do need to be planned in both systems when we alter a schema there can always be downstream impacts to the applications that interact with the environment so we always want to consider what does it mean to change schema even when we're using json and nosql solutions you may hear me keep saying nosql and that's because we are really starting to bridge into the nosql world in many ways which means the solution here will be highly denormalized more so than you may be used to with a traditional sql server while cosmos db does support joins those joints can't cross what we might think of as tables in cosmos db you'll understand this more in a minute but join has a different purpose in a nosql or cosmos db world and one of the biggest differences triggers don't really work the same here as they do in the sql server world they're not intended to solve the same problem the same way and so we are going to drill into a bit what does that mean query operations between the two also very different as i mentioned cosmos is point quarry based and so set operations while completely doable in cosmos and often important are more expensive and can require more careful consideration finally it's worth knowing that changes in both environments can be tracked but cosmos db includes a feature called the change feed that allows you to go beyond tracking to replaying it meaning you can create a reactive model out of the box with cosmos db that makes it very easy to carry your changes downstream and this can often replace the needs developers have for the traditional trigger i'll mention again because it's important that partitioning in this world is not optional it is critical and a lot of your data modeling decisions will revolve around how you think about partitioning and that may be a bit of a shift unless you're used to working with large data sets in sql the story about that really begins with the fact that cosmos is so scalable that we can easily change how it scales how much throughput it gets but all of that is tied to how we partition it because these numbers and a lot of the behaviors are directly related to how the data is partitioned first and foremost as i mentioned a moment ago backing all of this is the concept of the request unit it's our fundamental unit for understanding the pricing and the throughput conceptually it's easiest to think of it as one are you is the resources required to read a one kilobyte document and one ru per second is reading one of those documents every second so the more throughput i provision the more i can read analyze index at any one moment of time we can generally provision everything in increments of 100 ru with the lowest level being 400 ru all of this is exposed if your developers are working with these systems they can even see the specific costs of each query in real time to optimize very important to note that there is another relationship here i mentioned before that an ru encompasses things like the data and there is a minimum of 10 rus per second of throughput required for every gigabyte of data that is stored also once you increase the rus that you've provisioned as throughput you can never decrease that number back below ten percent of whatever the largest value was you used so if i'm using 400 rus then i could go up to 4 000 to uh without having an impact in terms of my scalability but once i start to go beyond that from that point on i can't come back to 400 anymore and that's really aligned to how the data is partitioned and what's happening behind the scenes from a query cost request units are also a way of understanding uh the complexity of the quarry what's involved behind the scenes and that'll drive your throughput each of the different operations requires a certain amount of resources to be able to do its job roughly it starts at 1 for reads and goes all the way up to 10 for updates whenever a query has to go beyond one partition then we start seeing an additional expense because we're going to need at least one ru from every partition we're querying and so this will drive again that discussion around how much we're doing and how spread the query is how that partitioning occurs in the data model will directly affect the performance of the overall system this will get much clearer in a moment but these numbers are your core metrics for starting to understand how is the system performing if you're needing to do administration or understand where are bottlenecks coming from in addition to the type of operation and the s the consistency levels having some impact our use may also increase because we're making larger documents with more properties this can also be because out of the box everything's indexed and if more properties are being indexed it is more expensive if we have unique keys defined that can also increase the cost because unique keys mean we need to understand behind the scenes does another record with the same key exist and that has an expense and then as we move towards a stronger consistency model our ru's will increase because more work is involved in order for it to guarantee that level of consistency at a global scale and of course i know i keep mentioning it but it is so important whenever we have to cross-partition boundaries or we cross those and miss an index that's going to increase our ru's because more work is required to find our data so again the closer we get to a point query the better the performance so with a sql server the way we handle these kind of challenges beyond fixing the data model is often to throw more memory throw more processor at it add better drive arrays with cosmos we take advantage of the different scaling models each of these gives us various trade-offs and can be used to allow cosmos to increase and handle that workload for us we have four primary approaches starting with dedicated throughput in that model we're saying i want a specific amount to be available for my database to use and or for my container to use i should be a little more correct there when i'm using a dedicated model my container has resources allocated to it and so it can consume some or all of that and i pay a very fixed rate or a very fixed price i can do shared dedicated throughput so now i move up a level and my containers can share a common throughput and so one container may use it and then another container and if the model supports that i can distribute that load across all of the different containers by each using what they need when they need for cases where things are less predictable that's where i have auto scale within boundaries it's going to grow up to a certain point automatically for me and then come back down when it doesn't need it and serverless where all of this happens completely dynamically i pay a little bit more for those latter two models but i gain the flexibility that i can use those models to better understand what's happening and how is the system being used if i need that so now we'll dive a bit into the data modeling cosmos db has a model that it uses for how we organize data at a high level which starts with the account roughly you can think of that in some ways as if it was a server but it really isn't but it is a grouping mechanism it pulls together these databases so you can have them the various databases in the account and then within the databases we have containers you can roughly think of it like a table but as you'll see it isn't really a table either conceptually very similar though in the way that we would use it and then within the container we have items conceptually think of it sort of like a row but you'll see how that also relates to the overall model as we go through this and so very similar to sql server databases with tables or in this case containers and tables containing rows or in this case items so we'll go into this more because you have to really make sure you understand that still it is not tables it is not rows it's actually entities and trees and that's where some of the power of this modeling comes from but because conceptually it looks it behaves it acts similar to something we're used to it also behaves really well when working with it using sql and so we can treat it like things we know to be able to take this and apply our sql understanding to a cosmos db world in a traditional database world we might have an order table containing order details denormalizing some of those aspects cosmos db it gets a bit more complicated this appears to be json documents so similar to a table but for performance reasons we may actually need to embed some of the data that might classically be modeled in another table directly into the document this is a really big conceptual change and it requires having some understanding of how nosql modeling really works at a very high level these kind of contained relationships work well when you have a one to few relationship that is i've got a discrete fixed number of items and so it may be better performance to pull back all of the data at once especially if i'm using the data pieces together so if when i pull up an order i commonly show the individual line items and a typical order is a very small number of line items then i may do well to embed the order details directly into the document if it is a one-to-many relationship especially if it's unbounded then i may need two documents and a very different design pattern if uh i need to be able to to provide a subset of information again i may have to embed and this difference the need to embed the data models is what is probably one of the most significant and stark contrasts to how you'd traditionally model databases i encourage you to go to the microsoft docs site there is a walkthrough of modeling posts and comments that is incredibly valuable and it there are several of these types of walkthroughs that are available through the docs of different approaches to thinking about modeling this is an area to definitely focus if you're trying to transition your skills to get the most out of the modeling experience because understanding when how and why you might need to embed and denormalize data is really important to getting the most out of the system from a high level bringing in the database background essentially we're denormalizing the data to put the information we need together and so you're beginning in many cases with the end in mind very similar in some ways to traditional data warehouse modeling under the covers the data storage as i mentioned is not tabular it isn't even actually json it is a tree-based model where the document is shredded the details are stored in a very specific way that is highly optimized where all of the properties can be indexed to give you a lot of performance guarantees as you're trying to recover documents based on certain criteria we'll come back to indexing in a minute but this basic concept means that when we query against these documents the behavior is a bit different it looks similar to your traditional table query where it may be container name dot item property so families dot id but because documents can be nested we may have multiple dot delimiters did take us down into the contained relationships themselves like families dot address dot state we can even use array indexing to be able to get to elements that are in an array to specifically pull back one or more value or even pull back complex json objects so the syntax is very similar to the dot notation you're used to in sql but with the fact that you can now go deeper each dot takes you deeper into the structure of the document model itself so again very similar to something you've already known and used being able to use these properties and to build a query on them brings us to the next key to understanding cosmos and that's the partitioning there are two parts to partitioning and i'll walk through a few examples that help you to understand these the logical partition which takes everything that has the same partition key and groups it all together for you a logical partition is limited to 20 gigabytes so you have to make sure to consider will i have more than 20 gigabytes of data over time that has to be in the same partition because once you have that partition key in place you're not going to be able to change it without some significant efforts this is backed by a physical partition some reserved ssd-backed storage and compute that are specifically responsible for storing all this information in cosmos a physical partition can include one or more logical partitions and it's constrained to 50 gigabytes these two work very closely together and one of the key areas you're going to have to keep in mind is that partitioning directly impacts performance and so the choice of the partition key is where all of your data models will absolutely have to start i'm going to use a very contrived example but it will help hopefully to understand a bit about how modeling works and the impact of the choice so that you can understand why you may not partition on id or you may choose not to partition on dates there's a natural inclination to to partition on dates because it groups together things that are related but you'll see why in a minute that could really hurt you i'm going to be building a model that needs to do a lot of querying to understand how many cats exist in a given city now the number of cities is fairly broad but still within reason i don't expect huge amounts of data to come in by city and i expect to only have to keep a few years of data so i've got some constraints so perhaps i decide to model on the city because i've already identified that the majority case is going to be i'm working with data by city as i mentioned before the partitions are organized based on that partition key under the covers that hash is being used to create an integer value that assigns your data somewhere so let's say i want to save some data about new york city well any new data is going to try and go to the same logical partition that already has the new york city data now cosmos db will also handle some of that for you automatically when it sees like in this case that the partition is already nearing being full the physical partition but the logical partition still has some room it's going to automatically scale itself to handle that if it sees that there's no need for that well it's distributing your logical partitions in a way that makes sense to ensure that the data is available and the latency guarantees can be contained so i get some data about new york city and i want to add it to cosmos db cosmos db sees that it's running out of space it creates an additional physical partition it moves that data over to the new physical and behind the scenes make sure that that logical partition now has room to hold this other chunk of data and make it one seamless complete model so now again all of my new york city data is all close together and any queries i do against new york city are going to perform well potentially there's a consequence though that comes with this and that comes down to how physical partitions behave when you allocate throughput you're actually allocating it evenly across physical partitions and cosmos db manages and owns those not you so if i allocate 30 000 ru's of throughput and i have three partitions each one gets 10 000 rus when we have to split a partition and distribute it it's also going to split and distribute the ru's so whenever we have to add more partitions the r us in a given partition physical partition may decrease changing the performance behaviors where we're seeing which means that we want to be careful to understand the characteristics of our partition key and we want to be aware that excessive quarrying tied to that partition that is incorrect or inexact could put a strain on a single given partition this is the nature of many of the performance problems you will run into as you transition your skills and get used to working with cosmos db typically you're going to see either a hot partition issue or a query fan out and in each of these cases you're going to be having to work with developers to make sure that they can understand how to avoid this or that we correct how we think about the partitioning to better align it otherwise you're going to be having to increase the throughput to be able to account for problems related to these as a practical example let's start with the idea of a hot partition i've got that data coming in for new york city and it's coming in frequently it's coming in fast and it's coming in throughout the day and as that data comes in it's all being redirected to one and only one partition that may be fine depending on how fast and how much but it's also possible if there's a lot of it and it's coming in fast i could end up needing more resources than the partition can make available at any one moment in that case i'm going to have to use retry logic to get the data across because i've created a hot partition with a hot partition my data is going to one place it takes the load and therefore i suffer from performance problems this is a key reason why just using date is often not the right answer because if i've got lots of logging information lots of date-based information coming in and that date by itself is my component for partitioning then everything with the same date could end up on the same partition and now my rights are constrained by a hot partition that may work really well for reads but i may suffer on the right side similarly if i do a broad query that doesn't do a point operation and worse it doesn't use the indexes correctly cosmos is going to have to take that and send it not to one partition but to all of them so in this case i've got a query by year as you may recall i partitioned on city because there's an obvious mismatch part cosmos db can't decide that one partition is the fit or that it can eliminate a partition so the request goes to all the partitions that means that each of them have to incur the costs of being able to query that data and return it and that has an overhead and that creates a performance issue on the read side we want to try and optimize for a balance or take advantage of change feeds which i'll discuss in a minute to be able to optimize the systems for both at a really high level the thing to know as a developer is that our query performances are best when we can get to a single partitioning key within that partition and it's worst we get the absolute worst performance the highest ru's as we move towards things that span partitions or don't use filters because then we're having to look at all the data on all the partitions finding that right balance is going to drive a lot of the data modeling decisions and may even mean we create redundancy that is the same data being stored in different containers specifically so it's optimized for the kind of quarrying or the patterns we need to support i may take a set of containers and optimize my partition key for high throughput writes then transform that data through the change feed and put it in another container optimized for high throughput reads with any nosql solution we are trading off that extra storage for huge gains in performance so instead of spending all of our money on processor on memory we simply are using more data now that may lead to the question why not just a trigger i can take the data as it comes in transform it right there and put it in the right place but that's not strictly always the best approach or the right solution because triggers here are not the same in fact triggers don't happen automatically as part of starting an insert starting a delete starting an update coming after it instead we have to specifically ask for a trigger to be executed as part of the request of a data operation we have to on the development side as we call cosmos actually request that the trigger be used and that's because it runs a single transaction with asset guarantees on that incoming data but there's a limit here this is going to occur in a single logical partition it's restricted it can't span and so because it is scoped to that single logical partition key there's only so much a trigger can do the great thing though is that when we need this support it guarantees it'll always run on a primary replica it'll always have strong consistency and so it can make a set of tightly coupled changes that are narrowly scoped very very efficiently unfortunately because it doesn't provide us those pavers we're used to with traditional sql we now need to look at another option the change feed the change feed is a different approach we can take to this which allows you to tie code in that specifically listens for when data has changed or when operations occur it allows you to create a reactive model where as things occur it's pushed to the code the changes happen and now we can act on them possibly storing new data or pushing it down the pipeline we aren't strictly limited to a push model we can also do a pull so i can say i need to replay the events to bring myself up to date and apply all of those things giving me a very event-driven approach to rebuilding a data store it gives you access to key parts of your document history essentially providing you the items as they've changed over time and this allows you to programmatically apply changes to the data the items are sorted by their modification time by the logical partition key so there's an order that happens as that data is provided to us and some lightweight guarantees in how we're seeing the data that does mean the data may arrive potentially out of order in terms of timing but ultimately we get to see all of the data it's really important to understand that in this model the change feed does not as of today include delete operations so as data changes you can see the current state of the data unless there's a delete in that case the item disappears from the change feed for the future now the work around for this is very simple it's a soft delete you create a column to flag when an item's deleted the change will percolate through the systems and through the change feed and you can react to that to actually either change the time to live on the document to let cosmos clean it up itself or to manually decide now is the right time to delete it i don't care about the delete anymore if you're doing this with functions it is really important to be aware the eventual consistency both from cosmos db and from the function characteristics themselves mean that there can be duplicates two records that are exactly the same where the system provided you the same data twice this is at least once delivery especially when you're doing anything with azure functions consequently be aware that you may process the same item more than once and while this has since changed the overall support of most new features new functionality enhancements really comes from being on the core platform or sql so coming at this from the sql world the choice of using that model gives you the most options and the most new features since we're getting closer to the end of the time i want to quickly dive into another area that is incredibly important to understand and that is indexes as i mentioned before by default everything is indexed out of the box you can customize this you can fine tune it ultimately the most specific index definitions will always win because everything is broken out into a tree the layout of the indexes and the way they're structured is very consistent and so we can use uh a path-based structure to define what is indexed is it everything under locations is it only the countries is it just the employee value under headquarters i don't care about anything else i can fine tune what's indexed perhaps in if i'm taking on large incoming data feeds i may even restrict how much indexing i'm doing all together i have total control over the index and so i can trade off some performance for the types of indexes i choose and what i choose to index by default everything starts with a range index these support various equality and inequality operations as well as letting you know whether something contains something else whether something exists or to even be able to do the joins within our data at the same time it doesn't cover all of the needs perhaps we've got special data types for these reasons cosmos allows us to fine-tune some of this and choose to apply indexes manually i can restrict these i can eliminate some of these or i can even use other index types default if i'm working with spatial data unfortunately cosmos isn't going to recognize that's what you mean so it won't create a spatial index but it's incredibly important to know cosmos db does support spatial indexes so i can flag that a given part of the data represents some sort of location or locality so that i can do distance within intersects operations very efficiently and so tr changing out a range operation and a that type of index for a spatial index putting a spatial index on those fields may give me significant improvements especially if i'm doing queries based on things like intersection in other cases if i need to query across multiple properties i may need to use something called a composite index ranges are inherently tied to one property each but if i'm doing an order by if i'm doing a filtering operation aware a composite index allows me to work with one or more or with multiple properties again one property would have been arranged it allows me to work with multiple properties to better index what i'm trying to do ultimately a composite will always follow the same pattern an equality property must be at the beginning if we're wanting to work with any sort of filtering and any sort of greater than less than range type operation those fields have to come later but it's worth knowing that if i want to index ascending cosmos will automatically include the exact opposite as part of that index so if i choose ascending it's going to index descending as well now the order of those properties and making sure that the ascend descend aligns for you with the additional indexing you get by default is important because if you don't include the property or it's mis-ordered it may not be accounted for when you do your filtering for order by operations having the properties there we get a a bit uh of an improvement in the sense that the ascending descending starts to uh become inconsequential because we've got both there but if we're doing filtering operations this is critical or you may miss your index so start with the equality then put things in in order uh by the property ordering if you have two equalities put both of those first if i have two range operations things change just a bit whenever there's multiple range operations the greater than less than i'm going to need an index for each starting with the equalities and the range property to make this really concrete and help you understand this here's a simple sql example i'm trying to get all the records where the person is john his age is 18 is greater than 18 and it happened after a certain moment in time traditional sql we might choose to build an index around name age and timestamp and i'm keeping the order there so i would have my bases covered except with cosmos as i mentioned before whenever we have a situation with inequality the equality needs to go first and if there's more than one range operation we're going to want an index for each one so john plus age john plus the time stamp by doing this we'll make sure that we hit that index and we get the best performance so with that i'm i hope this has helped you to understand a little bit about how you transfer some of these skills from the sql server world to cosmos db obviously there's a lot to learn in here and hopefully this gives you a high level understanding and just enough depth to start knowing where to begin looking and the kind of things you want to know i do recommend that going forward try to really understand the partitioning and the data modeling create some actually get hands-on in the environment and then look at the ru counts that each query is going to require so that way you can better optimize how you're building queries and data models and that'll really help set you up for getting much deeper much more hands-on in this nosql world and so with that i'm going to turn it back over for questions following the discussion questions come to mind later feel free to reach out by twitter through my website uh through any of my social media accounts or to mark and we'll be glad to make sure that we help you understand this topic in more depth thank you all hey ken thank you very much that was a great talk uh we've got a few questions uh during your uh presentation there um a question well comment can you show us how to use cosmic duty without your functions you kind of covered on that um but join us for a future meet up for sure we'll have uh more talks that uh will get you more hands-on with uh um cosmos and azure functions uh another question from yash is there any option of doing bulk deletes in cosmos tv unfortunately that one becomes a bit complicated by the fact that again cosmos is really optimized for those point updates but there's a few options here that can help uh you do have uh the ability to use some of the programmatic approaches that i talked about so features such as the stored procedures triggers operations through functions or other things they iterate and do that not always your traditional uh type uh bulk type operations like you might have in sql server another option that can be incredibly valuable here though is to be aware that there's a ttl property time to live i can set an expiration on documents so that cosmos will clean those up for me and it'll optimize for the cleanup time and how it does that if you pair this with a soft delete column like is uh is deleted and flagging that and updating the ttl that can often replace some of the needs for a traditional bulk delete coming from that world mark anything to add on that one yeah i mean you kind of covered it uh you can use bulk mode in the sdk to um help a bit with that i mean bulk mode will basically just lump up a bunch of different operations and then dispatch them on a single thread so you'll get you get a little performance there stork prox is probably a little better but again it's that's gonna just give you um that's just gonna give you help kind of with the resources client side because you're you're just issuing a single request and then allowing it to happen um server side um i mean i i can say that this is kind of on our road map uh but i don't really have a date for uh when we're trying to do bulk um uh bulk mode um or like bulk deletes and it wouldn't be bulk and sense of like truncating the entire container um it would be kind of on a partition key basis right because that's kind of the kind of storage boundary if you will or partition keys um but um yeah that's kind of the option then i think ttl is probably the best one um where if you've got um although if you're if you don't know you you need to ttl the data doing an update is functionally going to cost you the same as doing the delete so uh you know pick your poison i guess one way or another on that but if you do know that you want to like ttl data off then for sure set that and then cosmos will you know will expire that data uh using uh unused argue in the background so to be clear though the ttl will impact it takes an uh effect immediately but the physical removing of that data kind of happens as a background task uh okay uh let's see a question here from uh dr lowe uh as well as change feed for changes is there a way to trace all queries a bit like a sql profiler in there anything anything like that for sql devs um unfortunately an all-encompassing tool that dumps all of that uh mark you may be aware of some i'm i'm not at this moment but what i do see a lot of developers do is is tie this into azure monitor especially through application insights so they can capture the parts of that that makes sense because in many cases it's either about certain performance criteria so i can say i'd like to filter out cases of queries that have high ru's or where it's exceptional or it's situations where they're trying to gain some sort of insight into the data through those kind of mechanisms to be able to then refine so i've personally seen some monitoring some azure monitors some azure application insights but not a tool like sql profiler where you can just connect in and watch things in in any sort of real time yeah that's that's correct it would the the only thing the closest thing to that is to use the log analytics turn on log analytics and then write cousteau queries uh against them to and you can do things just like you said with uh give me my top long-running queries uh or most expensive queries or other things like that you can also set a flag in there to uh show the query text so you can get the plain query text you can actually see which queries uh are actually the ones that are kind of the most expensive or not performing as well so but yeah we don't have the cosmos was born in azure we don't have standalone tools like query profiler which has been around for ever i think uh going back maybe just psy base days probably um in some way shape or form so yeah all right uh oh another horribly distributed it would be quite challenging to uh to concurrently monitor up to 30 places worldwide for all operations uh that's that would be one one aspect of that being challenging um another question from uh greg here no deletes in change feed um yeah no deletes not yet um you gotta use that soft flag in there uh okay another question here relating to the bulk deletes edits is there a way to limit the amount of ru you want your task to be throttled at um yeah there is not uh you could do it if you were issuing them from say like a like an azure function maybe uh it's kind of tricky i mean you're kind of if you get a small enough functions instance that can only run so fast you would potentially be able to gate but there's no kind of you could also do some kind of cubase load leveling i guess would be another way you could do a similar type of thing where you're basically cueing those operations and then dispatching them over based on some kind of dispatch schedule but you're getting pretty complicated in that sense um and like i mentioned earlier the ttl that kind of limits it because it'll only use unused ru um so it's not it won't uh it won't throw exceptions it'll it'll step back if um if requests coming from the clients uh are being um taking or taking over in there uh so okay well that's it for questions uh let's do a drawing here huh so let me share and let me share here where are you microsoft edge tab giveaway and all right so we've got uh well multiple comments from the same people but five unique folks in here uh if you get chosen uh through our drawing here just dm me at my twitter account which is right there underneath my head and we'll connect i'll get your details and i'll ship you a cosmos db here we go uh one of these lovely cosmos db neoprene drink coasters and a cosmos db uh sticker all right all right here we go and our first winner is yes you win congratulations okay let's do one more and i can't find my mouse because stream yard is here we go okay one more time who's going to be the lucky winner here i should have some kind of like uh like click click click click thing okay and start that you are our second winner uh congratulations um okay well that's it for us uh this month thank you very much uh everyone for joining us uh it was great having you on and uh next month in december we're not going to do a meet-up but we will be back in january uh i believe i've got uh i believe alex mang has signed up and he's going to do uh or host our our january meet up so we'll work on getting that scheduled we'll get it set up uh it'll show up in the meetup uh there is a new event and then you can go in rsvp and that's it uh thank you everyone for joining us thank you ken for coming and presenting this month it was great to have you on i really appreciate you coming and joining us thanks for having me okay that's it thank you everyone bye bye we'll see you in a couple months

Info

Channel: Azure Cosmos DB

Views: 336

Rating: undefined out of 5

Keywords:

Id: 7LZrR_zK5TA

Channel Id: undefined

Length: 73min 4sec (4384 seconds)

Published: Fri Nov 19 2021