AWS re:Invent 2017: GPS: SaaS Metrics: The Ultimate View of Tenant Consumption (GPSTEC308)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello everybody thanks so much for showing up today they'll appreciate you being here as the slide suggests my name is Todd Golding I'm a partner solutions architect and I focus on not surprisingly sass as my area focus and when I'm working with partners and customers who are adopting or migrating to a SAS delivery model can we bring the mic down just a little when I'm working with customers that are moving to those models there are some common themes that I see in some patterns and what do we see is these organizations that are really hungry and interested in achieving sort of the promise of SAS agility right they want faster delivery models they want to be able to push out features faster and a lot of this is really driven by a competitive need they're really interested in saying either they're worried about whether some competitor is going to come into their space with a SAS offering and potentially move past them in a way that they'll get left behind or they're just generally trying to create the gap themselves between the rest of the competition and as part of that effort what I find is some themes on those organizations that make them successful and a huge part of that is the capture and the processing and the sort of correlation of a much richer population of metrics about how tenants are consuming their system how those tenants are pushing and exercising the the architecture of their environments in general this whole engine that fuels their agility and is that the core of the success of the agility ends up being driven by a bunch of metrics and not metrics that are just technical metrics but business metrics and more importantly it's the it's often the correlation of these business and technical metrics that are at this core of this lifeblood of of knowledge that drives both the technical and architectural and operational and business direction of the organization and when I look at these organizations and I say well how are they using metrics where are metrics applied in their solutions I see clearly a whole bunch of technical uses so you'll see here on the architectural side here I have a bunch of metrics that are here that are about you very unique to multi-tenant environments by the way a bunch of metrics that are about how does my system respond to Peaks and loads and how does my system handle new customers that are onboarding does it handle onboarding them fast enough so I see SAS architects obsessing over lots of metrics and they're not just sort of looking at them periodically they're looking at them all the time decide is my architecture efficient is my architecture responding and servicing the needs of the business and in fact for me as a SAS architect both at AWS and at other solutions I worked on it was always fundamental to me to be able to go to the business and say I am doing everything I can to eat every ounce of efficiency out of the architecture because I know that the architectural footprint of our solution is going to have an impact on the bottom line and the success of the company much more so than maybe other companies that I worked at but you'll also see here surprisingly metrics having a huge impact on the operational side of the house and the operation side of the house I need some notion of how tenants are pushing my system because in a multi-tenant SAS environment where I've pushed everybody under the hood of one infrastructure now the stakes are so much higher now if my system goes down my entire business goes down so what I see operations people I see them mining these metrics at all and we'll even look at a few slides of here we'll look at the kind of operational metrics that people look at that are very unique to SAS that help you more proactively respond to issues before they're an issue to your tenant or before they bring your system down so there's all these very clear sort of sort of technical dimensions we look at and maybe some of you were already looking at those today but if you're not you should be looking at those but then the business is also looking at a bunch of metrics they're looking at how our tenants using your system what are the patterns in use and now with all the tenants running in one environment we'll see tenants our bottlenecking in certain areas aren't using certain parts of the system and the business wants to know why more importantly you'll steering status strategies here we'll have a basic and advanced and a premium tier in our solution and we'll find that without analytics we don't really have a good sense of how our customers potentially moving from one tier to the next and are the features we're putting out there and the capabilities we're putting out there driving them up that string to the higher end higher higher cost solution so so my big point here is that you know in a universe of metrics for SAS both the business side and the technical side ought to be looking at these bits very heavily now the awesome part of this is that you put all your tenants and your customers under one umbrella the ability to get to the metrics that they need the ability to aggregate those metrics gets a lot easier right if you had a situation before where you had on-premises customers you had some custom installed maybe they're in your and hosted environment but they're very distributed it was very hard to sort of collate an aggregate of view of those metrics that was really useful to you and it was very rarely something you got in real time so now that you've sort of got this multi tenant infrastructure you say awesome I get a lot of these metrics I can get to them easily I can surface them easily and that's great but you also run into a huge challenge here and this challenge is a big part of the focus of this conversation today which is as a SAS architect I don't just want to know sort of the basic metrics everybody's tracking there is a metric to me that is often the most important metric I want to chase which is tenant consumption I want to know how much load how much effort every tenant we're putting into supporting every single tenant in our system I don't want to just generally know the system is doing well the system is responding to the load I want to know with the granularity of a tenant view what those tenants are doing how they're pushing the design of my system how they're pushing the architecture system and guess what I want to know how consumption is driving the business side of the house as well there are metrics those same metrics have a should mean lot to product managers right so for me I have become very obsessed as a SAS architect with saying I need to know what this is I need to know this number I have to be able to create some view of this but here's the challenge all the goodness that goddesses metrics easy make getting this tenant consumption number hard now that everything's in a shared environment now that we're running in shared infrastructure suddenly it becomes very hard to parse out where what load each individual tenant is putting on your on your environment imagine you have compute instances you have ECS instances you have ec2 instances and you've got tenant it's exercising them all over the place how do I know which tenants are imposing which amounts of the load how do i attribute consumption of them that is the focused and and the focused on how that affects both my architecture and the business strategy of the house and the focus of this talk so hopefully that fits with the abstract you read and what you're coming for because that's what we're going to talk about is how do you get to there through a technical lens how do I build that though the moving parts of that to get that solution now before we dig into the mechanic I want to be very clear what the difference between consumption about the difference between consumption and metering because these two phrases these two words tend to get conflated and they tend to overlap a lot in the SAS world where people will talk about metering when they mean consumption although I'll talk about consumption when they mean metering and I think it's really important because I think you could get lost in this terminology really easily so when I talk about metering to me metering is pure and simple a billing construct when I want a meter in a SAS environment my goal is to say what are the things I'm going to track and follow to be able to in the end result figure out how much I'm going to bill a customer and that that mindset is a very different mindset because here we might not even meet our consumption we might meet a number of users or features or we might even have some composite metric that is a composite of a whole bunch of kinds of consumption that is unique to my domain that ends up being the metric I end up billing on so very different my goals here are all driven by how are we selling the product how rebilling for the product and how do we measure that on the consumption side I think of the consumption side as more of an internal metric I want to think about consumption I want to say if I could look at in a moment in time and take a snapshot and look through my consumption lens and I could see a line tracing through all of the architecture of my system all my storage all my ec2 instances all my all the resources of my environment and I could say somehow I see that this particular tenant is consuming 22% of the resources of my system that's what I'm after with consumption and I with consumption want to be able to correlate that with your bill like how much do you paying for AWS for that for those resources so that I can come up with some metric that is a cost per tenant and that's my magic metric here is could I say to the business the cost per tenant in our system is this right now and then you can imagine a bunch of derivatives that average cost per tenant and you could start trending into string things with cost per tenant and so consumption can certainly overlap with metering in this discussion because some people who sell storage is a part of their SAS service are certainly going to measure your consumption of storage as a way to arrive at your bill but that's just sort of the natural area where they're gonna overlap I prefer to see one is billing and one as more of an internal metric now last little sort of sort of building my case sort of moment here I want to talk about a specific example I had where the the usage of this consumption metric meant a lot to me and had a lot to do with driving my thought that this is a metric that I think every architect and the SAS world should be chasing I worked at a SAS ecommerce provider for a few years and we basically sold a ecommerce system where merchants could come in build their own storefront build their have their own card customize the look and feel of it entirely and sell their own products but that was all run on a common shared infrastructure and when I got there the company seemed to be doing quite well they had lots of merchants in the system they were moving lots of product they were acquiring lots of new merchants all the time but for some reason when you looked at the actual books of the company they weren't doing so well and I'm like what's the deal I don't get it like and I'm asking questions of people like why what's behind this why are we not doing better worse is wrong so I thought I'm gonna at least go away and do my part as an architect and say I'm gonna go try to see figure out how my part of the problem is contributing to this or as as it might be playing a role in us and we went away and the system that we were working on had a basic tier a standard tier in advanced here a very common construct basic tier was like 4995 a month in advanced here was made five grand or 10 grand or whatever it was and we had features that distinguished one tier from the next and tried to get people to move up that pattern and so when we started to try to mine data try to understand what was going on here we went after three specific things we actually went after more than three but these are the three interesting ones to me um catalog size we said how many products are in their catalog what's the diversity and the selection because maybe they're just having varying success based on the number of products they're offering and that's just not driving enough business or something of that nature and maybe that's why they're succeeding or failing we're also gonna look just generally at how much sales they're doing on their e-commerce site like how much product is actually selling because we actually get a little cut of that sales and that actually hits our bottom line so we want to see what the correlation is there and then the last thing is this tenant consumption number like okay how do we correlate these tiers or we actually did it to the tenant level but then did it to the tier level how are these tenants contributing to our infrastructure costs and when you look at the graph you can kind of see the trend here that sort of jumps out of here it's pretty obvious right if you look at the advanced tier the advanced tier customers were had these really small catalogs they were selling these niche products but they were great at selling the products they knew how to market them they know how to sell them and they were moving lots of revenue and this revenue certainly was pushing some transactional bits of our platform but overall they were barely contributing to the foot Tripp infrastructure footprint cost for us their cost per tenant was low but then you look on the basic tier side we saw these people who were like I don't really know how to run a business I'm just gonna try everything I can I'm gonna go and hit your API like there's no tomorrow and upload all these bizarre products and cat I'm gonna extras and just hammer your API I'm gonna hammer your store design bits and so they exercise the heck out of our system but they moved hardly any revenue and of course the big number here they were just absolutely consuming most of our infrastructure costs so here at one end as a basic tier customer they're costing 49.95 a month for us and they're they're absolutely eroding all of our profits and soon as you met I had this sort of data and I could go into the business and show this data to the business which I think anybody would love to be able to walk into the custody end of the business and say look what I found out somebody's gonna have to say well we're doing something fundamentally wrong here we're not we're not doing something to throttle these basic to your users we're letting them just go wild and letting them push our infrastructure without any kind of any kind of controls or any sort of limits on what they can do and so they're rising our costs up but not contributing anything new to the bottom line and so even though you're acquiring new tenants all the time as long as they stay in that basic tier you're kind of stuck and the business love this and they fundamentally went back and said change they're tearing strategy changed their limit strategy change the way they price the product and shifted this more and suddenly these basic tier people get more infrastructure resources without paying a more of a premium and now we were in good shape so to me this is just one example of tenant consumption but to me it was a very compelling example of tenant consumption that really gives you a sense of why do I need this number now I mentioned the value of operational these metrics to operations I think this is a hugely overlooked opportunity for companies right I have I have the ability to if I can get tenant consumption metrics to surface an experience that can be very helpful to my operational staff right and so I put together this little wonky dashboard here there's just a trying to give you a little bit of a view of some of what I might do in real time if I had tenant consumption data on the on the left-hand side in the top left corner you'll see that I have some notion of who my most active tenants so if I have this tenant consumption data I can say who's pushing my system right now the most I can have a top 10 list or whoever it is and more importantly what is the health of that tenant what is the experience they're getting right if I've got all kinds of red and my most active tenants are red I need to something's wrong here like the people who are most using my system are having the worst experience I've got to go do something about that now a totally separate view of this is to say on the right hand side what is tenant resource consumption so here I'm going to look at the individual resources of my system compute storage whatever sort of notion of services that you're consuming in your system and say how our tenants consuming each of those resources so to me it's super value in an operational context to be able to say Wow Hogwarts memorabilia is consuming 27 percent of my compute resource right now and there might be some other tenant is consuming some disproportionate amount of your dynamodb resources if I can break this out along these dimensions these usually end up being red flags Liat the operational people can go back and say hey something's funky in the architecture are this tenants doing something we don't normally see we need to dig in here and find out what's going on because what we're doing is just turning up the PI ops every time they push dynamodb harder when reality we're turning them up for one tenant not a very smart move the last one is maybe more debatable but to me is one of my favorites actually which is this notion of service activity to me I decompose my system and all these cool micro services I build in this case it's an e-commerce I've got catalog shopping and product but whatever your micro services are in your application and I go to all this effort to build cool boundaries around them and make them scale and encapsulate storage well I also want some view of how tenants are consuming those services so this view at the bottom is try to say how can who are the top consumers of the individual services of my application and why is somebody potentially disproportionately using one service over the other right and yes it could just be that one tenant uses more of the system than the other but what I tend to find is you find very bizarre and interesting bits in here and I see operational people building proactive policies around these metrics they'll say hey every time this starts to happen I'm gonna go I'm gonna go do something to have the environment sort of Auto scale or adjust some policy to be able to say I'm I'm anticipating an issue here I'm gonna react and respond become before it becomes a huge issue the last one and I think developers can very much relate to this imagine having this metric and being in a room with a product manager who has an idea for a new feature well actually maybe the better example is to say imagine you don't have this metric the product managers in the room and the product manager says hey the competition is doing X we want this new feature let's go do it here's how it needs to work and you race off immediately to what it is and what it needs to be and there's almost no discussion of well how's that feature going to affect cost to the system like what's that going to do to our tenant consumption profile right so you just keep adding on features and without this metric you have no idea whether the feature you just added might have just upped your infrastructure costs per tenant by some huge amount that nobody thought about now for some features this is more obvious than others and it becomes part of the discussion but what I want to do is say I'm going to equip the product management side of the house with the knowledge and get them used to this of cost per tenant and get them to start thinking about that as they thinking about start thinking about what new feature they want I want them to start asking me like how is this going to affect our tiering strategy is this a good strategy should we only make this feature available to certain tiers how is this going to affect the bottom line cost model do we read to raise our costs if we implement this feature I want that to be as much of their vocab Larry as you know what the feature will do and how important it is to the market so to me this is using that knowledge and extending it into the business as an architect and making it a part of their view and putting it in their face in a way that they they can act on it and make it part of their decision-making process now enough about all that let's talk about how you might implement this consumption right and there's some really simple models of consumption here that we can start with that are really obvious and if you have these you're golden you don't have to do a whole lot of work right if i mewn in what we in in the SAS world called a siloed environment where every tenant gets their own stack of resources my notion of cost and my ability to do cost per tenant is a much simpler discussion in fact what we'll often see people do if they don't have too many tenants is they'll use AWS linked accounts and they'll put each tenant in a linked account and they'll use the AWS billing constructs to pretty much build out build their bill and they'll no cost per tenant done hooray you win you've got everything you need to know now you have a pretty costly infrastructure probably as part of that because you're not sharing enforce or tree structure but a totally valid approach to this so another approach to this is to use VP sees so same idea siloed tenants all the infrastructure is in a silo but instead of using a link to count as the construct we're going to use a VP C so we'll put them all in a VP C we might do something with peering where the peer will do peered VP C's is a way of managing each tenant VP C and then we just say well now what's the AWS constructor going to use this well now we use tags right tags are pretty natural way to say tell me that all the resources are in that VP see they're still all belong to this tenant I can still derive a billing construct from that pretty easily so with these two models if you're in these universes you can go back and you can just start to grab these numbers now what you will find here is that these numbers because they're siloed are less actionable in some of the other contexts right you now the more siloed something is the less you tend to care about one the system going down because it's usually only one tenant going down nobody likes one tenant to go down but that's certainly very different than the entire customer base is down so the investment here tends to be less but still a valuable number to go after now where it gets a little more interesting is when you end up what we call the bridge model right the bridge model is one where layers of your architecture are using different schemes here a mix of the schemes that we've talked about at the top here and this is kind of a traditional monolithic architecture I have this pooled web tier where I've got a bunch of web servers those web servers are all shared shared infrastructure for all the tenants and then at the app tier of my solution I've said no no those have to be siloed so the app tier each tenant gets their own cluster of application servers and then back down at the storage level the storage tier is pooled again well for all the siloed bits of this we get the same sort of rules that we just talked about for the silo bits I have ways to go away and pretty easily derive the cost but what do I do for the pooled resources like what do I do for that web tier how do I go AWS doesn't give you a construct to say where do I derive consumption in that pooled web tier where all the tenants are running in the shared infrastructure that's where this problem gets hard and that's where we need to focus a ton of our energy on one of the strategy we have here because there is no textbook answer to this problem there's no and there's no sort of framework or tool or some white paper somebody wrote that said here's how you derive consumption and a shared infrastructure model you have to sort of come up with it I'm pitching some ideas for that that I think here that you'll see in the coming slides but I also think you'll have to be innovative on your own to come up with strategies for this so the question is okay let's start with the simplest model let's say we have an ec2 instance and we have an ec2 instance that's running some service probably part of some cluster and it exposes a it's exposing it's being executed some so it's got some app running on it that's exposing some kind of rest date I and and tenants are exercising that REST API in tenant contexts but they're all just calling it and when you look at the distribution if you could somehow know the distribution which is the problem we're trying to solve you would see hey ten at five for example in this view is consuming way more at this moment in time of this ec2 instance than say ten at one or ten at ten right if I look twenty minutes later I'm probably going to find that that distribution might look entirely different so what strategy can I use when I have this sort of model to derive consumption and I'm going to say an important word here to me that hopefully stays with you through this whole thing to me what are you gonna do to derive an approximation of consumption because my belief is you should not be targeting absolute consorted notion of consumption I don't believe at least there's an algorithm or a strategy yet that I've come up with and I'd be glad if somebody has one share it with Lindley if you're all done but that will be so precise as to say I've absolutely nailed and I can tell you down to the millisecond who's consuming this resource and how much so instead we're going to go for some strategy that is an approximation here so then what are the ways you could approximate consumption here well to me they kind of relate to well what's the amount of activity going on what is the amount of activity somebody might be requesting of this service well there are ways that can measure that request count could be the easiest one right you say oh this is a really simple service everything it's asked to do it can respond to and react to and very quickly and so I'm just going to count the number of a quest number requests for per tenant and that'll be my distribution and I'll figure out hey this tenant did ten thousand requests this 10 it did four thousand I've got my breakdown I can do some distribution based on it but that may not be the right answer because somebody else might say I don't care how frequently you call there's some of the methods in there that are super intense methods and those methods impose a really heavy load on the instance and that's the better measure of consumption so here I might look at latency I might say long was it how long did it actually take to process that request and use that latency as the unit of of sort of figuring out how I'm going to apportion consumption for this resource now you could look at CPU impact eucalypt memory impact like I said in exact science but these are some of the strategies that seem to make sense for me for figuring this out and if we live in the world of approximation we might be comfortable with those the other thing we have to think about here is the fact that the strategies per service are not universal what I do for a situ versus what I do for DynamoDB or what I might do for redshift to apportion cost or to determine the amount of consumption imagine a managed service versus an instance is a very different mindset to just say how do I hide away a portion consumption in that sort of universe so I actually I have to cut different strategies for the different services of my that I'm leveraging and how a granade all that into some notion of cost per tenant the other thing I have to think about is the services the micro services the whatever the flavor of service you've decomposed your application down into each one of those services can have a different strategy for how you decide to apportion consumption catalog might look different in cart this same theme by the way applies to auto scaling if you think about it some things need to scale on CPU some things need to scale on memory right you have to take the sort of nature of that service figure out what the right policy is for it the same thing is true here you have to look at these individual services and say hey request counts good on this one and milady is better on this other service where I don't want you to go crazy is to say hey I need this menu of 50 strategies and I'm gonna spend a week thinking about every service and figuring out what's the ultimate strategy for that service come up with a menu of three to five approaches pick a few that seemed to make sense and in the mindset of approximation you'll probably be good enough and apply those uniquely I just don't want you to say there's one strategy and it applies the same universally to all services so what's this look like behind the scenes like how do I get to this number what sort of infrastructure what sort of implementation strategy if you well if you think about what you have to do here it's not all that different than what you do when you're logging or raising any other kind of metric or building any sort of metric instrumentation process right you need some way to instrument the actual moving parts of your application with the frameworks of the libraries that are going to allow somebody to surface the metric and then you need to feed that into some scalable kind of mechanism that can accept that data at scale feed it through aggregated up and store it somewhere so that some bi tool can go consume it and give you answers and this is no different than that so fortunately there are a ton of patterns for how to do this this is one very simplified one here which is I've taken an application service I've instrumented it with some framework of mine language choice here might be a module it might be a jar whatever it is that you're using out there or whatever language construct and I'm going to instrument this environment with some key metrics that I can use to go derive consumption most important metric here tenet ID if I don't know which tenet ideas here I have no notion of which tended to associate this consumption with but then I'm also going to catch maybe the method they call the latency for the processing that call and you can imagine date and time and a whole bunch of other data points that I might put in here that could help drive my consumption metrics now once I have that data I'm just going to feed it into this pipeline in this example I picked one of the models you could use you know this could be the elk stack just and it could be redshift it could be any number of different stream based models for capturing events but here I'm going to shove it into cloud watch I'm gonna use lambda to pre-process it feed it into Kinesis and then eventually dump it into s3 and say okay the date is all there now some aggregation service can go grab it and do something interesting with it now the thing I would like to see people do is say however you instrument it on the left/right that you are inputting them at richest set of data that you can there don't presume exactly how you do what model or policy you're going to use to derive your consumption numbers instead say let's just instrument it with this rich data as we can and then when it finally lands in s3 and I decide to aggregate it if today I'm saying hey this scheme seems to be working pretty well as an aggregation scheme to come up with distribution awesome but if tomorrow I say you know what there's another data point that seems to contribute to consumption is a better data point and it's available to me there I can start to play with it because I guarantee you as you start to play with this concept you'll be wrong you'll see your first set of numbers on consumption you're like that doesn't fit at all that's not what my tenants are doing and you know we sort of we factored this wrong or algorithm was wrong for how we met it did this well now are you gonna go I can re instrument because you need a new piece of data why you could do that I just like to keep those two separate ultimately when you're done you end up with some distribution here that says which tenants consumed which percentages of this particular service and that gets us a little ways down the road toes this problem now wouldn't be fair to talk about this without talking about what this could mean in a service model because if if you catch me anywhere in halls are talking about SAS anywhere I will tell you I believe that SAS is a fantastic fit for service right it those two line up very nicely I have another talk that I've done on serverless and SAS and talking about the alignment between these two so what does it mean then to apportion consumption in in a universe where you're building a service application right so I've got an API gateway I've got my catalog service now represented as a series of individual lambda functions and each one of those lambda functions are executed by tenants in an individual tenant context we are no longer sharing some compute resources ECS or ec2 we're sharing that compute resource here that tent that and we'll execute in the context of that tenant well now because I have this sort of one-to-one mapping of tenant to the method that executes my ability to derive consumption gets a lot easier right I just have to know many how many times you call that function and I have a much easier way to get to consumption number so to me this is yet another sort of box to check for server lists to say by the way a consumption is also easier to do here that doesn't mean you don't have to go build all the infrastructure to capture and aggregate it all but I think your math and your mechanisms for figuring out consumption get easier here the other thing I want to point out here is that this doesn't have to be start out at least as something so granular as every service is somehow instrumented and you've gone after every service and you've gone after every detail some very sort of granular view of consumption instead you could put something at the edge and just capture the interactions with the services and say hey it's not 100% accurate but it's better than having nothing and so for me API gateway is an awesome opportunity to do just this if I use the API gateway and I put my services on my application on the other side of the Gateway and I take those rest entry points that are all the rest entry points to my services and I'll wire them up with instrumentation to raise the same metrics because I can get latency I can get some of this other data out of here I can get tenant context out of here and I ship that data off to my aggregation scheme I have some notion there of consumption right it's not as precise and as granular as the other models we talked about but it's a really good place to start so to me if I said do nothing else I'd go away and do something like this and it doesn't have to be API gateway is there some edge some entry points some managed API experience for your product at the outer edge where you could do this as a starting point because maybe you get this data and I guarantee if you start to get this date and you think it's doing interesting things for you somebody will want the next level of data from you another approach to this I was working with somebody on my team named andy powell and he was working on Java annotations as a way to work with AWS x-ray so those you not familiar AWS x-ray is a tracing tool so it will essentially capture an end-to-end trace of a whole bunch of functions in your application and it will tell you things like latency and you can inject an ax into the annotations while it's capturing these bits and we started looking at this tool just as a cool tool for SAS so that I could say oh let's just see traces of what tenants are doing in my system and then we said well it is it ends up being a repository for some very interesting data that could be a way for me to calculate consumption could I go and just take this data take the latency of the calls in my app do some algorithm against that and say that's an approximation of my consumption and the good news on that is you now have a lot less infrastructure because x-rays now housing that data now I think you'd probably pull the data out of x-ray and put it into something that it was a little more bi friendly maybe then x-ray I don't know I haven't looked at the footprint for pulling the data out of out of x-ray closely enough but I think it's also a compelling alternative to going super deep here now well this gets more interesting is when you start talking about other services like storage this storage isn't as straightforward as compute now when I get an AWS bill I get different line items for different bits of storage like for dynamodb I might be paying for I ops but I might also be paying for physical size of the data some there are mold to pull dimensions along which I could get billed for storage so now unlike compute where I can just say tell me how much that I consumed of X I have all this other data about that storage to go calculate well for size the good news with size is it's a pretty concrete piece of a number and it can be acquired at some point for a tenant I ops is harder how do I know who I should give how much of the IAP stew it's probably some variation of what we talked about with the compute services some notion of the frequency of the calls some notion of determining how which tenants are more active than others apportioning that and saying that's I'm going to apportion your iOS contribution now to get the storage aggregation there's no secret sauce here there's no magic you know ultimately this is less about sort of eventing metrics out as much as it's more maybe of a pool model because really all you want to say is how frequently do I need this number do we do some daily thing where we're interested in this do we do it only when we do a billing cycle when do we do it I need some sort of aggregation service that can go out to the individual compute resources and say what's the current consumption for this tenant and get that number and then I can say this is the distribution of data size at least footprint size for a given service associated with a given tenant and gives me some way to connect that to that erbil now there are two approaches to sort of figuring out what you want to do with this there's probably more than two but there's two that jump out at me you could decide you know what I ops and and data size should be separate metrics they're interesting to the business separately and I'm gonna leave them separate and never do anything else with them and I'll just look at the frequency data I'll come up with some AI ops allocation and say how much each tenant contributes to AI ops and then I'll go aggregate the data sizes and I'll say how much does each tenant contribute to the data footprint and stop right there that the other option here is to say no the business wants some normalized composite view of consumption they want to say what's dynamodb costing us and they don't really care about IAP sources size versus these other granular bits so normalize that for me somehow and turn that into some here's my normalized DynamoDB consumption number there's no right or wrong to any of these strategies you have to do the underlying number any to get the composite number so the only question really is is the composite number of any value to you and what I do see people doing is they invent composite numbers that are new ones sometimes out of their own domain for some businesses though the IAP see are relevant in the size is like 99 percent of the cost and so normalizing it doesn't add a lot of value there my bigger point out of all of this was to only illustrate that as you dig into each type of service the there are this range of new considerations this range of new things you have to think about to get to Canal occation now the problem only gets more challenging there's lots more services we haven't even talked about here Athena Kinesis lots of apps are certainly using sqs and SNS there big chunk of this and I think ICS es used a fair amount as well do I go off and invest a ton of energy and apportioning consumption across all of these well it kind of depends to me most organizations at least for me and as an architect I usually have a good sense of hey dynamo and compute or whatever there's these six or seven areas of our system that are 90% of our bill and yes we do a ton with sqs but as a percentage of our overall bill it's so small that even if I invested a ton of time and energy and figuring out what the cost was for it are the breakdown of pertinent it would have hardly any influence on the final number I come up with so my advice here is to say for some services where you know they're not going to make a big contribution to your number even if the tenant sort of consumption varies for that service just do an equal distribution across that service and call it a day and move on and then somewhere in the future if you're suddenly seeing Wow SNS is doing some interesting things to our bill and it these tenants are doing interesting things in ways that are affecting consumption in an interesting ways for us now I'll go off and figure out how I want to apportion sto NS costs and attribute them to that the other one we talked about was services individual services of your application I think this one I would go for every time and it's again I think it's my personal bias but I've always loved seeing this number on my dashboard is knowing how the a CRO actual micro services are contributing to the consumption profile on a tenant basis right and so in this model you have to do some extra work here or you can just decide hey I'll just look at frequency of activity and latency or something for that service I won't care what's under the hood of that service and that'll be an approximation of consumption of that service or I'll say no I need to look at the storage that's used by that service I've got an example of a catalog service here using RDS and s3 I may go look at the consumption of those as well and do some way to normalize both the use of the compute and the storage to come up with some idea to say what's the tenant consumption for this particular service but I I personally have seen my design and my architecture i've even broken my app and and new micro services because of the data points i've gotten out of this discovery now once you have this data there are some interesting things you can do with it once I figure out consumption I can start to correlate it to other metrics and what the business starts saying is consumption isn't just good because we know how much attend it's costing us I've got usage metrics as an example I know where tenants are getting stuck in bottlenecked inside the application and suddenly I start correlating some of the bottlenecks and some of the challenges in usage with consumption and I find out gosh win this tenants loading the system this particular way I see this correlation to where people they stopped using the system because this particular service is not working well for them so I'll start to see a business sort of technical intersection here that's a very interesting one so I like to see you think of this as a first-class metric that you prolong side all the other metrics that you're gathering and start looking for new and interesting ways to correlate to those metrics I also would like to say just generally four metrics here once you have consumption start looking across the AWS stack at other things that might be interesting to correlate with it that you may not be aggregating today there are plenty of sources for data cloud trail if you don't use x-ray the way we described x-ray may still be an interesting piece of data to put into your dashboard and your view of metrics and cloud watch obviously if you look at just general AWS standard metrics are interesting to bring into this environment the whole point is to create this bi service where you're able to correlate all of this data and build interesting views of this that are really make creating sort of this real-time dynamic where people are making decisions in real time based on both consumption and consumptions correlation to these other pieces now the hard part of this well the hard part was certainly figuring out how you're even going to do consumption but even after you get consumption there's another challenge to this we still haven't talked about how that really relates to the bill right we've got we know that a Content it's consuming X percentage our resources but that doesn't give us a cost for tenet that just gives us some percentage of consumption we stopped to go out to the AWS bill we have to somehow bring that bill in and we have to bring it in if we're doing a siloed system I wanted to show here that you could just use the tagging or the account scheme or if you do to get a total cost per tenant for silo but more importantly we've mostly talked about this pooled mechanism at the bottom here which is hey we need to somehow go and figure out what for all my tenants that are pooled what's the total cost for those tenants and what am I going to do with those numbers well I'm gonna do some very basic math it's very straightforward here which is I'm going to take my aggregated pool costs I'm gonna multiply and buy whatever the allocation of my tenants are and I'm gonna arrive at some distribution of that tenant cost and there's my cost per tenant as a magic number and if I'm running in an environment where I've got pool tenants and silo tenants I'm gonna be able to I might choose to aggregate all that up and come up with some total cost per tenant that spans both those universes the bottom number is just some way to show that you could aggregate those two concepts now what I will say is this work on the detail getting the billing report and pulling the billing report in and aggregating that data and figuring out those costs I'd rather not see people spending energy and time on that it can add way more to your investment on this effort so I'd like to see you going out to partner solutions and other vendors who are already good at aggregating that data and use them as the as the as the resource that is good at sort of getting you the billing sort of information and then all you have to do is where you worry about how to apply the allocations to it because to me if you've ever looked through the AWS billing constructs and looked at ways to aggregate them it's a pretty there's a lot of moving parts to that and so generating a bill that is a meaningful sort of bill out of that is is a lot of work and a lot a lot of work you may want to invest for this so I talked about this idea of approximation I just want to hammer this point home this consumption metric I'm not sort of I don't have this sort of idea that these are absolute numbers and I've just nailed what this consumption is I have you know I know down to the general ledger sort of credits and debits kind of level mindset here that I've absolutely to the penny figured out what your cost per ten it is instead my mindset is that I want to be good enough here I want to be able to say these numbers are driving the value prop that we wanted to drive into the business I'm targeting the right kinds of information and I'm exposing the right kinds of consumption metrics that are valuable to my business and then just challenge yourself periodically reassess where you're at challenge yourself to see should I be going deeper or should we be more granular and like I said if you want just start at the outer edge even right that may be good enough for you so what are some of the key takeaways from this well hopefully just generally if you're already not a metrics person and you're in the SAS world hopefully you see that you ought to be a man person like you ought to be hunting for metrics you ought to be trying to gather all the metrics you can cuz even if you're just on the technical side of the house you should see that metrics are gonna have a huge impact on how you shape your architecture and how you how you sort of help inform the business and drive business decisions based on how those how those metrics tell you more about how the system responds and how efficiently a response hopefully you also get the sense that consumption and metering are different concepts so if you're doing metering today that doesn't necessarily you've solved this consumption problem the other bit of this is that these tenant centric metrics these consumption metrics that we talked a ton about here they directly have an influence on your operation side of your house right hopefully you saw that dashboard that we had and we talked about ways you could service these operational metrics these can be real-time metrics they can affect scale they can affect availability so they're not just sort of about I'm finding some interesting trends that once in a while we'll look at they might be useful they might not to the operational side of the house these become critical metrics when you go down this path you do have to allow for the fact that different services different AWS resources each one of these things may require a different consumption strategy so so and part of that strategy hopefully realize is you don't have to sort of boil the ocean you can pick the services where it makes the most sense to target at first and then do an even distribution in other cases this last point just is the point we just done if you're going to go get billing data correlating to that billing katie is a challenge I'd like to see you all aiding that to a partner solution or somebody else and letting them own that API and that experience and then if nothing else take the last point home here is start small just demonstrate value get the business to see the value these metrics get them engaged with wanting to know what they're about get your product managers thinking about these metrics and get them to realize how important these metrics can be to the bottom line so just some additional resources that may be interest to you there is a QuickStart solution that I don't know if any of you knew launched out there I didn't put the link here but if you search for this sass identity in isolation with Amazon Cognito it is an end-to-end reference s implementation that is doing some interesting things with identity but it's also doing interesting things with isolation I only pointed out there because you're a sass crowd I assume you have interest in SAS you might be interested in looking at this and I have a session at the end of the week that also dives very deeply into the implementation of this in terms of other sessions this I've got another session on identity later I think in another hour here and a colleague of mine Judah Bernstein is doing I think at 5 o'clock this GPS tech 309 where he looks at New Relic and he looks at multi-tenant health is sort of a it sort of marries nicely to the topic that we just hit here but it focuses more on how can you surface these things in a partner experience and instrument your app to get them to to show up in those bits and then like I said the Friday session I had so thank you so much for coming hopefully you're interested in tenant consumption metrics not always the sexiest thing but uh but a metric that's very important to SAS businesses thank you very much [Applause]
Info
Channel: Amazon Web Services
Views: 2,285
Rating: undefined out of 5
Keywords: AWS re:Invent 2017, Amazon, GPSTechnical, GPSTEC308, Compute, Storage
Id: X2SgoAl1vK8
Channel Id: undefined
Length: 52min 34sec (3154 seconds)
Published: Wed Nov 29 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.