Multi-tenant architecture in 20 minutes

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay hi everybody my name is Carmel Hicks I am a software engineer at Atlassian and I'm here today to speak to you about multi tenant architectures in 20 minutes by the end of the presentation I'd like to use me of the answer three questions what is a multi-tenant architecture and why would you use one how can you connect your customers to their data within a multi-tenant architecture if every customer has their own data base and what are three things to consider when building a mobile is the highly available application that serves a pretty cool data so in order to get you to the stage you can answer all of those questions I'm going to be running you through a couple things we're gonna start off with the history of Atlassian cloud and then we'll move then move on to speak about multi tenant architectures and what they actually are we're gonna be speaking about how we at Atlassian connected our customers to their data within a multi-tenant architecture I will be breaking that down into three different phases designing building and refining well then finish up on a reflection on what we think we probably could have done a bit differently and a summary at the end so to start off with the history of it Lessing cloud but just before we get too far into things there might be a couple of you sitting there thinking what's a blessing well everlasting we specialize in building collaboration software to 'only sh the potential in every team that means that we are responsible for building tools such as JIRA confluence Trello and bitbucket so in the beginning Atlassian was a server company which just meant that we built our tools in house and then bundle them up we then had our customers come and purchase those little bundles and would then be up to them to unwrap them set them up and want to maintain them on their own service after a little bit of time clouds started to become a pretty big deal and our customers started to express to us that they were having a hard time running and maintaining thurid servers so we figured well why don't we just run him anything the service and then our customers can just access them and that's exactly what we did we essentially deployed server products up in the cloud and that became our cloud offering so this is what is known as a single tenant architecture where a tenant is just a cloud customer so in a license case that would typically be an entire company now single tenant architecture model is a model where a single you know deserve a single tenant and this is because those compute nodes are stateful which means that they have pre-existing knowledge of the tenants that they're able to serve and to be honest we got really good at doing single tenant architectures we did this for many many years and we were really able to get the most out of this infrastructure but when we scale to tens of thousands of customers we started to run into some problems for example if a computer ode we're to go down that would mean that an entire customer somewhere and all of their users would be completely unable to access their instance upgrades became pretty problematic because if we had tens of thousands of customers that meant that we had tens of thousands of compute nodes and we had to go through and apply an upgrade to every single one of them and upgrades are complex and time consuming processes and they required down time which meant that we had to perform our upgrades during our tenants maintenance windows and given that we had customers all over the world that would actually take us 24 hours to roll out an operate production so you can imagine that's a pretty big deal if when I roll out the critical fix for a security bug finally cost at the end of the day a lesson is business and we care about where we spend our money and with this single tenant architecture we were scaling by the number of customers and that doesn't necessarily correlate to usage meant that we were paying the same amount of money for some fairly large customers were coming back and visiting us all the time whereas we as we work for smaller customers who were actually using any of the resources that we can provision for them basically we said there has to be a better way and this is where the multi-tenant architecture came in but what is the multi-tenant architecture well with the single tenant architecture is just the model where a single tenant can be served by a single compute node multi-tenant architecture is just a model where any tenant can be served by any compute node and this is possible because those compute nodes are stateless which means that they can figure out all of the information there needs to know on the fly this is awesome for a couple of different reasons to begin with it unlocks horizontal auto scaling which just means that if some of those computers started a little bit strained we can just add more nodes to the pool this allows for you to scaled by usage instead of number of customers if one of those compute nodes were to go down in a single tenant architecture that means a customer can access their instance in a multi-tenant architecture just like that compete node out of the pool finally with upgrades multi tenant architectures give you the ability to spin off new compute nodes running the latest version of the software and when they're ready you just switch the traffic over and you have yourself a zero downtime upgrade here in Australia this is what we were referred to as bloody beautiful so we figured all right we'll quickly go over to JIRA and confluence and quickly switch them over to multi tenant architectures and that's exactly what we did it just took us dozens of teams to perform over 15,000 tasks over the space of three years so why is that well it all comes down to legacy the end of the day during confluence has been around for almost 16 years now and sixteen years ago billions of lines of code under the assumption that we could do things such as cache tenant data and connect the database directly to the cookie node and in order to make the multi-tenant architecture work we had to go through and make a lot of changes because all of this constitutes estate and we also have to build supporting infrastructure to make the whole thing work in the first place and so that's what we focusing on for the rest of this talk which was some of the infrastructure that we had to build to support our customers and how we would connect them to their data within this new architecture so we'll be focusing first on how we designed to solve this problem so in the beginning we wanted to build a multi-tenant database and the idea here is that you put all of your customers data within one database and then the compute node would figure out what information to fetch based on the context of the request but we were already amongst a really really large and risky project and so we started to sit down and investigate some ways that we could cut scope and reduce risk but also deliver as much value to our customers as possible and at that point in time we decided that a multicenter database just wasn't the best thing to do and so we decided to compromise and we stuck with our single tenant per single database model but this say there's a new problem so because we now had tens of thousands of databases but only a handful of compute nodes and we need to tell each compute node which database to connect to dynamically to handle this we built a new service called the tenant context service all TCS and the idea with the TCS is that it would provide hymnbook up at runtime for example if we want to have a tenant make a request one of the compute nodes would receive that request and then send it over to the TCS and it would say okay show me Sarah's company at Atlassian dotnet the TCS would then receive that hostname and because those names are unique you could perform an order one look up and get over to any configuration the Tony configuration in this case is what type of products the tenant has how many uses and the database details so it's in the database information back to the computer and the compute node can now resolve the request this seems easy enough aside from the fact that we've just introduced a lookup into every single request that is made in JIRA and confluence cloud and to put that into perspective that's actually over 30,000 requests per second so that meant we needed to be pretty careful with how we built this thing it needed to be really really fast we didn't want people to even realize that the book rack was happening it had to be incredibly highly available because if my PCs went down that was an outage in Atlassian cloud it had to be incredibly highly scalable because 30,000 requests per second was a number that was only going to continue to increase and finally it had to be strongly consistent because this is tenant configuration which is really critical data and so we wanted to be completely sure that we had a single source of truth they gave us an accurate representation of what any of our tenants look like at any point in time but we start to run into some problems when we look at these three requirements because we've just asked for low latency high availability and strong consistency then that's actually not possible according to the packet ELC theorem which states that in case repetition you have to choose between availability and consistency else choose between latency and consistency so for those of you familiar with the cap theorem this is just an extension of that for everybody else there's just a whole bunch of words which basically says you have to make compromises if you want availability and consistency luckily for us we aren't the only people to ever run into this problem there is a pattern known as CQ rx which stands for command query responsibility segregation again a lot of fancy words basically just means don't do one thing build two and that means you can optimize accordingly so you have one model for your commands or your updates and another model for your queries or your reads and the idea there is that your commands model can then be be strongly consistent single source of truth whereas reads model can be highly available with lower latency we're noting here that because we now have two systems in this design it does take a little bit of time to get data from one place to the other and that means you have yourself an eventually consistent eventually consistent solution however in our particular scenario that was okay as long as we have a consistency somewhere we were happy to pay that price of the eventual consistency for our TCS and so we decided to adopt this pattern and we built the TCS but we also important the requirement of strong consistency and thinking I don't look service as well so we now had a pretty good package of follows we just needed to build the thing we'll be looking out to services we realize they're just trying to do two things store unique configuration data and serve unique configuration data and at the end of the day that's really just the key Valley store sorry this meant that Amazon Web Services DynamoDB was a really good fit for this use case so got ourselves a DynamoDB table that way about out of the box AWS guarantees our dynamodb will be available for 99.99% of the time which correlates to about four minutes of downtime per month at worst and that's okay especially for something like the catalog service where we're not optimizing for availability but for something like the TCS that's not really going to cut it that's okay because an easy way to increase your availability is to introduce redundancy that's exactly what we did we got ourselves not one with three DynamoDB tables and they were deployed into three different geographic locations or AWS regions and the idea here is that if one of those tables were to fail the other ones can then take over this actually brings us up to a theoretical table availability of six nines which is closer to three seconds of downtime per month now starting to sound a lot better the multi region deployment gave us an added benefit of reduced Network latency let's say we were to have a tenant and they were located in Europe and they were constantly making request of a TCS that was over in western USA they can be incurring 80 milliseconds of network overhead alone this is before they even hit the service alternatively they could just hit the TCS and I think her only 7 milliseconds I've never liked overhead that Weibo this didn't mean that we needed to find a solution that could get our data from our single source of truth out to out his es dice it needed to be something that could propagate data to multiple different regions it needed to go into ordering because if you think about us receiving a request to deactivate a product and then a request to activate a product let's say we were to get that order incorrect the impact there is that we would have customers paying for products and completely unable to access them and that was just not gonna fly and finally it had to be real-time because I mentioned before we're making these trade-offs with eventual consistency but at the end of the day we're talking in the scope of milliseconds for the vast majority of the time we don't want people to even realize as to systems going on in the background so for us this meant that Kinesis trains about what about AWS was a really good fit and because it not only allowed for multiple different read route readers from all over the world but it also guaranteed that your records would be delivered in order at least once and it was going to do it fast so here we got ourselves a commuter stream but we now needed something that would write to the Kinesis stream only upon successful persistence and likewise read from the stream and ride up to the TCS tables which meant that we needed to Kiki so this point in time we got ourselves a couple clusters of ec2 nodes and they would scale depending on the load that we received but this in place we were able to have the catalog service receive a request right it to its table and then upon successful persistence put that information onto the screen TCS stacks from all over the world would be listening to that stream and they would then be able to pick up that information and assist them to their earning tables with all of this in place we were able to satisfy all of our requirements however this is software when you wanna build something it doesn't always work out why does he plan the first way around so we had to spend a little bit of time refining our solution and to be honestly we're quite a number of problems we had to solve for example we had cases where we've redeployed the TCS at a high load period and our load balancer didn't sell off fast enough so we needed to start looking into things like pre-warming the load balancers we had cases where AWS had an outage and entire records would be dropped it's one of the test cases and whilst that's okay because the other TCS stacks would be able to take over it did mean that we had to write a tool which was capable of reading all of the information from the single source of truth making a modification writing it back and then flushing that information through the stream to get us back into a consistent state however the biggest problem we had to solve was with regards to caching this is a real graph from the TCF average latency at around November lastly and I'm speaking about average instead of something like 99 quli because I don't have a screenshot of any other metric from that time it will go without bridge anyway so you can see here that we're sitting at about five milliseconds of latency and then pretty much out of nowhere we started to skyrocket this was one second of latency for the TCS upon investigation we realized this was actually due to an internal problem at AWS for having where DynamoDB slowed down significantly and was taking a really really long time to respond this meant that all about ec2 nodes were sitting there waiting for DynamoDB to get back to them and then requested the TCS started to timeout so we figured all right no worries we'll just reduce the number of times we hit DynamoDB in the first place and we can do that by introducing cache caching in this sense in this case made a lot of sense because these records will access all the time but they were actually changed very infrequently however given the type of data that we were serving we could only cache if we could reliably invalidate and that's where I became a little bit more complicated because we don't have just one cache we've got a cluster of caches and that meant we had some issues if we were to have for example a request come in to make a change one node would receive that request and make the change for the dynamodb table it handled the requests we've noise to invalidate its cache however everything else in the pool has got no idea that anything has happened handle this we added a simple notification service which is a wseas polish subscription solution basically the idea here is that when a request is received it would still Ryder to the table and invalidate its cache it then would however notify SNS if something has happened an SNS could then broadcast that change event to all of our notes and they would then know to invalidate their caches as well after rolling this out we didn't actually just see how incident go away but we saw a dramatic decrease in our overall latency of the service and in fact this is very successful service like that we decided to pull this out Cline side as well and the average TCS latency has dropped to about zero now because no one even hits within anymore however this does bring us onto a reflection and what we think we probably could have done it differently because if we were to have our time again we probably would have taken caching a little bit more seriously from the very beginning a lot of our caches were added white reactively when things started to break but this was fundamentally due to the fact that we were quite scared to cache this type of data as for comment one said there are two hard problems in computer science cache invalidation and naming things and this was very much the case for us we were very scared yeah that we would get the invalidation wrong and that we would serve stay alternate configurations either indefinitely however when it became clear to us that the benefits were going to outweigh the risks and yeah sure caching is hard but it's not impossible and we had a pretty good idea of how we would go about it we decided to move forward with it not just once for twice and that would have probably been the best thing that we've done for the TCS to date nevertheless despite all the problems we had to solve this December 2017 the TCS has been behind every single JIRA and confluence request in cloud and this also might be and data about big multi-tenant room all the things to AWS project alright so at the beginning of the presentation I said I'd like you to be able to answer three questions what is a multi-tenant architecture and why would you use one well it's just a model where any tenant can be served by any computer and you use them because they enable awesome things such as horizontal auto scaling so that you can scale by load enhance resilience and zero downtime upgrades how can you connect your customers their data in a multi-tenant architecture if every customer has their own database well in our case we just introduced a lookup table that was able to resolve tenant configuration on every request based on a unique identifiers such as whose name and finally what are three things to consider when building a low latency highly available application that serves critical data well you can each reduce redundancy to reduce latency and increase availability you can harness CQRS to mitigate Park EOC and you can cache aggressively as long as you invalidate intelligently right thank you so much [Applause]

Info

Channel: Carmel Hinks

Views: 60,800

Rating: 4.9100718 out of 5

Keywords: multi-tenant architecture, multi-tenancy, grace hopper celebration, atlassian, multi tenant architecture, aws

Id: 0N4KknY_zdU

Channel Id: undefined

Length: 18min 56sec (1136 seconds)

Published: Thu Oct 04 2018