Scaling Slack - The Good, the Unexpected, and the Road Ahead

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I'm going to talk about a little bit of some of the transitions that slocks gone through over the last couple years just by the way of introduction this was the nice welcome post that my manager julia posted on my first day at slack just about two years ago um when I joined the infrastructure and scalability team before this I spent a bunch of time doing various startups and other companies mostly in the networking space and spent some years at Berkeley doing a PhD that's got very little to do with anything I'm going to talk to you about right now um and this is not a talk about a transition from monolith to micro services so there were some really really great talks that happened earlier about this theme and so I encourage you all to if you didn't see it catch the video afterwards what this is a talk about is really the focusing on what's changed in slack in these past couple years so to start kind of framing the conversation we'll go back in time a little bit I'll lay out the basic architecture of the slack service in 2016 motivate a little bit of why we made these changes in light of new requirements and new kind of scaling challenges talk most of the talk will be about what we did about it and then I'll kind of thump some up with some themes that came out of that work and talk a little bit about what we have in front of us all right so it's all getting a time machine and go back to 2016 I've just joined this company I'm really excited to start and I go through this presentation about how slack works at the time I'm not going to do the thing where I ask everyone to raise your hand if you use slack but if you don't know slack it is a messaging communication hub for teams people can communicate in real-time with rich messaging features in order to get their work done there's lots of integrations there's plenty of other features in the product and I just want to point out right now I'm mostly going to be pretty much actually all gonna be focused on the core messaging features of the product search calls there's all these other parts of the system that I'm just not going to talk about that have their own interesting tails to them because this is just about the messaging core of the application so it's important to note that slack has organized around this concept of a workspace when users log into the service you log into a particular domaine that relates to your company or an interest group and then all the communication all the user profiles all the interactions really live within this kind of bubble that we call a slack work space and this is a really important element of the architecture at the time so then back in 2016 with about four million daily active users people stayed connected to this service for much of the work day there's a large number of our users would come in open their laptop in the morning and have stock open for the for really the duration of the day before shutting down at night and so our peak connected users there's kind of globally were about two and a half million at peak at that time and those sessions are really long ten hours a day where people would stay connected to the service at the time we had organizations of around 10,000 plus users were kind of the biggest that we're using the product at at that time and really there was this culture of a very pragmatic and conservative engineering style and you'll see that and kind of reflected in the architecture of how this system worked which really can be drawn up in five really simple boxes on this diagram here client applications made lots of their interactions with a big PHP monolith that was hosted in Amazon's u.s. East data center in fact most of the site ran out of Us East it still does actually backed by a my sequel database tier there's an asynchronous job queue system which has had its own share of scalability and performance challenges but I'm actually not going to talk much about that during this talk but really that was where we would defer activities like unfurling links or indexing messages for search would kind of get pushed out to that tier and then in the yellowish boxes on the top was the real-time message stock so in addition to having a pretty traditional web app with all the letters of lamp in there Linux Apache my sequel end PHP all present and happy we had this real-time message bus that was custom software built in a Java tier that is where all of the kind of pub/sub distribution of the messaging product happened the only thing that didn't run in u.s. East one was this thing we called the message proxy at the time it was really just responsible for kind of edge terminating SSL in proxying that WebSocket connection over the WAM back to us east there's a pretty simple model right five boxes to describe all of slack and that was kind of reflected in a little bit of how the client and server interactions worked so you stand up the client you open it in the morning and the first thing it does is make this API call RTM dot start to the PHP back end which would download the entire of the entirety of the user model sorry the entire workspace model so every user all of their preferences their profiles and pavatt ours every channel all the information about them who is in every channel all of that would get kind of dumped to the client and the client would populate a pretty that model and vision of the metadata of the team and then as part of that initial payload we also got a URL which the client would then use to connect to a WebSocket to the nearby message proxy that would then go back to the message server and so once you're connected to the service you're using it for ten hours a day um pushes would come over that WebSocket you know messages would arrive a user might change their avatar and you'd see the the image show up and immediately update and so all of the both a data plane and control plane messages were kind of sent over this real-time message service to keep the client up-to-date though the backend was scaled out horizontally by some pretty straightforward charting approach where when we create a new customer a new workspace in slack we assign that workspace to a particular database shard and to a particular message server shard that was pretty much it right you log in the first API action you would do would look up the metadata row from this database tier that we call domains the kind of key initial bootstrapping tier that response would go back to the PHP tier and then it would know which of the database shards and which of the MS ARDS to route this request to and then from that point on pretty much all interactions for that user or actually for all of the users in that workspace would be confined to that one data shard and that one Msgr now at the time we kind of managed our servers in this mean that I'm sort of calling a herd of pets people have heard this cattle verse pets metaphor we really had a lot of these individual pets each database server was kind of and message server was known by a number and lots of manual operations were needed to keep that service - you're up to date um the PHP code the kind of monolith would have a big mapping that would say database shard number 35 here is the host name that that is currently running on and they were kind of like I said manual processes to keep that up to date the one exception of that is that we ran our databases in this slightly non-standard way where we had to my sequel hosts running as active active master master replication between them meaning each side was available for rights and they would replicate across them which is a particular design choice that the real reason we did this we were not dummies we did this as a as a pretty pragmatic approach to get a site availability with the obvious caveat that we would sacrifice some consistency so every workspace in addition to having one particular numbered pair of my sequel hosts also had a preferred side that it would try to read and write to and so when everything was up and functioning every team every workspace would be pinned to one database host with the other side is really just the backup for replication if anything goes bad though the application would automatically failover to the other side and yes that did produce some conflict problems and there were some challenges along the way but for the most part this model basically worked and the site was able to scale accordingly and really there's kind of two dimensions for why I think this really worked well for slack in 2016 the first is that overall client model really enabled a very rich real-time feel to the product that I think was a huge part of why slack was so successful um it really just felt like everything happened right away after users did it messages appeared reactions appeared it really was a key element of that kind of making the feel of the messaging product that kind of rich experience even when it came to all of these metadata updates and then for the backend we didn't have a huge engineering team and five boxes is a pretty simple model to keep in your head about how things worked and if a particular customer was having a bad day you could figure out very quickly you know which was the database host I needed to go look at what was the MS message server host that might be on fire you know that kind of debugging and troubleshooting was you only had three places to look at for what the problem would be and so that really helped the team focus on building the product and building the user experience that I think was such a key part of why slack was so successful cool but this is not 2016 this is now 2018 and in the years between them there were some challenges that we ran into which is why that model kind of no longer worked very well for us there's really two dimensions this one was the increase in in scale for the site increase in usage and then some pretty fundamental changes in the product model it's a first scale as you see here we had great growth that growth has continued to the point that in 2018 we now have more than eight million daily active users for the product more than seven million of which are connected at any one time this is a peak right off the presses light lean yeah anyway so that's a lot of people connected to the service and we're hosting organizations of more than 125 thousand users so we've doubled our user base and we've more than 10 times increase the size of the largest organizations that are using slack and as a result of kind of many of these scaling challenges as I'll note more in this talk we've adapted our engineering style we're a bigger team we're a little more ambitious in the types of solutions that were willing to embrace when it comes to complexity in service of some of these big challenges now the product model I mentioned before where there was this nice tidy workspace bucketing for all of the interactions in slack that also started to break down and this was really two phases the first was the introduction of the enterprise product of slack I'm not sure how many people are aware of this but for large organizations we offer a version of slack in which you can actually have separate slack workspaces for each subdivision or a geolocation or other organizational unit within a larger enterprise so that local teams can have their control over their channels and their preferences and which integrations are added to it but these are all part of an umbrella organization and so you know Wayne Enterprises can have announcements global channels where the CEO can disappear okay right now my message is that everybody in the company can see but then you know Wayne security has their own control over their future but this means that each of those kind of workspaces has its own notion in the backend has its own database showerheads its own and uh shards and so any Khanum user now no longer belongs to just one nice organizational bucket at the back end we now have these different contexts to consider is this an organization-wide channel or is this a local channel and then things get even more interesting with this feature called shared channels that's currently in beta where you can have two completely separate organizational entities with their own management under billing and they can create a shared channel between them for collaboration this is really heavily used by for example design agencies or PR agencies that want to have relationships with multiple customers where those customers are also using slack and they can communicate within their home spot I've talked to people in the other company well this really fundamentally blew up a lot of the assumptions that that original architecture was written kind of written to service because now channels no longer belong necessarily to one workspace but they're not sharing cases in which you Center how do I do that architectural model to go take into account the bigger scale and some of these new shifts in the landscape so like I said that prosperous nation Channel was one challenge and then just dealing with the scale you know we had all the great war stories that we have eluded to where you know an action would have been across a large number of customers for one workspace and all the client applications would come back and then thunder into the back end to hammer it for some data but as we got to bigger and bigger organizations that boot payloads and I mentioned before got bigger and bigger so we started to have challenges of just getting clients connected to the site and then this heard of cuts you know it works fine when you have 20 servers it starts to get bad when you have a hundred if you're getting into hundreds and hundreds of servers just the manual toil that was taking to keep the site up was just creating on our service engineering operations teams alright so now let's turn into a little bit of what we did about it and really I'm going to talk about three kind of separate changes to the service that were really in service of a number of these challenges the first topic we'll talk about is a change to the client-server model enabled by a cache of a service called flannel that allowing us to really slim down that initial newspaper until we take a little step back I did some spelunking in our data warehouse the original artyom start payload was a function of the number of users in the team with the workspace times how big their profile was how many channels there are how big the topic and purpose the channel is and then how many people are in that channel times the size of the user ID and I just pick three random teams three rows or workplaces with different numbers of users and you see you know every 1000 this is a you know reasonable 7 megabyte Kalyan not great as you start to get up to you know tens of thousands or 170,000 users the payload gets to be enormous gets to be you know basically prohibitively expensive for every laptop or iPhone to pull that full payload down whenever a user wants to get onto the site and let's remember that that's per user so if I have a hundred thousand users each pulling down 100 you know 100 plus megabytes of data that's just gonna handle this and so the biggest change we did to address that was the name not the initial payload to be much much smaller we did that by an introducing an edge caching service that we call flannel that's deployed throughout winter presents all over the globe close to clients and then we change the WebSocket connection flow so that after making an initial thin connection to the PHP application we then connect the WebSocket to a nearby flannel cache oxys the connection to the message proxy and then to that root pub set that's it sir so as I mentioned flannels a globally distributed education when clients connect there's a routing tier in each edge pop that routes a client team with workspace infinity to a flannel cache that is likely to have the model for that workspace we didn't want to get in the situation which flannel had to have all the metadata about every customer it's like does that really not scale so so you find next level that's nearby that has the model for your work this then because it's a proxy for that socket connection it's actually in mind and able to man in the middle the updates that occur for the users in that in that workspace until that allows us to take all of those user profiles all of that channel information how does the initial group end so the things that we're making that the payload huge we just remove them altogether and then expose the callback query API so the clients could with very low latency fetch unknown information from a nearby planet so that allowed us to maintain that really real time and feel to the product where every bit of information you see on the screen actually shows up at medially and it feels like it's instantly there but without having to download all of it to the client to be kind of pre-staged there we allow this kind of really rapid query response to the API to an educated Flamel very close to the users that was able to service those craters and this was hugely successful in allowing us to scale the product to be the larger and larger organizations one thing back in here when we were rolling out planula this was both and operations both are kind of back-end server refactor and then a huge client the clients had expected you know they're written from mainland to expect that the whole model was resident memory and so if you think about taking a you know large JavaScript to our other kind of Swift and the Android code base and now they have to be prepared to make callbacks if an object doesn't exist and so we did this in phases you know it obviously took cooperation and huge uplift from the client teams and the final teams today little an interesting approach we took as part of the initial rollout was because flannel staff has a man in the middle between the back end and the front end and it had this in memory models of all the users it actually have a mode initially where even sniffs the WebSocket messages that were coming across the wire to a different client and if it's saw a user ID or a channel ad than it thought the claims you can have it would just inject that metadata in line to that WebSocket stream so it's kind of ready in the model right before the client component while we went through this long slog of rewriting the clients to expect projects to not necessarily exist the Omegas callback api but with these kind of techniques and place and the kind of long hard work of all the engineers we're able to build this flannel based metadata model that really really really enabled the site to scale to these larger larger and larger customers in organizations the next big change we made is to the native this layer from the WebSocket sack to the back end and this was a really big fundamental rethinking of how we wanted to do our day to day sharks and this is stemming mostly from the challenges that we had from continued database hotspots due to the co-locating all of the users and all the channels for an organization onto a single database shower so I just pulled out some of the kind of redacted post-mortem titles from our github repository where you know charge and he's overwhelmed with queries or new features would get rolled out and they would be kind of you know not always blue test they could and so the load would come in and it would just hammered these data big sharks the graph on the left is actually the top hundred databases in terms of daily query time that I just pulled from a random thing back in history and you see the kind of top you know on that particular day one chart is more than five to ten you know five to ten times as busy because all arrested and this would just keep happening someone would do something on a if I'm an organization that we didn't expect maybe provisional whole bunch of movies occurs or delete a bunch of channels with these kind of unexpected sort of feedback which would occur where all the clients would react to that and send all this live to the back end that would just overwhelm our servers and so we knew we needed to do something about this and the observation did at the time was that this workspace scoping was really useful to kind of keep things nice and tidy when you had lots of small organizations using slack because the load would gets spread out and as we started to get bigger and bigger single organizations not started to be a challenge because we were funneling all of the activity from all these users on all these channels down to this one minute a shark that's what we wanted to do was change our charting approach so that you can shard by much more fine-grained objects and accusers were channels and so we'd be able to spread that load out more efficiently to our big data base fleet of that guidance so in order to do that we introduced a new data steer and didymus technology to our system that's called the test so the test is rolled out as of a second sort of fully featured data pipeline for the application to use and the test the way it works is that it is it runs my sequel at the court and so it is a sharding and topology management solution that sits on top of my single in the test the application connects to a routing tier called VTA the TK knows the set of tables that are in the configuration it knows which set the back of the servers are hosting those tables and it knows which column a given table is charted that until that allowed us to configure the system such that user scoped tables would be started by the user id channel stroke tables can be charted by a channel ID workspace go to my workspace ID and all of that knowledge is no longer in our PHP code that knowledge is not outsourced to the application and from the PHP code standpoint they look at you know they query et-gay using the bicycle protocol as if there's one giant database that has all the tables and that the ADA is responsible for disseminating and doing all of their routing where necessary it can scatter and gather requests for multiple charts if a given query doesn't necessarily fit it doesn't necessarily get routed to one chart and so that allowed us to efficiently and effectively rule out policy we wanted where data tables would be spread and the multiple servers based on the particulars of that table without needing to do a bunch of application now along the way it has also improved a lot the consistency story and our topology manager so that idea of the active active master master mr. LaBelle no longer applies here in the test we have a single regular master and we rely on Orchestrator open source project on github to man into the family I meant to mention the start the test was originally invented at YouTube as their approach to sharding by sequels and scaling that back-end and we've taken that open source project and adapted it for slack's use now with the tested place weren't able to really effectively scale that data tier and so we now get to the sort of best of both worlds where we are still using my sequel which we have a lot of know-how and a lot of confidence in but we've outsourced that topology management into the test system and the sharding decision and how data is allocated in also into the chest which enabled us to make these transitions and thereby eliminating a lot of the hotspots that we have in our databases now one caveat about this that I want to mention is that the migration from our existing system to the test it's been a little bit slower than I'm gonna play actually become in many ways this is somewhat fundamental because we're changing not only how they did the stored but the particulars of how an application wants to use that data we've just run into the need to kind of go very slowly and carefully to maintain the site and so although we are underway and the testing stuff proven it has everything solid database platform to build upon I think many of us had hoped that this channel sharding project by the government faster we're making good progress I have high confidence to succeed which is actually the project I've been meeting the most of slack so I have a lot of personal stake in this and so it has them successful like I said in maintaining in eliminating a lot of those top box that we had before okay so I know I promised that this talk was not about a transition to micro services I only slightly lied you know there is a part of our stack that needed to change and that is this real-time messaging service and really this was motivated by the need to handle these share channels the original message server architecture was very fundamentally built around the idea that a single message server have all of the data around for a given workspace and so I've managed all of the pub/sub interactions for every channel all of the metadata updates for every user and that just isn't gonna fly in the model where a channel no longer belongs solely to one workspace and so we need to do something about we consider a lot of things we can replicate that the shared channel data between the cooperating listed service but the team come together and decided that this was action the case in which decomposing this monolith into a bunch of services did make sense and so in this world we've refactored that message server system into five really main cooperating services I feel a little bit like I can't call this micro service but there are five and they all do you serve a separate purpose and this is kind of how they fit into the architecture users connect their WebSocket to process that we knocked on the Gateway server it replaces that message proxy box that I have up there before the Gateway server manages subscriptions to a number of channel servers that are responsible for the call core publish/subscribe system an administrative service tier is responsible for maintaining the cluster topology and then there's a separate present server tier that's really just responsible for distributing good service and say yeah say that again right here each of these services have their own bespoke roles but it turns out we had to keep that legacy message server to your around I'll talk about this a little bit but there's a feature whereby we could schedule messages for broadcast in the future it was used by some of our integrations that turned out to not actually have a great home in our service decomposition so we just kept that in the old tier but it's just running many many fewer of them they're actually much less critical to the operation of the site but so with this refactoring we were able to change the fundamental model in this message server system by decomposing things not by workspace but by chimp and so the overall pub/sub system is now a very generic it's a very generic system in which everything is a channel in the system and that is not just the channels that contain messages in slack but every user has their own channel ID and so if you make an update to your profile that gets fanned out according to a subscription for a given user ID every workspace has profile information that has its own fan out for kind of cross workspace information and some of the service has become much more generic and therefore much simpler to reason about clients then subscribe to the objects of their interest using this much more generic and aggressive so fundamentally this is really let us adapt our real tetanus adjusted to accommodate some of these more shared channel news cases again just like before we subscribe to updates for all the relevant objects we've decomposed the service and to clear responsibilities this was not Tibetan see those comments before this actually was a micro service refactor in the service of computer science and not for developer philosophy it's actually the same development team that works on all these but this was really a goal of refactoring this to enable the architecture to better meet the needs of the site as opposed to anything around kind of independent development now one thing that emerge from this is a a given user no longer dependence on just one message server being available in fact I as a user might be to subscribe to tens hundreds of channels hundreds more user channels and so one of the elements of this that ended up happening was the failures of the service ended up being much more widespread and so we've designed some additional systems to ring-fence sort of some of the cascading failures because now each user is not the kind many more of the back-end servers so we lost a little bit of the failure isolation that we had very naturally in the workspace sharted model I'll talk about that a little bit later we get to some of the themes these changes but overall these three fundamental we were three kind of fundamental changes in the slack architecture have really helped us and enabled us to scale to the needs of our site right now both from the standpoint of scale as well as the point of the user requirements on the product okay so looking at some of these few things there are a few things that I thought would be worth to polygon has kind of cross-cutting pass backs and this changes the first was that we've really moved away from this concept of our heard of the pets in many in all of these systems actually we no longer have hand to being servers that are individually replaced by humans when they fail instead the the services that contribute to this overall architecture self-register in the service match other services are responsible for discovering them and that's really enabled our operations team to have much less manual toil the other guy we now have a much harder dependency on our service and our distribution system we use pencil no comment we we have some challenges because although we're working through it but really it's been a big change in the way that you think about kind of the dependency tree for the site has this much bigger kind of need on this topology management and service discovery system and then Holly Hollander talked yesterday about on co-op black really touched on this third bullet if you didn't see her talk my encouragement about the video but when we changed when we introduced these new architectural components as the engineering team approved this actually was a big contributing factor to how we had to change the concept of service leadership a non-qualified before those five boxes were really easy to think about so anybody can kind of jump into the bug and diagnose the system with all of this refactoring these more complicated services we actually needed much more specialized knowledge to both isolate and then debug and diagnose the problems and so the kind of engineering culture then the concept of service of ownership really followed the architectural design it's almost the universe of Conway's law we're kinda we did this thing and then we needed to adopt our architectural or organizational design to match the architecture I mention before how this fine-grain sharding really did help it help to leave in a ton of the local identity servers but we really did start to get into some of these challenges when there were operations that no longer really round everything a single chart I only do this before about the subscriptions that a given user has for all of the channels and all of the users in their organizations but it applies also to the database yet there are certain cases in which you actually do need to get the information about a whole bunch of channels for example to render that sidebar really like it's a complicated thing respect to do you have to figure out all the channels you're in with the last message you read in that channel and what the last message that was written in that channel there's no brain way to sharp that in a way that's both scoped to you kind of what if I subscribe to and spoke to the channel what does that run written that in that channel and so we've had to adopt the site and adapt the service to accommodate the fact that as part of Augean might be to get a whole bunch of data from a bunch of different charts one of the person that we've done for that is actually relaxing some of the consistency requirements and in adopted the clients to accept partial results and so that means that when the client was to fetch that information it can bound how much time it's willing to wait for me shark and if it exceeds that threshold we will return whatever information we can in the client and trust that they will call back again later to fill in the gaps on that missing information it's a Mis behemoth of being able to prepare for potential unavailability has really helped us still accommodate that scattered these case without making every client and every action depend on every back-end server she's pretty bad for resiliency ability and also for performance because the slowness charge is always the slowest sharp and so if I have to talk to 100 then all of my operations are as slow as the hundreds slow she'll know one thing that hasn't really occurred to me before making this slight leg is just how few boxes I actually removed from the architectural diagram that I showed you before so with only one exception everything that was on there in 2016 is actually still part of our productions back today there's a bunch of reasons for this legacy clients are one reason their client doesn't still connect not through the flannel path because they're either a third party integration or some old Windows Phone sitting out there somewhere that we still support you do support this phone and then and then to this before about the database migrations those interested they take time we have terabytes and terabytes of data lots hike ups until making sure that migration can be done safely when you're trying to both recharge the data change the data model and in many cases we've suffered from a little bit of a second system defect where we have a table it works one way we have a long tail of things we've always wanted to do on that table boy wouldn't it be nice to couple those new features with the data migrations we have to do right now anyway and so that's actually caused a little bit of attention in how aggressively we move the data off those legacy shards into this new system the other thing I feel like really important to find out is how these really grand architectural refactorings I really useful they were essential for us to be able to manage the growth of the site but there's also been hundreds of little kind of little supercritical not so glamorous performance capable changes that we've done in the back-end system as we've seen various hot spots come through so again it's true they needed a lot of these three factors but things like adding additional captioning removing expensive database queries those are hugely important to getting these kind of you know parade of oh my god things are terrible - oh great kind of things one pattern isn't really important to us is adding jitter to our client operations to avoid some of these thundering herd's there have been those of us who describe the basics flat client has a nice lead organs commanding control botnet because it is sitting out there is very efficient pub/sub system just waiting to take commands from the backend and sometimes those commands cause it to go turn on and das our system and so one of the patterns that we've needed to do is really look at that thinking about operations where you know if we did them have a hundred thousand or millions of people at a time what would it actually do and then spread those out over time a simple jitter or delays or deferring actions that might not have to do that kind of cooperation between the client side and the server side for us at least has proven really essential to come in keeping the site moving as we've been making these bigger more sweeping changes to the overall architecture so in general this is the picture that - this is the nice happy picture than we are now it's more complicated for sure but many of in fact all of these changes were super important for us to be able to both meet the big-scale demands of our customers and accommodate these new shared channels these cases that complicated involved we're not doing yet I thought that we were talking about some other things we haven't done and some of the things that maybe future slack employees might talk about but I felt like I had to put up this services you can post my one thing we do so how to drive you monolith we suffer from some of the challenges at a large engineering organization kind of working on that model to code and they're in some you know some efforts and thinking about how they wanted you can post that much a long ones what the great folks who preceded me in this room talked it up at some point we will be in multiple storage backends we'll need to figure that out that these synchronous chopped you now I wish there's just some really good room for improvement there again continuing on the scale and resiliency features of the product and then this this whole vector and eventually consistency is going to be super important to future scaling in stock where the more that we can adapt the client so that it still has that real-time feel for the things that really need to be real-time that is able to tolerate some latency or some lag in the back end because I think it's been a super important for us to continue to have the kind of gradual become important so I hope this is interesting I think I'm up for time for questions than I thought I would in the practice so let's rush [Applause]
Info
Channel: InfoQ
Views: 25,120
Rating: 4.9287534 out of 5
Keywords: Software Architecture, Slack, Case Study, Use Cases, Performance, Agile, Scalability, Project Management, InfoQ, QCon, QCon San Francisco, Transcripts, Vitess
Id: _M-oHxknfnI
Channel Id: undefined
Length: 37min 53sec (2273 seconds)
Published: Tue Apr 30 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.