How We've Scaled Dropbox

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Skip to 10:30

👍︎︎ 2 👤︎︎ u/[deleted] 📅︎︎ Sep 23 2012 🗫︎ replies
Captions
Stanford University welcome to w3 80 winter 2011-2012 I'm Andy Freeman the other course organizer is Dennis Allison we're approaching the end of the quarter so if you're taking the class for credit please be caught up remember no incompletes we've talked a lot about large systems and scalable systems but we haven't talked much if at all about rapidly growing systems and rapid growth is the goal of most startups but it can be extremely hard both financially and technically today's speaker ted modulus key of dropbox had the good fortune to have extremely rapid demand growth today's talk is about how dropbox dealt with the technical challenges with very minimal resources Thanks so hi everyone my name is Kevin module esky and I'm the server team lead at Dropbox server team is a little bit of a historical name as I'll talk about a little bit later but we're responsible for the architecture and evolution of the Dropbox back-end which is what I'm here to talk to you guys today about so the rough structure of the talk is first an introduction about what this talk is then some background about Dropbox what Dropbox is what kinds of technical challenges we face which will give some insight into the third examples of things that we've had to scale over time going into a fair amount of detail so you can see both what we did and all the other things that we consider that we could have done but and why we didn't choose to do those and then a short wrap-up at the end if you guys have questions feel free at any time just to just to bring them up right so jumping right into it what is this talk as he mentioned there's a lot of talks a lot of information about out there about what do big systems look like you know how do the googles and Facebooks of the world how do they what do they have at this point but doesn't help you a whole lot when you're starting off by yourself with maybe one other person and you have nothing and have to get from there to having a lot that you know if you wanted to build Dropbox now just with two people I mean one option in theory is you could take Google's infrastructure and build it on that but there's only one company in the world that has that option and that's Google so what do you do if you're not them how do you get there there's a lot of things I could talk about that we fall into this category about how to work at a start-up how it like what things have to worry about and in particular I'm going to talk about the technical ones related to back-end engineering so this is a talk about what it's like to work on a fast-changing back end and a very quickly growing environment where your resources are growing at the same time as the demands and you can't necessarily sort of start with the solution the final solution and I think this should be interesting this was the talk that I wish I had gotten while still in school because you know you you learn how to build BigTable you learn how to build GFS and then you go and you realize that's just you and you don't have five man years to invest like they did in one of these projects so if you actually want to start a start-up I hope that this is interesting in terms of letting you know how you might actually go and do that and what it might look like in terms of the technical background because there's as I said there's a lot more you can take into our classes I'm sure they have them here at Stanford about how to actually how to actually do startups and this part is one that I think doesn't really get covered that much which is the technical back-end aspect so first a little bit of background about Dropbox so I just by show of hands how many people here use Dropbox well so that's uh most people if you don't use Dropbox that's okay welcome to Silicon Valley you will assume but what is so what is dropbox dropbox is our goal is to make it really simple for you to get your files your data anywhere you want them anytime you want them and the way we do that right now is with our main sync product which is a client that runs on your desktop or laptop over your computer and it uploads all the changes as you make them to files in your Dropbox there's a yeah as we're scaled there's a tens of millions of people who are using this and who are syncing hundreds of millions of files a day so I'm going into this because there's actually some very interesting implications in terms of what we have to do on the backend to support something like this that they're very different back-end choices that we have to make compared to the companies such as Facebook who not to pick on them or anything but just that we offer very different services that have very different requirements and there's there's a lot involved in this that you know how do you write a small client that doesn't take up too many resources how do you deal with low bandwidth to most people's homes how do you build a mobile app that runs in an even more resource constrained environment but again this is a talk mostly about the backend challenges and there's two in particular I think are most most interesting for the back-end architecture the first is the the right volume that most most applications most web applications especially have an extremely high read to write ratio just because people consume more content than they produce and so really twitter has like forget if it was 100 to 1 or a thousand to one something like that of tweets read versus tweeted what's interesting about the way we've built Dropbox though is everyone's computers has a complete copy of their entire Dropbox and this means that we basically have a multi petabyte cache sitting in front of our service so normal rules of thumb about cache ability kind of get thrown out of the window because we're measuring our cache and petabytes so yeah so it turns out that our read to write ratio is roughly 1 to 1 when you look at when you measure that in terms of the two main client endpoints of loading files versus downloading files so this means another way of thinking about this is for the same number of servers we're doing maybe ten to a hundred times as many rights as other companies are so this is very interesting implications because many sort of best practices or standard solutions are designed for you know a different order of magnitude on in terms of writes another interesting thing is our service in our service we can't be wrong as we don't have as much leeway to sort of play with people's expectations that in certain other services you know it might be fine to see one person's comments before another and then later see it appear about them no one's going to say your service is broken if that happens I mean they won't be happy but they won't say it's broken but there's lots of sort of horror scenarios you can imagine for Dropbox that let's say you delete a file from a folder because you don't want to share it but you want to share the rest of the folder and then you share that folder and then you check and you see if the files still in the folder and that everyone you just shared with could see it that would fundamentally break what people thought Dropbox was doing and we just can't be wrong in that kind of scenario so in technical terms from the database world these are referred to as the acid properties referring to atomicity consistency why am i sorry isolation for AI and durability for D and we have to be very careful about how much we can trade off any of those that atomicity you know people don't want to put in a large home video for us to say we synced it and then only get half of it on the other side isolation unfortunately sorry consistency also as I mentioned we can't really trade off that much on and we have to do a lot of things right and also because you might it's very often that you're updating files in the same Dropbox for multiple computers at the same time isolation we're allowed to trade off a little bit more we have to so that you can do offline operation like those two are kind of dry we oppose doing offline operation isolation but durability is something that we absolutely cannot trade off on so as a whole we have much higher requirements in terms of the correctness of all this than many other services out there so the combination of these two things of very high consistency and correctness requirements with a very high write throughput is sort of one of if not the hard problems in computer buted systems these days and this is not just you know something that we're building internally to have for ourselves for development this is actually core to our service this is what we are providing to other people and these are the these are the expectations that we just can't play around with because that's the situation oh this is not very nice all right so slide some background about Dropbox what are what our setup is and what our requirements are there any questions about that before I go on well so let me go into some specific examples of things that have evolved over time so the first one is just the high level architecture of our back end in terms of what services we have how many of them we have how they're connected and stuff like that so let's say it's 2007 and you are starting a startup with someone else and you want to build a file sinking startup what would you make your initial architecture look like would you go and build GFS and put it on that or would you do a simpler and faster ideas software guys going yeah yeah so how simple do you think you can make it put corresponding - ouch - absolute directories and partition like crazy people so um yeah that's alright lines I think I think what junior OSH came up with was actually probably one of the most elegant most elegant architectures that I've ever seen it's just there's just a single server and that was that was it it doesn't it can't really make it any simpler than that and this is also why it's called the server team because it used to be called the server as a post everything else so this is what yeah yeah I don't know the exact figures those it rounds approximately 2-0 so so this was a mid 2007 when junior osh started the company to the other this one server was doing everything it was running our application servers it was running the I don't I don't even know what web server they're running in front of it they're serving static content I was running my sequel and restoring all data that anyone was putting on Dropbox on his local disks it's surprising that yes that's how it started it's not that they didn't know how to build better things I mean they're both MIT educated they have read all the interesting papers they like know what's better out there but this was a conscious choice that you know you start with what's most important and it wasn't you know building out a complicated back-end infrastructure it's that they need to prove to themselves and everyone else that was the right thing for them to quit their jobs and drop out of school and all that stuff so this is where is the humble beginnings of Dropbox so say so say you've gotten to this point what would you do from here what do you think you know your two guys working I don't know why he likes to say this but they'd like to say how they were working their boxers and you're coding away and you know you want to get this company off the ground what's the what are the things that are most important for you to be improving about this picture okay I think the next slide also says approximately zero users because I didn't know how many there were but uh 2007 all they needed was you cans to get money but so so once once there are users what's what's going to start breaking about this or what what is going to be the most important and best usage of time because this is you know how you have to think when you're in this kind of environment reliability reliability probably uh was not great cuz each day and with bandwidth Oh to survive in usually put your data on another box and put some web servers up on would break it that way moving off the data yeah those are all those are all good things no I mean those all are things that happened it turns out the things that happened first was first the servers ran out of disk space so they had to put the data somewhere else and the second was that the server just became overloaded and had to move something off of it so they chose to put all the data on Amazon s3 and they decide to move the my sequel instance to a different box so that it could be they could be separated and run on separate Hardware and just for reference this um this bottom part of the diagram is the clients this all people's computers are running Dropbox this side is sort of our own machines it used to be on managed hosting and now we're its self hosted and this side on the right is ec2 or AWS for now it's only s3 so these are two somewhat controversial choices I mean not so much at the time but now they're viewed as a little more controversial you know there's the whole my sequel versus no sequel debate or whatever you want to call it the database is my sequel by the way and again it's not that they didn't know that you know they could have written a custom database if if they wanted to they could have written their own custom key value store and run it on their own hardware started like specking out new new machines that are optimized for our use cases and stuff like that but you know there's only three people still at this point and it wasn't clear at all that where this thing was going to go so I I personally believe that the choices of my sequel and s3 were extremely valuable especially in the early stages so uh now it's getting a little bit harder can you guys figure out what the next things are that are going to break yeah stop sending the files through the server to us three sorry and how would you do that I know no that's okay that you're alright yeah so um that's actually that is one of the two things I had to be done next that well first of all just the capacity on the server eventually ran out and it would get to the point that you know downloading files would eventually push people out of being able to access the website so they wanted to separate all the downloading and uploading functionality from all the sort of website and syncing functionality so that one they want to interfere with each other it was also one of the easier ways of just splitting the work into multiple servers the other thing that was fun at this time is you can see that there's only this arrow only goes in one direction that the clients are only hitting the server so when there's new changes that happen you have to the client has to hit the server again and that's called polling which kinds have to poll sit in a loop just pull the server every now and then and that's usually a bad thing so I mean we could you could play around with it you could increase the polling timeout to reduce load on the servers you can decrease it to make it seem more responsive but you know you're still playing that game and you're only getting one of the two so that one of the next things that was done as well was add this new service called the notification servers or not servers that will start pinging the clients will actually start pushing down notifications to them and and the the server was split into two web servers one running in managed hosting and one running in AWS where the one in AWS is hosting all the file contents and accepting all the uploads and the one in manage hosting is doing all the metadata calls so they've they were called meta server and block server because our file data API is based around file blocks so this was early 2008 I think there are I guess roughly 50,000 users at this time Dropbox was in private beta and I guess I'm not even gonna I won't try to ask what's going to happen because it's too hard to know what the exact things are that are that's going to happen I guess it's part of the point of this talk that you know it's very hard to it's pretty easy to screw yourself by over building because you don't even know what the things are that are going to fail it's hard to look at something like this and know that the particular problems I'm about to mention are the ones that you're going to run into well maybe some of them are but you can't know that like that's exactly what's going to happen so the three things I had to change where that uh obviously it looks we still have one of each of the meta servers and block servers so we need to add more of those and that's a fairly standard operation the second is a little bit more involved you can see that the block servers are doing database queries directly to the database because we just took the routes that were in the meta servers and move them onto blocked servers and set them up to do database calls from AWS which was in Virginia which is in Virginia to our managed hosting cluster which was in Texas so you can imagine that doing a whole bunch of round trip calls over that distance eventually gets to be a bit of a latency bottleneck so instead of doing repeated round-trip calls to do multiple database queries we changed it so that well I guess there's a bunch of things you can do you can make your my sequel usage a lot more sophisticated you can start using stored procedures you can write really complicated queries that like embed control flow logic and stuff in them you can add more caching you can have a really complicated caching infrastructure but these things are all a lot of work and as a result the thing that was decided on was having the block servers do our pcs to the meta servers because the meta server could sort of encapsulate all the logic of all the database calls that it needs to do so that was definitely the right way the easiest way to handle that but maybe not the most sophisticated the third issue is we also only have one database in this context and there's again a bunch of ways that you can handle this that are sort of the standard or right way to deal with it you can shard it you can partition it but it turns out that it was just so much easier to add memcache and just cache everything or not everything but start caching the easy things to cache and that just sort of let us avoid having having to deal with these really complicated database scaling issues so doing all those three things we ended up with roughly this architecture at launch where we added a bunch of meta servers and block servers put a load balancer in front of the meta servers and added a memcache tier and the blocks servers now to our pcs to the load balancers and so after this point like the sort of base architecture has been pretty stable the problems now are that there's still a bunch of things here that there's only one one shape and we need to get them to be stacks like the others so so actually our architecture today is basically the same but with those things filled out and there's a lot of other stuff that's going on there's you know there's they're batch job running machines and stuff like that but our fundamental architecture for providing sync is a hasn't changed since at that time though adding making all these into stacks is actually relatively difficult it looks easy on on the slide you just kind of add more but in practice is actually very difficult like every one of them like even even memcache which you know it's designed to just you should you can just add more servers the way we use it because we have these really high consistency requirements we can't use we had to modify the memcache library that we use because most memcache libraries when the server they try to hit is unavailable they just move on to the next one this is great for availability because if one memcache server dies you just start using another one but it's really bad for consistency because one one web server might think of M cache servers down but another one thinks it's up and so if you have any sort of complicated memcache protocols going on your servers are going to be cross talking and not seeing each other and you can get a cache inconsistencies so we had to modify the memcache library for that the load balancing tier again is also supposed to be easy but uh that's actually interesting that is tough because we use Python it seems like those two things are unrelated using Python and having difficulty scaling your memcache to you or your load balancing tier but there's actually this feature this feature of Python called the global interpreter lock which means to first approximation that you can only run one Python thread at a time you can have multiple threads but only one will be scheduled at a time mostly except for like I mean if one's doing i/o then another one can come in but you can't actually get true parallelism that much so what this means is that for each web server we want there to be exactly one requests at a time that adding a second request makes each request only proceed at 60% of the speed as a single request happening and that sort of 20% improvement in throughput is just not worth it to us so we want our load balancing tier to respect this when you have one load balancer it's easy it just has all the state in it and you tell it only one connection per web server but when you have multiple what multiple load balancers you can't really tell them to max out at one connection across the entire load balancing fleet I mean I guess you could but there's no load balancing software out there that actually does this so you know we had the option of either building our own load balancing software and we could add that feature if we wanted to or we could uh we could play some sort of complicated game or we could um we could allow just one load balancer to die and just lose a whole bunch of capacity which we did for a little while but the result we ended up with was we every load balancer now has a pair as a has a hot hot backup so if the primary dies it within a few seconds switches over to the backup and this is done using some network level tricks and so this does mean that we have twice as many load balancers as we use but in terms of work involved it let us you know get this high available high availability of load balancing cluster much easier than for instance writing our own software for that scaling the database tier is a probably one of the more standard of the difficult things to do that you can see a lot of a lot of discussion about how do you build a distributed database or how do you scale things or how do you shard things well and there's a lot of talk about that but this was interesting that you know it wasn't just we started with nothing no code and had to design in a vacuum some sort of sharding scheme it's people were actually using using the database tier and building in the assumption that there's exactly one database into the code and sometimes that's pretty obvious how they do that you know putting in joins it's obvious that they're like making the assumption that the two tables are in the same machine and joinable having foreign key constraints same deal but there are certain assumptions you can make about my sequel usage that are not at all obvious that for instance that a single transaction is a single transaction in sorry but once you move to a shorter environment your transaction model changes and it's not clear you can't just look at a single piece of code and tell whether or not that line that query is assuming that it's running in a single transaction with all the other queries that expects to run with so this is this is a kind of case where it's actually remarkably difficult to like arrive at a solution that is much easier to build from scratch than to like evolve your current system into and we had to do a lot of work to like hunt down all these cases where people had baked in these assumptions the not server tier was also interesting I guess it wasn't so much in terms of had to evolve it but just because that system is so high throughput that there's a there are tens of millions of clients that are connected to the not servers at any point in time because we can't you can't just send a message to anyone on the Internet due to firewalls and stuff like that you have to let them connect to you and then you can send messages down that connection so the result is we have tens of millions of connections open to these not servers and we're sending don't know I forget the exact figure we're sending a lot of notifications out at the same time so we actually had to add a two level hierarchy for distributing all of this to all the knot servers to then distribute to the clients because it was just too expensive to notify a hundred not servers 100 not server processes that they had to notify their clients so uh so this is where the as I said the high level architecture mostly stands today are there any questions before I move on the things that you seem to have a huge advantage of and without saying you're anything like they got uploaded cuz you don't wanna get your company shut down but my files in general have nothing to do with your files makes their a very natural way to be sharing all this stuff in depth is there something not obvious that's going on behind the scenes that's not the case so in terms of the actual file data we do a block level deduplication so if you and I upload the same file on the back end in a storage tier of the back end it knows that they're the same and it doesn't store more than one copy and a lot of things are fairly easily chargeable but a lot of some things are not so shared folders makes it very hard because shared folders is something that cuts across users and it's actually something that like were actively trying to decide how we want to shard because uh this is something for instance the relationship table between users and shared folders that they're in gets queried in both directions like both for usually wanting to other shared folders and for a shared folder you want to know all the users and both of those are queried a lot and have to always be exactly right that for various technical reasons if there's no there's no room there for getting a wrong answer so this is currently not charted and we have to invest a fair amount of time once we do start it to get it like exactly correct what is this anything like this is bloc of I always small so what we do is we take a file and we divide it up into four megabytes and each chunk is a block in terms of deduplication so we take the hash of the chunk and if two hashes are the same then they get like mapped to the same object in s3 just know oh yeah okay I think it's shot to 36 then why did you choose for megabyte chunks um it's completely arbitrary worked pretty well so far not like we I don't know if you have the option to change it well any other questions what couldn't you duplicate that you might let you see how much of it how much deduplication I think we've have we've ever determined this I think we've run it so you have to have another sort of reference to what you're comparing to measure how much you're saving and depending on where you set that reference in terms of like only deduplicating within a single account for instance I believe it's double digit percentage I'm not 100% sure can hear the questions all right you're down deduplication on the sub blocks on the up send to you send just Delta's to the block server or do you shed the whole thing so I believe that it is smart that if it believes that a file is only a small if the client believes that the file is only a small modification away from the old file they'll actually use like an R sync diff and upload that to the server I'm not fully aware of all those details though how open the supply bolt servers well used to be I think once a minute but now that we have the not servers and also now that we have tens of millions of clients polling would just crush us which is actually funny because sometimes we didn't used to have good like back offs so when the site would go down the clients would do Dasha yeah but that's it improves that a lot but now they don't pull at all because they just connect to the not servers so just keep it that it should open all the time yeah yeah they won't pull the not servers and then when we have a notification for them but it uh it sends that town it what kind of connecting kind of getting an odd server these days I believe a single not server machine we were we were running them at 1 million connections per machine they started it was never yep load economy areas well we didn't actually hit a limit they started failing because we hit a kernel bug which if I have time I can talk about it but uh so it's at least a million we're not really sure where it is these things are not fun to like push super hard because even though a single machine can have a million connections open it can't open a million connections in any reasonable amount of time so so once they go down there very hard to bring backup and we don't want to push them to coastal in it is your D duping on a per-user basis or say in a media case you have a media file that many users may have in there so the storage level it's globally across the entire service there are that doesn't mean that if you put a file in your Dropbox and that will necessarily instantly upload given that it's already on Dropbox so there are other things involved in deciding that shoot some servers and Amazon as observers or manage hosting how do you decide what to put Noah's on you like to play the joint hosting sure so at this point it's basically anything that has to touch the actual file data there's an Amazon and otherwise it lives on our own servers that it means it's great to co-locate our servers with the actual data when it needs it but otherwise it's more it's more cost efficient it's easier to manage and all of that when it's like our own hardware that we're actually running because all our all our ec2 instances are on 24/7 so we're sort of missing out on some of the best features of ec2 so it makes more sense to just kind of run them ourselves at this point it's a I think it's only block servers every now and then we do certain analyses of certain like subsets of data and that will also run an ec2 here estimates for a much it cost you to run on Amazon as compared to doing it yourself more probably on Lilla I should say but I mean you can see that we're still on Amazon and we haven't yet made the decision to move off of them tell me operations people do you have for your side of the line this is our pagers and there are I think one six of us yeah six of us I guess we also have a network guy so I guess I would make seven is your customer base worldwide roots in mostly United States the real question is are using Amazon distributed cloud or you got an island Virginia so everything is in Virginia I don't know the exact percentage that's international but I think is the majority is international at this pace sixty-five sixty-five percent international usage so yes we do we do serve all the data out of Virginia we do serve all the metadata out of San Jose now and I guess is another another point that you know obviously we we obviously know that you know if you want better performance to go international and you figure that out but one thing that we've been able to get away with is since the client behavior is all asynchronous and sort of user invisible it's not hugely performance sensitive not that we sort of just neglect performance but it's not been something we're not quite under the same specter of performance that like a web only site would be where I forget what the numbers are but you I we all see they're like stats that if you increase page load time by X percent your whatever return rate goes out but so you get we get to be a little bit more candid a little bit more relaxed about that oh how we call that data used over history we just store Amazon takes care of the replication and all of that so we just upload it once you would have to ask them about what they do internally you get hammered by that recent Amazon problem with the s3 window uh yeah I'm sure if you sorry yeah I mean it's interesting we do see I mean at this scale you do see interesting things happening on Amazon side I mean they're pretty competent over there but it's just interesting to watch it from our side do a feel for what fraction of the s3 usage you are I wish I knew they only released one set probably you should say it what I actually have a pretty good sense of it some stuff but ok in terms of public reminders yeah when the arrow is up good put the covers off okay I might do that then spit a top for a few minutes about your evolution of instrumentation what did you do when you do guys in boxers and what do you have now yeah so um okay so at this point instrumentation and debugging and monitoring is pretty easy this is a great tool that's already written it's called top you go to the server you're on top and that got that got us to here actually I should have it later yeah uh and I mean the service is pretty regular and like you build up a good intuition about what's going wrong when certain things happened but it was basically we went for a long time without building out like graphing and trending of metrics and stuff like that and we have all that now so it's like much better but you know it worked without it for quite a while well what you could have now what what what metrics do you watch what metrics do you monitor yeah we watch all the server's load we watch how many requests are happening from all the different channels per second we watch the breakdown for four important requests what's the breakdown in time that went into that request so if it takes 100 milliseconds to to commit new files that's you know 40 milliseconds of CPU time on the web server that's 30 milliseconds talking to the metadata server that's 27 milliseconds like dealing with memcache or something so you can see over time how that how that varies if one of them spikes with a code push then we know that that's something to look into if it uh if you know the site goes down and we see that that those things are changing it gives us a lot of good insight into what's going wrong we track bandwidth as measured by users there's there's a ton I guess yeah what are you doing for security and encryption just like to make sure that like you type in a random password you can't read everyone's files I mean like that was not a super specific question what is general you like what have you done too late like there was this problem with Dropbox a while ago like what if you like what were you doing then it didn't work what if you've done now to like keep that sort of thing from like like where things encrypted and decrypted sure so I mean I can't what's the risk I can't talk too much about any specific thing that has happened though I can say that just in general we take security and privacy very seriously and respond very aggressively whenever something does happen in general yeah I guess there's sorry it's not a whole lot I can go into right now but we can maybe talk afterwards or something Poole I'm going to the next example so the next example is going to be diving a little bit deeper into one of the aspects of the system the data based here and in particular diving into how we store all the metadata about your Dropbox so the way we the way we store what the metadata for what you have in your Dropbox is as a log of all the edits that has happened to it so whenever your your clients notices changes it uploads those changes to the meta servers which record them in this log and this is called the server file journal I believe there's also a client-side version of this as well which is why it's called the server file journal so this is a abridged schema of server file journal this is the original one that we started with or the earliest one that I could find at least and it's only including the sort of interesting fields in it so it's has an ID field which is just the index in the log it has the file name something called case path that I don't know what it is latest which means is it like is it the latest entry in the log for that file and NS ID which stands for namespace ID where a namespace is either your your Dropbox or a shared folder so that every every namespace has its own log associated with it the one interesting thing is the primary key is the ID we're using my sequel and in particular in ODB here so what this means is that a on disk things are ordered by ID that's that's what the primary key means so it's very fast to scan things in ID order any other order is not as fast even if you have an index on it and writing the appending in ID order is extremely fast appending in any other order is not as fast though in this case it's in this case everything was being appended in ID order so there's a bunch of things that changed over time I don't even know why this change was made but one of the first things that was done was getting rid of case paths I I think originally there's to deal with some case sensitivity issues and then the the protocol or the sort of the interface between the client and server was changed so that my guess is that the clients now take care of any case sensitivity issues rather than the server or that the logic is not put in my sequel and we're not storing that anymore so this is a case of our requirements changing over time or just like you know iterating on what we started out with the next thing is next thing that happened was my guess I wasn't here at the time I guess is that they didn't used to have we didn't use have the feature of you could click on a file and see all the past revisions of it this is actually with this schema kind of expensive to do you have to search the entire fought an entire log of everyone's drop boxes looking for the right NS ID and file name and then you can list those as a as a list of revisions so to make this faster because this was a new feature that we want to make more efficient we added a new field called prep Rev which I believe points to the ID of the previous entry that of that file so this was added because we added a new feature the next thing was the performance of the system started to get pretty bad that you know it was all in one machine and this log was getting very big that it works fine if you have like a small number of users but after a while it doesn't make any sense to mix everyone's updates together it's not very efficient to find a single person's a a single person's updates so the primary key was changed to this so what this means is first things are sorted by NS ID so everything within it an NS ID is grouped together then latest so that means that there's two sections of the log one that is sort of previous entries and one that is all your sort of current active state of your drop box and then ID to sort of sort it into essentially timestamp order and have a log log of that so this I think this was pretty good for a while and at this point the functionality was pretty much set and the sort of major performance was done that you know I think this was roughly 2008 at this point one hundred nine hundred percent sure and at this and at this point it starts to make sense to sort of go over this very carefully and make some careful optimizations to it so the next thing that was done was file name was changed from a 260 length string to a 255 it seems like a kind of random thing to do that if you don't you don't know anything less than a lot of about my sequel it might it's not clear at all why this would happen but it turns out that actually my sequel stores varchars with size at most 255 more efficiently than with a size larger than that because with 255 you only need one byte for the length and I'm not sure what it does if it's greater than 255 but it uses more bytes to store the length so this is a pretty easy win in terms of I mean it was easy in terms of once you know that you had to do it it was easy but there's just one of those like reading the manual taking the time to actually do that taking time away from building features actually started to make sense I think around the same time all these fields were declared not null because that also saves another byte per per field the next thing that was done was a little bit more subtle and that was getting rid of latest in the in the primary key so instead of having two sections of the log one that is the active files and one that is inactive they're all mixed together this makes it the reason that there's originally the primary key is because that makes it more efficient for writing or reading because all the files that you're interested in are together you don't have to skip over deleted entries to get to the ones that you're actually interested in but this means that when you write new things to your Dropbox you have to shuffle things around in your log a lot so this was so this change it's subtle in that it optimizes for writes at this expensive reads and given that we do so many writes that's actually a very good thing for us because I don't know it's some what people say and you can take their word for it or not that writes are harder to scale than reads and especially that we have because we have so many this is an interesting trade-off for us to be making and at this point I think this is mostly where the the schema lies today well so are there any questions comments about this yeah one day when you you know say I really wanted to leave something there any kind of tip action or just guarding old data or anything about this or do they just grow in normal usage they just grow I'm not sure if there's any times that we I don't know about like otherwise so you have to change the size of ID it's no big well now that IDs are per user or per namespace we haven't had an issue that we do have more than four billion entries so if that wasn't truth and yes we would have had to increase that but it's per its per namespace so it's not an issue it would for being unique - not unique at one point hmm so you have a testbed where you can go measure these things and see if it goes change actually makes a difference it's actually extremely hard to test these kinds of things I mean you can test that it's correct but it's very hard to generate realistic workloads on these kinds of things there's some simple things we can do where we can actually run a production workload on a production that works it's still hard to like make that work I think these changes all happened a while ago so I'm not unpresentable really the only way to if you really want good precision on whether or not change will be helpful is to ice it in prod and it's just too hard to sort of generate realistic data in close you do a/b testing which when bring up the new build the canary and it takes up part of the love and watch then migrate over enterprise yeah we do we're increasing our usage of staged rollouts and a B testing and all that kind of stuff that's the only way to find out yeah unfortunately you can't really stage roll out something like this at least not very easily we're sort of increasing our ability to do operational changes incremental e but at least on a single table it's kind of all-or-nothing whether or not you do this I mean so from a product standpoint we can do that but from a database is very hard I think the interesting things about this evolution is that we've seen especially with the primary key changes we've seen massive changes in the performance characteristics of this table over time with a what is a very small amount of text to change the primary key I mean my sequel has to do a lot of work and you have to be careful about telling my sequel to do it but conceptually it's not very much work to you to actually go and make a very fundamental change and how this is architected so so personally when it comes to my sequel versus no sequel debate I'm very much glad that we have stuck with my sequel because we can sort of pivot on on the fly and change the performance characteristics without having to completely react detect our usage of a certain table or anything like that not that is not hard with my sequel but something that's not necessarily even possible with other solutions cool so those are the two main examples that I had and just quickly to close this up I think one of the main themes that I've definitely noticed and hopefully was evident the two examples is how valuable it is to sort of use your time effectively that that's sort of the key constraint here that you know if we had n thousand engineers then yes we would just build Google's infrastructure and go with that or whatever build an improved version of it but we don't and that's that's why we had to make all these changes that we did or all these choices that we did we always know that there's something better that we could be doing and and it's interesting because you can trade your time for other things fairly easily you can trade it for more users you can trade it for money you can trade it for future time by recruiting but you can't really trade other things back for time that easily you can like not make your room you can clean your room you can not do your laundry I think some of our co-founders are big fans of those two things but once you do those things there's only a limit to how much you can like not do those things and at some point you know your time is just the constraint that you have to decide how you want to turn it into other things and hopefully some of this sort of showed that to you and the other reason why I brought these things up and have wanted to do this talk is because this isn't just a talk about our history this is like still our mentality this is still where we are we're still fast growing we're still having to make all these trade offs that we know all these things that we could be building but we know that we have fewer people than we want and so here these are just some examples of decisions that we're currently going through that like exhibit some of the same properties that we want to have some sort of batch processing infrastructure that can run jobs over our metadata and you know if you if you just sit down and think about what's the best way to do it you could say oh let's import it into Hadoop and then run some sort of like automated batch drop system on top of that and then have some sort of web service that like displays the results and emails you and stuff like that but this can get you know that's a lot of work to set that all up instead some of the things that we can do we found a much more elegant way and simpler way of including a lot of that inside the request workflow that you know we didn't have to add any additional architecture for that and we're getting a lot of the benefits with a lot less work a similar thing is a you know server file journal it's currently stored on SAS hard drives so they're kind of I think they're the fastest hard drives that you can get that are actually disks that are spinning but now there's SSDs and SSD prices are plummeting not plummeting but they're decreasing over time and it's no longer sort of a slam dunk that if you have a lot of data you have to put it on spinning media so maybe the next evolution of server file journal rather than like we architecting it at a my sequel level will just be read by a whole bunch of SSDs and put it on that instead if we can save you know months and months at engineer time by doing that then maybe that makes sense so these are both things that we haven't done yet but it sort of but show the same kinds of decisions being made and the rest of the talk was sort of background for how we think about these things in general so that's it for what I have prepared and if you guys have any more questions I can take those now yeah as you look forward what are the next things that you're thinking you're going to be your biggest challenge it sense that it doesn't have any predators really you mean as a company as opposed to like forces as a back-end so I mean we want to just we always want to be get bigger we want to appeal to more people we want more people to be happy to be using us all the time that's always a goal we can always like have more people be using us but the ultimate goal is that you know you're you don't have to think about where you I should have had to think about the fact that the data was on this laptop for the presentation and if I couldn't get this laptop to work with the projector system then it just wasn't going to work it should just be that's my data that's my presentation and anywhere that can present it I can just hook up to my Dropbox account and have it like show that presentation I should be able to take pictures with my phone and I guess I wouldn't be putting them on the projector but maybe like at home on my TV I could just get them on there so these things like this is how technology should work but it doesn't currently so we want to start building all this stuff out in the near term that means like better mobile clients more API usage and stuff like that you guys have anything that you want to add to that are you discouraging people from using the service actually as a backup like it gets Carbonite I mean it's really a data service right now documents right if you could backup just yeah probably people do I don't know if you have any information about how many people do but I'm personally at least happy that people will have found a productive way to use Dropbox how kind of the flipside if they're actually sharing folders and making public document there's a risk they're using it for copyright infringement and certain authorities frown upon that how do you I mean to some extent some of the service does it doesn't shut down do exactly the same thing how are you as a business defending against becoming sort of a transporter or whatever you want to call it yeah so I mean I don't have all the information on that but I mean we do have in our we do sort of explicitly I think prohibit this kind of stuffs I mean whatever I'm sure all the legal sites do as well but and then we also do you know we do follow the whole DMC takedown stuff that we have to I guess I haven't fully like kept track with like what's going on that so I don't want to like say anything for fear like being wrong here but I do know that like we do take that stuff very seriously people are paying for this or is it AB supported I think this is almost all user subscriptions' okay you get a small account for free but if you want to really use it thank you they definitely play other dollars well I think it's two great advantages shannon statutory great privacy for you because you're upselling might see my information to advertisers the other is that the spammers aren't going to pay for account when they can find a free one some place else yeah yeah that doesn't stop them from trying but you have this is in place to try to like detect sort of abuse and all of that stuff I work with the consulting operation we have at seven different location in United States and we're always using Dropbox to move the files between the different PowerPoint presentations Tori's good it you've got business against to where you get the large amount of space yeah I don't know what we might count costs a hundred dollars a year I don't know how much room I am but yeah we have a enterprise product that uh again I don't know a whole lot about what it offers by note does offer some features that are more appealing to the enterprise market the only frustrating thing and I'll be on the phone to the guy so I just dropped in the drop box is still not there can't get you email it well it's 50 Meg's and we're working on increasing the speed of light why some use nutrients yeah actually it brings up something I was thinking about over here is that if I'm in Japan Korea seems like I've got like hundred megabits in my house it must be a very different user experience for me I mean I upload my family photos and it's like tomorrow I'll go get them tomorrow whereas there I've expected this almost instantaneous behavior and more in your shipping it to Virginia that yeah so we don't have a whole lot of metrics in terms of breaking down by geography like what the bandwidth speeds are one interesting thing as I mentioned client behavior is a little bit more tolerant of latency right but those countries also are the ones that have a lot of smartphone usage and that is on demand as opposed to a syncing behavior so we're seeing this like change and requirements on the back end that rather than being latency tolerant is becoming I guess less latency tolerant over time well the main can be there's competitors of your company and how do you think about that Qualcomm um yeah so so box.net is currently an enterprise space and I mean as a whole our strategy is to build the best product that we can invest services that we can and not get distracted too much by what other people are doing but yes I mean other people are doing what we're doing and interested in doing if they're not already and there's not you know there's not a whole lot we can do rather than just like continuing to move fast and build the best things we can I think we're all waiting to do the camera off and do the Hillstrom all right I don't have to do that turn it off and say goodbye fade to black and do it and I can talk about Oh a lot for more please visit us at stanford.edu
Info
Channel: Stanford
Views: 265,160
Rating: undefined out of 5
Keywords: Engineering, Computer Science, technology, internet, cloud storage, cloud computing, business, enterprise, Google, Yahoo, Data, Programming, Developers, Database, Network, Storage, Server, Dropbox, Hadoop
Id: PE4gwstWhmc
Channel Id: undefined
Length: 68min 16sec (4096 seconds)
Published: Mon Sep 10 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.