Seattle Conference on Scalability: YouTube Scalability

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

I came across this great talk recently and really wish I've seen it before. Most of points are still valuable and I really like how he basically talks about DevOps team before the term was even in sure. Also it's a lot of fun to see how youtube started and what challenges they've faced.

👍︎︎ 2 👤︎︎ u/aerokhin 📅︎︎ Jul 07 2021 🗫︎ replies

As system admin this video is pure gold, I wish if I was more educated during ~2005 to take advantage of .com boom

👍︎︎ 2 👤︎︎ u/Z-80 📅︎︎ Jul 07 2021 🗫︎ replies
Captions
everyone cuantos from YouTube this afternoon to tell us a little about how they grew that organization very quickly as its popularity skyrocketed over the first couple of years from its existence with very few people they built an amazingly scalable system and I think that to be fun to hear how great ok so again my name is Congo I was part of the original engineering team that grew YouTube from its infancy to its current scale so here's what we'll talk about I'll go over a little bit of the history you know basically you know where we've been how long it took us to get here and I'll go into the evolution of the scalability of the various parts of the system the main parts of the system so I think over the course of today probably seen this kind of graph a number of times so in the context of this presentation I guess it could mean a couple of things well Google has with great free food so yeah jeans are getting a little tight here I hope it it's the Google stock price in 2007 or what it's actually as though is the daily video views per day of youtube.com since its beginning and it's beginning was a little more than two years ago so I had a quick timeline just to give you guys a sort of frame of reference we were founded in February 2005 we got our funding kind of in October 2005 and there's only a couple numbers I can really share with you due to various reasons but in March 2006 we actually hit our 30 million videos served per day milestone and in July 2006 just four months later we more than tripled that we were serving an average of a hundred million videos per day so you can kind of extrapolate some of the numbers I mean we've definitely grown very rapidly since then and we continue to grow very rapidly and I was in November 2006 we were acquired by Google so to give you some idea of how small the team really was we basically had to Steadman's more-or-less to software architects dealing with scalability issues we had two featured developers doing everything that the user see we had two network engineers building a network that was tens of tens of gigabits per second and one DBA and and we hadn't bought by Google's and those chefs yet so here's a kind of quick little algorithm that I came up with to handle rapid growth so you identify and fix bottlenecks you drink you celebrate you sleep and you continue the process all over again and that really has been our general motto I mean we we have tended to go in these very very quick cycles and sometimes we go in these cycles we go we do all these steps within the span of a day and sometimes less than a day so we don't get much sleep sometimes and if you think I'm kidding about the drinking or sleeping parts and now see yeah it's okay the first area that I'm going to talk about is our web servers so here's sort of the basic setup I'm kind of omitting details about how many machines and such because I want to just explain the overall flow of requests so first the requests hit the net scalars which are load balancers they're actually really good load balancers that do number of things they mark down machines and mark up machines and the cache from static content for us which is good because Apache is not very good at handling large volumes of requests for static content and they do a couple other things like connection pooling type ecchi the request then hits one of many web servers that are sitting behind these net scalars and these web servers are running Apache on the front end because Apache is fairly well tested it has a huge community we know how it works and Apache is running mod fast CGI which basically looks for a request that look dynamic that follows certain roles that we defined and shoves those requests to the Python app server that created and that does most of the work that does the application-specific work for us now go into a little bit more detail about what that means and the app server then talks to various databases it calls information from various sources gathers all that information and spits back an HTML page within which then gets served to the client through Apache so general flow so a few more details here the web servers run Linux they run specifically SUSE Linux and the thing about nice thing about the web servers is that you can generally scale just by adding more machines it's not always the case but more often than not it has been the case one thing that we've seen from various blogs and even from people who who we brought in to interview and we're potential candidates for software engineers we always have to ask that we always ask the question why Python is it fast enough and one interesting thing to note about web applications is that CPU is often not your battle neck at least not CPU on the web servers it turns out that you're often spending time waiting on a number of thing you're spending time waiting on the database you're spending time waiting on on your various kinds of caches you're spending time waiting on a lot of RPC calls I mean granted web Seaview is important but it's not the overriding concern for web applications so that's one thing another thing is because we're in a particularly fast-moving field we have seems like dozens of competitors announced every single week and we we have scalability concerns so we need to be able to move around pieces of code very quickly we need to be able to rewrite things development speed is often actually the one of the most critical things so that's where Python excels so it's the I guess the answer the question is Python fast enough well I guess the key word is enough it's it's definitely fast enough and we can add more machines and add capacity that way so on our web servers we tend to aim for much less than 100 millisecond per page rendering times basically as soon as the app server gets the request and and and parses it and then generates HTML and puts a gives it back to the web server a lot less 100 milliseconds in general which is fairly important because you don't want all your processes getting too busy on your web server we use a couple of things to speed up Python we use psycho which is a basically a just-in-time compiler for Python which will compile certain pieces of code namely the ones that matter most though the tightest loops that we do in 2c so another thing is we also selectively wrote certain things such as some of the encryption stuff in SC extensions another way to sort of take advantage of Python as well as use some the speed of C and the last thing that one of the other things that we do on the web servers is we pre generate HTML so in some cases you know you can do in various forms of caching you can unload on the lowest level you can cache rows from the database on the slightly higher law we can cache fully formed Python objects which can be sterilized a disk fairly easily and at the next level you can actually cache full HTML which is often a pretty expensive operation like to generate the HTML so in certain cases when there are blocks of HTML that that are fairly expensive to compute and to render we just cache all the HTML the next area is serving video and so the costs are fairly predictable in a lot of cases bandwidth is certainly an issue the cost of hardware is certainly an issue and power consumption we we have a limited amount of power that we can get any one particular data center so we need to be judicious in that we need to pick hardware that doesn't consume an excess amount of power so in the YouTube system each video is hosted by a mini cluster a mini cluster is just some small number of machines that serve the exact same set of videos and they have various forms redundancy and the reason why we have a couple machines in each cluster is that one scalability you have more disks serving each content you have Headroom so when if a machine goes down the other ones can take off the slack and it's it's just nice to have you know you know basically kind of like online backups of everything more or less so it serves multiple purposes so originally for video we we started with Apache and Apache was fine for three months maybe four months didn't work very well it had such high load it had very high context switching because of all the processes or threads that were active at any given time so we switched to light ii will i d-- is an open-source web server and it's a lot faster you use a single process single thread and some very efficient things like a poll which allow you to basically pull a large number of file descriptors very efficiently and we switched from single process to multi process so again just it's just mom in an ER iteration of the same idea so for video serving we we have a couple of different paths that a video request can take we actually put the most popular content into the CDN and we have a couple different CD ends we we use some third-party CD ends and we also have internal google produce ones and we have you know a couple other things that were kind of playing around with to see which one is the most effective one so that's for the most popular content and the the moderately played and the lesser played content actually go to the regular YouTube servers and what's kind of interesting about this is that because the most popular content has gone to the CDN the traffic profile of the requests that go to our regular machines is is very interesting because a lot of some of these videos might have between one and ten or twenty views per day in the scheme of things not very much but when you add up the whole thing when you aggregate that and when you look at the percentage of our overall play backs that constitute that's a huge deal and the problem is that when you when you have a large amount of data and you have kind of random access across that data that's that introduces some you know interesting performance characteristics you know you suddenly emphasize random dist:6 so suddenly your RAID controller becomes more important your caching becomes more important and you want to tune the amount of memory on each system so that it's not too small so that your cache is not thrashing all the time but it's not too large because otherwise when you expand the number of machines it's just costing you too much money and you're not gaining that much benefit and there are some other things that are that are we also be were able to tune like some controller settings and stuff like that but the point remains like our machines have to be tuned very differently from the CDN machines which are mostly serving out of memory so here are some key points we keep things simple and cheap for the most part we make sure that not too many network devices are in the path between the client and our video servers we use commodity hardware which is important because we found that the more expensive Hardware gets the more expensive everything gets more expensive the export contracts get the less likely you are to find a random Usenet posting or random forum posting about that hardware so we try to keep generally within the kind of the mainstream for servers and we use simple and common tools are seeing SSH and a number of other things that are just bundled with Linux for the most part we build layers on top of those because those are well tested and if we need to we can actually modify them so in case you guys were wondering if all the fun that we've had is over after Google acquired us here's a recent email that was sent out certainly I woke me up you know we had three days left of video storage and and we need to scramble and stuff at that because apparently we had a spike and uploads we had a number of things that were going on so we got that resolved but you know it's it's just amazing the kinds of things that you run into when you're dealing with a site that grows rapidly and grows and very unexpectedly because the moment we leave a bottleneck the site traffic goes up and a new bottleneck appears I mean it's that simple even if we in some cases we double the number of machines and site traffic still picked up within a couple weeks so thumbnails is the next area I want to talk about and surprisingly thumbnails is actually a pretty difficult problem at least when you use most of the conventional methods because thumbnails the thumbnail images that correspond to each video which I think are 120 by 90 pixel images they're small they're around five K each and the thing about thumbnails is that there a lot of them because each video has multiple thumbnails so this this this so I think each video has about four thumbnails so suddenly there are four more four times more thumbnails than there are videos so that becomes an interesting problem especially since the videos are spread across a whole bunch of machines they're spread across hundreds or thousands of or whatever machines whereas thumbnails are concentrated on a small number of machines for various reasons and so you get you have the overhead of dealing with a lot of small objects you have inode caches you have directory entry caches you have page caches you have all these different caches that get involved in the OS level and things start kind of crawling to a halt when you reach a certain point and you have high number requests per second because on any given web page at most a single video is playing whereas if you look at if you search for some search term like dogs for example you'll see 60 thumbnails on that one page so suddenly you have a lot more requests per second so we started pretty simple like we do like we generally do everything we want to get something working so we started with Apache we started with a unified pool meaning the same machines that we're serving our web traffic we're also serving our static images or our thumbnails rather and we moved to a separate pool still using apache and you know things were good and again until this one email gets sent at 2:24 a.m. and if you can't read it says we can't accept any more videos too many videos and we found out that the number of files it turns out that after after doing a little bit digging in frantic googling and searching for things we've found out that there's actually a per directed limit on the number of files that you can have in that directory and you know we yeah we had the luxury of time we could have you know research all the stuff ahead of time but frankly we were just busy trying to stay alive most of the time so we rented this problem and and all the videos stopped uploading and then and then we had to do a lot of emergency work and stuff at that and turns out that ext3 in particular had this issue we sent we since moved Arizer and we since moved to a more hierarchical direct structure so we don't you know have this problem because it's no definitely not a good idea having too many files in one directory anyway so again a little bit of the history with thumbnails we start with apache here unfortunately it results in high load and low performance not a good combination we moved to squid we place a put squid in front of our apache machines and it will it served as a reverse caching proxy and it was much more efficient with cpu it did a lot better job at dealing with most requests it had a nicer threading model and things like that but we found out that as load increased performance actually degraded over time to the point where before it might have been serving 300 requests per second and then afterwards later on like a day later two days later it would have to it would only serve maybe 20 requests per second or 10 requests per second so we actually had a script that would automatically kill it and we started so it's kind of the Windows approach of scaling you know just restart everything so we moved to we tried looking at lighty actually because we were using it for video servers and you know it with the problem is that out of the box it only has one process and so every time you need to do a disk read so every time your your content is not in memory already you need to do a disk read and that that stalls the main event loop and there's only one loop in the whole system anyway it stalls that loop you know this minut amount of time the more cache misses you have the worse it gets so we tried moving to the multi-process model where basically a parent would fork off whole bunch of children and the children would all be calling except on the one server socket but then we'd run into various things like with each of the processes would have their separate caches there was less efficiency there was a thundering heard problem because there were so many connections happening per second so there are a lot of issues with with having multiple processes just calling except all the time so we actually decided you know to make use of one of ladies big advantages it being open source so we modified it so instead of having everything happen on a single worker process we or a single process we had a main thread which handles stuff like the e pol it handled stuff like reading memory from reading content from that's already in memory so that's very fast and we delegated all of the disk reads into separate worker threads so the idea being that well if you as long as you keep like as long as you keep all the fast stuff on the main thread you're serving a lot of requests per second and all the slow stuff can come whenever I can come it can come you know half a second later and even come a second later in the worst case hopefully we not though but since then we've actually moved on something else because the reason is that this solution had some fundamental problems that we just didn't address we were still storing each thumbnail as a separate file on the file system and even though we'd had a hierarchical structure directory structure so that you know there's some sort of fan-out at each level so that you don't have a huge fan on one particular level the fact remained that we had so many images that to set up a new machine talked about let's see here - just atop the the images from one machine to copy to no new machine took 24 to 48 hours so that's pretty bad I think I don't think our sync ever finished because our sink tried to store everything in memory and it just exhausted to hold all the memory in the system we also the problem is that if you had to reboot a machine let's say you had to upgrade the kernel or whatever the machine crashed it would take about maybe 6 to 10 hours from the machine and warm up for enough images to have been read from disk that you're not stalling on disk all the time because you can't you can't spawn an arbitrary number of threads because after a certain while your bottleneck on disk so we move to something that's based on BigTable actually a B TF e stands for BigTable friend and I think and this is as mentioned in Jeff Dean's earlier keynote a Google technology I distributed data store and we store our thumbnails in BigTable right now and BigTable has various forms of chunking so you don't have all these little files like floating around you so basically replication and setting up a new machines becomes much better and easier and it's very fast and fault-tolerant all the Google stuff was was built on the premise that the underlying the underlying hardware or the underlying network all of it can be unreliable so you know plan for unreliability and just work around it and it has various forms of caching because with distributed systems one of the enemies is latency so especially be caused by the distance between the the source of the data and the client so you cache a multiple levels so that the latency between the the user and the content is not so much okay so the final area that I wanted to talk about was databases because that's really been the core of a lot of our scalability work we use my sequel because it was it was fairly simple it's very fast if you know how to tune it correctly and it's used by a lot of websites so we know its characteristics it has some flaws but they're known flaws that is they're not flaws they're that you have to pay you know somebody hundreds of thousands of dollars to figure out and so that was an easy choice for us we store metadata there we don't store anything you know you know we don't start videos we restore our thumbnails we store our metadata such as usernames passwords video descriptions titles that kind of thing messages in our system so again we started pretty simple we start with one main database and one backup and we were doing fine and until well you know people started using our site you know I guess somehow the word got out and load increased and the problem and the problem with most load profiles is that once you load load generally increases at a linear rate on the system and so you hit a certain point for data bits it's usually when you start having to go to disk and then suddenly the load really really spikes so we were having issues with that and in the problem is that we didn't have much flexibility because we were small enough at that time that we were actually leasing most of our hardware so they could only get new hardware to us you know so fast so we really to remedy do this but the point remains that during this time it was a big problem and one particular day I remember very well because the whole sighted crawl to halt I was trying to figure out why s tracing things doing vmstat everywhere things like that and it turned out that the the main database that we had the one database I was serving all live traffic was swapping so the question is why is the swapping I we we had configured it so that it's using it's not using more memory than the system has so the memory is capped at a certain point so why is that swapping why is it bringing it down the site and it turned out that in the linux to 4x kernels at least the earlier ones I don't know if they fix this later on but in the earlier Linux to 4 kernels the kernel thinks it knows better than you it thinks that it thinks that is page cache is sometimes more important than your application so so what happened was that the page cache was a healthy you know 1.6 gigabytes a 1 gigabyte or something of that and the application and Ford was about three point something gigabytes 3.5 gigabytes or whatever but the problem is that there was only 4 gigabytes of physical memory so rather rather than reducing the size of the page cache as I would have thought it did is instead swapped out my sequel but then the problem is that my sequel really needed to be in memory so then later on it decided it change its mind and then it went back and forth and basically we had the swapping and swapping out it was horrible so since a kernel upgrade was a few days out of the way since new hardware you know was not available at the time would have taken at least a day or two and since the site was really hurting I basically took a big gamble I I decided that what if so let's see here the kernel was swapping out my sequel so what happens if we took away the spot from the kernel so that's exactly what I did I deleted the swap partition while the site was pointing to the database yes and yeah believe me that was not a fun morning I don't think I don't think I tasted anything for days but but yeah and and after an hour or even more serious pain because everything was getting swapped in and and and a lot of disk activity it turned out that we had about twenty or thirty twenty to thirty percent over our head room on the system now we certainly had twenty or thirty percent idle CPU and granted if our memory uses spike for whatever reason the sort of hard crashed so but you know it was sort of a trade-off we had to make for that in a couple of days until we upgraded the hardware so interesting risk that you can take when when the whole world isn't really looking at you yet I mean that was found the nice thing about that back then so obviously we had to do a number of database optimizations along the way we optimized queries we made real-time operations into batch jobs whenever they were too expensive to do on a real-time basis we use memcache so instead of going to the database every single time we we use memcache D which is another open source daemon it's basically a more or less a hash table that listens on a TCP socket and allows you to get and set data using some kind of string based key and that's that was very helpful that gave us a very nice boost because using memory is a lot better than using disk most of the time and we use app server caching so again we have these various tiers of caching we can we can cache there's the database caches there's a database machine os caches there are there's memcache and then there's actually the memory space of the app server process so in some cases when we access the data are really really often like hundreds of times per second maybe or even dozens of times per second sometimes it makes sense to pre calculate the data replicate all the web machines and just to store that data in memory and nothing beats local memory access or at least I don't know maybe Google's coming up with something so another thing that we did was replication so my sequel has this rather nice facility where you know so most websites start with one database and that's that works pretty well for a time as traffic grows what happens is that you spread read load across other databases which are basically replaying sequel transactions that are happening on the master database so the way works is that all the client all the writes go to one master database and they get funneled down to every single replicas that's attached to that database this is an asynchronous process so there's no guarantees about timeliness so you have to be willing to live with at least a tiny amount of stillness generally sub-second most of the time and you have to you know watch out for the limitations because all these rights are happening in multiple places so if the rights grow to a certain point then you get this situation where the rights are hogging all the replicas databases and you know maybe 20% your resources or serving reads and that's that's kind of a problem and another problem with replication that can occur is that if you have too many replicas you have sort of management issues you you have the fact that you have a single point of failure that if you have to cut over to a new master database you something reconfigure all these machines you have a number of issues and it becomes a kind of brittle system because usually having too many replica replicas is a symptom of a larger issue it means that you're not you're not spreading the load evenly enough across multiple machines so here's a nice quote that you guys can read I hijacked it I changed a few words here and there but so it's kind of true though you know you start off small you know it works but then you you expand and things get out of hand so that issue is replication lag or replica lag and actual this is actually one the main downsides of my sequel to replication because my bleep my sequel replication doesn't have something like two PC it doesn't have any kind of consistent write mechanism so replication is completely asynchronous if you have three replicas machines in each of them have a different CPU speed and you should them have different speed discs you're going to you're going to get slightly different amounts of lag on each of those machines which is a problem but and here's one of the reasons why it could be a problem because on the master database there's a updates actually come in through multiple threads in this case there are three threads and each of the threads is executing a single update and what's nice is that if you have a multi CPU multi-core machine multiple disks you can exploit the parallelism that your Hardware off and the OS offers you however on the replica database in there I guess infinite wisdom my sequel see realized everything onto a single thread so I guess they did it for correctness reasons and for ease of testing and stuff of that but the point remains that replication on the replicas is it can be bottleneck because if you have a single long update let's say that the one highlighted in red up there update three cannot execute until update two is finished so you have one thing hogging every all the resources then you know you're you're stuck because ma the OS is not and the OS and the hardware is not generally very intelligent with trying to paralyze a single thread so here's another kind of variant of that same problem we have a replication thread and we and what happens is that it's updating it's doing five updates three of the updates effect database pages that are not in memory so they result in cache misses cache miche's stole the the replication thread and that can be a problem because after a while we we did as much tweaking as we couldn't application but it came to the point where the replicas just could not keep up with a master database because in master database was using all the CPUs in all this disks and and the replicas can only use one CPU on one disk generally speaking so we thought about it and we realized that replication or each update statement in a replication thread involves two steps it involves reading the affected database pages and then involves applying the changes so obviously I'm glossing over a lot of detail but those are the high-level steps so there the idea is that because my sequel has has actually one thread buffering all the sequel on the master database from the master database and another one executing the sequel so in the case where the replica is lagged behind the master we've actually written the sequel statements that are going to be executed one second from now five seconds from our ten seconds from now to desk ready so it's just that the replication thread hasn't had the chance to replay them because it's it's blocking or whatever so the idea is that whenever we prefetch the pages required to perform these updates so we came up with a little tool we had together a little tool and I guess the course of a day or two I think and we came up with the idea that we had a cash priming tool and that tool would read ahead the sequel buffer I mean keep in mind this only works obviously if if the replication thread is lagging and it will actually fetch the rows a couple of seconds before the replication thread needs them therefore ensuring that replication thread just hums along and it just executes every statement very quickly because everything is coming from memory so this was huge because again this reduced our replica lag from a huge amount to a very very small amount and and it bought us more time because you know we were working on the longer term solution but again that was longer term that took a while to implement so we needed to stare a lot stay alive during that time so even with all this trickery though the replicas were still falling down there were too many replicas it was a management nightmare we had to add hundreds of thousands of dollars of hardware just to add a little bit of incremental capacity because the rights were so dominant and there was replication lag users were complaining weird unreproducible bugs because since replication lag is inherently unpredictable you know it was hard it was hard to even reproduce some of the bugs so we had to take some extraordinary measures to stay alive and unfortunately some of our team members I guess weren't up to the task so yeah so we we actually took drastic measures though we introduced replica pools so rather than having a whole set of replica databases attached or one master database we actually said wait a minute so we can we are starting with the assumption that we cannot serve all traffic on the website equally well because our hardware is a few weeks out our software is a few weeks out we need to stay alive so we are thinking what part of our site matters most to users and that would obviously be playing videos and there's a lot of other social networking functionality playing videos really is the crux of the YouTube site so we split the replicas into two pools the watch pool which basically serve video watching and consists of very nice queries from a database point of view like single row lookups it it only hits a certain subset of the tables so we can exploit the cache locality by kind of confining the the queries required to watch videos to a single set of machines and then everything else gets put into another set of machines because those those features involve a lot more database load they involve more complicated queries so it's just a lot more for the databases and so rather than making everyone suffer including video including people watching videos we decided to contain the damage in the general pool and serve the people watching videos better again just a stopgap measure but you know it's one of the sort of more desperate things that you need to try if if things are falling down and everyone's complaining and and and things are not going so well so we also looked at our hardware and we've all looked at how our number of times this is just one instance of that and our databases we're actually serving all their data off of like a monolithic RAID 10 volume that was local to each machine so that raid 10 volume consists of 10 discs and we were kind of we noticed that we were spending a lot of time IO weight as measured by vmstat and we were waiting on disk likes so much and then we we were hypothesizing that Linux was only seeing one volume and therefore it wasn't being aggressive with with with parallelizing disk rise it wasn't being very aggressive with issuing the writes in the read requests so we played around with the raid configuration a bit and we came up with something that actually turned out to be much better it was basically exposing each of the Constituent RAID one mirror so two disk raid 1 mirrors to the operating system and then relying on Linux as built in raid 0 software to stripe them together so it's basically a hybrid hardware software raid 10 configuration and we did some tests they look good and then we released alive and it turned out that this helped us good maybe 20 percent or so 20 to 30 percent in terms of the throughput and the number of reads and writes that that the heart the same set of hard disk can service as measured by IO stat so this was a huge improvement obviously I same hardware and no additional cost just you know have to reconfigure a bunch of machines so that bought us a little more time but again all this stuff was just not it was just you know some of it was good engineering some of them some of it was just hacks and we really need to get to the solution that would scale us to for exercise or eight exercise so that's why I'm going to spare you guys the detailed diagram of what we were doing at the time because it was just very very ugly and it just got in tenable and I'll tell you guys about what it would happen soon after all these hacks that got put into place because at the same time as as some of us were keeping the site alive some some of the rest of us and keep in mind by um by some of the rest of us I'm basically mean the same people but nighttime is the daytime or something like that you know some of the rest of us were actually putting together the long term solution so which would generally be partitioning a lot of a lot of problems in computer science can be solved through judicious use of partitioning so you had this monolithic database right and it was gigantic and it couldn't fit in memory it was multiple times the size of memory and got rid of that and basically split it into multiple shards so instead of having one user a one database with all the users and all the videos on it split it up into multiple pieces and each in each shard is completely independent of each other there's no overlap between the shards and the great thing about this is that you can spread reads and writes which is a great thing because I was one of the key limitations of my sequel replication you you improve your cache locality which is critical because memory is obviously much faster than disk so the closer your working set is the closer the size of your working set is to the size your memory the better otherwise you spill over to disk and things get very slow so and it turns out that when we release this change which was a pretty big change and actually went fairly well considering the extent of the changes we were able to serve the same amount of traffic but with 30% less hardware and we reduce replica lag from a fairly high number to zero and we're remains the day so very good and we can create new shards with with some amount of work in the future so we have sort of a way of scaling our database kind of arbitrarily it's it's not it's not trivial but it's a lot better than it used to be and obviously we have less reliance on clever tricks because you know as much as fun as it is and belonging as fun as it is thinking about clever tricks you don't want to have to come up with five of them a day to keep the site alive it's okay to recap we keep things simple which is good because you you are able to move around simple code more easily you're able to be architected you're able to split it up the different functions to rewrite it to change the algorithm to do whatever we constantly iterated on bottlenecks because that whenever you resolve one bottleneck there's another bottleneck unless you are willing to spend ridiculous amounts of money on hardware and unlimited budget and so the the bottlenecks are PC software OS and hardware and there's one last thing that actually we do for scalability or one more important ingredient for the recipe of scalability which is the team I mean it's been a pretty awesome team to work with I think B I think for the first I don't know like six months or so twelve months I think we were pretty much it in terms of everyone who was giving the site alive and one person would be it would be handling data center stuff and I'm setting up machines and and sending off the printers for the office and just random stuff so it's pretty cool to have cross-discipline people people who understand where the whole system they understand more than just software they understand the system underneath the software so that's a pretty critical component and obviously it's nice to have cool people to drink with okay so future directions basically more of everything we've we've had pretty rapid growth every time we we leave a bottleneck that's seriously hurting the site it seems like we grow 20 to 30 percent kind of day overnight which is a little bit nerve-racking sometimes sometimes where sometimes we think well you know maybe I just leave it for one more day I can get some sleep you know I don't have to deal with yet another bottleneck so that there's that more videos more formats we're branching out into a whole bunch of different areas like I think there was the there was the the Apple ITV thing that was announced recently in a number of other non web experiences more users our users keep coming and we recently localized so now suddenly the the users in Japan for example can read our web pages for the first time we have more data intensive features I mean now we've we basically scratched the surface of what you can do with video where we have we have we have obviously Google search which is obviously a great search engine we have a collaborative filtering helping to figure out sort of the ko visitation habits of people and recommend more videos based on the current one you're watching and but the there's a lot of work that can remain though that's there's a lot of work that remains to be done there's personalized recommendations well based on the 20 videos I've watched what other videos am I likely to to want to watch after that there there's more group things that we can do there's there's all sorts of texana means that we can create there's a lot of categorization data mining and stuff like that and that's actually a weak point of a traditional sequel database especially my sequel which doesn't do parallel queries at all so we're definitely going to play around with a lot of the Google technologies where stuff like MapReduce and a number of other things that are highly parallel and can spread computation across many many machines because if you're serving hundreds of millions of video views per day you're not going to be able to mine that data unless you go distributed so I think that wraps up my presentation see I wanted to keep it you know reasonably short because obviously it's been a long day so I'll just leave some time for some questions we are doing replications but we're not going nearly as wide which is a great thing because after a while and after a while your your replica lag just became becomes unmaintainable it just becomes large and you can't do anything to reduce it past a certain point so yeah well well okay so so basically we started with a monolith database right and went one of the things that we did was we started I mean this is a high level I mean there's a lot more detail I'm leaving out but basically we we started instrumenting over code we changed our code and left the database functionality exactly the same we change the code so that we could tag each query with The Associated user so the idea being that okay howhow how are we going to know which go it goes where unless we know unless we know you know which user a query corresponds to especially since a user can have multiple videos which can have multiple comments etc so there's there's a hierarchy of tables here so you need to go you know as far down as you go you need to know exactly which user this corresponds to that way you know was charged to direct it to so that's that was the first step we released that code pretty much no one noticed because that was an invisible change then then we basically use that information and we pretty much we started with snapshots of the database in multiple places and for each of the shards these became the new master database for the shards and and we we basically took a Python script and we used it to divide up the traffic according to the user so we came over petitioning scheme and we decided that okay this will this shard will replay one ends of the stuff and for this fixed set of users based on our hashing algorithm and this chart does another disjoint set of users and so on and so forth we kept on kind of replaying stuff we were very verified for a fairly long period of time we made sure that all the data was there stuff I that we we tested on a small set of users and then afterwards we basically did the cutover we took a hard pretty much a hard downtime or close to a hard downtime and we switched the rights to the new shards because they were replaying all the sequels and they were perfectly in sync with the master database especially after all the site traffic had stopped to the master database and then we we pointed the application to the new databases so that's it at a high level and there's a couple other things like we in order to associate users in order to map users to shards we have a lookup database that's heavily cached we have a database and maps you know use your IDs to shard and the reason being that yes we can use a mathematical algorithm what but the problem is that we we've we found out through a couple talking to a couple people on different companies that the shards tend to get lopsided because one user happens to be a particular powered user or this group of users tends to be a powered user set up our users and and that char becomes unbalanced so you have to be you have to at some point be able to move users between shards arbitrarily right there okay so inbound bandwidth okay so the question the question is like how did we deal with bandwidth was inbound or outbound bound with an issue how did we host the service so inbound traffic not a non-issue I think I think we would make a great great great BitTorrent peer because we were actually getting at it for a while but um merely from Linux Isis MindView anyway so inbound traffic is not an issue outbound is certainly had to scale quite a bit and our basically our our networking team of two which became which grew by walking fifty percent after the acquisition had to fly every all these different places pre wire stuff had to have to negotiate all these peering agreements and etc and baton basically that was sometimes an issue because of the speed at which the site was going it's like oh let's add another ten gigabits here or ten gigabits there you know it becomes I'm kind of challenging to keep up with that especially with a small team and and you know furthermore are some of the some of the bigger guys some the telcos didn't want to appear with us for our because for whatever reasons and I guess though those were the main issues I think we our cost actually we're pretty exceptionally low for the most part we were able to negotiate fairly competitive rates I can't mention specific numbers for various reasons but I can say that as our traffic grew our cost per minute per megabit per second went down drastically so and we are we also use CD ends to help with that because CDNs replicate the content in a multiple places so the chances of the content living closer to the user and therefore going through fewer hops and friendlier networks tends to be higher so in terms of hosting we started early on with managed hosting because again when you're living off your credit cards personal credit cards you can only do so much right we didn't have a system in at first we didn't have anything we just had a couple of developers and you know me one business person and we went through several managed hosting providers but it became clear that all these guys just couldn't scale with us I mean and and furthermore we would get these weird complaints like for example we get a DMCA complaint which basically is a fish allah complaint by copyright holder saying you need to take down this video and of course will comply the problem was the handling of this for the managed hosting providers we got this one phone call from the hosting provider saying that there's just one reported DMCA violation fix it in one hour otherwise we're shutting on your site so not you know not a very good situation and obviously you can't control your hardware you can't control your networking agreements you can't control anything so we you know created it we set up our own code Lopez ensues and various Equinox centers and it turned out to be a wonderful decision you know greater flexibility and software and hardware we can negotiate all of our networking stuff we can basically customize everything to our hearts content it's actually we're actually I'm using Google infrastructure that mainly mainly because we wanted to make sure that our oh sorry I'm the the question was does the YouTube infrastructure take care of serving h.264 and handling h.264 or easy and basically we use Google infrastructure for that mainly because we wanted our servers to have a pretty homogeneous profile we wanted to make sure that they have similar bit rates for all the videos and stuff at that that way that way all the the tuning that we've done in the cache is in the discs and stuff at that still held true like still were fairly optimal so we shut that over because the the converting videos and especially mass converting older videos into h.264 is obviously a very distributed problem and that's where the Google infrastructure shines yeah right there the question is I do we had multiple data centers and and we do we have we have a number of them around the US we have Colo presences in each of them and I think the number is at five or six right now I believe well we we concentrated mode we concentrated most of our things in like a couple of data centers so we with that way we don't have to deal with cross data center latency and stuff at that obviously that's you know a bit of reliability concern and we we're we're working to address that the in terms of the videos the videos actually came come from like random data centers or running random I mean they're not assigned according to closeness to user that we don't do like CDN like replication because that's what the CDN is for I mean the if a video needs that kind of application that's probably popular enough to be seeded into the CDN so with videos what's one of the nice things about videos is that they are not particularly latency insensitive they're more bandwidth intensive the fatness your pipe matters more than then how fast each packet goes back and forth so even though let's say this video is coming for Virginia and I'm in California it's not that big of an issue as long as obviously there are huge bottlenecks in the middle somewhere for images it becomes more of an issue because there's a lot of little round trips and there could be like 60 round trips in a given page and and what that's one of the advantages of the Google be TFE stuff because that stuff is replicated across different data centers and Google has a really good way of figuring out we know which is a closed data center for the user based on a number of different metrics all right when we take a minute to thank hyung for coming out and talking you
Info
Channel: GoogleTalksArchive
Views: 60,353
Rating: undefined out of 5
Keywords: googlevideo
Id: w5WVu624fY8
Channel Id: undefined
Length: 52min 37sec (3157 seconds)
Published: Wed Aug 22 2012
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.