CS75 (Summer 2012) Lecture 9 Scalability Harvard Web Development David Malan

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
welcome back to computer science s75 this is lecture 9 our very last it has been a pleasure having everyone in the course this semester so tonight we talked about scalability so we try to revisit some of the topics that we looked at earlier in the semester and think about how we can deploy applications not just on say a virtual machine on your laptop or desktop as we've been doing with the appliance but how you can scale to servers on the Internet and indeed multiple servers on the internet so that you can handle hundreds or thousands or tens of thousands or even more than that in theory so some of the issues that will inevitably encounter is how to go about doing this so when it comes time to put something on the internet recall from lecture zero that we talked about web host so this is by no means a list of recommendations per se it's just some representative ones that we happen to recommend if only because the Teaching Fellows and I have had prior experiences with these particular vendors but you can go around and you can see that there's many many many different options these days however among the takeaways hopefully from the summer thus far has been what kinds of features should you be looking for or expecting minimally in any web hosting company that you might choose and in fact not all of these even have those features necessarily Scott interesting okay good so if your if your country or your work or really any network that you happen to be on or people that you know happen to be on block access to certain IP ranges among them for instance GoDaddy's in this case YouTube is a popular thing to block Facebook is a popular thing to block that can be a sticking point so doing a bit of due diligence or testing first can be a good thing what else should you look for in a hosting company Isaac good SFTP in contrast with what FTP and why okay good so the s literally stands for secure and what that means is that all of your traffic is encrypted and this maybe isn't a big deal for the files that you're transferring because after all if you're uploading like gifs and JPEGs and video files that are meant to be downloaded by people on the web well then who really cares if you encrypt those between your machine and the server but it is important to have what data encrypted okay exactly usernames and passwords right I mean that's one of the biggest failings of something like FTP which granted is a fairly dated or older protocol is that it sends also your username and password in the clear which means anyone sniffing wirelessly around you anyone sniffing the wired network be point between point a and B can see in the clear what your username and password are yeah price okay good virtual hosting good if you want to implement a system that can grow by itself maybe don't want to share the same computers the same good so dream hosts in particular I think we pulled up their feature list and it was ridiculous how many features they give you unlimited bandwidth unlimited storage space unlimited RAM or something like that and that just can't possibly be uh be real if you're only paying like 995 a month so there's some catch and in general the catch is that they're making that offer to you and to a hundred to hundreds of other customers all of whom might be on that same machine so you're contending now for resources and the reality is they're probably banking in the fact that ninety something percent of their customers don't need that many resources but for the one person or two persons that do probably going the way of a shared host is not necessarily in your best interest certainly not something to try to build a bigger business on and so there's alternatives do things like web posting companies there's VPS is a virtual private server that you can essentially rent for yourself and what's the fundamental distinction between a VPS and a shared web post axl okay good well and so be clear the operating system is largely irrelevant since DreamHost and shared web hosts could also be running Fedora or any operating system but what's key is that you get your own copy of Fedora or Ubuntu or whatever operating system they happen to be running because in the world of VPS is what they do is they take generally a super-fast server with lots of RAM lots of CPU lots of disk space and they chop it up into the illusion of multiple servers using something called a hypervisor or something like a product from VMware or Citrix or other companies and even open source providers have platforms that allow you to do this run multiple virtual machines on a physical machine and so in this world you're still sharing resources but different tight in a different way you're instead getting some slice of hardware to yourself and no one else has user accounts on that particular virtual machine now with that said the system administrators the owners of the VPS company they themselves depending on what hypervisor they're using they might actually have access to your virtual machine into your files and frankly if you have physical access to the machine you undoubtedly have access to your files because they can always reboot the virtual machine for instance in what's called single user mode or diagnostic mode and at that point it's not they're not even prompted for a root password so realize that when even when you're using a VPS your data might be more secure more private from other customers but definitely not the web hosting company itself if you want even more privacy than that you're probably going to have to operate your own servers that only you or your colleagues have physical access to so here's just a list of some of the popular VPS companies there is one catch in order to get these additional features or these properties in a VPS you generally pay more so instead of $10 a month $20 a month you're probably starting at $50 a month or something like that maybe even in the hundreds depending on how many how much you want in the way of resources and toward the end of today we'll talk about one particular vendor of VPS is namely Amazon Web Services Amazon ec2 their Elastic Compute cloud that essentially lets you self-service and spawn as many virtual machines as you want so long as you're willing to pay some number of cents per minute to have that virtual machine up and running and it's actually a wonderful way to plan for unexpected growth because you can even automate the process of spawning more web servers more database servers if you happen to suddenly get popular even overnight or because you've been slash dotted or posted on reddit or the like and then you can have those machines automatically power off when interest in your product or website has started to subside all right so how suppose you are the fortunate sufferer of a good problem which is that your website all of a sudden is super super popular and this website has maybe some static web content HTML files gifs and the like dynamic content like PHP code maybe even some database stuff maybe some database stuff like MySQL well how do you go about scaling your site so that it can handle or users well the most straightforward approach is generally what's called vertical scaling vertical in the sense that well if you're running low on RAM or you're kind of exhausting your available CPU cycles or you're running low on disk space what's the easiest most obvious solution to that problem excellent good get more RAM more processor more disk space and just throw resources or equivalently money at the problem unfortunately there is a catch here there's sort of a ceiling on what you can do why why is vertical scaling not necessarily a full solution yeah exactly there's some real-world constraints here where you can only buy a machine that's you know maybe three gigahertz these days and maybe he only has a few a handful of maybe a couple dozen CPUs or cores but at some point you're either going to exhaust your own financial resources or just the state-of-the-art in technology because just the world hasn't made a machine that has as many resources as you need so you need to get a little smarter but at least within here you have some discretion so in terms of CPUs these days most servers have at least two CPUs sometimes three or four or more and in turn each of those CPUs typically has multiple cores in fact most of the laptops you guys have here these days are generally at least dual-core sometimes even quad-core which means that you effectively have the equivalent of four CPUs or four brains inside of your computer even though they're all inside of the same chip essentially what does that mean concretely it means if you have a quad-core machine you can your computer can literally do four things at once whereas in yesteryear when you had single core single CPU machines they could only do one thing at a time and even though we humans seem to think that you're simultaneously printing something you're pulling up a Google map you're getting email and all this stuff seems to be happening simultaneously the reality is the operating system is scheduling each of those programs to get just a split second of CPU time before giving another program then another program a split second of CPU time and we humans are just so slow relative to today's processors that we even notice that things are actually happening serially as opposed to in parallel but when you actually have quad-core especially in a server that means whereas in yesteryear with a single core machine you could handle one web request at a time now for instance you could handle at least four at a time truly in parallel and even then a server will typically spawn what are called multiple processes or multiple threads so in reality you can at least give the impression that you're handling many more than even four requests per second so in short machines these days have gotten more and more and more CPUs as well as more cores and yet the funny thing here is that we humans also aren't very good at figuring out what to do with all of this available hardware right most of you even if you have a dual core machine you don't really need a dual core machine to check your mail or to like write an essay for school right you were able to do that five ten years ago with far fewer computational resources now in fairness there's bloat and software and mac OS and Windows and Office are just getting bigger and bigger so we're using those resources but one of the really nice results of this trend toward more and more computational resources is that the world has been able all the more easily to start chopping up bigger servers into smaller VPS and indeed that's how Amazon and other cloud providers so to speak are able to provide people with the self-service capability as we'll discuss a bit later so within these things there are a few things you have discretion over if you've ever built a computer you might be familiar with parallel ata or IDE or SATA or SAS any one axle what are these refer to okay good so SATA has to do with hard drives in fact all three of those have to do with hard drives years ago parallel ata or IDE hard drives were very much in vogue the you might still have them in older computers these days desktop computers pretty much but you wouldn't buy a new parallel ata drive these days instead you'd most likely get a SATA Drive with their 3.5 inch for a desktop or two-and-a-half inch for a laptop and if you have servers or you have lots of money and a fancy desktop computer you can go with the SAS drive SAS is serial Attached scuzzy this really just boils down to faster for instance whereas parallel ata and Saito drives typically ran at 7,200 rpm per minute revolutions per minute SAS drives anyone know what they typically spin that and for those unfamiliar inside of a mechanical hard drive there is a one or more metal platters that literally spins much like an old-school record player where the bits are now stored what speed does a SAS drive spin it's more than 7200 rpm axle yeah so 15,000 is where they would typically perform sometimes 10,000 but 15,000 so just twice as fast so that alone gives you a bit of a speed bump of course it comes at a price literally more money but that's one way of speeding things up so in often times what people will do is for a given website that they're creating if it has a database databases tend to write to disk quite a bit right every Facebook update requires writing to disk and then reading back out some number of times so really where you might be touching disk a whole lot people can throw things like SAS drives in their database so that their data can be read or written even more quickly and what's even faster than mechanical drives these days axle yeah solid-state drives SSDs which have no moving parts and as a result electrically perform much better than mechanical drives but those two cost more money and they tend to be smaller in size so whereas you can buy like a four terabyte SATA drive these days three and a half inch for your desktop you can buy maximally a 768 gigabyte SSD these days for a lot more money typically all right so let's skip the skip raid for now so horizontal scaling so this is in contrast with what we just discussed throwing money and throwing more of everything at a problem horizontal scaling is sort of accepting the fact that there's going to be the ceiling eventually so why don't we instead architect our system in such a way that we're not going to hit that rather we can kind of stay below it by even using not state-of-the-art hardware and the fat expensive stuff we can buy but cheaper Hardware sir there might be a few years old or at least are not the top of the line so that they'll be less expensive so rather than get few or one really good machine why don't we get a bunch of slower or at least cheaper machines instead plural number of machines so this is just a picture of a data center which is just meant to conjure up the idea of scaling horizontally and actually using multiple servers to build out your topology but what is this actually mean well if you have a whole bunch of servers now instead of just one how what's the relationship now with lecture zero where we talked about HTTP and DNS right the world was very simple a few weeks ago when you had a server and it had an IP address and that IP address might have a domain name or hostname associated with it and we told that story of what happens when you type in something com enter on your laptop and you get back the pages on that single server but now we have a problem if we have a whole aisles worth of web servers axle okay okay good so now if we get an inbound HTTP request we somehow want to distribute that request over all of the various web servers that we might have whether it's two web servers or 200 the problem really is still the same if it's more than one we have to somehow figure that out so let me put up a fairly generic picture here let me flip past those to this guy here so if we have a whole bunch of servers on the bottom here server 1 to DA and on the top we have some number of clients just random people on the Internet we need to interpose now some kind of black box that's generally called a load balancer depicted here as a rectangle so that the traffic coming from people on the Internet is somehow distributed or balanced across our various back-end servers so to speak so it might still be the case that server 1 and 2 and so forth have unique IP addresses but now when a user types in something calm and hit hits enter what IP address should we return how do we go about achieving the equivalent of this man in the middle who can somehow balance load across all end servers okay good so instead of in DNS returning the IP address of server 1 or server 2 or server dot you could instead return the equivalent of the IP address of this black box the load balancer and then let the load balancer figure out how to actually route data to those back-end servers so that's actually pretty clean so now if the load balancer has a public IP address the back-end servers now technically don't even need public IP addresses they can instead have private addresses and what was the key distinction between public and private IP is back in lecture zero anyone with anyone over here yeah Louis exactly so the rest of the world by definition of private can't see private IP addresses then so that seems to have some nice privacy properties in that if no one else can see them they can't address them so those servers just by nature of this privacy can't be contacted at least directly by random adversaries bad guys on the Internet so that's a plus moreover the world has been running out of version four IP addresses the 32 bit IP addresses come into scarcity for some time now so it's just hard or sometimes expensive to get enough IP addresses to handle the various servers that you might buy so this alleviates that pressure now when we need one IP and not multiple for our servers because we can give these back-end servers like an odd number like 192.168 which most of you have in your home routers probably or 10 dot something or 172.16.0.0 onward can the load balancer decide or get that data to one of the back-end servers how could we go about implementing that axle well probably first will not figure out which server to send it to so you want to check if somebody has available CPU cycles but they're not using ok so and once you see that there's one server with a novel CPU cycles to handle that requests and a local request the same request but locally inside your server network ok to that machine get back whatever it is that the client requested excellent so this request arrives then at the load balancer the load balancer decides to whom he wants to send this packet server one or two or Tata and you can make that decision based on any number of factors so actual propose doing it based on load like who is the busiest versus the least busy odds are I should send my request to the least busy server in the interest of optimizing performance all around so let's assume that there's some way as to mark demarcated by those black arrows of like talking to those back-end servers and saying hey how busy are you let me know so now the load balancer figures out it wants to send this particular request to server one so it sends that request to server one using similar mechanisms tcp/ip much like the packet how it traveled to the load balancer in the first place the server then gets the packet does its thing and says oh they want some HTML file here it is the response goes to the load balancer the load balancer then responds to the client and voila so that works what are some alternatives to load balancing based on the actual load on the server so load in general refers to how busy a server is so what's an even simpler approach than that because that frankly that sounds a little complex we've not talked at all about how one device can query another for characteristics like how busy are you is even though it's possible axel every pile of website has every server ah if you would instead say have one server containing all the HTML and went running and containing all the PHP you would then see well I'll get that example one example one server with all the images and one with all the HTML kind of a client requests an image it sends it to the image server ok ok good so the implication of the previous story that Axl told is that under this model server one two and so forth all need to be identical have the same content which is nice and that then it doesn't matter to whom you send the request the downside is now you're using n times this much disk space as you might otherwise need to but that's perhaps the price you pay for having this redundancy or to having this horizontal scalability or instead you could have dedicated servers these are for HTML these are for gifs these are for movie files and the like and you could do that just by having different URLs different host names this is for instance images dot something com this is videos dot something comm and then the load balancer could take into account the host HTTP header to decide which direction it should go in so that could work for us alright so what's an even simpler heuristic than asking a back-end server how busy are you right now like if you have no idea how to do that how instead could we balance load across an arbitrary number of servers think again back to lecture zero you can do all of this know would only lecture zero under our belt so let's quickly tell the story I type in something calm into a browser I hit enter what happens Jack okay packet descend to whom do we send it okay good something that will determine the IP address of where we're sending it to what's that thing called Isaac what cycle not router routers get involved but the DNS server domain name system server so that server in the world a whole bunches of them that whose purpose in life is to translate host names to IPS and vice versa so I'm going to pause the story there that seems to be an opportunity now for us to return something yeah good so this black box this load balancer maybe it's just a fancy DNS set up whereby instead of returning the IP address of the load balancer itself maybe instead the DNS server just returns the IP address of server one the first time someone asks for something calm and the next time someone requests something calm it returns the IP address of server two followed by server 3 followed by dot dot and then wrapping around eventually to server 1 again so this is actually generally called round robin and you can do this fairly easily this is just a snippet of a popular DNS server called bind Berkeley internet name name daemon I believe is the D and this just suggests that if you want to have multiple IP addresses for a host name called dub-dub-dub you mention a which is denotes a record and just refers to an in pointer here but a is the same as in lecture 0 and then you just enumerate the IP address one after the other and by default this particular DNS server very popular one bind will return a different IP address for each request so that's nice it's simple again uses only some knowledge from lecture 0 even though granted you have to know how to configure the DNS server but you don't need any fancy bi-directional communication with the backend servers in this model so that's nice but there's a price we pay for the simplicity if we only do round-robin where again we just spit out a different IP address each time and let me make this more concrete just so this isn't quite as abstract let me open up a terminal program here and do nslookup for name server lookup of Google com this is exactly what Google does at least in part their their load balancing solution is more sophisticated than this list suggests but indeed Google's DNS server returns multiple IP addresses each time so if this is so simple to implement what's the catch axle good so just by a bad luck one of the servers could just get a really real power user someone who's really doing something computationally difficult like I don't know what's a good existing lots of mail where someone else is just kind of logging in and poking around at a much slower rate and you we could come up with even more sophisticated examples than that but over time server one might just happen to get more heavyweight users than other servers so what's the implication well round robin is still going to keep sending more and more users to that server one end of the time just by nature of round robin so that's not so good what else causes problems here or what else breaks potentially so back to the lecture zero story I type something com I hit enter my browser sends a request to the DNS server or my operating system sends a request to the DNS server gets the IP address and in this model it's the IP address of one of these servers then I send my packet is Jack proposed to that particular server get back a response story's ends but then a few seconds later I visit another link on something calm and hit enter which part of the store now changes jack good question so what how does the story change and that'll give us the answer here axel oh ideally yes if you want a truly uniform distribution across all n servers then the DNS server has to return another response and I'd argue the DNS server will return a different response the next time it is queried but why saved or we're good so recall these caches back in lecture zero we talked about the implications of the good parts of caching whereby as axel is proposing there's no reason for Chrome or ie to send the same DNS request every single time you click a link on something com that would just be a waste of time you're going to lose some number of milliseconds every time that happens or worse yet a second or two so instead your operating system typically caches those responses your browser typically caches these responses as well and so you just don't need to do those lookups so if you do happen to be that power user who's doing a heck of a lot of work of whatever sort on server one the next guy is going to be sent to server two not you with your subsequent requests so caching to can contribute to a disproportionate amount of load on certain servers largely in due to bad luck indeed in DNS we didn't spend much time on this particular detail but there's typically expiration times TTL is time-to-live values associated with an answer from a DNS server and that's typically an hour or five minutes or a day it totally depends on who controls the DNS server what that value is but that suggests too that if you are this power user on server one it might be a few minutes or hours or even days until you get a sign to some other server simply because your TTL has expired by then so it's nice and simple we can do it with a simple configuration change but it doesn't necessarily solve all of our problems so in fact the approach axle proposed first is actually pretty good whereby you don't use DNS based round robin rather a more sophisticated approach would be to let those load balancer decide to whom to send you in the back end and the load balancer can make that decision using any number of heuristics it could even use round robin or randomness because at that point you don't have to worry about caching issues because the DNS server has only returned one IP so but that still leads you to the risk that you'll be sent putting too much load on some server so we could take for instance server load into account at that point but there is something else that breaks if we fast forward mid-semester to when we started talking about cookies and HTTP and sessions in PHP to spark discussion I propose that sessions have just broken in PHP if our back-end servers are PHP based websites and they are using the session superglobal load balancing seems to break this model why jack exactly so sessions recall tend to be specific to the given machine we saw examples involving slash temp which is a temporary directory on a Linux system where sessions are typically saved as text files serialized text files so that means though that your session might be sitting on the hard drive of server one and yet if by random chance you are sent by a round robin to server two or server three instead of server one in the worst case you're going to see the same website but you're going to be told to log in again for instance because that server doesn't know that you've logged in more okay fine you kind of bite your tongue and you type in your username and password again and hit enter and suppose you're a really good sport and you do this for all n servers you have no idea why something calm keeps prompting you to log in but eventually you will have a session cookie on all of those servers the catch then though is that if something calm is an e-commerce site and you're adding things to your shopping cart now you literally have put a book in your cart over here a different book in this card a different book in this card and when you checkout you can't check out the aggregate so this is a very non-trivial problem now excellent very true so if we have horizontally scaled in the sense that we factored out disparate services this is our PHP server this is our jiff server this is our video server then indeed this problem would not arise because presumably all the PHP traffic would get ratted to the PHP server but an obvious pushback to that solution is what your Isaac okay good so there's no redundancy which is not good for uptime if anything breaks axel good then the story is the same as soon as you get popular you have too much load for a single PHP server then we have to solve this problem anyway so how do we go about solving this problem this seems to be a real pain this one and to be clear the problem now is that in as much as sessions are typically implemented per server in the form of like a text file like we saw in slash temp then you can't really use round-robin you can't really use low true load balancing taking into account each servers load because you need to make sure that Alice if she's initially sent to server one subsequently gets sent to server one again and again and again for at least you know an hour or a day or some amount of time so that her session is useful jack excellent yes so absolutely we could just continue this idea of factorization and factor out not the various types of files but a service like session states so if we instead had a file server you know like a big external hard drive so to speak that is connected to all of the servers 1 & 2 & 3 so that anytime they store session data they store it there instead of on their own hard drive then this way we could share state so indeed that could be a solution here axle ok ok so that's not bad at all so we already have a man-in-the-middle here it's a black box but there's no reason it couldn't be a server with hard disk space so why not put the sessions on the load balancer that could absolutely work so let me be difficult then and whether we put the sessions in the load balancer or whatever that it's no longer a load balancer then it's obviously doing more it's more of a server that happens to be balancing load and storing sessions whether we put sessions there in that black box or elsewhere in the new box on the screen we seem to have introduced a weakness now in our network topology because what if that machine breaks it would seem to be the case that even though we have n servers which in theory those guys are never all going to die at once assuming that it's not the power electricity or something stupid like that that's somehow related to all of them but odds are they're not all just going to up and die simultaneously so we have really good redundancy in our server model right now but as soon as we introduce just a database or a file server for our sessions if that guy dies then what was the point of spending all this money on all these back-end servers our whole site goes down because we have no ability to remember that people are logged in if we can't remember they're logged in no one can buy anything so how do we fix that problem so we've solved one problem but if you think of that sort of old visual where you have like a garden hose with lots of leaks in it and you plug one of them with one hand all of a sudden a new leak springs up elsewhere that's kind of what's happened here we've solved the problem of shared State but now we've sacrificed some some robustness some redundancy how do we now fix the ladder axel okay good so we could just use a sort of different approach to storing our data and rather than just store it on the hard disk as usual we could use something called raid so actually this is a good way to tie in the thing we skipped over a moment ago let me just pull up something to write on here so redundant array of independent disks is the technology more succinctly known as raid raid can actually be used in desktop computers these days even though it's not all that common some companies like Dell and Apple make it relatively easy to use raid on your system and what does this mean well raid can come in a few different forms there's something called raid 0 there's something called raid 1 there's something called raid 5 there's something called raid 6 there's something called raid 10 and there's even more but these are some of the simplest ones to talk about so all of these technologies variants of raid assume that you have multiple hard drives in your computer for different purposes potentially so in the world of raid 1 a raid 0 you typically have two hard drives that are of identical size tera by 2 terabytes 512 mega gigabytes whatever it is two identical hard drives and you do what's called stripe data across them whereby every time the operating system wants to save a file especially big files it will first write to this drive a bit then to this one then to this one then to this one the motivation being these hard drives are typically large and mechanical with spinning platters like we discussed earlier and so it might take this guy a little while to write out some number of bits now that's going to be split second in reality but that's a split second we don't really have when we're trying to service lots of users so striping allows me to write some data here then here then here then here then here then here effectively doubling the speed at which I can write files especially large ones to disk so raid 0 is nice for performance however raid 1 gives you a very different property with raid 1 you still have two hard drives but you mirror data so to speak across them so that anytime you write out a file you store it both places simultaneously there's a bit of performance overhead to writing it in two places albeit in parallel but the upside now is that either of these drives can die and your data is still perfectly intact and it's actually an amazing technology because if even if you just have this in your desktop computer you have two drives one of them dies just because of bad luck it was there's a defect or it's multiple years old and it just upped and died so long as the other one is still working the theory behind raid is that you can then run to the store buy another hard drive that's at the same size or bigger plug it into your computer boot backup and most typically automatically the raid array will rebuild itself whereby all of the data that's on the remaining drive will copy itself automatically over to the new one and after a few minutes or hours you're back to a safer place where by now even the other drive can up and die sometimes you have to run a command or choose a menu option to induce that but typically it's automatic you can do it even sometimes in some machines while the computer is still on so you don't even have to suffer any downtime so that's great raid 10 is essentially the combination of those two you typically use for drives and you have both striping and redundancy so you sort of get the best of both worlds but cost you twice as much because you need twice as many hard disks raid 5 and raid 6 are kind of nice middle ground with raid 1 or nice variants of raid 1 whereby raid ones kind of pricey right rather than buy to one hard drive I literally have to spend twice as much and get two hard drives raid 5 is a little more versatile whereby I typically have say three drives four drives five drives but only one of them is used for redundancy so if I get five one terabyte drives I have four terabytes of usable space so I'm only sacrificing one-fifth in that case of my available disk capacity whereas in raid one you're sacrificing one one over one half so 50 percent of it so raid 5 you just get better economy of scale whereby you can grow bigger and you still have some redundancy so in raid 5 if you have three or four or five hard drives in the array one of them can die any of them you run to the store put in a new one and haven't lost any data raid six is even better what is raid six do do you think axle exactly and raid six to any two drives can die you still won't have lost any data and so long as you run to the store fast enough and put in one or both drives again you'll be good to go of course the price you paid with raid six is literally another hard drive but at least now you can maybe sleep a bit better at night knowing that man two of my hard drives has to die before I have to really worry about this so these are really nice technologies and so it's axle proposes here the upside of using something like that in whatever file server we're storing our shared sessions is we can at least decrease the probability of downtime at least related to hard disks unfortunately it still has a power cord that someone could trip over or the power supply could die it still has RAM that could go on the fritz a motherboard that could die any number of things could still happen but at least we can throw redundancy inside of the confines of a single server and this can definitely help with our uptime and with our robustness and indeed with actual servers that you would buy for a data center not so much the home it's very common for computers to have not only multiple hard drives and lots of multiple banks of RAM and they would often have multiple power supplies as well and it's actually a really cool technology there too if one of your power supplies dies you can literally pull it out the machine keeps running you put in a new one and then it spreads the amperage across to power supplies once both are back up and running all very hot swappable amazing technology these days and as an aside if you still own a desktop computer there is no reason you shouldn't use raid these days it is just very good practice since it will allow you to avoid downtime and data loss with higher probability okay but someone tripped over the power cord someone tripped over both power cords in the case of redundant power supplies so axial solution and even mine with redundant power supplies hasn't solved the problem of shared storage becoming all of a sudden a single point of failure so what else could we do to still get the property of shared State so it doesn't matter what back-end server I end up on but I instead get I still get to the ability to suffer some downtime well shared storage can come in a bunch of different ways so we talked really about things as a file server but this can be incarnated with very specific technologies and just to rattle them off even though we won't talk about them in much technical detail fibre channel FC is a very fast very expensive technology that you can use in offices and data centers not so much the home to provide very fast shared storage across servers so that's just one type of file server if you will I scuzzy is another technology that uses a IP Internet Protocol and uses Ethernet cables to exchange data with servers so that's a nice somewhat cheaper way of exchanging of having a shared file server that that can be used by multiple actually in the case of ice cuz you typically use it with single servers so let me retract that that is not a solution to our current cookie problem but what about my sequel all right we use that for a couple weeks my sequel seems to be a nice candidate because it's already a separate server potentially could not the back-end servers just write their session objects to a database they definitely could so just because you're storing things in a just because we usually store things like user data and user generated in a database doesn't mean we can't store metadata like our cookie information as well or that too comes from users though NFS network file system this is just a protocol that you can use to implement the idea that axel proposed of a shared file system it just means you've got one server and you're exposing your hard disk to multiple other computers but again we haven't really solved the problem of share of downtime so what's the most obvious way of mitigating the risk that your single file server will go down axel good right if it's not if you don't have if you're worried about the one file server going down will the obvious solution even though money and some technical complexity will just get to now somehow you have to figure out how to sync the two so that one has a copy of the other's data and vice versa so let's actually come back to that issue it's generally known as replication but it is something we can potentially achieve but before we segue to distribution of things let's finish out this load balancer question so how do you go about implementing this blackbox well these days you actually have a bunch of options in software you can do things relatively easily with a browser pointing and clicking using things like Amazon's elastic load balancer a scenario for which we'll talk about a bit later H a proxy high availability proxy is free open source software that you can run on a server that can do load balancing as well using any either of the heuristics we discussed earlier round-robin or actually taking load into account somehow Linux virtual server LVS is another free software a piece of software you can use and in the world of hardware people have made big business out of load balancers Barracuda Cisco Citrix f5 are some of the most popular vendors here most of whom are atrociously overpriced for what they do so case in point like Citrix is a popular company that sells load balancers take a guess as to what a load balancer might cost you these days it's a highly variable range but there's different models but take a guess how much did that black box cost Isaac definitely in the thousands indeed in fact we have a small one so to speak on a small one relatively speaking on campus that was twenty thousand dollars and guess what that one's cheap so you can literally spend on these kinds of things grant it not this is this is not what the costs that await you right after the semester ends today but a hundred thousand dollars for a load balance or even generally a pair of load balancers so that either of them can die and the other one can stay alive so in the world of enterprise Hardware these ideas we're talking about a ridiculously priced typically because of support contracts and the like so just realize software's number one on the list because there are other ways to achieve this much more inexpensively indeed for years one of the courses I teach we used H a proxy to balance load because it was so relatively easy to set up and 100% free so realize these same ideas can be both bought and set up on your own quite readily these days all right um let's pause here and when we come back we'll take a look at some issues of like caching of replication and databases and also how we can speed up PHP a bit take our five minute break here all right we are back and I almost forgot we have one other solution to this problem of the need for sticky sessions sticky sessions meaning that when you visit a website multiple times your session is somehow preserved even if there are multiple back-end servers so shared storage was the idea we really vetted quite a bit and we didn't quite get to a perfect solution since even though we factored out the storage and put everyone's cookies or session objects on the same server we feels like we need some redundancy but we'll come back to that in the context of MySQL in just a bit but what about cookies I propose that cookies themselves could offer a solution to the problem of sticky sessions and again sticky session means even if you visit a website multiple times you're still going to end up with the same session object or more specifically you're still going to end up on the same back-end server excellent store which server okay store okay yeah so storing everything in cookies probably bad because one then it's really starting to violate privacy because rather than store a big key you're going to store like the ISBNs of all of the books in your shopping cart and that might be fine but feels like you're roommates and family members don't need to know what is in your cookies moreover cookies typically have a finite size of a few kilobytes so there's definitely gonna be circumstances in which you just can't fit everything you want to in the cookie so you know an interesting idea but probably not the best so you could store the ID of the server in a cookie so that the user the second and third and four times they visit your website as by following links or coming back some other time they are going to present the equivalent of a hand stamp saying hey I was on back-end server one send me there again so that's a pretty nice idea that there is one at least one downside here what do you like what do you not like potentially about this idea of storing in a cookie that gets put on the user's browser that they subsequently transmit back to you the number of or the ID of the server to which they should be sent max'll expiration in what sense okay so eventually cookies going to expire though we saw a couple lectures ago we could make it expire in ten years if we really wanted to and frankly we're never going to avoid that even if we had a single server cookies could eventually expire so at least that's not a new problem so I'm not too worried about expiration now because that's not a problem new to us simply because of load balancing does anything not feel right about storing the idea of the server in the cookie loose yeah so if we just put like the back end IP so the private IP address and the cookie you know what if the IP changes what if that what if the IP changes so that's a little problematic and it's also one of these principle things like you don't really need to reveal to the world what your IP address scheme is it's not necessarily something they could exploit but it's just the whole world doesn't need to know that moreover we can implement the same idea by still storing a cookie on the user's computer but why don't we take the PHP approach of let's just store a big random number and then have the load balancer remember that that big random number belongs to server Y and this other big random number belongs to server two and so forth so a little more work on the load balancer but in this way then we're really not putting any states that might change or might be a little privacy revealing on the actual user's computer moreover we also take away the ability for them to spoof that cookie just to get access to some other server now whether or not they could do anything with that trick is unclear but at least we take a bill way that ability all together so there's no surprises alright so cookies indeed are something that these black boxes of load balancers tend to do whereby you can configure them to insert a cookie themselves it doesn't just have to be a back-end web server that generates cookies the load balancer similarly could be inserting a cookie with the set-cookie header that the end user then subsequently sends back so that we can remember what back-end server to send the user to now if the user has cookies disabled well then this whole system breaks down but again so does a lot of functionality we've discussed thus far the semester but there are sometimes some workarounds so a word on PHP PHP and interpreted languages in general tend to get a bad rap for performance because they tend not to be as high performing as a compiled language like C++ or C or the like however there are some ways to mitigate this there is this notion of PHP acceleration whereby you can run a PHP program the source code through PHP dot exe the interpreter on the system and it turns out that PHP does typically compile that file in a sense down to something that's more efficiently executed much like Java compile something down to something called byte code but you typically PHP throws the results of that compilation away whereby it does it again and again for every subsequent request however with very simple with relatively straightforward and freely available software you can install PHP accelerator here are just four possibilities that essentially eliminate that discarding of the PHP op codes and instead keep them around so in other words the first time someone visits your site the PHP file is going to be interpreted and some op codes for performance generated but they're not going to be thrown away because the next time you or someone else visits the site that PHP file is not going to have to be reparse and reinterpreted the op codes are just going to be executed so you get the benefit of some added performance now the only gotcha is if you ever change any of your dot PHP files you have to throw away the cached op codes but these various tools typically do that for you Python has a similar mechanism where you'll get dot py files or your source code files but dot py C files are the compiled versions that can be executed more quickly so the same idea as at play here so this is one of these things that is relatively easy and free to enable and gives you all the more performance specifically the ability to handle all the more requests per second in the context of a PHP based website so what about caching to caching in general is a great thing solve some of our DNS concerns early on but it introduced it others because caching can be a bad thing if some value has changed but you have the old one but caching can be implemented in the context of dynamic websites in a few different ways so I propose that through HTML through my sequel and there's something called memcache D we can achieve some caching benefits here so this is a screenshot of one of the most 1990s websites out there and this was not even taken in the 1990s this was taken couple years ago and I just visited out of curiosity Craigslist today still looks the same so what's interesting about Craigslist though is that it is a dynamic website and that you can fill out a form and post a for sale ad or may dad or the like and the website does actually change but it's we zoom in on this and it's going to be a little blurry because of the screenshot the URL that's up there is actually dot HTML even though it's barely readable at this resolution which is to suggest that Craig's List is apparently accepting user input through forms for instance whoever wrote up this job advertisement some time ago but then Craigslist is spitting it out as a dot HTML file as opposed to storing it where or in what that's all yeah so this is in stark contrast to what we've done for project zero project one we're using PHP as the backend whereby you store data like this like server-side and maybe an XML file or more realistically in my sequel database or similar and then you generate a page like this dynamically so why is Craigslist doing this apparently could just be they're stuck in the 90s but there's a compelling reason to axle yeah exactly if they're restoring the HTML file they just don't have to regenerate it every time it's revisited so this itself is caching it's not caching in any particularly fancy way you're just generating the HTML and saving it something called like something dot HTML and storing it on disk and the upside of this is that web servers like Apache are really really really good and fast at just spitting out bits just spitting out raw static content like a jiff a JPEG an HTML file or the performance optimizations these days generally relate to the languages like PHP and Python and Ruby where you're trying to fine-tune performance but if all you have to do is respond to a tcp/ip tcp/ip request with a bunch of bytes from disk that's relatively straightforward these days and so they're taking advantage of the performance presumably of serving up static content but this comes at a cost what's the downside of this file based caching approach someone outs nothing we've done thus far is sort of a complete win there's always a gotcha Louis okay so space so we're storing it on disk and you know if you've ever posted on Craigslist they're also storing it somewhere in a database because they do let you go back and edit it it's just Craigslist is one of these sites where reads are probably much more common than writes indeed when people visit Craigslist they're probably flipping through reading pages as opposed to posting lots and lots and lots of ads all at the same time so there's some redundancy there that's unfortunate axle okay good so there's redundancy well actually with all these thousands of files there's redundancy - just in the basic stuff like you have the same HTML tag the same body tag the same link tag the same script tag in every single page if they're indeed just static HTML files so whereas you get some benefits of using something like PHP and recall our MVC discussion where we factored out template code like the header and the footer so that we stored it one place and not thousands of places Craigslist is kind of sacrificing that feature and going with this instead so in the end it's probably a calculated trade-off you get much better performance presumably from just serving up the static content but the price you pay is more disk space but at the same time you know for a few hundred dollars you can typically get even bigger hard drives these days so maybe that's actually the lesser of the evils but there is one god there's another big gotcha here if you've generated tens of thousands of Craigslist pages that look like this what's the implication now and maybe why are they stuck in the 90s good exactly if you want to change the aesthetics of the page and add a background color change the CSS or make the font something other than Times New Roman it's non-trivial now because assuming this is a fully intact HTML file with no server-side include mechanism knowlike require mechanism like you have in PHP you have to now change the background color in tens of thousands of files unless maybe you at least put it in the CSS file but even then if it's a less trivial change than color suppose you want to restructure the HTML of the page then you really have to do a massive find and replace or more realistically probably regenerate all 10,000 plus pages and I'm we latched on to 10,000 arbitrarily but it's a lot of pages in this case so upsides and downsides they're one of the few people on the internet who do this particular approach but it does have some value and I think the last time I read up on statistics they get by with relatively little hardware as a result which is definitely compelling so my single query cache this is a mechanism that we didn't use but it's so easily enabled on a typical server with MySQL there's a file called my dot CNF for your configuration file and you can simply add a directive like query cache type equals one and then restart the server to enable the query cache which pretty much does what it says the neck if you execute a command like select foo from bar where Baz equals one to three that will be could be slow the first time you execute it if you don't have an index or if you have a really huge table but it's the next time you execute it if the query cache is on and if that row hasn't changed the response is going to come back much more quickly so MySQL provides this kind of caching for identically executed queries which might certainly happen a lot if a user is navigating your website going forward and back quite a bit memcache is an even more powerful mechanism Facebook has made great use of this over the years especially initially memcache is a memory cache so it is a piece of so software a server that you run on a server it can be on the same server as your webserver it can be on a different box altogether but it essentially is a mechanism that stores whatever you want in RAM and it does this with in the PHP context with code like this so this memcache can be used by all sorts of languages here is PHP zone interface to it and you use memcache as follows you first connect to the memcache server using memcache Connect which is very similar in spirit to my sequel connect which you might recall from a few lectures back then we try in this example to get a user so the context here is it's pretty expensive to do select star from users on my database table because I've got millions of users in this table and I'd really rather not execute that query more often than I have to I'd rather execute it once save the results in RAM and the next time I need that user go into the RAM go into the cache to get that user rather than touching the database so there's this sort of tier of performance objectives disk is slow right spinning disks especially slow but it's fast to serve up so generally you might want to store something instead of on disk you want to store it in a table that has indexes so that you can search it more quickly for instance think back to project zero the XML file it's relatively small but at the same time anytime you wanted to search it you had a loaded from disk build up a Dom thanks to the simple XML API then search it kind of annoying it'd be nice if we could skip the disk step so that things would just be faster and thus was born MySQL and project one MySQL is a server which means it's always running it's using some Ram so in that case you have the ability to execute queries on data that's hopefully in RAM but even if it isn't you at least have the opportunity to define indexes primary key is unique keys index fields so that at least you can search that data more readily than you can with say XPath in XML so the next step is not even to use a database because database queries can be expensive relative to just a cache which is just a key value store I want to give you x equals y and the next time I ask for X you give me Y and I want it quick much faster than a database would return it so here we've gone and connected to the memory cache daemon the server in the second line I am trying to get something from the cache the arguments to memcached get take the first argument is a reference to the cache that you want to grab something from and then dollar sign ID just represents something like one two three the idea of the user that I want to get if the user is null what's the implication apparently Isaac okay good but in what case would user be null do you think good when they're not in the cache when user one two three or whatever I'm looking for it's not in the cache that variable is going to be null and so we do this if condition is Isaac says and here there's some somewhat familiar code PDO which relates to my sequel in our case we connect to the database using that user and pass we then execute the query function in PDO in this case select star from users where ID equals ID I'm not using my I'm not some escaping ID because in this case I'm assuming that I know it's an integer so it's not a dangerous string just to be clear then I'm calling fetch to get back in associative array of my data the users name email address ID whatever else I've got in my database but then the last thing I do before apparently nothing else because this is out of context before actually using that user for anything what am i doing with him axel exactly I'm storing a key value pair in the cache whereby the key is the user's ID so this implies that there's an ID field in the user object that came back from the database from this line here and the value is the user object itself so again a memcache in this case is a key value storage mechanism and the next time I want to look up this user I want to look him up by his ID and case-in-point that's what I did in line two up top now caches are finite because ram is finite and even disk is finite so what could happen eventually with my cache just by nature of that those constraints axle good so eventually the cash could get so big you can't keep it on the machine so what would be a reasonable thing to do at that point when you've run out of RAM or disk space for your cash you're the person implementing memcache itself now what do you do in that case you could just kind of quit unexpected error but that would kind of be bad and completely unnecessary what could you do Isak OOP sorry yeah so some kind of garbage collection in what which things would you collect what things would you remove from memory good so we can essentially expire objects based on when they were put in so if I put in user one two three yesterday and I haven't touched him since or needed him since and I need more space well out goes user one two three and I can reuse that space that memory for user four five six if user four five six is the next person I'm trying to insert into the cache so indeed this is a very common mechanism whereby the first one in is the first one out if that object has not been needed since by contrast if one two three is just one of these power users who's logging in quite a bit and he or she is logging in again and again and again well I should remember every time and we don't see it in the code here but every time I get a cache hit and I actually find user one two three in the cache I could somehow execute another memcache function that just touches the user object so to speak thereby updating his time stamp to be this moment in time so that you remember that he was just selected and hopefully memcache get itself would do that for us and indeed it does I don't need to do this manually the cache software would remember oh you asked for user one two three I should move probably move him back to the front of the line so that the person at the end of the line is the first one to get evicted next time around so it's a wonderfully useful mechanism and face book is very read heavy or very write heavy if you're a user it's kind of both these days you know early on it was much more read heavy than write happy because there were no like status updates and you would just have your profile and that was about it so these days there's definitely more rights but I'm going to guess that reads are still more common than not right when you log if you're a Facebook user and you log into your account and you see your newsfeed you know you might have 10 20 whatever friends show up in that newsfeed that's potentially like 10 or 20 queries of some sort and yet you're probably not going to update your status 30 times in that same unit of time so odds are Facebook is still a little more read heavy which makes cash is all the more compelling because if your own profile isn't changing all that often at least you might get 10 page views 100 page used by friends or random strangers before you actually update your status your profile again that's an opportunity for optimization so early on and to this day Facebook uses things like memcache quite a bit so that they're not hitting their various databases just to generate your profile there instead just getting the results of some previous look up unless it has since expired well on to my sequel optimization so that you can squeeze all the more performance out of your setup so this table is a little more overwhelming right now than it needs to be but recall our discussion of my sequel storage engine some time ago and we talked briefly about my I Sam and nodb does anyone remember at least one of the distinguishing characteristics of those two storage engines and again a storage engine was just like the underlying format that was used to store your database data good so in ODB which is the default these days so you haven't really needed to think about this much since project one in ODB supports transactions whereas my I Sam does not my I Sam uses locks which are full table locks but n does tend to have some other properties and these this list here is a very long list of the various distinctions among these several storage engines transactions is one of them but there's a few other storage engines here that I thought I would just draw our Chintu so one you have a memory engine otherwise known as a heap engine this is a table that's intentionally only stored in RAM which means if you lose power server dies or whatnot the entire contents of these memory tables will be lost but still kind of a nice feature if you yourself want to implement a cache relatively easily by writing keys and values into your database into two columns maybe you yourself can implement some kind of cash to avoid having to touch maybe much larger tables that you yourself have so that's an option to you archive storage engine haven't had to use this but take a guess as to what it does besides archiving something what is this engine do for you do you think and what what was the last sentence or at the last part of your comment oh it doesn't sort anything in cash you have to query it all the time not quite so the property you're actually getting and you can kind of see it here but there's some footnotes on the other storage engines is it's compressed by default so archive tables are actually slower to query but they're automatically compressed for you so they take up much much less space so a common case common use case for archive tables might be log files where you want to keep the data around and you want to write out a whole bunch of khiva of values in a row every time someone hits your web server anytime something buys someone but suppose you rarely query that data you're keeping it for posterity for research purposes for diagnostic purposes but you're not going to do any selects on it anytime soon so it would just be a waste to keep use more disk space than you need to so you're willing to sacrifice some future performance when you do need to query it for some long term disk savings so the archive format allows you to do that NDB is a network storage engine which is used for clustering so that actually there is a way of addressing the issue of single points of failure that we discussed earlier with shared storage but we'll see a simpler approach in just a moment so in the worlds of databases like MySQL they typically offer this replication feature that I mentioned earlier so replication is all about making automatic copies of something and the terminology generally goes as follows you generally have a master database which is where you read data from and write data to but just for good measure that master has one or more slave databases attached to it via a network connection and their purpose in life is to get a copy of every row that's in the master database you can think of it rather simply as any time a query is executed on the master that same query is copied down to one or more slaves and they do the exact same thing so that in theory the master and all of these slaves are identical to one another so what's the upside now of having databases 1 2 3 & 4 all of which are copies of one another apparently what problems does this solve for us if any axial good so if something if database one dies because of human area trip over the cord hard drive dies Ram fizzles out whatever the case may be you have literally three backups that are identical so there's no tapes involved there's no backup server I mean these are full-fledged databases and in the simplest case you could just unplug the master plug in the slave and voila you now have a new master and you might have to do a bit of reconfiguration in the databases to make hit to promote him to master so to speak and then leave servers three and four as the new slaves while you fix server number one but that would be one approach here so you have some redundancy even though you might have a little bit of downtime at least you can get back up and running quickly and indeed you could automate this process if you notice that the master is down you could take him offline completely promote the slave and reconfigure them all just by writing a script what else how else could we take advantage of this topology let me ask a more leading question in the context of Facebook especially early days how might they in particular have made good use of this topology excellent uh-huh okay so if you're getting a lot of queries you could maybe you could just load balance across database servers and absolutely you could the load balancers don't have to be used for HTTP alone you could use it for my sequel traffic but why do I say Facebook in particular early on they didn't get that many queries but this was still a good paradigm for them yeah why is this good perhaps so back to my hypothesis that they're more read heavy than write heavy how can you adapt that and that reality to this particular topology effectively or put another way why is this a good topology for its website that is very read heavy and less write heavy been okay okay good so reading can be expedited so if we kind of combine been in Axl's proposals here for a read heavy website like face book certainly in the early days you could just write your code in such a way that any select statements go to databases two three or four and any inserts updates or deletes have to go apparently to server one which even though that query then has to propagate two servers two three and four it is less common and that happens automatically so the code wise you don't have to worry about it too much and if you're suffering a bit of performance there where you can just throw more servers at it and have even more read servers to lighten the load further so this approach of having slaves that can typically be used either for redundancy so you just have a hot spare ready to go or so that you can balance read requests across them is a very nice solution but of course every time we solve one problem we've introduced another or at least we haven't fixed yet another here what is a fit what is a fault in this layout still be paranoid kind of talked about it earlier but like what if one dies right unless this is there's got to be some blip on the radar here because we have to like promote a slave and right so you still have a single point of failure here at least for rights we could keep Facebook alive by letting people browse profiles and read profiles but status updates for instance could be offline for as long as it takes us to promote a slave to a master feels like it'd be nicer or at least our probability would be better of uptime if we instead had not just a single master but again let's just throw hardware at the problem so another common paradigm is actually to have a master master set up whereby as the labels imply and as the arrow suggests this time you could write to either server 1 or 2 and if you happen to write to server 1 that query gets replicated on server 2 and vice versa so now you could keep it simple you could always write to one but then the query goes to number 2 automatically or you could write to either thereby load balancing across the two and they'll propagate between each other but in this case here if you've laid out your network connections properly either 1 or 2 can go down and you still have a master that you could read from and you could even implement this in code recall very simply we had the my sequel connect function weeks ago or even the PDO constructor function which tries to connect to a database you could implement this in PHP code if my sequel Connect fails when connecting to server one then just try server two so you yourself could build in some redundancy so that now we could lose server one or two and not have to intervene as humans just yet because we at least still have a second master that we can continue writing - even though server 1 is now offline for some amount of time all right but we still have to route traffic there so in pursuit of this idea of load balancing here's a more complex picture that starts to unite some of our web ideas and some of our database ideas so at the top there we have some kind of network we have a load balancer in between and then we have this front-end tier so web servers are typically called a tier a service tier the a multi-tiered architecture would be the jargon here and those web servers now apparently are routing their requests through what in order to reach some mice equal slaves for reads who's the man in the middle here yeah axle okay so four reads yeah we have a second load balancer depicted here frankly in reality they could be one in the same they could be the same device but just listening for different types of connections but for now they're drawn more simply as separate now we have one my sequel master so we also have wires or arrows pointing from the web servers to the master and the master meanwhile has some kind of connection to the slaves so not bad no frankly this is starting to hurt my brain because now what was a very simple class where you have a nice self-contained appliance on your machine on your laptop does everything web database caching anything you want it to do my god look at all the things we have to wire up now and it's still not perfect what could die here what are our single points of failure check oh sure okay so the my sequel master we haven't really solved that problem so kind of be nice to steal part of the previous picture and maybe insert it into here jack same thing axle load balancers right so single point of failure is very well-defined like single point of failure just look for any any bottlenecks here whereby things point in and then go out load balancers one here load balancers one there so it turns out with load balancers for your hundred thousand dollars you can get two of them typically in the package and what they tend to do is operate also in what's called similar in spirit to master master mode but in the context of load balancers it's typically called active active as opposed to active passive and the idea here is with active active you have a pair of load balancers that are constantly listening for connections either one of which can receive packets from the outside world and then relay them to back-end servers and what they typically do is they send heartbeats from left to right and right to left so that if this guy ever stops hearing a heartbeat from this guy so to speak and a heartbeat is just like a packet that gets sent every second or something like that if this guy stops hearing a heartbeat from this guy he automatically assumes that this guy must have gone offline so he's completely in charge now and he continues to send traffic from the outside world in or if you instead have active passive mode if this is the active guy at the moment rather if this is the active guy at the moment and he dies this guy similarly detects no more heartbeat and what the passive guy will do is promote himself to active which essentially just means he takes over the other guy's IP address so that all traffic now comes to him so in short we definitely need another load balancer in the picture how its implemented is is not as important to us right now but it's having a single load balancer is probably a bad thing right and this is the tragedy you can throw money at you can throw a lot of brainpower at various tiers here but if you have a lot of web servers a lot of my sequel servers but you have one load balancer just because it was really expensive or you didn't know how to configure it properly like the rest of it is pretty much for naught because you still have things things that can die and take down your entire website so let's make this more complex still so let's now introduce two load balancers and let's actually introduce an idea of partitioning and this was actually something that Facebook coincidentally did make good use of early on back in the day there was Harvard Facebook Harvard v facebook.com there was MIT uh facebook.com and the earliest partitioning that they used was to essentially have a different server as best Outsiders could tell for each school so they literally just like copy the database copied the files over to another server and then voila thus was born and my mi t--'s copy of Facebook but this is actually even though this would get kind of messy for 800 million users and thousands and thousands of universities and networks it's pretty clean early on because it's just too leverages this idea of partitioning right it's kind of we didn't have a Facebook didn't have a big enough server to handle Harvard and MIT so why not just get to and say Harvard users go here MIT users go here and now we've kind of avoided that problem now unfortunately when bu comes on we need a third server but at least we can scale horizontally now there is a catch with partitioning as soon as you wanted to be able to poke someone at MIT or vice versa you had to somehow cross that Harvard MIT boundary at which point it's kind of a bad thing that they're all in separate databases so early on there were some features that you could only do within your own network not until there was more shared state could you send messages and the like so partitioning though could be used even more simply suppose that you just had a whole bunch of users well you need to scale your architecture horizontally why don't I just put users whose last names start with a to M on half of my servers and then n through Z on the others right and when they log in I just send them to one or the other server based on that so in general partitioning is not such a bad idea it's very common in databases because you can still have redundancy a whole bunch of slaves in this case here a whole bunch of slaves over here but you can balance load based on some high level user information not based on load not round-robin you can actually take into account what someone's name is and then send them to this particular server so partitioning is a very common paradigm and then lastly just to slap a word on it high availability refers to what we described in the context of load balancers but it can apply to databases as well whereby high availability or H a is the buzzword simply refers to some kind of relationship between a pair or more observers that are somehow checking each other's heartbeats so that if one of them dies the other takes on the entire burden of the of the service that's being provided whether a database or whether a load balancer so even though we finally got the iPad working it's a little small to draw on so what I wanted to do as our final example here is let me raise the screen here hire one of these buttons we'll do it all right we're going old school now our first and last piece of chalk in the class let's say start with the middle alright let's build ourselves a network here so we have a need for one or more web servers one or more databases maybe some load balancers but we're also going to try to tie together last week's conversation about security so we'll have to think about firewalling things out now so in very simple form we have here a web server which all draws WW all right so that's our web server and now my website is doing so well that I need a second web server so I'm going to draw it like this and now we need to revisit the issue of balancing load so what felt like one of our best options here how do I still have the Internet which I'll draw as a cloud here connected to both of these servers somehow but I want the property of sticky sessions so what are my options or what was my best option how do i implement sticky sessions axle okay good so use a load balancer and store all the sessions in one place and that's not okay so we can actually do one but not necessarily both of those let me interpose now the thing we started calling a black box so this is some kind of load balancer now I still have my back-end servers and here's another one here okay and now this is connected here but I still want sticky sessions but you know what shared States that sounded expensive fibre channel ice because it sounded complicated there's a simpler way how do I ensure that I get sticky sessions using only a load balancer and no shared state yet how can I ensure that when Alice comes in and she sent to this server the first time that the next time she comes in she sent to the same one axle okay good so why don't we have the load balancer listen at the HTTP level and when the response comes back from the first web server let's give them number so let's call this one and this guy to the load balancer can insert some kind of cookie that allows it to remember that this user belongs on server one and how it does that I don't know it's a big random number and it's got a table like PHP does for its sessions and figures out which server to send her to in this case alright so now I have a database the easiest way I know how to set up a database is to put it on the web server itself so much like the cs50 appliance you have a database in the same box as a web server if I now have a database here and here on the same boxes as our web servers what's the most obvious problem that now arises yeah if she does good it's just going to be server one answer exactly so if Alice just happens to end up on server one and she updates her profile or credit card information or something persistent not the shopping cart thing because that involves the session but she does something persistent it's going to persist on this database and that's fine because sticky sessions are solving all my problems now but she comes back in a week or she logs in from a different computer cookie expires whatever and she ends up here what happened to my credit card information what happened to my profile I now have no profile because I'm on a different database so clearly this is not going to fly unless we partition our users and have the load balancer actually taken to account who is this user and then send the user based on Alice's last name always to the same server so that could be one approach but for now let's instead factor out the database and say that it's not on the web servers it's separate and it's got some kind of internet connection here of course we've solved one problem but introduced a new one which is what Isaac yeah so a single point of failure again so how can we mitigate this well we can do a couple of things we can attach slave databases off of this and that's kind of nice but it then involves some how promoting a slave to a master in the event one died so maybe the cleanest approach would be something like to master databases so we'll call this DB one this will be a DB 2 and now how do I want to do this connect these like this Isaac you shook your head okay so we should probably do this for master master replication so that's good but what about these lines good bad answers bad why bad jack right so the problem we just identified was database on same server as web server bad because then it's talking only to it and if Alice ends up on the other server she has no data that you're expecting well functionally this is equivalent I've just drawn a line but they're still only connected to each other and it the try of traffic should probably not go like that so we at least need to have some kind of cross connect so okay so I can do this but now what do I do so now my load balancing has to be done in code if those are the only components on my system right now the line suggests that dub-dub-dub one has a network connection to DB one and DB 2 but that means now I have to do something like an if condition only in my PHP code to say if this database is up right here else if this database is up right there and that's not bad but now your developers have to know something about the topology if you ever introduced a third master or something like that although my sequel that wouldn't play nicely with that now you have to change your code this is not a nice layer of abstraction so how else could we solve this axel I don't like the idea of connecting each of my web servers to the database because frankly you know what this is going to get really ugly if it starts looking like this right very quickly this degrades into a mess yeah okay good only users with last name you're going to have a load on one particular server the load balancer can take all those features we talked about good and actually distribute okay good so we insert a load balancer here which is connected to both the dub-dub-dub machines and also the database servers and then he can be responsible for load balancing across the two masters it's actually harder for the database to do any kind of intelligence load balancing based on last names at this point since the my sequel traffic is going to operate with binary messages not with HTTP style textual messages load balancer up here can look at HTTP headers and make intelligent decisions it's harder and maybe not impossible but it wouldn't be very common to do load balancing based on application layer intelligence here you would probably push that to the PHP code again in that case but this isn't bad but Isaac doesn't like this picture now because of what yeah so we still have the single point of failure so you just cost me even more money or more complexity even if I'm using free software this just takes more time now so now we have load balancer one load balancer - I need to do something like this and even though this looks a little ridiculous actually it's a little elegant that's pretty sexy so you would do this with switches or some kind of Ethernet cables all going to censor some central source so suppose instead we actually did that if you've ever plugged in a computer into a network jack which most of you probably have even if you have a laptop you don't connect these computers all to themselves you instead connect them to like a big switch that has lots of Ethernet ports that you can plug into but now Isaac what do you not like about this idea if I'm plugging everything into a switch yeah so welcome to the world of like network redundancy so really the right way to do this is to have two switches so almost every one of your servers database and web alike as well as your load balancers would typically have at least two ethernet jacks in them and one cable would go to one switch to another cable would go to the other switch you have to be super careful not to create loops of some sort so switches have to be somewhat intelligent typically so that you don't create this crazy mess where traffic is just bouncing and bouncing around in your internal network and nothing's getting in or out so there's an there's some care that has to be taken but in general this is really the theme in ensuring that you have not only scalability but redundancy and higher probabilities of uptime and resilience against failure you really do start cross connecting many different things but let's push harder Isaac what suppose I fix the switch issue suppose I also make this to load balancers and fix that issue what's something else that could fail now I can't do this an on an iPad very well so this is your data center here's the door to your data center Jack that's good the more extreme that I had in mind I was thinking the power goes out but that works too so the building itself burns down or goes offline you have some kind of network disconnect between you and your isp the whole building or the power indeed does go out and this has happened in fact one of the things that happens every time Amazon goes out is the whole world starts to think the clouds cloud computing so to speak is a bad thing because oh my god look you can't keep the cloud up but the tragedy here is in this perception that cloud computing really just refers to outsourcing of services and sharing resources like power and networking and security and so forth across multiple customers so Amazon services ec2 Elastic Compute cloud is kind of this picture here whereby you don't own the servers but you do rent space on them because they give you VPS is that happen to be housed inside of this building amazon offers things called availability zones whereby this might be an availability zone called us east one so this is a building in Virginia in that particular case and what they offer though is US East two and three and four that they call them a and B and C and D and what that simply means in theory is that there's another building like this drawn over there that does not share the same power source does not share the same networking cables and so even if something goes wrong in one building in theory the other shouldn't be affected however Amazon has suffered outages and multiple availability zones multiple data centers so in addition to having servers in Virginia I guess where else they have servers ok anywhere else a that's actually a pretty hard question the world's big place so the west coast and in Asia and in South America and in Europe as well they have different regions as they call them inside of which are different data centers or availability zones but this just means that you can really drive yourself crazy thinking through all the possible failure scenarios because even though Jack's building burning down is a little extreme things like that do happen right if you have a massive storm like a tornado or a hurricane that just knocks out power apps lutely could a whole building goes now go down so what do you do in that case well we have to have a second data center availability zone I'll draw it much smaller this time even though it might be this physically the same so here's another one suppose that inside of this building is exactly that same topology so now really what we have is the internet outside these boxes connecting to both buildings so so Internet is no longer inside the building so once you have two data centers how do you now distribute your load across two data centers axl yeah so we didn't really spend much time on it but recall that you can do load balancing at the DNS level and this is indeed how you can do geography based geolocation fee-based load balancing whereby now when someone on the internet requests the IP address of something calm they might get the IP address really of this building or more specifically of the load balancer in this building or they might get the IP address of the load balancer in this building right when we did the nslookup on Google we got a whole bunch of results that's not because they have one building with lots of load balancers inside of it that's because they probably have lots of separate buildings or data centers different countries even that themselves have different entry points different IP addresses so you have global load balancing is that's typically called then the request comes in to a building and you still have the issue of somehow making sure that subsequent traffic gets to the same place because odds are Google is not sharing your session across entirely entirely different continents even though it could be but that would probably be expensive or slow to do so odds are you're going to stay in that building for some amount of time but again these ideas we've been talking about just get magnified the bigger and bigger you start to think and even then you have potential downtime because if a whole building goes offline and your browser or your computer happens to have cached the IP address of that building that data center could take some minutes or some hours for your TTL to expire at which point you get rerouted to something else not too long ago just a few weeks I think Korra was offline for several hours one night because they use Amazon a bunch of other popular websites - who use Amazon services were down altogether because they were in a building or a set of buildings that suffered this kind of downtime and it's hard like if you are having the fortunate problem of way too many users and lots of revenue it gets harder and harder to actually scale things out globally so typically people do what they can but as Isaac has gotten very good at pointing out you can at least avoid as best as possible these kinds of single points of failure questions so a word on security then let's focus only on this picture not so much on the building what kind of traffic now needs to be allowed in and out of the building so let me go ahead and just give myself some internet here connecting to the load balancer somehow what type of Internet traffic should be coming from the outside world and if I'm hosting a website with a lamp based website yeah okay good so I want TCP recall which is one of the transport protocols 80 on the way in that's good but you just compromised my ability to have certain security why you're now blocking a very useful type of traffic good so we also want 443 which is the default port that's used for SSL for HTTP based urls so that's good this means now that the only traffic allowed into my data center is TCP 80 and 443 now those familiar with SSH you've also just locked yourself out of your data center because you cannot now SSH into your data center so you might want to allow something like port 22 for SSH or you might want to have an SSL based VPN so that you can connect somehow to your data center remotely and again doesn't have to be a data center this can just be some web hosting company PS hosting company that you're using and ok so we might need one or more other ports for our VPN but for now that's pretty good how about the load balancers what kind of traffic needs to go from the load balancer to my web servers axel it's really a mess because inside the data center nobody else is going to listen to the people inside okay so you would want to drop the encryption and good and that's actually very common to offload your SSL to the load balancer or some special device and then keep everything else unencrypted because if you control this it's at least safer not a hundred percent because if someone compromises this now they're going to see your traffic unencrypted but if you're ok with that doing the SSL termination here so everything's encrypted from the internet down to here but then everything else goes over normal unencrypted HTTP the upside of that is remember the whole certificate thing you don't need to put your SSL certificate on all of your web servers you can just put it in the load balancer or the load balancers you can get expensive load balancers to handle the cryptography and the computational cost thereof and you can get cheaper web servers because they don't need to worry as much about that kind of overhead so that's one option so see CPA T here and here how about the traffic between the web server and the databases perhaps through these load balancers this is a more of a trivia question but what kind of what kind of traffic is that even if you don't know the port number yeah yeah query actually or more specific it's the sequel queries like select and insert and delete and so forth so this is generally TCP 3 306 which is the port number that my sequel uses by default so what does this mean well if you do have firewalling capabilities and we haven't drawn any firewalls per se so we do need to insert some hardware into this picture that would allow us to actually make these kinds of configuration changes but if we assume we have that ability in large part because all of these things are plugged in as we said to some kind of switch well the switch could be a firewall itself and we could make these configuration changes we can further lock things down why I mean everything just works if I don't firewall things why would I want to bother tightening things so that only 80 and 443 are allowed here and 3 306 is allowed here and in fact notice there's no law between these guys good exact there's just no need for people to be able to potentially even execute sequel queries coming in or make my sequel connections and even if you're not even listening for my sequel connections it again is sort of the principle of the thing you should really have the the principle of least privilege whereby you only open those doors that people actually have to go through otherwise you're just inviting unexpected behavior because you left the door ajar so to speak you left the port open and it's not clear whether someone might in fact take advantage of that case in point if somehow you screw up or apache screws up or PHP screws up and this server is compromised it'd be kind of nice if the only thing this server can do is talk by a MySQL to this server and cannot for instance suddenly SSH to this server or poke around or execute any commands on your network other than MySQL so at least if the bad guy takes this machine over you really can't leave this rectangle here that I've drawn so again beyond the scope of things we've done in the class and even though the appliance itself actually does have a firewall that allows certain ports in and out all of the ones you need we haven't had to fine tune it for any of the projects realize that you bet even on something like a Linux based operating system so in short as soon as you have the happy problem of having way too many users for your own good lots of new problems arise even though thus far we focused a much entirely on the software side of things so that is it for computer science s75 I'll stick around for questions one on one we still have a final section tonight for those of you who would like to dive into some related topics otherwise realize that the final project too is its deadline is coming up you should have gotten feedback by from your TFS about project one if not just drop him or her a note or me otherwise it's been a pleasure having one in the class we will see you mine after tonight thanks I wasn't trying to build up to that there
Info
Channel: Jorge Scott
Views: 803,712
Rating: undefined out of 5
Keywords: cs50, cs75, david malan, rob bowden, http, php, xml, sql, mysql
Id: -W9F__D3oY4
Channel Id: undefined
Length: 105min 40sec (6340 seconds)
Published: Tue Feb 26 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.