Distributed Systems: Computation With a Million Friends

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

[Music] this program is brought to you by Stanford University please visit us at stanford.edu this presentation is delivered by the Stanford center for professional development um as we know engineering is the art of dealing with the fact that the world gives us a choice of good fast cheap pick two Amazon web services which we saw earlier this quarter is a solution for very large computation that is cheap in a lot of interesting Dimensions today's speaker has something bigger and cheaper and it's actually older too which is sort of interesting Adam bber all right well thanks as I'll just get started here so I'm going to kind of my Talk's in basically three parts I'm going to talk briefly about what distributed systems are in my definition because distributed systems is now a buzzword that's applied to everything in the world so I'm going to Define to you what I mean um I'm going to talk a bit about uh designing and running these types of projects kind of the 10 last 10 years of what I've been doing um condensed into a few slides um and then I'm going to talk at the end about some new work we're going to be launching soon uh a project called storage at home um I'll I'll insert here that uh I like when people ask questions this is like 10 years condensed in juia slides so if you have questions please jump in um and and let me know what you you want to know so um as as he said this is a bigger system than something like Amazon um if you take the 275,000 machines we have in a system like folding at home um that would be 6,800 racks full of computers so I don't know how many football fields worth that is but it's about 100 megawatts which means you need a dam or nuclear reactor or something to power it um and if we were in fact renting uh on Amazon on Z2 or um doing collocation it' cost us about a billion dollars a year to do this what we're doing um so that gives you we're saving a couple bucks here by doing it this way so if you look at when you have a lot of machines there's two ways you can think about having a lots of machines um you can either think I've got lots of machines that means I can do more stuff I can serve more web pages I can do more you know data mining I can do more or you can think of it kind of how I like to think of it which is if you have twice as many machines you can do something that you couldn't normally do for another year so if you have 275,000 machines and other people have a th machines you can do things that wouldn't be possible normally for another eight years so you can uh use different algorithms uh go deeper into a problem do research that you wouldn't be able to do any other way or certainly not with without spending a lot of money um um and this is a really powerful thing to to kind of get through your head that you know with folding at home we're doing things we're starting to use quantum Computing and other computational techniques that we would there's no way we would be able to do those any for years to come so let's start with the problem of I want to make something big and fast um so what you do is you start kind of at the bottom you look at vectorizing the problem this is SSE or Al from power PC you take your machine you say I have I can do things as vectors instead of as single you know a * Bal C I can do those as vectors that speeds things up a great deal um so this is something we did in the 90s everybody started using SS everybody started using alac uh for example when the uh when Apple converted from Power PC to Intel this was the big deal because the the way they did vectorization was different so this buys you a bunch of speed the next layer you would go up is you would start using uh threading or streaming and we've had great talks about Cuda um I don't know if you've had to talk about Brooke which is ati's equivalent of this so I'm not going to address this really other than the fact that well instead of putting four alus on a chip let's put hundreds and hundreds of them on there and find a way to get data to to where it needs to go and this speeds you up we're seeing 50x speed UPS from this level of Technology uh then next is doing SMP and this is you have multiple chips in your in your computer um and you use threading threading could also be in this layer depending on how you look at it and you use this to you know split your problem in two or something try and make it go even faster and then you reach the layer of clustering which is where Amazon and Google they rule this world they have more machines and more of everything than everyone um so this is you've got a high-speed Network you're interconnected your machines are pretty reliable and you want to do more of the same type of thing that's the philosophy behind clustering and the last layer and if you're moving up the stream where everything you're doing promptly breaks is that going distributed which is crossing organizations asking people to help you using machines you don't have full control over um so this is where you have to start dealing with high rates of failure High rates of mistrust of the results another thing so this is really where I'm going outside my organization the example of Google it's still all Google they're not crossing over into anyone else where they have to ask someone to help them out they have full control over the system so when you're working your way up this stack I mentioned you break at some point and that's because it's this is really about algorithms how you're going to approach the problem what you're going to do at each of these layers and if you start from the bottom things are going to break each time you go up a level you have to start designing from the top down say what do I want to do what's my end outcome what do I really want to get out of doing this massive computation and that could be a very sequential task like rendering the frames of a movie in which case most of these questions are very simple um but often times they're not and odds are when you start from the top and design down your algorithms that you're going to use are completely different than the ones you would use coming from the top or from the bottom up and then of course once you've done Your Design you can optimize you know at each level however you want um and there's active research in each of these layers as well as far as getting the tools to do this for us uh vectorization is now done by most every compiler I know of it will just do that for you it'll look see you have a loop it'll divide it by four and do four times as much work each time um there's compilers now coming out that are maturing for doing stream programming that are really getting good so they can take a simple algorithm that you express kind of not in a very elaborate way make it spread it over hundreds of um alus and do that so that's active research but that's not really where uh we're looking right now another thing is when you're taught each of these things you're not taught that they're interconnected you're really taught each thing as a separate individual thing you would optimize for and tune your algorithms around you're not taught that this is really an Al stack that you need to learn and understand each piece fitting in with the other um so when we're taught to program we're generally taught that there's a CPU that can do the list of instructions we give it it'll go and do what we want it's not the hard that doesn't reflect the hardware anymore that's nothing like the hardware now the hardware the CPU is the master of hundreds of resources that you can use uh the latest uh Nvidia card is 376 I think ALU on on it it's like ridiculous if you're thinking that the CPU is like this one thing doing something you're really not going to get any kind of performance out of that you have to shift to the mentality that the CPU is just telling people what to do and how that um how this is really going to happen and it's just putting work out to other units much like the cell processor this is exactly the model they have they have a main CPU and then eight computational units that can do things it's told to do um and then you also have to understand pretty much from the beginning that your CPU is also not Al loone you can use the network and this is generally taught later in in coursework for for computer science so let's look at uh kind of the attributes that we're going to start as we cross from this clustering into this distributed computing what are the things we start to address um the first is that when you're doing clustering you're basically paying someone for the resources or you've bought the resources you're dealing with either a contract Service level agreement something like this in distributed systems you're dealing with untrusted machines that come and go as they please um the reliability of distributed system is much less uh not just the back hose but the people can leave if they want things happen these aren't machines and machine rooms they're just probably on someone's desk all kinds of crazy things happen to them uh and centralized systems or clustering generally most all of your machines are going to be the same thing they're going to be you know a thing in a rack with line Linux operating system they're all going to be the same Linux operating system um you're not going to have to deal with uh you know hundreds of different versions and combinations of dlls and windows you're not going to have to deal with the Linux kernel latest Linux kernel has a new you know libc and all the code broke it won't link anymore you don't have to deal with those things and you do have to deal with those things in distributed systems um we spent so much time trying to get Linux binaries to work on more than one distribution it's it's uh it's it's bad um in centralized systems you can basically trust the result you're getting you're pretty sure um that what you've told the machine to do is what's going to happen um you don't have to worry about hacking and other things too much or cheating um in distributed systems again cheating is just the fact of life and the big one I led with was the cost the cost of centralized systems is however much you want to build however much you want to rent however much you want to pay um in distributed systems it's how much time you want to put into it how much time you want to sync in the recruiting how much time you want to spend into making your website nice how much time you want to spend in dealing with press it it really channels into time equals scaling in these systems so let's let's say we've got a system and we want to know is this a distributed system or is this not a distributed system like I said everything is now called the distributed system CU that's the buzzword but I consider the three Central traits of a distributed system to be that you're Crossing organizational boundaries so you're you're either cooperating with someone in a friendly way like say DNS or email or you're asking them to do something for you um like folding at home or any of these other systems um the algorithms are different you can kind of look at the algorithm and go this is you know this is a distributed algorithm this takes into account you know these failures and all these other things um and the third which I'll touch on briefly later is adversaries in the distributed system you have active people working against you hopefully not very many or very mean but you do have people trying to cheat your system um and that's uh we that's something that takes a lot of time but um and of course the internet used to be basically fully distributed DNS didn't come along for a little while and that was like the first thing where there was sort of a little centralization but it used to be kind of free reign even back in the ' 0s and 90s the internet firewalls hadn't come yet all these you know the web Port 80 wasn't the only way to talk to a computer um so this was really something that we've centralized more and more and more over time so let's look at the example of if our computation nodes were people because this is an easy analogy to draw what if we were doing distributive systems of people you'd get something like if you were doing news reporting you would get um blogs versus the news networks so CNN versus the army of bloggers on in the world now the amount of news by boggers is certainly far more often they're willing to do different things they're willing to report on things that a major Network would not report on so you get completely different properties out of the the blogosphere than you get out of centralized news um if we look at the world of reference we've got Wikipedia which is is so good now that um encyclopedia Banica is releasing their thing for free because they just can't sell it anymore they have destroyed the world of encyclopedias because millions of hours by probably millions of people have been put into a system that is much more decentralized um you get the property that you get adver this introduces adversaries every time I've gone and introduced facts on to Wikipedia they get immediately rolled back I don't understand quite the dynamic there but um there's there is a lot of stuff happening there that's really exciting um and then of course in software development you get centralized systems where a company has Windows or any com corporate development model versus open source and you look at the strength of Open Source to to explore the design space much more rigorously than than a centralized company where somebody is kind of feeding things from the top down would where any of that would go um so now I'm going to use the example of folding a Helm and the algorithm we use there and how it's different from what you would do in a centralized system so a protein is produced basically as a long string of spaghetti and it and what it wants to do is fold up into a ball but you want it to fold up into the correct thing it has to fold up in the correct thing or you die um so if you have genetic defects it doesn't fold right you get Alzheimer's or any of these other diseases related uh to protein folding so this is kind of important for us to understand how biology works and how these proteins fold and more importantly what can go wrong when they're folding and how you would maybe fix that with a drug or you know prevent it from happening at all so to go from this long single thread to something like what's in the slide it takes a long time it's incredibly complicated and it's Way Beyond what we can simulate with computers in billions of years a single computer running even if it's a supercomputer can't possibly tell you how this is going to work and and the properties of how this works um you know what are some maybe a critical step it needs to make along the way so what we use um are marov models which um if you don't know is basically um if you don't say you didn't have maps and you wanted to get from from San Francisco to New York how would you do that you'd have to say okay there's a local sign that says Sacramento I'm going to go to Sacramento and then from Sacramento maybe there's a sign and it you know takes you to the next step and if you build up each step you basically produce a map and you eventually get to New York and then you go hey there's a path from here to New York so it's really about exploring more randomly like say a colony of ants would explore you you've got ants they want to go find food they don't just you know know where everything is they go explore so you can go explore how things fold and do this in a way and build up this model later now the trick to that is it lets you do do short little things and put them together because here to Sacramento is like a short little thing you can do maybe do that you can't just jump you know you can't maybe don't have jet technology you can't get to New York yet it's kind of the analogy there um so you can get to Sacramento and build up this map um but you can do those short little things with computers on the internet things we have now so that's what fully and home does it's a completely different way of approaching the problem people that approach this problem from the bottom up this going from vectors to clusters they build this dedicated hardware and they spend millions of dollars building these machines that just fold proteins and they're really really really fast and you still can't explore these big systems because it's still a billion times too slow to do these really big problems but from the top down you can explore these things and you know this is a ribosome it's still actually too big for us to do but we're doing parts of this now with these methods um and the thing here that also this allow you to do is like ants once you learn something about what's going on you can use that to decide what you're going to try next so if ants find food they lay down a trail and they go hey there's food over here let's do this and then from there we'll go look around for some more food they don't go okay I'm always going to start here and ignore everybody else and all the things we've learned I'm just going to randomly go out they don't do that they they learn and and so this model also lets us learn so when we're producing job the next thing we want to try in folding at home we're we're going to go out and base that on what we've learned so let's look back at history for a minute and say where did we learn this um so here's something depicting cloud computing um there's various machines on the on a network they charge various amounts for what they do this is basically Amazon's elastic Cloud prices are a little different but this is elastic Cloud this is Google Apps um trick to this is this was 1973 and this is a depiction of uh David farber's work on a system called distributed computing system it ran a project ran from 1970 to 1977 um they basically did all the things we do now in what we call cloud computing they studied networking storage failure models how how to compute jobs how to optimize a system like the cartoon um but what we have now is way way more powerful toys Mor's laws come and and done what it does and we have way more powerful tools now but the theories and the things we're doing are the same thing so I started doing this in a public way uh in 1997 we had distributed. net um we had 40,000 active machines and my memory is bad but it was about that um and we tackled some problems that aren't don't necessarily take this Advanced you know ant-like Behavior they just were pretty serial it was we had a list of a you know two to the 56 things we needed to check and we handed out chunks and we did them um the goal here was to fix the crypto laws if you remember back in the early 90s every piece of software had two versions they had the if you're in the United States click here and if you're not in the United States click over here and the one that you clicked over and got the foreign one was using these crypto algorithms and they were really weak um we actually working with the uh eff cracked Dees in 22 hours so that that's not strong encryption anymore that's that's weak so we got the laws changed they went away now soft all has one version it's got good strong encryption in it what we really learned in this phase and this was one of the early projects that gathered a group of people and kept them together for more than one project there were many other projects in the day um there was uh dsch which was just that there was mercine Prime searching which I think is still going on there's tons of these little projects but they were like one project and this we really learned how to keep Volunteers in our project and and learned kind of those lessons of how to gather a group of people together and really you know Channel them into something this is folding at home this is ongoing it launched in in 2000 um I met VJ p in 1999 we collaborated at various levels over over time I'm now in his lab um but this is the 275,000 number this is how many machines we have out there actively working that we're hearing back from this isn't like 275,000 people signed up this is how many people are actually running our work right now so this is a billion dollar system to us um we've got Windows Mac Linux we've got PS3s now which is the 20y PlayStation 3 we've got gpus so we're uh betaing an ATI client and again like the PS3 or the GPU are using you know streaming and other things like that to go even faster than Windows machines would and this is all dedicated to protein fold all this is protein related research um and there's 54 Publications out of this unfortunately not many of those are mine um so let's look at how do you get volunteers how do you gather a group of people together to help you if you want you want to get this billion dollars of value out of people you you've got to be doing something right um so they you have to be motivated to help you um this is really hard like we spend so much time and we learned so I learned so many lessons back in the day but this is a really hard problem dealing with human beings is always a hard problem this is like Beyond NP hard it's just difficult um npart is Trivial compared to do with social yeah exactly so there's a whole different set of skills you have to learn to do this type of thing that are not computer science skills there's sociology there's psychology there you know motivation and all these other things um the main vectors here are Word of Mouth um and just scientific press so as you you're first starting out it's really just Word of Mouth um and and maybe you get lucky and you're slash outted or something like that um and the first rule of all this is you have to be absolutely good about what you're doing um as I say not evil is isn't good enough so if you're a company trying to do this forget it if you're uh not a nonprofit if there's any profit anywhere near what you're doing forget it it's not going to happen people will not help you if they don't think you're like out there to help Humanity or the world or things like that there's lots of projects like this they could be doing this is really a competition about how good you are how what benefit you're bringing to humanity so I mean there's probably a hundred active projects right now um there's four really big ones um there's uh two protein projects uh Ste at home and there's another one I forget what it is it's something in blank but there's there's like four really big scale ones and then there's lots of little ones that are about a thousand maybe computers so getting past that point you have to actually be doing something good for people you have to be absolutely transparent what about what you're doing this is what we're doing this is what we're going to find out this is what we're going to publish this is the data we're going to make available just really you know doing saying exactly what you're going to do um another big thing is this is is better for early research so fundamental scientific research if if you were say doing really late stage where a product was like imminent out of the research this really wouldn't work for you if you were a drug company tried to do this or like working closely with a drug company like they nobody would help you why would they help you you're just helping a company the evil company we don't want that so it's it's really you have to I can't stress enough how important this is to really be totally transparent and upfront with people or they're really going to just go somewhere else so why do people help and this is the sociology and I love to partner with someone and actually write a paper you know with airb bars on what I kind of know um people want to be they're really like to be interested in something they love to be hooked and interested so with us it's research it's you know it's aliens with Stud it's it's what's your what's interesting about what you're going to produce why is this cool why is it why is it neat what what does it do um so that's very important um altruism just people wanting to help you know we're using idle computer Cycles it's on anyway you know why not right um competition and reward are of course Primal to human beings and dealing with them so competition is in the form of teams and stats if you help us we give you points if you help us more with more computers we give you more points if you when your friends get a team together and form a team your team can compete with other teams and so this is very you know video game point kind of motivation but it also really helps because we want people to help us and we can tune those points so that for instance we've got um gpus right now and we've got CPUs so there's a thing of which do and the GPU requires your CPU to be doing some heavy lifting still so which do we want them using do we want them to use the GPU or the CPU so we tune the points so we can get the most science out of what we're what we're asking people to do um rewards we have things like the little printable certificates and things on the website and and you know that's that's nice you know it shows that we care about them helping us and so that's very important and then just things about gift economies people feel good about helping and doing things and their status if they can say hey I did this good thing it does actually help them so so so that's a great thing um so people are helping us for a wide variety of reasons um but fundamental is this what is it being really produced so if you want to do a project like this you know really consider all these things um that you've really got to have in order to do something like this so implementing these systems you have to actually build it you actually have to test it you actually have to have servers to have bandwidth you have to have all these things to do this this is like profoundly time consuming but compared to a billion dollars yeah you know you know we'll do what we have to do it's it seems pretty simple in that you know that balance um so we have to develop the software test the software we roll out Alpha Testing we do beta testing we do the eventual roll out we do we have bugs we fix the bugs we upgrade new versions so we have to do all this cycle again we we maintain the website the statistics and this is all done by many people we've got moderators on the Forum this is you know this isn't a large number of people all working on this um and then every time something comes along like the PS3 you have to do it all again you know so there's a large amount of work that goes into something like this so this isn't free in any sense you're spending investing a lot of amount of time yes are they willing to have you spend money they're actually thanked them at the end they do fund part of our research um but yeah uh getting infrastructure money my understanding is that's a little harder than getting money to just pay people to do things well it's noted it's it's perceived dimly because it's not original the NSF does not like to anchor boat anchor itself with infrastructure okay so infrastructure is an original work which they're it's not a frontier you know running a website and stats is not Frontier science it's slugging along doing something you have to do yes of curiosity as far as new platforms why wasn't the 360 or any of the other ah so a lot so heating is the 360s reason we actually used to uh AMD for a while had a line of chips I forget how long ago it was but it was quite a while ago wasn't cooled enough and we like optimize all up and down from vectors all the way up and down the stack very extensively gromax is the main piece of software we use a lot of people put a lot of effort into that it's not actually by our group or anything but it's a piece of software that's highly optimized so the chips get really so the Xbox who has cooling problems with games like you can't possibly do that PS3 was designed a lot better so the cooling was taken care of I think you have to like literally put a blanket over them or something to like get them to overheat it all um so they were able to do this um so every time we do a new piece of Hardware there's vendors help us do this um even when we were doing really early ATI work the video cards couldn't weren't quite up to that but ATI has been doing the they've got their stream programming effort so they they're been doing this for years now so they've been designing their cards to handle this because when you're actually doing a game you're GPU is only probably 40% active in any given time so 36 would actually kill it or it would just kind of Crash and I know what they do I don't have one I think they go red or something I uh I don't know but no it wouldn't the other the other main fundamental issue of the 360 is that the processor that has the power is not available to us via API eyes it's something Microsoft guards internally but it does have a I think it's an Nvidia chip that was pretty highend at the time but none of the compilers were I mean the compilers for stream Computing are just now getting mature so two years ago or whatever like that wasn't even possible we'd have to like do assembly or something I think there's there's no tool chain for that what yeah I'm sure basic yeah um so we get the system but it upgrades itself people are upgrading their machines all the time people are buying PS3s they're buying new gpus they're doing this for us so that actually the system self upgrades which isn't something you would get if you were collocating or something like that um and like like I said we have help with there's moderators in the Forum there's a large group of people that work on this and all of them kind of do their part I mentioned I talk about adversaries um you're giving people points so bad people could want more points um you're trusting people to do your computation correctly um we don't have this problem too much because if they did the computation incorrectly and did something better than we're doing we'd actually like them to do that but that's not true of all projects set at home for example um has to have two people do every every bit of work because the answer is yes or no is there a signal or isn't there so there's no way to like double check that with the protein folding if it's a lower energy State like we' act like to know um so there's things we have intrinsic in the problem we're doing that guard against this um but basically if you're really good people aren't going to mess with you too much and that's just a great thing because we only have so much time to to fight back but there is a lot of stuff we do to check that people aren't cheating U I'm not going to detail them because they're actively in use um which is unfortunate but basically there's a lot of checking going on under the hood what you measures in use so have cases yes so I can talk more about distributed dtic us um because we were looking for an encryption key it was a yes no answer so people wouldn't ruin it by saying no just all the time what they' do is they'd say no to the same thing over and over and try and get through the Stat system or have multiple people submit the same no to the same set of keys so I mean it's it's if you're being good people are going to try to kind of cheat on your stats maybe but they're not going to like mess with your research it doesn't help anybody just this you know Windows is much more fun to mess with right so um they're going to go that way um and you know things like rootkits and spyware are you know spyware will run at a higher priority than our programs all these distributed computing projects run at the lowest possible priority so when you go and try and watch a movie or anything nothing is stopped from happening um so spyware and stuff it's busy sending email at a higher priority so it actually like you know slows our stuff down so we we can't really do anything about that but that's something that certainly is uh people have noticed spyware because they're weren't getting enough points y uh another cool trick and we did this distribut the.net a couple times is when people would steal laptops our software would still be there checking in sending stuff in so we'd have this you know they'd say oh my laptop was stolen it's still running can you you know I'm still getting points somebody turned it on can you like give the IP log or the IP addresses to the police and and said's done this we've done this and so there there's some cool tricks that you know but uh that really has nothing to do with that's just a cool side effect of O guess we didn't put that in there it's a design feature so platform Trends Hardware is changing as you know we're going from we've gone from mainframes to PCS we're in the middle of PCS to laptops and we're starting to do laptops to small Ultra portable devices the problem for from our point of view is that laptops aren't on all the time they're not Network connected all the time certainly the iPhone doesn't have enough power to do stuff so as people transition to smaller more Ultra portable devices we actually lose host the lucky thing is as this is happening things like the PS3 and your set toop box and all these other things that are basically home servers are appearing so as in the future I can see clients for you know more of these set toop box type machines and will you know less and less people will buy a full scale PC um and that's great because they're much you know portable but it's an issue for us um green is good nobody can say green is bad but if the machines shut off all the time it's not good for us um so the the more and more aggressive of people get about being green um that actually makes people maybe not consider doing things like distributed computing um but there's still plenty of people we're still seeing linear growth people are still signing up every day so we're not to this point yet but it is something to think about in the future that's going to eventually happen um performance of those house um more people sign up the machines they're signing up are getting faster and the machines they say you know upgrade are getting faster um the networks the DSL connections are all getting faster so we're and then of course they're doing this for us they're just maintaining it because they want to use their PC so we're riding like three or four exponential curves here so this is getting not only more powerful every day but rapidly much more powerful all the time when we went and started the PS3 the PS3 we're seeing you know not it's not 50x but it's like 20x faster than a a PC would be the the gpus are 50 times faster now and the next generation will be 100 times faster than than an Intel chip so I mean that's a huge jump on that Moors law curve actually I think it just brought us up to where we should have been the whole time but it gets us back up where you know this thing is growing and becoming more and more powerful as time goes on so now I'm going to talk about a new project we're going to be launching uh fairly soon actually um called storage at home and the goal of this is to not only do computation in the distributed way but do storage that way as well um and we'll use you know we'll monitor the system we'll repair things that get lost um but this really opens up a whole new uh set of opportunities and things we can do and the reason this is really important is that there's a step two to science first you ask your question and you do a bunch of computation and you produce all this raw data then then there's this other step which is you have to ask the questions you have to analyze the results um we've got about 300 terabytes of data so far and we're growing 75 to 150 terabytes a year so this is a very this is an explosive data problem for us what do we do with all this data and more importantly how the heck do you go through that data to look ask those questions in any kind of way that gets you where you want to go in a time period that's reasonable often now we have to wait a week or a month even for a really detailed question to be answered through the raw data um and then just of course storing this data processing it I mentioned and backing it up right now we're shipping a lot of it over to one I think it's Pittsburgh um supercomputing Center who's doing backups for us but getting it over there even over internet to isn't the like easiest thing in the world to do so backing this data up is a is a a problem so let's look at the current topology of all these all the distributed computer projects use this basic idea um they've got some kind of load balancing up front they've got work servers that have computational tasks on them and they've got and a case of storage or folding at home we've got a backup server that results can go to if your your server is dead um so client appears out of the Mist um we give it a job uh then uh wait am I looking at the wrong slide here I think I'm one ahead of you fell off the it fell off the network okay we gave it a job it it I've got two on my screen so um so it's non-persistent it comes back it says hey I'm back I've got a job we go yay they're back um probability you achieved um and they give the job results back to us um and then the job is done and the client disappears Into The Ether again um and then this goes on and on they'll come back immediate probably immediately say I want more work job will go out everything will happen so this is what what's going on right now in all pretty much all these projects so what we need to do is reverse the flow um we need machines online available to us so we're asking them to open a port so we can contact them you can't do distributed storage if you can't get to the storage when you need it not much use to that um so this allows us to do storage but it also allows us to do um the the types of different types of jobs quicker jobs to do um um change priorities so this is something um that in if you've ever done research the conference deadline is an 8 hours I need to do X right now which is not something we can do right now I with this system we'll be able to say I need these 10,000 jobs done right now we can ship them out with a high priority and whatever they're those machines are doing will stop and they'll do the immediate thing and we get our answer back very quickly so that's flexibility you don't have if people are just checking in so here's the new topology and I'll try and St on the right slide um so you have a presence server which you know is managing Who's online who's not online who's gone down for reboots who's just missing an action um you have a coordinator who's managing both your jobs and your data and this is the ma master control of the system you have probably many job servers machines that have things to do and then you have a bunch of clients that are online and available to you with various amounts of storage and various amounts of jobs that maybe are already happening or are queued up to happen or things like that so the first thing you can do is send jobs out so you move a job from you know one of these job servers to a client um maybe I'll just watch this on um then you can move data around you want to optimize where it is to decorrelate the data and I'll cover this more extensively later you can move data around to where you want it to be equalize all the load do a bunch of things like that um when jobs are done they go back to the job server they came from so this is the results coming back slightly different way but basically the same idea and then the job is when the job is done you create more jobs based on what you've learned so this is kind of the new flow and it's kind of exactly the opposite of what was there so let's address the storage part of this um we want to turn all these machines we have into like this giant gigantic raid server U multiple copies of data all these things we just want this huge vast amount of storage um and allows us to run these you know step two calculations run we know where the data is we send the computation to the data instead of bringing the data to us we just send the computation to them this is a very old idea and it's it's a really good idea and if it weren't for firewalls we'd still be doing it all the time um and we can also set up bigger mirrors for example at Pittsburg we could set up just a huge repository versus um the scale of this is we're allowing people to use 10 gigabytes on a machine now with terabyte drives coming out that may seem kind of small or it may seem big if you got an older system um but this is actually based on needing to get the data and repair the system and do the redon So based on an average American DSL connection which is you know 100th the speed of Europe or Japan um this is kind of how much we can expect from one person that we can access get to it uh migrate if we need to do computation on it there's not too much there where they're always trying to do computation on on results um so it works out pretty good um with 100,000 hosts because you have to open a firewall port and do some other things actually use storage we're not expecting everybody to do this um so about 100,000 hosts if we can get up to that that's a paby of raw storage um distributed across 100,000 machines um now this is where everybody starts talking about distributed hash tables and eras your codes and all that stuff we can't do anything fancy we need to actually compute on the data so we're not doing anything fancy it's just a copy of a file files run from about 10 to about 100 megabytes in our load um you would want to favor Files about that that size you'd want you know a good number of files on per machine but you wouldn't want too many so it's a pretty good balance um so we need these files actually intact so um there's a full copy of a given file on a given machine and then this just adds to the amount of points we'll give you for for helping us um less motivation actually to this because you're only allowed one um you can't set up you know many many machines in your in your home because your DSL connection wouldn't it doesn't scale you can set up more machines but it doesn't do any good um I guess you could say um so what we want to do is get copies of this data out and remove all the correlations so this is we don't want two copies in San Francisco there could be an earthquake we don't want them in New Orleans because that's below sea level we want them spread around the world um we also don't want uh time zones to be the same we'd like ideally to be able to assign you jobs at night when you're probably not using them computer and you've probably told it you're it's allowed to help us during at night so we want things spread out in time zone as well so that whenever we want to do something it can it can happen um you don't want things from the point of view of Stanford to be on the same internet segment or route so um for instance um if you're in Minnesota and Minneapolis almost all your internet traffic actually goes through Chicago there's very few high-speed links that don't go to Chicago Chicago to join the backbone there um so from you know this isn't something we'll wait heavily but this is an issue or if uh for instance Stanford's Network you don't want tons you wouldn't want tons and tons of machines sitting at Stanford or say uh UI or something because they're all on the same network when that Network goes down you're going to lose them all and you don't want that um so companies or schools another thing you want to try and get different copies of that file onto um of course if they're the same user or the same team you don't want to put the give them you know you don't want to give all four copies to the same team because if they all get pissed off and quit you lose a file that's that's something that happens whole teams get up and leave these projects and go to another one um so it's not just you know people dropping off randomly there are strong correlations between that um and of course operating systems um and this leads into two there's two types of failures we care about there's machine going offline and data loss so when you're talking about machine going online offline rather um this happens all the time every time you reboot every time your DSL goes down every time uh you crash every time you know these are very frequent events for for the kinds of computers we're talking about um they're basically random for the most part um you know there's no way to really predict that so you just want to spread things out there's the coordinated kind too and this is actually a really big effect on Patch Tuesday every Windows box on the planet as you go around the time zones of the world reboots so it looks if you're not aware of this like a massive Cascade failure but it's not it's totally normal every month this happens um sometimes it's not till Thursday that all the machines crashed because the patch wasn't right but you have to anticipate things like this these correlated events for the operating system when the new um red hat or ubut or however you say that comes out um Everybody patch and reboots um the Mac patches whenever there's a new security patch all the Macs reboot um so these are look like massive failures you if it's a Windows one of these reboots like 85% of the machines you've got contact with are just blinking out and and so this is something you have to be able to anticipate that these are not a big deal like when these happen it's okay data was not lost um but if you get to a low threshold like two copies you'd want to take action anyway so you'd want to preemptively say well this is getting a little scary I still want to have a couple copies at least um and then the other mode is the complete failure mode data loss um this actually doesn't happen that often there it's not very often you lose a hard drive like where it's just completely thrashed you're usually able to recover the data it's usually not that horrible of a deal um the main thing you're talking about here is people leaving your project so um it's about you know 1% a day churn um but that churn isn't uniform people that have been with you a while are probably going to stay with you for a while more um people that come in do like three days worth and then decide you know this what's this stupid icon in my trade and they turn it off so those people are much more more volatile um do you know R down well you don't you for for for 24 hours in our case I'll get to this a little more but you consider them to be the first class until they demonstrate they're not um and you trust older PE people that have established a reputation more than you've trusted machines that just came online but the bottom line here is if you're using geography and all these other things you've decorrelated um it's still really reliable pretty insanely reliable actually um so we're doing active monitoring um we we'd like to have four copies at all times that gives us about eight nines of reliability if we do correlate correctly we can tune this later if we want to um but this is actually it's much actually much better than that because we're not we're not looking every day we're watching the system at all times so we can actually repair 1% of damage in the system in about 30 minutes so it's we can repair things much much faster than a 24-hour period so we actually get like more nines but after a few nines it doesn't really matter cu Stanford goes down and then everything's offline um and we will have redundant sites for the master all these servers and other things like that but if Stanford's internet link goes down effectively the system is at zero availability and that's probably that actually happens quite a lot more than you would lose files by any other way um you lose availability but again that's a outage not a failure um we also try and keep posts about having the same amount of data because you want to be able with new data comes in you want to store you want to be able to you know put it wherever you want you don't want to have to say oh 95% of my hosts are just full I have to store it all on these other 5% that's that's not good so we're moving stuff around at all times to to keep these equalized um and of course you got to talk about what we need here at Stanford to do something like this um you need the metadata for the files it's about 200 to1 so if we have 250 terabytes it's really you know something that'll fit on a hard drive very easily and can be copied around pretty quickly um we have the overhead of pushing data out that we have already um and this is actually a big deal we've got a lot of terabytes of data even with internet too this isn't all that easy easy to do um but it is critical that we have internet to or like we would wouldn't just wouldn't push out old data at all it wouldn't be an option um Stanford's link to the commercial internet is is pretty slow so uh relatively still fast compared to DSL but um so we can't can't just saturate Stanford's internet connection for three months doesn't go over well with the it folks we've worned them um and then of course new data will store as it's created um when a job is done and it's created one of these big files it'll just be replicated and then it will be told oh go check the go verify things are where they should be but we don't have to bring it all back anymore um we probably still will because we want a copy or we'll just directly ship to a backup site um but that once the system a steady state the bandwidth usage is actually pretty minimal uh so that's the end of the storage section if you have storage questions now would be the time all right do you send data between clients directly Asos to taking it through T yes the coordinator can basically tell host to move data around or move jobs or anything so it's got it can essentially delegate authority to to machines on in the network standard pki stuff nothing too fancy about that um but yeah you can move things around or tell host to move things without coming through Stanford was there a reason why things like bit torrent just weren't relevant so bit torrent a completely different thing um bit torrent is I want to send the same data to lots of people this is I want four copies specific places so there's no there's no Bandit Savings in fact bit torrent doesn't actually save any bandwidth um it's just really good at doing the Swarm or transmission pricing yes um yeah so it's it's good at uh avoiding ISP fees um but yeah something like Boran it's designed for a completely different purpose than this um this is more standard file system stuff can this work with random nap boxes all over the place so the one of the boxes on an earlier diagram is the Pres server so that's handling uh Who's online where they are every every machine has a specific ID a public key so when they check in and say I'm over here now their IP address has changed which I think is what you're asking um then we know where they are their location is updated are you also using the methods that allow two boxes that are behind that to be able to communicate directly with each other there's an oh if there's no Port drilled no we're not doing the evil Skype algorithm well or or good the good Skype algorith it's almost yeah it's tacitly it seems to be tacitly approved now so yeah I guess everybody didn't but no we don't need to do that because we're asking them explicitly to open a port so that can happen without having to do crazy tricks most everybody knows how to do that now because a lot of games or you know Skype or I chat require you to most everybody I do my mother wouldn't have a clue people who run projects like this know how to do these I think knocking but we're also saying you know we're not expecting everyone to do this by any means we do know some people won't know how to do this or some maybe some DSL things just won't allow that you know that's that's life um but um yeah oh wow yes you know someone right clicks that icon and says go away unin do you have any sort of just system that's like oh let us back up your data since you're leaving thanks again for your help yeah I mean it would ask you like actually when you reboot it automatically lets us know like there's been a reboot um if you close it it'll say are you going away are you here's another correlated event in Spring quarter of the University population disappears and goes to slow DSL connections or somewhere else so this is a massive event in the end of May beginning of June where where people will go offline they won't lose data but they're probably going to take two or three days to move settle in set up their computer again so we'll like have a way for them to tell us hey I'm going away for a week I'll be back um so yeah that is certainly a design element you're doing things in sort of a bat mode and your jobs coordinator pushes jobs over to other machines what happens on the network and Computer Resources and jobs coordinator interactivities involved so you're saying what if the job we the thing we're sending them uses the internet no interactivity you know interaction so there's no interactive jobs in this I'm going to get to this in a second we'll go through the what the jobs do yeah you're going to be we've seen a lot of interesting data re recently about silent data corruption are you going to be able to track that in your storage system so yeah the verification process actually runs check sums on the data you have so if there's data corruption we'll spot it and Market invalid and make new copies no no are you going to be tracking how much data corruption you're getting I'll say yes because that's a good idea so we'll do that yes you you need to look at the paper in the fast conference in February about silent data oh was this a drive failure paper no not about Drive failures this that's another interesting paper but the the the one that you need to be concerned about is about um the rate of Silent data corruption in in storage systems yeah we'll be Swip uh sweeping the system for data corruption fairly often because they overhead of doing that so low so certainly is a concern that the data is just still there but corrupted but they haven't told us hey I lost some data so it's something we're actively sweeping for yes are you modeling the process like the hard driv have you know stand the b top models actually they don't yeah the other paper says no they don't yeah they don't have we're not modeling drive failure we're not because we're not actually tracking Drive the N the others not the drive are you modeling the others to try to see their new kind of modes like somebody's active going around or whatever uh probably not you could look at the statistics we'll certainly have a whole modeling you can do we'll certainly have a set of data about hosts on the internet and a large number of hosts on the internet we'll have you know internet out outage statistics we'll have reboot statistics ICS will have a a whole new set of data luckily we have a distributed system to process that data so we can look at it and see what's in there but certainly we'll have all those statistics every time something goes down every time something's lost model you cany those to yeah I guess if we wanted to look more carefully like what kind of hard drive you have or other things like that we could certainly collect all that data socast is throttling oh we know they're throttling isps are throttling but yeah certainly collect all that data and make it available to people and just to clarify this is storage for for protein folding at home it's not storage at home in in a general so yes storage at home is specific to the protein folding data the more General project is will come later when I'm trying to get tenure um don't go there this there's a huge history of failure of of distributed uh storage systems going we'll see if this one works going back back to the mid 90s so yeah the the and you're right the history of failure in these is very high the element that's lacking is the be good usually we can usually PR back to that no it's it's very difficult to get the reward structures to reward the corre correct kind of behavior no no no we're just talking about this is all jobs and data specific to Pro no no no no not I I I think what you're doing is fine I'm I'm thinking the general case of trying to build a a a distributed storage system has has been tried oh certainly that's almost always failed this it's always got to be special purpose right there is a service uh all my data that's trying to get going yeah and that's based I believe on an open source distributed file system that does many of the things you do and a few more um yeah General the this is not technology this is about people's motivations for the right behavior and the difficulty of preventing people Mo being motivated to do the wrong thing yes and and distributed storage systems like that very open to a malign Behavior the uh this service has some good security features built in low to to make sure that people the wrong people don't get data yeah and also well they have eraser coding so they don't have to duplicate to get all of the nines and stuff like that yeah yeah there's certainly other ways to do this I'm sure are they actually storing like your data on other people's or is it centralized yes they store many people's data on many people's machines be interesting to see if that works but like you said there's a history of that exact idea not working very and motivation is indeed a tricky problem yeah yes there's a system called samsara which uh tried to do this and and pretty much demonstrated that it's really really hard to get the motivations right yeah so I'm going to actually move on cuz want to get a bench sles here um so let's talk about the jobs that we're doing in this new system now we flipped all the arrows around um so really in distributed systems these are all you should really just think of these as batch jobs they're not interactive they're not time there's deadlines but they're not time critical interactive or anything like that they're running at idle mode their batch jobs um every job in the system has a header that talks a bunch of data about the job gives it things that the the Schuler and other things need to know um but after that it's really just data and state um and the state part of it is is important I think that yeah anyway so when I want to do a research project I create some jobs I create them in the suspended suspended mode and I just dump them in my local collection of jobs um and then the uh coordinator comes and you add it to that and let it know hey there's some jobs over here and it collects up the things it needs to know and it starts sending them out but it's tagged with a bunch of data things like I need to run on a Windows PC because the science I'm going to do has only been ported to that I need to run on a GPU because otherwise you know it's not available anywhere else um this is an important one this is I uh I went to run on a specific host because that's where the data is and I created this job to work on that data so it needs to be one of these four um there's size limits on all these things this is how much memory it's going to take this is how much temporary storage it's going to take um so there's a bunch of just kind of batch job parameters on this um how long I estimate this will run based on a benchmarking machine and then the priority based on other things so if I'm at one of these conference deadlines I would give these jobs a very high priority um so the coordinators you know optimizing all this from a central location saying um you know it's moving the jobs around moving them from just being idle somewhere to actually going and being run somewhere um and one of the things about distributed system is you've got to be doing checkpointing all the time you never know when these random failures are going to happen so you're doing checkpointing every 10 minutes five minutes whatever youve decided is reasonable um so that's always happening but these checkpoints are actually the job format so these checkpoints can be moved around you're not actually checkpointing in the classical process migration which is you take a snapshot of your entire memory and you're shipping that around no these are like like a save format it's very small compared to the memory image or anything like this um and then the coordinator can also look around and see are you still running what I gave you is it still going on and and see basically the job cue of the entire system um the mirror of jobs is clients um each client has a set of capabilities just like the job has a set of needs it's got an given operating system one or more CPUs like it could have a CPU and a GPU and it's got four CPUs so you could run an S&P client on it um it's got limits that the user has set of the amount of um memory it's allowed to use say I only have 512 megabytes of memory for some odd reason it's an old machine I don't want want to get assigned a job that has a gigabyte need doesn't work too well um on the other end if I have a ton of memory I want to say hey you can use a lot of memory and it's not a big deal um I want to say what times of day it can run if this is a machine at a company or it's just I do actually use my computer during the day maybe I just don't want things running during the day I want to leave it up on overnight and switch off the Monitor and let it do whatever it needs to do um and then every client has a history with it this how reliable is this host did it does it turn things in on time is it not as fast as it should be should be so we shouldn't be giving it these long running jobs maybe they won't finish by deadline um so we're going to build up a model of each each client um and then for instance if they're going to go offline at a given time we can step in before that and move the job to another machine maybe we've decided that's a high priority thing and we really want to make sure that gets done and very soon versus waiting at 12 hours for it to start start up again um and again the job coordinator is just the brain running all this stuff but the the key here is you can do Global optimization of the system you can get the most throughput out of the system if you can move things around and do stuff and balance data and all these other things um and this has just got a list of projects and saying these projects are more important than others get them out sooner get them out later don't worry about this one um kind of classic job scheduler there there's not a lot of magic here um but we're going to start adding some magic with maybe some machine learning because we know so much starting learning a profile of a given client we can start doing a little more advanced stuff um and that'll be future work um the key to all this is it gives us new capabilities things we can't do with the current system even though we have lots of hosts the things we can do them are kind of you know routine you got to set up a project you got to wait for clients to come in all these things um we can ask these T these step two questions what what does the data say what are the attribute what stuff is in this data we produced it all let's maybe I just want to pull an energy calculation out of a ton of stuff I can do that um the other nice thing is because data these ports are open uh we can migrate data to other researchers too so if somebody wants a copy of a data set we just tell them hey you're going to need this much space and we do data just the same data migration process we would use doing repairs or load balancing push them all the data um so that's nice because if they're just one point and we're sending we actually have to avoid denial of surfacing them U but we can send them data very quickly so if they've got internet to connections we can like get them all the data very quickly and and that's nice cuz then you have more people working with the data you work so hard to get um and then of course we'll get these statistics about the internet itself and host and hopefully we'll get some cool stuff out of that too so uh the end of the jobs part um but in summary we're finally putting storage into our computational infrastructure um and enabling this this further analysis and exploring hopefully the protein research uh people will be able to do that even faster um and they'll be happy which I like them to be um so got to thank of course all the contributors these all these people out on the internet who help us um the staff of distributed done infolding home um VJ P who's the pi in our lab who does so much work to do this and the whole Lab crew I'm not going to name anyone because then I won't be naming someone and then yeah anyway um but the entire lab puts a lot of time into this um we have for moderators who do a ton of work interfacing with people who have questions and they really without them there's no way we could handle that number of people um and of course the funding sources are donors NSF NIH and a bunch of other people so more questions so what kind of uh monitoring and management tools do you have at your end and um you know the flip side is you have uh sophisticated debugging tools how do you know when things go wrong and so um if something goes wrong in the field we'll just get kind of an error code saying this happened we do a lot of debugging and testing before we send code out to run um but it is I mean we're doing this now with the beta of the GPU clients stuff happens and we don't always know what happened so debugging is in distributed system is really difficult but you just do a lot of testing and hope for the best and things generally you find the bugs and things get pretty stable after a month or two just wonder if I was running a four prop organization wanted to do computation would it actually make sense to use this model because presumably I don't get my freeb internet to connection and I also want faster and more predictably and and you know even your comparison beginning I would think that the total number of machines you're using is different than the number of machines that you harness within one cluster uh so the equivalence I it might be that you have aund number of machines and get more done more quickly more so if you're for-profit and you have types of problems that maybe don't need distributive methods or work well in clusters there's tons of I figure out I'm going to make a fortune by doing protein folding really more effective for me to just build a cluster locally if I don't have internet I don't have the luxury of taking forever because I'm paying you know real salaries to real people and time to Market and all that so generally if you're a for-profit company you just build a cluster or you rent time um that is the model well I know that's that's the practice for asking is do make sense to do what you're doing in that situation or not yeah I don't think it makes sense to do this type of thing if you're for profit because the number of volunteers you're going to get even if you pay them because if you pay them their incentive to cheat goes way up so the number of people you're going to get to volunteer is going to be far far less so it's a social problem not a technical it is a social problem it has nothing to do with the technology but the problem about payments right now the transaction cost you pay a dollar in order to to transaction pay a dollar yeah that's a killer yeah the transaction costs are high so something like Amazon for that situation is really good they do good corporation that everybody loves so so even if everybody volunteers at the same level is it still makes sense it doesn't make sense to build your own if you can do this you know the numbers I really well I mean a billion dollars to to run would be a lot more than what we're spending on our you know we've got research grants as as funding they're a little less than a million dollars so well but I'm not sure i' buy that in comparison because if I've got something that's connected by a 10 gabit Network you know a bunch of blades multicore sitting inside my own situation then I can run these jobs much much faster so if the algorithms you're using can go everything I have check all stuff and I'm not paying the cost of a commercial cost of an Internet 2 connection right there you say you have 250,000 machines yeah but what fraction of those machines do you actually have you only have 100th of each machine and you really only have some smaller number no no no those are active yeah that's active machines but you don't have the whole machine so if I have 200 if I can a data center with 250,000 machines yeah I've got better bandwidth and I've got the whole machine right actually we do have probably a large fraction of each of those machines maybe that answers your question we're running it priority 24 hours a day people are trying to meet deadlines it's not really off almost all software you run doesn't use a pathetic fraction of your CPU so we're running all the time um and it depends on where your U your bottleneck is yeah bottle is if you have an algorithm that needs in in commmunication you can't do this it's not possible so there's no question it's just the algorithm won't work you have to change your algorithm first to one that doesn't need intercommunication and then you can consider something like this so until you change your algorithm you need a cluster and tons of algorithms do that there is a middle ground too between the for-profit question there um and that's the the grid Computing FW which don't hear much about these days but cloud computing now what's it called it's cloud computing now cloud computing I don't slide about the terms that have been applied to what I do over the last 10 years but within a pharmaceutical company a large one which actually does have 10,000 host uh they can reach not not 275,000 but maybe 1% of that I guess I would like a $10 million worth of computing power just with a grid approach uh internally and companies do do this within their companies companies that have 10,000 PCS on desks do this actually we can't we do know that you you lost half your machines because you're doing each computation twice no we're not somebody like set at home is okay okay so if you were doing it effective the effective so you're getting half of the machines because you're doing rant computation so you already you lose a factor there you l a factor of two roughly to fact you don't have the whole machine there's your upper limit is you really don't have 250,000 machines you have 50,000 machines okay so I mean which is still a large number there's no a is better than b here there's there's a and there's B and depending on what you want to do one is probably better for you and if you right that the fact that this you know say 75% sort of overhead is pales against you know being 1 1000 or one 10,000th of the price right yeah or one 100,000th of the price depending on what you're talking about right yeah so I mean you got to figure out which is better for a given situation good fast yeah exactly good fast cheep and uh yeah and we forgot control which is more important than all three of those yeah so another issue with if you're doing commercial stuff is you don't want your data out there and you probably don't want your code out there either it's probably some internal top secret patented algorithm you certainly don't want to send your Cod out to someone to be using it nobody said you could encrypted computation yeah it turns out doing encrypted computation is really really hard and applicable to a very small number of problems I guess you didn't work on that I'm surprised actually why CP the computation why don't you do the cited Russian mafia thing just just go out there and search all the available laptops who cares whether not have permission yeah so actually the commercial side of this is sadly uh and I took the slide out is is that it's the spyware guys it's the spammers the spam bot networks of 50,000 machines that don't ask and are definitely not good um that's where your spam's coming from in fact I just got somehow my email got sent is now in one of these Russian networks as a from so I'm getting thousands an hour of these bounced mailes um and so yeah I mean and they have a thousand times the resources we have they're paid a lot of money there's big money in these spyware Spybot networks and they're certainly doing many of these things and you have to watch them um and I'm not watching them to like fight with them I'm looking for good ideas because they've got hundreds of people working on these things full-time and yeah yeah alternative funding ideas but um they just have more people thinking about this problem and they and their constraint is that they have to be kind of sneaky about it but some of those sneaky methods you can make not sneaky and they're really good ideas so you have to watch what kind of the the black hats are up to because they come up with good stuff just as often as the white hat guys so so do you have like a market for new projects or ideas or codes or whatever like even within the protein level so I don't actually come do the protein folding part um the people in the group though are always working on new algorithms and always coming up with new stuff and pushing the technology further but so um gromax is available um I know VJ wants to make more of our custom things available he's announced we're probably so I can say this they're uh we're probably going to start releasing more of our code or client like more of a complete thing you could actually run on your own um like the PS3 code I think is the first thing we're going to put out there so yeah I mean it's it's really yeah we need more competition so we need them to speed up um is how I think about it but and not in a condescending way but I mean holding at home is really big Power and powerful and we're doing stuff that's really on the edge and we'd like other people to be able to do that too I think we're out of time so all right and I guess I'll end it there for information on other online Stanford seminars and courses please visit study. stanford.edu [Music] the proceeding program is copyrighted by Stanford University please visit us at stanford.edu

Info

Channel: Stanford

Views: 46,603

Rating: undefined out of 5

Keywords: science, electrical, engineering, math, computer, technology, distributed, system, algorithm, folding, Markov, model, share, network, data, server, storage, learning, cloud, computing, clustering, processing, failure, client, Storage@home

Id: 7zafB2GkMBk

Channel Id: undefined

Length: 77min 58sec (4678 seconds)

Published: Fri Sep 26 2008