Oral History of Leslie Lamport - Part 1

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
my name is Roy Levin today is August 12 2016 and I'm at the Computer History Museum in Mountain View California where I'll be interviewing Leslie Lamport for the ACMs Turing Award winners project morning Leslie good morning I want to start with the earliest time that you can remember being aware of computing aware of computing probably in my junior year in high school I think I came across a book that discussed boolean algebra and must have discussed computer circuitry a little bit because I got interested in trying to build a computer didn't get very far but I think it was my junior year in high school possibly senior year so that would have been what your 5556 and you were not so much aware of the state of computing devices at that time rather more the the mathematics behind computing circuitry well I mean must have been a winner of the existence of computers I seem to recall they existed in the movies wasn't it The Day the Earth Stood Still there was yes a computer or I might just make it a robot how the robot worked was another matter I think yes first time computers were sort of burst upon the public visibility might have been the election in which they were used for the first time to make predictions was that 1960 I don't know there was a period of time probably when I was an undergraduate and going through a large amount of graduate school where I was unaware of television so so let's let's in fact talk about what happened after you became aware of these things in in high school did you did you explore computing through high school and into your undergraduate talent well I tried to build a little computer and this was while I was in high school and I visited the IBM headquarters in New York and they were very helpful for this to this little kid who wanted to build a computer and they gave me a number of vacuum tubes in those days they would replace them on regular intervals and so they were still good and so I think I got to the point with a friend of building a 4-bit counter but that impressed the interviewers at when I applied to MIT so good effect but my first serious exposure to computers came the summer after I graduated from high school through some kind of City program that I forget I wound up having a summer job at con Edison the New York electric utility and that's New York City electrical utility and I was assigned to do some very boring stuff but I learned that there was a computer there and I went up to the computer center and managed to get myself transferred to a job running tapes on their computer system for the younger generation tapes are used to be a storage medium for computers and there would be read on these large tape drives and they would also serve as input and output and so you'd have to mount the input tape for a program run the program and then remove the output tape and give it to whoever wanted the results so that was my job mounting tapes probably starting the programs and stuff like that but in my spare time and in the computer spare time I learned to program it and the first program I can remember writing computed e to something like 125 decimal digits that number being that the computer I was running on which was an IBM 705 had an accumulator of 125 or so digits and so I was able to do that basically without having to do multi-word arithmetic so I guess it was natural that your first program involved an interesting mathematical entity since your interest in mathematics from I gathered from your resume goes back even further than your interest in computing yes I was I guess in some sense that the extent that an elementary school student could be a mathematician I was a mathematician throughout my my school years even though the idea of actually being a mathematician as a profession never occurred to me until around my senior year in high school but even so it seemed like not a practical way of making a living and so I started out being a physics major and maybe my first or second year I switched to math although I'm not sure if the reason was that I really thought that mathematics was a career move but at MIT in those days mathematics was the only major that didn't require you to write an undergraduate thesis practical as always so you actually were thinking at some level about careers coming out of high school than going into your undergraduate days it sounds like not in a very concrete sense in terms of oh I think if I become a physicist I will be able to get a job doing thus and such although it was clear that becoming a mathematician the only thing I I knew that mathematicians did was teach math so I suppose the time I switched to a math major and college my career goal would have been to be a professor of mathematics and I should mention that throughout my studies I had part-time and summer jobs programming my undergraduate years at con Edison I went back to them for summers this time as a programmer and continued and also worked part-time in the computer lab at MIT and that was also running tapes but as another graduate I also remember doing a working as a programmer for someone in the business school so programming and math and physics all were interwoven during your undergraduate days but you thought of them is sort of separate activities in some way well they were interwoven temporarily but there were completely different worlds as far as I was concerned and and when did you first notice or become aware that there might be some let's call it science related to computing as distinct from it just being a technology for calculation or something that's a difficult question I really can't answer that there was some point in my career at which I suddenly realized that I was a computer scientist but I can't say when it probably happened by before I went to SR I which was in 1977 the first thing I published was a letter to communications of the ACM commenting on an algorithm that had been published it was an algorithm for implementing a hash table and I discovered some very slight improvement one can make then thinking about the problem i reck'n realized that there was a simpler way to do the hashing and started collecting data on efficiency or something or other and when I was in the midst of doing that someone published a paper in C ACM with that method and for quite a while that was the standard method for implementing hash tables but I had discovered I think three variants or techniques for implementing the idea that weren't mentioned in that paper and I wrote another paper just describing those three methods and it was returned by the editor saying it was just not interesting enough you know be worth publishing in later years two of those methods each was published in a separate paper in the C ACM but by a different editor just an amusing little historical footnote well you began to learn pretty early on how the vagaries of editing and review and publication work that's a topic actually I want to come back to later but but I want to continue now with a more chronological view of things so while you were at you mentioned you meant to SSRI but before you went to SSRI you were at Compass for some years how did you come to be working there when I went back to graduate school to finish my degree I needed to get a job to support myself and somebody actually professor at Brandeis where I was a graduate student recommended I check with campus official name was message to Massachusetts Computer Associates yes and they gave me a job well first was supporting me through the rest of my graduate studies and when I got my degree I was all set to take a teaching job at the University of Colorado and Colorado Springs and encompass said wait would you like to work keep working for us in the new California office we are going to set up and I thought about it and decided yeah that sounded good and so by the time I had actually gotten around to being ready to go to California they said whoops the California office has been delayed but that's okay you can go out to California and work for us which I did and I worked for them for another five years I think and they never did build open a California office but I kept working for them from California and would visit the whole office outside of Boston for about a month once a year so that's very interesting an early example of telework which is of course these days a more common notion what kind of work were you doing I may be will be useful to talk a little bit about the kinds of projects that that compass did and what the business was well compasses main business was building Fortran compilers but they also had random contracts with the Defense Department for doing different things and with some companies one of the things I did was design a file system for a computer that was being built by Foxboro and compass had the contract for doing software for that do computer I don't remember exactly what that was I supposed would have been the first concurrent program I actually worked on if the idea of concurrent programs was known instead it was actually worked by using interrupts so I'm intrigued by the remote working arrangement that in those days seems to me would have been quite a difficult way to develop software can you say more about how you actually did that oh I'm sorry the that Foxboro project was while I was still living in Massachusetts I see at the time I had a friend who had some land in New Mexico that we would spend to take a week or two vacation and stay with him and what I proposed to compass was that I'd go spend I forget what there was one or two months there working on this problem and they said okay I did that and while I was there I wrote a tome I guess you'd call it a a paper although was handwritten and maybe about this thick in sheets and presented it to them and this I think blew their minds the notion of using linear algebra and I learned later from observation that this tome that I delivered it was like Moses delivering the stone tablets it was practically a less sacred text that people studied and you know they did use it to to build the compiler but what it did serve is to reassure the people at Compass that I could go off by myself without supervision and actually do something useful so they sent me off to California and I don't remember if I actually what my actual assignments were if I even had any but I was being paid on some government contract and it was I think about six months after I moved to California that I developed the bakery algorithm and I think from that point it became clear to compass that I was a researcher that they were going to support and my memory is that I was left pretty much to myself to figure out what I was going to do and they would find contracts to support me so the bakery algorithm is obviously a very important milestone in your career and I want to talk about it quite a bit later but for the moment I want to to pursue this notion that compass was even though they didn't really formally have a research organization certainly not in California they were in effect supporting you as a researcher is that right to the best of my memory yes uh-huh okay so it seems like it was a fairly open-minded research environment if you want to call it that at least at least as far as your interests were concerned well I think from the from compasses point of view they were happy to work for anyone who would pay them and they figured out that the Defense Department would pay them to have me do research and so that's what they did so the bakery algorithm what's you were you invented that this is we're talking early 70s now right yeah and it got published if I remember in 74 is that correct something like that it sounds about right yes so they were also compass was also comfortable with the notion of independent publication in those times that you could go off in his publication presumably on your own is that okay oh sure the this notion of intellectual property didn't really exist as a practical sense in the computer industry at least in terms of any kind of research in those days I think that it was probably even before the days of software patents I'm not sure of that certainly would have been before the days of patenting an algorithm and so no there was no notion that anything would be not published and of course anything was published was effectively serving as advertisement for compass so they were quite happy with that in fact wasn't in in those times that algorithms were more or less routinely got published in the communications of the ACM as sort of people throwing out here's my here's my algorithm there was an algorithm section in C ACM that's not where I published those algorithms there weren't algorithms there were there was probably somewhere between algorithms and code because I mean you had to submit a working program and how could I submit a working program on the bakery algorithm when multiprocessor computers essentially didn't exist so let's talk a little bit more about the bakery algorithm at this point can you tell us a little bit about how you came to be interested in the problem though that I can tell you exactly in 1972 I became a member of the C ACM and I think one of the first of the ACM and one of the first C ACM issues that I received contained a mutual exclusion algorithm and I looked at that and it seemed to me that that was awfully complicated there should be a really you know simpler solution and I decided oh here's a very simple solution for two processors wrote it up I sent it to ACM and the editor sent it back saying here is the bug in your algorithm that taught me something it taught me that concurrency was a difficult problem and that it was essential to have proofs of correctness of any algorithm that I wrote and it well of course it got me mad at myself for being such an idiot and determined to solve the problem and and attempting to solve it I came up with bakery out with them and of course I wrote a careful proof for those days was a careful proof and in the course of writing the proof I discovered that the bakery held in rhythm had this amazing property that it did not require atomicity of reads and writes and I understood that was a really cool thing and I'm not sure at which point it made me realize that I really solved the mutual exclusion problem because I had written an algorithm that didn't require any underlying mutual exclusion something that there was one article by Rick Hansen that said was impossible so felt nice having solved an impossible problem so you you you wrote up the algorithm and and submitted it with proof do you recall whether how the editor of the journal reacted at that point was it a straightforward oh great I'll accept it or something more complex I was straight forward and he sent it out to referees the referees I don't remember any significant comments from the referees it was published with got a problem so this was a piece of work you were doing while you were employed at compass did did you share it with colleagues there I did in fact the next time I was out at compass I showed it to a colleague Talley Holt and it's all holt and I got the reaction I don't think I've ever gotten from anyone else which told me that tally appreciated that algorithm more than anyone else had because when I showed him the algorithm he said that couldn't be right and you know I showed him the proof and he listened to the proof he went over it he said well I don't see any hole in this algorithm but it must be wrong and I'm gonna go home and think about it and I'm sure he went home and thought about it and of course never found anything wrong with it but I was very impressed by this visceral approach that tally took to to the issue tally by the way was he was a an offical man advocate but a believer in Patriots and he was also interested in well guess we will now say was fundamental issues in concurrency and every year for about three years I would go back to campus and tally would have his new theory of concurrency and he would give some lectures on it and it would start off you know in some completely different way but then it would always wind up at Patriots and it was all very lovely you know carefully reasoned but it just struck me as being not very useful because I realized that Patriots had nothing to tell me about the bakery algorithm what's that a conversation that you had with with him about the you know you were your study of the bakery algorithm and the fact that his methodology for for talking about concurrency didn't work with it no I don't think we ever had a a discussion about that I I learned some I think important basic things from tally and I'm not sure what they were but I think the experience of thinking about things in terms of Patriots was in fact useful tally is probably best known for them as being one of the inventors of marked graphs which are a particular class of Patriots that are quite interesting I have some very there have very nice very simple properties and in fact are a useful model of a lot of real-world computation it just they weren't going to be useful to me in general as a general way of looking at concurrency coming back to the bakery algorithm itself I've read in one of your perspectives on it that you thought it was the one algorithm that you discovered rather than invented I'm not sure I understand the distinction can you elaborate on that I'm not sure I understand the distinction either but there's just there just seemed to be something fundamental about it and maybe the basic idea of the use of a counter sequence of numbers that can in fact increase forever that you will see elsewhere and things I've done in the time clocks paper for example the use of time stamps or that kind of counter and in the whole state machine approach and paxos there's well in paxos there is also a counter that's being used in not quite in the same way but it seems like there's there's some fundamental thing going on there and I don't know what it is exactly but it's certainly been quite useful to me all those algorithms that you mentioned are things that we'll talk about some more because I think that's been a very significant thread in your in your work starting with the bakery algorithm you've also said it was the one you were proudest of can you say why no I can't well that doesn't need to have a rational basis I suppose so that that happened while you were in sort of independent researcher mode let me just go ahead let me try to answer the question of why I'm proudest of I think in other things that I've done I can look back and see this idea developed from something else sometimes leading back to a previous idea of mine very often leading to something somebody else had done but the bakery algorithm it just seemed to come out of thin air to me there there was nothing like it that preceded it so perhaps that's why I'm proudest of it well certainly a mutual exclusion is an absolutely foundational problem in concurrency and to have solved it as as you did in the bakery algorithm certainly something worthy of being proud of as part of the publication of the the bakery algorithm you included the proof that you talked about was it common for papers at around that time in the in the early 70s to include proofs with algorithms that were published I don't know about algorithms in general but mutual exclusion algorithms which are pretty much the only concurrent algorithms that were published in those early days approximately but it's the most significant concurrent programming problem and yes Dijkstra Dijkstra's first paper at a proof and was interesting the second paper which I believe was a note submitted to see ACM didn't have a proof Knuth then published the third algorithm pointing out the fact that the second one was incorrect and I'm sure he had a proof as well and the mutual exclusion algorithms always contained a proof so for concurrency at least in those days that was the norm even if it wasn't at the norm perhaps for other kinds of algorithms right and your work evolved from the the bakery algorithm quite a bit but was often driven by what you have learned and studying it and in fact the future study of that algorithm I think you've written was something you did for a number of years did that study lead you into the into the approaches for verifying correctness that that you eventually spent quite a bit of work on the my interest in correctness was parallel through my interest in concurrent algorithms I should mention I I don't want to make this too I don't want to give the impression that I was drawn to concurrency by these fundamental problems the fact of the matter is that they were just really cool puzzles and when I was at con Edison I had someone who took a manager who became sort of a mentor to me and he would give me his problems and one problem he posed to me was in those days programs were on punch cards and the you would load the punch cards into the hopper and you'd press a button on the console and the computer would read the first card and then with that card in memory and execute the that piece of memory as a program so the first card would Beit usually be something that you know had just enough of a program in it to load the rest you know some a few more cards so you'd have a a decent-size program that you would then execute to load your program then start that program executing well I remember he posed the problem to me to write any loader that worked on a wall on a single card and I worked on that problem and you know I would present him into the solution and he'd say yeah that's good now make your loader also have this property and went through a few iterations like that and I was just you know I just love those those puzzles I mean one thing I'm also proud of as a program I wrote for the IBM 705 which was when your program crashed you would take a memory dump and also presumably print out as much as you can of the registers to kind of find out what what went wrong and of course starting that debugging process meant putting a card into the punch card into the hopper and so program I had was the problem I faced was well if you do that well it's going to erase part of what's there and executing the program is going to erase some of the registers that you want to to whose contents you want to print out so you know how can you do that and there was one register whose only function was as a destination for a move operation that is you say move something and you move a piece of data from sub location to the location specified by that register so the problem was how do you read out that register because it could put you know you could you could put something in the register you know you could go anywhere in memory that you have to look through all up memory but it could also land on top of the program that you were trying to that's trying to find it so I figured out how to write a program that even if the the transmittance to the move instruction put this piece of data right anywhere inside your program it would still work and you know I still think that was a really neat solution what did your boss it kinda doesn't think I don't remember if I must have showed it to somebody but I don't remember the reaction I was wondering if they actually ended up using it you could take credit for writing in the bugger oh no I don't think anyone cared that much about that register to worry about it but the thing about concurrency is that it added a whole new dimension of complexity and so even the simplest problem became complicated it's very much like just as trying to write a program where you know something could transmitted a value into the middle of your program where you're trying concurrency you're trying to write you know get a process to be able to do something even though another process could be doing anything else that that process might do so it posed that same kind of a difficult puzzle that and that's what attracted me to concurrency more than a in a desire for you know fundamental theory once I started working on concurrency then I became interested in what I might call fundamental problems in concurrency but initially it was just fun puzzles very interesting that's great so let's continue this sort of chronological thread a little bit you worked at compass until 77 is that right yes and when you move to s all right right how did that come about Oh for some reason that I never understood the people at compass said that it was a nuisance having me in California that they said I had to move back to Massachusetts and I didn't want to move back to Massachusetts so I found a job in the Bay Area and how did you happen on Azariah those I knew people at SR I I think s RI and Xerox PARC were the only places I knew that did computer science research and I applied both places and park didn't make me an offer and yes our I did and what was SSRIs business in around that time what kinds of things were going on there well SR I was very much like compass in the sense that you find somebody to pay for it you can do it but the major project that the obvious project for me to work on and I suppose they had it in mind when they hired me was the SIF project which was to build a fault-tolerant computer system to fly airplanes essentially it was a NASA contract and that was the work that with a contract that led to the Byzantine generals work so that's when I started working on let's let's talk about shift a little bit more that part so that was already underway when you joined SSRI and the it seems like in the late 70s we sort of think back to the state of aviation in those times the notion of having computers flying an aircraft was pretty audacious do you know how it was that that contract came about well I don't think it was such an audacious idea and the motivation for it as I understood was that was not long after the oil crisis of the early seventies where because of an embargo or some kind of embargo it's not clear exactly what what was permitted and what in fact wasn't the there was an oil shortage in the US and lines at gas stations and that didn't last very long but it made the made people think about trying to save gasoline and the one way that they were able to save gasoline jet fuel was designing act airplanes well designing airplane you want to minimize the friction and a large source of friction are the stabilizers the horizontal and vertical stabilizers that control the plane and keep it flying straight and they realized they could cut down friction and save fuel by reducing the sizes of those control surfaces but doing so would make the plane aerodynamically unstable and unflyable by a human being and so only a computer would be able to fly it because you had to basically make corrections to the control surface on you know every few milliseconds and if you didn't you could get vibrations and the wings would fall off so I think another motivation was another thing that reducing those size and the control surfaces did was make the plane really maneuverable so you could build fighter jets that could really you know win over do tight turns and all that stuff but if you could get computers to fly them so I think those were the NASA's two motivations and the project was underway as you've said when when you joined what was your role in in that project as you when you joined it well my general role was member the team you know engaging with the conversations and lots of decisions and stuff but the perhaps the question you're asking is which of the results that are still known ages Byzantine generals work did I contribute to and which were there when I arrived and the idea of the what was there was the notion of the problem namely achieving consensus in the face of completely unreliable processors and the solution that did not employ digital signatures was known the solution for and also the impossibility result the result that you couldn't do it with three processors you needed four and the algorithm for four processors I believe was invented by Rob szostak before I arrived and the generalization to 3 n plus 1 processors was done by Marshall Pease and that was an amazing piece of work of peas I don't know how we figured out that algorithm nobody could understand it I mean I could read the proof and follow the proof and it was convinced that it was correct but I still couldn't say that I understood it and what I contributed to the original paper was the algorithm with digital signatures which was actually something that I had done before I got to compass before I got to kompis I wrote a paper and I think it was published in 77 and nobody noticed it which actually generalized the time clocks paper to the problem of implementing a state machine in the presence of arbitrary failures and since in those days it wasn't clear what you know what you consider as a failure I just considered arbitrary failures that is we're a failed process these days most work on failures assumes that processes failed by crashing but there wasn't clearly a right model or the a useful model in those days so I had an algorithm that would tolerate arbitrary failures including processors that acted maliciously and the I didn't discover the the impossibility result because it doesn't apply if you have digital signatures digital signatures you just need a majority of processors to be working correctly the other thing that I hadn't done is to pull out the consensus problem now it's clear that if you're trying to build a system that works consecutive that works correctly you have to choose a sequence of operations that the that should do and so you have a consistent performed consensus on choosing each of those commands and I hadn't pulled out the consensus problem as something to be considered separately as sort of as a building block and I'm not sure why but I think in retrospect it's because the solution to that problem the one using digital signatures came so quickly to me that I didn't think much about it and the reason the digital signature algorithm came from me is in those days hardly anybody even knew about digital signatures I happened to be a friend of Whitfield Diffie and who of course with Marty Hellman published in around 1976 the seminal paper that new directions from cryptography that introduced public-key encryption and started the whole modern field of cryptography at a radio I think around 74 that I was having coffee at a coffee house with wit and he proposed the problem to me which he hadn't realized it was an important problem he didn't know how to solve that and I said oh that doesn't sound very difficult and I thought a minute and literally on a napkin I wrote out a solution involving one-way functions and that was the first solution wit and Marti published it in their paper and crediting me of course so I was one of the few people in the world who knew about the digital signature problem and I knew that it was possible to create digital signatures although in the in the context of fault tolerance creating digital signatures should be trivial because you're only worried about a signature being forged as a result of a random failure not as a result of a malicious processor so it should be fairly easy to devise some algorithm that would guarantee that the probability of a forged signature is infinitesimal although I've never had occasion to really think very hard long and hard about that problem and I don't think anybody else has either so it's in some sense an unsolved problem of creating an algorithm that you can prove in the absence of a malicious failure has a suitable Leeloo probability of being broken by a random failure at any rate so that was the situation when I got to SR I that original paper I contributed the digital signature algorithm and perhaps the the other important thing that I did was that I got them to write the paper because I realized that this was an important result an important problem and publication especially outside of academia you know didn't seem to have as high a priority as a member of Xerox PARC you may remember that it was no I mean there's a lot of great research being done but not very much publication and it was it was known as the black hole of computer science anyway I sat down and I wrote a paper on it and I remember Robshaw stack not liking the way I wrote the paper at all and so he sat down and rewrote it and I imagine that something would have gotten published eventually but it I just think it certainly got published sooner because I spurred them into action and in the second paper the only really new result in there was a an understandable version of the the general algorithm without digital signatures a recursive version and that was I think that was one that I discovered but the real contribution that the second paper was introducing the name Byzantine generals which is what its of course become known as ever since just to wrap up on sift by the way you should ask me about the origin of the name at some point I will just to wrap up on sift how did that project conclude did the system get built did it get deployed a a prototype system was built it was I think it was run on a simulator I'm not sure if it ever flew in an airplane there was a bit of controversy at the end of the project the project was to included a complete verification of the code and the there was some controversy about the about that and the short story is that through carelessness the paper that was published about it did not make it perfectly clear what had been verified you know mechanically mechanically proved and what hadn't I think in the particular case that the the code that had been verified didn't include some optimization that was later made so the the final system included code that had not been mechanically verified and the there was this general sense and in the criticism of it that the you know this was this multi-million dollar project and you know they lied about you know that worked with that time and they have they lied about what they had did what they had done and the fact of the matter was that this project was used somewhat as a cash cow at SR I in our group and the actual writing of the Macan the actual proof of the code was an onerous task that nobody wanted to do and I managed to get off the project you know before it became my turn to do that and finally rob szostak no I don't know it was Michael Miller Smith and Rob szostak and you know really Herculean effort as the deadline was approaching wrote the proof and ran it through the through the proofer and it was really a I would say a you know a baling wire and chewing gum my type of project as far as this proof was concerned rather than this you know grand effort and I think that although the people were careless and what they published you know this was not a serious offense and I think the project was quite successful and you know if it had any direct impact on the way avionics systems were subsequently designed well I don't know how avionics systems are designed the Byzantine generals work certainly did I was I was actually curious about that and sometime maybe 10 year I'd be 15 years afterwards or so I happen to be contacted by someone who had been at Boeing at the and asked them if they knew about the Byzantine generals work and he said yes indeed he said he had been the one who first read our paper and he said oh that we need for though he did say that this was in the commercial side and he did say that Boeing had just bought McDonnell Douglas I believe who were doing the military aviation side and that he from what he said at that time the people in McDonnell Douglas were somewhat clueless about these things I don't know what goes on these days at Boeing have no idea what goes on at Airbus I happen to visit Airbus and was told that they use a primary backup system rather than the the sift type design which was I guess what you call peer to peer these days and you know basically the corners that that was the job of the hardware people that was done in hardware and you can build a primary backup system correctly taking handling Byzantine faults on the other hand I've seen occasion where engineers have very happily you know presented solutions of provably unsolvable problems so I don't know you know so I'm not positive that the engineers who built the Airbus know about the Byzantine generals results so but Airbus is very secretive about what they do I you or I can't find out and you don't let it affect your flying habits one thing that I have observed is that well let me give you a one little anecdote as you remember at Digital Circle AB where we were there was a reliable distributed file system that was built called echo and that actually motivated me to discover Paxos because I was convinced that what they were trying to do was impossible and was trying to find an impossibility proof and it said I found an algorithm but there was no algorithm behind that code there was just a bunch of code and I'm sure that you know there was you know kind of of failure that Maximus would handle that the echo code didn't handle but the echo system ran perfectly reliably for how many years five years or something like that and it only crashed once and the reason for that crash had nothing to do with the algorithms it had to do with a very low level programming detail that buffers got over full and so whether or not the Airbus engineers will handle all possible Byzantine faults I think that in practice if there there are probably good engineers and they probably thought about the problem enough that if there are failure modes that you know would cause their algorithm to fail they are sufficiently improbable that they're not the most likely cause of an airplane crash so you are at SR I for seven or eight years and then you moved from an organization that was fundamentally a contract programming or contract project organization into what I think we might call a corporate research lab and Digital Equipment Corporation how did that move come about oh well the actual details is that we were in a group that had I don't know a dozen people or so and at some point they decided that we needed an another level of Management and that you know in addition to the lab director and assistant director there was now you know for these twelve people became you know some managerial title and something and I just said that was ridiculous and Bob Taylor had just started the circle AB it was a natural place for me to to go so you came into a what was must have been a fairly small group of researchers at that at that time since the lab was just getting started it didn't strike me as being small because it was probably larger than the group at SSRI where I was so but there wasn't a single project you were basically being brought into as you had been at and as our I know it was clear that I was being brought into well to become a member of the lab there was much more of a sense of community at Cirque than there was at SR I which may seem strange because it's at Cirque was a much larger place but as someone said at SR I that grant proposals are our most important project I'm sorry grant proposals are a most important product and so there was always this question of you working on a project and there was no sense of interaction between product projects so in a sense the lab was more splintered than it was at circ especially in the beginning where there was one project which was building the computing environment so it was clear that my expectation was that I would be continuing the same line of research I had been doing but also interacting with other people in the lab because they were building the current system and so they would have concurrency problems and so I expected to be useful so how would you characterize the most important product of Cirque in by analogy with what SR I said it clearly wasn't grant proposals I think well I'm getting out of my depth because this is a question of you know why did digital open the research lab and we know what a digital expect to get out of it of course as you know the kernel of the people who started circ with the people who came from Xerox PARC where they had just created personal computing and so there was very much a sense that well this was what the lab was going to continue doing creating the computing of the future and so that was the purpose of the lab and so there was a qualitative feel that to that organization that was different from SSRIs that right definitely but that I think had a lot to do with Bob Taylor the lab director and lab founder who was just a wonderful manager and as you know his he was your mentor indeed did you consider at when you were looking to leave SR I other places to go work perhaps academia I did talk to Berkeley and Stanford but I must said was never really serious about it I I did not see myself as an academic so it was clear to me that sir quiz where I was gonna go say a little bit more about about that what is it about either the difference between circuit academia that attracted you to stroke more than more than the university well there are two things at play first one was personal and one was institutional you might say personally I had never taken computer science seriously has an academic discipline and in fact when I started out it wasn't I don't think there was enough stuff you could call computer science it's certainly not in my field that merited being taught at a at a university and this may be just my own background which since I had studied math and physics and perhaps if I had gone to an engineering been an engineering major at MIT I would had a very different perspective and I would have considered programming to be a perfectly fine academic discipline but it never seemed to me that people ought to be going to a university to learn how to program and by the time and being had to feel that well maybe I actually could go to a university and teach F academically respectable things to people I was too set in my ways and I think that made would have been in the 90s or so so that was one reason why universities didn't interest me and the other thing is that I found industry to be the best source of problems and I think in the early days of concurrency starting in the late 70s and and going up was with the late seventies and early eighties that we're talking about a lot of academic research was just self stimulated people especially theoreticians contemplating their navel and deciding what should be what should happen and I've always found that the real problems the important problems were motivated by people in industry having things they wanted to do I mean sift was an example of that I mean it wasn't sponsored by an industry but it was very much a real-world project I mentioned Paxos which was inspired by a real system being built at Cirque bakery algorithm not an industrial problem but it was it was clearly important in in programming and another one's a disk Paxos for example another algorithm happened because there was it was an NGO a deck engineer in Colorado who said he wanted to build a system where instead of having to have three processors he wants it to have three disks instead to get the redundancy and that led to this paxos fast pack so sorry fast packs was fast mutual exclusion came because the people at world the Western Research Lab it was a another deck lab but the people there were building a computer and they needed a mutual exclusion algorithm would have had certain properties so yeah I've just always found not a industry to be the best source of problems not all problems sometimes I suppose the most successful problems or ones where you think of the problems before industry realizes they have them but it's certainly my experience working with industry and even before that of you know programming has been much more useful than most of what I learned in school so since 85 was it that you joined circ approximately until the present day you've been in research labs first at Dec and Compaq and now at Microsoft and that interaction of real-world problems and research and investigation has continued through that essentially this whole portion of your careers is that fair to say yes it's sort of a an irony and a failure that although I look for real-world applications of real-world problems I didn't interact with the people at Cirque that much and I think that I must have a a prickly exterior that didn't encourage people to come and come to me with the problems they were doing and I wasn't active enough in going out and finding people's problems because I always had things of my own that I was working on and was always easier to stay in my office and keep working on those then try to go out and and talk to people extracting their problems from them more often would come from somebody outside the lab who had you know knew about me I even knew about my work and came with the problem you've had a lot of external collaboration around problems and their solutions with people in in academia even though you didn't want to be an academic yourself so there was there are evidently at least some people in academia that want to attack real-world problems it's I don't feel like I've collaborated with that many people I think I think certainly compared to most computer scientists I think I have more single author papers collaborated quite a bit with Fred Schneider and I guess that's because fred has I think a very similar sense of practicality that I do collaborate on a couple of papers with Ella Gaffney they both happen while Elly was visiting Cirque and well one of them was the disc Paxos problem which I said came externally the other one was on the mailbox problem and that was a problem that I had had myself a problem that had occurred to me many years earlier and just Elly's being there was an opportunity for it there was the collaboration with the people at sift SSRI doing the sift so that was in some sense the one time where I was really part of a group that I was actively collaborating on doing research so I'd like to shift gears now and we've done a kind of a somewhat meandering exploration of your career chronologically or these part of it but I think I'd like to start talking about some of the the themes that have run through your career and we've touched on a few of them briefly but I want to like to get into into more detail on that and the natural one to start with is probably the one that you worked on earliest which we talked about a bit the bakery algorithm and concurrency and concurrency algorithms the you've said that quite a bit of the work you did out of the bakery algorithm and studying it learning learning from it one of those was well let me let me ask you this way probably the next famous paper that you produced was the time clocks and the ordering events in distributed system paper which is I perhaps the one that's most widely cited did that grow out of the the bakery algorithm study not directly the it had a very specific origin namely I received well it was this was while I was working at compass and from some connection I don't remember why but I received a technical report written by Bob Thomas and Paul Johnson which was about replicated databases and they had an algorithm in there and when I looked at the algorithm I realized that it wasn't quite right and the reason it wasn't quite right was that it permitted systems to do things that in ways that seemed to violate causality that is things that happen before one thing that happened before something else would wind up influencing the system is if they had happened in the opposite order now the reason I realized that has to do with my background and from my interest in physics I happen to have a very visceral understanding of special relativity in particular the four-dimensional space-time view of special relativity that was developed by Minkowski in the famous 1908 paper and I realized that the problems and distributed systems are very much analogous to what's going on in physics because in or in relativity because in relativity there's no notion of a total ordering of events because events will two different observers will happen appear to happen in different order but there is a notion of causality ordering between two events in which one event proceeds another if for every observer it will appear that that event preceded the other and the the basic definition is one in which that is that if you know an event happens you know we're at to some of the person you know and another and the other event happens to another person then it's possible for a message or a light wave to be sent from the first person when that event happens and that it will arrive at the second of the second person before that second event happens and I realized that there is obvious analog of that in distributed systems which says that one HAP event happens to one processor for it happened in another processor if there was a message or a chain of messages that began at the first processor after this event and reached the second processor before his event and that's the and it's that notion of causality or before that was being violated in this algorithm and I made a fairly simple modification to the algorithm to correct that and that's the algorithm that was presented in that paper along with the definition of the partial ordering relation now I should mention that I'm often one of the things that the algorithm used and then that the Johnson and Thomas algorithm used was attaching timestamps to messages that as each process has its local clock and it time-stamped a message with the time on that clock before sending it and people have credited me with my inventing time stamps which I didn't because I got them from the Johnson and Thomas paper and what I didn't recognize at the time when I received that paper I just assumed that time stamps were a standard technique that people use the distributed computing ID this is the first time I'd even thought about the problem of processors communicating by sending messages and so you know I didn't point out that time stamps were used in the Johnson and Thomas paper and so I their paper has has been forgotten and it remains as a you know as a citation what are the four citations in my time clocks paper one of the others being the minute Kowski paper and another one being a previous Einstein paper so at any rate that's just a footnote the other thing I realized that I believe Johnson and Thomas didn't realize is that this algorithm was applicable not just to any not just to distributed databases but anything and the way to express anything is a state machine what I introduced at this paper in this paper was the notion of describing a system by a state machine now a state machine is something that's really simple it you start it's something it's started in an initial state and it then takes steps which change its state and what it can do next can depend on incoming messages in this case or from the environment or it can come from or from the the current state of the machine determines what it can do next now the ideas of state machines of finite state machines are ancient dating to the 50s or so but they were all used as finite States namely that this the state had to be the fixed bounded set you know certain like a certain collection of registers or something and it was very clear to to me and everybody that finite state machines we're not terribly interesting as ways of talking about computation or about problems but if you remove the restriction about them being finite then they're completely general and you can describe what any system is supposed to do as a state machine and what I realized and said that paper that any system what if you know if it is a single system in a sense what it's supposed to do can be described as a state machine and you can implement any state machine with this algorithm so you can solve any problem and as the simplest example I could think of I used a mutual a distributed mutual exclusion algorithm and since distributed computing was a very new idea this was the first distributed mutual exclusion algorithm but you know I never took that as seriously as an algorithm and nor do I take that this the time clocks papers algorithm is necessarily a you know the solution to all problems because although whether you know in principle you can solve any problem this way that without that is building a system not without the without failures and didn't deal with failures it's not clear that the solution would be efficient and there could be better solutions and in fact I never thought that my distributed mutual algorithm algorithm wouldn't in any sense be an efficient way of doing mutual exclusion well the result of publishing the paper was that some people thought that it was about the partial ordering the events some people thought it was about distributed mutual exclusion and almost nobody thought it was about state machines and as a matter of fact on two separate occasions when I was discussing that people with somebody and I said you know you know it really was important thing with state machines they said to me there's nothing about state machines in that paper and I had to go back and look at the paper to convince myself that I wasn't going crazy and literally didn't mention state machines in that paper I think over the years of the state machine concept thanks in part to Fred Schneider talking about state machines people have gotten the idea the way I met Fred Schneider is he sent me a paper I don't remember what it's about that he had written he was still like thinking may have been a grad student or possibly I think I'm a young faculty member I sent him he was working along somehow along similar lines and so I sent him the preprint of the time clocks paper which I think subsumed what he had been doing but I thought you know this guy this guy's pretty good you know I should stay in touch with him I was right so this paper has been very widely cited obviously because you did a foundational thing and whether people really understood that or not somehow they decided it was the place to point to in their own papers do people continue to I mean this paper was a long time ago almost 40 years now since it appeared do people continue to discover the paper and learn things from it well what amazed me is that I recently learned that there are physicists or at least one physicist who takes those people the paper very seriously and says it's very important in developing a theory of the physics of time and now this has bought this bag of my mind I could understand I mean I regard what I did in that paper at least the the partial ordering that I defined as being very obvious because to me because I understood special relativity and so I figured well it seemed you know it seems really magical to most computer scientists because they don't know any special relativity but of course the physicists understand special relativity and they thought it was a seminal paper and so I gotten me thinking why and I finally figured out for the physicists and I confirmed with with with one that what my paper did that the advance that my paper made was that mini Kowski was talking about in some sense messages that could be sent to define his relation and I changed that to messages that actually were sent and this seems to me to be perfectly obvious but to the physicists this was a you know a seminal idea so I'm you know it seems that things that seem obvious to me do not seem obvious to other people I've never understood quite why the papers considered so important in computer science because the algorithm that presents is of no particular interest and has historical significance but I think that somehow what people get out of it is a way of thinking about distributed systems that so natural to me that I considered it obvious that is not obvious to them and gets them to be thinking in a different way this may be related to one difference between me and I think almost all researchers in concurrency so only you know theoreticians is that in most of the the research that I've you know most of the papers that people write or people I talk to they think of something like mutual exclusion for example either as a programming problem or as a mathematical problem I think of it as a physics problem then with mutual exclusion problem is the problem of getting to people not to be doing something at the same time at the same time that's physics and I look at at things from a very physical perspective and which I think gets me looking at the essence of problems and not getting confused by language the way a lot of people do I've observed something in computer scientists that I call the Worf Ian syndrome the the wharf in hypothesis Wharf was a linguist early in the 20th century who his hypothesis is that the language we speak influences the way we talk well what I call the Wharf Ian syndrome is the confusion of language and reality and people tend to invent a language and they think that that language is reality and for example if something can't be expressed in the language then it doesn't exist let me give you an example I've talked about state machines well you can model a program as a state machine and if you model it mathematically basically what you're doing is mathematical description of a state machine and so the state machine has to consist of the complete state of the system now if you consider a program part of the state of that state machine to describing the program is what I call the program counter it's the thing that tells you where in the program you're executing which statement you're executing next well there's no way of talking about the program counter in most programming modern programming languages and so two people who are hung up in programming languages the program counter doesn't exist and so there are inst cases when in order to reason about programs particularly concurrent programs you have to talk about the program counter and I've talked to people who say that were one theoretician works in programming languages said that's wrong you simply shouldn't say it wrong what was the notion you know the programming language jog and for wrong is not fully abstract but he was saying that you know my method of reasoning about programs which involved the program counter which you basically have to use must be wrong and a lot of people have created methods that since you have to talk about the program account to you they have ways of getting around that by talking of things that can act as proxies for the program counter so that's just an example of what I mean by the war fee and syndrome and I think there are a lot of people who are for the years have worked on problems that were meaningless outside of the context of the language in which they've been talking about them and that has not been my problem I believe I mean I may suffer from the war fee and hypothesis may fire me really well but I don't get hung up in the language I'm talking about I mean look the the physical reality that underneath underlies these things and it's that way in concurrency problems I think the reason why things that I've done have turned out to be significant years later is not that I have any more foresight than anybody else work no better way than anybody else in predicting what's couldn't happen in the future it's the fact that if I'm I'm dealing with I've dealt with real problems problems that that have a real physical sense that are independent of a particular language in which talking about them and it's those problems have a good chance of turning out to be to be useful whereas problems that are just an artifact of the language you're talking about are not going to seem very interesting when people stop using that language that reminds me of another paper that you did - I thought the same time not not in the same thread as the as the time clocks or the bakery algorithm but about a problem that was very real at the time the glitch where there are teams it's a very again a very physical thing enabled you to bring some insight to that that others that eluded others would can you talk about that a little bit sure let me explain the glitch which is more formally known as the arbiter problem imagine that you're well okay it's described in some paper by some housewife in those days housewife was not politically incorrect and the milkman comes to the front door and the postman comes to the back door and they both knock it about the same time and you have to decide which one to you know to answer and so it's a problem of making that decision and this is known a philosopher's have known of that problem and the name of Borodin's ass where instead of the housewife having to decide whether and go then you know the front door the back door it's you have a donkey who has to decide which of two bales of hay to go to any equally distance from the tomb and it was realized in the early 70s that computers act like this housewife or this donkey and have to make a choice between two different things the way it happens in computers is that you have a clock that's inside the computer and you have some event that's external to the computer maybe a message comes in or something and the computer has to decide it's doing one thing at a time has to decide whether it should because the clock pulse tells it to go to the next instruction or something and the thing external interrupts says to stop what you're doing and do something else and it turns out that if you don't design your circuits correctly what could happen in that case is instead of saying either you know do it to a or do B which in computer terms means either produce a one or a zero it produces a one half and computers are not very happy when computer circuitry is not very happy when they see one halves and it causes them to do weird things and to get you know different parts of the computer to do different things and your computer crashes and they discovered that I think it was some deck engineer who at there was there was a conference that was held on this workshop that was held on this problem and he mayor Copa and said that a computer he designed would crash you know about every couple of weeks because of this problem and it turns out this problem is in a sense unsolvable what's unsolvable is that it's impossible to if you're going to come out you can come out with a zero or one but you can't do it in a bounded length of time and which means the computer is screwed because it has to do it by the next clock cycle and if you do it with a banded length of time you you have a possibility of getting the one-half so what computers do but you can build circuits that will almost always do it you know pretty quickly and matter of fact that the probability of not having made the decision within a certain length of time decreases exponentially with the time so in practice you just wait a little while and the chances of getting a 1/2 are negligible so that's what computers do they don't try to decide whether this events happen at the same time as you know the current clock pulse but whether it happened before the clock pulse three times you know you know three pulses later or something like that and so if engineers know about the problem they designed the circuits that's not an issue engineers who don't know about the problem designs of circuits that will fail because of it and I hope that these days all engineers know about the problem but if an engineer doesn't know about the problem he'll be able or she will be able to solve it you know they'll give you a circuit that they say solves and of course it doesn't it just pushes the problem through a different part of the circuit so that's the the glitch of the arbiter problem and when I wrote the bakery algorithm someone at compass pointed out this problem to me and said well you know your bakery algorithm you know doesn't require atomic operations but it still requires solving this orbiter problem and that's what got me interested in the arbiter should I go into the store the history of the papers and the publication and stuff a non publication or not well I'm interested in in what the impact was of the work that you did and maybe published on the orbiter problem okay the what I did is I mentioned the product all the problem - dick Palais who was my de facto thesis advisor at Brandeis and it's become a friend and he realized that an argument saying against a reason that people think that you should be able to solve the orbiter problem is that the the argument that you can't is based on the fact that you know real world is continuous and people will argue well the real world isn't continuous because you can actually build this operator that orbiter that goes zero or one well it doesn't work you know I can't we don't know how to do it and bound the time but it's still yet introduces discontinuity and what dick realized that if you have the right definition of continuity in fact it is continuous and theorems that tell you that well basically theorems that tell you that if a system is the solution to a certain differential kind of differential equation then it is going to be continuous and we believe that the world is described by these kinds of differential equations they are continuous that result applies and therefore this is a mathematical proof of the impossibility of some of building an arbiter that works in a banded length of time so we've sent a wrote a paper on that we sent it to a some I Tripoli journal might have been computer no not computer might have been transactions on computers or something like that at any rate the engineers I mean this was a oldest mathematics engineers knew nothing I have no idea what to do about that and so they projected the paper I should mention that this orbiter problem has a long history of being not recognized and Charlie Muller who who was a hardware engineer who worked on this problem said he went submitted a paper on it and the paper got rejected and one reviewer says wrote I'm an expert on this subject and if this were a real problem I would know about it and since I don't know about it it can't be a real problem so the paper should be rejected literally that's the started Charlie tells I had so this paper was rejected from transactions on computers I think by that time the engineers knew about the arbiter problem but they still didn't know what to do with this paper later a few years later a similarly mathematical paper was actually published in transactions on computers but the result wasn't as general as ours I also wrote a paper called burdens principle which basically said generalized that not just to computer type of issues but to arbitrary real-world problems for example if you consider suppose someone is driving up to a railroad crossing you know when it just has flashing lights and the lights start to flash well he has to make the decision of whether to stop or whether to keep going in a bounded length of time namely before the train gets there and that's provably impossible so in principle there it is possible for him to if that is it is impossible to avoid the possibility that he will be sitting on the railroad tracks where the train arrives and I remember actually from my youth some train accident that happened where I think massive school bus that for no apparent reason that went through a flashing light and we sit by the train and perfect visibility and nobody could explain it and it occurred to me that maybe that could be an explanation you know I have no idea whether that kind of thing which is in principle possible whether it has a high enough probability of big a real problem and that that occurs you know everywhere and for example you think of airline pilots have to make similar decisions and they have to make them quickly you know and could accidents actually occurs so I wrote a paper and I wanted to get it to the general through the more general scientific community so I submitted it to cite to Scientific American no I'm sorry not scientifically oh I forget I think it was science the publication of the triple a yes and it was rejected there were four reviews they literally ranged from this is a really important paper that should be published to surely this paper is a joke and the two others were somewhat in the middle less positive or negative one was somewhat positive and one was somewhat negative and you know 50 percent split was enough to get it rejected from science I then I usually when I get a paper rejected I if I think it should it deserves to be rejected there was a good argument for rejecting and I will not try to publish again but I felt that this was worth doing and so I sent it to nature which was the British equivalent of science and I got back letter from the editor who said read the paper he said he pointed out well I have seen this problem and I realized it was a reasonable problem is in the sense that for someone outside the audience that I'm familiar discussing with so I described you know have the answer to his thing and then he said okay yeah and then you sent me something else and another issue and I said oh yeah that's a good point but here is you know the way the you know what it should have said to weigh that and at two or three exchanges like that and I was never sure whether he was really serious well whether he just thought I was a crackpot and and was humoring me and then but after that exchange I then got an email from some other editor saying that the original editor had been reassigned to something else and your paper had been assigned to another editor and you know to me I guess and I have this question and he came back with basically you know one of the questions I know yet already and at that point I just let it let it drop finally some somebody who if the paper was on my website and somebody suggest that I submit it to the foundations of physics and I did and that was accepted with no no problem but again not the wide audience that I was hoping it would get so within computing it sounds like this principle is probably not as widely understood as it should be still perhaps the at the level of people who build circuits it's understood we hope but perhaps not elsewhere and these things arise in concurrent systems all the time right the people who build computer hardware are aware of it I don't know whether engineers who aren't principally computer hardware designers know about it for example back in the seven days I tried to explain this problem to the people who were building the computers that were gonna run sift because sift has to had to worry about all possible source of failures and one possible source of failure was an arbiter arbitration failure and I wanted to ask the engineers whether they understood this problem and you know what they could tell us about the probability this happened well in about 30 minutes I was not able to explain the problem to those engineers so that they would understand it and realize that it was a real problem so they were you know in those days the avionics people you know computers were just starting to be used in avionics and the avionics people we know an awful lot about you know feedback control systems on such like that nothing about you know computers or digital at the digital domain so the people who are building avionics Hardware these days no no about the arbiter problem and know how to had to take it into account I don't know I sure hope so agreed did did any other work come out of your interest in orbiter free problems a couple of things first of all I decided I mentioned Patriots and I mentioned mark graphs now people in the small part of the hardware community that deal with asynchronous circuits and no really understand things about arbiters and they'll try to build circuits without arbiters know that mark graphs describe the kind of synchronization or the sync kind of synchronization described by mark graphs can be implemented without an arbiter and it's significant because the kind of synchronization that mark graphs represent seem to describe what I would call parallel computing as distinct from concurrent computing parallel computing in my use of the terms means a bunch of processors that are built deliberately to collaborate or join together deliberately to work together on a single thing concurrency is where you have a bunch of processors sort of doing their own thing but they involve some synchronization with others and that in order to keep from stepping on each other's toes or bad things happening and parallelism huge about big data computation you know were you using a server farm to compute Oh to process you know all of the queries to being you know in the last 24 hours what you're doing is parallel computation and there are there's software to do that I've forgotten the names of the types of what's the MapReduce so for example MapReduce is software that solves a kind of synchronization that is described by mark graphs at least that's what I presume Patrick with MapReduce does and for a long time I thought that mark graphs described precisely the kind of synchronization that can be done without an arbiter and so I decided to you know to try to prove that and write a paper about it well I discovered that I was wrong that there actually is a weaker type of thing you can describe then I'm sorry I'm more you can describe things that mark graphs can't describe but that can be done without an arbiter and so I wrote a paper about that and I wrote another paper that realized that parallel computing is is important even though I've generally stayed away from it as not being that interesting because the problems of competition are much more difficult and interesting to solve than problems of cooperation but I just realized that there was a an algorithm that is sort of in the back of my mind or that indicated how you could have a an implementation of Mart graph synchronization by multiple processors that are communicating by shared memory and so I wrote a paper about that describing that algorithm and that's I think the extent of what I've done in terms of and roughly when was that work I believe the work was done since I got to Microsoft but I think it was sort of the early part of the sitter so ago yeah as as this whole area of arbiter free algorithms or implementations or whatever continue to receive a lot of attention as far as you know I don't think it's ever received a lot of attention okay there's a small community of hardware builders who are well almost all computers these days or all computers you can go and buy in the store work by having a clock and they do something on each clock cycle so they'll do something wait for the gates to settle down then go on to the next step and that waiting is synchronized by a clock this is a problem because your computers are so fast that propagating a clock signal from one end of a chip to another takes a significant amount of computing time but engineers work hard to to solve that problem there's another small band of dedicated Hardware builders ivan sutherland is one of them who believe that you know it's time to stop using clocks and to build asynchronous circuits and so you want to avoid arbitration you know whenever possible and building asynchronous circuits and they have the advantage of potentially being much faster because if you're clocked you have to go at the speed of the slowest component or the worst case speed of the slowest component whereas if you're asynchronous you could have things done as fast as the the gates can actually do it and I know I've heard Chuck sites who's another member of that community who I used to interact with used to chat with them a bit and I'm sure learned a fair amount from him he has described building you know basically really really fast circuits for doing signal processing you know this asynchronous technique but so far it hasn't been adopted in the mainstream so it sounds like there's at least the potential you might do more work in this area in the future I think my days of working in that area are done but somebody else might take it up okay let's pick up the thread of your work in concurrency after after that time you continue to to do work in in concurrent algorithms and I think there was a paper that you did not terribly long after that maybe appeared around the same time as the time and clocks paper that was about concurrent garbage collection that I maybe the first paper at leaves the first one I knew of where you were working in some sense directly with with Dijkstra although I might have gotten that wrong you talk about that a bit Oh the genesis of well my involvement that Edsger Dijkstra wrote a with some of people around him Luke sleep call his disciples other of the authors on the paper I don't remember whether there was still students at that time or what about a concurrent garbage collector and it was he wrote it as an e WD report the series of notes that that Ed's car distributed to you know fairly widely and he sent me a copy or I ever saw I received a copy somehow and I looked at the algorithm and I realized that I could it could be simplified in a very simple way basically it treated the free list as in a different way than it treated the rest of the data structure and since the concurrent garbage collector I realized that you could just treat it as just another part of the data structure and moving something from the free list someplace else is the same as basic problem is making other any other change to the list of structure so you know I sent that suggestion which to me seemed perfectly trivial and esker thought that it was you know sufficiently novel and and sufficiently interesting that he made me an author result of that something I imagined he regretted doing there was a interesting story about that ed square wrote a proof of correctness of the algorithm and it was in this pro style mathematical style proof that people used and people still use to write proofs and I got a letter from Ed score those days when actually communicated by writing letters and sending them my post and it it said it was a copy of the letter e sent to the editor withdrawing the paper that had been submitted saying someone had found an error in it now I found Dijkstra's proof very convincing and I decided that oh there must have been some minor insignificant error and I had just developed the method of reasoning formally about concurrent algorithms that I published and I think was in the 77 paper proving the correctness of multi-process programs so I said okay I would sit down and write a correctness proof and I'll discover the error that way and I'm sure it'll be a trivial error easy to fix so I started writing a proof using this method that I had developed and in about 15 minutes I discovered the error and it was a serious error the algorithm was wrong I had a hunch that it could be fixed by changing the order of two different of two instructions and I didn't understand why I didn't know whether that would really work or have any very good idea of why whether it would work or not but I decided to give it a try and of course I would do is use my method to write a proof and it wasn't a was hard work because writing the side of proof you have to find an inductive invariant which I won't go into now what that is and that's hard and especially at that time I had you know no practice in doing it I spent a weekend and I was actually able to prove the correctness of my version of the algorithm it turned out that Edgar had found a correction that involved changing the order of two different instructions and because there was somewhat more efficient to do it that way you know we used his and rewriting the paper but there I insisted that we give a more rigorous proof and Ed skirt didn't want to give you know they the really kind of detailed proof that I believe is necessary for making sure you don't have these these little errors of algorithms but we compromised and we had the invariant I believe appeared in the paper and not written formally but but written precisely and so we had a sketch of that that proof and a little while later David Greece I think they he wrote that paper by himself not with Susan or wiki but David Greece basically wrote a his version of the the exact kind of proof that that I had written it was at the same time I was developing my method david greece and his student susanna wiki i suppose since was her thesis it must have been Susie Oh Ricky even and David Grace were developing what was an equivalent method logically equivalent but it was more politically correct because remember that I told you that you needed to root to reason about the PC in order to write these kinds of proofs but they didn't reason about the PC they introduced some wherever they used auxiliary variables that would basically capture the difference but there was another difference in the presentation and in some sense they were doing it in a well the this method of reasoning about programs was introduced by Bob Floyd in a paper calling assigning meanings to programs so thinking he did it in terms of writing using flowcharts and annotating the flowcharts now Tony Hoare developed another way of doing it called logic in which it's in subsets and it's a nicer way of doing things for sequential programs and what a wiki and Greece did was presented in a language that made it look like they were using method but they weren't really using method they were using Floyd's method but in disguise and as a result and you know their method in mind were logically equivalent their method became well-known and my method it took about 10 years before people realized that what I was doing was severely the same thing that they were doing and it really but people were really confused about by their method and in fact Edsger wrote an ewd calling my version vision of the a wiki grease method you know which he said was to go to explain it simply and I thought that my reaction was God the poor people who tried to understand it based on his description and it's when you look at it the way I did it it's very obvious what's going on you know you see it's clear that you're writing an invariant and each step you're checking the invariant but the invert that the global invariant was what was hidden in with the Greece method and and these days well actually I can't say what whether people are still using the wiki Greece method they don't write things in terms of flow chart the it was a paper by Ashcroft which preceded both my paper and the wiki Greece paper in which he talked about doing things directly using the global invariant and I thought that I was making an improvement on that by basically distributing the invariant as an annotation of the program of the flow chart and I've since realized that that's done the correct way of doing it is just thinking directly in terms of state machines and essentially recently the same way at Ashcroft did but in those days I still thought doing things in terms of programming languages or things that look like programming languages was a good idea I've since learned that if you're not writing a program you shouldn't be using a programming language and an algorithm is in the program it's that very interesting illustration of the tie-in between your work done on algorithms and particular currency algorithms and the method it met the methods that you used that you developed for proving the algorithms correct which is I think another major thread in in in your work the to have been into woven over over far as I can tell most of your career yeah well in the early days and you know there are lots of papers written about you know verified concurring where that means what I think distinguished me for most of them was that I was doing this because I actually had algorithms to verify and I needed things that worked in practice and I looked at a lot of these things and say I couldn't use that so how and you know even for the little things that I was going to use and you know let alone try to scale them up to something bigger David Grayson and I think is well he has a long background and I think the people who I don't know whether he started as a mathematician or as an electrical engineer I think he started as a mathematician but the people from the old days were in the u.s. at any rate we're somehow seem to be more tied to real computing than people whereas most people were especially in Europe back in like in the 70s and so you know their method was ago is that it's equivalent to mine and and in in practical terms of whether you could use it made very little difference which one you were using I want to go back to something that you mentioned in passing in conjunction with the time clocks paper which was the use of counters which became an important notion in in subsequent work and in that time clocks paper the the counters are essentially unbounded and that obviously had a practical implications which you explored further in some subsequent work about what you called registers could you tell us a little bit about that well as you mentioned actually is the bakery algorithm sorry bakery algorithm where the in the bakery algorithm the well I sort of said it had no precedents the precedent in that which led to the name was something that from my childhood where we had a bakery in the neighborhood and they had one of the little ticket servers that you took a ticket and the next number you know you waited for your number to be called suppose I've grown up in a later era would be Delhi algorithm and that's basically what the bakery algorithm does except each process invents it basically picks its own number based on numbers that other processors have and it's possible for numbers to grow without bound if processors keep trying to at the critical section and so I realized that the algorithm required multiple registers to store a number because you could run out of you could overflow a single register a 32-bit register for example in a fairly short time so the reason why the fact that the registers didn't need in the bakery algorithm didn't need to be atomic meant that it was easy to implement them with multiple register multiple word registers where only the reading of a single word was atomic and so so the first glance it sounds like oh that solves the problem because it doesn't matter what order you do reading and writing of the rate of the different words it would still work except that if you weren't careful you'd like to prove well you'd like to prove that the numbers although not bounded had some practical bound for example that the number would never be greater than the total number of times that somebody has been it's critical section and it was not trivial to implement that and so paper I wrote I think called on concurrent reading and writing gave algorithms for solving that problem a nice statement of one of the out of one of the problems solved in the algorithm was suppose you have a clock in your computer and the clock will count cyclically you know say for instance one day or so it might you know cycle from 0 to 24 hours and then back to 0 but how do you accomplish that if you if your clock had to be multiple registers and so the first question is what is correctness main and correctness means that if the clock is counting the time the value that a read returns is the value of time at some point during the interval in which it was reading and so that's a nice problem and I heard if anybody hasn't seen the solution I suggested as a nice little exercise so it was the bakery algorithm that led to me to look at that class of algorithms is a funny story that the it turns out that you don't have to assume you if you have multiple words that the reading and writing of an individual word is actually atomic it can there's a weaker condition called regularity that I bother explaining but it's it's weaker than atomicity and it's a condition that if you give me for a single word for I'm sorry for a single bit it's very easy to implement in hardware as a matter of fact the condition is is basically trivial to satisfy for a single bit and when I published the paper in CAC M I wanted to introduce the definition of a rake of a regular register and talk about how they could be used to implement this for example it's kind of cyclical clock but the editor said that the idea of reading a single bit non atomically was just too mind-boggling and that he didn't want to publish a paper with that so I was forced to state that you know you had atomic bits even though that wasn't a requirement at any rate that then eventually led me to think about actually not eventually but fairly quickly thinking about what kind of registers do you have that are weaker than atomic register and there's actually a weaker could type them let me think remember what the situation is no right there's they're like two classes of registers that are weaker than an atomic register atomic register is what you simply think of as you know reading and writing happening is if they were instantaneous when is called safe and the other is called regular and the paper on concurrent reading and writing it's it's safe registers are automatically read regular not automatically atomic and safe registers are trivial trivial to build that basically any registered way you might think of building a register would pretty much have to be out of multiple pieces of why not having to be safe safe just means you get the right answer if you read it while nobody is writing it so I consider the problem of implementing well first regular registers using safe registers and then atomic registers using regular registers well I was able to solve the first problem not optimally but at least solving it and I believe that Gary Peterson later published a more efficient algorithm for doing that and for going from regular registers to atomic registers I was only able to solve it for two processors as a writer and a single reader I proved the interesting result that you couldn't do it unless both processors were writing into something by doing it and I met with a bounded amount of storage it's easy to do it if you have if you have unbounded counters well somewhat easy but not you know an inbounded way and so I was never able to figure out how to do it with multiple readers and the after I published my paper on it there are a couple of papers published one which claimed to have an algorithm that I have no idea why the authors believe that could possibly work and then there were a couple of later algorithms that were really complicated and I realized that the reason I never solved the problem is that the algorithms the solutions were sufficiently complicated that when things got got that complicated I just would give up sort of a combination of aesthetics and not knowing if I had the mental power to deal with anything that complicated was was it your thought that the that this framework for different kinds of registers with with varying degrees of complexity or utility maybe I should say was likely to lead to something else or were you really focused on the problem of that you had exposed in the bakery algorithm which was that you do have these unbounded counters you have to deal with somehow well I said the the problem the unbounded captors led me to think of of what it means to build one claim you know counter out of you know one register of bigger ones that have smaller and then that led me into considering these three classes of registers and the safe register is one that I said that it's you know I know you can build them out of hardware you just that's that's clear and so that was the only register that I could say that physically I knew how to build and again this is you know I always regarded this as a physics problem exactly and the and actually when I started doing this stuff was fairly early I think around 75 or so and I gave a couple of talks about it but nobody seemed interested in it so I put it away and I thought I had solved the problem of atomic registers and then sometime much later I think it was around 84 J misra published a paper I think it was J by himself I don't think he did a lot of published a lot of things with many changi many chandi but i think that one was a paper by by j himself where which started it was heading in the direction of that work that i already did so i decided you know to write it up and publish it and when I got back to writing it up I suddenly realized that I couldn't imagine what kind of solution I had when I was talking about it in the in the 70s for the atomic register case because I probably was able to look at my slides and I'm what I seem to have been saying was utter nonsense brain rate so I sat down and came up with an algorithm and fortunately Fred Schneider was the editor of that paper and he has an editor he's incredible was my resume he doesn't do it edit anymore but he actually whenever every paper he published he really read and he went through the paper and he came to the proof of correctness of the atomicity paper and he he couldn't follow it he said you know I don't understand something or other and I said Oh Fred being fake oh go through and got on phone to explain it to him and I got to this point and I said my god it didn't work the proof was wrong and in fact the algorithm was wrong and I mean I've always been good at writing careful proofs but it was before I had learned to write proofs the right way the hierarchically structured proofs and so I realized the algorithm was incorrect so I went back to the drawing board unfortunately there was an easy fix to the algorithm and I wrote a much more careful proof and that's what finally got published so I maternally grateful to Fred for saving me from the embarrassment of publishing an incorrect algorithm not completely well it was one that I did publish but what happened is that it was a I gave a talk at an invited lecture at one of the early pod seas and then it was decided to I should transcribe I guess the recording was made it was decided that I should transcribe but the recording and turn it into a and publish it in the next Potsie proceedings which I did and in one of the slides there was an algorithm that was nonsense and I didn't pay attention to it but I was you know I don't know how how it got into my slides but you know would never got never have gotten into a paper if I had actually been writing it down but I just reproduced the slides and it was I'm not a major part of the talk but it was just a silly mistake or silly piece of nonsense that I've written the slide without thinking about it and but you know fred saved me from what would have been the only I think really serious error in anything published although it's an interesting error that was actually in the footnote in the bakery algorithm what I basically wrote in without using the terminology but in footnote said that basically a single bit was automatically atomic and I discovered when I started thinking about atomic registers that that was wrong it was not true but non-trivial even to get a single atomic bit the obvious is sometimes not so obvious well never believe that anything is obvious until you write a proof of it oh good Maxim I want to pick up on a topic that you talked about a little bit earlier but which became a a major theme in your work on concurrency really fault tolerance and it goes it goes back at least to the sift work that that we were talking about earlier on doesn t m-- agreement which of course wasn't called Byzantine agreement or Byzantine fault early on became a topic that was pretty significant in your work and I'd like to I suppose region mentioned the story of why it did became I'm the Byzantine yes that would be a good thing to talk about I had a script Dijkstra published very early on a twin of his papers he posed a problem it's called the dining philosophers problem and that's a lot of people have used that as an example of you know how to do this in some language or other and received much more attention than I really thought it deserved I mean edster has done some did some brilliant work and the dining philosophers was you know I've seen didn't seem to be terribly important to me but it got a lot more attention than things that really deserve more attention and I realize it's because at an acute story about it involving philosophers eating bowls of spaghetti with fire to Forks at each philosopher just had one fork or something like that and so I realized I thought that the Byzantine eat that the problem was really important and that it would be good to have a catchy story to go with it and Jim Gray had described was was very important that was the Byzantine problem it's one that clarified yeah okay and a couple of years earlier I had heard from Jim Gray a description of what he described as the Chinese generals problem and it's it's a basically an impossibility result that's it's quite important that I realized a lot later that Jim was probably the first one to to state it invented the Chinese generals of story to go to go with it so I decided the story of a bunch of generals who had to reach agreement on something and they could only send messengers and stuff like that and I originally called it the Albanian generals because at that time Albania was a the most communist country in the world and it was a black hole and I figured nobody when Albania is never going to object to that unfortunately Jack Goldberg who was my boss at SR I said you really should think of you know something else because you know there are Albania --nz in the world and so I thought and suddenly Byzantine generals and of course with the connotation of intrigue that was the perfect name so that's how that's how it got named but you weren't originally talking about the name you were talking about the problem so what was your question well I think because this is such a fundamental thing the transition from earlier work in which you assume that things didn't fail to work in which failure had to be factored in and was in fact sort of almost fundamental isn't themes can be an important transition and and obviously it's important to deal with failure it becomes ever more important in the modern world so I'd like to hear how you how you got into that well that was obvious because I had had the time clocks out paper with its algorithm for implementing an arbitrary state machine and but it was clear that when you had dealt with communication you had to worry about messages being lost and yeah at the very least so you had to worry about failures and so that led me to say how can I extend the algorithm from the time clocks paper and make it fault-tolerant and I guess when I was through it didn't look much like the algorithm from the time clocks paper but they said it was an obvious next step and I wrote this paper this forgotten paper in 77 about implementing an arbitrary state machine and in in the sift work well it didn't talk about that they didn't think of talk in terms of state machines they had already abstracted the problem to sort of it's it's a kernel of just reaching ingredients and then to build a state machine you just have a series of numbered agreements that on what the state machine should do next but in the context of sift there was Oh sift was a synchronous system so the system did something every 20 milliseconds or something so there was a natural separation of a an agreement protocol done every 20 milliseconds and ordered by time so when you get to an asynchronous systems you don't have that every 20 milliseconds and so the you use what's effectively a counter and that somebody you know to propose the the next action of the state machine you say well action 37 has been done so now this is we're now agreeing on action 38 and so that's that becomes the sequencing mechanism in that Byzantine agreement it's still the building block you know it's just that you don't have a well real-time clock you have a logical time clock oh well but the other thing that's changed in the asynchronous world at the same time there was this sort of shift of thinking at least in my thinking from the synchronous to the asynchronous world there was also a shift in the failure model and that people were worried about computers crashing not about computers behaving erratically doing the wrong thing I'm this can still be a problem in practice but it's a very it's considered to be sufficiently rare that can either be ignored or you assume that you have some mechanism outside the system to stop the system and restart it again if that happens I do remember when sin was in the either the 70s or early 80s when the ARPANET or I forget whether as the ARPANET still or the internet went down because some computer you know didn't just crashed it started doing the wrong thing and since the algorithms were presumably tolerant of or people worried about making them tolerant for crash failures but not of Byzantine failure but malicious failures like they're not Byzantine failures or even just failures causing computers to behave randomly not protected against that and what they did to get it back up so this this leads us I think to the to another project which became a whole series of works of yours around building building reliable concurrent system that was paxos and you've mentioned a little bit earlier that this was motivated by some work at at cirque where you were trying to to show how something couldn't work and instead you've got an algorithm that can talk a little bit more about that and then tell us a little bit about how that word work came to the attention of the world outside sir oh sure first of all the before that I had there was a book published by well so there's a database book and but they had an algorithm supposed to deal with fault tolerance and I looked at the algorithm and I said that wasn't an algorithm to deal with fault tolerance because they assumed that that there was a leader and for the and if the system failed well you just get another leader but you know but that's a piece of magic you know how do you select a leader and select the leader so that you never have two leaders at the same time and what happens if you have that's just as difficult it's the problems they claim to solve so you know ignored that and which in retrospect you know was a mistake because I think if I had thought about the problem and say not about just how how their algorithm doesn't work but I thought about how do I make that algorithm a - I turn what they did into an algorithm that works I think I would have forgotten Paxos if a couple of years earlier but at any rate as I said the people at Cirque were building the echo file system and I thought and again this was this was a a case where we were just interested in crash failures people believe that that's the way processors crashed almost all the time and so you didn't have to worry about Byzantine failures which was good because nobody knew at the time how to do Byzantine how to solve handled Byzantine failures except in the synchronous world and the it it's not practical to build large synchronous systems these days herbs I don't know maybe possible these days but it certainly wasn't wasn't the way things were going in the in the world today really in most of computing because to build synchronous systems you have to have systems that will respond to a message in a known bounded length of time and the entire mekin computer communication has evolved there is no way to put a bound on how long it's going to take a processor to respond to a message as you know if you use the net and usually you know the response comes back very quickly but you know it could take a minute to get a response and so we couldn't assume any kind of synchrony so it was a synchronous system and I thought that what they were trying to do in an asynchronous system was was impossible there's well I guess maybe the reason I may have thought I let me try to get the possible reason why I thought it was impossible and an interesting backstory to that when I was interviewing at park back in 85 I gave a talk and I I must have described some of the Byzantine generals work and I made the statement that in order to get the fault-tolerance or to solve the problem that I was trying to solve you needed to use real time because in the absence of real time there's no way to distinguish between a process that has failed or a process that is just acting slowly and I knew I was I was trying to slip something under the rug which which I do in lectures because there's a fixed amount of time and sometimes I have to ignore things embarrassing things not because I want to hide them but just because I want to get through the lecture but any rate Butler Lampson caught me on that and he said well you know what it's true what you say that you can't distinguish between whether something was failed or just acting slowly but that doesn't mean that you couldn't write an algorithm that work without really knowing whether a process had failed or not but just does the right thing if there are too many failures or whatever and of course I couldn't answer it on the spot I went back home later and thought about the problem and I came up with some argument that it was impossible and unfortunately that email has been lost I remember what it was but I I realized you know it wasn't rigorous I mean certainly what I was writing wasn't rigorous enough and it just wasn't very satisfying and I never carried it any further I just had this feeling you know that there was some resolved there and some number of years later and I don't remember what it was what it was published that there was the famous Fisher Lynch Patterson result when it's the FLP result which essentially proved that what I was saying was impossible was in fact impossible and the it was a fantastic beautiful paper and you know they what they said was much simpler and much more general than anything I was envisioning so you know that was one of them one of the most if not the most important papers and distributed systems they never written but any rate that experience may have made me thought that what the echo people were trying to do was impossible although I think it was that the FLP paper came later but I'm not positive about that at any rate so I sat down to try to prove that it couldn't be done and instead of coming up with a proof I came up with an algorithm sort of say well in order to do this has to do this as that can't work because of something well actually this couldn't work because of that something but he'd do something else and you know that following suddenly say whoops there's a helical gear and but it's interesting the relation with the FLP result the FLP result says that in an asynchronous system you can't build something that guarantees that that consistency in in reaching consensus that is where you're never going to get two different processes that'll disagree on what value was chosen but it says you can't guarantee that value eventually will be chosen and there's people have shown that this is an explanation in post facto explanation not something they're going to do my my mind at the time but it's been shown that there are random solutions to the problem which don't guarantee but guarantee with with high probability that you with that is as time goes on as the algorithm advances the probability that you won't have reached a decision gets smaller and smaller and so what I would say that paxos does is it it it's an algorithm that guarantees consistency and gets termination if you're lucky and that makes it sort of gives engineering hints as to what you do so that you have a good probability of getting of being lucky and sort of reduces the problem of beat someone of being lucky which has engineering solutions but even if you're not lucky you know you're you're not going to lose consistency then that's what the Paxos algorithm does and that's why I came up with the problem with the paper was having been successful with a story with the Byzantine generals I decided to write a story for the Paxos paper and it was about reason it's called paxos papers it's the the story line is this is a paper is written by an archaeologist of this ancient civilization of Paxos and they had this parliament and this is you know how they built the Parliament and I had a lot of fun with it papers called the part-time Parliament and for examples of that's the part of the story or how it would be used I would have the story and gave the names of the characters Greek names or more precisely their actual names transliterated into Greek and I had Leo D business help in in doing that he to the trend I had to add a few non-existent symbols to the Greek language to get some some phonemes that weren't in brief but and I thought that you know people would be able to you know know the people who do any kind of mathematics should know enough Greek letters to be able to know you know what an alpha and a beta looks like and so you know come reasonably close to you if not figuring out the names at least be able to keep you know to read them in their head so they'll be able to understand the story but apparently that was beyond the capacity of most of the readers and so that part you know confused them and oh I submitted the paper to tox I believed transactions on computer systems and the referees came back and said well this is interesting not a very important paper but would be worth publishing if you got rid of that all of that Greek stuff and I was just annoyed about the lack of humor the referees and I just let it drop fortunately and and almost you know nobody understood what the paper was about except for Butler Butler Lampson and he understood he understood its significance is that this gives you a way of implementing and in a fairly practical way any system or you know any a kernel of a system of the part that really required that that keeps the system working as a unison rather than keeps the processors from going off in their own way and getting chaos and you can use Paxos as the kernel of any distributed system and he went around giving lectures about it also in the at roughly the same time Barbara Liskov and a student of hers Brian okie we're working on a distributed file system I think and they had buried inside of their distributed file system something they called timestamp replication which essentially was the same as the Paxos algorithm and at some point I heard Butler say that you know well time stamp replication is the same as paxos and I looked at the paper that they had published and I saw nothing in there that looked like PAC so so you know I would I dutifully would when I wrote about paxos would quote would say you know quote Butler it's saying that you know well this algorithm is also devised by mr. coffin okie without really believing it and then years later Mike burrows said oh don't read the papers read Brian's thesis and I looked in Brian's thesis and there indeed was very clearly the same algorithm but I don't feel guilty about it being called paxos rather than time-stamped replication because they never published a proof and as far as I'm concerned an algorithm without a proof is a conjecture I think Barbara has come to realize that and I should mention that she and her a student Miguel Castro well Castro's thesis actually solved the problem of Byzantine Agreement asynchronous Byzantine agreement or Byzantine agreements an asynchronous system and dinner they had a very careful proof of correctness and so they event essentially have what's Byzantine version of axis so you mentioned that the the editors and reviewers were not particularly write to the paper yes what happened to it well one thing you know I thought it was really a terrible paper and some point when the system was being built at Cirque and idli he was and I think it was idli who who was told by somebody that oath you should read this paper about paxos because that's what they need and he read it and he had no trouble with it so I feel not so much that you know less guilty about it and you know willing to to put more of the responsibility on the reviewers and the readers but at any rate it was clearly an important paper and you know by the it was finally published in around 91 I think and what happened is that name's Ken Berman ken Berman was the editor of have become the editor of talks and he was about to leave his editorship and he said gee it's about time that x-axis paper were published so he said I'll publish it but you just you know update it to talk about what's been done in the meantime and I didn't really feel like doing that so what I did is I got Keith Marzullo to to do that part for me and to carry on the story you know being an archaeologist it was that this manuscript was found in the back offices of talks and the the author was away on archaeological expedition and could not be reached and so Keith wrote this you know you know I forget what it's called you know some explanation or whose which was put in you know in separate text as you know his contribution so he did the annotations and it was published so I think if I if I got my chronology right the the work on Echo was in the late 80s that inspired you to work on paxos and the first shot at submitting the paxos paper was in the early 90s and the paper didn't actually appear until late 90s is that right all right sorry it's the late 90s that it that the paper case we had we need to check the date on that I don't remember the publication date of paxos but I remember I recall it being approximately 10 years after it was submitted yes that's early nineties in the late nineties yeah but what struck struck me in mind there and so the the the sort of core idea in impact so then turned out to have rather long legs in terms of the work that you did that you did subsequently it applying the the idea again in different situations with different constraints that led to different solutions he talks about sort of that space of papers you mentioned some of them are earlier at least in general dispatch dose and you know fast Paxos and so on okay the next one was as I mentioned disk paxos and when we were writing the paper le sort of said well it's really exactly the same as Paxos and I started writing the paper that tried to make them dispatch those look like just a variant of Paxos and in fact Butler Lampson has written a paper called the ABC DS of paxos who claims that he has one germ of an algorithm from which he derives both ordinary paxos discs Paxos and Byzantine packs of Lascaux F Castro algorithm but I looked at the paper I haven't tried to look at in great detail but I'm really skeptical because my experience with the disk Paxos is that if you looked at a high level it seemed that way but you try to work out the details it didn't work and there was a just one basic difference in the two algorithms so they really are separate algorithms and there's no were algorithm that they could both be derived from since that time there was another thing that happened that algorithm called cheap Paxos and that happened because a Microsoft engineer I had moved to Microsoft at that time needed an algorithm that well the for celebrated single failure you need with ordinary paxos you need three processors but most of the time two processors are enough and the third processor sort of you can sort of think of it as a backup but in particular if the it doesn't need to keep up with the other two processors until there's a failure in which case it has to take over from one of the two processors so he what the this I believe it's Mike Massa was the engineer Microsoft engineer he had an idea for using an asymmetric protocol that most that most of the time just used those two processors and then they would have some either some weak processor or just if you're a third processor use you know have some other computer that's busy doing other things but can can come back in when needed and you know he had a vague idea of how to do it and I came up with the actual you know the precise algorithm and wrote the proof and that's how that paper came since then we've I did some other work with Li dong zhuo and Dahlia mock Moakley about having to do with a reconfiguration one of the problems in fault tolerance is safe for paxos you need three processors to tolerate one fault well one processor dies you need to replace it now if it's really just three processors then you bring in a new new computer and just call it by the same name as the old one it keeps going but sometimes you really wanted to move things from two different computers so you want to do some kind of that's called reef configuration where you change the processors that are actually running the system and pack so so in the state machine approach this is a very simple way of doing that you let part of the Machine state be the set of processors that are choosing the commands well if you have a chicken and egg problem if you do it you know naive way cuz you have to know what the current state is to choose what the next command is but how do you know what the current state is and and so the so what was proposed in the original Paxos paper was i said that the pack sons did it by making the state at time t that the processors who choose one of the legislators who choose the command at time t be the ones according to the state at time t minus three or command number t minus three well three was a totally arbitrary number and I figured that everybody would know that realize it's totally arbitrary number but a lot of people kept saying what's three why did he choose three and something at any rate there were so at any rate we did a few improvements optimizations or with it's about how to do reconfiguration and Dalia said that this method of putting it into the state isn't gonna fly with engineers because they don't understand it and so she's we worked out a different procedure which was actually a variant of cheap Paxos and but I don't think much of that work has had a significant impact I think it's the basic paxos algorithm that's the one that's still used so say a little bit about the impact that it's had obviously it got it got used by people who were you know in your immediate vicinity because they learned about it quickly but with the difficulty of the the word getting out through this this hard to read paper and so on how did taxes become shall we say a standard a standard thing in building large-scale distributed systems I really don't know what happened I'm sure Butler had a large effect and going around and telling people about it I know it's at the point where an Amazon engineer said to me there are two ways to build a reliable distributed system one you can start with paxos or two you can build a system that isn't really reliable so they obviously believe or he did are there other companies that you know of that have have adopted it oh sure Google there I think they called that chubby locks or something like that it's it uses Paxos I think there are three implemented I've been told of that there are at least three implementations of paxos and Microsoft systems it it has become well it has become standard there are other algorithms that are in use i that i don't know about there's an algorithm called raft which is really simple that the well the creators of raft claim that one is its advantage over pack so since its simplicity i take that with a grain of salt I heard somebody recently give a lecture on unwrapped and it didn't seem too simple to me but I think that what's going on is they're measuring simplicity by you show it to something and people say you know whether or not they say they understand it and to me simplicity means you show it to people and they can prove it's correct then prove it's correct and somebody else had told me is that raft is basically an optimization of Paxos and there's plenty of room for optimizations and paxos Paxos is a fun a fundamental algorithm and there are lots of optimizations you can make to it there's something else called zookeeper but I don't know what its relation to the practices and at least some of this is in the open source world these days right yeah I'm sure there are open source implementations of Paxos since the patent I believe has expired
Info
Channel: Computer History Museum
Views: 8,335
Rating: 5 out of 5
Keywords: Distributed computing, Algorithm specification, Program verification, Mutual exclusion, Fault tolerance, Concurrency, Temporal Logic of Actions
Id: SXt3-iZpQQc
Channel Id: undefined
Length: 186min 49sec (11209 seconds)
Published: Wed Jul 12 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.