Elements of Programming Style - Brian Kernighan

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you very much um I can't tell whether something closer to one a you away there um I don't know where you folks can hear me let's assume you probably can and no it sounds like we're getting another push it says it's on if all else fails check the power okay so I'm going to assume if in the back row you can't hear me at all you know wave your arms or something like that we'll just assume it's okay already um so let's get started this title here it actually has a bit of a history in some senses it's obviously pretty arrogant and impolite in some sense it's actually a title that I used in the talk in 1973 which you may think of it's a long time ago before practically everybody here was born and I was much younger in those days and had slightly different hair the genesis of the whole thing and in fact what I want to talk about today when I went to Bell Labs of in 1969 I happened to be lucky enough to have the office next door to dick Hamming anyway um as I said I was in the office next door to dick Hamming dick hemming was arrogant and irascible and very very smart and he has to spend a lot of time complaining about how people programmed and he used to say to me II and many others hey kid what the way we teach programming is that we give you a dictionary we give you a grammar we say you're a great writer and that's nonsense it doesn't work that way and one day dick came into my office carrying a book literally this book a book called computer applications of numerical methods I will not give you the author's name and he opened the book showed me a particular page and he said look at this this is terrible and I looked at the book and I said my god this is terrible he was looking at the numerical analysis I was looking at the code in this book was a Fortran numerical analysis book so let me show you um the code that I saw in that book okay in fact I asked you the question was its portrait in many of you I suspect all of you or at least familiar with Fortran if not real who is banging experts I'm no longer with bang for training expert either what's it do come on there's no excuse this is a Fortran as it was in 1958 right it's not that complicated so what's it do remember that I I and J are integers all right we're getting closer it's not a unit vector but you have the right idea when I is less than J that term is zero right when J is less than I that term is zero however when I equals J the term is 1 okay so what this is doing is creating an identity matrix rights putting ones down to diagonal the matrix kind of clever and you know I got a room full of really bright people here I'm a really bright you know not entirely clear what it's doing right so there's something wrong with this program probably it's too clever for its own good how might you read it well you might write it like that where it's clear you're still got the same two nested loops but now the inner loop clearly sets an entire row to zero and then the outer loop sets the diagonal element back to one is that clearer I think it's clearer in a modern language I'd probably do it even differently in some languages that already a B's thing called make me an identity matrix but at least that's better so anyway the experience with that the book that dick brought in his continuous nagging about how people do not program well and why can't we do something about it led me and Bill plier to write a book which was called the elements of programming style and that's the title which Jim had Scott put on the announcement for this table gathering here the approach that we used in that book was actually ripped off that both the title and the approach was ripped off from another wonderful book called The Elements of Style by Strunk & white which is a very good very compact book on how to write English well the approach in the elements of style and therefore the elements of programming style was to say here's something that's not very well written here's what's wrong with it here's how you might make it better and here is the rule that you might derive from seeing the process of going from bed to good and so what we did in the elements of programming style would simply take a large number of examples of bad stuff and make them better and put up a bunch of rules as a result now that was a very long time ago that book was published in 1974 the world is quite different in that book we wrote in Fortran 66 and in pl-1 language which I conjectured none of you have ever used fortunately it's dead today people write in Fortran and 9x mostly or they write in C or C++ or Java or they use scripting languages like pick your favorite um so the world is very different the examples in that book came actually and I think this is one of the things that made it fun the examples came from programming textbooks like this one the idea was here were people telling you how to program well and they couldn't do it themselves I guess you could argue this history repeating itself but what the heck I'm arrogant and I'm at the front of the room so what I'm going to do today is more of the same kind of thing but I'm going to use examples that are taken from a slightly broader collection of sources they come some from programming textbooks some come from real programs that I have found some of them are fresh that I found while googling around looking for interesting code that might be suitable for a collection of computational astrophysicists some of them are probably code that was written by computational astrophysicist some of them are written by students in my class um sadly a little couple of handfuls in the Margaret and by me just to show you that nobody is perfect so anyway in a way this is old stuff right people knew that you should program well long long time ago even before Bill Parker and I started writing about this in the 70s but in fact people still don't write necessarily really well all the time all I think the average standard has gotten a lot better the principles of how to write well haven't changed the details change rights and languages are different we have a lot more computing horsepower to work with but the principles of writing clearly and in a form that other people can understand haven't changed a bit in all of those years so let's look at some of them ok most of these are pretty small because they got to be example to fit on the screen most of the examples I just went back and looked are in C which you could then think of a C++ but nothing very complicated or Java nothing very complicated and some unfortunate various guises so but as they say the principles although the details would be different in different languages again the principles are pretty much independent across language so let's look at one here's one just to see okay so this uses a construct in C which is nice shorthand sometimes but obviously badly overused or abused from time to time a question mark colon operator so what's this do I literally never seen this construct before somebody asked me what does it do maybe a year ago it's not the detail legal it's absolutely perfectly legal it's like what the heck does this do let me walk you through it it says if armed in other words if armed is nonzero whatever that is there's some kind of device control work yes then if the count is greater than or equal to threshold do the dot dot dot on the other hand if armed is not true false equal zero then the count is less than or equal to threshold do the dot dot dot so the comment that goes with this is wrong right because it says if it exceeds the threshold that it also does something if it doesn't exceed the threshold so plural in comment doesn't agree with the code we've come back to that kind of stuff later the problem with this is is a little too clever for easy understanding it's not that you can't figure it out but you've got to work at it and you shouldn't have to work at figuring out what a piece of code does when it's this simple so if your will miss blow a few more characters you can get something that's a little better right if it's armed and the count is bigger than the threshold or if it's not earned and the count is less than the threshold then do the dot dot dot right that's clearer I don't have to explain it to you you can read it there are still things that you might argue could be done better I don't have any parentheses there other than the ones around the F what's the relative precedence of and and and or or well I know the right answer there because I've been doing this stuff for a long time but you could argue maybe parenthesizes it differently would make it a little clearer hard to say not one I won't argue is I think of that as a second-order issue but the first order is just a mess okay alrighty so don't be too clever this is going to be your problem because you're all very clever people looking at the mathematics it goes into this stuff it's clear cleverer than I am on the other hand it's quite possible for clever people to blow it in the other direction and you don't want to be too dumb because you may be clever in one area and like all this dumb in the different area and this is a fine example I think of being too dumb what's going on here got a bunch of stuff we got the guy is defining an array of one character a blank and then you've got four function calls to various C string comparison and copying things to do something what on earth is going on here you know a big heavyweight at least comfort sort of mechanism to do something if you look at it carefully realize that this is all he's doing he's saying is that first character is not a blank then if it's a digit copy it right so you don't need any function goals no function calls none of that stuff you don't have to go to the manual and remember what the origin argument order is Forster incompetent what it means and whether it copies the extra byte on the end and all those guys you don't have to think about it okay so there's a balance that you have to find between too clever and not clever enough being really con div damn about what you're doing and sometimes you can figure that out by a principle that I guess it's like keep it simple stupid but it's keep it simple like this okay so here's an example again this is one I literally found all I was poking around looking for new examples last week law marooned in Upper Pennsylvania on the slowest internet connection known to man what does this do well it says remove all spaces and in fact it does remove all spaces but you notice a lot going on it called sterlin give me the length of the string several times it calls his stirrup e Berk I don't know how you pronounce these things this is not my fault which basically I think says find me the next blank in the input and then it does a memo which copies stuff from here and character over how much you care to copy well the length of the pointer plus one okay so you know what decrement my setter and when you're done to me there's a lot going on there what's it doing how long do you suppose it took this person to get that right okay it must have been an amazing length of time to get it right the problem is libraries are good library functions are your friend because it means that you can use code that somebody else wrote and once they got it right you don't have to worry about it anymore and so these things like stir copy and move and so on are things you want to use no ambiguity but there are times when you don't need to use them and at that point you probably want to back off a little and this is all that's going on walk along the string and if it's not a blank copy it right and so I can describe that to you really easily that's all you have to do and the job is done it's much simpler this way I wrote it with pointers here you could write it with array subscripts if you feel happier with that the code is almost identical they both work this is odd it can't matter it cannot matter but this code is going to run a boat load faster because that other one is somewhere between quadratic and cubic in the number of blanks in the line because of all of that testing and finding the next blank and then moving things over using the string length of what's left it's an amazing inefficient algorithm can't matter in that particular context but I've been told that people who do cosmological simulations or whatever occasionally you worry about efficiency so anyway part of the problem with these things is you get the sense that the person who's writing code doesn't really understand the language properly or their understanding is a little shaky sometimes so here's a nice example this one actually comes from a piece of Student Code undergrads not graduate these grad students learned it all do you sense a pattern here you know I will probably say this again but computers are really good at doing things that have repetitive patterns and so you perceive you could probably do something better and so this one I probably don't even have to show you how to do it but right it's basically that if that index high alls doing is converting a integer between 0 9 into its ASCII character equivalent and the way you do that is just add the ASCII value 0 and that works fine in the ASCII character set no reason not to do that ok so you got to know your language and if you do you convert 10 wise and stuff down into basic with one or two lines that's going to be a lot easier to work with ok here's another one this is a piece of Java code that appeared I drew a little tiny picture so you could get an idea of what it is there's an example of bit lacking I think that lacking is something that doesn't get done very much in scientific computing where you get basically bits and you want to manipulate them as bits and I think lots of people pub in building tools comm network kinds of things device controllers all kinds of weird things do bit lacking I suspect you don't do nearly as much bit lacking as numerical stuff in your particular field but it's still useful to know how to do it this one is kind of interesting because the task is to say here is a 16-bit number and I want to knock the top 8 bit to 0 I just want to set them to 0 ok I preserve the bottom 8 bit that's the make a bit so the interesting thing is if I set you that task how many of you would go off and basically use an exponential to do it what the heck is going on this is the power function yeah why is he doing that how is he doing it well what he's doing is starting over there at the 15th position the one on the left and he's saying let's compute to the fifteenth and then if the number I am dealing with the number that I'm trying to reduce is bigger than two to the fifteenth then I can subtract two to the fifteenth from it so I've knocked that bit off and then we'll keep doing that until we get down through the eight bit and then we're done okay this is kind of interesting when I checked because I couldn't remember for sure the power function this is in Java the power function actually takes doubles as its argument so this is in principle being done as double precision floating point although I suspected somebody inside saying hey wait a minute those are integers so this is just bizarre how would you write it how should you write it how about that all right take the value preserve the bit you want with the and operator and a probably hexadecimal constant because those are the easiest to represent real bits that's those bits okay not very complicated let me give you one more bit lacking example just because I have it sitting here and it's a lead-in to something else that I want to talk about in a moment so again imagine that we have a 16-bit quantity you'll have two this is close enough for astronomical purposes right think of this at 16 what's that all line about 10 to the 50th plus or minus 10 right edge close enough so anyway 16 bits like this and all this job is supposed to be all this little macro supposed to be doing is that right just interchanging the upper half and the lower half of this 16 bit quantity so how does it do it well it says take the part in the take the whole number and then mask off the top part that's the end 0xff we just saw that okay and left shift that eight okay so we took the right part we moved it left and then it says add to that what you get by tagging the original number masking off the bottom stuff shift that right eight and you're done right so you took the bottom part move it to the top and you took the top part moved it to the bottom okay look good but now you should know not to trust me what's wrong with it yeah negative number is it's one of those things that could happen in fact let's talk about that one of the possibilities here some machines do sign extension that is arithmetic operations like shift extend the sign so if that quantity a had its top bit turned on when you shifted it right that would leave a trail of one bit behind right so that and the masking is done in the wrong place there so you can see you take these one bits you just lay them out across that and then you would add to with that what you've got by shifting the other thing left and when you shift the one left it fills in with zero bits no matter what the machine architecture is so you don't need the mask there so that's you know bad worse in fact those aren't the real problems what the real problem is what's the relative precedence of plus and the shift operators well I'll tell ya the shift operators are much lower precedence than the plus operator so what this is doing is computing something not very meaningful on the left side and then it's left shifting it by eight plus whatever that not very meaningful computation on the right side is and then shifting the whole thing back over eight like so this does nothing sensible at all this cannot possibly work but it did come from an article on how to do machine independent computations I think you could argue this machine independent is totally wrong on all machines okay so the deal is you can write it like that and that actually is correct for 16 bits in you notice that shifting it left you don't need to do anything and shifting it right to go do the ante at the right place and you notice that I didn't use the plus you use the or because the or is it the right level of precedence okay and so it's kind of a warning sign at least if you're looking at C or C++ or Java code the precedence of the logical operators is well down and so you have to be very careful if you find yourself with mixtures of logical operators down here and arithmetic operators up here because without parentheses the parentis the precedence the association is just going to be wrong guarantee okay there's another thing that's wrong with this it's not a actual error error in this piece of code but it's a suspect construct it uses macros so the original C definition had the C preprocessor it has a macro processor somewhat strange property sometimes and there are perils using macros and macros is something that you probably do use in your C and C++ code to some extent or if you don't use them yourselves because you were wise those who went before you did use them and they're not so wise like this I mentioned that function called is digit takes a value ASCII character and says it's a bit digit or not so this is a possible implementation of that as a macro okay and you'll find dozens if not hundreds of textbooks that say sure you could do it this way what's the potential problem with that well one potential problem is that the way that macros work is it's simple textual substitution so when somebody in the rest of the program in the dot-dot-dot says is digit of something that something is plugged into two places in the resulting expression and therefore depending on how the expression works it may get evaluated twice and if it has a side effect then you've done more than you thought you did and in fact I found this in some piece of code that I think was processing JPEG images I can't remember now on the web found it last week in this context so what this is doing is saying every time we want to test a piece of the input we call it digit with something that has a side effect so if when you call it that value @jx at the current point is greater than or equal to 0 then you decrement JX a second time that's comparing something unrelated to mind and so this code is completely wrong and it's going to fail in an utterly mysterious way and this is one where I speak from bitter experience I must have spent two or three days of my life quite a while ago kind of figure out why program I wrote was basically only producing about half the output that it was supposed to do and the reason was that I had an is print you know another in this family in that that was and I was invoking it with something that had a side effect and so every other character most of the time was just disappearing and the only good thing is it wasn't my fault because the function I was using the is print came from the compiler manufacturer who ought to have known better but didn't so mysterious hard to characterizes it ok um one of the reasons that people use macros in C by the way is there it's a reason that no longer I think it has any validity whatsoever but they used to use them because they were more efficient in the sense that with a macro you could get something that had the semantics more or less of a function call but didn't have any function call overhead and the machines of my vintage it's like eeny AK um machines of my vintage subroutine overhead was a serious factor for most purposes today I think you can safely ignore it or at least it's not a first-order consideration at all but people use macros to try and make things go faster and here's a wonderful example again found very recently in the last few days that's a huge macro called fast mmm copy okay so mem copy is when any think since here's a block of memory make a copy of it over there so it's basically a loop and what it says there's a funk there's a break-even point which is in fact a defined variable and you're supposed to four different machines set that to different values which is not exactly the definition of portability but you're supposed to do that and so what it says is if you're over the break-even point then you can just use the M M copy that came with mem copy that came with the system but otherwise then you go into this thing where you write out the explicit loop yourself because that way you're saving the function call overhead so question what particular value might you choose for break-even point today - that's not a bad guess I think the right value is probably zero but you know because I actually was curious because I thought maybe I'm just a dinosaur and I have this completely wrong or something so I went and did a bunch of measurements I actually wrote the code put it in there tested this thing that thing worked and was unable to find any measurable difference on any test I did now you know this is only copying random things vary sizes but I couldn't find any measurable difference modern compilers on modern machines are just smarter than you are okay and so for the most part let them do their thing they mostly do it well you shouldn't be worrying about it okay so you do not want to sacrifice clarity for efficiency this is another one of these things where somebody has to write a lot of complicated code have to get it right come cost them maybe more than you will save in the entire lifetime as a program and they've left a nightmare for the next grad student or postdoc to pick up okay already um so macros I think in the sense are arguably one of the features of C that was okay at his time but it's time is pretty much passed and you'd have alternatives at this point especially in C++ where you even use inline functions if it really does matter and you have constant declarations for numbers and you have any numeration x' all of these kinds of things to think of that there's still places where macros are okay but mostly not these function like macros so I guess there's basically a a general principle that says every language has things that are really pretty crappy you shouldn't be using them right um not to pick on any particular language but this I mean I actually Fortran was my second programming language my first programming language is COBOL which said it kind of a real bottom to the level of craftiness one could put up with but but anyway I wrote a lot of Fortran when I was more or less URIs and this is Fortran as it came from the factory originally in roughly 1958 would or they are both right because it's got these arithmetic gift statements anybody willing to admit that they've ever written an arithmetic gift statement in this group raise your hand have you written ah three whole three good will beat you guys up later of the arithmetic gift for those of you who have never used it although you've probably seen it basically is a three way branch and it says evaluate the expression and then if it's negative go to the first one of those labels if it's zero go to the second and if it's positive go to the third and it's a pretty close mimic or match for an instruction in something like the IBM 704 or something like that if I remember correctly a machine which is long since passed from Earth um so anyway this does something but it's kind of hard to see it and of course one of the things that came along actually as early as Fortran of 66 was that you could write if statements in a more sensible form the original Fortran only had the arithmetic if if I remember correctly so you could write this is not very good but at least now you can start see what this is doing is some kind of greatest common devisor algorithm and then finally and it took until fortran 90 before somebody realized that a while loop just so you know i'm conditioned go around and round loop was worth putting into a language what a concept um you know roughly thirty years after everybody else figured it out and so now you could write it like that and I think that's a reasonable thing to do so languages have bad features as well as good ones as languages evolve and languages evolved a lot Fortran is evolved enormously since Fortran 77 for example you've got less read use the bad features and lots more reason to use the good ones and so you should be thinking that way every language has not just bad features but pitfalls things where the feature is okay but you have to be careful how you use it here's one this comes from a book which I believe is the worst C programming textbook ever written almost every single example in that book was wrong in some way or other the advice in it was at best misleading it was usually flat wrong it's hard to believe that the book sold fortunately I got a copy and managed to preserve many of the good bit for uses exactly like this what this function is supposed to be doing it's basically doing a stir cat you know string concatenation but he's trying to show you how you might write it and so what is it's called combined and it simply takes those two strings and gives you a new string which is the two of them stuck into n um and what to do well it says R is a pair array with 100 bytes so he's going to put the output in 100 byte independent of how big the input is good start you say okay then he's going to use stir copy and it's like if you're going to tell people how to do these you want to just consistent but anyway he is a stir copy to copy the string and then you can compute some links and that tells him where the second string is to be stuck in there many writes an explicit loop for putting it in there instead of just using stir copy but okay and then he says return a pointer to the internal array this is just appalling I mean this is malpractice or something like that because it's a mistake a mistake that all of us at one point or another have made when we write C programs you create you have an array and you inadvertently return a pointer to it it's a local array and therefore it doesn't exist in a useful sense once that subroutine or function has been exited right it may be sort of residually on the stack until something else happens but mostly it's gone and relying on it is a recipe for disaster and here's this guy telling you this is the way you do it ooh bad move okay so what you're seeing here are examples of things that people do well or badly or whatever in languages and computer languages are very much like human languages they're sort of a formal this is the way you were supposed to do it but there's also kind of idioms the way that people actually write code in practice that the way that they write standardized kinds of they do standardized operations or tasks so consider how do you set the elements of an array to something in a C program here are four different ways that you could do it right you could go from 0 up to n minus 1 and increment in the loop you could do it up to but not including an increment and loop you could go backwards but I'm betting for all of you writing C here nobody would use those you'd use the fourth one right that's the standard way to write it if you're doing Fortran it's like do I equals 1 to 10 or whatever it says 1 comma 10 whatever it's the same thing that's the standard way to write it so the issue with idioms is that if I give you a program piece of code that I wrote and it has that particular idiom in it you can look at it say oh I see what he's doing you don't have to think about it conversely if you're sitting there saying I have to write a loop that sets in elements of an array you know what to do you're right it down automatically you don't get it wrong it's easy because it's an idiom it's one you use all the time without having to think about it and so it's a standard medium of communication between your brain and machine or between you and some other programmer the flip side of it then is that when you see something which is not idiomatic it should sort of raise a little tiny red flag this is what's going on here whoever wrote this is doing something different why are they doing it differently it may be because they're just not a native speaker of this particular language or it maybe there's something actively wrong so if the idiom isn't followed if you see something that's unnecessary look at this one what's happening sialic is one of the memory allocation functions it says allocate n items of this particular size set them to zero okay so what this does is to allocate an array of n integers okay so as a n is 10 okay and then it goes through a loop let's set n plus 1 oops allocated a whole list big and you wrote stuff this big right and the way that you see that literally the way I did see it was a DDM is wrong the idiom is 0 less than n and here it's less than or equal to n and so having seen that you say okay there's something wrong and that's what it is this is the idiomatic version the size and then from 0 to that ok so if you write it this way you don't have to pay any attention to it probably it's almost surely going to be right now this shows up in a variety but the specific one shows up in a variety of contexts one that I'm sure has affected at least some of you it certainly affected me a number of times think about arrays in C C++ Java Python Perl you name your favorite language they all start at the zero right think about Fortran where two arrays start Oh beer they start at 1 and so the cognitive dissonance or whatever if going from 0 origin to 1 origin or vice versa is just a nightmare and so if you're converting a program from one of those languages to another it's guaranteed you'll get that wrong somewhere and it'll be hard to find right I'm bet lots of people have been through that one so that's a place where the idioms are different in Fortran let's say then they would be in C I'm not up enough on Fortran 9x to know whether I can set the origin properly I probably can but do people do that I honestly don't know and that just creates cognitive dissonance in a different place fortunately not for me ok another example this actually harks back to something we saw a few minutes ago what's wrong with this what this is doing is they make a space big enough to hold the string and then copy it okay how long is the string well so string to have three bytes in the length of three however think about the implementation of strings within a seed program or C++ program at the basic level there's a extra byte on the end the null byte that's a terminator okay so what you've done is allocate let's say three bytes for three bytes string except the string is really four bytes and so when you do this copy I copies the null byte as well and so again you've walked off the end of the array and the idiom which actually was in one of the earlier examples is this sterlin plus one if you don't see that something's wrong okay it's kind of awkward it's an idiom that's only appropriate for C character array kinds of things you don't need it in all Fortran or Java or lots of other languages because strings are not null terminated you don't see that is that there's no length operator and so you have a different set of idioms that you need to manipulate things in those languages one other that goes right here this is a specific instance of a very general issue that you've got to watch out for in programming get a character print it back out again and then see if it was the end of file the end of file in the real character at least in the UNIX world maybe it is in Windows world it's a state of being and so you've copied something to the output that wasn't there in the input and the reason you did that is that you wrote a loop that has a test at the bottom and a loop with a test at the bottom is one of those things that's a red flag you look at you say wait a minute is that appropriate because it means you've done something before checking whether you should so it's almost always the case that you want to write loops with the test is at the top is there anything to do then do it there's nothing to do don't do anything the original Fortran and up through I can't remember whether in Fortran 77 even do loops had the property that regardless of what the limits were on do loops it went through the damn loop once and this was a disaster you always had to be protecting against that kind of stuff the newest versions if I remember correctly do that properly but I'll defer to the Fortran experts that was just bad and it meant that all kinds of programs didn't work properly when they had these funny cases where the upper limit was less than the lower limit the way in C this is a standard C idiom right get a character store it away and if it was not the end of file then you can put it out but otherwise don't do anything so this is an idiom the one above is the more like the kind of thing that people find themselves writing in languages like Python where expressions on assignment cannot be embedded in expressions like that and so you have to read once prime the pump and then you want a while loop with the test still at the top and a read a second read at the bottom should I had that here but I didn't okay how do you find things like this how do you find what's going wrong awful lot of cases you can find out where the error is by testing something at its boundary conditions I realized boundary conditions is a phrase of term of art in lots of different scientific fields it's actually a term of art of and testing programs as well I think we're running programs here's another piece of code comes from the same guy who was removing spaces and he says he wants to remove trailing asterisk okay and the thing that caught my eye here was comment it's a comment that said should the test be I greater than or equal to zero how would you resolve that issue well I think the way I would resolve it is to say okay suppose that the input consisted of nothing but an asterisk the string was one character long it was an asterisk for what happened then I say okay what's the sterlin it's one subtract one I becomes 0 y greater than zero geez I'm never going to look at that character so it had better be AI greater than or equal to zero right and I can reason about that by reasoning about the boundary condition and I can reason about the boundary condition by finding a simplest possible case that I can work with which is put that asterisk right there at the beginning okay and nothing else so um anyway that's that ah let's see okay what could possibly go wrong that's a good question to ask yourself in any program um here's something again I picked out of off the web very recently like last week um it's part of a little tiny I don't know whether is meant to be real or toy a statistical package doing mean and standard deviation along with a variety of other things and so what it does is compute the mean of elements in an array and then it computes a standard deviation but again basically the squared differences and then returns the square root appropriately okay what could possibly go wrong yeah what happens if M is one now I don't even know I'm not a statistician somebody with better training than I can tell me what the standard deviation of a single element is but it's not division by zero okay and so yeah that's the obvious thing and how do you figure how do you spot that you test the parameters and you think about where what are the boundaries the boundary cases are well M is one and in fact there's another boundary case what happens if M is zero well that sort of raises the question what happens in the computation of the mean and I'll tell you the computation the mean has the same problem it divides by zero if there's no elements so here we have a library that's supposed to be helping you but if you don't use it right it isn't going to help you very much you're going to get mysterious failures in the middle of your code and so you want to be watching that kind of thing in your own code if you wrote the routine you got to defend against the stupidity of the people who use it even if they are you in some other part of the code okay yes yeah actually if we get back to that for a second there's lots of different things that could go wrong the array could actually be a null pointer in which case you have a totally different class of disaster about to happen it raises a very interesting question you can be paranoid or you can be super paranoid or you can be unbelievably unrealistically paranoid and I don't know where you draw the line there's no general answer to that but one of my favorite instances of it that shows that there is a line suppose you have a routine that does a binary search okay so I have a bunch of elements and they're sorted and I'm gonna do a binary search on it okay it only works the binary search algorithm only works if they are in order right so I could if I were super paranoid go through and check that they are in order before doing a binary search would that be prudent that would replace a logarithmic algorithm by a linear algorithm right so it's probably not the right thing to do so where do you draw the line and I honest you know I'm trying not to make too many modifications in it but it's absolutely valid question that you have to ask yourself as you're writing code where do you be paranoid and where you to say can't matter okay one place you want to be paranoid thank you for the lead in one place you really want to be paranoid is when you're dealing with input that might cover real people this actually comes I think from that same horror book but I'm not absolutely sure it may be from a different horror what it does is it gets a number and then does something with it and it calls this function get s which simply goes to the standard input you type a number and it returns that value as an array of characters what's going on here it says too many digits you wiped me out but what is checking is not really the number of digits is it it's checking the last element of the array what's going on there this is really kind of subtle weird was the word static up there under main what that means is that the variable is not allocated as a local variable on the stack but rather is statically allocated and initialize to zero when the program starts so that array number which is going to hold the input byte is all 0 so what this does is to implicitly set that array to 0 then at reason number and then it looks at the last byte of that array and says this is still 0 and if it isn't then oh dear you wiped me out but in microcosm this is exactly the kind of buffer overflow problem that is at the heart of all those exploits of people who write code that is then exploited by various bad guys in various parts of the world to steal information from you or install bad software on your Windows machines or whatever this is a buffer overrun ready to go nobody would do that no textbook would ever do that today except here's a book so c++ book I think it's now in its 7th or 8th edition these guys crank out a new edition every year or something like that and I haven't got a new one to check whether this is still err but this is in effect the heart of a web server the way that information comes from a web client to a web server is that the input all the information but what's a forum or something is wrapped up in a big long string on the standard input and there's a shell variable called content length which is set to the length of that stuff okay so what does this do it assigns an array of thousand 24 bytes and then it figures out what the content length is of that input and then it just reads that many bytes into that array and so what happens if there's more than a thousand to 24 bytes your game is over again this is the buffer overrun problem in disguise not even very well disguised my power to Michael blank wrote actually quite a nice book called writing secure code and this observation about the world should be treated as hospital and bent on your destruction it's not too far wrong now I realized the average scientific code doesn't run on the web or something but there are actually places where scientific codes do run in the web as some kind of service and you'd rather not be the one who made it possible for a bunch of bad guys to use your machines as the vector for denial of service attacks so look at this same kind of thing we have some function that is reading stuff from parameters and setting up some kind of data and the first thing it does is to scan the fields of the parameters and it puts into F named an array of 64 bytes a file name how long can filenames be well it depends but they can be awfully long look at the file names on that I see lots of Mac's here look at the file names on the Mac's okay they're very long certainly lots and lots and lots of them are going to be if they're full paths are going to be more than 64 characters this is a buffer overrun problem waiting to happen okay and sometimes it's just pure inadvertent it's entirely in your own code but look at this there's an array of 64 bytes 64 seems to be a good magic number here and it says there's no box called this in this schematic type but if you count the characters there's something like 48 characters of overhead there before you ever get to plug in the name of the box right so if the box is more than about 15 character name you've walked off the end on that too so I'm belaboring this one but this is actually pretty important because if you walk off the end of an array in your favorite language like C or even do it inadvertently in C++ and anything that's taking input from the outside if you walk off the end things aren't going to work anymore and this really is the root of an awful lot of troubles externally and if you're just doing your own numerical analysis who knows what will happen if you clobber some piece of data in the middle of your program okay enough of that let's switch gears slightly to some control flow idioms on all do them a little faster this is actually a textbook kind of implementation of the UNIX CP command copy copy this file to that file look what's going on there there's a sequence of exits if something if something if something so what you're doing is making a sequence of decisions you haven't done anything yet okay and then finally you're having made some decisions then you get to do the actual copying of the files but what happens if one of those ifs fails how do you find out what you do instead where's the else we have to drop a perpendicular right and then you can find the corresponding else because this is properly for about it and then you can figure out what's going on but the problem with if if if as a construct is an effect you're digging a hole and you have to find your way back out of the hole and it's kind of a mess it turns out that the right thing to do is simply turn things inside out or rearrange them like this so that every time you make a decision you do something you peel off one case you make the remaining stuff easier and so if the argument count isn't three you print the error message and it's the argument and then you try to open the input file if that doesn't work you're done and then if that works then you try to open the output file and if that doesn't work you're done and otherwise then you can go on into the work so you notice that instead of if if if I have if-else if-elsif-else and that is definitely the right way to write these things this is the way that you write a multi-way decision in almost all of the programming languages that anybody here is going to use and you notice that it doesn't get indented it's all down the left so you don't have the thing migrating off the right side of the screen on your your machine right keep it to the left and then it's obvious idiomatically that that is a multi-way decision look at the same thing in Fortran Jim matures me that he didn't write this code so I'm just he's an honorable man I'm going to take him with this word but this is back to Fortran whatever 58 or something like that and it's kind of not quite but close and it's kind of interesting and this is actually a sort of multi way decision but it's really hard to see what's going on with those arithmetic ifs in there the other thing that's kind of intriguing just as on the side you'll notice the number 32 appears twice there in two lines they're different thirty twos all right the one on the first one is a label that reverses on line near the bottom and the one then the other one is a mask like the ones we were talking about it's a bit pattern because it's two to the fifth but there's the same 32 to the naked eye so you don't want to write it that way but in Fortran 66 or whatever that was essentially all the choice you had you didn't have to use arithmetic gaps but in Fortran 90 and onward you can do a lot better and you can write it like this and you notice that the code becomes a lot simpler first you can indent it which is always a good start but secondly you can use an if-else if-else to make it clear what the multi way decision is and also you can use the operators like less man equal greater than and so on rather than the dot EQ for I hope we're not being taped here well we'll just downplay the audience commentary just in case that shows up in some place that people don't really okay I got it from Jim okay okay so fair here's another one these repetitive things remember we've seen repetitive code before machines are great at repeating things why are programmers forced to repeat what's going on here this one actually has an interesting lesson buried within it you notice the code is pretty much the same all the way through if new equals a 10 care of CH plus + comma and then a couple of numbers so everything is essentially identical except for that pair of numbers everything else is the same thing so what's going on it's fairly complicated control flow I mean not complicated complicated but just repetitive stuff with a very very small amount that's changing that's pure data and so the problem with this code the thing that could be improved it's got the data and the code kind of intermixed like this and what you really like to do is separate them out and get the weirdnesses into the data where you can manipulate them easily and the regularity of the code because regular code is a lot easier to work with and so you could write it like this well the data structure that data starts just you know it's just an array of numbers that I've written as pairs that are the magic numbers that showed up in there and then here is a simple loop that walks along those numbers two at a time and calls that function and so this is much simpler and it's got that clean separation between the data and the code and this is a general principle which you can find all kinds of opportunities to do in your own code think data is the place where you put the dirty stuff code is the place where you want to keep it as clean as possible okay and speaking of that this one I don't have a nice title on this this is real code this comes in fact from for ESS which is the long-distance switching system that AT&T used to sell for a long long time it didn't basically all in long distance service in the United States for many years those days passed and this was part of the billing code actually it wasn't a mainstream switching code so it's typical grimy but you know okay code as would be seen in any kind of software system really then watch okay this is one piece it's something NCP 100 is a network control process or something I can't remember did you see that this is it 800 inbound wide so let me go back again see that not much changing is there let's go one more something called mega comm 800 which is a different flavor of 800 service notice how much changed okay kind of weird let's do one more okay so we got one two three one did anybody notice something else now there's the bottom watch the bottom about four lines from the bottom now astronomers don't need to be told about blink comparators right that's exactly what this is and so we didn't discover Pluto but is something there um what's going on is that this is code that is written the way the enormous amount of real code is written here's a piece that does something okay and I have to do something that's sort of like that and rather than trying to factor this into something clean I'll copy this stuff or two here and I'll start fiddling it and then I'll copy and start fiddling and after a while I've got a bunch of these things okay suppose that an error is found in one piece of it I have to go back and kind of reengineer where that error should be fixed and all what are the odds I'll get it right not very good and then you get things like this kind of genetic drift that little bit there where at the bottom where something changed okay that's a harmless mutation here but who knows what it would happen in a more complicated organism right so beware and an awful lot of code is written that way sadly okay so let's see here's another one while we're doing blink when Chris copes or comparators or whatever this is a function that is supposed to insert a word into a dictionary so there's a elaborate computation here the whole thing is figuring out here's a word I want to stick it in a dictionary somewhere here's the word I would stick it in the dictionary at a place so that it will be easy to find in some sense you can see there's words like hash and so on so it's obviously attempting to be efficient about that so that's the insert now having inserted it wouldn't it be nice to see whether you can get it back out again and so on the next page of the book that I took this from in fact is the Czech version of it and I've reformatted these slightly so that you can see that there's a lot of similarity there right so people have probably been telling you since you were in elementary school that when you write code you should modularize it right you should break it up into suitable functions and subroutines and so on so that the modules do one thing you know you can practically play the whole song yourself um and this thing has been modularized right there's a module for inserting and there's a module whoops get that right checking okay the problem is it's motorized wrong there's this very elaborate computation that figures out where things are going to be in the dictionary and that appears twice elaborate consultations ought to appear only once so it ought to be there once as a module and something the insert' guy says tell me where I should put my new word and the Czech guy says tell me where I should look to see whether my word is there and so the modernization if you do it that way means that the complicated computation is only there once and that makes it a lot easier any more likely to get it right and it finesses this problem of suppose somebody fixes the bug in one side who fixes the bug on the other you yeah well as I was saying before I was so rudely interrupted don't actually remember what I was saying when I was interrupted um but it was something about modularization and the the two different versions of this program and actually I don't have a heck of a lot more and I'm grateful to you as many of you has come back as have come back so let me actually finish off with a modest number of this sort of quickies something went by here probably didn't notice it because it went by so quickly but um right there at the top of the top of the first one of those functions of the dictionary function dictionary insert it says insert and it says returns 1 if the word was already in the dictionary and otherwise your return 0 since the comment says what does the code say code says it doesn't return anything ok so the comment in the code disagree which one's right no ambiguity there there's an aphorism in the military that says believes the terrain not the map ok so the code is the terrain and the comments are the map and so they may well be wrong let's look at some comments actually it's just sort of fun one of the things you want to do is to make sure your comments actually tell you something well you know well-written code doesn't need a heck of a lot of comments but there are certain critical things the one thing that code whether well-written or not well-written doesn't need his comments that don't add anything to what the code already says and so look at this it says ignore all the signals etc and then every single one of those things basically says ignore this signal ignore this signal ignore this signal and so you could presumably do it like that and that would be just a lot simple there are times when you see comments and you think whoever's writing them was actually being paid perhaps by the line Department doesn't work in scientific computing but in commercial stuff you wonder sometimes of like this not yet all right more than double my pay these are real by the way I have not made these up absolutely these all come from real code not even textbook code where else you say these are like COBOL actually I guess if you change the comment part around it would almost be cobalt um and this is probably C++ and this one I'm a complete with a big box comment and so I so not very not very helpful do not write comments like that they do they do not add any except vertical horizontal space to a program you really want to do that one place that you do want you do see comments and probably it's a sign that something isn't right this is actually piece of code that I wrote up and it comes it's actually part of aux and at one point in the evolution of Ock I introduced a horrifying just an awful bug into the thing as a whole class of programs cause the doc compiler to go into an infinite loop before I ever got to looking at anybody's data and as I tried to figure out what on earth I was doing I gradually figured it out and this piece of code which you're not expected to understand of course the comment density here is extremely high like it's one comment per line and the rest of the code is probably one comment every 20 or 30 lines at most something like that so when you see that sudden burst of comments it says something is going wrong and so sometimes what's going wrong is this is a piece of code that's actively wrong it's still wrong and somebody was trying to muddle their way through it and putting in comments and so the deal is that you don't want a document bad code in other words I did something stupid here rather you would like to rewrite it so you know to say that you did something stupid there there are other kinds of implicit documentation that show up in programs one of these is various kinds of constants you see 3.14159 or whatever you know that that's going to be pi and might even be better to see something that pi and then know that that was properly defined and computed somewhere else generally when you see magic numbers numbers whose significance is not instantly obvious like constants of nature like pi something is wrong somebody has missed a chance to make the code easier to understand so here's a piece that came actually from student in my class a couple years ago the comment itself is kind of interesting this poor child was unable to figure out how to do something without a manual something pretty basic actually has nothing to do with Java could be done in C or C++ as well what are those constants what's 65 ah it is indeed 64 plus it's a capital A yes and so 90 is a capital Z and 97 is a little-a and so on so this is basically saying do we have uppercase letters lowercase letters 48 to 57 is digits and there's a couple other random characters and so what those should be is quote a quote and so on for all the other characters and that way it's self documenting and there's no need to do this no need for you to figure out what is the correct value 97 and no need for somebody else later on to figure out what does 97 mean so you don't want to do that and the other thing I was talking outside to people about how do you name things in programs and are there conventions for naming and if the specific conversation was about conventions for things like capitalization or underscores and long names and things like that but before you even get to that you actually want to name things like variables in a way that again explains what their function is what their role in the program is local variables like AI the iterator in some trivial loop like for I equals 1 to 10 don't need comments but things like global variables need them or function name should be self-explanatory and this is a wonderful example of how not to do it again from my class six or seven years ago at this point I guess and you can probably send to this kids roommate and perhaps girlfriend where from the names in the middle there and I've never I never figured out who grace was but I'm not sure she was properly associated with Earth oh well um so anyway I've given you some rules sort of arm-waving stuff and this is not exactly a summary but kind of a repeat of some of the things that I talked about along the way cleverness is misplaced almost always but you don't want to be too dumb about it it's almost always going to be the case that if you write your code clearly and simply and straightforwardly it will be plenty efficient and at least you'll be able to work on the parts that aren't efficient and not worry about all the other parts which is not where the action is it helps to know your language including its idioms of course on things like control flow boundary condition checking is something that you do while you're writing the code you think about I just wrote a loop what's going to happen if I go through the loop the right number the wrong number of times what happens if I delimited zero what happens is limit is the size of the array things like that you can by doing that kind of boundary condition right checking right up at the time you're writing the code you can hit off an enormous number of bugs before they ever happen and so that's definitely something that you want to do and then defensive programming protects you against yourself or whoever else is using your code and so that's really important but the bottom line is really what you want to do is to say what you mean is simply and clearly and directly as you possibly can you really want to write stuff that's clear because that's the only way you're likely to get it right it's the only way that you're going to be able to work on it later on it's the only way that when you from ceased being postdocs grad students and so on and get on to be tendered full professors that your students will be able to figure out what the hell you did and so you reading want to do it that way now everybody knows all of this and you know sometimes I'm preaching to people who already know it so you all do well go tell your friends the ones who stick for the rest of the talk that this is what they should do there's lots of reasons why people say they don't have time to do this um you know who cares about style program works maybe it does maybe it doesn't but actually that's not a very strong argument you can say it takes too much time to fix it up but how much time does it take to fix a program that doesn't work so if you do it better the first time you're more likely to have work and not have to spend that unpleasant time fixing it up people say style rules interfere with programmer freedom I don't think that's a serious argument at all you can write really clean code within somebody set of rules without much problem and if you go out and work in the real world let's say not in the Academy you're going to be forced to use some company's real rules anyway so you have a choice there the rules are arbitrary well they're sort of arbitrary but they're not that arbitrary um and one that's actually real I think for people in academic settings in particular grad students postdocs and even junior faculty is you don't get any credit for making the program cleaner and neater and nicer because that doesn't help you get the paper out and the papers the thing you have to get out but if the paper is based on physics that has to be then and rethought or retracted or whatever then maybe you're not so well off so it's even for that kind of setting I think worth serious thinking about it so if you get a program where all the little things are wrong you can be confident that the big things are probably a little wrong too whereas if the all of little things are clean and the program is really clean straightforward you can understand what's going on then the odds are strong that it's doing the right thing as well anyway that's the end of it thank you all for coming back after that [Applause] Thank You Bryan we're running a little bit on time Scott well 22 so we have time for a few questions then there okay all right go ahead oh the question is when I say for example integer plus quote zero quote to convert from an integer value into an ASCII value that one is portable yeah because that's a universal character set even though it's ASCII American Standard Code it's also UTF the Unicode encoding and so that arithmetic for digits in the ASCII character set works and similar ones for uppercase and lowercase work I still think it's probably in some ways better to do some other thing just in case some other representation superficially looks like the same thing but it's different but I think that one is safe and so I'm comfortable with that one I wouldn't push it too hard and as soon as you got outside of the standard sort of North American character set Barth american named us really character said i wouldn't push that too hard to print a character the proper thing is to use well first are you starting with the numeric value let's say 48 or are you starting with a numeric value 0 because the deal is if you have a number 0 and you want a you know binary patterns all zeros and you want to see what that is as a digit then you need to do the conversion if you already have something which is asking inside and you want to do in percent see you in C programs or in Java for that matter is the right way to do it or even in Python for that matter is there another question or people yeah um global variables are I think on balance mostly not the right way to do things there are times when they just save your bacon it's so much easier to just say the stuff is up there and anybody can look at it anybody can modify it the good news is anybody can look at it the bad news is anybody can modify it and therefore it's potentially not good then maybe certain kinds of settings where the fact that its global interferes with something else because of the uncontrolled access over time so if something is a global variable it means you real danger using it say in a threaded kind of computation where two threads might be manipulating it in an inconsistent way and you have to lock it whereas if that content were stored within each the code associated the data associated with each thread then the locking that part of it goes away but um I think the trend over the last number of years has been less and less global variables more and more discipline more and more putting things into classes or in Fortran whatever the right word is modules and so on so that they're better controlled certainly Fortran programmers using you know unnamed common or something don't do it escape go get groceries [Applause]
Info
Channel: Institute for Advanced Study
Views: 104,800
Rating: 4.9023905 out of 5
Keywords:
Id: 8SUkrR7ZfTA
Channel Id: undefined
Length: 70min 42sec (4242 seconds)
Published: Tue Jul 11 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.