Bringing the Unix Philosophy to Big Data by Bryan Cantrill: FutureStack13

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you very much good morning thanks for for making it in here so normally at any given conference um I'm always reaching back into the crypt and dropping history on everybody which I kind of I kind of like that role um you know some people find a little bit annoying but I like history um and I've kind of come to cherish it as my own that at any given conference I'm the one that's gonna be talking about history but also Hilary Mason goes 19th century on my ass so I was not ready for that at all I thought that was a terrific terrific talk by Hilary obviously so I I guess I'm no longer the history guy because I mean this is not 19th century we're gonna talk about about the 20th century and in particular about bringing the UNIX philosophy to big data um so people know what UNIX is right show of hands Unix like that term rings a bell good on so um UNIX first appeared I actually first started to be written in 1969 um so over 40 years ago well over 40 years ago um and and started to appear slowly in the in the nineteen seventies and UNIX was it was an incredibly important development because it was not just a new system but it actually represented a new way of thinking about systems and operating systems prior to UNIX operating systems from the sixties kind of the the primordial era of operating systems were these big control programs you know they've kind of the master control program you see in Tron that was kind of a model that's big all-knowing executive um that it was I love the fact they call them executives you know when we're trying to trying to figure out what a metaphor for computation is it's like well maybe it's your boss it's like art at 9:00 on but so we had these all-knowing control programs and multix came along and multics had had some really interesting ideas and in multics in particular i introduced the idea of a shell and had some other very important notions but it was still this very large sealed monolith and multics was then taken away from Bell Labs the best thing that ever happened in his the history of computing is that the GE 645 at Bell Labs was taken away from them and they had to go invent their own system on a much smaller computer on the on the pdp-8 on and they really set out to develop a file system ken Thompson wanted to develop a file system and in what might be the first act of test-driven development he needed to develop some little programs to test the primordial file system so LS RM cat MV these are actually tests for the file system the fact that we can use them on their own is just a kind of an artifact of history and they had this idea of minimalism um that these things were very small this is this kind of small aesthetic and indeed as you may know many of you probably do know that the term UNIX is actually it's a homophone of UNIX which is to say the castrati on and this was a a castrated multics um as Dennis Ritchie said we were a bit oppressed by the big system mentality I mean it was the 70s so you know these guys were hippies um so they had to be oppressed we were bit oppressed by the big system mentality and Ken wanted to do something simple um so that was and if we can go to the next slide sorry my my PDF is not there we go um so um that was on the the UNIX um as it was first conceived a kind of test for the filesystem and they're building these small little commands and they can redirect commands and then redirect output and so on um and then there was Doug McElroy so um I think we think of the fathers of Unix of being Ken Thompson Dennis Ritchie rightfully so because they were the one they've got the Touring Award for it and so on but Doug McElroy may actually be the proper father of Unix because Doug McElroy had really that the key idea in Unix and that is the idea of connecting disjoint components connecting them together and he had this idea of hoses garden hoses and he kept trying to persuade Ken Thompson Dennis Reggie to do it in fact he tried to persuade them from 1969 but they had no interest in doing it um and MacElroy I just would not shut up about this and he was constantly going at them and badgering them and he did this over three years you ready to admire mcelroy for being very persistent about this and finally what broke through is he came up with a syntax again with a syntax to express this idea of the pipe of taking the output of one command and making it be the input of another and finally Ken Thompson says I'm gonna do it um now history doesn't record if it's like I'm going to do it like all right I'll do it or it's like look I'm going to do it IIIi need to get the inflection I don't know that if he was he felt so badgered a nagged it's hard to know what what he actually felt the time but he did it in classic 1970s fashion it only took a day to implement and the next day and one of my favorite quotes in the history of Unix the next morning we had this orgy of one-liners man that's how you know talk about the sweet spot of the Batman that is that is an out of the park homerun for software systems is the orgy of one-liners if you are ever developing anything in your life that results in an orgy of one-liners you nailed it on and I speaking for myself when we I developed eat rice at Sun I recall our own orgy of one-liners I actually don't recall it quite so erratically on but the discovery of one-liners was very important and I certainly like you look at awk and these other technologies and indeed they saw it on stage this morning with an ax stand the N sequel one-liners are terrific they lend themselves very well to orgies and apparently that it was a terrific orgy of one-liners on 18t Bell Labs if you've actually seen photos of these people this visualization becomes very disturbing actually these are at the end of the day super nerds on but it was actually the pipe that gave us the UNIX philosophy and the the UNIX philosophy as articulated by Doug McElroy is that we should write programs that do one thing and do it well write programs to work together write programs that handle text streams because that is a universal interface this is from the early 70s and I think this is absolutely true today I think we should strive for all three of these things and in the way we we write our software in the way we think about our systems and the interesting thing to me is that it was that this did not predate the pipe they didn't develop the pipe because they had this philosophy they developed the philosophy because they had invented the pipe and I believe that this is the single most important revolution in the history of software systems this idea of building small well-defined tools indeed I've come to see the history of software as a a battle between the forces of good and evil the forces of of simplicity and complexity and I hope in our defense too many people in the room when I think that the manifestation of this is the battle between frameworks and libraries libraries put flow of control with the user and I pick up various components to do something new frameworks kind of trap me in in someone else's control that the control has been inverted and I'm in their complicated um j2ee is not the UNIX philosophy I feel I can speak ill of j2ee because it like I've worked for son for 14 years and you know that's down a watery grave so I feel I can speak ill of jqe does not represent the UNIX philosophy um but the UNIX philosophy very important revolution and the I think that it's moment of triumph it's moment of of pubescence if you will in 1986 John Bentley posed this challenge that I think has become the epic rap battle of computer science history and that is to read a file of text determine the end most frequently used words and print out a sorted list of those words along with their frequencies this is like an interview question for an intern today right and it is an interview question for an intern because UNIX has given us the gift of simplicity that allows us to solve this ad hoc problem dynamically and quickly Don Knuth father computer science obviously one of the fathers of computer science terrific computer scientist um took on this challenge and and you should go read this paper I'm not exaggerating wrote a seven page paper describing the bespoke algorithm he developed to solve this problem um and he did it in a new system that he developed called web it's this Pascal like literate programming system you haven't heard of it for a reason as they say um and it was I'm sure beautiful in its complexity but um ultimately it was a ridiculously complicated solution to the problem and then Doug McElroy hero of systems emerges like Indiana Jones and fires his revolver at this complexity I realized now that I'm getting so old that that's actually becoming a dated reference I was horrified that the hillarie Mason's implication that some people don't know what the a and B tribes were like I don't get it why is that funny what they're like oh my god I do get it I'm old um so anyway Indiana Jones you should go see it and they can go argue with your older co-workers about whether it's good movie or not um which it is by the way you're wrong if you don't know um but the Indiana Jones style slays Don Knuth with this very simple pipeline beautiful triumph of the eunuchs phosphate but with big data histories seems to be repeating itself and it seems to be repeating itself almost pitched perfectly in the original Google Map Reduce paper from Dean and from Dean Jeff Dean and Ozzie 2004 poses a problem that boy does this look a lot like Bentley's problem the URL access frequency problem processes logs of web page axes and outputs URL comma one and then in a reduced function that adds all these together the solutions if you read the paper it's funny because the paper is aware of Unix clearly but the paper desecrates the UNIX philosophy and the solutions that we have today do not adhere at all to the UNIX philosophy and forget not actually using the actual tool set they philosophically they are divergent and they're these ridiculously complicated solutions and in particular you look at the appendix a of that osteo for paper it's got a 71 line word count in C++ that has clearly been grossly simplified so they can fit it on the page you know that that sucker is actually like a couple hundred lines 170 171 line word count in C++ if I ask you to count the words in a file you don't think to yourself okay I'll need a C++ compiler it's like no no no no no you think yourself okay I'll just use like WC write WC word town a command so old it has two characters it's in the lizard brain of Unix and you know when we UNIX folks think we need to count the words let's use WC but when we count the words in terms of big data well then I can't use WC now you got to take C++ off the shell forget to do if you have to do all this complexity why why why can't we take the UNIX philosophy to Big Data now what are the challenges here first of all I love the Hillary's definition of big data my big date my definition of big data I guess is the path three and one it's bigger than the machine I can currently work on um so that definition obviously changes with time but um and its quantities of data that dwarf a single machine you got to allow for massively parallel execution you have to allow for multi-tenancy and to make use of both not just the UNIX philosophy brought by the way that great toolset WC and grep and so on you've got to be able to virtualize the operating system so let's talk about how we can how we can go do this scaling storage first of all there are three protocols for storing things got block file and object um block let's say sands where I could talk any more about that far too expensive very expensive at scale camp can't possibly work for a lot of reasons Nazz file based protocols um the file system is actually too permissive an application to be able to build scalable big data systems I believe because it implies a level of coherence that's actually very hard to implement a file system allows you to modify a file in the middle of it right if you make available a file system someone's gonna put their database on it please don't do that did not put a transactional load on what needs to be a big data distributed store it can't possibly work why because of CAP Eric Brewer has given us the gift of cap it's a gift because it tells us we can't actually have it all ah cap being consistency availability and partition tolerance you can only pick two of those asterisks you have to pick partition tolerance um it's like I wait wait a minute I was I was psyched I'm gonna I'm gonna pick C na it's like no no you have to pick P it's like Oh peanut all right say the pixie and P or a NP on and fortunately object storage like s3 is is just similar enough to a file base abstraction but because there's a lot of partial rights we can actually make proper cap trade-offs so what does an object store look like traditionally for both durability an availability objects are erasure encoded across nodes you store an object to s3 and it gets splattered across many different nodes gets put into a wood chipper and spread across a bunch of different nodes so it can survive the failure of any given node and reconstruct the object now a different approach is to instead of splatting it across nodes is to actually have a highly reliable local filesystem and splat that object across spindles gives you it gives you that the same kind of durability and resiliency it set the spindle level inside of the compute node and that's what ZFS does so I don't know have you ever heard of ZFS file system we developed it sign it thankfully son did everyone the favor of putting all of its source code in the lifeboats and sending it out into the world before the vessel actually sank so we actually have ZFS we have DTrace and so on and ZFS represents one of the four foundational technologies in our smart OS so okay what about virtualizing the operating system well historically mentioned that our requirement to be able to use the UNIX philosophies we have to virtualize at the layer of the operating system historically since the sixties we have virtualized at the layer of hardware and this is still true today right if you go spin up something in Amazon or if you go spin up a Linux virtual machine or Windows virtual machine at joint you are gonna be given virtual hardware I will give you a virtual microprocessor or virtual x86 virtual DRAM virtual neck you are gonna put your own operating system kernel on top of that you only think of it that way because all that stuff kind of happens for you automatically and you pop into a shell but what has actually happened is this horrific slaying of kittens that has allowed us to have this virtual abstraction and it's a slaying of kittens it's a kitten slaughter on because and this may be an inappropriate metaphor on it's a kitten slaughter because the the operating system doesn't really get along well with others trust me I'm an OS kernel guy I don't get along well with other software engineers it's like no I will control the hardware and so the operating system the virtual operating system believes that it controls its virtual hardware and you start having a lot of these sitting side by side and the box is in a lot of pain on that does not lend itself well to multi-tenancy on their heavy weight they've got a lot of issues on when we want to go virtualize the operating system there's another approach we can instead virtualize one layer one layer above the hardware Center virtualizing the hardware what's virtualized at the os1 os2 and I will punch out a container for you that is entirely secure you can't get into anyone elses container but your app is actually sitting there on the metal on the hardware there you don't have the performance effects of having to virtualized hardware and these things are very lightweight they spin up and spin down and this is pioneered by bsts jails um zones in first-time Solaris and now in the lumo systems like smart OS have taken it to the logical extreme all right so we got ZFS we've got the kind of the raw ingredients here can we combine ZFS and zones and develop an object store that has compute as a first-class citizen thanks to ZFS rollback we can allow for zones to be completely trashed so we've got an object store where you're gonna store your objects when you want to execute compute I will drop your compute right in the zone here you can do whatever you want in this little virtual environment because when you're done I'm just gonna roll you back and clean that for the next person you've got some job scheduling that we need to go build and what this allows you to do is actually use the proper UNIX toolset you want to do a word count across a million documents use WC it's what God or at least McElroy intended McElroy you'll be pleased so this may seem obvious it really wasn't obvious and I've kind of told this story like oh we were gonna bring UNIX to Big Data it's actually not the story at all it's kind of a nice little rewrite of the origin story the actual origin story is we want to go build an object store can we go do something interesting and I was fixated on the ZFS aspects of this like what can we go do with ZFS and I just totally forgotten about OS pace virtualization and zones of course we we built our entire business on so you kind of ridiculous I forgot about it and we had engineering from Amazon Marc cabbage and Marc was the one who had the insight of Wow can we combine these things and I just blew my mind of course oh my god genius and when I went to think about it later I was like you know what's strange is I remember my mind being so blown by that but I don't remember Marc actually telling me about it I know Marc told me about it but I don't like remember the conversation and normally when you have that moment like we heard on stage this morning you have that moment you just lock everything in I know exactly where I was I couldn't number where I was and so I went back over my IM logs and like where the hell was I thinking like maybe when we find the first conversation that we talked about Manta like oh my god this why can't remember I was I mark told me over I M so this is literally a transcript of my mind detonating as it turns out when my mind detonates its laden with curse words so um mark says hey we could escalate Map Reduce for a first class citizen by way of routing your JavaScript code to some node thing running on a separate zone at the same place as your data not sure this will be viable or not and of course over here I'm like and that's a good idea you that's a really good idea that's why I was asked about zones and as DFS and cross act and a closed cross zone and then of course I'd be just begin swearing backpack look why didn't I think of that that was right in front of me that was right in front of me but sometimes it's like the pipe the pipe feels so obvious the power of the pipe is so obvious and yet not only was that me at least in my give me this mark didn't have to beat me within an inch of my life for three years which is what Doug McElroy I had to do to Ken Thompson and Dennis Ritchie to get pipes in a UNIX so at least he didn't have to like hit me with a baseball bat at least I got it you know is sometimes these simple powerful ideas on they they're not obvious um but boy as soon as he articulated that it's like oh my god that is a terrific idea so what we built is this isn't called Manta on that that realizes this dream so we've built a distributed system sophisticated being a euphemism for ridiculously complicated and hard to get right on top of CFS and zones on and this allows you to have institute in in such a compute and what it allows you to do is again store store store store store your objects and when you want to do something to them instead of back hauling them to transient compute instead of back hauling your objects from an s3 to an ec2 or back hauling them into some infrastructure-as-a-service compute you actually operate them on operate on them in sits you you want to grep your log files for a particular UUID you just execute grep and that grep will run in parallel wherever your objects are so you can go through a redonkulous amount of data in a very short period of time because you're not moving it we've fundamentally changed the physics of big data and you know by the way like just the raw tools that we have in UNIX was built to do this grep is really good at this Oh Karen people program a knock give me some love ah all right so those of you who haven't you've got to go check ISO och built in 1977 and I an arc is a if UNIX is the old testament auch is the beginning of the new testament I'm so auch an incredibly important language that that has been is inspired Perl and Python and and many other higher-level languages beyond it but awk is so you simple it's so tight and so fast and it's a very tight little language that allows you to pattern match in text files and take action on them and it runs like a bandit and so much that we do today with these ridiculously complicated baroque MapReduce frameworks can be implemented in several lines of awk it is an orgy of one-liner waiting to happen like awk is always down for an orgy not sure if that's the way Peter Weinberger thinks of it but that's why I think of it on so we can do all these things using this this terrific toolset and so actually if we go to Bentley's challenge this is the the the McElroy solution - Bentley's challenge cast in terms of Manta and this is literally like we can go cut and paste this like I'd like to go do that so I'm gonna go we're gonna go cut and paste this line and so what I'm gonna do is execute an EM find over my directory and what I've stored here is a bunch of objects that are the v7 man pages I think long as we're talking history and that said dollar 1q I'm not sure why me Kilroy he he has a solution terms a said instead of head I'm not sure why but I'm gonna change that to our 100 so we get the top 100 things and what we're gonna see is that it's gonna assume the Wi-Fi works I should have checked that before on that we've added a bunch of inputs to this job and what I'm gonna go do let's go we can go check on that we'll do a little EM job get of this thing I would already finish that's that's disappointing all right well it's before I could even check its status it finished and so over here we can see that we had 308 tasks and now of course 308 are done and we can see exactly how long it took and over here is the output so what has happened here what has happened is that we've got 307 different objects across the the giant manta Cloud on many many many different machines in parallel we kicked off this little pipeline mcil Roy's beautiful pipeline and then we threw in some some aho Weinberger and Kernighan in there with sumac beautiful beautiful walk orgy tastic we kick that off everywhere and I should say that this is what we kicked off everywhere this little map phase the TR the sort the unique then the - are to know to reduce this is our reducer our reducer is simply taking all these different 307 outputs and just collating them effectively and giving you the output and you can see how quickly it ran and by the way most of the time that it was running was all the bureaucracy required to actually get these jobs launched if these files had been instead of just being several K were several hundreds of megabytes the runtime would not have been materially different because the the object again is not moving how long does it take to to run a you know a TR and a sort and a unique on an object that is a couple hundred megabytes like not long right because that you're just reducing the data so much so actually let me just find show you this as well just to give you a flavor for it so we'll show you this of this M flying again and as this kind of implies we actually have a true file system structure it's not an actual file system but we actually have a file system like hierarchical structure um if you've used s3 other object stores you know they they violate one of the most important gentlemen's agreements we have in computing and that is that the forward slash always denotes hierarchy if you are going to name anything and the forward slash is going to not denote hierarchy do the world a favor and reject the forward slash because it's not what the person who created that object believes that it is for us and that by the way is a multics ism that's where that that's multix a nation a developer forward slash and then of course dass had to destroy it um but the we have actually true file system like structure on and which allows me to list records and so on than those directories are of course synthetic it is a distributed system and if I want what's to actually backgammon um if I want to actually get this object I can just M get it and this is going to be okay this program does what you expect it will I love this eyelid as I love out your next it will ask whether you need instructions end of man page documentation over some things never change um so what am I actually gonna do now is something that it's a little bit ridiculous for such a for such a teensy teensy teensy tiny object but now I'm gonna actually gonna M login to the object because another thing that we developed because we thought it would be useful would allow you to run a job that is interactive so I'm not running a Mantha job and I have now if you will logged in to this object and if I look at my environment variables I see that I have this Manta input file that is this object not interesting for an object that is I mean how obscenely tiny is this thing um for an object that is 182 bytes not really interesting but for an object it is 182 gigs much more interesting because that object you can't actually meaningfully kick around hundred eighty two gigabytes takes a long time to move around you know for over ten gig so the ability to actually log into the object and then you can actually and I in fact did this when I was developing getting mcelroy solution to work you can actually um develop your little pipeline here and get it to work on that one object and then take a step back and apply it to ten objects and then one hundred and a thousand and a million that is the UNIX philosophy start small start simple get it working build it bigger and that is what has enabled us to build really the most important systems that we've been able to build on so and just a and I know I'm shy on time here but um so in terms of cap trade-offs um there are many more details on des petite goes blog on this um Eric Brewer gave us a totally unsolicited celebrity endorsement when he said it was a good example of the correct way to apply the cap theorem I'm pretty sure I hike I teared up when I saw that I'm pretty sure Dave who actually wrote this analysis had a seizure when he saw that um I think the view Brewer is somewhat of a celebrity um and you know in terms of the way we didn't wanted it we got full support for cores headers all the kinds of things you would expect we've got a terrific or really interesting I think SSH based HTP off mechanism and more details on that um you can get all in all have a last slide we can get a bunch of additional information um in terms of the the future of big data I mean after this conference is about I love that the tone of this conference the future stack I can tell you this for absolutely certain the the future of big data is compute data convergence no question it is much easier much much much much easier to bring our compute to data than it is to bring our data to compute no one talks about like big programs or big texts it's not a big text movement it's a big data movement you want to bring your program text to your data not the other way around bring Mohammed to the mountain that's my mom you say she grew up in Saudi Arabia I'm marking folks said we were going to use that from marketing but I think we should I think I think it's a great expression and units we believe is the natural way of expressing this computation um UNIX is a great fit for bringing the compute to data on and OS virtualizations are just the right way to virtualize we think ZFS is the only sane way to do it but you know go ahead and knock yourself out on something else um I don't think mantas gonna be the only system that does this because this is such an important and powerful idea that others will do this and it which is great I think it'll be a revolution for Humanity um but Mantha is I think actually the first system that allows you to take the general-purpose UNIX primitives and apply them to data in the large and it's changing the way we develop software so we are developing our own systems based on Manta in particular I'm a disaster porn addict sadly I'm like the Charlie Sheen of disaster porn though cuz I have no actual remorse about it I just look a plane crashes I get titillated I'm really sorry I did it's true especially one like the one at SFO like no fatalities oh that's the best one except for the girl got run over by the fire truck but you know that's that's kind of interesting in its own right the but I love disaster porn and in software engineering disaster porn is corn crash dump analysis corn crash dumps are really big you want to move them around so what do we do we've got a system called thoth that is a manta based system for which we do all of our core in a crash dump analysis so anytime a application dumps core in the joint cloud it goes in demanda where we can then either debug it or just fetishize about it if you're you know if you're into like an orgy of disasters like I am in terms of more information um you can just google Manta of course there's no task module no Manta that gives you the COI on the documentation we believe is actually pretty thorough you can hit hash mance on freenode the Mantha engineers hanging out there and then along with follow mark cabbage and Dave Pacheco you know Chow and on Twitter the the core team behind behind Manta we said those are the core team that actually on Twitter they're of course some other engineers such introverts and on Twitter which is great and then join and and with Manta we have been able to deliver on the dream with Manta we actually have big data one-liners so with that let the orgy begin and thank you very much and I don't know it all right exhale you'll have exhale to AJ sorry I give people kind of a fight-or-flight reaction often when I present yes absolutely yeah it's right I'm sorry yes look at the last slide back um just first so folks can take a quick photo of that and I don't know if I've got any time for questions or not but if you have any questions definitely come up and feel free to see me and would love to talk with you about your thoughts on Big Data yeah so the the slide with with yeah this is the modern day and actually because this is in my public um when you sign up for Manta you can just cut and paste and run that and you can just run that yourself you yourself can do a word count over things in the seventh edition man pages and then go appreciate the glory of the backgammon man page yes yes yeah that's a great question we're gonna open source Manta um so first of all mantas built-in note is it's built on a lot of components we have our open source this is a little get to your question in a second so many of the components are actually open sourced on we believe fervently in open source for the company behind nodejs um we we are a fervent steward of nodejs we believe in the community so for virtually all the software we developed we have an open source directory for it um which is a way of saying not yet but um definitely stay tuned and a lot of the things we've used to build Manta you can use yourself to build your own apps yet LDAP yes or no trust if I knowed Bunyon and so on any other questions super thank you so much you
Info
Channel: New Relic
Views: 14,691
Rating: 4.9319148 out of 5
Keywords: apm, New Relic, engineering, Software (Industry), fullstack monitoring, devops tools, enterprise cloud, fs13, engineering tools, full stack, cloud, Unix Philosophy, cloud monitoring, Big Data, software engineer tools, cloud migration, software application testing, monitoring, futurestack, application monitoring, fullstack, fullstack visibility, unix, philosophy, future stack
Id: S0mviKhVmBI
Channel Id: undefined
Length: 31min 8sec (1868 seconds)
Published: Mon Nov 11 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.