Release Engineering Keynote | Chuck Rossi | Talks at Google

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
BRAM ADAMS: Welcome, everybody, on this relatively sunny day at the Releng 2014. So there will be lots of people here, around 100. So this was a huge organization. So there's quite some people who worked on this. Not everybody could make it here. Chris and Kim, unfortunately, could not be here. But there are some other people in the room, like Foutse, for example, Stephany, she's still, I think, outside, Boris is just coming in. And then there's Akos as well, who's probably also outside and coming in soon. And I'm Bram, Bram Adams. OK, cool. First of all, we're at Google right now. And this would not have been possible without some people internally who have helped us quite a lot. That's one of them, Boris. And Akos is the other one. So I really would like to thank them and people who are here, like Dominic and Eugene, and so let's thank them first for helping us set this up. Cool, OK. Now, Releng. A while ago, I did my yearly chocolate pilgrimage in Belgium. And I ended up in Brussels at some conference called Fosdem, which is about open source development. And I wandered the corridors. I could hardly pass. So these are lined up-- more people lining up to go in the room. And the room was full and blocked by somebody in the Puppet Lab t-shirt-- no correlation there. And then I was trembling with trepidation there, and then I took a picture, which is blurry on purpose. And it shows "Full". And this was a session on configuration management. Now what was that? Well, this is a session about all things release engineering, actually. They're talking about deployment clouds, enterprise configuration management. It's full. And these are all people want to learn these open source technologies supporting that. And then I said, yes this makes sense, because I saw this blog post a bit earlier by this gentleman here, who said, well, you know, continuous delivery is mainstream. And he got a lot of backlash there, which is weird because we just saw all these people lining up. They want to use continuous delivery and all these fancy [INAUDIBLE] technologies. And this backlash was like and update. And his point was actually that since 2010, people like a Facebook, Google, Amazon have been doing continuous delivery-- all these things. So we're four years later, so this should be mainstream right now. But he made some formulations that caused some controversy. And later on, there came this blog post that actually nailed it down, exactly. And I want to zoom in exactly on what this blog post said. What it says is actually yes, people want to really apply continuous delivery and all these [INAUDIBLE] engineering techniques in their situation. And that's where the problem is. Because how do you do that? It works at Google. It works at Facebook. They did lots of effort through this, and made some mistakes along the way. How does their work apply to other companies? OK. For example, how can you actually get buy-in for your management to spend effort to get there? All these kind of thing-- other techniques you can use, what are the tools? So bottom line, you have a bunch of people in the industry who want to apply continuous delivery. Plus you have a whole bunch of these guys-- researchers who want to help you, who want to prove that continuous delivery helps, it improves quality. So what happens if you combine both? Then you're here, at Releng. That's the goal of this workshop-- people talking about experiences, how they can go to rapid release, and all these kind of things. We have researchers who want to help, who show some results, want to get ideas for further research. And that's basically why we're here. Right here. OK, cool. So now the workshop. What will we see today? Well we had lots of submissions. We don't get to five, not ten, not 15-- 18 submissions, which is quite cool. Even cooler, especially given what I just said, is that half of them are from industry and half of them are from research, exactly. So from these 18, 16 will be presented today. You'll be seeing them. And you can interact and discuss and these kind of things. And now, we had a very busy schedule. So we had one week for people to actually review all these submissions. So some people read it like huge effort here in one week. They do this thing and even discuss things online. So really let's thank these people for doing this work. BORIS DEBIC: All right, everybody I want to welcome you again, and I want to welcome our first speaker today, Mr. Charles Chuck Rossi, off Facebook. I know him-- yes, give him a hand. Give him a hand. He is already a release engineering celebrity of sorts in Silicon Valley for many reasons of which I won't come and talk about. But I know Chuck from Google. He used to work at Google. In 2008, Chuck joined Facebook, and he's working in the release engineering group at Facebook ever since. As you probably know, Facebook's application on iOS and Android is the most popular application on the planet. And those guys who work with Charles, they release this twice a day. So he has a lot of experience in how to push change lists and bugs very fast into production. I would like him to share with this group some of the stories and war stories and some of the approaches that they take at Facebook to make this a success. Chuck, please. CHUCK ROSSI: Thank you, Boris. I hired Boris at Google. I learned a lot about hiring after that. Yeah, sorry for that. So I want to talk a little bit mainly about what we're doing lately at Facebook, and it's all about mobile. I've talked a lot about the front end release process, the facebook.com release process, which Boris says we're a little bit famous for because we do it twice a day, every day. Facebook.com rolls on the new code every day, twice a day. I'll give you inside information. It's around 8:00 am Pacific and around 4:00 pm Pacific is when the whole site rolls. Anywhere between 30 and 300 cherry-picks go out per roll. It's a quasi-continuous deployment, with well over 1,000 engineers touching it with over a billion people being affected every time we push that button. Some of my release engineers are in the back. They're a little nervous because it's about time we should be rolling it. Hopefully, everything's good. If you see Facebook go down, somebody wave their hands like, it's not working. Let me know. The genesis of this talk of and the talks I give really came-- I got to give credit to John Allspaw and John Hammond. At that talk, I think we've all seen it from the Velocity conference from-- when was it-- 2009, where they define the dev/ops thing. And we had been doing this organically at Facebook since I got there in 2008. And it gave me a voice and name to call this thing and a sledgehammer I could use for the developers coming in saying, this is how we do stuff. It's been validated by these guys. This is the way we're doing it. The thing I got from that talk-- what's the take away from that talk? This slide. Right? "No." We're all going to look like that as release engineers. That pretty much summed up my experience up to that point of being a release engineer and saying this is I operate. But it changed. And for mobile, it changed a lot. So the dev/ops movement and everything we learned at web-- I consider web delivery a solved problem. For Facebook, it is a solved problem. We've whittled down the team that supports pushing those 300 changes a day down to aspectively two people. Two people run facebook.com from a release engineering point of view. Then we decided, OK, we're a mobile company. And this became a problem, because we threw out everything. All the good stuff that we had learned, all the good things that we had from so many years of building up to a continuous delivery system and all this dev/ops crap was great. And then mobile came along, and it's dumb. It threw away everything. We had to start again mostly on the culture side and on the thinking side. And this was unfortunate. Now, what we're dealing with here-- and Boris alluded to this-- is a scale that is big. On IOS, we're the number one app. And we are number one because there are some percentage of a billion people who run that app on their phone. I can't tell you the exact split because I don't hurt anyone's feelings. Those are monthly active users. Think about this-- over 700 million people will use the app today, alone. There's about 300 plus engineers working on it. There are many features. I will make this case here. I defy you to find a more complex app on the platform than the Facebook app. I don't think there is an application as complex that uses the full stack of the phone as Facebook does. We got to support multiple devices, even on iOS. And remember, there is a web component to most mobile heavyweight apps. We are delivering web and backend endpoints and web endpoints to deliver content and experience to the phone. That's iOS. Same kind of story on Android. It's the number one non-Google app. The number one app people choose to install is Facebook. Again, it's some percentage of the billion monthly in the 700 million daily users. Again, about 300 engineers and again, the same problem. The multiple devices problem is a bit more severe with Android. And I'll get into that. And again, web component, you got to worry about. So fundamentally, the thing that gets us is if we have problems on the website, if you have a fatal on facebook.com it looks like this. So you can't see, but there is a fatal on that page. Something didn't render. And that's a php bug, and I got to fix that. That's a fatal. I got to fix it right now. It's a rendering problem. But you pretty much have an experience. If you crash on any mobile thing, what's it look like? Boom, you're out. You're done. Your user experience is over. And you can crash for any number of reasons on mobile. If you develop on mobile, you know this very well. And it's miserable. And people hate mobile. The user satisfaction numbers of web versus mobile, mobile's in the toilet, because the experience is A, out of our control many times. And B, we can't recover, do exception handling, or gracefully exit when things go wrong. So we're under much more scrutiny on mobile. So we want to do, though, with our release process-- the main thing is as release engineers, we are here to make the company successful. We have to maximize the rate at which our company can do great things-- all of us. Our companies want to do awesome things. Our developers want to ship their cool code. We are there to facilitate that. We have to make it happen. At the same time, we're also responsible. We're the adult supervision. There has to be some sort of quality metric, some supervision, some idea of, are things better or worse if I push this button? And we are all pushing that button to say, I say this is going to be better. We have key metrics on mobile that cannot regress. TTI-- Time To Interaction, crash rate, star rating, things like that cannot go backwards. And that's important to us, and as release engineers we pay attention to that. So mobile is different. Let me talk more about some of the things that bite us in mobile that you're not used to on web. There are no daily releases. Those of us who, if you want to release your packaged software-- as I said, I was at VMware, and we released stuff on decent schedule as package software. It was up to us when we release. Web, we release continuously, right? What do we do on mobile? Nothing. Pick up any iPhone in this room, go home screen, and look at the home screen at that stupid app store icon, there will be a double digit red number in that box. Why do I, like a monkey, got to go push that button every day? To get my little thing back, and say, OK, do that, do that-- mindlessly pushing that button to tell it to update. The worst thing in the world, especially as a release engineer, because you have no control when you're going to push that button. My mother's is probably a three-digit number in that thing. So this is a major, major crisis for you as a release engineer. Now, iOS 7 got it right in that you can turn on auto update. They didn't make it on by default, which I think it was a mistake, but maybe the next release will sneak that on. We have to get away from this. Android has a long way to go to make this better. We release every four weeks on mobile. And that is fast for a mobile company with hundreds and hundreds of developers working on a on a billion people. I'll get into details of what that flow looks like. But we can talk about what your release schedules look like for your mobile apps, but I think four weeks is relatively quick for mobile. The other problem-- when we release software as release engineers, do we build our bundle, our website, or web stack, whatever is-- push a button, and 100% of everybody gets that? When I push the button at facebook.com, do 1.25 billion people instantly get my new binary? No. I do a slow rollout. I push the web to 2%. I get data, looks good. I push out the rest. In mobile, what do i d? I push a button. It goes to the app store, the black hole that is Apple. And out comes, in some indeterminate amount of time, my binary. A billion people, or some percentage of a billion people are going to be slam bam, you get this binary. If I have made a mistake, or if there's a fatal, or something silly an app, it's gone. That bullet has left the barrel. And I'm screwed. There is a little bit of hope on Android in that people who do allow automatic updating-- and this is a huge great feature for Android-- is I can say, update to 5%. We did a push yesterday. We released our Android app. And how we did it-- we said go out to 5%. And we get data. And we say it looks good, ramp it up. But that's only people who ops in and go through the nightmare of checking off all these boxes that are buried in various places in the Android operating system. We have to ask permission to do something. So again, if I want a hotfix, I have a crisis, the first thing you want to do as release engineers, you fix that problem, right? You don't do that in mobile. You make a nice package. If it's Android, you have hope that you can get it into the store if the stupid thing will upload correctly. I'll get into that. And then if it does get there, you can get it out. But then, even if you got it out there quickly, what are you going to do? You just got to wait for people to click the stupid button. And Apple, it's even worse, because if you happen to have your hotfix in the middle of Worldwide Developer Conference, when the intern takes your app and puts on a USB sticks and takes it somewhere does that. If it takes them threes to do that, your hotfix will sit for three weeks. Anyone work at Apple here? Good. I can keep talking. So as release engineers, these are serious problems for us. And you have to keep this in mind, because now, with continuous delivery, you don't sweat these things. But now, it's the opposite. Like I said, you threw that all the way now. And now, you have these problems that are a real nightmare for you. Some idiot shipped the wrong icon for the iOS app. I did that. So there's no worse feeling knowing you did something globally that's in the news, because of a stupid icon. And there's nothing you can do to get that back. So keep that lesson in mind. Permanence-- all those little bullets that you fired are still out there. This is a little slice of what people are running. This is telemetry from our Android apps. Those are the versions of Android running in production on phones. What do I want everyone to be running? I want them up in that green section there, in the upper right. What are they running? A vertical slice of crap-- of 20 versions of old stuff I don't want them to run. My mom is somewhere in that red line at the bottom there. And so they complain the experience is terrible. Of course the experience is terrible. You're on a version literally 16 releases ago. So this, again, is going to be your reality. Testing these things-- so I'll talk a little bit about that but especially on Android, you have this. This is a heat map of the devices sending telemetry back for our Android app. There is a long tail of crappy little Android devices that will never die. Technically, your app needs to be tested and run on all these physical hardware things. And again, out of your control and something you need to consider. I'm not even giving you the vector of which version of Android they're running on these phones-- Froyo, Gingerbread, KitKat, Jelly Bean, ICS-- all those, we could put another matrix in there, and your head would explode. And you know darn well that something that works in ICS is not going to do well on Gingerbread and a million other permutations of that. So this is something else you need to worry about. It's nicer in the iOS environment, because it is a bit more unconstrained with devices and whatnot. But supporting the iPhone 4 is not as easy as you'd think. So we do need to worry about how this works on older and different iPad and iPhone devices. So how do we ship this code? So what's the process by which we're getting this out? So like I said, the web is well known here. Just one thing on organization-- this is a big thing we could talk a lot about. But the normal thing you do is you have your normal web deployment world and your development environment. And your engineers have your desktop web guys. You got your product experts for the product itself, and then the mobile guys tend to be platform experts, right? So they're shoehorning stuff in because they know the platform. But they don't know what the messages, they don't know photos, they don't know-- whatever functionality they're working on, product feature, they might not be the expert. We started out this way because it's naturally what happens, right? What you've got to get to is obviously this. So we have no more mobile group and web group and all that. If you work on the chat group, you work on the chat group for all platforms-- web, mobile, whatever. And there was a bit of a organizational noise that went and shuffled around. But it was worth it in the end, because when it settled, we had this. And we had less cloudiness on the mobile side. And our mobile quality improved greatly because the features were done by the people who know the features, who know that thing what they're trying to do, regardless of platform. When we did that, the number of developers we were supporting on mobile kicked up. And this is an actual graph of the number of unique individuals checking in the code into the mobile code bases. And after the re-org, bam. So as release engineers, we have just multiplied the volume of stuff that we're dealing with. Just be aware of when you go this model, that's what you have. So this fixed-date release process-- Facebook uses it, Chrome, a bunch of other people use this process. It's not ideal. You all know trying to get software engineers to hit a date is like trying to give a cat a bath. It's just not-- it's just fighting the whole time, and they never hit the date. So while we don't love it, this really works well for mobile. And we're trying to do things to optimize that. When you have a date-based release system, what are you trying to ship? You have three things you're trying to worry about when you're shipping software. You have the features, the quality of the code, and the schedule that you got to worry about because it's date-based system. When you're under this kind of constraint, you got to pick two of these. Which two do you pick? You pick quality and schedule. Those are the two things, as release engineers, that we focus on. Have we regressed anything, and are we going to hit the date? If it's a feature issue, it's not the priority. Why? Because we ship on time. And again, this is where we have the most conflict with engineering with developers, because they're crappy at hitting time, hitting dates. The good news for this is you don't have to wait if you do get your stuff in. So if you do have a press release, or a major feature announcement, or whatever is you've got out the door, you know it's going to go out that day, because we're going to kick out anyone who doesn't fit the bill. So the good news is if you do have your act together, and you do get in, you will be in good shape. Your stuff will go out. We're like doctors-- do no harm. So if we do something, we cannot make things worse. And that's the sum of the criteria we judge every commit by. Are we making things better, or is this just an iffy thing that will possibly make things worse? To engineers, four weeks seems like a lifetime. To PMs, four weeks seems like a lifetime. Four weeks is not that long. And you know there's another one coming. Those trains are always leaving. Do not freak out when we throw your thing out. I had a team come to me at week three of a four-week cycle. And they're like, this is big. We got to get this in. It was literally 15 changes to the main photo flow in the mobile app. And we're like, no, get that out of there. We're not taking this at this late date. We're almost ready to ship. You're nuts. No, no it's a high priority. Zuck wants it. It's got to go in. And we escalate. Boom, boom, boom, boom, boom. All right, let's get someone more important than you and me to talk about this. And I eventually get the thrown out. So we ship. We're good. The next cycle comes up. I go to that team about two weeks in. I'm like, hey, you guys get that stuff in? We're going to check it out. They're like, nah, we're going to wait for the next one. So it was like, you're killing me. You wanted to get in. It was the most important thing. You're going to wait another cycle because it wasn't ready. So as a release engineer, you have to have that sense of-- these guys are not going to land this. And you've got to assert yourself and say, like listen, there's the next train. You're on it. Get off my train now. Things that break aren't ready. Get them out. Don't waste time fixing forward or taking more patches on top of more patches. Like, OK, I know I gave you those three diffs and those three cherry-picks, but take these three more. It'll fix it. I promise. Use your judgment. But literally, do not let them walk over and keep dumping in and fixing forward. Just like, no you're done. You're on the next train. Get out of here. I got more stuff to worry about. You're just annoying me now. Put your mean man face on from slide number two there. OK, let's talk a bit about the mechanics. This is our web development cycle. So we have our source control system. We use Mercurial, Subversion, Git, I don't care what we use. I hit them all equally, so it's not a big deal. You've seen developers screw up source control many ways dealing with branches and complex things. I think a couple guys from VMware are here. I set up the system at VMware back in the day. That was a hard problem-- many long-lived branches with dot releases of many products. We had a really good system under Perforce, that when you check in, it asked you where you wanted your stuff delivered. It would deliver it, check in, build it, let you know it went in, blah blah blah. It could not be simpler at Facebook. No matter which crappy source control system you use, you check in the master, you're done. OK, that's all they have to do as developers-- get fricking code into master. What we do in web is after a week of development, generally it's Sunday at 6:00 pm, we cut a release branch, a simple release branch. From Sunday to Tuesday, during that blue period there in that blue box, we stabilize. We test it internally. We make sure it's good. If everything's good, Tuesday at around 4:00 pm, that goes out. That is between 4,000 to 6,000 changes that went into trunk that week. For the rest of the week, that's my twice daily push, during that green box there. And that's where I take my 30 to 300 cherry-picks a day. That flow has been the way at Facebook for six years-- has not changed. The big win for this is that we ship twice a day, we're fast. That little blue box in there is like internally, we're dogfooding before anyone sees it. Again, we're not waiting for anyone, because at that rate, it's like, if you don't hit today, there's tomorrow. It's like hours away. It's not the end of the world if you don't make it. The engineers are there supporting their changes. It's true dev/ops. Your change doesn't go out unless you are there. I won't push your web change unless you show up and show me you're still alive. And there's clear rules. We all know it. There's an on-boarding where I brainwash all the new hires. Like, this is how we do it. This is what you're doing. We're all on the same page. So let's take the desktop web. Let's overlay now what we do on mobile. Not very different, the time scale's changed. So now we have four weeks of development in master. And at the end of four weeks, we cut our simple release branch. And that release branch lives for 3 and 1/2 weeks under our eyes. And that's where we take more cherry-picks, probably between, I want to say 120, 150 more cherry-picks will come in to stabilized after three and a half weeks. And then that green period of soaking-- like, don't change anything for three days internally. Just dogfood it for three days. Let it accumulate state. See what breaks, and see if we're good. At that point, that green line is our fourth. The day of our fourth week-- exactly four weeks from that first red line, it goes out the door. What I want to keep at Facebook, is no matter what group you're in, front end, back end, mobile-- this little picture is your life. And for the most part, this is true at Facebook. No matter what group you're with, you will release with a simple branch cherry-pick system. The time scale will change, and some of the mechanics will change. But I'm a big advocate for this. We can't do true continuous where we can just deploy from trunk, but I like this little buffer zone of having the cherry-pick system with a release branch. It's worked very well. So we haven't changed things. Like I said, if you changed from web to mobile, you have the same thing, except now some of the times have changed. Otherwise, all the same things you've learned, all the operational awareness you've built up as a developer is still with you. On Android, we have one special tool. And God bless Google and Android for doing this-- is the alpha and beta program. So I want more eyes on my stuff. And I do this with facebook.com. The website, like I said, I can leak out stuff to 2% of the user base at any time to get feedback of what I'm doing. But I have the beta program on Android. My beta program is a few million people who volunteered to get the beta. The beta comes from the blue line, which is the release branch. I have more beta customers than most people have users. Every Monday, Wednesday, Friday, I ship whatever's in the branch to these people. Obviously, auto update's an important thing for these people. They all have it turned on. So bam, bam, bam. I get that. And I get telemetry back immediately. So I can now analyze what they're seeing, what crash rates, what the logging looks like, what bugs they're reporting-- all this good stuff. We were really happy with the beta program, and then Google announced the alpha program. Wrap your head around this-- I am shipping trunk to a few hundred thousand people every night. That should scare you. So the Android app-- if you're in the alpha program, you will get pushed every night whatever's in trunk. There's some certain things we do to ensure that nothing leaks and that we're in good shape. And I'll talk about some of safeguards. But that's really cool. All right, let's get in to the developments. What are we doing in that 3 and 1/2 weeks? Because we take that full time to figure out if this thing's going to ship or not, or if we're in good shape or not. So let's talk about some of the details there. These are more the philosophy in that release branch. How do we keep that release branch in good shape? The biggest thing, like I said, is no features. If you take a feature, you're basically resetting the clock. We've done all this testing. We've had all the dogfooding. Everything that comes in is resetting in destabilizing what we've done. And honestly, if it didn't make that cut, we're assuming it wasn't ready when the cut came and you're not going to cram it in later. You can't just worry about native code. And it depends on your app, but if any of you have any kind of at that does anything of significance, you're going to have these issues. We are very picky about design. And there's a design team that in mobile, you just can't throw in an element, or something, or change the UI without a pretty heavyweight analysis of is this the way that we want to go. Did we get logging data? Is your logging in there? Are we getting data from dogfooding that your thing is working and turned on and good? Are there server-side endpoints, or updates to the website that need to roll out before your thing can be turned on? So make sure we coordinate that. The worst thing in the world is pushing something out, where it starts hammering the backend because they didn't realize the use was going to be like, 10x what they thought it was, or the endpoint's not there, or it's not at the right level to make the right response. We have big privacy and legal issues, as we all do. There is a team for a release that looks at what's going in, and says yes this is good. If you rush it, or you take in late changes, you put that risk because they could derail what they looked at when they first said, OK, this is what's going on for this release. And yeah, basically if you are not testing in master, we want that-- we want it vetted in master before it gets in the release branch. So if you're putting it immediately into the release branch as soon as you check it in, we don't have that window to do our test to make sure that-- we still run the tests, but I want it going through master to release for more sanity. You are guilty until proven innocent is a pretty much our motto here. So every time you do ask for something, we need to approve it. And we use a cherry-pick system, again, this is across all of Facebook. It's called Relief. It's part of Fabricator, which is that our whole code review, our whole stack is open sourced under the guise of Felicity, which is a company that has all our internal tools. They've open sourced it and they maintain it. Within Frabricator is this thing called Relief. So if your diff is accepted-- and this is a diff that was accepted and is in master-- that link shows up, and it ways Relief Request. And what you're saying is I want this to go out. I got it into master, but I still want it in the release. And you're in week two, or three, or whatever of the process. You click that button, and out comes this page, where you tell me why we taking this. And on the right there, you're going to say, yeah, we're taking this because this is a really bad bug with display model blah blah, blah. Boom. My release engineers are going to look at that, and say, yeah this is legitimate, or this smells fishy. The other thing we have is over here, we have these two lines. The top line says Size. It's the size of the diff. As that diff a big diff-- number of changes, number one lines added, deleted, moved, whatever-- that bar will grow. The bottom line is Churn. And that's the amount of discussion there was in the diff. How's the diff go? You send out your diff, hey here's my diff. And you go, you diff sucks and so do you. They're like, no, you suck and so does your mom-- and back and forth and back and forth. Maybe those are just my diffs. So that will get bigger, and bigger, and bigger as there's more rejections, changes, to that diff. So If i see big bars there, I know there's some contention, and I want to take a look. The other thing which I've blacked out is the Karma. And right under there are stars. And all engineers start out with four stars. You can only see your own Karma. But if there is a bad thing that happens, I push this. And I'm the only person on Facebook who has this symbol, which is the Dislike button. So if I push this, it means we got a problem. And a box opens up, I type in what happened, how we can improve ourselves. I click Submit. It goes to me. It goes to you. It goes to your manager. And it goes to our work.com performance review tool. So it's very much a public shaming, not a private-- I'm sorry. It's very much a private shaming, not a public shaming. You really want to avoid public shaming. I do this in my head. You cannot stop me from doing this. OK. When I was at Google, I could walk down the hall and say like, two stars, three stars, four stars, two stars, one star, zero star. Right. Where's Boris? There. So you know you do this as release engineers. And it's your job. You've got to manage risk. And when you see a room full of engineers, that's a room full of risk. So but when it got to 300, 600, 800 engineers, I couldn't do it anymore. So we made this system. Now, this is not like some punishment system. Nobody's ever gotten fired because they got two stars. But it will be really valuable to you as you get to people on your team, and as you start dealing with different teams to remember-- you look at this diff like why does that guy have two stars? Oh, yeah. I remember now. So you're going to take a little bit more extra care and see what's going on there. And that is part of Relief. You can download this and use that as you wish. Again, we are risk averse on the release branch. We need a reason to approve, not a reason to reject. And then, like I said, the main reasons here are there's pluses and minuses for every change. You have to let your release engineers have this flexibility to be able to make a judgmental call, very subjective, to say like, this feels right, this does not feel right. And there can be some criteria that you can spell out for it-- help them with that, or help publicize it. But we've been doing this as release engineers for six years at Facebook, and the push back is very low. We are very well-respected as release engineers. And make sure, in your organization, that you are respected for what you do-- for your judgment and for your skill set. And given that, when release engineering says it's not good, it doesn't go. So make sure your management, your organization, your culture at your company backs you on this. Finally, there's a great quote in the movie "Ronan." Robert De Niro says, "When there is any doubt, there is no doubt." And that's our motto. If we get queasy on anything in that really branch, and we don't feel right in mobile land, it goes out-- an important thing. Let's talk a little bit about the tools. Let me check my time. So the tools-- I need to get master in better shape, especially because on Android, I'm shipping the thing every night. So I need some tools that when the developers land their code into trunk, they have confidence that they're not going to burn themselves. Because they're very attentive now. I think the developers have a very good operation awareness. They want to do right. They want to be happy. They want to land that code. They don't want to cause trouble. They don't want to lose that precious star and their Push Karma, so they will pay attention. Give them the tools to do it for them. This is our continuous integration stack time on mobile. We use Buildbot. You can use whatever you want. Buildbot works for us. But basically, you have the build part, which is we build everything. When you check in the mobile-- what is Facebook mobile? It's not just one app, right? It's the native app for the platform, so iOS or Android. It's messenger. It's pages manager. It's Instagram. It's a bunch of other stuff that's going to launch or has launched. It's new projects. There's a long list of those little boxes of squares on top. For every commit, you might break something across the way. We use a monolithic code base, much like Google and other places. So you really need to check all the builds when you check something. I just fixed something in Facebook for Android, but you just broke messenger. So we do all those builds. On the way in, there's a whole series of lint/static analysis-- things that check are our policies being followed? The easiest one is the regex one. Anyone can write a regex to say if you should do this, or you shouldn't do that, the regex will catch it, throw it as a warning, throw it as a fatal so they can't check in. That's simple. And anyone can contribute to it with a simple regex. For more serious stuff, we use Clang, which does the static analysis and checks memory, or dead code, or things like that. Android has some built in linting that we use on that. This is both platforms, iOS and Android together, so some go with some. Some go with the other. But you get the idea, right? So you have that layer protecting your master code base. Nothing gets in unless it gets through that red section there. And finally, the tests-- for each platform, there are various test systems that go in. We also do WebDriver from our UK offices that does also end-to-end integration style testing as well. So that stack happens all the time-- all of it. How often? So this often. So during each step of the process that whole stack is run-- every build, every test, everything done. If you're at Google or Facebook, this does not impress you, because basically, I say this kind of boldly. But there are no issues with compute power or storage. Those are infinite as far as we care. You need to get to that. Machine resources should not be the thing keeping you from running this full stack while the person's developing the diff, when they create the diff to send it out to the other developers, when they update it after getting feedback, when they land it in the landing queue to check it out before it gets delivered, and when it gets committed. Each step is going to run through that stack. And machine resources should not be the reason you can't do this. This is the number of builds were doing per day. It averages around 20,000 to 30,000 builds a day to give you the scale for our couple dozen mobile apps. So this Async, when it's built and tested, what's it look like? So when you do commit, the reviewer-- actually, this isn't a commit. This is for a diff. The reviewer will see did the stuff pass all the tests? Did it pass all the builds? That's in the diff itself. So the diff tool itself will expose any dirty laundry that you have that didn't pass test, didn't pass builds-- will be there for the viewer to see it. They see that box is red, it's an immediate go back, hey, go check that out before I check it out. Shrubbery is the thing we put on top of our build system. And this is basically showing us across the whole matrix of builds that are running where did they fail. So if you go there and look, you can see exactly. Like, down there, that red bar says, oh yeah, a test failed here. You can click through and land at the test console to understand like, OK, where did this go wrong? And the reviewer will do that as well. Dogfooding-- we all know the value of dogfooding. But dogfooding on mobile is a different problem. Within Facebook, I force- if you're on the Facebook network, or within the VPN, when you go to facebook.com, you're never going to facebook.com. We always redirect you to what we're going to ship-- our dogfood. How do I do that on mobile? Well, I have a mobile builds page that people will go to on their phone, and they can download any version of master, or the release candidate, or a previous release of the various products. And this page scrolls down a bunch. There's one other thing though. People are lazy on mobile. They actually use their mobile phones. They don't want to be bothered. On Android, I force them into the dogfood. If you're a Facebook employee, you will now download the Google Play version of our app. It will always kick back and download and use our dogfood version. On iOS, it's a harder problem because we don't have the guts to do that. We wrote a wrapper around the app for internal use. So the problem with iOS is you'd fire it up. And you're in the middle of the park, or Burger King, you want to check in at Facebook-- hey, I'm having a burger. Boom. And it's going to come back and say, hey, you're on the wrong build. Upgrade now. You're like, I'm in the middle of Wisconsin. I don't have connectivity. I don't want to do it now. So it's a big pain. What we did is wrote a wrapper that in the background, it knows when the new package is out. It downloads it, so when you fire up the app on iOS, it's going to just tell you hey, by the way, I got the new app already waiting for you here on the phone. Just click install. And that is going to boost a lot of the dogfood usage for our internal people. So it'll be very seamless for them to keep up to the latest, because that changes every day, right? We're going to ship that dogfood app every day. The test console-- if we click through in some of those test failures, you're going to see basically the history of what's passing, what's failing, and specifically what failed. So you'll be notified when this stuff fails for you. You're going to go look for your commit. You see your commit. You click through. You're going to get exactly which test failed. Our tests have an automatic quality rating. So if we see tests are failing or flaky, the star rating for tests-- tests have Karma as well. They will lose their Karma. And eventually they'll be discarded if it's a flaky test. Click through here. Again, these are tools that are kind of specific, but you can basically see the history of how this thing failed, exactly when it failed. Point being, you need to have data that gets you down to the rev level of when things went wrong. So again, every commit, every test run every time, it's easy to basically bisect down. And say, here's the point which we failed. Here you go. With all this, we still have breaks. And if you're committing and something breaks but it's not yours, you can always pull back to a stable point. So there's a rolling stable label in master, where if you're hopelessly broke at top of trunk, you could say, listen, give me back something I can build. You can always pull that stable label back. And that automatically updates as things pass. That's a simple thing. We've been doing that for years. I think all of us have done that. Don't forget it on mobile. Test failure bot-- again, nobody likes tests because they can be noisy and brittle and give you noise. The bot tries to take care of a lot of this. It'll assign bugs it sees that are unassigned by doing some analysis, and says, nobody owns this. But I think this guy should own it. Or if it sees things that have been closed, or that the test is working now, it'll go off and close the test, saying like OK, this clearly works-- no reason to keep this bug open. So get the bots to do a lot of the crappy work of figuring out where tests should go, or if it's open or closed. Because that's the part that developers really fall down on is just responding to this endless stream of noise about tests failing or passing, and failing and passing. So mitigate that noise to them as best you can. With all that, especially with Mercurial and Git, when you rebase, you could get in a bad state. We use this thing called Landcastle. And what this means is when you commit into master while that stuff's all running, we worry about rebasing your stuff into the latest-- basically, the tip of the tree. So when you commit, this thing's going to say, yes I've cued up your thing. Your change is in there. It's going through the stuff. I'll keep rebasing it for you until it lands. If in the process of rebasing it basically finds a problem or a conflict because stuff is coming in, it'll send you a page. And say, listen I can't land your change in the master because this guy just pulled the rug out from underneath you. So go deal with him. So again, taking the onus off the developer to worry about constantly rebasing, constantly checking if that thing gets in, it's much simpler if the system takes care of that for you. I think at VMware we had a similar system we had at VMware. With all this, we can still break master. What do we do? We have the Sheriff. I do the same thing on web. Like I said, there's two people essentially running web. There's 1,000 developers against two release engineers, and we release in real-time. So we have on-calls, or Sheriff. I have a page with a rotation for every group-- all their on-calls for that week for all the different groups. When things go bad, and photos doesn't work on the new iOS build, am I going to debug that? Not so much. I'm going to go to the list, and it's going to say, this guy right here is the photos on-call for the week. Here's the problem. Fix it now. Fix it real-time, and get back to me. Get who you need. Back out what you need. Do something. And that's the job of the Sheriff. This is ideal because it gives the developers this operational burden that really opens their eyes. They become allies of you. They feel they're part of the release engineering team. We're part of the gang. And as you spread that-- as more people get to be Sheriff and on-calls, they have this empathy for like, yeah, this sucks. We're really screwing you guys. So they will be better engineers and better operational people if they have this role. Their main role on mobile-- get it working, man. Just revert. Just get me back. Get me back on my feet. They'll look through. I'll send them this link like, hey these tests are failing. I can't figure out what's going on. It looks like it's related to photos. They'll go in, and they have the special super confidential tag, where they can get stuff in, bypassing some of the big stack of stuff, because I want that fixed now. So as the Sheriff, you get checked off in the database as a Sheriff. You get a special tag to commit. And you get your stuff in. Generally, reverts go in immediately. Bisect too is cool. Basically, you can do a live bisect on your phone, trying to find a problem. So you can go here. Tell us what bug you're looking for, basically punch in build numbers. And they'll just suck down from the dogfood page, so you can try different builds, bisect until you find that exact point that things have gone bad. All right. Let me wrap up here. So the big shock for us at mobile was we thought we had things solved as release engineers, and we didn't. The process for us at mobile was I went out and I hired Christian Legnitto, in the back, from Mozilla. And I was busy with the other guys. Just getting web-- was pretty good. I wanted to make sure web stayed on its feet. I said, Christian, that mobile thing's a mess. Go deal with that. I threw him in-- like one guy into this big den of mobile. And he did a great job. But he had to come back, and we all had to figure out we have to change how we develop, how we ship, how we write code. All those things that we had solved already had to be rethought. And the tools had to be modified a bit. And new tools had to be invented. But the important thing is it was not a shock for any developer going from the web world to the jarring cold reality of mobile, because the culture was the same. They all knew this Dev/Ops culture. They all knew they had the responsibility. It was very bearable. So I can say, quite confidently now, Facebook is mobile. We are a big heavyweight mobile company. We have many, many mobile apps. The team in the back and myself are responsible for shipping those mobile apps. We'd love to hear what other people are doing with mobile. I think we have a lot to learn. We have a lot to share. We have to really lean on our friends at Apple and at Google to help. I don't like to be critical, but they're keeping us back. The systems have not kept up with the reality of the mobile ecosystem. I promised I'd complain about the Google Play Store. We have 17-- when we want to release one version of Facebook for Android, there's 17 packages. Because we build out APKs for individual DPIs or chips, or whatever. It's no fun to go into a web interface and uploading 17 packages every four weeks with release notes. So this is silly. I mean, let's not be amateurs here. Let's get an API. Let's get this thing like industrial strength, so I can get things through. I promise you, I am not happy with four weeks releasing stuff. I'm going to get to two weeks, hopefully by the end of the half. All right. I want to ship both platforms every two weeks, eventually every week shipping mobile. I can do that, but the tools at Apple and Google are, right now, one of my biggest hurdles-- to get that cadence. All right. There's my contact information. I think we have time for questions, if we have maybe a microphone. BORIS DEBIC: Thank you, Chuck. We have time for a few questions. And we'll ask you to speak to the mic, so we get the questions on camera. AUDIENCE: So the Karma stuff-- so A, Craig from Wikimedia Foundation, we're actually seriously considering moving to Fabricator right now. We're in a mix of Gerrit and Bugzilla and it is hell-- on Trello and Mingle and all the other crap that's out there. But so, Fabricator, we've been talking with Evan a lot, and we're thinking about-- oh, is it-- oh, there we go. Swallow the mic. All right. So right, so Wikimedia Foundation-- we're thinking about moving to Fabricator. And one of things that I liked about the features that you mentioned that it know about was the Karma thing, because I'm the release manager there and I have those same stars in my head. So the very basic question-- who all-- so you said you have that right to dislike. Is there anyone else like your release team? CHUCK ROSSI: Right, so the question is-- the Karma thing can be sensitive. And I don't want to mean is a mean-spirited thing. You can't use it as a club. But it could be a subtle way that you can keep track. And the question is who has access to that? Well, all my release engineers. So everyone in release engineering is in the database as being able to see that-- just the user themselves and the release engineer. So it is, again, a private thing. And I'll give you an example of how it was used for good. We had a guy in web who was just killing us. Like, every time he'd touch code, [INAUDIBLE] would break, or something break. Something in platform might break. And he was down to two stars. And our policy is when you're down to two stars, we just don't take your change. Because clearly we've-- you only lose half a star each time. So that's four times. And we only give you a down Karma if you really-- things had to stop working. So we're like, this is ridiculous. This guy, we can't take his change. And what's going on? So we were able to get with him and his manager. We're like, what's going on? Well, it turns out, he'd inherited this awful JavaScript code from 1,000 years ago-- probably written by Zuck himself-- that landed in his lap. And every time he touched it, it was just a hopeless situation. He was just doomed. So we said, OK, we got to step back, get proper resources on this, revamp what we're doing here, get some real-- you can't go on this way. So it really flushed out an issue that was not this person's fault. But something that wasn't getting attention clearly showed up on the Karma scores that helped. AUDIENCE: I'm Ryan from Cloud Foundry, and one of the things that really stand out to me was when you said that we as release engineers need to make sure our team is well-respected in the organization. So what are maybe one or two of the most important things we could do to ensure that when we say something important it is heard at the highest levels? CHUCK ROSSI: Right. So the first thing is you've got to have the attitude that you're not there to hinder things. The push back you get is like, if you guys get in control, everything is going to stop, because you're not going to lay anything out. And you're going to grumpy and like that. So you have to balance the idea of like, I'm going to be cautious, but I really want to make things happen. The team wants to enable the company in the developers to get stuff going. But we're going to be-- like I said, some sort of adult supervision. We're going to do a little bit of sanity test for that. As far as how you can build that, it's tough. You absolutely need an advocate up the management chain. And I've been lucky here at Google and at Facebook and pretty much everywhere that the organization from usually the VP of engineering down, understands the value of what we do-- of release engineering and learns to trust the experience and the judgment of these people. So I would really advocate up your chain a bit and see if you can get someone to back you. It helps if you have an experienced release engineer on the team somewhere. Or even the other one is-- and I'm sure you can find these people. Every group has a developer who is like a frustrated releng. They're always getting in your stuff. They're always pointing around. They always like to help. They're giving you great tools, and they love this stuff. Get that person on your side to help advocate for you. And those developers who on your side are going to be a big boost for your team and for your respect. The other thing, I"ll just say one last thing-- is do the thing I said about on-calls. Get developers in your little world. And say, hey, you're the on-call for the week. And you will get your circle of trust, and you'll get more respect that way, because they'll be like, yeah, you guys do crazy work. I can't believe it. So those things will help. AUDIENCE: So one line that I use that was very effective to get people to accept what I was doing was to say, hey, you're going to have to jump through a few more bureaucratic hoops, but you're only going to do your work once. You're not going to have to go back and deal with this craziness of things breaking and having to be redone. CHUCK ROSSI: Ultimately, the pitch is you're going to help yourself. If you let me do these things for you, I'm going to save you a lot of pain and lot of redoing things down the road. AUDIENCE: Hey Chuck, you had one of your diagrams ended with a soak period. So what happens if you get a surprise in the soaking period. Does that effect that cycle? Does it affect the next cycle? What do you do then? CHUCK ROSSI: Yeah, soak is not ideal. And in fact soak doesn't work that well, because if you're a day into soak, and you're like, oh yeah, that doesn't work. So you got to cherry-pick, and you've got to push that out again. You got to collect edit. It stinks. You know what saves that 100%? The beta program. So I could be in soak. I'm in soak now the whole stupid time that four weeks, because I have two, three, whatever million people we have out there using the app. So when I find something, and I'm in soak. And I cherry-pick, I don't have just the 4,000 engineers of Facebook to go give this to. I have a couple million people I can go give this to and an instantly get some feedback. AUDIENCE: But then why do you need the soak period? I mean, why have it? CHUCK ROSSI: The question is why do we need the soak period at all then? And in fact, it's probably less of a thing now that the beta-- I need it on iOS, because there is no beta program on iOS. Maybe if you can give the microphone to Christian sitting behind you, who works. Can Christian just grab the mic there. Christian works on mobile. CHRISTIAN LEGNITTO: Yeah, so the original goal was-- the way we dogfood and the way beta dogfood is, you're basically installing an update every night, which is not the way our users will actually run it. They'll install one update and then run it for a month. So there's some sort of bugs like, local caches growing unbounded, or something like that, where installing every day would hide those bugs. And of course, we have to push out a new build every day because we want to test the changes, the cherry-picks we've taken. But at the end of the day, we want to test how our users are going to experience and they're not going to be installing updates every day. CHUCK ROSSI: So we did effectively get a soak through that. AUDIENCE: Hi, I'm Fred from Google. I'm just wondering about how you use that telemetry. What do you do with the telemetry you get back, and how do you interpret it? CHUCK ROSSI: So the telemetry coming back from those various channels will be our graphing system, our data collection system. We'll basically graph it over the current production values-- so crash rates, TTI, app size, bug rates, all those meta values will be transposed over the known values for production. So if they vary, and we're all very intimate with what those numbers are, and we can see. The other thing we're getting is individual results-- specific logging data that is only from those people. We do this on the web. It's fantastic. We have a page for all the log data coming back. It's like, show me only the stuff happening with the new beta release. So it'll flush out like, hey, these are new bearers I've never seen before. So instantly like, oh, that's all new stuff. We got go flush through that and see what's going on. So those are the two main ways we can get that telemetry and figure out if we're in better shape or worse shape. AUDIENCE: Hello. My name is Armand from Mozilla. It was excellent what you showed us. There's one question I have with regards to the home page for each team of developers. If you're backing out up to the last stable state, why would you need a team of sheriffs or on-call people? CHUCK ROSSI: Right. You're specifically like, if master has trouble, why do you need a sheriff to back it out, or-- AUDIENCE: You back out. You get to a stable state. And why would you need a list of people on call if you are back to stable supposedly? BORIS DEBIC: Right Only one part of the job of the Sheriff is to deal with breakage and trunk. It's almost an easy part of the job, because you can just revert. And a bot could almost do that at some point if you get some confidence in it. The real reason I need those on-calls and sheriffs is operationally-- we operate in real time. The data coming in from production, the website going out every day-- something is happening all the time. If something goes wrong, if I look at my graph like Fred was talking about, and I see the production graph for crashes just all the sudden, bam-- goes up, and I look at the stacks. And I'm saying, why are group messages suddenly fataling? I'm going to go right to that on-call page, find the groups person and say, you-- this is your life right now. You got to take this, and you've got to figure out right now what's going on. Now you guys at Google championed the use of SREs to do this as the first line to figure out mostly for web side, but when things go wrong, SREs responded. We're less about that and more about having it go right to the team. The operational people are within the team itself. So that's the real reason. I need their expertise as on-calls and sheriffs to be able to look at a problem-- a stack trace, a code path. They know what went in that week. They're like, oh shit yeah, we just updated that groups-- that push went out yesterday. And all the sudden, it's failing now. Somebody changed a gate keeper, or somebody turned on that code. And now we've got to react. And he's the best person to do it, because he's in that group, and can find-- they're like, it was that guy. And then run down, you get them, and then we do it. So that's the big win. AUDIENCE: Do you have any challenges with the balance of power between product owners and testing and then the release management? Do you ever get overridden by your decision-- get overridden by the product owner saying, those testing, it's not relevant. The feature being delivered is more important than that bug. Ship it anyway. BORIS DEBIC: Yeah, we probably don't have it as much. But that's the natural order things, right? So we don't have any QA groups. So there is no QA. That data is evident by itself. Tests pass, or they don't. So that pretty much settles that argument. The last is the release engineer versus development are more likely PMs. Because PMs-- I make the picture of on the one side you have my team, release engineering, trying to hold down the fort. On this side, you got Zuck and the management staff wanting stuff to go. And in the middle you have the developers. And we're like the rocks that crush the blood that makes the stop go, right-- the lubrication. So the poor developer and the PMs are these little things between these two rocks that are just grinding away. So yeah, you're going to have issues where-- especially for the PM, because they have the pressure like, you got to shift this thing x. And they're like, it's ready I got to try to cram all of this past the release engineers. They're going to kill me. And they're like, if you don't, Zuck's going to kill you. So what do you do? Those, you have to resolve, I swear to god, on a case by case basis. And again, it comes down to a subjective, judgmental thing. You look at who's involved. You look at the code. You look at the risk. You look at the benefit. And in our case, you think, how many users are we going to mess up with this? All right. And you make a decision. And if you still can't pass, you go up the line until again, I give a lot of credit to our executive team with Mike [? Schrepp. ?] If it goes to him, he'll make the call. And if he says it-- it goes. It'll go. I'll do it. So generally, you alleviate a lot of this grief with a faster release cycle. So with four weeks, we're just at the cusp where people freak out. They can wait. But they're like, if they hit it really wrong, it could be weeks if they just miss by the time they get in. So that's why I want to get to that two-week cycle. All that pressure, all that conflict, all the grinding of gears and rocks goes away. I can say, relax. You can go out the next day, the next week. No big deal. We never have these problems on web if you can push every day. AUDIENCE: You had a really nice diagram about how you integrate into your continuous integration like, testing, statistical analysis, all these kind of things in every state of development cycle. You're doing it before the commit, if I understand, before the devs are being evaluated and such. So if you do have it in place, why the question of Karma comes into existence? Because in theory-- I mean, the specifically designed these steps and integrate them to our cycle is to avoid these breakages, or whatever. So if you can talk a little bit. I'm interested in theory-- how much we can actually prove in that process, and what is the filter? I mean, what gets through? And why do you think it gets through? CHUCK ROSSI: Right. So I don't want to paint too rosy a picture here. All that-- the slide that I put was very pretty with all the steps that go through. Obviously-- and we all know this, because we've all written tools and have those things in place. They're not going to catch everything. Human judgment is going to enter into the equation. The thing that makes Karma come into play is we're operating in near real-time here. So when things have to happen, and those cherry-picks have to come in for production launch, that's where the judgment counts, both for the developer and for the release engineer. The tests I have will really flush out the obvious things. But if someone checks in a new flow for messaging, or a new flow for group update, you're not going to catch up with tests. It's going to pass, but when it's all together, integrated the app, and you decide to take that in week three of the release process, your tools aren't going to catch it. That's where Karma comes in. And that's where you say, you should have known better. Why did you even think you could rewrite the messaging stack in week two of a release cycle that somehow got by us, checked in, and derailed the release. So it's more at a higher level where Karma comes in. AUDIENCE: So the input I take from that-- basically, just in the theory of [INAUDIBLE], in a situation like that where we have to update production every week or so, we're always going to be behind on the quality of tests we are putting into our system. So maybe we should maybe concentrate more on exploring that issue, as opposed to putting stars on engineers. Because it doesn't matter how good the engineer is. If there's no test, he won't be able to pinpoint all the problems, especially if it's a huge code base, or a Legacy code base, or whatever. So just in terms of the [INAUDIBLE] and practices. CHUCK ROSSI: Yeah, like I said, I don't want Karma to be used as a-- everyone's losing stars. I don't think there's anyone left with four stars. When you move like this, things aren't going to go right. You're not born with this operational awareness that's going to get you through. What we're really looking for is really a neglect thing. You didn't follow process. You didn't really think about this. You took off your operational hat long enough to make this mess. AUDIENCE: But as you noted yourself, the neglect thing you're able to catch, because these are obvious issues. CHUCK ROSSI: Not always. AUDIENCE: It's complicated new features, which didn't roll out in your testing framework yet. This is what cause you grief. CHUCK ROSSI: Yeah and that's where-- generally, those are caught because they're big enough where they'll escalate, where we'll try to discuss that and do the analysis of the big feature. We've had it many times. A big feature will come in late, but it's really important, can't wait the four weeks. The press release is already set up. And that's not a Karma event. That's like-- AUDIENCE: Yeah, exactly. That's what I'm trying to say. If you have a good suite of tests to catch every change, all they neglect thing is going to be caught before it even gets into the master. AUDIENCE: Hi, I am Ramon from Twitter. I had a question about the desktop web. So one of those scenarios we hit often is more releases. We want to do more releases, but our roll back process takes a long time to roll back a release. And so that was the reason that we don't do a lot of releases. So can you talk a little bit more about how your roll back process works and how much time it usually takes to roll back if you already rolled out to everyone? CHUCK ROSSI: Yeah. Obviously, on mobile, we can't have this discussion because that's firing bullets. And bullets don't come back. So on web, it's a much better situation. And we don't really have this issue because-- think of the number of machines it takes to serve facebook.com. It's a big number. When we push the button to say, go. We've decided we're going to go 100%, that binaries that is facebook.com is out in about '15 minutes, which I've said is both awesome and terrifying. It's awesome, because in 15 minutes, the fleet is on new code. It's terrifying because if we've made a mistake, there's no time to pull that red lever. The bus is already going off the cliff. So what we do, though, is obviously, we keep the previous binaries on the fleet. So for us, it's a simple matter of, oh, this looks bad. We hit another button. And the send command runs, and it just symlinks back to the-- in our case right now, it loads the previous byte code and brings up the servers. Now, that is not a pleasant experience for the end user. It's a bit like pulling the red cord. There's going to be a bump. And there will be some disconnects. But we will, within 10 minutes, revert that to new code. Now, the thing that also makes this work is-- I make this very clear to every developer. You will never, ever run in a homogeneous environment on the front end. All right? So if you make that clear, you don't have issues like, well, we've already rolled forward. We can't roll back because that new API blah, blah, blah-- never happens, cannot happen. With a fleet as big as we have, with as many backend systems as we have, you cannot guarantee you will ever talk. You've always got to be forward and backward compatible. If there's some issue, there's always a gatekeeper that can be turned on or off that turns on or off that new code that's sitting out in production that may or may not be there yet. So we've really solved that problem. In fact, if you want to talk more about that, the guy who helped write that system is Amir, in the back, is the a deployment expert from Facebook. He can tell you how that tool works to deal with getting that code out and back in so fast. AUDIENCE: So I'm John Oden, formerly Mozilla, and just recently moved to Hortonworks. So when we were at Mozilla, we were doing a lot of mobile stuff, so a lot of these diagrams-- I was like, oh yeah. And I want to plus one to thing you said about a command line programmatic way to be able to upload apps to the store. That was a recurring pain in the blank-blank-blank having to do this by hand. The fact that we have to upload it manually these days-- I mean, I'm all for secure encryption stuff. Is there anyone here who has a sway in either the Google app or Apple app who can make this happen, please? MALE SPEAKER: We will make it happen. AUDIENCE: Programmatic, sure encryption sign, whatever, but can we make it that we can hit a button? And then I would ask for a version of it, which is also, if we upload something, I know it's like, then it's a bullet and users can start picking it. If we find out early there's a problem, we've many times wanted to go and hit an abort button, take down the thing we just uploaded. And the only way we found we could take down was to go find the previous change set, generate a new build with a newer number and upload a new. Whereas, if we could go say there was a previous one already there. We just upload a new one. Abandon the new one, and let people see the previous one. Something like that pragmatic. So there's my two wishes. CHUCK ROSSI: Yeah, I can't stress enough. It's not reasonable. One of my engineers, Brad's, in the back. And he literally stayed up till 2:00 or 3:00 in the morning just fighting, pushing the stupid upload button on a web page to upload the number one app in the world, worth billions of dollars of revenue, to try to make this thing get out. And we can't do that. That's just silly. BORIS DEBIC: Chuck, thank you very much for-- CHUCK ROSSI: Thank you, Boris. BORIS DEBIC: --the keynote. We'll have much more talking during the day. And we're going to move on with the program. Bram.
Info
Channel: Talks at Google
Views: 25,603
Rating: 4.8640776 out of 5
Keywords: talks at google, ted talks, inspirational talks, educational talks, Release Engineering Keynote, Chuck Rossi, Moving to mobile: The challenges of moving from web to mobile releases, Facebook's web frontend release, lessons from shift to a mobile-centric company, shift to a mobile-centric company
Id: Nffzkkdq7GM
Channel Id: undefined
Length: 70min 1sec (4201 seconds)
Published: Sat Apr 26 2014
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.