droidcon SF 2018 - Android CI @ Airbnb

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
everyone my name is Michael dang I'm an engineer on the what we call the mobile developer infrastructure team at Airbnb and today I'm here to talk about Android CI I'm at Airbnb I just want to provide some insight on the kinds of things we do in CI that's specific to Android as well as kind of our history and journey to show what we built out today it's a little bit more about myself I've been Airbnb for three and a half if she goes now I work on product for about a year and a half and I've been working on these comic build tools and CI for the past two years now so this is the Android team 2015 when I joined there's 15 of us here and yes I worked with this team for a year and a half and I've been working on the infrastructure team now just to help support these amazing guys cuz you know I really want to make sure that these guys up a great developing experience I start from that I'm also a man of many hobbies if you don't want to talk about CI or Android but want to check before you come up and talk with about a photography or climbing video games the kind of keyboards for Smash Brothers cool all right so let's get started on exactly what is a CI and kind of a part that I would be talking about so this is maybe like a diagram of a typical engineer workflow involving CI you write some sort of code and you push it to your source control server we use github Enterprise then you open up a PR which will then trigger series of builds and tests and then notify you about whether actually succeed or fail this is the part I'm gonna be talking about this part is the I and what are some examples of CI then well there's a lot there's a deep city Travis Circle CI build kite Jenkins a very bamboo there's a lot of different versions of and tools you can use for CI this isn't gonna be a talk about which one is the best or even the best for Android but just to give you some insight on what we just what we decided on and which factors went into those decisions and so I want to go a little bit into a short history of Arab indie I think our mobile team is fairly new compared to some other companies but I want to see help paint a picture of like our mobile teams growth has affected the kind of decisions be made so important 14 we had 15 mobile engineers so that's about half on each platform seven or eight on iOS and Android there was no CI we did have three Mac minis set up on a stack to build the iOS release app and the and repeats app was so both on someone's computer so this is the state of things it's pretty raw very startup ask super small team and so what happened if master was broken yelled at them they were probably sitting and a desk or two away from you and so you just found the engineer who broke it they'll probably fix said life moves up we have gone pretty quickly after that and 2015 which is the year I joined we grew to 30 mobile contributors about 15 on each platform we actually implemented CI which is team City at this point we decided this because this actually will be used to build the iOS release up so we just built on top of it meant for all throughout all but a whole cio structure with it and at this point we're doing builds on PRS we're running master after every commit is merged and we also have some release automation stuff for both platforms now we have 30 build agents total so about I think 12 at this point maybe 12 later on but at one point we have around 12 iOS agents because iOS app has to be built directly on like Mac OS and all that which would be important with it later even though this is droidcon and then the rest are Android which was actually really easy for us to scale because we did everything through chef on like ec2 Amazon instances so you can easily scale it at a press of a button so in terms of scaling there wasn't too much of an issue for Android the teamcity server was actually also managed by US mobile engineers so our infrastructure team was helping support the other parts of Airbnb and we were kind of letter left out our own hair so we decided to go with team city because that's the one we were comfortable with with Simon cool so in 2015 to 2018 so from then to now I just wanna point out that one of the main factors for our CIO infrastructures Direction is the growth that the mobile team experienced for a couple of years a number of mobile engineers is actually doubling in size and so this growth actually led to a vertical ization from mobile teams to vertical eyes on the future teams and this just made overall communication and collaboration more difficult as you can imagine since we're no longer sitting together and you just have like you know that's interaction other Android engineers and so this meant that we had to focus more on stability and enforcing higher standards and our code base to ensure that development experience doesn't degrade over time as we have more engineers all right so to quantify this I think from this period we actually grew about 5x we're close to 150 mobile contributors maybe about 120 of them are full-time we have some web folks who help contribute we also have a mono repo of our Android iOS and react native code and as you guys probably also know we added and are subsequently sunsetting react native and we also added several more jobs to CI and for release automation we're no longer just building stuff for PRS Masur but we're doing things like we're multiple building multiple types of builds we have like UI test we have unit tests we even generate some documentation l VI vanderlin etc we're also not just building the release app but we're actually scheduling uploads of our alpha and beta just to remove some tasks that the breeze manager would normally have to do and we've also moved over to build kite so ours current current CI tools bill kite no longer team city we'll get to that later as well and now we're close to 250 build agents so I'm not sure you guys are here but for Galton's talked about uber CI but yeah we have racks of Mac minis that we personally manage and it's kind of a pain in the butt but this is where we at now so close to maybe two to one or so one to one build agent / mobile engineer cool so I'm gonna just dive into some specifics about team city-county problems we ran into and why they were an issue for us there's gonna be very specific to us but maybe if some you guys have run into similar problems you guys can empathize a little bit but just to start off this is how it looks like to add a step in team city it's I think it's fairly easy it's there's this drop-down form style UI they can use to define a new step the downside of this that I will detail later on is that makes with the fact that there's actually no sort of code review for any of these changes if a lot of your build logic actually lived in team city itself you probably had you know high potential to break something if you made a change so we'll get into that later as well but and all in all fairly easy to add a new step this is just an example of showing drop-down menu with some to find parameters that we filled out these are some like secrets that we put it in and you can easily reference them there's also an idea if I build inheritance so this allowed us to easily add certain kind of Universal steps to any sort of bill like analytics or you know learning with like some sort of event and so what are the reasons for moving off so this is actually the biggest reason limited PR reporting functionality so this is kind of hard to explain but it involves basically a build queue when you don't have enough agents to run all the PRS that want to be built and so yeah at around 2006 years so I think we had 12 agents for some 30 iOS engineers and again while this is really iOS it was definitely part of decision for us to move as a mobile team and so I think the next image images will help illustrate this so back to this photo the part that was breaking is actually steps six year notifies success or failure so you expect that if you know you open a PR on github it will kind of show and the status like whether it's pending or finished and whether it succeeded or failed the pending part is welcome to us so this is just a screenshot of like sixteen bills in the queue so for a bunch of these builds our PR is the bills haven't started yet what you expect to see on github as something like this right things are waiting these are yellow that's fine you move on with your life and you get back to this later the weird thing here though is that this is what people were actually seeing everything's green even though nothing's ran why why do you do this to me and so this was just an FAQ over your functionality I guess like bug in teamcity this is of course of like a very old version of teamcity like twenty seventeen point one four one maybe these things have changed since then but for the time this is actually one of the biggest things that caused a lot of headache for us people would come to the screen see that their PR was green and they would merge that would cause like master to break because no bills or tests were actually running and so we wanted to be able to fix this so aside from this continue on there's a few other things this user experience that I mentioned before about like configuring everything within a UI can be not super scalable if you have really complex build configurations especially with like the inheritance and like lack of color reviews is really easy to break something basically and that in hand that we didn't have any support from our local infrastructure team and that everything is on the mobile engineers just added too much overhead for our team that's at that point they're imagining about and the rest of the company was on Travis and Solano so you know there's three different build systems going on come maybe too much for us to handle the last thing that's more of a funny story than anything team city is not emoji friendly so we've had like company-wide instances incidents caused from emojis one of them was a fire in McKee was just funny here's just an example basically all team City was red everything is failing we saw this weird failure wasn't really sure why I think we saw this guy so it's funny because team city can actually render this but then there's a sequel exception saying it can't insert into the database River entirely sure about what was happening but our guess is over that like team city was fetching from our github repo and then can't do you like some sort of indexing on all the files and I guess when it got to the emoji file it just blew up everything went red it was pretty crazy but we were able to fix this but this is just kinda those funny stories where you don't expect emojis to create by company-wide blockages yeah all right so moving on I want you to talk right now and about why we wanted to move to it so around 2017 we transitioned our Java and mobile repos to Bill kite the star as an effort to find a solution to the problems we were encountering about team city and Travis although I don't know the exact details about Travis but this was also just a way for us to start anew and come with the knowledge we had from our previous learnings to manage a new CI tool cool so we're the big ones from build cry pros so yeah like I mentioned that PR reporting thing worked great with beau kite if your build is queueing even if nothing ran yet it was showing github it was yellow and people stopped merging it's all the master and the flexible build configuration so Bill Qaeda actually allows you to write a configuration Yamla file to have all the specifications you need and that was just a lot easier to for us to handle you can actually code review everything because all the configurations live in your Reid so it's a lot safer in that way and so anything change you made was actually testing a PR right if you're changing a specific pipeline for that build once you open a PR it'll actually be tested which is really really nice and so this this probably was possible in team city but it was so very hard to test changes if you know a logger logic actually lived in team city rather than a configuration file somewhere they'll kind of also automatically parallelize jobs again I think this is one of those things where it's possible team CD but bilk I just basically gave it to you for free so that's where the important for us because not however things needed to be run serially a few of the smaller wins we think it's nicer simpler to use UI as simple as that and the main servers maintained by Bill Chi maybe this is a bigger win then then I mentioning here but we actually like a dedicated support team was Bill kite it's really easy for us to connect with them and they'll report any incidents any sort of issues that they have they'll be able to like work on it directly which is great for example since we were managing our own team city server I'll be like I'll promote just space because all the logs would be saved on that single server machine and so then I'm here to start thinking about scaling and all that stuff and just you know as another headache this is a great thing to have to be able to have Bill kite manage the main server for us so this is an example of a Bill kite pipeline way simpler and like less configurable I guess in a way then teamcity but hard to mess up basically all we do here is that we just upload it our configuration file so no logic actually lives in bill kite for us it all those in our people which is really really nice for us and this slide here just shows kind of an example yam will file so this isn't gonna be a good build credit or anything but just kind of show a little snippets here and there to show how we use it highlighted in red is a what we call like a job there's two jobs here one for Gradle one for buck down below and so these are all run in parallel and purple you see the list of commands so we simply just in this case CD into the Android directory and then invoke Gradle two runs the builds and blue you can see some parameters to enforce a type of agent that we want the bill to run on since all the builds are under like one account we have all of our iOS agents Android agents and some Java specific stuff we just want to make sure they were running the bills in the correct agent so that's in the proper environment so this is really easy configure as well but yeah so everything is like in a single configuration file and it's all edible you know within your repo and super flexible alright so next I want to give any overview about the kinds of things we actually run on CI I think these can be interesting to show where we put our focus on and like what we decided to be the most important tasks to be stable for our Android developers i've also part release automation instead of CD because you know we can't really actually continue to deploy get them agree to go through like google for android and apple for iOS but I'll show those tasks later cool so and every PR we do the following stuff I'll go into them in more detail in the following slides but this is what just a list of jobs that we do so we build their baby up I mean make sure it's buildable no compilation errors that does that we run Android lint for any sort of like resourcing you know vine errors and stuff like that we also run UI tests again to why there's only in tests later and we also run some Ruby tests I'll explain why we have Ruby and we also have some DLS tests which I'll explain later as well maybe you guys have heard of it I think we've written some blog post on it but it's essentially what we call our design language system every job is run in parallel by the way so they can get it's like the bottlenecks itself later cool so to build the app we actually build the app in three different ways maybe two and a half depending on how you count it but the first way is Gradle so we use Gradle because it's the Google approved method we actually use it to release so every release bill is Bill have a Gradle and it's fairly stable they're consistent builds the problem may be is that it's kind of slow for us so we actually use buck and okay buck okay buck is actually something that we wrote to help convert your Gradle files to buck files so you don't have to manage to complete build systems and we use it because it's faster that's about it buck is super optimized for build speeds so this will be is what all Android engineers used for local development Airbnb and the last one is what we call air BnB light builds so it's actually something where we take advantage of flavors to build smaller usable portions of the app basically for example you know you only work on messaging for any given day you probably don't need a no doubt like the listing page with your compiled app right you just want to test your messaging features so we built a system around this so that engineers can only compile what they actually need and ruint is a big but what we think is unnecessary pain in the butt and so we use Android Gradle plug-in three point two point oh one thing to note is that I believe in three point one point oh this actually included linting on callin files and i we think that that's what is causing our two and a half full venting time to a half hour moment in time which is a bit absurd so we think it's weird to call in as we've added more calling code this has been slowly increasing so this current solution that we've come up with to tackle this is what I call smart lint it basically I was like a per module venting so we detect the heat changes that you've made on your PR and we only run lint on those specific modules so far this has been proven to be pretty good I think the average run times not around like 2025 minutes depending also you can vary a lot depending on how many modules you touch but so far we haven't seen a scenario where the full two and a half hour lens has caught something that smartening has not caught but this is only been running for a few weeks so we'll see how it goes but so far it's comprar it's pretty promising we do also do other callin menteng we use a couple of libraries called Katie Lentz and detect this stuff is a lot faster but doesn't encompass as many rules as the full Android ventas UI tests so yeah really really hard get it right for us this is also another point that we spend a lot of time on to get right and we've had a few heartfelt attempts but in the end we've kind of had to mix a la goes so to get into a little bit more one thing that we try to solve for one of our biggest issues was the idea that you are test can be the pendant on our API servers and everything being up and running and so it's really frustrating for an engineer if their PR is blocked because like the API went down so we came up with this tool called ok replay which I think we've actually open sourced and it's basically this tool to record API responses and then deliver those in the context of the UI test so that your tests aren't actually dependent on any servers being run and so the cache should be mocked through IP ok replay well this was great there are still a few issues with like rewriting the recorded responses and overall ownership of the tests I think the infrastructure team actually wrote out conv log initial tests and so when something broke we couldn't really depend on the other teams to fix it and overall this just became like way too much overhead for our team and we ended up mixing a lot of what cuz I was just too much for us to handle but currently we have one UI test it has caught a few things it opens the app and make sure that you can see the login screen and it's caught stuff so it's useful it's easy we don't have change it that's great but that being said 2019 supposed to be the Europe tests we actually have a lot of product teams her super passionate about this now and are stepping out to take ownership so there's gonna be really great for us and hopefully we see a lot of major improvements here soon to kind of do a spiel about Mavericks cymatics is an open source library that we've released basically makes it really easy to write android screens and takes all takes out a lot of the boilerplate with it and so this should overall make it easier for us to write UI tests and make engineers a little bit more passionate about like having to go and write a UI test and ignore all the stuff that used to be really hard about it alright so one goes a little bit about Ruby we have some Ruby scripts may be easy for engineers create modules just having more modules makes it faster to build with buck since it paralyzes everything and so we just have like some scripts to generate directory added like a build.gradle file with your dependencies that you want and maybe add some like dagger stuff this is also related to like there may be like builds I mentioned earlier but essentially long story short is you have some Ruby scripts to help make some tasks automated and so we want to make sure that those are running as well so we rerun Ruby ours which is their testing framework as well as rubric op which is like a satellizer and so yeah we use it for some cogeneration as well as some release management tests but just know another one of those things were you might not think about it but it's important to have test for and so before we move on I want to talk a little bit more about that deal that system I was mentioning earlier just give you guys a little more context so design language system that's what do that stands for I'll refer to it as deal that's from now on but it's essentially a discipline agnostic collection of view components it's like a way for engineers and designers to be able to like speak the same design language they essentially it fixes the problem of multiple designers recreating a similar component for multiple use cases with like varying one-off designs and your copying case seeing the same component or adding some additional to show a different specific view and so you know this is obviously not that great not that scalable you're not sure if you can't reuse a component or not so we developed a system so you can have the same sort of rules for all engineers and for all designers and so yeah this just means less duplication for us and as you can imagine a system like this Ashi requires a lot of upkeep we want to make sure that the components across all platforms are equivalent and that I changed any one component for one screen doesn't shape it doesn't break the user experience in other screens so just kind of give you a visual example this is fairly old but I think to illustrate of you can see like different types of cards different types of rows this is like what we call a marquee which is like a big header at the top of a page and so if you wanted to compose a screen it might look like this where you have a marquee at the top some series of rows may be a section area to break off important sections and maybe some more rows so this is like this huge system that we built out a or maybe it's super essential for the developer experience and as far as the I who want to make sure that this works and this works well so for one this is actually where we run in CI we have a browser app that contains all the components that an engineer can build at any given time and so we want to make sure that this browser app is stable so we build it we actually render and show differences with this tool called hypo hypo is this COI tool used like diff UI and it's what we've used to basically capture any differences between a change to make sure everything is intentional since a component can be used in multiple parts of the app you have to be very very cognizant of that we also generate some documentation for DLS I believe and we also do very very specific stuff that may be out of scope for this talk but there's a lot of that we do to ensure that our DLS system stays like stable for all of our developers cool so far release process automation we have a weekly rotation of release managers so we release on a weekly basis and their job as a release manager is to basically check crashes make sure the app is stable merged any sort of cherry picks for fixes and also like updating translations this can actually be very time-consuming and tedious tasks but it's also very important so we try to make as much of it as I automated as possible so yeah for one thing we upload strings for translations so we actually have an internal translation team where we periodically upload all of our untranslated strings and downloaded and download any new translations on a daily basis we do this on a daily basis because we have scheduled builds and uploads of alpha beta and QA so if you actually build an uploaded alpha build every six hours and for beta and QA we both are once a day we send our QA bill to our QA team and then beta is uploaded to the beta check we also alert engineers about new crashes so this is probably one of the most important things I would say about the release manager job is to actually triage and or detect and triage new crashes so we use a tool called bug snag I think they're actually upstairs and so for a lot of engine for a lot of release managers like this tasks can be very cumbersome it's also like prone to human error when you like don't find or don't know as the crash is like super you know high priority to fix and so we built the tool where we essentially will detect all new bug snack crashes that occur for a new version go through the stack trace try to determine who is the most likely person to have potentially caused the crash sign in and then alert them be a slack so this has helped a lot with just like our apps crash rates and taking off some of that menial task work from the release manager we also create release branches so kind of for that weekly really students that we have we basically just like a snapshot of master name that as the release for the following week it soaks for a week we'll do some QA testing make make any cherry picks for the crucial fixes in this case we actually get to interact with ghe so I think I'm a long story short is that you just really utilize your api's for this that are super useful for making things automated and fast we actually run all these things on c iso or through bill chi or CI tool just as like a scheduler essentially so it's like a neat way to utilize it cool so I'm just going to kind of the future work as you can see our our CI and tooling is pretty coupled with a lot of different products you know while CIA I think starts up build stability you can continue with building new tools to improve efficiency especially with mobile whose employment process is fairly complex and it's not completely under control so it's important to have every step in between to be like robust against potential API changes but in terms of future work I think one of the more impactful things we'd start off working on right now it's like a proper release dashboard that just makes mobile development more accessible to everyone in the company so like there's different groups of people who want different builds a PMI want a specific branch that's like pre-production ready customer service might want a specific version to reproduce an error API engineers might just want like masterbuilt to say so they can test their API changes but yeah so there's a lot of different groups here now right now for us IRA maybe there's actually no way for this to happen super easily I get to just ask an Android engineer who build the app and send you the apk your way but this would be great if it was more streamlined just make it more accessible for everyone outside of the mobile engineers something also be really interesting would just have a dashboard to monitor crash rates be able to be able to modify a be experiments and have some like automatic rollout or halting of releases which is what we don't have right now we do it manually it's up to he's manager to to tuck these changes build build visualization is also something that's really interesting in my to me as a product rose and engineering be talking were detached from the build tools at work it's important to make sure that engineers have all the tools necessary to understand and act upon build errors you saw earlier that we have Gradle we have bug yep ok but we have Airbnb like there's a lot of steps in between for an engineer when they're running codes like seeing it on the app right so I don't think we can probably expect every single engineer to understand how every single tool works which i think would probably be necessary in order to debug certain kinds of errors so I think we could have some space here to work on I'm fronting these errors and maybe even suggesting certain things you can do to like make it easier for them to fix and debug it yeah essentially they make it easier for engineers to understand the abilities and as I mentioned before UI testing and it was gonna be the big thing for 2019 I think our infrastructure khari has a lot of space to grow to make sure that everything is super stable for them and that's really easy for them to to develop for this and not have to think about the infrastructure in terms of UI tests and so yeah we think there's a lot of work here they could be done as well and there's probably more that we can do but currently my team has two people one for iOS and one for Android that's me and so we'd love to get some help from anyone if you know what says any experience here but yeah thanks so much that was my really fast talk on and here at CI I just want to open it up for questions now from the audience [Applause] yeah yeah so the question was beyond teamcity and bill KY why do we set a good at bill Chi instead of Travis so for one thing that team was also looking to get away from Travis as well and so we together went towards goal kite I think one of the things I'm not too sure about the specific details here but having the Matt go as support is really important for us to have a good tool that works on Mac and so we just thought the oak is a great option for that sure yeah so the question is why did we decide to have a single mono repo for all of our mobile source code basically I think the biggest reason was probably react native so I wouldn't be decided to go with this I think in 2016 it also made a lot of sense I have a react native code share with our iOS and Android so we decided to make it a mono repo maybe that'll change in the future as we start sunsetting react native but I think that's the biggest reason yeah I plan to open source of Smartline tool yeah it could be possible we're definitely gonna look into it first to make sure that it's like super complete cuz I think there are probably a few edge cases out there that won't totally catch us I just want to keep running it internally to see how it goes has great old caught anything that buck hasn't caught when developing locally caught is an interesting word I don't think there's necessarily something wrong per se that like buck does that great I wasn't properly catching but there are some scenarios where for example Gradle or buck will fail when the other one succeeds I think a lot of that has to do with like resourcing like resources and I think there was something also call in bugs related to buck but I think that's basically a short answer that yes there are some things and we've basically worked around them to make sure that it's stable on both yeah sure so why do we use Ruby and some of Figueredo plugins for some of our like non build related tasks I think the reason is that we just have like a very strong Ruby a culture at Airbnb and decided to go with it there's just like a lot of colonel rubygems that we use as well so it just kinda fit with the rest of the engineering culture yeah sure so the question is with our ABB light flavors how does the modulation work terms of like features so yeah each module is essentially its own feature but each module is not it's not yet their own light builds so I think we combine there's like a number of like basically subsets that you can use right to develop a single flavor so it can include multiple parts of it it's up to the developers to choose what they need you don't generate these or anything like that space on the developer 2x rate the the flavor or the light build for their for their own purposes do you have any performance testing our automation process there's no testing for that I think in terms of performance like measurements that we do it's all within the context of either like an individual developer who's you know focusing on that or within an experiment and we just test the differences with the experiment on and off about like cold start times and time director interactive and all that yeah nothing in terms of C I have you Ryan - let me any limitations with book for library projects I don't think so non the none the recent time period maybe when we had first implemented I was actually not on the team when we first started using book for development so I can't comment too heavily but so far it's been running great gross yeah yeah so the question was what were some of the issues we had with using Colin with puck there are there a number of issues a lot of them are just simply like the CERN like build tools build tool steps we're not yet implemented in buck so for example like common annotation processors were not available for a really long time within buck and we essentially either waited for someone to to work on it and avoided using those kinds of things within our project and for a lot of these things like cat for example they did build them out and then we were able to upgrade our version of bug and start implementing things with calling annotation processors there has been areas where we have you know made committed changes to bug itself to like fix some of these problems or maybe fork it and do some sort of like hack to make it work but yeah there's many different ways that we've tried to solve this either by waiting for someone to implemented or doing it ourselves right so if yeah if we have changes in react native code or if you have changes and both Android and iOS code and we run all tests for all for both type ones have there been difficulties going from one bug person to the next bug version yes there have been there's a lot of things that end up breaking a lot of things that come that you've had to work around but yeah the short answer is we've had some breakages here and there and we essentially have to like submit a fix to it or yeah just wait until that bug gets fixed on on Buckmaster or something like that yeah sure can I spend more about Happo I'm not on the team that actually built it out I'm just kind of aware of what they did but essentially I think what it does is it renders each view or each like view component within our deal that system and then does a comparison between how it looks like on master versus how it looks like in your PR and it'll basically send you a link with all the differences and you can visually scan through like this other component like before this other component looked after I think I don't even like highlight the part of it that actually changed and then it's just time for you to be able to like mentally like you know say that this was intentional this is fine otherwise you go back and check out what went wrong the out note the outputs like in this really like nicely formed just like UI comparison so I should like render the views for you and then point out which part of the views I should changed yeah how does it render those views that I'm not too sure about cool well if anyone has any other questions you can come out to me otherwise from prior to send early thank so much everyone for listening you
Info
Channel: droidcon SF
Views: 1,031
Rating: undefined out of 5
Keywords:
Id: HShVflK-lmI
Channel Id: undefined
Length: 35min 8sec (2108 seconds)
Published: Sat Dec 01 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.