BazelCon 2019 Day 1: Six learnings in moving to a Bazel-based CI system + Q&A

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so we've one more talk before we break for lunch that the last talk was super exciting I'm expecting the same from this one we've got or from which calm are going to give us a talk on moving the basil Bay CI system come on up or hi everyone yeah so with a show of hands who here is familiar with the experience of taking a long break because you're waiting for your bill to finish because the bill takes a lot of time yeah yeah we all have it right we all of those things that we do at the office while we're on pending one right taking a coffee break checking your email I personally call my mom and in many companies this is what developers do while waiting for the bill to happen so hello everyone or I'm back in engineering at wick CI Group I came here all the way from Tel Aviv I'm super excited to be here on stage and it's really I'm really happy to see how the community grows from year to year a little bit about myself I was fortunate to start my career working side-by-side with a classic build manager who taught me the joy of ant and make file since then I've built several maven ci5 lines again and again and then moved to doing some back in engineering three years ago I joined weeks and I was happy to combine my two passions doing back in engineer for the CI system at weeks we need to bring distributed back-end system into the CI domain because traditional tools did not work well with our site so we moved the backend build system from maven to basil and we're quite happy about it from long-running bill that made us wait and wait and wait and switch context and eventually failed to speedy build that just work and I'm here to talk with you about the process of getting there before we begin let's agree on a simple metaphor for a build system an assembly line the input is our code base and the build of Phoenicia the machinery is the is the build server and the build tool and output are the deployables and the feedback back to the developer it's quite simple to understand right we require the system to be fast reliable and efficient but let's look at weeks back and build system situation in 2017 we had about five millions lines of code mostly Scala over 1,000 git repositories about 150 developers for Bill - we use maven so snapshot dependencies between two thousand million modules and videos team CD for the build server so each maven baa jewel got its own team CD configuration which is the equivalent of Jenkins job we applied pessimistic lock so whenever a module was a building or broken any module that depend on it was actually blocked and we had a real big problem because this is this did not work well with our size many builds were broken breaking downstream builds in order to get my feedback I had to wait sometimes over an hour in order to get my deployable ready I had to wait many hours today and this is not exactly what the CI system should do right we're supposed to improve the diversity and those are just examples from messages we got on our CI slack channels frustrates the developers asking about their build CI was marked as one of the biggest bottlenecks in in releasing a new feature and we know that something must must be done we knew that we already exhausted any options to opt in I optimized within the system we had and we need to look for something new we consider moving to SBT to Gradle but given how Wix is growing from year to year we wanted something that would still work with 10 times 20 times more scale so a snap to the present 2019 we completely changed the machinery we completely changed as the AI system to work with basil we're very happy to say that our build are mostly stable in order to get my feedback I need to wait five to 10 minutes not more to get my deploy already fifteen to twenty five minutes in the next two months it's gonna be even half of it thanks to optimization that we're working on these days and and yeah we're quite happy about it we made a long way getting from this to that moving to Basel and we had a lot of failures along the way thanks Jeff for the talk and we had a lot of learnings along the way so I'm here to share with you six key learnings that might help you when you're migrating your system to Basel so let's begin number one is user migration tool if you wanna run basel on your code base you need to have some basil files right build files workspace file and you can write those manually let's understand what it means to write those manually let's take a maven repository a very simple maven repository two modules a core in the server some main code and some test code the first thing we want to do is to choose the bill granularity that is how many source files would be included in a single build unit you can go very coarse-grained like maven does so group together all of the main code in the test code you can go very fine-grained so a single build unit pair source file it's all a matter of how much you want to gain from parallel form basel parallel execution and cache at weeks we chose the one one one strategy so it's the mid-ground one target for one directory representing a single java package next we want to understand what basel rule does each build unit represent you have a library Java test maybe it's it's the image maybe it's the deployable that you want to get at the end and then you know we want to start working on the dependencies start with the interdependencies dependencies between the internal targets and external dependencies dependencies in this case on external maven binaries and look here we have dependencies on j-unit and jetty but Fuko depends on guava 20 and foo server depends on cocoa over 28 and if you know about the jvm world you know that this is not a healthy situation because you can get failures during in production our in test run time so part of the migration for us was to align third party to all use the same version of the of each dependency so now that you know the dependency and the targets you're done you can start writing your workspace file in build files what we understood that week so this is not how a workspace looks like this is not how a cause it looks like it's more like this right this is so we actually wrote a tool that will do everything for us automatically in generates are in generate our build files in workspace file using some heuristics and code analysis tools it will generate the code graph reduce it to targets graph then we will deal with all the dependencies the third-party dependencies and write the Basel files the build files the workspace file I'm happy to say that this year we open sourced it under the name of Exodus so you are more than welcome to try it out let us know how it works for you and contribute make it work for other languages for other build tools we'll be very happy to help you out but it's a tool that works with heuristics and it doesn't really change your your source files so in a lot of cases the generated basil workspace would not pass basil build and basil test so you would still need to bring some experienced developers we call them the migration trustees to massage the migration result and make those steps pass it's not as hard work as writing the vase of the base of files from scratch but it does require some skills because sometimes you need to add missing dependencies that Exodus missed or you want to change the code maybe the code would not compile once you make all the code work with a single version of guava or you find out that you have tested try to explicitly read from source from fat that are only available during maven test run time like SRC test resources test the try to download stuff from the internet during test run time and you know that in Basel you run everything in hermetic way so there's no internet access during test run time and those are just examples we have a full page of troubleshooting exodus' results the rule of thumb is that the more technical depth your project have the more work the Basel trustees would have to do so let's look at the process again and even the migration trustees massage the result now we have Basel build and Basel test pass at weeks it wasn't enough for us we needed a way to make sure that we don't lose anything otherwise we didn't miss anything on the way in the way we chose to check it was to actually write a comparison script that scraped through the Maven test result and the basel test result and compare all the tests eases you must imagine how happy was the back-end developer that got the job to write this comparison script that compares a bunch of xml to a bunch of XML with a slightly different format but we did it so let's have a look at the final process the comparison script passed and then we consider the the repo as Basel validated we can start working with Basel in many cases we had to reiterate on the process so in order for exodus not to do the same mistakes again again we added override mechanisms so the migration trustees can hint exodus to do add some additional steps at the end of the automatic migration so I really I really think that if you can use the migration tool to generate your base of fact this can really boost your migration project number to measure your system it's kind of trivial but you need to remember that you're working on optimizing a system and you want to measure your current system you wanna measure the new one a story from 2017 we took weeks framework one of the larger gateway positives we had with over 100,000 lines of code 200 million modules framework at weeks is basically a set of libraries and utilities that are you throughout all of our micro services and we migrated it with Exodus we got to 2300 basil targets we were very happy to show that after running basil several times the average run of basil takes five minutes incremental build and together a correct build in maven to run maven from the top-level directory it took us 45 minutes stuff that's amazing right it's a it was a really good seller in the organization to say hey basil is amazing but the truth is that it's not a good enough way to measure the whole build system because in CI we didn't really in the Maven CI we didn't really run maven from the top level we did try to optimize and parallelism as much as we could and also for framework we didn't really have deployables and part of the what you wanted first of how we want to measure this di system is how much time it takes for deployable to be ready so in a hint side this is what we would have done to show improvement so for each artifact for each deployable and we and we check how much time it takes from a push to deployable and you want to check it fair commit because you can want to compare apples to apples you don't want to compare how much time it took to change in a root library that affects the whole build tree and a leaf library their effect on itself also you want to check the stability rate and I'll explain let's say that I'm working on a new feature on a super artifact I'm working locally and then I want to merge my changes to master and master is broken I'm basically blocked right I need to wait for master to be greens for someone to fixed my master before I can merge my change so given a month given a time frame you want to check how much of it your master was green and ready to be and ready to be changed we actually use the fact that basil was so fast and we used a remote cache to introduce the free check mechanism so we really improved on that metric alone so you can invest a lot of time and effort and your migration project make sure to invest sometime in measurement system that will measure the older CI Python in the new CI Python it will also benefit you for the future if you want to understand where you can improve number 3 local dev experience don't forget your users when we started the project we thought basil would be amazing for the CI system and when you run basil from command line it takes us 5 minutes right not 45 minutes 5 minutes everything would be amazing developers will be happy but when we start to get more and more users we started to get a lot of rejects users that came to us super angry saying hey the experience with IntelliJ sucks I how do I handle all of those build files dependency management is so hard now and I get red codes and and then the code completion isn't correct and refactoring is basically impossible we also thought that all of the our developers we're gonna use the same the same remote cache we found out that and we're using a tweaks local local builds so we found out that with Mac OS it's not that easy you have to manage all of the tools the Xcode version the sealing version so we actually started the project with developers not using the same remote cache and they all had to build locally one team leader came to me and said well that's amazing that you made this di so fast but now it takes me three times four times as much to develop a feature locally so it didn't really improve on anything so eventually a good end to the story eventually and weeks we made sure that we'll do proper basil training for the whole R&D group we made sure that they were really on top of any IntelliJ issue and we reported anything to the basil IntelliJ repository and when it was possible we also contributed solutions ourselves and whenever we hit a wall we just developed solutions ourselves internal solutions today at weeks we have a whole team that is in charge for the local dev exterior with basil doing tooling alignment making sure that all of the verses can use dev cash safely and we wrote a set of enhancement and tooling to make the work inside IntelliJ with basil much better tomorrow there's gonna be a talk just about that by a Thai side man so I really recommend it to go I can say that today the the local sphere isn't perfect it's not fantastic but it is okay and it really worth the trade-off of moving to faster build on CI and locally so yeah number three think local dev experience number four introduced a new CI pipeline but keep the old one and I'll explain let's talk sixty seconds about how a push get how a change gets from the developer to production at weeks unlike other companies or some companies we don't have code drops or iterations and the the decision on when to deploy a change to production is solely on the developer the developer push code pushes code to get it will kick off the CI pipeline that will create an artifact and then a developer can go to our deployment back-office a system that we developed an industry I group and decide whether to stand a canary of this version or 2g air to put in all servers or to roll back to put the previous version so the speed of this process really depends on the speed of the this part right the CI pipeline and this part was bad for us this what this part was broken was slow we needed to replace that and we did we thought of okay let's do it a whole new process that is basil based it's gonna use basil remote cache and remote execution we're gonna protect master from being broken we will also have an automatic discovery of any deployable because unlike teamcity we didn't have one job here module we had one job here repo so we need to discover all deployables inside that repo we're gonna replace the build server to be cloud-based so no agent limited no build queue and we took this amazing pipeline and we placed it in parallel to the LTI pipeline so any change actually I triggered the both of the pipeline and created two separate artifact one that is basil based and one that is maven based and then the developer could go to our deployment back-office and with a click of a button change whether she wants to take the the deployable from basil or from maven she could just go back and this prove this have to been very good because the developer was confident to know that she can move whenever she's ready and if there was any any issue she could roll back and move back to maven and fix the basil build offline so I think so run parallel CI pipelines and and until you're ready to discard your old one moving on to number five now number five talks about optimizing your CI for speed and this is actually relevant for those of you who already moved to work with basil so we all know that basil is fast right but at weeks we wanted to make it faster we wanted to make it as fast as possible and when we started we said when we started the CI we said okay we're gonna use the basil remote cache remote execution and this would create an amazing CI build is gonna be super fast but then we started the system and we noticed that even for a fully cached build we had to wait five to ten at five to eight minutes and at this point we were devastated like we said okay what what are we going to do this is that we we wanted to show that basil is fast but five to ten minutes for a fully cache build this is not what we expected and the reason for that is very simple the reason for that is that we had a lot of third-party dependencies so every time you ran a deal nabil an ephemeral machine every time you ran a build it had to redownload all the Maven binaries and docker images when the solution for that is is simple we need to bring two directories that are pre-populated with the docker images and maven binaries the repository cache is a feature of basil the docker cache is a feature that weeks contributed to the open source of rules docker and container registry and once you have those folders ready well before you start running basil you can save the time on fetching those dependencies another thing is if you dig into basil documentation you'll find that you have many flags that you can that you can add to your build and save many minutes of your of your build for instance bills without divide since were building with remote execution we can save time on downloading intermediate results between actions or if you find that you spend a lot of time computing the digests of your input if you add a flag of multi thread digest this can save you some time on that point I think it was very beneficial for weeks to be a very active member of the community weeks is a common tenor of rural Scala we make sure that were very much on top of any issue any new feature that's about to happen whenever there's a new release candidate we make sure that we tested against all of our repositories I really think that if you're planning to go on to use the basil become a basil activist it can really work well for you so number 5 is optimized for speed we can into our last one number 6 talks about migrating a poly reto because migrating a single repo is a is one story but migrating it's a whole different story but it is possible and let me tell you that story when you read about basil use you know that it's really good to use basil with the mana repository right put all of your code in a single repository the developers can find everything and in a single place a single basil execution would test everything and see and then you can work with a single or very few CI jobs and we tried it out we try to put all of weeks back-end code and a single git repository and we got a hard rejects from get just running just running a git status on the repository took us 15 minutes so we know that it's possible we know the Twitter does it and Microsoft has a solution for this but when we started all of this was very much premature and I think until today correct me if I'm wrong but I think this today it's not possible to do it in github so we couldn't do that with git so we thought okay maybe we'll just move out of gate maybe we'll just do mono repo on a different solution but then we got the hard reject from a developer's saying hey you're changing too much did the experience with a new bill to the IDE you no way you're gonna change gate so really our only option was to continue working with a lot of repos with basil on git but of course we didn't want to have the 1,000 repository model we didn't want to have as much code as possible in a single repository to for a single based on invocation to test as much as possible so after some internal friendly debates we were many we managed to consolidate 1,000 repos to 50 repos debates on yeah I want to be with him now I want to be with her and we got to about 50 repositories and our goal was to work in a virtual mana repository mode not explain what it means I know if all of you know how to read basil targets I help you but this is a Java library that has two dependencies the first dependency is a local dependency it starts with slash slash and it means that it comes from the same repository the second is an external source dependency it starts with this add framework and this add framework name is defined in the workspace level see that this add friend defined in as a git repository with a get URL and a commit the idea of a virtual minoriko is to have this Committee move around automatically how we implemented it last year there was a talk again by Ty Seidman in Basel Conda talks about our implementation but in a nutshell we have a server called virtual mono repo server that magically knows all of weeks repository and the latest head commits we have a Basel wrapper that talks with this server and before the build begins generates an external repo bzl file and our workspace file just load those definition that's very simple to understand it's it's super simplified the our CI build generate this file with every build so every bill runs with the latest version of everything and the developers can regenerate this file manually or whenever they switch a branch so back to our story we didn't really have a single migration with Exodus we had over 50 migration with Exodus and the other problem because like I described at the beginning part of the migration is to extract the external dependencies all those maven modules that are not part of the of the same repo and treat them as an external binary dependencies this includes common dependencies that you all know guava Commons J unit but also included internal libraries with snapshot dependencies so we had all of those labels that represent third party binary dependencies like Chrome Google guava but we also had all of those labels that represented we call them second party dependencies but it was the snapshot dependency and this this actually this was good because it allowed each repo to migrate without depending on the immigration of other repository and then to move to onboard the new CI pipeline gets faster build get faster feedback but the problem was that even if my repo already migrated to Veysel i had to keep my maven build green so external consumers can still consume my snapshot dependencies also the basil builds were depend on the maven build and every every build got a different set of sniper dependencies this is how its snapshot defense mechanism work with maven and locally since basil doesn't really know the concept of snapshot then I would download the snapshot the first time I would run my event and I would just continue using the same library forever until I would clean up those snapshot and forest basil to redownload the latest snapshot so we provided our developers a custom script that would clean all the snapshot and it created a lot of confusion it was really not a good situation to be in we had to wait until all repositories would go through the migration in isolation mode that is how we called it before we could start dealing with all with all of those snapshot dependencies we wanted to break the connection between our basil builds and our maven builds and this is basically what we wanted to happen one we wanted to take all of those second party labels that map to snapshot dependencies and change them to external source dependency so this add framework slash slash blah blah blah we call it moving from isolation mode to social mode 1 if solution is to do basically that to just go over all of the build files and change those labels according to the mapping that we already knew but the problem was that this was this involved touching a lot of build files for was very complex and if there was any issue with this at the at the end it was very hard to revert this right so eventually we came up with a clever solution we use the fact that this comb wigs framework is just a logical name that's defined on the workspace level and and we knew that we can map this name to an external binary rule that downloads the snapshot but we can also map it to a different rule and we wrote our own custom rule that does a very simple thing just exports those external source dependency so we can so we didn't have to change any build files it was only a matter of choosing what kind of rule we're gonna define in the workspace level and for each for each of the for each of the second party label we had this decision tree on whether the current repo is part of social mode the current repo is part of the virtual mono repo and whether this module comes from a different rifle that is part of social mode only if the answers for both of those questions were yes we loaded the external source dependency otherwise we kept using the snapshot dependency and then with a very small team like 3d developers that worked for three months we were able to on board with repo by report to the social mode to the virtual mono repo the process was quite transparent for our developers they didn't really understand that they are being part of a social mode the work included alignment the third party between all of week's repositories adding some runtime dependencies sometimes we may send us we found out that we needed to define some targets as test only but it was all final it was all a discrete set of issues that we were able to solve and eventually today all of week's repositories are part of the virtual mono repo so we don't have the on snapshot and this is basically the story of migrating weeks before the repo so I bet you all ask where are we now so I feel like we got to a safe haven after all of weeks repositories are part of social mode and we don't have dependencies on snapshot plus all of our deployables are now basil based all of the developers switch their deployables to be from basil we finally announced sunsetting the old CI pipeline so no more team city no more maven and we actually started deleting all the palm files all of the developers worked with basil and they're quite happy with with the build times we still have some work to do we can still make local development better we can still make the builds a lot faster we still need to work on the process on updating third-party dependencies updating our rule set but finally I feel like we're on a after steering through choppy water I feel like we're on a safe heaven and we can start walking through the golden castle of perfect build system so those were the six lessons that we learned that weeks from migrating to basil I really hope that some of this can be helpful for you too thank you very much [Music] thank you or thank you or that was great I really appreciate it questions for or let's line up the mic and the other mic thanks for making the flight all the way from Tel Aviv and everybody who flew in from around the globe so I think it's fantastic let's start down here hey I'm Rob Annabel I work for improbable in London I would like to hear about the measurement statistics that he was showing you're showing a dramatic increase in stability now we have the same statistic but it's very misleading because if you have any transitive dependency that is rarely built it always gets cached and it's like flaky behavior then only the top-level dependencies like the top-level targets will actually show you that these transitive dependencies are flaky I was wondering whether you have any thoughts on this or do you think all of this as well so it depends if you're talking about second party dependencies dependencies between the different repos or the third party for second party we also included a pre-check mechanism to check that you don't break other repositories what I mean specifically if you have a badly cache system your labrie are likely to rerun the exact same code multiple times and then you will have a statistic on the number of failing tests if you know don't change that code frequently you will hit the cache 95% of the time you will run this test once per month so you won't see that that code is actually only flaky Lee passing because as soon as you've a passing test you will never run the test again I was just wondering whether you've encountered that or thought about it not really the with regarding flaky tests we do run tests twice more if it fails and then we just declare it as flaky and we have a mechanism to deal with flaky tests but when the test is flaky but fast that's a different solution that I think it's part of the work that we still need to do in order to make the system better thanks from VMware do you have any pointers for the dev experience that you ran into and what are the things you learnt basically with IntelliJ and other stuff so so basically go to it assignments talk tomorrow but we what I mean we had issues with IntelliJ we had issues with adding new dependencies since we moved from maven granularity to fine-grained granularity then we had to deal with a lot more dependencies whenever someone extracted a class to a new folder even if it was inside the old maven modular directory they had to generate a build file and change dependencies all of this became very complex and and we had to add we had to create tooling for the developers to make it a lot easier and for the dependencies you had external tooling to make sure the versions are not getting into trouble yes thanks Heidelberg from Capital One you mentioned some of the benefits of the remote cache we've done some initial testing and seeing that there's kind of a network penalty especially for any infrastructure that's in the cloud can you talk about how you optimize for things that'll build quickly locally versus getting a remote cache hit it's a great question so for local developers we establish a cache that is only for developers since they were using a Mac and local builds and we found a way to segregate each group of developers according to the tooling the version of tooling that were using and also there we found that on some targets on some actions it doesn't really work well to work with the cache so we found a way to exclude those actions from the cache so they're they're built locally for steai builds we still use a cache for everything we need we need this cache because we also query this cache at the end of the build in order to extract our deployables in the remote cache for CI is that fairly local from a networking perspective yes it's very local thank you thank you hi or my name is kyle from linkedin so my question was about Lesson four on your slide where you had parallel CI pipelines was it possible for you to do binary comparisons of the outputs of the two CI pipelines in order to validate that basil was working correctly or were there just something about the nature of your deployable or the built artifact that didn't allow you to do that so we actually wrote a custom basil rule to generate the same deployable as as maven did we had a custom assembly ruling for maven and we had to write a custom maven rule for for basil but it still did it didn't generate exactly the same binary it was it in a different structure with more jars because it's more fine-grain but there were cases that we did need to compare in a class level so we extracted all the jars inside and we made sure that all the class levels are the same thank you i Kyle cordis okay so at the beginning priority or change as new developers would have been on-boarded you know Java developers what you know ninety percent plus are gonna be familiar with maven on their first day you know from their previous work you know right now if you hire random Java developers very few are familiar with basil how is your onboarding going for new developers post this change amazing question so at weeks we have a process we call it nothing to prod it's basically the commentation of how to write your own first service from scratch and and how do you work with the CI system and we completely change this process in order to support basil development and entity for the new for a new developer to understand how it's like to work with basil it's not perfect but it's a lot better than then when we just moved to Basel we just moved rather we just didn't think about this this guide and now I think this guide can be improved but I think it really helps us I think one challenge that we have in the Basel communities that we don't have a lot of information in the internet about Basel so we really have to introduce a lot of internal information we have our own private Stack Overflow with tons of questions about how to use basil and how to use basil with IntelliJ and how to understand the builds on CI and this is how we support it alright our questions anything more for well this is fantastic as a great talk I really liked it I loved your slides let's have a big hand around the crossbar thanks so much
Info
Channel: Google Open Source
Views: 1,323
Rating: 4.8095236 out of 5
Keywords: type: Conference Talk (Full production);, pr_pr: Bazel, purpose: Educate
Id: BYg3fDFrTz8
Channel Id: undefined
Length: 40min 21sec (2421 seconds)
Published: Thu Jan 16 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.