Michael Sarahan - Making packages and packaging "just work"

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
I am one of the the people I continuum who built packages and I also work on the tooling at continuum to build packages and I was going to come here and tell you guys today about all the stuff that goes wrong with packaging and in talking with some of you in the hallway it really became clear that it would be better if I just showed you how to do packaging from the get-go and not show you all the squirrely stuff that can happen that makes my life fun so we're going to talk about kind of the anatomy of a package and how to start one from nothing and then I'm going to show you Conda build which is the tool that I work on to help make some of this stuff easier and if we have time I'll show you guys a little bit about what we're doing at continuing to try to automate our builds and make it scalable we currently provide about 700 packages but pie-pie has about 80,000 packages of course that's a really long tail of packages that don't get used but keeping up with the ecosystem is a hard problem and it's something that we're working on and we won't we won't get to the the squirrely stuff I'm sorry so you want to make a package let's talk first about what exactly a package is what I'm going to call it two different things most of the time when I'm talking about a package it's some kind of file that you're going to ship around that has stuff that you want people to be able to install some kinds of packages like Python packages are our files but they have formalisms to say where the file should go so that Python knows what to do with them right there's also the term a Python package which means like a folder that has Python modules in it and in that name overloading kind of yet it's unfortunate so I'll try to be clear here where I'm talking about package types as files and if you have package files you might be familiar with s disks or wheels those are things you'll see on pi PI IPI I deal mostly in condo packages which are dot RB z2 file that has a particularly out and to produce one of those you you always start with well if you're dealing with Python you start with a Python package and so who here has made their own Python package anybody a few people did you use a standard layout you found some where did you just kind of wing it yeah tutorial tutorials great this might be even better than a tutorial so cookie cutter is up like a template generator thing and I'll just show you their website it's a Python installable library and when you install it you can then run the cookie cutter for any given kind of project that you're going to start ok so if I go over here and I start a a real quick docker container that just has mini Conda installed I can just say pip install cookie cutter and then a cookie cutter I'm going to go to one of these to a particular repo I have here oops so we've made this particular template for condo projects and so it just goes through this interactive dialogue is this make this bigger not big enough you go through this interactive dialogue you say oh my name is Michael and email address whatever github username and this is this is filling in blanks in that package template so that what you get out is now this this llama folder that I've created that has that nice standard layout you would get from a tutorial and if you take a look at what's in setup pie it's kind of set you up for just kind of a generic thing that you can build on okay so if you have existing code this is still useful it's kind of showing you the model and you can copy and paste stuff into this model of stuff all right to actually build a package with this you can just say I think it's hip pick wheel dot right and so that just built a wheel for my package there similarly because this conduct this this particular cookie cutter template of ours is made for a condo recipe what's in that condo dot recipe folder is a meta dot yamo file which is the input to condo build for us to build a condo package so I can say condo build condo dot recipe and it's going to tell me I don't have condo build installed not a problem kondeh install condo build and if we do that now so what you'll notice is that condo bill is going to have a whole boatload more output and it's because condo build is a more complicated tool what it's doing is its leveraging conda's ability to create environments and it's creating the build environment for the package and it can create any number of different configurations so it does that it runs the building that thing is then detects what files have changed in its build prefix where it's created the environment that's what it kind of considers as files to package and after that it runs a test phase where it installs the package that it's created and makes sure that it works for you okay so compared to wheels Conda recipes and kind of packages are totally language agnostic it is totally all about this this UNIX kind of file system layout and so the the output package here goes to the mini Conda Conda Conda build Linux 64 folder here and if we take a look at what's actually in that that file we've got a whole bunch of metadata that tells Conda about what's going on in that file and how to deal with it we've got the Python files that got recorded and pre compiled for stuff and then here's we've got the entry point but you see this is this is very imitative of something that you would see as like slash e.t.c e.t.c lib slash user whatever that's kind of how it works and the fundamental key to akanda package is as long as you kind of it will create the build prefix as long as you move stuff where you want it in that build prefix it'll get packaged and so if you've ever fought setup tools to include files outside of Python setup tools is great with dealing with Python files but dealing with like data files it has caused me great pain and in comparison a condor recipe lets you just simply move something into that prefix file and so if i say here go back to my recipe and a bt get stalled in hip so this part doesn't have the this well this does have the build instructions in it if I comment this out I can also have a separate file that then can do anything and so we can say echo llama and then there's there's a few important predefined environment variables so dollar sign prefix is going to always point to that build prefix or if it's in the test environment it's going to point to the test prefix but this is all you need to do to actually install files into a condom package and have condom know what to do with them okay so we've seen a condom build I'll skip over that one other quick and easy way to create condor recipes is to we've we've made skeleton generators and so we know that there's these other repositories that have boatloads of really great content on them and we want to make it easy to to bring those into the kondeh ecosystem and so if you I'm going to go ahead and create a skeleton here for beautifulsoup for so I know it's on pi PI I say beautifulsoup for here this is going to go look on pipe I get the metadata read the setup I file information from that file and come up with a recipe for us and so here's our folder with the recipe in it and we've got that build a sage folder and metadata yeah no build that Sh a lot of the time is really nothing more than just saying Python set up that PI install and if that's all you have you can roll that into the metadata ya know which is what I showed you guys before from our template but you see here it's just gone to PI pi recorded all that information gets you the md5 and it makes your life a whole lot easier to start out from something from PI pi okay so we've got the same thing for cran and Sipan and we just recently added support for rpms if you need to what we're what we're moving towards more and more is having really low level system packages and so people had been moving towards building things like x11 and it that's just it's really hard and it ends up giving you like really poor hardware acceleration capabilities so what we're trying to move towards instead is to encourage people to rebuttal the operating system level packages and use those as they as they see fit all right what I've shown you is a gross oversimplification and if there's ever been a part of programming where there's been more edge cases I haven't seen it yet so every given package takes special love especially when you have compiled content in there and another big thing is how do you keep all of your dependencies in mind how do you keep them compatible so you don't want to be too restrictive and you also don't want to be too loose with your constraints if you're too restrictive then you unnecessarily meaning that other packages can't play nicely with you and at the same time if you're to lose things can change out from under you and your software breaks so I'll show you guys a little bit more about that we've added a new feature in our latest version to kondeh build that's really exciting once you have your own package there's a lot of different places to put it so everybody's probably familiar with pi PI who knows about anaconda org a few people yeah so anaconda org is Continuum's community repository people get free channels by themselves and you can put your own packages there as much as you want so unlike pi PI where somebody can kind of lay down a claim to to a package and kind of have this namespace battle of who is the actual owner of a package it doesn't matter on anaconda org right you you are you and anybody can install from your channel and you manage what's there so yeah I've found it very refreshing and what has really been incredible to see is the community that has grown around anaconda org so people have recognized how powerful that that kind of ability to have your own channel and your own really purpose-built channel is and there's been several communities of people who have come together and they each have their own way of building packages but they end up uploading them to anaconda org so by akanda does all kinds of biology related stuff DNA sequencing and stuff I don't understand omnia was molecular dynamics mento is computer vision and so what this what this means is that even if continuum doesn't have a package for you you can say anaconda search what you're looking for and a lot of the times somebody on one of these other channels will have built it and that's pretty awesome the biggest community by far is one called Conda Forge and this has been just a unifying effort so they have a really nice CI type setup where you submit a recipe it builds tries to build a package and when your PR is merged it becomes part of this giant aggregate repository of other recipes so I would really strongly encourage you guys to check out their website and also if you if you're using Conda and you can't find a package the first place to look is kata forge as I said we've got about 700 in our repository they've got 2600 and and they're growing fast and we're a part of them to be honest so we have I participate in Conda fors and we have two or three other people that help with it so it's really nice to see that grow one other I guess kind of silly fun trivia thing to show you here this is what I want to show you I'm pretty sure this is the the organization on github with the most number of repositories because what happens is every single recipe becomes its own repository and so this organization has two thousand six hundred and forty four plus a few repos on it and and that has a consolidated view that gets automatically generated once in a while called feedstocks it's all corny and blacksmithing themed but uh yeah so you just see here these are all sub modules and github refuses to show you anywhere close to all of them so when we started seeing this it was a was a fun day so I mentioned that there's a lot of ways that things can break there are some things we can control as packagers and a lot of stuff we can't and so when it comes to these three things API is is the actual code in a program and that's you know people may change arguments that may change return types they may change input types all kinds of stuff that's obviously going to break you ABI is more subtle that comes about due to like what compiler you use or what OS you're on when you build something you get you get these binary symbols that may or may not be supported on other people's computers who has tried to install tensorflow from pi pi and had it not work okay this was your problem because Google was evil when they published it and there is a standard for for publishing wheels on pi PI for Linux and that standard says that you have to build your compatibility back to this old OS Santos 5 now they did not do that they built on a newer system and they did not make it compatible but they labeled it many Linux 1 compatible anyway so give them a hard time if you see anybody yeah finally constraints are both they're both a way to make stuff work but they're also a way to make it break so they are the part that we really have control over to help make sure that things work as far as ABI compatibility it's boring there's not a lot of secret to it it's just a matter of laying down good standards of what your worker machines are going to be in understanding the limits of those worker machines and yeah the hard part is that those worker machines it's really hard to make everybody happy because you've got some people in your organization who stability is the ultimate goal and we have to stay on CentOS 5 you got other people saying I want to use C++ 14 how do you do both it's hard and the answer is not always good so on Windows it's really boring we just follow up stream and so upstream says Python - 7 s built with Visual Studio 2008 that's what we do and as a result we can't support things that are done with C++ 11 for Python - 7 on Windows so in that case we sometimes patch packages to remove C++ 11 there's there's some idea it's been like a really hotly debated thing to build Python 2 7 with a different compiler but it's really hairy and there's really not much support from upstream so that doesn't seem like it's going to happen any time soon Mac OS is boring also you use the right SDK when it comes to Fortran we ship GCC packages but also include gfortran that's just a little bit tricky because clang + + is not binary compatible with G + + yeah good times Linux is finally the most interesting and we've gotten really really far by building on CentOS 5 with the newest compiler that was available there so if you've used anaconda for the past three or four years this has been what it's been built with now this is also the many Linux people kind of reverse engineered this and this is also their standard they build with a slightly new compiler I'll talk about in a bit but we really hit a lot of problems with this because we're not getting any of the performance or security improvements that have come out of newer compilers so what you really want to do on Linux in order to build a package that's as good in terms of compatibility and performance as it can be you want to build on as old of a Linux distribution as you can tolerate and so for example where you cease to be able to tolerate it is where the kernel symbols have changed too much and you start to have trouble with stuff that builds against the kernel symbols tensorflow for example don't try to do it on CentOS 5 which is coincidentally why Google didn't try to do it on CentOS five right yeah for those of you guys following along this is the most fantastic website describing what you should do for this process that's got the reasons why it works and a very very good article so really what you want to be able to do is custom build your own compiler or pull one in from somewhere else that's newer and use it on an ancient OS so the many linux guys have a docker image that does exactly that they use CentOS 5 and they have this Red Hat dev tool set compiler that's specifically patched a lot by Red Hat to maintain this backwards compatibility is also what Khanda Forge uses one of our experts ray Donnelly who's famous for doing a lot of the emphasis to project he just doesn't trust that and so we're kind of wary of this we don't have any hard evidence against anything he's queasy about it so we tried to custom build a new GCC into a CentOS 5 docker image I did GCC 5.2 and it just had too many strange issues and I tried to make it multi Lib so that the 64 bit compiler could compile 32-bit code and nobody was really happy anyway what we're doing now is we are making Khanda packages for the compilers these are built with a tool called cross tool ng which is the absolute coolest tool ever for for creating portable tool chains it includes both the kernel headers and G Lib C or muscle Lib C or any other Lipsy you want and because it encompasses all those things they no longer matters what build platform you're on so it gives us really tremendous freedom to to use whatever other build platform is convenient for any other reason and it comes at the cost of we need to ship lips standard C++ people have to use our live standard C++ because our compiler is so new so Lib standard C++ is totally backwards compatible but if you build on a to new of a compiler and you have an ancient lip standard C++ you're going to have these binary symbols that and aren't available all right you can imagine this has caused a lot of headache to people over the years and it's been wonderful that that our platform has been so standard for a long time but because people are wanting to change and wanting to adapt to newer compilers we have to figure out a way to solve this and it really comes down to the informations there you actually have to record it and you've got to use it to make decisions so our term for this entirely is CD T stands for core dependency tree and kind of builds going to start recording this kind of stuff it's just all the information related to the compiler that was used for your system and all the stuff that's going to affect its bounds of compatibility and and then more for informational purposes things like the compiler flags all right what Conda will do with this is because it's embedded in the condom package itself you will be able to specify to Conda or even have Conda detect what your system is compatible with so if you're saying oh I'm on a sin toxic system I cannot use anything that's built with a G Lib C newer than 2.12 and Conda can then make recommendations to you to say oh well you know this package is uninstalled will because this or it can also say oh you're actually running a new OS and we can turn on these extra optimizations so it's going to give us a lot of flexibility to make things work better when it comes to you guys writing recipes just a tiny bit of advice avoid constraints as much as you can and keep your recipes as general as you can and so what I see a lot of the time is especially people coming from the pit world writing requirements txt files with exact pins exact pins are fine as long as you're creating an environment but if you are installing two packages with two separate requirements files and they each have different equals equals constraints what do you think happens it should blow up but it doesn't pip has actually really strange rules and it's kind of like the first one in wins yeah so we just released Cana built three this morning and this has been about a year's a year-long effort to to cram Jinja to and do as many corners as I could into a condo build because what what we allow through templating is we allow all of that pinning stuff to be done dynamically and later and from external sources so something like condo forage can have a canonical copy of a recipe that doesn't define any versions at all but then somebody else can define a separate outside file that map's their own CDT or or whatever you want to call it so there is an icon notebook I'd like to just run through real quick with you guys and show you some of the new features here this is also a blog post that's gone out this morning if you want to look at it check our check Continuum's developer blog page and what this is going to show you is Conda rendering which means it takes that input meta demo file and then does stuff to it fills in templates yet does does the selectors which are running particular lines or not and I'll show you just the the extra flexibility here so this is the Jinja template syntax for hey it's a variable here and I've got my config file that is just dead simple yeah Mille and it's a list of values and what's going to happen is kind of build is going to put in those values for each of those template things so here it's rendering what the actual output files would be and here's what the actual recipes would be so you see it's done the pinning okay kind of cute so this works for Python versions it works for numpy versions works for anything one thing I'll point out about Conda build is it really tries to enhance reproducibility and so what it's doing when you render a recipe to build it is it's saying exactly what was in the build environment so that you're able to then rebuild exactly that recipe based on whatever was in the past even if there's new versions of stuff that has come out since yeah if you guys I'm going to skip that that's just saying that the old way of doing things still works but well the other interesting thing is whatever you build has this new thing is it's a it looks a lot like a git pash because it for all intents and purposes is kind of a good hash it is formed by all kinds of important information about your recipe and what it gives us is the ability to differentiate those variants on disk in a totally generic way and so the old way was to say oh well your is it's Python and so when you have Python that that gets added to the file name some similar up here right you get PI to 7mp whatever that does not scale all right what if you have G del 1 and G del 2 so that's what the the hashes are for and then there's a way which I skipped over here to have it tell you what what actually went into that hash because it's pretty obscure right H whatever doesn't tell you anything about what went into it so this is kind of primitive at this point but it's the kind of thing where we can probably build a library query tool to say tell me all the packages that meet this condition all right one other thing is we've enabled pinning to be relative and so what you would see in a requirement txt file is just simply some text and what you see now in Conda build is it's going to say what version of Lib PNG did I have at Build time and let me express some constraint based on that version so the way that this works is there are default arguments to it so if I just say pin compatible to Lib PNG it's going to look for lid PNG up in the build requirements and here it's found oh look I had version 1 627 the default way of doing this penning expression is as many places of precision as you have in the version number for the lower one and then the next major version so this is totally customizable you can change the max pin to be like two places right and then it increments that last place you can also get rid of them or or specify exact values and so if you specify any of them as none it's going to say we'll just leave that constraint out completely so we have no min pin and if we say the upper bound is 5 that totally overrides any of the relative pinning one other thing that's really key to kondeh build 3 is we want to enable people to use dynamic compilers and dynamic compiler settings and so if you move away from the thought of like oh I need a compiler configured on my system you don't you just need to tell kind of build how to get a compiler or how to set one up and so if you're using these em SVC things kondeh build has in the past run the activate script for you it'll still do that if you want but but we're moving towards this you will make a package or you will use a package that politically defined your settings and this is fantastic because people will be able to just so Intel for example is really excited they want to use their Intel compiler to build all this stuff right but they don't want to hack that into condo build now they don't have to they just have to use recipes that are built for this compiler metadata and then configure in their kind of build configure know what the exact compiler is and that's it so it's just tremendous flexibility and yeah so there you see it's it's rendered in different compilers one other kind of cute thing and then I'll be done here is that people oftentimes forget or don't know what their actual dependencies are and so what kind of build three frees people up to do is it frees up the pinning the the actual addition of the constraints to the package creators to the upstream package creators rather than the downstream person using the package and we're what we're hoping this is going to iron out is a lot of missing dependencies on shared objects at runtime and especially live standard C++ because people are really used to that one being on the system so anyway we've got a recipe it's doing this thing called run exports and it is exporting something compatible with whatever we have it build time and we also have a recipe that's just going to use that and we're going to build the first one keep it here locally and when we when we see what would be fully rendered for that second one it has inserted that newbie zip to requirement based on using this package has run exports here so really hopeful that will turn out if it blows up in our faces there is a way to turn it off so please play with it and be very curious to know what you think I guess I have a little bit of time for this we have not had a CI system for packaging and so we we script everything of course but it requires us to like SSH or B and C or RDP or whatever into some build worker and build stuff and that manual tedium is just terrible so we've set out to create a CI system and we needed to be cross-platform we want people to be able to volunteer machines as workers and we also want this thing we've got hundreds or thousands of recipes we can only build what changed we don't want to build the whole system all the time that's totally intractable and that's what has killed earlier efforts so we experimented with a lot of CI tools and settled on concourse C I which was started by pivotal the really great thing about this is it's totally based on ya know files and that you can dynamically create pipelines from pipelines and so our process somebody will commit recipe changes and then we have a job on the CI service and this is kind of a static job it's computing what the build system should do what recipes should be built it submits a new pipeline and that's what this looks like in concourse these are jobs nothing fancy just it gets this stuff from from up above and then runs the examination process and our sinks the recipes elsewhere and it creates these really cool-looking you know build trees and so this is kind of what goes into mini Khanda yeah thank you this is what goes into mini conda it's just the basic fundamentals but it's also Python 2 7 and 3 6 for just CentOS as this thing goes along things light up like Christmas trees they go green when they're good to go red when they're bad but it's a lot of fun to watch and it's it's also amazing because each these things are totally separate processes and massively scalable right each of those nodes can can do a job and so we're hoping that that's a major boon for us this is also much better at doing testing so we've got a dag we can navigate that dag dag is directed acyclic graph to do much better testing so we can say well whenever we rebuild numpy make sure you also test Saiki I damaged so I could learn whatever make sure that the core stuff doesn't break so thanks this is my team and they're great and thank you all for listening thanks Michael question [Applause] so I'm a bit new to conduct been following it for the last three years or so but haven't actually been using it until this conference so my question was is it appears with a really compelling case for it is to be able to deal with binary and Python version and compatibility across the systems which is why I was attending this talk I came away a little bit terrified from what I just saw so is there anything that you can say that would help us understand the issues with regard to binary compatibility and stuff going forward I'm not sure I'm coming tell me tell me what terrifies you because it's I've certainly said that it's a lot of awful work but there's never been a better time to dig into it because Khan DeForge especially is this monumental source of knowledge with really really great volunteers who will help you sort through things and so if you submit a recipe to Khan DeForge you will absolutely end up you know having it work and learning a lot ok so basically it's really the community that we've got this really dynamic good community that we've known and loved for years here and that's going to help us help me be able to go ahead and use the system to be able to deal with these moles or well did the technical issue of binary compatibility is one that's not new at all you know yeah right that this has not changed and the only thing that's making it harder is that we're kind of at a crossroads where we're C++ 11 and 14 have improved the situation enormous Lee but they require modernization of your stack and they there's just rivers you've got across fair enough thank you there get my exercise this way thank you question about missing dependencies so I saw the there's kind of this clever idea of picking up whatever is on the machine at build time and then generating your dependency file based on that it wasn't clear to me how you knew which dependencies actually mattered and which were noise so for reproducibility I would argue that they all matter you want to be able to recreate the same package at a later date and I think it's an intractable problem to say I want to figure out which one of these really matter so am I actually linking against these if - or is that a trivial thing that Python is bringing in that's that's I don't think it's worth the effort got it so there's no static analysis or anything you're literally just saying hey anything that's on the machine on the build machine could affect the final product so let's just take it not quite so it's not on the build machine it is the build prefix itself and so it's to say I asked Conda hey here are my specs what are you going to install for these specs for the build environment got it yeah if you ever get system stuff in your package you're going to have a bad time and so that's one thing that that kind of really pretty good for is the system isolation and keeping your packages pure and that means you're more confident that they're going to work on the other end
Info
Channel: PyData
Views: 7,021
Rating: undefined out of 5
Keywords:
Id: Kamld5Z-xx0
Channel Id: undefined
Length: 40min 28sec (2428 seconds)
Published: Mon Jul 24 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.