Elana Hashman - The Black Magic of Python Wheels - PyCon 2019

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments

Resolving dependencies is always the challenge for any ecosystem of every language:

Mixing naive apt dependencies with pip dependencies for source packaging is so problematic for portability.

Living inside Conda solved so many problems until you need a dependecy exists only in pip with only source packaging or yet another framework like PipEnv or Toxt.

👍︎︎ 2 👤︎︎ u/hkanything 📅︎︎ Dec 03 2020 🗫︎ replies
Captions
okay thank you for bearing with us next we have a lot of hush man with the back Magic Python wheels to the time here goes hello PyCon can you hear me great okay sorry for the initial technical difficulties I'm Elana hashman welcome to my talk the black magic of Python wheels if you want to tweet my handle is at a hash dn so today I'm going to put on my Python packaging authority and tell you a little about Linux wheels or Python wheels particularly for the Linux platform now you may be thinking oh Lana I'm not not so sure about this black magic thing to which I'll say sometimes the greater good requires a bit of sacrifice I'll admit I had a lot of doubts myself when I was doing the research for this talk there used to be so many problems with Python eggs obtaining reagents for your Python potions is rough now if you are a witch then I'm delighted to welcome you here at the hour of gathering the topic today is the Python native extension and its distribution a most curious spell Before we jump into things let's do a quick survey of the coven how many of you are familiar with Python packaging and distribution great how many of you have heard of elf or executable and linkable format files Wow great how many of you have heard of dynamic linking so a lot of people stretch goal how many of you have heard of application binary interface is our symbol versioning ah less hands great well you're in the right place and don't worry too much if some of these concepts aren't familiar to you by the end of this talk I'm hoping you'll come away with a better understanding of modern Python packaging and learn how each of these concepts work together under the hood unlocking some of the witchcraft behind how computers work really so what are we going to cover first a very brief history of Python Packaging formats and an overview of the wheel next the motivation behind native extensions and why binary wheels are useful well spend the bulk of the talk discussing how do native extensions even work really including a discussion of how Python packaging tools many Linux and awed it will fit into this picture and last I'll close with how you might get involved in some of this wheel building yourself if you are interested as I mentioned earlier before the Python wheel there was the egg now eggs serve the community as best they could but they had some problems they were organically adopted without the guidance of a pep and hence there were many conflicting ways to do the same things without a standard there was nothing to coordinate the thousands of Python developers trying to ship software to end users and incompatibility was inevitable eggs were also designed to be directly imported and as such they could contain some or even exclusively compiled PI C files which might not actually be compatible with the version of Python you have installed so wheels were designed by the Python community via a pet to provide that standard and implement many of the existing peps that eggs did not comply with among other things wheels provide much better metadata than eggs and are designed to be more portable as they are primarily a means of distributing not importing Python wheels cannot contain pi c files although they can contain other pre-compiled resources now there are three kinds of wheels worth mentioning first pur wheels these wheels consists of just Python code they may target a specific version of pythons such as Python 3.7 next Universal wheels which are special kind of pure wheel they are Python 2 3 compatible for these first two kinds of wheels I've got great news they're not much different than B just eggs you just have to run the following two commands and you get a wheel now unfortunately not all Python packages fall into one of these two categories or else the talk would be over and I could spare you from the dark knowledge of this other kind of wheel this last kind is called an extension wheel which contains a python extension how many of you heard of a Python extension yeah well rather than trying to explain to you what a Python extension is I think the best way to introduce them is to tell you about this rite of passage many early Python programmers go through myself included and through this example we'll see how wheels make it easier to roll out Python we're going to pretend that I am a newbie Python developer and I'm trying to get my Python environment set up to work on this cool web app so here I am installing pip install requirements txt and one of my requirements is cryptography because I know security is important and I want to use SSL so here I go I'm gonna install cryptography and get to work and oh the installation fails because I'm missing Python gotta CH what's Python dot H why do I have a pound include in my error here I have no idea remember I'm a new Python programmer but I do a little hunting on Stack Overflow and learn I need to install the Python dev package so ok let's do that problem solved right well actually it turns out I'm missing FF I dot H as well whatever that is so I do another search on my favorite search engine find that I need to install Lib FFI dev and I should be good to go right who thinks this is gonna work ah well the nays have it now I'm missing open SSL I still don't know what the heck is going on but I figure yeah every time I install a thing I've got a little further why not install another thing how many more things could there possibly be do you think this is gonna work now yay well if finally worked this time and it took 16 seconds overall to install not including all of my frantic Stack Overflow searches so that's kind of slow like an extra 16 seconds on every fresh build that's maybe a lot of time I could be shaving off of my CI runs who thinks we can do better Oh some people good enthusiasm what if I told you that the solution is a prebuilt extension wheel well let's try it out here I'm going to install pre-built wheel avoiding all of those system package installations and on top of that I shaved 15 seconds install time what sort of black magic is this what's the catch what's going on here this accomplishment is a really big deal for the Python ecosystem historically it was very painful and user unfriendly to pip install Python extensions as demonstrated by our earlier example now the condo package format was developed to address this gap particularly for the pipe scientific Python ecosystem and it's not a great job of that conda's very popular so why bother with shall we say reinventing the wheel here well kondeh like eggs was not adopted by a pet and suffers from similar incompatibility issues as well the kind of packaged format can be used to package anything not just Python code and hence isn't supported by the Python packaging Authority or PI P I and condo packages are only compatible with kondeh environments meaning that if you want to use just one kondeh Python package the rest of your Python package just must be Conda - you can't mix and match hence the need for the Python packaging Authority to address this gap Khan does a great solution and works for a lot of people but it doesn't work with PIP and pi PI while Conda packages don't work with nan Conda environments you can install wheels in a Conda environment which is great news for Conda users so python extension wheels allow end users to for example safely install pip install numpy whereas a few years ago that would have been very inadvisable so I'd say binary wheels are an amazing change for the better so what is a python extension well it's short for Python native extension and native means that this code was built specifically for my operating system and my version of Python my OS is its home if I try to run this cryptography wheel that I downloaded from my Linux machine on my Windows laptop that won't work if I try to run this wheel against Python 34 and it was built for three five that won't work either so if I want to distribute pre compiled versions of my Python native extension I have to build a version that covers every single operating system Python version and cpu type combination the extension bit refers to the fact that we're extending pythons functionality with some code that wasn't actually written in Python you may have guessed by now that our example package cryptography is a Python native extension if we look inside cryptography code indeed we see that a small component of it is written in C and many of the files have pound includes statements to declaring C dependencies the set up top PI file inside cryptography indicates that it provides ACF fi extension package CF fi is a library used for declaring and interacting with C foreign function interface --is so what does all this mean it turns out a lot of Python code depends on code that is not Python at all c is the lingua franca of the modern operating system for better or for worse well not all Python extensions depend on C libraries for example many of the scientific Python libraries depend on g fortran under the hood more or less everything has a dependency on the C runtime including gfortran I checked with C compatibility we can harness the power of thousands of existing libraries without having to re-implement them in Python which may be very time consuming or even impossible Python native extensions allow us to harness this power but now we're not just in the business of managing Python we're in the business of managing C code as well and that's where things start to get messy in order to understand how extension wheels work we must first understand how C works C is a compiled language that is the code I write must be fed to a program called a compiler and the compiler turns that into a bunch of machine code that can be run directly by my CPU on this slide I have an example HelloWorld program written in C it calls puts which stands for put string from the standard i/o library in order to print hello world in order to run the code I first need to compile it so I invoke the Guzzi CCR's the GNU C compiler GCC and by default it produces an executable called a dot out since I didn't specify an output name the output on the right here is a byte by byte printout of my compiled application in hexadecimal form this executable consists of a bunch of native machine code zeros and ones that can be run directly on my specific CPU if you work with a lot of ASCII those first four bytes might look a little familiar the format of this compiled program is called an elf file which stands for executable and linkable format and indeed the second through fourth bytes literally spell out elf e.l.f so now that we've got hexes and elves on our hands I suppose we've passed the point of no return so we may as well take a closer look inside executable refers to the fact that this binary contains machine code that we can execute but what does linkable mean I'm gonna use a tool called retail to try to make sense of this binary file the - a flag means to print all sections of the file on the next few slides I'm gonna explain some pieces of the produce output the first chunk of output displayed here on this slide is part of the header of the elf file the elf format is standardized and the first thing I want to point out to you is I have not been exaggerating the first piece of metadata in the header is literally called magic and if you can remember from the last slide these magic first 32 bytes are the same as what was displayed there I also want to point out this file is aware of the machine architecture it was compiled for in this case AMD x86 64 different machine architectures require different compiled code since they have different CPU with instruction sets for example I can't run 64-bit code on a 32-bit machine because the instructions won't fit next we'll take a brief walk through the program headers in particular I want to point out the program interpreter whose location was hard-coded into this binary at instant compile time the program interpreter also called an elf interpreter is the program needed to make this binary run on my operating system the elf interpreter is responsible for making sense out of this pile of binary which is a very good thing because if I had to do it I imagine I'd go mad next we'll take a look at the relocation section this is where we record any symbols that our code relies on but don't have a corresponding implementation in our binary in our case we called puts from the standard library rather than defining its implementation in our program so that implementation needs to be filled in later there is only one entry in this table puts at G Lib C 22.5 now I remember we call puts but what's this @ G Lib C bit all about this is what we call a symbol version in the same way that we conversion ap is an import or call a specific version of a dependency in our code an application binary interface or ABI can also have versions the compiler attacking this version onto our put symbol ensures that when we go to find the implementation of puts we don't accidentally call the wrong version which might have an incompatible interface with what our program was compiled against and unlike with semantic version api's where typically only one version of the library is installed or loaded at a given time application binary interface is usually contain many version implementations in order to ensure backwards compatibility last let's take a look at the version needs section of our elf file version our here stands for version required and this section tells us what versions are needed for each library file we depend on so here we see we depend only on one file Lib CD so 6 with version name G Lib C - 2.5 and this makes sense this is the version of the put symbol we saw in our last slide and we know we called puts from the C standard library aka Lib C so let's illustrate how this all works together in my C code the file hello dot C I depend on puts from standard i/o a part of the C standard library when my code is compiled into the binary file a dot out the C compiler resolve the put symbol to put swiffers in G Lib C 22.5 the version required table in this binary then tells us we can get this symbol with that version from Lib C FS o dot six if we look inside Lib C dot s o dot six and I'm sparing you from having to read any read elf output we see that it has a section called new version D or version definitions and deleted at five is one of those declared versions as well as a bunch of more recent versions if we take a look at the den sim or the simple table for dynamic linking we find the implementation of our function puts with the correct version so now that we have all these pieces how do they work together when we try to run this program well when we run the file a dot out before the code can actually execute on our CPU we first have to brave a bunch of eldridge Horrors under the hood what happens look something like this first our OS parses the magic elf bytes at the beginning of the binary this is our arcane invocation next the OS invokes the elf interpreter specified by the binary only the Alpha interpreter can unlock the powers within the elf interpreter is also a native program so if this elf file wasn't compiled from my CPU I won't be able to run it as the right program interpreter won't be found assuming the correct elf interpreter is there it will load any required files with valid versions as specified by the binary and it will move things around and lay everything out all nicely in memory to ensure this code can actually run only then can our CPU execute the binary instructions loaded in memory and breathe life into this program printing the ancient letters upon our screen hello world this rather frightening process is usually referred to as dynamic linking and it's how almost every program on your Linux system works including Python programs ok so now we understand how these C programs express and load their binary dependencies how do our Python programs do it well the good news is the Python interpreter handles most of the heavy lifting for us in that respect dealing with all the dynamic linking and the like our responsibility as the user of the software is to ensure that any C dependencies our Python program has are available on our system for the program to use and that our Python extension is compiled to be able to link against them so there are two ways to do this the old way which is still completely valid is to provide a source only download of the Python extension and ask our users to compile our Python extension from source we give users the Python package code and then finding all that foreign dependencies is their problem when a python extension gets installed it will build against the system installed version of dependencies which will ensure that the output binary depends on the right system library application binary interface so what we saw earlier with cryptography of the hard way but we also saw there is a shiny new way we can do this with an extension wheel pre-built by a packaged developer we can remove the compilation burden from users by bundling precompiled binary dependencies into our Python wheel everything the end user needs will be available without them having to install or compile anything outside of their desired package assuming wheels are available for all of its dependencies and as we saw earlier trying to install cryptography the old ways have many problems it's slow because we need to compile everything from source and that's computationally intensive this also means the end user needs to install not just runtime dependencies but also any built time dependency is needed by our extensions which is a ton of extra work for end users that just want to use my library it's also possible that the user runs into version mismatches where they've installed a version of a dependency that's different than the developer intended because that's what was available in their OS package manager resulting in subtle bugs or unintended behavior that they might not have to expertise to diagnose and last as we saw earlier this will frequently require knowledge of the system package manager or at least stack overflow which is a really bad experience for new users that just want to try out Python stuff now we still need the old way because someone is always going to have to compile things from source in order to produce a compiled binary but isn't it better to just leave this in the hands of experts rather than requiring every end user to become an expert on compiling C and Fortran so binary 5 on wheels solve this problem they ensure that the dependencies provided inside the wheel are always the right versions and they come pre compiled so users don't have to worry about any inst compilation steps this also means installations are much faster and since we also Python native you just pip install them no knowledge of outside package management required but how can we ensure the precompiled binaries are compatible with my system the cryptography developers right B might be running the latest of 1 to 1810 I'm stuck in the Stone Age because I'm running 1404 because I hate upgrading my laptop this is a really hard problem how would we know if our precompiled binaries would be compatible with whatever random version of Lib C an arbitrary user had installed well from the example earlier in the talk you may remember that we had a simple C program that depended on the put symbol and once it was compiled it had a simple version attached recall to that while C libraries add new implement implementations they also keep the older versions to support backwards compatibility so what if we just dependent on our really old version of Lib C when we built our dependencies for distribution wouldn't that maximize compatibility for everyone this cuts the question at the heart of this talk how can we ship compiled Python extensions that are compatible with as many systems as possible now we have all the magic we need to answer that question the answer simple versioning and dependency bundling achieve with the Python packaging tools many Linux and audit wheel so what are these things in order to ensure widespread compatibility of compiled binaries peps 513 and 571 define a minimal set of permitted libraries and symbol versions that can assume to be present on the average Linux system for built wheels any other symbols that a wheel depends on must be declared inside that wheel this ensures that Python wheels don't use any cool new GCC features that aren't available on older systems or assume that certain shared objects will be available you may remember that in my cryptography installation example the first thing I had to install was the Python dev header package for this reason even the Python development libraries are not included in this minimal base thus we ensure that many Linux systems are compatible with this standard hence the name many Linux many Linux is both the name of the policy and a docker image used to implement these policies the original policy from pat 5:13 and corresponding docker image was called mini Linux one based on the Santo S 5 distribution and the newest many Linux policy and docker image called many Linux 10 is based on Santo Essex which was released in 2010 both images are available for the amd64 and I 386 architectures by building your extension wheels inside the stalker container you can ensure that you use a build environment that's compatible with the affer mentioned peps making it easier to produce a compliant binary once you've produced a wheel through some means many Linux or not you can inspect it with audit wheel audit wheel investigates any symbols and versions contained inside your built wheel and determines if the wheel is policy compliant given the set of many Linux policies and a priority ordering audit wheel finds all external symbols and their versions that your wheel depends on and labels it with the strictest policy your wheel complies with if any but that's not all audit wheel can do auto you'll can do much more than just checking compliance against the many Linux policies since it already has a very dark understanding of the inner workings of wheels in order to determine whether or not they comply with policies as such audit wheel can use that arcane knowledge to actually locate external versions of dependencies on your systems make copies of them perform name mangling and then update paths to bundle these dependencies directly into your built wheel for distribution this is extremely spooky this empowers Python developers to build many Linux wheels without having to make substantial changes to their build processes all I need to do is build and install dependencies inside the appropriate many Linux image and then build their wheel like they normally would running audit wheel repair on a built wheel will bundle all the necessary binary dependencies inside and produce a policy compliant wheel installable on many Linux systems to illustrate and summarize this process from start to finish I will share this picture as developers we start with our excellent Python extension first we build it inside the mini Linux image against each version of Python we support and for each architecture we support once our wheels are built we then repair them with audit wheel which will bundle copies of any native dependencies required by the package we then use audit wheel to inspect that output ensuring that all the included cymbal versions comply with the desired policy and last we upload our wheels to pie-pie giving users the option of downloading and installing binary extension wheels speeding up their installs and reducing their dependency overhead so now that you understand a little more about how this works and what you need to do to build a wheel how can you get involved we would love to see you build wheels of your package if you're not doing that already feedback is enthusiastically welcomed please let us know if it doesn't work file bugs we want to make it better Python wheels comm have some information about wheels as well including which of the top 360 haha packages on pi PI shift wheels I didn't cover how to make a Python extension because that's out of the scope of this talk but this is a great place to see how and why different projects are doing that you could also use this to find an example package and see how they build wheels for inspiration we didn't cover Windows or Mac OS wheel building in this talk so that could be a good resource if you want to target those systems too we also have a mini Linux demo repo with a more straightforward example to play around with if you just want to try out a more hello world sort of wheel built there's another way you can get involved audit wheel needs a new maintainer because after three years I am stepping down there's also a lot of work to be done on the new many Linux 2014 spec and existing many Linux docker images which don't currently have a maintainer if you're at all interested do come find me and I'll try to put you in touch with the right folks in PI PA to help you help us do we have time for questions no we do not so thank you thanks to my employer following the attend thanks to these folks who are viewing my talk if you want to check out some resources I posted related to my talk you can visit this link on my website it includes a copy of these slides links to all the reference paps supplemental readings and more thank you so much for attending [Applause]
Info
Channel: PyCon 2019
Views: 10,763
Rating: undefined out of 5
Keywords: Elana Hashman, pycon, python, coding, tutorial
Id: 02aAZ8u3wEQ
Channel Id: undefined
Length: 26min 22sec (1582 seconds)
Published: Sun May 05 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.