Christian Heimes - File descriptors, Unix sockets and other POSIX wizardry - PyCon 2016

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
(presenter) It's confusing, oh, here we go. (presenter) Welcome, everyone, to our first talks of PyCon 2016. In our first session, we'll be having Christian Heimes who is a longtime core developer of CPython and an employee of Red Hat. He will be telling us about file descriptors, Unix sockets and other POSIX magic. So, please join me in giving him a warm welcome. [applause] Yeah, good morning everybody. Welcome to PyCon and welcome to my talk. This is actually the first time I'm talking in English in front of a public audience, so, excuse me if my English is not that perfect, but yeah. So, let's go on. I'm working for over a year for Red Hat as senior software engineer in Security and Identity Management Department. We are developing software to make computers more secure, and one of the things I'm currently doing research on is called Custodia. It's a way to get secrets, like passwords and keys, into containers to make that more secure. And this actually is lots of the things I'm going to explain in the next 25 minutes. Because I just have 25 minutes, unfortunately I can't give you ready-to-use recipes. I'm going to introduce you to a couple of concepts, a couple of tools you can use in your application. Most of the stuff I'm going to explain is focused on Linux and works only on Unix-like operating systems -- so, POSIX. And all the examples, of course, are in Python 3. Yeah, nobody uses Python 2 anymore, hopefully. [audience chuckles] So, agenda for today: I'm going to explain file descriptors and later on, a bit into the operating system and the Linux code. And next up are how file descriptors and processes interact with each other, a bit about networking, and finally Unix sockets, containers, sandboxing. And depending on how fast I'm able to talk today, I have a little bonus track. So, simplifications ahead. Sorry, I have to lie in a couple of places, just because I don't have the time to get in all the details, so I'm going to skip over a couple of parts. If you want to know more, I'm going to have an Open Space later on today at 4 o'clock. File descriptors. Maybe you've heard the term in Unix, "everything's a file," like the default system, the prog file system. You can use it just to interact with your hardware or to get information of processes, networking, settings, etc., etc. And everything you do with that -- every time you do input/output, read some stuff, write some stuff, even interact with resources, you use a file descriptor. File descriptors are used for all sorts of things like reading/writing files, obviously -- even directories. We can interact with hardware, and we do inter-process communications, like, we talk between processes, networking, I/O multiplexing, but the heart of async I/O, file system monitoring, and lots of stuff more. File descriptor is internally a bit like a ticket. So you go to the kernel, you ask the kernel for access to a resource, get back a ticket, and every time you want to do something with a file descriptor with this resource, you show the kernel again the file descriptor, the number (like a ticket), and the kernel does something. There are a couple of standard numbers, 0, 1, and 2 for standard in, standard out, and standard error output. There's also number one are used for error indication. Don't care about it in Python. Yeah, Python is an exception for you, so if you see a developer, you have to take care of -1 -- in Python, no. A very simple example, probably have seen that. "With" statement, open a file, read from the file, print it. This example uses already two file descriptors once there's a process running. So, it reads from a file, the "open" creates a file descriptor, and the "print" writes to another file descriptor. You send it out to the shell. That's a very simple example for a file descriptor. You can do even more with file descriptors. You can use file descriptors to actually reference a file that's already open. That's rather useful in certain cases. There's also security implications. That's usually more secure. If you already have a file open, you can get the status information of a file, you can change the file permission settings. A couple of years ago, Python were able to use the dir file descriptors. You can use the file descriptor to a directory like a location indicator. They use it more in Python -- in Python, um, I think 3.4 introduced that feature. Then, you can even control your hardware with file descriptor. And that's my only demo for today, and I hope it works. [clicking sound] Did you hear that? For those who are hearing impaired, we have closed captions. Um, oh, wrong one. I'll hide that. That's what happens. [laughter] A prerecording. Actually, I use a file descriptor to open some hardware device, dev/cdrom and to give it a command to eject the bay -- and back in. It works only on Linux bay, too. Every operating system has their own line of magic codes for that. Operating System 101. Before I can teach you some of the other tricks, I have to explain how an operating system actually works internally. We used to have the Dark Ages back then. We had no kind of isolation between processes. Everybody could just read and write from hardware, from software. If one program crashes, it could just tear down the whole system. If you used DOS like me in the old ages or even older computers, a crash of a program usually means you have to restart the computer. In modern operating systems, we have the layout between the hardware and software called the kernel. The kernel does a lot of things. I'm not going to explain all the hardware drivers, the way the kernel makes it much easier to talk to hardware. Rather than -- the kernel is also a very important step to isolate processes. So, a process can't directly interact with another process except of using the kernel in one way or another at least to set up a communication channel. The kernel also does lots of checks regarding security. So, you have file permissions, you have users and groups in your system. For networking, you have, maybe, firewalls, and you have physical memory in your computer, and you have visual memory in the processes, and how the kernel maps these virtual memory to physical memory is something I wanted to explain, but unfortunately, not the time today. Twenty five minutes is very short. So, to sum that up, every time you do something, you have to talk to the kernel if you read or write. The way you talk to the kernel is called a syscall. A system call -- that's the way you switch the context from a user-space program to a kernel-space program. Maybe you've heard the term "context switch." Context switches are rather slow. Take a couple of -- up to a couple hundred of use cycles in a modern system, and when you use a lot of them, like currently almost 400. And we go back to the simple example where you just opened the file, read from the file, and printed it out. They're commonly called systrace or s-trace, where you can see which kind of system calls the program does. So in this example the "with" statement opens the file and you get file descriptor number 3. The kernel usually takes the next open file descriptor. And then Python does some stuff like check if it's actually a real file to prevent you from opening directories, in a way. It seeks to the first position of the file in case something is -- yeah, I don't know why really. Benjamin might know. He wrote that. [chuckles] And then, we use again number 3 to read and read a couple of screens. The number 19 at the end means it actually read 19 characters on the screen. It tries again. OK, the file's empty. And then Python uses another file descriptor, number 1 -- that's the center out -- to actually write the string to the shell. And finally, every time we have a file descriptor, we should close the file descriptor, because file descriptors are actually very scarce resources, so you don't want to waste them. If you have a long-running program and run out of file descriptors, yeah, you can't do anything anymore. And the kernel internally maintains a couple of tables, informations. We have the global open file table. That's one gigantic table in the kernel that has all open resources. Every process has a file descriptor itself -- refers to the table itself. The file descriptor table in the process is actually very small. It just maps some of the numbers into the kernel. So, a couple of examples. You open a file, you get a file descriptor that points to an entry in the global table and eventually ends up in a file. You can open the same file again and get an independent file descriptor. But you can also duplicate the file descriptor and maybe get the file descriptor that points to the same entry. And there's another operation that's useful. We're going to look at that later. We can override a file descriptor and rename it. Finally, there's a way to get a file descriptor to know the process, and so that suddenly another process uses the same global entry. In this table -- the file descriptor table itself just contains this mapping. Another flag, Close-exec or cloexec, I'm going to explain in a couple of minutes. The open file table actually contains everything that's stateful, like the position of the file, the mode, if you open the file, read or write or both, or who's the owner, file locking, credentials, reference, counting, etc., etc. So, the entry in the open file table, the global table, is a bit like this old device. Say we have a global counter, and even if you have multiple programs reading from the same resource, you can't go independent. Next up, how Unix creates processes. That's a bit strange for people who are not used to that. So, there are actually two steps in creating a new process. For one, we have fork. Fork just creates a clone of the current process and basically an almost perfect copy except for threading. And this child process you create inherits this file descriptor table from the parent process, so it actually points to the same global entry. And the second step you do -- oh no, sorry, the first thing -- small example. If you run fork and then the first process, you might have, like, that's currently the parent process. It just happened to be the first of the parentals from the parent process that reads a bit from the example txt, Python, and almost at the same moment, the child process reads from the same file descriptor and the global table, just to get to the end. The second step we have -- just forking the process, getting a copy, is not very useful if you want to run a different program. The second step is replacing the current code with different code, and that's called exec. Even in exec, you get all the file descriptors from the original process, which can be a security issue. That's where cloexec comes in. So all file descriptors marked as cloexec are ultimately closed. And thanks to Victor Stinner and Python, you don't have to care about that, too. So, quick summary. Every time you do something, if you go to the kernel, we have this different kind of tables, and a new process is created with fork and exec. Now you might wonder why this is useful. The child process actually gets the same file descriptor as the parent process. There is something like subprocess.PIPE, or, like, piping on a shell comes in. Pipe is actually real like a water pipe. You have one end where the data flows in and the other end where the data flows out. It's a universal pipe. And the way this piping works is, you have a common like os.pipe -- Python's standard library is pretty awesome -- where you get two ends. And when a subprocess creates a new process, it first forks itself and then checks, "Oh it's a parent process. "We are not interested to use the writeend." We just close it and degroup with that. And the child process, we close the readend because the child isn't supposed to mess with both ends. And then we use Step 2. That's when you rename the file descriptor to #1. And lastly, we run a program like ls. And then the parent process itself can read again from the file descriptor and that's how piping works in subprocess. But, of course, just talking between processes is not very interesting. You want to talk, in this global world, to other computers. Now we have networks sockets. Network sockets are a bit like a parcel delivery system. A sorting center, we send packages through a couple of steps -- that's where routing and addressing comes in. So, in order to send a package from you to somebody else, you need to know the address. And the other person you sent the package to, of course, needs to know to whom to send back an answer, so, a bit like a letter. And these addressing/routing -- just IPv4 and IPv6, so, the basic internet protocols most of you have probably heard of. There's also a second thing called flow control. It's like -- do you want to send out just packages or do you want to send out packages and get a receipt that the package was actually received by the other peer? That's TCP and UDP. So, quick example. That's how our server, our Unix server, looks like for a socket server, so you bind to a port like an address, like a street address or an apartment number, and you listen to wait for incoming packages and then you finally accept packages and both the server and the con (the connection) file descriptors internally. And the client itself also creates a new socket and connects to the peer to send them. And what probably not everybody knows: IPv4 and IPv6 are incompatible, so you have to use a different kind of addressing and different, yeah, kind of address names. Now, to the promised Unix sockets. Unix sockets are a bit like a mix of pipes and network circuits, so actually they aren't like network circuits, but limiters, so they only work on the current computer. It's a bit like an old-style plumatic delivery system where you have pipes running through your building. Like the building in this example is like the operating system -- and to connect different parts, like processes. Because it's all in house -- so inside one kernel, we have additional features and additional security settings. And the kernel guarantees that these pipes are protected and nobody can mess with them. And to send a message, we have this line of a fancy capsule where you put your data in, but you can also put some information outside. You can glue a tac on it, and that's called ancillary data for Unix sockets. So, Unix sockets, like, a way to create Unix sockets is you just exchange the way you do the addressing. So, instead of talking to IPv6 or IPv4 you use AF Unix, Address Family Unix, and then you can use a path to a file (like location), and the client can then connect to that file. Because these are regular files in the file system, you can use all sorts of permission settings, like the way you have user groups and read and write bits. You can use that feature for authentication and for protecting. There are also ways to create a socket pair. It's a bit like a pipe with a/b directional ends. Unix has also something called abstract namespace. I'm not going to cover that here. So, a couple of things we can actually do with Unix sockets that don't work with normal sockets. We can get the peer credentials. So, the kernel tells us who is on the other side of the pipe. That's a bit of an example. That's currently not in the standard library. I'm planning to add that for Python 3.6. You can get the PID, the program ID, the process ID, the user ID, and the group ID of another process. Because that's guaranteed by the kernel, nobody can mess with that. You can also get -- if you are running a system that has Linux, you can get the Linux context. Again, I'm going to add that feature to the standard library too. And we can use both these features to do something very fancy with containers. That's one of the main things I am using for the Custodial protocol is we can use Unix sockets even between containers, because we actually run the kernel space a lot like virtual machines. SELinux might make it a bit harder for very good reasons. Ben Walsh has a couple of blog posts about SELinux on Unix containers. Another thing: so when you run a container, you have different name spaces so the container doesn't see other processes. But with the Unix sockets, actually the kernel translates the PID to the correct -- (audience member) What about [inaudible]? Later, okay. And you also get these multi-category security separation label, so every container is guaranteed to have a UNIX label. Every currently [inaudible] container has a Unix label for [inaudible] systems that use secure visualization. So was that useful? We can actually get the Docker ID from a PIDs. So, we have, like, the C group's file with the control groups. If you look on the control groups file, it looks a bit like that. Shorten the ID a bit. And being able to hash, that's actually your Docker ID. And when you look closer, you can check if the SELinux label matches, and if you're running, like, Kubernetes or OpenShift, you can even get to the information of the container, pod, name space, and, yeah. And because that -- um, the kernel prevents any process from messing with that, unless either the kernel or the document container is -- if both the kernel and the document container are not compromised, it's secure. You can also send file descriptors over Unix sockets. That's using the surrogate data, just glued on also the capsules. So, these features, for example, are used in multi-processing and the standard library documentation. The socket model has an example of how to do that. It looks rather ugly; it depends on operating systems. I'm also planning to add that, actually, to the standard library socket model. Why is it useful to send a file descriptor between -- from one person to another? That's very useful for example for sandboxing. There is a feature called seccomp, and it can be used by lots of programs where you put your process in a sandbox, and the sandbox actually prevents the process from doing any kind of forbidden syscall. So, you say, "This person's not allowed to create new process." "This person's not allowed to open any files, "mess with other files, or create network connections. But if you have an H connection or a browser, of course, you want to talk to another rec server. There comes in the so-called broker. So you have another process that's very, very simple and can be audited and checked for issues much, much better than the flash or a very complex browser renderer or VU renderer. So, the sandbox asks the broker, "Please open that file for me," and the broker then opens the file and sends back the file descriptor. Or, in case the process is compromised and does something evil, the broker is able to just kill the malicious instance and yeah, you're safe. I have a couple of topics I would really, really like to cover here, but I'm a bit out of time, like 30 seconds left before all the questions and answers start. Memory mapped I/O. You can actually do more than just reading from a file; you can map a file into memory. It's very efficient if you have, like, multiple processes open in the same file or your renderer reads and writes. The kernel actually copies data from the file into the memory and removes that eventually. NumPy has [inaudible] for that. There is something very new called memory fd, where you can actually create a file-like thingy in memory that you can seal, so you can basically write data, seal the box, and nobody can change the data anymore and then give this file descriptor to another process. It's a better way to do temporary files and you can do much more efficient I/O with zero copying. So, every time you copy data from the kernel into the user space and back in the kernel space, you have a context switch that is very slow. You have to copy data multiple times. There are better ways to do that -- like sendfile. Python uses that already. Copy file range -- that's going to be added to Python 3.6. Splicing, a very new thing where you have actually an SSL/TLS socket that is very useful for high performance file service. They do most of the expensive TLS inside the kernel itself. And finally, event-driven I/O. The features that are used for async I/O where you have, like, hundreds of connections or thousands of connections, and you use these kind of commands to wait, and every time the pipe is ready for reading or writing, the socket connection is ready for reading and writing. The process gets informed of that. So async I/O uses that. If you want to know more about file descriptors or want to talk about the Custodial protocol and proof of concepts I am currently researching, please come to my Open Table at 4 o'clock today. It's in Room C 120 downstairs. Thank you very much. [applause] Almost in time. Three minutes left for questions. (audience member) I have a question. What about chroot? (Christian Heimes) Change root? (Christian Heimes) Can you please go to the microphone? It is over there. (audience member) All right, oh sorry. This might be a bit germane to your topic at hand, but, sorry, if I start up Red Hat, I open Python 3, and I open a file with a non-Unicode character in the string -- (Christian Heimes) Can you speak up a bit? It's hard to understand you. (audience member) So, I have a blank Red Hat Linux running, I start Python 3, I open the file and in the file name, there's a non-Unicode character. What happens, what do I have to care about? Would it always work? (Christian Heimes) Okay, that's actually not covered by -- that's already handled by Python internally. That's not related to file descriptors. Python tries to use something called surrogated pairs to kind of translate bytes that match UTF-8 into something. I'm not an expert on that part of encoding. But that's actually not handled by anything in the kernel or syscall. That's handled in Python, because in Unix, files are actually bytes but Python tries to do that as text. There's nobody? Okay, so, on the other side again. (audience member) Hello, sorry for shouting earlier. Thank you for the talk. My question is -- you said you can pass Unix sockets between containers, between containerized processes, but often containerized processes have a chroot, right? They have a different view of what the file system is, so I wonder how that could work. (Christian Heimes) Yeah that works. You can create a file-based Unix socket, say, if your Unix socket is like a file in the file system. Then you can mount-bind that directory into another container. It even works when you mount-bound the container read-only. And that prevents the process from replacing or removing the file descriptors. The copy of the file descriptor actually is a bit like a device file, so stuff you have in the def directory. You can open that file and connect with the file descriptor. What's currently not working is if you have one container that creates a file descriptor and another container that wants to open the file descriptors, because the MCS labeling prevents you from exchanging information between unrelated containers. You either have to use some settings to put them in the same context or that's something that's going to be handled by kernel DBS. Kernel DBS are Unix sockets on steroids with additional features. We also can do, like, this cross container communication in a different way. (audience member) OK, so is the short answer that I have to bind-mount a common directory to both containers? (Christian Heimes) Yes, and if you -- basically it's easier to lose the broker approach. You have, like, a super-privileged container that's running in the host PID namespace and create the file descriptor, and all the other containers then connect to that. So, the container example I wrote for Custodia actually is a privileged container where they host namespacing. It doesn't currently work with the SELinux for security reasons -- for good security reasons. (audience member) OK, thank you. (presenter) Thank you. That's all the time we have for questions, but remember Christian's Open Space at 4 p.m. (Christian Heimes) Thank you. [applause]
Info
Channel: PyCon 2016
Views: 22,528
Rating: 4.9076924 out of 5
Keywords:
Id: Ftg8fjY_YWU
Channel Id: undefined
Length: 32min 2sec (1922 seconds)
Published: Mon May 30 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.