(presenter)
It's confusing, oh, here we go. (presenter)
Welcome, everyone, to our first talks of PyCon 2016. In our first session,
we'll be having Christian Heimes who is a longtime core developer
of CPython and an employee of Red Hat. He will be telling us
about file descriptors, Unix sockets
and other POSIX magic. So, please join me
in giving him a warm welcome. [applause] Yeah, good morning everybody. Welcome to PyCon
and welcome to my talk. This is actually the first time
I'm talking in English in front of a public audience, so, excuse me if my English
is not that perfect, but yeah. So, let's go on. I'm working for over a year
for Red Hat as senior software engineer in
Security and Identity Management Department. We are developing software
to make computers more secure, and one of the things
I'm currently doing research on is called Custodia. It's a way to get secrets,
like passwords and keys, into containers
to make that more secure. And this actually is lots of the things
I'm going to explain in the next 25 minutes. Because I just have 25 minutes, unfortunately I can't give you
ready-to-use recipes. I'm going to introduce you
to a couple of concepts, a couple of tools you can use
in your application. Most of the stuff I'm going to explain
is focused on Linux and works only on Unix-like
operating systems -- so, POSIX. And all the examples, of course,
are in Python 3. Yeah, nobody uses Python 2 anymore,
hopefully. [audience chuckles] So, agenda for today:
I'm going to explain file descriptors and later on,
a bit into the operating system and the Linux code. And next up are how file descriptors
and processes interact with each other, a bit about networking, and finally Unix sockets,
containers, sandboxing. And depending on how fast
I'm able to talk today, I have a little bonus track. So, simplifications ahead. Sorry, I have to lie
in a couple of places, just because I don't have the time
to get in all the details, so I'm going to skip over
a couple of parts. If you want to know more, I'm going to have an Open Space
later on today at 4 o'clock. File descriptors. Maybe you've heard the term in Unix,
"everything's a file," like the default system,
the prog file system. You can use it just to interact
with your hardware or to get information of processes,
networking, settings, etc., etc. And everything you do with that --
every time you do input/output, read some stuff, write some stuff,
even interact with resources, you use a file descriptor. File descriptors
are used for all sorts of things like reading/writing files, obviously --
even directories. We can interact with hardware,
and we do inter-process communications, like, we talk between processes,
networking, I/O multiplexing, but the heart of async I/O,
file system monitoring, and lots of stuff more. File descriptor is internally
a bit like a ticket. So you go to the kernel, you ask the kernel
for access to a resource, get back a ticket, and every time
you want to do something with a file descriptor
with this resource, you show the kernel again
the file descriptor, the number (like a ticket),
and the kernel does something. There are a couple of
standard numbers, 0, 1, and 2 for standard in, standard out,
and standard error output. There's also number one
are used for error indication. Don't care about it in Python. Yeah, Python is an exception for you,
so if you see a developer, you have to take care of -1 --
in Python, no. A very simple example,
probably have seen that. "With" statement, open a file,
read from the file, print it. This example uses already
two file descriptors once there's a process running. So, it reads from a file,
the "open" creates a file descriptor, and the "print" writes
to another file descriptor. You send it out to the shell. That's a very simple example
for a file descriptor. You can do even more
with file descriptors. You can use file descriptors to actually
reference a file that's already open. That's rather useful
in certain cases. There's also security implications. That's usually more secure. If you already have a file open, you can get the status information
of a file, you can change
the file permission settings. A couple of years ago, Python were able to use
the dir file descriptors. You can use the file descriptor
to a directory like a location indicator. They use it more in Python -- in Python, um, I think
3.4 introduced that feature. Then, you can even control your hardware
with file descriptor. And that's my only demo for today,
and I hope it works. [clicking sound] Did you hear that? For those who are hearing impaired,
we have closed captions. Um, oh, wrong one. I'll hide that. That's what happens. [laughter] A prerecording. Actually, I use a file descriptor to open some hardware device,
dev/cdrom and to give it a command
to eject the bay -- and back in. It works only on Linux bay, too. Every operating system has their own line
of magic codes for that. Operating System 101. Before I can teach you
some of the other tricks, I have to explain how an operating system
actually works internally. We used to have
the Dark Ages back then. We had no kind of isolation
between processes. Everybody could just read and write
from hardware, from software. If one program crashes, it could just
tear down the whole system. If you used DOS like me in the old ages
or even older computers, a crash of a program usually means
you have to restart the computer. In modern operating systems, we have the layout
between the hardware and software called the kernel. The kernel does a lot of things. I'm not going to explain
all the hardware drivers, the way the kernel makes it
much easier to talk to hardware. Rather than -- the kernel
is also a very important step to isolate processes. So, a process can't directly
interact with another process except of using the kernel
in one way or another at least to set up
a communication channel. The kernel also does lots of checks
regarding security. So, you have file permissions, you have users and groups
in your system. For networking, you have,
maybe, firewalls, and you have physical memory
in your computer, and you have visual memory
in the processes, and how the kernel maps these
virtual memory to physical memory is something I wanted to explain,
but unfortunately, not the time today. Twenty five minutes is very short. So, to sum that up,
every time you do something, you have to talk to the kernel
if you read or write. The way you talk to the kernel
is called a syscall. A system call -- that's the way
you switch the context from a user-space program
to a kernel-space program. Maybe you've heard the term
"context switch." Context switches are rather slow. Take a couple of -- up to a couple hundred of use cycles
in a modern system, and when you use a lot of them,
like currently almost 400. And we go back to the simple example
where you just opened the file, read from the file,
and printed it out. They're commonly called systrace
or s-trace, where you can see which kind
of system calls the program does. So in this example
the "with" statement opens the file and you get
file descriptor number 3. The kernel usually takes
the next open file descriptor. And then Python does some stuff like check if it's actually
a real file to prevent you
from opening directories, in a way. It seeks to the first position
of the file in case something is -- yeah, I don't know why really. Benjamin might know. He wrote that. [chuckles] And then, we use again number 3
to read and read a couple of screens. The number 19 at the end means it actually read
19 characters on the screen. It tries again. OK, the file's empty. And then Python uses
another file descriptor, number 1 -- that's the center out -- to actually
write the string to the shell. And finally, every time
we have a file descriptor, we should close
the file descriptor, because file descriptors
are actually very scarce resources, so you don't want to waste them. If you have a long-running program
and run out of file descriptors, yeah, you can't do anything anymore. And the kernel internally maintains
a couple of tables, informations. We have the global open file table. That's one gigantic table
in the kernel that has all open resources. Every process
has a file descriptor itself -- refers to the table itself. The file descriptor table
in the process is actually very small. It just maps some of the numbers
into the kernel. So, a couple of examples. You open a file,
you get a file descriptor that points to an entry
in the global table and eventually ends up
in a file. You can open the same file again and get
an independent file descriptor. But you can also
duplicate the file descriptor and maybe get the file descriptor
that points to the same entry. And there's another operation
that's useful. We're going to look at that later. We can override a file descriptor
and rename it. Finally, there's a way
to get a file descriptor to know the process, and so that suddenly another process
uses the same global entry. In this table --
the file descriptor table itself just contains this mapping. Another flag,
Close-exec or cloexec, I'm going to explain
in a couple of minutes. The open file table actually
contains everything that's stateful, like the position of the file, the mode, if you open the file,
read or write or both, or who's the owner, file locking, credentials, reference,
counting, etc., etc. So, the entry in the open file table,
the global table, is a bit like this old device. Say we have a global counter, and even if you have multiple programs
reading from the same resource, you can't go independent. Next up,
how Unix creates processes. That's a bit strange for people
who are not used to that. So, there are actually two steps
in creating a new process. For one, we have fork. Fork just creates a clone
of the current process and basically an almost perfect copy
except for threading. And this child process you create inherits this file descriptor table
from the parent process, so it actually points
to the same global entry. And the second step you do -- oh no,
sorry, the first thing -- small example. If you run fork
and then the first process, you might have, like,
that's currently the parent process. It just happened to be the first
of the parentals from the parent process that reads a bit
from the example txt, Python, and almost at the same moment,
the child process reads from the same file descriptor
and the global table, just to get to the end. The second step we have -- just forking the process,
getting a copy, is not very useful if you want to run
a different program. The second step
is replacing the current code with different code,
and that's called exec. Even in exec,
you get all the file descriptors from the original process, which can be a security issue. That's where cloexec comes in. So all file descriptors marked as cloexec
are ultimately closed. And thanks to Victor Stinner
and Python, you don't have to care
about that, too. So, quick summary. Every time you do something,
if you go to the kernel, we have
this different kind of tables, and a new process is created
with fork and exec. Now you might wonder
why this is useful. The child process actually gets
the same file descriptor as the parent process. There is something
like subprocess.PIPE, or, like,
piping on a shell comes in. Pipe is actually real
like a water pipe. You have one end
where the data flows in and the other end
where the data flows out. It's a universal pipe. And the way this piping works is,
you have a common like os.pipe -- Python's standard library
is pretty awesome -- where you get two ends. And when a subprocess
creates a new process, it first forks itself
and then checks, "Oh it's a parent process. "We are not interested
to use the writeend." We just close it
and degroup with that. And the child process,
we close the readend because the child
isn't supposed to mess with both ends. And then we use Step 2. That's when you
rename the file descriptor to #1. And lastly, we run a program like ls. And then the parent process itself can read again
from the file descriptor and that's how piping works
in subprocess. But, of course,
just talking between processes is not very interesting. You want to talk, in this global world,
to other computers. Now we have networks sockets. Network sockets are a bit like
a parcel delivery system. A sorting center, we send packages
through a couple of steps -- that's where
routing and addressing comes in. So, in order to send a package
from you to somebody else, you need to know the address. And the other person
you sent the package to, of course, needs to know
to whom to send back an answer, so, a bit like a letter. And these addressing/routing --
just IPv4 and IPv6, so, the basic internet protocols
most of you have probably heard of. There's also a second thing
called flow control. It's like -- do you want
to send out just packages or do you want
to send out packages and get a receipt
that the package was actually received
by the other peer? That's TCP and UDP. So, quick example. That's how our server,
our Unix server, looks like for a socket server, so you bind to a port
like an address, like a street address
or an apartment number, and you listen
to wait for incoming packages and then you finally
accept packages and both the server
and the con (the connection) file descriptors internally. And the client itself
also creates a new socket and connects to the peer
to send them. And what probably
not everybody knows: IPv4 and IPv6 are incompatible, so you have to use
a different kind of addressing and different, yeah,
kind of address names. Now, to the promised Unix sockets. Unix sockets are a bit like
a mix of pipes and network circuits, so actually they aren't like
network circuits, but limiters, so they only work
on the current computer. It's a bit like an old-style
plumatic delivery system where you have pipes
running through your building. Like the building in this example
is like the operating system -- and to connect different parts,
like processes. Because it's all in house -- so inside one kernel,
we have additional features and additional security settings. And the kernel guarantees
that these pipes are protected and nobody can mess with them. And to send a message,
we have this line of a fancy capsule where you put your data in, but you can also
put some information outside. You can glue a tac on it, and that's called ancillary data
for Unix sockets. So, Unix sockets,
like, a way to create Unix sockets is you just exchange
the way you do the addressing. So, instead of talking to IPv6 or IPv4
you use AF Unix, Address Family Unix, and then you can use
a path to a file (like location), and the client can then
connect to that file. Because these are regular files
in the file system, you can use all sorts
of permission settings, like the way you have user groups
and read and write bits. You can use that feature
for authentication and for protecting. There are also ways
to create a socket pair. It's a bit like a pipe
with a/b directional ends. Unix has also something called
abstract namespace. I'm not going to cover that here. So, a couple of things we can
actually do with Unix sockets that don't work with normal sockets. We can get the peer credentials. So, the kernel tells us who
is on the other side of the pipe. That's a bit of an example. That's currently not
in the standard library. I'm planning to add that
for Python 3.6. You can get the PID,
the program ID, the process ID, the user ID, and the group ID
of another process. Because that's guaranteed
by the kernel, nobody can mess with that. You can also get -- if you are running a system
that has Linux, you can get the Linux context. Again, I'm going to add that feature
to the standard library too. And we can use both these features to do something very fancy
with containers. That's one of the main things
I am using for the Custodial protocol is we can use Unix sockets
even between containers, because we actually
run the kernel space a lot like virtual machines. SELinux might make it a bit harder
for very good reasons. Ben Walsh has a couple of blog posts
about SELinux on Unix containers. Another thing:
so when you run a container, you have different name spaces so the container
doesn't see other processes. But with the Unix sockets, actually the kernel
translates the PID to the correct -- (audience member)
What about [inaudible]? Later, okay. And you also get these multi-category
security separation label, so every container is guaranteed
to have a UNIX label. Every currently [inaudible] container
has a Unix label for [inaudible] systems
that use secure visualization. So was that useful? We can actually get the Docker ID
from a PIDs. So, we have, like, the C group's file
with the control groups. If you look
on the control groups file, it looks a bit like that. Shorten the ID a bit. And being able to hash,
that's actually your Docker ID. And when you look closer, you can check
if the SELinux label matches, and if you're running, like,
Kubernetes or OpenShift, you can even get to the information
of the container, pod, name space, and, yeah. And because that --
um, the kernel prevents any process
from messing with that, unless either the kernel
or the document container is -- if both the kernel
and the document container are not compromised,
it's secure. You can also send file descriptors
over Unix sockets. That's using the surrogate data,
just glued on also the capsules. So, these features, for example,
are used in multi-processing and the standard library
documentation. The socket model has an example
of how to do that. It looks rather ugly;
it depends on operating systems. I'm also planning to add that, actually,
to the standard library socket model. Why is it useful
to send a file descriptor between -- from one person to another? That's very useful for example
for sandboxing. There is a feature called seccomp, and it can be used
by lots of programs where you put your process
in a sandbox, and the sandbox actually
prevents the process from doing any kind of
forbidden syscall. So, you say, "This person's not allowed
to create new process." "This person's not allowed
to open any files, "mess with other files,
or create network connections. But if you have an H connection
or a browser, of course, you want to talk
to another rec server. There comes in the so-called broker. So you have another process
that's very, very simple and can be audited
and checked for issues much, much better than the flash or a very complex browser renderer
or VU renderer. So, the sandbox asks the broker,
"Please open that file for me," and the broker then opens the file
and sends back the file descriptor. Or, in case the process is compromised
and does something evil, the broker is able to
just kill the malicious instance and yeah, you're safe. I have a couple of topics I would
really, really like to cover here, but I'm a bit out of time,
like 30 seconds left before all the
questions and answers start. Memory mapped I/O. You can actually do more
than just reading from a file; you can map a file into memory. It's very efficient if you have, like,
multiple processes open in the same file or your renderer
reads and writes. The kernel actually copies data
from the file into the memory and removes that eventually. NumPy has [inaudible] for that. There is something very new
called memory fd, where you can actually create
a file-like thingy in memory that you can seal, so you can basically write data,
seal the box, and nobody can change
the data anymore and then give this file descriptor
to another process. It's a better way
to do temporary files and you can do much more efficient I/O
with zero copying. So, every time you copy data
from the kernel into the user space
and back in the kernel space, you have a context switch
that is very slow. You have to copy data
multiple times. There are better ways to do that --
like sendfile. Python uses that already. Copy file range --
that's going to be added to Python 3.6. Splicing, a very new thing where
you have actually an SSL/TLS socket that is very useful
for high performance file service. They do most of the expensive TLS
inside the kernel itself. And finally, event-driven I/O. The features that are used
for async I/O where you have, like,
hundreds of connections or thousands of connections, and you use these kind of commands
to wait, and every time the pipe is ready
for reading or writing, the socket connection is ready
for reading and writing. The process gets informed of that. So async I/O uses that. If you want to know more
about file descriptors or want to talk about
the Custodial protocol and proof of concepts
I am currently researching, please come to my Open Table
at 4 o'clock today. It's in Room C 120 downstairs. Thank you very much. [applause] Almost in time. Three minutes left for questions. (audience member)
I have a question. What about chroot? (Christian Heimes)
Change root? (Christian Heimes)
Can you please go to the microphone? It is over there. (audience member)
All right, oh sorry. This might be a bit germane
to your topic at hand, but, sorry,
if I start up Red Hat, I open Python 3,
and I open a file with a non-Unicode character
in the string -- (Christian Heimes)
Can you speak up a bit? It's hard to understand you. (audience member)
So, I have a blank Red Hat Linux running, I start Python 3,
I open the file and in the file name,
there's a non-Unicode character. What happens,
what do I have to care about? Would it always work? (Christian Heimes)
Okay, that's actually not covered by -- that's already
handled by Python internally. That's not related
to file descriptors. Python tries to use something
called surrogated pairs to kind of translate bytes
that match UTF-8 into something. I'm not an expert
on that part of encoding. But that's actually not handled
by anything in the kernel or syscall. That's handled in Python, because in Unix,
files are actually bytes but Python tries to do that
as text. There's nobody? Okay, so, on the other side again. (audience member)
Hello, sorry for shouting earlier. Thank you for the talk. My question is -- you said you can
pass Unix sockets between containers, between containerized processes, but often containerized processes
have a chroot, right? They have a different view
of what the file system is, so I wonder how that could work. (Christian Heimes)
Yeah that works. You can create
a file-based Unix socket, say, if your Unix socket is like
a file in the file system. Then you can mount-bind that directory
into another container. It even works when you mount-bound
the container read-only. And that prevents the process from replacing or removing
the file descriptors. The copy of the file descriptor
actually is a bit like a device file, so stuff you have
in the def directory. You can open that file
and connect with the file descriptor. What's currently not working is if you have one container
that creates a file descriptor and another container that wants to open
the file descriptors, because the MCS labeling prevents you
from exchanging information between unrelated containers. You either have to use some settings
to put them in the same context or that's something
that's going to be handled by kernel DBS. Kernel DBS are Unix sockets on steroids
with additional features. We also can do, like,
this cross container communication in a different way. (audience member)
OK, so is the short answer that I have to bind-mount
a common directory to both containers? (Christian Heimes)
Yes, and if you -- basically it's easier
to lose the broker approach. You have, like, a super-privileged
container that's running in the host PID namespace
and create the file descriptor, and all the other containers
then connect to that. So, the container example
I wrote for Custodia actually is a privileged container
where they host namespacing. It doesn't currently work with
the SELinux for security reasons -- for good security reasons. (audience member)
OK, thank you. (presenter)
Thank you. That's all the time we have
for questions, but remember
Christian's Open Space at 4 p.m. (Christian Heimes)
Thank you. [applause]