Zero Size Files - Computerphile

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
One of our loyal viewers sent in a question the other day. He was asking why, when he creates an empty text document on his computer, does it have a file size of zero bytes. Well, there's no text in it, so that sort of makes sense, but he realized - well, actually, there's some information associated with it, this thing - it's got a file name, there's how big a size it is, there's a time it was created, and so on. Where's all that information stored? Why does it have zero bytes, and yet we know there's some information stored alongside it? So we can demonstrate what this actually means if we use the computer. So let's just create a simple text document. This document's got nothing in it, so let's save this out to disk... ...and we'll call it "Empty.txt". So we've created an empty text file, and if we look on our system, we can see that we have created a file called "Empty.txt" with zero bytes in it. It's an empty file, there's nothing in there, and if we have a look at all the data that's in there... We try and cat the file - there's nothing in there. That's an empty document. Just to prove that a real text document would have some information in it, let's create one that says "Hello, Computerphile!" and save that out, we'll call this as "NotEmpty.txt". And if we look at that one now, we see we've got two files - "Empty", which has got 0 bytes associated with it, and "NotEmpty", which has 21 bytes associated with it. If we just look at what's inside "NotEmpty", we see that those 21 bytes form the ASCII codes for "Hello, Computerphile!" So we have these two files - one's "Empty", which has nothing in it, and one of them has 21 characters in it and the line feed at the end for "Hello, Computerphile!" But neither of them actually have the filename stored in them, they don't have the date there, they don't have how big the file is. So where is all that stored? What's going on there? Well, actually, we need to think about these bits of information as being two things. We have one bit of information, which is the document. In this case, it's either empty or it's got some ASCII characters in it. But the other information isn't really part of the document. It's describing that document. So it's information which tells us what we want to call the document, how big it is, when we edited it, and so on. It isn't actually about the document. And the easiest way to think about that is to think about it - if you rename a document, you don't change a document. So if I rename the file "NotEmpty.txt" to be "StillNotEmpty.txt"... We've changed the filename, but the file is still the same, even though we've given it a longer filename. So if we hexdump this one, the bytes match between the two different ones. The thing we have to think about first is that we have our document, and if we draw that out, so we have computer icon for the document, and we also have alongside that the information about what we called that. In this case, "Empty.txt". So this document is called "Empty.txt", and it has a size, which is zero bytes, and it'll have a date we created it. And as we change things about this, as we move it around, this information will change. So we might change the filename, so we call it "StillEmpty.txt", we get rid of the old filename, but we haven't actually changed the document. That stayed the same. Cause just as we think about these things as being separate, we have the name which describes the document, but isn't part of the document, the file system in the computer does exactly the same. So if you remember back to the videos we did on how data is stored on a disk, we divide the disk up into a set of tracks, and we break those tracks up into single sectors. So as well as storing the documents on the tracks and sectors on the system, even if it's an SSD, it's still emulating a lot of the time, this old system the hard disks tend to use. As well as storing the data, we also store a directory, or catalog, that represents where that information is stored. I'm going to use the FAT file system as an example here, because it's relatively straightforward to understand. Systems like NTFS, ext2, etc. - ZFS - will all use different variations, but they have similar concepts that use more efficient ways of representing the catalog. So in a flat file system, the directory is really just a special type of file. So there's a special directory called the "root" directory, which is the one that the system knows about where to find. But inside that you'll find entries that point to other directories, and also point to files. And each of those entries in the original FAT system is made up of 32 bytes of data. And these are stored consecutively after each other, so if we had another one, it would immediately follow these 32 bytes. So the first 8 bytes, for example, are used to store the first part of the filename. So if we had a file named - let's call it "Empty", which is what we used, we have E-M-P-T-Y, so that's five characters, and we store the other three being spaces. The next three bytes store the extension, so T-X-T. We don't store the dot, so we have the name padded up with spaces, then we have that extension there. We then have various other flags and so on, some of which tell it whether it's a directory or a special file, and so on. And then towards the end, we have the size, and there's four bytes which are specified for that, which means you can't have a file bigger than four gigabytes on the FAT file system. And there's also two bytes which say where the file starts. So in the start, at the beginning of a disk, we have this information which describes the file. Most importantly, it tells us where on the disk to find it, where it starts, and how big it is. Now, we need to know how big it is, because we can only allocate a whole number of sectors, or technically we use clusters, which is multiple sectors joined together, to store this file. So we know where the first one is, and that there's another section of the system which tells you which of the ones are formed together to form a linked list of the files that we'll use, we'll go into that in another video in some more detail. So we have the size, so we know exactly where to stop in the last cluster that we're using. So in this case, this will be zero, because the file is empty, and we have the name of it stored here. Interestingly, if the file is zero, you can say that the start cluster is also zero, so you don't actually have to take up a whole cluster storing nothing. So in terms of actual disk space, this empty file will still take up no disk space at all, because all the information it needs is part of the directory. Now there's one caveat to that. As we've said, each of these files take up 32 bytes, and eventually you'll fill up the cluster that is being used to store that directory. When you do that, the next file will need to start using another cluster, and so that will take up a whole kilobyte of disk, or 512 bytes of disk, or whatever it is, depending on how the file system is set up. So at some point, you will create a file that will use some disk space. So that margin is dependent on how many other files you've got, as to whether your next file takes any space up or not? Yes. So it depends on it, exactly. So as we add more files into a directory, more empty files, they won't take up any space, and then suddenly you'll add an empty file and it will take up a whole kilobyte of space on disk. And then you'll keep adding more, and they won't take up any extra space, then you'll add another one and it'll take up a whole kilobyte. So you could, theoretically, fill up your whole hard disk with empty files and have no space on that. If you want to try it, do so, but don't blame me if you suddenly find your computer doesn't work properly. You'll have to find some way of deleting the empty files, but your computer probably wouldn't boot properly. And no one uses FAT these days anyway. He created the empty .txt file and it didn't have anything in it. and it took up zero bytes. Now we discussed why they actually take up some space, but he also created a rich text file, an RTF document.
Info
Channel: Computerphile
Views: 333,454
Rating: undefined out of 5
Keywords: computers, computerphile, computer, science, computer science, Dr Steve Bagley, University of Nottingham, File Size, FAT, FAT32
Id: kiTTAbeqQKY
Channel Id: undefined
Length: 7min 50sec (470 seconds)
Published: Fri Oct 21 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.