What is a File Format?

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Check this out, this is really cool! I have here a file which clearly seems to be a .pdf. We can openi t, and it’s a print from my website. BUT! Let me rename it to .zip. Open the zip file, and that works! This zip contains a test.txt file which says “This is a text file”. So wait… Is this now a PDF file or is this a ZIP file? What exactly is a file format anyway? Let’s talk about it. I grew up with Windows, so my understanding of files was coming from the intuition I got just by using the explorer. I was using the computer a lot as a teenager, so I also learned about the view option that hides known file extensions, and a typical setting for me was to disable it. Another thing you might come across as a GAMER is, changing some game settings. For example League of Legends has a config folder with a game.cfg, which if you try to double click, Windows doesn’t know how to open it. But you can force open it in Notepad and suddenly there is plain text you could modify to change your settings. So in the Windows world, how files are being opened when you double click them, depends on the file extension. The .pdf file opens for me in the browser, and the zip file is directly opened in the explorer. Which program to use you can find in some windows settings. Here is a big list of extensions and what program is associated with it. BUT! If you are older or you just grew up with something like Linux, or more precisely a grew up with a command line, you probably have a very different intuition about what a file is. Because there is no double click. On the commandline you always specify a program to execute, and then you can give that program some arguments. For example ls, to list the current folder content and the arguments -la, to show it in this list view. And so when I want to work with zip files, I need to explicitly use the zip program. In this case I just want to list the files contained in this zip. But I could also just specify a totally different program. For example hexdump which reads the content, the raw bytes of the file, and displays it here with hex values. With -C it also tries to decode these raw bytes to plain text. If you are a basic windows user, maybe even view the files without extension, then files feel very abstract. Sometimes when you click on one, it opens with one program, but another file opens with another program. Files seem complicated. But if you have a background in the command-line, you have a MUCH better understanding of what a file is, because there is no magic. There is nothing that opens the file with whatever it should be opened. YOU decide how to read this file. You specify the program. And this means you are a step closer to understanding fun file tricks. So when you force open a zip file in notepad, you can see some binary data. But you can also see readable text. the text file included in this zip is visible here. The file name, including it’s content. In this case the text file was so small, that it wasn’t even compressed by zip. Actually the zip file is much larger than the raw text file because of all the stuff around it. So let’s open this file in a fancy hex editor called 010. This hex editor has some cool features called templates. And we can apply the ZIP template on it, which then dissects this raw data for us, so we can understand the zip file. Similar to how wireshark dissects packets into the individual components, we can also now explore the zip file components. We can for example see here a ZIPFILERECORD and a ZIPFILEENTRY. The RECORD shows stuff like the version or the compression type. Which in this case is 0, indicating no compression was sued. Some date and time, as well as size information about the file included, followed by the file name and the actual data of the file. And you can always see how these fields correspond to the actual raw bytes above. So a program or library that can work with zip files, has to write code to actually understand this binary structure and parse this file. And as you can see there are a lot of different values you can screw with. The first time I heard about file format tricks was in 2013 at a small berlin conference called berlinsides. Julia Wolf is a malware researcher and did a talk about “Stupid ZIP file tricks”. I can’t find slides for this talk and there were no recordings of this conference. But it was about crafting zip files, such that different programs that can read zip files, would understand the file differently. Because parsing this data is not straight forward, different implementations might understand it slightly differently. And this could be abused, for example to have antivirus which is scanning the files in a zip file, reading different files, than when the user opens the zip file with the explorer. Antivirus sees a harmless text file. And the user sees malware. And this was mind blowing for me when I heard this the first time. In 2014 I saw a great talk by Ange Albertini - Corkami - about Funky File Formats. This has a recording, so I highly recommend you to watch that. And I’m going to steal and adapt an analogy from this talk. Let’s say there are two people. Person A loves dogs. And Person B looooves cats. And then you show them a statue from the german fairy tale “Town Musicians of Bremen”. As you can see, there are four animals stacked on top of eachother. You point at it and ask Person A, is this a dog? And Person A looking at it and says, YEAH that is a dog! Then you ask Person B, is this a cat? And Person B says, “Of course, that is a cat”. But you shake your head in disbelief and disappointment. And you respond to them “You are idiots. You are so dumb. This is not a cat or a dog. It’s a statue of a donkey, a dog, a cat and a rooster. And this is basically how file formats work. Different programs look at the same things, but see something else. Ange Albertini is somewhat famous for his playful exploration of file formats. And he recently started to write a tool to help create binary polyglots. Polyglots are maybe a bit more known in computing with this idea, that you can write a script, that is actually valid source code in multiple languages. The wikipedia article about polyglots has here an example. If I would ask you, what programming language is this? What would you say? Well it’s a trick question. This is valid C, PHP and BASH. You can see traces of all these languages in here. This is a C define directive, but also a bash comment, and here is the start of a php script. So depending on YOUR preferences or experience, YOU might identify this as C or php code. But somebody else might interpret it as something else. Truth is, it is all three languages at the same time. So this polyglot here is source code, but file formats work exactly the same way. Source code is interpreted by a program that we sometimes call compiler or runtime. The same way how a regular zip file is interpreted by a program like the zip program. Now here is a bit of a mind-bending take. Concentrate and think about this carefully, I’m going to twist some words around. Some researchers would even say a zip file is source code. This binary data is binary source code. And when you create a zip file with this binary structure, you are actually programming in a weird zip file language, and the zip program is an interpreter for this language. The zip program simply runs your code and does whatever you wrote. So instead of saying this zip file contains this text file. You could say you programmed a zip file, and that code displays a text file when executed by the zip program. *whoosh* This all ties back into this langsec point of view with weird machines. I have talked about this in the return oriented programming video before, and I can also recommend to you again the talk “The Science of Insecurity” by meredith l patterson. Anyway. Now that we have a much deeper understanding of file formats. Let’s look at the took from corkami, called mitra. This can create binary polyglots for us. So here I have this zip file with the text file. And here I have a PDF that is simply a print from my website. Let’s also have a quick peek into how a PDF file looks like, by opening it in notepad. As you can see a PDF file is mostly plain text. You can almost imagine this like HTML code. But its very complex, you don’t really want to do this stuff by hand. Except your name is ange albertini. Anyway, let’s also have a brief look with the 010 editor, and select the PDF template. And again we can see here how thes bytes can be understood and parsed into PDF objects. Anyway. Mitra can combine them now. We can execute mitra and pass in the zip and pdf. This created now a file I call test.zip.pdf. So is it a PDF file? Let’s open it, and yes. This looks like our pdf. But if I rename it to .zip, and force it to be opened as a zip file, I suddenly find a test.txt file within it. Now it looks like the zip from before. This is awesome, right? We have a file that is a valid PDF and is a valid ZIP file at the same time. And depending on which program you use to interpret this file, you get different results. The same way how when you execute this code with bash or PHP, and you get different results. And that’s why I wanted to do the example with the statue. Yes if you very accurately look at this statue, it’s a statue of 4 animals. But if you explicitly ask the ask Person A, or program A to look at this statue, or a file, you know it will look for a dog. You know it will look for a PDF file. If you ask Person B, or another Program B, you know it will look for cats, or zip files and tell you everything about the cat. Both of them kinda ignoring all the surrounding other stuff. So let’s look at the whole statue, by opening our weird file in the 010 Editor. And let’s apply the zip template. You can find here two records and two entries. The first entry is HUUUGE. And looking at the content of that file, we can see it looks like the fill PDF file. Here is the start of the PDF file. But it doesn’t show up when we open it as a zip file. I’m not sure why the pdf is not shown, but I think this is because the file name length is set to 0. And then we have this second entry, which is the file we know. Test.txt. So this is the world view as a ZIP program. But let’s look at the same data as a PDF program. Apply that template, and we can see that there is some unknown garbage at the start, but it identified the PDF header here. So the PDF file really only starts here. And then we can see it being followed by all the PDF objects and whatever. So it turns out, that the PDF program will look at this file, the same way Person A and B looked at the statue. They started looking from the top until they see what they were looking for. Person A found the dog. The PDF program found the start of the PDF file. There are a lot of fun things you can do with files, so please checkout more of anges research. But there is one thing that is NOT FUN about this. And that is when you hide files within other files and make it a CTF challenge. For example this text file could contain the flag, you combine it with mitra, and you share out the .pdf version as a challenge. That is annoying, stop doing these kind of steganography challenges. This is a great example where the research is awesome and I want everybody to learn more about file format tricks, but naively putting this into a challenge, is stupid. If you want to do a file format challenge, do it the other way around. I have given an example at the end of last video, but to summarize, don’t have players trying to find out what you hid in the file. But have a challenge where the player needs to hide something themselve. It can be literally as simple as “submit a file that is a zip file that contains a test.txt, but also it is a valid .pdf”. You give them an upload field. And if the file checks pass, they get the flag. This is much better because the instructions are clear, and the educational value is much greater. No guessing involved.
Info
Channel: LiveOverflow
Views: 177,622
Rating: undefined out of 5
Keywords: Live Overflow, liveoverflow, hacking tutorial, how to hack, exploit tutorial, file formats, 010 editor, fileformat, what is a file, funky file formats, file tricks, polyglots, steganography, ctf, capture the flag, mitra, corkami, ange albertini, julia wolf, berlinsides, conference, zip files, pdf, hide zip, hide pdf, challenge
Id: VVdmmN0su6E
Channel Id: undefined
Length: 12min 58sec (778 seconds)
Published: Mon Oct 26 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.