This is the music video to Toxic being played
entirely off a 3.5” 1.44MB floppy. Seriously. You can just take this disk, pop it into a computer
running Windows, run the program, and watch the ever-iconic music video get played out right in
your terminal. Really. But, I pretend to hear you ask, how does one play a video from a floppy disk.
Well, even if it is old, a floppy is still a storage device. All you need to do is take
a video, and copy it onto the disk. Well, so long as that video is smaller than 1.44MB. And
therein lies the problem. Floppy disks are small, and video is, not. My videos now generally render
at around 20kbps, which means that a floppy couldn’t even hold a second of HD video, let
alone three minutes. That’s at 60FPS 1080p though, and as you might expect, lower quality implies
lower bitrate implies a smaller file size. When it comes to video encoding, there are
a bunch of dials to control resolution, framerate, audio sample rate, the number of
audio channels, what codec to use, and so on. Turn these knobs all down, and we get a video that
easily fits on a floppy, but it looks like this. Redditor RichB93 from the LGR subreddit
used these encoding settings, along with a better disk formatting scheme, and was able
to encode an entire five minute episode of LGR down to a file small enough to be played off of
a floppy disk. Clint even made a Blerb about it, good stuff. Turns out, that’s all you
need to fit a video onto a floppy. Except, well, I dunno, that feels too
easy in my opinion. The quality isn’t bad, but it’s all dependent on modern video codecs
like H.265, and if the computer you stick it in doesn’t have the hardware or software to decode
that, the video file is as good as useless. I thought maybe I could try to include a playback
program on the disk, but most video players like VLC here are already way too big to fit on a
floppy, even before we throw the video file in. So, how do we actually get a video onto a floppy
that we can playback on almost any computer? We’re going to want something compact, efficient to
decode, and portable. And immediately my thoughts went to text. We can take every frame of the video
and turn it into ASCII art, and that’s perfect. Text is relatively easy to compress, almost
every computer has some type of terminal output, and as a rule, using a terminal-based interface
over a graphical one automatically increases your hacker cred by at least a few points.
Converting video into text isn’t that new of a concept. In fact, there’s a
really cool demo from 1997 called BB, which plays a procedurally generated animation
using only text in your terminal. Neat stuff, but by far the coolest thing about the BB project, is
that they released their image to text conversion tools under the name aalib. That means that
converting a video into ASCII art is as simple as loading each frame in from the mp4, and pushing
them all through aalib. But, something spoke to me. Maybe it was the naive programmer’s desire
to “build everything all by myself,” maybe it was just me skipping ahead a bit and realizing that a
more complex character set would kill any attempts at compression we might need to use. Whatever the
case, I decided not to use aalib, and instead, make the video to text conversion all by myself.
So… How does one make a video to ASCII converter? Well, it’s important to remember that a video is
just a sequence of frames. And each frame is a set of three monochrome images. And each image is a
2D grid of intensity values, one for each pixel. In ASCII art, for our purposes anyways,
each character is essentially one pixel, so our first step is to scale down each frame to
the resolution the text output is going to have. Then, for each pixel, we need to decide what
character we’re going to use to represent that color, and here is where things get interesting.
You’ve probably already noticed from some of the ASCII art examples I’ve shown, some characters
like spaces or commas work well at representing dark areas, since the character doesn’t have a
lot of white pixels, and similarly, hashes and at signs are good for bright colors since they’re
mostly filled in with white pixels. But what about everything else? There are a lot of colors between
black and white. Well, rule #1, for now, I’m not going to be dealing with RGB color. Don’t get
me wrong, I could totally do it, but color makes things more complicated, it’ll take up more space
on the floppy, and it makes my program much harder to port between Linux and Windows, so for now,
we’re just gonna change every frame to grayscale. Greyscale ends up being nice to work in anyhow.
Each color exists on a brightness scale from zero to one, and we can pick which character
to use based on which range it falls in. To get the ranges, I made a program to print
out all the characters to the terminal, took their average brightness, plotted them out, and
selected seven characters whose brightnesses were pretty evenly spaced apart. And, when we run the
video through this process, this is what we get. It’s at least recognizable, but it does come with
a slight problem. We only have seven brightness levels here, which means that we lose detail
between the levels, especially on the darkest one. But this is a problem dithering can easily
fix. If pixels alternate between two brightness levels in a checkerboard type pattern, we can
give the illusion of more colors than we have, since our eyes blend the colors together. I’m
using Floyd-Steinberg dithering here, which distributes any error accumulated when picking
the color of a pixel into its neighbors. It’s a simple algorithm, but this step single handedly
makes our text render of the video frames look a lot better, especially in the darker frames.
As it turns out, a scaling step, a greyscale step, and a dithering step are all you need to convert
a frame of video into a block of ASCII text, and those steps are all easy enough
to work into a short python script. Before I move onto the next section, I should
probably answer the question I’m sure a number of you are wondering… “Why toxic?” Originally
the idea was just to use ANY music video, since I wanted to use something recognizable
that was under 4 minutes. Originally, I was thinking of using the Rickroll video, or Take on
Me, but those videos don’t have a lot of contrast. When you try to convert them to text, they come
out looking mostly gray, and it’s hard to make out much in terms of details. On the complete opposite
side of this problem is the Bad Apple video, which barring some anti-aliasing is entirely monochrome.
Pretty boring for this project, in my opinion, plus just about every possible port of this has
been made by programmers far more talented than me. Which brings us to the music video for Toxic.
Turns out, this video is perfect for this kind of text rendering. Almost all the scenes have strong
contrast between bright lights and dark shadows, and there are lots of closeups on peoples’ faces
to show off the detail I’m able to preserve. Plus, I just like the song, ok? Free Britney.
I did tease color video to text a minute ago, so I should probably talk about that a bit. First
off, how is this even possible? Most terminals, at least in linux-land, support a set of
formatting codes called the ANSI escape codes. These are a sequence of bytes that appear in
the output stream. When the terminal sees them, it goes “oh, this must be formatting
information” and it follows the encoded command to move the cursor, or switch the
text color, or start writing in bold text. When it comes to setting colors, you get to change
the foreground color and the background color, and depending on how advanced your terminal is, you
have either 8, 16, or 256 colors to choose from. The cool thing about the 256 color set, is that
the standard palette includes a set of RGB colors with 6 brightness levels per channel. Generating
color video is as simple as printing a bunch of spaces, and formatting the background to whatever
terminal color is the closest to our pixel. The result is recognizable albeit a bit pixelated.
But, if we’re able to blend black to white using text, we can blend between the colors we’re given
to get an even better color depth. I wrote a brute force program to try every foreground, background,
and character combination in the ANSI palette, and determine the best combination for
each RGB color from 0-255. In a sense, I generated a much more advanced 3D version of
the grayscale bounds we talked about before. The result being, a table you can give any RGB
color, and it tells you the exact foreground, background and character combination to match
that color the closest in an ANSI terminal. And, it works. I’m able to draw absolutely
beautiful color images in my terminal, still using nothing but colored text. The only drawback for
video, is that the Windows terminal isn’t great at parsing these ANSI codes, and with multiple
formatting sequences per character, it’s no longer able to keep up with our framerate. But that’s
something I can fix through the magic of editing. Alright, that’s been enough tangents, let’s
get down to the real business of this video. How do we get this text encoding of Toxic onto
a Floppy disk? If your answer was drag and drop the text onto the floppy, you are unfortunately
incorrect. While the source video is 18MB, the text encoding of Toxic, rendered at 160 columns
like I’ve been showing, comes out to just about 45MB. That’s about 30 floppies worth of data. So,
first things first. Let’s drop our resolution down by half in each dimension. I needed to do this
anyways to match the width of the old 80 column terminal in Windows, but that step alone gets us
to 11MB, which is now only eight-ish floppies, but still more than one. To bring this down
further, we’ll need some form of compression. I want this thing to be as simple as inserting
the floppy and clicking the program, so ZIP archives aren’t gonna cut it. Instead, I tried
a few different methods for compressing images with small color palettes, like the classic Run
Length Encoding. The idea is in most drawn images, you’ll find long horizontal strips of the exact
same color, so, what’s easier to write down? 50 individual blue pixels, or one blue pixel times
50. In our case, since we only have 7 characters, there are really only 3 bits of significant data
per byte, and we can use the spare 5 bits to contain the run length for each character. In the
optimal case, one byte can stand for 32 pixels. In practice, it’s not even close. The
dithering patterns kill the RLE here, since on average, each run only covers two or
three characters before we need to start another. That didn’t work, so here’s another approach more
suited for text: Huffman Coding. In this scheme, we build up a decision tree to decode
bits from the compressed output. Most importantly though, different characters can
have different encoding lengths and we can exploit this to give the most common characters the
shortest encodings. It’s a brilliant technique, but, in the case where letters are used
fairly evenly, the results are only as good as the 3-bit encoding we started with.
One thing I started to play around with was constructing a Markov matrix of the transitions
between characters as we print out a line. Basically, what are the probabilities of one
character following another in our image? As it turns out for each character, almost
always, it’s highly likely that we stay on that character or go to a character of
similar brightness due to dithering. Given the previous character we can rank each
choice of what comes next based on its likelihood, and I ended up encoding the image this way. Now,
instead of one representing brightness level one, it means, pick the most likely next character.
Two, means pick the second most likely step and so on. In this encoding, most of the pixels turn
into ones or twos, and since one is way more common than the other numbers, it encodes
nicely on a Huffman tree as a single bit. Theoretically, this algorithm can produce an 8x
compression rate, and in practice it gets about 5x. This method gets us all the way down to 2MB,
and the egotist in me would like to point out, that’s actually better than ZIP compression.
2MB is still too big for a formatted floppy, but I feel like I’ve worked enough here, so I’m gonna
cut one last corner and switch the playback frame rate from 30 FPS to 15. I’m sure there are going
to be a bunch of 240FPS pro gamers in the comments saying otherwise, but honestly, when rendering
to text, I can barely tell the difference, and it cuts our data in half. With that, the entire
18MB video rendered in text, can be compressed all the way down to just 1.1MB of data. Now
we just have to get it off of the floppy. This step just involves a C program to extract
and print our compressed text. Writing it was mostly straightforward, though getting it to run
well on older hardware required a few fixes. To improve speed, I needed to cut my program down to
just one printf call per frame. The older terminal also had a hard time scrolling all the text my
program put out, so now instead of starting a new line when drawing a frame, I just put the cursor
back at the top left and draw over the last one. With those beautification steps in place, let’s
see it. Here I have my TC1100 tablet running Windows XP and the Toxic floppy. All we have to
do is pop it in, and run the program. It takes a little time to load the program in, but once
it’s done, we see the Toxic video. Pretty slick, if I do say so myself. There’s no sound, since I
was focusing mostly on the video for this project, but I’m sure with the right bitrate, and
maybe some more compression on the whole disk, you could probably squeeze in some audio
data to get the full Toxic experience. The cool thing is, since this is a 32 bit
executable, it should be able to run on Windows versions as far back as Windows 95. Testing
in a virtual machine, I can confirm that, yes it definitely still runs on Windows
95. But, the output does have a hard time keeping up. Even so, that is still definitely
ASCII Spears on the screen, so I’ll take it. My original goal with this project was to
be able to fit a reasonable length video, and the software to play it back all on a single
floppy disk, and that’s exactly what I have in my hand here. All it took was converting the
video into a stream of ASCII characters, then figuring out a way to compress all that text.
I compiled it here for Windows, but this video should be displayable on just about anything with
an 80-column terminal. If anyone wants to make a port to some really obscure hardware, or make some
improvements to my program, I’d love to see it. Like always, I’ll have a link in the description
to the GitHub repo with the code necessary to textify and floppify any video you want, so be
sure to check that out. Anyways, there now exists a floppy disk port of the Toxic video. I’m not
sure who was asking for this, but you're welcome.