Nothing strikes fear into the heart of a Python programmer like a for loop in the midst of some performance-critical code. Today we're going to see how Python scales when looping over 1 billion rows of data. In this 13GB text file, each row contains a location and temperature measurement, separated by a semicolon. Our goal is to group these measurements by location and compute the min, max, and average values for each location as fast as possible. This is called the 1 billion row challenge. Though it started in the Java community, several other languages, including Python, have since unofficially thrown their hats into the ring. In this video, I'm going to show you how Python stacks up against its compiled competition. What strategies did our community use? How well did they perform? And can I tweak them to make them even faster? By the end of this video, I'll walk you through what I think is the fastest pure Python approach using only built-in libraries. And as a bonus, if you're willing to stretch your definition of Python a bit, I'll show you two shockingly easy and blazingly fast approaches that are just one PIP-installable package away. Let's dive in. Let's start with a dead simple approach first. I got it from this blog post here, but it's more or less the first thing that anyone would write. Open the file, read it line by line, and build up a nested dictionary mapping the city name to the sum, count, min, and max. We can compute the average from the sum and count when we eventually print the results in the required format at the end. Using Python 3.12 on an M1 Max MacBook Pro, this takes 9 minutes and 28 seconds, which is hilariously awful, especially because the baseline Java implementation included within the challenge runs in 3 minutes and 12 seconds. So why did the blog post author even write this post? I mean, it certainly wasn't for the 9-minute time. Well, they wanted to see how CPython compares with the JIT-compiled PyPy interpreter. Swapping out CPython 3.12 for PyPy, the time drops to 5 minutes and 3 seconds. For the rest of our pure Python approaches, we're going to use the PyPy interpreter, because this free performance improvement is hard to pass up. Still, this sucks. Thankfully, there are a few easy things we can do. Rather than using a dictionary to store the statistics for each city, replacing it with a simple list brings the runtime down from 5 minutes and 3 seconds to 3 minutes and 43 seconds. And using byte strings instead of Unicode again cuts our time down to 2 minutes and 42 seconds, which finally puts us ahead of the challenge baseline. But none of these tiny tweaks or micro-optimizations have any chance of fixing the root cause of our performance problem. We're only using a single CPU core. Now, I'm no mathematician. But if one core is good, more cores is better. So let's take a look at some implementations that use every core that our system has to offer. The two fastest pure Python implementations that I found were submitted by GitHub users Ifnessi and Booty. They both follow a similar approach. First, break the file into chunks, defined by their start and endpoints in bytes. Each chunk should be roughly equal to the size of the file divided by the number of CPU cores to equally distribute the work. But they need to make sure that every chunk ends on a new line character. Otherwise, things will get messed up. Then create a multi-processing pool with one worker for each core. Give each process the start and endpoints of its chunk and let it read and process its assigned portion of the file. Now, some minor details differ between these two approaches. Let's look at Ifnessi's code to start. Running the code with PyPy, this approach runs in about 30-ish seconds or about a 5x speed up from our single process approach. But Ifnessi's implementation leaves some obvious performance improvements on the table. If we parse the file using bytes instead of Unicode, we bring the runtime down to 23 seconds, which is a 7x speed up on my 10 core system. Let's hop over to Booty's approach. Booty uses memory mapping to map the file into the process's virtual address space, rather than directly manipulating the file handle. In principle, this can make certain IO operations faster, so it's definitely worth a shot. Beyond that, Booty asynchronously dispatches jobs to the multi-processing pool. He does this as soon as each chunk's start and endpoints have been found, rather than weighing around into the full list of chunks have been produced. This could let the first process get a bit of a head start, but we'll have to measure it to see if it pays off. Finally, Booty's code had a subtle bug in it, so it didn't run out of the box, and I had to dig through the densest, most annoying memory mapping logic with a fine-tooth comb to spot the error. To me, this is kind of a reminder that the more code you write to squeeze out performance, the more likely you are to introduce bugs. But after patching the bug, Booty's code runs in about 21 seconds on my machine, so maybe a bit better than Ifnessi's, but not by a huge margin. That said, I did spot a few things that could be improved on my first read-through. In the performance critical process chunk function, we can defer decoding the location from bytes to Unicode until printing the final result. Also, I'm not a fan of these global variables used to encode which element of the list is the sum, count, min, max, or average, so I removed them, and I also removed the average element from the list because we don't use it. Elsewhere in the code, I wanted to do a bit of housekeeping. We can use the find method on our memory mapped file to replace this clunky loop-based search for the next new line character. And finally, in the reduction step, I managed to save three dictionary lookups per city by creating a temporary file, but this shouldn't matter because it's not in the hot loop of the code. After all these changes, I had what I like to call Doug Booty version 1. It ran in about 18 to 20-ish seconds, which isn't anything earth-shattering, but a second or three is progress, and we've firmly pulled ahead of Ifnessi's 23-second runtime. So what's the difference between these two approaches? Well, it's almost surely not the asynchronous submission to the pool, as generating all of the chunk boundaries takes about one four-thousandth of a second, which is firmly in the realm of who cares. So in the interest of simpler code, for Doug Booty version 2, I'll use the simpler approach of initializing the pool using a context manager, and dispatching all the jobs all at once using a star map, similar to Ifnessi's approach. On the topic of context managers, I might as well use MMAP's context manager while I'm at it, to help clean things up. Doing a bit of testing, I didn't find any difference between using an if statement or the min and max functions when updating the statistics, so for no other reason besides I like the look of them better, in they go. Finally, to wrap up this round of refactoring, I'll un-hardcode our input file to make this function a bit more flexible, and fix another subtle bug that was hiding in the code. After making all of these changes, nothing has changed with respect to performance. But that's fine, I just wanted cleaner code. So is 19 seconds the best that we can do? Well, I almost thought so. But before giving up, I wanted to look through some of the top submissions from other languages. There, I stumbled across a really neat trick. Since our temperature measurements all have exactly one decimal value, if we imagine multiplying them by 10, we'd have integer-valued measurements. And in fact, we can directly parse our measurement as an integer by concatenating the two slices of the byte string before and after the decimal place, and then casting it as an integer instead of a float. We just need to make sure that we adjust our measurements by a factor of one tenth when printing our final results, but since that's outside of our hot loop, it's basically free. So where does this little trick get us? Does it eke out another second or two? No. It actually gets us all the way down to 11 seconds. I am pretty confident that this is the fastest pure Python implementation that anyone wrote for the one billion row challenge. And I don't think it can get much better. So I did think that, but I was wrong, and I came up with a better idea while I was editing. So let me show you that now. We can roll our own parser. There are four possible cases. Ignoring the decimal point for a moment, our number is either two or three digits, and either positive or negative. It's easy to find the sign of our number by looking at the first symbol of our input. So a quick test for that tells us if we need to multiply our final results by minus one. But we also learn something else. We know that the first digit of our number starts on the next character. So if we keep track where our first numeric digit is located, all that remains is to distinguish between two or three digit numbers. The cheapest way to do that is to check if the next symbol after our first digit is a decimal point. Based on the outcome, we can multiply all our digits by the appropriate power of 10, add them up, and multiply by our sign. We do need to be careful though, because the string zero is 48 in bytes. So we can subtract off 11 times 48 for two digit numbers, or 111 times 48 for three digit numbers. This speeds up our average running time by about 1.2 seconds, giving us a sub 10 second running time. I'm going to put a link to my implementation in the video description in case anyone can find other ways to speed this up. But I'm getting ahead of myself. Let's get back to the video. So I'm declaring this pure Python approach finished. That said, this approach was freaking complicated. So just to be clear, if I ever needed to chunk through billions of rows of data in my day job, I definitely wouldn't choose to write this code to do it. Python is incredibly fortunate to have a ton of high-performance libraries written in compiled languages that are just one pip install away. So as much fun as this jitted memory mapped multi-processing monstrosity was, if we're willing to stretch our definition of what Python code is just a little bit, I think we can do a lot better in practice. Polars is a data frame library written in Rust. It has Python bindings and is easy to install with pip. So what does a Polar's implementation of the 1 billion row challenge look like? First, scan the file, which is kind of like reading the file except it lets us stream the data instead of loading it all into memory at once. Then group the data by city, aggregate the values into columns, and sort the results. Finally, collect the results to execute the computation and print them in the desired format using standard Python. Several authors found that messing with the streaming chunk size helped, so I tweaked it slightly to get the best possible performance on my machine. In the end, the polars approach ran in about 11 or 12-ish seconds, which is just as fast as our best pure Python approach. So given a choice between the two, I'd take polars in a heartbeat because it's just so much simpler. But it's not actually just a choice between the two. I have one more approach to show you. DuckDB is a really cool SQL database library. If you're just looking for a quick and dirty data analysis, you can easily read data from csv's or parquet files into an in-memory database and slice and dice it with SQL syntax. Or if you want a persistent store of your data, you can write the database to files so that you basically have SQLite on steroids. DuckDB has a command line interface, a REPL, and a bunch of language bindings including Python. So that makes it Python, right? So here's what that implementation looks like. We first connect to an in-memory DuckDB database. We'll select the station name as well as the min, average, and max values. We parse the text file using the very helpful read csv command and group our results into rows by station name. Once DuckDB has done its job, we can format and print the results using vanilla Python. This runs in about nine-ish seconds, which finally breaks the sub 10-second boundary. And just to show off DuckDB a bit more, imagine for a moment that we live in a more civilized world. And instead of storing the data in a delimited file, the challenge happened to store it in a parquet file. DuckDB could carve through that data in a speedy five-ish seconds, which is very impressive for how flexible and easy to use it is. But how good was the challenge winner? Well, on my machine, the winning Java submission took about 1.4 seconds. This other highly optimized C implementation took about 2.2 seconds. There's simply not much performance left to squeeze out once you're getting to low single-digit running times. So what does this mean for you? Well, if you aren't a Python interpreter purist, the best pure Python approach is about eight times slower than the fastest Java implementation. All things considered, I was pleasantly surprised. And as far as I'm concerned, PyPy is just magic. If you're even more flexible with your definition of the word Python and consider extension libraries fair game, then you should absolutely reach out for Polars or DuckDB for your data processing tasks. They were fast, simple to write, and I am perfectly happy to run six times slower code if it means I never have to read or write Java. Writing fast and efficient code is a hands-on and iterative process where we try things, we get feedback, and we use what we learn to guide our next steps moving forward. I find this workflow really satisfying and helpful for learning about what works with Python performance tuning. If you do too, I think you'll really enjoy today's video sponsor, Brilliant.org. Brilliant is where you learn by doing with thousands of interactive lessons in math, data analysis, programming, and AI. As a learning platform, Brilliant is incredibly effective because their first principles approach helps you build an understanding from the ground up. The hands-on problem solving in each lesson lets you play with concepts yourself. And like optimizing code, it's far more effective to actively engage with concepts and receive feedback than it is to passively go through the motions, like only listening to a lecture or only making changes to your code without knowing which parts are slow. Brilliant helps you build critical thinking skills. You don't just memorize things, you become a better problem solver. Brilliant's fun lessons are a great replacement for mindlessly doom scrolling on your phone. Just a few minutes every day can build real knowledge that will help you achieve both personal and professional growth. Recently, I've enjoyed working through how LLMs work, which gives you a hands-on look at what's going on under the hood of these otherwise opaque models. You explore how LLMs build their vocabulary, choose their next word, and more. This course interactively illustrates its key concepts by letting you play with models trained on different data sets and even gives you a chance to tune an LLM to tailor it for different applications. To try everything Brilliant has to offer for free for a full 30 days, visit brilliant.org/DougMercer or click the link in the video description. You'll also get 20% off an annual premium subscription. If you enjoyed this video, you'll also enjoy this video where I pit four compiled Python approaches against C++ on a dynamic programming problem. Or, if you'd like to hire me for one-on-one Python coaching, code review, or pair programming, you can check out my website at dougmercer.dev to find out more. Thanks for watching, and why don't you hit up the plus sign on the way out. Thumbs up, whatever it is. Thanks.