Blame Me: The Windows Progress Dialog!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Hey I'm Dave, Welcome to my Shop! Today we're putting the Windows shell progress dialog to the test. Love it or hate it, I used to own the progress dialog when I worked at Microsoft. Recently I was on Facebook, of all places, when someone sent me a cartoon from one of my very favorite comic strips, xkcd. That's nothing new, but the fact that I was apparently the SUBJECT of this cartoon is what made it interesting to me! Let's have a look: The Author of the Windows File Copy Dialog Visits Some Friends I'm Just outside town, so I should be there in fifteen minutes. Actually, it's looking more like six days. No wait, thirty seconds! I love it. It's funny because it's true, and because I can take a joke, even if its somewhat at my expense. But that's all right, it's partly deserved, and at least I get the joke. I'm Dave Plummer, retired operating system engineer from Microsoft going back to the MS-DOS and Windows 95 days, and today, I'll tell you about the secrets behind how that progress dialog works, why it's so wildly wrong sometimes, and why it's such a hard problem to get right in the first place! We'll also test the accuracy of the shell by putting it to a real world test - copying the entire system32 folder from one SSD to another. And we'll do it on a PCI4 SSD raid set capable of 10Gigabytes a second, just for good measure! Now I need to be really clear that I'm just one of many devs across the ages who has owned, worked on, designed, or otherwise contributed to the progress dialog. I think the very original came out of the Windows 95 project, of course, and it evolved with Windows XP and an on down the line. I started working on it periodically around 1995, when I added support for a number of things that XP would have that Windows 95 did not, such as multiple streams in a single file, compressed files, encrypted files, and so on. At some point we added Yes to All and No to All and Skip and so on, all intended to make life a little easier. I worked on various aspects of it off and on until through about 2003, but only sporadically as things came up. The last thing I worked on, though, was in fact the time estimate dialog, so the cartoon seemed particularly prescient. At the time, the first, and most pressing problem was to improve the shell's time estimate for long operations like copying a large number of files. Even back then it was the butt of jokes at times. Now off the cuff, you might think this is a fairly simple problem, but I can assure you that it's anything but. Here's the single biggest reason that the shell's estimation can be so wrong: as Raymond Chen has aptly noted in his blog, the shell is trying to predict the future. And it is trying to predict the future solely based on what has happened so far, such that early in the process it's even more likely that the estimation will be wildly wrong because it's based on only a tiny subset of the entire process. You might think hey, I know my drive reads at 100MB second and I'm copying 500MB of files so it should take 5 about seconds. Experience has likely shown you, however, that it's not usually quite that simple. A console app like xcopy doesn't provide any estimate at all, so it doesn't pre-walk the tree of files and folders that you've selected in order to make an inventory of work to be done. It simply starts at the beginning and works away until it is done, at best telling you which file it's currently working on. But the desktop shell wants to provide much more feedback than that, so it walks the entire selection in advance to count the files and folders and add up their size and so on. That means the shell also knows the number of total bytes to be copied, but the problem is that it the total size of the files is only one small factor in the outcome. The shell does not always need to look back at the entire copy operation, but instead could bias the calculations towards what has happened most recently. But the reality even so is that it cannot see into the future - it doesn't know because it can't know how much bandwidth will be free a minute from now. It doesn't know how busy the disk or the bus will be or if writes are saturated on the SSD or if the cache needs to be written back. It can only assume that the future will be like the past, so it bases all its predictions about the future on the past performance. That sounds like a reasonable thing, but things can change along the way. Network conditions can change, multitasking apps can slow down I/O, your multitier SSD can slow down significantly as large copies progress, and so on. Let's imagine something like the cartoon, where you're taking a trip from Seattle to New York to visit a friend, and I'm the Windows Shell, and I'm trying to predict how long it will take you to get there. You begin your journey by walking out to the taxi. Well, you're only going about 2 miles per hour while walking so I can assume it will take you several months to reach New York. Once we're in the cab on the freeway, however, we're going 60 miles per hour, so it might be only a week or two to get to New York. But then you're back on the little train in the terminal going 20, and the estimate gets longer again. Soon I update my estimate as you board the aircraft: three weeks to fly to New York. Once you finally get underway on the aircraft, now you are exceeding 600 miles per hour. Because the shell is smart, maybe it pays more attention to how fast you've been going recently than at the start of the trip, but that also means that it predicts that you will be at your friend's apartment in Manhattan about 30 seconds after landing at JFK, because after all, you've been averaging about 95% the speed of sound for the last several hours, and that apartment is now only a few miles away! It's a little like that. Just when the shell thinks it has a reasonable estimate and stakes its reputation on providing you that guess, conditions change enough to make the prediction look downright silly sometimes. But notice if that if you used Google Maps or Waze to calculate your trip from the airport to your friend's house, it can be surprisingly precise, even with traffic and conditions changing on the fly. That's because in both cases what the computer is being asked to do is still to predict the future. But why can Google predict a drive uptown with far better accuracy than the shell can a file copy from one drive to another? It's because Google has access to vast amounts of historical data that it can draw on from other users driving those same streets in the past. But that model isn't very useful for computers, because the amount of time it took for you to download and install Flight Simulator might be very different than it would be for me. Odds are the first part of my download is more predictive than knowing your entire download, so there's not a lot of point in using your data to estimate my download once mine has been underway for a little bit. You're probably better off using the current speed than the historical average from other users on different ISPs. For a single large copy operation, once it has been underway for a while such that the shell has been able to gather some data about the speed of the progress, it can be refreshingly accurate with its predictions, but this is only one type of shell operation. Consider the case of moving a large folder, like an offline copy of the Windows folder itself, from your internal SSD to an external data drive. Now it's not as simple as just keeping track of how long the first part of the operation took because measurements based on that first part might not have anything to do with the last half of the actual operation. What do I mean by that? Well, let look at all the different types of files and folders you might find in a Windows directory. There's quite a mix - tiny .ini files, small configuration files, medium sized font files, large dll files, and even a few huge files among them. It depends where the operation starts. Let's say it starts with the fonts folder, which are mostly smaller files of regular, predictable size, usually under a megabyte. You can tear through them fairly easily, and at a consistent rate. So you think, well, I'll just measure the number of bytes per second of throughput that my copy can do at the beginning and extrapolate that. And at its most basic level, the shell is doing something just like that. But anyone who's done a large copy and watched the performance knows that the speed can vary dramatically throughout. Copying large, monolithic files is a breeze, and you can crank the bytes through easily. But when it's small files, there's lot of filesystem overhead. For example, a file entry has to be created and populated and linked into the volume structure in the filesystem and space allocated from the disk bitmap or whatever system it uses, all before any bytes can even be accepted, and that's done for each and every little file. This really slows things down, at least relatively. That's why copying the large dlls in system32 might breeze along at 100MB per second, yet while copying the many fonts it drops to a mere 10MB per second. Same disk, same filesystem. The folders are right next to each other on the disk but copying 100 of them takes wildly different amounts of time per megabyte. That's why knowing the raw speed of your disk, or keeping a record of it, isn't all that helpful. This a problem for moving and copying, of course, but it's also a problem for operations such as changing attributes or deleting subfolders for similar reasons because - not to beat a dead horse here, but -- because just as you hear every time a stockbroker commercial comes on TV: measurements of past performance are no guarantee of future results. How can we improve the estimate? How can me make it smarter? Well, when I was working on it, one of the first things I realized is that different operations can take markedly different times to complete. Renaming files is faster than deleting files, but deleting a file is faster than moving it when it's on the same volume than when it's on a different one. Moving a file to a different volume is just slightly slower than copying it, because the original must be removed and that's extra work. But renaming files is even faster than any of these. So, as you can see, it starts to get complicated with all the combinations and permutations. What I wound up doing in my initial design was to break every large shell operation down into discrete tasks, time them as they happened, and keep track of averages. Let's take moving 100 files of 100K each from one disk to another as an example. In our various categories of actions, we have things like creating a new file, removing an existing file, renaming a file, deleting a file, and so on. We also want to keep track of the time needed to move or copy any actual data as well, so we add another category for "copying data bytes". Then, we undertake to begin the operation. We'll let it run for a while, display a message like 'Estimating' in place of an actual time estimate, collect statistics, and then use those statistics to predict the future. We'll continually feed information about the ongoing operation back into the copy engine and it can use those statistics to update its predictions for future performance. It will run in closed loop and continually get better. Heck, at least near the time it approaches the end of a long operation it should be fairly accurate. If we're going to play junior operating system designer here, we have to treat some problems as just practical and not theoretical - like how long do we leave up the "Estimating" or "Calculating" message? 10 seconds? But then you'd never get information for any operations less than 10 seconds! What about until 10% of the data is complete, so we at least have some historical context to guess from. But all these rules have edge cases - like for a 4K video copy that takes an hour... do you really want to wait more than 5 or 6 minutes for the very first estimate? So perhaps you must make more complicated rules - 10 percent but not longer than 20 seconds, something like that. No matter what heuristic you decide on, however, be prepared to be wrong sometimes. And when you're wrong, people will be harsh. Try not to be too wrong too often. Thinking about this a little further, you could maybe store some statistics about each disk volume in the registry. That way, when starting off a copy from C drive, you're not working blind. You can use some data from the last big operation on the C drive to seed the estimate for next time. But if last time you were copying from C to a floppy drive, does it really tell you anything informative about copying from C to an external SSD? Well, perhaps you're thinking you should store your prior stats based on both source and destination. But remember, not all networks have symmetrical performance. And some media, whether it's a CDR blank or a RAID 6 NAS volume, writes at a wholly different speed than you can read from it. So, I guess you really have store those statistic by source and target pair PER DIRECTION. And the thing is, if you were to just repeat the very same operation the disk cache may come into play that second time around and yield completely different performance yet again. So perhaps you can see how trying to keep historical data, if not exactly pointless, is much more complicated than it seems at first blush. The copy engine and the progress dialog, not to mention the confirmation dialog, have all evolved continually since those early days, and today, I'm curious to see just how accurate it is. So, let's put it to the test. I'm going to time some actual I/O operations and see how accurate the current shell actually is. Let's look at a real-world example. I'm going to copy my entire system32 folder from the boot SSD over to my data SSD, which is actually a raid set of three fast PCI4 SSDs. To get a sense of the raw I/O speed the source and destination drives are capable of, I'm going to first test them both with Crystal Disk mark. Let's do that now. [Crystal Disk Mark Test] Now I’ve increased the transaction size to 2 gigabytes to get around some caching and I’ve made sure that I’ve sped this footage up as it actually takes a while to run. This is the boot drive and I’ve heard that there are IO problems under Windows 11 but that seems about what I had before so I’m not sure if I’m not experiencing them or… in any event, 2560 on the read side and we’re going to be looking at about 10 gigabytes maximum read on the drive we’ll be writing to and a maximum write of 4200 so obviously the bottleneck will be the source drive of 2560 megabytes per seconds and going on down as your queue size and T count decrease. Now that we know we can write at least a gigabyte second, and read even faster, you'd thing we could get a reasonable estimate. After all, we're using fast SSDs on both source and destination, so it's not like small files are going to cause the heads to thrash around. But if we check the size of system32, it's about six and a quarter gigabytes. So six gig at a gig a second... carry the one... that's like six seconds. But we both know there's no chance it completes anywhere near that quickly. Rather than speculate, I broke out the stopwatch and manually copied system32 so that we could compare actual time with the estimated time. Let's have a look! [ System32 copy ] Alright as we watch this copy my system 32 to the data drive we can see the first estimate that comes in is one minute and 45 seconds for an estimated 21,000 items now head stuck with his six minute 30 would have been actually pretty close to correct but the broken clock is right twice a day too so it's hard to know if that's luck or accuracy you're a busy person so we don't have time to watch this in real time I had to watch it once and it was already painful watching my system 32 coffee so as we jump ahead we'll see the next estimate runs up to 9 minutes and 30 seconds of a sudden now at about mark we'll see this estimate gets revised all the way up to 15 minutes oh wait I'm at 4 minutes 4 minutes becomes 7 minutes it was seven minutes you could make a cup of coffee oh but wait it's only 30 seconds now continues to oscillate around depending on the size and the nature of the files that it's encountering and we can see that by the end here it says there are two minutes left at the six minute mark and it actually end just ends so it was actually surprised by the end of the copy it appears. As we just saw, the shell can be wildly wrong. And I picked this example for two reasons - because I felt it represented a real-world use case of copying a mixed payload of files, and because I knew it would be very, very wrong. As we saw, the transitions from a bunch of small files to a few larger files and back and forth continually caused it to be wrong in its estimates. Now in fairness, I don't know that the mechanism I described earlier, where it measures and tracks historical data off past atomic events, is still in there. Changes in I/O might have meant it needed to be completely reworked. I have no idea. I just know it's a hard problem and it's been solved a few different times by a few different people, and it's still far from perfect. But when the problem is a little more straightforward, the shell can be surprisingly fast and accurate. We saw that it took several minutes to copy a mixed payload totalling 6 gigabytes. But if my drive can sustain 1200 MBs/sec raw speed, why can't the shell copy files at the full transfer rate of the drive? Ah, but it can, and I'll show you... Watch as I copy a single 6GB file from the data SSD to the boot SSD. At a little better than 12000 megabytes a second, it should only take 4-5 seconds. But what's the reality? [ Single 6GB copy slower ] There you go. It happens so fast the shell isn't even done estimating yet. But from watching the stopwatch, I can see it took right around that 4-5 second mark. Impressive, to be honest! What about the other way, reading from the boot drive and writing to the fast data drive? It should in theory be even faster. Let's give it a shot. [ Single 6GB copy faster ] Sure enough, this time it copies that 6GB in under 3 seconds! It's smokin'! So as you can see, the filesystem structure itself, and not the drive, seems to be the majority of the overhead. When it's just data being written to a flat file that is growing on the disk, it can come close to saturating the maximum transfer rate of the drive. But when there are many individual files to contend with, the operation is much, much slower. Particularly when measured by the number of actual bytes handled per second! The fundamental conclusion, then, is that the shell is really bad at predicting a highly variable and chaotic future, whereas it's much better at predicting a stable one where the future is much like the past. In the end, this should come as no surprise! The other conclusion is one that is confirmed by my earlier Crystal Disk Mark results - that manipulating lots of small files means many small I/O operations, and even if everything else was equal, which it isn't, the drive is significantly slower when working with small I/O blocks. It's kind of a double whammy, and that's why it's so much slower! Now I don't have any Patreons, and I'm not selling anything, I'm just in this for the subs and likes. So I'd be honored if you'd consider subscribing to my channel if you're not already. And if you are, heck, drop a like or comment on this video so I know you're out there! Let me know what you'd like me to feature next. Thanks for joining me out here in the shop today. In the meantime, and in between time, I hope to see you next time, right here in Dave's Garage.
Info
Channel: Dave's Garage
Views: 25,395
Rating: undefined out of 5
Keywords: windows progress bar, windows progression, windows progress bar c#, windows 11 secrets, windows 11, windows 11 insider, windows insider
Id: 9gTLDuxmQek
Channel Id: undefined
Length: 16min 26sec (986 seconds)
Published: Fri Dec 17 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.