How One Line of Code Almost Blew Up the Internet

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
February 18 2017 at the cloudflare HQ work began winding down this Friday afternoon as the weekend approached Moreau was high and no one was prepared for the disaster that was about to happen the engineers were excited to go home early and enjoy the beautiful weekend free of any operational issues suddenly 4 11 pm Pacific time at the friendly neighborhood Google complex one of the googlers working in Project zero a security research team discovered a severe issue with cloudflare's system he immediately reached out through the most sensible channel for something urgent like this and first Contact was made minutes later it was now 4 32 pm and the alarming details of the report were made clear to cloudflare suggesting a possible widespread data leak always a Friday afternoon I've read my weekend you may have seen cloudflare's DDOS mitigation service before this is built on top of their primary product a Content delivery Network or CDN cdns came into existence in the 1990s to speed up the delivery of Internet content they're kind of like distribution centers Amazon isn't just going to have a single Warehouse in the middle of the United States that every delivery driver starts from there are many spread all across the country in a store or should I say cash commonly sold items to minimize delivery time similarly it makes no sense to deliver internet content to all users across the world from a single centralized Source a CDN will have many points of presence across the world with Edge servers that cache content from the origin server when a user makes a request for a particular website the request is directed to the nearest Edge server where the content is most likely already cached it was here that cloudflare not only returned the requested website but also cookies keys and other sensitive customer data this is what it would look like if you knew words and look plenty of useful information could be extracted from the leaked Memory full https requests IP addresses responses passwords and who knows how long this exploit has been out there Bad actors could have already compromised thousands of companies and cloudflare's monitoring evidently did not self-detect this issue as a third party had to identify it and reach out to them data leakage like this can come with Hefty consequences FTC fines lawsuits and increased audits but most importantly of all it degrades customer trust no customer trust no customers no customers no Revenue no Revenue no taco Tuesdays to make matters worse search engines like Google also regularly index and cache websites so this leaked data could also be accessed through Google's cache 4 40 PM now this was serious business everyone immediately assembled in San Francisco maybe even some cross-company action with the Google employees the engineers noticed in the dashboards that the occurrence of this bug seem to correlate with the usage of the email obfuscation feature which was also immediately suspected as ahead of recent deployment to partially migrate to a new HTML parser either way every feature that codflare ships comes with a feature flag and Engineers immediately flipped what they called the global kill which would prevent all customers from using the feature by 522 PST about an hour after the initial report email obfuscation had been disabled board worldwide however the bug was still occurring on the other side of the Atlantic the London team had joined the call all hands on deck it was time to spend the Friday night debugging and rethinking life 8 24 PM PST four hours in another two features were found to be problematic automatic HTTP rewrites and server-side excludes automatic HTTP rewrites was shut down immediately with its Global kill the server side excludes with such an old feature that it predated the practice of deploying with global kills the engineers had a Crossroads here they could release a patch for this feature to allow it to be turned off but that would take some time for implementation and deployment alternatively they could spend time Root causing the issue and deploy a single proper fix but the root cause was not apparent thus the engineers began working on the global kill for Server Assad excludes and readied it for a deployment 11 22 PM PST 7 hours in as the night progressed streets outside the San Francisco office grew quieter Daybreak in London the engineers were more than ready to sign off and get some much needed sleep the patch to turn off server-side excludes was finally deployed worldwide but there was still much work to be done cache data from search engines still needed to be purged and without knowing the true root cause reoccurrence was still within the realm of possibilities but what could have caused this well Edge servers contain software to perform all kinds of operations on the content they deliver this was the clear common denominator among the three aforementioned features they all parsed and modified The Returned HTML content in some way email obfuscation would erase any email addresses in The Returned webpage if the requester source IP was deemed suspicious server-side excludes is very similar it can automatically hide content wrapped in a special tag from suspicious Source IPS automatic HTTP rewrites would simply rewrite any HTTP content embedded on The Returned website to https furthermore these three features all use the new HTML parser mentioned earlier CF HTML the engineers however found nothing suspicious in the code despite thorough verification it wasn't until the next few days that the root cause was made clear now cloudflare had originally been using a parser generated using radio and they were looking to migrate to something simpler and more maintainable it was in the self-described ancient piece of software that the bug took Roots vraja was a parser language that no one knows how to pronounce that works through defining finite State machines with regular expressions and Performing various actions based on the match results you can think of it like those flow charts where we start at one state and transfer to different states based on various conditions for example here is a machine which matches consecutive numbers and letters in practice you can see radio code is embedded within C here using the double percent signs and then it can compile down to C C plus plus Java it's a live ratio it's actually fairly readable and concise after a bit of getting used to and I'd imagine its very performance or maybe one engineer a long time ago thought it was a fun language and implemented it themselves with minimal communication and it ended up just working fine everyone else was like walking broken fixed it and it was untouched until now so the HTML web page consumed by the Rachel parser is represented by a series of data buffers with each buffer containing a portion of the HTML code each time the radial parser is invoked to consume a buffer the user needs to pass in data pointers initialized to the beginning and end of the buffer varesio uses P to iterate through the buffer and PE to tell when the buffer has been fully parsed in cloudfire's case one of the things they wanted to parse were HTML attributes within script tags such as type or source taking a look at the racial code this script consume attribute machine will try to match this regular expression attribute characters followed by space slash or closing angle bracket then we have a few actions this is an entering action which will be performed when starting the machine it simply logs that the script is running this at symbol is a finishing action which is performed with the machine complete successfully here we call F hold which is equivalent to P minus minus and will move the pointer back by one this is likely because the script tag parse machine it proceeds to jump to is to consume the space short angle of reacted characters that the attribute machine would have already matched as those are also part of the tag there is also a local Air action which is performed with an attribute fails to match there's a log here for failure and then it recurses and tries to parse the next attribute going back to the success case After exiting back to script tag parse the many parser machines will continue until the end of the buffer is reached but how do we know if we've reached the end of the buffer well if the data pointer p is equal to the data endpoints or PE then we have surely reached the end of the buffer so it turns out that something very bad happens if there is an unfinished attribute at the very end of a web page when this happens failures to match would occur when the data pointer p is equal to data end pointer PE the parser then reinvokes itself now at risk of parsing undefined heat memory let's see if the buffer end check saves us oh man the pre-increment causes P to skip over and never be equal to PE a Ricky mistake [Music] but wait this is a bug in the old parser which has been in use for years has cloudflare been leaking data all this time no so it was actually migration to the new parser that triggered the issue going back to the buffer override we were talking about before if there are more buffers to come the unfinished tag could just be due to the rest of the elements being in the next buffer so the error action will not be invoked the error action is only triggered on an unfinished match within the very last buffer as there is no more data at that point to complete the match this is why in the example The Unfinished attribute is at the very end of the page that is at the very end of the last possible buffer however the key here is that historically when only the old parser was used it would always receive an extra dummy last buffer that had no content why no particular reason it just did this meant that for a website that ended with an unfinished tag The Unfinished tag would be in the second to last buffer and the error action would not be caught then since the last buffer is empty the parser would also not be again after the new parser was introduced this Behavior changed and the empty last buffer was no longer present in the buffer sequence passed to radio causing the unfinished tag to be in the last buffer and making the overrun possible perhaps the new parser cleaned up the empty last buffer before passing data to the old one this also meant that the bug can only occur when a customer enables features which in combination use both old and new parsers so what can be learned from this failure well here we see a classic example of backwards compatibility no matter how dumb the behavior of something is if it's been set in stone for a long time and you change it something is definitely going to break however it's not always so easy to maintain backwards compatibility obviously Microsoft can easily choose to not deprecate the ability for Windows to run 32-bit programs but CF HTML removing the last buffer or perhaps more accurately not adding the extra dummy buffer back for no reason is something that can easily be overlooked and it was not just this but also a bug in the existing code plus a very specific type of input that in combination caused the data leak when you consider even larger systems with dozens of interlocking components each with millions of possible inputs it's clear that there will inevitably be bugs in all software so what can be done to minimize impact cloudflare inventions buzzing generated code to search for pointer overruns as well as building test cases for malformed web pages there are also various memory management techniques that can reduce impact this can likely also have been caught by Static code analysis perhaps another thing worth pointing out are best practices the coding standards for ratio are not very clear but for my limited experimentation I don't think it is possible for Rachel to naturally overrun the buffer it's possible to underrun the buffer by spamming fold but radio's default Behavior seems to make overrunning impossible there's no radial command to force iteration of the data pointer and the way radio iterates the data pointer forward naturally is as follows always explicitly checking if it has reached the data end this points to cloudflare potentially going in and modifying the compiled C code rather than the radio code itself something that would obviously not be radio best practice two days later pointer checks to detect memory leaks were rolled out and three days later the engineers determined it was safe enough to re-enable the three aforementioned features cloudflare then worked with the various search engines to purge their caches of affected websites in terms of overall impact evidence suggests that it was quite small there were quite a few conditions that needed to be met for the bug to manifest and cloudflare claims that there is no evidence of the bug being leveraged for any attacks we know that 0.6 percent of cloudflare websites ended with unfinished tags and that the bug occurred more than 18 million times it is reasonable to say that cloudflare just got really lucky in fact one of the features which could trigger this bug was available as far back as November 2016. had this exploit falling into the wrong hands or occurred more recently now that cloudflare is so much bigger there may not have been such a happy ending
Info
Channel: Kevin Fang
Views: 973,729
Rating: undefined out of 5
Keywords:
Id: GEbn3nHyKnA
Channel Id: undefined
Length: 13min 46sec (826 seconds)
Published: Sun Feb 19 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.