Faster, Faster! Improving regex performance with atomic grouping, possessive quantifiers and more!

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
foreign who's met me before or have I had on a training course or something like that okay excellent lovely lovely to have you back um for those that don't know me oh I know someone do a welcome for those that are dialing in remotely especially internationally especially my son Quentin in Christchurch hi cute for those that don't know me my name is Peter Lovett I've been a programmer for 41 years I've been being paid to be a programmer for 41 years nowadays I run training courses predominantly in C C plus plus and of course Python and you can find me at plusplus.com this is not an introduction talk this is I'm expecting that you already know regex and we're going to be exploring um particularly performance I'd also have to say this is incomplete I have missed a lot of things um because benchmarking is tricky it's a tricky thing to get right it's a tricky thing to repeat here's some of the things that I have found specifically I've done these benchmarks on C python 3114 on Windows 10 they've been timed with IPython timer there are a number of issues that are outstanding with some of the things that I'll be looking at particularly the possessive quantifiers and if anything goes wrong it's your problem I would also say content warning and I hope you've learned to sanitize your database inputs this may uh highlight some deficiencies that you have in your reg X use I'm getting a bit of feedback there is that thank you and this is not a security talk I'm not talking about security directly but the things that we look at May well have some security or denial of service particularly uh implications what we're going to look at is rule number one we're going to look at compiling quantifiers character classes and flags backtracking and then Atomic grouping and possessive quantifiers that are new in 311. so if you're using 310 probably get to 311. rule number one does everyone know the first rule of regex shout it out [Music] rule number one is don't use regex to quote Jamie zwinski some people when confronted with a problem think I know I'll use regular expressions now they have two problems well this one applies to me some people when confronted with a problem think I know I'll quote Jamie zielinski I'm not specifically talking about readability or other issues apart from performance in this talk but I want to just I couldn't help myself I had to say readability counts as it says in pep 20. so first up don't use regex one don't use match use starts with 500 percent faster applaud now and that's the end of the course that's the end of the talk I hope you've got benefit out whenever you see a percent faster that you're happy with I want some Applause don't use search use in don't use rejects unless you need to I could do a match of it does it start with an a b or a c or I could do a starts with starts with or starts with or a little known but it's actually pretty handy starts with takes a tuple and you get a tiny little bit of improvement there go Pete uh don't use rejects unless you need to oh but Pete it's got to start with a letter and I'd rather do that but it's 500 faster so now you've got a trade-off between readability and performance but I want to talk about performance oh actually why don't you just go and do a range check so don't use regex if you're going to use rejects you could you should compile that's a tiny bit of the code from the re module and we can see some interesting things there that there is a compile function that calls a single underscore compile and it actually has a cache of all of the patterns nonetheless doing a search took that long but pre-compiling it and then searching still gave us a drastic improvement I haven't quite drilled down exactly what I'm getting there but I did repeat it and I was happy with that let's talk about anchors don't use hat you're doing a search from the beginning and you shouldn't do that you should do a match and that gives you five percent I did not go I should have worked up shouldn't I should have started with the five percent and ended up with the 500 I've got more though stay tuned if you used a match with the um the Hat then you do get um it's a tiny bit slower Flags you say but Pete I need Flags so that I can do ignore case no you don't instead of you interesting instead of using ignore case why don't you use a character class and get 17 faster really uh there is some do I get some descent there yeah yeah uh I I would point out that in non-ascii there um there are more details in case than just b or B but just looking at a straight ASCII uh alphabet even if I had to do each one of those letters it still worked out cheaper it still worked out faster than straight Flags what about quantifiers um I've got a long string there lots of A's if I did a search for one or more any things oh an A and then one or more anything's and then a c now I'd want to point out watch out for your failed cases failed case is often a very Mark has a very different a marked difference in your performance compared to oh it found it so this Benchmark is a I didn't find it and that gave me a that speed but if I went and clamped it I got 280 percent faster because uh you probably don't mean plus because plus goes from one two Infinity and Infinity's big number so Pete's Maxim don't use Plus don't use star consider exactly what we're talking about there consider how many you're after let's look at laziness now here tonight that is almost tonight is indeed a unique occasion in the history of pycon I'm looking for the word I'm looking for a space in stuff and then a space now the sharp amongst you will go yeah Pete but that's going to grab too much and in fact it does because as I I didn't wear I need to get a t-shirt that says quantifiers are greedy quantifiers by default are greedy they grab the most they can not the least they can um one option that grabs all of that through to the space a nice option there though is to make the quantifier not greedy also known as lazy also known as reluctant and that then stops at the first one that it finds not the last one like that that doesn't actually have anything uh doesn't have a lot to do with performance but stay with me if I did a search for a greedy quantifier I get 467. whatevers make it lazy it's a little bit slower you can now boo boo that's what we're here for Pete we didn't want uh we didn't want slower we wanted faster well maybe you don't mean Dot a negated character class is a nice solution and for uh eight percent faster ah it's not 500 faster but it's faster yeah I'll give that if you get an eight percent here and a five percent there you'll get a nice result which leads to Pete's next Maxim which is don't use Dosh you probably don't mean anything uh reminder dot doesn't actually match anything it matches most things what does dot not match new lines unless you've got the dot all modifier I didn't Benchmark that um here's a example of something that I'll be using for the subsequent exercises which is I'm looking for a sentence and I've defined a sentence as word characters followed by a space Maybe white space maybe one or more words and an actual real actual dot at the end these sharp amongst will go no Pete you've just committed a fatal error and if you can't see that I'll be getting to that it'll work it's just got a performance implication and if I looked for something that doesn't have a full stop on the end I didn't get the sentence no match one thing that I could do is use non-apturing groups so parens parentheses have three different uses they're used for grouping which is actually what I'm using it for there they use for alternation this or that and they're also used for capturing now I'm not doing any capturing here I don't want to capture I'm not using backslash one or doing a match group so why not use a non-capturing group non-capturing groups are groups they do exactly the same as groups but they don't capture it and it gave me a little bit faster hmm to be honest it's a bit uglier but if you want it beautiful um this is the wrong talk now in order to go further we need to understand backtracking and how the engine actually goes looking for things if I was looking for a one or more A's and a b it works if I was looking for one or more A's and a b and it can't find it it actually has multiple bytes at the Cherry it goes back and has another go let's do a demo love live demos what could possibly go wrong so this is um regex 101 awesome totally awesome because they've got this awesome thing down here called the regex debugger I mean it tells you whether you've got a match or not that's interesting but the regex debugger gives you an enormous amount of insight into what's really going on here there's my data there's my pattern and here's the steps that it's going through it found a match in three steps it did the A's see that highlight got the B happy well that was that wasn't a great demo well actually I want to show you what happens if it doesn't find it if it doesn't find it it failed and took 35 steps we'll be shouting out what because it did that and it didn't find it so it goes backwards and you might be it's hard to see there's a tiny little Thin Blue Line there now it's going to start there and have another go and didn't find it so it goes back and has another go I'm seeing some uh oh my regex is just a disaster probably probably is and it keeps on having a go until eventually after 35 stabs at it it points out that it didn't find it that was my demo [Laughter] so what about bad backtracking backtracking this this process of the regex engine going back to the beginning having another go at it moving along a bit is not too bad in my None Shall Pass to find the sentence and it got me a result of that but if it doesn't find it it took a heck of a lot longer but it's it's it's how do I say this it's worse than that it's catastrophic that's the only way I could describe it because if the end of the pattern causes it to not match if this if the string that I'm searching is that big it took that long add one character to that it's twice as long oh and if it was one more character long it's going to take that long will I keep going and if it was one more it's going to take that long and if it's one more and that's uh I don't know I'll tell you that that's not a long string I'd call that a short string you can see where I'm going on this so we'll call that catastrophic we could also have a look at what the regex engine is going to give me for the number of steps it's doing and uh yeah [Laughter] for and then it just gave up sorry didn't find it couldn't find it catastrophic backtracking so if I have a look at another demo there of again this with a doesn't match or if there's a yeah it doesn't match I got 19 000 steps as it keeps on going and that will be the end of the talk hopefully I've got a solution for you I do have a solution for you that was a demo uh so it's if the input string length isn't too long uh there's no it's fairly flat for if it finds it but if it doesn't find it that's when the backtracking happens and then it goes well that's not too bad at the bottom in fact found was taking longer but when the input string gets longer we've got ourselves a quadratic performance o of N squared and that's how do I say this bad um one thing this one actually suffers from the same problem but in a slightly different way I don't know if you can see it but backslash W word character includes digit characters and when there's an overlap of those character classes you'll invoke far more backtracking one solution in this particular case is make sure that you don't use overlapping character classes make sure they're mutually exclusive if they're mutually exclusive then that particular fail won't happen and even in a simple case it was 78 faster or we could look at Atomic grouping whoa again this is new in 311. the big thing with an atomic group and possessive quantifiers as well is that they don't backtrack so they don't go back and have another go so the pattern of uh that's all the way up a b c or B and a c would match both ABC and ABC and the sea but the atomic grouping after it's eaten after it's chewed through after it's consumed the BC there's no C left in ABC so we'll only be a true it would only give me a match on a BC and C hmm because there's no backtracking and it's slower well again it's slower in the simple case but I'm much more interested in the fail case where for that example I've got a 26 percent faster and they both don't find it so no problem uh notice the regex extension there paren question mark is the whole selection of um regex extensions which python 311 now supports more of because it supports Atomic grouping so the quantifiers we've got straight straight quantifiers I'm just using star there but applies to plus and applies to curlies by default greedy Trail it with a question mark to make it lazy stop at the first one Trail it with a plus to make it possessive so why or where would possessive not be good if I had a pattern like that r dot star spam because the backtracking I get eggs and spam and it matches eggs and spam that is the dot star goes and consumes everything and then it backtracks so that it doesn't consume it all so that it does get the and spam make that possessive by chucking A Plus on it and it doesn't match because the dot star has consumed all of the data string um this parrot is dead I'm looking for a quoted string and if I use a possessive quantifier after a DOT plus before the quotes this is the same one I just showed you it doesn't get the match because the dot plus consumes it so just a reminder those pluses are well they're diff well they're the same key on the keyboard but they've got completely different meanings the first plus is the this is the one or more quantifier and the second plus is the previous quantifier the plus is possessive so from my maximum earlier don't use Dot then I could use a negated character class if not quote make it possessive I get the right match and I get well I got six percent faster there was a bit of effort for six percent Stay With Me let's go and look at something more more detailed my um my my scary one so this particular one looking for the sentence in the None Shall Pass one two three five took six thousand seven hundred and ninety microseconds make it possessive uh yeah [Applause] make it possessive and we get it down from six thousand to two and if I made this one possessive as well I got a little bit faster so I got a [Applause] I got a two million percent speed Improvement I thought that was pretty good actually so that's like if you were flying to London instead of it taking 23 hours and 10 minutes we could do it in six minutes now I should put a caveat in there um not every occurrence of this is going to give you a two million percent Improvement don't come back to me if you're not getting a two million percent Improvement in your regex it's lost more time um too long didn't listen don't use regex uh but if you're going to use regex don't use Dot consider a negated character classes uh clamp your quantifiers don't use plus don't use star use curly brace from two or a specific number use non-capturing group groups paren question mark colon watch out for nested quantifiers I don't think I used the phrase but I should have when I was looking for the sentence that was the problem was that I had a nested quantifier I will just go back to that because I think that was in well that one's there that's the problem is that we have a nested quantifier we have an unbounded quantifier inside an unbounded quantifier and it's the nested quantifiers that gave us the catastrophic backtracking and the quadratic performance so used on watch out for nested quantifiers use mutually exclusive patterns that is that was where I had the backslash W followed by the backslash D where there's an overlap of the patterns used mutually exclusive patterns and if appropriate and if it works for your problem use possessive quantifiers and atomic grouping technically they're actually the same thing that's just different syntax giving you the same or they're almost the same thing as long as you've got 311 which you should have because 312 comes out next month I didn't mention it earlier but I should mention it test your regex as maybe obvious it's very easy to get a regex that is wrong um test them and of course use your own discretion readability counts uh the details of that the benchmarks were done on win 10 with C python that particular version and I used IPython time it I'm aware of the the deficiencies of that as a benchmarking tool but if you're going to get 2 million or 2.2 million I'm happy with either uh your mileage may vary there's a wide variation in the sort of performance profiles that you will get with these and it behooves me to mention that there are different regex engines um in fact backtracking engines like python users and pearl users and Ruby users are notorious for this particular fail there are other regex engines here I'm just looking at C Python's built-in re module but consider other regex engines there's someone pip for more information look at the regular expression denial of service page that's a thing that any of you that are using use of regular Expressions that pars user data need to be well aware of regular Expressions info is a fabulous website a very thorough coverage and um the or the author of that site kindly added on some python 311 things when I raised it with him thank you 101 and of course mastering redirects regular Expressions by the Jeffrey freidle and some of those images were from him the others were from Quentin so thank you thanks to you for coming to my talk um thank you to the pycon Au organizers volunteers my wife who whenever she sees me writing rejects uh in my study she says it looks like I fell asleep and my head hit the keyboard and I now have a regex Cody for a review and sorry I embarrassed you again Quentin buddy the Jeffrey freidle who I had a nice interaction with Jan govets for adding python 311 there was an XKCD in there and some other images and that's the end of my talk now we have a uh a couple of minutes if there's anyone that has a specific question that I can cover in a couple of minutes we do have a comment from Quentin I believe so guessing that possessive quantifiers are also in 3-1-1 and later is that correct that's correct yes yeah interestingly the engine actually uh because the python regex engines written in C and is actually from Secret Labs I think originally um it's actually had this functionality but we just didn't have the python layer that passed it through so we've actually had it in the engine for a long time the question first question here uh just a real quick one um compiled regular Expressions do they have the same exponential Behavior I really hope they don't because it's unnecessary uh when you've compiled them uh yes yes so they still fail they still have that same performance profile in fact um the call to uh our research actually Returns the underscore compile of that compiled caches by default I think it's the last 50 regular Expressions so I was kind of surprised to see that performance if I separately compiled and called it that I actually got a noticeable um performance Improvement in doing that traditionally I'd always be very careful about doing a regex Insider loop I would always compiled before but the code there now is caching that so I wouldn't see as good a performance Improvement as I used to up the back somewhere um could you explain possessive regex again I didn't quite understand sure so it's a funny uh word because what it really means is the best one for that is this one so this particular one dot star whoops dot star means zero or more of anything followed by a Spam when it parses eggs and spam it'll actually read or eat or consume all of it and then go I didn't find when it gets to the end then it says I'm I didn't get a Spam so it'll start at the front again and go up to one before the end and then it'll go back and have another go and then go back and have another go until it finds that it can get a match that's the backtracking that happens it's kind of obvious to us as humans like don't do that but um by the design of the engine that's the way that it works and it's quite amazing that it works anyway the possessive quantifier um possesses it and doesn't the simplest way of explaining it is that it doesn't backtrack so once it's consumed everything that it could for the dot star there's nothing left so a possessive quantifier tells it to not backtrack in this particular example it um it doesn't give me the match but the one with the um this one here is a good example because that will execute faster because I'm using a negated character class all right thank you that's all we have time for I'm afraid uh I'm sorry if I can just interject um I'll be hanging around uh so if you've got any more questions you're welcome to come and see me um after and or um there's my contact at the end if you would like to contact me post course or post it thank you very much here's your speakers here yay [Applause] [Music]
Info
Channel: PyCon AU
Views: 1,563
Rating: undefined out of 5
Keywords: PeterLovett, pyconau, pyconau_2023
Id: L5Vqm-REhAU
Channel Id: undefined
Length: 29min 54sec (1794 seconds)
Published: Thu Aug 24 2023
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.