Al Sweigart Yes, It's Time to Learn Regular Expressions PyCon 2017

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
so good afternoon everyone we now have sy guard who will be talking on yes it's time to learn regular expressions hello hi I'm al Swaggart I'm probably best known for writing Python books including one called automate the boring stuff with Python which is released under a Creative Commons license so you can read this online for free you can check out the book and also my slides from bitly flash yes regex if you want to follow along with the slide deck I really encourage you to release your own content under an open license like Creative Commons one of the benefits that I have is that I can take a look at my web traffic and I notice that the regular expression chapter in this book has a lot of traffic so I thought hey this would be a good idea for a talk so regular expressions are kind of this intimidating topic a lot of software developers pretty much put off learning it for most of their careers it seems really cryptic there's a bunch of weird punctuation marks that have special meaning it can be kind of hard to learn in fact just 20 minutes before this talk I overheard somebody in the hallway say this quote and so I had to update my slides with that but in general their regular expressions are really powerful if you have to do text pattern matching and I really encourage you to learn them and so yes it's time for me to start this presentation which is entitled yes it's time to learn regular expressions so the first thing that you'll learn is that they're also called read checks for short so here's two numbers actually it's one number and one of these numbers is a phone number and the other is a is the population of Asia so which one is which well if you're like me you've memorized every single American phone number that exists and you can tell that that top number is one of them no that's not how you actually do that at all you know it's a phone number because you know a phone number when you see it and you know when you see it because you know there's a specific pattern that the phone them that all American phone numbers follow it's a few digits for the area code with dashes in the middle of it so that's what regular expressions do you can specify a pattern of text that you're looking for even if you don't know the exact numbers or the exact text that you're looking for and it's really easy Python makes this really simple it's just about three lines of code the first is to import the regular expression module which is import re then you want to call the compile function and pass it your regular expression pattern that you're looking for then you're going to call the search method on the regular expression object that that returns you'll pass it the haystack string I call it the haystack string if you're looking for one string instead of another I'd say you're looking for the needle string inside the haystack string and that's going to return a match object and you call the group method on that and that will print out the exact text that you're looking for so did I say three lines I meant four lines but it's still basically just three lines of text compile search and group and also if you're busy scrambling to write all of this down remember you can download these slides at bitly /yes regex so just calm down don't stop taking notes and listen to my soothing voice let's focus in on each one of these parts so starting with compile and this regular expression string of the pattern that you're sending to compile this is the main complicated part of pythons regular expressions figuring out the syntax for this so let's go back to that phone number example what exactly is a phone number how you know a phone number when you see it well it starts off with a digit character which is one of the ten numerals 0 through 9 followed by another digit character and another digit character and then a dash and so on so we need to translate this into a regular expression string in the regular expression syntax and that's what we're going to pass to compile remember there's compile search end group so we're going to use slash D which is the Reg expression syntax for a digit you want to really pass this as a raw string because otherwise if you don't have a raw string you're going to have to escape that slash and you're going to have all of these slashes in your regular expression string so I tend to like to send raw strings always to the compile function so this slash D will match a digit character and our phone numbers are going to have three of these for the for the area code followed by a dash and so on this is the regular expression string to match a phone number after that we get the regular expression object and we can call search on that and pass it our haystack string this is the string that we're searching for that pattern we're looking for and it will return a match object and it will return a match object if it finds the pattern otherwise it'll return none so we have to check if the match object is none or not and when we call group on that match object remember a compile search group that's going to print out the actual text that matches the the regular expression pattern that we were looking for just like that now this seems I know I said that oh it's a lot easier than you think this is still kind of complicated remembering compile search group if you're just calling the find string method that's a lot more straightforward it's just one line of code but imagine if you didn't have regular expressions to do this fine phone number example you would have to write Python code that looks like this which isn't that hard to follow but it's a lot of it and it's really gnarly and phone numbers are actually pretty simple patterns once you get into really much more complicated regular expressions the Python code that you would have to write to do the same thing as a small regular expression would just explode and across pages and pages of text and we don't want that so let's examine that /d thing that I was talking about earlier that that matches a digit this is something called a character class it's it represents a range of characters or a class of characters that you're looking for and /t is the one saying I'm looking for a digit one of the numbers there's a few other character classes there's w forward characters this is letters and numbers and I believe also the underscore character there /s for space characters like space and tab and newline and there's also the capitalized versions of these for everything that's not a digit or not a word or not a space character character classes are how you specify what exactly you're looking for so whenever you're searching through a string try not to think of the actual semantic meaning of whatever text you're looking for just consider that string to be a sequence of symbols that you're looking for and these character classes match a certain group of symbols you can also create your own character classes by putting a bunch of characters inside the square bracket so if we wanted to say create a character class that matched all the vowels in the English language we just type a e i o u in lower case and upper case inside the square brackets and now we have a character class matching vowel characters we can also do the same thing as the the uppercase version of those previous character classes by just adding a caret character to the very front of this character class and now this is matching everything that's not a vowel character so that's going to be consonants it'll also be something like punctuation marks or numbers that's something that you have to keep in mind it's everything that is literally not those 10 characters and we can also have ranges of characters if you want to match say all 26 letters or all 10 numbers or all or both of those things you can just use a dash character to have say 0-9 and that's going to be all 10 characters and this character class basically does the same thing as the /w character class it's matching all the numbers and letters so in these examples we had the caret character at the very beginning or those dash characters we're not literally looking for those characters punctuation marks in regular expression syntax tends to have very specific meaning so if we're actually looking for any of these punctuation market characters be sure to add to escape them with a backslash in front of them so if say we wanted to create a character class that matches open and closing parentheses we can just add a slash character in front of those parentheses and now we have a character class that matches the set of parentheses either the opening or closing characters so character classes as I said this is what you're looking for inside of your regular expression that's how you tell Python I'm looking for these characters you can also specify a quantity of these characters we kind of did this with a phone number example just by repeating the slash DS over and over again but there's a shortcut for this you can use curly braces and a number in between the curly braces to say hey I'm looking for three of these digit characters followed by a dash and so on for the rest of the phone number example this matches the exact same thing as the previous regular expression but it's a bit more compact and here's the pattern that we're following right here we have slash key that's the character class that's what we're looking for coming first and then after that we have the quantity that we're looking for you're going to see this repeated over and over again what we're looking for and the quantity that we're looking for and that curly brace 3 for matching three of the thing that it comes after is really handy but there's a bunch of these others and they're a huge list of punctuation marks you don't actually have to memorize these very few people do memorize these you'll just end up going back to a cheat sheet or looking them up in the documentation but you have all these punctuation marks that mean things like hey I have slash D I'm looking for a digit and the question mark which means I'm looking for zero or one of these digits or the plus sign that means I'm looking for one or more of these digit characters you can change the character class that you're looking for - and suddenly it's you know I'm looking for a zero or one space or I'm looking for one or more spaces or use your own character classes like we did with our Val example now we're looking for zero or one valve and one or more valves all those punctuation marks you don't have to memorize them right now just remember we have character classes at the first part that's what we're looking for followed by the quantity analysis I'm going to completely different so Japanese words are composed of letters and Japanese letters usually follow this consonant vowel combination so take a japanese word like sayonara it's made up of four letters each of those are a consonant and vowel it's a consonant vowel consonant vowel it's following that pattern we can create a regular expression that will then match this remember we have a character class for matching vowels and then we also have that caret one for matching that everything that's not a vowel technically this will match not just consonant letters but also numbers and punctuation marks we'll just ignore that for right now but anyway saying we wanted to match a japanese word which is made up of several of these japanese letters we can have the plus sign that means one or more of the thing that comes before it but technically this isn't going to work because remember it's just going to match only one of those consonant patterns we need to specify how many of those that we that we want to match but we also sort of want to group these together so this regular expression is going to end up matching something like ah but so it's a consonant character followed by a bunch of owls and it doesn't have to be the same valid can just be any Val that's the character class that we're looking for one or more vowel characters I'm not even going to attempt to pronounce this word but what we want to do is sort of just group together that consonant and vowel together and then have the plus sign mean one or more of these things so we can use parentheses to group these together in forms a sort of one giant character class out of all these other character classes it's still the same thing we're specifying what we want to look for followed by how many of them that we want to look for so this is going to match something like saw saw saw saw saw saw saw saw or it'll match an actual Japanese word like sayonara so this would be a pretty good point for me to somatically end my talk on sayonara but actually there's a little bit more we're going to go into just a small example right here with reg X's for a comma formatted number so if you're an American you usually split up your numbers into groups of three with a comma in between them in between them so let's create a regular expression for this we have to figure out what exactly it is that the pattern that we want to match here is so it's generally going to be something like one two three digits for the lead part and then followed by groups of the sort of comma and three digits afterwards and we'll have you know zero or more of those groups so a number like 12 is going to have zero of those comma groups because there's no commas or a number like twelve thousand will have one of those groups and I could just sit here and show you the regular expression for this but there's actually a lot of nifty tools online that you can try out these are called regex buddy websites or regex tester websites and now I'm going to do something that is highly ill-advised for anybody doing a presentation and that is a live demo in which anything can go wrong especially one that requires using the Internet so hopefully the Wi-Fi is going to hold and I can just use this website so this is red jax-ur com technically it's using JavaScript style regular expressions but regular expressions across multiple programming languages are so similar that it's going to work for Python as well so let's let's try out let's say I have my text I have a 64 million year old egg we want to find a regular expression that matches let's see how much I can move this not that okay there we go I want to write write a regular expression that matches that comma delimited number so let's see what was this this was a digit and then I want between 1 and 3 of them so I'll try a curly brace 1 comma 3 and you can see as I'm typing in real time these websites are going to update and show you exactly what they're what they're matching so these are great if you're trying to construct a regular expression and you sort of want to just build it up step by step it's a lot easier than just running your code seeing what it matches and trying to go back and change it this is a much more direct method so let me finish this up let's see we're going to have that group of the comma three-digit thing so I'll need a nurse slash D and then a3 afterwards wait great and I can see well oh this isn't quite working because it's not matching everything right here and now remember oh right I wanted not one of these comma groups but zero or more so I'm going to add a star to mean zero war and I can see okay it matches the comma-delimited number all right hey I got through the entire live demo and nothing went wrong rude yes well I'm demo complete every time I do a live demo and I talk I feel like I'm like angering the live demo god who's going to start chasing me from conference to conference waiting for me to let my guard down and just ruin the next live demo I try right not today okay next pipes this is a way that you can sort of provide alternate groups to choose from and the example that I want to use is let's say we want to create a regular expression to match sentences of Monty Python words and Monty Python words I'm going to use underscores instead of spaces just make them more visible and just to make it easier and more consistent all the Monty Python words always end with an underscore but Monty Python words will be something like again spam or egg bacon and spam or egg bacon sausage and spam or spam eggs spam spam bacon and spam that one only has a little bit of spam in it so you might think okay let's create a regular expression we're going to call compile that's the first step of compile search group and let's see we'll put these in a group for each word and we'll have that plus sign to mean one or more of this so we're going to match one or more egg and one or more bacon except that's not going to quite work because what if bacon comes before the egg and what it's sausage comes before egg or bacon or something like that this isn't actually going to work as a regular expression we we just need some way of choosing one or another and we kind of do this if you think about it with character classes character classes say hey we're going to match a or E or I or o or you accept character classes only work with individual characters rather we want to match groups saying egg or bacon or sausage etc so we can do that with the pipe character which means or if you've ever programmed in a language like JavaScript or something else they also use the ORA character for the boolean or operator so it's a bit easier to remember that way so you can just put the pipe character in between these groups to say I want to match egg or bacon or sausage or and or spam and then put all of those groups into one giant group and then add a plus sign at the end so that it's matching one or more of those and that will match something like spam spam spam spam spam spam spam really stop something like a word if you say it over and over again and the last bit that I want to cover you actually this is pretty much all you need for the basics of regular expressions character classes quantities groups pipes that's essentially it this thing regular expressions that you've been putting off learning for years and years and years it's actually not that bad the rest of the stuff is just gravy but it's pretty cool I really love this the dot which basically means match any character except for the newline this is great because it's a great character class you can also set a flag so that also includes the newline but you can combine it with the star character from several slides back that means zero or more and when they're powers combined they form the dot star which just means match whatever there's a second one to this that is dot star question mark which means match the least amount of whatever because dot just means match whatever character and we're looking for zero or more of whatever so you can match whatever this is great you can you in a string like looking for text angle bracket in between the angle brackets angle bracket this looks a lot like HTML and so let's say we want to get a regular expression that matches this HTML like text that's pretty simple we can just have angle bracket dot star question mark angle bracket so we're just looking for whatever text in between angle brackets the least amount of it and then that would match a string that's sort of like angle bracket to serve humans angle bracket but remember there's the one with the question mark that's the least amount that's the non greedy version and then there's just dot star which is the greedy version it's going to try to match the longest bit of text and that's when you find out the dark secret that really the string was angle bracket to serve humans angle bracket for dinner angle bracket whew plot twist and the reason this happens is because dot star the greedy version is going to match the first part to serve humans and that technically matches the pattern that's looking for but dot star is going to continue looking forward to see if there's an even larger string that it can match so if you want to match the most amount of text for this pattern you can use dot star and the least amount you can use dot star question mark that's pretty much hit the you now have this solid foundation you can go into the Python documentation and sort of look up all the rest of the neat little tricks that you can do with regular expressions I kind of just want to end on some best practices some limitations to regular expressions the first of which is really important and that is don't ever parse HTML with regular expressions I know I went back into this slide and said hey that looks like HTML don't actually do that with regular expressions you'll end up creating a regular expression that sort of matches HTML and then you'll realize oh wait it also needs to be case insensitive so then you've changed it a little bit and then you realize oh wait there's some attribute that's out of order or something weird like that and you'll have to change the regular expression you'll end up making a regular expression that doesn't really match HTML instead what you should do is use an HTML parsing tool like beautifulsoup that module same thing with json you want to use a JSON parser to match json text the second thing is I used to use this as an interview question come up with a regular expression that matches a strong password you know something that has lowercase and uppercase letters and numbers and special characters the regular expression to do this turns out to be this huge giant thing especially because you have to get the all possible orderings of lowercase and uppercase and numbers and everything and it's really awful if your regular expression starts blowing up into this giant thing it's probably a good time to just break it up into smaller regular expressions you can just use multiple regular regular expressions on the same bit of text something that looks for a lowercase letter something that looks for an uppercase character and then finally this is kind of going into the computer science of regular expressions and regular languages matching nested parentheses regular expressions can't do this matching parentheses rely on having the same number of open and closed parentheses they have to be in a certain order and technically regular expressions aren't turing-complete which is a computer science term them just going to gloss over but you can think of this it has regular expressions aren't programming languages they don't have flow control or loops or variables or things like that and the reason I know this is because at my last job we're doing a code review for some web app and the user can type in a regular expression into the text field we wanted to validate that make sure they were typing in a valid regular expression and a co-worker said hey can you actually come up with a regular expression that matches regular expressions and I left forward and said no you can't because regular expression strings have groups which require nested parentheses and that requires a stacked data structure which means it means it requires a context-free grammar and that's beyond the capabilities of regular languages and I was so excited because it was the only time I've ever used my computer science degree for something in the real world but yeah so that's just some best practices with regular expressions they do have their limitations but they are so incredibly powerful and useful to have so I definitely encourage all of you to keep reading about it you can find more in the Python documentation I also have them in the in chapter in automate the boring stuff with Python which you can for free online but I really advise that you go ahead and whatever you do find out more about regular expressions and use them because yes it is time to learn regular expressions thank you very much [Music] so do we have time for questions yes I guess line up at the microphones so we do have time for some questions so if you have some questions please line up here and please keep the questions short and simple because we have only five minutes left so yep hello Hey I believe you have a good use case for a look ahead I do I remember so one of the fancy things that you can do with regular expressions wow I've been saying regular expressions a lot today one of the things that you can do with them is not only find but also find and replace so let me let's say you had some text like agent Alice told agent Bob the in the info so look Ahead's are basically when you want to use the pattern that you found in a regular expression later on in that same regular expression so let's say you wanted to do something where you want to find every case of the pattern the word agent followed by some some other name you know and then using this space as telling it when to stop and you want to replace it with just star star star but keeping that first letter so this is sort of redacting it with a find and replace using regular expressions this is the example that I use for using look Ahead's which I'm going to mess this up I know let's see you would basically be looking for agent and then let's see /w I guess one or more of these oh haha that's why everybody's looking at me weird there we go oh now I can't see there we go yeah we're looking for like agent Alice and we want to find this text and then replace it we can use that group that's the first group that we found so we would replace it with this text agent slash one I believe meaning just use the text of the first group oh oh wait no I would want that first character so that's in our group followed by the rest of that name let me see if I got this right and so this is called a look ahead is where you use flash 1/2 to refer to the groups of characters in your regular expression pattern that you've matched before yeah anyway let's move on to the next question so in in an API that I'm what my team's writing we we use legislations and we need negative look behind and python only seems to support fix with fixed with groups so in your pattern if you do like something and something or something or something they will have to be the same width which seems bizarre to us and so our our solution is to just shell out to get into it properly in a proper oh yeah proper regular expression yeah ocean so do you have any other ideas I mean that is one way of doing it there's so there's slight variations between different languages and different command-line tools in how they handle regular expressions especially the more advanced features like these booked ahead things it basically if you find something that Python can't do you could find other regular expression modules if you really really need that behavior there's other regular expression modules on PI P I that you can download do we have time for more questions or are we out oh hi this question may be again the general spirit of fiora talk to learn regular expression is there a tool that lets you accept a before version and after version of attacks and suggests the regular expression R suggests some of the regular expressions we could come up with to do that that's kind of big sort of the exact project that I wanted to work on during the sprints was this was going to be like a learning tool for where you can type in a regular expression and I'll just start spitting out some example strings of what it could potentially match I from the brief looks that I did around the internet for a tool like this I haven't been able to find anything does something like that but that's a great idea somebody should make that or you can help me make that during the sprints wait for that Thanks thanks uh do we have time for one more question yeah a plant Hey this is just this is a intellectual vandalism how would I handle though your thing it's not a question oh yeah okay we assign go ahead your last person how do I handle Unicode uh oh wait unit unit code Oh unicode right so later no so I think I've tested this out and pythons regular expression module does handle Unicode I think I know this because I have this LOD Exe program I made which copies the look of disapproval emoji to my clipboard so I can just paste it oh I've already erased that but I tried pacing unicode characters and just all sorts of weird stuff in and it seems to work just fine so Bravo Python yeah well I tested that in Python 3 anyway I'm not sure who knows with Python - but yeah anything else oh that's it alright thank you very much thank you and for your great dog
Info
Channel: PyCon 2017
Views: 35,898
Rating: 4.9447236 out of 5
Keywords:
Id: abrcJ9MpF60
Channel Id: undefined
Length: 30min 31sec (1831 seconds)
Published: Sun May 21 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.