Trey Hunner - Readable Regular Expressions - PyCon 2017

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Music] - I will be using Python 3 and I'd recommend it for this tutorial but you're welcome to use Python 2 everything I type should work in both you'll just see a little bit different output Python 3 is a little more helpful in some ways [Music] wants to remind me that would be excellent because I think I forgot last year you can fill out to rate this tutorial afterwards so at the end of this tutorial I'm going to try to remember to put that URL up again [Music] all right so we're going to get started who has who has heard of regular expressions before but you haven't really written them you've never written regular expression you think okay who has written a regular expression but you don't really know what you were doing okay who's written regular expressions and you feel somewhat confident in your abilities okay so we have a lot of different people sitting in the room right now rigging expressions they mean different things to different people and your ability to write a regular expression that means a different thing to different people who's ever written a positive look ahead before okay that's surprising maybe you can come teach those to us afterwards so regular expressions I'm going to mostly teach regular expressions themselves in the process we're going to learn about how regular expressions work in Python so if you are already comfortable with regular expressions this is going to draw out a little bit in the beginning here if you've never seen regular expressions this is going to be a little bit confusing but it's because we have three hours and I'm going to try to appease appease everyone in the room a little bit here so everything that we're going to go through is on this first link this first URL here PyCon 2017 that regex training we are not going to get through all that material there's probably about four hours of material there and we have three hours so we will get through we will get through quite a bit of it there's also that survey monkey link there I'm going to try to remember to put that up again at the end you can after this tutorial rate the tutorial I believe or fill out a survey about it on that link so if you could all make sure you have internet go to your web browser get high con 2017 that regex not training up in your browser also make sure you have Python installed if you're having trouble getting to the website or getting Python installed put up a red sticky and so I've got two TAS for this tutorial David and Steven are going to help me out here none of us are regular expressions experts because I don't know if there are any regular expressions experts outside of the few people who wrote the our insurance regular expressions the various languages but we can try to help in you know whatever ways we know how yep yes that zip file is on the website it's linked on the website as well however yes so I emailed a zip file who has not downloaded that zip file yet okay if you could go to the web site go to somewhere on the web site who the end it's not working well for me either if you're having trouble with the internet I've been told you can jump on PyCon 5g and it should help out however I've also been told PyCon 5g as fat or slower than the other one so I don't know which one is the better one to be on I tried both of them out I'm actually not getting any internet at all at the moment which is exciting you know I think I forgot my USB sticks does anyone have a USB stick you wouldn't mind me borrowing and passing around with a zip file on it okay yes I did I did I sent it out to the email it was not an attachment though it was a link so if you haven't downloaded at this point you're going to have to deal with the internet here so those exercises during the first exercise break hopefully by then we can get this sorted out before that point I'm going to be coding I'm going to be writing code so I'm not going to use any slides I'm going to be writing code you can watch if you'd like to follow along you're welcome to type but remember this is on the website so you don't need to type anything during the exercise breaks that's your turn to type I want you to be working through exercises because if you don't do something hands-on if you don't write something you didn't learn anything you just watch something happen all right so let's jump into this even though the internet doesn't work hopefully it'll magically start working by the time the exercises start excellent um could someone put the zip file on that USB stick and then pass it around the back of the room actually David could you help out with that thank you there's a link on the website it is the problem is I don't remember where it is and I can't access the website someone next to you might know where it is but it is it is linked I believe under the exercises page the end of the exercises page I believe it's on there so we're going to back up a moment I've been switching between Python 2 and Python 3 a lot lately because I teach Python and companies want me to teach different things so I'm going to be messing out that print function a lot I'm in Python 3 I'm going to occasionally have a print statement you're going to have to bear with me what is going on here why didn't it type out the thing that I typed in yeah that backslash n that backslash n represents what newline character so in a lot of programming languages backslash n is the new line character in Python if we want this to not be a newline character there's a couple things we could do we could double escape those back slashes put two backslashes we could also put an R before that string what is our stand for its it stands for raw string I want you to think of it as a regular expression now this is a regular expression string so it's a raw string because it ignores back slashes the place that you'll see these used very often in Python is regular expressions because in regular expressions we use a lot of back slashes so all of the regular expressions I type today I'm going to try to remember use that are there all right import ari what does already stand for regular expressions this is an interactive tutorial by the way so feel free to shout things out when I ask questions so I've got a greeting here I'm going to search in this greeting forex notice I'm using an are there even though I don't need to because that is a regular expression so I'm searching for X and greeting it gives me nothing it returns none because there is no X in greeting if I search for X and exit I get this thing if you're in Python 2 this thing is going to look a little different it's not as helpful in Python 3 it says we matched X it is from character 1 to character 2 now this is a silly thing to do how else could I do this in Python instead is there any way to do this without using rate expression X in yeah so in Python we have substring matching you don't need to use regular expressions most of the time you are looking for something in something else when those some things are strings in Python so let's look at a case where regular expressions might actually be useful so this takes me a while to write if I were to look for every vowel in greeting I'd have to type something like this a rate expression could be handy in this case we could say re dot search AEIOU in greeting we are looking for any of those characters so regular expressions we're going to see this as we go on today they are an extremely difficult language to read regular expressions are a programming language they are a programming language within another programming language in this case Python they're a programming language where every character tends to represent one statement and you write all of your statements on one line of code without comments or whitespace who writes all their Python code on one line without comments or whitespace no one it's actually impossible to write all your Python code on one line without comments or whitespace Python doesn't allow you to regular expressions we have to we're going to see a way to write more readable regular expressions later but in general the reason they're so difficult to read is every character is a statement all right so this this is a character class here this goes by a lot of names I'm going to call it a character class the reason it goes by a lot of names is regular expressions are not a Python specific thing Perl Java JavaScript Python a lot of programming languages in fact almost every programming language has a regular expression parser they're an old notion in fact their linguistic notion that was borrowed and taken into programming which means we have different names for the same things sometimes different symbols for the same things in different languages usually if you learn regular expressions in the Python world they're going to work the same everywhere else for the most part all right so we could also search for numbers so we get none for rhythm because there's no digits in there we get one for $100 why do we get one yes so this is matching a single character the first character that it finds that matches zero through nine and it stops at that point so search here re dot search only matches the first thing and this first thing in this case is a single character so there's another way to write this zero to nine gives us the same thing a match of one there and in fact we aren't limited to just numbers for that zero to nine we can do that with letters we can do it with lowercase letters with uppercase letters this one here is going to match nothing and we can combine these so this would match all alphanumeric digits actually only the first alphanumeric digit yes so that span there if you're in Python 2 you're not going to see that 0 that one this is a Python 3 thing that it prints out at the repple you can get start and get end that is where it starts matching and stops matching that character and you notice that $100 it said 1 to start at character 1 stop - character - so these character classes these can also be inverted that caret I just put there that is giving me dollar sign it's giving me dollar sign because dollar sign is the only character in the string or the first character in the string that is not 0 to 9 so I've mentioned this already regular expressions are dense they're difficult to read that carrots that is inverting this character class it only works at the beginning if I put 0 to 9 carrot it's matching 0 through 9 and a carrot character all right so we're going to run through a lot of regular expression syntax really quickly there is a cheat sheet on the website so you do not need to write down notes on each of these things if you look on the cheat sheet you feel it isn't quite a descriptive enough for something feel free to take some notes but you probably want to open up that cheat sheet when you're working through the exercises so this gives me a regular expression object here it's matching that last a 3 to 4 what does that carrot do what did I say the care does inverts a character class so we were in those square brackets which matches any one character in those square brackets that carrot there does something then this carrot here what does this one do yeah this is matching an A only if it starts the string so this is something we haven't seen before this is strange for two reasons here we've already seen carrot but it did something different than this carrot accurate inside of a character class at the very beginning inverts the character classic carrot at the beginning of a regular expression string outside of a character class matches the beginning of the string so symbols and regular expressions are very context sensitive this is one of those cases question mark is another one of those that we're going to see that's very context sensitive this is weird for another reason now this carrot here what character is it matching is it matching the character yeah it's not matching a particular character its matching a notion the beginning of the string so this is actually not matching a single character here unlike that a and unlike that character class we saw before that matches the single character this carrot here is matching the space between characters it is not an anchor so anchor is one of the words we use for this this is anchoring itself to exclusively the beginning of the string dollar sign here is the opposite of this that's the end of the string so there has to be an A and the end of the string it's the only acceptable place to find this a so this matches the single letter a what would that match carrot a dollar sign just a so the only acceptable string is a which is a strange thing to search for all right any questions so far yep I did not so this order doesn't matter I was going to about to say it doesn't matter at all I believe it doesn't matter at all there may be a performance reason it matters in terms of matching but it is doing a match on all of those things it could be 0 to 9 a to z A to Z you in fact we could have put another 0 to 9 in here or 1 to 4 which is an odd thing for us to do other questions alright we're going to go through yep that's a great question so for example like this it does not so it's matching a single a and the next thing that must follow that single a is the end of the string so these regular expressions we haven't gotten to this yet because we've only matched a single character so far when you match multiple characters it's matching them contiguously so if I wanted to not match h-e-l-l-o row if I wanted to match that literal word with brackets around it I can use back slashes so this does not match here well this does the only thing this does match is a string that contains this so open square bracket close square bracket carrot what else have we seen so far that's special dollar sign it's the last one I think we've seen so far those are meta characters those are characters that do not literally match what they mean in a regular expression at least not in all cases whenever you see a meta character you may want to backslash it just to make sure that you're matching it literally if you need to and we're going to see a lot of meta characters regular expressions have quite a few of them so for example if we want to match a literal dollar sign this doesn't work because you can't have a one after the end of the string it's looking at the end of the string one zero zero because it's not smart we need a backslash there to match a literal dollar sign what is this match yeah any character any single character so it's matching H why is it matching H in particular it's the first one I found it's the first character that or the first string that matched this regular expression what does this match yeah a any single character and a Z will this match who thinks it will match who thinks it won't match so it does match it does match here because that space is seen as a character even though it's whitespace so it's an a a space and then a Z what about this one who thinks this will match who thinks it won't match sellout for your hand that time was that confusing more confusing the last one maybe it's because I'm asking trick questions all right I'll try to shuffle up the trick questions and non trick questions yeah so that one comma for it's starting matching at one ending matching it for so first I want to show this here it does not match it doesn't match because that a there's nothing between the ANZ there has to be exactly one character all right so this one for start and end if we were to slice this string from one to four we get the exact match there so that's what that represents there's actually another way to do that group we're going to look at that a little bit later that is the way to get the actual string that it matched so it matched exactly a space a Z does that answer your question okay yep yes exactly that's the first that's the first character that it found it started matching that regular expression so regular expressions they are a matching language they don't just match single characters they match a single substring essentially but they're not a straight substring match here so this is describing a substring using a specific grammar a regular grammar and it is matching that grammar starting that match at character one ending it a character for yep right so why does it end at four so this is a little bit odd this is also behavior that's inherent to the slicing syntax in Python so if you slice from one to four you get a to z a few slides from 1 to 1 you get empty string so this is non inclusive with that second one it stops just before it yeah so it's a slicing behavior so you don't need to worry about those numbers the thing that you actually want to worry about here is what it matched it matched a space Z that's the easier one all right so earlier earlier you asked about a a as the first character and last character and it did not match Hannah here how could I write arrogant expression that matched a two characters in an a what could I type yep yeah that matches there okay how could I do a any number of characters in an a yes so we haven't done this yet so dot matches a single character dot again matches another single character we want a dot a a dot dot a and any number of dots after that point we want we want any of those to match asterisk is a and any number of characters so asterisk here this is a quantifier it is modifying the thing that came before it the thing that came before it matches any single character this asterisk says the thing that came before me match that zero or more times so if you see a a is that going to match will match there are zero characters in there if we have a na that also matches two ends that matches if we put two different characters in there in fact it also matches it is any character its matching their zero more times questions about that yeah why is a dot star so this dot here if I instead I made it an N it matches something different a a works a n way n a works a n n a works but if we put any other character in here it doesn't work so that star there is not a generic wildcard it's a modifier so it's taking the thing that came before it which happens to be kind of a wild-card in saying this period which matches any one character that star is saying do that thing any number of times zero or more yeah this is a tricky part of regular expressions because if you're working from the shell that star is the equivalent of this dot star you have more control in terms of what your wildcard means or what your matching in regular expressions because you've got both the dot and the star okay so plus does something slightly different so I could find the plus on my keyboard there we go only have a slightly different angle as my keyboard ruins everything period plus this is matching what anyone know yeah so a a does not match here all of those others do though so asterisk is zero or more plus is one or more okay we've got one more thing I want to show you here real quick and then we're going to jump into our first set of exercises who spelled color this way no one spells color this way this way over here this first one we can't all be from the US okay so some of you spell color this way some of a spell color this way if we want it to match both of these I could put a question mark there so this is matching the thing that came before which happens to be a you hear that single character zero or one times now in this case zero one is basically giving us an absence it's either not there or it is there and if it's there multiple times doesn't match because that R has to follow a u that follows an O and that u is optionally present there but only once yep the question mark and a plus sign yeah in what so what would you like me to type here right here that gives me an error yeah however plus question mark does something which is a little weird we're going to talk about what that means later that question mark here I said before question mark another one those characters that means different things in different contexts these question marks mean two different things we're going to talk about this question mark later this is actually changing it to be non greedy which is a strange concept that is a regular expression specific so we're going to exercise this now we're at validation exercises so hopefully the internet works for you by now if it does not put up a red sticky and we'll try to work it out the first exercise is has vowel so if you have the zip file downloaded I want to show you how to use it here unfortunately I don't have easy access to it so I'm going to go find that zip file and do what you're doing right now getting that set up you okay so I've got my zip file here you should be able to run Python either Python 2 or Python 3 test dot py has vowel and it fails why does it fail yeah this is failing because I haven't done anything here so validation test dot py has bound those to find some tests in there if I open up validation dot py I will find a has vowel exercise so we're in validation that py has vowel is integer is fraction so these functions these are sub functions your job is to fill out these first few functions fill out these first few functions so you're going to want to import re and you want to want to do something here probably return an re dot search and in fact there's a hint on the website that if you convert this to a boolean you'll get a true or false value so this is not the answer to this problem but it gives us a slightly different answer here gives us a little bit of feedback so if you have any questions while you're working through these exercises if what I just said is confusing which it probably was because you haven't used this test op py file before please put up a red sticky and David Steven or I will come by and help you out I really want you to work through these exercises today because you can watch me as much as you want but working through these exercises while we're here together is going to be helpful for you to actually get up to speed learning how to write your own regular expression so you don't have to go to stack overflow and copy-paste them all the time all right any questions at all put up a red sticky you so something I forgot to mention we are practicing test-driven development who practices Chester and development already you write tests before you write your code okay so most of you don't which is perfectly fine we're practicing test-driven development but I have done the hard part the hard part of tester development is writing the tests so I've written the tests your job is to write the code that gets the test to pass so when I run this Python tests up py has Val the test runner that I've written is a little bit clever maybe a little bit too clever it goes into validation tests dot py and it knows about this has Val tests which run your has Val that you are implementing and validation and runs various tests against it so if you want to see the test I'm running against your function you can go in there so it should say that rhythm doesn't have a vowel so why doesn't count as a vowel E and I do Oh does so you can see those tests in validation test if you would like so we're practicing TDD I've written the tests your job is to write the code now if you have any questions about those any questions at all put up a red sticky all right I realized that was not very long hopefully you got at least some some tests failing you see some test failing maybe written a little bit of code we're going to solve this first one together though all right so what did you find four has Val that you thought might work but didn't seem to work did you type anything that you thought might work but didn't you want to volunteer anything that you found it didn't seem to work would it work if I typed AEIOU looks that going to match specifically AI are you so it's going to match those characters sitting next to each other in a string so that doesn't work what could I do to get this to match put brackets around it so this makes a character class this is matching instead of five characters it's matching a single character so this is matching a single character which is either an A and E and I and O or you alright and that passes those tests so these tests were not the most picky they looked for lowercase vowels they didn't actually expect searching for uppercase vowels or they didn't test that rather all right any questions so far we move on to the next section which my browser is not loading there with me a moment as I deal with the Internet all right so how could I match a single digit what could I type how can I match any number of digits bracket zero through nine what is that going to match single digits so one character so this matches a single character because this character class your match is only one character how can I get that to match one or more of these digits dot plus dot star so this actually works here but it works for that too that doesn't isn't what we were looking for zero a zero why is it matching the a like this remember this one Oh backslash di yeah so let's talk about backslash T so this this star here this matches zero or more how could I make this one or more plus so that is matching only one zero so you just mentioned a shorthand for this so zero to nine plus this matches one or more digits this is a pretty common thing to want to match digits are a frequent thing for us to want to match so we have a shorthand when I mentioned before that regular expressions use a lot of back slashes this is one of those back slashes backslash D means something special when we're in regular expression land it matches zero to nine exactly what we typed before it's a shorthand for getting that character class oh so when I use that plus it actually didn't work here it didn't work because it matched one zero a zero because this dot plus here it is matching one or more of dot which is any character whereas this is matching one or more of that digit there so this backslash D this matches one or more digits actually this matches a single digit if I just type backslash T how can I match something that isn't a digit how would I do this with a character class caret could I say caret backslash D might work it does work it's matching an on digit there so there's another way to do this we capitalize the D so regular expressions they decided to make them hard to read by making lowercase D mean a different thing from uppercase T lowercase D matches a digit when you uppercase something it inverts its meaning right like an English right when we uppercase words their meaning just totally reverses so uppercase D matches non digits lowercase D matches digits confusing but that is the way regular expressions work and this pattern is actually somewhat of a fortunate one because we're going to see this multiple times backslash w this matches word characters so this matches H what will this match backslash w plus what do you think will it match hello just an H something else so it matches all of hello what about this one who thinks it's going to match hello the things that's going to match hello they're the things that's going to match something else so it stops at hello so word characters don't include white space and include spaces the plus does mean one or more so what it's doing here it's looking for a word character H is one it's looking for another one ee is one L is one L is one o is one then when it gets to space it stops and this space this space that has to be contiguous here so it's kind of like if we had that $100 there and we had a zero this found 100 stopped at the a didn't find the next zero it's asking for contiguous word characters and that space doesn't have to be a word character what about this one what do you think that'll stop it takes it's going to match a CLL who thinks it's going to match a chi l l0 who thinks it's going to match something else so just like an English we're allowed to put digits in our words just a little bit odd so helz Row is a word character or rather than zero is a word character so this backslash W it doesn't necessarily do what you might expect it's matching uppercase letters lowercase letters zero to nine and underscore so that is a valid word that entire string is a valid word to make an expression land who thinks this is a little bit odd yeah so underscores and digits we don't normally have them in words but in regular expression land backslash W matches underscores and digits what is capital W do you think non words what non words might it find in here if any space that is actually the only non word character in here so backslash capital W find space what would this find nothing so would I get a match object that's empty string so I get no match object here asterisk gives me an empty string why does asterisk give me an empty string zero or more zero more is always going to give you something back if it's a whole match that is zero more because empty string matches zero more questions okay so we've got lowercase D uppercase D lowercase W uppercase W uppercase W is not a great way to match that space there a better way would probably be to use a space what other whitespace characters are there new line what else tab any others slot another type of slot Oh a backslash R maybe yeah maybe a vertical tab because we type those all the time now so there are a lot of different whitespace characters if we want to match all whitespace characters we can type backslash s that is any type of whitespace character so if I put a new line here that would have matched as well if it got there what is backslash capital s non white space so it's matching that H there so lowercase T capital T lowercase W capital W lowercase s capital S so those are those are all short hands for character classes we could have typed those in manually using those square brackets I'm going to show you a shorthand that is not equivalent to a character class this backslash B here backslash B hello backslash B it matches hello it also matches hello here it doesn't match hello there what is it looking for what's your guess a break what kind of break whitespace maybe so something but here it's there's no white space so maybe something either the end of a string or whitespace something like that all right let's try that out see if maybe we can get any non-white space to work so carrot work too so it seems like there is some kind of break going on it's not on white space though any other guesses a boundary what kind of boundary for words okay so maybe a non word or a word character - a non word character let's see so caret is a non word character underscore as underscore word character it is strangely so is zero so neither of those match what about - it is a word character in English it isn't honey horde character in regular expressions so yeah this is a word boundary so it is a break of sorts it is a boundary it's a word boundary so this backslash B here it is matching this is actually another type of anchor this is anchoring itself not to the beginning of the string not to the end of the string but to something that is the transition between non word and word so either we're moving from non word - to an H that's perfectly fine over moving from non word beginning of the string to H but we can't be moving from a word character to a word character that does not match there so backslash B is a way of enforcing word boundaries around the thing that you're matching so this are I put before my string this hasn't mattered up to now it actually matters here if I remove that R it does not match anymore does anyone know what backslash B means when you don't have a raw string yeah well so it's actually not a literal B it is a literal character though or rather I guess it's not a little character it is a character it doesn't happen to be a literal B though it is backslash x08 I assume none of you have asked e tables in your head I know that I don't it doesn't print so this is a non printable character this backslash B is a backspace this is an ASCII backspace character it backs back spaces over the character that came before it and starts printing again so you're probably never going to use a backslash B they just get in the way when you're writing rate expression so always remember to put that our before your regular expression strings to make them are all strings all right any questions so far [Music] how could I write a regular expression that matches u.s. zip codes at least this form of us ed codes a curly brace so we haven't looked at curly brace yet if I wanted to do this manually I could do either zero to nine or Bank slash D one two three four five times and actually that's not quite right because that's looking for them anywhere in the string I want to make sure the entire string is exactly that so carrot it has to start the beginning of the string and at the end of the string I'm validating that this is a zip code ah yeah if we're doing correct zip codes which I'm not going to go that far but yeah if I was doing actual correct zip codes I would need to do a little bit further validation on here I'm going to leave that up to you though this backslash D here if I did Bank slash D plus why is this wrong why doesn't this quite match the right zip codes what would it match that it shouldn't if anything yes it could that's too long is there anything that it should match that it wouldn't so in this case I don't think there is these are the two problems that you will always be battling with regular expressions matching things that you shouldn't match and not matching things that you should there's a further meta battle in there of writing a Braille expression that is readable but wrong and regular a regular expression that is correct but completely unreadable you can't always write a readable regular expression that is also exactly what you're looking for so often you make compromises and say for example I don't care about ZIP codes that start with a zero I'm going to assume that maybe those someone's not going to type those in or I don't care about ZIP codes that are over a certain number of characters there's always a compromise that you're making in the world of regular expressions with correctness and readability okay so if I don't want to type backslash D five times I can do this backslash D open curly brace five close curly brace so this is matching exactly five digits do you think this will match what's your guess will this match here backslash w that's a word character three comma five we haven't talked about this comma this does not match three to five so what do you think D will dis match then it does match alright let's try out five will this match maybe less than five it's actually inclusive of five there so unlike the slicing syntax so it matches three it matches four matches 5 doesn't match that because that's six what would this match what's your guess three or more so matches Mello does not match high so that is three or more what is this Plus do what does it match would it match zero would it match one yeah so it is one or more so it is equivalent to one comma so you can think of the plus as the shorthand for curly brace one comma what about asterisk what is that equivalent to and let's try that out zero comma yeah that matches empty string there so zero comma is equivalent to the asterisk one comment equivalent to the plus what would question might be zero or one so that's that does match what's going on there now that does match because it's zero that makes sense that matches that doesn't match so question mark is zero one asterisk is zero comma plus is one comma I believe that is on the cheat sheet so if you don't remember that feel free to reference the cheat sheet those curly braces there this is the long notation for that asterisk and plus if you need to do something more specific than those reach for the curly braces those go by a lot of names those are repetitions is how I think of them quantifiers our asterisk and plus these are repetitions will this match anything it will not so why didn't that match this yeah it's a case-sensitive match we can alter the way the regular expression engine works in Python to make this match case insensitive it takes an optional third argument which is a flag re dot ignore case tells a regular expression engine whenever you see a case sensitive letter ignore it allow any cases what will this match which letter first one because we stopped at that first one what do you think this will match if anything also matches the first one so this even is case insensitive for character classes not just for literal characters alright we're going to look at one more flag here if you already know regular expressions and you're bored at this point because you already knew all of this hopefully at least this will be a takeaway for you here this is the one thing I'd like you to remember out of this realistically when you need to write a regular regular expression normally you shouldn't write your own because you're probably going to write it wrong you should go to stack overflow and look for the regular expression you're looking for and if someone else has already written it and it is uploaded enough hopefully it is correct if it's not correct then you just copy/paste the wrong rate of expression but writing your own writing your own is something that should be reserved for niche use cases u.s. zip code someone's written every other expression for that there's various ones out there something that's more domain-specific to something that you're working on that's when you need to write your own regardless of whether you're writing your own or copy pasting someone else's I want you to know about this next flag here alright so I'm going to copy paste in a regular expression this matches you you IDs you you IDs are universally unique identifiers they are made up of hexadecimal digits and this is not actually specific enough they're hexadecimal digits that have to be the right digits in the right places but it matches roughly a UUID so it is matching a to F or a digit eight times so eight hexadecimal digits a literal - four hexadecimal digits a literal - another four digits - four digits and twelve digits the dash between them who finds was readable so this is a really long line of code not only is it a long line there are a lot of characters on this line this is incredibly information dense because every character interactive expression is a statement so we've got our entire program here our regular expression program written on one line of code without whitespace or comments if we want to add whitespace or comments we can so normally you might see someone try to break this up by doing something like this splitting this up over multiple lines and using string concatenation to make this a little bit more readable that works it's a step in the right direction but there is another way to do this if you turn on verbose mode and notice I'm using a pipe here this is something you can look up in the documentation if you need to use two flags you need to use a pipe this is a bitwise or unfortunately we're using bitwise flags here which is a little bit confusing if you're not used to bitwise flags use an or when you need multiple flags if you have questions on that feel free to ask me during the exercise break the next one so I'm going to make this a multi-line string if you're new to Python this is going to be a little confusing these triple quotes here they allow me to break this over multiple lines this string here and pythons not going to complain now I'm going to turn off her bows not here will this match will this match a UUID what do you think why wouldn't it match a you ready what would it match instead or need to match yeah it's expecting newline characters also it's expecting these indents so we would have to have the beginning of a string a newline character a whole bunch of spaces and then hexadecimal digit a newline character a whole bunch of spaces in a - that's not we're looking for so I'm going to turn verbose mode back on and we're going to actually try out this function I generate a GUI D real quick I'm landing it doesn't work I'm just going to type in your ID what did I do wrong oh thank you something else wrong that's seven there we go it's hard to make up you you IDs so eight four four four twelve this is a UUID so if you have one takeaway from this tutorial it is turn on verbose mode for all of your regular expressions break them up over multiple lines and more than that comment them with for both mode on if it finds a newline character that has an octave Thor character before it it has a hash symbol before it it will ignore that as if it was a comment like in Python so this should work just the same all right any questions about this who has seen verbose mode before so if you are working in another programming language like Perl for example there might be an X mode X mode is pretty much the same things we're both mode in fact you can say re X there and it does the same thing in Python I think we have a break at some point but I don't number one it is anyone remember what time the break is maybe there isn't a break yeah I would guess halfway through to actually let me check this well it doesn't say on my little piece of paper here so at some point we may just take a break check how we're doing on time real quick yeah if anyone finds that when the break is feel free to let me know because it'd be great to take it at the same time as all the other tutorials so I flew from San Diego I flew from San Diego and San Diego like every other Airport code it is three characters typically three uppercase alphabetic characters so A through Z how could I match an airport code in this sentence what do you think what regular expression might I try to start with can always improve it curly-brace three let's see yeah so that matched it there what if we typed like that is that an airport code will it match it well that's something so what could we do to improve this 3 comma 3 let's try it so that did the same thing there what's the what's the problem here what is it matching yeah so it's matching si n but it stops because it doesn't know that we don't want to keep matching so after this what could I put backslash B so word boundary and before it - oh yeah in fact it matched a in oh there so it found a match wasn't supposed to so if I put a backslash B before it and after it it requires that to be a whole word unto itself okay so if we have a match we have an arbitrary sentence someone typed in we found a match typically we want to actually grab the thing that we found so what we've been doing so far is mostly validation we've been saying we want to make sure that something exactly matches a pattern here we're not validating or searching we're looking for something inside of something else so regular expressions can be used for validating that something matches a given structure or given condition they can be used for searching popping something out of a string that you're looking for they can also be used for substituting things for replacing one thing with another and that's that's the main use cases of writing expressions so we're searching here this match object if I want to grab the actual essay in that we see out of it I can type in group and we get the actual airport code that it matched ah let's see so if I say for example [Music] like that yeah what do you think em that group is going to give me just sa n why does it give me just a say in stops with the first one yeah so there is a way to get more that we haven't seen yet we're going to talk about that but this this it's uh this is the future of our dot search this is fundamental to our dot search re dot search stops at the first one so groups here let's talk about groups a little bit more we have already made a search for a zip code and we did it this way is there any other way you can write a zip code yeah you can put four digits at the end so you could do a dash so this is a more specific zip code so the post office accepts either of these you can be really specific or you can be not very specific they're fine with either way of writing zip codes at this point how could I match this form of zip codes what could I type yeah so I need those four digits like this okay ah so right if I type this will this match at this point it doesn't match how can I match either of these so question mark I put a question mark here and a question mark here that still doesn't match here why doesn't I match there so this doesn't matter even though it should and that's because this question mark actually changes the greediness we're going to talk about that later question marks are confusing regular expressions I can use parentheses to group this so so far every time we've used an asterisk a plus a question mark curly braces anything that modifies things that came before it we've been modifying a single character so a 1 or more times any character 1 more times and 0 more times this is taking this question mark and taking a group of characters and matching them 0 or 1 times so it matches this which is 0 times it matches this which is 1 time does it match this so it doesn't match this because it did not see this whole string here - and four digits one time and it didn't see a 0 times and it needs to see an end of the string after whatever it found oh you mean afterwards like here let's try it so it doesn't match so the reason it doesn't match there is this is zero or one yes I put a an asterisk here that'd be zero more yep all right so these parentheses here this is a group these form a group this group here groups are a little bit odd because they're used for two different things what we've just used it for here is grouping characters together to be modified to be repeated or to be modified in some other way typically repetitions the way you modify either groups or individual characters they can also be used for capturing something out of a string so let's take a look at that when I type in that group before it gave me the whole match so in this case it's matching all of that I'm going to go back to this question mark here I think that group matches everything I can pass to group a digit if I say m dot Group one it's giving me this what is this it's the thing inside parentheses so it is the first thing inside parentheses so we're actually starting counting at 1 here if I put in 0 it gives me the whole string so 1 0 represents the whole match it's just like if I don't pass anything 1 represents the first group so groups are capturing by default so these groups are used for grouping to be modified they're also used for pulling something out of the string to capture it any questions about this about groups oh yeah so how would I get just the first part so I could if I want to put parentheses around it and then if I say m dot Group one it gives me the first part two gives me the second part what would happen if I did this what would Group one be so Group one is the same as before it's 9:02 one L what would group to be so it doesn't exist at all it's not an empty string it's none it didn't even find Group two so it's giving me none if I've got a - Group one in this case is the same what is group two here yeah - 4 873 so capturing groups they can be optional but you get them captured regardless it's either going to give you none or it's going to give you the actual capture depending on whether it found them other questions how can you know how many groups there are let's try it so if we've got this em here if I type groups gives me all of them that is one way you could now you can ask for the length of this thing it is a tuple which works the same way as a list but it's just immutable you can't change it so I have a sentence here this sentence has airport codes in it I can get the first one of these airport codes by doing the same thing we did before matching an uppercase letter three of these surrounded by word boundaries so search only gives us one thing back if I wanted multiple things back what could I do use what search hole search all you think it doesn't exist there's something similar that I think you're thinking to find all before I show you find all I'm going to show you something else find it er so find it er this gives us a list of match objects so if I get group just like I did before this gives us each of those matches so it starts searching once it finds it stops gives you the match and then starts up again from where it left off so if I took this airport matches thing and I converted it to a list what type of things would be in this list do you think will they be strings so they are not strings what are these they are s re-match objects so these are these strange match objects we've been working with so find it er isn't much of a shorthand this is a the long form for getting match objects back you still need to take those match objects and do something with them to grab a match out of them so find all that you mentioned a moment ago that does something slightly different this is a short hand this gives us strings that represent each of the matches any questions about find it or find all we're capturing groups know it gives you the entire match object so every it's sort of like there's an implicit parentheses around the whole thing it gives you the the whole string that was matched it's the same actually if we said group zero why didn't that work so this is not something I'm going to talk about this tutorial this airport matches is a single-use objects so you can only loop over at one time see how group zero there gives us the same thing as just group okay so this find all thing here is pretty handy when you want to find multiple things you will mostly be using find all not find it er all right let's say I have zip codes and I want to find all of the zip codes in a string what will this give me how many matches will I get back or how many strings what about here how many matches well get back also two and here also two to that I probably shouldn't have got back because our regular expression is not all that great here so I probably want to put word boundaries around this that's at least somewhat helpful but if we saw something like this that's probably not a zip code we're dealing with so still not all that great how could I match the long form of zip codes as well as the short form here yeah I'm going to do a group going to put a question mark what I put within it haven't four times so this question mark here matches this a zero or one times so how many matches is is going to give me back gave me two but those are not what I was looking for what did it give me what was in the parentheses so it gave me - four eight seven three and an empty string so find all has a catch when you use find all if you use capturing groups it gives you a tuple of all of the groups that were matched through what happened here oops I put that in between the backslash and the D that's not allowed so if we wanted a tuple of the optional - part and the other part we could use two sets of capturing groups there so this is a little bit annoying if we want it to match this whole string and not capturing groups findall doesn't allow us to do that yeah so like this so we have parentheses with parentheses inside them and then I would set a parenthesis right after it is this valid so it is actually valid this does work so we what we get back though is a tuple with the entire match the first part of the match and the second part of the match so we would have to loop over this grab the first thing from each of these so this is a little bit annoying and a little confusing so there's another way to do this this group here groups are used for two things what are the two things groups are used for what are we wanting to use this second group for here why do I have a group at all so this second group I don't care about the fact that groups capture things here the thing I'm using this group for is to modify to make it zero one times I want these set of characters zero or one time so I'm modifying this group I really don't care about the fact this group is capturing so we have to tell this group that it should be non capturing so we can write a non capturing group here now the syntax of about to type is awful I mean regular expression syntax in general is a little bit hard to read this is particularly hard to read this is particularly hard to read because at some point in regular expressions history they decided it would be it nice to have non capturing groups but we have no way of doing that easily in the language without breaking some kind of backwards compatibility we want existing regular expressions to keep working so what we'll do is we'll introduce new syntax that would have previously been invalid and make it a valid syntax so it would have been invalid to put a question mark after an opening parenthesis so they put a question mark there and then you put a colon and it magically becomes a non capturing group so open parenthesis question mark colon is a non capturing group if you find that hard to read it's because it is hard to read this is seeing a question mark after you open parentheses means there is trouble going on you have to go look in the documentation see what the thing after that question mark means and how it is modifying that group so this is a case where you might want to turn on verbose mode and add some spaces in here to at least make things a little bit more readable so that someone could see that this question mark this Col in that parentheses those go together somehow alright any questions so we're going to another exercise break where it's search exercises on the website so if you go to the exercises page you can scroll down to search exercises search dot py contains these exercises as well there are six of them so I should have mentioned before you should well this last exercise break you might have been able to get through all of them in general you should not be able to get through all the exercises in an exercise section so do not get discouraged if you don't even get through one or if you only get through one because these are meant for you to work through after this tutorial I want to give you more exercises than you could get through there's thirty of them we have three extras or three hours and I'm doing a lot of talking so there's no way you're going to get through ten every hour okay I don't know when our break is so we're going to do an exercise break right now and I'm going to figure out when that break is and we're possibly going to take a half hour break otherwise we might just buy ourselves some more time but if you have any questions at all put up a red sticky and David Stephen or I will come by and help you out questions about anything we went through the exercises or anything else all right we're going to combine the exercise break and the actual break there is food outside the moment so feel free to go eat something drink something go use the bathroom and we will reconvene I'm planning on 310 yes three three ten any questions during the exercise break put up a red sticky happy to help haven't taken your break yet you have two minutes so if you were working through the end of validation dot py you probably found some exercises that we're really hard to do because we haven't learned how to do them yet only the first few in valuation up py have we learned how to do we'll come back to that after this search section here I would like to solve this first one together get extension cool so search dot py there are also exercises in this file that are more advanced than we've learned how to do this next exercise section once we get to it you're going to be flipping between validation and search so you might actually want to get access to the website for this next section if you haven't already all right so get extension so it is returning none why is it getting none yeah we haven't returned anything so the default return value for functions in Python is none so we want to return something here let's import re first because we know we're going to need to use the re knowledgeable what did you try to do here that didn't work didn't seem to work anyone try something out that didn't work for them the first time so already dot search or find all what should we try to use or find it or search all right what's the regular expression what is a regular expression that doesn't work alright let's see what's going on here so we're getting match objects how do we get a string from that match object group so we can take the match object we get back call the group method on it all right so we've got dot zip dot XHTML dot jpg guitar and we expected GZ how could we improve this remove the dots on how yeah how could we remove that dot parentheses around this alright which group do I want one so that's an improvement so we got rid of XHTML zip and everything else there we are failing at this one though so tar.gz supposed to return GZ how could I do this dollar sign at the ends here let's see what that does ah there we go so backslash W plus did anyone try something here that didn't work yeah how did he get just the last one did anyone find extensions that were more than the last one what did you end up typing that did that a word break a word boundary rather see if that doesn't so word boundary didn't do it so star is not the same as GZ so the reason a word boundary didn't do it is this would match the changing from R to dot because dot is an on word R is a word if we did word boundary dollar sign that would work but this is a little redundant because that dollar sign is already matching the end of the string yeah a lot of you probably tried this word boundary here word boundary along with a carrot or a dollar sign is an uncommon thing to see usually use one anchor at a time so what if I did dot here would that work what would that match with archive tar.gz tar.gz yeah why did it match tar.gz so it finds a dot it found the first dot so regular expressions by default they are greedy they are greedy not only are they greedy they try to find the other greedy in two ways they're greedy in the greediness way that is something that the regular expression engine does when you put an asterisk a plus a question mark it attempts to do 0 if 0 doesn't work it tries to go as far as it can so it tries to match as many of it can and then it backs backpedals if it has to this is greedy in another way though in the sense that this period here it's matching the very first one so it's looking for it's really lazy in this way archive dot finds the dot and then we have dot . . period matches what one or more of any character so it matches a T&A and are a dot a G a Z and then it sees the end of the string goes so that's fine this matches just fine so backslash W would fix this is there another way you could fix this did you want to find another solution to this one decides backslash W plus also not dot anything but a dot yeah that matches as well so this is anything but a dot one or more times so this works this works there's multiple ways to solve this one in general with regular expressions there's not usually one right answer because there's a whole bunch of different edge cases that this would work differently on any questions are you that split yeah we haven't talked about yet so split could we split on a period maybe grab the last one I think we could in fact I think we could just do that with a plain old strings in Python let's try that real quick yeah I think we could just grab the last one here so most actually all of these exercises you do not need regular expressions to solve regular expressions are a tool but they are not the only tool to solve these problems they sometimes make things easier to read some things not this one probably honestly I would probably use this to solve this split based on period and grab the last one because it might be more readable or at least someone a co-worker probably who doesn't understand regular expressions might have an easier time understanding what this does and that's an unfortunate part about simple regular expressions often the simple regular expressions are the ones that you could as you could equally solve a different way or another way so already dot split is r dot splits a little different though why did that split that way so this is splitting on any character I want to split on just dot so if I'm splitting on just dot this happens to not actually need a regular expression because strings have split as well strings do not except regular expressions when you're splitting if you need it to split on a regular expression you could though so when we see split we'll see a couple cases where it's actually useful to use already dot split and in particular not just the string split inside of square brackets Oh interesting yeah that is one that is one way to do it so that would be splitting on any number of a set of characters which you can't do with the built in split to PyCon but you can do with our read us but yeah but I read of split actually can be used for a whole bunch of things anything you can think of a regular expression for to split on you can use our go split for what does this match that regular expression what is that describing time could be arranged literally what is it describing two digits a colon two digits so this could be time this could be some sort of range of some sort could be something else we're going to match time we're going to match 24-hour time is this a valid time that is not a valid time we are matching that how could we attempt to well actually this a valid time here 2360 that's not a valid time either so minutes should be zero zero to fifty nine hours should be zero zero to 23 okay how could we how can we restrict minutes to stop it 59 how can we make this not match zero to five so instead of backslash d you 0 to 5 and then a backslash d that doesn't match this does that does will 59 match here yeah that second digit is still matching the same way it was before just that first digit we changed all right what about this 24 how could I make 24 not match so this one's a little trickier how could I make thirty not match what could I do right so zero to two here would make 30 not match we're a zero zero still matches 23 still matches we still have the problem of 24 though so at this point we have to use something we haven't talked about yet the reason we have to use something we haven't talked about yet is we've got a fork of sorts we need some kind of condition to say if we have 0 or 1 we want to add any digit afterwards if we've got a 2 it can only validly be a 1 a 2 or 3 you can't have anything after that so regular expressions are not well suited for this problem in general because we're working with numbers we're not working with characters here we're working with new numerals and regular expressions have no notion of numerals they only understand characters if we were to write every expression to do this though we could do so by making this a 0 or 1 adding a group and using a pipe what does pipe do or so this is 0 or 1 any digit or what would the other thing I'd type here be a 2 and what would the second digit represent 0 2 3 so this is 0 or 1 and any digit or a 2 with a 0 1 2 or 3 after it so 24 doesn't match 23 does 18 does so does 10 so this pipe here is an or this is allowing us to alternate between one of two things yeah so the problem here is if we did zero to two we want a different behavior for two and four zero and we can't put a backslash D inside here because the character class matches a single character so that's a good point though you're on the right track here in the sense that character classes they're used for alternation they're used for alternating a single character where is this pipe here this is used for alternating between a any number of characters you can type any sub regular expression a pipe and in some other sub racket expression whereas in a character class it's essentially a shorthand for using this pipe so this 0 1 and this is 0 3 this is kind of like I type 2 0 or 1 or 2 or 3 which is a really weird thing to see because we have this character class that allows us to do this or make a range like this so character classes are the character based alternator this pipe here is the whole regular expression multiple characters at one space alternation other questions all right so that's pipes let's talk about that split that you mentioned earlier what will this give me how many things so we are splitting on kama gives me three things this second one starts with a space why does it start with a space yeah I've got a space after the comma so I'm splitting on comma I'd like to be splitting on comma and any number of spaces afterwards so we could we could attempt to write a match a find all like this how many things is going to give me back I'm looking for any number of characters a comma and any number of spaces what do you think so it gives me one here why does it give me one so this period here it is matching that comma so I could say not comma any number of times comma any other spaces I only get two here why don't I get the third one yeah well actually it's because I don't even have a comma so column three here it doesn't end with a comma so if I said a comma B it only gives me a if I put a comma after B it gives me B so I'm really not splitting here this isn't the split is the find also I could say comma any number of spaces or the end of the string this is getting a little complicated at this point though and in fact that find L is giving us more than we want so sometimes it can be convenient to not model your search as a regular expression search but as a split so that re not split if you think of the world in terms of splitting you could instead say what here what could we split on a comma and what or would just comment work so we've still got a space there with comma how many spaces do I want after that comma zero or more so I'm going to say whitespace characters zero or more of those so I'm splitting using a regular expression so if you find yourself using the string split built into Python strings and you'd like to use a regular expression with it reach for re dot split if you're using a regular expression and you realize this problem is well suited for splitting and not for finding use split instead of find all all right any questions about split so I want to mention real quick here that there is built into the regular expression module a compile the thing it gives me back is a compiled regular expression which conveniently has on it search splits and other other features that are built into the regular expression module so it doesn't make sense for us to split here it makes sense for us to search we might want to find all we might want to split we might want to search it really depends on the regular expression we're writing compile is a way to write a regular expression one time and use it many times throughout your program you could just write a string and use it many times throughout your program this goes a little bit beyond that and actually does the work of compiling that regular expression optimizing it so that it'll search faster so I want to mention compile there you don't have to use this during the exercises but I do want to show you that it exists because this especially when you're using verbose note it's a very common thing to write a regular expression one time in one place comment it well give it a variable name and start using it throughout your code all right I have a quote here I have a quote and this regular expression is looking for double quotes any number of characters double quote what is this going to match how many things will we get back if any who thinks we'll get nothing no matches at all here I think we're going to get one match thanks - thanks something else all right so we get one match we get what you'd expect here how many matches do you think we'll get here the things we'll get zero matches this time thanks we'll get one match thanks we'll get two matches three four matches so we get an error why don't we get an error ah I typed a single quote inside of a single quote I could backslash that instead I'm going to make this triple quotes and I didn't work and I'm just going to add a period to fix that so we got a single thing matched there in fact I'm using search instead of find all and it could be using find out here why do we get a single match yeah what does not match everything so dot asterisk this is matching any number of characters it goes all the way to the end and that backspace is until it finds others a quote here just after the W and before the period so there are double quotes inside of here so I could fix this by saying not double quote and I get two matches there that might be one way for me to fix that okay so there's another way to fix this this here is a little bit hard to read also hard to read but maybe a little bit more clear sometimes is this what does question mark normally do what does question mark done so far when we've seen it yeah it matches the thing before at zero or one times so it modifies the thing before it is not modifying the thing before here the asterisk is modifying the thing before it this asterisk is saying match the thing before means your or more times this question mark is modifying the asterisk and making it non greedy so regular expressions by default are greedy meaning when they see an asterisk they are going they're trying-- zero then where they don't match zero they go as far as they can matching as much as possible non greedy regular expressions try to match as little as possible they're conservative they make the shortest string they possibly could that ends in a double quote so this question mark here is telling that asterisk I want you to be a non greedy asterisk now it's probably confusing that there's a question mark after an asterisk more confusing as you can put a question mark after a question mark let's take a look at these so we can say this what is this going to match how many eyes will this match three eyes so it's matching all three how many eyes will this match one eye and this one three eyes what about this all three what about this two so it's going as far as it can all of these are greedy all right let's do this asterisk question mark how many do you think this will match so it matched three before this time it matches zero it's as short as it can go what about question marks question mark County will this match none no eyes at all plus question mark how many one so one or more that question mark is making it non greedy all right let's do two like this how many lavage two as the shortest it can go and one two what do you think yeah just one so in general you do not need to know about greediness it is rare that you absolutely have to use greediness with a forget expression usually you can get around needing greediness by writing something like this not double quotes not the thing I'm trying to match to make something not non greedy but match precisely what you're looking for there are times this isn't good enough there are times where you can't get away with simply negating something and matching less you can often negate something to match less when you can't remember greediness remember rudiness go look up how to make something non greedy look at the cheat sheet put a question mark afterwards in general turning off greediness making something non greedy makes your regular expression harder to read so avoid using non greedy regular expressions unless you need to or non greedy matching rather all right so we have more exercises now it looks like we actually might get to the last exercise break because we're doing pretty well on time these ones are just called more regular expression exercises on the website you will see the actual things that we're working through it's is number which is invalidation abbreviate which is in search is hex color which again is invalidation and is valid date which is invalidation so the reason I have alternated these between sections is because I wanted to confuse you I didn't actually want to confuse you I did confuse you but I alternated these because when you are working through exercises when you're getting practice at something new it is a good idea to interleave multiple topics at once if you just work on validation exercises and then you go to search exercises you will get better at validation exercises but you're fooling yourself because you're getting valid at better validation exercises knowing that's the task that you're trying to solve typically when you're writing a regular expression you don't know what it is that you're exactly trying to do it's difficult to write a regular expression unless you have context and with context you're cheating yourself a little bit because usually you're lacking that context so after this tutorial you're not going to get through all these exercises I would recommend you stripe the ones you're working through you interleave different types of exercises next to each other if you have any questions at all during the exercise break put up a red sticky weather about the material exercises or anything else get stuck on any of these don't be afraid to put up a red sticky let me give you a little over five more minutes all right let's work through this first one together so is number none is not true so I'm getting none because at the default return value for functions in Python I have the re module imported so I want to return a boolean whether or not this is true so the regular expression I'm writing here should be validating whether or not this is a number okay what did you start with that didn't work but I got you in the direction - question mark /d a backslash T asterisk okay so this here is what 0 or 0 1 dash is so optional - there this is 0 or more digits and then what slash dot so why not just a dot you know this is a meta character so we've got a a literal period there another way to write that would be this either those works and then what asterisk dollar signs the dollar sign there would be zero or one I'm going to put an asterisk here so this doesn't work we've got negative so with a period ooh this is actually saying true so this is saying true why is it matching negative one two three point eight five nine point yeah this is a substring how can I make this match the whole string carrot and dollar signs so this is only a that exactly matches this thing all right we still have a problem so period is a number that's an issue five is not a number that's an issue those are only two failing so how can we make 5 into a number make this one optional put a pipe could I put a question mark here see so that works for five it did not help us at all for this period so that's one way to do I think we could have used a pipe there as well okay so period how do we make period and not a number parentheses where's the opening parentheses go like that see still doesn't work why doesn't it work so why does period match here everything else is optional so we have an optional - zero or more digits zero or more digits what if we required one of these to be one or more all right so that failed that failed because point five apparently is supposed to be a valid number so let's try the other way all right five points supposed to be valid number two so we need five point and point five but we don't want just point did anyone find an answer for this one it seemed to work what you do okay that's where the pipe is so basically the same thing we just did here where one of these with the plus which worked except it fails to the point five case so you had this or and then what was on the other side of your pipe all right so that worked I think that actually fails for one case that I'm not testing what would that fail for so my tests your tests are only or rather your code when you are practicing TDD your test is only as good as the creator of the tests and I failed to write some tests here the tests that I failed to write that I can think of here yeah - dot 5 so if we move this minus sign to the outside - dot 5 would work but I didn't test for that so this is a valid hit you know successfully passes the test here but yeah this pipe here this is a way for us to handle the issue of having one or more digits before and one or more after in fact this also fails if we had any digits before the period however that doesn't matter that doesn't matter because if we have any digits before the period this one will match so there are multiple ways to write this one you could do it this way you could do it this way either of those paths and they both match the same thing any questions about alternation or anything else by the abbreviate one yeah that one's a tricky one so abbreviate I'm not going to do abbreviate I will say that abbreviate you could one thing you could look for is spaces or rather word breaks you can look for word boundaries so that Mac slash would be a word boundary looking for a letter that has an on word before it or the beginning of the string that's not going to work for that case of JavaScript though because J and s are both part of JSON JavaScript object notation and so you're probably going to need a pipe to say either we're looking for word boundary capital or wood boundary any letter or a lowercase letter and then a capital letter that were we're grabbing there it's going to be it's going to be difficult to write that pipe and actually capture it yeah if you want to look through the answers of any of them in particular together feel free to stay after there is material we're probably not going to get through actually did anyone come to this tutorial specifically because you wanted to know how look Ahead's and look behind worked okay good we're going to skip over that that's at the end that is something you hopefully will never need to do because it's painful and ugly looking and it makes your regular expressions harder for everyone else to read you usually don't need to actually rely on look ahead and look behind there it's a very rare time that you really need them usually you can get away without them so that there is material on that at the very end feel free to go through it and send me an email if you have questions about anything I think you all have my email address at this point right like I've sent you an email if you have questions about anything after this tutorial feel free to ask if you know you didn't ask during feel free to stay after for a bit and ask me questions week about abbreviate or other exercises well let's talk about substitutions has anyone ever used what tech before la tech so I use law tech sometimes and law tech wants you to use smart quotes meaning to backticks and to single quotes in order to actually make smart double quotes that turn to smart quotes using Unicode smart quote characters so we could replace these and change them to simple double quotes using Python like this place the backticks and replace the single double quotes that works we could also use a regular expression for this so in the RT module there's a sub or a sub stand for substitution so we can do back tick back tick pipe single quote single quote substitute that and what am I going to replace that way so on a plate replace that with a single double quote character there a double quote character so that does the same thing it accomplishes the same task so re dotsub is useful for a number of things one of the things that it's particularly useful for is data normalization so for example we have this row what were we splitting this row on before yeah commas and any number of spaces normally if you're parsing a CSV file you don't care about spaces it cares about commas so if we wanted to take this and make it into more of a CSV typical format we could convert comma any number of spaces to just comma so we could say re sub comma and any number of spaces replace that with a single comma and that space was remove there would this have done anything different if I used a plus there so we get the same string back the reason we get the same string is that commas that don't have a space after them weren't replaced because they were already what we were looking for so either of these works just fine all right so substitutions this is pretty much the the third tier of what we're looking at here today we've done validation we've done searching and now we're doing substituting mostly for the purpose of normalizing data or not just normalizing data but also changing the format of data so for example we have dates what format of these dates in month/day/year so these are us file dates if we're sending this string here to someone who is from pretty much anywhere else in the world they might be confused they might expect day month year we could make this unambiguous by using year-month-day by using an international format so I'm going to substitute here what am I going to match so I want two digit how many digits do I want to then what do I want a slash and then two more digits a slash and four more digits so substituting here I've got an issue I need to somehow capture the year the month of the day and use those in my substitution so when normalizing data this isn't always just about taking commas and spaces and replacing those with just commas sometimes you don't know what it is you're actually substituting we're matching any digits so we talked about groups before what two things can groups be use for matching a value so grabbing capturing something for for use layer yep you can use them for capturing what other thing can groove use for yeah well so that I was considering that capturing the other thing they used for which maybe was what you're saying first is on taking a group and modifying it to do any number of times so making an asterisk question mark - taking one of our quantifiers and modifying our group so we can use them for modifying which is not what we're doing here we're going to use these for the other behavior capturing so we're going to grab that first thing what is that first thing represent month second one's day third one is year so I want to do the third one first I want year and then month which is the first one and then day which is the second one so I can use backslash and then the number of the group in my substitution so group with a number that number can also be used when substituting any questions about this right yeah yeah yeah so I'm not I am not adding new text here I'm taking existing text and substituting it with parts of itself I'm taking three of the parts of this text basically removing the slashes and rearranging at adding dashes instead but yeah I am not really normalizing or rather I am normalizing this but I'm not replacing anything with anything novel or new so you can use substitutions or substituting both for the case of taking something like commas any number of spaces replacing with commas or taking data and replacing it with parts of itself other questions so we can actually use these backslash number things these are called BAC references we can use that while there's multiple names for them that references is one of them we can use these in not just substitutions we can also use these when searching so I want to show you example of this for example earlier we talked about looking for quotes so here we're looking for a double quote or a single quote any string and a non-greasy fashion and then a double quote or a single quote so this gives me not really that gives me why and ooh what's going on there why is it giving me I dawned so I'm looking for single quotes or double quotes I am NOT making sure the end quote matches the beginning quote if I put this in a capture group I could use a back reference here what back reference what I use if I wanted to match the same quote backslash one or two backslash one and I gave us something interesting why didn't find I'll give us this so what does find I'll giving you when you have capturing groups gives you a tuple of all of the groups so we don't actually care about this group could I make this non capturing like we did before what do you think that did not do what we wanted what is that doing so there's a problem here the problem is back references rely on the capturing nature of groups when I make a group non capturing its trying to look for exactly the same quote because that question mark : makes the group non capturing so the first capturing group is the second group that we wrote there so we have to have this group capturing but findall doesn't give us what we're looking for if it is capturing so there's not really an easy way out of this one way would be to use find it ER and fall back to doing this manually say for M and find it or prints in group which group would be I care about so group one is the quote that we use Group two is the actual quotation so on the website is a couple other solutions for working with this situation when you're find all has to use capturing groups you can use a list comprehension you could use find it or you can use find all you can even use zips sometimes there's not really great solution to this problem sometimes regular expressions are not the most elegant tool to solve your problem in fact probably most of the time when you think you should use a regular expression you shouldn't that's maybe the second takeaway from this tutorial stop using regular expressions so much they are very useful and it's nice to know cases where regular expressions can be used but they are not always the best tool to solve your problems if you are using them remember it's around verbose note though okay so we have another exercise section if you have any questions at all if you found what I just did confusing feel free to put up a red sticky happy to either talk through an exercise with you if you get stuck somewhere and just want a little bit of help or work through one of the exercises in so many way or answer your question so we are in capture exercises on the website we are mostly at the end of the search module actually all three of these exercises are in the search module we're working through palindrome five double double and repeaters so any questions at all put up a red sticky [Music] [Music] [Music] all right let's try to solve this first one together palindrome five so this is supposed to find all five letter palindromes in a given string all right how could we get a user regular expression to get a list of matches what would we use from the re module already dot what what could we use this search give us a list find all so search gives us a single match object find all would give us a list of strings it could also use find it early and iterate through it and get strings out of it we're going to try to use find although so we're going to make a regular expression and we're matching on our dictionary that is passed into our function all right so now we need to write away the expression five letter palindromes so I'm going to just say all words with five letters this matched APU why did it match APU space AP a space PU that's not a word yes got a dot all right let's make that W it also matched pul Lu why didn't match that and race si would this work so this doesn't work this limit is too much why did this not work what is it carrot do it's actually the beginning of the string here yeah so carrot carrot and question mark the two things we've seen so far well question mark three uses of question mark non capturing groups non greediness and zero one this carrot here outside of the character class at the beginning of the string this matches the beginning of the string what is the dollar sign match the end of the string so we're matching the beginning and end of the string we do not want that what do we want to match what anchor boundary a word boundary so we want to match five word characters on their own so there should be spaces around them or dashes or some kind of word boundary or the end of the string of the beginning of the string okay now let's make this into a palindrome how could we look for a palindrome what is a palindrome yeah it's the same word backwards and forwards what could we do for five letter palindromes so put this in parentheses okay-y two and one and not one and two yeah it should be flipped so this the last letter should be the same as the first the second-to-last should be the same as the second so that's weird what is it giving me here le is not level st is not stats so this is a problem with find all you could have used find it er here and loop through these and grabbed exactly what you wanted out of it group zero you could have also surrounded this all in parentheses that didn't work if we surrounded it all in parentheses we've got to bump these up it changes the numbers so now we get level le stats St still not what we're looking for how can I get rid of the le in the st can I make them non capturing no I couldn't I make them non capturing we need them the one of the to rely on the fact that they're capturing so I'm going to use fine dater for this I'm going to say matches equals this and I'm going to use a list comprehension here you can use a for-loop just the same in that group for M in matches so we're grabbing the entire match from each of these as a for loop this would be for M in matches just make some variable here so equally valid as a for loop but the list comprehension is a little bit shorter all right any questions so far the print ah do we have to use finder we could use find all did anyone use find all for this one get to work so if you use find all you'd have to surround everything inside of parentheses and that changes the numbers here so we have to do two and three and then matches would be tuples and I think it's the first one we'd want out of this see if that works yeah that works as well I think this is a little bit more confusing personally I think at this point it's probably best to fall back to using find it or because match objects are a little bit easier to work with in these awkward tuples other questions so we are approaching the end of our time together here we have two more exercise sections one of them is on look Ahead's though and you can skip over that if you'd like the other one I would encourage you to look at after this tutorial is over here I do want to mention real quick the material that leads up to that section which some of it is not essential but can be valuable for solving those exercises so we've talked about groups we've talked about back references backslash one gives you the first caption group backslash two gives you the second you can also use this weird question mark thing here in your groups and AP an Open bracket a less than sign whether a name and then a closed bracket to give a name to your group and then your groups can be referred to using these back reference G year back reference gene month back reference G day this makes things easier to understand in a sense but also harder to read in a sense if you are using these I would recommend using verbose node if you've ever used Django's URLs you may have seen these occasionally mentioned in documentation or in code so these are named capturing groups you can look up name capturing groups on your own know that they exist and that they can be handy they're really not essential they are a handy tool to have in your tool chain now now the other thing I want to mention has to do not with capturing groups but with substitutions any questions about this so this one's quite a complicated one this is a function replaced date that takes a match object takes the groups out of it month day and year checks to see if the year is four digits long if it is it uses that year if it's not it does some logic to seek out whether it's probably a 1990 19 year or a 2000 year and then returns a year so it tries to convert two to two years to four-digit years using some complex logic here which we really couldn't do easily in a regular expression you can use a regular expression I'm using compile here but you don't have to use compile you could use sub with a function so I've used sub so far with a string you can use it with a function as well so I'm going to write this as well without using compile re dotsub our regular expression are replaced date function and our sentence so if you are having a lot of trouble figure out how to write your substitution to actually do what you want you can always fall back to using a Python function it's a little bit hard or a little bit complex sometimes because your function has to take in a match object and return a string but it is a nice thing to know that you can do because you can't do everything with regular expressions it is nice to be able to fall back to Python code so substitution exercises the next thing that we would be working through here we only have two minutes left so I'm going to wrap it up at this point any any final questions at this point any questions about regular expressions in general that you feel like we didn't get answered if so feel free to stay after and talk to me personally about it I do want to mention real quick I didn't really do an introduction myself at all my name is Trey hi nice to meet you I do Python and Django training so I do this kind of stuff for a living I don't teach regular expressions usually for a living don't think of me as a regular expressions person I don't want to be pigeonholed in that way I don't like these nearly as much as you might think I do I these because I learned regular expressions for the Perl world not knowing at the time that they were a weird thing in most programming languages then coming to Python took some of that baggage with me they are useful in Python they are not the end-all be-all I do want to mention outside of doing training I also do a live chat every week live chat on Python called weekly Python chat I have done a video on regular expressions before which is pretty much the same content as here if you have any questions about Python I love doing videos on these things I love writing blog post on these so feel free to ask me afterwards if you'd like to see me do a video on something in particular in general I appreciate emails if you have an email with a question about Python and regular expressions do not be afraid to email me or approach me in the hallway and talk to me I am an introvert so if I see you in the hallway I may or may not talk to you but don't be afraid to you know come up and talk to me and ask me a question about pretty much anything if you have any questions or if there was anything after this that you know you feel like really didn't get answered please let me know because I'd love to add that to my curriculum next year it won't help you out because you just finish this tutorial maybe I can point you to some resources that might be particularly handy though I should probably put up again that survey link I actually am not convinced that this is the survey for this tutorial because it mentions things about the morning and not the afternoon so you might be filling out a survey for someone else's tutorial but that is not my problem at this point because this is the only URL that I have I think this is the right tutorial or the right URL but if it isn't let me know and I can ask someone if it if there's another one or maybe it asks you which tutorial you went through I actually haven't been to that URL it has the name of the top excellence it's the right link good to know so yeah you can go there and fill out a survey that I believe the program committee wants people to fill out surveys to find out how this tutorials went so I'm going to stick around for a couple minutes after words we might be shifted into the hallway but feel free to ask me questions regardless thanks a lot
Info
Channel: PyCon 2017
Views: 7,934
Rating: undefined out of 5
Keywords:
Id: 0sOfhhduqks
Channel Id: undefined
Length: 199min 41sec (11981 seconds)
Published: Thu May 18 2017
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.