Regular Expressions in Python - FULL COURSE (1 HOUR) - Programming Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey guys welcome to a new python tutorial today I want to show you how we can work with regular expressions in Python regular expressions are short r-e or reg X is a powerful method that is used to search for matching text patterns for example typical patterns that can be extracted from large text files with regular expressions are emails or domain names so at the end of this tutorial you will be able to understand what this regular expression here does and there's a lot to cover in this tutorial don't be overwhelmed I promise that once you have understood the concepts it's not so hard anymore and it can simplify and speed up your search tasks a lot so if you watch the whole tutorial and you will be able to understand any pattern that you want to look up so now let me quickly show you what we will cover in this video so of course we will see how we work with the re module in Python then I will show you what methods we have to search for matches what we can do with a match object then we will talk about meta characters and more special sequences that can be used in patterns then we talk about sets quantifiers conditions and then grouping then modifications so how we can modify strings with our ease and at the end I show you some different compilation flags so let's start so as I already said Python has a built in module that is called our e which we can use to work with regular expressions so we have to import air our e and then we can start working with regular expressions so let me show you a very simple example first so let's say here I have some test strings already so let me copy and paste this year so this is our test strings and now let's say for example we want to search for the pattern ABC so we see we have this 3 x here and now let's say we want to look for ABC then we create a pattern so let's say pattern equals and then we use the are emote Yul and the compile method and then here we say R and then the string ABC so I will explain what R means in a second and then we can use this pattern to find matches so we say matches equals pattern dot find hitter and then we want to find the matches from the test string and now this will be a object that we can iterate over so we can say for match in matches and then we simply print the match so now let's run this and then we see we have two matches so this is a match object and we can see more details so for example we can see the span so this is the start and the end position so this is three four and five and this is our match ABC and a second match at position 12 so this is position or index 12 in our string so we see that we have two matches here and what we also see here that our regular expression is case-sensitive so it doesn't include the uppercase ABC into our matches so this is one thing that we must know so one thing that I want to mention here is that instead of compiling our pattern explicitly we can use the find error method directly on the re module so we could also just write that our matches equals our e dot find eater and then we want to look for our let's say string our ABC and then from our test string so you can use it directly on the re model and then we will if we run this we will see that we get the same results so there is not much of a difference here but I prefer to do it this way to explicitly compile them pattern and bind it to this object here so this improves readability and it's also a little bit more flexible so I prefer this way but you should know that you can use both ways and now let's talk about why I'm using this are here briefly so this means that this is a raw string so for example if I have a string a and this includes some special characters like a tap so a backslash T that this is a tap or a backslash n for new line and then I have a string so now if I print this then you will see that we have the tap here at the beginning so it didn't print the backslash T and in a pattern I usually want to look for the actual characters in my pattern so then I can write an R here and then this means that this this is a raw string so Python will print this the same way as it is specified here and yeah so I recommend to always use a raw string for your patterns you can use just a normal string but remember that you should use a raw string and yes so this is a short example how a regular expression is used so typically we come up with our pattern then we compile it and then we use the pattern to find our matches and I will show you the different methods that we have on the matches now so now let's go over the methods to search for matches so we already have seen the find it ER method and this will give us a match object and I will show you what we can do with a match object in a second so now let's talk about the other method so there are three other methods so we can use the dot match method so here dot match then we have search and then we also have find all so now let's look about the find all method first so if we can say pattern dot find all then we will simply get the string so if you see here I'm printing the whole match object so now if I want just a string then I can use find all and now if I run this then it will just print the two strings that I'm looking for so this is the find all method now the match method determines if the expression matches at the beginning of the string so this will only return one match so here I can say match equals pattern and then match and now if I print the match so let's print the match and run this then we will see this is none because the match looks only four patterns at the beginning of our string so ABC is not at the beginning so now if I use one two three as a pattern then we will see this is at the beginning so this will return one match and we also have the pattern here again but again um the match does only return the first match if it is at the beginning of the string and now we also have the search method so the search method scans through the string and looks for any location where the re matches so if you use a for example let's look for the match ABC again then we will see this will return none because ABC has to be at the beginning and now if we use this search method then it will find the match object again and it will simply return the first match so we have search match find all and find it er and this is my preferred method so from now on I will only use this one and then we also have some functions that can be used to modify an object so we also have split and sup so I will come to them later so now let's continue using the find error method and let's have a look at what we can do with the match object so again let's say our matches equals pattern and then find it ER and then let's iterate over this so for match in matches and then we want to print the match then again we see we have the whole match object here and we can use four different methods on this so we can use the group method we can use the start and the end method and we can get the spam so let's start with this BAM so this will give me the start and the stop index where this pattern is located so let's print the match dot span so then we simply get this as a tuple here so we get three and six so this is a tuple and we can also get the justice start and the end right away by saying print match dot start and print match dot and oh sorry here is i dot match dot start so then we get the start in the stop index and now let's talk about the group method so now if we call match dot then we will get or print the actual string of the match and we can also give this group method arguments to find the group zero or one and or two and we will talk about this grouping later but for now on if you just want the string then from the match then just call match dot group or group zero this so this is the same and yeah so these are the four different methods that we can use on a match object and now let's come to the meta characters so in regular expressions there are these math occurs that have a special meaning so these are all their method characters we must know and you don't have to know them by heart so I recommend that you keep a cheat sheet somewhere with all this stuff and I will also provide a cheat sheet on my website so you can check that out on Python - engineer comm and this is all you need to know so now let's talk about these meth occurs one by one and then I will show you what this means so the first one is the dot so the dot means that we want to look for any character so any character except a newline character then the caret means that we want to look for a pattern that starts with it starts with pattern we are looking for so that starts with a string hello for example then the dollar sign is the opposite if you want to look for a string at the end of our text then we have some quantifiers so the asterisk the plus + square brackets and I will talk about them later in more detail then we have the set operator which I'll also cover later then we have conditions and grouping with parentheses so I will also talk about this later and of course we have to look we have the backslash so with the backslash we can get more special sequences or we can escape characters so for example if you actually want to search for the dot then we have to escape this in our pattern so now let's talk about the first three and show you some examples and then later we will cover the other meta characters in more detail so now first let's say we want to look for the dot and then print all the matches then we see we get all all the characters in our string because the dots is looking for any character except newline so this is the dot and now let's say we have a dot here at the end and we actually want to get this dot so then we escape it with a backslash and now if we run this then we just get the dot so now let's print the whole match object then we get the dot and we see that it is at this position so this is the dots and then let's have a look at the carrot so this is the carrot so let's say we want to look for one two three if it starts with this and then we get one match object and for example now if we look for a PC then it will return nothing because it's not at the beginning and the opposite if you want to have a look if we want to look if this is at the end so then we can say dollar here and now if you run this and this will find nothing because again I'm sorry we have to call on here this will find nothing because as I said it is case sensitive now if I'm looking for uppercase ABC and dollar at the end so then it found the match at the end all right so now we will talk about the other meta characters later and now let's look at some more special characters so there are more special characters that start with a backslash so there is the bachelor X slash and small D this looks for any digit so 0 until 9 then there's the capital backslash capital D so this matches any non digit character then there is backslash Kappa small s this matches any whitespace character for example space tap or newline then we have backslash capital is s this matches any non whitespace character so for all these patterns all these special characters the capital pattern is kind of the opposite of the small character here so then we have backslash small W this matches any word character so we have characters from A to C we also have all the capital characters and also ditches and the underscore then the capital W the opposites of any non word character non alphanumeric character then we have the back /b so this matches where the specified characters are at the beginning or at the end of a word and again we have the opposite so where this is not at the beginning so let's have a look at them in detail so let's use another test string here so let's for example use this one and now if we want to look for any digit here we can simply say we want to look for backslash D and now if you run this then we will see we have three matches two digits 1 2 & 3 now if we use the opposites of capital DS or any non digit then it will find all the characters except 1 2 & 3 then let's have a look at the white space so backslash s finds any whitespace character so here we see we have a space here this base here and a space here and then again the opposites any non whitespace character is any other character so this is the s special character then let's have a look at the W characters so any alphanumeric character so if I put in a W here then it finds all the word characters and again the opposite capital D this will just find the spaces in this example and now let's have a look at the backslash B so now if I am looking for hello then it will find it because it is at the beginning of a block and a block is not only the beginning a string but the beginning of any block that follows a whitespace characters so for example if we look for hey then it will also find the M hey but it will only find this pattern and not this one because it's looking for matches that are at the beginning of a block so for example if we put this and before a space then it will find and then it will find this pattern or this match too and again the opposite now if you are looking for this and we put ho hey here again then it will find this hey because it is not at the beginning of a block where this is at the beginning of a block so these are the special sequence but special characters that we should know and now let's continue with sets so we can use square brackets to look for sets and let me show you what this means so let's say we only have this string now and now let's say we only want to look for and non numeric characters so only for these ones then we can use a set for this so a set is a pattern between square brackets and now here in this set we can use multiple multiple characters that we want to look up for example we want to look for a L and a o and now if we run this then it will find all these characters and you must be careful here because it doesn't look for L o but for any single character that we put into this set and we can also specify ranges here so we can instead of let's say we also want to have the H and the e then it will find any character here that is not a number and also not the backslash and at the underscore so we can also specify a range here and this is a very typical very common example and regular expressions to use a - C so A to C so all the lowercase characters now if you run this sometimes it's not saving this file automatically so now if you run this then we see that we will find all the letters here and we can also look for digits so let's say we want only the digits two and three and again here we can have a range so we can say 1 2 9 so this is or let's say 0 to 9 and this will find all the ditches so this is the same as using backslash deep to find a digit and so yeah so if you want to specify a range then that the dash can be used to declare to define the range and now if you use it after a range then it's looking for the actual um - so now if you also want to look up a dash then we can find it here and if we put it between two things then it is a range so be careful here and we can also write our different ranges back to back so for example if we have hello here in uppercase letters and first of all let's say we only want the lowercase letters and then we also want to have all the uppercase characters from A to C then we can write this back to back so we can say small a to C or a - C then capital A - C then this will also include all the uppercase characters and again we can use back to back and also include numbers so yeah then it also finds the numbers here so the digits so yeah so this is how we can use sets with this brackets and now let's talk about quantifier so we have these quantifier the meta meta characters so we have a an asterisk so a the multiplication sign this means zero or more then we have the plus this means one or more then we have the question mark so this means zero or one and this means or this can be used when we want to look for an optional character so it may be there but it may also be not there then if we want to look for a specific exact number we can use curly braces and then a number here will look for the exact number and then we can also specify a range with minimum and maximum so if we put two numbers between the curly braces then it's looking for a range okay so let's have a look at them in detail so let's say we have a string let's say hello underscore one two three and now let's say we want to have a or we want to find the chips and remember we can do this with with backslash D and then it will find all the digits and let's say we want to look if we have zero or more so then we use an asterisk and then it will also find all the other characters here and because here there is no digit but it was looking for zero or more and in this case our match is just an empty string and then again an empty string empty string empty string and then here we have digits and then it will combine them into one match so now if we just use the and use it without a quantifier then it puts any every single digit s1 match and if we want to look for zero or more we can use this with them asterisk and now in this case a plus is better so we want to look for one or more and then we will see it has only one match and it combined all that digits into one em match and let's say we want to look for a digit that has an underscore in front of it so let's say we want to look for underscore and then the digit then it will find the one and but now let's say we don't know if there is an underscore or not so now if the string looks like this and then if we run it and it doesn't find a string a match and then we can say that the underscore is optional by using the question mark and now if we run it it finds um all the matches because it doesn't has an underscore or and now if we do it like this then it will find the same matches because it can also have an underscore here so this is the question mark and now let's talk about specific ranges or a specific number of of characters so now if you want to look for three digits then we can say at digit and then curly brace and then three then it will find our match so now if we are looking for four of the digits and run it then we don't have a match and we can also use a range here so this can between can be between one and three and then it will also find the match so these are the quantifiers now let's stop for a second with all the concepts and just make or just do an example so let me copy this string here and now let's use some of the concepts that we already know so let's say our string is now the date string so this is dates in different formats so for example here we have the day and the month and then the year and this is separated by a colon then here it's it's the Year first then they call on then here we have year month and day separated by a dash and here by a slash and also by an underscore and now let's say we only want to extract the dates with this format so year month and day and only with a dash in between so let's do this so the first thing we can do is now here is to look for this patterns of for two and again two digits so we can write this up so backslash T backslash T backslash D and then let's say first of all we want to look for any character between so remember the point is a meta character so this is looking if you have a look at this here this is looking for any character except new line then we have two digits so backslash T then again we can have any character and then D and backslash T so for example if our string is also some text in it and now if we run this now it's called dates the string now if we run this then it will find all the and all the states with with the numbers but only in this format so four to two so for example it didn't put the text here the hello text in here and it didn't put this date in here because it has a different format so now this is our first try and now what we can do here is for example the next thing we want to do is to find only these in this format so now let's have a look at so let's exchange the dot by eight - so this is I'm looking for an actual - and then we have only the dates in this format so for two and two numbers separated by eight - and let's say we this may also be a valid state so we can also looking for a / as an separator here so then we can use a set so remember a set is defined in square brackets and then we can define the characters that may be at this position so for example we have a dash and we have also or may have a slash and again here we are using a set so then we have - slash and are closing our set and now if we run this again then we see sorry we see that this is also included in the matches and now let's say for example we are looking only for dates in May or June so how do we do that so the month here so what we do here is now this is not any digit so here we are only looking for a month oh five and oh six so we always have a zero here and then we can again use a set and here we can use let's say only five and six and now if we run this and we only have the dates in May or June and remember we can also use a range here so let's say we want to have May June and July then we can say five to seven and then we have all the dates from May to July and now let's use a quantifier here so instead of writing for DS here backslash D we can say T and then curly braces and use the quantifier for so we want to have exactly four digits here and here we want to have exactly two digits so then we can do it like this so this finds all all the dates in May June or July in this format so this is one typical example how regular expressions are useful and yeah so now let's continue so we already covered a lot here so let's talk about conditions next so let me copy another string and do another example so here I have another string with some names so let me copy and paste this year so this is my new string so here we have a mr. Simpson a mrs. Simpson a mr. Broun a Miss Smith and a mr. t and sometimes we have a dot between mr. and sometimes not and now let's just extract all the different names here so for example there's some more in our file so for example we have hello world one two three date and now we only want to extract only the names and we want to have the whole name so let's build up our pattern here so let's look for mr. first so first we want to look for a mr. so M R and then we have a wide space so backslash s and then we have one or more characters word characters so here we use a backslash W and then we say Plus this remember this is a quantifier so one or more and then I'm looking for the my string here and I don't actually write the space here because I have this backslash s and now if I run this then we see that we have one match here so this is our mr. Simpson so here we have the M R and then a space and then one or more word characters and now as a next step let's also include a mr. where we have the dot here so we can have the dot and now if you just write it like this and run it then it finds sorry I have to use backslash dot of course here because it's looking for an actual I want to look for the actual dot and now it only finds mr. Brown and mr. T but not mr. Simpson anymore so now as we just learned we have the optional quantifier with a question marks and now let's make our dot optional and now if we run this and we have all the mr. and now let's talk about where conditions are useful so in this case we may not only have mr. but we may also have a miss or a missus so then we can use a condition so we use parentheses here and then we separate them so let's have a look at this here this meta character is the either/or so now if we use this we can write mr. or miss or misses and then if we run this then we see and it extracted all the names from this text so this is where a competition is useful and as we have just seen we grouped this condition together with the parentheses so this is again one meta character and now let's talk about grouping a little bit more so let's do another example for this this is also a typical example so let's copy some emails into our text and let's say we only want to extract the emails from this string here so again let's build up our pattern so what we can do here is we can use sets to do this so let's build this up so let's say we want to have some characters here so this may be word characters but this may also be a - and numbers so let's use a set here and let's use back-to-back ranges here so we can use small a - C or capital a to C or also the digits 0 to 9 or we may also have a dash here so now we are you looking for any of these characters here and we want to have multiple of them so we say we want to have multiple so one or more so this combines this group into one match and then it is followed by an @ sign so now if we compile this and run this then we see that it extracted all these patterns here with any words or numbers or and then an @ sign so this is the name before the email and then our email can have different domains for example we have at gmail.com @ GM x dot de or at my domain or my - domain dot org so we want to extract all the different domains and the next thing we want to look is to look for only for word characters so the domain doesn't have a digit in it so the only allowed characters are let's use another set and here we use again maybe a to see capital a to see and also a dash and then we have the dot so now let's run this and of course there are again one or more so here I have to do a plus and then it's looking for one or more so now we see our match also includes the domain name and the dot and then here at the very end let's do another set so here we say our ending for example we can say here we have dot [Music] sorry again I missed I was not looking for an actual dot here so this is a typical mistake that I make so now it for example it would have also found this one here but this is not a valid email address that I have to look for the actual dot by using the backslash and then let's say I'm looking only for dot-com but it can also be taught or dot-org so for example I can use a group here by using parentheses and then use the condition Here Come or de or.org so now it would only find these endings here and now let's not use a condition but I just wanted to show you the condition here again but we can also just use a set here so let's use the set and again here and we may have a to see and a to see and capital and then one or more and no digits here so now if you run this then then this will extract all the emails for us so this is a typical regular expression pattern to look for emails and this is what I showed you in the beginning so now you understand what this means and now let's talk about grouping a little bit more so there was one case just where I used the condition and then I have had to use parentheses but we can also explicitly group our match object here into different sub strings so for example I can put all of these before the @ sign into a group so now let's use parentheses and then let's use the @ sign and then let's use the domain name so this is one group until the dot and then we have one group to have the ending here and now we have three groups here and as I showed you in the beginning now if we run this then this will give the same results and here we are printing the whole match object and then we can use the dot group to return the actual string and this is by default this is group zero so this is the whole match string but now we can also print the single groups that we just defined so for example we have Group one two and three now and now if we run this and print this then we see let's just print the group one for now here and let's comment this out too then we see it only prints this group here so only the name of the email before the @ sign then here this is the second group so now if we print that group two then this is the domain name and if we want to have the ending then we can print Group three so this is where grouping is useful if we only want to have a look at specific things in our match then we can use parentheses now let's move on so now let's see we talked about grouping now let's talk about modification so we have two methods to modify a string so we have the split method and we have the SAP method so let's talk about both of them so the split method will split the string into a list and splits wherever our regular expression matches and the sub method will find all substrings where the regular x person matches and replaces them with a different string so let's look at two examples so let's say let me grab a string here so let's use this one again so this is our test string and now we use the pattern equals our e dot compile and then we are looking for the raw string one two three sorry let's use a different one do I have it here no let's write it myself so let's say ABC one two three ABCDE F and capital letters again one two three and ABC and now let's say this is our pattern so one two and three and now we say our split it equals and then we say pattern dot split and give the test string as argument and now let's print the split it now this will be a list and where our string split it sorry this was a bad example so let's use ABC as split and then we have splits where it split it our string into different sub strings and use this pattern here as the split so here as the matching split so here it has a PC so it splitted our string into this part so there we have one two three and then this part and then it found our pattern again ABC and then again it splitted the string and then at the end we have the rest of the string so this is the third substring that it found and and returned with this split method so this is the split method and now this sup method with the SUP method we find all the sub strings where our pattern matches and then replace them with a different string so let's say our test string equals hello world and then let's say you are the best world and use them so we use the world word world two times and then let's say we want to look for the pattern world so we say pattern equals our e dot compiled and then an ARS raw string and here we have world and then we say our sapped string equals and then we use pattern dot sup and then what we want to put in as replacement let's say we want to put in planet and we also have to put in the test string now it took our test string look for all the matches where pattern matches so it looked for world and replace them with planet so now this will return another string that was modified and now if you print this then we see it printed hello planet you are the best planet so this is the SAP method and now let's do another example to combine all that we have learned and again use the sub method and yeah so let's do this so so let me grab this string here so this is our URL string so here we have again let's say we have different things here and then we are only looking for URLs but they may have different formats so for example we have an HTTP URL an HTTP URL and then sometimes we have a www and sometimes we don't have this and then we have the typical domain name and ending so yeah so let's extract this so let's build up our pattern again so pattern equals our e dot compile and then a raw string so now let's start by saying it starts with HTTP and then a column and two slashes and then we have www and then we have a dot so an actual dot and then we have one or more work characters so for example we can use a set here again and use a to C and uppercase a to C and also a dash here so like here and so then we have a plus so one or more and let's put this into a group here right away so this will return the same thing and then we can later use this group here and the next thing we again have a dot so backslash dot and then again we can use a set here a to C and capital a to C and now let's try this out so let's say matches equals pattern dots find either and then we call this URLs and then for match in matches we want to print the match and let's try this and run this and then we see we made some mistakes here and this is because here I have to say plus of course one or more and now it only found this URL because it didn't find this one because we have HTTPS here and this one doesn't have www so the first thing we can do here is to use an S and this is an optional s remember s question mark so this is optional then and if we don't put this into a group then the question mark will only refer to this character here so now let's try this out and now we see it also found the HTTP URL and now the same thing with the www so this may be there or may not be there so again let's put this into a group and then use an question mark to make this optional and now if you run this again then it still doesn't find it and this is because our it must be W W dot X slash dot one which must be optional and then I don't need it here anymore so let's run this and then we see that it found all of the URLs and extracted them and now let's say our string has only the URLs here and now let's say we want to return a new string where we replaced all of these optional beginnings so it should only print the actual domain name with the ending so as we have learned we can use the SAP method so we can say instead of just finding the matches what we want to do here is let's also print this and then let's say our SAP URLs equals and then we use pattern and then SAP and then what we want to put in the replacement here so for example if we just say hello and then your else as a string and then print the SAP URLs so then we see that this is the new string here so it replaced all of the matches with hello and now let's say we only want to put this in our string and only this then what we can do here is we can group this and we already did this so we have a group here we have a group here and let's also put this into our group into a group and then what we can do is we can use back references to replace them so here we can say I'm back /to and we must use a string so a raw string and then we say back /to and backslash three and now if we run this then this is our new string so if I comment this out then we see then this is our new string and what happened here again if we have a look at the group so we can say let's print all the different groups so we have matched that group so this will be the whole string and so now so this is group zero again now let's have a look at what is Group one so for example here this is the group one the first one in parentheses and because this may be optional this may also be numb so the first URL has none as the first group because it doesn't have www and this is the first group so now let's print the second group so this is the actual name of the domains at the beginning and then the group three is the ending so com.com dotnet and now here we use this group 2 and group 3 with this back reference and then replace the whole found pattern only with the domain name so this is what happens here and so this is also very often used in regular expressions and now you know what this means and I guess now we are almost through with all the contexts all the things that I wanted to show you and now as a last thing let's quickly talk about compilation flags so when we compile the pattern then we also have to option to use different compilation flex so year I listed them and again you don't have to remember then just keep a cheat sheet so here we have the different compilation flexo ASCII dot all ignore case local multi-line or verbose so I recommend that you check out the official documentation to see what all of them mean in detail and now I just want to show you the ignore case compilation flag so this is also a very common use case so let's say we have the string my string equals hello world and then we want to look for the string world and now if we compile this and then try to find the matches and print them so print the match now if you run this sorry this is called my string and now if we run this then it doesn't find a string so because remember this is case sensitive now if we make a capital W then it finds the match world and let's say we don't know what our string is so it may be uppercase but it may not be uppercase oh it doesn't matter for us for us then we can just use the compilation flag our e dot ignore case so we can write this out so we can say ignore case and or we can just say our re dot I and then if we use a small W then it will still find the match and because now it ignored the cases so this is the ignore case compilation flag and now yeah you also have these compilation flexor that you check them out for yourself I will provide a link to the official documentation in the description and now I think we are done and now you should be able to understand all the difference regular expressions I hope it wasn't too complicated for you and I hope you enjoyed this tutorial if you like this then please consider subscribing to the channel and leave me a like and see you next time bye
Info
Channel: Python Engineer
Views: 17,143
Rating: 4.9384613 out of 5
Keywords: Python, Regex, Regular Expressions, Python Tutorial
Id: AEE9ecgLgdQ
Channel Id: undefined
Length: 64min 47sec (3887 seconds)
Published: Sun Apr 19 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.