Regular Expression Tutorial Python | Python Regex Tutorial

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
Are you building your career as a programmer or a data scientist and you want to know about one technical skill that can differentiate you from rest of the crowd. Well that skill is regular expression. Assume you are working in some finance company and you want to extract information from Tesla's company financial report. In this video I'm going to show you how you can do that in python using regular expressions and make sure you practice the exercise that I'm going to give you at the end of the video. I googled Tesla's company filings. which will take me to this particular website where you can see companies annual and quarterly filing reports and I'm going to just apply a filter here and click on this 10q pdf and that pdf will help have information on companies financial numbers. I'm going to keep things simple and extract the titles of these two notes section. So note 1 is overview. Note 2 is summary of significant accounting policies. Let's say you are a python engineer and you you have already used OCR to extract this text from this document and now you are using regular expression to get the titles you know these two titles, so that's your end goal. We are going to open regex101.com, which is an amazing website that I love personally. I use it all the time to build my regular expression here first and then I use that in my code. This is sort of like a test pad that you can use it all the time again. I'm using this website all the time. Here on the right hand side you will see the special tokens that you will be using in regular expression for the pattern matching. Now for pattern matting pattern matching of course you can use simple string match in python but using regex is going to make your life much easier and we'll see that in few seconds. Let's say I have a very simple string I mean I'm going to go to this note section little later but let's say just for net practice, let's say this is a simple string. I'm giving you elon musk number, if you have any questions on dodge point. By the way this is a dummy number so don't even try. Okay, here I want to extract a phone number okay from this number. You can say okay Tesla's revenue is whatever, it's not 40 billion really. But there are a couple of numbers in this text and you want to extract a phone number. If phone number is 10 digit. So whenever you say a 10 digit sequence or a pattern in your text you want to extract that. Now this is 10 digit, but this is two digits. so you want to extract this but not that. How do you do that? Look at here slash d. So any digit is represented by slash d. So let's try that slash d so when you say slash d and if you look at match information see it it has so many matches, because it matches a single digit. How do I say that I want to you know match multiple digits well you can do this, /d. Now it will match two digits, see two consecutive digits. See 99 91 and so on. How do I say three consecutive digits? Well, see three consecutive digits. 999, 111, 666. Four consecutive digits five six seven eight nine ten. Now this is one way you can extract phone number because this regular expression is saying that extract any text which is 10 consecutive digits. There is a better way to write the same thing where see in this help, if you click on common tokens check this. It says exactly three occurrences of a. So I want to do same thing. I want to say exactly 10 occurrences of digits 0 to 9 and if you look at match information see it exactly match that. So this is how you use these tokens. What if I have a different number in a different format? sometimes in US, the numbers can be represented as like this, you know. Let's say Tesla's phone number is this. Now this number is in a little different format but it's still a valid USA number. If you want to extract this as well as that what do you do okay. First we know okay this expression let me just put it. Here this is the expression for this simple number. Let's write a regular expression to extract this particular number. Okay, in this number we have a bracket first. So whenever you want to do a match, let's see if I to do Rlon say it is it is matching that so I can do this and it will match that but bracket has a special meaning. It's a special character. See bracket let's see capture everything and close. So it has a different meaning. I will go there in a bit but assume bracket is a special character, and whenever you have special character and if you want to do a literal match you will put slash here. Doing that will exactly match that and then I have three digits exactly three digits and see if you have 10 digits, you do this. If you're three digits you do that. Cool! And then you have another bracket but bracket again bracket is a special character. So you need to put slash here. See this looks complex but if you gradually build it it's not that complex. Then you have hyphen, then you have exactly three digits, then hyphen, then exactly four digits. See, now it is matching this particular pattern pattern. If you have anything less it will not match. See it is highlighting, which means it is matching and you can see it here. So I want so I have this regular expression for this number and this expression for that number. So how do I do or I want either of these to be matching. And for or, I have this character see a or b. So I can now do a or b. Cool see. Now it is matching this number and that number. Now let me do the same thing in python code. So in my jupyter notebook I have imported a module called re and that's what we use for regular expression and I have that text here. Okay so what is my text? So my text I'll just copy paste that text here. So my text is this, my pattern is this and you'll do re.find all. Find all will find all your matches. You will supply your pattern here, then text here and then use the matches. Okay. See it match, both of these numbers. So this is a very effective way. Now if you want to write the same thing in a plain python without using regular expression try it out. It's gonna be much difficult. All right, awesome now let's go back to our main problem, which is extracting the note title. So I'm going to copy this paste this huge code block and again my goal is to extract note title which is overview and summary of significant accounting policies. So let's try this out so to match note you will say note, then there is space okay and after that there could be any digit 1 2 and so on. And you already know that for digit what do we use. Well, we use here see slash d. So you will do slash d, then space then hyphen then space. Okay so so far, we matched note one and note two and you'll see the character range is zero to nine character in my text block is note one and this character range is note two. Now think about it. You want to capture everything that starts from here till you find slash n. So slash n is a new line. So anything that comes in a way before you encounter slash n, which is a new line you want to capture all of that. How do you say that in regular expression language? Check this any character except a b or c. So when you want to say I want any character except this particular character. You will say this particular um you know this this carriage return. So let me remove it and just just to make things simple I will type something in. So let's say I have all these characters you know and now you want to say any character except this and that. So to do that you will do this particular character and you will say say So now it matches all this character a s j l f l whatever, but it did not match this character and that character and if you want to do a sequence you can do plus as well. So when you do plus it is like one or more of those characters. So one or more of those characters, which is not semicolon and hyphen. So see now look at match information it match this this that, but it did not match semicolon and this. Okay we have back to our note example. Sorry for back and forth. So note then space then slash d node space and then slash d, then space hyphen space. Now any character except slash n. So you would do something like this, okay. So you're saying any character but slash n. Now when you do that see it is matching o. It is matching o here, it is matching s here, but I want that repetition. You know I want any character except slash n and a character sequence, I don't want only one character and when you want to match one or more of those characters you will say a plus c plus. So I will say plus. If you want to match zero or more of character a, you will do star. So the right thing if you have a blank title and if you still want to get that, the right thing to do here will be star, okay. Hooray! Look at this. I have my title. See overview, summary of significant policies. So I matched those two but I want only titles. So my title is see my title here is overview. I want to extract overview and then summary of significant policies. So to match those titles, you know see there are two things one is match information, which is the which is sort of like a string match. but you want to extract only a portion of that match. So the portion of that match is anything that start after this space, after this space. So you will put a bracket here. When you put a bracket what happens is it will perform a match but from that match it will capture everything that is within those brackets. So what is within those brackets see here capture everything and close, meaning bracket and after that inside those brackets whatever that is, it will capture all of that. So now I am going to just copy paste this pattern here in a pattern variable and I will say re dot find all pattern text and see now what I find is, I have overview and summary of significant accounting policies. See when you look at the expression itself, it looks difficult I mean to me it looks difficult, but if you build it slowly step by step it is not that hard. Now the next thing I am going to do is I took some text block from this document and I will extract the company financial periods. So company financial period is anything that starts with FY, after that there is a year, and then there is a space and then there will be quarters. So it could be q1 q2 q3 q4. One quarter is three months. There cannot be q5, so I want to extract from this text fy 2021 q1 and fy 2020 q4. So let's do this in our test pad first as usual. So I'm just going to copy paste this thing here, and remove this. Okay, so what is the pattern? Pattern is it always starts with fy. So let's put fy first. Then there should be four digits, exactly four digits. How do you do that? You already know that. It's slash d four. See four digits. So now my match one is this, match two is this. After that there is a space. See this dot means space. So space and then there is q and then see, now I cannot do slash d. See doing slash d matches this, but it will match things like this too. Q5. q5 is not a valid financial quarter. The number has to be either one two three or four, and if you look at this help, it says a single character of this. So if you want to explicitly mention your choices, what you can do is one two three four. So match either one or two or three or four and see. It match these two but this did not match. Okay you can see this in match information. This is one way. The other way is this see you can specify a range. So any number in range 1 to 4. Any number in range one to four. Friends this is not hard stay on this, I'm telling you this is easy. See think about it. Take a pause think about this. One to four and it matched that see this is super easy. It's not hard at all and I'm going to now copy paste this pattern here, and then um you know just copy this and by the way you can store this in a variable for matches and print matches instead. See it matched both of this. Now what if there is a you know this case sensitive, f5 for example. See it's a lower case and if you want to tackle lower case there there are flags that you can match. The flags is equal to re dot and if you read the documentation you can just say python regex flags, you will find all these flags and I'm going to use ignore kc re dot ignore case and when you ignore case it will match capital fy as well as small fy. Now, when I extract this financial information sometimes, I want to extract only 2021 q1 and 2020 q4. I don't want to put fy, okay. One option is okay I extract this and then I remove f5 character explicitly or you can be a little smarter kid and you can use this bracket. We already saw that in regular expression there is something called match and after you match something, you can extract sub the part of that match using a bracket. So part of that match using a bracket would be this. So I put a bracket here and when I put a bracket see, my group 1 is 2020 q1 q4. So now, let me extract that particular thing here. Well, I can just put a bracket here and you will see now I have only the information that I need. Now instead of financial number, let's extract the actual values. You know instead of financial basically periods, we are extracting the actual values for those periods. So which was 4.85 billion and 3 billion. How do you do that? Well see I can have n number of numbers. You know I can have things like Tesla's employee count is let's say 5400. So when I extract those numbers I want to extract I don't want to extract this. I want to extract anything that starts with dollar. So how about we put a dollar sign? Well if you put a dollar sign, again dollar is a special character see dollar means end of the string, so it matched this thing. I don't want that actually so for that reason what we will do is this. We will put slash. So when you put a slash it is an escape sequence and now you are doing a literal match. Literal match meaning you are not using dollar in a special way, your if you have a dollar in your actual string you are doing that match and it found these two matches. After that you want to say any digits. So any digit is slash d. okay but see when you do that it is matching this, I can do any digit again but now it did not match because there is this decimal. So instead I can do something like this, you know. Any digit and then decimal now decimal again dot is again a special character and you need to put slash here, and when you say plus. See doing plus will match one or more character. If you do just this it will match only first character. You want to say match all the repeating characters, until you find space or something you know. So any digit or dot match that. Cool you can also do something like zero to nine. Zero to nine again means the same thing any digit okay. See here a character in this range, so I am going to now put this here and see it meant this and once again you don't want to have dollar sign in your end result. So for capturing group, you can do bracket. When you do bracket see in the match information match is dollar 4.85 but the group is 4.8503. So it removed dollar. So whenever you put something in bracket the group will have the content inside those brackets only. It will not include this dollar sign. So I will do bracket here and see those dollar signs are gone. Now let's take a little difficult task, which is I want to extract both the financial period so fy 2021 q1 and also 4.85. You see so, there should be financial period which says fy 2021 q1 is this, and fy 2020 q4 is 3 billion. So I want to extract both of it. How do I do that? Well, first you will write an expression to to extract the financial period. Okay for financial period what was our expression? It was this right. So, I will just copy paste that okay. Now after I have my financial period there could be you know n number of characters and then there will be a dollar sign. So my pattern is this after my financial period any character but dollar. So how do you do any character, any repeated character except dollar. How do you do that? Well you use this one right. You already saw previously any character but dollar is this. Okay so this is saying we are putting this slash because dollar is a special character but this is really dollar and this will say any character and this bracket is just a syntax and you will say plus. Plus means repetition, you know okay. Now let me just make this a little bigger here. Okay, so we are going from here and we are going all the way here. After you find dollar. So dollar is what? Slash dollar so see go. Friends go slowly this is not complex, I'm telling you. Go slowly don't get confused, focus. Slash dollar so slash dollar is matching this. After that you know what is the expression for matching this number where expression is this. So I'm just going to copy paste. See we are building these blocks one by one and just doing copy paste. Okay dollar I remove cool, and then you are putting that in a bracket. So as a result what happened was, see group one is this, group two is this. So, now I'm saying 2021 q1 number is 4.85 2020 q4 is number 3. So I can now copy paste this particular thing here. See I copy pasted that's that thing and now matches is this it's a tuple and this is awesome. Now I'm saying 2021 q1 number is this, 2020 q4 number is that. Other than find all there is a method called search, okay and let's see what hap the search has a different response where you have to do matches dot groups, but here it will search for the first occurrence, whereas the find all method will find all the occurrence. So that is the difference. All right so that's all I had for this tutorial. Now let's move on to the most important part of this video, which is an I have given the link of this exercise in the video description below. You have this notebook where you need to extract few things from this text and all you need to do is, you need to fill out this blank you need to say what is the right regular expression for this given problem. So read this problem and then try to write the regular expression and make sure you use regex101.com. And once you have attempted on your own you can click on this solution link. Don't click on the solution link if you have not tried it otherwise it will download a special virus and your computer will start burning in fire! All right I hope you enjoyed this if you did please give it a thumbs up, share it with your friends and make sure you practice. Practice is the most important thing when it comes to coding. Thank you!
Info
Channel: codebasics
Views: 112,176
Rating: undefined out of 5
Keywords: yt:cc=on, regular expression examples, regular expression matching python, python regex, python regex search, python regular expression tutorial, regular expressions, regular expression, regular expression python, python regular expressions, regex, regex python
Id: sHw5hLYFaIw
Channel Id: undefined
Length: 25min 29sec (1529 seconds)
Published: Wed Dec 01 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.