How to Write Regular Expressions Like a Pro [Regex Tutorial]

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

welcome to the regular expressions aka regex master class in this video we're going to be covering all things regex we're going to touch on python javascript all the intricacies of regular expressions definitions you know over the last 10 years of my career i have become fluent with regex and i've noticed that it is this super powerful tool in my arsenal what is regex regex is a way to perform powerful pattern matching and search and replace on strings of text and it really can be used in a wide range of different scenarios so in this video we're not only going to lay out how regular expressions work and the definitions and how to build them but we're going to show you tons of real life examples like password validation or web scraping so you're really going to understand how this works in action this is one of those videos that if you put some upfront investment and watch the entire video this is going to save you tons of time in the future because it's going to accelerate your ability to develop at a faster pace anyways let's get started okay so we're going to start off using a publicly available regex validator and towards the end of this video we're going to actually use python and javascript to show real-world implementations of regex but just to get you acquainted with the terminology and all the different syntax and patterns we're going to use one of these validators here now keep in mind that every major runtime has a regex engine and they're all very similar i would say there's subtle different flavors between them but i would say they they utilize 95 percent of the same exact syntax so if you learn regex in one language you're really learning it in all the languages so the first thing is a regular expression is denoted by two forward slashes and you put your pattern between them to the right of the second forward slash is uh the modifiers or flags which will introduce global rules to the entire regex for instance i is a flag that says make the entire regex case insensitive and we'll go over the remaining flags later in the video so one thing i want you to keep in mind is that a string is also a regex it's just an exact search regex so if i type html into the regular expression it will match on all instances of html and it will be case sensitive so when we do this we're using literal characters when i type in h i'm literally looking for another h these are called object literals in addition to the literal characters there's a special set of meta characters called special characters which have a specific meaning and when you type them into a regular expression you don't actually look for those characters you're looking for something more generic so we're going to start by looking at the various meta characters so the first and most powerful meta character is the dot the dot will match on a single instance of any character so it is the most broad generic regular expression that you're going to have at your disposal so if i go ahead and type in a dot here you will see it matching on every character so the next meta character is the question mark the question mark says for any pattern that is followed by a question mark make the pattern optional and this is useful if we want to match on a pattern that that might contain a character or a substring and we're not sure so for instance in this example say we want to match on both open and closing html tags well closing html tags have a forward slash in them and the open do not so we can utilize the optional character in order to match on both and so this would look like the following and now we have a regex that is generic enough to match on both open and closing html tags next we have the pipe and the pipe just says or so say we wanted to match on two terms we could just put the first term to the left of the pipe and the second term to the right of the pipe and now we'd match on both instances of the string now you can see that we're matching on instances of h1 and instances of html as well and we can continue to expand that sequence as long as we'd like if you're liking this video go ahead and turn that like button blue so we can get out to a wider audience thank you so now i want to talk about regex shorthands shorthands are little sequence regular expressions that tackle common use cases for instance matching on a word or a digit or a space so let me go ahead and show you those so the first one is to match on a letter is backslash w so now you can see we're matching on all the letters in the test string to match on a digit is backslash d so now we found all the digits in the string and then to match on a space we use backslash s so let's talk about anchors for a second because i find that these are pretty much necessary for writing comprehensive regular expressions so an anchor is for any given test string you have lines usually so new lines but you also have a start of the whole blob of text right the very first character and then you have an end of the entire blob of text so how do we denote these different orientations within the test string and the answer is we use anchors so the start of a line is going to be denoted by a carrot and that looks like this the end of a line is denoted by a dollar sign and that looks like this the start of the entire blob of text is denoted by backslash capital a see that only matches once on the very start of the test string likewise if we want to match on the very end of the test string we use backslash uppercase z this is useful because it allows us to place the regex in a certain position especially when the test string is very long take like for example an entire html document we might want to find the first or nth instance of something and in order to do that we need to know where we are and be anchored to the start or the end of the string okay now i want to introduce a concept called repetition so say for instance i wanted to match on two digits i could do that with backslash d backslash d but there's a more concise way to set that up i could do one backslash d and then i can define the number of digits that i expect to show up and to do this we use curly braces to indicate the number of times we want the preceding regex to appear so i'm going to do open curly brace 2 close curly brace and now you can see it only matches on the 16 because that's the only instance in this test string where there are two digits in sequential order now this rep now this repetition syntax can get even more comprehensive we could create a range so we could say find digits that fall between uh a min and a max for the sake of this example here let's find instances of digits that have at least three to five characters so i'm going to use three for the min and five for the max and now you can see we're matching on one two three we're at we're also matching on one two three four five so again these repetition functions allow us to just create more concise regular expressions if we wanted to match on five digits we could do the following or we can just do this and as you can see this just shortens the regular expression it's more efficient a couple other little notes here is if we don't want a min or we don't want a max you can just leave it blank so say we wanted to do find digits that have three or more characters i'll just do three comma blank and now you can see we're matching on all instances where there are three or more characters now instead of using the curly braces there are a couple shortcuts that have been made just for convenience and the first one is the star the asterisk so the asterisk says that look for zero or more instances of whatever the preceding regular expression is so if i do backslash d star it looks for any number of digits the next one is the plus sign the plus sign says look for one or more and now you can see we're matching on all instances where there are one or more digits the asterisk and the plus sign will usually be interchangeable but for a small set of narrow use cases and the more generic one is the asterisk so i'd stick to using the asterisk and you'll probably be good most of the time there's just a couple scenarios where it matters if you use the asterisk or the plus sign okay let's talk about capture groups so capture groups are denoted by parentheses and they do two things one is they allow us to create a sub regex which is very powerful two they allow us to store whatever is captured in that sub projects in memory so that we can reference it later maybe for search and replace or maybe for back references so let's create a little capture group and you can see i use the pipe to denote those two strings okay so to give you a real feel for how capture groups work i'm going to write a regex here that matches on the h1 tag and we're going to extract the inner text of the h1 tag so you can see this regular expression is using the open h1 tag and the closed h1 tag to kind of anchor the dot star question mark which will eat up everything between those two anchors and then we use a capture group because we want to store that text in memory so that we can reference it as a capture group so i'll show you how to do that so if i pull out the match information here you can see the first part is the match which is the entire regex but the second part is what the capture group referenced and in code if we wanted to reference this capture group we would do so by using dollar sign index and in this case it's the only capture group so it'd be dollar sign one but as you build more capture groups and you might have nested capture groups you need to know that they are referenced based on their sequence within the regular expression and specifically the sequence of the open parentheses because it's going to get very complicated as you have nested uh regular expressions but all you need to keep track of is the open parentheses and that will give you the index of the capture group so that you can reference it later say for instance if you do you're doing a replace or a text extraction function so say i also wanted to extract the text from the paragraph tag we can modify this regular expression to be generic enough to match on both instances so i'm going to go ahead and do that okay so you can see i introduced a couple more capture groups and the or the pipe in order to make this regex more generic but say for instance i don't want to use the first and last capture group to store anything in memory i just want to use them for the ability to create a sub-expression we can actually cast it to a non-capture group by using a question mark colon in the beginning of the regex so you can see on the right all the different groupings that it captured but again we don't want to capture the tag name we just want to capture the inner text so we're going to use that question mark colon to uh to convert the first and last capture group to non-capture groups okay and now you can see that we're only storing that capture group that we care about in memory and the other ones are just used for the subexpression function now i don't think every regex engine supports this but some of them do and what you can do is instead of using dollar sign one you can actually name the capture group so say for instance we wanted to name our capture group inner text we could just add a question mark and between two left and right arrows we can add the name and now you can see the group is referenced by its name not by its index okay now let's talk about character classes so character classes are denoted by brackets and anything that falls within the bracket it will attempt to match on so it's kind of a way for us to create a very generic expression so say for instance i wanted to match on any letter or digit we could do that by using the following [Music] and again this only matches on single instances but say we wanted to do any number of letters or digits we can add the repetition flag plus to say match on n number of characters or digits and then if we add the plus it will match on any number of characters or digits and you can see how that changes what is actually matched on now i told you that the backslash d in the backslash w were actually shorthand and that is true there's actually a different way to match on digits and characters that is sometimes useful so if we want to match on any number of digits you can use the range function to determine which digit you want to match on so say we wanted to match on 0 through 9 you could do 0 hyphen 9 and that indicates it'll match on that entire range of digits [Music] and intuitively if we wanted to match on only one through five we could just create a range that indicates that likewise for letters you can use a range for the alphabet so we could if we want to match on any letter we could do a through z and this will be lowercase letters if we wanted to match on uppercase as well we would just do capital a through capital z but you can also put any character in here so say we want to match on forward slashes we could throw those literals in here as well we could also do spaces we could also do something interesting which is called a negated character class which means it will match on anything except so say we wanted to match on anything except a white space so to create the negated character class we just start with a carrot followed by the character that we're not trying to match on so now you can see we're matching on everything except for spaces the reason the gated character class are useful is because they will allow you to look for everything except the next instance of something so in a way it functions very similar to dot star question mark so let's actually redo the regular expression to find the inner text but instead of using dot star question mark let's use the negated character class so once again we're matching on everything between the right arrow and left arrow but we didn't need to use the dot star in this case so with the dot star or the dot plus there's two concepts i want you to keep in mind one is lazy versus greedy and lazy just says find the next instance and greedy says find the last instance so say we wanted to capture all the text leading up to the first anchor tag we would use the lazy approach so i'll go ahead and create that regular expression so you can see because i use the question mark we're capturing all the text up to the first instance of the anchor tag but say i wanted to get all the text up to the second instance of the anchor tag i would just have to remove the question mark to make it greedy and then we would actually include that additional text and now you can see we're going up to the last instance of the anchor tag so just keep that in mind when you're writing your regular expressions okay and then one thing i quickly want to gloss over i don't think i'll be used very much but there's a concept called back reference so say we create a sub expression with a capture group that looks for a b c so it's looking for actually sorry let's make it more generic let's just look for three letters and three letters followed by a hyphen just to make it more specific so say we're looking for three letters followed by hyphen but then we want to check later on if that same exact expression whatever matched so in this case it's abc hyphen say we want to check for abc hyphen later in the regular later in the string but we don't know it's abc hyphen we only have this pattern here well you can do that by referencing the capture group as a back reference and the way you do that is you do backslash index so so say we want to then check if that shows up elsewhere in the string and actually let's just throw it in the string so that it does in fact match um we can do that so we're gonna do dot star question mark and then we're gonna look for the back reference see we're saying look for three letters followed by a hyphen anything in between leading to the same exact pattern and we do and that criteria does fit this here because we have abc hyphen abc hyphen and as you have more capture groups you would just change the the uh the index on the back reference okay and then i just want to show you how to use the capture groups the other tool doesn't really display that very well but let's just modify this let's say we want just the we want the first three words so we'll anchor it to the start okay um and actually we want the next word and we'll get rid of this complex guy here we'll just do backslash w plus all right and then we're capturing each of the words in a capture group and then say i want to rearrange them i want to reverse them dollar three that's the third word right dollar two that's the second word and then dollar one and so we can put it in any order that we want so again word one two three regexer was created and then we're reversing it created was regexer so that's how you use actual capture groups it's usually in the context of search and replace okay next is probably the most advanced concept which is positive lookaheads and negative look arounds so a positive look ahead says find a regular expression that is immediately followed by some other expression but only include the initial regular expression in the matching so say for instance we wanted to match on any letter that is followed by a white space but we only wanted the letter to be part of the regular expression itself we can do that using a positive look ahead so the positive look ahead is denoted by a capture group with a question mark and an equal sign and you can see that this pattern is matching on uh letters that are followed by a space so that's the qualification but it only matches on the letter itself not the space next is negative look arounds if you want to create a regular expression that says match on uh text that does not contain some subtext that's where you want to use negative look arounds so the negative look around is denoted by the question mark and then the exclamation point and what we're saying here is look for right arrows that are not immediately followed by blank white space so as a result we only capture those right arrows that are immediately followed by another character and not a white space now if we wanted to say does not contain i'm going to show you that regex i know it's going to be kind of intimidating but i will show you how to do that so say we wanted to match on any line that does not contain the phrase html we can use a negative look around to accomplish that so we'll start by using the start of line and end of line anchor and then we'll use dot star question mark to match on anything and then we'll use the negative look around to say but does not contain html okay so now you can see we're matching on every line that does not contain the word html and this is really really useful and again we can expand this out by creating a sub capture group and what we could do is we could say any line that does not contain html or body and now you can see we're only matching on lines that don't contain html or body this is a really useful one to have it might be something you have to copy and paste but it's good to be aware of this one so the final thing we did not talk about yet are the flags or the modifiers so the biggest one i think is just g for global and i for case insensitive say we want to match on any letter and we want to do that using a through z in a character class this matches on the first instance but if we add a flag to make it global now we're matching on every instance so i basically always want global if we wanted to include if we wanted our regex to be case insensitive we can just go ahead and add the i flag to it and now it's case insensitive and so i think there are a number of different flags but really those are the most important for most use cases okay so let's do like a real world example here so say we wanted to construct a pattern that matches on email addresses how would we go ahead and do that well we can use some of the key concepts that we just came up with so the first thing is every email address has to have an ampersand in it right so we can start with that and then it's followed by a domain name a domain name a period and then a top level domain so for the domain name things a domain name can have any letter or number in it so we could do something like that but i guess what if the domain name had a hyphen in it and then we no longer be matching so why don't we just do it super generically and do dot star question mark and then we'll do a literal period so now we're matching on anything that's followed by a period because we know that's a requirement and now we just have to match on the top level domain so i believe top level domains are fairly rigid in that they are always characters um but they could be different numbers of characters so why don't we just do backslash w plus for that and i think that will suffice most of the time okay so now we just need the actual email handle itself so why don't we go ahead and just use we want to build a match on hyphens special characters so we could do something like start and then we could just do dot star question mark and then we'll just look for the ampersand and so now when we are validating a string we can determine if it's a uh email address or not now this is not perfect this is what i just came up with very quickly but this this is something that you might actually write with real code when you're building an application so now that we've done a bunch of examples of constructing regexes let's actually dive into some real world use cases and get our hands dirty with some actual code and we're going to see how this is actually implemented with actual code examples so i'm going to go ahead and just search for websites and i'm going to click on one of these ads here and the reason i'm going to do that is because when i do there's going to be these url parameters that are passed in the url that denote various marketing campaigns and what i want to do i'm going to pop open the javascript console here and now we can write some code to maybe extract these url parameters here when we're writing a regex with javascript there's two ways to do it one is you can use a reject object the second is to use a literal string as a regular expression so let's create our regex real quick and now there are different functions that we can evaluate against the regular expression so one thing we want to first do is just grab the url so i'm going to do var url equals window.location and just so we can ensure that we have that we're going to go ahead and log it to the console and we can see there that we do in fact have the url including the query parameters so now let's combine these two and run a regular expression against the url and match will return an array of all the different instances where that regular expression was found so let's go ahead and use that so again i'm just going to do window dot location dot href and then i'm going to do dot match and we're going to pass in our regular expression so again we could just do backslash w plus and now we are getting an array of every single instance of essentially uh kind of like a word phrasing so and then if we wanted to grab a particular one so say for instance we only wanted the protocol we could just index it as zero and we're only grabbing the protocol there now we can also create different regular expressions like maybe we don't want to grab all the digits uh and we can do that as well and it's super easy but say we want to grab the parameter utm underscore medium and we just want the value because url parameters follow a name value key pair so we just want the value so what we're going to do is we're going to say look for utm underscore medium followed by an equal sign and then there's different ways to get the next word we could do backslash w plus so this is now capturing the entire regex but we only want this substring here so that's where capture groups come in so we're just going to throw a capture group around this and actually for the capture groups to work we just have to remove the global which is fine and now the returned match is going to be the first index in the capture group so uh if we don't we're still getting an array we're getting a match of the regular expression and then we're getting the capture group so so with the single line of code we're able to extract all the um utm uh underscore medium uh campaign values so you can see that's pretty powerful now here's one other thing that's good to have in your repertoire a lot of the times when folks are writing code and they want to determine if a string contains a substring they use index of so for example they might do something like window of and then they pass the substring so maybe they want to check if you know utm medium is in there and if it's not in there it will return uh negative one so like if we uh if we change the phrase to utm not found it returns negative one and then in their if logic they say greater than or they just say not equal to negative 1 they can determine if if the string is contained within the if the substring is contained within the string well this is actually not the right way to write this code the right way to do it is to use a function called test and it's more concise so the way you would actually want to do this is you would write a regular expression and then you'd run the javascript method test against the string in this case it's window.location.h and then the um the string that we or the regular expression or string that we care about we would just pass that in like so and this will always evaluate to a boolean so again if we wanted to write code that said um you know console.log utm medium is found the way you properly write that as you do i regex and then against the text test string and that's how you do that and if you wanted to say not found um so say utm medium is not found you could just invert the regular expression by prefacing it with a exclamation point or bang so this evaluated to false because we do in fact have utm medium and that's the better way to check for sub strings within strings okay so i want to show you one more way you can create a regular expression in javascript which is to use the regex object so we could do var pattern equals new regex and then the first argument is the regular expression itself so we could do any letter one or more and then the second argument is the flags so we could do gi for global case insensitive and then we can just test it using test and then we'll throw a test string in here and this should evaluate to true and we can also see what the regex is by just logging it like that the only time i use the object is when i want to pass a variable into the regular expression so for instance if i did var user equals tim say i wanted to create a regular expression out of the variable tim the only way to do that that i know is to do is to use that regex object so you could do um but say we wanted to like add something else to it we could do um you know right so that's the only way to create a regex dynamically is to use the regex object but short of that if i'm writing a regex i just write it in line like this and then you know i can test it against the string just like that okay so i want to show you another example here so we have a code pen script up here that is just showing a very generic input login form and where you might want to use regex is to validate on the client side whether the password is valid or not so if the user types in something that we know doesn't follow our password requirements then we don't even have to send it to the server we can just notify the user that the password is invalid so that's what this code is doing right here every three every two seconds it's running it's taking the password field and then if it's greater than four characters we will run an evaluation sequence here so i'm just going to modify this code to check that we have digits that the length is greater than eight and that we have an uppercase letter and if those are all satisfied then we allow the user to proceed if they're not then we show an error message okay so what i did with this code is i said don't evaluate the password unless it's greater than four characters just so we're not annoying the user before they've typed anything but once they have typed uh greater than four characters we check that the password is at least eight characters eight or more characters we check that um there's a capital character one or more and then there's a digit one or more and if any of those are not true that's why i put the bang in front of it if any of them are not true separated by the or then we throw a validation error so let's just do pass username is refactored and then password will be my name is tim and you can see we didn't meet the uh the digit requirement we didn't meet the capital uh requirements so let's see if we can't rectify that so we'll add a capital letter there and there so we're still out of compliance because we don't have a digit but if i throw a digit in there then the error message goes away and we're able to proceed and log it so you can see how regular expressions can be used for client-side form field validation okay so say we want to do some web scraping so i have an app script here that is going to make a url fetch to reddit.com and say we want to get the titles of every post on this page well first we got to see what does that actually look like in the html of the page so first we'll start off by just logging the response so basically what this code does is it just makes a request for reddit.com and it logs the initial html source and then from that html source using regular expressions we should be able to extract what we want so let's first just take a look at how this is constructed so i'm going to refresh reddit.com and then we'll just look for one of the posts title and then we should be able to determine what's what so there's a post called tell me your favorite video game so i'm just going to look in the html that's the wrong one tell me all right so it actually looks like there's kind of like a data layer here that we might be able to use so let's take a look at this all right here we go so there's a data layer that we can match on so we can write a regular expression to find that so we're going to come over here and we're going to say take that response html and let's evaluate our regex against it so a couple things here is i don't know if there's always going to be a white space so let's do backslash s star and then what we actually want is what's contained between the two double quotes so we're going to use a negated character class to say match on anything that is not a double quote and let's go ahead and see what this returns okay so it looks like actually we did get all the titles so let's um let's just write a little bit of code to make these a little bit uh prettier so what we're going to do is let's iterate over some of this stuff so this is an array and we're going to do dot for each [Music] and we have two parts here i forget what they're called so i'm gonna do a and b then we'll add a new line in there and hopefully that will make it a little bit prettier but actually we don't want console.log we want logger.log because we're using an app script let's see if this helps us uh read the data better so it does so this looks pretty good um so these are these are the titles of the posts but we don't really care about uh this part here so what we can do is we can actually remove that if we want um so let's go ahead and do the following so i think a is the actual search term let me confirm that yeah so we can use regular expression actually again to remove some of the text that we don't care about so we're going to replace we're going to look for regular expression that says title [Music] so okay so we're looking for a regular expression to match on that text that we don't care about and then we're replacing it with nothing so i think this is going to give us exactly what we're looking for yep so you can see um so not everything was a title so it got some other stuff in there but uh tell me your favorite video game and i will guess your age just an idiot interested in getting a new perspective okay and we're getting some of those titles of those different um those different posts so it's very easy to extract data based on a pattern from even complex or unstructed test strings such as just the html of a page so you can see this is pretty powerful here and i just want to show you that regular expressions are used very extensively so i'm just going to youtube.com and i'm going to go over to sources i'm going to do a deep search using command option f and we're just going to look for the regex object so i'm going to search for regex and we get a bunch of javascript files all that reference regex so you can see it's used by all these different files because it's it's helpful so this isn't just some one-off function this is a core part of programming that you should really add to your repertoire alright so i want to give you a quick glimpse into how regex might work on python so there's a package called re and you run this re.compile function to create a regex and then you can use that object to create or use that object's methods there's search find all match i think find all is the most useful and it will return uh the um results of that regex against the test string so we're looking for any number of characters lowercase and you can see we return those proper characters and then maybe we throw in uppercase as well and that should slightly modify the output so now we're capturing every word and really it's it's pretty much you know very similar to how we did with javascript you can take some search string and find some needle within that haystack so really accessible i hope this video was helpful and as always if you want to stay apprised of the latest emerging tech around ai machine learning iot and general web development go ahead and click that subscribe button thank you

Info

Channel: Refactored

Views: 746

Rating: undefined out of 5

Keywords: regex, regular expressions, regex tutorial, regex python, regex javascript, regex java, regex explained, regular expression, python, tutorial, javascript, regexp, regex js, match patterns, python re, regular expressions tutorial, python regex, regular expression tutorial, regular expression python

Id: saABx34CsBE

Channel Id: undefined

Length: 40min 37sec (2437 seconds)

Published: Tue Sep 14 2021