Regular Expression Basics

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
in this video lecture we're going to talk about regular expressions we'll also take a look at the command line utilities grep and said so what are regular expressions well regular expressions are patterns that we can build to match against a specific series of characters you know words in a programming sense we can say strings we can use these patterns to pull out specific pieces of information from a larger set of information that we're interested in so if we're looking at something like a very large log file we can use a regular expression to just isolate those aspects of the log file that we're interested in this is really great when you've got a lot of data to pick through on the command line we can also use regular expressions for finding data and then acting upon it and we'll take a brief look at said which is the stream editor later on in this video lecture but for now we're going to look at mostly a tool called grep which allows us to show or not show specific aspects of a file that conform to one of our regular expression patterns so the command line grep utility is used to apply patterns to either standard input or to a specific file and so we might have a very large file and what we're interested in doing is just taking a look at certain aspects of it there's actually two forms of grep and grep stands for a get regular expression and there's also grep and egret and what you're going to find is that because unix is so old it's you know had a lot of time to evolve and sometimes people have gone on and extended aspects of unix and in their sense of making it better so the idea of grep and egress that grep is the original grep and egress is extended grep an extended grep actually supports extended regular expressions so anytime a language has been around for a long time people try to start to add to it when they start to notice failings in that language and what I'll mention is that we'll look at both extended and regular regular expressions there's also perl regular expressions which i'm most familiar with so i also throw that out there when i'm talking about regular expressions anytime I make an error I will probably be thinking in terms of perl regular expressions so enough about what regular expressions are what utilities we're going to use what does a regular expression do and look like let's think about a typical pattern that we might want to apply in this case let's say that I have asked someone to look through the phonebook or a directory of employees and show me everybody whose last name starts with the letter S so I've given you a pattern I've given you a few bits of information I've said okay I want you to look at last name oops and I want that last name to start with the letter S so if I have somebody whose name is right like John Smith this would match my pattern because I look at the last name and the first letter of that last name is an S if their name was Steve John I guess we'll stick with people with two first names well this the first name starts with the letter S but the last name doesn't so it doesn't match my pattern so we define patterns and we do so programmatically and we use those patterns to filter or match against information that we're most interested in first of all let's look at some really basic ways that we can apply patterns on the command line I'm currently in my home directory and there's a number of folders and at least two files in my home directory so there's actually some basic pattern matching capabilities built into the bash shell so for example if I want to see all of the folders that start with the capital letter D notice that if I just put the letter D I get an error because I can say well I can't find a folder or file named D but if I add a star after the word D I get a lot of information in fact I wanted to look just for folders I would do dash D which would give me information about the directories as opposed to items in the directories and this star character on the shell is called the file glob operator and basically you can think of it like a wild card match and the pattern that I've essentially given the LS command here is that the LS command should display all folders that start with a capital D and then folders that have any number of characters following it so when I mean wild card I mean any number from zero to an infinite amount and I mean any type of character so basically this would match any folder that starts with D and goes ahead and has a bunch of stuff in front of it you can also use a wild card at the front of a string so I'm going to see any folder that starts with the letter s and notice it shows me documents downloads pictures templates and videos because all those end with a lowercase s and remember that the command line is case sensitive so if I do capital S I get nothing you can also do something along the lines of putting two stars on the command line so I could say something like show me all directories that have any number of characters the letter L in them and then any number of characters on the other side and in this case it show me downloads examples that desktop public and templates because all of these have a letter L in the middle that we can see in each one of these and on either side there is any given number of characters that match this wildcard pattern so this is built into the shell and this can be really useful if you wanted to show say all files in the directory and so if we look back up into the shell little notice that in my desktop directory I have a couple of dot txt files let's clear this and let's say I wanted to do LS dash L desktop star dot txt and what this would do is say ok go into my desktop directory show me files that have any name and then have a dot txt extension and if I hit enter I see those two files sitting in my home directory so pretty cool that those exist and that this capability exists but what we want to do is we want to actually look at how we can utilize grep to get more detail information about things like file directory listings and acts to files themselves so LS is great when you're looking for just information about some listings and the wild-card is helpful for getting some basic pattern matching going but to learn regular expressions what we're going to do is we're going to look at a very specific file on the UNIX system and that file is the american english dictionary used for spell checking and other things by the system so i'm going to pass the american english dictionary file which is found in / user / share / nicht / american english into the les command so we can see what's in there and what you'll notice is this is a very large file that goes on and on and on and on and keeps going there's lots of words in it so this is really useful for our pattern matching exercises so I'm going to hit Q and we're going to start to talk about some really basic regular expressions and the way we're going to do this is we're going to actually use the command grep remember that I mentioned there are actually grep and egret so what we'll do is we'll use grep until it breaks in other words we'll use grep until we run into a feature that is an extended regular expression and then we will jump over to using the e grep command usually it's it's a better habit to just get into using egret all the time and that way you don't have to worry about what is an extended what isn't an extended regular expression so but I kind of want to mention show you that at some point some of these features will not work so I am going to actually set this up so that we can grep this file and the way craps work the way the greps work is that we give the grep command what we'll do is right in here we'll put a pattern and then we will actually go ahead and match it against a file and so the easiest pattern that I can put in here is I want to look for say words that have CI T in it so let's you see a t and see what we get once you'll notice is a lot of words scroll by I gave the pattern cat and maybe we're thinking well I'll only see the word cat but notice that it shows me every word in the file that has the letters C 80 somewhere in the word and what's really nice with the color coded command line is we can actually see them which is so that's kind of a nice feature for learning how to use regular expressions and you'll also notice that this is a really big file so at some point we'll look at we're always going to see letters at the end of the alphabet just because of the way we're doing this so let's clear this and take a look at that pattern again my pattern is called cat and for the rest of the exercises the other things are is cat and I'm going to put these in double quotes because it'll make it a little bit easier to see where our pattern is within these lectures and it'll also help us deal with any non-standard characters in other words any characters that we want to match in our pattern but have special meaning to the shell we want to make sure that those are not processed by the shell in a way that the shell wants to process them putting them in double quotes kind of fixes it so what is this pattern say it says well look for a C followed by an a followed by a T and what you found out again is when I run that command it doesn't really care if there's any characters before the cat or after the word cat it just looks for those three characters in order so that's pretty interesting what if we wanted to look for not just cat but you were wondering like well I wonder if there's a word that looks like cat that has another word in the middle so maybe you're thinking of like cut or cut it was caught and you want to defined all of these well regular expressions come with a special character which is the dot character and the dot character will match one of any character so if there's actually a word called C like C to t then putting a dot there would match it so what this will match is a C followed by any one character followed by the letter T so let's hit enter two we get so and you can just see in this little bit of information that we got I didn't even think about words like yachts which have CHT or watchtower so we matched CHT c 80 c UT c ot and that's pretty cool why because the pattern used a dot so let's talk about why this is let's look at some other patterns that we might have Vale to us so what we just looked at was this idea of basic elements you can literally type a string and that will match the string with a series of characters and in this case we can also then use dots in our series of characters to represent any given one instance of any given character so notice my first example matches batcat Matt finally if I just give the letters do G it would match the literal string dog but found anywhere in a word what's cool with regular expressions is we can start to get tricky and so in our previous example we found we use the dot to find any character but let's say we only wanted to limit ourselves to finding a subset of any character and we can actually create what are called classes of characters so we can have classes of characters by putting them into hard brackets and a dash c will search for an A or a B or a C followed by an A and a T and this would match bat and cat but not rat and let's we'll take a look at this in the command line in a second you can also use character classes to find upper case lower case as well as numbers within your strings now again this is not looking for the number 9 this is looking for the character 9 the symbol that represents the value 9 so this is really just looking for the number the a character 0 through 9 followed by another character 0 through 9 and this would match 42 37 99 0 1 this would basically match you know any two characters next to each other anywhere in a word so let's take a look at that so previously when we used the C dot T we matched cot cat cut so the oh the a and the use let's say I don't want these Watchtower CHT I just want like C followed by a vowel followed by a T so we can do that instead of using the dot I can create a character class and if I want to try to match any Val I don't know what we'll get but we'll see and so what this says is look for the letter C followed by a e i o are you just one of those but it can choose amongst all of the values in that set followed by the letter T and now if I look at my output I get veracity I get wainscott I get votives I get wild cats so I'm matching the letter all the vowels between the letters C and T all because of this character set that I've made if you want to try and find any words in the dictionary so that that's character sets and by the way you can use these and you'll see these in a second you know if you want to match all lowercase characters you would do a through Z you can also say find all lowercase and all uppercase characters which maybe you're looking for user names or something along those lines and then if you wanted to you could also add 0 through 9 and say like the underbar and what that would do is match for all of those characters so you can have these sets of characters be as small or as large as you want them to be I'm going to do is change this let's see if there's any words of 0 through 9 I doubt it and nothing matches that pattern so that's what happens when you put in a pattern that doesn't match let's see if there's anything in the dictionary that matches against a number nope so there are no numbers in the American English dictionary on the UNIX system so it's kind of helpful there's a couple other things that we could actually use if we wanted to grep numbers what I'll do right now is just to look at what we could possibly grep for a number would be the Etsy password file and if I grab that I know there's a lot of numbers in there so I'm going to get a lot of matches and you'll notice that it matched all of these lines because all of the lines have a number in them like in the ansi password files just shows you all the user accounts on the system and these numbers represent their user and group IDs so i know that there were a lot of numeric matches in this file so just to kind of demonstrate that you can match against numbers as well as words so i'm kind of helpful let's go back to an example where we can start to look at the english dictionary so let's look at some other options for this feature this ability to use character classes built into unix is this idea of these real what i think are pretty ugly character classes you'll see these sometimes built into the system so bracket bracket : alpha : bracket bracket is the same as building your own character class that uses bracket a through Z capital A through Z and basically what these these things on the left are are just ways of writing these out using words as opposed to using the categories a through Z you know pretty much if you haven't figured this out at this point in UNIX there's there's the 50 ways to do everything so these character class abbreviations can be helpful they make things a little more readable in scripts but the other thing if you haven't notice is that regular expressions can get pretty big and pretty scary pretty fast I would say the most useful character class that we're looking at right here is the one called space because that will match any whitespace characters that includes tabs and spaces that can be really helpful when trying to match words that have or sentences that have spaces or in my first example I we talked about last names and that would really be helpful for processing finding somebody's last name because I don't like these mostly familiar with these character class abbreviations so notice there's two character class abbreviations the previous ones with the words like alpha upper and lower but then there's also these character class abbreviations slash t / w / / s that make things a little bit more readable when we start to use these in larger regular expressions I'll also mention that there are actually ways to negate character classes if you notice the caret inside of the the bracket base classes at the bottom here you'll notice that that means does not match this character and if you do slash capital D / capital W / capital S that means that you do not want to match that character in that given position I don't usually teach those when I introduce regular expressions because it's hard enough getting your head around regular expressions let alone getting your head around negative logic at the same time so we'll just kind of mention that there there and just ignore that so look let's look at these two character class sets in operation so let's say I want to font to find a word that has any letter in it well notice I just did that in it I said that I wanted to find a /d but notice that it actually went ahead and found all of these words with the character D well that's a problem and you're thinking what Jason wait you just showed us that slash d is a character class but here's the thing remember earlier on I said there's grep and there's extended grep while the slash D character sets are actually part of extended grip so if I switch to e grab it still doesn't work which is a problem because I'm trying to explain this what you'll notice is that the / D doesn't work it actually goes ahead and finds the letter D within here / D is actually a a part of purl regular expression so if I actually at - P you'll note that it returns nothing so that actually says u / D like it's a purl regular expression and that'll pass numbers so that doesn't actually work in grep or eat grab but if I get rid of the - P and just go with regular extended grep which will find as I can actually utilize / W and that matches any character in any word then we might be looking at so that one does work and if we go ahead and we look at / s what you'll notice is that that also returns nothing so it looks like / d is one of those cases where that's more of a purl thing than it is an extended grip thing so I'll apologize for that and keep going remember / D is the same as typing 0 through 9 so you don't need to worry about the fact that / D is not supported in EE grip let's take a look at some other ways that we can build patterns we'll start out by taking a look at what are called anchors or positional anchors in other words we can say that a certain character has to match a certain point on a line either at the beginning of the line or at the end of the line and you'll see that the characters that we're going to use for these beginning and end of line matches are a carrot and a dollar sign and you'll notice that in the example in front of you putting a carrot means to match only the word car where car is the first line character on the line so it would match car cattle and canine where those are the first words on a line and the second example floating and sailing would match because the G is at the end of the line if those are the only words on the line and if you'd like to match situations where a word is the only word on a given line such as cat you would want a line that starts with the letter C the line that ends with the letter T we'll take a look at these in one second let's look at another possible anchor that we can use which is a word boundary so remember carrot character and the dollar sign character have to do with the beginning in the end of line but word boundaries give us the ability to match on spaces tabs and these basically these positions where you know there might be breaks between words so in the example that I've got at the bottom if I wanted to match Jason the prof s o n is anchored on a word boundary so there's a space between my name and the word tha so s o n slash B would match that line actually there's no other instance of s o n on that line but let's say I was in a file that had a lot of people named Jason but I was the only one with the surname of the Prophet would match so let's take a look at how word boundaries and anchors work in real life so to match items in word boundaries we really need a file to do this so what I'm going to do is take another look at the Etsy password file I've also bumped up the font so hopefully things are a little bit easier to see so what you'll notice is in the NC password file there's a number of lines here so I've got root demon bin sis so let's say I only want to see lines that start with the letter S there's a couple of them so I'm going to use egret and I'm going to say that I want to look at lines that start with the letter S and I want to look at the Etsy password file and what you'll notice is that shows me lines that just and start with the letter S notice that the letter S might be in other parts of the line if I take out that position qualifier what you'll notice is that the letter S appears on many more lines but I only want to be interested in those lines where the letter S is at the beginning if I want to see which lines have at the end the letter n I can put a dollar sign and it'll show me only the lines that end in the letter N let's try H I was thinking bin but I think I meant bash and so it'll show you is that many of these user accounts use the bash shell and let's say notice in this case I'm seeing bash and Sh let's say I'm only concerned about those users into this file that have the login shell bash so let's see how we can fix that instead of just saying H I could say bash and what this will say is the last letter on the line must be an H but before it must be must comma be a NS and now I get exactly the information I want lines that end with bash so that's the concept behind positional values within regular expressions what I have here is a file called demo that contains two lines one is JSON space the space prof and the other line is JSON iam so maybe I have my own element and if I grab this file for my name what you're going to notice is that it shows me both lines but maybe I only want to show those lines where it's actually my name and not my name embedded in another word and the way I can do this is to add a word boundary to my regular expression and remember /b means word boundary and so now it will only show me the jason that is not up against a word boundary so it's pretty helpful when you know you have spaces in a file and you want to find words that you know are on a space the last thing for us to review is the concept of quantifiers in a regular expression sometimes you want to look for 0 or 1 or 2 or 3 of a certain character to appear in a certain spot other times you want to know exactly a specific number of items that you want to appear in a given spot in your regular expression and so regular expressions come with the plus question mark and star operator and while the star operator seems to work like it does on the shell it's important to note that there is a slight difference between the file glob operator on the bash command line as opposed to the star operator within a regular expression so how would we actually utilize these within some regular expressions well let's take a look at some sample regular expressions and then we'll go to the command line so here's a sample regular expression that matches a number in the order of five five five five five five five five five five so typical u.s. phone number three digits - three digits - four digits and you'll notice that I've got some positional notations here so the line starts to the caret and ends with a dollar sign and notice that I've got a character class zero through nine and since I know I want to match three characters I could actually type zero - not bracket 0 - 9 bracket bracket 0 - 9 bracket bracket 0 - 9 bracket - or notice in this case I can use curly bracket 3 curly bracket which says match exactly 3 of whatever you find before you so we can actually use these curly bracket number curly bracket notations to kind of make our regular expressions a little more condensed so we'll look at this on the command line in a second and discuss also talk a little bit about the /w /b kind of type word boundary command and the /w character class commands and you'll notice that I've written a really poor regular expression here that matches an email address you regular expression email address matching can be helpful but it's not something I would rely on because there's a number of email address formats and it's too complicated to really catch all of them if you do a Google search for email address regular expression take a look at some of the answers that you get they're pretty long and detailed but what you'll notice is this email address says it's also positionally situated so it's got a carrot and a dollar sign so that the only thing on this line should be an email address it says find any number of word characters so you'll notice that the star in this case says and you'll notice same thing with the number the the curly bracket 3 notation is it affects whatever is to the left of it so this is look for any letter or number or valid email address character and then star after it means any number of valid email address characters followed by an @ followed any number of valid characters letters or numbers followed by a dot followed by any other number of valid letters or characters notice also that because I actually wanted a dot here I didn't want to look for remember that a dot has a special meaning with regular expressions anytime you want to tell a regular expression to use the actual character and not interpret it as a special regular expression character you just put a slash in front of it so let's take a look at how some of these components can work on the command line I have a file that might be the file you're using for your lab and if we look at the file lab 3 test ext which you notice is it's a file full of names and phone numbers so we're just going to look at matching the phone number since I've already given you a regular expression that does that let's see how that works so I want to build a regular expression to match just valid phone numbers in this file so when you use e grep and i'm going to put my pattern in here and I'm going to use lab 3 test txt as my file so before I said if I really wanted to match a phone number I could write a regular expression that looks like this and if I just run that I'm not going to type notice it matches any line that has three numbers in a row or any instance of three numbers in a row notice it also gets these these kind of user IDs over here so then I could say as well what if I match a - there's let me clear this and then well what if I add a - well now we're getting better we're getting closer because it's matching it's no longer matching these IDs on the left but it is matching these two multi formatted phone numbers and by the way if anybody's dealt with large amounts of data that was input by humans over a period of time you've probably run into situations like this we have inconsistent data entry so again you say well I kind of just want these - formats because these are the only ones I really want to identify so let's clear this and go back and look at how to fix a regular expression so what I could do is just add one more number and I'm getting better but still notice it's still matching out on these other numbers so you notice that these are their greedy matches in other words it keeps trying to match things so I can keep adding these until I get the total number of characters they want but this is getting hard to read and getting long so we just talked about so I know my phone number is going to have three characters what this says is look for a value that could be any value between zero and nine and look for three of them in a row and they don't need to be the same number like two two two it could be two one five or whatever then follow that with a dash then look for another series of numbers zero through nine and look for three of those let's run it and it still being a little too greedy so we're going to have to extend it out here to get these four numbers also notice this file is a broken phone number in it that's there on purpose but we'll skip that for this exercise I'm gonna put another dash in zero through nine and I'm going to put the number four and so what this says is find three numbers followed by a dash followed by three numbers followed by a dash followed by four numbers by the way notice I'm using egress for this and notice that it gave me the valid answer if I use plain grep I just want to point out that you'll get an error because in these curly brackets there is a way to make them work in graph but it's just easier to use Yui grep because these album curly brackets kind of are part of the extended grep syntax so just be sure that when you utilize numeric quantifiers that you utilize curly bracket or you utilize e grep to do that now let's look at the plus question mark and star operators and we'll go back to the searching through files in the dictionary file so let's say I wanted to find some words that have start with a be have an O and then have the letter T like bot so I get a lot of things saboteurs robots lobotomy but what's interesting about that is that's that's definitely more information that I want so let's say I'm curious about if there are words that have 0 or 1 O's in them and I put a question mark a question mark will say in this specific case that the oh there could be zero OS or there could be one owes what's at entrance if again any changes and I do notice that it matches subtropical and it matches turbot it says well match any characters where there is an O or not a no there could be zero OS or there could be one oh the plus sign will match one or more of a character so in this case it brings in Tollbooth as well as sabotage and robot knows there's two O's there one oh there and then finally we could use the star which means zero to infinity and so notice that star actually gives us the two O's zero O's and one Oh up to this point we've been using grep to filter information so we can target stuff that we want in more detail but we can also use regular expressions to find things and modify them the said command is the stream editor and while it can be a very powerful tool what I want to do here is just kind of introduce how it works very basically so we have an idea how we can use regular expressions in a new and different way so what this allows us to do is we can basically use the set editor to take stream of data and match against a pattern and then modify that information that matches that pattern in some way so one obvious way you could use this is to go through a bunch of files and find the old boss's name and replace it with the new boss's name it's one way that I've actually used this and there's a couple other things that are useful so it's great way to find something and replace it so you can use said to find a road play stuff it's a really powerful tool and we're going to only look at one aspect of it so we're going to look at a command very similar to this one that's in this example let's jump over to the command line let's take a look at how said works for this to work we're going to actually need a file to edit and so we're not going to actually edit the file in place today we're just going to dump data out to a standard output and what we're going to do is just modify what's printed to standard output we could obviously redirect that back to another file we could actually edit the file in place but you know these are all some more advanced things we just want to look using regular expressions in a new way so I'm going to actually utilize the Etsy password file for this and actually if we just look at that file real quick what you'll notice is that everything in here is separated by colons so let's say I want to get rid of those colons so let's you said we get rid of this so said - E + - e means I'm going to give set an expression in this case I'm going to use single quotes and a couple of things I'm going to set this up and talk a little bit about what it means notice I've got single quote s / / / G single quote so whatever I want to find goes between the first two slashes so in this case I'm saying find the colon and now I'm going to say replace it with three dashes and what this will do is said we'll go through the file find every instance and by the way G stands for global and that means it will find every instance of a colon in the file and replace it with three dashes and if you look at my output now instead of colons you see a bunch of three dashes so this can be really helpful for a bunch of reasons one of the things that we can do too is we can use all of our newfound regular expression powers in this file so if we take another look at that password file one of the things you'll notice at the top is let's say there's a user called bin but then you also notice that the word bin shows up a lot of times in this file let's say I just want to change that user bin but I don't want to change any other instance of bin so what I could do is go back up to my previous command and what I'm going to do is instead of just match bin which would match everything I'm going to actually go in what am I going to replace bin with I'm going to replace it with Jason really big so if I just do that notice it replaces every instance of bin in this file with the word Jason to replace the what I want but then it replaced all these other instances so what I want to do is use an anchor so I'm going to say only match the bin that's at the start of the line and now you'll notice that all the other bins are left in place but if I scroll back up and look it only replaced the one that was at the start of the line so sense a really powerful tool we'll take a look at it more as the semester progresses but in this case I just want to show you that you can use your new found regular expression Power is not just a filter data for viewing but filter data for editing
Info
Channel: Jason Wertz
Views: 110,859
Rating: 4.9450622 out of 5
Keywords: Linux
Id: KJG1dETacLI
Channel Id: undefined
Length: 36min 58sec (2218 seconds)
Published: Fri Sep 13 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.