Using Regular Expressions - Computerphile

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
first thing we've got to talk about today Shawn because otherwise I'll get it in the neck and so will you for not nudging me and reminding me yes everybody I know it this thing here is still using Windows 7 and the useful guys at work thanks for doing this chaps have assured me that the moment this filming session is finished if I bring this in they will upgrade that one to Windows 10 as well we've been talking about regular expressions basically about the theory of them and the idea of them but we've not actually seen them in practice yeah reg X reg X Ari's they are a very good illustration of where theory can me practice but I think in the previous one we did a little bit of theory what we ought to do now is just see them in action I think a difficult one for me to tackle is this I think you'll all I'm just trying to get sympathy here the span of what you either know or don't know you the audience is huge on this topic some of you know way more than I do some of you really are beginners and struggling to get used to the somewhat abstract notation and so on so apologies up front but this one will seem very simple and very straightforward to those of you who have got some expertise but I think it's important that we regroup and say look this is the notation we all agree on this because I have for the future got a very good example lined up of something where regular expressions if you like can only just cope I am doing my examples here in Lex because I hope that some of these examples later on will transfer into being part of a little compiler of some sort and it's software I'm used to but it's very straightforward you give a piece of regular expression for a pattern you want to match and then you give if you like an action that you want to take now very often having recognized a piece of regular expression all you want to do is to echo it back perhaps with a bit of explanation as to what it means so here's my simple exercise I'm going to declare about seven reserved words in my language but my language is going to ultimately end up as being an elementary computer graphics language just like Brian Kernighan speak so my reserved words all the things like circus circus [Music] circle in fact I should put in both circle and circus so see if it can distinguish between the two line arc spline box that sort of thing I want those to be picked up as being reserved words but then if it isn't a reserved word my scheme and there's some other bunch of characters is it a bunch of characters that would do good service as being a variable name another as I think we've said many a time variable names in many languages follow the pattern that they must begin with a letter but then after that they can have any mixer upper lower case letters and digits in the name zero or more of them that's your variable name so reserved words and named variables of that type of that particularly reserved word type that's all we're gonna do today I have here a Lex script which has got seven specific lines in for recognizing circle line arrow spline box arc or bonus at the bottom circus star see IRC so when you look at it it see it's either gonna match that or that or that or that's all that the reserved words if it won't match any of those it keeps coming down trying to match try to use the next regular expression to get a match and below here I just give zero to nine it says in square brackets plus and that's a piece of regular expression notation that says any combination of digits naught to nine in any order going on arbitrarily long for the moment here those a to Zed or a to Zed choices it means anything in that range literally those characters in however many combinations are possible so I've put all this together I fed it intellect a compiled it all that for you and won't bore you by doing it in front of you but believe me I have saved this as a binary executable it's called test re for test regular expression but it only handles these regular expressions I think we're all ready to go I just type in the name of the executable binary test re let's see if it works right silence signifies are unhappy yes is waiting so go on tell me something just try let's use the name Bob Bob you just want Bob all on its own Bob okay what will Bob do would you agree with that Bob is a variable name in other words it's a valid identifier for a variable of some sort fine yes yeah there's nothing to stop you calling your integers or your circles or your lines you can call them Bob if you want to that's fine this is I'm saying that this thing as advertised really does treat words like circle and line as being special let's see if it filters those out and gets it right so I'll just say circle on its own lowercase look at that as part of my pattern matching circle is one of my entries in reserved words that must be recognized just as is lowercase notice and it's worked it basically says yeah got you circle it's a reserved word and just to emphasize I haven't got it it's alright now this time I'm spelling it with a capital C and my guess my hope is that it will recognize the first circle as being a reserved word the second circle can't be a reserved word because it's case sensitive right it's circle all lowercase has been reserved but the version with an uppercase C isn't us therefore who knows he should be a variable let's see if that works yeah Circle all lowercase is reserved echo it back just to be sure yeah I got it it's circle but circle is a variable name which I think sounds right I think of something else that might break it Sean God well we talked earlier and you kind of said the idea of putting the word circus in there to throw it because it's so similar yes that's a good point let's just try Circus is happy with that I did make it a reserved word but it hasn't sort of come up with all I can't decide between circle and circus part of what I was saying in the episode last time is that one of Lex's jobs is to say despite the fact that circle and circus have a common beginning and very clever and I very efficiently factorize that beginning out and then say well after that if it ends le it's a reserved word if it turns us it's also a reserved word but it's happy so a good thing to do now I think would be can you name a circus yeah [Music] better still perhaps how about this I want to name another circus but I'm gonna call it a circus one now that should be no problem because it's not saying circus circus it's saying circus reserved word and that category circus one can only be a variable name so it's using the space to delineate yes the way I've got it set up at the moment is I haven't told it to ignore spaces yes I've left the main because it it serves in the world wonders at the moment is a very handy break between these various things which can then be analyzed separately this then is if you like aligning with the history of Lex and regular expressions is that Mike let's put them in this front end to enable you to do reserved words variables all sorts of things like that but historically they then migrated out into things that have nothing to do with compilers many of you will have heard of UNIX awk and that was the great granddaddy of all sorts of things that you're more familiar with like Perl PHP Python and so on August characteristic was they just did reg X pattern matches then actions there was no context it was interpretive ork you gave it the thing to do it comes straight back at you you didn't have to recompile it every time so here's the first beginnings and what we need for a longer example we've got the ability to take a of characters in any combination zero or more of two named variables fixed sets of characters of certain variety like circle line box are dealt with first so I think the thing to take away from this is that in programs like orc and legs you've got to remember that the various possibilities you give will be done in that order its you've got to imagine that between the lines there's almost an or operation you start up at the top you say it will either my match circle or it will match box or it will match line or it will match spline or it will match arrow and so it goes on and then down at the bottom the catch-all is and if it doesn't match any of those let's just see if it could be a legal variable and then you just run out and I have to accept this if you put in a line of punctuation I think it would just activate backup me and not do anything with it let's just see quote pounds doll it just echoes back it takes no action just says me I don't know what that is I think this has set us up now I hope into being able to do a longer example than this but to me at least regular expressions come into their own for this kind of thing one liners to name things you know match a pattern do that with it all over one line they're not all that well suited to doing very long range big strategic structure so many of you have said to me Oh cover why reg X's can't do XML properly well I might get onto that but yes you know you all know XML has got big tree like structure reg X's do not of themselves find it easy to do though can I ask one question if you had a real circus what would it called would it be like the great Brill's for Dini wasn't there a Circus Maximus in Rome no there we go I just think he needs a bit more of a showman's title in dance Oh Barnum & Bailey is that right do you want me to try Barnum & Bailey
Info
Channel: Computerphile
Views: 107,408
Rating: 4.879899 out of 5
Keywords: computers, computerphile, computer, science, Computer Science, University of Nottingham, Professor Brailsford, RegEx, Regular Expressions, Applied RegEx, Lex, Parsing
Id: 6gddK-cOxYc
Channel Id: undefined
Length: 11min 38sec (698 seconds)
Published: Tue Jan 28 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.