Best of Fluent 2012: /Reg(exp){2}lained/: Demystifying Regular Expressions

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

I've tried learning this before and this video really helped! Thanks for this

👍︎︎ 2 👤︎︎ u/Things_I_Said 📅︎︎ Aug 09 2017 🗫︎ replies

This is awesome! Thanks~

👍︎︎ 2 👤︎︎ u/helleeo 📅︎︎ Aug 23 2017 🗫︎ replies

Great talk, thanks for shring

👍︎︎ 1 👤︎︎ u/Shabz_ 📅︎︎ Aug 09 2017 🗫︎ replies

I came here searching for help understanding RegEx and this was a massive help. Thanks.

👍︎︎ 1 👤︎︎ u/PerryDigital 📅︎︎ Sep 08 2017 🗫︎ replies

Captions

hi there I'm Lia burrow it's great to be here today you can see my Twitter username over there if you have any questions and we don't have time at the end you can always ping me on Twitter and ask me so regular expressions many people use many different names to refer to regular expressions such as regular expression reg X reg X some even use a term ASCII puke hopefully by the end of this presentation if you're one of them you have started to change your mind a bit so I'm going to ignore this for now the first three names can be described in a much more concise way at least if we lived in a world where everybody was a developer we could refer to all three of them just with this line which is basically a regular expression itself and that's what regular expressions are there a way to describe a set of strings and why is that useful we can use them to check if some if a string much is a certain format we can use them in html5 for form validation with the pattern attribute we can use them to extract certain parts out of strings or to replace certain parts with something else we can use them when refactoring in our text editor or IDE or even in command-line tools or databases and many more tools and luckily the syntax used in most of these tools is pretty much the same javascript has some limitations compared to other languages but it's pretty much consistent so to start with the syntax the very basic rules is that every regular expressions literal starts with a slash ends with a slash what's between the slashes is the pattern we're trying to match and after the slashes there there's the so-called flags which in JavaScript are just a combination of GI and them in any in no particular old order so while I'm explaining the the syntax you can go to this web application I made for this talk to test it for yourself if you want and there are going to be some small challenges scattered throughout the top to which you can participate through Twitter and whoever wins gets an O'Reilly regular expressions book so the very basic regular expression is just a few letters or symbols with just that just match the exact ring like this for example matches a anywhere in the string it doesn't even need to start or end with a it just matches anywhere unless of course you restrict it which will see how you can do later and by default its case insensitive so this won't match anything but you can use the I flag to change this that's very useful when tasting when testing for example if an element is of a particular type because the node name can have any kind of capitalization if you're not very familiar with regular expressions you'd assume that this would match something like 1.5 probably you can guess that it it would match it anywhere in the string but if you're not very familiar with them you'd probably wouldn't expect it to match this or this or even this but the dot match is pretty much any character except line breaks but you can change this behavior if you just escape it with a backslash and now it only matches the dot this is called a meta character which means it's a character that doesn't only match itself or that it might not even match itself at all it has a special meaning we have 12 total meta characters and if you want to use them literally in most cases you need to escape them not in every case regular expression engines are quite smart and in some cases they can figure it out on their own that you are just interested in matching these characters literally so in this case we have a regular expression that matches for consecutive days like this but what if we wanted to match 12 of them or 100 of them would we have to keep typing AAA eight a hundred times we can use something called a quantifier to specify this in a much more concise way this is exactly equivalent to the previous regular expression but much shorter especially if we want it to much mores this you might think that this is very strict what if I don't want to match exactly ten A's what if I want to match like from zero to ten or from ten to 100 if you include a coma after the number it basically means at least that number so if you have four coma it means at least four but as you've noticed the regular expression engine matches as much as it can that's called greediness if you give the regular the regular expression engine a choice it will match as many as it can you can set an upper bound as well like four from four to five which only matches up to four up to five days and there are certain shortcuts as well for example if you want to match any number of ways from zero to whatever this is it's quite interesting in this case to notice that you can actually match the empty string this regular expression matches the empty string we don't have anything but we want to match a from zero times to whatever so since we don't have any A's it matches the a zero times but due to greediness if we have any a is it will match all of them as many as it can consume so we can use this shortcut to match exactly this from zero times to whatever to infinity if we want the a to be present at least one time to avoid matching the empty string which is usually what we want in most cases we don't want to match the empty string we can use a plus sign instead which means from one time to infinity or if we want something that may be present there but not always but not necessarily we can use the question mark this matches from zero to one and you to breathiness it always matches in this case one but if we don't have any A's it matches the empty string like here it doesn't find any a so it matches the empty string and if you recall in our original regular expression about how we call regular expressions with it something like this which means it matches both reg X and reg X because the P is optional so greediness can be really annoying in some cases for example assume you want to match HTML tags you know that it's pretty simple HTML in most commonly you wouldn't want to use regular expressions to match HTML but in some cases when you know what to expect it you can do that so here you want to match HTML strings to strip them out for example from the from a comment if you don't want to allow HTML so you think of writing a regular expression like this angle bracket something whatever number of times and then closing an angle bracket but it doesn't match what you expect you just want it to match this and the closing tag but it doesn't do that it matches the entire thing why because it it matches the first angle bracket here and then it goes and matches as many things as it can and here it finds this angle bracket so technically it's correct it's not wrong but it's not what you wanted you can change this behavior and make in Reverse greediness by putting a question mark after the quantifier and now instead of matching as many as it can it matches as as little as it can the fewer number of characters it can so now it does exactly what you want it matches both the start and the end string you can even do things like this if you want something from 0 to 1 at times you can have two question marks one after the other in case you want to say to use something to use alternation between certain characters you can you can use square brackets this basically means any character that's either A or B or C so this much as any of them it doesn't much D for example and you can even use the plus sign to say any number of these characters and if you want to match multiple characters like for example assuming you want to match a letter according to this you could start doing this sort of thing and write the entire alphabet in the character class but there's something better you can do you can use ranges from AE to Z and now it matches every letter every letter you can think of it does match numbers or symbols but it matches letters and if you can even concatenate multiple of these rages ranges to produce a union like if you want to match letters and numbers you can do something like this and you can even add single characters after them this matches both letters and numbers and the underscore like this another thing to notice to note is that most meta characters don't need escaping in square brackets there are very few that do at the correct the closing square bracket because otherwise the regular expression engine won't know when to stop when the character class ends but most of them don't really need escaping also we get some shortcuts here for very commonly needed character classes for example /w means basically something similar to the character class we did before letters numbers and the underscore it's basically equivalent to a two said both lowercase and uppercase 0 to 9 and the younger score so I can type like any letter and it keeps matching it or numbers or underscores or any combination of them one bad thing is that it's not Unicode aware so if you want to match like for example Greek letters that won't work there's also another character class that's a bit more restrictive than this it only matches digits it's basically equivalent to a car to class from 0 to 9 so this this would match any integer it won't match decimals and there's also a character class that matches whitespace /s any kind of whitespace tabs line breaks spaces it's actually way more wide than what I'm showing there and that's why I have the not the equals sign but the about equals because it's not exactly equal to that character class its Unicode aware so it supports many all the weird whitespace characters that Unicode has and you can even combine those character classes to form something more calm more complex for example if you want to match something that's letters digits underscores or hyphens you can do this which combines the word character class and - so now it will even match even if we have a hyphen there which is kind of useful for matching things like telephone numbers we can use this to count words in some text these are two different ways each with its with their own advantage for example the first one matches all the words and it counts how many words it matched the second one splits the text where it has whitespace and counts how many how many words you have according to that the second method is much better because even though it will match some things that might not be words like if you have a straight dollar sign for example but how common is that the first one has a much bigger bug it won't match any not any non-english word because that's only that only matches English letter it's not even a centered English letter it's just plain old ASCII English letters and that's many texts don't don't only have English words so this is one of the first of the challenges I mentioned in the beginning of the talk if you haven't noted the URL with the web app you can see at the top of this slide so this is just the test challenge it doesn't uh it won't matter in the competition at all it's just so that you can test the system and see how it works and everything you have a minute for that and the tweets you post are going to appear here but don't be disappointed if your tweet doesn't it doesn't mean that it won't be counted because it takes there's a bit of a lag it takes about 25 seconds for something to appear there basically this counts tweets with the Reg explained hash tag so if you tweet directly from the web app that will be added automatically okay I guess it works so this is the first of the challenges you should write a regular expression that matches a hex color these are some examples of the hex colors you need to match they might be application or lowercase or three digit hex codes or six digit hex codes by the way if you're interested or who want the book I'll tweet about that after the talk I can't really decide on that right this moment whoa eleven tweets some of them got quite close so one first thought might be something like this the problem with this is not that it doesn't match some of the hex codes the problem is that it matches too many in some of them at Rome it will not tax code with hex codes with four characters or five characters which are invalid you need to only match hex codes with three or six characters so that's not exactly correct that's more like it it matches any any letter from A to F or digit three times and then this thing needs to be this pattern is repeated once or twice so by combining these quantifiers you can match you can basically create some sort of quantifier that's either three or six times it's impossible to accidentally match hex codes with four characters or five characters with this so character classes can also be negated in this case you this matches any letter from A to F but this could also be written in an alternative way a letter from GT said and negate this that's not exactly equivalent to the first one because that will also match symbols for example there's nothing that says it should only match letters it this matches anything that's not a letter between J and Z we even have shortcuts for the negated character classes of for the negated versions of the character classes we mentioned before for example the negated version of this could be written more simply as this that matches anything that's not a letter or number or underscore C for example here it matches only the percentage sign or now it matches both of these symbols it also matches whitespace pretty much anything that's not a letter or number or underscore say similarly the negated version of the digit is the capital slash capital G the D that much as anything except digits so it will match every character of this string except the digit and the negated version of whitespace that will match anything that's not whitespace an interesting fact is that even the dot itself is a character class it's basically this negated character class any what what does this mean it means anything that's not a line break character but that's exactly what the dot is see this matches every character in the string and same happens if we use the dot another interesting and an interesting thing you can do with negated character classes is to provide an alternative of patterns like the ones we discussed before for example stripping HTML you could either use lazy quantifiers by using the question mark or you can use a negated character class which basically means anything that's not a closing angle bracket as many times as you want these are basically equivalent and the second one is slightly faster I feel I need to remind you again that it's an anti-pattern to parse arbitrary HTML with regular expressions I need to say this because otherwise I'll get people after the talk telling me you shouldn't do that well if you know what to expect you can do it but in arbitrary HTML you shouldn't you can use parentheses to group many alternatives and the pipe character too basically as basically some sort of or in this case it matches either a b or b a it matches both of them and here it matches either of them and an interesting fact about parentheses is that they don't just do grouping they also capture in this case the entire regular expression matches the CBA but what's matched by the parentheses is actually stored by the regular expression engine you can retrieve it if you use the proper methods so here you go to the entire match CBA or C a B but also the sub match of a B is stored this is very useful in some cases because otherwise you'd have to use multiple regular expressions on the same thing but in many cases you don't really need it and it consumes extra memory it's it's slower and in many cases you don't need it like for example in this case if you want to just match both of JavaScript and echo script and you don't really care about matching that sub of the substring that's either Java or EICMA you don't really need capturing here do you it's pointless it just consumes memory for no reason so what can you do you can opt out of capturing by using a question mark in the colon and now you just have the entire match this is basically a this group is ignored when capturing this is the second of these challenges it's about matching numbers negative integers positive integers they can have a sign in front of them they may not have a sign in front of them they might be decimals they might not have a part before the decimal point they might not have a part after the decimal point and that's why you have two minutes okay times up that got quite a few tweets and some of them are very very close so one first thought might be something like this it matches the sign of the optional sign and it matches any number of digits or decimal points by the way an interesting thing is that here the decimal point the dot doesn't need to be escaped even though it's a meta character because it's inside a character class and it matches any any number of them in any order the problem is it's 2 lakhs because it might it does match the numbers were interested in but it also matches it also has many false positives it allows us any number of decimal points like even consecutive dots even consecutive dots with no numbers at all this is a this is something that's much closer and it's what I used to do I think it's it's quite good it the only problem is it has one significant false negative it allows number it it doesn't allow numbers that just have a part before the decimal point and the decimal point nothing after that it depends on whether you want to allow these numbers many people are okay with not allowing them and another another alternative is this which is basically the same as the previous one except this there's a star here instead of a plus sign that means it matches any number of digits after the decimal point even zero that solves the problem of the previous one but it allows many false positives like just a dot or just a sign and the dot and things like that something that's accurate as the last one it matches exactly the kinds of numbers we want but and it doesn't have any pool any false positives than I can think of at least but is it where is it really worth it sometimes it's it's better to allow some false positives or some false negatives depending on your application rather than writing a huge regular expression that matches exactly what you want especially on the client side like for data validation for example since you're you going to verify them on the server side anyway if we want to match something that's explicitly in the beginning of the string for example an a but only if it's in the beginning of the string we can use the correct character here the a doesn't match because it's inside the string it needs to be in the beginning of the string to match and there's also something similar we can do about the end of the string here the a will only match if it's at the end of the string and in the beginning because we also have the correct here so if we have both of them this string will only match a literal a so it's kind of pointless to use regular expression in this case just use it just compared to strings you can also change the way these anchors behave if you use the M flag here since we don't have the M flag this needs to be present this needs to be both at the beginning and the end of the string if we have a line break and we use a multi-line flag the correct character matches at the beginning of every line and the dollar sign matches at the end of every line so we can have anything here and as long as a is on its own on its own line it will match these anchors are very useful in the polyfill for string prototype trim that's like the shortest polyfill you can write about that function basically it replaces whitespace that's either in the beginning or at the end at this or at the end of the string it's more performant to do to split this regular expression into two and do two replaces one for the for the whitespace in the beginning and one for the white space at the end but I think that's a bit of premature optimization in most cases and sometimes being concise is better than saving one millisecond this I this is called an assertion because it matches it matches ah it never consumes any characters for example if you have something like this it matches at at this point between the dollar sign and the five the reason is that this the word word boundaries match anywhere you have at any point between a word character and the non word character remember when we explain what these predefined character classes do one matches letters and digits and the underscore character and the other one is the exact opposite so the word boundary matches at any point where you have a one character that's not a word character and then it and it's it's next to another that is for example here it matches two times because the order isn't significant it will also match on the beginning of the string if you have a word character that's at the beginning of the string or at the end of the string that will be at the end and the beginning of the string and the end of the screen stringer basically considered treated like non word characters in this case so that's very useful if you remember before we before we got classlist in html5 we needed to write our own functions for adding classes or replacing classes or removing classes and usually those functions created the regular expression on the fly that had the class name like between word characters so it matched like the when it was in the entire string or when it was part of it or if it was in the middle of course it's consistent with everything we saw so far there's also a negative word boundary and known word boundary which is basically the opposite an on word boundary matches whoops an on word boundary matches between two word characters or between two non word characters like this for example assertions like we showed are always zero with in most cases if you just care about testing whether a string fits a particular format you don't need to use a surgeon's assertions are very useful when you care about what matched not only if it matched there there are also much more complex assertions which are called look Ahead's and get in this case you want to match a bee that's after a name and this does exactly what you want but what if you want to only match it an a if it precedes a bee but you don't want to actually match the bee you can use a look ahead for that that's exactly what this says I want to match the a when it's it when it be for Low's but the look ahead itself doesn't take part in the matching what's what's what I find quite interesting is that you can even include capturing groups inside the look ahead so in this case in this case you will match a zero string because this entire regular expression matches the zero string but you will also match a bee because it's that's outside the the full match because it's it's included in capturing parentheses this should actually be under the be it it appears I found a bug in the this little app but it should actually match the bee there's also a negative version of a look ahead this basically means match the a when it's not followed by B so anything can be after it and it will still match except B this is incredibly useful and in most languages there's also the opposite concept of the look behind that matches things only if they're after other things that's called look behind unfortunately Atmos trip doesn't have look behind there's some discussion about adding it on ACMA script next I really hope it makes it but right now there's no browser that supports look behind in echo script sadly this is the third of these challenges it's about matching dates it's actually almost impossible to match with regular expression to match any valid date there's always going to be some false positives basically the one that wins this challenge is whoever gets closest okay times up good this one seems to be the closest but croft dracula from a first glance at least it's really hard to tell so quickly so ah one very loose regular expression for it could be this ah like I said that's quite loose it will match months like 99 or days like 99 for example which don't really exist a closest one would be something like this which at least matches the correct months and the correct days but it also has some flaws it will match dates like 34 31st of February for example which is never which never exists or it will match dates like 29th of February which only exists once every for certain years but if we really try to go that deeply with regular expressions the result will be either it's either impossible to do that or its you'll end up with a huge regular expression that nobody will be able to read so it's basically the same thing as I was saying before sometimes you need to know when to stop a very interesting thing you can do with look ahead is to mimic certain patterns like intersection like when you want a certain string too much multiple regular expressions usually you do that at the code level like if our string matches this regular expression and it matches this regular expression and it matches this regular expression like in the JavaScript but you can actually do that just on the regular expression level just by using one regular expression because it it takes advantage of the fact that Luca heads don't change the matching position so if you have something after the look ahead it matches at the same position as the look-ahead started from so you can have any number of look Ahead's after the first one and they will match on the same string we it doesn't advance the matching position so basically in this case for example you want to have a six letter password or more with at least one number one letter and one symbol doing this without Luka heads would be really hard so most people would result in doing it with multiple regular expressions but if you use Luka heads you can have this look ahead that checks if that there's at least one digit then this starts matching at the same position since the first one doesn't consume anything and it's X that there's at least one letter and this one checks that there's at least one symbol and finally if you want to actually match the string you have this that matches anything that has at least six characters so basically you have one regular expression one line that matches all these conditions similarly you can do subtraction like any number that's not divisible by fifty something that matches this pattern but doesn't match this button that's basically a very similar thing you're just using negative look-ahead instead of positive look-ahead and of course you can use a sub case of the previous of subtraction which is negation anything that doesn't match one pattern which is basically a negative look ahead and then you have anything like a dot that can match any any number of characters the this one matches pretty much anything if you also wonder much line breaks it's quite easy to wrap it in a character class so that will match any kind of string that doesn't contain foo again that would be really hard to do with a plain regular expressions without look ahead and the last part of the syntax if you want to write something like I sent a code highlighter you'll probably encounter something like this this is supposed to match strengths for example this the problem with the regular expression I have here is that it will also match strings like this where the quote marks are mismatched and that's really bad because you might actually have double quotes quote marks inside a string with single port marks for example he said boo and it only matches part of the string that's wrong it could seriously break an application you can use the buck references which is basically a slash and the number and the number refers to the index of the capturing group you have remember parentheses create capturing groups the Reg accent engine remembers what parentheses marched and you can actually use what it remembers in the same regular expression by using this this basically means I want too much here whatever this matched if it matched a single quote I this is equivalent to a single quote if it matched a double quote that's equivalent to a double quote so you can match strings properly of course this has some problems as well it doesn't account for escaped quote marks like this is a it's treated like a quote mark that the limits the string where it actually should be ignored it should be treated as a regular character which brings us to the next to the last challenge which is to improve that regular expression I remind you the regular expression was this so that it actually accounts for this case escaped quest quote marks and it also accounts for double backslashes escaped back slicers I think the answer to that is pretty much the most elegant regular expression of seen as you'll see after this okay times up so the answer is basically this it's very similar to the previous one it kind of takes advantage of how reg X engines work so if they can match a back slash with something they will match it but they also have to match the quote mark at the end so that basically takes into account all the cases that are there that are of interest to us so it matches exactly what we want I think it's really elegant for what it does it I didn't come up with it I came up with something very close it's Steven Leviathan that came up with it so to wrap up some best practices like I said many times throughout this talk sometimes practicality wins over precision sometimes you need to you need to accept the fact that you will allow some false positives or some false negatives otherwise you'll it will drive you crazy this but regular expression at the end is what you need to use to match email addresses if you want to be really really precise it's not practical I don't think so that's huge I don't think any sane developer would actually use this so when you're matching email addresses anything shorter will allow some false positives or some false negatives it depends on your application which of the two you will allow for example if you're doing form validation it's usually better to allow false positives than false negatives because if somebody has an email address you didn't foresee when you were crafting your regular expression they'll be they'll go through a very distressing experience they're very valid email address won't be allowed and it's pointless anyway because they may as well enter in an email address that doesn't exist but has a very valid format so it's better to allow some degrees of freedom there also keep it simple if if you can't do it without using regular expressions in an amount of code that's not insane don't use regular expressions like obviously the alternative is using ten lines of code to do it with string functions regular expressions are better but if you can do it with just an index of for example just use the string function it's faster in speaking about performance some tips are avoid greedy quantifiers if it's R it's your use case better compared them to lazy ones don't forget anchors obviously if it suits your use case I try to use anchors because that means the regular expression can decide can realize it failed the match failed much more early so it doesn't need it doesn't need to try many different combinations which is basically a special case of the third point be as specific as possible as possible for example don't use the dot when you really need it when you really need to match a letter for example just use a character class which is much more specific prefer non capturing groups if you don't need to capture just add this question mark in the colon it's faster and minimise backtracking backtracking is what the regular expression engine will do when it starts matching and goes on and on and on and then it realizes oops I went to four so it back tracks and it back tracks one position one character and then what another character until it finds a match so if you can avoid this behavior try to avoid it it can be very costly in some cases it can even completely make your application hand there are some cases about that google the term destructive backtracking i think there are some regular expressions that are so bad in that regard that can completely hung your application in with very short strings so thank you here are my contact details i hope you learned something you you

Info

Channel: O'Reilly

Views: 208,247

Rating: 4.9334722 out of 5

Keywords: Lea Verou, Regular Expressions, explained, Programming, code, OReilly, OReilly Media, OReillyMedia, o'reilly, O'Reilly Media, O'Reilly Webcast

Id: EkluES9Rvak

Channel Id: undefined

Length: 48min 18sec (2898 seconds)

Published: Mon Apr 08 2013