Functional Parsing - Computerphile

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
okay so today we're going to do some more live coding and we're going to talk about something which is quite close to my heart because i wrote some of the early papers on it many years ago and that's something called functional parsing or combinator parsing before we get started with the actual live coding i want to begin with a question and the question is what actually is a parser so for me a parser is a program which takes a string of characters as its input and as its output it produces some form of tree and the idea is that the tree makes explicit the structure in the input string so that's a bit of a mouthful so let's have a simple example to explain what's going on with this so what we've got here is we've got a string of five characters two plus three times four and that's our input string but when we look at this we know that that's not just a random sequence of five characters it's actually got some structure to it so in particular we've got three numbers here two three and four and we've got two arithmetic operators we've got the plus and the times and one of the things we learned at school is that the times happens before the plus so you really do three times four here and then you add the two on at the end so that's our input string so what a parser does is it takes a string of characters like that and it tries to recognize the structure in the form of a tree and the tree we would get from a parser is something like this here and you see we've got some leaves in the tree the leaves are the numbers 2 3 and 4 and the nodes in the tree are the two arithmetic operators that's the plus symbol and the time symbol and you can see the structure of the tree reflects the fact that we do the multiplication first we multiply 3 by 4 and then we do the addition second so that's the basic idea of what a parser is it takes as input a string of characters and as output it produces a tree so it's really a function it's taking an input as a string and an output as a tree so we've seen what a parser is it's a function that takes a string as an input and produces a tree as an output so now we want to think how can we actually implement this how can we implement the idea of a parser and i'm going to do this in haskell today but it doesn't matter if you don't know anything about haskell because i'll be explaining everything as i go along and actually nothing i'm going to show you today is specific to language like haskell you can do it in any general purpose programming language so if you do a web search for combinator parsing or functional parsing and then whichever language you're interested in you'll find the same kind of stuff which i'm going to show you today so it's not specific to the haskell setting so what we're going to do then first is we're going to define precisely in our programming language what it means to be a parser we're going to define a new type called parser and it's simply going to be a function which takes a string as an input and produces a tree as an output so it captures the very simple idea of what we're wanting to do a parser is a function that takes a string as an input and gives a tree as an output and the arrow here just means that we have a function from one thing into the other so this captures the basic idea of what a parser is but unfortunately it's not sufficient to program with we need to refine this a little bit to actually write programs with this so the first little refinement we're going to make is that a parser might not consume all of its input so for example if we're trying to parse a number like two we maybe we find the two and then maybe we've got some more stuff left in the input string that we need to parse and if we want to chain parsers together we're going to need to have access to the remaining input string that we didn't manage to consume so the first little refinement that we'll make is rather than just returning a single tree we're actually going to return a pair now we're going to return two things we're going to return a tree as before and we're also going to return the unconsumed part of the input string so that's the first refinement we've made the second refinement we're going to make is that a parser may not always succeed we may be trying to parse a number and we don't find the number we find something else so we need to have a way of representing that parser can fail so the way we're going to do that is we're actually going to make a parser return a list of results rather than a single result so lists in haskell are denoted using square brackets so this simply means that rather than returning one pair of results we could return zero or one or two or as many as we like and the idea is going to be if our parser can't parse so it doesn't succeed we'll return an empty list of results and if it does succeed we'll return a list with one pair in it we'll return a tree that represents the structure of the input string and we'll return the unconsumed part of the input but because we're working with lists here we could actually be more flexible we could return two three or four or five or as many as we like parses and this is actually quite a good flexibility because for some languages the input string may be ambiguous maybe we're trying to parse english and english sentences don't always have one parse they can be interpreted in many ways so this type here is giving us the flexibility to to return many results if we wanted we're not actually going to use that flexibility today but it's nice to actually have it so i haven't told you what the tree data type is here and that's because i'm actually going to get rid of that now sometimes you may want to return a number or a program or some kind of other structure so we're going to replace that specific type of trees by some arbitrary type a and i'll make this a parameter of my type this is our final type which we're going to work with today what we're saying is that a parser whose results have type a is simply a function which takes a string as an input and then it gives a list of results and each result is a pair comprising a single value of type a maybe a tree maybe a number maybe something else and then an unconsumed part of the input string okay so this is our final type and if you look in any of the articles or books about these kind of parser combinators or functional parsers you'll find the type very similar to this there it's quite a mouthful again so let's think about how we could understand this in a simpler way and actually we can write a little rhyme to understand what's going on with this type so let me write the rhyme out for you as a comment so what we can say is a parser for things is a function from strings to lists of pairs of things and strings this is a little dr zeus rhyme to tell you what a parser is or what a functional parser is and that's actually how i remember this type so that's our basic type now we've seen so far a parser is basically a function from strings to trees but in order to actually program with these kind of things we need to refine the type a little bit and this is the type we'll be working with today but you don't need to worry about the details of it just basically think it's a function from strings into trees or some other kind of structure what we're going to do now is we're going to load up the parsing library so i'll start up the compiler this is a library which contains a whole bunch of parsing stuff which allows us to program with parsers of this form and this is a parsing library i wrote myself i'll see if i can get sean to upload it as part of the video a parsing library comes with any programming language you can think of again if you just search for parser combinators functional parsing whichever language you like you'll find a library which gives you all sorts of basic ways of building parsers and that's what i'm going to show you now all of these libraries work in the same way so you have some basic primitives or basic building blocks for parsers and then you have a way of combining parsers to build bigger parsers so it's like a kind of lego kit or a construction kit you have some basic breaks that you have that you can do things with and then you can put those bricks or components or primitives together in all sorts of different ways i'm going to show you a few of the primitives and a few of the combining forms and then we'll do an example the first primitive i want to show you is very simple it's just a way of parsing a single digit so what the digit parser does is it takes a string of characters and it tries to consume a single numeric digit off the start of that string okay and you might think well what does pars do here what pars does is it takes a parser which in this case is just digit and it takes an input string to that parser and it just applies one to the other so of course this parser here is going to succeed because we do have a digit at the start of the input string so we get exactly the expected result we get a list with one pair and the first thing in the pair is the actual digit we get the character one and the second thing in the pair is the unconsumed part of the input and that's something we could then try and parse subsequently with another parser we can test does this thing fail properly so if i give it an input string that doesn't have a single digit at the start then it's going to fail so if i give it the input string abc there's no digit at the beginning so we're just going to get the empty list of results so i'll show you one more quick primitive if i parse a single character say an a from that string then that will do the right thing if i parse single character a and i didn't have an a at the beginning then it will fail okay so we've seen two basic parsing primitives here we've seen a way of parsing a digit a way of parsing a specific character and we've seen that these things can succeed or fail and in the library or in any of these libraries there'll be a bunch of these basic building blocks or basic bricks or primitives that you can use to build up your parsers where things get more interesting is when you think how do you combine these kind of things how do you use these basic bricks to build actually useful parsers let me show you an example of this so there's a parsing combining form called sum and what it does is it takes a parser as its input and it tries to apply it one or more times as many times as possible so if we're trying to parse some digits what we're trying to do is consume one digit then two then three and as many as we can until we don't find any more digits okay so if we apply the sum digit parser to the string one two three then it will do the expected thing it will consume all three of the digits and then we'll get the empty string left here so that sum it gives us a form of repetition and we also have a very simple way of making a choice as well so if i do i want to make a choice here between a digit and a letter and let me parse the string a b c one two three so what we've got here is this funny symbol here with the three symbols that's a choice operator it says do that or do that so if i try to parse a digit or a letter what it's going to try first is it will take the first character in the input string and say is it a digit if so i'll parse it and if it's not a digit then i'll go over to the other side and say is it a letter and i will try and parse that you can see what's happening with this particular example here if i look at the first character it's not a digit it's a letter so when i apply the digit parser it would fail and then the or operator or the choice operator here will give it to the other side and say well is a letter instead and of course it is so we can parse the single a off the front here and then we get everything else as unconsumed and of course if we wanted to be a bit more clever we could combine some of these things so i could say sum digit or letter get my brackets right abc123 and then that will parse everything because all i'm doing here is i'm repeating or iterating the choice between either parsing a single digit or a single letter and i've got a string of digits and letters here so i can parse the whole thing so i've consumed them all and then i get nothing left at the end so what we've seen so far some basic building blocks and we've seen a couple of combining forms we've seen a way of doing repetition which is some and we've seen a way of making a choice which is the funny operator in the middle there what i haven't shown you so far is how to do some form of sequencing and this is the most common thing you typically want to do with parsers you want to say do this and then do that and maybe do that as well you want to sequence things together so i'll actually show you a bit of the parsing library here so here's the parsing library and i don't want to go through all the details of this but one thing i want to know is it's quite short if i kind of scroll down here i think it's about four and a half screenflows and i've got quite a big font here and this is actually already quite a sophisticated parsing library so it shows you the power of this method that you don't need hundreds of lines of code to write parsers four and a half screenflows is a library which is fully fledged and you can basically implement any parser you like using this so i'll show you a couple of examples of sequencing the first example i want to show you is a parser for natural numbers so what's a natural number it's just a non-negative integer like 0 1 2 3 or 10 or something like that so you think how do you parse a natural number well i'm going to use the sequencing notation for parsers which is the do notation and the do notation is very simple you write the word do and then you have a whole bunch of parsers one after the other and it just runs them each in sequence so the first thing here is we're going to parse some digits because that's the basics of what a natural number is it's just some digits and if that succeeds i'm going to call all those digits x's so x's is just going to be a list of all the digits and then what i'm going to do here to be a bit more flexible probably when i parse a number i don't want a string of characters back i actually want the number so i'm going to pass the string in to a little function called read which just converts the string into number and then i'm going to simply return it so the basic idea here is we're sequencing two things together we're reading some digits or parsing some digits and then we're translating those into a normal number and then we're returning it and we're sequencing those things together using the do notation here so just one more little example because we'll use this in a minute here's how you could parse an integer so an integer is either a negative number or a positive number so there's a choice there so we're going to use the choice operator so here's the or operator which we've seen a few minutes ago and the two parts here just say have we got a negative number or have we got a positive number so the parser for a negative number we use the do notation because we need to do three things so the three things are here so if we're trying to parse a negative number the first thing we do is we parse a minus sign so we're using the char primitive that we saw previously that will parse a minus sign then we're going to parse a number and call it n and then we need to remember that we need to make it negative so we we negate it and then we return it so again here we're just using the simple idea of sequencing three parsers one after the other and then the or here says or we can just have a simple natural number okay so this illustrates the idea of sequencing and if you've seen some of my previous videos you may recognize the do notation here and this is because parsers form an example of what's known as a monad and in fact for me parser's being magnetic is one of the key ways to understand what a monad is so if you've seen the monad video or even if you haven't seen it maybe have a look back at that and if you find it interesting maybe look up some of the work which people have done on magnetic parsing and it's a really good way to get a very good feel for what's going on with both these kind of parsers and one ads as well we've seen that parsers are basically functions they take a string as an input and they produce essentially a tree as an output we've seen some basic primitives for consuming single digits and single characters and things like that and we've seen some basic combining forms we can have repetition with some we can have choice and we can have sequencing so we've got our basic kind of building blocks for making larger parsers so let's wrap this up now by doing a little example an example i want to do is to build a really simple parser for the kind of expressions or arithmetic expressions that we saw back at the start so things like 2 plus three times four so what i'm doing here is i'm writing a haskell program which is going to implement this parser what i've got in the first line here is simply importing the parsing library which we've just seen it's just four and a half pages of code it's very straightforward and what we've got here in the comments is a simple way of writing down what the syntax or form or structure of expressions are and this is what's known as a grammar but it doesn't matter if you don't know what a grammar is because the basic idea is very simple here the first line says an expression can be one of two things so this means or here in grammars so an expression can either be a term plus an expression or it can be a term and then in turn a term can either be a factor times a term or a factor and finally a factor can be a bracketed expression or an integer so there's three simple rules here which explain the form or structure of what a simple arithmetic expression can be there's actually quite a lot of things going on here but the only key thing i want to point out is we've got three rules because there's three different levels of priority in an expression so the highest level of priority is brackets so that's one thing you learn in school you learn that you do the brackets first that's the highest priority thing the middle level of priority here is multiplication so that's sitting in the middle rule and the lowest level of priority is you do addition and that's sitting at the top rule and again this priority order is something you learn at school you do brackets first then you do multiplication second and then you do things like addition last and these three rules are just making that precise but again if you want to know more about grammars you can you can search on that and you'll find lots of information about that online so what we want to do now is take this little grammar and implement it as an actual parser and this is very straightforward to do because we're using this functional parsing idea essentially we just take those three grammatical rules and we just implement them using the combining forms and the primitives that we've seen so it's a very straightforward translation so the first one is we want to say an expression can either be a term plus an expression or a term so let's do the first part of that so we've got term plus expression so what we're going to do is parse a term and if that succeeds color x then we're going to parse a plus character then we're going to parse a expression and call it y and then we're going to return x plus y so there's four things going on in sequence here we're first of all parsing a term then we're parsing the plus character then we're parsing an expression and we're getting the values here x and y these will be numbers and then we're going to simply add those two numbers together so you can see here we're actually doing more than just parsing we're actually evaluating the expression as well and that's one of the advantages of this approach to parsing that it's not just about building a tree you can actually process things as you're going along and we're actually processing them here by doing complete evaluation so x and y will be numbers which result from this term and this expression and then we just add them together here and return the result then the last part of parsing an expression is we can either be a term plus an expression or we could be a term so we just use the choice operator and we get a term and these five lines here are our full parser for expressions so we had this one rule up here an expression could be a term plus an expression or a term and we just translate it directly into our parsing notation and again the key observation here is that the grammatical rule here looks basically the same as the parser so let's look at the second rule and in fact it's pretty much the same as the first row except a few symbols are changed so let me just copy it and i can just change it so i can say a term can be a factor times a term and then i can do a multiply there or i get a factor okay so in just a few key presses i've managed to implement my parser for terms and again the point to note here is that the grammatical rule here looks basically exactly the same as the parser okay i've just got a few more symbols in here because i'm actually writing a program to do parsing or actually evaluation but it's the same basic structure and then just to wrap things up um i can implement what a factor is so a factor is either a bracketed expression or it's an integer so let me write the parser for that so a bracketed expression i just parse a character and then i parse an expression and call x and then i parse a closed bracket and then i return the x or i can parse an integer and again if you look at the structure of the rule up here a factor is a bracketed expression or an integer i've got exactly the same thing down here here i'm parsing a bracketed expression and here i'm parsing an integer and this is actually our entire parser and we've got three lines up here and we've just got kind of 10 or 15 lines down here and this is actually a complete parser and evaluator for arithmetic expressions and again the beauty of this approach is the parser looks basically the same as the grammar so let's try and see if this works and hopefully i haven't made any mistakes so let's load it into the system okay that's great no errors and we can see that we've loaded two files in now we've loaded the parsing file in which is that four and a half pages of definitions and we've loaded the example program in now so now we can check does our parser actually do what we want so let's try our parser with the little example that we had at the start 2 plus 3 times 4. so we're going to parse an expression and the expression is two plus three times four and we press return and we hope we get the right result and we do so remember from school you do the multiply first you do the three times four so you get twelve and then you add the two on at the end and you get fourteen so we've managed to get the result 14 here and there's no portion of the input string left so we've got a successful result we've got a list and we've got one result value we've got 14 and we've managed to consume the whole thing and we could check does this actually work with more sophisticated examples so let's try putting some brackets in let's put brackets around the two plus three so we get two plus three times four so we hope then that we do the addition first and we get five and then times by four to get twenty yes and it works we can try more sophisticated examples so let's do something like 2 plus 7 times 10 plus 8 times 20. and if i've got the brackets right yes then it works fine we can also check what happens if you give it something which doesn't parse so suppose i do something like a partisan expression so two plus three times and i forget to write the four at the end what's going to happen well the parser will still manage to succeed because it will manage to parse the two plus three and we'll get the five out but it doesn't know what to do with the symbol sitting on its own so you get that back as an unconsumed part of the input and again we can try another example suppose i forget to close the brackets so i do something like two plus three and i forget to close the brackets then it won't know what to do with that at all and we'll just get the empty string that's basically it this is the idea of functional parsing or combinator parsing idea is very simple parsers are basically functions you define a library with some basic building blocks or primitives some combining forms that let you put these things together and then you can end up writing parsers as we've seen that look very similar to the grammars that you write to describe languages the man goes to town as we would say to town the man goes sounded to me 20 years ago when i first stumbled on this very much like yoda the jedi master for those of you coming into this cold and direct because you saw the word yoda and grepped over the entire universe
Info
Channel: Computerphile
Views: 134,608
Rating: undefined out of 5
Keywords: computers, computerphile, computer, science, Computer Science, University of Nottingham, Professor Graham Hutton, Functional Programming, Haskell, FP, Parsing, Combinator Parsing
Id: dDtZLm7HIJs
Channel Id: undefined
Length: 22min 45sec (1365 seconds)
Published: Wed Feb 05 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.