ANTLR v4 with Terence Parr

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
thank you I had to stop making photos like this because apparently in Central and South America this means your wife's cheating on you why they have a hand signal to indicate this I mean is this kind of so common that we need a signal I don't know no I mean literally I had to take quit doing that okay well anyway hey let's see how I'm gonna do this thank you very much for the opportunity to talk to you about parsing tonight I like nothing more than talking about programming and parsing grammars and that sort of stuff so people know this quote that I had years ago it says why program by hand in five days what you can spend five years of your life automating I've had to make a small modification to that it's now 25 years later but I finally have the parser generator I was looking for so now I just have to remember what I was doing in graduate school in the late 80s and get back to it so basically I'm gonna introduce another for to you I promise this is my last version I I have no more antler rewrites in me so I'm gonna go over a little bit of the big picture so we're all on the same page with regards to you know terminology and remind some of you refresh your memory on parsing then one of my goals with this version of antler was to satisfy a very common problem which is you're not really a language guy how many language parsing Theory people we got here okay okay we've got a couple of freaks in the audience but most of us all we want to do is okay you know you got a you know parse Java or C sharp or whatever but you don't care how it's done you just want to get notified when you see a variable declaration you see a method declaration whatever you want to divorce yourself from the grammatical difficulties and just focus on living in your comfortable Java space or whatever so without any kind of really looking at the grammar we're gonna of a real language problem and then all as we go along to show you a little bit about the new version of antler works written by Sam Harwell my co-author of antler - or antler for so as part of this whole attempt to make it easier to use an antler and to build language applications I'm going to show you something about the listener and visitor mechanism that antler can automatically build for you to help you build language applications that feed off of input streams and then we'll do a really quick comma separated value grammar just for fun I'll show you the various tools how that works and then I will strut my stuff on adaptive bellows star grammars and show you why that is so cool okay so a couple of terms so a language applications job is really to recognize input and to respond to that in some way and you know respond means either you load in a configuration file or you interpret it or you translate it something like that and to recognize something means to be able to identify the it's kind of like identifying the parts of speech like in English you can say that's a subject that's a predicate this is the object and then oh you got it okay cool yeah I need a little exercise though so you know so you need to be able to identify the the pieces of it and you need to be able to differentiate that like an assignment from a return statement okay so that's what recognition is about the job of the recognizer is to take in a character stream and break it up into a sequence of symbols called tokens or symbols vocabulary words like you know in English it then applies a grammatical analysis if you will it checks the syntax as it duns this it doesn't just throw it away right because you could read in 500 megabytes of Fortran code and go hey it works but it doesn't do anything right a boolean yield out of that is useless so you have to do something with that and what you want to know again is what are the parts of speech effectively how do you get those pieces out of there so you can do something with them so you may not realize it but you know we're all so good at reading but you remember when you first started you were like oh go dog you know and then you keep going you're looking at the characters and putting them together and that's exactly what the human brain does and that's typically what we do when we do language processing by computer and if you look at that humuhumunukunukuapua'a or whatever that is you realize that your brain is actually looking for that damn space and so you really are still doing breaking it up and if you look at Morse code it's even more obvious right okay so the result of this parse in many cases and in most applications that we're gonna care about or the result is a parse tree and a parse tree is a representation of the input you can see in my leaves I've got mm-hmm I've got on the bottom there the input and for your puppets it also has name to the parts of speech effectively for us right the nodes interior to this tree are the names of the rules that were applied to match that input so it's not only the entire input it tells you how it was matched if you know how it's matched then you know how to respond to it so on the right side there I've got sort of a description of some of the classes used by antler in its run time and the I've tried to be as parsimonious as possible the parse tree points into a token stream which is the set of symbols that were broken up by the tokenizer and then those don't copy the characters either those point into the original character stream and that's cool because then when you're in this parse tree you can go all the way back to the exact character an input stream of where something happened okay and so this is all nicely tied together with a bunch of pointers so how many of you have played with antler three so there's some work involved if you want to build a tree right and there's even more work if you want to walk that with a grammatical tree parser things like that kind of a long story but I was I was asked to evaluate for the anybody remember the the oracle google android trial I was an expert witness for that and they asked me to review the source code I went in and I was really really surprised I was thinking to myself oh my god this is gonna be a mess on the inside right really complicated bytecode manipulation stuff turns out was really clean and they used a visitor pattern to go through this and normally I hate I didn't like the word pattern because to me it's just what programmers you know it's experience and then you have to have a book about it and then read there and you know it's like a fan you know XML aware but you know they really did something nice there I understood what was going on because I knew what they meant by a visitor pattern and anyway so I started thinking more about the antler thing and I realized that what I do is not build compilers I don't build compilers that's the way I started so that's why I was thinking those kinds of trees you had but in the end you know I translate I do a little interpreters things like that none of us build compilers so I should modify antler so that it focuses on what we do most of the time which is to build these little things and so the beauty is that now it will automatically build these trees for me just automatically you turn it off if you want and it will automatically generate the infrastructure the scaffolding for listeners and visitors so that I can have it notify me when things of interest happen ok so imagine you go to a new job here at boundary and your boss says to you I want you to read in Java code and extract an interface so it's like you're refactoring pattern and eclipse or IntelliJ well what it means is you know you flip the public and then you get rid of the method bodies and all that kind of stuff right this is a reference from the an example from the book so your first step and for most of you is panic right what are you gonna do and then you okay you know your your coffee kicks in and you realize oh well you know let's just compile the stuff right and then this maybe you look at the bytecode output and see if we can pull out and then some of you realize oh hey let's just use the java p program because that's gonna literally print out those methods for me and then I can do a little you know Perl or something like that to translate it into the right form and then all of a sudden the boss says oh yeah and by the way keep all of the white space and comments exactly the way they are you're like hmm now what all right we're back to step one panic how do you do this there's nothing there's nothing for it you have to parse that code right well let's solve that problem so you can imagine going to a web site and saying give me a Java grammar you don't want to build it yourself so that means we get a java dodgy four and I have the antler tool and I have its runtime so I can execute the generated parsers and then I don't care about the grammar let me stay in my java world and just just call a function give me a call back when you find a method declaration that's all I want to know about and then oh you have to pass me the context you have to tell me like what that little piece of the tree looks like so I can print out the right stuff because I need to know more than just that it's a method declaration I need to know which one okay so we need to know a little bit about the grammar right we're gonna have to take a quick peek at it but there's a rule called method declaration and obviously a method is like you know type identifier formal parameters and some junk right so we need to know at least those three things oh I'm gonna parse a whole file so I need to know what the starting rule is in the grammar so that's called calculation unit and then I need to know the concept of a listener it's like a sax like thing where it just fires off these entry and exit events so we don't even really care what the details are here just note that hmm you got a type identifier formal parameters and then everything else we don't care about right because we're trying to make an interface okay maybe I need to throws claws on there or something but this is close enough okay so give you a quick view of whatever this is antler works so this is a nice little thing you can run it and you can generate grammars with it you can it'll bring up trees and I'll show you all that in a second but so this is based on NetBeans so that means it's also a plug-in if you want it that way okay so we've got a Java grammar we run antler on it and that's just a little alias for whatever the Java incantation is and the result of that run of the antler tool is to generate the parser the syntax analyzer the lexer the tokenizer and automatically builds two things the listener interface which looks like this here you note that it's got entry and exit events method calls for all of the grammar rules and there's something called a base listener which is a default or blank implementation of all these so if all you want to do is implement one method you just subclass that and override the one method very easy so here is the one method that we need to override so when the parser has seen we're actually walking the parse tree after the parse but when we see a method declaration this is the code that's going to get executed and notice it passes me a particular kind of node this happens to be a subclass of you know parse tree node or something like that that is specific to a method declaration match because I need to get the identifier the type and the formal parameters out of there so it is specially generated one context object for each rule in your grammar and it contains everything matched by that rule so you've got it all you know that it was called and you have information to get all the details from the input and then in the end I'm just gonna print it back out write the type the name of the arts semicolon instead of the body done right so well we actually need a little bit of code so we can call this thing so I'm starting out here so this was automatically generated for me so I say give me a parser start parsing at the start rule and there's you know it's um a little bit of setup here for the token stream but the result of that is to give me the parse tree back then I'm saying give me a standard this is just a class in the library it it's something we'll see in a second that walks this tree for me and then this is the object that I just defined for you that implements the method entry method declaration and so then I tell the walker to walk the tree with that listener and then it just fires events as a season so I can compile all the crap and then if I run this little thing on that input file which was the demo which I'll show you we get the right stuff so let's see now okay so I'm gonna run antler on this part of me actually and I'm even shorter than that do I have the right version of this tell you what I'm gonna run the generate recognizer and I don't want a visitor listener and finish okay so I've got all these things in here this extract interface tool is the code that calls the calls the parser and the walker anyway someone who's gonna compile all its crap well that would be unfortunate I just tried this oh you know I probably have to what I'm going to do I may not have set this path up directly I do like a ThighMaster here my things properly done and now let's see it stills coming up the wrong one that's how is that possible oh I know why then in principle everybody is happy a compiled and okay so now I'm going to run that extract interface tool thingy and the file that I've got is just that little Java file which I made J so the compiler didn't complain all the time and you can see that the people in the front can see that it spit out all of these stuff with all the appropriate comments in there because I said you know get the start of this this method that we matched and basically print out to the end and it got everything in between because remember it points back into the original character stream so I know everything in between two points okay any questions so far comments so for commentary yeah it's it's it's it's not a perfect analogy parse the entire input first you can actually put actions in the grammar still but you know so like if you're parsing a network stream okay don't don't make me a parse tree and put actions directly in the grammar that is just like bang bang mechanic bang as I see stuff so you can be as efficient as you want to be but for the common case where you just don't you just don't care this is the way to do it so it actually builds it's kind of a mix of a sax Dom thing where it actually builds the entire tree the equivalent of the Dom tree and then you can walk it with a visitor or you can use the automatic listener to have it go fire the events and the reason that you like this this this intermediate form this parse tree is because it's often the case that you want to walk something multiple times my first well wasn't a real job my first task is a postdoc was to take this hideous Fortran 77 and turn it into parallel Thinking Machines Fortran anybody remember the Thinking Machines boxes we call them blinking machines cuz they had nothing but a bunch of LEDs going blue it was in Jurassic Park or whatever so blinking or stinking machines or whatever you know so I had two parallel eyes this thing and it was such a mess to do this that I had a 15 pass translator because each pass it was just so hard you could only think about one thing I want and so I gradually morphed it as I went along each pass did something else and so that's why we like a tree otherwise you got to reparse it right but then wait a minute if you're morphing it what are you reading back in right you got to have some tree that is a representation of this so oK we've kind of already gone over this but so you got this parse tree that's built and the standard parse tree walker it's just a depth-first walk right it goes down as it discovers a node it fires the appropriate enter event based upon that node so each one of these nodes gets you know automatically generated by antler I should say here here here and so when the tree walker sees one of these it just hits it with a message to notify and it it fires off an enter event after this has walked all of its children it then fires The Associated exit event and so that's very much like a sax thing so you just fill in the methods as you want so your the list of methods that you have is really the interface to your surrounding application now it could be something really trivial right you just load a bunch of properties into your application or it could be you're actually doing interpreting a calculator language in here or a translator but this is how the parser connects to your surrounding application now it's often the case that you don't want just some brain-dead parse tree walker which does the depth-first walk you don't want that to be doing you're walking you want to walk your own for example if if you have like if you're parsing or traversing the tree that represents a program and you have something that's if false and then 30 megabytes of nodes well you know if you're trying to be efficient and you're doing flow analysis or something you might decide oh well I'm not gonna walk the children because I know this damn thing is false statically right so there are reasons you might want to walk this thing yourself it's also the case then if I'm doing the calls that I can return values so if you're building a calculator this is really easy so I've extended the standard notion of a visitor with a type parameter that says the return value so every method in the visitor can return whatever you want you like I like it's very nice let's see here so then yeah you on your visit II to say visit and give it the route and it it jumps over okay see how we doing on time here did you alright usually I have an internal sense but I've just consumed enough salt that I think my I'm having an aneurysm or something at the moment oh that's perfect perfect yes well you guys say that yeah exactly yeah when I start to get up and leave then everything alright let's look at an actual an actual grammar a simple little grammar we want to parse simple comma separated values by the way if you actually start looking at the fill that spreadsheets don't that stuff that's really complicated you know yeah whoever decided that like two quotes in a row is a single quote escaped should be publicly beaten um so basically what I want to do is just I just want to parse a little bit of input you know that's gonna come back to haunt me I'm gonna be in a trial some time like you know trying to defend myself and they'll whip back this video say uh so professor para you're into beating random humans I see yes only if they deserve it so I don't know if you maybe will try to get this font size bigger let's see what can we do maybe 24 whoa okay well anyway you can see if I move this around you can see that there are syntax diagrams and all that automatically put in here but we don't care about that oh this is small enough yeah we got it okay so just to identify the elements here the rules that start with a capital letter our lexical rules an artifact of Yaak from 30 years ago and the parser rules start with a lowercase letter so I've defined whitespace you'll note you version three people that I've done the set notation thankfully and also there's no action here in a random language it's now understood that all of the common actions you want a lecture to do we can do it in a language independent fashion right so my goal is to get actions out of the grammar so that not only is it application independent its language independent so I can take the same Java grammar and generate a parser for it in Python C C sharp whatever okay and the way to do that is to keep all of the actions out of there so this just says throw that crap out this is a greedy operate non greedy operator to match anything between here it's basically says while not quote match a string an identifier and so here's my start rule just by definition I put it at the top and I'm gonna match a bunch of records and a record is a field optionally with zero comma field whatever and then I have a new line on the end or nobody will remit it but if you're a Windows programmer then you'll have carriage returning line so in my fields very simple I just got intense string all right what can we do with this damn thing what can we do with this thing oh right well we could do a couple things first thing I'm going to do is run it in the test rig maybe that's the only thing I'm going to do that's the input at CSV what does that look like yeah the look of the like 20 year olds in my class when I open up this terminal thing earlier than my class is like what the hell is that he's typing in there and stuff is happening I sent a student an SS I sent a student an SSH command the other day to get into one of my Amazon boxes this person writes back and says it's asking me for a credit card and I said that's interesting SSH just never asked me for a credit card and it turns out that they were putting this SSH command to connect to the remote server into the browser URL window yeah but you know it is weird I'm trying to I'm trying to teach my students that yeah you know you just you're just not gonna launch a server by hitting run in Eclipse right yes sometimes you have to look through a little hole across the net to a server but anyway okay so we've got a bunch of this data and so I'm telling you to read that in and I'm gonna tell it to print out a bunch of crap and so this gives me a parse tree so the overall thing is a file then I have multiple records and each file there's multiple fields and I paid some really smart German guy to implement this algorithm made by an even smarter German guy and to do optimal tree layout and so it's it's much more compact than well this one's not as obvious but it does a really good job if you have a complicated tree so yeah not mine no no one could imagine one could imagine no way to do that but I say to be smart I didn't say he was Einstein so and this also prints out you can see I told it to print out all the tokens so it shows me you know what index of the token the character positions the string the token type oh no that's the index the character indexes and that's the line and call a character position and then I told to print out the tree and list format so you can get all kinds of stuff from this thing and there's this command line tool see if it actually is a surround yes it is you could say basically so let's see what do I got I got this grammar thingy here CSV G for built all this stuff you see sitting here and so now just by compiling it I don't have any main program or anything I can say G run the CSV grammar starting at rule file match input CSV one more time let's see hmm okay let's see what happens ah let's see it's probably G run alias again I can fix that you think I would have tried this earlier I think that is gonna be okay do you run huh ah there we go okay that doesn't do anything because I didn't tell to do anything but I didn't get any syntax errors so then I can say print out the tokens and he gives me that I can say print out the three and it gives me that and I can say show me the German got tree and there it is and so there's lots of lots of stuff yeah sure no it's I could make it anything I want I never refer to it that's right because what happens is all of these lexical rules they're kind of special what happens is the lexer says okay I've got a whole bunch of rules figure out which one matches at the current moment and if it sees the white space that goes hey I match the white space and then it just throws it away now you what if you had a rule that was like white space followed by some stuff well both are then considered it speculates that both might match and then it finds that one will match more character so it'll match the longest pattern so if you had white space followed by you know a bang or something like that like you know Python needs white space to be meaningful and so on so there are reasons why you might actually not tell to skip white space and so on but yeah it's just a bunch of rules that end up getting matched and then sent to the parson so how does it know that this rule isn't just global anytime you see it ever so what happens is imagine that there's like a a a big decision that happens when you say give me the next word in the input and then it starts into a rule but once it gets into a rule it follows only the grammatical structure within that rule so I could put you know fine white space here I could do anything I want in here and this is not affected okay so and then I don't know if it's working in this thing but Sam has this select an input file okay I don't really want to do that did I oh okay there is the input and where is my okay so this is a little tokenizer debugger thingy that Sam was put together and so it shows you all the tokens and and you know how they were matched and then it can tell you where that token was matched in here and you know so that that's some of you don't always need but if you've got a really complicated lecture for like a wiki which are just super filthy kind of you know horrible stuff like oh yeah star means bold unless it doesn't four times five ain't bold man right but you still gotta match it so that that's when that'll come into play okay meanwhile back at our little talk here oh yeah so if you wanted to use the parse tree visitor thingy here's if you want every time it entered one of those records one of the lines it could it could give you the parse tree node associated with that and then you could pull all the fields out of that and add it to some list if you want so another example of a listener is there no point what's that okay so what is all this adaptive ll star crap so the previous version of antler looked at your grammar and tried to figure out how it was going to make decisions so imagine I mean if when we just have a grammar it shows you all the possible ways something could well it shows you all the possible sentences in a language but now when we have to implement a parser we put restrictions on the grammar so that it actually fits that model of our world right so you have the top-down parsers and the bottom-up parsers you have yaks lel r1 and all these little parameters mean something and llk all these things mean we cannot handle all grammars so then people came up with things like the early parser and tomatoes parts of this GL r you may have heard of it takes any grammar you give it but it can be quite slow and you can't you can't debug it right you can't it gives you a bunch of integers in a table multiple tables it's a very complicated thing whereas everybody gets recursive descent we all had to build one in undergrad or whatever we all know how to build a little parser and so that's exactly what adaptive LSR is it's exactly what you build by hand except it's not lo1 meaning it doesn't say switch on token type and that's how i make decisions and a decision is like you're an expression is it an integer or is it a string or is it a you know the size expression you're making a decision every time you have alternatives right it's like you're in a maze and you come to a fork in the maze your goal is to find a path to the exit so if you only have K symbols of look ahead you can only look so far down this path and this one to make a decision about which way to go what ll star does is that it says all the hell of it pretend we have little sub parsers and we're gonna let them race off and figure out which alternative path through the maze is eventually gonna win it's like you're going to the movies with your little brother and there's two really long lines you can't really see which line is for which one of them is for Borat high five and the other one is for The Bodyguard I don't know I do but I'm not seeing The Bodyguard I'm gonna go to Borat and I definitely don't want to get in the wrong line right so what you do is you send a little brother racing ahead and figure out what's actually at the end of the line now of course you've got to be efficient so what you do is you remember oh yeah the next time I come here the left line is Borat very nice oh wow right so that that's really efficient you know just getting the left line you don't have to do that again okay so it is exactly what you would expect of a top-down parsing generator except it has this magic function that says do what I need you to do right now so well you have to mimic backtracking without actually having the problems right here is a problem that if antler three were forced to backtrack across this or if you've used a peg before it basically says choose the first alternative that matches notice what it didn't say I didn't say choose the path that eventually leads to the exit of the maze in other words choose the path that actually matches all the input I said choose the alternative that matches the current input so when it finishes here it says oh hey I found an identifier identifier done it doesn't met excuse me it doesn't match all the way to the end so this is dead code so if your input is like X plus plus this will never match in a peg and if antler we're forced to backtrack in this situation that's also what it would do so that really pisses me off so don't worry about how but basically it does an initial race ahead as far as it needs to only and then records what it found so that the next time it comes through it can immediately make a decision without any kind of complexity and so once this warms up it's kind of like a JIT right once it warms up a Justin tonka pilot it's it turns out that the new version of antler at least on like the Java grammar is faster than the old one by a little bit and we have an experimental version that Sam built it you can't read the code but goddamn is it fast so that's why I kept my version that I can read yeah but his is actually just crushingly fast so now it takes a little bit to warm up but once it does you know like you you could even train a compiler for example on you know 500 gigabytes of Java code and then serialize the internal data structures in the generator parts there it's possibility but anyway okay so the cool thing is it always gets the right answer you can give it any damn grammar you want and as we'll see in a second it can even take left-recursive I take it you all know the honey badger honey badger takes it wants it doesn't give a damn all right that has got to be the world's best video you owe it to yourself to take a picture of that URL and take that home and try that out we'll be here for 20 minutes watching this ten times I can't start it up but so there's only one restriction so this is what we call left recursion it's like f immediately calls F well you'd wait a long time for that function f to return right because it's like F immediately calls F so it just makes no sense in our recursive descent world but I do this we're like grammar transformation and I can always get rid of left recursion and it always uses a notion of precedence so that I can do things like expressions properly it assumes that the alternative specified first has higher precedence so multiplied and divided have higher precedence than this one and so on so it mimics this long chain of things if you've done grammars before or hand Bill Parcells you know there's this long chain of stuff and the Java grammar where did I go I went from 172 lines to 91 and now all the expressions fit in one rule yep it does not do indirect left recursion that is the only caveat and it is an engineering compromise because if I if I can get away with that small caveat then I go from like this horrible end cubed thing down to just quasi-linear so this is it was a good trade-off and I I look at this and I go oh you know that smells very much like a left recursive rule so as long as you follow this basic pattern that it's a binary operator a ternary operator a unary operator or something else which more or less covers it then I know how to do the rewrite automagically so that this becomes a standard grammar underneath the covers and he uses predicates to check the precedence of the previous like if you've got three plus four times five you want the multiply to happen first right well the code i generate actually has predicate cin there to check to see what the precedence of the current operator and the next operator are so that it does the right thing because this is a horribly ambiguous grammar right because a little kid starts out they say three plus four times five why not do the plus first actually I don't know why not but you know that's the way we do it so this is a huge win this is a much much more natural way to do it and the cool thing is also do the they share the math so that the the parse tree looks exactly like this would Jen as opposed to something like this right and you know this you just to match an integer you have to go down like 20 levels of precedence you making 20 function calls to match an integer and guess what you know x equals 5 that's the most common expression you're gonna see so this will match it instantly so it's why don't oh that's it so these grammars you can give it pretty much whatever you want okay you there's a little bit of wrinkle with that recursion thingy left recursion but you can pretty much give it what you want and it does full context which is most likely meaningless but normally parsers make a decision based on where am I and what are the next characters or next tokens now look for also can take into consideration the call stack of how you got there so if I'm in a method declaration and I'm trying to decide whether I'm gonna match a method body or just a semicolon I can actually check the stack to see if I'm in an interface definition or a class definition right so that is technically exponentially complex but you know engineering you know in the real world most the time it ain't so so that works out great and so I automatically build trees for you and I automatically generate the infrastructure so that it'll walk all that stuff and since we DF a-- sized actions you really can get a grammar that you can reuse i mean it's sort of the holy grail right you don't have to go oh I found a yak parameter for you know Python but it's got all these weird actions I gotta cut those out and oh it's got all kinds of weird special case stuff the grammar is now clean okay so we can generate in different languages and reuse it for different applications even in the same application you might have a java grammar in one case it translates you know for refactoring to get interfaces another you might execute little bits of code things like that so all that can happen just for one grammar and then of course the shame was plugged for the extremely expensive twenty-something dollar book that all the cool kids are buying as far as I understand it and well that's it questions okay oh yeah well what does star mean what does star mean is it bow to my wiki extracted up high you turn with a wiki extractor that pie program or or oh you did wrong okay [Music] indeed it is I can confirm that you are not crazy sir at least in this area I can't speak outside of this but but you know I'm doing some statistics now I'm learning not to extrapolate so what you've described is sort of modes right it's like XML right are you in a tag or am i outside of the tag so it's very similar to that except it's more complicated because you can have nested like the Wikipedia thing is like it's got nested in double brackets got nested crap right and I'm unfortunately intimately familiar with this because literally just yesterday I was working with some students doing a bunch of Wikipedia parsing and there's actually somebody actually built an antler 3 grammar for it that I started to look at and I'm like you know what this is just as bad as all those regular expressions but the new but wait there's something new so in it's not that it's my idea but it's an old idea of this lexical modes and you can define effectively sub lectures and so what you do is if you're in the Tag Mode you switch to a different mode and then all those rules apply only in that context but what you've done is to add a stack to that which effectively gives you the power of a pushdown automata which makes you as powerful as the parser yes which is what you kind of need and so in another four you can have these modes and you can say you know switch to mode inside tag or you can say push mode and then later pop mode all without special actions so you don't have to put special actions in that grammar so yeah I'm you know and of course we got let's see oh we can even do some really crazy fuzzy parsing like ignore everything inside of a method the mash the stuff outside now I'm not saying it's gonna be super fast but you can do some really crazy fuzzy parsing with this non greedy operator because we even have the non greedy operator in the parser so you can match you know just scarf the entire center of a method or if you want to ignore it anyway so there's some fun stuff in this new version and I think I finally got to figured out for most for most cases I think there's gonna work out what's this dude man I got why better because I can't do another v5 IV Part III I said that the when I started version 4 actually started fixing version 3 because I'm like oh my god this is a mess and I can't leave a message for people to carry on with so I started kind of fixing it and then I'm like oh you know wouldn't it be interesting and then all of a sudden this new adaptive LL star thing came around because then I was what I really want to do is leave a parser generator where you can just give it whatever you want more or less and and have it be fast enough and useful and also oh man so there's a couple other things I want to do but I don't know if I got a stomach for it we'll see but I'm gonna keep maintaining the code and and doing an academic paper on the parsing algorithm and then I got to get back to the eighties and back to the machine learning stuff so yeah any other questions comments suggestions no it is always a deterministic and exact it imagine okay so everybody gets the decision issue right you're coming the analogy is you come to a fork in a maze and you've got three or four paths in front of you and you've got to figure out which one leads you to the exit so amis is an amazing analogy to parsing sorry it's uh I got a few more of those stick around yeah those are gold man gold so imagine parsing analogous to a maze you've got a passphrase written in your hand and there are words on the floor of the maze and the passphrase tells you to match it up on the ground and then it tells you which path to take right but what if you come to a maze fork and the next word is like dog but every path starts with dog on it right oh now that's got me intrigued haha all right you tell me later okay so what you need is so the first symbol of look ahead we would say does not tell you which path to take so then you turn up your flashlight and you go oh look at the second word ahead and maybe that'll tell you but ultimately in the worst case you might actually have to just send your kid brother running through the maze to the exit to see if it reaches them exit right so the concept is let's just launch every for every alternative path just to launch a sub parser like in a threat or something just to launch it and then eventually they'll come to a dead end and die off and eventually a single path or single thread will have reached the end and you know you have success but of course that's pretty inefficient right because what if you get ten words ahead and everybody dies off but one alternative then you know which one to predict remember we're not parsing we're predicting because then we're going to resume the regular parse process and use the thing you can step through with the debugger the normal parser so the trick is - ensued Oh parallelism in lockstep have all of these sub parsers March ahead and eventually they're all going to die except one and ignore the case when there's more than one and so that's your predictive process the alternative associated with the sub parser that is the sole survivor that is the prediction you make okay so if you're at an expression and the only parser that survives is the one associated with a function call then you know that's the one to predict now remember the fact that the last time you saw that exact sequence at that location with that exact call stack because maybe you called from an assignment which was called from a method which was called from a class you record all of that crap and you freeze-dry it and then the next time you come back you say oh is it the same stack in mine the same spot do I have the same like ID left parenthesis right parenthesis I do oh I know what that answer is so you just think of like a little map right a map input sequence - predicted alternative at every spot and so the next time I come through it's it's a hash table lookup I don't have to do these sub parsers so you kind of you kind of take it in the shorts the first time you hit a decision right but then you record all this information and then the next time you come back it's really fast and that's why it warms up so it warms up like a just-in-time compiler so you're caching all of the hard work into a data structure and that's how you can get it to go fast in the in the long term well it could get exponentially big so it turns out you don't actually record the call stack because that's one of your biggest offenders you only do that if you need to so you actually there's a hybrid approach you start out and the first thing you do is you do a quick ello one pass over the grammar statically so when you actually generate the grammar you go which of these decisions require one symbol of look-ahead and then you just generate a switch for that in the generated code so you don't even consider those in the analysis at run time and the vast majority of decisions are like l1 l2 or l3 just a few symbols of look ahead will get you there so most of the time I'm recording like little 1 & 2 & 3 token sequences and I'm doing that in the DFA so it's optimal in speed and compression right and I don't record the stack however if I find ambiguity more than one sub parser succeeds I'm like bro I don't know what should take at this point I don't know if it is a true ambiguity in your grammar or it's a weakness in the algorithm that does not consider the call stack so at that point I must failover and restart with the full meal deal and then I take into consider the call stack and in my case I actually in my version of the the algorithm I do not actually record the call stack so at that point I would interpret it again the next time but the cool thing is is it's almost never the case that you need that it's like a security blanket if you need it you got it and in some cases there's a lot of caveats but in general it's the case that I will never try to store exponentially large data structures and so you could you can rest assured that you know that's not going to be the problem Sam's algorithm it does something really interesting he basically in the DFA he doesn't say ok this stack this input matches this prediction he puts that stack onto the end of the DFA so if you match ID left prints the ID right parenthesis he then says oh and then trace this is the stack expression statement class so it hooks it into the DFA and it turns out there's a lot of merging and stuff that goes on and yeah it can be done very efficiently but that is really hard to read the code so let's see there is a complicated sequence of hybrids that are used to try the simplest possible algorithm or work and when that fails I I keep ratcheting up the sophistication until eventually I hit the maximum and but at that point and then you can also okay another caveat but I Ratchet it up to the full power of this thing and I can always tell you whether it's an ambiguity I never miss an ambiguity right so if you like it C++ you know T of I could mean like five different things so the parser will tell you that and it notifies a listener and you can ignore it if you want or whatever so I've traded static grammar analysis where you get warnings but now you know the other cool thing is with a ler you don't get any warnings at all right you run the grammar through and it goes okay and a runtime if there is an ambiguity it will notify a listener so it's kind of like a static versus dynamic lis type of language but it turns out that the thing that I'm doing at runtime is undecidable beforehand and so that's why this is much more powerful because I've shifted everything over I literally take a copy of your grammar and I turn it into this internal graph out a graph data structure serialize it with your parser and then I do grammar analysis on the fly so maybe any other quick questions we can get Dietrich up here okay yeah well we can talk offline but Diedrich we should probably get you started oh okay good ah no that's a good question the the simple answer well you go as high as you can and then if you still get a shinobi succeeds then you have a no viable alternative situation and it's a syntax error so the question is you know what do you tell the user and I think let's see what do I do at the moment I think I basically say here's as far as I got in the furthest case and I say this is a syntax error in some cases I can be smart right because if I get eventually a syntax error but I went through a bunch of rules successfully to get there then I can be smarter and let it continue and then eventually fail closer to the air and so and yeah the error handling is actually better in this version in a number of ways right in general you would you would want to have something on the front end that chunka fied it right like like for a protocol like you know nntp ftp you know all these things they're all gonna be like line based so that's easy enough you can break at the line level can't always do that and in which case i couldn't build the tree for you but as long as you can give me a finite chunk then and usually that's the case yeah it's rare that you have to see all the way to the end of the universe to you know to really decide what the errors are all right then I guess we'll call it quits and now we do to go on [Applause]
Info
Channel: PragProg
Views: 66,209
Rating: 4.8899674 out of 5
Keywords: terence, parr, antlr4, 640x360
Id: q8p1voEiu8Q
Channel Id: undefined
Length: 59min 17sec (3557 seconds)
Published: Thu Feb 14 2013
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.