How to write a parser in C++ (Part 1)

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right um good that's working uh so i have about two hours to sit down and do a little bit of a tutorial so i thought i'd just take the opportunity because recently someone asked how one does a um a writes a parser in c plus plus and the example was a c like language which you know is any language um that you uh you know like which is a good start for a language so um i've created a new project i'm not going to go into how to compile a c plus project or how that how to do that with your ide i will be creating a bunch of files that will show up somewhere in the left like you know one file per class or something like that um that's about it that you need to know about this ide this is a standard hello world thing um and so the first thing we will do is we will create a file just an empty file like a test file so we'll call this i don't know um test dot my c or something like that and just a test file is fine okay and so in this we will write the things that we want our programming language to parse so for example we probably wanted to parse um well let's leave away include statements for now because those are something else let's just go for a regular program so let's say void main void curly bracket printf hello it has been person d years since i learned c um backslash n and then we'll go and put a little equation in there as well so let's say uh current year minus i think 1995 or something i learned c something like that um so this is actually something that might run as a c program if you didn't mention the include statements and things like that so it's a good example to start with so that's what we want to parse so how do we do this well first um we need to read this program word by word or rather character by character because that's how we get it from the file system and since we sometimes might want to look ahead a little bit to you know like if we see void main then the question is okay we know void is a type um although and then we see main and we go okay this is some identifier so probably a name of something that is this type but now we want to know is it a variable or is it a function and so we need to look at the next letter and decide if it's an opening bracket then it's um you know the start of a function um and if it's a semicolon then it's actually the end of the line and it's probably a declaration for some sort of global variable or something um so we need to do this um and we might you know actually have to look ahead a little more for example if we see the word long then of course next we will want to check is it long int or is it long long int or is it just long which is short for long end but also permitted things like that so we might look ahead and then decide no we don't actually need this other token and so if we scanned over each character each time and then you know um repeated that you know over and over for each thing that we want to test then that would be a bit slow and a bit annoying and so what we are going to do first is we're going to break up this script into like the individual parts so that means identifiers like void and main operators like the opening bracket another identifier whoops undo you know and you know operators like the opening curly bracket closing curly bracket that thing and of course strings so a string for us we don't really care at this point where we write the parser what is inside a string all we need to know it's a sequence of characters um so um and we just store that away somewhere and then we pass it to whatever function and in you know a you know traditional c compiler a string passed to printf is just that a string the compiler knows nothing of course in modern compilers we actually have usually special code that looks at the percent sequences in a printf and can provide you with additional error messages but you know that's not the common case so we'll ignore this that's just some extra fanciness that you can add later if you want um but so that means so so what we'll build is an array of um these individual um identifiers and token and those are called tokens so like a token is an identifier an operator a string literal a number literal um and even you know like a floating point number would be another literal um maybe that's something else uh let's do um double foo equals 1.5 just so we have that as well so we will build an array of tokens that are equivalent to our script and then we'll keep working with that so the first thing that we want to create is ac plus plus class that we'll call tokenizer all right and let's just prepare everything so i usually use pragma once because all the compilers i use support that and you don't need to do the hash if def and some compilers can be a lot smarter about pragma once than they can be about you know with if you use the hash if include guard then uh that usually means um it it doesn't really know that this is an include card so it has to run the whole preprocessor over the entire file whereas if you use pragma once the compiler knows okay you just want me to only include this file once if it's included a second time we'll use just to be very clean we use a namespace this project is called simple parser so that's what we'll call our namespace and then we put our tokenizer class into oops i just lost all right luckily undo let me recover okay so give it a public method so what the tokenizer does is it um has a pars method so we include vector because that's our list to which we can add stuff um and then we say wait inside the namespace using namespace standard so we don't have to type as much and then we can just by the way you may notice the the main advantage of this namespace beyond just having a namespace is that it gives us a scope in which we can use using namespace standard so everyone who includes tokenizer.hpp doesn't get our using namespace standard leaked into their own code so that's one of the reasons why i'll be using the namespace here and of course no we don't need to do that here since it's already a namespace simple parser so we say vector and now what we need of course is a class to represent each token of which we can have a list we'll fill that out in a second let's just write okay token parse um const string in program all right so that's um the method we're going to implement in a moment of course we also need to i think we also need to include string to get the string class [Music] and now what are we going to do with the token well for one thing the token needs a type so we have different kinds of tokens so we say enum token type um and one type is so here's the thing we're going to use this for the parser itself so um one token type will be white space [Music] which you know is not something we will ever create a token for um but it's a token type nonetheless um just because we'll be using it during parsing kind of as the unset state for our token you'll see that in a second if that isn't clear yet but the actual token types for which we will create tokens is of course identifier integer literal string literal and now we have several choices um for right now i'll just go with an operator token you can also of course do this more fine-grained or what i've done sometimes is i've done a token type and then a token subtype so i can say here this token needs to be an identifier and then the identifier subtypes would be things like um you know like the for the if um the do identifier using namespace kind of like identifiers that you might have in your programming language um or the same you can do for operators you know you can have operator um you know like and then have you know like curly bracket as a separate operator um then regular parenthesis um so that one here opening parenthesis or um you know semicolon would be a different operator or the ampersand would be a different one with operators um you could also use a character constant here because usually they are one character but of course as soon as you go to less than equal or something even that doesn't make sense anyway so um you can do uh lots of uh things here you can also have like separate types for like you know like you can have a type for curly bracket because it usually opens a scope you know like so you could have scope open operator scope close operator for those you could have you know like um ampersand and other operators um under generic operator and then have semicolon separate because it's usually not you know that's fine do whatever you think makes sense right now i'll just go for operator maybe i'll reconsider later we'll see um but anyway so and then we have a token of course that has a token type um then we have a string with a token text so we'll just put a copy of our tokens text in there for now that that makes it easier to debug because you if you have a token you see what its original text was you can also do fun things like have [Music] you know and for strings of course we need it but um you know you can also just put the operators in there and then compare the string once you know it's an operator or something like that if you want to um so that's why and what i usually do for error reporting is um i put the start offset and end offset of a token in the token and you can even go so far and put um the line number in there if you want to um since we're in c plus plus let's initialize them all here and um let's see start out with white space which is basically invalid um okay so now we have a token and we have a tokenizer with a function parse um now we'll create a definition for that function my ide kind of uh generates a stub for me which is a little verbose so we go with namespace simple parser again because then we can shorten this a little to make it more readable using namespace standard all right and now we can never get rid of the tokenizer but the rest we can get rid of okay so we have a tokens list here that we will fill and we will return that list so what does this function do well this function just loops over the text in the program and we build a basically we're doing a standard state machine state machine sounds scary don't worry if you have never heard the term or you know have only heard it in complicated contexts basically a state machine is kind of a way to save yourself from writing a lot of booleans so usually you have a boolean you know for like if you read text and you encounter two slashes for instance then you would say okay two slashes that means it's the start of the comment of a comment and then um you kind of would have a boolean somewhere saying i'm in a comment now so that you just ignore any text that comes there even if it's an opening bracket even you know whatever it looks um and kind of to keep that state that's what we do um for that we will abuse our token because we're using tokens anyway so we'll just have a current token variable and that is empty right now and then we're going to loop over our um so current character in in program and now we just switch over the character now there is one thing i should say here um this assumes that our program is like c basically pure ascii this doesn't really handle unicode because you know if you wanted certain unicode characters to be valid you know for example if instead of exclamation mark equal you wanted to use the not equals letter the you know the equal sign with the strikethrough um that is like two bytes so two cares when encoded as utf-8 so if you wanted that to work then you would have to write this code slightly differently because you would have to recognize that this is a multibyte sequence unicode letter and extract it sadly there is not much unicode support in um in uh in c plus right now in the standard library there is a u8 string class that is supposed to be for that but it doesn't really do anything right now um so that's why right now this is all just ascii and that's it and now we just look at the characters so we go for example um case zero case one so you see that's that's character constants so that's just that character that was switching over of course um i'm trying to remember um i don't actually i'll just do it manually for now i think there was some syntax like 0 to 9 or something like that but i'm not sure if that's i might be mixing that up with with pascal or something so um let's just do each one explicit six seven we need two more eight nine of course do a break and so now here um know it's a number and now we look at the current token so if current token type equals white space which is you know like the normal case um then let me change the current token type to [Music] integer literal and current token m text append the character that should work right why is it saying it's not happy oh is it one and character yeah count okay all right uh so this um shows me the la the name of the variable here so i can actually tell whether i got it the right way around you know because it might be character or count or count character so that's just my ide showing me some text extra that's why it's on a lighter background and in grey that's not actually code i've written here just if you see that just it's a help from the ide and otherwise for now we'll just append it as well that means that you know if we're not in a starting an integer oh wait of course we can do that for uh where is it for integer literal as well no actually we don't have to okay so um so all we do is if it's white space and it's one of those numbers then we make an integer then we start an integer literal token and append that character otherwise we just append that character because whatever else it is it might be happy there is one more case we have to cover here um or no wait there isn't yeah right so if it's an identifier for instance that can continue with numbers but it can never start with the parser is why that rule exists there um all right so what else do we want to say we want to do some operators so we say you know like what do we need curly brackets now uh one thing i'll do here just for practicality is i'll put those near each other because not all uh co source code editors are that good um at uh recognizing you know that those brackets are just in strings and um if you just um if you put them in pairs then you know it will be confused only from here to here and that will still look like a complete case statement and things so um oops um what else do we want for our source file let's see what does it show we need an equal sign we need a semicolon and we need a minus sign equal sign semicolon minus sign that's our only operator so you see i'm not treating certain other characters equal sign minus sign semicolon oh right we need a comma as well and so here the the thing that you do here is um you do something similar um but you just say if token type is not string literal because like an operator even when it's directly after an integer literal um will uh you know will um end that token so we'll say let's just make an end token current token we'll write that call in a second just because we'll be doing it a lot um so we set the type to operator and append that character and then immediately end token again and otherwise like if we're in a string literal or something we'll just append it as we did in the others other cases um so now what um are we doing create new function and token yes create even created it nice um so my ide is really nice here i haven't used this ide much and it's it's nice to be able to use all these features that i can't use in some other projects in the day job okay so what does end token what is it supposed to do well for the first thing all right i forgot an argument um we also need to give it the token list i'm thinking it might actually be a good idea to use let's leave it like that we could make these instance variables so let's say vector [Music] token tokens and that needs to be a reference and of course token needs to be one as well the reason is of course that we plan to modify it um same here and those are not part of the public api so we'll make that end token function private um that's just because we only use it internally and people you know who want to use our class shouldn't be confused by it all right so what do we do um so basically we just say if token type is not white space because we don't want to generate any tokens for our white space because you know in c apart from splitting apart certain identifiers white space is not really relevant in c proper at least if you go and try to re-implement the preprocessor then of course whether there is larger white space or not makes a difference in in like hash define you know you say define foo bar it the i think the white space is actually relevant so so there are some some cases where the c preprocessor actually works differently but we're not bothering off about this we're building a simple c parser for now without the preprocessor um so yeah so and rather we're just building a tokenizer right now so what do we do um if it's not white swiss because if it's white space then there's nothing you know we want to enqueue otherwise tokens push back token so we add a new token for this and then we say tokens token type equals white space token type no text um erase because um you know we we append to this string so we we need it to be clear again um yeah that's pretty much what we have right now um i think i'll fill in the offsets later or you can do that as an exercise or something we'll see how we do that um so what else do we look at of course we need to handle um white space which could be a space just to clarify so what does this do this you know puts our finished token if there is one in the token list and then resets the token to a you know clean state basically um so that we can start a new token with it that's all this is supposed to do in token and then we support all sorts of white space characters so backslash r and backslash n backslash n is of course the unix line break backslash r is used in very old mac os programs or backslash r backslash n is the combination used on windows um in windows native files you can kind of you know be a little clever about handling them by handling backslash n kind of specially um you know just checking if you're a backslash n um and your current token is a line break um that had a backslash r then you just don't count the backslash n as a line break because you've already covered it for the backslash r um but if it's not then the backslash n gets counted you could do something like that i'm not going to do this right now but what we can do is current token um line number um just add one to it for now wait uh of course not for the white space and not for the tab and so what what do we do with this well with white space all we really do is we end token oh it's current token and with this one we end token and increment the line number so that everything after it has a different line number and of course um we need to initialize the line number to one because they're generally one based you don't generally have a line zero um because this is something that you show to users after all like with offsets you can do it differently um okay so this way we already count line numbers which is kind of nice for error reporting um and otherwise we just end tokens um which is fine uh what else do we want to do okay and now um we do the default case um which we'll just assume is identifiers for now if you want it to be stricter you know you could go um and um you know do this um like filter out that it's only a2z in uppercase or lowercase but really this way by just saying the default case you can actually get away with using unicode because everything that's not an operator or white space or a line break gets automatically treated as a as part of an identifier all right of course we can't forget uh almost forgot that we have to do string of course as well so this is a little more clever because the start of a string and the end of the string use the same character [Music] so what you do is basically something similar as up here which is in the case of the quote is if the current token is not string literal then we end token just to be sure you know like if there you know was an operator or a whatever um or a number or something that immediately got followed by a quote then this will work and we'll make sure that we don't lose that token and then we set it to string and append that character else if um [Music] if oops what is it current token type equal string literal then we end the string literal token here and we just ignore the quote character so that way we've already removed the quote characters so there's of course one more case that now we haven't done yet which we can do which is what do we do with a backslash i'm writing two backslashes here for the simple reason that you know one backslash would be an escape sequence in the character literal so that's why i need two and here of course what we do is again depending on the token type if the token type is string literal then we treat this character specially otherwise we just you know take the backslash and put it um um yeah the thing is what do you do with a loose backslash in your code um usually what it does is tell you to ignore the next character i guess i don't even remember what a backslash outside of a string does in c like in the preprocessor it's used to ignore line breaks um so in this case what we should probably do is um we should end token and just make it an operator and then do something with it when we encounter it or should we yeah i don't really know what it does so i guess i guess for now we'll just um uh we'll just do it as a operator and then we can always see type equals operator and then we've ended that token so right now we only support single character operators um there are different cases how you can handle this like you could glue together certain operators that you know but if you remember in early c plus plus if you wrote a template so like if you wrote a template of templates so like template um foo bar of t you know something like that um or rather you know t1 and bar of t2 then you had the problem that this was parsed as a right shift operator and so um because it just you know kept took the longest possible match but it's actually two um you know greater than signs or you know end of template brackets operators so what you always had to do is put that extra space in there um so since we don't you know c plus these days doesn't have that problem anymore one way you can do this is just by saying well multi you know we only generate one character for each operator and the right shift and left shift operator is a special case when we look for an operator where we go okay if it's a greater than sign and there's another greater than sign operator behind it so we have two token operators basically um that's one way to handle it you don't have to do it like that um but it's one choice that i think i'll do for now um okay so now um of course we haven't yet handled what to do so we switch all right um so what do we do now um we don't append the backslash here of course and uh what we rather have to do is we have to invent another token type so we'll do some no token ever has this type um and now we do one that is string escape sequence so that that is the state we will go into once we've encountered a backslash so that we know the next character is that and while we're at it also uh we will need a uh one line uh wait we'll need a i'll do that in a second um okay string escape sequence so that's the state we will set and now um right um now we need to handle this case especially so one way to do this would be to make each of these characters um handle it specially but really um we can just do it here at the start of current token type because it doesn't matter what character is escaped you know we need to treat it that way so um string escape sequence um if it is that then switch the current character and then so case n means append backslash n so that's basically the same you know that our source code is using here we're re-implementing that now so we need n we need r for return we need t for tab we need um what else do we have backslash for inserting a single backslash um there are others but we won't really need them right now so we'll just add a default right now throw runtime error and write unknown escape sequence um all right we need string one current character and um let's put the backslash into the error message as well in string on line and now what is the line the line is 2 string current token m line number and put a full stop at the end all right and we need to include the exception header all right now it should be happy yes what was that whoops okay i accidentally locked myself out by using the wrong shortcut um [Music] was it this no that wasn't what i wanted jesus um code reformat file ctrl alt shift oh file okay that looks better anyway so um so that means at the start if we're in an escape sequence we handle it or we are out that's nice um here we handle numbers oh we've forgotten of course the case of floating point numbers so let's do that and there it's a little more interesting because like a floating point number if you look at it to our parser will first look like an integer and then there will be a dot and then basically another integer so every floating point number token will start its life as an integer um or i guess since the us has this crazy for zero dot they just allow dot so that case we also have to cover so either it starts with white space that's actually annoying let's i'm not sure i want to cover this case right now like you you can i guess we'll just do it um although i'm i'm kind of i don't want to spend too much time on this parser on this tokenizer i want to get to the parser so anyway so what you do is if case is white space or no wait so integer literal is a special case we have to handle so if it's a dot in white space that really should be a special case again so let's say we will add a new case potential double okay and now we say if current token type equals white space then we go into state potential double and append the dot if current token type is integer literal then we know that the dot means it's a floating point number so we go into uh do we have a state for double yet no we don't have one yet so we make a double literal okay so we set the type to double literal because we now know it will be a double but we do not end the token here so if it's white space we don't need to end the token really um if it's an integer we just morph it over that's why we don't end the token and otherwise we of course do our the same dance that we do for operators because dot is an operator in c so we say we end whatever the current token was we set the time to operator we attend a period and end that again why is it unhappy here oh because it doesn't have tokens as its parameter i forgot that all right white space double oh and of course in this case we need to handle string literal especially as just append or rather maybe we should swap these to make the else case just the append for the simple reason that there's a few no wait most types no it was the wrong way okay so in most cases we want to append to just you know to to use this as an operator so if you have an uh i don't know an identifier or whatever then you want to end that one when a period happens but if it's a string or a comment then of course you don't um all right so here we are case dot is i think covered and then one thing we need to do is at the very end so that's at the end of the loop um outside the loop before we return we need to call n tokens one last time for the simple reason that we might have you know like an identifier or something that goes up to the end of the document and that would otherwise not get close because there's no other token behind it that will end it okay and of course we need to cover the case of identifiers so everything else um if um current token type is white space or integer literal um actually that's a bit incorrect um like integers and doubles can usually have like an f for example uh lowercase or uppercase f at the end of a double to indicate it's you know a double or a float um i'm not going to cover that right now i'm too lazy um anyway integer double white space then we say end token uh it's current token again um and then we say current token type equals identifier and we append that character else um for the other ones we don't really need to end the token we just append the letter to whatever it is and so that means it either ends up inside a string or as part of an identifier so i think we should be good so that's the situation of a start so i think we're good with this function now um one thing um right we haven't no we have handled the almost the string escape sequence one thing we have to do is once we've handled it we need to set current token m type to string literal again and of course we need to not execute anything else um don't we have what is the potential double case right that we need to cover as well so if there's a number and it's white space then that turns it into an integer literal so copy paste if current token type is potential double then now that we have a number we know that it's a double so we say double literal and append that letter so we definitely know it is one now um and now we need to handle the case for everything else um so one thing we would have to handle is if we encounter a dot in potential double then of course we would need to raise an error um i'll just do that this way for now two dots i know wait um [Music] no then it would just cause an or be a dot operator at the end that sounds fine um so multiple dots are actually handled but this would be two dots in a row so i guess what we'd have to do is if we end up here with a potential double else if if token type equals potential double what do we do with a potential double so a potential double would be a number with a dot and no other number behind it so one point one dot i guess for now let's just handle it as double literal because i think that's what the compiler will do like i think it lets you say dot five and it lets you say one dot um [Music] but it doesn't let you say dot though so if it's potential double if um token m text compare dot equals zero so if this is actually a dot then we turn it into a operator and that i think is now we should have had all those cases covered and now we have one more case to cover and that is comments and again comments are a little complicated because a slash is a division operator two slashes are a comment so again we go case slash and now we need to add another fake type here potential comment and comment and comment is also a type that will really never make a token off but um we'll still um you know use it here so case slash if current token m type equals uh let's see which ones do we want to cover in general if we hit a slash anywhere so if we hit it in a string literal then we just append it else let's see um we'll need at least one more case so let's see if um current token type is potential comment so that means we've already hit one slash then that means current token m type equals comment and we will empty the text the reason for this is we will collect the slash the first slash just in case it's a solo slash you know we've appended that we will append that actually here um and set its type to current token type equals potential comment all right so the default case if we come in is we're in white space or after some other token or something but not in a string in that case the first slash we see will switch us into potential comment mode then we append that to the token and we of course end token because the comment could be right after you know a identifier um okay um so then we append this character and now yeah and so if we then hit a second slash we hit this case we're in a potential comment so we say yes it's a comment and clear the two slashes out so we collect everything else and now what is the end indicator for a comment it's a line break so we go up to here to the line break and it already ends the token and we for all other token types we collect the characters so basically at this point when we call end token we will end up here and so we'll just say um if not white space let's see so if it equals comment then we don't push back the token and really um what we'll do is just for fun we'll print out um what do we say um ignoring comment token m text oh wait like this uh saw this brackets all right and now we need to include iostream of course all right so this will make us not generate tokens for comments [Music] and then you know we'll just reset to white space at the start of the next line which is exactly what you want um because really comments cause you know like a break in any token that they are next to you know they're basically like like just the return at the end of the comment so now we have one more problem what if you know there was you know it's any other case so again i'll just do that here at the start um else if current token type equals potential comment okay and now a potential comment and cur character not slash then that's obviously my character so so what we do in this case is then it's just a slash operator we've already ended the token so current token m-type equals operator and now the single slash is already in there so that's fine um no no if it's potential comment that's fine because it's not inside a string that can never happen inside a string and type and we end token again where do we have an end token to copy so i don't have to type it here okay and so we'll end token and not execute the rest because all we want to do is add this operator okay we're good here next thing we need to do is um [Music] go okay we've used all our types that's a good sign we haven't used the start and end offsets yet never mind those um let's just delete them for now it's you know it sort of works like the line number like at the start of any token you have to write the number in there and then when you end the token you have to write the current offset there so you know okay so now we have two things one i want to do a quick debug dump or let's call it debug print function um wait let's be clean here generator definition and now say um so uh one trick that is very useful when you write a parser is you will have a lot of constants that correspond to text and so the simplest thing that you can do is um you can do s token type strings there are much more clever ways of doing this that i use everywhere but this is a good start oops uh here's the quote so there are ways with macros and stuff to make this um kind of like keep those lists in sync automatically and everything um i've blogged about it look up x macros if you're interested ctrl shift alt l whole file i don't know why it insists on uh yeah this works um all right so we have the the important thing here is that this array has the index here is zero and the value of white space is zero you know and this is one two three they get automatically assigned so you can just plug any token type in as the array index and now you can go into the debug print and say c out token and now [Music] token type strings type space and now we have the text which we should probably put at least in quotes so we can recognize it so those are as you see that they're just quotes that we print out they're not quotes in the actual text so everything will be quoted and then after that we should put um what do we want at the end the line number of course oh and we need a closing bracket okay so now we have a function that we can call that's a const function by the way all right so we have a function that we can call on each token now we can go and run our tokenizer for the first time so uh simple parser 0.1 just so we have something um we only need one endle because we don't need to flush twice here actually don't need to oh well no that's good um okay so now we need to read the file how do you read a file in c plus plus uh i have no idea let's do it like we do it in c um file handle f open and now i need to figure out what this one's path is i think we had file path uh wasn't there a copy file path or something copy absolute path that sounds good okay so now we have this path and we only need to be able to read it all right and then we go um what is it see f seek right yeah and their offset origin okay fh zero seek end that's what we want to do and that's the file size and then see set as the start so go back to the start and now we do [Music] how do we create oh yeah right we create a string file contents and now we say file size comma so i know let's use spaces okay so now we have a buffer and now we have read into that file contents i think file contents had a data element size file size file handle okay we could do some error checking here but i'm really lazy so um first what we'll do is just output this and now we need to run our program it builds and it didn't work okay so i guess we need error handling if not fh cr can't find file let's see if that's already what we need no okay so it can open the file let's just step through it with the debugger what am i even doing okay i have a file handle all right start over fck files i zero oh right this is not i'm an idiot of course f seek just goes to the location and what we need is f tell fh that should be it right yeah okay now you see it printed out our pseudo program here so that means we can read the text [Music] let's just do it this way all right and now we include our tokenizer by creating a tokenizer and a simple parser dot pars file contents and vector token tokens and now we loop over those and print them card token debug print why did i uppercase debug print by the way refactor rename oops all right that's more consistent now and so now when we run this it should show us our list of tokens okay and now you see what did it do with our program oh it didn't it wasn't quite correct but void main and then so those are operators uh identifiers that's right then in an opening bracket that's an operator again void an identifier closing bracket curly bracket operating you see there's no white space or anything between those double foo equals as an operator again it recognized 1.5 as a double literal and then okay with the string apparently we're including the opening quote which we shouldn't so i'll have to fix that one oh it actually didn't recognize the string at all that the space cut off oh we haven't covered strings in this base case i guess okay um so let's go to our tokenizer and see how we handle spaces white space in general yeah it just does n token unconditionally which of course is wrong if current token equals um string literal or om type type equal string literal or of course comment else and token and so what do we do in string literal we just append the character and line break is fine we don't really support line breaks in strings we could though if we wanted to so this is still not quite correct so we still have the opening quote in there by the way but at least we've grabbed oh no this is correct actually because we had a backslash n at the end and it actually output that as a proper token okay so yeah here of course we don't want to append the quote that was the mistake underway if not string literal not right we make it a string literal but we leave it empty now it's correct sorry and now as you see like we have here so these quotes like you see the operator is also so there are no quotes anymore around this string we've already removed them as we wanted to and we have the string and we've converted the escape sequence backslash n into a line break um now here we have 2020 minus 1995 closing bracket that looks correct pretty much all of it so now we have a list of tokens so now how do we parse this how do we write a parser so that's the next step now that we have these um let's create our parser class new c plus class parser okay pragma once all right so the parser of course has a parse function that takes a vector token tokens and names space simple parser i always hit command c instead of command v and that causes this editor to just copy nothing if nothing is selected i didn't know who thought that was a good feature and of course we needed to know about tokens so include tokenizer hpp i guess we could put token in its own header to make it a little less of a dependency but hey um general definition for function parse namespace simple parser all right you'll probably need a namespace declaration soon so might as well write it here as well as long as we're thinking of it okay so how do we do this now well first we need to think about how we could parse a program so to parse a program never forget to drink occasionally especially when you're talking a lot so we have different levels at which we can parse stuff so if you think about it in c the top level is where functions live or where global variables live so that's the first level we read at and then of course there is the level you know where like function contents live there's also the level where you know parameter lists live um all of these we'll have to deal with somehow so let's start at the top level and say how do we do this well first we can go and say um while so we need a concept of a current token so that means vector token iterator current token and that's tokens dot begin and now um while current token not tokens end so we loop over all the tokens sort of but as you see i'm using a while loopy i'm not using a for loop here because we do not always advance one token at a time you know we depending on whether it's a function or whatever you know we advance at how many tokens are make up that thing like the entire function body might get skipped in one of these loops okay so first um we'd say we need an identifier to introduce something so that would usually be um the type of whatever we're about to declare so if we had you know for example if we supported structs or something then you know the keyword struct would be something but in general in let's see it's an identifier of some sort so why not make it a little easier for us and say make a function um let's say bull um no let's say token expect identifier usually people use it something uh call it something like expect um and we can even give it a specific string um name okay and um of course that needs to be an optional so we can also signal that we don't have an identifier okay let's create this function generate definition okay so now we go oh right and we need the current token that should really be an instance variable now the other should have probably been one as well in the tokenizer but i guess we got away with it because we only had one function um okay let's put current token here okay now say if um current token oh and tokens needs to be one as well m tokens equals tokens i guess now let's say um and token that's what we really need equals tokens and because we really only need to know which when it's an end so if current token equals end token then return null opt so that means we have no match um just so we know you know we don't crash or run over the end of our array when we expect something now let's see um if i'm current token type not identifier and we also don't have a much match so we also return null opt do those two right at the start and now we know that it's an identifier token that is the next one so we say if current token text equals name then now we need to remember the token of course let's say if it's not name that's easier also return null opt and so now that this is left we say where can we copy the type here we need another iterator so we say return token equals m current token and we skip that token once we found it and return the old token all right [Music] so that's what our expect identifier can do um return token like this oh so we could probably just say token m current token all right so um now we can say here if expect identifier oh one thing um if name empty or okay so what this does is it lets us expect any identifier by passing an empty string or um lets us match the actual identifier um and now we can say here if yeah and here actually we want and so let's say token possible identifier right it's an optional token expect identifier if um possible identifier has value then we know we have a type we have a type um then we'll get the next one which should be the name let's call it possible type instead of possible identifier and then we say here possible name expect oh and we need to have a default give it an empty string empty string means match any identifier all right possible name and now here we can say um now we need the same but operators so expect operator type this operator no wait if not name empty and current token text not name that's what i want here of course i wrote it the wrong way around and so otherwise if it matches then we're okay and now um do the same thing again operator and here we say we expect a bracket we have a type we have a name possible operator i have a function okay possible name expect okay and so that means we know now we have a function so let's just print that out we have a function and say print its name which is possible name m text all right um so we will probably get fun errors or fun mistakes after the first parsing but uh i will probably get an endless loop so let's just say exit 0 here and let that run and see what happens oh we are not calling the parser aren't are we okay so parser parser pars tokens we have a function main okay so that worked we got to that point um so yeah and that's basically how the start of a parser works because you can just you know keep going there and now you know as you aggregate all these things um you can actually work with them so the problem with parsing c that you might already see here is we're pretty much blindly just running forward we know that pretty much everything starts with a type um so um one of the things that you'll end up doing is probably build a little table of um types that you have encountered like when someone defines a type um and make sure that whatever you encountered here really is a type and then if you know that it's a type or if you know you know if it's not in your list of types then in that case you know okay i um this has to be some other construct um things like that um of course you know a type is not just an identifier in c you know because you might have a type that is a template type and of course you might have to uh so so we should probably make this a more complex function so let's go back here and say um [Music] um well it's we're only doing c right so let's not go crazy with c plus types so that means we still need a we could do an um expect type function i guess and that means we need some data structure for our types so class type and now how do we model this well we could just say you know in the end well every type gets some sort of name and every type gets [Music] um and let's see what what do we do here i guess we in the end we can reduce all types to other types so let's do an enum [Music] built-in type and so let's say we have and 8 u and 8 and 16 so that's the bit length of those you and 16 things like that let's um let's maybe just do the types we will need for now so that's int and signed int that's signed and unsigned care for characters um then we have double and then we have what else do we have um of course we have struct so enum built in type equals m type and then if it's a struct it's a vector of type can we do that okay let's see if c will c plus plus will let us do that actually not sure because that's sort of a recursive definition isn't it but i think if if the size of vector doesn't depend on yeah it doesn't depend on the size of its fields i guess because that's on the heap so it should be fine okay um never mind um oh so let's say we return a type from parse type uh again let's not say pars type let's call it expect type and of course we need a map of string to type types okay and now we'll create a little constructor okay and now we'll go and say int equals type and we need a constructor for type of course oh wait can we can we just do this [Music] um name type fields equals and type equals n32 does that work no okay then we'll assume sorry um okay now let's give type the constructor it needs type const string name come out in built-in type type um name name type type okay so now we can create types if we want to and unsigned equals unsigned and and let's say care equals um and eight signed care um okay we don't really have that one yet you and it t i guess is what we can call this one and uh as we have multi character names we might need to special case that you could probably do that by getting type uh you know giving type keys that have multiple words and dealing with those somehow various things that you could do but anyway so as an example we initialize our list of types here and then we go to expect type and create a function for that and now we say um [Music] map string comma type we need to include map of course so we can look up types so the main reason why i'm using this simple naming scheme is because we will add our own types and those then we'll get um you know we'll get mapped to these structs as well so like if you declare a struct or something then you would need this table and would need to be able to add new types in there for the built-in types it wouldn't be that important um iterator found type equals m types so first we should expect identifier expect identifier if not possible type then return all opt so that means if we don't have a type have an identifier we can't have a type found types equals my types find [Music] possible type text if font type not m types end or actually if it is and types end then return null opt and else return found type second which is the actual type we found so now we have a nice expect type optional type expect type and now all right what we shouldn't forget of course is that void is a type as well all right what okay and so now if we run this again will it work no no appropriate default constructor oh right of course we we created our own constructor so built-in type type equals void just so array can create default items if it needs to okay so it still finds a function which means it found our type so if we now modify our test program and say instead of void we say foo it should say it didn't find anything yeah in fact it will endless loop here because we haven't got a case where you know we don't have a type and we don't advance in the token list in that case so we should make an error but anyway this works and now if we say int main void say we have a function again so it already knows about types now um let's switch that back to void um good uh what else do we want to do in the parser so um the thing is of course um it gets a bit unwieldy um if you do [Music] um if you do the everything here in the parser so it's totally fine to go you know to try to call pars function and put all of this code into parse function but of course in that case what you have to do is you have to backtrack so that means um this one has no value no harm no foul it didn't consume that first token because expect type um if it returns null opt nothing then of course expect type does not give you anything but so if there was no identifier here we would have already advanced past the type so in this case we need to deal with the else case and say minus minus current token similarly here where we're inside the operator we would have to backtrack by two tokens but once we do that we can just grab this entire thing and say um pool parser expect function let's say function definition because that's what we have in our code and then here we say return true wish i could type and in any other case we say return false and so now if expect function definition then i don't know do something else um i'm current token m text so cr unknown identifier current token and text and line okay so now we should have our error message so first if we run this it should still run like before no oh we should have declared can i create new function yes okay now put it in the header right yeah it edited here it's in the wrong place but okay and now if we run it okay it's unhappy and it didn't exit but let's just say m plus m current token right now so it will just skip usually you would just throw an error here or something okay um [Music] anyway and now we change our script oh wait unknown identifier it still gives that seems a little weird oh no that's the rest of the function okay yeah that makes sense so it's it successfully parses void main opening bracket and then of course it chokes on everything else that we're not reading yet and locks it as an error that's what we want basically that's good we we have achieved our beginning and now what what happens if we say foo main void now it should basically log the entire script but it doesn't huh okay so our backtracking is still wrong so let's see what would that be um expect function expect type and it would already fail here and return false wouldn't it okay let's just walk through i guess run it in the debugger and see where it's going wrong okay we're stopped here so we step into expect function definition and that does expect type and expect type expects an identifier oh that already consumed the identifier so of course we need to backtrack here as well so if not possible type that so okay and here we don't find the type so what we need to do here is a little more than just bail out minus minus current token all right now it should work stop and rerun and what's happening now yeah unknown identifier foo so it already fails really early and we don't have a function so basically now we've implemented backtracking and you know we have a basis for parsing types so what you would do now is um you know you would expand that with more function like instead of expect function definition you would write another one that checks is there a semicolon after it or an equal sign then that means it's a variable declaration and but it's basically the same code and of course then you would parse the expression afterwards with a new function because expressions are basically always the same so that's [Music] that's basically how you would write a parser i didn't really get to the part of nesting this time uh i'll see if i have some more time i might actually um create a new uh create a second episode of this i hope this has already been a little bit helpful um so hopefully see you another time bye bye
Info
Channel: uliwitness
Views: 16,086
Rating: 4.9497485 out of 5
Keywords: programming, programming languages, c++, parser
Id: Ql4sG1Aem-I
Channel Id: undefined
Length: 123min 23sec (7403 seconds)
Published: Sat Oct 31 2020
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.