Episode 1: A basic expression evaluator

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] hello friends of darknet I'm in will entereth and you can find me on Twitter at Tarot jobs I'm super excited about this event I've rambled about this on Twitter I think I was a year ago when I said how about I do a video series on how to build a compiler from scratch and then I never did anything with that because it turns out I was super busy doing other stuff but thankfully you guys have all pinked me on regular basis on Twitter and asked what's going on there I want to see this amazing talk of yours so I finally had the time to put something together and today is the first episode where we will start writing the compiler some of you have asked how long will it take to write the compiler that's a good question I don't quite know because my goal is to write this live right so the today's episode will be 60 to 90 minutes I will do a demo of so I should say I already have written the compiler so one of the things that we'll do today is demo what the end result would look like so you get an idea whether it's worth you know keeping keeping up with it or whether you want to do something else with your time yeah before we go into the details of a few notes I want to go through it to make sure we're on the same page so we will be writing the compiler from scratch so there will be no magic compiler tools no parser generators is just plain c-sharp the other thing is I don't require any computer science knowledge even though I have a computer science degree I think compilers are just you know as any other software if you know how to write it it's just a lot of fun we will be using dotnet Korn Visual Studio code both are free applications so you should be able to download them and just follow along without any any major requirements and the cool thing is also both of them are cross-platform so even if you're not on Windows you should be able to keep up with the talk um yeah one thing I should say I am NOT a compiler guy so I work at Microsoft at work on the lunette team so I think of myself as being a decent developer but I am definitely not a compiler expert so if you're a compiler expert you might be bored or highly amused I don't know which game you will fall into I'd like to hear your feedback though so if you're interested keep watching so the one thing I would say is the why do I do this course so my prime motivation is to convince you the compilers of fun because I've written a few compilers in my spare time and I think those are cool things to do and there's plenty of stuff that is actually very useful in whatever job you are doing because turns out compilers are really mostly about you know how to handle trees how to walk trees how to do interesting stuff with trees and trees are just a very interesting data structure that if you've never used them you probably should because they're really cool so my goal is I will be streaming every Wednesday 8:00 a.m. Pacific Standard Time which will probably not work for the entirety of the world but hopefully for enough people on the world so it makes sense to do it in a live fashion dependent it doesn't matter where you're watching whether it's twitch or mixer or youtube I stream to all three of them at the same time so just feel free to use the chat in there and ask questions and if I don't forget I will actually check the chat window every once in a while and try to answer your questions the recordings of these episodes will go to YouTube so even if you can't make the life's not you will be able to follow later and find the stuff on YouTube as well which is super useful yeah of course the source code that we going to write is going to github my goal would be that at the end of every episode I have one PR that adds all the code we wrote in that episode and I think on how much time and how much energy I'm willing to spend after the episode hopefully also get some notes so even if it didn't follow the episode or if you're not a video person you should still be able to catch up by just reading the notes yeah this brings him back to what's the goal for today so today I want to start the project we want to you know create the github repo get the initial code up and running and then write the first part of it so today what I would like to do is start with something simple like an expression evaluator where we handle priorities already you know basic parsing of basic structures and then maybe even print the parse tree so that we have some foundation that we can play with all right send without further ado let's just jump into the coding portion of this I mean really the fun part isn't it so let me first say we need to pick a name as any given project is and you know I have type of plenty of names we have been I wrote the compiler you know tear drops compiler in most compiler our compiler and like I've never was really happy if you ever considered MIPS come again have you ever considered Minsk I'm not entirely sold on the name Minsk yet I think maybe we should pick maybe a better name all right fair enough fair enough that's not upset the Klingon Minsk it is so let's call our compiler and our language for now Minsk and then see where this leads us so what I have here is the visual studio environment for the finished product so I want to give you a quick demo on what that would look like so that you can decide whether you will run around the annoying person that I am i named it visual minsk because why not so we have a small program here that we can load and then you have an editor let me just maximize this and zoom in a little bit so you have code colorization which again is actually done by our compiler infrastructure you can hover over identifiers and it tells you what it is so it knows that guesses is a variable of type int so we have type checking going the language that I have I was very uninspired basically what it does it's just JavaScript or a little bit of c-sharp maybe I just wanted to have some language that is something we can actually play with like I'm not a language designer even though I'm going to see sharp language design team I have my opinions on what I would like to see there but for me was mostly about writing the language so I just picked something relatively easy to parse and you know we can play with this probably as we're going as well let me open up the chat window so that when I'm going I am actually able to see what's going on zoom out here there's already plenty of questions yep all right so one thing so what it does right now I can actually run this program and I wrote a little guessing game so you can say I would like to get hints and then it will basically say I guess that I'm a between 1 and 100 and now you know what's your guess and it basically does the normal binary search so we can say 50 and then it says well the number is actually higher maybe you want to get 75 or the number is lower and then eventually hopefully after a few guesses you get to the actual number so this is a very basic program here you know we have functions in here the functions take parameters I have the ability to print to the console we have string interpolation going we have basic control structures like wire loops and if statements and you can assign variables and you know what do you were basically expect from a from a from a from a boomer decent programming language the idea that we have here is actually it doesn't have code completion yet although that is something that I would like to add as well but you get life type checking so if I make this in a float you need to get an error message that says this variable here we cannot convert float to an int because startgame takes an int right you also get some sort of symbol browser here where you can actually navigate between different parts of the language so basically what you would expect I guess from anything that has the title IDE or something as basic as that the other thing that I did because this is actually compiled so this is literally producing a dotnet executable so we can actually go to the folder here if I remember where I put it I think it's here under samples yep here's our or XE and we can actually open up this guy you know it's fine which is actually kind of fun to do because you know look at the the decompiled c-sharp version of our language which is kind of fun so you can see for example that I use string interpolation here but it just compels down to a normal string concatenation for example right which I think in I actually call string concatenation if yeah that's what it does here but you can see it in C sharp of course you can switch to all language because nothing knows about our language right that would be maybe even fun to do maybe somebody wants to do the D compiler for IELTS why for our language that would be super sweet but that's not what I'm focusing on and the other thing we can do is we can actually debug there or or a code so I can actually hit f5 and no I did not write a debugger all it did I omit a PDB file that actually has debugging information in it so that allows us to actually step through our code so all I have to do here now as I can had f11 and then Visual Studio will happily load our source file I mean it doesn't know anything about our language right so there's no syntax highlighting no nothing but I can actually do what you normally do it you can do an f10 you know you can say do I want to jump into the method ask for hints or they want to jump over that do an f10 so it asks for hints I say no and then for the run game I can decide whether I want to run into that I can do this and as you can see here is we even see the variables because they are actually part of the of the debug file so the four videos three notes what local variables exist of course what you can't do you cannot divide expressions in here because again Visual Studio knows nothing about our language so there's no expression evaluate or anything like that but you are at least able to debug your code right so long your visual studio installed but you get the idea right basically we just admitted Samael and then had a debug information to it and that's all it took all right so that is what I want to demo about the actual finished product yeah somebody's asking what language is the compiler written in I will be using c-sharp somebody asks for VB but I don't know VB anymore it's a long time yes somebody was asking what y'all stands for again I'm uninspired it's yet another language so not really super curiously interesting so I think yeah that's pretty much it what we have for questions right now so alright let's get started so let's actually closed room studio for now and let's actually start so let's let's start by creating our or bare-bones folder so we have a Minsk folder here let me populate this with some basic basic file to you so we have at least to get attribute to get ignore get license let me do it get in it so we have an actual repository to work with and then I think I want to create the Roppongi to make it public let's do this let's come with what we have initial commit yeah it helps if you add everything I know let's just keep forgetting yep let's do that ooh yeah so we have at least the basics well it's actually open this folder in Visual Studio code maybe the first thing we should actually add is a readme file just so we know what's going on this repo is a compiler maybe we should also add have you considered Minsk Worf naming things more details follow up so we at least know what this thing will contain let's add this guy let's add let's do this alright so nobody at least have something to work with so the first thing I will do is I will create our actual project as a set up I'll be using donut core so anybody can follow along on any operating system so let's say we want to get a console app and we want to call this MC for the Minsk compiler and now let's actually open that guy again in code alright so here we are alright so now we have our or basic compiler I think I should be able to run this now maybe not yet yep I wanted on the core environments yes please I do not want to use the internal console because that never worked well for me so I will use the external terminal so for no controller 5 this that will hopefully launch this yep it does that I mean Lee quits all right so this is our thing so the first thing I will do because it's a much easier for us to work with it's not an actual compiler but more like a wrapper right where we enter some expression and then you really get some feedback on that so let's do this let's say while true then let me say var align I may say just console dot readline and let me say if line if spring is null or empty I guess not all white space is also fine line then we just quit otherwise we say if line equals 1 plus 2 times 3 then we say console.writeline 7 actually let me add a prompt here console right in this case so we have something that we can actually see them what the output was and what the input was else console dot write line invalid expression sit see all these girls one plus two times three seven Wow I say anything else anybody expression seems I've already done it's that easy right yeah not quite so let's actually do a little bit more so the first thing I want to do is when we talk about a compiler right the first thing that a compiler does if you just look at this source snippet here is it will break this thing from characters into verts so if you think about the parser the parts that doesn't actually deal with individual characters the parts that deals with verts right so you can think of the first part of the compiler is the lexer which creates words and then there's the parser who creates sentences and sentences are usually entry form so the very first thing I'm going to do is do that and so in our case because we you know lets us start with the class lexer which basically will actually does this work here sort of extra string text okay we need a field oh god no once that naming convention underscores for the win I cannot stand this so the first thing we need probably a position right so let's say private in position and then we have say syntax token next token let's say right so now the question is what's the syntax token that it's basically what represents a vert in our language right so in our case it would say that starts simple and say we have what do we need what would be useful I guess what we sold is the position in the input file and maybe the text and maybe the kind like is it a number or is it an operator right like that sort of thing syntax kind kind and let me just say our syntax kind is just an enum for now we have nothing in it and then it's just initialize properties for those guys so we can actually store them alright this should get us going so so the idea here is that this guy like every time you call next token we basically yep you know at the current position in the file and if you find the next word take that return that and keep going so in our case so what we'll be looking for so what we were looking for a fun expression evaluator our numbers right I guess plus minus times divided and of course you know white space right space let's just call this white space right these are the three kinds of words that maybe parentheses would also be great of what we can detect for now so how about we start with numbers so we say if char actually before we do that maybe what we should do is public you know private char current which basically says the current character let's start with something like so if the position is basically outside the bounds of the text then we return this guy here which is basically the zero Terminator second tensional I guess for most systems to say if it's invalid you return that guy otherwise you return text position so this gives us the current character and then we probably want another method avoid next which basically just does this like we just increment the position so no we can say if char is letter or digit no sorry it's digit current then what we will do is we will remember where we are tomorr start is position we say wild char is digit current next and then we say okay a lengthy I guess it's not really nice that we have this leaky abstraction here and we have a position on them kind of to know what's going on but let's deal with that for now so we say in length would be position minus start and then we can save our text is text substring for start and length so basically the idea here is that we will keep reading numbers and then at the end we just create the word that represents the number so we would say var so I returned news and text token no not like that so I cannot type this morning what's going on here ah I see what's going on all right look at a new subjects token we say the syntax kind would be it's called as number token for now and let me say the position is starred and then the text is text let's generate the enum member all right so now we have the number as text which is probably for later faces not super useful so normally what you do for tokens as well as you would say the lecture is the party that is responsible for actually you know reading it into the actual value so let's do this let's just say in tryparse on the text and let me say out of our value and we don't really I don't really care right now whether it succeeds or not of course we need to add error handling boy we'll do this later so let me actually go here and that's actually object value let's accept this guy as well so we had the actual representation I'd use object for now because we probably have other shapes later that we want to have like strings or floats so type in this is in this probably not going to be a winning proposition all right then we can say okay bar is white space let's say current and then this guy basically repeats we do something like that again say as long as this guy's white space keep reading and then we say white space token for now and I think I just lost the light yeah I may have run out of battery so if later on it might get darker a lot my face that might not be my mood or depending on how the last you because it might be my mood we'll see all right so now we have the white spaces covered which gives us most of them we need and then I think you only have the special cases now if current equals let's say plus then we do something much much simpler we would say position let's just cheat here c-sharp for the win position plus plus text is this guy and we don't have any value and we call this the plus token and then it's handled - as well while we edit and then let's do actually let's do them all while we edit / open paren close paren and then we call this guy the - token it's called this star token this guy would be the slash token this guy would be the open / and this is token if I get this ever ride I suck at typing that's why intelligence is such an important thing for me I just wish I had intelligence for English all right so now we have most of these guys I think and of course we screwed up the text yeah of course I could use the original text on the input but you know what for these kind of tokens it feels wasteful to do the substring operations because they are you know fixed and the nice thing with literals is that they're all in turn so you don't actually allocating your string objects for that quite handy all right so now the questions if you get to this point then we probably found something that we don't know what we should do with that and so one thing that you can do is you can just say new syntax token syntax kind it's called this bat token and yeah I guess at this point we should have I guess yeah I guess we can achieve again bad language semantics for the win so we can say this would be text position minus one because this guy has already been incremented so we go back yeah I guess we didn't need to do a substring here sub-string for the previous position only one character and then we have no value I guess this is a comma not a semicolon all right they're talking we add to our list all right now we have our lecture that can recognize numbers + - / + whitespace so before we go ahead let's actually make our repple just print it back to us so we actually know what's going on so let's actually allocate a lecturer no lecturer for the line and then we say while tool bar a token enclosed excerpt reads good how do we call it next okay all right next token oh yeah one thing we should probably do that's also conventional is that there's this magic token called end of file so at the very beginning we should probably say if the position is text length or larger than that then we just return the magic new syntax token and we call this in kind and of file token the position is the end of file it has no text well I guess we could cheat and say this is the magic thing here and let me say has no value so this is effectively done by pretty much any parser in the universe because it allows the reader to know when there's no more stuff coming so this is basically a virtual token it doesn't actually exist in the input file unless you think and see and you think the string is always 0 terminated in which case you can think of it as the 0 terminator alright so now we have the token now let's put this guy so let's say console.writeline so i guess we could say if token kind equals syntax kind and a file token and we can just say stop it otherwise obscene otherwise we write the token kind and yeah let's print the text as well talking text in fact let's also say if token value is not now then console right yeah let's do it into a belated string here as well let's say what is that token dot radio and then we can say console dot write line so if you're thinking to yourself I know this all looks horribly inefficient that is kind of okay because we still write our compiler right so we want to not waste too much time doing stuff we don't care about yet all right so now let's see how bad I did the moment of truth so let's say one plus two times three that's not bad so far so we have a number token one we have a whitespace token we have a plus token we have a whitespace token find out the whitespace tokens all have values it seems how did that happen oh because I'm needed because there's really no reason to parse those guys so now we're going to say let's say something similar like 123 - 456 so now we have one token that is 1 2 3 we have whitespace talk we have a - token another whitespace token number number token and that's the end so that's pretty good so now we at least know the detective actual Virts if you will in the in the input so that's pretty decent but this is obviously not a parser yet so let's check the next step here so know we had a lecture now let's do a parser so the lexer produces tokens which you can think of the leaves in your tree and then the parser produces the the actual sentences which are trees and they're basically syntax notes so let's I'm fighting way too much of this I don't know why the code folding doesn't work the way I want alright let's do a parser so there's multiple ways you can do parser like if you read the most of the compiler books from the 70s or 80s I'd say they actually really highly optimized for streaming so they try to not Lex the entire input they will usually like consume one token at a time and then maybe ever look ahead or whatever but one of computers are really good and so what I just will do is I will just say screw this I would just actually in it it's a parser I would take the text here as well and then what I will do is the first step is I will literally just run the lecture on the entire input so I will create a lecture here new extra but that will give the text this is one of those moments where I think I wonder why you a do loop I will probably regret this choice because every single time I did that I later on we voted into a while true loop token equals lecture good next next okay I'm bad at naming did I mention that yeah next okay and let me say if token let's say is syntax actually no that's the well part right while token kind is not syntax kind and all file we keep doing this now what we need here is we need a list of tokens so let's say of our tokens equals new list syntax token and then we will just say if token kind is not syntax kind whitespace because really don't care about whitespace for the most part right and if it's not a bad token token kind bad token then we say tokens dot at token I'm probably be using for that guy yep all right now let's say tokens equals tokens to array because why not and then we say create me if we'd only filled tokens sweet I'll so much the other one we need a position which is basically our current token yeah what's important is if you read this correctly end of file is part of the token list all right so that means I always know even on an empty input I will always have one token which is end of file so what we can do is we can say we have a pub no private private syntax token let's call this peak end offset and I will later on tell you why I think this is useful so we say index equals position plus offset if index is outside the bounds we will just return tokens tokens length minus 1 we just returned the last one otherwise we return tokens index so this means I don't have to think too hard when I want to peek ahead if you will and then we can say we have a private syntax token current which basically just is peek 0 basically give me the give me what's at position and everything else is just slightly ahead of that so this is the nice thing and you legs everything at once because it means you can look ahead at the next token to make this how you want to parse it the entire input file which tends to work pretty well all right now so we have a class syntax token yep what we also need is a syntax note right so let's do this so let's say we have a class syntax note syntax note and again this guy needs and public abstract syntax kind kind you will need this later to decide what kind is this this guy will be abstract because it's basically just the base type for all my syntax notes and all let's think about how you want to represent the input so this is some of these areas where maybe a picture would be more useful so if you have an expression like this right if you say 1 plus 2 times 3 what you really want is a tree representation which roughly looks like this so you have one here you have the plus here and then you based on operator pressure precedences the tree will look like this right so in other words you do if you look at this from a tree strong I'm actually really bad at drawing trees yeah that's not horrible and then this is two this is three so what you see here is basically a tree where the leaf node is effectively the tokens in the input file and then you have a binary operator plus in this case and then so if you want to evaluate this thing you just basically say well plus will give me the left value one give me the right value okay that's another expression 2 times 3 and then you add the result right so basically in our case this would be a node probably number a number node this would be a binary operator a node because it has two inputs and everything else is just chained together right so if you do stuff like this it will look even simpler it would look like this right you would have to I'd so like this is what would happen if you just parse this input here so this is the kind of tree you want to produce it will show later how this wool how this will work so for now we need basically a number a number node which represents our leaves and then our binary operators which are our interior nodes for that so let's start with the fact that given that we will have expressions and statements later let's just start with an of an expression syntax which will for now just derive from syntax node and does nothing special but it does simplify our design a little bit I think and then we have actual instances so let's say we have the number syntax which is of type expression syntax and now we can say let's have a constructor it's called as number syntax so what does this guy need well it only needs the syntax token which represents the number so that's the number token that be there we got all right now let's implement the abstract class in our case we say this would be syntax kind number expression again this doesn't exist right now so we edit and then we just store this guy as monitors the fields let's make this a property [Music] okay now let's do a sealed class binary expression syntax in fact let's call this number expression syntax just for consistency reasons and then let's fix the typo we have what did I say expression syntax right so what would this guy need well it would get the left expression right so we have an expression syntax that represents the left side we have a syntax token that represents the OP rate or token and then we have an expression syntax right and again I suck at typing let me fix this so let's store these guys property property and property oh yeah property explosion alright let's do this as well and let's say this is syntax kind binary expression so the only reason we have these kinds by the way is just so that we can just switch over them which is slightly more efficient than doing an ask us all the time and have to check for the actual instance it's basically if you do f-sharp you probably think less of me right now because it's much easier an f-sharp because they have Union types and c-sharp you have to do this by hand all right so we have left right so that seems like we have our data structures at least somewhat in order for now and all the part everybody has waited for and how do you parse this stuff so let's start with a public method public and say we call this expression but it's called this parse so how do we do this so logically let's first think about how the tree structures work right so at the very bottom we have numbers so the first thing you want to parse logically is the leaves and then you build the other structures on top of that and then you know think of the method structure basically as mirroring the source structure at the tree structure so what do I mean by this okay let's say that started with the with the with the idea that first you parse a primary expression maybe T's ever just write it out the primary expression then we can say while the current is syntax kind let's call this plus token 1 - token oh yeah there's one method that we really want a half which is private syntax token next token which basically says if our current is current and then you say position plus plus and then you return current so then we can save our operator token equals next token and then we can save our let's call this right parse time ring let's call this parse for now actually is there any way to do the snow cross for me nevermind that's primary expression and then we can say primary is no new binary operator expression on binary expression I guess we've let's call this left left operator token right and then we say return left yes and this needs to be kind and then primary expression in our kids right now is literally just we expected in number so let's do this for now and let's add another method called match subjects some kind kind so the idea is now if current kind is kind then return next token [Music] otherwise let's fabricate one so I will explain later whether this is super useful in the parser I mean you basically create tokens but let's now say we manufacture syntax Koken of the kind that was asked at current position it has no text because it's manufactured and it has no value all right what's going on here why does that not work oh because type expression syntax here we go so we say var number token equals match syntax kind number token and let me say return new number expression for the number tokens all right this should be happening here come over from syntax token to syntax note interesting did I mess it up yep and didn't mess it up that is it token not not a note all right so so far so good now the very first thing you will probably find when you are parsing is you really want to see the parse tree because if you messed that one up that is usually very annoying so I suggest we go to the subjects node and we add the notion of children because once we have that we can actually walk the tree in a generic way so let's say we have an ienumerable of syntax node and we call this get children and let's make this guy abstract what helps right now again this is probably not what the production compiler would look like but for now let's just say syntax tokens I just notes because if we do that then we can basically pretend that some text tokens are actually the leaves in our in our tree so this means this guy does nothing so we say return in your mobile thought empty or syntax note yes I need that all right no we can say our number expression syntax just returns our number token Jesus I'm really bad at typing today usually not that bad then a binary expression will first yield return our left then I'll operate a token and then we say we want to return our right so you have never used yield it's a very convenient way to write stateful iterators where effectively this produces an enumerable where the first item is left I could also just literally create an array here that would probably work as well but I don't know I find these a little bit easier because they can shuffle them around and again efficiency right now it's not all primary goal all right now we can traverse the tree so now let's actually change our main method here let's add a new method static void pretty print and it takes let's say a syntax note note we probably want the indent what we have written so far actually if we want this why everyone wants a indent and then starts with an empty string and then we can literally just say console dot right let's say no dot kind and let's say this if note is syntax token T let's say console right [Music] now actually let's just do this and T values not know in which case we add the token value here and then we just say indent plus equals what do we want for spaces 1 2 3 4 I guess that works and then we can say for h-bar child in know to get children we can say pretty print the child with the end and I should do it for now all right now let's go back here so instead of doing this we save our parse our new parser for the line let me say expression equals parser dot parse and then let me just say our color equals console foreground let's say and let me say console dot foreground console what is the console color dark grey I don't know I know why I did this but it seems cool and then we can just restore the color here I guess little swap damn I didn't type everything here and there is just a pretty print for our what is that expression let me just let me just create it alright so we have that part alright I'm inclined to run it now and see how bad I did all right let's do one plus two plus three for now oh yeah we did really well I guess you forgot or yes you should have done this console dot write line I guess it makes it slightly more that is interesting oh yeah yeah console dot right indent it's cool when you construct in and but you also need to write it to the console otherwise it doesn't do much one plus two plus three there you go so the binary expression that is a binary expression that has a number expression with the number token and a plus two Oakland and number expression so that's effectively the tree that I drew earlier I the only thing that annoys me slightly is what does that called tree it would be nicer if you would get and I'll put that looks more like this so let's actually get these because he'll be looking at those guys a lot so he might as well make it actually pretty um so many desk guy and there's one more right there's the the center guy that we need yeah something like that no we have to think about how we construct these guys how did that work again I think I need something like is last I think the default is false [Music] so [Music] so how do we indent you basically say if is lost then we say indent is Plus this to do this right it is this and it's that otherwise if it is not the last one then it would be that guy and then how do we know something is to laugh you save lost is no get children thought lost that's just cheat here and then we just say you are it's called this last child lost child something like that and then I think I only need to [Music] and as this guy come from that mmm let me quickly ones I'm not entirely sure I might have missed up the what happened why did this disappear so fast what what is happening here one plus two plus three and that seems reasonable oh we crashed that is true last are the falls to the rescue alright so this looks pretty ugly so I think so what is happening here is I believe that it I think we create the England too quickly right so boom boom boom boom boom boom I think it's like this right so let's call this bar marker so we need to write the indent and then we say write our marker something like that right and then we say by the way this is indent one two three I think I got that right if not almost [Music] yep and it actually helps if you're--if check is not screwed up of course he has to change this guy no I think we have something that is actually borderline readable except it's not the the the why is that again I guess I guess it's like this right it's it's lost then it is 1 2 3 4 otherwise it's man this is like trial and error programming right here close-ish except for the first one yeah which I think the default is messed up the default should be true and then I think I have it man pretty printing trees the hardest problem of compiler construction apparently alright so now we can actually read this thing not too shabby I'd say alright so now in order for us to actually make something useful out of this yeah it's a bit of a comment on my horrible string concatenation but that's just the way it is for now so now we have these guys here now I think we're the first thing I would like to do is deal with arrows because right now so far we have ignored any arrows we don't print anything we just silently ignore them and again if you want to build a compiler that's probably the first thing you want to pay attention to because otherwise it will be lost forever in why stuff doesn't work the way you wanted to when it turns out that your parts are just did it a little heart attack in the middle and you never knew about it so the first thing I want to do is I want to introduce the notion of arrows and I'm lazy right now so I will just call them literally just a bunch of strings so let's do this and let's say we have a list of string which we call Diagnostics new list of string and we will expose this to the public by saying public I know mobile string Diagnostics they just returned our Diagnostics right so now whatever we find something that is bad like for example this we can say Diagnostics ad and we can say error bad token a bad character in input which in our case we can probably say it's too current here good enough so this way we at least know what's going on of course not everyone would record the position but let's leave this for a later state now let's do the same thing at the parser so the parser where is it should have that as well so private list string Diagnostics new pistol string so of course at the beginning we should say Diagnostics range lecture Diagnostics so we don't forget what the legs are reported and then everything the time you see something bad here like for example here we can say Diagnostics add arrow unexpected so can let's do this and let's call this what is that current current expected what did I say kind I guess all right that's pretty decent no that's actually if we find any arrows let's report them so we can say if parser I think we have an exposed to in the process okay did it uh no we have not so let's say there's a public what is happening with my tab here what is happening I never had that public public ienumerable off string Diagnostics let me say this is you guys all right Tipton will focus no what's that I would ever do that alright so if the person has any Diagnostics parser Diagnostics any then we say you know let's change this color to something dark red maybe I might regret dark red it's probably unreadable well we'll find out for each of our diagnostic in presser Diagnostics console.writeline diagnostic all right let's check that out so if we say let's do a bunch of garbage so we get better parse tree which I would like to highlight so we get a valid parse tree back it's just because we expected a number expression right so we got that one and we have a bunch of arrows and then we get parse errors as well so now if we say 1 plus we get an arrow that says unexpected token and a file and expected a number token all right because it was after the plus unfortunately we also have this situation where right now we are not reporting an error because what happened is we parsed an expression and then we didn't assert that when parsing was done we're actually having end-of-file which is a problem because now you can have basically trading garbage can you consider that legal which is not what you want so in order to address that the cleanest way I think is for us to use a new type which represents the entire input and we call this guy the syntax tree and so the syntax tree will basically hold on to our expression syntax we call this root and let's say syntax token and a file token why would I want to do that make this a property make this guy a property and then we can say also what we expect here is and I enumerable of string and we call this Diagnostics that's also the thing let me return and yeah for the time being let's just do that even though that is also bad but good enough you just store the annual as is you know what not worth it let's do the right way let's say we store this as and I read only list for now and so we just say to array so we don't have to worry about any craziness here then we will do we let our parser instead of returning an expression for the root they return a syntax tree let's move this whole thing into another method let's call this method Paris expression yeah obviously I should return an expression subject still then we can save our expression equals parse expression then we can say match syntax kind and afar token because that's what we really expect the law and of it's okay and then I can say return new syntax tree basically our Diagnostics the expression and our end of our token now this is slightly cleaner so instead of saying expression we say syntax tree then we can say syntax tree root and then we can say it's syntax tree Diagnostics you walk over these guys sweet all right so far so good no let's run this guy now we say 1 1 we get unexpected token number token expected end of our token so those things seem to work quite nicely actually nice all right now let's think about priorities for a second so if you do this that is ok actually know before we do that let's actually evaluate let's do that first so how will be how about we use this output now to actually compute a number right that's probably the maybe not doing this just for the for the giggles here you actually want to do something with the end result so let's call this an evaluate or let's say and let's say we give this guy how do we do this Jesus Christ we give this let's say the root object root and then we have public and evaluate and then we just say this is return evaluate expression with this root with our route all right how does this work well we have basically what kind of expressions do we have we have a binary expression right now and we have a number expression right now it's the only two things we have so let's do this let's say if root is number expression and then the result is the number token and the value so now that is the problem because we said earlier and the parser that we may not actually have a value right because we didn't actually validate that so let me actually go back to the lexer really quick remember when we parse this guy here and we said where is it here we said oh we don't know what's going on well this is the moment where you want to make sure that you report an error so you would say if not then we would report an error here that says Diagnostics and the number cannot be represented by and in 32 all right because we know it's only numbers for now right I choose this day isn't valid in 32 good enough because now we can say you can only call evaluate if you don't have any arrows otherwise it's illegal to call this method which simplifies your error handling quite a bit all right now we can say if root is binary expression syntax B well we can say var left equals evaluate expression for B dot left do the same thing for right right and then we say well if P dot operator token kind equals syntax kind plus token then we say return left + right bum-bum-bum gonna say this is the elsif out of out serve else throw new exception unexpected binary operator that will be B operator token kind right because that shouldn't happen right so we have + right - right star would be multiplication and then this would be division and say whether it's slash talking yeah and then same thing here we would say it warned you unexpected note let's call this you know all right so this one should give us something to work with inter Denton if you have any Diagnostics and we report them otherwise we will for our syntax tree expression no root I think it's gone yeah and then we save our result equals evaluate and then we say console dot write line is that can invert if super nice because that makes a little bit easier to read let's try what happens one plus two is three two times three well actually we don't parse that yet okay two plus three minus one it's four seems right and if we say really really really really large number then we get it's not a valid int alright so this way we actually have some decent error handling and we actually evaluate stuff now let's actually go down to our parser and try to handle multiplication because how hard can it be right we already handle plus and minus so all we have to do here are these guys right I would say what does that start token and then / token right now let's say 2 times 3 6 all right it seems good 6 divided by 2 is 3 well 1 plus 2 times 3 it's not 9 so the reason this is saying 9 is because the parse tree it does not actually honor our normal priority rules right normally you would say 2 plus 3 binds stronger then this one here but right now because the way it's being parsed is we first parse 1 plus 2 and then we just say plus 2 times 3 so we first evaluate 1 plus 2 which is 3 and then we say times 3 which is 9 but that's not the correct way to do the math so we have to parse this differently so the way we handle that is by just changing the way the parser is organized so if you think about the parser for a second right so this is a what's called a recursive descent parser because basically what you do is you call these methods recursively right and so this is how you get the tree structure going I know marques I'm not having really recursion right now because I just have a primitive while loop so let's actually do a recursive part here so the trick is you just split these guys into two and then this is what people call parse more to multiplicative expression let's say I think the correct term for that is parts factor which is basic to say you know the things that are stronger together let me say parse factor as the primary expression times was divided by another primary expression and let me say this is a this is a we call this parse Terim I believe it was the usual term and internet that they use where we say this is a bunch of factors that are strung together and then this guy only handles these guys so let's see whether this actually solves our opponent well first it would help if it compiles so we now say 1 plus 2 times 3 now I get 7 and now I get a different press 3 right so now what ends up happening is we say 1 plus and then the operand 4 plus is another expression which is now our our factor effectively that represents the result of the computation so now we actually have the correct priorities in our in our evaluator so let's do one more thing because right now what annoys me a bit yeah I guess doesn't matter um so we have that now the other thing that we want to do probably is support parenthesized expressions so let me actually add another note for that so what would that look like so we have a sealed clasp parenthesized expression syntax which is an expression syntax there's no way I typed it correctly prevent this I get look looks good I'm syntax token so what do we have syntax token we say open parenthesis token then we say we have the expression syntax which is the expression and then we have another syntax token which is the closed parenthesis token and then I just store them in properties we have to override a certain thing so we say this is some kind X what do we say parenthesized you know what I copy that this guy here generate the you November and then we say what do we do we say yield return open parentheses token closed parentheses token and there is any yield return an expression in the middle alright so this is our thing you know we only have to parse it so the way we would parse this is at the very very bottom part why driven right now say primary expression we now have two kinds that we can expect we can say if current kind equals send kind left sorry open parentheses token then I would say var left equals next token then I save our expression equals parse this is one of those things where you want to have helper method so you don't have to understand the structure of your stuff so what we would do is it would just have a private message private expression subjects parse expression and that just says return parse turn so you have sensible names you don't have to remember this thread the structure and we can say this is part expression and then I can say bar write is match for syntax kind close parenthesis children let me say we return a new parenthesis this sized expression which is left expression and write does that own it let's find out so now let's say we do 1 plus 2 times 3 right and I think I crashed so let's attach a debugger 1 plus 2 times 3 oh I know what the problem is look it's disk right here so we haven't handled our note in the evaluator so if note is parenthesized expression syntax p in that case it is just return evaluate expression for p dot expression right because parenthesized expressions are just you just evaluate the child of it so now let's do this again 1 plus 2 times 3 and it's 9 because yes we first do this which is 3 and then times 3 and so we have the proper note structure here so what I also want to show here right now is if I say 1 plus 2 and I don't provide a closing parenthesis we manufactured one here right so that's a closed parenthesis token here that we edited the tree so what's nice is that when you when you process your trees you don't have to deal with my reference exceptions for for parts of the tree that you that you just don't have right so the the idea there is that you just have synthesized tokens for that and you just rely on the fact that you have reported errors for those and then this way things tend to work much better now the downside of inserting tokens is that you have to be careful how you structure the parser because when you insert tokens you might end up in loop situations where you keep inserting tokens and keep inserting tokens and you never you need to be very careful that you ride your purse on such a way you always consume tokens so that eventually you will hit the end of file to open and your parser terminates otherwise you will create a lot of memory and nothing happens all right so this one I think so far so good the one thing I would like to add in our repple just forget shits and giggles is yeah make the make the API is a little bit nicer so for example we don't really want to deal with the lexer and the parser anyway realistically because what we operate on is the syntax tree right so what we can do is we can just have a message here public static syntax tree parse and we can just say give a text and then we can just do our parser equals new parser for the text and then we can say return parser thought parse and it makes the api's I don't know not a whole lot cooler but I know I like this one a lot more than having to deal with the parser it also beats slightly more logical subjects tree dot parse you get back a subjects tree right seems reasonable yeah so the done thing I would like to add here is some pseudo commands so we could say if line equals so tree first of all we say continue more importantly let's have aesthetic here actually doesn't have to be static volume bull show tree falls I'd say and then we can say show tree equals not show tree and then we can say console dot write line show tree showing parse trees otherwise not showing past phrase so this way everyone our thing it looks a little bit cleaner oh yeah it helps if we actually handle that thing - where's the pretty print here if show tree do that otherwise don't do that I guess you need to do that here now as well you can sell one close to you get a result I can say show T samples - I get the treat so this way we have some diagnostic information without without anything and then I think the other one that I probably want out of line equals CLS they say console dot clear and then this way this way we can have a little bit nicer as far as the console is concerned alright so I think we now have at least a ripple that kind of sort of works for us now let's quickly look at the project structure that you have so far because right now everything is in one giant file which I know some people prefer everything in one giant file I am NOT one of those people because it makes version control pretty ugly first thing is the texture we name our namespace here into Minsk let me create a folder here which I will call code analysis and yes if you have ever used was nine there wasn't a P is the ones that are just produced will look very familiar let's just say so let me just add a new file here stuff dot CS and let me take everything we just built and put it into an in space min starts code analysis and put the stuff in here now of course we need a ton of usings and now I can just move these guys to separate files so let's just go this one to this file to type move type type type and yes do that actually let me do a save all for now because I cannot deal with that many open windows all right let's keep doing overtime type move type I know this is last one so let's just rename the file all right then I guess here we can just take the delete hmm what we select everything again that also works don't wanna try to using for code analysis and booyah now we have just our pretty print and the repple in here and everything else is neatly distributed in a bunch of smaller files I'd say I'm at this point so far quite happy with the result now let me actually because I haven't done this in a while look over the questions that I've gone what do we have all right so both of the questions were related to what language are we using it very answered yeah so some person asked why do we duplicate the note information in an enum and the class hierarchy so right now what I what I have done here is I have cheated a lot right I've done a lot of instance checking so if you look at the evaluator for example we just quickly see where that guy is now here we just do this Rd basically just say you know if this if this instance is this instance here and now with the new c-sharp pattern matching is actually pretty concise you can just do that but for the most part you really want is you want a switch statement because that is just much more efficient especially when you walk larger trees these things are much much cheaper because what you do is you basically ask the instance give me an int which is you know the enum value and then the switch statement is just a jump table that directly dispatches to the to the block and then you have of course the cost still occurring in that in their block but you only have one cost as opposed to you try this one no you try this one nope you try this one nopes we do a plenty of costs before you actually find the object so instead of doing this you just directly jump to the corresponding code block for three or four instances like we have here no point but I wanted to set the pattern already so that we don't have to duplicate so much information later yep somebody says you could do different no colors for the output that's actually good idea we should probably do that any plan writing a new CLR that will be a lot of work right he new CLR a new language for the CLR might actually be something in fact that's what we do here might be compiled onto il and that's basically new language yeah somebody mentioned lob em I have tried LLVM before I've not succeeded yet in getting my back-end up and running on a VM but it is something I would like to look into because my language is you saw doesn't really deal with object allocations so it should be fairly straightforward to to take the language subset that I have and omit actually native code for that might make the debugging a lot harder though yeah so somebody suggests maybe we should have done commits for the individual items that's a good idea I generally do that but when I get carried away I'm not doing this but generally just how I work I would I would commit logical chunks and we can do this right now actually if we wanted to let me think about that for a second is that easy probably not right now because you already have interdependencies everywhere but I would say for our first episode um you know it's not do that and that's do it the right way to check out it's called as episode 1 yeah I guess episode 1 lexer and parser right just for good measure and then we just say add these guys why don't we go to what we were to commit I'm not sure I want to commit deviance code once but yeah it is what it is all right let's do the tune so here's all the stuff we're about to commit yeah good enough alright add lexer and parser this adds the basics maybe we should say that expression evaluate or evaluate or that's the basics we need specifically lexer parser and diagnostics mistakes diagnostics that will lay the foundation for our compiler I'd say good enough so I think for our first episode that is a pretty decent start alright so yeah my my lights are all fading so I look very interesting probably very dark it's a very critical thing here so I think the next episodes will focus a little bit more on getting more stuff from the language going maybe some statements and then also we have to think about how we and we want to run this guy probably start first with an e value of an interpreter because it's much easier to do an interpreter than it is to omit il and then we will walk our way slowly up alright so this we are pretty much in time I was hoping for 90 minutes we're now at 100 minutes but
Info
Channel: Immo Landwerth
Views: 79,191
Rating: 4.9649124 out of 5
Keywords:
Id: wgHIkdUQbp0
Channel Id: undefined
Length: 98min 35sec (5915 seconds)
Published: Wed Oct 03 2018
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.