Writing a Lexer - Building a Programming Language in Rust

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hello hello hello let's just tweet this out cool hey oliver how you doing thanks for coming along can you all can you all hear me all right let's check obs real quick all seems to be good um out of interest what does the quality look like because last time i stringed it only went up in 720p i believe and i've i've updated the oh is there echo um there is echo why is that echo um is there echo for anybody else or no echo for me but a bit quiet okay uh we're turning up the mic how does that sound just turned up the gain a little bit um i just i don't want my keyboard to be too loud because it's mechanical i don't want it sir i'm one of those people um hey kevin cool uh no i don't want to shut down obs i want to minimize it cool we've got a couple people here that's good let's make sure twitter saw the tweet yeah cool um i've i've noticed a lot of people come back after the fact to watch this the the stream from the other week has uh yay for mechanical keyboards indeed yeah the clickety clack hey dan um cool let's just jump right into it so i'm gonna be doing some rust um if you're not familiar with rust it is a systems language it's quite low level not as low level as c it does some things for you like memory management um we haven't got to put up with uh malloc m alloc for memory allocation and stuff like that um it's an interesting language it's fun to write lots of cool features um not very friendly for beginners if you've never done if you've never done c before c plus plus it's quite weird there's something called the borrow checker which scares a lot of people um but yeah so i'm gonna create a new file here so i'm gonna be working on a lexa and if you don't know what a lecture is i've seen a little ros code but haven't written any myself yeah i think that's the case for a lot of people a lot of people have seen rust or heard of rust but they've never actually written it um so i'm just going to jump straight into it i'll be streaming for half an hour 45 minutes let's uh let's see how it goes hopefully we can get something good going uh during the stream but uh we'll see where we get to so programming languages that's basically what i'm going to be building a programming language a scripting language to be precise um so we're going to start with a lexa so if if you don't know how general programming languages work uh you have a stream of input and that goes into something called the lexa which spits out a bunch of tokens i've got alexa talking to me in the background as well which is great um so you get input that goes into the tokenizer or the lexa um that spits out a bunch of tokens so you can kind of think of that as just an array of tokens and then you pass that into a passer um i'm going to be building a very simple interpreter it's called a tree walk interpreter and this will be over a couple of streams probably um but this is an experiment i'm writing rust so we're gonna have a bit of fun with it um and the the parser can do a couple of things it can transform those tokens into some sort of semantic structure and you would generally call that an ast an abs abstract syntax tree and if you've used php passer before that's what that does it takes php code it takes the tokens and generates an ast and then you can whack that through an interpreter which will go through the ast and it will look at each node or each each branch in the st if you like and evaluate it and do something based on the type of node so we're going to start at the very beginning today we're going to be taking some input just a string of random characters and we're going to be running that through alexa and spitting out a bunch of tokens so let's get started let's come up with a grammar how many people have we got here six people send people nice um is the quality good how's the quality for everyone hopefully it's 1080p if you can stream it at 1080p and it was 720 last time gonna increase my font size a little bit as well um so i'm i'm just calling this pl for programming language i'll come up with a better name at some point but this is just an experiment so it doesn't really matter um we'll jump into 1080 works fine perfect we'll jump into a bit of basic rust so rust has a command line tool called cargo and this is very similar to npm composer it manages packages it can run and build your rust app it can run tests it can benchmark it can initialize new projects so to build out this structure here all i did was cargo net pl and that created a cargo.tomall file a git ignore and then a source and main.rs file let's jump into the cargo.tunnel as well so tomml is kind of like yaml you know it's just a configuration language there's nothing special it's purely for config and it's got some keys so this is package information and all of the dependencies go down here a nice git ignore which hides the target folder which you can see here and a main rs so like many low-level compiled languages uh basically non-scripting languages there's a main function which is the entry point for the program so if i go into the terminal and i do cargo run it's going to go through it's going to build our application in debug mode so it's not purely optimized it's just for quick builds and it's going to run this main function i don't need to go down here and call main because this is the entry point so this is all working this is great um a couple of other things this might look weird to you this print line here has an exclamation mark so this is known as a macro so if you caught my php source code stream you might have seen some seat macros like a php function and all this does is it basically expands out into another statement or expression as you can see that here macro rules print line it defer into print with a line break at the end but it's taken in a very addictive number of arguments and it evaluates to this to io underscore print it just hides some of the internal logic just like c does just like c uh preprocessor macros um so let's define a grammar what do we want to start with let's do something simple like variable declarations so i'm thinking we'll use let um just like javascript and we can probably change the language mode to javascript just to get some syntax highlighting uh hopefully the font's big enough if it's not let me know i'll bump it up so we'll do something similar to javascript so we can have a let statement here and we'll give it a variable name let's call it foo and we'll use bar and i'm going to actually emit semicolons i don't want any semicolons in my language um i don't use semicolons in my javascript i don't want to use them here so we'll actually emit the semicolons so what we want to do is just loop through this string character by character and we want to say well do we know what this word means and if we do we want to spit out some sort of token so we might call this t let so t underscore let this isn't exactly what the rust is going to look like because there's a few things in russ that can help us out um but this is going to turn into tilet which is semantically the token name it's a let token foo is an identifier so we might have something called t identifier equals might just be t equals although this might be a bit misleading because when you when you think of equals you think of two equal signs um or two equals symbols or uh whatever you want to call them so we might we might call this t a instead we'll get to that and then we have a string so we might have t string to represent a string token or a string literal in this case so that's the basic idea we want to turn this random stream of characters into some tokens which semantically mean something they mean something to us to our interpreter so i'm going to use this file and where is the best place to start let's um so i want to say cargo run and then i want to be able to pass through a path so to this hello world file it doesn't do anything in a minute but i want to read in the contents of this file so rust has a very extensive standard library thankfully because you don't want to write this stuff yourself and it's all name spaced as well so we have a use expression or a use statement a bit like php but instead of having something like you know illuminate support we have standard which is the standard library namespace and we have fs for the file system and this is going to allow us to access any functions or structs or traits inside of this namespace using the fs binding so we can say fs colon colon and we can access all of these methods that are inside of this namespace you can see there's some structs here we'll get to that later but for now we just want to read in the file contents so we're going to need the file system and we're also going to want the env library because the emv is if i just type this out here you can see we get access to args which is going to be uh everything that's passed in here so let's just print out emvr this is how you declare something in rust you use let and the name and we'll just set this equal to env args um and rust has uh types everything is typed so i can put a code on here and put a type if i want but i'm not going to because we'll let rust infer the type through type inference it knows what args is returning so it knows that args must be of a certain type uh if we look at this it is standard emv args which is great um and let's use that print line so we can do some form yeah a bit like you do with sprint f or printf in php you can use format strings or template strings kind of inside of these literals and rust will look at this and say i need to format this value in a certain way the colon question mark here means that it doesn't have a particular form standard so instead we use its debug formatting which for a lot of things is just going to print out the name of the struct and any public arguments that it makes available so let's run this and we'll see what happens so you can see it's printed out args and the inner value we've got the script name or the executable name and then our argument and i'm not going to spend too much time going into the ins and outs of rust um because that that's you know a whole video series a whole video course on its own um so emv arcs has a couple of convenience methods um let me know if you've got any questions in the chat as well happy to answer um so we can use dot just like an object in javascript or any c style language really and we can use the nth method which returns the nth element on the args struct or the iterator and we want to get the first argument because this is sorry not the first argument the first index uh this is zero and this is one so if we add a print line back in and we put farlin in there it's having a go else um and this is where you might get caught out if you're not familiar with rust so rust doesn't have no um there's no null like this in php in javascript there's no nil locking go and there's no null like in c uh instead rust has something called option uh and there's php libraries that do this sort of thing and it's actually something that i could see happening in the future um if we get uh if we get enums in php that can hold values um we'll get to that later um but this is this is how rust handles no essentially so option can either be some or none so if something is nullable like a function argument or a field on a struct you can say oh it's actually an option and generically type it so again generics uh something that we're probably all quite excited to have in php at some point whether that's uh runtime generics or statically analyzed generics uh as a as a syntax in the in the language um this is saying that whatever this variable or uh property or field on the structure is it might have a string it might not have a string so in this case we we know that we're passing in a file so we can unwrap and this is going to take whatever's inside of that option and return it so the inner value it will return it and if for whatever reason the option is none meaning it's null you can think of it like that it's no this will panic and this this will basically throw an exception and it won't compile so i can show you an example if i just run this without any arguments and you can see thread main panicked at called option unwrap on a none value and in the file that we said and that's because we didn't pass for a file if i do it now with a file we get the file name printed out similar to c plus plus yeah similar to c plus i mean rust is kind of like c plus plus plus plus you know um that's a good way to think of it if you know c plus plus rust is probably going to be fairly easy to pick up um standard vector ball uh sieve n plus one i don't really know what a sieve is in c plus um but yeah it's uh i guess if you unwrap that you know there's no i know i know c plus plus has uh like null pointers so if you've got a pointer that can can be null um but from what you've put in there uh hamid it looks a lot cleaner in rust uh just a sample for typing in right yeah we'll we'll get to some typing and stuff eventually um so let's not hang about let's get the contents of this file so we're gonna reach into fs and we're gonna uh read to string so this is going to take the file path that we provide and it's going to read into a string so we'll say file and again we need to call unwrapped this time so again for php developers this is going to be a bit weird even for javascript developers it's going to be a bit weird rust doesn't have the concept of exceptions um in in the same sense that we're familiar with you can't you can't turn around and say throw no exception rust has something called a result type and this is kind of similar to option it's either okay or an error and it's also generic so in the case of contents if we hover uh we get an i o result string so if everything's okay we'll get a string otherwise we'll get an instance of error so again for now i'm just going to unwrap this if i don't provide a file it's not going to get to this point if i uh it's not going to get to this point if i don't provide a file if i provide a file that doesn't exist so let's try that let's just put an s on the end it's also going to panic because we're trying to unwrap an error and it gives us the error which is no such file or directory so let's print line again and we know we've got a string so we we don't need to use this debug formatting like we did before because it's a string uh it can be printed and formatted as is so let's just print out the contents and this file doesn't exist so there we go so we've just read a file in rust so we now have the contents great so we're making progress we've successfully read a file um in in php you're probably going to do you know arg v or something like that to get the first argument and then file get contents or f open f read um but here there's actual the standard library is namespace unlike php where everything in the standard library is global which is a good thing or a bad thing depending on how you see it um how many people got six seven nice cool um okay where should we go next let's create a struct so rust uh doesn't have classes it's not it's not an oop language in that sense there's no classes there's no abstract classes there's no um interfaces in in that sense but it does have structs um which kind of like classes um so structs have fields like a class so i'm going to create a contents field and this is going to be a string so this struct lexa is going to take in the contents that we say here um and a bit like a class structs can have methods and we use this impul word implement so we're going to write the implementation real quick what happens on error like do you just validate it exists before unwrapping um yeah we'll get to that in a second actually i'll kind of show you how that works um let's just write out a basic lexa implementation um so yeah we can implement a lexa and this is where we're going to put all of the methods so for a method to be public we use the pub keyword and fn sorry fn which is the function keyword and in rust we either say default or new typically so we're going to say new because we're going to be constructing a new lexa and this is going to be a static method and this is again really strange for anyone who hasn't seen rust before um so we're going to take in the contents here and we're going to say string and because we're returning something from this function we have to declare a return type the return types on a function have to match what you're returning has to be declared as the return type and just like php where you can say self or static rust has some uh helper or short hands for that so we can say self we could also say lexa here this this works perfectly fine um but we'll say self this is good practice because if you change the name of lexa to something in the future you haven't then got to go through all of these return types and return like change it to alexa you can keep it itself in here we're going to say self and to create a struct if you've seen other languages you use curly braces and you pass in the properties or the fields and this works because rust has implicit returns so we don't need to explicitly say return self rust will take the last expression or the last statement in a function uh emit the semicolon from the end and it will use that as a return value so it's implicit which is great we can also a bit like javascript because the variable name and the field name are the same we don't need to do the assign we can let the engine figure that out for itself um okay so let's just create alexa real quick so we can say let lexa equal and this is in the current scope so we can reference the alexa directly it's not inside of a namespace it's in in the current scope it's like writing two classes in the same file in php um and this is a static method i'll show you why that is in a second so we can use colon colon just like php even java i believe we can say new and we can pass in the contents and if i print line this it's not going to work and that's because i need to add this i'll explain this later or in another stream so if i if i print this out now you'll see that we've got our instance instance of alex off the lexa alexa stop she's listening um and we've got our contents filled and rust has done some magic with this up here which i won't go into because again these are things that you can learn in a in a rust course um so sam if you're still here sam here is the answer um why is dan laughing what what's funny um is it because i said that i'm not gonna go into this this is magic under the hood um i won't explain how it oh yeah she's annoying um i won't say her name because i'll trigger everyone um so sam if if you're still here you can you can validate if this exists so unwrap is almost like i know this is going to exist but in the case that it doesn't i want you to panic i want you to throw an error and stop compiling or stop running the program if we didn't do this let's rename this to maybe contents uh because this has a result we can actually use a match statement or an if so again this is going to be really weird for anyone who hasn't seen rust but basically everything in rust is an expression so an if statement isn't an if statement it's an if expression or it can be used as an expression so we can go down here and we can say let content equal if and so we're going to want an f else here um this is really bizarre when i when i first learned ross this baffled me because this is like unseen in other languages it it's like a ternary but in long if else else if form because it's an expression so this is evaluated as an expression and the return value inside of the blocks is assigned two contents so uh we don't actually need parentheses here wow didn't know you were live glad yeah hey man um we don't need parentheses rust doesn't enforce this you don't need to add parentheses around conditions which is nice as well something that i'm not going to add to this language um so we can actually say here if let or sorry if maybe content dot is okay so if you remember this is a result so it's either okay or error so in this if statement we can excuse me we can say if the maybe content is okay then we'll get the maybe contents and unwrap uh we don't we don't want this code on here because it's going to be used as a as the return value for contents and there's a little error here and like i said before everything has to have a compatible type so in this case if and else have incompatible types because up here i'm implicitly returning a string whereas down here i'm not returning anything there's no return so instead i'll put panic which is another one of those macros it's special uh the implementation under the hood uh is is magic it takes this string that i provide and it outputs it at runtime as an error so i'll put a message in here could not open file for reading uh and now if i go back and run this with a non-existent file you can see my error message is being shown instead so that's how that works so you've got if expressions you don't have to unwrap up here uh no try catch but instead check that it's okay then implicit return yeah basically there's there's no real try catch this is your try catch the an if statement to say if it's okay do this otherwise that is okay will be false or false if it's an error um and you can do the same with this up here let's do the same with this just for demonstration purposes so this is maybe a file we don't know yet so up here we can say let file equal and we'll do an if else again uh but this time we're not checking if it's okay we want to know if it is some value so remember option is either sum or none so in this case if it's sum we want to unwrap it and return the string which will be the file name otherwise we want to panic we want to say you know expected a file um so we can use let inside of an if similar in php if you said if error equals this get error or this error right this is uh basically a let statement or an assignment inside of a condition which is perfectly fine you can do that in rust so we're going to say if let sum uh equal maybe file and this is rust pattern matching um so like i said before option is generic so if you were to return some like this you would have to pass in a value you might pass in a string literal like hi um so now sum has an inner value again sorry if this is going over your head it's complex um sum is holding a value but we need to get that value out so up here we can do some pattern matching and internally rusty's going to say well does maybe file match this pattern is it sum and does it hold a value and i'm just going to call this f so now when rust evaluates this it's going to say if let maybe file some pattern like this so is maybe file sum and does it hold a variable and if it does it will assign it to f so down here i can say f um otherwise i can panic expected a file ask me any questions if you've got them as well by the way um i'll try my best to answer um so this is this is probably confusing um so if we said some string like this say maybe file equals some string rust is gonna say well it is sum so that's true and it holds the value string so it will take whatever's inside of here and it'll assign it to f in this case um so if we run this should work fine and if we don't it should error still and if we get rid of the file path completely because because maybe file is none this is failing and we're dropping down into panic uh cool so let's get this again so we've got our output uh so let's actually get into some of the juicy stuff um let's see how much we can get done in 15 minutes let's say 15 minutes cool i should have got a glass of water the the pepsi is not doing that um right so we've got our alexa i'm just looking for any questions in the chat how many people are here seven eight cool right so we've got alexa um alexa's going off again i'm just gonna unplug her so she doesn't get on my nerves rip alexa um so we've got alexa and we can add a method to the alexa and we will call it lex i guess we want the lexa to lex something so we'll create this method called lex and like i said this this is where it gets a bit funny because rust doesn't have any sense of this or self um in the generic sense i can't come down here and say this.contents uh or you know like you would in php or javascript there's no this binding instead if you want it to be an instance method so if you want it to have access to whatever we've returned up here you need to type in a reference to self just like this so this is the same as saying this but we're saying self and it's a reference to this so we we have access to this completely which is great if i came down here and i said print line self.contents we'll come down here and we'll say lexa.next if we run this you can see that the contents get output because this this is uh saying we require self so it's an instance method and therefore we can access all of the fields on the struct another interesting fact is rust variables are immutable by default meaning i can't come in here and say lexa dot contents equals this won't work because lexa isn't mutable and to do that we can put the mutt keyword in front so this is creating a mutable binding or declaration and it's having a go at me at the minute because the variable doesn't need to be mutable so there's a lot of uh pre-compilation things that happen here so the variable doesn't need to be mutable and that's because i'm i'm not mutating it anywhere but we will do at some point so again if i came up here and i tried to change self.contents uh let's do string from [Music] hello pile this it's gonna have a go at me um because self is a normal reference it cannot be written uh written to so let acts as con and this is yeah that's exactly it then um yeah let by itself is the same as const everything is immutable by default um if you want to mutate something you have to make make it movable using mutt and we need to do a similar thing inside of here i've said that it's mut it's mutable down here but that's not good enough um because i'm only asking for a normal reference i'm not asking uh even if you're mutating object properties yeah yeah even if you're object mutating object properties that's right because you're still mutating something even internally you're still mutating something um and that's the issue here self is a ampersand reference it's not a mutable reference so down here i need to say mutt and this will give me a mutable reference which means i can say self.content equals something else pass by reference in c works like this and then work with b is there such thing in rust um pass by reference yeah you can you can pass references through to functions um in a similar thing there you pass through a literal reference and then b ends up being a pointer to a or to the memory address of a um dan yeah just comparing it to js which allows object property assignment with const i think yeah that is right because javascript is rubbish um and it doesn't make sense because yeah if you've got an array or an object in javascript under const you can you can change it still which is baffling um cool so now we've got this mutable reference so we're going to be doing something with the contents up here um so there's a couple more things that alexa actually needs so we've got the lecture itself and we're storing the contents um and what we're going to do is we're going to loop through this sort of letter by letter um so we need to figure out what position we're currently in so up here we're gonna need um some sort of pointer what do we call it let's just let's just call it counter for now and we need to type this so i'm going to say this is a you size which is just an unsized uh unsigned integer sorry it's a primitive type uh and because that's a filled i have to declare it down here so we'll start at zero because we're going to be looking at the zero index um and what else does alexa need to do so let's think let's think uh so it's got the source um i'm guessing actually the source shouldn't just be stored as a string we probably want it to be a vector or some sort of iterable type so i'm going to go up here and we're going to use a generic type so we're going to say vec which is a vector it allows you to push and pop and get the length um and we're going to be storing chars so characters um because we're we're going to be looping through each individual character in here which can store it as a char just as a single byte so we've basically got a vector of characters or bytes here which doesn't mean down here we need to change this slightly we need to say content is content dot chars and what does this return it's gonna have a go at me um because it's returning a special type for charles so can we say uh to vector oh no i'm being silly there is a method for this contents dot chars dot collect so a bit like uh php you get a vector which is basically a collection it's a an array object if you like uh so this is going to return um a vector of characters or charts um or in this case you know you're iterating over it turning it into an integer and then collecting it so this is great let's check that this compiles which it does great uh what else do we need for alexa we need [Music] hmm i think this is fine for now to get a very basic alexa going i think this is fine because we we've got a counter so we know what our current position is and we've got all of the chars i'm going to rename this to source as well because it's the source of everything it's the meaning of life um cool so let's do something else let's introduce some tokens so i'm going to do this using a struct again so we're going to have a token struct um we're going to have a token type and we're going to have a lexumi which is basically the literal it's the literal form maybe i'll call it a token literal instead and the the literal can be a string as well and the token type is going to be some sort of enum enums are coming to php.8.1 and they're very similar so we're going to say token kind and now this is the where we're going to define whether it's an identifier whether it's a string whether it's a sign whether it's uh let um so let's start with identify so again if you're not familiar with enums it's just a way of structuring uh static well not necessarily static but like named uh concepts it's given a name and a type to something you can type in these in a function so we're gonna have an identifier we're going to have the assign we're going to have the let token and we're also going to have the string token great so token type we can come down here and let's rename this to kind so that it's in line with the name of this enum um yep this sounds good and we don't need to do anything here necessarily we can implement this struct and we can give it the new method like we did with alexa and we're going to pass in the kind which is a token kind and we're going to pass in the string and just return it so we can say kind literal so now when we want to make a token we can go somewhere token new and we need to say the kind so if it's an identifier we might say token kind identifier so this is how you create an instance of an enum and we pass in a string we'll use let uh and this isn't actually a string so we can do as oh sorry two string or two owns like this i'll explain this a little bit later but this is this is how we're going to use this method cool uh so let's try and see how fast we can get this into a token stream let's start with a vector so we're going to say let tokens equal vect nu so we're creating a new vec type annotations needed which is fine so we need to make this generic so i'm going to come in here veck is generic so i'm going to say this is a vector of tokens and because this variable's got a type hint rust can automatically infer this here because i've said this is going to be a vec of tokens when i say vec new rust knows that this is going to be a vector of tokens because it looks back here i could also put the generic here um and call new like that and this is going to be the same thing uh this would mean i can remove the type hint or the the type declaration on on the variable uh but we'll do it like this just for fun i wouldn't i wouldn't normally do this i'd even do it one way or the other um but this is fine rust is gonna throw errors if we give it the wrong type anyway cool so let's think let's think we basically we want to loop through all of the characters from source and for each of them we want to match them up against some sort of pattern right um so let's create a while loop and we want to say while counters or sorry while uh self.source.length is greater than self.counter we want to do something down here uh and we'll just we'll just plus plus for now or sorry plus equals one if we run this it doesn't have an infinite loop because we're incrementing counter at the end and we're checking against the length of the source so let's start with a very very simple one let's use a match expression so match is also in php now uh it allows you sort of like a switch case but a lot more efficient it allows you to pattern match so we want to get the current token or sorry the current character at this index or at self.cancer so let's do let's create a method for this we'll say chart and we'll pull in self and we'll also put in actually let's create a current char method this is going to return a char because this is a vector of chars um so we can just say if self dot sorry counter is greater than if self.counter is greater than self.source.length we can return them this is a good example actually of using option we'll come in here and we'll say option so we're returning none so this won't be able to unwrap uh otherwise we'll return some char which will be self dot source dot get we're going to get this particular index and we'll say self.cursor this is also an option um so technically we can we can actually just do this we can just say self.source.get self dot counter i think uh expected chart found a reference to char uh can we just de-reference it no do we have to unwrap it and then d reference yeah let's say char okay there you go so this is what we're going to do we're always going to return a chart we're not going to bottle with any error handling right now so while self.source.length is greater than self.counter uh we'll match self.current jar and with a match we need to make sure that all possible cases of current char are covered which in the case of char is like you know every alphanumeric character any any character really so for now let's just use underscore which is like a wild card and let's just print line um let's actually assign this so i'll say c is self dot current jar um and we'll match against c let's just print out c and let's make sure we don't get any infinite loops plus equals one and if we run this you can see we get every single character because we're just looping over them you know we're saying get the current jar print it out and move on so we get led we get some white space as well which we'll handle uh because white space won't make a difference we get a single uh quote and this seems to be working fine so let's start with a sign in rusk charles are you represented using single quotes like this and if i hover over it it will have a go at me because it's empty it's a character literal so i want to say if this is equals then we're going to drop down into a block down here and let's just put this underscore rust has a nice unimplemented macro which if it encounters something that's not implemented it will panic and let you know uh we've still got quite a few people here which is nice i'm glad you're still here sticking around again let me know if you've got any questions um if anything's not making sense let me know i'll try to explain it um so we're matching against c we want to say if it matches this pattern in this case just a single equal sign or the assigned token in our case we're gonna return uh a we're not gonna return sorry we're gonna self.push let's actually remove this as well let's do this we'll leave this empty so that it doesn't panic and but we'll move this inside of this match arm yeah so when you when you match a statement you can drop into this block of code here just this this block um and running this won't do anything because we're not uh incrementing down here uh so we'll just do the same down here as well um so it finishes fine but what we want to do is we want to generate or build a new token and we want to push it up into tokens up here so we'll say tokens.push and this is going to have a guidance because it's expecting some arguments and we need to push a token so we can say token new this is the method that we made earlier we're going to have token kind of assign and the literal is going to be an equals to or how do we do this uh we'll do equals to owned uh this is going to have a go as now as well because i'm trying to mutate tokens but it's not declared as mutable so i need to say mutt and if i run this and it would help if i printed out the vector we're just going to print it out like this and tokens uh token doesn't yeah okay let's do all this um let's do it up here as well uh derive debug um in short this debug up here adds some methods to the enum or the struct that lets it print nicely uh when we're trying to output it using this debug format down here so if i print this out now we can see that we've got like an array-like structure that has a token and the kind is a sign and the literal is equals and so this equals two owned bit is probably a bit confusing as well in php and javascript you have a string in rust you have a string and a string slice so this is a string slice it's not an instance of string it's not of this type if you wanted to type in this you'd use this instead of this i'm not going to go into too much detail but that's why i'm calling to owned rust is going to say well this is a string slice it's going to look at the definition for this function and see that it needs to be a string it's going to say i know how to do that i'll turn this string slice into a string instance instead or of type string with a capital s cool this is great so this is basically the lexa we're going to go through and we're going to match against certain tokens um and depending on what type of character we're matching against we're going to change the logic inside of here so let's start doing keywords um i think we can get this done pretty quickly to be fair um any questions any comments i might be going too fast i don't actually know um yeah just just let me know um cool can rust be integrated with php since the nature of is interpreted or compiled is different that's a good question php does have something called ffi which is a foreign function interface i believe and that basically means that you can use any c functions you know standard c um you can use any c libraries with php as long as they've got like a particular header file and some declarations and you can actually build php extensions with rust as well because rust has interrupt with c you can pull in external c modules into rust and you can compile rust into like a c header file equivalent um so yeah you can kind of use roster pitch but that might be interesting to explore on a stream so yeah that is possible it's not something i've played around with i know um larry garfield who's working on the auto capturing multi-line closures with nuno i know he's uh done a bit of work with that and some blog posts but yeah it's definitely something that's possible uh the performance probably isn't great um or not as good as rust on its own obviously but yeah you can definitely do it um cole so let's let's start passing out keywords let's do keywords um since run is very fast it seems yeah rust is very fast but i think the with the the foreign function stuff with php i think there's a performance penalty for sure uh because you're invoking it from an interpreted language and you've got to pass out ahead of files and stuff like that there's there's an extra layer of complexity on top uh let's actually start with strings because strings are kind of interesting um so i'm gonna support single quotes for strings for now so we need to escape this single quote just because it's inside of single quotes just like you do in any language um and what we need to do with the string is we basically need to loop over it until we reach the end of the string so we're going to say have we seen this well if we have we're going to keep on looping until we hit another single quote i'm not going to deal with you know i'm not going to support escapes and stuff like that for now i'll just do string literals so we need to increase our counter by one and this is going to consume this it's going to put us past this onto the next letter and we can say while self dot current char so this is going to get the current character and we can say is alpha or is alphabetic or is ascii alphabetic uh we'll say alphabetic so we're going to loop here i missed a semicolon so what we're going to say is we're going to say whilst uh whilst it's alphabetic meaning it's not um or actually we probably want to say alfaloo is alphanumeric or even while self.currentchart is not equal to one of those because we're gonna pass everything in between those two uh single quotes right uh so we're going to loop over the string until then so until it hits here uh we also need to bring up um we need to build up another vector so let's just call this buffer and this is going to be a vector of chars and we're going to say vector we'll do it like we've done above with the generic syntax and this needs to be mutable because we're going to be pushing to it so while self.current char is not equal to this we're going to say buffer dot push self dot current char and we're gonna increment the uh counter as well so yeah this is gonna skip past the first uh quote it's going to create a new vector that will allow us to push things together or push characters into it and we'll say until we hit another single quote we want to get the current char and push it to this buffer because it's part of the string and we're going to say self dot cancel plus equals one so that we skip over that character and self.current char then moves on um and then we'll say self.counter plus equals one as well which will get rid of the closing uh the the closing quote mark there and we'll create uh let string token equal token new token kind string and the literal will be the buffer joined by an empty string i believe uh sorry that's not an empty string that's an empty string or can we just say join on its own and this is going to have a go at us because the following trait brands are not satisfied method cannot be called on vect chart due to unsatisfied trait bounds uh okay is that because we're using charles maybe char as okay let's let's think about this do we want to do a vector of string slices maybe let's have a think [Music] let's think so why isn't this working it's a vector of chars right so um oh actually a better way of doing this instead of yeah instead of doing a vector of chars we could actually make this a string and we'll set it to string no so just an empty string and then down here we can actually say buffer dot push like this and push is now expecting hr we're we're pushing a single character to the end uh so we don't need this buffer here because it's expecting a string uh and in fact we won't have the temporary variable either we'll say tokens dot push like this and we've already skipped over uh that there we'll do that there instead so we're pushing the token and we're setting the counter to one more so if we run this now uh this seems to have worked because we've got our sign and we're passing out strings very naively mind because we're only accepting you know anything between two single quotes there's no escape codes or anything we can do that later um but that seems to be working um all right let's pass out some of these so how do we do this we're gonna want some sort of wild card again i think um so in pattern matching we can say wildcard and then have an if statement so we can say only run this wildcard block if um c is alphabetic uh so if we just print line uh test here if i run this it's gonna output test because l e t they're all uh alphabetic um we can we can kind of see this in action if i do this you can see it prints out test for l e t um f o o for um so this is good so we're gonna fall into here and we're gonna say um we'll we'll build up a string right so we'll say just like we did before uh let buffer equal string no and we'll put a type here why not doesn't need that but we'll do that we'll give the compiler some help so we'll say let buffer equal a new string um we'll make it mutable and we'll push we'll push c to it so that we can skip over it um this isn't cursor is it this is counter uh so we'll skip over that first character because we've already pushed it we'll say while self dot current char dot is alphabetic like this um we'll push self.current jar we'll increment the counter by one and let's just print line uh the buffer like this and this should be good so there we go uh because we're not we're not accepting spaces so when it gets here it's not our alphabetic anymore so it stops building the string and it just returns us the the word that we've built up so far eventually you know we'll have something like you can use underscores or dollar sign or something like that in a variable name but for now we'll keep it very simple so now that we've got the word we need to see if the word is reserved so uh down here we'll say match um we'll say let kind equal match uh buffer like this so we can use uh we can use match as an expression as well so whatever we return from this match will be assigned to the kind variable and we'll say kind needs to be a token kind and again we've got uh certain patterns that we must match uh as a string so let's do um i don't want to do two owned everywhere because it gets messy so we'll say as ref and this will return um bytes basically i mean that's what it returns it just it returns bytes um so i'm guessing actually what we need to say do we need to collect no we don't need to collect let's say as straw because now we'll be able to use these sort of ones here um so if it if the buffer equals let we'll return token kind let um if it doesn't match any reserved keywords we can kind of assume that it's an identifier so we'll return identifier instead and if i print line here again it's a it's an enum so we need to use the debug formatting we'll print out kind and we run this we've got the let and the identifier because remember we're returning those here it's implicit if the expression here is a single statement or a single expression it will just use this as a return value so now we need to do tokens dot push uh token new kind and then the literal will just be the buffer perfect so you can see we've got our let token we've got our foo token we've got our assign token and we've also got our string token um it's getting on it's 20 past 11. um [Music] before i finish then let's um let's add support for double quotes as well because why not let's change this to double quotes so if we head back up here we're saying if it is a single quote and we can match multiple patterns a bit like in php we want to say or if it's a double quote um we'll get the letma char equal self dot current char and we'll say uh chart is not equal to that and char is bought um actually no a better way of doing this we can we can leave this as is if i just undo this uh because we've got this c variable up here which we can access because it's still in scope um oh well we've still got seven or eight people who knows um because we've got this c variable up here and it's in the scope of this block we don't need to have like a if if self.current char is not equal to single quote and it's not equal to double quote because if we start with a single quote we don't want it to end in a double quote so because we've got the chart we can say if self.current char is not equal to c which remember is pointing up here um we'll continue to read it so this should just add support for double quotes and it seems to work um great okay that was easy enough um one more thing then let's add support for uh escaping quotes like this um so effectively what this needs to do is if we encounter a backslash and the next character in the string is the same as the opening character we just need to ignore the backslash and push the the character after so we can say if or if um if this is really not optimized in a minute it's very naive but if self.current char equals a backslash we need to double escape that uh so if it equals a backslash we'll put this into the else block because that's what we want to do by default uh if it equals a backslash we'll skip over it by incrementing the counter by one i think that makes sense um and we'll [Music] push push the next character because it's escaped that works but it doesn't work um [Music] yeah so how do we want this to work we can actually get rid of that else block for now because we'll doing this we'll we'll skip over the backslash and it will end up on on the double quote anyway and because we're already inside of the block it'll push that skip over it and carry on i think just make sure we're not broken anything why is self.cancer not let's just print out self.current char inside of there so we do get inside of it um oh i think i think it's because it's a string when rust is outputting it it outputs it like that um i think we can test that just by doing this and printing out buffer yeah okay so it does just push the the double quote on its own it's only in the debug view that it escapes it's still in the output okay and now this this will kind of work like i said it will work for double quotes or single quotes if we go back to single quotes uh it will do this um you see it's got the the single quote there um yeah it's got it's got the single quote there but in the case of this where we're doing backslash double quote this doesn't need to be escaped uh so we need to add the check for that as well we'll say if we can say match self.current char again we'll do a bit of a nested match yeah it's not not the prettiest but it does work uh so we'll match self.current char if um we don't want to do this here we want to move this out we want to say if it matches c meaning it needs to be escaped uh we'll do this so we'll increment counter by one so we'll skip over this um otherwise it won't do anything we'll say underscore and just do an empty expression like this perfect i think this is working uh so let's just go down here and we'll say print line uh buffer so if i print this out the is still being skipped um so if it equals that we don't need to escape it um oh hang on no let's do it like this instead let's print line again this is sort of nested spaghetti now but we'll print that line okay so it is doing that um do you because it's a backslash is it behaving weirdly maybe um don't really know let's have a think okay a moment of silence to think that i think we can say if in here we'll say match self um let's just let's just keep it really naive for now we'll just increment counter by one i think that's fine we'll increment counter by one and then that way um it will just skip over it if we if we escape that's fine for now you know this is very basic lexus stuff it's it's nothing complex um so what we've what have we achieved um we've passed things out if i just go in here and say bar um where's cargo there is we're getting that token string let me just remove that print line quickly um we're getting that token string which is great if we go back we're successfully tokenizing identify or keywords um identifiers single tokens or sigils and strings before we finish up let's pass out numbers as well so my plan is to treat it a bit like javascript where one two three four five uh isn't necessarily an integer it's a number um and one two three four dot five is also a number so all numbers are represented as uh 64 bit floating point numbers or an f64 so let's do number as a new token kind uh literal can stay as a string because we're not going to pass it out yet that doesn't really matter we're just tokenizing we can add another bit here and we can say if c is numeric like this let's add an example in here we'll say one two three four five so this is numeric we can basically copy this entire block here but we can say is numeric um we want we don't need this and this is going to be token kind number and this should technically work uh option unwrap on a non-value line 114 which is down here um okay um so if it's numeric we push the character let's just print line high so we get to hire just move this about to figure out where it's going wrong we'll just print out see like this so it gets to one but it's not getting past here it's not incrementing it so while self current chart is numeric so whilst it's numeric we push the current chart we set the counter and we go back to the top do we get down here with a buffer no we don't get there and let's put it there a bit debugging okay um oh this okay this this is because it it's getting to the end here and it's trying to get the current char um so this this is kind of where the the option safety comes into play because we have some and none um so let's say self self.current char dot is oh wow um okay let's let's refactor this a little bit let's do option char let's not unwrap it let's return a char reference this is fine um we'll say so while the source.length is greater than self.counter so that shouldn't ever happen should it um any ideas any ideas any ideas i think because because basically we want to say you know whilst there are some characters in place run this loop um why don't we change this into a loop which is like an infinite loop just a while true uh we can say let's see equals celtic current jar if c dot is none we can break break out the loop uh and we'll say let's see equals c dot wrap um and we we kind of know that there's nothing here um we should probably up here say let current equal self dot current char and here we can say while current uh dot or while current is sum and current dot unwrap equals that if current.unwrap is not implemented for a reference reverse out of all this let's go back to where we were with the loop before i'm sure i'm just being silly so whilst it is numeric increment counter it pushes up yeah i guess we kind of need a check here to say if if self.counter is less than self dot source got length very naive uh we can say self.counter plus equals one and i'll take this down here as well so let's fix it it does not um hmm plus equals one let's go back to here and let's go back to there as well so we just want to we want to pass out a number um hmm okay it's the last thing i'm gonna do and then i'm i'm i'm gonna end the stream uh because we've made good progress i just want to figure this out uh i don't like leaving a bug or an issue untackled so do we need to kind of change how we're looping maybe because what i'm thinking is up here we're just saying while while the length is greater than the counter um the length is going to be one more than the index so we'll we'll at some point when we reach the last character we will be going higher than the index but because we're still in this loop um where is it because we're still in this loop it's going to go back up and it's trying to unwrap current char so actually let's let's change this to a loop um and we can say if self.counter i don't need the parentheses if self.counter is greater than or equal to self.source.length and we can break out of the loop and again very naive we can refactor this later otherwise we'll push the current char and we'll increment the counter by one yeah that seems to work cool so we've got let foo equal one two three four five very quickly we will um check that it works with floats which it should because it's just a string so we've got one two three four five six perfect uh and something that i actually want to implement now is number separators uh so we can check so this is a hundred thousand so this underscore is a number separator um so this should pass it still because we've we've removed that is numeric check um so we can say if self.cancer is great okay and we can also say if self.current char if it's not uh is numeric or self.current char uh if it's not numeric we can uh break that won't work for underscores and periods so we need to put this up here we need to say if self.currentchar equals a dot or self dot current char equals an underscore um if it equals an underscore we can just skip over the underscore uh which will work in this case and we end up we end up with a hundred thousand which is fine um if if it's not numeric and it's not equal to a period that should work too so let's just do dot zero zero perfect cool so we're passing uh sorry we're tokenizing or lexing numbers with number separators as well as floats uh you know with a period and then some decimal places afterwards cool so i'm i'm gonna put this on github and i'll i'll put the link on twitter i'll put it in the description as well to the repository so that we can uh come back to this in the next stream and carry on the next step will be sort of uh making this a bit more efficient you know we're calling self.current char everywhere uh which is doing an index lookup and unwrapping it and dereferencing it so we'll do some optimization and we'll implement some new tokens uh like a function keyword um some parentheses and stuff like that and then we'll we'll move on to the next step of passing this into an abstract interview so to the seven or five people that are still here thank you very much for tuning in and i'll see you next time cheers
Info
Channel: Ryan Chandler
Views: 2,472
Rating: undefined out of 5
Keywords:
Id: HrQZBExoV3w
Channel Id: undefined
Length: 93min 36sec (5616 seconds)
Published: Sat May 01 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.