GopherCon 2021: Alan Shreve - Becoming the Metaprogrammer Real World Code Generation

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

hello gophercon i'm delighted to be here today uh to talk to you about one of my favorite topics in software engineering which is code generation we have a lot to get through so let's jump right in what we're going to be going over today we're going to start off by talking a little bit about what code generation is and why you might choose to use it then we'll kind of get into the meat of the talk surveying some of the practical applications of code generation some of the popular tools in the ecosystem to do code generation and talk a little bit about how to use them and why you might use them also like jump out a little bit and survey some of the other ecosystems as well and code generators that expand beyond go um then uh we'll we'll work on actually writing a code generation uh code generator together um and uh we'll wrap up by going through some some best practices here so let's talk a little bit about um what code generation is but before that i just wanted to introduce myself uh i'm alan trev i'm the founder and ceo of encroc if you're looking for me on the internet i am at incontrivable so code generation very succinctly code generation is about building programs that produce other programs as their output for newcomers to to programming that can sound like a very um intimidating definition and it really it really shouldn't be a code generation at its heart is is actually quite simple um when you think about it like you might be saying like well if i'm generating a program as an output like isn't a program just text like isn't it source code if you kind of put aside like the meta discussion about like what is a program like um and focus more on like the the concrete implementation of it um our our programs are source code so building a a program that writes other programs is really just about producing the text for that other program and even you know when you're you're really just getting started uh you're you're very used to writing programs that produce text outputs so um if you're if you're just getting started and really getting into this that's a really easy place to to find kind of a point to hold on to to really get into this so this is like uh the simplest version of a code generator right it is a program that produces some output which is text that would compile as as a program but there are very few code generators that that work like this that you just like run and you get a program as an output instead when we're talking about like real world and practical code generators they tend to have two main sources of input into them one is the source code that they're using as kind of the the reference input and then some kind of configuration so a code generator will often read uh source code and then use some configuration um to decide what to output a really common source of that configuration would be command line parameters of what to generate but you know just as easily a configuration file or or some combination of the two is is a very popular choice not all code generators actually read source code for input sometimes it really is just a reading from some kind of configuration in that case the configuration serves as both kind of the the source material and the configuration knobs on it a good example of this might be something from the rails ecosystem where you're generating kind of scaffolds and templates as a really quick way to get started without having to write the boilerplate this is a good example of a piece of code that runs to very quickly generate a stub for you to to start filling in there so before we actually get into uh surveying some of the different types of code generation i want to talk about why we would choose code generation we're going to come back to this at the end um trying to understand why we do code generation to start with is a little bit challenging without having seen some of the examples so we'll do this to start with and then we'll circle back at the end there but at a very high level like one of the reasons that we one of the main reasons we choose code generation is to write code that we as humans don't want to write there are kind of a bunch of second level objectives in writing in using code generation tools and writing code generators these would be things like code reuse or getting better performance supporting multiple languages and you can do all of these things as as humans writing code as humans but it can get very verbose and tricky uh to write rather than writing something that will write the code for you so like i said we'll circle back to these and kind of tie them back to to the examples that we're going to go through let's get into some of those examples here we're going to go through a survey of practical code generators we're going to focus like starting just on code generators in go and then from that we'll we'll kind of jump off to some of the other code generation tools that you're probably familiar with in in other popular ecosystems getting started um we want to talk about uh pretty much the de facto like intro code generator for any uh tutorial and go that that introduces you to code generation um and that is the the stringer tool so what stringer does is given um an enum uh or as much as of an enum that you have in in go right a set of like typed constants you often with that constant want the ability to render that constant as a string where the string value you want out of the constant is the name of the constant itself and of course you can write that that routine yourself um but as you like add and remove constants um it's you know it's it's just annoying to to really keep that in sync with uh the method that turns it into a string so instead uh we can write a program that will do that for us um we don't have to though uh because uh we've uh there is a first party tool that will do that for us and it is called stringer down here at the bottom i've got an example of using stringer where you specify the type in a particular file that you want to generate string methods and then over on the right you're able to call that string method to to return the name of uh in this case ibuprofen if you actually look at the code that it generates um this is the code that it generates you don't really have to understand it it looks a lot more complicated than you would expect if you had written it as a human and that's because it's it's space optimized in a way that humans wouldn't normally write it and it comes back to one of the things that we were talking about earlier about code generators allowing us to write code that is more performant than we as humans might write ourselves let's get into another uh code generation tool this one is called go mock or mock gen there are a number of tools in the ecosystem that look like this this is a is one that we use at ngrok when you're working on testing code you want to mock out all of the dependencies so that you can test just the piece of code that you're interested in and not have to worry about you know calls to external systems like databases or working with the file system and the way that you can do that is to call all of those other services through interfaces and what that means is that when it comes to testing time you can substitute in a mock object for those interfaces that will instead not talk to a remote service and will instead do something else of course you have to write that mock object though um and that is often uh you know a lot of work to just continually write these new mocks especially as you change interfaces you have to continually update your mocks to match so what you can do instead is use a tool like mock gen which you point at an interface and it will then generate out an object that matches that interface for you the nice part about it of course is that it also generates a whole bunch of methods for you that allow you to define what it should do when those methods are called really quickly as well as what to expect so when you're writing a test harness like we have over on the right uh ins what you can do is as part of your test harness instruct the mock what you expect to be called in the running of the test so you can guarantee that the the object was exercised the external dependency was exercised correctly if you take a look at the code that's generated behind it this is just a snippet of it there's actually quite a bit of code that's that's generated to make this happen it's one of the reasons that you obviously don't want to write it as as a human um you can see like if you if you squint like really quickly you can kind of get a sense of what's going on here where where the the code generator has defined methods that record the calls that are going in and record what you expect so that at test completion it can match what was called and what you expect it to be called let's continue on uh this is a tool that uh goes into the the category we talked about uh about um getting to better performance when you're using uh marshalling and unmarshaling functions of goes json encoding and decoding routines in the standard library those routines actually do that through the use of reflection that means that they are dynamically inspecting the object at runtime to determine what json to output but doing that kind of reflection can be slow and so what we can do instead is we can write a tool that will look at the structures that we want to serialize and deserialize and it will generate ahead of time optimized routines for the serialization and deserialization of those objects since those don't need to be calculated at runtime it allows us to create more efficient marshalling and on marshalling routines the ffjson tool is one example of this there are many that look like it where it generates those routines for you they're actually quite large which is uh rather common in uh performance generated code uh so i've omitted that from the slide here but that hopefully um gives you kind of an idea of where that code like fits in it fits into those functions that are uh known by the json package to indicate an overridden method that provides a specialized implementation of of json marshalling moving on to another tool here this is the first one that we're going to look at that doesn't take go code as an input so the previous tools we've looked at take the go source code as an input and produce go source code as output this tool is a little bit different this is called sqlc one of my favorite code generators and what it does is it reads sql code as input and produces go source code as output for those of you have written sql code in go you know that it can be quite frustrating for two reasons one is again it's quite repetitive especially when you're working with large result sets you need to iterate over many rows and the code that you write to like walk over those rows objects and check all the errors is very repetitive and the second is that it lacks a lot of type safety it's very very common to change the columns in database or change the type and forget to update your code have them mismatch even just change the order you query them out in a in a sql query and have them not match and run into runtime errors so what sql c does is it takes a definition like the one that you see on the left of this slide here which is your ddl of the table it constructs an understanding of what the types are in the table the types of the queries and then generates a code that allows you to write calling code like you do on the right hand side so you take a look down at the code to actually call get author you're just passing in a context and an id and it's running that query for you without all of the ceremony that i'll show you in the next slide that you typically have to do which is around querying that row out and scanning it into the individually named columns anyways we we use uh sqlc all the time at ngrok um and it has made working with the databases uh tremendously easier um let's move on into some of the other ecosystems here um this is the c preprocessor the cpu processor preprocessor is also a code generator it is a program that runs before you invoke a c compiler um and it is essentially doing processing to determine what is the output program that is then going to be handed off to the c compiler this is an extraordinarily simple use of the c preprocessor to just switch between two different statements that will be included in the output program depending on some definition that's set on the command line but actual c preprocessor directives can be quite advanced including all sorts of other files together and making complex decisions even before the the program has been compiled jumping a little bit back into the go ecosystem i want to start branching out into other areas of code generation that aren't necessarily thought about as code generation gofund is a really good example of this or maybe diverge a little bit from from the the things that we've talked about already gofund is a really interesting code generator in that we we don't normally think of it as a code generator but it is it produces a a program as its output after reading a program as input it doesn't take a whole lot of configuration and interestingly it doesn't add anything new to the program itself it simply restructures uh the the text of the program itself but in the end at the end of the day it is still in fact a code generator branching out even further here one of the very popular code generators especially used in the go ecosystem but that is multi-ecosystem multi-language is proto-c so proto-c is part of the protocol buffer grpc ecosystem and it is a compiler the proto-compiler that takes as input an idl and interface definition language that describes structures that you want to compile and service definitions of rpcs and from that it can generate marshalling and unmarshalling code as well as the actual server and client side stubs to call remote services that meet that interface definition and server-side stubs that can be used to handle calls for those those apis it allows you as the application developer to focus on just the application code that you need to use to call service or or implement a service and the really interesting thing about it compared to kind of some of the other code generators that we've talked about so far is that it outputs to multiple different languages the the code generators we've looked at so far output to one language or another but proto-c and what we'll look at afterwards um generate code for multiple languages this is a really interesting property of of some code generators is that they allow you to write at in some ways at a higher level um than you know the the languages themselves by allowing you to generate across many different languages so in some ways when you're working with some code generation tools you're often working around limitations in the compiler or limitations in the language definition but these are really working across multiple ecosystems which is is quite fascinating uh i'd be ris remiss after mentioning protoc did not mention swagger and open api um when you zoom out at a very high level um they look very similar um at the end of the day swagger and api open api are about defining uh an rpc layer um it just happens to be rest and http and json instead of protobuf but you're still from it generating the same kind of client libraries and server stubs and documentation and other things that you can generate out of that definition language so a lot of interesting parallels between those two and and finding the similarities there let's start and start talking about some other code generators that that maybe are not as well known as code generators really good example of this is go test go test is a code generator although from using it you you may not know it because the output the code generate the generated code output is ephemeral so when you run go test what's happening is you've written your tests in your go program but nothing is calling those functions but when you run go test what it does is it's actually reading that source code of all of your tests identifying the names of those tests and taking some input from command line parameters about which tests you want to exercise and then it writes a new program a new piece of source code that invokes those tests it compiles it which is the test harness and then runs it interesting uh way to look at this is all of the go tool chain um has options to really show you what they're doing under the hood go test works like that as well if you pass it the dash x flag it will show you the the kind of raw instructions that are running underneath and if you dig through the output there you can find um the place where it actually like links together that test program and ultimately runs it to to run the code that it generated there sticking in the testing world um we'll also talk about another uh code generation step inside of testing which is the cover tool um interestingly the cover tool of course being part of go test it's doing the exact same code generation we just talked about uh in the last slide here but the coverage tool also does another piece of code generation which is quite fascinating the coverage tool's goal right is to tell you which pieces of your code ran when a particular test was run and there are many ways to do this you can instrument the run time to let you know when a particular statement was called but another way to do it is what the go tool does here which is that it takes your original source code program and then basically creates a new copy of the program with a whole bunch of statements injected that mark when particular statements were run or when different blocks of code were run this is a really interesting solution it's one that you you know in in solving this coverage problem you can implement that in the runtime you could implement it at compile time and which you choose to do you know really depends on what is most practical and in this case the authors found that this solution of rewriting the input source code was a more practical way to solve the problem than trying to add new hooks and instrumentation to the runtime to measure the same thing let's branch out into some other ecosystems and talk about other types of code gen here for those of you who do browser development i'm sure you're familiar with tools like webpack and es build that take many different javascript files or css files and combine them all into one these tools are large and expansive and have many more use cases than just that but that is certainly one of those popular ones and ultimately what it is doing is it is taking javascript as input or css or many different files and compiling them all together and generating a new program as an output that is sometimes optimized or minified or combined but has kind of the same raw source code as the you know it doesn't really um cause a behavioral change and it is more of an optimization change so another example of a different type of code generator that's that's still moving between two different places but without changing the the meaning of the program in the same ecosystem babel is a similar program here which is reading uh one version of javascript and compiling it to another version of javascript um this is really common when you're you want to use like the latest javascript features but you want to support older browsers so what you can do with babel is write your latest javascript code and it will compile it into a code that works for all of the older browsers and it does so by doing code generation understanding what needs to be done and then rewriting the code to make that work and outputting a new program it can do that for for other types of transformations as well moving between like react.jsx files and actually the raw javascript that browsers need to interpret but fundamentally also a code generator um this is an example of what that what that looks like for babel where over on the left hand side we're using a fancy new spread operator but older browsers that all the browsers don't understand natively so it needs to be compiled into something that they do understand there [Music] that type of transformation between like one form of a program and a different version of it or a different language is sometimes called a transpiler like a translating compiler another example of that is typescript typescript in in the same ecosystem is taking a typed version of javascript right and recompiling it into javascript but allowing you to add type annotations in front of it so again like taking a piece of source code translating it to another one this also falls into the the transpiler kind of bucket of of code generators here as we've zoomed out into like these different kinds of code generators things like transpilers things like the go cover tool and babel and the c preprocessor and gofund you start to wonder like what is code generation really like those are very different programs serving very different purposes and they're like a long way away from where we started with something like stringer that's generating additional code to compile into your go program and you start to wonder like what what is code it seems like very broad like and there are other things that seem like they should fit into that category like isn't the go compiler itself a code generator it reads your go source code as input and it outputs assembly or machine language as its output and that's another program so that too should be a code generator there are ones that there are programs that do that at run time like v8 or the hotspot g hotspot vm that are uh interpreting um machine code uh and rewriting machine code on the fly and doing that translation and outputting a new program ultimately that's cogeneration as well there are other examples like rust macros c plus plus templates are those code generators markdown like that's not a turing complete uh translation but you're still moving from one one type of markup to another is that a code generator you start to wonder like is everything a code generator are are you a code generator am i a code generator like i interpret instructions and then produce code as an output so doesn't that kind of make me a code generator probably and that's a fun little meta tangent to be on not an ultimately practical one so pulling it back down to like why we're using code generators and focusing more on the ones that we started with generators that produce source code that we will then uh use in our compilation steps um our future like go compilation steps we can find a number of places where the tools that we've looked at fit into the benefits that we were talking about up front i will tell you that these categorizations are fuzzy right saying that like protobuf like is only about multi-language support does a disservice to it it also improves type safety and code reuse a lot of like what benefits you get out of code generation really depend on your alternatives like what you would do instead so it's kind of hard to pin down what categories all these things fall into but hopefully that gives you a good idea of why where these tools fit in and the benefits that they're providing and why we're looking at them for code generation so we've looked at a bunch of tools that already exist in the ecosystem for code generation so now it's time to write one of our own um i really want everyone to walk away from this uh talk thinking and understanding that code generation is an easy thing to get started with it's an eminently practical thing to do and to do that we're gonna write a code generator together um we're gonna motivate it with a very practical use case that we we have at ungrock actually which is user-facing errors so what i mean by user-facing errors is i want you to imagine that you are implementing a piece of like a web service application something really simple where like you have to take in an email address and as part of like working with that email address before you work with it you obviously need to validate it very common thing to do in in programs right maybe you need to make sure the email address isn't too long that it can fit into your database maybe you have to make sure that it looks like an email address by matching it against a regular expression things like that when you're first starting to write code for this you will typically use the standard library methods for working with errors like format.rf to print out errors when the input isn't valid however when you're you're kind of working in organizations and trying to create a and reusable systems you have users who who use your system and they will say things like my email is invalid why doesn't it work and this is a very frustrating thing to hear right because you don't know why it didn't work right someone just told you that it didn't work but it's not clear like what problem they ran into and so one of the things that we've we've really tried to do is to make it a lot easier for users to communicate to us about why a particular error happened and why they ran into the failure that they did so one example of how you might do that is to add some kind of unique string to all of your errors right something that is very easily identifiable in here i've added you know error email to long or error invalid email and these are unique strings that users can latch on to and see ah this is like a very unique error code that indicates my problem that i have when you do this kind of thing [Music] folks who are using your software will often like use will take that uh error and plug it into google or they'll send it to support something like that this is a very common practice if you're working with a you know large-scale software systems like a database or an operating system that tend to have defined error codes for all the different errors that they can encounter however um jumping back there one thing is this is very ad hoc right this is just a string that you've embedded inside of another string there's no indication to anyone else who's going to write the next error message that they need to create a unique error code so how can we start to get more structured about this one way is we could you know create our own function which takes the unique error code as a parameter this starts to get into a place where the libraries that you create help the developers who are building the software to understand that they need to follow a convention to create these kind of errors and that's a good start but these are still strings right we can still do things that are unsafe with them it's still easy to not be regimented about doing this correctly so how can we take this to the next step the next level one example would be to actually create unique functions for every single error that you want to create in your software and have those functions output the string errors that we were seeing earlier with the appropriate formatting and the unique error codes when you do this it also allows you to return error types that are specialized that can include the code so that you can potentially inspect them for observability purposes or have higher level pieces of the code actually work with those errors and switch on the particular error conditions so writing each of these functions though would be really uh time consuming right if you have hundreds or thousands of errors in your program you don't want to define a new function every time that you you want to write one of these errors so what's what's a way that we can get around that with code generation one example would be to define your errors in a reusable format like a yaml file for example is is a place to start with and that's what we're going to work with here this allows you to define upfront the set of errors that you want to to use it's very easy to add a new one and out of it you can create those functions that we were working with interestingly right it means that you can also create functions for any number of languages that you want to work with and have a unified error directory of of all the errors that your software produces so what does this look like we're going to write together the code generator that reads that yaml file and outputs uh the error definitions that we want to use in our software so this is the very simplest version of it we're not going to go into the entire thing but what i've done here is i've defined two types that i expect to get from reading that uh errors yaml file which has an error definition which has the name of the error the message of the error and the variables that i can interpolate into each error message parsing out those variables isn't too hard but that's that's what we expect this load defs function to do and then at the end all we do is we execute a template with all of our our air definitions in it so to look at what that looks like let's take a look at the template this is the template that outputs that errors yaml file it's a standard go template which outputs the type that we're going to use for all of our generated errors and then it ranges over every one of our error definitions and outputs a unique function for each one of them where each of the parameters into the function is guaranteed to be correct and and type safe reading this template definition um can get a little hairy it's a little difficult to parse um really quickly so just to kind of like walk you through what that looks like this is really the meat of it where we're we're doing this for each error in our definitions file if we're substituting in the name this is where one of those names substitutes in which is the name of the function and also the name of the struct that we create the parameters then get substituted in right which is both the type and the name of the parameter in some cases and just the name in in the actual formatting of the message um and then finally the message itself since there's only one parameter in each of the errors that we were looking at the loops also disappear you'll see that there's some extra commas in there but don't worry about that that's still that's still valid and what that means is that at the end you call these functions that are type safe and generated with your unified error directory generate them for all the the different pieces of software that all the different languages that you use and still have a unified definition for them all so that was a really quick way to get started with code generation a really quick example of how with just a few lines of go code and a you know a standard definition in a language like yaml or json can be used as the source for that code generation and now i want to talk about some of the best practices when you're writing code generators you may have seen some of those in the example that we looked at but i want to go over them in more detail here so there are a number of different best practices we're going to walk over each one of them relatively quickly here so the first one was one that you probably saw in the code generator which was a comment that was at the top of the file um and it said something like this like code generator by stringer type pill this is the generation comment that the stringer utility that we talked about earlier leaves and it says do not edit this is pretty important it instructs humans not to edit the code if you don't include it it is possible that it makes for a poor developer experience when developers look at code they assume that they can change it or modify it to their needs but then it becomes out of sync with the source that was used to generate it the best practices for creating this comment should indicate for humans not to write the code it should redirect them to the source of uh the source of truth where they would want to to work to actually make a change to the generated file and then ideally also help them understand the command that they're supposed to invoke to actually run that code generation step sometimes if possible if the definition comes from a single unique file including the source file itself that was used to generate it can be a really helpful addition as well moving on and in a similar vein of really focusing on developer experience isolating generation files serves a purpose both for developer experience as well as making it easy to write the code generator i suggest that when you when you do code generation that the files you output use a distinct suffix or put in a separate directory away from human generated code isolating code files in this way really help developers quickly understand what is machine generated and what is what is human written and that has a number of benefits one of which is you know when you're working with tooling uh like grep for instance it's really easy to filter out files that match a certain pattern in a certain directory when you're looking for you're often looking for things that are human generated and not not code generated so that's one reason that that kind of good developer experience there's a side benefit when you're writing code generators that every time you run the generator you need to remove all of the old files so making it really easy to remove all the old files becomes a benefit in in the construction of these generators as well let's talk about the actual like implementation of the generator so in our errors example what you saw is we used a string template file or a string template to actually serialize out the code that we were generating it's very tempting when you start going down the code generation rabbit hole to find a couple packages in the standard library like go ast and go printer and say ah and abstract syntax tree that's what i want and to try to construct the code that you want to generate as a set of ast nodes and then use the printer package to output them it's often not the right thing to do and and almost never the right thing to do when you're getting started um string templates usually are a better choice to get started with that's not always the case there are some cases where you do want to be working at the ast level like transformers like gofumpt are a really good example but the thing to keep in mind is that code generation can often be difficult to read we walked through that you know very simple template like one at a time because it's difficult to parse for humans so trying to keep things as simple as possible to reason about is paramount for uh the maintainability of your generators right so the simpler you can make them the better and that often means that just starting with string templates for for writing them out is the best choice to get started with nice part is you can write use those string templates without having to worry so much about the format of your code because you can always go fund it afterwards which should always be the last step of any code generation tool you write in ingo i would be remiss in a talk about code generations to not mention the go generate tool this is the canonical way to instruct the go tool chain to run a code generation step before you compile before the go generate tool existed it's very common right the interesting part about code generation steps is that they exist outside of the normal tool chain of compiling a go program right the go tool chain is go build go install but if you have to run some kind of like stringer thing beforehand or a mock generator it's not clear that as a developer you're supposed to do that so what we added to the language was a directive that the go tool chain understands means that it should run a program before it actually runs its compilation that it can run a particular step before it runs the compilation step and there's a command to invoke that will go find those directives and run them there's an excellent blog and documentation about it uh it is kind of functionally equivalent to make but in a very ghost specific way um and one that works cross-platform without having to make sure that you have any other tools installed so if you're building these things into your go programs uh definitely the the preferred way to make sure that those generation steps are easy to access for other uh go developers last kind of point i want to touch on is the question about should i check in my generated code there's no one-size-fits-all answer here there are pros and cons to going in each direction on the plus side when you check in generated code as someone who is contributing or working on the project when you download that code you clone it you can just invoke the go tool chain to build or install that program without having to worry about first trying to figure out what are the other steps that you need to run first to generate the code that you want to work with in some cases your program like if you didn't check in the generated code your program wouldn't even compile without the generated code being checked in in there um so that's one thing to to certainly trade off um it also right it avoids like uh adding generation steps to the build process where those can be like kind of more human steps if they're things that change very rarely on the other hand it does make doing pull request review quite challenging when you're looking at incoming changes and there is you know like one line that actually needs to be looked at but you see that there are like 10 or 50 files that have changed because of generation that can make reviewing those kind of changes quite challenging so to address that problem there is a file that you can use called git attributes which allows you to annotate a specific set of globbed patterns as being generated and some source control managers like github for example will pick up on this to know that there are a certain set of files that are generated and will automatically collapse and kind of hide them so that your human reviewers know that they're not things that they need to look at and can instead focus on the pieces of code that do matter which are the ones that were written by humans all right so just quickly recapping here and going over what we learned code generation is at its heart about programs that output other programs there are a wide variety of use cases for it it can help you solve a tremendous number of problems and make writing more code with less lines of code really easy and the other thing that i hope everyone took away is that code generation is uh eminently easy um it is easy to get started with um and it is not really that that complicated to get started with uh and you can do it too to really start scratching your own itches and and solving your own problems it's been a delight to be here at gophercon uh i'm so uh happy to be here um and to to share um uh about code generation with you and i'm looking forward to seeing you all at a future gophercon hopefully in person one day

Info

Channel: Gopher Academy

Views: 201

Rating: undefined out of 5

Keywords:

Id: RpmYXh0ppRo

Channel Id: undefined

Length: 42min 9sec (2529 seconds)

Published: Fri Dec 17 2021