How does PostgreSQL actully work

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
all right good morning I'm Peter I have been a post quest contributor since 1999 and I work at second quadrant right now we have a booth and a sponsor also you can visit us there and if you also want to talk to me later on in the you know we've vacated the room you can probably find me there so this this talk is different because I just checked yesterday that post Chris now has eight hundred eighty thousand lines of code and all those or seven hundred thirty thousand lines or see and who knows what the 150,000 the lines of code are I don't know and so what does all that do right you you you download posters or maybe nowadays you just provision it in the cloud and you you get a you know get a prompt and you load some data you running query and data comes back and in milliseconds right and wow what actually happened here and there's there's two ways you can think about that one way is like well it works for me and that's cool and I don't want to know and then you shouldn't be here or the other way as well let's learn a little bit about what actually goes on inside and and that can help perhaps when you then have to work a little bit more with postcards you have to do some tuning and you know you read some blog posts and you want to understand what some of these terms mean and so we're gonna take a tour around how a query is being processed so this subtitle is it mentally used in my head to to describe this talk to myself is you know from Kerr to disk and back again and and what happens in Postgres so here here's a user query and it doesn't matter what the query is but you've seen these things and it's it's a little bit small here perhaps but I had let's change the window here so it's gonna use some psql here pull up a curry and this is from some sample database has customers and and do some orders and very standard right and you do some sort of a little bit of a report here of the some of the orders by by month and you can filter a little bit so it is very simple doesn't matter what it is and they comes back within 15 milliseconds Wow and it all this worked so what what actually happened here and and that's basically what I'm gonna explain in the talk so you type something and something comes back and you have eight hundred eighty thousand lines of code at your service so what just happened here so let's start with that what psql does yeah you've all used P SQL is the command line tool and what this is kind of the code of PS go ID it's actually much bigger but this is basically what is outside you start it up it gives you it gives you a prompt and you type something in and if and then it sends it to the backend and pul doesn't know anything about SQL it doesn't know anything about your tables it doesn't know any thing of what you what your query is doing right all it does is it gives you a prompt and it checks if you you know type backslash Q it quits and if you type backslash DA does some interesting things but other than that it basically just takes your string and sense it over the network and this this was actually messing around in PS girl was actually my first development project in 1999 so I'm so a little bit fond of 50 SQL so that's what that does so there's no logic in here of knowing anything about anything about your database and this also means you could take any other tool you like for example PG admin or any other there's some one alternative command-line tools also now available for PS girl that work just the same so P SQL is linked to library lip EQ which many of you have probably seen and this is kind of a Mac OS tool that shows how POS girls is linked together so in this case it it this basic intro it shows me all the libraries that this program is linked linked with and in this case it's linked with for libraries and this will differ depending on the system but basically all that peers girls link to it slipped EQ for communicating with postcode server and then use these get text for internationalization the redline library that I show it which is responsible for the command line editing and then just some system library here so so then next you go into lip EQ lip EQ is the client library for voice gross again lip EQ doesn't know anything about SQL does know anything about tables doesn't know anything about any of these any thing that goes on in servers basically a network abstraction so the the API that a program like psql would call it are functions like this you connect to the database then you execute the string and you get something back and what basically lip eq does is it wraps that into these operating system calls it opens a socket somewhere hopefully does a bunch of error checking and then it writes something to sock it and they read something back and then so unpacks that information so that you know the the return value from PK acts like you can then get rho information back and things like that so but it's very it's just sort of a network abstraction it doesn't know really anything about what's going in your database or what your query is really trying to do the the real sort of knowledge in lip eq is that it speaks the front-end back-end protocol so that the protocol that declined and the server speak is a custom protocol it nowadays you might use you know build sort of on existing building blocks like protocol buffers or thrift and things like that there's ways nowadays you can build protocols without inventing everything yourself but Postgres has been around since you know in one form or another since the 80s and so back then people just made up their own protocols so here's here's how the protocol looks on the wire and you can if you get some tool out to sniff the protocol and I'll explain it in a second how you might do that very briefly you would see stuff like that flying around so the first column here is I'm just showing this is this line means the client is talking to the server and then the server starting declined and the clients talking to the server again so that's what that column means and the rest here is basically that's those are the actual bytes so the first byte of a protocol message is always a single letter that kind of describes the message so a Q would be occurring for example so if the first thing that happens is that decline sends the query to the server there's a handshake e at the beginning once when you establish the session does authentication and and things like that that's a little bit more complicated but once you have a session established it's pretty simple the client says Q then it has a length byte here so it knows when this thing ends and then it just sends a string that's what you see on the wire and the server does some thinking talk about that in the rest of the talk what the thinking is and then the server send something back and the first thing the server sends back is a T message that stands for tuple that describes the row that is coming back it describes how many fields am I going to send what are the types of the fields what are the names of the fields basically and that's a little encoded here you know there's more information it's like how many fields are gonna be what are the types and then all the names and that's it and then you get all the data rows and this could be many or not depending on what your query is returning so D is for data home data row or something like that and then you have a length byte again and then you get the actual data the data is usually sent in text format there's variance for that for for binary format but generally you just it's just a bunch of text of you know what ever your table data is and the lip DQ's job in the end and that at that point it's basically recording this tuple message so it knows what all this is that's gonna follow and then to kind of unpack that a little bit copies it into local memory and then it returns it to the decline in in these API calls so in this protocol is not super hard there's you know alternative implementations of that for example the JDBC driver for Java implements the protocol itself it doesn't use the PQ and there's some other drivers that implement the protocol themselves so it's not very hard so when the server is done it you know it sends a bunch of rows this could take a while and then at the end it sends a message for whatever it'sit's done and then it sends a little bite here that in this case means it's the transaction status so depending on if you're in a transaction block or not it tells you in this case I'm idle right now that's kind of optional this kind of the status information so and then you're done so that's that's a client sense query then there's a tuple descriptor than a bunch of data and then it says I'm done and then the client can depending on what the client wants to do it can send another query and then more stuff comes back when the client is done with the session it sends the X and that's determinate and then everything gets shelled out so if you've maybe you've seen this in the log if you forget this at the end so if the client just closes the connection or you just you know rip the connection out somehow then the server gets upset and then you get these log messages like unexpected end of client connection or something like that in the server that's usually means the client didn't shut send a proper shutdown message it doesn't really like hurt anything it just means somehow the software is not working as it should so this protocol is documented in the Proteus documentation so the two at the end that the internal so if you want to learn more about that there's obviously a lot more protocol messages if you go really deep in there so using cursors and prepared statements and stuff there's more stuff happening but for most applications this is really all that's happening so again it's but it's documented in the Proteus documentation you can really if you want to learn a little bit about this use TCP dump or for example wire mark to sniff the protocol Wireshark is actually nice because it has built-in knowledge of the post ghost protocol so if you select amongst as many options it has death like to decode protocols it will actually esta buy it sort of fly by it will actually show you something that looks pretty much like that and you can kind of see as data goes around and every once in a while in some of my business of doing sort of debugging and in support cases for customers every once in a while I actually do that because somehow there something is wrong in in whatever they did in network or application setup is and then we dig in here and see what's actually happening so sometimes that's useful and you know or collecting TCP dumps and having them decode and stuff that that's all entirely possible it's mostly sort of a text ish protocol so you can kind of read along so once the query is sent to the server it has to be parsed and this is straight out of compiler theory the when I was actually started getting started with the post ghost project I was actually still in college and I was having a compile with your theory course and so I was hacking on post goes on the Postgres parser at home at night and you know learning in during the day I was having the the course that explained how it worked so that kind of worked out well so this is pretty standard post chris uses flexed and bison though the modern versions of lex and yak which have been around forever to do lexical analysis and enforcing so first lexical analysis takes the string and splits it into tokens and so in this case it would look like this these are sort of internal symbols that are being used so it goes you know you through the string and splits it up so the first one is is a keyword it knows about keywords so it kind of records that the next one it doesn't know it it just says I was probably an identifier you know and then there's parentheses a string constant because it has quotes on in a comma and another identifier it doesn't know what this refers to this is some column name from later on here right but at this point it's just an identifier and more symbols here and and so on and so on so it's just basically long list of stuff it you know using regular expressions basically so it does go through that that's usually not very interesting because flex kind of takes care of all that so we just kind of feed it the the regular expressions and it does the rest now after that it goes to the actual parser and we I'm just gonna hand wave past this here because again we don't really worry about how that works we used bison which has all the smarts and we just give it the tokens and at the end we get out a parse tree and the parse tree is basically a bunch of c structs are linked together so it looks like this this is sort of a kind of a llamo representation that i made up here so at the top level is a list because you could have multiple statements in one string or and you could write select something semicolon and select something that kind of works it's not often used but it's possible so you have a list of statements and in this case was a select statement and the Select statement has all these attributes or properties right that's a target list from clause and the from clause has other things in it and then where Clause has things in it so it has sort of an expression subtree this is really quite standard you just go through the parser goes through things and creates these structs and hands you back a pointer to a struct which has all this information about it so at that point you got rid of the string and you have some internal representation and then you can go so to learn more about that you know you can learn and this is really how I did it you don't need to know anything about database is really to learn about this stuff because flex and bison or standard tools and there's no literature about them and they have really good documentation actually I often work with and these two source files contain the de Lex on the grammar and they're very self-contained so you can just you know have the documentation of those tools handy and read through it it's pretty straightforward there's there's no magic and it's very so self-contained and it doesn't have anything to do really what with what happens afterwards so at the end you you get a parse tree but the parse tree doesn't necessarily make any sense yet because you have all these symbols in there but you have to then go through with what we call parse analysis and check that the query actually makes sense so you do some additional syntax check that bison doesn't know how to do or that would be too complicated to do in the in the bison description the big thing is at this point you go through and make sure all these tables actually exist as tables and you make sure that the columns that are being referred to match with the tables and you check the types and do the type of Matt types match and if the types don't match you try to do some little bit of castings and to make them match that all happens at this point if you can find stuff you air out at this point you use all all these system catalogs to look up stuff so you look in PG class what the tables are and the types and there's you know constraints and things like that you all have to look that up and make sure everything you wrote it actually makes sense and the result from that is that you basically add in another parse tree that has been vetted so to speak so and that parse tree is then handed off to to the next step so then after that comes to rewrite or and this is also nicely organized in the Indus or code so I'm so pointing these subdirectories out so we had the parser and now have the rewriter and the rewriter mostly comes in when you have any of yous if you know a little bit more about the history of postcodes there's something called a rule system which is much more general than views which can take curries and rewrite them according to rules to do arbitrary things that nowadays we would normally do with trigger so the rule to do rule system in all this generality it's a little bit deprecated but we do use it for views because views are sort of macros that get expanded into the query right so if you have a career you have it parse but then you go through and you see well actually some of these tables I Sawyer they actually views so then to rewrite a sit go in and sort of take the table out and put the stop another subtree for the view in that place and sort of make mango sit that way it can get pretty complicated so a fairly elaborate code because of the sort of historical generality of the rule system but basically if you just want to think about it at this point of you get expanded so our query didn't have any years and pretty much nothing would happen in this case so then it gets to the interesting so database specific points so now we have a parse tree in memory that describes what the query that the user typed actually means and we have sort of check that it makes sense and remember this all happens in those 10 milliseconds right and now we have to think about what we actually couldn't do about that so we had a select with a join and some auto by and then a group by in there and so now how are we going to do that and this is what the the terminology is a bit mixed up here it's sometimes called the planner or sometimes called the optimizer I think most people say it's the query planner because it creates a plan and but the subdirectory is called the optimizer because it also does some optimization if you have redundancies and stuff so it can throw some stuff out but that's not really what it mainly does the main job is to take the parse tree and then create an execution plan which is another tree which is completely different and so in our query and when I tried it and this can be different it would look like that try to make that a little bit bigger hair this works so this is llamo I would put that you can actually get from the explain command so if you type explain and then that query or your career you get something like that and kind of cut out a little bit to make it fit but so at the top level you have a plan and in this case it decided all the first thing or the sort of the last thing right it kind of goes up this way so the last thing I do is sort this is my soar key and then the sort gets its input from the aggregate and the aggregate the aggregate has T in this case decided I'm gonna do a hashtag of it there's different ways to do a grenade this case I said okay I'm gonna do a hashtag good and then I agree it gets its input from the hash join and the hash joint gets its input from these two plans one is the sequential scan and of this table and then this equation kind of the other table that's kind of where the hits said it's the disk right and well how does it get from one to the other that's what the smarts of the planner and it's pretty big and complicated because that's really where the perform a lot of the performances is gained so it needs it takes input from statistics that it collects from the table that are collected by a vacuum or analyzed more precisely and it has cost parameters at model your hardware so it takes that information into account and then it looks at all the possible plans that how could we run this query gonna do hash join our universe join universe join this first I'm gonna join that for us so we can aggregate first what we're gonna sort are we gonna do a merge join so we already get sawed out code and we don't have to sort how much memory do we have for the sorting has to take all these things into consideration and then it says well this is the best plan I could come up with here you go and if this goes wrong you get a bad plan slow query and then you go to user forums and say microwave slow and then people first thing people will say well how about posting your explained plan then you post that and then people will go through it's like well this looks weird you have a really big table but you don't have much memory and so and then you could get this sort of debugging oh if you performance going on I turned and there's more documentation about about that out there and this is not what this talk is about but that's sort of one of the sort of really complex and important parts of the West Coast grading processing alright so there's a lot of these good read means around in the post goes back and especially so each stop directory usually has a readme which is worth reading if you just want to dig into that a little bit more and we do need people to more people to dig into the planner so if you there's work to be done all right so then we have a plan that we gonna execute the plan and if you if you use prepared statements then you kind of skip ahead to this point right you the prepared statement basically saved this somewhere and the next time you execute you just kind of skip right ahead to here and you save all the parsing and you save all the planning and which in our case didn't really matter much but the planning for a really really really big words can take a while relative to perhaps the execution so execution is the executor and it has that separate subdirectory and I I believe it has a readme file in there and so the executor works sort of and analogous to the or it takes to the the execution plan that was just produced and then basically just works through it so in our it's what works and we call sort of a pool model and in the sense that there's sort of a you know a driving agent up here or piece of code at this sort of this entry point to this and says like give me one tuple ok and then we go through the plan okay we're gonna need one tuple here what's the top note top plan oh it is the sort was in that case so we gonna call this function there's basically sort of its big switch somewhere switch case okay if it's a sword no it don't we call this function and the sort note calls them whatever its input is and then it does the sorting so first it says well okay you give me some tuples and then there's actually some sort code in here that actually does the sorting there's it was a talk and is there's always a talk at this conference about sorting so that that could be the sort of the sideboard if you want to learn about the actual sorting so in this case this it's sort of the the model is sort of give me one tuple but the sort can't do it's sorting unless it has all the tuples so in this case it would kind of have to wait until all of this actually has to finish to the end so it goes to the the aggregation it calls the aggregation function say give me all the tuples you have and yeah I could get aggregation function you know looks at with the plan the plan decided I'm going to do hash aggregation so it does actually some hashing in there there's hash code and so then it has to do it call the joint function and get this case the hash joint function these are actually all actual see function in the source code in this subdirectory and then join kind of has to call two places has to say okay I'm gonna first I'm gonna call do this sequential scan build a hash table and then based on that I'm gonna do that sequential scan so and there's oh that's all pretty standard database stuff right hash joints merge joins and how all those things work there's you know standard little Shabbat I don't know how that works and readme files that explain that so this this is how the executor works and then once you get a tuple up here it's just sent pretty much straight back to the client it's kind of formatted a little bit so it's in this network protocol format and it's an it's sent back over the wire so it's depending on what kind of plan you have these tuples are getting sent straight back as soon as they're available in this case that wouldn't work because the sort has to finish so there's some distinction between plans that are so fast starting plans in a way that can produce intermediate results faster and some plans you have to wait until actually everything is computed so that kind of makes a difference if you use limit for example then you want to prefer a plan that is fast starting there are some discussions on drosera no we have there's some discussion now that this model is actually kind of inefficient and slow because you you don't take advantage of sort of batching and and sort of localities in the cpu for example so this model of give me one tuple okay here's one tuple it goes through the entire stack depending on the plan right give me one tuple that I just want to put it that they need call these different functions very inefficient from the CPU and then you know cache locality and all that stuff right so there is some work going on right now as soon as you all go home and go back to work to do this a little bit a little more batch oriented that you send bigger chunks of tuples around and that'll hopefully improve performance on a cpu utilization mainly readme file okay and then okay then next level below is what we call the access methods and the access method so basically how would you actually access the table or the index perhaps because the the execution plan just says okay do a sequential scan and then sequential scan what is the sequential scan actually do that's code that's in here and this is also where all the index code is and b-tree code and that is more complex in a sequential scan write sequential scan is sort of has functions like this open a table it's identified by OID and has a lock certain lock level that it has to take so it opens the table it begins the scan so it has some scan keys depending on what you want to look up that's still of the deity's search conditions if you have any and then it runs in a loop until it's done keep get next gets you to pool that tuples sent up the stack and back to the client if that's all you need and you runs in a loop then you've got a end stuff and close some stuff to release some resources that's basically it so this is sort of the interface you have and if you do an index scan it's similar you just have index open and then it's similar like that and then there's the actual indexing code that knows what a b-tree is and stuff like that below that so that's one we had that's what we what the access methods are so at that point you kind of done you all the way back to the disk all the way down to the disk and you get your data back so that's everything as far as that concerned and then there are sort of sites supporting pieces of code like function and operator calls that filter results let's see what execution plan actually was we have filter here by age what we had for example so this would be put into the scan key perhaps depending on how the plan works out and employers the nice thing is this is all in a way running in user space all these operators so there's no built-in knowledge of what greater than or less than is obviously it's built in in a sense that it's shipped with the code but there's not really any hardwired knowledge of what all these operators mean everything is extensible you can write your own data types that have the same operators and it worked the same way as the built-in operators and all this is implemented in C functions that look for example like this so in this case we have a we're implementing the greater than or equal operator for the into data type so this is a piece of code that actually exists pretty much like that in the backend and has you know some special macros to pass some stuff around but basically what it does it's get my arguments that I was passed do some computation this guy is very simple because he kind of just used the built-in whatever the compiler the C compiler provides but if you write your own custom data type you can write your own code and return anything you want in within reasonable limits and then return a value and the way the system knows how to use this is through the system catalogs so this is to filter that the plan produced of plants said okay you have an into column here you have the greater than or equal operator here and some constant here so how are we going to find out what this operator really means so we're going to look into first we're going to look up the types so we go into page you type get the OID for the into type right so we're gonna look up the OID of the data type in 2 which is 21 in this case and then we're gonna look in the PG operator table and you can run this yourself this is pretty much exactly how the system does it internally you look for the operator name that's named this way and it has the left argument of type 21 and the right argument of type 21 gives you results back you know it's procedure name basically and then you go into PT proc and look up that procedure name which is in this case called in to GE look at the source code of that in this case the source code is built in C function but this could be your extension function in PL Python or something like that you would look up what is the source code of this operator and then execute that so this is all changeable or at least inspectable by users so this is how the sort of this was how the filters work in sequential scan this is how the actual scan works very very simplified but there are you know you have your data directory just just have a placeholder here for and then in the data directory you have subdirectories one it's called base and then you have some numbers the first number is it the OID of the database and the second number is the OID of the actual table well it's not the actually do ID it's the real file no it's of you know well know what that is and based on that the system opens that file using the normal operating system the system calls and the files are organized in by default 8k blocks and then if it needs to read something from a file it pulls in one of these blocks so the sequential scan would start at the beginning and say okay give me the first block loads it into memory it goes through the shared buffer cache which is many of you hopefully have heard of and have configured appropriately and then it kind of decodes that 8k block so I did it doesn't pull one tuple individually it always pulls in the entire block and then kind of decodes that block and the block internally has a structure that has check sums and and some transaction information and things like that and then the actual tuple data is sent somewhere in there and it pulls it out you get a basically a blob of memory and that's basically then decoded a little bit sent over the wire inside sent back that's that's pretty that's pretty much all that is and there's tools to inspect these files PG file dump is one of them for example you can look into what's actually in these blocks that sometimes good for forensics or debugging so and if this were an index scan then it would be a little bit more complicated but still be a file like this and it would still be a K blocks but the blocks or they mean more complicated things the blocks would be sort of parts of a b-tree for example so and another important thing that's going on actually not in this query because this was a select but if you have an update so we have in this case we just had a loop that does you know he'd get next if you actually write something we call at the executor would call heap update for example to update the tuple first you read the tuple you get a little tuple handle so in this case you would do a scan you got a tuple handle and then with this this you would modify this structure here this something called heap modified tuple you update that somehow and then you call heap update and he pop date writes back to disk and the way postgis work is you've probably heard of that there's it doesn't actually override the old tuple makes a new tuple and creates links between that non overwriting storage manager and there's hot and there's vacuum to clean all that mess up so the heap update basically just writes a new a new tuple somewhere else and grates a link between them but it also has to do is while logging so that if there's a crash it can recover without having to f sync all the files all the time so inside heap update and these are actual C functions again inside heap up there there's a sub function it calls that says called log Neve update which produces a wall record and it's a piece of bytes that describe what the update was so it basically assembles a bunch of bytes that describe okay I did this update to this tuple and I change these values it's it's a look kind of logical very low-level logical description of what the change was and that that wall record is then passed to this function X log insert which writes it to the wall which means first it writes into the wall buffer and then it gets washed disk you know because I've seen all that wall stuff that's happening so this is sort of happening every place in the Postgres the source code that does a change needs to within certain optimization up to certain optimizations but basically every place that makes a change has to somehow make sure while logging happens so all over the code just basically just these side calls so okay at this point now I'm gonna wall lock and then I'm gonna keep doing the rest of my stuff so this sort of works a little bit on the side and is its own thing and of course as you know that this all then ties into application because these description scription x' can then be moved to other hosts and replayed and that replays all these changes and you have replicas so that that particular the wall code is in in there's a readme that explains it in the X lock at C is where all the action happens it's one of those huge and endless files that for every one need to be refactored but nobody dares to do it so that's where the wall art is so that's basically all the pieces of how this query had at least work you started at PS girl which is a very a straightforward line program the PQ is the network abstraction basically there's a custom protocol that how the pieces communicate that you can inspect yourself Lex the parser standard tools the planner there's a lot of magic in there that takes parse a parse tree and creates an execution plan with based on statistics and things like that and then the executor basically just go through that plan and actually has the code that does you know the murder merging the hashing and the sorting and all that stuff and so tied on to that is the function manager that knows about all the types and the functions and the operators and how to call ace and how to call X next extensions that might implement those then you have the heaps and the indexes that sort of manage to the low-level blocks and they call the wall and then below that which I didn't really go into his storage manager which basically just manages to file descriptors opens a file gets you bytes closes to file make sure all the files are closed and F staying at the right time and things like that so that's the storage manager that's also a term that sometimes if you're somewhere so that's Postgres that's eight hundred eighty thousand lines of code so any questions about that yes please the page cache is in the kernel page cache it doesn't really do much with that the way that where the page casually comes in to is if you do this if you have a if you have a lot of memory in your system and you want to access this file to do sequential scan or even other kinds of scans if it's already in memory this does not have to hit the disk you just basically read memory and if you do updates you just basically write to memory and then let the kernel will put it to disk later on the wall lock the wall logging on the other hand has to the whole point of that really originally at least is to F sync that straight to this to make sure it's durable so any right to the wall up to some optimizations again go straight to disk so it I mean it still goes through the kernel page cache because that's basically the only way to do it other than unless you use special options but it has to go straight to disk so you have to wait for the disk and the wall is never really read or at least at least on the writing system it's never read or it's not being so that's not an optimize case so it doesn't usually reside in the page cache to read it from there so you usually read it straight from this or you read it from your remote archive which copies it to the local disk and from there you read it so it's usually not in the page cache so basically answer is it doesn't really do anything with the page it sinks when the transaction is finished that's all the default settings and there are some ways to fiddle with that which gets you performance or better reliability or worse reliability but basically the default mode is whenever you commit it F things to wall up to that point to disk there's a great talk by Bruce yesterday about it was post course hardware selection which actually goes into that point in in much detail because at that point you want to you know it goes it talks about how to select your your Hardware how to select your storage system to optimize exactly that because that's really important for performance so there's this more detail to be had there perhaps yes please our indexes version the way tuples are you mean tuples are versioned in a sense that when you update it you get a new version is that what you mean it roughly speaking your works the same way except the indexes don't have transaction information in them so that's this kind of really goes beyond what I adjusted but just to quickly address that if you if you make if you make a new version of a tuple you record what transaction it belongs to so the next transaction knows which version of the tuple should look at indexes don't have that information so they actually have to go down look into the heap so you basically just have two different you have two versions of the tuple but you don't know what they are in the index so you have to go into the heap and lookup which one is the right one that's all of the basic way it works and then there's many optimizations such as hint bits and stuff to make that work a little bit better than the way I described it but that's the basic system there is different there's different reasons for what those things might be in some cases the indexes are lossy so they don't actually give you the only accurate information in which case they they kind of just give you well this tuple might be the one you're looking for but you still have to check whether it's the right one depending on the index types so that might mean different things depending on the index question was over here yes please determining what the work is is probably half the work it's it's it's a it's a important piece of the of the puzzle right of what makes a database system really right because you can the whole point of really relation to it relational databases is you can write an abstract description of what you want and the system figures out how to do it otherwise you could just write your own code that does sorts and lookups and stuff so that's really what the value is and it could always be better it works obviously good enough for many many people that but there's always sort of edge cases and you could make this simpler and refactor that and parallel queries knew that opens up a lot of new opportunities very planner know it's how to handle two parallel query but I'm sure we can have more knowledge of how expensive is it to spin up additional workers and when you know what how expensive is it to move data between CPUs and and you could model this endlessly so it's it's very hard stuff so that's why I just mentioned you know there's always more to be done but determining exactly what to do without breaking everything else is really the hard part so yes please I said you can add more protocols I don't think that's right GD BC jdbc sorry u you can implement one side of the protocol yourself that's that's what I'm saying right you don't have to if you ride your own driver or your inline program you don't have to use lip EQ you can most of the work is already done normally nobody would do that but when somebody implements a driver for example what is like most recent driver that was written from scratch was perhaps for go language that was a new thing a couple years ago so somebody would the database driver they perhaps had the choice of am I gonna link at against the PQ to do all the work or am I gonna do myself they chose to do it themselves because then they can take advantage of their our language specific optimizations and stuff like that so well and I had talked yesterday about PG bouncer we also mentioned that you know I wrote my own PG bouncer clone basically by implementing the protocol itself so it's not very hard all I'm saying is the protocol is no mystery and you can work with it you can look at it and you can kind of interact with it yourself if you want to in the back yes explain analyze annex so the question was where does explain happen and what does explain analyze happen so explain it's pretty much here this is explained output cut down a little bit to fit here so the X if you run explain it would basically you know it has to do the parsing and all this checking to make sure the query makes sense then it does the review expansion and then it does the planning so the full plan or it's in action and then at that point it just stops and prints it out and that's it that's really literally how it works so it calls the planner it calls the and then it has a function to print this out if you do explain analyze which actually execute the query and then collects information about what actually happen then it would actually just keep going here and run this whole thing and then these things are instrumented to know like Oh am I actually running explain a lies in that case I'm gonna just collect some information and store it in the there's another tree here which is the execution state which kind of mirrors the plan and then in this execution state I'm gonna collect all this information and then after that I'm not going to send any results back I'm just gonna kind of present this information that I collected in execution State so that's it's pretty straightforward actually if you if you see in the code how that works it just really takes these internal structures and exposes them to you yes that it that's kind of hacked in that was like my very first project actually yeah I I was in as I've mentioned it was in 1999 I was just doing sort of an internship in a way and company was my name Postgres which was for their production system which is outrages in 1999 right and so I was like well here's a database do something run it fix it and so I was typing psql always hitting tab completion and it just always did the default completion which is like filenames ID it's sort of inherited from bash earlier so I really got fed up with the tab completion in PS girl so I sat down one weekend and just full figured it out I didn't hacked it in and so psql does of course have some knowledge of the database but the tab completion is really a separate thing so if you do tap it goes into this module and then it calls into the database and looks at what the tables are and then but it doesn't record that really just kind of uses that at that point for the completion so that's a fairly sort of ugly module really because of course it doesn't have a full career that it can parts that's the whole point right it has sort of a partial information of what you're trying to type and then it kind of figures it out so that's a separate thing that does connect to the database and collect some information from that yeah going once twice okay thank you [Applause]
Info
Channel: Citus Data
Views: 13,244
Rating: 4.9096045 out of 5
Keywords:
Id: OeKbL55OyL0
Channel Id: undefined
Length: 50min 34sec (3034 seconds)
Published: Wed Dec 07 2016
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.