PostgreSQL In-Depth Training: Fundamentals Part 1

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

greetings greetings guys so I will be talking about a quick architecture overview as Jeremy say says we will be covering the foundations a higher level in terms of architecture okay if we my plan is go through those contents so from the general overview what I calling here the back end flow we will see what back-end means here but it's the process that managed that connection and what it does the auxillary process which means here everything else that is not a back-end process that's handling that connection then we will talk a bit about the disk layout and how things are organized internally in Postgres and then because we are talking about the disk layout we will talk about the pages and purples and Postgres usually we call pages when it's in memory but it could be also the data blocks inside the data fire inside the table data file then we will talk about the multi-version concurrency control the MVC see that we can argue that it's one of the core one of the core components inside the Postgres database because it how it handles not only the concurrency has the name said but how to handle the locking system the conference through multiple connections trying to do multiple things at the same time let's say if I am managing a bank I need to redraw money from my account and the post money James account and I went to that operation occurred in the most atomic way but if the money goes out from my account I must be sure that it goes and jeremy's account otherwise it should not be out of my account and then MVCC is there to ensure that in the easiest way possible and by easiest I mean from the company from the research utilization viewpoint and how the algorithm handles that in the possible ways to control that but that we will see that bring few issues in the long run that create old versions of turbos that will require vacuum and that's when company that's very specific on posters and then we will glue everything together talking about the shared memory and the writer had log mechanism and then checkpoint back write checkpoint in background writer and that will be then very high level covering those things so I would try to not take more than 60 minutes talking about that but for sure we could spend one day talking of one of each component here so the idea is giving a general idea I will share this deck after I after this presentation which has few links as well for each one of those topics okay so just a brief word here so what is poster is and why it's poster goose it's something that's not new probably there are few people that we imagine that post chrism hype thing that is just happening from the last five years but it's a very stable and solid project that was created in Berkeley in 1985 and the name came from an old database name ingress so the post is post ingress that's the name there's a vibrant community online and presidential community that you can seek people around the world to discuss ideas develop together and share your ideas it's a free software but by free usually people have this concept that free means I don't need to pay anything and you don't need to pay for their lives the license but it means that you still need to pay for sports to need to pay your DBA is to need to pay for taking care of your database so we could thread good translate that for a low low cost of ownership due to the posters licensing so yeah there's professional commercial support as you will see in any other engines like Seco server oracle sybase yeah it's another database and another thing that I could mention here that I think it worth to mention in an open slide is that post via skype has a very clear versioning policy and a development cycle release notes so it's very predictable how new features are released how bugs are Patchett why you should keep on the latest minor version things like that that we can discuss it later on okay so to start talking about the architecture we need to understand how the client the application connect with the database and there's a very simple diagram here that we will say that Postgres the clients the application we will use the early PQ that's the client library and the leap you will open a connection and try to authenticate with the postmaster this authentication will fork another process inside the Linux and that other process will be responsible for that connection here we are talking about Li P cubed it could be JDBC driver for example then that back-end process of course if you have 50 connections those 50 connections must know which what each one is doing they need to understand that if I'm open a table and other session is open the same table what that session is doing I'm allowed to lock that table or not and that integration and that communication between those process it's done through the shared memory that's pre-allocated when the postmaster is launched it then of course you have kernel stack that will handle these buffers and the storage itself okay when when the tea one detail that's important to mention here is that the Postgres protocol is from 80s so it's it's very detail in terms of how it's implemented and at that time it was assumed that tcp it's not it was not as standard yet so you have somehow re-implementation of few things that you have given and tcp/ip that's reimplemented and only PQ okay so you will see that each connection now it's handled by a different process idea different back end through those fork processes okay and each back-end is also responsible for its own authentication so you kind of fork first to have that process out indicate for you and if that authentication and it's wrong your passwords wrong your our users wrong then the process will terminate itself so yep Li PQ it's a client-server protocol and you will use the socket interface nowadays to communicate through tcp/ip then we will talk also about the auxiliary processes and how the auxillary process and those back-end process are tied together with the shared memory so the disk layout and how files are distributed we will see that there is the database cluster and the cluster word specifically is can create some sort of a confusion for the the user that's seeing it for the first time on post miss community because cluster here does not mean a set of instances are different machines as you wish it means a set of databases inside the same instance and that's the terminology using Postgres documentation that sometimes if you are reading it for the first time you see the word cluster for example PG CTL we start the cluster but you say but I'm I don't want to restart a cluster I want to restart just this instance yet your instance is the cluster because it handles multiple databases inside the same instance and it's a concept that Postgres has from pretty much from the beginning that you can have multiple databases inside the same instance you don't need to have multiple instance to have multiple databases okay and there's this something very specific on posters so you have like inside the cluster but we can replace that word cluster to make it easier to understand that cluster here means instance so have the database instance which has multiple database DB 1 DB 2 DB n those database they are isolated from each other so you cannot join one table from database one with database too easily you can do that but you will need to do a link a DB link between those two databases as you are doing a DB link with another instance ok because they are really isolated from each other but then you have DB n here that you can have multiple schemas and by schemas here we can understand like logical containers ok so I can have my public schema that's the default schema but I can have schema HR schema accounting schema marketing and each one of those schemas handling objects for their own business and then those schemas I can join tables from different schemas inside the same database that's given I can have foreign keys between them so and another thing that people will create some confusion when coming from our code that sometimes kima will me user and user means kima but here they are not related with the user that connects with the database so schema it's just a logical container it's just we can't do an allergy like it's a directory when I put my files my objects but this is just an analogy okay schema is not a directory is just a container a logical container which I can use to divide my objects with a logical meaning like I say accounting marketing HR and put objects in a different schema to have it organize it okay or even I could have the penny on my data model I could have different schemas for different companies or different business units and then I can merge them together it really depends it's very flexible internally and they host each database is a directory there's a specific directory with an ID which will contain out that our data files for each table or each index or each sequence that's inside that host which will be used by the the postmaster okay in the back ends and then we need to start talking about the backends okay how the back end work remember that the back end here is the process inside the UNIX that handles your connection okay so he's responsible for your session so to first start I can already say that because one process is designed to handle your connection if you have 1,000 connections you will have 1,000 processes to deal with your which each one of your connection ok so there's a very good slide from Bruce margin here that tells how the back end will behave when the statement is coming from the application ok so there is another cut here from what we see in the previous slide when the lead PQ is responsible to connect the Postgres with the application which we can see the main here then this green square is the back end behaving and the work flow inside that green green scares the sub-modules that is inside that back end so we will talk about the parsing phase we also had the traffic cop that will be used to divide which is a complex query and then from what is a simple comment and by complex query here we are talking about select inserts updates and deletes not really complex in terms of how big your query is how if there's joins or not complex is everything that's not a comment and by comment here will be a copy a create table an outer table those things will be like comment utility then we have the rewrite phase that nowadays is use it will be used more to decode views into queries then we will have the generate paths in the generate plans take in consideration the statistics that is constantly collected by the auto vacuum workers and of course if you run analyze and then you have this equation phase this equation plan this process can retro element 8 itself if you have Suba queries and if you have start procedures or views and then finally when the Secutor goes on it gives the result back to your client so what I try to do in my my next few slides here is to dive deep a bit into which one of those phases okay and as I mentioned before each one of those lights has few references and links to sometimes dogs or Postgres source code from the implementation of that part inside the backend so that might help us to dive deep later as well so one thing that the leap peak you thinking here in the client side have in mind is that it's a custom protocol so it must take pretty much they have lift of network communication even if it's implementing on top of tcp/ip you must control network timeouts you must control if it's a security connection it must use a library for SSL connection but it must call those functions itself inside the leap EQ protocol so there's a lot of settings that you can do both on client side as well in the server side to manage how this protocol will behave it has few other features as well for example a synchronous notification listing not Phi command imposters that can be used to do queuing so this is a very good feature from the Olympic queue itself the native native SSL support based SSL libraries that I mentioned is that thing that I was saying that we have a library that implements the SSL connection but how that implementation calls those functions is done by the lipid queue itself the lipid queue also implements the book import/export through the copy comment the cop statement that's very useful to do large operations when you are dealing with a large table what million rows I want to get from a CSV an important a table I do not do this doing 1 million inserts because it will have the overhead in terms of parsing that statement of doing a lot of extra work so if I already know beforehand that I need to insert 1 million rows or registers from a CSV file I must use the cop comment to make it faster okay so then following the arrow with the arrow here we see that the lipid queue connects with the back end ok and what the back end or the back ends the multiple process that we have for each connection that we have will be the will do now those multiple process simply find that flow will do the parser the optimizer in the executor phase okay so the parser first thing the parser will do it's a lexical analysis it's make sure that the syntax that you are using it's a valid syntax it's a valid grammar if you are not misspelling some comment if that common it's make it it's making sense from the the seco grammar that's implemented inside Postgres okay and that is done using few few things here there the five for the lexical analysis and the grammar itself that it used bison in Flex that's an open source tools as well that implement implements those parsing languages okay the results in this phase will be a parse tree and we can think in terms of decoding the language the Seco language instructors that you say if I'm doing a select which sieve functions inside the source code must be called to do the Select sometimes you take select a simple select select star from table a by giving some database magical thing internally will do that for us but we need to remember there's some source code doing that for us and to do those things we need to decode the simple select in terms of C operations C functions inside the database codes to do that and that it's done in this phase here okay of course if instead select I misspelled a select and I had an extra e on the end that comment is invalid the Seco grammar does not implement a select e with the e in the end so it gives a neighbor that comment is invalid that's done in this phase here okay it will generate a parse tree it trees a data structure in the memory that we represented that that comment that you are trying to execute but by that time you are not validating objects yet so you don't really know if the table resists or not if the column is this or not but you have now it's the personals is to add those things okay as the function is this if you're trying to call a function to the code time stamps to string it's done in this phase and then you have the rewrite face the which I mention before its most used nowadays to replace views into normal queries but you can also manipulate the rewrite directly you I mean the developer the DBA and do some trick stuff which is not recommended nowadays but still can do that for example if I try to insert something in a table instead insert in that table I start in other table that's possible in sometimes it's useful to know that it's possible to do that so that's done in the rewrite face that fix something very specific on unpause which is not triggers something and then I could say lower level than triggers and it's something interesting okay so now we have the parse tree and we need to we need to know what to do if the parse tree we already know that it common desist the common represents those sieve functions and those if functions will try to access object table way with columns c1 and c2 for example so now we need to understand how I will do that and by how I mean should I use an index to access it I'm talking about the entire table just one row in that stable if I should use the in index which index I should use those things and those operations are done here by the optimizer which will be decoding the parse tree to get an execution plan okay if I'm doing a joint which the join mat which the join method that I will do hash join a nested loop if I'm using an aggregation how that will be taken in place and that fails the optimizer phase will get the parse tree that it's that we could consider ate the table plus statistics sorry the table statistics plus the cost parameters and get this equation plan tree okay and that part is important for a DBA what is a cost parameters here the cost parameters we could imagine like they are the factors that represent your host your environment and by factors I'm talking about absolute factors exactly which the difference between a random access and a sequential access when I'm trying to get a role in your database it's a way how you can control the optimizer itself by manipulating those factors okay so it's a kind of external knobs that you have to manipulate how the optimizer will behave so it's important for the DBA understand the optimizer behavior itself and how those knobs interfere with that behavior based on the query that he's trying to do and there are multiple cost parameters for multiple different actions that the optimizer will generate those actions is the execution plan 3 west west execution plan 3 is done is pretty much it is that how poss receive access your data so you have a rainy-day the the final plan to execute that data ok you know that there's the comment explained that allows you to observe that Planetree it's pretty much the output from the operation so to explain we generate and you observe those costs you observe how much time it should take and if you run explain analyze it gets the real data after the real execution so when you run it's playing in our eyes in a query it pretty much get everything the time on each one of those phases how much time you take by join the table how much time you take in the where clause how much time you take aggregating that value that you receive from that join everything you get and you get those metrics from there usually explained allies will give you the best idea of how much time your query will take so developers constantly talk about test-driven development for example and it's interesting that sometimes you forget to add and explain analyze for queries that you are doing in the application to get those metrics and it's pretty rich in terms of metrics so it's some good products as well getting the explained allies of the most common queries that rinse and interpret those data and see that it's skewered in terms of watch X what I would expect that behaves or not ok so then there is a cuter there is a cuter once you have the execution tree we will get the a copy of the plan tree which it's me the state tree that will control which step is being executed ok and of course the executor needs to know about the memory management the transactions access methods how I access that row internally in the page and then I go to the access itself that's how the access is implemented for different types of objects that I can access for example if it's an index I have different types of indexes that's something very good and Postgres I have even a possibility of extend how an index work using those expression index types here balance at 3 rash gene gist special access methods for indexes and then the heap itself heap when we see hip in the source code is a table access they're the wrong table access okay and that would be the end of this occasion phase and then you get the data from the disk or from the memory depends where your data already are and give that data back to your application so we see there's a lot of things implemented internally on the simple back end that's handling your connection it's not just handling your connection is doing all that abase stuff that you want to do okay so now we need to talk a bit about the auxillary processes and I would just try to cover it as fast as possible because there's a lot of auxiliary processes auxilary process here it's everything that it's not just the backend that's handling our connection okay we have auto vacuum starts collector that collects activities from your database information about how much that table was accessed how much how how much the number of how many's the name the number of dead rows the dead stop was inside that object okay and then we see the Wow writer that's responsible for controlling the the execution of the redo mechanism Impostors so we have the backend process here when trying to access the shared memory getting the ball buffers writing using the wall writer mechanism to the raw files which will after be used by the archiver to send this to an archive storage okay so now going a bit deeper in terms of pages and the memory itself and how is the data structure of those processes what those processes are trying to do which the fine granularity in terms of data internally in the posters so we already know that we see the backend process here and each process has its own memory structures 410th buffers workmen and maintenance work men pretty much we can highlight the workmen here for hash joints or any sort of internal operations that the backend is doing for example joints as I mentioned aggregations orders if your query realize a lot of memory for those operations it's a parameter that you can increase and then you have a shared memory global pool here that all those processes will talk with so one process know which odd what each other is doing and then that memory the shared memory will be synced with the data files itself we've got a bit deeper in terms of the data structure we see that there's the data directory and there's a directory for each database and one file for each object it could be an index it could be a table that file is divided in 8k pages okay and then if we zoom in into those 8k pages we have the page reader and we have I think pointers which will point to tuples in the end of that data structure so we have the header in the either pointers from the top point to the end of the data structure and they will meet in the middle once it they meet in the middle another page will be created of course hopefully speaking here there's a lot of things that we need to understand when we go deep how those things are manipulated okay we can control we can control the empty space here that is tolerated or not how how assume you need to create a new page okay so just a bit as zooming in here if we take that previous slide where we have multiple weight K pages and the I think item pointers go into tuples those tuples here if we zoom in and then we will see that the top of itself has a header so it's not the header for the page okay it's not the Hydra for the page but the header for the top o itself which has few interesting columns here that we can discuss later and then the values itself okay value is for integers for text and then the next step of okay but the next step is before if we looking from right to bottom the next step will be after and of course so it's possible to learn more and there are few interesting ways to learn more that it's using the page expect extinction they filed them it's a third party - that allows you to see the file itself and they storage folder inside the post resource code okay then now we talk about pages and we talk about those data structures we need to understand what is multi-version concurrency control understand it's very it's kind of presumption of mind saying understand because I don't say that I myself understand fully the MVCC it's a very complex algorithm and how it's implemented it's it's very very complex there are few videos that I can share with you that I'm using to learn about it myself but we can talk about the purpose and some strategies that MVCC does use to do its job okay so the purpose is maintain data consistency consistency internally so if I'm trying to change a row I must ensure that the change will be consistently applied if I commit that change to other objects to other sessions sorry that is observing that change I also must ensure that readers will not block will not be blocking writers and writers will not be blocking readers at least for DML if I'm certain if I'm updating or if I'm deleting selects will not be blocked by those operations then of course the painting of the operation that I'm doing for example changing the table itself and adding an axe for a row for a few for few period of time for a short period of time that will block the object and it's a write operation so yeah readers not blocking writers and writers not blocking readers in the day by day operations of course there are operations that might block your your database and we need to have that in mind when we are talking as and with DBAs because sometimes we can observe people say okay I'm creating an index CRM application is down because creating an index without the concurrently parameter will block your table for everything else okay so that's something that you should know and yeah another things that MVC does as well as provide transaction insulation in concurrent transactions prevents transaction from viewing consistent data it's that example that we were talking before if I'm reader if I want to do a transaction cash transaction if I want to transfer money from one entity to another I want to do in the most atomic way so then VCC is controlling that okay and to do that it you have to need to use a few locking and locking modes internally we will see as well how it will lock the shared buffer to not allow other process get the data what we'll call a latch or a lightweight lock of course I need to persist the state of each transaction if the transaction war was already committed or not and if that transaction was aborted for example so we have a few mechanisms to do that for the transaction itself but I also remember when I was talking about the pages and the header we have a few concepts and how this is implemented on phosphorous for example the X mean internal column tells about the ID that creates that row and the xmax tells about the ID that the leads that row and which ID is the ID of the transaction and the MVCC implements that mechanism to understand that Row was deleted how how how the poster is how the back-end knows that the row was deleted because there's xmax on that step oh okay so I like to use the topo in terms of the physical row and rows the logical rule that you see your application see but your application might not see a physical row that was already deleted before and that's interesting because that may arise the question so you are telling me that if I delete a row the row is not really deleted and yes that's true if I delete a row you will keep another version of the row the previous version of the row and you will need a mechanism to clean that role that we Kovak okay so vacuum will keep the pages clean removing those rows those tuples better saying so the premise here is when a row is deleted the kinda seems there's a tipple okay and the top was just Marcus dead I don't need that topo anymore the same thing will happen in an update okay just an advice for architectures and developers if you're update constantly it means that you need vacuum constantly as well because update creates a new version of the topo so the vacuum will clean your page there's a cost involved on that that requires i/o require CPU as well but not it's a cost that it worth that cost because how the MVCC works is that makes very fast to row to hoe back to row back an operation and even to insert a new row so the trade-off it is very good there are four types of vacuum operation okay there's the normal vacuum that reclaimed dead turpis from pages by removing those pointers there's the full vac that also reclaimed that purpose from pages but in fact it's not the word here is not really good because what if I come fooders it does recreate the entire table and indexes from scratch so it's pretty much a deep copy of that object and physically into another object and then switch the catalog and that operation will do exclusive lock from the beginning to the end there are few alternatives to to vacuum foo okay there's the vacuum freeze that's very important about how the MVCC itself works and the vacuum freeze it's used to prevent the transaction ID wraparound phenomenon here that will would be very important when we are doing few labs and then the shared memory here will glue everything together okay shared memory of a few coats of the shared memory itself the shared memory is of course the representation of everything together the writer had log it's the redo mechanism that will facilitate how changes are made in those data files okay and ensure the durability of those transactions so if you commit you be rest assured that the commit is already in the disk but also it will give you as fast as possible the return of that commit to not the committee will not take a while to be flushing the disk okay so have having that previous slide in mind those phases there's a cooter itself needs to do few things to ensure the committee's done for example in case an update that your committee it must update the shared memory and then it needs to ensure that that change is physically on the wall files or redo logs if you are used to oracle terminology okay the implementation of each one of those phases here link it you have the X log insert that's how the transaction is always a painting the they were five okay for each each change that you do it could be a maintenance change like a vacuum it could be when you're creating an index or changing a table you you are always a painting the wall fire the transaction journaling mechanism okay a journaling if you compare with a journaling file system it's a fair comparison the wall file that keeps the transaction with a journaling mechanism in the file system so then we have the the cheque pointer in background writer that will sync the dirt pages in the shared memory with the data files itself so we have here an X log insert a painting the wall file after the shared memory is updated as well in parallel you will see the check pointer applying a record a register in the wall file say hey I synced all files here in my data files i sync it all dirty pages on my data files so if a crash of course it's safe recovery from there and it will mark the wall file from there and the background writer will be constantly will be constantly flushing the shared memory to the data files the background writer was introduced it not exactly at the same time the check pointer was introduced it so it was introduced it to make the checkpoint or life easier and speed up the checkpoint operations so in summary here try to cover multiple aspects from the architecture each one of those as I mentioned there was very specific points that I think Jeremy would like to cover and dive deep with you guys and see which more we can learn from each one of those

Info

Channel: Amazon Web Services

Views: 42,248

Rating: undefined out of 5

Keywords: AWS, Amazon Web Services, Cloud, cloud computing, AWS Cloud, PostgreSQL, Amazon, Aurora, RDS

Id: ysG3x2QOu0c

Channel Id: undefined

Length: 52min 54sec (3174 seconds)

Published: Mon Jun 17 2019