Postgres Open 2016 - Non-Relational Postgres

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

okay everyone last session of the day hahahaha oh he's always a challenge to stay focused but I promise to be interesting how many of you were at my talk yesterday okay good the talk yesterday was about indexing this one is much broader it's about the non-relational features of Postgres I'll start talking about some of the generalities of relational systems some of the history doesn't launch into a ten-point list of non-relational features of Postgres and sort of explain why that's a really exciting thing for for the project my name is Bruce mom Jen I been working for Postgres for 20 years well I work for Enterprise DB who of course supplies support and training and tooling and custom version of Postgres I'm enjoying my 10th year there and I'm also one of the Postgres core team members so I've been doing a lot of work with Postgres that URL right there at the bottom is actually where all the slides are located so if you want to look at the slides now or some other time it's not very bright I don't know why oh I think it's the screen it's the way the screen is anyway that's the URL and you can see that this presentation as long as a large number of other ones on that website I will be taking questions as I go so obviously I'll take breaks and answer any questions you have about about post Chris we do have a large number of slides here 71 slides for 50 minutes so that's always somewhat of a challenge but I guarantee we'll finish one time so let's talk about let's talk about Postgres and relational storage honestly you know I've been a bit with post Chris for 20 years and obviously when you think of the features of post grass you think of oh we added that seven years ago we added that five years ago we added that three years ago you think of it Karl and one of the interesting things about why I created this talk is I started to think back of some of the stuff we had done in the past 20 years and I realized it was quite a body of knowledge quite a body of features that had been developed over that time that makes Postgres unique in the database space and I think that's what I'm gonna try and highlight here and I'll explain why I think that's that's a pretty revolutionary thing and more is something that will continue to define Postgres even more the years to come historically Postgres has been sort of playing catch-up to a lot of the more traditional enterprise databases but in the past couple years we've pretty much reached parity at that at that level and what this presentation is going to highlight is Postgres kind of going beyond what these other relational systems can do and starting to talk about some non-traditional workloads that actually work very well in as part of or combined with a relational database and I think there's going to be increased adoption of Postgres because of these features because of the fact that Postgres does things that no one else does rather than post Chris doing just the same as everyone else does so anyway I'm gonna get into that in a couple of minutes any questions before we start cool ok so let's talk a little bit about history relational storage relational storage originally proposed by EF Cod in 1970 what is that 30 46 years ago that's a long time I wasn't even involved in databases I was nine years old so it has really stood the test of time it's very flexible but you know it's not always perfect we kind of get into these paradigms where we think oh everything has to be object-oriented oh everything's got to be relational Oh everything's got to be you know this or that well yeah it's not really true there's always exceptions and with the sort of explosion of web apps and GIS and and of things and and bringing even more data into databases than you traditionally have in the past things like purchase data and data analytics and stuff you start to see a lot of cases where Malaysia storage doesn't always work great so what is relational storage relational storage by EF Cod is basically the idea that everything 0 in a column they're combined into tables you have constraints and then you normalize everything now you normalize in the you know first normal form second normal form third normal form second normal modified form whatever there's a URL there that explains it but you know the first normal form this is like the base form effectively defines where that what relational storage it represents the idea that each column or attribute contains an atomic indivisible value now what the comic indivisible is is up to debate I'll talk about that later but that's the concept you don't have repeating groups in a relational storage you create a separate table for each related say the data and you identify each set of data with a primary key okay and again I have some URLs there if you're interested but you know first normal form is not always ideal yes it's very flexible yes it's probably the way you should start most of your data modeling but you need to identify when that data modeling really doesn't fit any of the normal forms and I'm going to explain why first query performance first normal form again very flexible very easy to represent things very easy to understand from an from a DBA or a developer perspective but it can suffer from query performance issues queries can become very complex as you atomized the data as you normalize it storage can be inflexible so a lot of times storing things atomized is very inefficient you have a lot of overhead and they're also indexing limitations which you might not identify right away but as I get to further in the presentation I think those index and limitations will become clearer okay so that's just kind of a teaser of sort of how I'm looking at it why I think you know this is a really interesting aspect of Postgres in an area that's gonna be more and more interesting so this is what we're going to focus on against six minutes into the talk we're gonna focus on the idea of eight major non-relational storage types and what I'm going to be highlighting for these eight is the idea that because these are non relational storage they actually give us advantages over a normalized data environment and I'm gonna I'm gonna kind of hammer that home for it each data type I'll show you the data type I'll show you how it's used I'll try and explain why traditional normalization doesn't work for this data type and I'll also highlight indexing which actually works really well in Postgres for these non-relational data types doesn't work so well to law the other relational systems out there and this is again an area where Postgres continues to shine again if you were then talk yesterday and I talked about Gini indexes and Geist indexes and SPG stand and so forth you have an inkling a point of what I'm getting at that the the Russian developers particularly have been very adept at creating some of these very unusual methods for indexing the non-relational storage and I'm going to highlight that so let me take questions okay great ok so let's start with the first one this is the very brain-dead kind of very simple one an array okay Postgres has a built in array type is this relational no okay did it come from Berkeley yes so hey Kim it's not relational but it came from a university known to be you know very foundational in database and operating system design one of the reasons rays are not relational is effectively the idea that every column is supposed to contain one piece of data and by definition arrays contain multiple okay why were they added well frankly uh they were added not only because where they thought users would use them but Postgres internally actually uses a race a lot of this couple of system tables PG Class PG proc actually have arrays in them to represent things like function arguments variable number things like permissions variable number okay could we have taken those input that P those in those pieces of information and place them in another table yes would have been cumbersome and slow to access that other table while you're running a query to look up the function arguments and look up the you know the permissions on an object yes so there were practical performance reasons practical simplicity reasons we're bunching a bunch of data inside a one column when when practical and when it makes sense is actually a good thing okay so here's an example I have an employee table I basically create the deploy table and I also create a certification column or attribute where it's called text but it's got little brackets after it now I could have put a number in there that number would represent the length of the array but it's only for documentation it isn't it isn't enforced at all so I just left it blank but effectively this certification field is now an array of strings okay not one string but an array of strings could I put the certification in another table yes so every time I went the employee I joined to the certification table and I put his primary key on it and do all sorts of validation against it and I was like you know we just want to throw certifications we don't want to validate against it we don't want to do reports against it we kind of just want a little column I can throw some for certifications in there and I think that that makes sense again you have to know what your needs are but there's some practical reasons we're just having it already makes sense so if I want to insert into that array it's actually very easy I basically use single quotes around a string and then embraces curly brackets I'm starting operations curly brackets I actually have these double quoted strings CCAA CSP and see ssp I believe yes cissp those are my those are my three certifications for that particular employee I can do a select star and it displays them kind of in one big bunch here I can also do this which is kind of cool this is the containment operator it's a it's a case where I actually want to take a field certifications and I want to say does this field contain this particular value you're gonna see this containment operator over and over again it's used for a whole bunch of different data types but it basically means am i inside of this other thing is effectively what it is and you can see it's very clear in terms of how to write the query give me all the employees who have that certification and it comes out just fine okay I can do other things I can look at the first element of the array arrays are one based for because that's me the SQL standard defines it so here I'm saying give me the first certification I can unnecessary and like make it go instead of a cross like make it separate rows for each array each value in the array I can go and I can use unnecessary mand repeat it for every unnecessarily set of four rows in this take case from PG class and I can go the opposite direction so unnecessary over a tag takes me from a number of rows into an array that's a coin go in the opposite direction okay so you can kind of see not only can you store stuff in array but you can kind of move them in and out of row based at analysis very easily any questions okay number two range types I talked about this briefly yesterday it's kind of a cryptic data type that a lot of people don't really see the value for I do have some stuff on my blog about why this is really useful but let me just walk you through why I think it's useful right up here at the top we create a car rental table and we create a column for the car I assume and then we create a time span which in this case is very cryptically called TS TZ range which represents time stamp with time zone range okay and the second line in red is what it looks like it's effectively a start and stop time in one field okay remember I told you normally you would atomize this you'd have a column that's start and you have a separate column let's stop the problem with having a separate column would start instead of calling a stop is a your queries are very complicated B the optimizer can't make a whole lot of use of it because if you say start greater than this and stop greater than that it's kind of hard for the optimizer to optimize that and third it's very hard to index it because effectively you have one index one start over here one index that stop over here and they really can't work together so one of the exciting things I'm going to show you a little later is you can actually use indexing on this which is really cool so again all people are going to use this from day one but it does it does have a lot of uses so here's an example of very clean syntax we don't have to play with separate fields it kind of documents itself I'm basically saying go and take my car rental table and give me the time span that contains this particular time point so I've got a point in time but I'm saying give me the span that spans this there may be more than one in this case there's just one I only put one row in there if I ask for a time spans doesn't match I get nothing which you you kind of get right that makes sense here's a more complicated example I'm actually creating another of adding to this table somewhere and I'm creating over I believe a thousand different car rental events that are going in from 2001 to 2010 okay so actually I guess it would be three thousand thirty five hundred events because it's one day a year and it's ten years roughly so what I've done is I put a roll in for every day from 2001 to 2002 to 2010 and if I run this query I say give me the car rental that's fans this time I get one row and if you notice it's actually doing a sequential scan that's kind of yucky because it's reading the whole table to find out where a particular car was in this particular time frankly if your data type doesn't index well it's a toy data type in a lot of ways if you you know you can put a couple hundred rows in but you start with millions of rows if you have an index that really understands this data type you're not really gonna write any kind of really high-performing application with it fortunately all the data sets I want to list your haves indexing support so what I actually can do is I can create a just index those of you remember from yesterday I mentioned just I basically say create index on the time span field notice I'm not specifying starter stop I'm just giving it that TS T Z range column and I'm saying pretty just index on that and then BOOM all of a sudden when I asked to create a find me the row that contains this particular point in time now I'm using an index scan okay using special sort of gist logic very similar to the way we do Geographic points using our tree we have special code to do range type indexing and it's able to pull up that row really really fast and all you can imagine if you had billions of rows sequential scan and billions of rows bad bad bad indexing support good good good right and again a lot more powerful than having a start column in a stop column and trying to index both other than trying to get the index to kind of use both them at the same time just doesn't really work very well right whereas this system understands the particular value the way there's two things in that field and can take very good use of that okay another cool thing you can do with it this is kind of really crazy is you can actually add something called an exclusion constraint which I'm not going to go into in a detail but why can't I there's an exclusion constraint like that and effectively I'm using a special overlaps operator there at the end with two ampersands and what that does is it says do not allow any overlapping intervals inside this column very hard to do in an application very hard to do with some kind of trigger because you have concurrent people inserting into the table at the same time so doing this individually is kind of hard but hey creating what appears to be an exclusion constraint real easy use the index handles it for you you don't creating triggers you don't have to deal with concurrent access problems it basically says don't allow any conflicting overlaps in there and we could have used a different type of operator overlaps is just probably the most natural one to use in this particular case questions yes sir so the question is if you have an insert into this particular field and you're inserting for multiple time zones can you specify that the answer is yes when you do the insert in this case notice I'm using this kind of string you know it's as an ISO format string I can actually put the time zone specification here at the end I can also set the time zone for the session and therefore all of the time going in would be assumed to be within that time zone that your session is has there's a environment variable called time zone I think and you basically set it to EST or GMT or Asia / Tokyo and then all of your data that you every insert that doesn't specify x m is assumed to use it's used to use the session time zone now you might and you might wonder well how did they play together in the same table and the answer is they play together very cleanly what effectively happens is the data is always stored in GMT okay so even though you might mix a whole bunch of stuff in different time zones when you query it it's going to come out in your local time zone there's not going to be any like entries that are gonna be mixed they're all gonna come out in the exact same time zone in fact when you insert the data once that date is inserted the storage does not know the original time zone you used there's no record if you needed a record that actual times when you use you'd have to have a separate field and actually insert probably the current time zone setting for that session into that field and then you'd have your time stamp where timestamp range to column and then you'd have a separate column that just had the original x and for that particular entry because obviously if you stored different time zones in in original format in the table then effectively you'd have to be converting them as you went from row to row I'd like to forget that right imagine what a conclusion constraint would look like you know so the data basically as it goes in it effectively is stored the current time zone database specification offset and convert to GMT at insert time other questions yes sir oh I'm so glad you asked that so the question is because it's a range how would you specify things like I don't know the end time or for example you didn't ask this question but I will put it in your mouth the question is how do i specify whether the begin time or end time is exclusive or inclusive okay so if you take a look at the example I have right here you will notice that there is a bracket right here okay and there is a parenthesis here at the end and what that's telling me is that the insert Tod they begin time is inclusive because it's a bracket and the end time is exclusive that's why it's a parenthesis okay in addition either the starter end time could be specified as infinity or negative infinity at which point that means that there is no specification for the start and the stop and therefore it would span either from the beginning to infinity or from infinity to the stop time which would be the natural case you can also put in null in there and they behave slightly differently and I really don't want to get into that but there's certainly specification and we spent a lot of time designing how to handle unknown pieces of this data and a lot of discussion and we we did this probably five years ago and we're very happy with with the outcome so please read up on the docs to read how to do the details of how null is different from infinity and and how there's some philosophical things of how do you you know I'm not I don't want to get into it but there's an interesting first couple of questions of how you handle various of conflicts in that area other questions yes ma'am yes yes right so so the question is if we had more than one row and we ran the query like with the time have to match exactly well the highlight here is I'm using that containment operator the same containment operator are used in the previous array example and what this is telling me is that hey I want to see a row where any range spans across that moment in time right there okay so it's similar to the array it's not exactly the same because the array you're right the array was saying give me any element that the array saying give me any element that is in the array and it's treating them all kind of the same the the ampersand great event pays a little differently here where it's saying give me any span where this point is part of the span okay so the traditional case in this example is tell me the record that was active at midnight on that date is effectively is effectively what we're doing because we kind of just want to know where it was so it turned out in this case that it matched to start time but it didn't have to it could have been any time in that range it thank you all might maybe I'll change the example to show to show that other questions okay so let's try another one now we're at geometry again a completely different case although it has a lot of similarities to the range type because range type had two values right start and stop right this one has two values except they're x and y but you know the indexing is kind of the same in a way way--just does indexing of points and wait those indexing of of ranges is actually similar so here I actually create and and now I'm showing you geometry here but post GIS uses this exact same system the exact same kind of layout so here I do create table it's a dart table I have a dart number and then I have a point which is made up of x and y on this thing and then I actually create I insert into the table a case where I'm gonna place one thousand random darts on the dart board again I'm not trying to get in the center they're just randomly put on the dart board I know it's kind of hokey but this is what it looks like here's a list of five of the darts and you can see the X&Y points for the various starts I can use the diamond operator although I'll notice that it kind of goes the opposite direction it's less than an at sign instead of epsilon greater than somebody's just going the opposite way I can say give me all the darts which are within four unit distance of the center which happens to be 50/50 and the way I do that is I actually take a circle which is another geometric type that we support that's the syntax for a circle Center 50/50 radius for I'm casting that to a circle and I'm saying give me the locations which are contained within that circle similar exactly to where I said can forgive me the ranges which are contained which contain this particular point okay so now I've got a circle I'm saying give me the ones that are within a circle it turns out there are five rows five darts within that for unit circle of the center but you know I want this to be a production dartboard I'm imagining billions of darts being thrown and boo-boo-boo it's doing a sequential scan that's not going to help me I'm going to be very slow I'm gonna have to search through the every dart to find ones within a certain distance so yeh comes to the just comes to the rescue we basically can create an index and just give it the point location which it knows is a point it knows how to index it and now all of a sudden if I say give me all the darts within four units of the center I get an index scan which is really really really cool if that's not cool enough you can do some even more cool stuff I did allude to this yesterday so here I'm actually saying give me the two closest darts to the center and that's usually very hard to do in a relational database normally you have to create like a circle give me everything within ten and then if I have more than two then I have to look at those everything within ten units and then like sort by distance that's usually way you have to do it in a relational you have to do two steps because if you do a square of ten or a circle of ten and then you don't have two darts and you got to do a circle of 100 and if you don't have two darts and there you got to do a circle of a thousand you have to keep making your circle bigger until you get enough darts that you can then compute the distances and sort of that's usually the way I have to do it fortunately Postgres has something called nearest neighbor search which allows you to do it in one query we basically kind of walk around the index until we get to I know it's kind of crazy you can talk to Oleg Bart doing off who's here the big Russian guy kalmyk Russian guy who who was working on this and they basically walk around the index until they get two points they compute the distances they go and then they return to heap rows that match it's just amazing stuff this is this is a game changer if you're doing a lot of GIS stuff this is really cool now of course the dartboard examples completely ridiculous but you know give me all the houses within 10 give me all the bars within 10 these kind of things the the closest bar seems to be a big query in every demo I've heard of but anyway I like the darts better any questions yes sir so the point is that this is assuming a flat projection because it's just geometric type it assumes the units are the same width and so forth if we were using post GIS yes indeed it would know the distance based on the coordinate system that you used when you store the data and it would convert various coordinate systems to a uniform coordinate system to do this correct the operator is exactly the same yeah in fact you'll notice the operator we're using here is the spaceship operator with a distance operator not the greater maybe that one's containment contains in a circle here we're saying give me it's kind of an awkward syntax but it works really great again an example of Postgres kind of going that extra mile where we've actually got a case where we're our actual query or actual restriction is in the order by clause right like how often do you see that you know it's usually where something this is order by something and that's how you that's how you you instantiate or call the nearest neighbor operator if I didn't use the limit clause here I would get all of the points in order of distance from the center all right I mean this is cool but what's also cool is when use the limit it doesn't compute all the points they give you the top two it knows to do two and stop this is pretty pretty crazy stuff and I believe Postgres may be the only one that does this kind of feature other questions although I expect all the databases to copy them as they have copied other things we've done okay hmm okay very odd so we're gonna let's talk a bit about XML Postgres does allow XML storage I'm not going to go into a huge amount detail on it because XML frankly is just not super popular I'm not like dissing s XML but we've had it for a bunch of years it just doesn't seem to do a whole we don't get a lot of questions about it so I'm gonna kind of just buzz through I have a lot of slides and again you can look at them a detail so what I did is I basically looked at the foo Matic printer driver on my debian server I pulled a bunch XML files and I loaded them into an XML flat file and I basically created a table with an XML column I just loaded in the XML but you can do cool stuff with XML you can do XPath queries here's an XPath query looking for the short name of particular printers I can actually get rid of the array reference by looking at the first element I can convert them to text so now they're text strings I can I can actually look at the first element and then I can order by them so now I'm taking the text I've converted to text now I'm ordering by and giving me the first five I can pull in like five random I can do XPath where the actual attribute is not at the root so it's looking within the path and finding anything in there so there's an example of the printers names I can unless that's all that with a raise these these are raised here right because I got the braces so I can uh nest it and I can do that I could even do this crazy thing I'm using common cable expressions I'm making all the arrays on nesting them I'm filling them in to another query and I'm saying give me all the HP particular drivers that which are in this XML document okay again if your eyes just glazed over that's okay it's just showing that this is another type of non-relational data that we can store questions okay so let's take a look at JSON this type of type was developed about five years ago this is not JSON B and I'm gonna totally JSON and be intimate this is not the JSON B now it's a little confusing because we have a JSON type which stores JSON and then we reduce OnBeat which stores JSON kind of confusing I'm sorry about that there are some good reasons for it I'll explain why the JSON data type is of is basically similar to the XML data type in that you just store the JSON unadulterated okay take the JSON document store in the database okay pretty easy to understand it's torus stores it effectively as a text string nothing wrong with that it's compressed if it's long if it's a long text string but it's stored the same way we store xml the stanley stores text strings okay the only difference is that we have validation on that string so you have to store valid JSON in there and we have over a hundred JSON functions which allow you to manipulate and query different pieces of the JSON okay let's see what I mean I went out to macaron comm which is sort of a random data generating website who knew they existed it fortunately dump stayed out in JSON so I effectively create a table called friend and I copied right here my data over into from the maca root into my database and I have a thousand records we do here's an example the first two records my friend's name it seems I have friends Eugene Reed and Amanda something and those are my friends but I have I have 998 more so good for me but you can see it's a standard JSON type of document I can convert that JSON to look more like what I would use in a JavaScript program where typically when I'm doing JavaScript I'll have braces and then I'll you know kind of put up one on each line with commas much easier to understand I can pull individual fields out of the JSON so I can say go into the JSON document if there is an email key give me the value so here's an example of five email addresses of my friends okay I can concatenate things together I'm saying here's first name and last name with a space between them let's display those I can you even use the this kind of JSON interpolation inside of a where Clause so give me the first name where the last name equals banks turns out I have two friends who have a last name of banks I can even do it this way this is that containment operator again remember where I'm actually saying last name banks and I'm using the containment operator and I'm converting the JSON data to JSON B I'll talk about that in a minute but again I get the same I get their same result if I find that I need to look up a particular field a lot I can create an index but in JSON on like JSON B and JSON I can only index one key at a time okay I can't index all of them unless I want to create an index for each one each key that I could possibly have which sounds horrendous solution coming in the next couple slides but what I'm doing is I'm saying you know I'm really always looking up my last name so I'm just gonna index the last name so now when I run when I'm sorry we're getting ahead of myself here now when I run this query and I say last name of banks because I've created an expression index those you met for yesterday I create an expression index the last name all of a sudden yeah I get an index scan for my key that I'm interested in that's that's just great okay I can also do calculations I can say give me my friends who are in this IP address block I can even do aggregates on these keys so here's give me my gender layout or my gender breakdown sums how many male and female friends do I have and it gives me kind of a layout there any questions yes ma'am I'm sorry yes or the keys case-sensitive and the answer is I can't remember yes no it's what is in JSON they're not case-sensitive J songs are case-sensitive they're the keys keys sensitive in JavaScript anybody I think I think they are maybe they are either feeling they are so I think they are they're my guess but whatever whatever it is in JSON it would be this I can't remember how we implemented but it follows the JSON spec no because effectively JSON document doesn't have any knowledge of keys so it would just it would it would it would think that's a completely different key it would have no idea it was related to the other world yeah yes is there a way for right data to go into the database yes typically what you would do is when you create the table you're going to add a check constraint so when you create the table here right here you would say check and in parentheses you would say data - last name not null for example or exists this key exists in the field I'm about to enter so you can actually put constraints around the JSON coming in using check constraints and will validate against those checks every time you do an insert or an update right and that's a lot of cases where people would use it because it's kind of annoying that there's no there's no like schema on top of JSON that can be enforced similar to what XML has a schema but in this case there's no schema so the only way they'd force that schema is effectively to put it as check constraint yeah other questions yes I'm sorry this is a is just a text index because it's really just text value yeah just the text index is the question yeah it's a b-tree it's a b-tree index of text yeah okay all right let's go to json be json be is kind of json on steroids there's not a whole lot of reasons to you json when you have json be honestly and i'll start to talk about some of the amazing stuff about this so first off one of the cool things about json be is the values are native javascript datatypes that means that a value associated with the key can be a text string but it could also be a number it can also be a boolean it can also be a null it also be its own object a sub object technically okay you can create an index of every key and every value in the column not just the last name but everyone okay it's compressed the keys are binary searchable so it's very quick to parse and look up keys in in in json be the only downside it doesn't preserve key order it doesn't preserve whitespace and it doesn't retain last duplicate okay this is very similar to the H store if you've ever used that that we have support here's an example of the downsize you can see that the whitespace has gotten removed this is the JSON up here does the JSON be example notice that the whitespace is removed here the duplicate name has been removed so there is it does kind of canonicalize your JSON a little bit but again there's a lot of positives I would say you would use JSON if you just want to store the thing in and pull it back if you're just wanting to input and output JSON use the JSON data type if there's any type of manipulation or indexing you want to use you probably want to use JSON B because the ability to manipulate that without we having to reparse the JSON string every time you access it is huge here's an example the same data I'm going to load into a JSON B table I'm going to put a Gini index on it there's two types indexes for G if for a JSON B but I'm just going to use the default one so here now I can look up last name banks okay and I use this index this index that I created right here notice when I created this index I didn't specify a key at all I just said data I just said go in index everything so what you can actually do is you can say I can I can look up last name I can look up first name same index I can look up IP address same index okay so that's just really powerful in the way you can kind of do that that the fact that just index is everything and you don't sort of have to decide what your queries are going to need and this is the kind of thing that got people very excited about Postgres as a replacement for new sequel type of stores yes does this make your index really at large yeah the index is gonna be kind of big although the key and you might I don't know if you were in the clip thing yesterday but as you remember a genie index only stores the key once right so you're not storing the string last name over and over again you're saying last name and here's all the rows that have a last name right and those pointers are also compressed so actually we've the index has gotten pretty good I'll tell you it's like a tenth the size of mongos at least in version 2.4 among those so it's pretty good but but I guess it can get pretty bad yet because you are having to store a lot of pointers particularly JSON documents are very large yeah yes sir overheads for inserts and updates yeah I mean the Djinn index has been optimized so effectively when you insert into a Djinn index it actually inserts into what we call a holding area and the holding area effectively holds the changes and we then flush them later to the actual index and we look it up as we we check the holding area anytime we look in the index and so forth so so the problem with Djinn traditionally is if an individual inserts and updates were very expensive but because we have that holding area and we way we batch the changes into the index the overhead is not as bad as you would you would suspect which is traditionally the problem with with with bitmap indexes do they have that problem where it's just an insert can make it go berserk you know if you got to expand a bitmap and all of those stuff so because we're batching things it's not that big a deal ago but that was identified very early as a problem we we know we always had a holding area and it's gotten better over the years and even in 95 we made some improvements to gin they've made it much faster okay all right two more to go row types this is a bizarre data type it's actually a row you can store in a column so here's an example in the way you create a road datatype is you say create a type and the name of the date of the road data type and then you put the column values and the labels you want the column names and the date that you want in the column and the row type so here I've got a driver's license row type and now I can create a trucker table with a license of type driver's license and effectively I can insert jimbo begins as a truck driver and I can specify his entire license information inside a single string now the problem and I can do cool things I can select star and it comes out fine I can select just the the row type I can even it can even look inside of the row type and get a particular column from inside the row type okay keep in mind this is gonna be a problem if you need to like send it to a client cuz the clients gonna be completely confused if you do this like it's gonna be like what is this it's just like this string it's really hard to understand so just don't go berserk with this one you see it's a lot used a lot in server-side functions where people are having to kind of setup they want to pass a row into a function and my tip you later you see that a lot so there's one more to go unfortunate this is a pretty complicated one so I'm gonna have to kind of buzz through this but philosophically a character string can be thought of as a multivalued type because effectively is it a collection of letters is it a collection of words is it a collection elections like it's kind of unclear as a collection of prefixes kind of a string is this amorphous thing like a number is a number but a string it's kind of can be looked at a whole bunch of different ways so in this case I just took the fortune fall for FreeBSD loaded it in got fifty-nine thousand rows into a particular strand tricular text field and you can do a whole bunch of cool things with string manipulation and again if you want to look at the slides later I can do things like looking at underdog but in fact there's no underdog but there's an uppercase so I can actually call a function on that but that might not be cool for indexing so if I create an index online it can't use a lower of line I talked about this yesterday so I'm not gonna go into it too much but you can actually create an expression index on lower and then when I actually call underdog even though it's all lowercase it's gonna find a match and use the index because I've created an expression index there okay um I can do prefix searches with like fine give you all the lines that begin with mop unfortunate that gives me a sequential scan boo however I can create a special index called text pattern opps and again you can read this to understand why it's important but if I create a text pattern ops index on that line I can now find all of the lines that begin with mo P using that index okay that still doesn't look at individual words it's just looking at the beginning of the string okay I can combine those together if what if I want to find the lower of the prefix I can create a text pattern ops with lower and I get an index still okay but that might not be enough maybe you want to search for words you want to break the string up into words Postgres supports full text index with a whole bunch of 15 languages it supports stemming which means remove things like ing and edie and plural from-from words and bring them down to their base you know their base word you can remove stop words you can put synonyms we're in there we have prefix search coming in 9 6 so here's how you kind of set it up you use a TS vector ts vector effectively takes a word and breaks it up into its component parts it gets rid of the stop words it turns out that I and and can or actually stop words here okay you can use a TS query and you can basically look and you can say from this string does the word hardly and wait exists yes does the heard word softly and wait exists no okay I can creat an index on it with a TS vector and now I can say give me all of the lines that have the word pandas and the word panda peers here I can do this kind of query and in fact that panda search uses an index now man remember I said if it doesn't use index it's no good here's another one give me the one the word the lines without the word cat and sleep I got a line give me the words that have the word cat and sleep or nap I get two lines cool I can also do prefix searches with this give me all of the lines that have the word that words that begin with zip and in fact it uses an index which is a a very good but that might not be enough for you maybe you want to look at letters within a word letters within the word not the word not the prefix but the letters in the word here I have something called here's an I like query where I'm actually doing that but you know it doesn't use an index boo but we do have something called PG trigram it's in contribute a special Gini index on PG trigram and now when I go to look for the word Verret you find it uses an index it uses this new trigram index that i've created you can even use the trigram index to do prefixes if you're really crazy enough to do what i've just done here and it returns the data I don't use an index cool you can eat you can also control similarity you can say give me all the lines that are similar to this with a string that's a feature of of the the trigram capability here and we also have there's a uses an index like similarity and we also have soundex meta phone and so forth if you want to use those so these are all the indexes I created in the in the text in this last section these are all indexes I created on text columns and these are all of the data types that support me the containment operator that that at sign greater that we saw over and over again there's the one with circle there's the one that json b there's the one with TS query there's the one with it right and that is it folks we have run out of time so I will be up here to answer any questions I didn't get any questions stored at the end I realize I kinda had to run through it I kind of knew that the the dialogue about this is non-relational is great but don't get carried away you're gonna end up looking like this guy right don't just stuff everything into one field relational school it's just we thought be understand when you want to use it and when you probably don't so thanks very much I'll be up here to ask questions

Info

Channel: Postgres Open

Views: 1,021

Rating: 5 out of 5

Keywords:

Id: nfBeNHFw_z4

Channel Id: undefined

Length: 52min 25sec (3145 seconds)

Published: Wed Oct 12 2016