ElixirConf 2021 - Todd Resudek - I can do all things through Postgresql: Lessons for the Elixir dev

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
[Music] [Music] okay well imagine i played some song there okay god bless there's no words in it so i could hum it okay all right well let's pretend that song happened um great entrance okay i'll walk up and that'll be a song yeah all right that was a great song by van halen all right if you haven't figured it out i am a linux user so video and audio neither one is working today but it's extremely um extremely common so all right this is me my name is todd rezadek i'm a staff engineer at a company called jackpocket um anybody heard of jackpocket yeah hey look at adam that's good he works there too and i'm a hex core team member so uh yeah i work on build tools and that's my twitter handle there i don't have as many followers as mickey but you could follow me if you wanted to all right it's been a long since i've long time since i've seen you friends i just want to let you know i haven't forgotten you and i'm really proud of all the things you've done all right so um if anybody saw the the title of this talk some of you might recognize it some of you don't but it's taken from this verse of the bible um i just didn't want to take credit if you thought it was like a really catchy title or something um so the the thesis here is that different problems i don't think this is controversial but different problems can be solved better with different tools and since most of our applications are already using sql i just want to show a class of problems that i think might be better solved in the database versus in the application layer [Music] somehow i got blank slides in there all right so let's go through the wayback machine so postgres like this could be what you consider the beginning like the genesis of of the postgres project so it started as an ingress project called ingress at uc berkeley and it was actually it's actually not used for what it's intended for uh it turns out they applied for a grant and got funding for this geographic image service and so that's why it's called this was called ingress but it i guess sort of took a turn so in so they worked on that at uc berkeley in 1982 they started working on a proprietary version of ingress and in 1985 one of the original leads from the ingress project started to work on a project he was calling post ingress so you kind of see where we're how we're getting there um and then in 94 uc berkeley officially ends their development of the post ingress which at that point was just being shortened to postgres project and they open source the uh all the source code for under the mit license um so 94 is you know we're starting to get closer uh later that year um two researchers added support for the sql query language so up until that point it was using um its own version of its own unique query language uh called nql i think so this is uh 94 is really the genesis of the postgres that we know now um and in 1996 they officially renamed it to postgresql so if you talk to anybody that's been around or was at uc berkeley and you say yeah use postgres they might be like no you use postgresql postgres was this thing i worked on in grad school that use this other query language so um so this 96 is when we officially get postgresql and jumping ahead today um i mean the wikipedia page honestly just didn't have a whole lot between 96 and here so i don't know what you expect from me um so today uh postgresql is the fourth most popular uh well fourth most popular relational database system uh behind oracle mysql and mssql um and so i some of us have probably used postgresql for a few years and every version or or or minor version something new comes out it's just constantly constantly improving all right is this obvious enough okay all right so there's some caveats to this so i'm gonna give some advice and keep in mind i'm an idiot so it may or may not work for you um so first caveat are you already using postgresql okay if you're yeah okay we've got i didn't it wasn't a survey sorry um so if you are if you're currently working on a project that uses a document store or something this doesn't apply to you because you'll have to add postgresql it's not really doing any favors it's probably causing you more harm i mean i would recommend moving to postgresql but this may this is not advice you can take home and like start putting prs up on your project for um the second thing is like look at your your load on your sql database so a lot of the applications that i've worked on if you look at the cpus like the database cpu is constantly at three or five percent right and your application cpu is most likely much higher than that um if you were to have the inverse moving logic out of your application and putting more burden on your postgresql would not be recommended um so keep that in mind i i think that's an unusual situation but you definitely wouldn't want it to put more work on your postgres uh servers i guess if uh if they're already taxed all right let's get into some examples all right so the the theme of the talk is cars if i would have played the song at the beginning it would have kind of all tied together so just keep that in mind um uh so okay so complex constraints so this is like uh this is based on all these are based on real world examples but i've simplified them um so it's not to you know reveal anything about what i do so this so let's imagine we have this schema so we have a car schema and we say engine type and then we have miles per gallon in kilowatt hours um but so in a car you know your engine type let's say you have an internal combustion engine or an electric motor so that would be like the field the values of engine type and electric motor doesn't really have mpg and internal combustion engine doesn't really have kilowatt hours so this is an example where you'd want a complex constraint so you basically want to say hey if so this is what your change set could look like like a custom validator in there to say hey take a look at the the change set and look at the the engine type and if the engine type is electric we want to make sure that the second the mpg can be nil and the capacity is not nil so that's validating so essentially you want to say hey if this value is this i want to make sure this column's nil and this column is definitely not nil and so you know this would be your your validation that you could put in in your change sets which is you know fine but if we run it through the postgres machine the transmogrifier i don't know what this is by the way does anybody know what that is it's a post-sequel machine postgresql machine so okay hopefully it's not like a kitten shredder or something um so sorry if we want if we transmogrify that into like what would that change that look like in postgres we can create a um a migration like this so uh and put the check constraint directly into postgresql so in this case it's it's uh it's at that that validation is happening at the database layer and when you run a i think i have a slide for this so yeah when you add a check constraint um using ecto code um it's going to check this constraint that you've defined on the database with that last migration and if it comes back it'll throw this error so in with a check constraint function you can give it a custom error if you want to error message after that um and so what what happens here is we've just taken logic from what was in the change set and moved it into the database layer and so the the advantage if you well first of all it's less code to look at um secondly i look at this as sort of like a belt and suspenders approach to things so if uh if you've done web development you probably have forms that use front-end validations like javascript validations to make sure you know this is definitely not going to work and you're going to get a back-end validation if you try to send this so let's just stop it on the client side but then you also have back-end validations in your change set to say hey that doesn't look like a number or that you know your name is required and so front end is the the belt and back end is the suspenders and like moving adding this to the database level as well is like an elastic waistband on top of that so even if it were to get past your javascript validation and get past your back-end validation it will never go in your database if somebody pulls up tunnels to your database and tries to enter data manually if somebody tries to import a sql dump file uh there's no way that invalid and invalid data or invalid row can get into the database at this point so i think of it as just like the next level of protection and i mean there are definitely times when people will be like oh i'm going to just put this in on the console or whatever and they won't go around you know they'll go around your change set for instance and so this is uh this is like a good way of adding that extra level of protection um all right the next example problem would be case sensit case sensitivity so in spongebob knows letter letter case matters right like this means something a lot different than if it was in a different case i think um and so postgres has a solution for that it's called scitext but in if you were going to do this in your application code you might have something like this so let's say you're going to put a car in there and you want to get the vin number from the user but some users make capital letters and some users make lower case letters and you just want to normalize the data so that when you're searching for it you know you're going to be able to to find the right match and so you might have this this function in your change set that says okay well when i get a vin string just down case it before you put it into the database now with postgres sorry the kitten shredder again um you want to see it again okay all right um so in in postgres there's an extension available called scitex uh many of you probably have already used it so if you just modify the type of that column the vin column the scitex essentially what it's going to do is anytime you query it it's going to do a drop case a down case on both sides of the query automatically so the data that's in the row looks like however the user input it but anytime you're querying it it essentially normalizes the case for you um and so that's like a pretty easy solution if you do this you don't need to put anything in the change set at all um and so if you like look at this this was the example of like what it looked like before and after you just can get rid of that function completely another one is adjacency so the common i think the most common example of this is like a leaderboard for something so if you're making like a game and you want to show the user's position on the leaderboard but you don't want to say like you're in 69th position nice you want to show who's in 67th and 68th and 70th and 71st as well right so you might have to do something like this so in this case for cars let's say you're looking for a car by price so you say well i want to show the user the car that is that price but i also want to show them the car that's a little bit less and the car that's a little bit more and so you know in this example you're running three queries um you know one that says get me you know order them descending by price and get me the the two before it and order them ascending by price and give me the two that's after that and then give me the exact match and then you know you build that list that way but if you think about it the way postgres indexes work is kind of like you know if you look at this car lot if you were to index on price essentially postgres is organizing that data in this way already so it would be analogous to saying like oh that you know green card in the middle is at the let's say 11 000 price point that i entered but i you know the gray card next to it is 10 900 and the uh dark car after it is eleven thousand one hundred so postgres already keeps them in an order for you um and so if you were to just use these these window functions that are already built into postgres uh the lag and the lead function um what that's what that is gonna do is say like um hey i wanna get the the price that's 5000 but also select the lag and the lead on that so it's just saying like the window function is saying like find that row but then like just expand your window a little bit and grab me the ones that are next to it as well um and so if the top there with the the width base that's actually a common table expression i don't know how many people have used them before but it's essentially a good way of doing recursion in in sql um so yeah if you if you were to just write this query here instead it you just do one query it uses the indexes it's extremely efficient extremely fast and um because postgres is natively organizing things that way anyway all right the next one is is string searches um so and specifically fuzzy matching you like the picture okay no oh sorry no these cats weren't shredded uh no but nobody take a picture of this okay it's an nft all right i'm glad you like it it cost me thirty seven hundred dollars uh all right what was i okay so okay so fuzzy string searching so users are going on your car site they're like yeah i want a civic you know i got lots of money now i'm gonna buy a civic and um but i don't know how to spell civic you know i so i type in civ ic and search for that term so you can't just use uh like an i like or a like statement in postgres because it doesn't account for misspellings so if you were to you know drop the k off you know it would help you with that assuming you put the wild cards in the right place so in this case okay we're saying yeah this guy wants a civic or this lady wants a civic but does apparently doesn't know how to spell it and so in application code we could actually look for close matches using elixir string module there's a gyro distance yarrow distance anybody i don't know i don't know who that guy who the person yaro is that it's named for but um if he's or she is here sorry if i'm mispronouncing your name so you could uh you could calculate the the gyro distance um of a string natively in elixir and then find all the uh so but first you'd have to find all the cars and then you'd have to calculate all the the makes and models of the cars those strings and calculate their distances and then you could go down here and say okay sort this by the ones that are the closest match and and the furthest match um just for you i like cats by the way i have a cat named taco it was an accident i was like we were on the way to the shelter and all that crap i got to find a name for this cat because her original name was celine which i don't know if you like celine dion or don't admit it if you do but i was like her name's not celine and then i was thinking well what do i like i like tacos so she'll be taco okay um okay so we ran it through the transmogrifier and so this is the query that you could use so instead of all that application code pulling down doing mapreduce uh counting the distances and then displaying the closest matches there's a a very similar to yarrow distance function that's built into postgres called levenstein that'll calculate the levenstein distance of a string and so you could use a fragment like this you know so this is all ecto code essentially the fragment function is kind of the cheat code to drop into sql and find the the lowest levenstein distance either on the model or the make so they can search for honda they can search for civic um and it'll search for both and find you the the match that's the closest distance so uh levenstein distance is just like gyro distance is measured between uh zero point zero and one and um the lower the number the least distance it is and so you can see this is like a bit less code it's actually going to be a lot more efficient there's a slight difference between levenshtein calculations in yarrow distance calculations just in that levenstein is very particular about the position of the string so like civic with the k at the end would be a close match in levenstein in yarrow you could actually like transpose two letters so if you were to put civ ci instead of ic the jro calculation would technically think that that's a closer match because it looks for similar letters but they're in the wrong position so it's probably not going to affect you but just keep it in mind i guess all right the kind of the next level of fuzzy matching searching are trigrams um or engrams i guess so anybody use elastic search for engrams yeah it's pretty popular but um so just for an example if you haven't used ngrams before let's say you have this list of strings and you want to make in this case they're trigrams tri meaning three so it's sections of three um so this is how trigrams are built so basically you give a string to a trigram index and it turns it into a list or a vector of every um chunk of three letters that exist in the string and so you're essentially matching on any one of those and so if you wanted to do trigrams natively in application code you could calculate the trigram distance like this and so let's say we've got a list of models up there all starting with e and then we can make the trigrams of those and then essentially like then we could search the list through through elixir and so that long list is actually becomes these vectors in the in the application code that you're searching but let's say you're a baller right baller all right and you want to buy one of these anybody nobody knows what this is right this is a 1979 cadillac elegante okay so this is ballers only stuff right here um so let's say you want to buy a cadillac elegant incidentally this thing had a had like a 350 cubic inch engine that managed 170 horsepower um and i'm sure it weighs like 8 000 pounds okay so let's you just ran into a bunch of money it's baller time you're going to treat yourself cadillac elegante crap how do you spell elegante right so you're typing in searches and so that's where the triagram comes in handy okay hey it's nouveau riche you know you're not you didn't get rich because you're smart um but you have good taste in cars okay so what we can do instead of all that application code we can just use this extension this trigram extension pgt rgm uh extension in postgres it'll automatically set up trigram indexes for you there's two different ways of building those indexes one's using the the gin and one's using gist this is kind of in depth but essentially like a gist index is going to be faster but it's lossy so essentially what happens with just indexes is they can be built really fast but they cause a lot of false positives and so it ends up actually executing another query on top of that just to validate that it wasn't a false positive gen indexes are about three times slower but they're not lossy so you don't have that additional query on top of it and again this is just informative you might be like why is this trigram so slow and your particular use case might um allow you to use the just index just fine um so for instance if you were to since these indexes take three times longer to build you might be getting um data new data in all the time and so rebuilding those indexes might actually slow you down more than having to do that additional query with all the false positives um so a gen index would be really good for data that's more static so if you look at like our example is car models manufacturers don't come up with new models every year um and so like it's it's reasonable to think like it might be 12 months before i have to even rebuild this index at all um so a gen index is a good way to do it is a good way to do that so um and yeah this kind of reiterates what i said so if uh yeah gen indexes are slower but lookups are faster just indexes are the opposite and so once you've built the trigram indexes this is how you query them now i've never seen this operator in any other language does anybody use any weird languages mark no okay all right so this yeah five character operator is actually the the distance operator for the pg trigram extension and so this is just like the levenstein is going to give you a zero to one um match or number on that there's you know the opposite operator which is like the closeness and then there are also boolean operators so you could say give it a threshold and use the boolean operator and just say like give me the matches that are you know within these boundaries um so that's pretty cool and so if we ran our list through and we search for elegante in there like essentially this is the output of that this is the distance from elegante of all the known models that we have [Music] all right next next level uh bloom filters so a bloomfield does anybody know about bloom filters all right we got some machine learning people here probably yeah okay so a bloom filter is a space efficient uh probabilistic data structure conceived by by burton howard bloom uh to test whether an element is a member of a set and so the language is actually pretty important it will it will always be able to it will definitively tell you that it is not a member of a set or it is it could be a member of the set um and so i i don't have my speaker's notes because linux so i'm going off of memory on this so let's see how this goes so essentially you tell it i want to set up a an index of n length so n is uh should be proportional to the amount of data that you want to store in the set i want to build x number of indexes indices on that so the indices are hashing functions that you run through and they spit out a value so in this case i'm using three hashing functions and so when you run your uh your value through the hashing functions each one spits out a value and in this this example let's say it spit out one two and eight so it flips the bits sorry what okay okay i'll ignore mickey okay one two and eight so it flips those bits in this data efficient space efficient set uh and says okay that value is is there and then the next one you run through let's say those hashing functions result in 0 to 7 and so it flips 0 and it flips 7 2 was already flipped um and so that's this is kind of where you get that um that unsureness that it might say oh yeah all three of these bits have been flipped so i think it's in the set it might actually have just been overlap essentially so that's why you have to be you really have to tune the size of the set based on the size of or the amount of data that's in there or the yeah essentially there's a really big formula that i don't know how to make a lot of the symbols on my keyboard so i didn't put it in here but there's actually like an algorithm to determine you know based on the order of magnitude of the set like this is probably how big your index should be um and so so yeah just keep that in mind it's it's always going to tell you accurately that it's not in the set but sometimes it'll tell you that it is in the set when it's not and so there's a an extension called bloom that exists for postgres that will essentially manage this data structure for you so you can create the extension bloom and tell the indexes the length there is you know the length of the of the vector that you're using to store those bits and then the columns are mapped directly to the hashing functions and how much space that each hashing function will take and so oh sorry i'm hitting the wrong button okay so that's a bloom filter the next level is a ribbon filter now has anybody used a ribbon filter or heard of a ribbon filter okay none of you work at facebook that's good okay all right so a ribbon filter is the even more space efficient version of a bloom filter um it was i believe created inside of facebook there's like one white paper on it that says essentially this is like a bloom filter but even more space efficient and that's like the extent of the documentation so there's no um there's no support really for it yet but i i do want you to know about it because probably in like two years there will be like a ribbon extension for postgres and then forget all what i said about the bloom extension and just scrap that and use ribbon instead um all right we've come to the the epic conclusion um and so this is the way i look at problems first thing i do is i assume postgresql can do this until i've absolutely proven otherwise almost every time i find some native function in postgres that can do this and uh and probably do it better than i could do in application code consider the trade-offs okay so this is a big one so moving there's one there's a specific reason why i didn't put any triggers in this example as triggers everybody familiar with triggers okay yeah so i mean triggers are great they're very powerful it's essentially like if you come from the rails world like a callback method which is like why the hell did that happen and oh there's a callback mounted some in some weird place that does this automatically triggers are kind of like that but in the database saying oh hey if this event happens go ahead and do all these other things extremely opaque if you're the only person that's managing the project use triggers you'll know where they are but no one else will ever know and future you also might not know so consider the trade-offs um if you are going to move application logic into the database it's fine put comments in the place where you think somebody would go to look for that logic so if there's a schema that's doing something in the database go in the schema file write a comment write documentation that says like this is what's happening and it's happening inside the database using this extension or link to the migration file that's in it so that the next person or future you will come along and and be able to figure that out because it putting things in the database is not the first logical conclusion you're going to draw for that and it never hurts to use those belts and suspenders and elastic waistband approach so even if you're doing it well in application logic uh it never hurts to also validate things or put protection in at the database layer uh just to like i said your data science team gets in there f's things up uh or somebody you know cowboy codes and and messes things up in the databases gives you an extra level of security on that um all right i want to just thank my employer uh jackpocket for supporting me and doing this and thank jim and all the organizers of elixir conf for letting me be here and that's it [Applause] i think we have a little time for questions from tim okay i'll take sorry i'll take questions and compliments if you have any comments let's leave them to yourself okay oh tim do you have a question yeah so i think everything in here either went through the change set anyway um well i think everything did go through the change set that i showed and so you should be able to test this logic by proxy of testing the change sets so i think that is where the i think that's where i would i would test it via that yeah i mean effectively the interface to all this stuff is through the change set so i think the tests probably belong in the change set test this is the guy that co-wrote the elixir testing book uh and he gave me a thumbs up so two thumbs up so that's the most thumbs you can give and so that's good uh anybody else have kate did you have do i like spongebob uh i think he's kind of funny my kids like spongebob i used to go with the my little pony theme on most of my talks but the last time i did that nobody had any idea what my little pony was and so it fell pretty flat so um i went with spongebob in you come up cars a new repo and you're trying to figure out how the database is set up like you're trying to help new developers learn where their stuff is what ui do you want to use the ui postgres how do you figure out if you're trying to explore the schema what's your favorite me what tool do i use to find out like this new project how is the database how is the database schema set up uh yeah i use a an application called post bird which is a ui for postgres essentially and i don't i think there's some popular ones for mac and windows as well but yeah that's how i look at the database uh yeah i don't know i'm not like i said i'm not that smart there's probably a better way all right yes sir i've i come from very modest means uh my my mom did buy a cadillac el dorado uh in 1991 it was a 1984 cadillac eldorado and so i have driven that i proud to say i drove it when i was 16. i had my license for i think like eight days crashed the car the good news is in 1984 they didn't have such a thing as crumple zones so you you should have seen the other car so no i'm not baller status yet so yeah i got to get into crypto or something maybe i can sell that nft for like a lot anybody want to buy a cat nft chad you want to buy a cat nft okay uh i'll look up the current going rate for a 79 elegante and i'll let you know how much it is yardas the cab machine is proprietary that's i i can't sell the the cat shredder sorry but i'm sorry i think it's italian for elegant it's oh we have an italian here elegante she verifies it oh is it spanish as well elega but it's got an accent right no okay i think cadillac i think cadillac put elegante um i i do know how to spell it i was gonna talk about on accents but i dropped that slide because i didn't i didn't think i was gonna have time for it but there's yeah okay ask me about it later if you're really interested in the unaccent function okay all right thanks everybody [Applause]
Info
Channel: ElixirConf
Views: 1,667
Rating: undefined out of 5
Keywords: elixir
Id: mliO5_XhAKE
Channel Id: undefined
Length: 35min 47sec (2147 seconds)
Published: Sat Oct 23 2021
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.