All you wanted to know about collations — Erland Sommarskog

Video Statistics and Information

Video

Captions Word Cloud

Reddit Comments

Captions

take it away thank you so welcome to the sessions and title all you wanted to know about collages and whether it is all or you wanted to even more even more or if if this was far more than you wanted to know well we'll see anyway i'd like to thank you to everyone who voted to have this sessions group by i'm very glad of that this was a session i developed this year because i like classes one of my favorite topics but i think figured maybe this will not be too popular by people who organize conferences and so once i'm very glad that i was get the chance to present this one um let's see uh yeah yeah so before we move on i'd like to remind you that we have a group by that sorry we have a oh we have a code of conduct here at group by and uh well which simply means that if you ask questions in the chat or if you well mute et cetera et cetera you're like you're supposed to behave nice and respectful don't say dirty things about anyone else et cetera et cetera so because if you do what you would get you will get kicked out so anyway my name is alan somasko i live in stockholm where i work workers independent consultant i've been an mp for many years and i have my website so must go to sc which you can say well a small collection of large articles however there is no articles on collations there yet maybe i will write one um here's my email address so if you have any questions i don't answer right now you're more welcome to get in touch with me i also like to point out that slides and scripts for this presentation is available on my website right now so must go to the s8 present then you have a list of all my presentations and if you want about this picture here to the left well that's a swedish summer for us and then you have to figure out the rest yourself also um we have sponsors uh very planned to have to give i like to give big thanks at these because i i have the suspicion that without these sponsors we will not be doing this event so big thanks to these redgate solomon's straight paths solomon's against yes q storage and dallas dba um so here's my agenda i will start to talk about collation basics of what what is the collision and where we can define it in this concern i will then divert a little bit and talk about unicode and code pages because that is something which is affected by or well involves with galatians i will then go and pick apart the well the autonomy of of the collision name you know all these cias was all this stands for that will be quite big part of the presentation and they will will eventually lead us to discussion about duty fade collisions that were added with escrow 2019. i will then talk about two pain points you could say with collisions for pain points for two different reasons the first one is metadata and the other one is conflicts i will have some show you some useful tricks with glaciers where you maybe didn't even think of using collisions and finally i will talk a little bit how collations can affect performance so what is circulation then well a collision that is a set of rules for how to handle string data depending on human language i mean because people speak different languages and expect a to be handled according to their language so for example comparing data is this one an uppercase i without any dot equal to a lowercase i with a dot it's single v equal to w it's this one or over the dots over or er as i would call it greater than z or maybe smaller is plus does it come before or after minus now you might think that if any of these that these are tried it's always the same this can have a change beware you will see examples of all these four will be different during this session now once we have the rules for sorting that also defines grouping sorry once we have the rules for comparison that also defines the rules for grouping and sorting because if these two are equal well they will they will group together in the group i if they were unequal they will be in different groups and same thing for sorting if this one is greater than this one well then it will sort after that else it will sort before that the rules for upper and lower are also affected by a collagen i guess not all all languages have the same rules you will see example of this like ranges is also affected by collisions this is some something that sometimes bites people and we'll come back that towards the end um the collation also controls which code points in unicode that actually are available and that are defined this may not register right away but don't worry i will talk more about that later and also the collation controls which code page to use for vaca that is which characters are available when we use the varchar data type and here's an interesting tidbit not all collisions support the varchar data type you will see one example of this now in sql server we can define the collation no less on four different levels and the first one is the server collision and which was set from install sql server this one controls the collision for the system databases and it is the default relation for string columns in temp tables and the server correlations also serves as a default for the database collision but when we create a database we can say no no i don't want the server collision i want that collision over there over there in the right corner and then the database relation well that sets the collision for string variables and the literals and it's also set to the default collision and user tables and table variables a difference to 10 tables uh then when you create the table you can say no no for that column i i want that collision on the area on the third shelf from the top please give me that one and finally you can when you have an expression you can cast the collision on the fly in this session i will use these three quite a bit more than once you will not see me see me however see me installing esco server that would take too much time but there's one thing though i like to call out when it comes to the the server collection and that is when you uh runs it up and you come to this page the server configuration you have the collation it's tucked away here on the secondary tab so it's not splashed in your face but always always make it a habit to look what you have there because it may not be what you expect the default in the setup wizard is taken from the windows system locale and that may be different from your personal regional settings and even if the wii um system locale is the same as your regal settings well the collision may not still not be possible oh sorry it may not still not be the be that one for your language so if you have a polish system collision nowhere you will get a polish system sorry foolish system to kale you will get suggested the polish collation if you have a norwegian system or kale no no you will not get an ouija collision so it's always a good idea to check what you have there because changing the collation after the fact it's a pain um here is more of a reference so if you want to find out what is the server collision you can use this function server property or if you use the database database collection you can use the database property x give the database name and collation or you can look assist databases for both of them because the system databases will have you will see those there and for the column for you can simply look usb help on the table and the and that will be the rightmost column in the second result set or you can you can system columns so let's now talk about unicode this is the character set well that windows is based on unique all modern operating system all modern computing is based on unicode which is a character set that supports all human languages in total it's possible in unicode to define 1.1 million code points that takes up 21 bits today there is as many as one and four to three thousand defined which kind of scary number um anyway and every once in a while they come up with new versions of unicode to define more and more code points and the collagen esco server is tied to a certain version of unicode and what the impact of that i'm going to talk about that later on um now one thing you should know is that all living languages they can be expressed with the so-called base plane that is the code point from 0 to 65 535 that is the lower 16 bits also good to know is that the code points from 32 to 126 that is the ascii we have known and love with computers since way back in the at least the 70s and also the code points from 160 to 255 is the fairly common standard isolating one which is good for western europe and which has been around since well mid 80s and here are some sample code points from unicode and you always in in unicode the convention is to write them as u plus and the code in hex so a capital a is o plus 0041 which you may recognize also 65 as an ascii code here we have a polish character there's a greek character here's a japanese character and here's a playing card which is one of these characters beyond the base plane and it's only when the characters is beyond the base plane we use more than than four hex to denote the character now we can encode unicode in more than one ways because these are only the code points where we put them in the computer we can do that in different ways so one encoding is ugif32 and this is very very simple-minded encoding there's four bytes for every code point but it takes up space and it's not very commonly used very popular in the western world is utf-8 which is used as one it's variable length encoding so there's one to four bytes per code point and we'll get see more details about that later on and it's also commonly used i think linux for example uses ugf8 many systems uses ugf8 however not windows windows uses youtube 16 which has two bytes for every code point in the base plane and then four bytes per code point for the rest and we will also see how that works later on and as i said windows is based on youtube 16 and from that follows in sql server the data type in varchar is utef 16. now we have not always had unicodes do there were computers before unicode and before that we have smaller character sets and windows supports these and the terms in windows is code pages and all these code pages in windows as far as i know have ascii in the range from 32 to 126 and then the range from 127 up to 255 is used to support one or more languages but far from all languages so each of these code pages is targeted for a certain group of languages so for example cp850 is for western europe and this code page was originally designed for ms-dos and it's just appropriate for microsoft and it's not interoperable with other systems because way back in the 80s about every vendor had their own or 8-bit character set copied 932 is for japanese and then we have 1250 and 1252 which both are drawn from this from the 8-bit standard iso 8 uh 1859 so 1252 is latin one for western europe 12 11 2 for eastern europe and then there's also coke bridge 65 0001 65001 utf-8 because as far as windows is concerned youtube 8 is just another multi-byte character set yes that is the terminology and you might think 8-bit multibyte yeah well we're going to see that just in a second so anyway these code pages they are the foundation for the varchar data type in sql server so let's have a look at this so first of all we're going to look at the server property sorry oh at the server collation from my instance so you know where i am and also the code page for that one let's use collision i can use the function collision property give it a collision name and i need to use converter because this one returns this call variant and then the property code page so we can see this i have the collision finish swedish 100 csas well i'm swedish so that's why i have a swedish collection and the code is 1252 for western europe so now here i've got to select what three words all uh the same three words in two different ways here there is an end in front of the literal which means that the data type of the literal is enva car here there is no answer the data type is a barca which means that not all characters are available so there are three words and we'll look at these so this words rex murgus it's a swedish word meaning shrimp sandwich and include i like this word because it includes all the three characters in the swedish alphabet that are not an ascii and you can see also in vodka we get back as it was originally because i got a swedish co-pitch here i've got the polish word swansea which means sun but here if you look closely you can see that um there is a slash on the l and there is an accent on the end which is missing down here because these two characters are not available in 1252 so they are being replaced with fallback characters which means characters that look similar basically the original character without the extra theocritic remark and here we have a japanese word or a name nakamura that's a district in tokyo but there is nothing like like that looks similar so we get the gene genera generic fallback character the crested mark three of them now let's move over to database called japanese cias all databases i'm going to work with today will have the name of collision and use that collision so you can see here i'm saying create database collate i want that collision for this database and we're also going to look at the code page and now it's 932 a japanese code co-page i'm going to look at these three words again and now you may see that react smokers now comes like a smart small gas well if you don't know swedish you might think hey that's the same what's the difference hey i'm swedish i know the difference this means well nothing but anyway because these two these characters uh i know are not available in japanese so they get replaced with similar ones for some reason here uh the polish characters get replaced with question marks but now nakamura comes back as nakamiguru also in barca now let's have look at some more of this because let's give this a think thought so if you think that um these are just a 8-bit character set now remember these are geographic characters this character means middle it has a translation directly into english so do you think that only one and 27 of those no of course there's about thousands of them so how do they do that well i'm going to put this into a table not going to be good and also yokohama it asks you and i'm going to look at the length of the word in characters that's what the length function it returns the number of characters include excluding training spaces and data length which returns the number of bytes so and it's a varchar8 and we can see here this one is three characters but it's six bytes whereas yokohama is eight characters eight bytes because that's that's the ascii part of the code page but this is the japanese parts of the code bridge and what they do is that they take two bytes so they have you have to combine two bytes to get one of these characters so it's not 255 characters only now this is varchar8 so we have another japanese word or well name it's a railway station to talk about one two three four five characters so it should fit shouldn't it or no we get string a binary data would be truncated in table and create the value s and as you can see the last character is missing because varchar 8 does not meet eight characters it means eight bytes yep that's the way it is and both the japanese people have had to put up with this since sq2000 was introduced oh sorry since 2000 introduced japanese galatians uh one more thing we're going to look at while i'm at it is i'm going to create this table called india and here i'm going to force the collision of this column to be in the general 90 cias and the data type is vac and varchar and that worked well now we're going to create a table called india2 and i'm going to have the same collision but now a varchar column but i get an error message collation in the general 90cis is support on unicode data types only and cannot be applied to car varchar and text data types so this is one of these collisions that do not support barcode we can just see here that i got the same error message if i try to create a database with that collision because it's kind of silly i have a default collision that i can't use for markup so what is the code page for this collision well it's zero there's none at all and what that why there is not a code page for indic languages i mean after all there are quite a few people in india i don't know but maybe they never come around to do one but there are there are a couple of glaciers that do not have a code project i can only be used for envaka quite a few are like not only indic i think the bengali has a mess then the political agents are like that of the european glaciers i think maltese is the only one in memory says so yes varkari is not always 8-bit that's a good thing to know let's talk about all the collisions that aren't as conserving and what the how and what there are there are no less than 508 or actually that's 5681 including also the deprecating ones that's quite a lot and they fall into two groups and the main group is the windows glaciers that are which there are 5431 or 56603 if you encounter include the deprecated ones they are based on the windows system locales the documentation says exactly what that means i haven't really figured out i guess it means that well if we have let's say um the poly collision well then service is going to work with with the pilot data like windows would work with the power data more or less although not necessarily like in this in file explorer because windows explorer is doing some kind of smart sorting that would not be good to do with large amounts of data what is important to know about windows collation is that even if you work with varchar all operations are carried out in utics extinct so they will do conversion to utv16 and possibly back if needed to do the operations then we also have the smaller group this sql collisions these are legacy way up to sql seven square server in a single instance you could only have one single collision or rather sort order as the terminology was and that was applied to the entire instance you could not set one on server one on database one and column one on the expression level it was one size of it all do this and for these s calculations the varchar has no relation to youtube 16. it's based on a specific code page with that own library basically that goes back to even earlier days 6.5 and earlier when this conserver did not support unicode at all that was though nvac or they detected but for end of our car while there's calculations work like windows glacier although these are legacy so you'd stay to stay away them you should never use them well the most commonly used collisions in this world is this one sql latino one general c1 cias because this is the default question if you install escrow server with the system the kl of english united states is the default collision on linux and docker where you have to always specify it on the command line when you install it's also the default for anything you do in azure as a matter of fact if you install an azure v in fresco server you cannot even choose the collection and install them you have to change that afterwards but i will mainly talk about the windows creations but i will show some funny things about this one as well now we have a collision name here polish 100 ci ai ks ws sc utf-8 what does all this mean so here's a quick summary the polish that is the collision family 100 that's a version number well c i and ai that's about case and access sensitivity ksws vss right now i'm not going to say japanese i'm going to talk more about that later sc stands for surrogate compatible that can also be bin and bin two for binary collisions and finally we have youtube eight fugitive eight support so we're going to take this now more one by one let's start with a collision of families so a collision family can be for a oh single language like polish or is for example some languages have more than one malaysian family so there's traditional spanish and modern spanish why you will learn a little bit later some languages share a collagen family like finnish and swedish we have the same rules could be because finnish finland was part of sweden for quite a long time um and then we have that one general which is a bigger family for languages that uses the latin alphabet to have common rules in common not extremely a lot of languages but quite a few widely spoken ones like english german portuguese italian and indonesian even well dutch is not really that widely spoken but why it's one of these um there also the cyrillic general for languages that use the cyrillic alphabet mainly russian but there's also bulgarian uh belarusian and not ukrainian they have their own collation but also quite a few languages spoken in russia like uh mordovian uh chuvas etc um and the croatian family determines well for the basic first of all the basic sorting and comparison rules because of of course they come from the alphabet of the language and what carrick what rules the language have but then they are further refined by whether it's case or accident sensitive et cetera also the version number can have a minor impact you will see one example of that later on and also the roots were upper lower because that follows comes from by the language and the code page for varco with the exception for ugf8 glaciers now we're going to have a look at some collisions families and this demo is more for fun just to show you how different languages do things differently before we go there though i just like to show you this function systole f and underscore help collisions that lists all the collisions that are on sql servers so here first albanian ones here's arabic ones etc and we can see down here that yes the number is oops five thousand five hundred uh okay sorry about that five thousand five hundred eight so anyway i'm going to create a date base here called that one general c-i-a-s and move to that one i'm going to create a table called sword words and which has a number and a word and a remark the purpose of remark is that if you wonder what any of these words means you can read the remark column i'm not going to spell that out i'm going to add a couple of languages and the number here will be according to what one you what you will get in english or in portuguese or italian that uses lag one general so we can run this as from the beginning as without any with a or rather with the collision of the database and you can see one two three four five six seven eight nine ten eleven and you can see here that this a will ring over source after a1 but this one aimed with the tilde oversales before that well that is because this accent as we can call it because this ring doesn't mean anything for english so the english thinks oh yes it's just like an a but that's because that accent only matters if nothing else differs so that's because this one comes after this one simply because else comes after a and this one comes before the new comes before never before netherlands because a comes before e now anyway let's now force the collision when i saw it to be um traditional spanish this is what i do i say create word collate traditional spanish and now again one two three five six four eight seven so and nine ten eleven twelve so chico's move to be between centre and charters to be right before netherlands this is because in the spanish language up to 1994 ch was considered a separate letter and had a separate entry in the dictionary then in 1994 uh the royal spanish academy said no no wait but this is too difficult for the computer range now the computer will never never be able to sort this out so they decided that ch and ll will now sort so ch will sort between cj and ck and ll will also sorts right in l so ll was also considered a separate letter and that's so this is traditional spanish modern spanish would put chicas up here or also up here still though nyan do because n with the till over is still considered a separate letter in the spanish language such a separate translation and therefore a distorts between n and o now let's try hungarian and see what rules what we get now one two three four six sevens and ten and nine so chico's is back in place and i thought charlotte does is between cycle and the undo because in hungarian cs which is pronounced like ch in spanish counts as a separate letter still they haven't changed that it comes as a separate letter and comes in the dictionary between c and d or yeah well as i forgot some other letter and also they have overdose over which in hungarian also counts as a separate letter and which they saw has a separate entry between o and p so that's why these two change places now let's go to finish swedish and see what happens now one three four six etcetera ten eight ten oh two nine at the end so three swedish has as you might you might recall from the first lemma three accelerators that are not an ascendant you see two of them here a with ring over and over dots over or o as we would say and which is this is the third and last letter in our alphabet and this is the last one yes because in difference to spanish and hungarian we did not place the them right after the similar similarly looking characters now we decided to put them at the end of our alphabet so that's why you have them down here and then we're going to move over to my our neighbors our well our swedish neighbors danish and green anik well these roots probably got mainly from danish and that starts on three four and we have nine two one at the end so danish also has three extra letters with about the same pronunciation as the swedish ones but they look differently so they don't actually have a withdrawal server they have a with slash but they it's pronounced the same way and therefore they consider oh adult server to be a while kind of an accent of the oh with slash but why have these changed places well in swedish we introduce this character in the end towards somewhere in the 15th century i think in spanish sorry in danish they added this one not onto 1948. yeah they bit behind so before that they used to spell their sound as a double a and they still do in some proper names so that's why double a co-sorts with a ring in a danish collision and you get them here so let's now move over to the version number and the version number can be done at all which we then can say stands for 80 and those are the original collisions that were introduced in sql 2000 then in esco 2005 they added support for a few more languages and then then in sql 2008 they added support for even more languages and also they supplied new versions of the original glaciers from mexico 2008 and in some cases they also changed names et cetera then well they added also nucleations in one another 2012 but they had no version of it because they were only new features but then in esco 2017 the introduced collision with the version number 140 but these are only for japanese now it's not so much here the version number for uh for sql server that is of interest what is interest is that each of these ties to certain version of unicode although i've not been able to find the mapping but this in this turn defines which code points that are defined now that may not ring a bell what does this mean you might think so let's show you i'm going to show this i'm going to go to my database later on general cias and are going to declare an animal here called sun with the value of swansea which is police for son however if you've been paying attention or if you know polish oops that was the robot um if you've been paying attention you will see that [Music] this wood is misspelt because this is an end with a grave accent but polish doesn't have this character to have an end with an acute accent so anyway i'm going to look at the uppercase of this version i'm going to try to replace the battery character with the correct one i'm also going to try to find it with carous and i'm going to try to compare it to swatzer without any n at all and the outcome of all this is not what you might expect this one is still a lowercase and it's still a grave accent and current index does not find the character it's equal what this sounds crazy well now let's move to a database called let lat1 general 100 cias so we're going to move to newer newer version of unicode and now we're going to do the same exercise and now we get it all uppercase replace does it work it we have a complete correct spelling of it karen nix finds it and now it's not able to swatch it so this is the story way back in this version of unicode they have not yet defined this code point u plus one f9 it was just the proper representation in a in a font would that would be based on this for version of unicode would just be an an empty square or sorry empty rectangle but ssms uses a much newer uses a font which is based on a lot newer version of of unicode so and sms does not understand what's coming back what does for server things about it so they that's why it's kind of confusing but for sql server this means nothing so up of course since there is no it's not a letter nothing happens without but and replacing karman they says oh this is an undefined component i don't know what to do with this i do nothing and same thing here because it doesn't mean anything well it is equal to also because it's just white noise but here it's probably it's being handled properly and if you wonder where this character is used i've seen it used in the opinion the way to write chinese without the letters now we're going to look at the coded table called sword casts also one thing here about correlation and version numbers so in this one i got a number i've got a ch column and nch column for car one and end car one column and i'm going to add the same characters to both of these columns so we're first going to sort this according to the database relations which is a version 100 collation by the car column and you can see okay one two three four five six seven eight nine ten so that's the number reflects the original sorting and if i switch this to be encar i get the same sorting i always do always do that i will always get the same result with windows creation no matter whether i use varchar and vaca for the same characters because as i said all operations for varchar are actually performed in utf-16. now i'm going to try this collection instead elect one general c i here so it's version 80 and the older version of unicode and now for some reason i don't know why but you see one two three five four the zero character now sorts in a different place than it did before that's exactly why but for some reason they found it apt to change that um now we can also look at this collision here sql latte one cp1 cias and this is the nch column we saw this is the one we sold by so we get the same result i i don't know for sure but i suspect that these for enva car and car these two collisions are perfectly equivalent but i don't know for sure but if i sort by scar now and remember this one has a separate library for varchar lot and car one five two three four eight nine six seven ten so the plus character now comes before the minus and these two letters as hatchet uppercase and lowercase now comes before the digit why well i don't know one theory that i have is that originally in this code page there were different characters here some punctuation characters because these although this is latin one this these are in a part of latin one which is actually not defined so these are extensions that microsoft's added to 1252 so they are not with the standard of that one so let's now talk about case sorry case and access insensitivity so well this is quite simple ci sends the case insensitive and then insert in all lower case is equal to insert into one case and if it's case sensitive well then they're different then we have access and sensitivity so if it's exit insensitive these two english words resume and resume i'd be considered the same which is quite fortunate maybe for the english people because you know english people can't spell their own language they basically even though their resume is supposed to spell it's being spelled this way oh they don't do that at least the the ones the left pandas are oh yeah oh i'm not sure not to defend anyone so let's get that anyway if they are exits insensitive they are be considered being different so let's have a look at this and what some interesting impacts of this so first of all we're going to move to this database later on general 100 cias and yes they are being equal not no surprise there and now we're going to create database called lat1 100 csas it's case sensitive and they're being different now while we're at it let's have a look at how case sensitive sorting works because this is something that sometimes surprises people so i'm going to create this table called case sorting and there is a number column and the word and the data so and i'm going to sort by both word and data so in case the the um in case the word doesn't sort or is it's in case the words are not considered the same the data will have well sort of the data so we're first going to do this according so this data according to that one general cias and then we can see where i get the order that i have with the numbering so andrew are searching australia nothing surprising that of course i got uppercase and lowercase mixed and then i got mixed plus case mixed case and mixed case small and various sort of com ver weird combinations of uppercase and lowercase no particular pattern but of course it's case insensitive and we can see here though that six three thirty seven eight so it sorts by the data column but mix plus case here the plus is considered a separate letter now that why does mixed underscore case come here at the end oh we'll talk about that later on now let's change this to latin one general csas and see what result we get first of all these three words are still sought in the same order and do a certain australia it doesn't matter that a is considered to be different from up lowercase a is considered to be different from uppercase a because case sensitive sorting is the case will only matter with nothing else at diffus so and of course this is how we want to find things in the dictionary we don't want to oh here are the lowercase words and i want to find okay the uppercase words i would have a separate entity for them no that's not what you want and we can see another mixed case now there's some sort of logic here mix plus case first comes comes first but then we have mixed case or lower case then mixed case with only capital m then with both capital m and c and then mixed case at the end of all uppercase and mixed case comes here so in a case sensitive collision a windows case sensitive collision lowercase source before uppercase and a hyphen the hyphen also only has a secondary weight when sorting and this is also how we want to find things in the dictionary because some words are being spelled with the iphone but we may not know so it's more easy if we just ignore that and then if nothing else differs whereas on the plus on the other on the other hand is normally used in part of the word so that's why the plus is just considered like any other letters and plus holds before c and finally let's try the collision sql lat one general cp1 csas so this is an sql collision and we're going to this is a varchar column remember and you can see they do it differently these these are the same but here mixed case and all uppercase comes first so in sql collisions they think that an uh uppercase should come first and this is just the convention i think the reason lowercase comes first in windows creation is that there's some general not part of unicode but some convention type unicode that says lowercase confessed and also this girl collision does not have any particular handling of the on of the hyphen so they will sort that before d well sorry before c now let's see gonna try this collision are going to move says escol uh latin one general cp1 ca a s and because this one has more fun with the with one character i'm going to look at the word kosich which is a city in slovakia and i'm going to compare kosache to in all uppercase in almost all uppercase but this one lowercase and then the same thing for enva car and here is a weird thing because this one is not equal although this collision is case insensitive it's not case insensitive for this particular letter but as i speculated this might have been thrown in latest originally maybe this was something else in this creation or sort order now let's have some even more fun we're going to create a database called turkish cias with that collision and then we're going to compare insert to insert so what do you expect now different what yeah so let's have a look at the upper and lower of these to see what's going on and look here there is a dot over the eye i look here there is no dot over the eye you see turkish have two eye letters the dot and i and the doctor's eye and that might seem like why do they do something like that that's stupid well hey they did this was designed in the early 1900s when there was small basically only not laid their love that's never what the computer was so no one cared about that from a phonological perspective it makes perfect sense but i'm not going to go into those details anyway so here i have istanbul properly written in turkish it's a should be a delta knife compared with the leading eye leading application to it all uppercase and i got the word kazil which means red in lowercase and uppercase and it's circuit insensitive and yes they are equal because they are versions uppercase level case of the same letter now if i go back to my latter one general 100 cis they are not equal because it's case insensitive but it's exit sensitive and these are now different letters but oh different accents on them so to speak but now i'm going to move to the database within a case insensitive and accident sensitive and finished with each one sorry i need some water i think and again i'm going to compare these and they're equal and so because now it's accident sensitive and we're going to have some more interesting observations about axis insensitive so i'm going to create this table with the left word and the right word and they call them english where i just going to say uh compare the words according to the rules of english i'm going to cast the collision to that one general c i ai i'm going to do the same thing for polish using polish collision and for swedish i don't need to force the collision because i already have swedish creation and now we're going to add some words and we're going to look at the outcome so the first pair is assume a resume and they are considered being equal in all three languages so at least english and swedish uses this character i'm not sure about polish but it's not considered a separate letter it's just in swedish for example it's used to mark stress in some in some words so it's considered it is just an accent then we have here we have swanson it's a polish word and here's slonsa which has maybe our english swedish people will write it because this english or swedish person doesn't understand this this slash of this and it's just some free craps oh yeah so they are the same so english and swedish think they're equal but polish does not because they are separate letters in the polish alphabet and here we have the english word cooperate being spelt in two different ways this is the most common spelling someone's some people spell it with a hyphen and some people not very common but some deals that put drs's on the top of the second o to mark that it's cooperate and not cooperate and english and polish writer thinks that they're equal but in swedish no no no no they are not equal because well this is a our last letter in our alphabet we think co-operate but in this case it's not mark mark and um they just mark help the pronunciation and then we another two different english words wine and vine and of course english thinks they're not equal also polish although pogba just doesn't use the v character but hey it's swedish yes they're equal because the swedish the w just some sort of well accented v we pronounce well it's just spelling variation we pronounce it the same thing you know the english people they go and say www we see we simply say vivid as it had been spelled vvv so yes this is not consider this is not a separate letter in our alphabet oh well at least it wasn't when this collision was defined and finally here this one is also kind of funny you might how come these two dissimilar this is you without silver and a y but they are considered to be an exit insensitive collision the same in swedish but this character here is known in swedish as a german why they're being pronounced the same so uh let's see i'm gonna check out in my time well that seems snow i'm prefer probably behind let's see anyway um so just just a summary what i talked about in the case sensitive collation case only has a secondary weight when sorting case insensitive watch out for turquoise and axis insensitive comes with surprises the more languages you try you can find more interesting things like that things you might not expect so let's not talk about the japanese matters the ksws and vss so in japanese they have we previously previously we saw nakamura written in kanji but japanese also has two syllable scripts hiragana and katakana and here we have naguro written here to the left in hiragana and they had to write in katakana now you probably would never write an encounter because basically i think catacomb is used for foreign words anyway are they the two the same or not well as long as the collision name does not include ks it's canon sensitive but if it says ks they are being considered different and well then we have ws this is maybe a little bit more weird but but when you write things text in these languages you may note here that these characters are actually quite a bit wider than the latin letters so what they do is they like to render latin letters and wider so they match up uh in space with the hiragana katakana or the idiographic characters and they've also coded that into their character sets and that has also made its way into unicode so here to the left we have paris and half width or normal width and here we have it in full width and also for katakana there are also half width versions of katakana because maybe you have a text which mainly latin then they want to throw in some catacombs i don't know anyway are these equal yes they are equal as long as the collagen name does not include ws for with sensitive but then they're if there's if ws they're considered different finally via says stands for variation selectors and since sensitivity i will have to pass on this one i've not been able to track figure this out very closely uh it's not really only for japanese but it's probably most important for japanese because the only collisions to have ess are the japanese collisions with the version number 140. i'm disappointed i thought you knew this stuff yeah well i yeah i tried i know something what where invasion selectors are about but not enough to do it in a demo and i was not able to figure out whether actually this made a difference so i i will need to write up more read up more on japanese to figure this out let's talk about binary collisions so they uh sought to compare by the co point they are case accident kana everything else sensitive which means that they may not be that user friendly but they are fast but to confuse matters there's two of them bin and bin too why well let's have a look at this and i need to show a demo to demonstrate this um so we're going to create move to my database i'm going to create the table called binary test and i'm going to enter two characters three character combinations d e d and accent and l slash n accent or n if i want to do that in polish so uh and then we're going to look at the code points both in decimal and in hex and we can see here they're sorting this orders by code 0.6869321323 and 004.0045.01451.0143 in binary now i might think okay why do i need binary representation can i just cast the binary and sort them that way let's try that and we get them back as l slash comes before the d an n accent comes before the e which well okay when we look at the bond representation of course that makes sense because it's 4 1 0 1 comes before 4 4 0. but why are the bytes swapped well that has nothing to do with the correlations that has to do with how computers are built the winter platform is a so-called little endian platform which means that when you dress a word you will get the bytes swap that is just the way things are so then if you have let one general bin 2 you get things in the order we expect because in when you sort by this collision they do swap all the bytes all the words so you get the bytes in the right order and you get them by coping so what then is a ben collision well let's see if you can figure this one out well the d comes before the l slash but the n accent comes before the e so what they do is this they do swap the bytes of the first word and the rest they sort by the rule binary and please don't ask me what they were thinking when they came up with this idea maybe they thought it would be faster but i have not been able to find any differences anyway the bin two galatians are newer they were introduced i think in this 2005 and these are older and they are legacy collisions i would say so that's why you have this let's now move over to sc well where the documentation says supplement supplementary characters i tend to read this as surrogate compatible in utm16 so i so i told you characters beyond the base plane they are encoded in um you need four bytes and the way it's done is that they encode it in so-called surrogate pairs that is just the the name surrogate pass don't ask me why it's called that way so the high word will then be in the range from u plus d80 to u plus db bbb and the low world will be either range from u plus d c o o to u plus d f f f and then unicode does not define any characters on these code points so unicode is designed with util 16 in mind um the regular collisions that were added up to 2008 they only supported the unicode base plane from zero to sixty five five three uh sixty five thousand five and thirty five then in sq2012 they added sc collisions that support due to 616 in four and they added s equations all collisions with the version 80 and later sorry version 90 and later not the version 80 equations so and uh if you look at the version 104 articulations you will not see sc in lane because they are by definition always surrogate compatible and binary collisions on the other and on the other hand are not really circuit where so um gonna move to my database lagman general 100 csas which that is not surrogate aware and i got this variable here called gothic which i gotta assign well five letters from the ancient and um historic gothic alphabet no one speaks gothic these days their language is extinct and also thrown in the energy and then i'm going to look at the length of the word in characters the length in the bytes and i'm going to look at the new substring to extract the first three characters from that word and you can see i get back to word correctly but as closer says no there are 12 characters in this word which is kind of wrong and the first three well we can see we get the wall the first one which is canon a but then we get this square with a question mark in it this clip means encoded area because management studio thinks but all the font representation like wait a minute we got half of a surrogate pair that's that's wrong but let's now create a database called check 100 cs assc and if you wonder why check well because it's shorter than that one general 100 that's all the reason and then we're going to look at these again and now we get six characters and substitute returns correctly the first three characters a b and y gamma so to speak uh let's look at one more thing here i'm going to declare a table variable and the word is envirocar 10 i'm going to try to put in my word here well the sixth character so it should fit in shouldn't it but i get stringer boundary data will be truncated in table column word value gothic characters but the gothic did not fit because although it says enva carten that is not 10 characters that is 10 bite pairs and that is what the gothic people have had to put up with since cesco 2012 or they would have put up had to put up with if they had still existed now let's move over to ucf eight and these questions were introduced as 2019 and you could simply easily recognize them by the fact that the suffix is q-tip eight and what they did was they took all the sc collations and tacked on youtube eight so there's no version 80 collisions for youtube eight uh in the huge of eight glaciers the copic vodka is always sixty-five thousand one newton eight so it does not follow from the creation family one thing you should be aware of is that these collisions do not support the text and text data types as a matter of fact this is true for all surrogate regulations and there's one single binary utility collisions that one general 100 bin 2 you defeat i think they forgot to add a turkish one so one reason youtube aid is popular is because you it's perceived that you can save space with you to effect let's have a look at this so in utf-8 the range from 0 to 127 takes up one single byte so that's asking this means that if you have an ascii file it's also youtube8 file if it's just pure simple ascii file and it also means that compared to youtube 16 well for english you save up to 50 space then we have the range from 128 to 25 2004 to 7. that takes two bytes so that's all other latin characters that are not fitting taski so that means for english and polish we may not say fifty percent space compared to your dev60 but maybe 40 to 45 space then we have all grips uh other scripts in the european area so to speak it's a really greek arabic and a few more ones and well you would still save some space of youtube with these scripts because although most of this really characters well they take up two bytes but then they always have some punctuations of digits and all typically all sorts of latin words thrown in so they might say five to ten percent space but then we have the range from 2048 to 65 535. you need three bytes so these are the language of india chinese japanese thai so you can remember you can guess that in east asia now beauty of eight is probably not that big hit you need fifty percent more space and finally we have the range beyond the base plank you need four bytes to code every character but that is the same as youtube 16. so let's have a look at this i'm going to start with a database called finish footage 100 csa ssc so it's sorry to wear but it's not compatible and i'm going to look at summer words again both in enva car and in varchar these are the first three we looked at before so here we've got surrogate characters here what we have parents are full with and now we get it's being replaced by parrot in half with that makes reasonable and this gothic word we can for some reason i get 12 question marks i should only get six but as you see i can only use the characters in codebridge 1252. now i'm going to create a youtube a database finnish swedish 100 cs assc huge f8 and look at these strings again and lo and behold i get back them all well of course it's difficult to see that this is filled with this font but nevertheless um and this you can see the gothic characters and you can see nakamura in kanji now let's look at this a little more we're going to create a table called words and insert them and i'm going to look at the again look at the length of the words in characters in the bytes and also make run a substring on them see the first three and the english well seven characters seven bytes and the first three are egg rex motors ten characters thirteen bytes because the three letters that are not in ascii but i get back the first three's although that's four bytes swanson six characters eight bytes and small still again that's six bytes that can be good three characters now it's nine bytes because it's utv8 and paris in full width it's still is also 50 bytes because the full width characters are also way back in the east asian range over there and the gothic string is small still 24 bytes like it was with ut16 but still first three characters is coming back correctly now let's try this var class six we've got a column called sun and it's a volca six column and i'm going to try to put in the word swansea which is six characters so is this going to work out well if you have been paying attention you know this is going to end yes in tears string of boundaries will be truncated in table column sun truncate the value swan because six characters means six fights now people in let's say well sweden poland france who never worked with japanese or gothic never seen this before until they come to this youtube collision completely freaks out and says what this must be about what mike what was michael myself thinking well it's completely crazy but yes this is just the way things are it takes fights and as i said the japanese people they have had to put up with this since 2000. uh i think the answer is there are permits to saying van gaal six could be six characters and then you can also have an option to which you want to do and maybe microsoft's gonna change that in the future but i think that would be quite a bit of an architectural change i'm not holding my breath so yes bark arthur it means 30 bytes so this means if you want to permit for 30 characters well you may hate go for varchar 60 and you would still have to take the chances that you don't need to store chinese or images if you really want to be sure you need marker 1 and 20. then again keep this in mind a 10 byte value in the varchar 60 does not take up more space than than in the valkyr 30. but it's just that it looks odd and it can give you some shivers because what if someone actually adds 60 characters to this one and we're going to print this on a form which could give us trouble finally one thing i like to discuss is if we gonna design a new application and we would want to add international support and of course if you design a new application you should always think international to be future proof anything else is just silly if you excuse me so my take on this is i would go for envaka uh the reason for this first of all it's more predictable if i have nvakar 30 that means 30 characters yes i guess i can expect that people are not going to enter names in gothic and for a name column i will not accept images so i will know that that would be 30 characters for a free text fill i am i might have to accept uh gothic oh sorry not maybe not coffee but at least emojis but yeah then it's maybe not a big deal if i say n workout n 2 000 yes sure sorry you put an emoji so you only have space for one of my name knight tonight eight characters i can live with that also i get better performance not a big deal but and you will see this towards the end um and when it comes to saving space you can do that with road compression i took a table with uh and put it both in utf-8 and youtube 16 without compression no difference in size and it had data in beyond the ascii but this is absolutely a matter of preference i'm not saying if you like think that you defeat this better for example because you think you have you're going to have the talk with other systems that you use f8 yeah you may want to do it but it doesn't really matter because that will still be converted on the wire then again if you have an existing application that was built with varchar without out no thinking about being international and you now have to be international for example let's say you're in france and your bustle tells you that oh we have acquired a company in poland and i tell that they use a different code page well then having to change the varco to end of our car can be quite a bit of the pain moving to youtube eight collisions it's probably less painful you still have the author on the tables you might have to consider character lengths so it still can be still quite a bit of a work but it's probably a smoother walking smooth a slightly smoother path than having it envaka now let's move to one of maybe the biggest pain point with collisions in sql server because due to legacy collisions also control meter data um and this is because you might remember way back in sql seven as the server only supported one single collision or sort order so that also affected the system table and how they were solved and you know table names etc are stored in tables so yes they have to do it that way anyway the way it works today is this you have a server collision and that controls the name of the server level objects so database logins etc the server level collision also defines the sets of the names for temp tables and column names in term tables and table variables well that is because they live in temp to be and temp to be system databases that uses the system collation and the server collision also controls the names of variables whereas the values of the variables follow the database correlation and the database collation is it that controls the names of the database label objects that is tables columns uses etc although there's one exception that added service broker objects they always have a ban regulation because for or in interoperability reasons because it would be a big deal so it has to work across servers and multiple systems i have a demo down here but it's not that thrilling and i'm a little late in my schedule so i'm going to skip this one but it's not that extremely exciting you're just going to summarize some tips for writing international software so you what you should do i think is you should use a k if you writing a software that's going to be used in many countries you should develop with a case sensitive collation both on server level and database level or else just forget about the turkish market because else you have you have tables being that use babies you first created them with an uppercase i then you referred them to the lowercase i and trying to sort out that best no and also and the color of this is you only use lowercase for identifiers because else you would have a if you remember oh did we call this one uh mixed case both m and capital m m and c or w which way did it do it so if you use everything lowercase i know this i could work in with the case sensitive creation developers for the for many years and we use locations for identifiers also avoid very short names like v or rw i've occasionally seen posts from people and they were not swedish who being burned by a collage like finnish swedish ciai where they had two variables at v and that w and all of a sudden they clashed and i mean whether it's to be case insensitive or case sensitive that's a matter of preference some languages like pascal basic they do case insensitive other languages pearl python they are case sensitive what is complete mess in this conserver is that it's depending on the on the setting that's just ugly i think they should fix this by providing an option having a fixed fixed creation now here's another paint book with collisions although this one is more natural one so if you have two columns that meets an expression well you will get a conflict and the way to resolve this i think we better look at this demo so again i'm going to move to my database later on general cias and i'm going to move to my places database and i'm going to add some places to new york london paris munich and buenos aires and i'm also going to create a temp table and i'm going to put london there and also paris in all uppercase and now i'm going to run this join i want to find find the places in my places table that also are my playstation in my in my temp table now you might remember that my server collision is not this one it's uh finished with this case sensitive one so which creation should this person use there i mean the comparison rules are not the same for these particular words that would not be an issue but let's say that there are well some that where it actually would be different rules for swedish for these two to um to um for example oh yeah there is a difference because we have the one is case sensitive ones in this case insensitive so is this conservative we're going to toss a coin or no esco server just raises the white flag and says cannot resolve the collision conflict between latino and general cias and the finnish regis 100 csas in the equity operation of course when you get this particularly not prepared for this it's kind of a pain the way to resolve this is to force the collation on one side and typically you do that on the temp table because you probably want to work with the data you have in your table so you're going to say t place clay network general cias and now i get back my data now you might not be comfortable with hard code initialization because okay okay this is that one general but tomorrow this collision is gonna run in the czech republic so we're gonna have a check collision well there is a way out we can say collate database defaults which means well the database default for the current database not uh temp2b so then it will always be the formulation for that database and i get back get it back that works of course does not work of course if this one has a different different collision from the database collision but the many systems you will get for one collision all over the database it's only if you have special needs that you vary the equation now what do you think about this one this one i'm forcing the collision on the place column on the my places table to be a french case sensitive equation so you might think here now there's the service now i'm gonna go and say i don't know which rules gonna use i'm gonna say this is a conflict because the reason you get this conflict there is that this conserver doesn't want to make the choice it just simply tells us you need to resolve this because i don't know but it turns out this one runs but i only get back london this is because when you force a collision on one side this actually forces the collision on the entire expression so you force the collision on the temp table as well and it was a case of sensitive collisions and now i only got back london not paris now what if i compare with the variable so here i got at place it paris in all lower case i'm going to try to compare it both with it with the uh places table and the temp table well i get an error message no i don't i get here i get back paris here i don't get backpacks when you have a variable or a literal or constant the column will always win and the variable and the literal will be cursed to the collision of the column and finally before i leave this demo i'd like to show the proper way of defining the temp tables you should rather than casting it in the expression it's better to do this saying place collate database default and you will get the default collision of the current database and it works as long as you have the same collision all over the database and now when i run this i don't get this error and here's a second tip if you actually want to make sure that you do this make sure that you have a different circulation and data basically so you don't forget then again if you expect that this database will only be alone on the server so we will always have the same collision everywhere and then you can forget about it but if you think that maybe we might move this database elsewhere this can be a good habit to always have this in place because if you don't have to add this afterwards it's a lot more pain than they have to do it directly so just to summarize so if you have a collision conflict you need to do that by adding the collate keyword to force one of the columns to the common collision you want to do and this cast the collision for both columns one thing to be aware of if you cast the collision of the column that will kill in the index of the column because the index is organized according to the correlation you have specified so if you cast a different collision that index is now useless and best practice when you define temp tables use collate database to fold four string columns to avoid the correlation conflicts later on because you might not have them today you might get them today with the database on the different server and when it comes to columns they always win so when it comes to variables and constants they always lose against columns now here are some tricks we can do with collisions where you might not think about using collisions in the first place so removing accents and quotes finding words that start in order case and removing invisible characters so um i'm going to create a table here called more words and that's both a word and an original and i will store the same word in both the word and the original and oh let's see look at the words and they say well there's a danish name a swedish word a polish word a danish name again and some polish from french some japanese and some german what you might notice though that all of these words include letters that are not in the ascii character set and some people say oh we want to do this we don't have we don't want to have these special characters special characters hey look there's nothing special about this character as far as i'm concerned at least no no anyway i hate this but still people want to do this so here's a trick you can do to get rid of all these non-ascii characters you cast the word to equation for a language with a different script of latin for example greek then you cast that ivarka because in the code pitch for greek there are no latin letters beyond those in ascu because in the upper part they're only greek letters so look what happens now and we're going to retain the original column and you can see look boom all of them are removed the only one that did not really work out really well is this danish character the ao literal it came back as a question mark rather than ae but else yes you can do this although i need to add the cave yet that if you also have punctuation characters and other characters it may not always work out but the alternatives is a whole bunch of nested replace or maybe maybe the translate function could work for you if you're on excel 2017 later but so this is a very quick thing to do this anyway while now we have this data let's see we want to find all the words that start with uppercase and some people they try to do this um just like 80 percent because they they used to do this from blood so unix or text editors but when i run this i get back all of them because well the like range is also affected by the collision the range from a to set includes well all the lowercase letters because it's casey and sensitive okay when people realize they say okay i'm going to do this case sensitive instead but the result is not that much different i got rid of one the lowercase a because now the range starts at uppercase a but go all the way down to uppercase z but includes lowercase b lowercase at zero lowercase z but not lowercase a now the way to sort this out is to use a binary collection so we get by co point and now boom all of the words that's done with uppercase if you also have your own if we do this on the original set we would have to add the all the extra characters to this range to include those obviously uh here's another one that also sometimes causes confusion we have are gonna form a string over a b zed the null character this is something you know want to do but that can happen sometimes due to processing maybe you get data from other platforms etc so there is some invisible character here and if i look at this so there's abc a null character and xyz and the first may seem that oscar sure is only storing the first three characters no it's not it's storing all seven there's seven characters now the one who's uh misbehaving here is management studio management issued is written in the she sharp and all c based languages they think that hey an old boat that's a string terminator so as management refuses to show the rest of the string now we want to get rid of this sucker so we do try to do replace car zero with nothing at all but much to a disappointment it didn't help this is again because car zero and unicode is not a defined code point so replace says i don't know the rules for this character i'm giving up i do nothing and again the way to resolve this is to use a binary equation and now i get rid of that invisible character and i see all six could you replace n car zero oh let's see that should be the same after all it's it's a it's a windows collision so everything here is being being is being um being conducted in in um is being performed in youtube 16. in an sql collision though right yeah in the sql collection though you would not have this problem for vaca because it's a different completely different drill set and it has a real rule for for car zero so um this is just a summary of what i've talked about so i'm not gonna repeat um because i'm thinking i'm quite better short on time 23 minutes yeah yeah well that should be enough for the performance of so collisions and performance how does collisions affect performance well in most cases a little bit but not maybe too much but there are two situations where though uh collisions has a big impact on performance and we're going to look at these so um we have 23 minutes but that includes uh i think we should have some break as well i want to give space for that so my aim is to be stay within 75 minutes so anyway i'm going to have a table here this i'm going to be intend to be and i've got a table here called words that i'm going to run the create statement and also populate it while i'm talking about it because it takes a little bit of minutes so anyway this table called grids and there is the in the windows envirocar column which has an envelope 200 column and it has a collision of latitude and general cias and then there is an spl nmr column which has also invoked 2000 and it has the equation sql latte one general c1 cas and then there's two varchar columns with the same pair of galatians and all of these are labeled as unique so there is an index of all of these columns and then i'm just filling this up with well grits so each of them will be a string of i think one and four to seven characters and there's one on a thousand rows now what i'm going to do here is i'm going to do a kind of a point lookup it's not an equal it's a like operation without a leading percent so it's going to find a few characters but this is a very quick operation so what i'm going to do is i'm going to loop for 500 milliseconds to see how many iterations do they do and the first one is going to do this for the windows envelope and this takes 8 000 milliseconds let's now do this for their sql column oh sorry not eight thousand milliseconds it took i was able to do eight thousand five iterations that's a little lower but maybe my laptop is busy by zoom this one was actually faster usually they are the same but it was a little faster this time um but they usually should be about the same now when i go to the windows of our car column see i only do half 4 000 103. so why is this well so remember this i have our car value and i'm comparing to the end of our car value by the type rules in this conserver the far car column will be converted to envelope that usually does not mean good for indexes however because all operations of our car are actually conducted in ut16 the index is still usable so is consumer is still seeking the index but it has to add a few extra operators to the plan so that's why we get a get a speed reduction so it's cost usually i measure this to a factor of two sometimes up to a factor of five which may sound like a lot but hey these are quick operations so depending on your workload you may not even notice for windows collision but now we're going to do this for an sql collision and remember in this calculation i completely different routes in varchar so i was able to do only 15. so here there was a big performance impact and here we can't talk about a factor of two to five because the index is dead the index confused it has to be a scan so now it only depends on the size of the table or the penalties it could be a factor of hundred it could be a factor of a thousand it could be a factor of a million it could be even more so this can be a real disaster of course this is a kind of a programming arrow you wouldn't shouldn't do this but this can easily happen if you use a well a dba design i don't use vancouver's all i don't know about international stuff i don't want that so i'm going to use utf8 but the developers decided to use an rm for example which is basically anything modern program is basically we're going to use unicode so they have unicode you get this clash then you can and they because no one think thought about this so it can quite easily happen and with a win if you use a windows collision the damage is limited but this calculation will get the revenge now we're going to look at doing this we're going to do a search now for the varchar column but now with leading percent so in this case the index cannot be used at all it cannot be seen we can scan it of course but it has to go through every columns and we're going to do this for the windows all of these four columns and we're going to do this for all four ones oops if i am if i'm only able to do my selection correctly we should do this and of course let's see here oh this one took two half minutes this one was two half minutes this one i up and this one was a lot faster but these three were about this all the same and this was a lot faster why well when you do this kind of search sq server does not only have to scan the table it also has to scan the strings from start to end more or less it has to go at least oh not the last two case but the rest so it has to look and compare every character now for the unicode of youtube 16 there are very complex table because there are very many things it has to consider things i haven't shown you like combining accents etc so it has to consider those so and that starts to add up but as the sql collision on the other hand it only has to think about the 255 characters that are in that clay and that collision yes there are no sql classes for japanese there all those collisions off of our car 255 characters so it's a lot easier a lot a lot smaller rules so that's why it's so much faster now so is that the lesson that we should use for these kind of searches in sql collection with vodka not really because maybe we need to end our car because well we won't need to support multiple languages well we can do this with a boundary collision so i can say windows and vaca collate let one general bin two and let's see how this goes you might remember that we had round here you know split slightly below 300 milliseconds for this calculation well half the speed although yeah there is a problem zero because this is now case sensitive and my search string is lower case but all these grids the carriage representation are all uppercase but there is an easy fix i can do an upper on my varchar column and on my search string normally you shouldn't do this because that kills the index but it doesn't matter because the index is already dead because of the operation now well it was not as fast as previously but i get back the correct count so this is a kind of a trick to get these searches faster you will not get exactly the same result as as with the true case case sensitive equations there are some corner cases but you might very well get away with them but if your users also say that they want access insensitive searches well then you would probably have to go with a normal like so these two slides just summarize what i've talked about in the demo so i'm just going to not repeat that there's actually one more case that i was reminded of just before that i got a mail but that's a bug that has been fixed with cu7 where you before that there was some nasty case where you could get a problem with with you to fade cliches but that's that's a bug that had been fixed anyway um what i also did was i actually ran a test for all five thousand five or eight places that kept my laptop busy for nine days i tested five different operations both of our car and my car to see what the result might be and i did this in school 2019 and just because i'm crazy i did this all way back in 2008 on this 2008. and i'm going to show you some of that data i mean the most and as i point out how to note i mean the effect on real workload is likely to be far less than the number indicates because in the real world what you have scans you have numeric comparisons all sorts of stuff so this is just looking at this specific operations so here are some data the first thing you might think case sensitive that should be faster than case insensitive turns out that not a big difference but case insensitive accident sensitive is slightly faster don't know why but maybe because this is more common so they optimized the tables for that kind of these operations um i found out that the version 100 collisions they can be up to 30 but slower than 30 slower than the version 80 collisions and that has to do with i guess with because the character tables are larger they're more characters to consider um for windows glaciers it can fracture for varchar it can be up to ninety percent slower than end vacation particularly with sorting operations other operations there were hardly any notice noticeable difference at all and that is because of that conversion to youtube 16 and back but that does not take in consideration that you might save space with the with the varchar and saving space often means saving execution time because you make the scans smaller for sql collisions varchar is always faster because you have simple rules but the difference is not always that extreme as would like and yes yes there are differences depending on the language or the coalition family um particularly slowest ones i think with vietnamese and hungarian simply because they have more complex complex rules but hey if you aren't in hungary and you're going to mean you're not going to pick a boss slovak romanian collision just because it's faster because your users want to get want to get hungarian rules so you have no choice uh and also finally i like to i was figured it's interesting to look at youtube eight so i think it as long as i don't need the extra characters for version 100 and as long as you're in europe you can get away with the version 80 places if you also need support language in india for example then you will need the version 100 creation for a version 90 at least um but anyways i figured i'm going to compare you to fate with varco and the end of our coalition collision with version 80. i look here it's it's slower and for the difference operations that i drive and i want to buy it's even twice the speed so it comes with a cost and which is even bigger than the normal x overhead you get for farka but again take this with a grain of salt it does not take account for the space reduction you might get with the uge8 so don't put too much into this uh before i close this i like to this i got this with the speaker deck i like this slide a lot thank you for organizers the moderators and that and these for being here and it's a really nice picture that they put in um this is my final slide just to get to get some references my name is alan somasko my email address is here again if you have any questions i don't answer right now please feel free to drop me a line at any point you have slides and scripts on my website someone's got it as e slash percent and if you do the demos there here's a cleanup script that will just drop all the databases with collision names so that was my i had to say today any are any questions somewhere i'm i'm i'm just amazed and we were a couple of us were joking in the beginning like uh you have a high performance mode on and will you make it in time but you did no i think we're all just gobsmacked [Music] what was there there was something about smurga's torta something about no lock and poop emojis hugo says i still want a set option to make collate database default uh the default for all temporary objects i i agree with them that's a great idea yeah it could be that could be useful yes but i think it's even better though is to have simply uh yeah okay that's that doesn't that means ladies and gentlemen that does not address that they uh that would be helped but then again if you have well if you sort of mix class in the database and sometimes you have to do that because you might okay generally do things case insensitive but uh to my business rules i said this one has to be case sensitive and actually i have the idea that if you have things like codes like current scale for example that should be a binary collection because why would you ever want to have that in ever anything else first of all it's faster and why would you have to see sometimes see circus applications sometimes lowercase etc yeah yeah totally i tend to install the dev developments uh where i have a different server collation to compared to the database collation because inevitably you will run into stuff where you create temp table objects and then you sort of yeah you get something something in a stored procedure somewhere wrong and and that collation error would wouldn't show in your dev development if you didn't uh it would show up at the client and at which point it's harder to fix yeah that's also so the reason as i said i would even set it to case sensitive actually but that's just me being terrible on my colleagues yeah so i worked worked with the developer with the product it was not for inter really international market but we we used a case-sensitive collation exactly because and this was there was this was a decision making already in this 6.5 time so you do so if the server says we we go back oh sorry if the client says if you go by case sensitive well then we have developed cases of you you're you're kind of host so that's why we made the decision to always use case sensitive and once you get used to it well i i can't say why i ever don't download queries from the forums and people have mixed cases with what's the point of sometimes spelling the customer with a capital scene sometimes in lowercase and sometimes all uppercase i mean i don't see them it's it's just sloppy right but if you use linux or unix for any length of time or or code in c for all that for that matter you get used to the case sensitive thing or you you actually just make a general rule to have everything in lower case which i have opinions on but but okay nagi says wow that this was incredible you went at three times three x speed but this has been a great eye opener uh my notepad is full and my mind is at capacity uh i think i'll have to lie down thank you ireland yeah yeah yeah so many great details thanks a lot adlan yeah it was a big cramp but uh well i'm usually not known for going particularly slow in my presentations but yeah i think it was an eye-opener really i i had no idea that sort of collations could take up more than say 15-20 minutes i don't know i was i stand corrected yeah that's more things i could have shown you for example that could confuse people for example combining access to have a unicode but you don't come across on these very often although i actually had someone on the forums who actually had a problem with combining accents um that was but they're not very commonly used but they when they decide unicode the reason they want to have these combining actions was well to reduce the number of code points i was i remember i was on the mailing list that was because they had there was two initiatives unicode and iso 10 646 and dense for six cents 10 six for six was aiming at 32-bit standard and unico tried to make it in 60 bits and then the end result was one kind of a compromise but at least there was only one standard it was not way back in the 80s when about every vendor had their own 8-bit character set and it was a complete mess yikes yeah so uh what say we take an eight-minute bio break uh over in the dba channel mikey bronowski is up in a couple of minutes

Info

Channel: GroupBy

Views: 187

Rating: 5 out of 5

Keywords: sqlserver, groupby, SQL Server

Id: rfguGBy_exw

Channel Id: undefined

Length: 82min 40sec (4960 seconds)

Published: Wed Jun 16 2021