Code Pages, Character Encoding, Unicode, UTF-8 and the BOM - Computer Stuff They Didn't Teach You #2

Video Statistics and Information

Video
Captions Word Cloud
Reddit Comments
Captions
hey friends I'm Scott Hanselman and this is things they didn't teach you some of the folks in the last video commented in the comments that maybe they did teach you this stuff I don't care that's the title of the thing so maybe you learn these things and maybe this isn't the video for you but it's a fun title and I like it so the idea is stuff that maybe you forgot or didn't know or they didn't have a class on this maybe you didn't pick it up there's lots of folks out there who are learning from boot camps they're learning to be a programmer or an IT person by picking it up and sometimes you just don't pick these things up as you go about your life so in episode 1 we talked about carriage return and line feed and we just touched on a bit of character encoding I thought it would be interesting to talk a little bit about character encoding right here so every once in a while you may find yourself on the Internet's and you might go and find a you know a character like this like a Chinese character or a non Latin character and then you go into your notepad or you maybe have opened a text file from somewhere and you paste it in and you go oh my you know it's happening it's it's all nonsense and then you become frustrated or perhaps you open up a text file and it's all black squares and you don't know what's going on well that gets us into the question of character encoding in our last episode we talked a little bit about the ASCII character table in that episode in the last episode I said the word as Key ASCII ASCI I the American Standard Code for information interchange and I saw it all casual like someone would know what that meant well here's the deal back in the day back in the day we had only a few bytes and every byte mattered in fact every bit mattered so someone came up with a way to go and put the first 127 characters that you might need into seven bits not 8 bits but 7 bit they made all that fit in 7 bits that's 2 to the seventh which is 128 if you added a single extra bit if you had all the space in the world and you would get 256 but let's talk about 2 to the 7th it's an agreement that this fight means this and this bite means that so they got around they said you know a is gonna mean 65 base 10 well or it's gonna be 41 in hex let's take a look at this the way that I'm going to teach you this is I'm gonna do something silly I'm gonna write a computer program let's go and make directory and we'll call it write some bytes it doesn't matter what language I use I'm going to use C sharp but you can use whatever makes you happy so I'm gonna say dotnet new console we're gonna make a little stupid console application I want to make this point really really clear and open a visual studio code with code dot I'll open up this and we're gonna go to our program right here and we've got hello world we're gonna get rid of that we're gonna say hey we're gonna need some bytes will call them bytes and we will make 128 of them so we've got a byte array of 128 bytes okay and what we're gonna do is we're gonna make a for loop for from the number 0 up until until what 128 mm okay then we're gonna say bytes hey bytes put in I what we're gonna do is we're gonna take I which is an integer we're gonna instead make it a byte we're gonna shove it in there so we're gonna make a bunch of bytes from 0 to 127 and then we're gonna go we're gonna say hey we need some I oh and we'll say dududu file dot write all bytes we'll call it iron 27 28 fights dot text and then we're gonna give it our bytes boom pow that cool now that's our app very simple now I'm gonna just pop out to the terminal here I want to say run it and we're gonna go and look at those bytes so watch on the left hand side here because if I did it right boom 128 bytes I'm gonna open it and this thing says this is using an unsupported text encoding what's happening well why that's because the first byte is 0 and the second part is 1 and the etc etc etc so if we go in back in at the table here right zero is no Knowles not interesting but let's open it anyway shall we it's a bunch of schmutz until we got into recognizable characters in fact if I delete all of this stuff and hit save and then right-click and look at it in a hex dump we can see that the interesting bits started around 21 and when we talked about a there's 41 for upper case there's 61 hex for lower case alright now what if we took 256 bytes 256 bytes 256 bytes 256 look it's bigger than a byte you're gonna go too big so we're gonna switch this back to an int and then when we're done here we're going to shove it into a byte so we're gonna spin through from the number 0 to 255 we're gonna shove it into here and I'm gonna say dotnet run do a little dupe alright got bigger but it's all on one line what's all this crap look at all these things here okay what's happening let's go and look at it in a hex dump we can see we went from 0 all the way up to FF but does this stuff on the right actually reflect reality who decides that this number means a it depends that's called character encoding there's lots of different character encoding now all this time when I was saying ASCII is just a 7-bit character encoding that means it's bits all the way up to 127 there's a lot of code pages out there code pages with the windows code page for whatever reason is called codepage 1250 - is the one for graphical apps and windows and code page 437 is one for console applications and there's a bunch of other code Agis they're identical until you get up past 127 okay so for example one code page might say that's a euro character and I wonder if I might say that's this cool see and another one might say that's a non-breaking space that might be this is an a with an accent on top it all depends these cool DOS looking console deals will only show up like that if you apply the right code page okay so you've got to have a font that supports it and you've got to know what the code page is so the way another way to think about this is that the string that you have it doesn't mean anything unless you have an Associated code page okay now if we take one of these files and we open it up in notepad what's that it looks like crap what happened here is notepad took a guess notepad took a guess and said I think this is what it is I think it's utf-16 we'll talk about that in a second and it got it wrong let's open it up in notepad to a different note planet application it also took a guess it took a guess it said and see if I double click on that a notepad to an C is in fact codepage 1250 - and the one that we saw that was for the console is called om 437 who knows why they named them those numbers it's silly here's the deal though these are all different views on how you can present this stuff another common one is ISO 8859 - 1 if I click that it'll say wait a second if I switch it things might go south if we find any characters here we don't recognize we're gonna turn that into something else we're gonna turn it into default characters now in this case nothing happened which is a good thing but what if I switch it to something like Unicode and we go and we grab that Chinese character again I'm just gonna grab character for mother I'm gonna throw it in here a couple of okay and then what we're gonna do is we're gonna switch this to an C or in this case the most basic 8-bit encoding ASCII Windows it's gonna warn you hey what everything went bad now what if I just made a new file put C look I can't even paste it in there what if I make a new file I'm gonna click here where it says dancing I'm gonna say unicode utf-8 I'm going to paste in the character for mother I'm gonna hit save when I put it on my desktop go out to the command line look at that in this case here I got three bytes it doesn't look Chinese and it's wrong but is it what could we do to guarantee that folks got this right what if we saved a signature in front of it a byte order mark we're gonna save it again I want to point something out I'm gonna go ahead and say this is a nine character file right now I'm gonna hit save now it is a 6 character file okay if I switch it to Unicode save it again is a three character file change it back to Unicode signature back to six characters it's still wrong in the dosbox because that's how things are going to work for a while but there's three characters in front of it that's giving me information that I maybe didn't know about okay what if I said I want to change the code page iemon daus and what I said was display this character using this code page I could go and I could say change code page to 1252 that doesn't look right I could change it to 437 that's where we were at the beginning member that's the default code page or I could change it to Unicode which enables that but what's that first character what's going on there what is this thing here remember when we saved that stuff we said save it with a signature let's open it up find out what's going on let's go to a hex dump those three characters are called the bomb the byte order mark it's the unicode byte order mark the idea was if you had this magic string here it would tell you what to expect it says expect things to look like this and bytes to be in this order from this point forward so that byte order mark would get carried around and then once I go and have that byte order mark in my text file it assumes that everything from that point on is stored as a Unicode code point which is a magic number of two three or six bytes that expresses a point in a map that it could be any character that Unicode supports in fact a Unicode has this lovely website where you can go and find all of these characters if you're in Windows needs like windows are you type in char map you can get this old and wonderfully fabulous application and pick any font I'll just pick a regular font like Arial and click on it and you can see the Unicode code point for that character and this thing is interesting it says keystroke alt 233 if I ran notepad I pull it down here and I'm gonna use the number pad on my keyboard here I'm going to hold down alt with my left finger and then with my right hand I'm gonna type say 0 2 3 3 and I just typed that symbol if I want to type the restricted trademark alt 0 174 makes sense when you get way down farther you can't type these yourself but if you're looking for a character that you can't type you can grab it select it hit copy and paste it in there okay but again if you don't watch for your encoding when you save it you will potentially lose information because anything over 255 anything over about 241 anything right there that's your cutoff right there what you need to understand though is once you've got this bomb this byte order mark you see it works immediately and I can go and put ASCII before and after it actually lets go ABC ABC I look at the file here we can see that it was loaded correctly with utf-8 with BOM I can right-click on it and I can see the byte order mark ABC the Chinese character this is interesting see right there my hex inspector it actually points out the string at the bottom there then ABC again that cool without that byte order mark things would go south last thing we'll do what if I made a little bit of room and I made a 256 bytes with a bomb this is not how you would do this to be clear and what I'm going to do is I'm going to hard-code the first three bytes I'm gonna say bytes at zero I'm gonna make it EF byte 1 and bi 2 are gonna be BB and bf I'm hard coding the bomb then I will do my 256 other characters and then we'll run this shows up over here boom there's my byte order mark now we're going to open this with notepad or 128 bytes text file got confused are 256 one got confused opened as ANSI looks pretty decent though remembering that the first 27 odd characters are kind of trashy they're just control characters for doing stuff but 256 bytes with bomb watch right here where it says ANSI when I drop it see how it says utf-8 signature it recognized that we wrote out that byte order mark and it was smart enough to even give us the characters for those higher-level bytes that we wrote out those higher than 128 bytes so everything from here down okay so I realized that there are lots of different ways to express this information and maybe this wasn't the easiest for you I'm doing the best I can but I want folks to get a general sense of encoding character encoding what it means how it works and that you need to know about it because when you get a string and you don't know the encoding of the string the best you can do is guess if you have a byte order mark then you have a lot more to go on but not all bytes are made equal and if you have any more comments or questions please put them in the comments below and if you have an idea for a future video please holler at me and I'll do the best I can to make one thank you very much and please do subscribe
Info
Channel: Scott Hanselman
Views: 146,984
Rating: undefined out of 5
Keywords:
Id: jeIBNn5Y5fI
Channel Id: undefined
Length: 17min 18sec (1038 seconds)
Published: Tue Nov 19 2019
Related Videos
Note
Please note that this website is currently a work in progress! Lots of interesting data and statistics to come.