5 years ago

#7 What is a Character Encoding?

How is text represented in software?

Transcript

Speaker A:

In this episode, we talk about how text gets from a file onto your screen.

Speaker B:

All right, Dave, today's question what is a character encoding? Or another way of asking this how does a computer represent text?

Speaker A:

Well, as we discussed in episode three, everything that you use on a computer is ultimately stored as a number, whether that be a color, whether that be a location on a screen, whether that be actual number, whether that be a bit of code, or whether that actually be the characters in the text that we read. Each individual character has a different number that represents it for the purposes of storage. For example, in most character encodings that we use today, the number 65 is an uppercase letter A. So when we store the letter A, we actually store the number 65. Now, that's the number 65 in binary, so it's something like 1001. We store that sequence we read it in, and we know that 65 is an A. And so that's the point where we should show an A on the screen or store an A in a file on disk. Every single character that could be in a text document has to have a different number. So the number of numbers that we can represent delineates how many different characters we can represent. So if we think about the Latin alphabet, the letters A through Z that we use in the English language, well, there's 26 different letters, of course, from A through Z. So to represent those 26 different characters, we would need 26 different numbers. And it just so happens in most character encodings, for the uppercase letters A through Z, we use the numbers 65 through 90. Now of course, there's many other characters beyond just the Latin alphabet. So there's characters like special characters, like a dollar sign or an ampersand or a hash sign. There are numbers themselves, so the digits 123-456-7890, right? There are also special characters that we don't even see. So something like a carriage return, which tells a text document, hey, I'm at the end of line. I want to go to the next line, or a tab character, which means, hey, I want to have a bit of an indent here, or a space character, which means I want to have a little bit of separation between the last character and the next character. So there's all these special characters as well, and we also need numbers to represent them. So there's a lot of different parts to a text document, and all those different parts need to be represented individually by some number that we can use every time that we mean that part. And so the purpose of a character encoding is to have a standardized set of numbers that represent each of the different characters. So we need to always know that this number represents this character, and that's the purpose of a character encoding.

Speaker B:

Who comes up with character encodings?

Speaker A:

Character encodings actually have a pretty long history. So the most common character encoding that people have heard about historically is ASCII, and I believe it stands for the American standard character interchange or something like that. I'll do some live research and tell you exactly what it is. It's the American standard code for information interchange. So I was pretty close. And this is a way of delineating the first 128 characters that most computing uses, and it goes all the way back to teletypes. What were teletypes? Well, before we had computers that had monitors that were giving us live interaction, we actually used to have something that looked kind of like a typewriter almost. And new information would kind of print out on those teletypes. And there needed to be, just like we have on our monitors, standard way of showing the characters. Well, there had to be a standard way of printing the characters on the teletypes. And so actually, the character encoding that we use today still goes back to those teletypes. And believe it or not, it was influenced. Going even further back to the telegraph. For anyone who doesn't know what a telegraph is, it was pre telephone. So that's how old we've been continuously working with some standardized character encodings. But anyway, so the teletypes had all kinds of special characters. They needed things like, okay, move back to the front of the line. Carriage return was one of them. You might have always wondered why we have both enter and we also have return as characters, and we use them interchangeably today in most computer programs. But on the teletype, they meant two different things. There was something called a line feed, which was different from a carriage return. But again, we kind of think about both of them today as just new line characters. So there were all kinds of special control characters for working with the teletype that we're still stuck with today. And they were part of those 128 standardized characters. But most of the other 128 standardized characters were things like the latin alphabets. So a through z in capital letters, a through z in lowercase letters, the digits, some standard signs like dollar sign, ampersand, hash sign. There were these characters that for people working in the western world in English speaking languages or languages that had a character set similar to english that you would need to represent those languages. So unfortunately fortunately or unfortunately, depending on your perspective, our standard character sets are highly influenced by English and kind of the standard American English because that happens to be where a lot of computing devices and software were invented. And so the standard that became really big was this American standard character interchange format ASCII. So what we had this way of doing these 128 different characters, but as you might imagine, that's not enough for the whole world, and that's also not even enough for every single different kind of application you might want to use in a computer. For example, a lot of early personal computers did most of their display in what we call text mode, and some early personal computers didn't even have a way of doing anything but displaying text on the screen. So we wanted to maybe do some more interesting text than just the standard characters. For example, the original IBM PC had some characters for smiley faces, which you might almost think about it like an early form of emoji, maybe we'll come back to emoji later in the episode. And so the people coming up with these different to come back to your question, the people coming up with these different character formats, where first there was this historical ASCII standard, but then it was the computer manufacturers themselves. People like IBM, people like Apple, who were actually going and taking that original ASCII 128 characters and adding on more characters depending on the needs of their particular system. So for example, on the Mac in the 1990s, they used a character encoding called Mac Roman, which was backwards compatible with ASCII, but it also had some special characters used on the Mac, things like the Apple sign and that character didn't exist on the original. Of course. IBM PC. There was no Apple sign. So each of the different manufacturers would come up with their own character encoding and they would still be able usually to communicate with each other because we would use ASCII those 1st 128 as a standardized set that everyone understood, but it could create real problems and maybe we'll talk more about how there's differences.

Speaker B:

Do all computers use the same character encoding?

Speaker A:

Yeah, so as I was just saying, not every computer uses the same character encoding. So I mentioned macroman was the standard character encoding, that was a different character encoding than what was used on Windows and on IBM PCs running Dos before then. So what you would end up with is if you use some of those macroman specific characters and then you sent your text file over to your Windows using Friend and they opened the text file without having a program that knew how to convert from the Mac Roman character Set to the Windows, what they called the Windows Latin character set. Then you'd actually get some garbage text because those numbers that are supposed to represent, let's say, just I don't know if this is true off the top of my head, but number 255 represented the Apple logo and on Windows maybe that represented some kind of e with some kind of accent mark on top of it. So instead of seeing the Apple logo as the original writer of the document intended, now you're going to see the e with that character on top of it when you look at it on your Windows computer. And you're going to be kind of like, okay, now this kind of looks like garbage because the original person didn't mean to have that e with the accent on top of it. And they were expecting an Apple, and now it's not really mutually intelligible and the same would happen vice versa with a file from Windows going back to the Mac. And of course there's not just Windows and Mac, there's a lot of other systems with their own different character encodings. And so this was a big problem and it took many years to standardize on just a single character encoding that basically most computers use today.

Speaker B:

So could early personal computers display non Latin alphabets? Like what if I wanted to read Japanese, right?

Speaker A:

So we spoke so far about ASCII, which defines the Latin alphabet, and how Mac Roman and the Windows character set built on top of it in the. That of course is not enough those 256 different characters to represent all the languages of the world. So there's no room in there for Cyrillic characters to represent the Russian language, and there's no room in there for Kanji, or for simplified Chinese or for any other interesting language that's not from basically Western Europe. So there are different character encodings for different languages. At least that was the case back in the. Might remember that when you installed macOS or Windows or a Unix based operating system, whatever operating system you were installing, there would often be language packs that you could install additionally that would go on top of give you additional character encodings that you could access while using the computer. Now they used up a lot of space of course, relative to the size of hard drives back in the, so they usually weren't default installs. And so it wasn't totally uncommon that you would receive a file that used a character encoding that you didn't have installed and then it would just look like gibberish to you. So, for example, if somebody was writing using a Cyrillic alphabet, using a Russian character encoding, and then they sent that file over to you, and you didn't have that character encoding installed on your computer, you would just see some combination of garbage ASCII characters or Macroman or Windows Latin characters, and it wouldn't look like anything meaningful to you because you didn't have that character encoding installed. So this was a real problem and this created real barriers for international communication and it was not the kind of seamless situation we have today. Even when browsing the web, you might remember if you browse the web in the late ninety s, you would go to some web pages from countries that used non Latin alphabets. And again, the web pages would just look like garbage to you because they were using a different character encoding that you didn't have installed on your computer. So this was a real problem and we had to find a way to solve it by unifying behind a standard character encoding.

Speaker B:

What character encoding do we use today?

Speaker A:

Today the character encodings that we use are all around a standard called Unicode and Unicode doesn't just define 256 characters like Macroman or Windows Latin did, and it is backwards compatible with ASCII in the same way that Macroman and Windows Latin were, in that those 1st 128 characters are still the same. So American English still getting some primacy there, but actually Unicode defines hundreds of thousands of different kinds of characters. There's every modern, widely spoken language in there, and actually a lot of not so widely spoken languages. Something that a lot of people find surprising is that Unicode actually includes Egyptian hieroglyphics. So there are characters built in that are standard from ancient Egyptian hieroglyphics. Basically any language that has a significant number of speakers has Unicode characters existing in it. And since every computer today, every modern computer, whether it be running Android, iOS, Mac, Windows, Linux, whatever, uses Unicode, we're able to all share the same documents and see them correctly using all these different languages without having to install any additional language packs on our computer like we used to back in the bad old days. So Unicode is a really expansive standard, and it's a living standard, actually. So they left enough room in the standard, they made it expandable and expansive enough that they could continue to add new characters, continue to add new languages, continue to expand existing languages, and create additional capabilities for making text documents richer. And they were really brilliant to be backwards compatible because it means we didn't lose all of our older documents. We still have access to this treasure trove of text document history. And that's what's amazing about how this ASCII standard we talked earlier how it goes all the way back to teletype machines, right? If you have a file that you saved in the 1970s in ASCII, you can still open it in a computer today that uses Unicode and you can actually see that file. That's pretty amazing. That that is a testament to the longevity of carefully designed standards and why it's so important to have standards, right? We'll talk about this more, I'm sure, in other episodes. But when the industry really gets behind a software standard that can create really powerful ability for human beings to hold on to information, to communicate, to be able to work together seamlessly with one another and when we get too proprietary, sometimes we go the other direction and we actually create friction between different computing systems and therefore between people collaborating. But ASCII is a beautiful example in Unicode after it, which, like I said, is a standard that multiple different software companies, of all the major software companies really have gotten behind, is really a beautiful example of us working together in the computing industry for the benefit of everybody. Anyway, so Unicode is a standard that we use today, and there are multiple different Unicode encodings based on the width of the character. So that original ASCII encoding that goes all the way back to the 60s, that is a seven bit encoding. And for those of you that remember from episode three, with seven bits, we can represent 128 different characters. Now, with Macroman and Windows Latin, they added one more bit, so they went from seven bits to eight bits. And when we're in binary, every time we add another bit, we double how many different things we can represent. So we went from being able to do 128 different characters with ASCII to 256 different characters with macroman or Windows latin or any of those eight bit character encoding systems that built on top of ASCII. Now, Unicode has an eight bit format as well, that again, is backwards compatible with ASCII, but it also has some understood prefix bytes that allow it to expand beyond just one byte. So in what's called Utf eight, which is the eight bit unicode character encoding, we can actually use more than one byte to represent a single character, which allows us to then have much, much more than just 256 different characters. Like I mentioned, unicode actually encodes for hundreds of thousands of different characters, and it's still expanding as we speak. They actually meet each year and they decide, okay, here's what the new characters we're going to add to Unicode are. So we really evolve standardized on unicode. And there's Utf 16, which is a two byte per character unicode standard, and there's Utf 32, which is a four byte per character unicode standard, but the Utf eight, which is one byte, and then sometimes we put multiple bytes together to get all the different characters, is by far the most common format today.

Speaker B:

How do emojis work?

Speaker A:

So, believe it or not, emojis are actually characters. So emojis are part of the Unicode standard. Now, there are proprietary emoji standards as well, but the majority of the emojis you see today are coming from the Unicode standard. So the same people who are deciding what languages we can support on our computers by what they add to unicode each year are actually the people deciding what emojis we can have on our phones, on our computers. And of course, the big tech companies are part of this Unicode consortium. So Apple sends a representative, google, Microsoft send representatives, and they discuss each year what characters to add, what emojis to add. And so emojis are actually just text. So on most modern systems, any place that you can put the letter a, you can also put an emoji. And that's really a beautiful thing because it's allowed people to be more expressive in text documents. Some people would say it's allowed us to be less expressive because you say a picture is worth 1000 words. But if you're always using the same picture, you're always using like, the thumbs up emoji. Are you really being very specific about what you're saying? Are you really being very descriptive about what you're saying? But anyway, so really, emojis are just text, so emojis don't actually use up any more storage mechanism than just a character from Unicode does. So if you think about an image, well, an image like a GIF image or PNG or JPEG, whatever, for the same amount of expressiveness that we get from a small image, we would use a lot more memory than we do for a single emoji character. And that's because there's only so many emojis. So we know they're just part of the Unicode standard and they just use up a few bytes for each one because we're always using the same number, quote unquote, that represents that emoji each time. So ultimately all an emoji is is some number in the Unicode standard. Just I told you earlier in the episode that capital letter A is the number 65, right? Well, let's say thumbs up emoji is just a few numbers from the Unicode standard as well. And we know every time we see that sequence of numbers, we expect to display a thumbs up emoji. So that's really the basis of everything we talked about today. We just have numbers that represent characters. And that's what a character encoding is. It's saying, here's some specific set of numbers and here's the specific set of characters that that specific set of numbers is supposed to represent. And then if I'm using that character encoding, then every time I see that number, I show that character. And that's really the simplest way to think about it. And it's wonderful that we've standardized on a single character encoding. Now, Unicode that we use across all computers that we don't anymore have these problems of interchange that we had in the where we would take a text file from one computer, try to display it on another, and it would look like just total gibberish unless we had a converter or some kind of language pack installed. So we're really living in nice times where communication is a lot more smooth between our various computing systems. Well, thanks so much for listening. We look forward to having you with us again next week. Don't forget to leave us a review on Apple podcasts on Overcast Spotify. Wherever you're listening, don't forget to subscribe and we hope you have a wonderful week.

Computers are not just great for calculating, they’re also great for storing, manipulating, and viewing text. In fact, the majority of the work we do on a computer is “text work.” But, how does a computer actually store text? How is text represented in software? In this episode we dive into the world of character encodings, the way that software represents text.

Find out more at http://kopec.live