Thursday, January 17, 2008

HTML charset (character sets and character encoding) - TUTORIAL

Audience: Those with a basic understanding of HTML.

If you view the HTML source code of a web page, you may see the following bit of text (or something very similar), located somewhere between the <head> and </head> tags:

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

A tag that starts with <meta, like the one above, doesn't actually contain anything that will form part of the content of a web page ("content" being the stuff that appears in the web browser). Instead, "meta" tags contain information that is used for other purposes. Some meta tags, for example, contain technical information that gets used by the web browser. Other types of meta tags contain information that might be useful for a search engine.

In this tutorial, we are focusing on the meta tag shown earlier (the one that begins with <meta http-equiv="Content-type"..). This is a meta tag that contains useful information that a web browser can use. You will notice part of this meta tag contains the text charset=utf-8. If you look at the HTML source code of different web pages, you may also see charset=ISO-8859-1. These two are the most common, but you may see others.

This tutorial will attempt to give you an understanding about what this all means, and why it's important you include it in the HTML code of your web page.

If you don't correctly use this charset stuff in your HTML code, then there is a good chance that the words, numbers or symbols used on your webpage could look different, or even change, when different people are viewing your website. This is especially true for visitors to a web page who are in different parts of the world. For example, a sentence you've written as "That's cool!" could change in to "That?s coolA'" when someone is looking at it in their web browser. You may have seen similar things like this happen when someone sends you an email from another country. You may have seen it happen when you copy and paste text from one place to another. Something's obviously going wrong, somewhere, for this to happen.

To understand why this happens, and to make sure it doesn't happen to your web page, you first need to understand how computers store information.

And to make it easier to understand exactly how computers store information, I'll give an example.

Let's say that Mark wants to write a message, on a piece of paper, to Larry. However, Mark is only allowed to write the numbers 1 and 0 in his message. Mark can not use any letters of the alphabet, any numbers other than 1 or 0, or any other symbols. Could Mark still write a meaningful message to Larry?

Yes, he could. Mark could create a type of "code" that converts the 1s and 0s in to meaningful letters and symbols. For example, Mark could come up with a list of all the letters and symbols he might like to use in his message, and assign all of these "characters" to a sequence of 1s and 0s. A sample of this list might be:

H = 00100001
E = 11000110
L = 01001100
O = 11100010
! = 10011110

Using this code, Mark could write "Hello!" to Larry by writing down 1s and 0s in the following way:

00100001 = h
11000110 = e
01001100 = l
01001100 = l
11100010 = o
10011110 = !

What Mark has done here is "encoded" his message using his special list of characters and codes. His list contains a set of characters (in this case, all the letters of the english alphabet, plus the exclamation mark), and each character has been assigned a "code" (a string of 1s and 0s). Mark could give his list of characters and codes a name: "Mark's super character list". In technical terms, "Mark's super character list" is a "character set", and can also be known as a "character repertoire".

Larry can decode the message, as long as he knows which character set Mark used to encode the message.

By grouping the 1s and 0s together to represent different characters, Mark can still write his message.

Well, guess what? This is exactly how computers store, and even send, information - as strings of 1s and 0s. In computer terms, these 1s and 0s are known as "binary digits", or "bits" for short.

It's therefore also how a computer saves a web page. Every time a web designer, who is working on a web page, clicks "save" in whatever program they are using to create their web page, the computer will save the web page file as a very long string of 1s and 0s. The way it does this is to take each character that appears in the HTML code of the web page, and assign it to a specific sequence of 1s and 0s - exactly the same way that Mark encoded his message to Larry. So for example, it might encode each opening tag "<" as 10110100, and each letter "p" as 01011100. The computer will convert each character of HTML in to 1s and 0s by using a specific character set - just like the character set Mark created. In fact, in the world of computers, there are hundreds of different character sets. Some contain many thousands of characters, some contain only a few. Different character sets sometimes encode the same symbol in a different way. For example, The sequence of 1s and 0s that is used to represent the letter 'e' in one character set, could be used to represent the '@' symbol in another character set.

So although you may not realise it, every time you have "saved" a web page that you've been working on, your computer has chosen a character set, and encoded your web page in to 1s and 0s. How can you tell which character set your web page is encoded with when you save it? Well, this depends on which software you are using to build your web page. In the text editor that I use, Crimson Editor (version 3.70), I can select which character set I'd like to use by clicking 'Document' from the main menu, and then clicking 'Encoding type' (see below).



If you are using different software to build your web page, just click around for anything that says "encoding type" or "charset", you should be able to find it.

With the software I'm using, there's only about 5 different character encoding options. (You may notice I have two UTF-8 options; "with BOM" and "without BOM". If you encounter this with your own HTML editor, just choose "without BOM").

UTF-8 is a good character set, and is recommended by many as a good one to use. Unless your webpage is going to include letters or symbols that are very rare, UTF-8 should do the job for you. This is because it contains over 10,000 characters, from almost every language in the world. Examples of websites that use this character set include Facebook and Yahoo. If you're looking for a character set to use for your web page, you should try UTF-8 first.

So, once you know which character set your web page has been encoded with, you then need to put that information in to your HTML code, so that a web browser can find it. How do you do that? First, just copy and paste the following meta tag in to your HTML code (you should put it somewhere between the <head> and </head> tags).

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

Then, make sure that you change utf-8 to whichever character set you saved your web page with when you were building it!

One common misunderstanding about this charset stuff, is that some people think you can change the character set of your web page, just by changing what's written in the meta tag.

NO!

The character set of your web page is determined when you save your web page to your computer when you are building it. It is up to you to make sure that you write the correct charset information in to your HTML code.

You should hopefully now know which character set was used to encode your web page. So, let's say you upload your web page to the internet, and then someone else comes along and views your web page. Their web browser will begin downloading the long stream of 1s and 0s that your web page has been encoded with, and once it's done that, will begin to decode those 1s and 0s, in to readable HTML code. In order for the web browser to know which character set to use for the decoding process, it will look for a <meta http-equiv="Content-type" tag, and then check to see what is written next to charset=. If it says charset=utf-8, the web browser will decode the 1s and 0s using the UTF-8 character set.

So what happens if you don't include this meta tag in the HTML code of your web page, and hence, don't include any charset information?

Well, the web browser is then forced to guess which character set has been used. The way a web browser makes this guess will vary from browser to browser.

The other thing that could happen, is if you put this meta tag in to your HTML code, but specify the wrong character set (by writing the wrong thing next to charset=), then the web browser could start decoding the web page using the wrong character set!

This explains why you sometimes might see strange things appear on a website, such as "That?s coolA`" when you can probably tell it should say "That's cool!". The web browser displaying the page was either given incorrect charset information in the meta tag, or, it wasn't given any charset information and was forced to make a guess. The web browser has decoded the 1s and 0s using the wrong character set, and that's what causes the weird "That?s coolA`" stuff to appear on a web page.

As long as you include correct charset information in the appropriate <meta tag of your HTML code, then the web browser should have no problems decoding each character correctly, and displaying the web page correctly.

One last thing:
Even though you might put the wrong character set information in to your HTML code, a smart web browser will sometimes be able to correct your mistake, and display a web page properly. The reason it can do this is because when a web browser gets sent a web page from a server, the web browser actually receives more information than what you can see in the HTML code. In fact, before a web server starts sending the HTML, it sends this other thing called a HTTP header. A HTTP header contains technical information that the web browser can use. Some web servers will send the name of the character set for the web page in this HTTP header, and if it does, your web browser will most likely ignore any charset information that appears in the HTML code, and just use whichever character set was specified in the HTTP header. However, not every web server will do this, so the safest thing is to make sure that the correct charset information is included in your HTML code.

No comments: