Audience: Those with a basic understanding of the SQL language, and have a reasonable understanding of how to use the WHERE clause.
I'm sure most of you familiar with SQL have all seen the term "left join" around the place. Up until now, you have probably gotten away with never having to use it.
I recently had to learn about left joins, because a simple query I was using was not returning a complete set of records, as I had intended it to.
As with all innovations, the "left join" exists to solve a problem. I'll highlight the problem a left join can help us solve, using an example.
Let's say you have a database that stores people's names, and their favourite colours. The database could have two tables, and look like this:
Names (table)
Colours (table)
As you can see, the two tables are linked on the fave_colour column in the Names table, which relates to a colour_id in the Colours table.
So, let's say we want to run a query that will retrieve everyone's name, and their favourite colour. You might think the proper way to do this would be easy:
SELECT name, colour FROM names, colours WHERE fave_colour = colour_id
The above query would return the following results:
Looks ok.. but hang on. There are some names missing. Have a look at the Names table again, and you will see a "Simon" and a "Carol". Why haven't they showed up in the query result?
To get an answer, let's take another look at the query we used:
SELECT name, colour FROM names, colours WHERE fave_colour = colour_id
This query is asking to retrieve all names and colours "where the value in the fave_colour column matches a value in the colour_id column".
Have a look at the fave_colour column in the Names table, next to Simon and Carol. You will see that Simon and Carol have a fave_colour value of 4.
Have a look at the Colours table. There is no colour_id of 4. Hence, if we run a query that asks only to return records where the fave_colour column matches a colour_id column, there will be no match for Simon and Carol! Hence, Simon and Carol's records will not be returned.
The way to solve this problem is to slightly change our query. Here, again, is the query we started with:
SELECT name, colour FROM names, colours WHERE fave_colour = colour_id
We need to make two changes to this query. For the first change, I am going to remove the comma between names and colours, and put the words LEFT JOIN in its place. For the second change, I am going to remove the word WHERE, and put the word ON in its place. Here's what the new query would look like:
SELECT name, colour FROM names LEFT JOIN colours ON fave_colour = colour_id
So, what do these changes do?
In the new query, you will notice on each side of the words LEFT JOIN are the names of two tables. In this case, names is on the immediate left, and colours is on the immediate right. The table on the left hand side of LEFT JOIN will have all of its selected records returned, regardless if there is a match in the table on the right hand side. Then, just like a WHERE clause, we need to say which two columns we want to look for a match on. When using LEFT JOIN, we just replace the word WHERE with ON.
So running the new query will give us the following results:
SELECT name, colour FROM names LEFT JOIN colours ON fave_colour = colour_id
There we go. All relevant records from the Names table have now been returned, along with the favourite colour (if properly recorded).
Saturday, January 26, 2008
Thursday, January 17, 2008
HTML charset (character sets and character encoding) - TUTORIAL
Audience: Those with a basic understanding of HTML.
If you view the HTML source code of a web page, you may see the following bit of text (or something very similar), located somewhere between the <head> and </head> tags:
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
A tag that starts with <meta, like the one above, doesn't actually contain anything that will form part of the content of a web page ("content" being the stuff that appears in the web browser). Instead, "meta" tags contain information that is used for other purposes. Some meta tags, for example, contain technical information that gets used by the web browser. Other types of meta tags contain information that might be useful for a search engine.
In this tutorial, we are focusing on the meta tag shown earlier (the one that begins with <meta http-equiv="Content-type"..). This is a meta tag that contains useful information that a web browser can use. You will notice part of this meta tag contains the text charset=utf-8. If you look at the HTML source code of different web pages, you may also see charset=ISO-8859-1. These two are the most common, but you may see others.
This tutorial will attempt to give you an understanding about what this all means, and why it's important you include it in the HTML code of your web page.
If you don't correctly use this charset stuff in your HTML code, then there is a good chance that the words, numbers or symbols used on your webpage could look different, or even change, when different people are viewing your website. This is especially true for visitors to a web page who are in different parts of the world. For example, a sentence you've written as "That's cool!" could change in to "That?s coolA'" when someone is looking at it in their web browser. You may have seen similar things like this happen when someone sends you an email from another country. You may have seen it happen when you copy and paste text from one place to another. Something's obviously going wrong, somewhere, for this to happen.
To understand why this happens, and to make sure it doesn't happen to your web page, you first need to understand how computers store information.
And to make it easier to understand exactly how computers store information, I'll give an example.
Let's say that Mark wants to write a message, on a piece of paper, to Larry. However, Mark is only allowed to write the numbers 1 and 0 in his message. Mark can not use any letters of the alphabet, any numbers other than 1 or 0, or any other symbols. Could Mark still write a meaningful message to Larry?
Yes, he could. Mark could create a type of "code" that converts the 1s and 0s in to meaningful letters and symbols. For example, Mark could come up with a list of all the letters and symbols he might like to use in his message, and assign all of these "characters" to a sequence of 1s and 0s. A sample of this list might be:
H = 00100001
E = 11000110
L = 01001100
O = 11100010
! = 10011110
Using this code, Mark could write "Hello!" to Larry by writing down 1s and 0s in the following way:
00100001 = h
11000110 = e
01001100 = l
01001100 = l
11100010 = o
10011110 = !
What Mark has done here is "encoded" his message using his special list of characters and codes. His list contains a set of characters (in this case, all the letters of the english alphabet, plus the exclamation mark), and each character has been assigned a "code" (a string of 1s and 0s). Mark could give his list of characters and codes a name: "Mark's super character list". In technical terms, "Mark's super character list" is a "character set", and can also be known as a "character repertoire".
Larry can decode the message, as long as he knows which character set Mark used to encode the message.
By grouping the 1s and 0s together to represent different characters, Mark can still write his message.
Well, guess what? This is exactly how computers store, and even send, information - as strings of 1s and 0s. In computer terms, these 1s and 0s are known as "binary digits", or "bits" for short.
It's therefore also how a computer saves a web page. Every time a web designer, who is working on a web page, clicks "save" in whatever program they are using to create their web page, the computer will save the web page file as a very long string of 1s and 0s. The way it does this is to take each character that appears in the HTML code of the web page, and assign it to a specific sequence of 1s and 0s - exactly the same way that Mark encoded his message to Larry. So for example, it might encode each opening tag "<" as 10110100, and each letter "p" as 01011100. The computer will convert each character of HTML in to 1s and 0s by using a specific character set - just like the character set Mark created. In fact, in the world of computers, there are hundreds of different character sets. Some contain many thousands of characters, some contain only a few. Different character sets sometimes encode the same symbol in a different way. For example, The sequence of 1s and 0s that is used to represent the letter 'e' in one character set, could be used to represent the '@' symbol in another character set.
So although you may not realise it, every time you have "saved" a web page that you've been working on, your computer has chosen a character set, and encoded your web page in to 1s and 0s. How can you tell which character set your web page is encoded with when you save it? Well, this depends on which software you are using to build your web page. In the text editor that I use, Crimson Editor (version 3.70), I can select which character set I'd like to use by clicking 'Document' from the main menu, and then clicking 'Encoding type' (see below).
If you are using different software to build your web page, just click around for anything that says "encoding type" or "charset", you should be able to find it.
With the software I'm using, there's only about 5 different character encoding options. (You may notice I have two UTF-8 options; "with BOM" and "without BOM". If you encounter this with your own HTML editor, just choose "without BOM").
UTF-8 is a good character set, and is recommended by many as a good one to use. Unless your webpage is going to include letters or symbols that are very rare, UTF-8 should do the job for you. This is because it contains over 10,000 characters, from almost every language in the world. Examples of websites that use this character set include Facebook and Yahoo. If you're looking for a character set to use for your web page, you should try UTF-8 first.
So, once you know which character set your web page has been encoded with, you then need to put that information in to your HTML code, so that a web browser can find it. How do you do that? First, just copy and paste the following meta tag in to your HTML code (you should put it somewhere between the <head> and </head> tags).
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
Then, make sure that you change utf-8 to whichever character set you saved your web page with when you were building it!
One common misunderstanding about this charset stuff, is that some people think you can change the character set of your web page, just by changing what's written in the meta tag.
NO!
The character set of your web page is determined when you save your web page to your computer when you are building it. It is up to you to make sure that you write the correct charset information in to your HTML code.
You should hopefully now know which character set was used to encode your web page. So, let's say you upload your web page to the internet, and then someone else comes along and views your web page. Their web browser will begin downloading the long stream of 1s and 0s that your web page has been encoded with, and once it's done that, will begin to decode those 1s and 0s, in to readable HTML code. In order for the web browser to know which character set to use for the decoding process, it will look for a <meta http-equiv="Content-type" tag, and then check to see what is written next to charset=. If it says charset=utf-8, the web browser will decode the 1s and 0s using the UTF-8 character set.
So what happens if you don't include this meta tag in the HTML code of your web page, and hence, don't include any charset information?
Well, the web browser is then forced to guess which character set has been used. The way a web browser makes this guess will vary from browser to browser.
The other thing that could happen, is if you put this meta tag in to your HTML code, but specify the wrong character set (by writing the wrong thing next to charset=), then the web browser could start decoding the web page using the wrong character set!
This explains why you sometimes might see strange things appear on a website, such as "That?s coolA`" when you can probably tell it should say "That's cool!". The web browser displaying the page was either given incorrect charset information in the meta tag, or, it wasn't given any charset information and was forced to make a guess. The web browser has decoded the 1s and 0s using the wrong character set, and that's what causes the weird "That?s coolA`" stuff to appear on a web page.
As long as you include correct charset information in the appropriate <meta tag of your HTML code, then the web browser should have no problems decoding each character correctly, and displaying the web page correctly.
One last thing:
Even though you might put the wrong character set information in to your HTML code, a smart web browser will sometimes be able to correct your mistake, and display a web page properly. The reason it can do this is because when a web browser gets sent a web page from a server, the web browser actually receives more information than what you can see in the HTML code. In fact, before a web server starts sending the HTML, it sends this other thing called a HTTP header. A HTTP header contains technical information that the web browser can use. Some web servers will send the name of the character set for the web page in this HTTP header, and if it does, your web browser will most likely ignore any charset information that appears in the HTML code, and just use whichever character set was specified in the HTTP header. However, not every web server will do this, so the safest thing is to make sure that the correct charset information is included in your HTML code.
If you view the HTML source code of a web page, you may see the following bit of text (or something very similar), located somewhere between the <head> and </head> tags:
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
A tag that starts with <meta, like the one above, doesn't actually contain anything that will form part of the content of a web page ("content" being the stuff that appears in the web browser). Instead, "meta" tags contain information that is used for other purposes. Some meta tags, for example, contain technical information that gets used by the web browser. Other types of meta tags contain information that might be useful for a search engine.
In this tutorial, we are focusing on the meta tag shown earlier (the one that begins with <meta http-equiv="Content-type"..). This is a meta tag that contains useful information that a web browser can use. You will notice part of this meta tag contains the text charset=utf-8. If you look at the HTML source code of different web pages, you may also see charset=ISO-8859-1. These two are the most common, but you may see others.
This tutorial will attempt to give you an understanding about what this all means, and why it's important you include it in the HTML code of your web page.
If you don't correctly use this charset stuff in your HTML code, then there is a good chance that the words, numbers or symbols used on your webpage could look different, or even change, when different people are viewing your website. This is especially true for visitors to a web page who are in different parts of the world. For example, a sentence you've written as "That's cool!" could change in to "That?s coolA'" when someone is looking at it in their web browser. You may have seen similar things like this happen when someone sends you an email from another country. You may have seen it happen when you copy and paste text from one place to another. Something's obviously going wrong, somewhere, for this to happen.
To understand why this happens, and to make sure it doesn't happen to your web page, you first need to understand how computers store information.
And to make it easier to understand exactly how computers store information, I'll give an example.
Let's say that Mark wants to write a message, on a piece of paper, to Larry. However, Mark is only allowed to write the numbers 1 and 0 in his message. Mark can not use any letters of the alphabet, any numbers other than 1 or 0, or any other symbols. Could Mark still write a meaningful message to Larry?
Yes, he could. Mark could create a type of "code" that converts the 1s and 0s in to meaningful letters and symbols. For example, Mark could come up with a list of all the letters and symbols he might like to use in his message, and assign all of these "characters" to a sequence of 1s and 0s. A sample of this list might be:
H = 00100001
E = 11000110
L = 01001100
O = 11100010
! = 10011110
Using this code, Mark could write "Hello!" to Larry by writing down 1s and 0s in the following way:
00100001 = h
11000110 = e
01001100 = l
01001100 = l
11100010 = o
10011110 = !
What Mark has done here is "encoded" his message using his special list of characters and codes. His list contains a set of characters (in this case, all the letters of the english alphabet, plus the exclamation mark), and each character has been assigned a "code" (a string of 1s and 0s). Mark could give his list of characters and codes a name: "Mark's super character list". In technical terms, "Mark's super character list" is a "character set", and can also be known as a "character repertoire".
Larry can decode the message, as long as he knows which character set Mark used to encode the message.
By grouping the 1s and 0s together to represent different characters, Mark can still write his message.
Well, guess what? This is exactly how computers store, and even send, information - as strings of 1s and 0s. In computer terms, these 1s and 0s are known as "binary digits", or "bits" for short.
It's therefore also how a computer saves a web page. Every time a web designer, who is working on a web page, clicks "save" in whatever program they are using to create their web page, the computer will save the web page file as a very long string of 1s and 0s. The way it does this is to take each character that appears in the HTML code of the web page, and assign it to a specific sequence of 1s and 0s - exactly the same way that Mark encoded his message to Larry. So for example, it might encode each opening tag "<" as 10110100, and each letter "p" as 01011100. The computer will convert each character of HTML in to 1s and 0s by using a specific character set - just like the character set Mark created. In fact, in the world of computers, there are hundreds of different character sets. Some contain many thousands of characters, some contain only a few. Different character sets sometimes encode the same symbol in a different way. For example, The sequence of 1s and 0s that is used to represent the letter 'e' in one character set, could be used to represent the '@' symbol in another character set.
So although you may not realise it, every time you have "saved" a web page that you've been working on, your computer has chosen a character set, and encoded your web page in to 1s and 0s. How can you tell which character set your web page is encoded with when you save it? Well, this depends on which software you are using to build your web page. In the text editor that I use, Crimson Editor (version 3.70), I can select which character set I'd like to use by clicking 'Document' from the main menu, and then clicking 'Encoding type' (see below).
If you are using different software to build your web page, just click around for anything that says "encoding type" or "charset", you should be able to find it.
With the software I'm using, there's only about 5 different character encoding options. (You may notice I have two UTF-8 options; "with BOM" and "without BOM". If you encounter this with your own HTML editor, just choose "without BOM").
UTF-8 is a good character set, and is recommended by many as a good one to use. Unless your webpage is going to include letters or symbols that are very rare, UTF-8 should do the job for you. This is because it contains over 10,000 characters, from almost every language in the world. Examples of websites that use this character set include Facebook and Yahoo. If you're looking for a character set to use for your web page, you should try UTF-8 first.
So, once you know which character set your web page has been encoded with, you then need to put that information in to your HTML code, so that a web browser can find it. How do you do that? First, just copy and paste the following meta tag in to your HTML code (you should put it somewhere between the <head> and </head> tags).
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
Then, make sure that you change utf-8 to whichever character set you saved your web page with when you were building it!
One common misunderstanding about this charset stuff, is that some people think you can change the character set of your web page, just by changing what's written in the meta tag.
NO!
The character set of your web page is determined when you save your web page to your computer when you are building it. It is up to you to make sure that you write the correct charset information in to your HTML code.
You should hopefully now know which character set was used to encode your web page. So, let's say you upload your web page to the internet, and then someone else comes along and views your web page. Their web browser will begin downloading the long stream of 1s and 0s that your web page has been encoded with, and once it's done that, will begin to decode those 1s and 0s, in to readable HTML code. In order for the web browser to know which character set to use for the decoding process, it will look for a <meta http-equiv="Content-type" tag, and then check to see what is written next to charset=. If it says charset=utf-8, the web browser will decode the 1s and 0s using the UTF-8 character set.
So what happens if you don't include this meta tag in the HTML code of your web page, and hence, don't include any charset information?
Well, the web browser is then forced to guess which character set has been used. The way a web browser makes this guess will vary from browser to browser.
The other thing that could happen, is if you put this meta tag in to your HTML code, but specify the wrong character set (by writing the wrong thing next to charset=), then the web browser could start decoding the web page using the wrong character set!
This explains why you sometimes might see strange things appear on a website, such as "That?s coolA`" when you can probably tell it should say "That's cool!". The web browser displaying the page was either given incorrect charset information in the meta tag, or, it wasn't given any charset information and was forced to make a guess. The web browser has decoded the 1s and 0s using the wrong character set, and that's what causes the weird "That?s coolA`" stuff to appear on a web page.
As long as you include correct charset information in the appropriate <meta tag of your HTML code, then the web browser should have no problems decoding each character correctly, and displaying the web page correctly.
One last thing:
Even though you might put the wrong character set information in to your HTML code, a smart web browser will sometimes be able to correct your mistake, and display a web page properly. The reason it can do this is because when a web browser gets sent a web page from a server, the web browser actually receives more information than what you can see in the HTML code. In fact, before a web server starts sending the HTML, it sends this other thing called a HTTP header. A HTTP header contains technical information that the web browser can use. Some web servers will send the name of the character set for the web page in this HTTP header, and if it does, your web browser will most likely ignore any charset information that appears in the HTML code, and just use whichever character set was specified in the HTTP header. However, not every web server will do this, so the safest thing is to make sure that the correct charset information is included in your HTML code.
Labels:
character encoding,
character set,
charset,
content type,
content-type,
utf-8
Wednesday, January 9, 2008
What is a DOCTYPE? Part 2 of 2 - TUTORIAL
Audience: Those with at least a basic understanding of HTML and CSS, and have read or understand part 1 of this tutorial.
In part one of this tutorial, we learned that the strange bit of text that appears at the top of the HTML code of some web pages (see below) is a "DOCTYPE declaration".
Sample DOCTYPE declaration:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
We also learned its purpose: to inform the web browser which version of HTML a particular web page is written in.
A question you may ask is, "But I've seen plenty of web pages that don't have a DOCTYPE declaration at the top of the HTML code. Why do those pages still seem to work ok?".
There are indeed many web pages on the internet with no DOCTYPE declaration. In fact, even www.google.com doesn't have one! (At least as of January 9, 2008). Yet the page still seems to display properly in any web browser.
The reason why pages like www.google.com will still load in a web browser, despite not having a DOCTYPE declaration, is because web browsers are very forgiving of badly written HTML. What I mean by "badly written" is any HTML code that does not properly comply with the rules in which HTML is supposed to be written. These rules are specified in the "Doctype Definition" (DTD) of a given version of HTML (if you are unsure what this means, please refer to part 1 of this tutorial series).
So, even if a web page does not tell the web browser which version of HTML it is supposed to be written in, a web browser will still attempt to display it. This is also known as a web browser entering "quirks mode" (as opposed to being in "standards mode", which just means that the web browser knows which version of HTML is being used). Quirks mode isn't good because there is no clearly defined set of rules for how a web browser is supposed to interpret HTML when in quirks mode. This means a web page with no DOCTYPE declaration can look very different in one web browser compared to another. Unfortunately, using a DOCTYPE declaration still won't guarantee a web page to look the same in different web browsers. One reason for this is that web browsers do not interpret the same version of HTML in exactly the same way (there is no good reason for this, that's just the way it is). However, it is generally considered to be true that using a proper DOCTYPE declaration will make it easier to build a web page that will look the same (or close enough) in any web browser (also known as a web page having "cross browser support"). The reason for this is that with a proper DOCTYPE declaration, different web browsers are at least trying to interpret your HTML code in the same way.
I will now go through the available HTML DOCTYPEs:
(Note: HTML versions 2 and 3 both have DOCTYPEs. But these versions of HTML are so old, they are not worth focusing on).
The earliest version of HTML worth worrying about is HTML 4.01, which comes in 3 types:
HTML 4.01 Strict:
"Style" elements are not allowed to appear within the HTML code of Strict web pages. Style elements are things like colors, font sizes and images. These should all be reserved for the CSS (Cascading Style Sheet).
HTML 4.01 Transitional:
Some style elements are allowed in the HTML code of a HTML 4.01 Transitional web page, but the FRAMESET tag is not allowed.
HTML 4.01 Framset:
Very similar to Transitional, except that the FRAMSET tag is allowed to be used.
You may be wondering, "Why are there 3 different types of the same version?". The answer is that the people in charge of releasing new versions of HTML (the W3C) understand how difficult it is to build a web page that will work properly in every web browser. It is therefore considered "ideal" to use the Strict DOCTYPE, but the reality is that you may find it impossible to get your web page working in all web browsers if you're not allowed to use style tags in your HTML code. The goal trying to be reached is to have all style elements separated in to the CSS, with no style elements at all in the HTML code.
(For a more complete list of what is and isn't allowed with Strict and Transitional web pages, see Roger Johansson's article 'Transitional vs. Strict Markup'. You can just scroll straight to the section titled 'Elements that are not allowed in Strict DOCTYPEs', and start from there).
Another important version of HTML is XHTML 1.0. This version of HTML also appears in the same 3 types that HTML 4.01 did:
The major difference between HTML and XHTML is not what tags are or aren't allowed. The difference lies in how the HTML code should be written. XHTML code must be written according to some very specific rules. For example:
Some XHTML rules:
<p><i>Watch as I close the italics tag before the paragraph tag, just as XHTML requires me to</i></p>
BAD:
<p><i>This is bad! I am closing the p tag before i tag!</p></i>
(For a more complete list of the rules you must follow when writing XHTML, see Linda Roeder's article 'Basics of XHTML - Why, What and How'. The list of rules starts at about the third paragraph).
As you might be able to guess, one major reason that XHTML was invented is to try and get everyone to write their HTML code in the same way. This would have major benefits for web browsers (in their attempts to display web pages properly) and for web designers (whose code will now look very similar, if not exactly the same, as any other web designer).
The last version of HTML that you need to worry about is XHTML 1.1. This version of HTML does not come in 3 types as you have seen previously, it's just plain old XHTML 1.1. This version is very similar to XHTML 1.0 Strict. It must also follow the same rules about how XHTML code is to be written as the other XHTML versions (lowercase tags, etc). So what's the difference between XHTML1.0 Strict and XHTML1.1? Unfortunately, this question can't really be answered if you don't know what XML is (that's not a typo. XML is not the same thing as XHTML). But don't worry. At this stage, XHTML 1.0 is still more popular than XHTML 1.1, so just stick with one of the earlier versions for now.
Below I have again listed all of the HTML versions that I have talked about, along with their DOCTYPE declarations. You might have seen elsewhere that you can just "copy and paste" these bits of text in to the top of your own HTML code - and indeed, there is nothing wrong with that. In fact, it's probably better to do so, because if you get just one character wrong in your DOCTYPE declaration, it will probably throw your web browser in to "quirks mode", possibly without you even realising. So, once you have decided which version of HTML you want to write your web page in, just copy and paste the relevant DOCTYPE as it appears below. I have listed each DOCTYPE under a bolded heading - the heading is NOT part of the DOCTYPE declaration and should NOT be copied and pasted in to your HTML code! (Please note, when copying and pasting these DOCTYPE declarations, they should be the first thing that appear in your HTML code).
HTML 4.01 Strict
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
HTML 4.01 Transitional
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
HTML 4.01 Frameset
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
"http://www.w3.org/TR/html4/frameset.dtd">
XHTML 1.0 Strict
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
XHTML 1.0 Transitional
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
XHTML 1.0 Framset
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
XHTML 1.1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
The people responsible for HTML are trying to achieve the goal of having one HTML version that all web browsers will rely on, and as long as you build your web page in that version, you shouldn't encounter any "cross browser support" issues. Unfortunately that day is probably still some time away. But the idea of encouraging web designers to start using DOCTYPEs and to follow the respective rules is so that one day, a common standard can be reached.
---
Please comment on this tutorial, and let me know if there's any way I could have improved it. Most importantly, what could I have done to make it easier to understand?
In part one of this tutorial, we learned that the strange bit of text that appears at the top of the HTML code of some web pages (see below) is a "DOCTYPE declaration".
Sample DOCTYPE declaration:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
We also learned its purpose: to inform the web browser which version of HTML a particular web page is written in.
A question you may ask is, "But I've seen plenty of web pages that don't have a DOCTYPE declaration at the top of the HTML code. Why do those pages still seem to work ok?".
There are indeed many web pages on the internet with no DOCTYPE declaration. In fact, even www.google.com doesn't have one! (At least as of January 9, 2008). Yet the page still seems to display properly in any web browser.
The reason why pages like www.google.com will still load in a web browser, despite not having a DOCTYPE declaration, is because web browsers are very forgiving of badly written HTML. What I mean by "badly written" is any HTML code that does not properly comply with the rules in which HTML is supposed to be written. These rules are specified in the "Doctype Definition" (DTD) of a given version of HTML (if you are unsure what this means, please refer to part 1 of this tutorial series).
So, even if a web page does not tell the web browser which version of HTML it is supposed to be written in, a web browser will still attempt to display it. This is also known as a web browser entering "quirks mode" (as opposed to being in "standards mode", which just means that the web browser knows which version of HTML is being used). Quirks mode isn't good because there is no clearly defined set of rules for how a web browser is supposed to interpret HTML when in quirks mode. This means a web page with no DOCTYPE declaration can look very different in one web browser compared to another. Unfortunately, using a DOCTYPE declaration still won't guarantee a web page to look the same in different web browsers. One reason for this is that web browsers do not interpret the same version of HTML in exactly the same way (there is no good reason for this, that's just the way it is). However, it is generally considered to be true that using a proper DOCTYPE declaration will make it easier to build a web page that will look the same (or close enough) in any web browser (also known as a web page having "cross browser support"). The reason for this is that with a proper DOCTYPE declaration, different web browsers are at least trying to interpret your HTML code in the same way.
I will now go through the available HTML DOCTYPEs:
(Note: HTML versions 2 and 3 both have DOCTYPEs. But these versions of HTML are so old, they are not worth focusing on).
The earliest version of HTML worth worrying about is HTML 4.01, which comes in 3 types:
- HTML 4.01 Strict
- HTML 4.01 Transitional
- HTML 4.01 Frameset
HTML 4.01 Strict:
"Style" elements are not allowed to appear within the HTML code of Strict web pages. Style elements are things like colors, font sizes and images. These should all be reserved for the CSS (Cascading Style Sheet).
HTML 4.01 Transitional:
Some style elements are allowed in the HTML code of a HTML 4.01 Transitional web page, but the FRAMESET tag is not allowed.
HTML 4.01 Framset:
Very similar to Transitional, except that the FRAMSET tag is allowed to be used.
You may be wondering, "Why are there 3 different types of the same version?". The answer is that the people in charge of releasing new versions of HTML (the W3C) understand how difficult it is to build a web page that will work properly in every web browser. It is therefore considered "ideal" to use the Strict DOCTYPE, but the reality is that you may find it impossible to get your web page working in all web browsers if you're not allowed to use style tags in your HTML code. The goal trying to be reached is to have all style elements separated in to the CSS, with no style elements at all in the HTML code.
(For a more complete list of what is and isn't allowed with Strict and Transitional web pages, see Roger Johansson's article 'Transitional vs. Strict Markup'. You can just scroll straight to the section titled 'Elements that are not allowed in Strict DOCTYPEs', and start from there).
Another important version of HTML is XHTML 1.0. This version of HTML also appears in the same 3 types that HTML 4.01 did:
- XHTML 1.0 Strict
- XHTML 1.0 Transitional
- XHTML 1.0 Frameset
The major difference between HTML and XHTML is not what tags are or aren't allowed. The difference lies in how the HTML code should be written. XHTML code must be written according to some very specific rules. For example:
Some XHTML rules:
- All tags must to be written in lowercase (eg, <h1>, <p>, and not <H1> or <P>).
- All tags that are opened must be closed (eg, <h1> must have a corresponding </h1>)
- If more than one tag is open at the same time, they must be closed in the reverse order they were opened. For example:
<p><i>Watch as I close the italics tag before the paragraph tag, just as XHTML requires me to</i></p>
BAD:
<p><i>This is bad! I am closing the p tag before i tag!</p></i>
(For a more complete list of the rules you must follow when writing XHTML, see Linda Roeder's article 'Basics of XHTML - Why, What and How'. The list of rules starts at about the third paragraph).
As you might be able to guess, one major reason that XHTML was invented is to try and get everyone to write their HTML code in the same way. This would have major benefits for web browsers (in their attempts to display web pages properly) and for web designers (whose code will now look very similar, if not exactly the same, as any other web designer).
The last version of HTML that you need to worry about is XHTML 1.1. This version of HTML does not come in 3 types as you have seen previously, it's just plain old XHTML 1.1. This version is very similar to XHTML 1.0 Strict. It must also follow the same rules about how XHTML code is to be written as the other XHTML versions (lowercase tags, etc). So what's the difference between XHTML1.0 Strict and XHTML1.1? Unfortunately, this question can't really be answered if you don't know what XML is (that's not a typo. XML is not the same thing as XHTML). But don't worry. At this stage, XHTML 1.0 is still more popular than XHTML 1.1, so just stick with one of the earlier versions for now.
Below I have again listed all of the HTML versions that I have talked about, along with their DOCTYPE declarations. You might have seen elsewhere that you can just "copy and paste" these bits of text in to the top of your own HTML code - and indeed, there is nothing wrong with that. In fact, it's probably better to do so, because if you get just one character wrong in your DOCTYPE declaration, it will probably throw your web browser in to "quirks mode", possibly without you even realising. So, once you have decided which version of HTML you want to write your web page in, just copy and paste the relevant DOCTYPE as it appears below. I have listed each DOCTYPE under a bolded heading - the heading is NOT part of the DOCTYPE declaration and should NOT be copied and pasted in to your HTML code! (Please note, when copying and pasting these DOCTYPE declarations, they should be the first thing that appear in your HTML code).
HTML 4.01 Strict
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
HTML 4.01 Transitional
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
HTML 4.01 Frameset
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
"http://www.w3.org/TR/html4/frameset.dtd">
XHTML 1.0 Strict
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
XHTML 1.0 Transitional
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
XHTML 1.0 Framset
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">
XHTML 1.1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
The people responsible for HTML are trying to achieve the goal of having one HTML version that all web browsers will rely on, and as long as you build your web page in that version, you shouldn't encounter any "cross browser support" issues. Unfortunately that day is probably still some time away. But the idea of encouraging web designers to start using DOCTYPEs and to follow the respective rules is so that one day, a common standard can be reached.
---
Please comment on this tutorial, and let me know if there's any way I could have improved it. Most importantly, what could I have done to make it easier to understand?
What topics would you like to see a tutorial on?
I'm open to suggestions for the topic of my next tutorial. The areas that I have at least some knowledge in are:
HTML and XHTML
CSS
PHP
MySQL
Are there any areas within these fields that you just don't quite get, and would like to see a tutorial on? If so, leave a comment in the "Comments" section below. In the comments section, ignore the bit that says "Sign in using Blogger/Google", just scroll down and choose one of the other options; (in other words, there is no need to sign in to leave a comment!).
HTML and XHTML
CSS
PHP
MySQL
Are there any areas within these fields that you just don't quite get, and would like to see a tutorial on? If so, leave a comment in the "Comments" section below. In the comments section, ignore the bit that says "Sign in using Blogger/Google", just scroll down and choose one of the other options; (in other words, there is no need to sign in to leave a comment!).
Monday, December 17, 2007
What is a DOCTYPE? Part 1 of 2 - TUTORIAL
Audience: Those with at least a basic understanding of HTML and CSS.
If you view the HTML code of a webpage, there is a good chance you will see the following text (or something similar) at the top:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Why is it there, and what does it mean?
To get a clear understanding, we need to go back in time..
When HTML first arrived in 1993, it was in a much simpler form than it is today. For example, the first version of HTML didn’t have as many tags (there was no <div>, no <span>, among others) not even any CSS. HTML’s creator (Tim Berners-Lee) specified what tags could be used in HTML, and what they would mean. For example, he said the <h1> tag could be used in HTML to mean a heading. This list of tags and their meanings became known as the “HTML specification”.
As the internet became more popular, competition between the web browsers to be the most popular was also growing. In order to try and separate themselves from others on the market, some web browsers began introducing new tags that were not part of the original “HTML specification”. This caused a major problem: some web pages were being created with tags that not all web browsers could understand. Eventually, many people agreed that the original “HTML specification” should be updated to a second version, and should include as many of the new tags as possible.
This is exactly what happened. HTML was upgraded to version 2.0, and included many new tags, and the removal of some old ones. This created a new problem for web browsers. If HTML was going to keep getting upgraded to newer versions, it would mean that some (older) web pages would still exist on the internet written in older versions of HTML. How was a web browser supposed to know which version of HTML a web page was written in?
This problem was solved by asking web designers to place some special text at the top of a web page’s HTML code. This special piece of text lets the web browser know which version of HTML the web page is written in. This bit of text is called, in technical terms, a document type declaration, because it is declaring which type (or version) of HTML is being used.
If a tag is found in a particular HTML version, then it is considered to be a “valid” HTML tag (for that version). Each time HTML is upgraded to a new version, the list of “valid” tags for that version is compiled together in a document called a document type definition (or DTD). It has this name because the DTD defines which tags are valid for the type (or version) of HTML being used.
So that special bit of text that you sometimes see at the top of HTML code is a document type declaration, which is letting the web browser know which version of HTML the webpage is written in. In the example below, it's XHTML 1.0 Transitional, one of the latest versions of HTML.
You will also notice a link to the document type definition for XHTML 1.0 Transitional (a list of all valid tags for this version of HTML). See below.
You can even type that web address in to your browser and view the DTD, but please note that figuring out how to properly read it would require a tutorial of its own!
Additional tips:
* DOCTYPE is short for 'Document Type'.
NEXT: Part 2 of the 'What is a DOCTYPE?' tutorial.
---
Please comment on this tutorial, and let me know if there's any way I could have improved it. Most importantly, what could I have done to make it easier to understand?
If you view the HTML code of a webpage, there is a good chance you will see the following text (or something similar) at the top:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
Why is it there, and what does it mean?
To get a clear understanding, we need to go back in time..
When HTML first arrived in 1993, it was in a much simpler form than it is today. For example, the first version of HTML didn’t have as many tags (there was no <div>, no <span>, among others) not even any CSS. HTML’s creator (Tim Berners-Lee) specified what tags could be used in HTML, and what they would mean. For example, he said the <h1> tag could be used in HTML to mean a heading. This list of tags and their meanings became known as the “HTML specification”.
As the internet became more popular, competition between the web browsers to be the most popular was also growing. In order to try and separate themselves from others on the market, some web browsers began introducing new tags that were not part of the original “HTML specification”. This caused a major problem: some web pages were being created with tags that not all web browsers could understand. Eventually, many people agreed that the original “HTML specification” should be updated to a second version, and should include as many of the new tags as possible.
This is exactly what happened. HTML was upgraded to version 2.0, and included many new tags, and the removal of some old ones. This created a new problem for web browsers. If HTML was going to keep getting upgraded to newer versions, it would mean that some (older) web pages would still exist on the internet written in older versions of HTML. How was a web browser supposed to know which version of HTML a web page was written in?
This problem was solved by asking web designers to place some special text at the top of a web page’s HTML code. This special piece of text lets the web browser know which version of HTML the web page is written in. This bit of text is called, in technical terms, a document type declaration, because it is declaring which type (or version) of HTML is being used.
If a tag is found in a particular HTML version, then it is considered to be a “valid” HTML tag (for that version). Each time HTML is upgraded to a new version, the list of “valid” tags for that version is compiled together in a document called a document type definition (or DTD). It has this name because the DTD defines which tags are valid for the type (or version) of HTML being used.
So that special bit of text that you sometimes see at the top of HTML code is a document type declaration, which is letting the web browser know which version of HTML the webpage is written in. In the example below, it's XHTML 1.0 Transitional, one of the latest versions of HTML.
You will also notice a link to the document type definition for XHTML 1.0 Transitional (a list of all valid tags for this version of HTML). See below.
You can even type that web address in to your browser and view the DTD, but please note that figuring out how to properly read it would require a tutorial of its own!
Additional tips:
* DOCTYPE is short for 'Document Type'.
NEXT: Part 2 of the 'What is a DOCTYPE?' tutorial.
---
Please comment on this tutorial, and let me know if there's any way I could have improved it. Most importantly, what could I have done to make it easier to understand?
Five simple rules that online tutorials should follow
Although this blog (for now) is going to focus only on web development tutorials, I believe that the qualities that make a tutorial easy to understand are universal. So here we go:
5 Simple Rules that every person who writes online tutorials should follow:
*Always define your audience, and write only with them in mind
Decide who you are writing the tutorial for, and any required skills/knowledge they should already have. When deciding on this, remember that you want your tutorial to appeal to as many people as possible.
*Do not introduce any new concept or idea unless you are certain that the target audience knows precisely what you are talking about
This is the number one failing of almost every online tutorial. You're reading a tutorial on an introduction to HTML, and you're seeing terms like "web standards" and "cross browser support" without any useful idea what these terms actually mean.
*Don't use words that some people may not know the exact meaning of
This is referring to words in the language that the tutorial is being written in (in this case, english). For example, don't use words like "ambiguous" or "ubiquitous". Instead use "confusing" and "everywhere".
*Don't use any unnecessary words or sentences
Keep sentences as short as possible. Eliminate words that don't need to be there.
*Make it clear and obvious what problem is being solved by the new skill that is being taught
Another major common failing of online tutorials is that they begin explaining a concept without giving the reader a clear understanding of what problem it is solving. Every new technology and innovation in the history of mankind was invented to solve a problem. Before teaching the new concept, make sure the reader has a clear and practical idea (use examples) of what the problem is that is being solved. This will almost certainly involve explaining some history or evolution of the concept.
Coming up in my next post - my attempt at writing the beginning of a tutorial that will follow these rules. The tutorial will be on HTML DOCTYPEs, given that I touched on them in my previous post. The intended audience will be those with a basic understanding of HTML and CSS.
Can you help me improve these list of rules that every online tutorial should follow? Please make suggestions in the Comments section of this post.
5 Simple Rules that every person who writes online tutorials should follow:
*Always define your audience, and write only with them in mind
Decide who you are writing the tutorial for, and any required skills/knowledge they should already have. When deciding on this, remember that you want your tutorial to appeal to as many people as possible.
*Do not introduce any new concept or idea unless you are certain that the target audience knows precisely what you are talking about
This is the number one failing of almost every online tutorial. You're reading a tutorial on an introduction to HTML, and you're seeing terms like "web standards" and "cross browser support" without any useful idea what these terms actually mean.
*Don't use words that some people may not know the exact meaning of
This is referring to words in the language that the tutorial is being written in (in this case, english). For example, don't use words like "ambiguous" or "ubiquitous". Instead use "confusing" and "everywhere".
*Don't use any unnecessary words or sentences
Keep sentences as short as possible. Eliminate words that don't need to be there.
*Make it clear and obvious what problem is being solved by the new skill that is being taught
Another major common failing of online tutorials is that they begin explaining a concept without giving the reader a clear understanding of what problem it is solving. Every new technology and innovation in the history of mankind was invented to solve a problem. Before teaching the new concept, make sure the reader has a clear and practical idea (use examples) of what the problem is that is being solved. This will almost certainly involve explaining some history or evolution of the concept.
Coming up in my next post - my attempt at writing the beginning of a tutorial that will follow these rules. The tutorial will be on HTML DOCTYPEs, given that I touched on them in my previous post. The intended audience will be those with a basic understanding of HTML and CSS.
Can you help me improve these list of rules that every online tutorial should follow? Please make suggestions in the Comments section of this post.
Most online tutorials are very poor quality
I work in web development, and am often having to learn about new web technologies. Generally the first thing I do when I want to learn something new is Google for some online tutorials. Over the time I have spent doing this, one thing has become very apparent: Experts in a given field are not necessarily experts at teaching it.
How many times have you started reading a tutorial, and it begins introducing new concepts without explaining them? Or it uses words and terms that you don't understand?
I think it's obvious what the main problem is. The authors of the tutorials assume that the reader knows a lot more about the topic than they actually do.
Some time ago, I wanted to learn about DOCTYPEs (in relation to web sites). If you view the HTML source of a webpage, look right at the top and you'll see something like:
How many times have you started reading a tutorial, and it begins introducing new concepts without explaining them? Or it uses words and terms that you don't understand?
I think it's obvious what the main problem is. The authors of the tutorials assume that the reader knows a lot more about the topic than they actually do.
Some time ago, I wanted to learn about DOCTYPEs (in relation to web sites). If you view the HTML source of a webpage, look right at the top and you'll see something like:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
What is it? Googling around, I found out it had something to do with "XHTML" and "web standards". But what actually is it? What is it doing there? Why is it there? What problem is it solving? Don't bother trying to get your answer from an online tutorial about DOCTYPEs.
Let's take the opening paragraph from this online tutorial about what DOCTYPEs are (from ZVON):
"An XML document is valid if it has an associated document type definition and if the document complies with the constraints expressed in it. The document type definition must appear before the first element in the document. The name following the word DOCTYPE in the document type definition must match the name of root element."
WHAT?!
For someone who knows nothing about DOCTYPEs, how would they read that above paragraph? Probably something like:
What is an XML document? What do you mean "is valid"? I don't know what "document type definition" is, let alone what an "associated" one is. As for the last sentence... no clue.
I should of course be gentle. That sentence obviously made perfect sense to the author who wrote it, because he/she would understand all of the terms used. Perhaps the tutorial is aimed at people who have an existing thorough knowledge of related areas. But to a novice trying to learn about DOCTYPEs from that "explanation", it would be of no help at all.
In my next post, I'll list a simple set of rules that I think every online tutorial should follow (but almost none do). I hope this website/blog can be a reliable place for novices to get quality tutorials, examples of good tutorials, and links to other sites carrying good tutorials!
Let's take the opening paragraph from this online tutorial about what DOCTYPEs are (from ZVON):
"An XML document is valid if it has an associated document type definition and if the document complies with the constraints expressed in it. The document type definition must appear before the first element in the document. The name following the word DOCTYPE in the document type definition must match the name of root element."
WHAT?!
For someone who knows nothing about DOCTYPEs, how would they read that above paragraph? Probably something like:
What is an XML document? What do you mean "is valid"? I don't know what "document type definition" is, let alone what an "associated" one is. As for the last sentence... no clue.
I should of course be gentle. That sentence obviously made perfect sense to the author who wrote it, because he/she would understand all of the terms used. Perhaps the tutorial is aimed at people who have an existing thorough knowledge of related areas. But to a novice trying to learn about DOCTYPEs from that "explanation", it would be of no help at all.
In my next post, I'll list a simple set of rules that I think every online tutorial should follow (but almost none do). I hope this website/blog can be a reliable place for novices to get quality tutorials, examples of good tutorials, and links to other sites carrying good tutorials!
Subscribe to:
Posts (Atom)