Saturday, January 26, 2008

Left Joins in SQL - TUTORIAL

Audience: Those with a basic understanding of the SQL language, and have a reasonable understanding of how to use the WHERE clause.

I'm sure most of you familiar with SQL have all seen the term "left join" around the place. Up until now, you have probably gotten away with never having to use it.

I recently had to learn about left joins, because a simple query I was using was not returning a complete set of records, as I had intended it to.

As with all innovations, the "left join" exists to solve a problem. I'll highlight the problem a left join can help us solve, using an example.

Let's say you have a database that stores people's names, and their favourite colours. The database could have two tables, and look like this:

Names (table)



Colours (table)


As you can see, the two tables are linked on the fave_colour column in the Names table, which relates to a colour_id in the Colours table.

So, let's say we want to run a query that will retrieve everyone's name, and their favourite colour. You might think the proper way to do this would be easy:

SELECT name, colour FROM names, colours WHERE fave_colour = colour_id

The above query would return the following results:



Looks ok.. but hang on. There are some names missing. Have a look at the Names table again, and you will see a "Simon" and a "Carol". Why haven't they showed up in the query result?

To get an answer, let's take another look at the query we used:

SELECT name, colour FROM names, colours WHERE fave_colour = colour_id

This query is asking to retrieve all names and colours "where the value in the fave_colour column matches a value in the colour_id column".

Have a look at the fave_colour column in the Names table, next to Simon and Carol. You will see that Simon and Carol have a fave_colour value of 4.

Have a look at the Colours table. There is no colour_id of 4. Hence, if we run a query that asks only to return records where the fave_colour column matches a colour_id column, there will be no match for Simon and Carol! Hence, Simon and Carol's records will not be returned.

The way to solve this problem is to slightly change our query. Here, again, is the query we started with:

SELECT name, colour FROM names, colours WHERE fave_colour = colour_id

We need to make two changes to this query. For the first change, I am going to remove the comma between names and colours, and put the words LEFT JOIN in its place. For the second change, I am going to remove the word WHERE, and put the word ON in its place. Here's what the new query would look like:

SELECT name, colour FROM names LEFT JOIN colours ON fave_colour = colour_id

So, what do these changes do?

In the new query, you will notice on each side of the words LEFT JOIN are the names of two tables. In this case, names is on the immediate left, and colours is on the immediate right. The table on the left hand side of LEFT JOIN will have all of its selected records returned, regardless if there is a match in the table on the right hand side. Then, just like a WHERE clause, we need to say which two columns we want to look for a match on. When using LEFT JOIN, we just replace the word WHERE with ON.

So running the new query will give us the following results:

SELECT name, colour FROM names LEFT JOIN colours ON fave_colour = colour_id



There we go. All relevant records from the Names table have now been returned, along with the favourite colour (if properly recorded).

Thursday, January 17, 2008

HTML charset (character sets and character encoding) - TUTORIAL

Audience: Those with a basic understanding of HTML.

If you view the HTML source code of a web page, you may see the following bit of text (or something very similar), located somewhere between the <head> and </head> tags:

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

A tag that starts with <meta, like the one above, doesn't actually contain anything that will form part of the content of a web page ("content" being the stuff that appears in the web browser). Instead, "meta" tags contain information that is used for other purposes. Some meta tags, for example, contain technical information that gets used by the web browser. Other types of meta tags contain information that might be useful for a search engine.

In this tutorial, we are focusing on the meta tag shown earlier (the one that begins with <meta http-equiv="Content-type"..). This is a meta tag that contains useful information that a web browser can use. You will notice part of this meta tag contains the text charset=utf-8. If you look at the HTML source code of different web pages, you may also see charset=ISO-8859-1. These two are the most common, but you may see others.

This tutorial will attempt to give you an understanding about what this all means, and why it's important you include it in the HTML code of your web page.

If you don't correctly use this charset stuff in your HTML code, then there is a good chance that the words, numbers or symbols used on your webpage could look different, or even change, when different people are viewing your website. This is especially true for visitors to a web page who are in different parts of the world. For example, a sentence you've written as "That's cool!" could change in to "That?s coolA'" when someone is looking at it in their web browser. You may have seen similar things like this happen when someone sends you an email from another country. You may have seen it happen when you copy and paste text from one place to another. Something's obviously going wrong, somewhere, for this to happen.

To understand why this happens, and to make sure it doesn't happen to your web page, you first need to understand how computers store information.

And to make it easier to understand exactly how computers store information, I'll give an example.

Let's say that Mark wants to write a message, on a piece of paper, to Larry. However, Mark is only allowed to write the numbers 1 and 0 in his message. Mark can not use any letters of the alphabet, any numbers other than 1 or 0, or any other symbols. Could Mark still write a meaningful message to Larry?

Yes, he could. Mark could create a type of "code" that converts the 1s and 0s in to meaningful letters and symbols. For example, Mark could come up with a list of all the letters and symbols he might like to use in his message, and assign all of these "characters" to a sequence of 1s and 0s. A sample of this list might be:

H = 00100001
E = 11000110
L = 01001100
O = 11100010
! = 10011110

Using this code, Mark could write "Hello!" to Larry by writing down 1s and 0s in the following way:

00100001 = h
11000110 = e
01001100 = l
01001100 = l
11100010 = o
10011110 = !

What Mark has done here is "encoded" his message using his special list of characters and codes. His list contains a set of characters (in this case, all the letters of the english alphabet, plus the exclamation mark), and each character has been assigned a "code" (a string of 1s and 0s). Mark could give his list of characters and codes a name: "Mark's super character list". In technical terms, "Mark's super character list" is a "character set", and can also be known as a "character repertoire".

Larry can decode the message, as long as he knows which character set Mark used to encode the message.

By grouping the 1s and 0s together to represent different characters, Mark can still write his message.

Well, guess what? This is exactly how computers store, and even send, information - as strings of 1s and 0s. In computer terms, these 1s and 0s are known as "binary digits", or "bits" for short.

It's therefore also how a computer saves a web page. Every time a web designer, who is working on a web page, clicks "save" in whatever program they are using to create their web page, the computer will save the web page file as a very long string of 1s and 0s. The way it does this is to take each character that appears in the HTML code of the web page, and assign it to a specific sequence of 1s and 0s - exactly the same way that Mark encoded his message to Larry. So for example, it might encode each opening tag "<" as 10110100, and each letter "p" as 01011100. The computer will convert each character of HTML in to 1s and 0s by using a specific character set - just like the character set Mark created. In fact, in the world of computers, there are hundreds of different character sets. Some contain many thousands of characters, some contain only a few. Different character sets sometimes encode the same symbol in a different way. For example, The sequence of 1s and 0s that is used to represent the letter 'e' in one character set, could be used to represent the '@' symbol in another character set.

So although you may not realise it, every time you have "saved" a web page that you've been working on, your computer has chosen a character set, and encoded your web page in to 1s and 0s. How can you tell which character set your web page is encoded with when you save it? Well, this depends on which software you are using to build your web page. In the text editor that I use, Crimson Editor (version 3.70), I can select which character set I'd like to use by clicking 'Document' from the main menu, and then clicking 'Encoding type' (see below).



If you are using different software to build your web page, just click around for anything that says "encoding type" or "charset", you should be able to find it.

With the software I'm using, there's only about 5 different character encoding options. (You may notice I have two UTF-8 options; "with BOM" and "without BOM". If you encounter this with your own HTML editor, just choose "without BOM").

UTF-8 is a good character set, and is recommended by many as a good one to use. Unless your webpage is going to include letters or symbols that are very rare, UTF-8 should do the job for you. This is because it contains over 10,000 characters, from almost every language in the world. Examples of websites that use this character set include Facebook and Yahoo. If you're looking for a character set to use for your web page, you should try UTF-8 first.

So, once you know which character set your web page has been encoded with, you then need to put that information in to your HTML code, so that a web browser can find it. How do you do that? First, just copy and paste the following meta tag in to your HTML code (you should put it somewhere between the <head> and </head> tags).

<meta http-equiv="Content-type" content="text/html; charset=utf-8" />

Then, make sure that you change utf-8 to whichever character set you saved your web page with when you were building it!

One common misunderstanding about this charset stuff, is that some people think you can change the character set of your web page, just by changing what's written in the meta tag.

NO!

The character set of your web page is determined when you save your web page to your computer when you are building it. It is up to you to make sure that you write the correct charset information in to your HTML code.

You should hopefully now know which character set was used to encode your web page. So, let's say you upload your web page to the internet, and then someone else comes along and views your web page. Their web browser will begin downloading the long stream of 1s and 0s that your web page has been encoded with, and once it's done that, will begin to decode those 1s and 0s, in to readable HTML code. In order for the web browser to know which character set to use for the decoding process, it will look for a <meta http-equiv="Content-type" tag, and then check to see what is written next to charset=. If it says charset=utf-8, the web browser will decode the 1s and 0s using the UTF-8 character set.

So what happens if you don't include this meta tag in the HTML code of your web page, and hence, don't include any charset information?

Well, the web browser is then forced to guess which character set has been used. The way a web browser makes this guess will vary from browser to browser.

The other thing that could happen, is if you put this meta tag in to your HTML code, but specify the wrong character set (by writing the wrong thing next to charset=), then the web browser could start decoding the web page using the wrong character set!

This explains why you sometimes might see strange things appear on a website, such as "That?s coolA`" when you can probably tell it should say "That's cool!". The web browser displaying the page was either given incorrect charset information in the meta tag, or, it wasn't given any charset information and was forced to make a guess. The web browser has decoded the 1s and 0s using the wrong character set, and that's what causes the weird "That?s coolA`" stuff to appear on a web page.

As long as you include correct charset information in the appropriate <meta tag of your HTML code, then the web browser should have no problems decoding each character correctly, and displaying the web page correctly.

One last thing:
Even though you might put the wrong character set information in to your HTML code, a smart web browser will sometimes be able to correct your mistake, and display a web page properly. The reason it can do this is because when a web browser gets sent a web page from a server, the web browser actually receives more information than what you can see in the HTML code. In fact, before a web server starts sending the HTML, it sends this other thing called a HTTP header. A HTTP header contains technical information that the web browser can use. Some web servers will send the name of the character set for the web page in this HTTP header, and if it does, your web browser will most likely ignore any charset information that appears in the HTML code, and just use whichever character set was specified in the HTTP header. However, not every web server will do this, so the safest thing is to make sure that the correct charset information is included in your HTML code.

Wednesday, January 9, 2008

What is a DOCTYPE? Part 2 of 2 - TUTORIAL

Audience: Those with at least a basic understanding of HTML and CSS, and have read or understand part 1 of this tutorial.

In part one of this tutorial, we learned that the strange bit of text that appears at the top of the HTML code of some web pages (see below) is a "DOCTYPE declaration".

Sample DOCTYPE declaration:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

We also learned its purpose: to inform the web browser which version of HTML a particular web page is written in.

A question you may ask is, "But I've seen plenty of web pages that don't have a DOCTYPE declaration at the top of the HTML code. Why do those pages still seem to work ok?".

There are indeed many web pages on the internet with no DOCTYPE declaration. In fact, even www.google.com doesn't have one! (At least as of January 9, 2008). Yet the page still seems to display properly in any web browser.

The reason why pages like www.google.com will still load in a web browser, despite not having a DOCTYPE declaration, is because web browsers are very forgiving of badly written HTML. What I mean by "badly written" is any HTML code that does not properly comply with the rules in which HTML is supposed to be written. These rules are specified in the "Doctype Definition" (DTD) of a given version of HTML (if you are unsure what this means, please refer to part 1 of this tutorial series).

So, even if a web page does not tell the web browser which version of HTML it is supposed to be written in, a web browser will still attempt to display it. This is also known as a web browser entering "quirks mode" (as opposed to being in "standards mode", which just means that the web browser knows which version of HTML is being used). Quirks mode isn't good because there is no clearly defined set of rules for how a web browser is supposed to interpret HTML when in quirks mode. This means a web page with no DOCTYPE declaration can look very different in one web browser compared to another. Unfortunately, using a DOCTYPE declaration still won't guarantee a web page to look the same in different web browsers. One reason for this is that web browsers do not interpret the same version of HTML in exactly the same way (there is no good reason for this, that's just the way it is). However, it is generally considered to be true that using a proper DOCTYPE declaration will make it easier to build a web page that will look the same (or close enough) in any web browser (also known as a web page having "cross browser support"). The reason for this is that with a proper DOCTYPE declaration, different web browsers are at least trying to interpret your HTML code in the same way.

I will now go through the available HTML DOCTYPEs:

(Note: HTML versions 2 and 3 both have DOCTYPEs. But these versions of HTML are so old, they are not worth focusing on).

The earliest version of HTML worth worrying about is HTML 4.01, which comes in 3 types:
  • HTML 4.01 Strict
  • HTML 4.01 Transitional
  • HTML 4.01 Frameset
It would take up too much time to list every single difference between these versions. However the general differences are as follows:

HTML 4.01 Strict:
"Style" elements are not allowed to appear within the HTML code of Strict web pages. Style elements are things like colors, font sizes and images. These should all be reserved for the CSS (Cascading Style Sheet).

HTML 4.01 Transitional:
Some style elements are allowed in the HTML code of a HTML 4.01 Transitional web page, but the FRAMESET tag is not allowed.

HTML 4.01 Framset:
Very similar to Transitional, except that the FRAMSET tag is allowed to be used.

You may be wondering, "Why are there 3 different types of the same version?". The answer is that the people in charge of releasing new versions of HTML (the W3C) understand how difficult it is to build a web page that will work properly in every web browser. It is therefore considered "ideal" to use the Strict DOCTYPE, but the reality is that you may find it impossible to get your web page working in all web browsers if you're not allowed to use style tags in your HTML code. The goal trying to be reached is to have all style elements separated in to the CSS, with no style elements at all in the HTML code.

(For a more complete list of what is and isn't allowed with Strict and Transitional web pages, see Roger Johansson's article 'Transitional vs. Strict Markup'. You can just scroll straight to the section titled 'Elements that are not allowed in Strict DOCTYPEs', and start from there).

Another important version of HTML is XHTML 1.0. This version of HTML also appears in the same 3 types that HTML 4.01 did:
  • XHTML 1.0 Strict
  • XHTML 1.0 Transitional
  • XHTML 1.0 Frameset
The difference between the Strict, Transitional and Framset versions of XHTML 1.0 are pretty much the same as the differences between them in HTML 4.01. The important thing here is the difference between HTML 4.01 (all versions), and XHTML 1.0 (all versions).

The major difference between HTML and XHTML is not what tags are or aren't allowed. The difference lies in how the HTML code should be written. XHTML code must be written according to some very specific rules. For example:

Some XHTML rules:
  • All tags must to be written in lowercase (eg, <h1>, <p>, and not <H1> or <P>).
  • All tags that are opened must be closed (eg, <h1> must have a corresponding </h1>)
  • If more than one tag is open at the same time, they must be closed in the reverse order they were opened. For example:
Good:
<p><i>Watch as I close the italics tag before the paragraph tag, just as XHTML requires me to</i></p>

BAD:
<p><i>This is bad! I am closing the p tag before i tag!</p></i>

(For a more complete list of the rules you must follow when writing XHTML, see Linda Roeder's article 'Basics of XHTML - Why, What and How'. The list of rules starts at about the third paragraph).

As you might be able to guess, one major reason that XHTML was invented is to try and get everyone to write their HTML code in the same way. This would have major benefits for web browsers (in their attempts to display web pages properly) and for web designers (whose code will now look very similar, if not exactly the same, as any other web designer).

The last version of HTML that you need to worry about is XHTML 1.1. This version of HTML does not come in 3 types as you have seen previously, it's just plain old XHTML 1.1. This version is very similar to XHTML 1.0 Strict. It must also follow the same rules about how XHTML code is to be written as the other XHTML versions (lowercase tags, etc). So what's the difference between XHTML1.0 Strict and XHTML1.1? Unfortunately, this question can't really be answered if you don't know what XML is (that's not a typo. XML is not the same thing as XHTML). But don't worry. At this stage, XHTML 1.0 is still more popular than XHTML 1.1, so just stick with one of the earlier versions for now.

Below I have again listed all of the HTML versions that I have talked about, along with their DOCTYPE declarations. You might have seen elsewhere that you can just "copy and paste" these bits of text in to the top of your own HTML code - and indeed, there is nothing wrong with that.
In fact, it's probably better to do so, because if you get just one character wrong in your DOCTYPE declaration, it will probably throw your web browser in to "quirks mode", possibly without you even realising. So, once you have decided which version of HTML you want to write your web page in, just copy and paste the relevant DOCTYPE as it appears below. I have listed each DOCTYPE under a bolded heading - the heading is NOT part of the DOCTYPE declaration and should NOT be copied and pasted in to your HTML code! (Please note, when copying and pasting these DOCTYPE declarations, they should be the first thing that appear in your HTML code).

HTML 4.01 Strict
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">

HTML 4.01 Transitional
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">

HTML 4.01 Frameset
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Frameset//EN"
"http://www.w3.org/TR/html4/frameset.dtd">

XHTML 1.0 Strict
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

XHTML 1.0 Transitional
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

XHTML 1.0 Framset
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">

XHTML 1.1
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

The W3C have come up with a great way of testing whether a web page has been written properly, as per the rules of its DOCTYPE. It's called the W3C markup validator, and you can use it to test your own web pages (or even any web page that is on the internet).

The people responsible for HTML are trying to achieve the goal of having one HTML version that all web browsers will rely on, and as long as you build your web page in that version, you shouldn't encounter any "cross browser support" issues. Unfortunately that day is probably still some time away. But the idea of encouraging web designers to start using DOCTYPEs and to follow the respective rules is so that one day, a common standard can be reached.

---

Please comment on this tutorial, and let me know if there's any way I could have improved it. Most importantly, what could I have done to make it easier to understand?

What topics would you like to see a tutorial on?

I'm open to suggestions for the topic of my next tutorial. The areas that I have at least some knowledge in are:

HTML and XHTML
CSS
PHP
MySQL

Are there any areas within these fields that you just don't quite get, and would like to see a tutorial on? If so, leave a comment in the "Comments" section below. In the comments section, ignore the bit that says "Sign in using Blogger/Google", just scroll down and choose one of the other options; (in other words, there is no need to sign in to leave a comment!).