I am an XML disciple

Everyone knows about HTML, it’s possibly the most popular language on the internet. Its utterly ubiquitous in its domination of web based documents, for some very good reasons too. However, if you haven’t heard already, I’m afraid to inform you that it’s dead. And you know what? I’m glad it is! HTML was flawed from its beginnings in CERN where it was still just a twinkle in Tim Berners-Lee’s eyes.

My God what were you thinking?!

First things first, it shouldn’t be called HTML at all. Initially it wasn’t, but it was decided later version of HTML (from version two and up) would become an application of SGML. This is where it gets its naming problems, it should not have the ML suffix, as it isn’t a markup language! It’s an application of SGML, which means it should’ve been named along the lines of RSS (which is an application of XML, but we’ll get onto that later), not with the ML suffix.

Of course, if everyone realised the fault with HTML and then named all their other application of SGML differently, then it would be fine. But by using the ML suffix, it has created a whole bucket load of confusion. For example, while applications such as RSS, Atom and SOAP have gotten the right idea, all too commonly you see incorrectly name applications such as MathML. What MathML should be called is XMath. Prefixing X onto the beginning of your application name seems to be quite popular at the moment, and its perfectly correct!

It should be noted that I’m not bashing Tim Berners-Lee here, he’s openly admitted to his mistake in calling it HTML. I’m just trying to highlight some little known mistakes.

SGML and all that rubbish

The people who created SGML had good intentions. They wanted an open metalanguage in which people could define markup language for documents, and thats what they created. However, they made a few minor mistakes, and one major mistake that ensured that SGML had no killer application until HTML, years after SGML’s release.

You probably know that one of the most obvious differences between HTML and XHTML is that in HTML, some elements can have omittable end tags. For example, the following code is perfectly fine HTML, even though I’ve missed several closing tags.

<p>This is a paragraph
<p>This is another paragraph

<ul> <li>This is a list item <li>This is another list item </ul>

Apart from the closing </ul> tag, all the other elements need not be closed. You might say that its obvious where an element ends, just before the beginning of the next element, so there is no need for a closing tag for every element. However, there are situations where ambiguity rises and what happens then? Your SGML parser is screwed! Actually, you should technically never fall into this category if you follow the SGML bible; it specifies that you can omit end tags as long as it doesn’t lead to ambiguity. I’m sorry, but that is just fucking ridiculous! Any benefits of saving storage space is completely killed when you try to write a SGML parser. Anyone who has written a compiler before will tell you that having optionally omittable end tags makes writing a parser a horribly complicated task. I won’t go into the details, but that is one of the main reasons why SGML parsers were still cost over five figures in the late eighties.

Come on XML

Something obviously needed to be done to fix the mess of SGML and HTML. Browsers were starting to create their own proprietary HTML elements, such as Netscape’s <blink> element which, as the name suggests, repeatedly made whatever was inside it flicker on and off. The Internet Explorer team at Microsoft disagreed with Netscape about <blink> and refused to implement this element into IE. Fortunately, it’s not in any of the official W3C specifications.

XML is the solution to the problems and arguments between SGML purists and the browser venders. You see, browsers aren’t technically SGML parsers. They are far more lenient towards badly formed markup, and they even try to correct those mistakes. For example, the following code snippet is badly formed HTML markup, however every major browser will be able to correct the nesting problem and display it correctly.

<p><b>Some bold text</p></b>

If the browser vendors followed the SGML purists route and turned their browsers into fully fledged SGML parses, then ninety per cent of all websites in the world wouldn’t display. Instead, you’d get an error message complaining about badly formed markup. Well, at least the sites that work fully comply with the SGML specifications, and theres endless possibilities with error messages!

Of course, the SGML people were never going to agree with the browser vendor’s view point., and vice-versa. So, together with Tim B, they devised a new standard, XML. XML is essentially an optimised subset of SGML. Firstly, all elements must be explicitly closed, either by having a self-closing tag or a separate closing tag. XML also made some other minor improvements such as better support for internationalisation.

In contrast to SGML, XML currently has many applications; RSS, Atom, XHTML and SOAP just to name a few. In the next article, I will be brining this short series of articles about markup languages to a close by talking about proper usage of XHTML with XML. Stay tuned!