I am an XML disciple

Everyone knows about HTML, it’s possibly the most popular language on the internet. Its utterly ubiquitous in its domination of web based documents, for some very good reasons too. However, if you haven’t heard already, I’m afraid to inform you that it’s dead. And you know what? I’m glad it is! HTML was flawed from its beginnings in CERN where it was still just a twinkle in Tim Berners-Lee’s eyes.

My God what were you thinking?!

First things first, it shouldn’t be called HTML at all. Initially it wasn’t, but it was decided later version of HTML (from version two and up) would become an application of SGML. This is where it gets its naming problems, it should not have the ML suffix, as it isn’t a markup language! It’s an application of SGML, which means it should’ve been named along the lines of RSS (which is an application of XML, but we’ll get onto that later), not with the ML suffix.

Of course, if everyone realised the fault with HTML and then named all their other application of SGML differently, then it would be fine. But by using the ML suffix, it has created a whole bucket load of confusion. For example, while applications such as RSS, Atom and SOAP have gotten the right idea, all too commonly you see incorrectly name applications such as MathML. What MathML should be called is XMath. Prefixing X onto the beginning of your application name seems to be quite popular at the moment, and its perfectly correct!

It should be noted that I’m not bashing Tim Berners-Lee here, he’s openly admitted to his mistake in calling it HTML. I’m just trying to highlight some little known mistakes.

SGML and all that rubbish

The people who created SGML had good intentions. They wanted an open metalanguage in which people could define markup language for documents, and thats what they created. However, they made a few minor mistakes, and one major mistake that ensured that SGML had no killer application until HTML, years after SGML’s release.

You probably know that one of the most obvious differences between HTML and XHTML is that in HTML, some elements can have omittable end tags. For example, the following code is perfectly fine HTML, even though I’ve missed several closing tags.

<p>This is a paragraph
<p>This is another paragraph

<ul> <li>This is a list item <li>This is another list item </ul>

Apart from the closing </ul> tag, all the other elements need not be closed. You might say that its obvious where an element ends, just before the beginning of the next element, so there is no need for a closing tag for every element. However, there are situations where ambiguity rises and what happens then? Your SGML parser is screwed! Actually, you should technically never fall into this category if you follow the SGML bible; it specifies that you can omit end tags as long as it doesn’t lead to ambiguity. I’m sorry, but that is just fucking ridiculous! Any benefits of saving storage space is completely killed when you try to write a SGML parser. Anyone who has written a compiler before will tell you that having optionally omittable end tags makes writing a parser a horribly complicated task. I won’t go into the details, but that is one of the main reasons why SGML parsers were still cost over five figures in the late eighties.

Come on XML

Something obviously needed to be done to fix the mess of SGML and HTML. Browsers were starting to create their own proprietary HTML elements, such as Netscape’s <blink> element which, as the name suggests, repeatedly made whatever was inside it flicker on and off. The Internet Explorer team at Microsoft disagreed with Netscape about <blink> and refused to implement this element into IE. Fortunately, it’s not in any of the official W3C specifications.

XML is the solution to the problems and arguments between SGML purists and the browser venders. You see, browsers aren’t technically SGML parsers. They are far more lenient towards badly formed markup, and they even try to correct those mistakes. For example, the following code snippet is badly formed HTML markup, however every major browser will be able to correct the nesting problem and display it correctly.

<p><b>Some bold text</p></b>

If the browser vendors followed the SGML purists route and turned their browsers into fully fledged SGML parses, then ninety per cent of all websites in the world wouldn’t display. Instead, you’d get an error message complaining about badly formed markup. Well, at least the sites that work fully comply with the SGML specifications, and theres endless possibilities with error messages!

Of course, the SGML people were never going to agree with the browser vendor’s view point., and vice-versa. So, together with Tim B, they devised a new standard, XML. XML is essentially an optimised subset of SGML. Firstly, all elements must be explicitly closed, either by having a self-closing tag or a separate closing tag. XML also made some other minor improvements such as better support for internationalisation.

In contrast to SGML, XML currently has many applications; RSS, Atom, XHTML and SOAP just to name a few. In the next article, I will be brining this short series of articles about markup languages to a close by talking about proper usage of XHTML with XML. Stay tuned!

15 Comments

  1. Stevie December 7, 2005 at 11:01 pm

    I think if everything is going to be standardised with XML, then they should standardise the use of closing tags. I think that all tags should have separate closing tags.

    The use of XML also worries me: every tag would need defining I assume. So I guess that would mean websites would become bigger, not smaller. So that means mobile phone bills would go up… Fantastic for Vodafone, but bad news for us!

    Unless I have got the idea of this thread COMPLETELY wrong!

  2. Weiran Zhang December 7, 2005 at 11:59 pm
    I think if everything is going to be standardised with XML, then they should standardise the use of closing tags. I think that all tags should have separate closing tags.

    It is standardised, you must have an end tag. There are two ways of closing tags because, for example, you had this:

    <script type="text/javascript" src="javascript.js"></script>

    While it does work, it isn’t semantic coding. Having a seperate closing tag give the appearence that there should be something between the opening and closing tags, when there isn’t, and never will be. This is the correct way of writing it:

    <script type="text/javascrip" src="javascript.js" />

    However, some browsers such as IE have problems interpreting that, so currently its best to stick with seperate closing tags with the script element.

    The use of XML also worries me: every tag would need defining I assume.

    Well no, I never say in the article that we should be using XML to code our pages in. While it is possible, XSLT (the styling language) is hard to use and there isn’t enough cross browser support. What I’m advocating (and you’ll find this out in the next article), is the proper use of XHTML, if possible.

  3. Stevie December 8, 2005 at 12:24 am

    <ramblings class="mad">I would rather have a closing tag than a self-closing opening tag. It isn’t that bad, it’s like having an address book with no one in the Z entry… apart from you… bad example… Q then!

    And even after we all find out how to use XHTML properly, I reckon I could count every single website with the absolute correct usage on one hand! The Internet is one screwed place! And that is why, ladies and gentlemen, we have porn popups!

    </ramblings>

  4. Weiran Zhang December 8, 2005 at 1:05 am

    Yes but its not semantically correct is it? And thats what XML is good at doing, describing a document very semantically.

  5. Stoyan December 8, 2005 at 4:56 am

    Somewhere I’ve read some stats - what amount of an average browser’s source code goes into handling weirdnesses like non-closed or improperly nested or just mistyped tags (or attributes or their values). I don’t remember the exact figure but it was something ridiculous - like 70%! What a waste! And all that just to be more forgiving and to say - hey, my browser is good because your page works in it.

    “The road to hell is paved with good intentions” :)

  6. Stevie December 8, 2005 at 3:19 pm

    Yeah, it would be better if browsers stopped saying "well we know roughly what you mean, hang on a minute and we’ll sort it for you!". If browsers only accepted correctly written code, it would put pressure on web developers to produce their work properly.

    Like that will ever happen!

  7. Stoyan December 8, 2005 at 5:19 pm

    Steve, you’re right. BTW, I’ve recently installed a Tidy extension for Firefox. It gives you warining and error icons at the bottom right corner and you can click for details. It’s alarming the amount of warnings I get from the sites I go to. Oh well.

  8. James December 8, 2005 at 7:59 pm
    Like that will ever happen!

    It kinda already does, most browsers have two rendering modes, “Quirks” and “Standards-compliance”. As you can tell by the titles, Quirks is a lot more forgiving with badly formed code, and Standards compliance tries to stick to the standard you define in the DOCTYPE declaration.

    For example, my site looked fine in quirks mode in IE, but in standards complicancy, it less than gracefully degraded ;)

  9. Stevie December 9, 2005 at 9:22 pm

    <ramblings class="mad">Maybe we should build our own browser and make the Internet a strict dictatoship regime… all websites look the same… perhaps even make all the subject matter the same: all hail our glorious leader… ME! (sorry, meant us…)</ramblings>

  10. Stevie December 10, 2005 at 1:27 am

    I accidentally posted twice earlier. Well, not really accidentally… I posted the second to correct the first. Then I posted a third time to get Waz to delete the third and the first posts. Now, the first remains, but the second and third have gone (well in Waz…!). So, SPOT THE MISTAKE! And it’s very apt!

  11. Weiran Zhang December 10, 2005 at 1:26 pm

    And you call yourself an XML disciple!?

  12. Stevie December 10, 2005 at 6:23 pm

    Hey! I spotted it! And it was you who deleted the correct code! Call yourself a computer scientist?!

  13. Weiran Zhang December 10, 2005 at 8:23 pm

    A proper disciple wouldn’t have made the mistake in the first place! I just thought you double posted.

  14. Stevie December 11, 2005 at 1:30 am

    I made a post telling you which posts to delete! And being a disciple yourself… I bet that not every page you have made was perfect first time!

    AND I bet you have never had to write phantom markup like: &lt;ramblings class=&quot;mad&quot;&gt;

  15. Weiran Zhang December 11, 2005 at 1:34 am

    Yeah, I only read the beginning!