XML Basics
XML is a markup language that allows data to be stored and transmitted in a structured, hierarchical manner. It has similarities in markup style to HTML, but whereas HTML has a fixed list of element definitions and is designed primarily to allow you to define how a document should be displayed, XML elements may be defined within a particular XML document to suit the data being described there.
In common with HTML, markup elements (normally referred to as tags) enclosed by < and > are used to annotate the contents of a text file, describing the information it contains.
Note
![]() | The similarities between XML and HTML are not purely accidental. Both are based on SGML (Standard Generalized Markup Language), a system for organizing the elements of a document. SGML was developed and standardized by the International Organization for Standards (ISO). |
Unlike the tags in HTML, though, whose definitions are fixed, XML tags can be defined to be anything you want, allowing you to describe virtually any kind of data. Consider this example of an XML document:
<race>
<yacht raceNo='74'>
<name>Wanderer</name>
<skipper>Walter Jeffries</skipper>
<helm>Sally Jacobs</helm>
</yacht>
<yacht raceNo='22'>
<name>Free Spirit</name>
<skipper>Jennifer Scully</skipper>
<helm>Paul Thomas</helm>
</yacht>
</race>
This short XML document describes a yacht race, including the two competing yachts and their respective personnel. Note how the tag names are descriptive of the data they contain, and how the tag structures are hierarchical. You may also notice that XML tags, like those of HTML, can also have attributes. The end effect is that the XML file is quite readablethat is, the meaning of the data may be readily inferred by a human reader.
Caution
![]() | Unlike HTML, tagnames in XML are case sensitive, so <yacht> and <Yacht> would be treated as two distinct elements. |
Tip
![]() | XML uses the same syntax as HTML for the display of comments. Any information beginning with the character string <!-- and ending with the string --> will be ignored: <!-- This is a comment --> |
XML Document Structure
The permitted structure of an XML document has only one mandatory element, the so-called document element. In the preceding yacht race example, this would be the <race> element.
Note
![]() | The document element need not necessarily have elements nested within it; the following is an allowable XML document: <competition>Farlington Summer Cup</competition> |
Document Prolog
Other information may be optionally included before the document element, forming the document's prolog. An example is the XML declaration:
<?xml version="1.0" ?>
Caution
![]() | If such a declaration exists, it must be the first thing in the document. Not even white space is allowed before it. |
The prolog may also contain, in addition to various comments and processing instructions, a Document Type Declaration.
Document Type Declaration
The optional Document Type Declaration (often referred to as a DOCTYPE declaration) is a statement of the permitted structure of an XML document. It usually contains (or refers to another file that contains) information about the names of the elements in the document and the relationships between those elements.
Caution
![]() | Take care not to confuse the Document Type (DOCTYPE) Declaration with the Document Type Definition (DTD). The DTD is comprised of both the markup declarations contained in the DOCTYPE Declaration and those contained in any external file to which the DOCTYPE Declaration refers. |
Let's look at an example Document Type Declaration for the yacht race document:
<!DOCTYPE race SYSTEM race.dtd>
This declaration, which would appear in the document before the <race> element, specifies that the document element will be called <race> and that document structure definitions may be found in an external file, race.dtd, which would perhaps contain something like the following:
<!ELEMENT race (yacht+) >
<!ELEMENT yacht (name, skipper, helm) >
<!ATTLIST yacht raceNo #CDATA #REQUIRED >
<!ELEMENT name (#PCDATA) >
<!ELEMENT skipper (#PCDATA) >
<!ELEMENT helm (#PCDATA) >
Alternatively, this information could be quoted in the DOCTYPE Declaration itself, placed between [ and ] characters:
<!DOCTYPE race [
<!ELEMENT race (yacht+) >
<!ATTLIST yacht raceNo #CDATA #REQUIRED >
<!ELEMENT yacht (name, skipper, helm) >
<!ELEMENT name (#PCDATA) >
<!ELEMENT skipper (#PCDATA) >
<!ELEMENT helm (#PCDATA) >
]>
In either case we define four elementsnamely, race, yacht, skipper, and helmand one attribute list.
Tip
![]() | DOCTYPE Declarations can contain both internal and external references, known as the internal and external subsets of the DTD. |
Element Declarations
The line
<!ELEMENT race (yacht+) >
declares that the <race> element will contain elements of type <yacht>, whereas the + character indicates that there may be any number of occurrences from one upward of such <yacht> elements. Alternatively, we could use the character * to indicate any number of occurrences including zero, or the character ? to indicate zero or one occurrence. The absence of all of these characters indicates that there should be exactly one <yacht> element within <race>.
The <yacht> element is declared to contain three further elements, <name>, <skipper>, and <helm>. The #PCDATA term contained in the declarations for those elements stands for parsed character data and indicates that these elements must contain character-based data and may not contain further elements. Other possible content types include MIXED (text and elements) and ANY (any valid content).
Attribute List Declarations
Our example also contains the line
<!ATTLIST yacht raceNo #CDATA #REQUIRED >
Such declarations are used to specify what attributes are permitted or required for any given element. In our example, we specify that the <yacht> element has an attribute called raceNo, the value of which is comprised of #CDATA (character data).
The term #REQUIRED indicates that, in this example, the <yacht> element must have such an attribute. Other possibilities include #IMPLIED, specifying that such an attribute is optional; #DEFAULT followed by a value in quotation marks, specifying a default value for the attribute should none be declared in the XML document; or #FIXED followed by a value in quotation marks, fixing the value of the attribute to that quoted.
Valid XML
If an XML document contains a DOCTYPE Declaration and complies fully with the declarations it contains, it is said to be a valid XML document.