JETZT ONLINE BESTELLEN
Add to Cart
XML in a Nutshell

Third Edition Oktober 2004
ISBN 978-0-596-00764-5
712 Seiten
EUR37.00

Weitere Informationen zu diesem Buch

Inhaltsverzeichnis | Kolophon |


Inhaltsverzeichnis

	
Chapter 1: Introducing XML
Inhaltsvorschau
XML, the Extensible Markup Language, is a W3C-endorsed standard for document markup. It defines a generic syntax used to mark up data with simple, human-readable tags. It provides a standard format for computer documents that is flexible enough to be customized for domains as diverse as web sites, electronic data interchange, vector graphics, genealogy, real estate listings, object serialization, remote procedure calls, voice mail systems, and more.
You can write your own programs that interact with, massage, and manipulate the data in XML documents. If you do, you'll have access to a wide range of free libraries in a variety of languages that can read and write XML so that you can focus on the unique needs of your program. Or you can use off-the-shelf software, such as web browsers and text editors, to work with XML documents. Some tools are able to work with any XML document. Others are customized to support a particular XML application in a particular domain, such as vector graphics, and may not be of much use outside that domain. But the same underlying syntax is used in all cases, even if it's deliberately hidden by the more user-friendly tools or restricted to a single application.
XML is a metamarkup language for text documents. Data are included in XML documents as strings of text. The data are surrounded by text markup that describes the data. XML's basic unit of data and markup is called an element . The XML specification defines the exact syntax this markup must follow: how elements are delimited by tags, what a tag looks like, what names are acceptable for elements, where attributes are placed, and so forth. Superficially, the markup in an XML document looks a lot like the markup in an HTML document, but there are some crucial differences.
Most importantly, XML is a metamarkup language . That means it doesn't have a fixed set of tags and elements that are supposed to work for everybody in all areas of interest for all time. Any attempt to create a finite set of such tags is doomed to failure. Instead, XML allows developers and writers to invent the elements they need as they need them. Chemists can use elements that describe molecules, atoms, bonds, reactions, and other items encountered in chemistry. Real estate agents can use elements that describe apartments, rents, commissions, locations, and other items needed for real estate. Musicians can use elements that describe quarter notes, half notes, G-clefs, lyrics, and other objects common in music. The
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The Benefits of XML
Inhaltsvorschau
XML is a metamarkup language for text documents. Data are included in XML documents as strings of text. The data are surrounded by text markup that describes the data. XML's basic unit of data and markup is called an element . The XML specification defines the exact syntax this markup must follow: how elements are delimited by tags, what a tag looks like, what names are acceptable for elements, where attributes are placed, and so forth. Superficially, the markup in an XML document looks a lot like the markup in an HTML document, but there are some crucial differences.
Most importantly, XML is a metamarkup language . That means it doesn't have a fixed set of tags and elements that are supposed to work for everybody in all areas of interest for all time. Any attempt to create a finite set of such tags is doomed to failure. Instead, XML allows developers and writers to invent the elements they need as they need them. Chemists can use elements that describe molecules, atoms, bonds, reactions, and other items encountered in chemistry. Real estate agents can use elements that describe apartments, rents, commissions, locations, and other items needed for real estate. Musicians can use elements that describe quarter notes, half notes, G-clefs, lyrics, and other objects common in music. The X in XML stands for Extensible. Extensible means that the language can be extended and adapted to meet many different needs.
Although XML is quite flexible in the elements it allows, it is quite strict in many other respects. The XML specification defines a grammar for XML documents that says where tags may be placed, what they must look like, which element names are legal, how attributes are attached to elements, and so forth. This grammar is specific enough to allow the development of XML parsers that can read any XML document. Documents that satisfy this grammar are said to be
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
What XML Is Not
Inhaltsvorschau
XML is a markup language, and it is only a markup language. It's important to remember that. The XML hype has gotten so extreme that some people expect XML to do everything up to and including washing the family dog.
First of all, XML is not a programming language . There's no such thing as an XML compiler that reads XML files and produces executable code. You might perhaps define a scripting language that used a native XML format and was interpreted by a binary program, but even this application would be unusual. XML can be used as a format for instructions to programs that do make things happen, just like a traditional program may read a text config file and take different actions depending on what it sees there. Indeed, there's no reason a config file can't be XML instead of unstructured text. Some more recent programs use XML config files; but in all cases, it's the program taking action, not the XML document itself. An XML document by itself simply is. It does not do anything.
At least one XML application, XSL Transformations (XSLT), has been proven to be Turing complete by construction. See http://www.unidex.com/turing/utm.htm for one universal Turing machine written in XSLT.
Second, XML is not a network transport protocol . XML won't send data across the network, any more than HTML will. Data sent across the network using HTTP, FTP, NFS, or some other protocol might be encoded in XML; but again there has to be some software outside the XML document that actually sends the document.
Finally, to mention the example where the hype most often obscures the reality, XML is not a database . You're not going to replace an Oracle or MySQL server with XML. A database can contain XML data, either as a VARCHAR or a BLOB or as some custom XML data type, but the database itself is not an XML document. You can store XML data in a database on a server or retrieve data from a database in an XML format, but to do this, you need to be running software written in a real programming language such as C or Java. To store XML in a database, software on the client side will send the XML data to the server using an established network protocol such as TCP/IP. Software on the server side will receive the XML data, parse it, and store it in the database. To retrieve an XML document from a database, you'll generally pass through some middleware product like Enhydra that makes SQL queries against the database and formats the result set as XML before returning it to the client. Indeed, some databases may integrate this software code into their core server or provide plug-ins to do it, such as the Oracle XSQL servlet. XML serves very well as a ubiquitous, platform-independent transport format in these scenarios. However, it is not the database, and it shouldn't be used as one.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Portable Data
Inhaltsvorschau
XML offers the tantalizing possibility of truly cross-platform, long-term data formats. It's long been the case that a document written on one platform is not necessarily readable on a different platform, or by a different program on the same platform, or even by a future or past version of the same program on the same platform. When the document can be read, there's no guarantee that all the information will come across. Much of the data from the original moon landings in the late 1960s and early 1970s is now effectively lost. Even if you can find a tape drive that can read the now obsolete tapes, nobody knows what format the data is stored in on the tapes!
XML is an incredibly simple, well-documented, straightforward data format. XML documents are text and can be read with any tool that can read a text file. Not just the data, but also the markup is text, and it's present right there in the XML file as tags. You don't have to wonder whether every eighth byte is random padding, guess whether a four-byte quantity is a two's complement integer or an IEEE 754 floating point number, or try to decipher which integer codes map to which formatting properties. You can read the tag names directly to find out exactly what's in the document. Similarly, since element boundaries are defined by tags, you aren't likely to be tripped up by unexpected line-ending conventions or the number of spaces that are mapped to a tab. All the important details about the structure of the document are explicit. You don't have to reverse-engineer the format or rely on incomplete and often unavailable documentation.
A few software vendors may want to lock in their users with undocumented, proprietary, binary file formats. However, in the long term, we're all better off if we can use the cleanly documented, well-understood, easy to parse, text-based formats that XML provides. XML lets documents and data be moved from one system to another with a reasonable hope that the receiving system will be able to make sense out of it. Furthermore, validation lets the receiving side check that it gets what it expects. Java promised portable code; XML delivers portable data. In many ways, XML is the most portable and flexible document format designed since the ASCII text file.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
How XML Works
Inhaltsvorschau
Example 1-1 shows a simple XML document. This particular XML document might be seen in an inventory-control system or a stock database. It marks up the data with tags and attributes describing the color, size, bar-code number, manufacturer, name of the product, and so on.
Example 1-1. An XML document
<?xml version="1.0"?>

<product barcode="2394287410">

  <manufacturer>Verbatim</manufacturer>

  <name>DataLife MF 2HD</name>

  <quantity>10</quantity>

  <size>3.5"</size>

  <color>black</color>

  <description>floppy disks</description>

</product>
This document is text and can be stored in a text file. You can edit this file with any standard text editor such as BBEdit, jEdit, UltraEdit, Emacs, or vi. You do not need a special XML editor. Indeed, we find most general-purpose XML editors to be far more trouble than they're worth and much harder to use than simply editing documents in a text editor.
Programs that actually try to understand the contents of the XML document—that is, do more than merely treat it as any other text file—will use an XML parser to read the document. The parser is responsible for dividing the document into individual elements, attributes, and other pieces. It passes the contents of the XML document to an application piece by piece. If at any point the parser detects a violation of the well-formedness rules of XML, then it reports the error to the application and stops parsing. In some cases, the parser may read further in the document, past the original error, so that it can detect and report other errors that occur later in the document. However, once it has detected the first well-formedness error, it will no longer pass along the contents of the elements and attributes it encounters.
Individual XML applications normally dictate more precise rules about exactly which elements and attributes are allowed where. For instance, you wouldn't expect to find a G_Clef element when reading a biology document. Some of these rules can be precisely specified with a schema written in any of several languages, including the W3C XML Schema Language, RELAX NG, and DTDs. A document may contain a URL indicating where the schema can be found. Some XML parsers will notice this and compare the document to its schema as they read it to see if the document satisfies the constraints specified there. Such a parser is called a
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The Evolution of XML
Inhaltsvorschau
XML is a descendant of SGML, the Standard Generalized Markup Language. The language that would eventually become SGML was invented by Charles F. Goldfarb, Ed Mosher, and Ray Lorie at IBM in the 1970s and developed by several hundred people around the world until its eventual adoption as ISO standard 8879 in 1986. SGML was intended to solve many of the same problems XML solves in much the same way XML solves them. It is a semantic and structural markup language for text documents. SGML is extremely powerful and achieved some success in the U.S. military and government, in the aerospace sector, and in other domains that needed ways of efficiently managing technical documents that were tens of thousands of pages long.
SGML's biggest success was HTML, which is an SGML application. However, HTML is just one SGML application. It does not have or offer anywhere near the full power of SGML itself. Since it restricts authors to a finite set of tags designed to describe web pages—and describes them in a fairly presentation oriented way at that—it's really little more than a traditional markup language that has been adopted by web browsers. It doesn't lend itself to use beyond the single application of web page design. You would not use HTML to exchange data between incompatible databases or to send updated product catalogs to retailer sites, for example. HTML does web pages, and it does them very well, but it only does web pages.
SGML was the obvious choice for other applications that took advantage of the Internet but were not simple web pages for humans to read. The problem was that SGML is complicated—very, very complicated. The official SGML specification is over 150 very technical pages. It covers many special cases and unlikely scenarios. It is so complex that almost no software has ever implemented it fully. Programs that implemented or relied on different subsets of SGML were often incompatible with each other. The special feature one program considered essential would be considered extraneous fluff and omitted by the next program.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 2: XML Fundamentals
Inhaltsvorschau
This chapter shows you how to write simple XML documents. You'll see that an XML document is built from text content marked up with text tags such as <SKU>, <Record_ID>, and <author> that look superficially like HTML tags. However, in HTML you're limited to about a hundred predefined tags that describe web page formatting. In XML, you can create as many tags as you need. Furthermore, these tags will mostly describe the type of content they contain rather than formatting or layout information. In XML you don't say that something is italicized or indented or bold, you say that it's a book or a biography or a calendar.
Although XML is looser than HTML in regard to which tags it allows, it is much stricter about where those tags are placed and how they're written. In particular, all XML documents must be well-formed. Well-formedness rules specify constraints such as "Every start-tag must have a matching end-tag," and "Attribute values must be quoted." These rules are unbreakable, which makes parsing XML documents easier and writing them a little harder, but they still allow an almost unlimited flexibility of expression.
An XML document contains text, never binary data. It can be opened with any program that knows how to read a text file. Example 2-1 is close to the simplest XML document imaginable. Nonetheless, it is a well-formed XML document. XML parsers can read it and understand it (at least as far as a computer program can be said to understand anything).
Example 2-1. A very simple yet complete XML document
<person>

  Alan Turing

</person>
In the most common scenario, this document would be the entire contents of a file named person.xml, or perhaps 2-1.xml. However, XML is not picky about the filename. As far as the parser is concerned, this file could be called person.txt, person, or Hey you, there's some XML in this here file! Your operating system may or may not like these names, but an XML parser won't care. The document might not even be in a file at all. It could be a record or a field in a database. It could be generated on the fly by a CGI program in response to a browser query. It could even be stored in more than one file, although that's unlikely for such a simple document. If it is served by a web server, it will probably be assigned the MIME media type
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XML Documents and XML Files
Inhaltsvorschau
An XML document contains text, never binary data. It can be opened with any program that knows how to read a text file. Example 2-1 is close to the simplest XML document imaginable. Nonetheless, it is a well-formed XML document. XML parsers can read it and understand it (at least as far as a computer program can be said to understand anything).
Example 2-1. A very simple yet complete XML document
<person>

  Alan Turing

</person>
In the most common scenario, this document would be the entire contents of a file named person.xml, or perhaps 2-1.xml. However, XML is not picky about the filename. As far as the parser is concerned, this file could be called person.txt, person, or Hey you, there's some XML in this here file! Your operating system may or may not like these names, but an XML parser won't care. The document might not even be in a file at all. It could be a record or a field in a database. It could be generated on the fly by a CGI program in response to a browser query. It could even be stored in more than one file, although that's unlikely for such a simple document. If it is served by a web server, it will probably be assigned the MIME media type application/xml or text/xml. However, specific XML applications may use more specific MIME media types, such as application/mathml+xml, application/xslt+xml, image/svg+xml, text/vnd.wap.wml, or even text/html (in very special cases).
For generic XML documents, application/xml should be preferred to text/xml, although many web servers come configured out of the box to use text/xml. text/xml uses the ASCII character set as a default, which is incorrect for most XML documents.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Elements, Tags, and Character Data
Inhaltsvorschau
The document in Example 2-1 is composed of a single element named person. The element is delimited by the start-tag <person> and the end-tag </person>. Everything between the start-tag and the end-tag of the element (exclusive) is called the element's content . The content of this element is the text:
  Alan Turing
The whitespace is part of the content, although many applications will choose to ignore it. <person> and </person> are markup . The string "Alan Turing" and its surrounding whitespace are character data . The tag is the most common form of markup in an XML document, but there are other kinds we'll discuss later.
Superficially, XML tags look like HTML tags. Start-tags begin with < and end-tags begin with </. Both of these are followed by the name of the element and are closed by >. However, unlike HTML tags, you are allowed to make up new XML tags as you go along. To describe a person, use <person> and </person> tags. To describe a calendar, use <calendar> and </calendar> tags. The names of the tags generally reflect the type of content inside the element, not how that content will be formatted.

Section 2.2.1.1: Empty elements

There's also a special syntax for empty elements, elements that have no content. Such an element can be represented by a single empty-element tag that begins with < but ends with />. For instance, in XHTML, an XMLized reformulation of standard HTML, the line-break and horizontal-rule elements are written as <br /> and <hr /> instead of <br> and <hr>. These are exactly equivalent to <br></br> and <hr></hr>, however. Which form you use for empty elements is completely up to you. However, what you cannot do in XML and XHTML (unlike HTML) is use only the start-tag—for instance <br> or <hr>—without using the matching end-tag. That would be a well-formedness error.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Attributes
Inhaltsvorschau
XML elements can have attributes. An attribute is a name-value pair attached to the element's start-tag. Names are separated from values by an equals sign and optional whitespace. Values are enclosed in single or double quotation marks. For example, this person element has a born attribute with the value 1912-06-23 and a died attribute with the value 1954-06-07:
<person born="1912-06-23" died="1954-06-07">

  Alan Turing

</person>
This next element is exactly the same, as far as an XML parser is concerned. It simply uses single quotes instead of double quotes, puts some extra whitespace around the equals signs, and reorders the attributes.
<person died = '1954-06-07'  born = '1912-06-23' >

  Alan Turing

</person>
The whitespace around the equals signs is purely a matter of personal aesthetics. The single quotes may be useful in cases where the attribute value itself contains a double quote. Attribute order is not significant.
Example 2-4 shows how attributes might be used to encode much of the same information given in the record-like document of Example 2-2.
Example 2-4. An XML document that describes a person using attributes
<person>

  <name first="Alan" last="Turing"/>

  <profession value="computer scientist"/>

  <profession value="mathematician"/>

  <profession value="cryptographer"/>

</person>
This raises the question of when and whether one should use child elements or attributes to hold information. This is a subject of heated debate. Some informaticians maintain that attributes are for metadata about the element while elements are for the information itself. Others point out that it's not always so obvious what's data and what's metadata. Indeed, the answer may depend on where the information is put to use.
What's undisputed is that each element may have no more than one attribute with a given name. That's unlikely to be a problem for a birth date or a death date; it would be an issue for a profession, name, address, or anything else of which an element might plausibly have more than one. Furthermore, attributes are quite limited in structure. The value of the attribute is simply undifferentiated text. The division of a date into a year, month, and day with hyphens in the earlier code snippets is at the limits of the substructure that can reasonably be encoded in an attribute. An element-based structure is a lot more flexible and extensible. Nonetheless, attributes are certainly more convenient in some applications. Ultimately, if you're designing your own XML vocabulary, it's up to you to decide when to use which.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XML Names
Inhaltsvorschau
The XML specification can be quite legalistic and picky at times. Nonetheless, it tries to be efficient where possible. One way it does that is by reusing the same rules for different items where possible. For example, the rules for XML element names are also the rules for XML attribute names, as well as for the names of several less common constructs. Collectively, these are referred to simply as XML names.
Element and other XML names may contain essentially any alphanumeric character. This includes the standard English letters A through Z and a through z as well as the digits 0 through 9. XML names may also include non-English letters, numbers, and ideograms, such as ö, ç, Ω, . They may also include these three punctuation characters:
_ The underscore
- The hyphen
. The period
XML names may not contain other punctuation characters such as quotation marks, apostrophes, dollar signs, carets, percent symbols, and semicolons. The colon is allowed, but its use is reserved for namespaces as discussed in Chapter 4. XML names may not contain whitespace of any kind, whether a space, a carriage return, a line feed, a nonbreaking space, and so forth. Finally, all names beginning with the string "XML" (in any combination of case) are reserved for standardization in W3C XML-related specifications.
The primary new feature in XML 1.1 is that XML names may contain characters only defined in Unicode 3.0 and later. XML 1.0 is limited to the characters defined as of Unicode 2.0. Additional scripts enabled for names by XML 1.1 include Burmese, Mongolian, Thaana, Cambodian, Yi, and Amharic. (All of these scripts are legal in text content in XML 1.0. You just can't use them to name elements, attributes, and entities.) XML 1.1 offers little to no benefit to developers who don't need to use these scripts in their markup.
XML 1.1 also allows names to contain some uncommon symbols such as the musical symbol for a six-string fretboard and even a million or so code points that aren't actually mapped to particular characters. However, taking advantage of this is highly unwise. We strongly recommend that even in XML 1.1 you limit your names to letters, digits, ideographs, and the specifically allowed ASCII punctuation marks.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
References
Inhaltsvorschau
The character data inside an element must not contain a raw unescaped opening angle bracket (<). This character is always interpreted as beginning a tag. If you need to use this character in your text, you can escape it using the entity reference &lt; , the numeric character reference &#60;, or the hexadecimal numeric character reference &#x3C;. When a parser reads the document, it replaces any &lt;, &#x60;, or &#x3C; references it finds with the actual < character. However, it will not confuse the references with the starts of tags. For example:
<SCRIPT LANGUAGE="JavaScript">

  if (location.host.toLowerCase( ).indexOf("ibiblio") &lt; 0) {

    location.href="http://ibiblio.org/xml/";

  }

</SCRIPT>
Character data may not contain a raw unescaped ampersand (&) either. This is always interpreted as beginning an entity reference. However, the ampersand may be escaped using the &amp; entity reference like this:
<company>W.L. Gore &amp; Associates</company>
The ampersand is code point 38 so it could also be written with the numeric character reference &#38;:
<company>W.L. Gore &#38; Associates</company>
Entity references such as &amp; and character references such as &#60; are markup. When an application parses an XML document, it replaces this particular markup with the actual character or characters the reference refers to.
XML predefines exactly five entity references. These are:
&lt;
The less-than sign, a.k.a. the opening angle bracket (<)
&amp;
The ampersand (&)
&gt;
The greater-than sign, a.k.a. the closing angle bracket (>)
&quot;
The straight, double quotation marks (")
&apos;
The apostrophe, a.k.a. the straight single quote (')
Only &lt; and &amp; must be used instead of the literal characters in element content. The others are optional. &quot; and &apos; are useful inside attribute values where a raw " or ' might be misconstrued as ending the attribute value. For example, this image tag uses the
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
CDATA Sections
Inhaltsvorschau
When an XML document includes samples of XML or HTML source code, the < and & characters in those samples must be encoded as &lt; and &amp;. The more sections of literal code a document includes and the longer they are, the more tedious this encoding becomes. Instead you can enclose each sample of literal code in a CDATA section . A CDATA section is set off by <![CDATA[ and ]]>. Everything between the <![CDATA[ and the ]]> is treated as raw character data. Less-than signs don't begin tags. Ampersands don't start entity references. Everything is simply character data, not markup.
For example, in a Scalable Vector Graphics (SVG) tutorial written in XHTML, you might see something like this:
<p>You can use a default <code>xmlns</code> attribute to avoid

having to add the svg prefix to all your elements:</p>

<pre><![CDATA[

       <svg xmlns="http://www.w3.org/2000/svg"

            width="12cm" height="10cm">

         <ellipse rx="110" ry="130" />

         <rect x="4cm" y="1cm" width="3cm" height="6cm" />

       </svg>

     ]]></pre>
The SVG source code has been included directly in the XHTML file without carefully replacing each < with &lt;. The result will be a sample SVG document, not an embedded SVG picture, as might happen if this example were not placed inside a CDATA section.
The only thing that cannot appear in a CDATA section is the CDATA section end delimiter, ]]> .
CDATA sections exist for the convenience of human authors, not for programs. Parsers are not required to tell you whether a particular block of text came from a CDATA section, from normal character data, or from character data that contained entity references such as &lt; and &amp;. By the time you get access to the data, these differences will have been washed away. No code you write should depend on the difference between them.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Comments
Inhaltsvorschau
XML documents can be commented so that coauthors can leave notes for each other and themselves, documenting why they've done what they've done or items that remain to be done. XML comments are syntactically similar to HTML comments. Just as in HTML, they begin with <!-- and end with the first occurrence of -->. For example:
<!-- I need to verify and update these links when I get a chance. -->
The double hyphen -- must not appear anywhere inside the comment until the closing -->. In particular, a three-hyphen close like ---> is specifically forbidden.
Comments may appear anywhere in the character data of a document. They may also appear before or after the root element. (Comments are not elements, so this does not violate the tree structure or the one-root element rules for XML.) However, comments may not appear inside a tag or inside another comment.
Applications that read and process XML documents may or may not pass along information included in comments. They are certainly free to drop them out if they choose. Do not write documents or applications that depend on the contents of comments being available. Comments are strictly for making the raw source code of an XML document more legible to human readers. They are not intended for computer programs. For this purpose, you should use a processing instruction instead.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Processing Instructions
Inhaltsvorschau
In HTML, comments are sometimes abused to support nonstandard extensions. For instance, the contents of the script element are sometimes enclosed in a comment to protect it from display by a nonscript-aware browser. The Apache web server parses comments in .shtml files to recognize server-side includes. Unfortunately, these documents may not survive being passed through various HTML editors and processors with their comments and associated semantics intact. Worse yet, it's possible for an innocent comment to be misconstrued as input to the application.
XML provides the processing instruction as an alternative means of passing information to particular applications that may read the document. A processing instruction begins with <? and ends with ?>. Immediately following the <? is an XML name called the target , possibly the name of the application for which this processing instruction is intended or possibly just an identifier for this particular processing instruction. The rest of the processing instruction contains text in a format appropriate for the applications for which the instruction is intended.
For example, in HTML, a robots META tag is used to tell search-engine and other robots whether and how they should index a page. The following processing instruction has been proposed as an equivalent for XML documents:
<?robots index="yes" follow="no"?>
The target of this processing instruction is robots. The syntax of this particular processing instruction is two pseudo-attributes, one named index and one named follow, whose values are either yes or no. The semantics of this particular processing instruction are that if the index attribute has the value yes, then search-engine robots should index this page. If index has the value no, then robots should not index the page. Similarly, if follow has the value yes, then links from this document will be followed; if it has the value
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The XML Declaration
Inhaltsvorschau
XML documents should (but do not have to) begin with an XML declaration. The XML declaration looks like a processing instruction with the name xml and with version, standalone, and encoding pseudo-attributes. Technically, it's not a processing instruction, though; it's just the XML declaration, nothing more, nothing less. Example 2-7 demonstrates.
Example 2-7. A very simple XML document with an XML declaration
<?xml version="1.0" encoding="ASCII" standalone="yes"?>

<person>

  Alan Turing

</person>
XML documents do not have to have an XML declaration. However, if an XML document does have an XML declaration, then that declaration must be the first thing in the document. It must not be preceded by any comments, whitespace, processing instructions, and so forth. The reason is that an XML parser uses the first five characters (<?xml) to make some reasonable guesses about the encoding, such as whether the document uses a single-byte or multibyte character set. The only thing that may precede the XML declaration is an invisible Unicode byte-order mark. We'll discuss this further in Chapter 5.
The version attribute should have the value 1.0. Under very unusual circumstances, it may also have the value 1.1. Since specifying version="1.1" limits the document to the most recent versions of only a couple of parsers, and since all XML 1.1 parsers must also support XML 1.0, you don't want to casually set the version to 1.1.
Don't believe us? First answer a couple of questions:
  1. Do you speak Cambodian, Burmese, Amharic, Mongolian, or Divehi?
  2. Does your data contain obsolete, nontext C0 control characters such as vertical tab, form feed, or bell?
If you answered no to both of these questions, you have absolutely nothing to gain by using XML 1.1. If you answered yes to either one, then you may have cause to use XML 1.1. XML 1.0 allows Cambodian, Burmese, Amharic, etc. to be used in character data and attribute values. XML 1.1 also allows these scripts to be used in element and attribute names, which XML 1.0 does not. XML 1.1 also allows C0 control characters (except null) to be used in character data and attribute values (provided they're escaped as numeric character references like
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Checking Documents for Well-Formedness
Inhaltsvorschau
Every XML document, without exception, must be well-formed. This means it must adhere to a number of rules, including the following:
  1. Every start-tag must have a matching end-tag.
  2. Elements may nest but may not overlap.
  3. There must be exactly one root element.
  4. Attribute values must be quoted.
  5. An element may not have two attributes with the same name.
  6. Comments and processing instructions may not appear inside tags.
  7. No unescaped < or & signs may occur in the character data of an element or attribute.
This is not an exhaustive list. There are many, many ways a document can be malformed. You'll find a complete list in Chapter 21. Some of these involve constructs that we have not yet discussed, such as DTDs. Others are extremely unlikely to occur if you follow the examples in this chapter (for example, including whitespace between the opening < and the element name in a tag).
Whether the error is small or large, likely or unlikely, an XML parser reading a document is required to report it. It may or may not report multiple well-formedness errors it detects in the document. However, the parser is not allowed to try to fix the document and make a best-faith effort of providing what it thinks the author really meant. It can't fill in missing quotes around attribute values, insert an omitted end-tag, or ignore the comment that's inside a start-tag. The parser is required to return an error. The objective here is to avoid the bug-for-bug compatibility wars that plagued early web browsers and continue to this day. Consequently, before you publish an XML document—whether that document is a web page, input to a database, or something else—you'll want to check it for well-formedness.
The simplest way to do this is by loading the document into a web browser that understands XML documents, such as Mozilla. If the document is well-formed, the browser will display it. If it isn't, then it will show an error message.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 3: Document Type Definitions (DTDs)
Inhaltsvorschau
While XML is extremely flexible, not all the programs that read particular XML documents are so flexible. Many programs can work with only some XML applications but not others. For example, Adobe Illustrator can read and write Scalable Vector Graphics (SVG) files, but you wouldn't expect it to understand a Platform for Privacy Preferences (P3P) document. And within a particular XML application, it's often important to ensure that a given document adheres to the rules of that XML application. For instance, in XHTML, li elements should only be children of ul or ol elements. Browsers may not know what to do with them, or may act inconsistently, if li elements appear in the middle of a blockquote or p element.
XML 1.0 provides a solution to this dilemma: a document type definition (DTD). DTDs are written in a formal syntax that explains precisely which elements may appear where in the document and what the elements' contents and attributes are. A DTD can make statements such as "A ul element only contains li elements" or "Every employee element must have a social_security_number attribute." Different XML applications can use different DTDs to specify what they do and do not allow.
A validating parser compares a document to its DTD and lists any places where the document differs from the constraints specified in the DTD. The program can then decide what it wants to do about any violations. Some programs may reject the document. Others may try to fix the document or reject just the invalid element. Validation is an optional step in processing XML. A validity error is not necessarily a fatal error like a well-formedness error, although some applications may choose to treat it as one.
A valid document includes a document type declaration that identifies the DTD that the document satisfies. The DTD lists all the elements, attributes, and entities the document uses and the contexts in which it uses them. The DTD may list items the document does not use as well. Validity operates on the principle that everything not permitted is forbidden. Everything in the document must match a declaration in the DTD. If a document has a document type declaration and the document satisfies the DTD that the document type declaration indicates, then the document is said to be
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Validation
Inhaltsvorschau
A valid document includes a document type declaration that identifies the DTD that the document satisfies. The DTD lists all the elements, attributes, and entities the document uses and the contexts in which it uses them. The DTD may list items the document does not use as well. Validity operates on the principle that everything not permitted is forbidden. Everything in the document must match a declaration in the DTD. If a document has a document type declaration and the document satisfies the DTD that the document type declaration indicates, then the document is said to be valid. If it does not, it is said to be invalid.
There are many things the DTD does not say. In particular, it does not say the following:
  • What the root element of the document is
  • How many of instances of each kind of element appear in the document
  • What the character data inside the elements looks like
  • The semantic meaning of an element; for instance, whether it contains a date or a person's name
DTDs allow you to place some constraints on the form an XML document takes, but there can be quite a bit of flexibility within those limits. A DTD never says anything about the length, structure, meaning, allowed values, or other aspects of the text content of an element or attribute.
Validity is optional. A parser reading an XML document may or may not check for validity. If it does check for validity, the program receiving data from the parser may or may not care about validity errors. In some cases, such as feeding records into a database, a validity error may be quite serious, indicating that a required field is missing, for example. In other cases, rendering a web page perhaps, a validity error may not be so important, and a program can work around it. Well-formedness is required of all XML documents; validity is not. Your documents and your programs can use validation as you find needful.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Element Declarations
Inhaltsvorschau
Every element used in a valid document must be declared in the document's DTD with an element declaration. Element declarations have this basic form:
<!ELEMENT name 

               content_specification>
The name of the element can be any legal XML name. The content specification indicates what children the element may or must have and in what order. Content specifications can be quite complex. They can say, for example, that an element must have three child elements of a given type, or two children of one type followed by another element of a second type, or any elements chosen from seven different types interspersed with text.
The simplest content specification is one that says an element may only contain parsed character data, but may not contain any child elements of any type. In this case the content specification consists of the keyword #PCDATA inside parentheses. For example, this declaration says that a phone_number element may contain text but may not contain elements:
<!ELEMENT phone_number (#PCDATA)>
Such an element may also contain character references and CDATA sections (which are always parsed into pure text) and comments, and processing instructions (which don't really count in validation). It may contain entity references only if those entity references resolve to plain text without any child elements.
Another simple content specification is one that says the element must have exactly one child of a given type. In this case, the content specification consists of the name of the child element inside parentheses. For example, this declaration says that a fax element must contain exactly one phone_number element:
<!ELEMENT fax (phone_number)>
A fax element may not contain anything else except the phone_number element, and it may not contain more or less than one of those.
In practice, a content specification that lists exactly one child element is rare. Most elements contain either parsed character data or (at least potentially) multiple child elements. The simplest way to indicate multiple child elements is to separate them with commas. This is called a
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Attribute Declarations
Inhaltsvorschau
In addition to declaring its elements, a valid document must declare all the elements' attributes. This is done with ATTLIST declarations. A single ATTLIST can declare multiple attributes for a single element type. However, if the same attribute is repeated on multiple elements, then it must be declared separately for each element where it appears. (Later in this chapter you'll see how to use parameter entity references to make this repetition less burdensome.)
For example, ATTLIST declares the source attribute of the image element:
<!ATTLIST image source CDATA #REQUIRED>
It says that the image element has an attribute named source. The value of the source attribute is character data, and instances of the image element in the document are required to provide a value for the source attribute.
A single ATTLIST declaration can declare multiple attributes for the same element. For example, this ATTLIST declaration not only declares the source attribute of the image element, but also the width, height, and alt attributes:
<!ATTLIST image source CDATA #REQUIRED

                width  CDATA #REQUIRED

                height CDATA #REQUIRED

                alt    CDATA #IMPLIED

>
This declaration says the source, width, and height attributes are required. However, the alt attribute is optional and may be omitted from particular image elements. All four attributes are declared to contain character data, the most generic attribute type.
This declaration has the same effect and meaning as four separate ATTLIST declarations, one for each attribute. Whether to use one ATTLIST declaration per attribute is a matter of personal preference, but most experienced DTD designers prefer the multiple-attribute form. Given judicious application of whitespace, it's no less legible than the alternative.
In merely well-formed XML, attribute values can be any string of text. The only restrictions are that any occurrences of
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
General Entity Declarations
Inhaltsvorschau
As you learned in Chapter 2, XML predefines five entities for your convenience:
&lt;
The less-than sign, a.k.a. the opening angle bracket (<)
&amp;
The ampersand (&)
&gt;
The greater-than sign, a.k.a. the closing angle bracket (>)
&quot;
The straight, double quotation marks (")
&apos;
The apostrophe, a.k.a. the straight single quote (')
The DTD can define many more entities. This is useful not just in valid documents, but even in documents you don't plan to validate.
Entity references are defined with an ENTITY declaration in the DTD. This gives the name of the entity, which must be an XML name, and the replacement text of the entity. For example, this entity declaration defines &super; as an abbreviation for supercalifragilisticexpialidocious:
<!ENTITY super "supercalifragilisticexpialidocious">
Once that's done, you can use &super; anywhere you'd normally have to type the entire word (and probably misspell it).
Entities can contain markup as well as text. For example, this declaration defines &footer; as an abbreviation for a standard web page footer that will be repeated on many pages:
<!ENTITY footer '<hr size="1" noshade="true"/>

<font CLASS="footer">

<a href="index.html">O&apos;Reilly Home</a> |

<a href="sales/bookstores/">O&apos;Reilly Bookstores</a> |

<a href="order_new/">How to Order</a> |

<a href="oreilly/contact.html">O&apos;Reilly Contacts</a><br>

<a href="http://international.oreilly.com/">International</a> |

<a href="oreilly/about.html">About O&apos;Reilly</a> |

<a href="affiliates.html">Affiliated Companies</a>

</font>

<p>

<font CLASS="copy">

Copyright 2004, O&apos;Reilly Media, Inc.<br/>

<a href="mailto:webmaster@oreilly.com">webmaster@oreilly.com</a>

</font>

</p>

'>
The entity replacement text must be well-formed. For instance, you cannot put a start-tag in one entity and the corresponding end-tag in another entity.
The other thing you have to be careful about is that you need to use different quote marks inside the replacement text from the ones that delimit it. Here we've chosen single quotes to surround the replacement text and double quotes internally. However, we did have to change the single quote in "O'Reilly" to the predefined general entity reference
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
External Parsed General Entities
Inhaltsvorschau
The footer example is about at the limits of what you can comfortably fit in a DTD. In practice, web sites prefer to store repeated content like this in external files and load it into their pages using PHP, server-side includes, or some similar mechanism. XML supports this technique through external general entity references, although in this case the client, rather than the server, is responsible for integrating the different pieces of the document into a coherent whole.
An external parsed general entity reference is declared in the DTD using an ENTITY declaration. However, instead of the actual replacement text, the SYSTEM keyword and a URL to the replacement text is given. For example:
<!ENTITY footer SYSTEM "http://www.oreilly.com/boilerplate/footer.xml">
Of course, a relative URL will often be used instead. For example:
<!ENTITY footer SYSTEM "/boilerplate/footer.xml">
In either case, when the general entity reference &footer; is seen in the character data of an element, the parser may replace it with the document found at http://www.oreilly.com/boilerplate/footer.xml. References to external parsed entities are not allowed in attribute values. Most of the time this shouldn't be too big a hassle because attribute values tend to be small enough to be easily included in internal entities.
Notice we wrote that the parser may replace the entity reference with the document at the URL, not that it must. This is an area where parsers have some leeway in just how much of the XML specification they wish to implement. A validating parser must retrieve such an external entity. However, a nonvalidating parser may or may not choose to retrieve the entity.
Furthermore, not all text files can serve as external entities. In order to be loaded in by a general entity reference, the document must be potentially well-formed when inserted into an existing document. This does not mean the external entity itself must be well-formed. In particular, the external entity might not have a single root element. However, if such a root element were wrapped around the external entity, then the resulting document should be well-formed. This means, for example, that all elements that start inside the entity must finish inside the same entity. They cannot finish inside some other entity. Furthermore, the external entity does not have a prolog and, therefore, cannot have an XML declaration or a document type declaration.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
External Unparsed Entities and Notations
Inhaltsvorschau
Not all data is XML. There are a lot of ASCII text files in the world that don't give two cents about escaping < as &lt; or adhering to the other constraints by which an XML document is limited. There are probably even more JPEG photographs, GIF line art, QuickTime movies, MIDI sound files, and so on. None of these are well-formed XML, yet all of them are necessary components of many documents.
The mechanism that XML suggests for embedding these things in documents is the external unparsed entity. The DTD specifies a name and a URL for the entity containing the non-XML data. For example, this ENTITY declaration associates the name turing_getting_off_bus with the JPEG image at http://www.turing.org.uk/turing/pi1/busgroup.jpg:
<!ENTITY turing_getting_off_bus

         SYSTEM "http://www.turing.org.uk/turing/pi1/busgroup.jpg"

         NDATA jpeg>
Since the data in the previous code is not in XML format, the NDATA declaration specifies the type of the data. Here the name jpeg is used. XML does not recognize this as meaning an image in a format defined by the Joint Photographs Experts Group. Rather this is the name of a notation declared elsewhere in the DTD using a NOTATION declaration like this:
<!NOTATION jpeg SYSTEM "image/jpeg">
Here we've used the MIME media type image/jpeg as the external identifier for the notation. However, there is absolutely no standard or even a suggestion for exactly what this identifier should be. Individual applications must define their own requirements for the contents and meaning of notations.
The DTD only declares the existence, location, and type of the unparsed entity. To actually include the entity in the document at one or more locations, you insert an element with an ENTITY type attribute whose value is the name of an unparsed entity declared in the DTD. You do not use an entity reference like &turing_getting_off_bus;
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Parameter Entities
Inhaltsvorschau
It is not uncommon for multiple elements to share all or part of the same attribute lists and content specifications. For instance, any element that's a simple XLink will have xlink:type and xlink:href attributes, and perhaps xlink:show and xlink:actuate attributes. In XHTML, a th element and a td element contain more or less the same content. Repeating the same content specifications or attribute lists in multiple element declarations is tedious and error-prone. It's entirely possible to add a newly defined child element to the declaration of some of the elements but forget to include it in others.
For example, consider an XML application for residential real estate listings that provides separate elements for apartments, sublets, coops for sale, condos for sale, and houses for sale. The element declarations might look like this:
<!ELEMENT apartment (address, footage, rooms, baths, rent)>

<!ELEMENT sublet    (address, footage, rooms, baths, rent)>

<!ELEMENT coop      (address, footage, rooms, baths, price)>

<!ELEMENT condo     (address, footage, rooms, baths, price)>

<!ELEMENT house     (address, footage, rooms, baths, price)>
There's a lot of overlap between the declarations, i.e., a lot of repeated text. And if you later decide you need to add an additional element, available_date for instance, then you need to add it to all five declarations. It would be preferable to define a constant that can hold the common parts of the content specification for all five kinds of listings and refer to that constant from inside the content specification of each element. Then to add or delete something from all the listings, you'd only need to change the definition of the constant.
An entity reference is the obvious candidate here. However, general entity references are not allowed to provide replacement text for a content specification or attribute list, only for parts of the DTD that will be included in the XML document itself. Instead, XML provides a new construct exclusively for use inside DTDs, the
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Conditional Inclusion
Inhaltsvorschau
XML offers the IGNORE directive for the purpose of "commenting out" a section of declarations. For example, a parser will ignore the following declaration of a production_note element, as if it weren't in the DTD at all:
<![IGNORE[

  <!ELEMENT production_note (#PCDATA)>

]]>
This may not seem particularly useful. After all, you could always simply use an XML comment to comment out the declarations you want to remove temporarily from the DTD. If you feel that way, the INCLUDE directive is going to seem even more pointless. Its purpose is to indicate that the given declarations are actually used in the DTD. For example:
<![INCLUDE[

  <!ELEMENT production_note (#PCDATA)>

]]>
This has exactly the same effect and meaning as if the INCLUDE directive were not present. However, now consider what happens if we don't use INCLUDE and IGNORE directly. Instead, suppose we define a parameter entity like this:
<!ENTITY % notes_allowed "INCLUDE">
Then we use a parameter entity reference instead of the keyword:
<![%notes_allowed;[

  <!ELEMENT production_note (#PCDATA)>

]]>
The notes_allowed parameter entity can be redefined from outside this DTD. In particular, it can be redefined in the internal DTD subset of a document. This provides a switch individual documents can use to turn the production_note declaration on or off. This technique allows document authors to select only the functionality they need from the DTD.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Two DTD Examples
Inhaltsvorschau
Some of the best techniques for DTD design only become apparent when you look at larger documents. In this section, we'll develop DTDs that cover the two different document formats for describing people that were presented in Examples Example 2-4 and Example 2-5 of the last chapter.
DTDs for record-like documents are very straightforward. They make heavy use of sequences, occasional use of choices, and almost no use of mixed content. Example 3-6 shows such a DTD. Since this is a small example, and since it's easier to understand when both the document and the DTD are on the same page, we've made this an internal DTD included in the document. However, it would be easy to extract it and store it in a separate file.
Example 3-6. A DTD describing people
<?xml version="1.0"?>

<!DOCTYPE person  [

  <!ELEMENT person (name+, profession*)>

  <!ELEMENT name EMPTY>

  <!ATTLIST name first CDATA #REQUIRED

                 last  CDATA #REQUIRED>

  <!-- The first and last attributes are required to be present

       but they may be empty. For example,

       <name first="Cher" last=""> -->

  <!ELEMENT profession EMPTY>

  <!ATTLIST profession value CDATA #REQUIRED>

]>

<person>

  <name first="Alan" last="Turing"/>

  <profession value="computer scientist"/>

  <profession value="mathematician"/>

  <profession value="cryptographer"/>

</person>
The DTD here is contained completely inside the internal DTD subset. First a person ELEMENT declaration states that each person must have one or more name children, and zero or more profession children, in that order. This allows for the possibility that a person changes his name or uses aliases. It assumes that each person has at least one name but may not have a profession.
This declaration also requires that all name elements precede all profession elements. Here the DTD is less flexible than it ideally would be. There's no particular reason that the names have to come first. However, if we were to allow more random ordering, it would be hard to say that there must be at least one
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Locating Standard DTDs
Inhaltsvorschau
DTDs and validity are most important when you're exchanging data with others; they let you verify that you're sending what the receiver expects and vice versa. Of course, this works best if both ends of a conversation agree on which DTD and vocabulary they will use. There are many standard DTDs for different professions and disciplines and more are created every day. It is often better to use an established DTD and vocabulary than to design your own. However, there is no agreed-upon, central repository that documents and links to such efforts. The largest list of DTDs online is probably Robin Cover's list of XML applications at http://www.oasis-open.org/cover/siteIndex.html#toc-applications.
The W3C is one of the most prolific producers of standard XML DTDs. It has moved almost all of its future development to XML, including SVG, the Platform for Internet Content Selection (PICS), the Resource Description Framework (RDF), the Mathematical Markup Language (MathML), and even HTML itself. DTDs for these XML applications are generally published as appendixes to the specifications for the applications. The specifications are all found at http://www.w3.org/TR/.
However, XML isn't just for the Web, and far more activity is going on outside the W3C than inside it. Generally, within any one field, you should look to that field's standards bodies for DTDs relating to that area of interest. For example, the American Institute of Certified Public Accountants has published a DTD for the Extensible Financial Reporting Markup Language (XFRML). The Object Management Group (OMG) has published a DTD for describing Unified Modeling Language (UML) diagrams in XML. The Society of Automotive Engineers has published an XML application for emissions information as required by the 1990 U.S. Clean Air Act. Chances are that in any industry that makes heavy use of information technology, some group or groups, either formal or informal, are already working on DTDs that cover parts of that industry.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 4: Namespaces
Inhaltsvorschau
Namespaces have two purposes in XML:
  1. To distinguish between elements and attributes from different vocabularies with different meanings that happen to share the same name
  2. To group all the related elements and attributes from a single XML application together so that software can easily recognize them
The first purpose is easier to explain and grasp, but the second purpose is more important in practice.
Namespaces are implemented by attaching a prefix to each element and attribute. Each prefix is mapped to a URI by an xmlns:prefix attribute. Default URIs can also be provided for elements that don't have a prefix. Default namespaces are declared by xmlns attributes. Elements and attributes that are attached to the same URI are in the same namespace. Elements from many XML applications are identified by standard URIs.
In an XML 1.1 document, an Internationalized Resource Identifier (IRI) can be used instead of a URI. An IRI is just like a URI except it can contain non-ASCII characters such as é and π. In practice, parsers don't check that namespace names are legal URIs in XML 1.0, so the distinction is mostly academic.
Some documents combine markup from multiple XML applications. For example, an XHTML document may contain both SVG pictures and MathML equations. An XSLT stylesheet will contain both XSLT instructions and elements from the result-tree vocabulary. And XLinks are always symbiotic with the elements of the document in which they appear since XLink itself doesn't define any elements, only attributes.
In some cases, these applications may use the same name to refer to different things. For example, in SVG a set element sets the value of an attribute for a specified duration of time, while in MathML, a set element represents a mathematical set such as the set of all positive even numbers. It's essential to know when you're working with a MathML set and when you're working with an SVG
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The Need for Namespaces
Inhaltsvorschau
Some documents combine markup from multiple XML applications. For example, an XHTML document may contain both SVG pictures and MathML equations. An XSLT stylesheet will contain both XSLT instructions and elements from the result-tree vocabulary. And XLinks are always symbiotic with the elements of the document in which they appear since XLink itself doesn't define any elements, only attributes.
In some cases, these applications may use the same name to refer to different things. For example, in SVG a set element sets the value of an attribute for a specified duration of time, while in MathML, a set element represents a mathematical set such as the set of all positive even numbers. It's essential to know when you're working with a MathML set and when you're working with an SVG set. Otherwise, validation, rendering, indexing, and many other tasks will get confused and fail.
Consider Example 4-1. This is a simple list of paintings, including the title of each painting, the date each was painted, the artist who painted it, and a description of the painting.
Example 4-1. A list of paintings
<?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?>

<catalog>

     

  <painting>

    <title>Memory of the Garden at Etten</title>

    <artist>Vincent Van Gogh</artist>

    <date>November, 1888</date>

    <description>

      Two women look to the left. A third works in her garden.

    </description>

  </painting>

     

  <painting>

    <title>The Swing</title>

    <artist>Pierre-Auguste Renoir</artist>

    <date>1876</date>

    <description>

      A young girl on a swing. Two men and a toddler watch.

    </description>

  </painting>

     

  <!-- Many more paintings... -->

     

</catalog>
Now suppose that Example 4-1 is to be served as a web page and you want to make it accessible to search engines. One possibility is to use the Resource Description Framework (RDF) to embed metadata in the page. This describes the page for any search engines or other robots that might come along. Using the Dublin Core metadata vocabulary (
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Namespace Syntax
Inhaltsvorschau
Namespaces distinguish between elements with different meanings but the same name by assigning each element a URI. Generally, all the elements from one XML application are assigned to one URI, and all the elements from a different XML application are assigned to a different URI. These URIs are called namespace names . Elements with the same name but different URIs are different kinds of elements. Elements with the same name and the same URI are the same kind of element. Most of the time a single XML application has a single namespace URI for all its elements, though a few applications use multiple namespaces to subdivide different parts of the application. For instance, XSL uses different namespaces for XSL Transformations (XSLT) and XSL Formatting Objects (XSL-FO).
Since URIs frequently contain characters such as /, %, and ~ that are not legal in XML names, short prefixes such as rdf and xsl stand in for them in element and attribute names. Each prefix is associated with a URI. Names whose prefixes are associated with the same URI are in the same namespace. Names whose prefixes are associated with different URIs are in different namespaces. Prefixed elements and attributes in namespaces have names that contain exactly one colon. They look like this:
rdf:description

xlink:type

xsl:template
Everything before the colon is called the prefix. Everything after the colon is called the local part. The complete name, including the colon, is called the qualified name , QName, or raw name. The prefix identifies the namespace to which the element or attribute belongs. The local part identifies the particular element or attribute within the namespace.
In a document that contains both SVG and MathML set elements, one could be an svg:set element, and the other could be a mathml:set element. Then there'd be no confusion between them. In an XSLT stylesheet that transforms documents into XSL formatting objects, the XSLT processor would recognize elements with the prefix
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
How Parsers Handle Namespaces
Inhaltsvorschau
Namespaces are not part of XML 1.0. They were invented about a year after the original XML specification was released. However, care was taken to ensure backward compatibility. Thus, an XML parser that does not know about namespaces should not have any trouble reading a document that uses namespaces. Colons are legal characters in XML element and attribute names. The parser will simply report that some of the names contain colons.
A namespace-aware parser does add a couple of checks to the normal well-formedness checks that a parser performs. Specifically, it checks to see that all prefixes are mapped to URIs. It will reject documents that use unmapped prefixes (except for xml and xmlns when used as specified in the XML or "Namespaces in XML" specifications). It will further reject any element or attribute names that contain more than one colon. Otherwise, it behaves almost exactly like a non-namespace-aware parser. Other software that sits on top of the raw XML parser—an XSLT engine, for example—may treat elements differently depending on which namespace they belong to. However, the XML parser itself mostly doesn't care as long as all well-formedness and namespace constraints are met. Many parsers let you turn namespace processing on or off as you see fit.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Namespaces and DTDs
Inhaltsvorschau
Namespaces are completely independent of DTDs and can be used in both valid and invalid documents. A document can have a DTD but not use namespaces or use namespaces but not have a DTD. It can use both namespaces and DTDs or neither namespaces nor DTDs. Namespaces do not in any way change DTD syntax nor do they change the definition of validity. For instance, the DTD of a valid document that uses an element named dc:title must include an ELEMENT declaration properly specifying the content of the dc:title element. For example:
<!ELEMENT dc:title (#PCDATA)>
The name of the element in the document must exactly match the name of the element in the DTD, including the prefix. The DTD cannot omit the prefix and simply declare a title element. The same is true of prefixed attributes. For instance, if an element used in the document has xlink:type and xlink:href attributes, then the DTD must declare the xlink:type and xlink:href attributes, not simply type and href.
Conversely, if an element uses an xmlns attribute to set the default namespace and does not attach prefixes to elements, then the names of the elements must be declared without prefixes in the DTD. The validator neither knows nor cares about the existence of namespaces. All it sees is that some element and attribute names happen to contain colons; as far as it's concerned, such names are perfectly valid as long as they're declared.
Requiring DTDs to declare the prefixed names, instead of the raw names or some combination of local part and namespace URI, makes it difficult to change the prefix in valid documents. The problem is that changing the prefix requires changing all declarations that use that prefix in the DTD. However, with a little forethought, parameter entity references can alleviate the pain quite a bit.
The trick is to define both the namespace prefix and the colon that separates the prefix from the local name as parameter entities, like this:
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 5: Internationalization
Inhaltsvorschau
We've told you that XML documents contain text, but we haven't yet told you what kind of text they contain. In this chapter we rectify that omission. XML documents contain Unicode text. Unicode is a character set large enough to include all the world's living languages and a few dead ones. It can be written in a variety of encodings, including UCS-2 and the ASCII superset UTF-8. However, since Unicode text editors are not ubiquitous, XML documents may also be written in other character sets and encodings, which are converted to Unicode when the document is parsed. The encoding declaration specifies which character set a document uses. You can use character references, such as &#x03B8;, to insert Unicode characters like that aren't available in the legacy character set in which a document is written.
Computers don't really understand text. They don't recognize the Latin letter Z, the Greek letter γ, or the Han ideograph . All a computer understands are numbers such as 90, 947, or 40,821. A character set maps particular characters, like Z, to particular numbers, like 90. These numbers are called code points. A character encoding determines how those code points are represented in bytes. For instance, the code point 90 can be encoded as a signed byte, a little-endian unsigned short, a 4-byte, two's complement, a big-endian integer, or in some still more complicated fashion.
A human script like Cyrillic may be written in multiple character sets, such as KOI8-R, Unicode, or ISO-8859-5. A character set like Unicode may then be encoded in multiple encodings, such as UTF-8, UCS-2, or UTF-16. However, most simpler character sets, such as ASCII and KOI8-R, have only one encoding.
Some environments keep track of which encodings particular documents are written in. For instance, web servers that transmit XML documents precede them with an HTTP header that looks something like this:
HTTP/1.1 200 OK

Date: Sun, 28 Oct 2001 11:05:42 GMT

Server: Apache/1.3.19 (Unix) mod_jk mod_perl/1.25 mod_fastcgi/2.2.10 Connection: close

Transfer-Encoding: chunked

Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Character-Set Metadata
Inhaltsvorschau
Some environments keep track of which encodings particular documents are written in. For instance, web servers that transmit XML documents precede them with an HTTP header that looks something like this:
HTTP/1.1 200 OK

Date: Sun, 28 Oct 2001 11:05:42 GMT

Server: Apache/1.3.19 (Unix) mod_jk mod_perl/1.25 mod_fastcgi/2.2.10 Connection: close

Transfer-Encoding: chunked

Content-Type: text/xml; charset=iso-8859-1
The Content-Type field of the HTTP header provides the MIME media type of the document. This may, as shown here, specify which character set the document is written in. An XML parser reading this document from a web server should use this information to determine the document's character encoding.
Many web servers omit the charset parameter from the MIME media type. In this case, if the MIME media type is text/xml , then the document is assumed to be in the US-ASCII encoding. If the MIME media type is application/xml, then the parser attempts to guess the character set by reading the first few bytes of the document.
Since ASCII is almost never an appropriate character set for an XML document, application/xml is much preferred over text/xml. Unfortunately, most web servers including Apache 2.0.36 and earlier are configured to use text/xml by default. If you're running such a version you should probably upgrade before serving XML files.
We've focused on MIME types in HTTP headers because that's the most common place where character set metadata is applied to XML documents. However, MIME types are also used in some filesystems (e.g., the BeOS), in email, and in other environments. Other systems may provide other forms of character set metadata. If such metadata is available for a document, whatever form it takes, the parser should use it, although in practice this is an area where not all parsers and programs are as conformant as they should be.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The Encoding Declaration
Inhaltsvorschau
Every XML document should have an encoding declaration as part of its XML declaration. The encoding declaration tells the parser in which character set the document is written. It's used only when other metadata from outside the file is not available. For example, this XML declaration says that the document uses the character encoding US-ASCII:
<?xml version="1.0" encoding="US-ASCII" standalone="yes"?>
This one states that the document uses the Latin-1 character set, although it uses the more official name ISO-8859-1:
<?xml version="1.0" encoding="ISO-8859-1"?>
Even if metadata is not available, the encoding declaration can be omitted if the document is written in either the UTF-8 or UTF-16 encodings of Unicode. UTF-8 is a strict superset of ASCII, so ASCII files can be legal XML documents without an encoding declaration. Note, however, that this only applies to genuine, pure 7-bit ASCII files. It does not include the extended ASCII character sets that some editors produce with characters like ©, ç, or ".
Even if character set metadata is available, many parsers ignore it. Thus, we highly recommend including an encoding declaration in all your XML documents that are not written in UTF-8 or UTF-16. It certainly never hurts to do so.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Text Declarations
Inhaltsvorschau
XML documents may be composed of multiple parsed entities, as you learned in Chapter 3. These external parsed entities may be DTD fragments or chunks of XML that will be inserted into the master document using external general entity references. In either case, the external parsed entity does not necessarily use the same character set as the master document. Indeed, one external parsed entity may be referenced in several different files, each of which is written in a different character set. Therefore, it is important to specify the character set for an external parsed entity independently of the character set that the including document uses.
To accomplish this task, each external parsed entity should have a text declaration. If present, the text declaration must be the very first thing in the external parsed entity. For example, this text declaration says that the entity is encoded in the KOI8-R character set:
<?xml version="1.0" encoding="KOI8-R"?>
The text declaration looks like an XML declaration. It has version info and an encoding declaration. However, a text declaration must not have a standalone declaration. Furthermore, the version information may be omitted. A legal text declaration that specifies the encoding as KOI8-R might look like this:
<?xml encoding="KOI8-R"?>
However, this is not a legal XML declaration.
Example 5-1 shows an external parsed entity containing several verses from Pushkin's The Bronze Horseman in a Cyrillic script. The text declaration identifies the encoding as KOI8-R. Example 5-1 is not a well-formed XML document because it has no root element. It exists only for inclusion in other documents.
Example 5-1. An external parsed entity with a text declaration identifying the character set as KOI8-R
                  
External DTD subsets reside in external parsed entities and, thus, may have text declarations. Indeed, they should have text declarations if they're written in a character set other than one of the Unicode's variants. Example 5-2 shows a DTD fragment written in KOI8-R that might be used to validate Example 5-1 after it is included as part of a larger document.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XML-Defined Character Sets
Inhaltsvorschau
An XML parser is required to handle the UTF-16 and UTF-8 encodings of Unicode (about which more follows). However, XML parsers are allowed to understand and process many other character sets. In particular, the specification recommends that processors recognize and be able to read these encodings:
UTF-8
UTF-16
ISO-10646-UCS-2
ISO-10646-UCS-4
ISO-8859-1
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
ISO-8859-JP
Shift_JIS
EUC-JP
Many XML processors understand other legacy encodings. For instance, processors written in Java often understand all character sets available in the Java virtual machine. For a list, see http://java.sun.com/products/j2se/1.4.2/docs/guide/intl/encoding.doc.html. Furthermore, some processors may recognize aliases for these encodings; both Latin-1 and 8859_1 are sometimes used as synonyms for ISO-8859-1. However, using these names limits your document's portability. We recommend that you use standard names for standard encodings. For encodings whose standard name isn't given by the XML 1.0 specification, use one of the names registered with the Internet Assigned Numbers Authority (IANA), listed at ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets. Knowing the name of a character set and saving a file in that set does not mean that your XML parser can read such a file, however. XML parsers are only required to support UTF-8 and UTF-16. They are not required to support the hundreds of different legacy encodings used around the world.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Unicode
Inhaltsvorschau
Unicode is an international standard character set that can be used to write documents in almost any language you're likely to speak, learn, or encounter in your lifetime, barring alien abduction. Version 4.0.1, the current version as of June, 2004, contains 96,447 characters from most of Earth's living languages as well as several dead ones. Unicode easily covers the Latin alphabet, in which most of this book is written. Unicode also covers Greek-derived scripts, including ancient and modern Greek and the Cyrillic scripts used in Serbia and much of the former Soviet Union. Unicode covers several ideographic scripts, including the Han character set used for Chinese and Japanese, the Korean Hangul syllabary, and phonetic representations of these languages, including Katakana and Hiragana. It covers the right-to-left Arabic and Hebrew scripts. It covers various scripts native to the Indian subcontinent, including Devanagari, Thai, Bengali, Tibetan, and many more. And that's still less than half of the scripts in Unicode 4.0. Probably less than one person in a thousand today speaks a language that cannot be reasonably represented in Unicode. In the future, Unicode will add still more characters, making this fraction even smaller. Unicode can potentially hold more than a million characters, but no one is willing to say in public where they think most of the remaining million characters will come from.
The Unicode character set assigns characters to code points; that is, numbers. These numbers can then be encoded in a variety of schemes, including:
  • UCS-2
  • UCS-4
  • UTF-8
  • UTF-16
UCS-2, also known as ISO-10646-UCS-2, represents each character as a two-byte, unsigned integer between 0 and 65,535. Thus the capital letter A, code point 65 in Unicode, is represented by the two bytes 00 and 41 (in hexadecimal). The capital letter B, code point 66, is represented by the two bytes 00 and 42. The two bytes 03 and A3 represent the capital Greek letter
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
ISO Character Sets
Inhaltsvorschau
Unicode has only recently become commonplace. Previously, the space and processing costs associated with Unicode files caused vendors to prefer smaller, single-byte character sets that could only handle English and a few other languages of interest, but not the full panoply of human language. The International Standards Organization (ISO) has standardized 15 of these character sets as ISO standard 8859. For all of these single-byte character sets, characters 0 through 127 are identical to the ASCII character set, characters 128 through 159 are the C1 controls, and characters 160 through 255 are the additional characters needed for scripts such as Greek, Cyrillic, and Turkish.
ISO-8859-1 (Latin-1)
ASCII plus the accented letters and other characters needed for most Latin-alphabet Western European languages, including Danish, Dutch, Finnish, French, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, and Swedish.
ISO-8859-2 (Latin-2)
ASCII plus the accented letters and other characters needed to write most Latin-alphabet Central and Eastern European languages, including Czech, German, Hungarian, Polish, Romanian, Croatian, Slovak, Slovenian, and Sorbian.
ISO-8859-3 (Latin-3)
ASCII plus the accented letters and other characters needed to write Esperanto, Maltese, and Turkish.
ISO-8859-4 (Latin-4)
ASCII plus the accented letters and other characters needed to write most Baltic languages, including Estonian, Latvian, Lithuanian, Greenlandic, and Lappish. Now deprecated. New applications should use 8859-10 (Latin-6) or 8859-13 (Latin-7) instead.
ISO-8859-5
ASCII plus the Cyrillic alphabet used for Russian and many other languages of the former Soviet Union and other Slavic countries, including Bulgarian, Byelorussian, Macedonian, Serbian, and Ukrainian.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Platform-Dependent Character Sets
Inhaltsvorschau
In addition to the standard character sets discussed previously, many vendors have at one time or another produced proprietary character sets to meet the needs of their specific platform. Often, they contain special characters the vendor saw a need for, such as Apple's trademarked open apple or the box-drawing characters, such as and , used for cell boundaries in early DOS spreadsheets. Microsoft, IBM, and Apple are the three most prolific inventors of character sets. The single most common such set is probably Microsoft's Cp1252, a variant of Latin-1 that replaces the C1 controls with more graphic characters. Hundreds of such platform-dependent character sets are in use today. Documentation for these ranges from excellent to nonexistent.
Platform-specific character sets like these should be used only within a single system. They should never be placed on the wire or used to transfer data between systems. Doing so can lead to nasty surprises in unexpected places. For example, displaying a file that contains some of the extra Cp1252 characters , ‰, ^, ƒ, ", †, ..., ‡, œ, Œ, ·, ‘, ’, “, ”, -, —, Ÿ, š, ™, , and ~ on a VT-220 terminal can effectively disable the screen. Nonetheless, these character sets are in common use and often seen on the Web, even when they don't belong there. There's no absolute rule that says you can't use them for an XML document, provided that you include the proper encoding declaration and your parser understands it. The one advantage to using these sets is that existing text editors are likely to be much more comfortable with them than with Unicode and its friends. Nonetheless, we strongly recommend that you don't use them and stick to the documented standards that are much more broadly supported across platforms.
The most common platform-dependent character set, and the one you're most likely to encounter on the Internet, is Cp1252, also (and incorrectly) known as
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Converting Between Character Sets
Inhaltsvorschau
The ultimate solution to this character set morass is to use Unicode in either UTF-16 or UTF-8 format for all your XML documents. An increasing number of tools support one of these two formats natively; even the unassuming Notepad offers an option to save files in Unicode in Windows NT 4.0, 2000, and XP. Microsoft Word 97 and later saves the text of its documents in Unicode, although unlike XML documents, Word files are hardly pure text. Much of the binary data in a Word file is not Unicode or any other kind of text. However, Word 2000 and later can actually save plain text files into Unicode. To save as plain Unicode text in Word 2000, select the format Encoded Text from the Save As Type: Choice menu in Word's Save As dialog box. Then select one of the four Unicode formats in the resulting File Conversion dialog box. In Word 2003, select the plain text format. When you save, Word will pop up a dialog box that prompts you for the encoding. Choose Other Encoding and then select one of the four Unicode formats in the list box on the right.
Most current tools are still adapted primarily for vendor-specific character sets that can't handle more than a few languages at one time. Thus, learning how to convert your documents from proprietary to more standard character sets is crucial.
Some of the better XML and HTML editors let you choose the character set you wish to save in and perform automatic conversions from the native character set you use for editing. On Unix, the native character set is likely one of the standard ISO character sets, and you can save into that format directly. On the Mac, you can avoid problems if you stick to pure ASCII documents. On Windows, you can go a little further and use Latin-1, if you're careful to stay away from the extra characters that aren't part of the official ISO-8859-1 specification. Otherwise, you'll have to convert your document from its native, platform-dependent encoding to one of the standard platform-independent character sets.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The Default Character Set for XML Documents
Inhaltsvorschau
Before an XML parser can read a document, it must know which character set and encoding the document uses. In some cases, external metainformation tells the parser what encoding the document uses. For instance, an HTTP header may include a Content-type header like this:
Content-type: text/html; charset=ISO-8859-1
However, XML parsers generally can't count on the availability of such information. Even if they can, they can't necessarily assume that it's accurate. Therefore, an XML parser will attempt to guess the character set based on the first several bytes of the document. The main checks the parser makes include the following:
  • If the first two bytes of the document are #xFEFF , then the parser recognizes the bytes as the Unicode byte-order mark . It then guesses that the document is written in the big-endian, UTF-16 encoding of Unicode. With that knowledge, it can read the rest of the document.
  • If the first two bytes of the document are #xFFFE, then the parser recognizes the little-endian form of the Unicode byte-order mark. It now knows that the document is written in the little-endian, UTF-16 encoding of Unicode, and with that knowledge it can read the rest of the document.
  • If the first four bytes of the document are #x3C3F786D, that is, the ASCII characters <?xm, then it guesses that the file is written in a superset of ASCII. In particular, it assumes that the file is written in the UTF-8 encoding of Unicode. Even if it's wrong, this information is sufficient to continue reading the document through the encoding declaration and find out what the character set really is.
Parsers that understand EBCDIC or UCS-4 may also apply similar heuristics to detect those encodings. However, UCS-4 isn't really used yet and is mostly of theoretical interest, and EBCDIC is a legacy family of character sets that shouldn't be used in new documents. Neither of these sets are important in practice.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Character References
Inhaltsvorschau
Unicode contains more than 96,000 different characters covering almost all of the world's written languages. Predefining entity references for each of these characters, most of which will never be used in any one document, would impose an excessive burden on XML parsers. Rather than pick and choose which characters are worthy of being encoded as entities, XML goes to the other extreme. It predefines entity references only for characters that have special meaning as markup in an XML document: <, >, &, ", and '. All these are ASCII characters that are easy to type in any text editor.
For other characters that may not be accessible from an ASCII text editor, XML lets you use character references. A character reference gives the number of the particular Unicode character it stands for, in either decimal or hexadecimal. Decimal character references look like &#1114;; hexadecimal character references have an extra x after the &#;; that is, they look like &#x45A;. Both of these references refer to the same character, њ , the Cyrillic small letter "nje" used in Serbian and Macedonian. For example, suppose you want to include the Greek maxim "σ ο φÓς ε α υ τÓ ν γ ι γ ν ω σ κ ε ι" ("The wise man knows himself") in your XML document. However, you only have an ASCII text editor at your disposal. You can replace each Greek letter with the correct character reference, like this:
<maxim>

  &#x3C3;&#x3BF;&#x3C6;&#x3CC;&#x3C2; 

  &#x3AD;&#x3B1;&#x3C5;&#x3C4;&#x3CC;&#x3BD;

  &#x3B3;&#x3B9;&#x3B3;&#x3BD;&#x3CE;&#x3C3;&#x3BA;&#x3B5;&#x3B9;

</maxim>
To the XML processor, a document using character entity references referring to Unicode characters that don't exist in the current encoding is equivalent to a Unicode document in which all character references are replaced by the actual characters to which they refer. In other words, this XML document is the same as the previous one:
<maxim>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
xml:lang
Inhaltsvorschau
Since XML documents are written in Unicode, XML is an excellent choice for multilingual documents, such as an Arabic commentary on a Greek text (something that couldn't be done with almost any other character set). In such multilingual documents, it's useful to identify in which language a particular section of text is written. For instance, a spellchecker that only knows English shouldn't try to check a French quote.
Each XML element may have an xml:lang attribute that specifies the language in which the content of that element is written. For example, the previous maxim might look like this:
<maxim xml:lang="el">

  &#x3C3;&#x3CC;&#3C6;&#3BF;&#3C2; &#x3AD;&#3B1;&#3C5;&#3C4;&#x3CC;&#x3BD;

  &#x3B3;&#x3B9;&#x3B3;&#x3BD;&#X3CE;&#x3C3;&#x3BA;&#x3B5;&#x3B9;

</maxim>
This identifies it as Greek. The specific code used, el, comes from the Greek word for Greek, ε λ λ η ν ι κ ά.
The value of the xml:lang language attribute should be one of the two-letter language codes defined in ISO-639, "Codes for the Representation of Names of Languages," found at http://lcweb.loc.gov/standards/iso639-2/langhome.html, if such a code exists for the language in question.
For languages that aren't listed in ISO-639, you can use a language identifier registered with IANA; currently, about 20 of these identifiers exist, including i-navajo, i-klingon, and i-lux. The complete list can be found at ftp://ftp.isi.edu/in-notes/iana/assignments/languages. All identifiers begin with i-. For example:
<maxim xml:lang="i-klingon">Heghlu'meH QaQ jajvam</maxim>
If the language you need still isn't present in these two lists, you can create your own language tag, as long as it begins with the prefix x- or X- to identify it as a user-defined language code. For example, the title of this journal is written in J. R. R. Tolkien's fictional Quenya language:
<journal xml:lang="x-quenya">Tyalië Tyelelliéva</journal>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 6: XML as a Document Format
Inhaltsvorschau
XML is first and foremost a document format. It was always intended for web pages, books, scholarly articles, poems, short stories, reference manuals, tutorials, textbooks, legal pleadings, contracts, instruction sheets, and other documents that human beings would read. Its use as a syntax for computer data in applications such as order processing, object serialization, database exchange and backup, and electronic data interchange is mostly a happy accident.
Most computer programmers are better trained in working with the rigid structures one encounters in record-like applications than in the more free-form environment of an article or story. Most writers are more accustomed to the more free-form format of a book, story, or article. XML is perhaps unique in addressing the needs of both communities equally well. This chapter describes by both elucidation and example the structures encountered in narrative documents that are meant to be read by people instead of computers. Subsequent chapters will look at web pages in particular, then address technologies—such as XSLT, XLinks, and stylesheets—that are primarily intended for use with documents that will be read by human beings. Once we've done that, we'll look at XML as a format for more or less transitory data meant to be read by computers, rather than semipermanent documents intended for human consumption.
XML is a simplified form of the Standardized General Markup Language (SGML). The language that would eventually become SGML was invented by Charles F. Goldfarb, Ed Mosher, and Ray Lorie at IBM in the 1970s and developed by many people around the world until its eventual adoption as ISO standard 8879 in 1986. SGML was intended to solve many of the same problems XML solves in much the same way as XML solves them. It was and is a semantic and structural markup language for text documents. SGML is extremely powerful and achieved some success in the U.S. military and government, in the aerospace sector, and in other domains that needed ways of efficiently managing technical documents that were tens of thousands of pages long.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
SGML's Legacy
Inhaltsvorschau
XML is a simplified form of the Standardized General Markup Language (SGML). The language that would eventually become SGML was invented by Charles F. Goldfarb, Ed Mosher, and Ray Lorie at IBM in the 1970s and developed by many people around the world until its eventual adoption as ISO standard 8879 in 1986. SGML was intended to solve many of the same problems XML solves in much the same way as XML solves them. It was and is a semantic and structural markup language for text documents. SGML is extremely powerful and achieved some success in the U.S. military and government, in the aerospace sector, and in other domains that needed ways of efficiently managing technical documents that were tens of thousands of pages long.
SGML's biggest success was HTML, which was and is an SGML application. However, HTML is just one SGML application. It does not have anything close to the full power of SGML itself. SGML has also been used to define many other document formats, including DocBook and TEI, both of which we'll discuss shortly.
However, SGML is complicated—very, very complicated. The official SGML specification is over 150 very technical pages. It covers many special cases and unlikely scenarios. It is so complex that almost no software has ever implemented it fully. Programs that implement or rely on different subsets of SGML are often incompatible. The special feature that one program considers essential is all too often considered extraneous fluff and omitted by the next program. Nonetheless, experience with SGML taught developers a lot about the proper design, implementation, and use of markup languages for a wide variety of documents. Much of that general knowledge applies equally well to XML.
One thing all this should make clear is that XML documents aren't just used on the Web. XML can easily handle the needs of publishing in a variety of media, including books, magazines, journals, newspapers, and pamphlets. XML is particularly useful when you need to publish the same information in several of these formats. By applying different stylesheets to the same source document, you can produce web pages, speaker's notes, camera-ready copy for printing, and more.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Narrative Document Structures
Inhaltsvorschau
All XML documents are trees. However, trees are very general-purpose data structures. If you've been formally trained in computer science (and very possibly even if you haven't been), you've encountered binary trees, red-black trees, balanced trees, B-trees, ordered trees, and more. However, when working with XML, it's highly unlikely that any given document matches any of these structures. Instead, XML documents are the most general sort of tree, with no particular restrictions on how nodes are ordered or how or which nodes are connected to which other nodes. Narrative XML documents are even less likely than record-like XML documents to have an identifiable structure beyond their mere treeness.
So what does a narrative-oriented XML document look like? Of course, there's a root element. All XML documents have one. Generally speaking, this root element represents the document itself. That is, if the document is a book, the root element is book. If the document is an article, the root element is article, and so on.
Beyond that, large documents are generally broken up into sections of some kind, perhaps chapters for a book, parts for an article, or claims for a legal brief. Most of the document consists of these primary sections. In some cases, there'll be several different kinds of sections; for instance, one for the table of contents, one for the index, and one for the chapters of a book.
Generally, the root element also contains one or more elements providing metainformation about the document—for example, the title of the work, the author of the document, the dates the document was written and last modified, and so forth. One common pattern is to place the metainformation in one child of the root element and the main content of the work in another. This is how HTML documents are written. The root element is html. The metainformation goes in a
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
TEI
Inhaltsvorschau
The Text Encoding Initiative (TEI, http://www.tei-c.org/ ) is an XML (originally SGML) application designed for the markup of classic literature, such as Vergil's Aeneid or the collected works of Thomas Jefferson. It's a prime example of a narrative-oriented DTD. Since TEI is designed for scholarly analysis of text rather than more casual reading or publishing, it includes elements not only for common document structures (chapter, scene, stanza, etc.) but also for typographical elements, grammatical structure, the position of illustrations on the page, and so forth. These aren't important to most readers, but they are important to TEI's intended audience of humanities scholars. For many academic purposes, one manuscript of the Aeneid is not necessarily the same as the next. Transcription errors and emendations made by various monks in the Middle Ages can be crucial.
Example 6-1 shows a fairly simple TEI document that uses the "Lite" version of TEI, a subset of full TEI that includes only the most commonly needed tags. The content comes from the book you're reading now. Although a complete TEI-encoded copy of this manuscript would be much longer, this simple example demonstrates the basic features of most TEI documents that represent books. (In addition to prose, TEI can also be used for plays, poems, missals, and essentially any written form of literature.)
Example 6-1. A TEI document
<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<!DOCTYPE TEI.2 SYSTEM "xteilite.dtd">

<TEI.2>

     

  <teiHeader>

    <fileDesc>

      <titleStmt>

        <title>XML in a Nutshell</title>

        <author>Harold, Elliotte Rusty</author>

        <author>Means, W. Scott</author>

      </titleStmt>

      <publicationStmt><p></p></publicationStmt>

      <sourceDesc><p>Early manuscript draft</p></sourceDesc>

    </fileDesc>

  </teiHeader>

     

  <text id="HarXMLi">

     

    <front>

      <div type='toc'>

        <head>Table Of Contents</head>

        <list>

          <item>Introducing XML</item>

          <item>XML as a Document Format</item>

          <item>XML on the Web</item>

        </list>

      </div>

     

    </front>

     

    <body>

     

      <div1 type="chapter">

        <head>Introducing XML</head>

        <p></p>

      </div1>

     

      <div1 type="chapter">

        <head>XML as a Document Format</head>

        <p>

          XML is first and foremost a document format. It was always

          intended for web pages, books, scholarly articles, poems,

          short stories, reference manuals, tutorials, texts, legal

          pleadings, contracts, instruction sheets, and other documents

          that human beings would read. Its use as a syntax for computer

          data in applications like syndication, order processing,

          object serialization, database exchange and backup, electronic

          data interchange, and so forth is mostly a happy accident.

       </p>

     

       <div2 type="section">

         <head>SGML's Legacy</head>

         <p></p>

       </div2>

     

       <div2 type="section">

         <head>TEI</head>

         <p></p>

       </div2>

     

       <div2 type="section">

         <head>DocBook</head>

         <p>

           DocBook (<hi>http://www.docbook.org/</hi>) is an

           SGML application designed for new documents, not old ones.

           It's especially common in computer documentation. Several

           O'Reilly books have been written in DocBook including

           <bibl><author>Norm Walsh</author>'s <title>DocBook: The

           Definitive Guide</title></bibl>. Much of the <abbr

           expan='Linux Documentation Project'>LDP</abbr>

           (<hi>http://www.linuxdoc.org/</hi>) corpus is written in

           DocBook.

         </p>

       </div2>

     

      </div1>

     

      <div1 type="chapter">

        <head>XML on the Web</head>

        <p></p>

      </div1>

     

    </body>

     

    <back>

      <div1 type="index">

        <list>

          <head>INDEX</head>

          <item>SGML, 8, 89</item>

          <item>DocBook, 95-98</item>

          <item>TEI (Text Encoding Initiative), 92-95</item>

          <item>Text Encoding Initiative, See TEI</item>

        </list>

      </div1>

    </back>

     

  </text>

</TEI.2>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
DocBook
Inhaltsvorschau
DocBook (http://www.docbook.org/ ) is an SGML application designed for new documents, not old ones. It's especially common in computer documentation. Several O'Reilly books have been written in DocBook, including Norm Walsh and Leonard Muellner's DocBook: The Definitive Guide. No special tools are required to author it. Much of the Linux Documentation Project (LDP, http://www.linuxdoc.org/ ) corpus is written in DocBook. The current version of DocBook, 4.3, is available as both an SGML and an XML application. Example 6-2 shows a simple DocBook XML document based on the book you're reading now. Needless to say, the full version of this document would be much longer.
Example 6-2. A DocBook document
<?xml version="1.0" encoding="UTF-8" standalone="no"?>

<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.3//EN"

                      "docbook/docbookx.dtd">

<book>

  <title>XML in a Nutshell</title>

  <bookinfo>

    <author>

      <firstname>Elliotte Rusty</firstname>

      <surname>Harold</surname>

    </author>

    <author>

      <firstname>W. Scott</firstname>

      <surname>Means</surname>

    </author>

  </bookinfo>

     

  <toc>

    <tocchap><tocentry>Introducing XML</tocentry></tocchap>

    <tocchap><tocentry>XML as a Document Format</tocentry></tocchap>

    <tocchap><tocentry>XML as a "better" HTML</tocentry></tocchap>

  </toc>

     

  <chapter>

    <title>Introducing XML</title>

    <para></para>

  </chapter>

     

  <chapter>

    <title>XML as a Document Format</title>

     

   <para>

     XML is first and foremost a document format. It was always intended

     for web pages, books, scholarly articles, poems, short stories,

     reference manuals, tutorials, texts, legal pleadings, contracts,

     instruction sheets, and other documents that human beings would

     read. Its use as a syntax for computer data in applications like

     syndication, order processing, object serialization, database

     exchange and backup, electronic data interchange, and so forth is

     mostly a happy accident.

   </para>

     

   <sect1>

     <title>SGML's Legacy</title>

     <para></para>

   </sect1>

   <sect1>

     <title>TEI</title>

     <para></para>

   </sect1>

     

   <sect1>

     <title>DocBook</title>

     <para>

       <ulink url="http://www.docbook.org/">DocBook</ulink>

       is an SGML application designed for new documents, not old ones.

       It's especially common in computer documentation. Several

       O'Reilly books have been written in DocBook including

       <citation>Norm Walsh and Leonard Muellner's

       <citetitle>DocBook: The Definitive

       Guide</citetitle></citation>. Much of the <ulink

       url="http://www.linuxdoc.org/">Linux Documentation Project

       (LDP)</ulink> corpus is written in DocBook. </para>

   </sect1>

     

  </chapter>

     

  <chapter>

    <title>XML on the Web</title>

    <para></para>

  </chapter>

     

  <index>

    <indexentry>

      <primaryie>SGML, 8,  89</primaryie>

    </indexentry>

    <indexentry>

      <primaryie>DocBook, 95-98</primaryie>

    </indexentry>

    <indexentry>

      <primaryie>TEI (Text Encoding Initiative), 92-95</primaryie>

    </indexentry>

    <indexentry>

      <primaryie>Text Encoding Initiative</primaryie>

      <seeie>TEI</seeie>

    </indexentry>

  </index>

     

</book>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
OpenOffice
Inhaltsvorschau
While you can write markup by hand in a text editor, many non-programmers prefer a friendlier, more WYSIWYG approach. There's no reason a standard word processor can't save its data in XML, and indeed several now do, including Microsoft Word 2003 and OpenOffice.org Writer. Harold also wrote a much smaller book in XML using OpenOffice.org Writer (Effective XML, Addison Wesley).
For what it's worth, in hindsight I regret that decision. If I were doing it again, I would write the XML by hand in DocBook as I did with Processing XML with Java, rather than using OpenOffice. As much as good GUI tools can improve productivity, bad GUI tools can hinder it. A poorly designed GUI is no guarantee of ease of use.Scott and I wrote this book in Microsoft Word, but mostly because the early editions predated the availability of high-quality XML publishing tools. That decision is hurting us now. For instance, the complicated tables in Chapter 27 are well beyond what Word can comfortably handle. In DocBook, they'd be a no-brainer. If we were starting from scratch, we'd write in DocBook.
Example 6-3 shows a fairly simple OpenOffice document. Again, the content comes from the book you're reading now. This differs from TEI and DocBook in several ways—for instance, it uses namespaces. TEI and DocBook don't. The title of the book and the names of the authors are not included because they'd normally be stored in a separate XML document containing only the metadata. Indexes and tables of contents are generated from the internal structure, content, and markup rather than being added explicitly. Perhaps the most unusual distinction is the lack of section elements of any kind. Instead, different chapters, sections, and subsections are identified by text:h elements with different levels. The contents of the section are everything that follows the text:h element until the next
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
WordprocessingML
Inhaltsvorschau
Beginning with Microsoft Office 2003 for Windows (but not Office 2004 for the Mac), Microsoft gave Word and the other Office components the ability to save all documents in XML, although by default it still picks a binary format. The XML application saved by Microsoft Word is named WordprocessingML. Unlike DocBook, TEI, and OpenOffice, all of which were designed from scratch without any legacy issues, WordprocessingML was designed more as an XML representation of an existing binary file format. This makes it a rather unusual example of a narrative document format. We would not recommend that you emulate its design in your own applications. Nonetheless, it can be educational to compare it to the other three formats.
Example 6-4 shows the same document as in the previous three examples, this time encoded in WordprocessingML. The WordprocessingML version seems the most opaque and cryptic of the four formats discussed in this chapter. This example makes it pretty obvious that XML is not magic pixie dust you can sprinkle on an existing format to create clean, legible, maintainable data.
The root element of a WordprocessingML document is w:wordDocument . Here, the w prefix is mapped to the namespace URI http://schemas.microsoft.com/office/word/2003/wordml. Several other namespaces are declared for different content that can be embedded in a Word file.
This root element can contain several different chunks of metadata. Here I've used three: o:DocumentProperties for basic metadata like author and title, a w:fonts element that lists the fonts used in the document and their metrics, and a w:styles element that lists the styles referenced in the document. All of these are optional. However, a document saved by Microsoft Word itself would include all of these and several more.
The actual content of the document is stored in a w:body element. The body is divided into sections (
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Document Permanence
Inhaltsvorschau
XML documents that are intended for computers to read are often transitory. For instance, a SOAP document that represents a request to a Windows server running .NET exists for just as long as it takes the client to send it to the server and for the server to parse it into its internal data structures. After that's done, the document will be discarded. It probably won't be around for two minutes, much less two years. It's an ephemeral communication between two systems, with no more permanence than any of billions of other messages that computers exchange on a daily basis, most of which are never even written to disk, much less archived for posterity.
Some applications do store more permanent computer-oriented data in XML. For instance, XML is the native file format of the Gnumeric spreadsheet. On the other hand, this format is really only understood by Gnumeric and perhaps the other Gnome applications. It's designed to meet the specific needs of that one program. Exchanging data with other applications, including ones that haven't even been invented yet, is a secondary concern.
XML documents meant for humans tend to be more permanent and less software bound, however. If you encode the Declaration of Independence in XML, you want people to be able to read it in 2, 200, or 2,000 years. You also want them to be able to read it with any convenient tool, including ones not invented yet. These requirements have some important implications for both the XML applications you design to hold the data and the tools you use to read and write them.
The first rule is that the format should be very well documented. There should be a schema, and that schema should be very well commented. Furthermore, there should be a significant amount of prose documentation as well. Prose documentation can't substitute for the formal documentation of a schema, but it's an invaluable asset in understanding the schema.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Transformation and Presentation
Inhaltsvorschau
The markup in a typical XML document describes the document's structure, but it tends not to describe the document's presentation. That is, it says how the document is organized but not how it looks. Although XML documents are text, and a person could read them in native form if they really wanted to, much more commonly an XML document is rendered into some other format before being presented to a human audience. One of the key ideas of markup languages in general and XML in particular is that the input format need not be the same as the output format. To put it another way, what you see is not what you get, nor is it what you want to get. The input markup language is designed for the convenience of the writer. The output language is designed for the convenience of the reader.
Of course this requires a means of transforming the input format into the output format. Most XML documents undergo some kind of transformation before being presented to the reader. The transformation may be to a different XML vocabulary like XHTML or XSL-FO, or it may be to a non-XML format like PostScript or RTF.
XML's semiofficial transformation language is Extensible Stylesheet Language Transformations (XSLT). An XSLT document contains a list of template rules. Each template rule has a pattern noting which elements and other nodes it matches. An XSLT processor reads the input document. When it sees something in the input document that matches a template rule in the stylesheet, it outputs the template rule's template. The template can tell the processor which content from the input to include in the output. This allows, for example, the text of the output document to be the same while all the markup is changed. For instance, you could write a stylesheet that would transform DocBook documents into TEI documents. XSLT will be discussed in much more detail in Chapter 8.
However, XSLT is not the only transformation language you can use with your XML documents. Other stylesheet languages such as the Document Style Sheet and Semantics Language (DSSSL,
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 7: XML on the Web
Inhaltsvorschau
XML began as an effort to bring the full power and structure of SGML to the Web in a form that was simple enough for nonexperts to use. Like most great inventions, XML turned out to have uses far beyond what its creators originally envisioned. Indeed, there's a lot more XML off the Web than on it. Nonetheless, XML is still a very attractive language in which to write and serve web pages. Since XML documents must be well-formed and parsers must reject malformed documents, XML pages are less likely to have annoying cross-browser incompatibilities. Since XML documents are highly structured, they're much easier for robots to parse. Since XML element and attribute names reflect the nature of the content they hold, search-engine robots can more easily determine the true meaning of a page.
XML on the Web comes in three flavors. The first is XHTML, an XMLized variant of HTML 4.0 that tightens up HTML to match XML's syntax. For instance, XHTML requires that all start-tags correspond to a matching end-tag and that all attribute values be quoted. XHTML also adds a few bits of syntax to HTML, such as the XML declaration and empty-element tags that end with />. Most of XHTML can be displayed quite well in legacy browsers, with a few notable exceptions.
The second flavor of XML on the Web is direct display of XML documents that use arbitrary vocabularies in web browsers. Generally, the formatting of the document is supplied either by a CSS stylesheet or by an XSLT stylesheet that transforms the document into HTML (perhaps XHTML). This flavor requires an XML-aware browser and is not supported by older web browsers such as Netscape 4.0.
A third option is to mix raw XML vocabularies, such as MathML and SVG, with XHTML using Modular XHTML. Modular XHTML lets you embed RDF cataloging information, MathML equations, SVG pictures, and more inside your XHTML documents. Namespaces sort out which elements belong to which applications.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XHTML
Inhaltsvorschau
XHTML is an official W3C recommendation. It defines an XML-compatible version of HTML, or rather it redefines HTML as an XML application instead of as an SGML application. Just looking at an XHTML document, you might not even realize that there's anything different about it. It still uses the same <p>, <li>, <table>, <h1>, and other tags you're familiar with. Elements and attributes have the same, familiar names they have in HTML. The syntax is still basically the same.
The difference is not so much what's allowed but what's not allowed. <p> is a valid XHTML tag, but <P> is not. <table border="0" width="515"> is legal XHTML; <table border=0 width=515> is not. A paragraph prefixed with a <p> and suffixed with a </p> is legal XHTML, but a paragraph that omits the closing </p> tag is not. Most existing HTML documents require substantial editing before they become well-formed and valid XHTML documents. However, once they are valid XHTML documents, they are automatically valid XML documents that can be manipulated with the same editors, parsers, and other tools you use to work with any XML document.
Most of the changes required to turn an existing HTML document into an XHTML document involve making the document well-formed. For instance, given a legacy HTML document, you'll probably have to make at least some of these changes to turn it into XHTML:
  • Add missing end-tags like </p> and </li>.
  • Rewrite elements so that they nest rather than overlap. For example, change <p><em>an emphasized paragraph</p></em> to <p><em>an emphasized paragraph</em></p>.
  • Put double or single quotes around attribute values. For example, change <p align=center> to <p align="center">.
  • Add values (which are the same as the name) to all minimized Boolean attributes. For example, change <input type="checkbox" checked> to <input type="checkbox" checked="checked">
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Direct Display of XML in Browsers
Inhaltsvorschau
Ultimately, one hopes that browsers will be able to display not just XHTML documents but any XML document as well. Since it's too much to ask that browsers provide semantics for all XML applications both current and yet-to-be-invented, stylesheets will be attached to each document to provide instructions about how each element will be rendered.
The current major stylesheet languages are:
  • Cascading Style Sheets Level 1 (CSS1)
  • Cascading Style Sheets Level 2 (CSS2)
  • XSL Transformations 1.0
Eventually, there will be more versions of these, including at least CSS 2.1, CSS Level 3, and XSLT 2.0. However, let's begin by looking at how and how well existing style languages are supported by existing browsers.
The stylesheet associated with a document is indicated by an xml-stylesheet processing instruction in the document's prolog, which comes after the XML declaration but before the root element start-tag. This processing instruction uses pseudo-attributes to describe the stylesheet (that is, they look like attributes but are not attributes because xml-stylesheet is a processing instruction and not an element).

Section 7.2.1.1: The required href and type pseudo-attributes

There are two required pseudo-attributes for xml-stylesheet processing instructions. The value of the href pseudo-attribute gives the URL, possibly relative, where the stylesheet can be found. The type pseudo-attribute value specifies the MIME media type of the stylesheet, text/css for cascading stylesheets, application/xml for XSLT stylesheets. In Example 7-3, the xml-stylesheet processing instruction tells browsers to apply the CSS stylesheet person.css to this document before showing it to the reader.
Example 7-3. An XML document associated with a stylesheet
<?xml version="1.0"?>

<?xml-stylesheet href="person.css" type="text/css"?>

<person>

  Alan Turing

</person>
Microsoft Internet Explorer uses type="text/xsl" for XSLT stylesheets. However, the
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Authoring Compound Documents with Modular XHTML
Inhaltsvorschau
XHTML 1.1 divides the three XHTML DTDs into individual modules. Parameter entities connect the modules by including or leaving out particular modules. Modules include:
Structure module, %xhtml-struct.module;
The absolute bare minimum of elements needed for an HTML document: html, head, title, and body
Text module, %xhtml-text.module;
The basic elements that contain text and other inline elements: abbr, acronym, address, blockquote, br, cite, code, dfn, div, em, h1, h2, h3, h4, h5, h6, kbd, p, pre, q, samp, span, strong, and var
Hypertext module, %xhtml-hypertext.module;
Elements used for linking, that is, the a element
List module, %xhtml-list.module;
Elements used for the three kinds of lists: dl, dt, dd, ul, ol, and li
Applet module, %xhtml-applet.module;
Elements needed for Java applets: applet and param
Presentation module, %xhtml-pres.module;
Presentation oriented markup, that is, the b, big, hr, i, small, sub, sup, and tt elements
Edit module, %xhtml-edit.module;
Elements for revision tracking: del and ins
Bi-Directional Text module, %xhtml-bdo.module;
An indication of directionality when text in left-to-right languages, like English and French, is mixed with text in right-to-left languages, like Hebrew and Arabic
Basic Forms module, %xhtml-basic-form.module;
Forms as defined in HTML 3.2 using the form, input, select, option, and textarea elements
Forms module, %xhtml-form.module;
Forms as defined in HTML 4.0 using the form, input, select, option, textarea, button, fieldset, label, legend, and optgroup elements
Basic Tables module, %xhtml-basic-table.module;
Minimal table support including only the table, caption, th, tr, and td elements
Tables module, %xhtml-table.module;
More complete table support including not only the table, caption, th, tr, and td elements, but also the col, colgroup, tbody, thead, and tfoot elements
Image module, %xhtml-image.module;
The img element
Client-Side Image Map module, %xhtml-csismap.module;
The map and area elements, as well as extra attributes for several other elements needed to support client-side image maps
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Prospects for Improved Web Search Methods
Inhaltsvorschau
Part of the hype of XML has been that web search engines will finally understand what a document means by looking at its markup. For instance, you can search for the movie Sneakers and just get back hits about the movie without having to sort through "Internet Wide Area `Tiger Teamers' mailing list," "Children's Side Zip Sneakers Recalled by Reebok," "Infant's `Little Air Jordan' Sneakers Recalled by Nike," "Sneakers.com—Athletic shoes from Nike, Reebok, Adidas, Fila, New," and the 32,395 other results that Google pulled up on this search that had nothing to do with the movie.
In practice, this is still vapor, mostly because few web pages are available on the frontend in XML, even though more and more backends are XML. The search-engine robots only see the frontend HTML. As this slowly changes, and as the search engines get smarter, we should see more and more useful results. Meanwhile, it's possible to add some XML hints to your HTML pages that knowledgeable search engines can take advantage of using the Resource Description Framework (RDF), the Dublin Core, and the robots processing instruction.
The Resource Description Framework (RDF, http://www.w3.org/RDF/ ) can be understood as an XML encoding for a particularly simple data model. An RDF document describes resources using triples. Each triple says that a resource has a property with a value. Resources are identified by URIs. Properties can be identified by URIs or by element-qualified names. The value can be a string of plain text, a chunk of XML, or another resource identified by a URI.
The root element of an RDF document is an RDF element. Each resource the RDF element describes is represented as a Description element whose about attribute contains a URI pointing to the resource described. Each child element of the Description element represents a property of the resource. The contents of that child element are the value of that property. All RDF elements like
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 8: XSL Transformations (XSLT)
Inhaltsvorschau
The Extensible Stylesheet Language (XSL) is divided into two parts: XSL Transformations (XSLT) and XSL Formatting Objects (XSL-FO). This chapter describes XSLT. Chapter 14 covers XSL-FO.
XSLT is an XML application for specifying rules by which one XML document is transformed into another XML document. An XSLT document—that is, an XSLT stylesheet—contains template rules. Each template rule has a pattern and a template. An XSLT processor compares the elements and other nodes in an input XML document to the template-rule patterns in a stylesheet. When one matches, it writes the template from that rule into the output tree. When it's done, it may further serialize the output tree into an XML document or some other format like plain text or HTML.
This chapter describes the template rules and a few other elements that appear in an XSLT stylesheet. XSLT uses the XPath syntax to identify matching nodes. We'll introduce a few pieces of XPath here, but most of it will be covered in Chapter 9.
To demonstrate XSL Transformations, we first need a document to transform. Example 8-1 shows the document used in this chapter. The root element is people, which contains two person elements. The person elements have roughly the same structure (a name followed by professions and hobbies) with some differences. For instance, Alan Turing has three professions, but Richard Feynman only has one. Feynman has a middle_initial and a hobby, but Turing doesn't. Still, these are clearly variations on the same basic structure. A DTD that permitted both of these would be easy to write.
Example 8-1. An XML document describing two people
<?xml version="1.0"?>

<people>

  <person born="1912" died="1954">

    <name>

      <first_name>Alan</first_name>

      <last_name>Turing</last_name>

    </name>

    <profession>computer scientist</profession>

    <profession>mathematician</profession>

    <profession>cryptographer</profession>

  </person>

  <person born="1918" died="1988">

    <name>

      <first_name>Richard</first_name>

      <middle_initial>P</middle_initial>

      <last_name>Feynman</last_name>

    </name>

    <profession>physicist</profession>

    <hobby>Playing the bongoes</hobby>

  </person>

</people>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
An Example Input Document
Inhaltsvorschau
To demonstrate XSL Transformations, we first need a document to transform. Example 8-1 shows the document used in this chapter. The root element is people, which contains two person elements. The person elements have roughly the same structure (a name followed by professions and hobbies) with some differences. For instance, Alan Turing has three professions, but Richard Feynman only has one. Feynman has a middle_initial and a hobby, but Turing doesn't. Still, these are clearly variations on the same basic structure. A DTD that permitted both of these would be easy to write.
Example 8-1. An XML document describing two people
<?xml version="1.0"?>

<people>

  <person born="1912" died="1954">

    <name>

      <first_name>Alan</first_name>

      <last_name>Turing</last_name>

    </name>

    <profession>computer scientist</profession>

    <profession>mathematician</profession>

    <profession>cryptographer</profession>

  </person>

  <person born="1918" died="1988">

    <name>

      <first_name>Richard</first_name>

      <middle_initial>P</middle_initial>

      <last_name>Feynman</last_name>

    </name>

    <profession>physicist</profession>

    <hobby>Playing the bongoes</hobby>

  </person>

</people>
Example 8-1 is an XML document. For purposes of this example, it will be stored in a file called people.xml. It doesn't have a DTD; however, this is tangential. XSLT works equally well with valid and invalid (but well-formed) documents. This document doesn't use namespaces either, although it could. XSLT works just fine with namespaces. Unlike DTDs, XSLT does pay attention to the namespace URIs instead of the prefixes. Thus, it's possible to use one prefix for an element in the input document and different prefixes for the same namespace in the stylesheet and output documents.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
xsl:stylesheet and xsl:transform
Inhaltsvorschau
An XSLT stylesheet is an XML document. It can and generally should have an XML declaration. It can have a document type declaration, although most stylesheets do not. The root element of this document is either stylesheet or transform ; these are synonyms for each other, and you can use either. They both have the same possible children and attributes. They both mean the same thing to an XSLT processor.
The stylesheet and transform elements, like all other XSLT elements, are in the http://www.w3.org/1999/XSL/Transform namespace. This namespace is customarily mapped to the xsl prefix so that you write xsl:transform or xsl:stylesheet rather than simply transform or stylesheet.
This namespace URI must be exactly correct. If even so much as a single character is wrong, the stylesheet processor will output the stylesheet itself instead of either the input document or the transformed input document. There's a reason for this (see Section 2.3 of the XSLT 1.0 specification, Literal Result Element as Stylesheet, if you really want to know), but the bottom line is that this weird behavior looks very much like a bug in the XSLT processor if you're not expecting it. If you ever do see your stylesheet processor spitting your stylesheet back out at you, the problem is almost certainly an incorrect namespace URI.
In addition to the xmlns:xsl attribute declaring this prefix mapping, the root element must have a version attribute with the value 1.0. Thus, a minimal XSLT stylesheet, with only the root element and nothing else, is as shown in Example 8-2.
Example 8-2. A minimal XSLT stylesheet
<?xml version="1.0"?>

<xsl:stylesheet version="1.0"

                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

     

</xsl:stylesheet>
Perhaps a little surprisingly, this is a complete XSLT stylesheet; an XSLT processor can apply it to an XML document to produce an output document. Example 8-3 shows the effect of applying this stylesheet to Example 8-1.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Stylesheet Processors
Inhaltsvorschau
An XSLT processor is a piece of software that reads an XSLT stylesheet, reads an XML document, and builds an output document by applying the instructions in the stylesheet to the information in the input document. An XSLT processor can be built into a web browser, just as MSXML is in Internet Explorer 6. It can be built into a web or application server, as in the Apache XML Project's Cocoon (http://xml.apache.org/cocoon). Or it can be a standalone program run from the command line like Michael Kay's SAXON (http://saxon.sourceforge.net) or the Apache XML Project's Xalan (http://xml.apache.org/xalan-j).
Internet Explorer 5.0 and 5.5 partially support a very old and out-of-date working draft of XSLT, as well as various Microsoft extensions to this old working draft. They do not support XSLT 1.0, and indeed no XSLT stylesheets in this book work in IE5. Stylesheets that are meant for Microsoft XSLT can be identified by their use of the http://www.w3.org/TR/WD-xsl namespace. IE6 supports both http://www.w3.org/1999/XSL/Transform and http://www.w3.org/TR/WD-xsl. Good XSLT developers don't use http://www.w3.org/TR/WD-xsl and don't associate with developers who do.
The exact details of how to install, configure, and run the XSLT processor naturally vary from processor to processor. Generally, you have to install the processor in your path, or add its jar file to your class path if it's written in Java. Then you pass in the names of the input file, stylesheet file, and output file on the command line. For example, using Xalan, Example 8-3 is created in this fashion:
% java org.apache.xalan.xslt.Process -IN people.xml -XSL minimal.xsl

                  

                     -OUT 8-3.txt

                  

= = = = = = = = = Parsing file:D:/books/xian/examples/08/minimal.xsl = = = = = = = = = =

Parse of file:D:/books/xian/examples/08/minimal.xsl took 771 milliseconds

= = = = = = = =  = Parsing people.xml = = = = = = = = = =

Parse of people.xml took 90 milliseconds

= = = = = = = = = = = = = = = = = = = = = = = = = = = = =

Transforming...

transform took 20 milliseconds

XSLProcessor: done
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Templates and Template Rules
Inhaltsvorschau
To control what output is created from what input, you add template rules to the XSLT stylesheet. Each template rule is represented by an xsl:template element. This element has a match attribute that contains a pattern identifying the input it matches; it also contains a template that is instantiated and output when the pattern is matched. The terminology is a little tricky here: the xsl:template element is a template rule that contains a template. An xsl:template element is not itself the template.
The simplest match pattern is an element name. Thus, this template rule says that every time a person element is seen, the stylesheet processor should emit the text "A Person":
<xsl:template match="person">A Person</xsl:template>
Example 8-4 is a complete stylesheet that uses this template rule.
Example 8-4. An XSLT stylesheet with a match pattern
<?xml version="1.0"?>

<xsl:stylesheet version="1.0"

                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

     

  <xsl:template match="person">A Person</xsl:template>

     

</xsl:stylesheet>
Applying this stylesheet to the document in Example 8-1 produces this output:
<?xml version="1.0" encoding="utf-8"?>

     

 A Person

     

 A Person
There were two person elements in the input document. Each time the processor saw one, it emitted the text "A Person". The whitespace outside the person elements was preserved, but everything inside the person elements was replaced by the contents of the template rule, which is called the template.
The text "A Person" is called literal data characters, which is a fancy way of saying plain text that is copied from the stylesheet into the output document. A template may also contain literal result elements, i.e., markup that is copied from the stylesheet to the output document. For instance, Example 8-5 wraps the text "A Person" in between <p> and </p> tags.
Example 8-5. A simple XSLT stylesheet with literal result elements
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Calculating the Value of an Element with xsl:value-of
Inhaltsvorschau
Most of the time, the text that is output is more closely related to the text that is input than it was in the last couple of examples. Other XSLT elements can select particular content from the input document and insert it into the output document.
One of the most generally useful elements of this kind is xsl:value-of . This element calculates the string value of an XPath expression and inserts it into the output. The value of an element is the text content of the element after all the tags have been removed and entity and character references have been resolved. The element whose value is taken is identified by a select attribute containing an XPath expression.
For example, suppose you just want to extract the names of all the people in the input document. Then you might use a stylesheet like Example 8-6. Here the person template outputs only the value of the name child element of the matched person in between <p> and </p> tags.
Example 8-6. A simple XSLT stylesheet that uses xsl:value-of
<?xml version="1.0"?>

<xsl:stylesheet version="1.0"

                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

     

  <xsl:template match="person">

    <p>

      <xsl:value-of select="name"/>

    </p>

  </xsl:template>

     

</xsl:stylesheet>
When an XSLT processor applies this stylesheet to Example 8-1, it outputs this text:
<?xml version="1.0" encoding="utf-8"?>

     

  <p>

      Alan

      Turing

    </p>

     

  <p>

      Richard

      P

      Feynman

    </p>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Applying Templates with xsl:apply-templates
Inhaltsvorschau
By default, an XSLT processor reads the input XML document from top to bottom, starting at the root of the document and working its way down using preorder traversal. Template rules are activated in the order in which they match elements encountered during this traversal. This means a template rule for a parent will be activated before template rules matching the parent's children.
However, one of the things a template can do is change the order of traversal. That is, it can specify which element(s) should be processed next. It can specify that an element(s) should be processed in the middle of processing another element. It can even prevent particular elements from being processed. In fact, Examples Example 8-4 through Example 8-6 all implicitly prevent the child elements of each person element from being processed. Instead, they provided their own instructions about what the XSLT processor was and was not to do with those children.
The xsl:apply-templates element makes the processing order explicit. Its select attribute contains an XPath expression telling the XSLT processor which nodes to process at that point in the output tree.
For example, suppose you wanted to list the names of the people in the input document; however, you want to put the last names first, regardless of the order in which they occur in the input document, and you don't want to output the professions or hobbies. First you need a name template that looks like this:
<xsl:template match="name">

  <xsl:value-of select="last_name"/>,

  <xsl:value-of select="first_name"/>

</xsl:template>
However, this alone isn't enough; if this were all there was in the stylesheet, not only would the output include the names, it would also include the professions and hobbies. You also need a person template rule that says to apply templates to name children only, but not to any other child elements like profession or hobby. This template rule does that:
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The Built-in Template Rules
Inhaltsvorschau
There are seven kinds of nodes in an XML document: the root node, element nodes, attribute nodes, text nodes, comment nodes, processing instruction nodes, and namespace nodes. XSLT provides a default built-in template rule for each of these seven kinds of nodes that says what to do with that node if the stylesheet author has not provided more specific instructions. These rules use special wildcard patterns to match all nodes of a given type. Together these template rules have major effects on which nodes are activated when.
The most basic built-in template rule copies the value of text and attribute nodes into the output document. It looks like this:
<xsl:template match="text( )|@*">

  <xsl:value-of select="."/>

</xsl:template>
The text( ) node test is a pattern matching all text nodes, just as first_name is a pattern matching all first_name element nodes. @* is a pattern matching all attribute nodes. The vertical bar combines these two patterns so that the template rule matches both text and attribute nodes. The rule's template says that whenever a text or attribute node is matched, the processor should output the value of that node. For a text node, this value is simply the text in the node. For an attribute, this value is the attribute value but not the name.
Example 8-10 is an XSLT stylesheet that pulls the birth and death dates out of the born and died attributes in Example 8-1. The default template rule for attributes takes the value of the attributes, but an explicit rule selects those values. The @ sign in @born and @died indicates that these are attributes of the matched element rather than child elements.
Example 8-10. An XSLT stylesheet that reads born and died attributes
<?xml version="1.0"?>

<xsl:stylesheet version="1.0"

                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

     

  <xsl:template match="people">

    <html>

      <head><title>Famous Scientists</title></head>

      <body>

        <dl>

          <xsl:apply-templates/>

        </dl>

      </body>

    </html>

  </xsl:template>

     

  <xsl:template match="person">

    <dt><xsl:apply-templates select="name"/></dt>

    <dd><ul>

      <li>Born: <xsl:apply-templates select="@born"/></li>

      <li>Died: <xsl:apply-templates select="@died"/></li>

    </ul></dd>

  </xsl:template>

     

</xsl:stylesheet>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Modes
Inhaltsvorschau
Sometimes the same input content needs to appear multiple times in the output document, formatted according to a different template each time. For instance, the titles of the chapters in a book would be formatted one way in the chapters themselves and a different way in the table of contents. Both xsl:apply-templates and xsl:template elements can have optional mode attributes that connect different template rules to different positions. A mode attribute on an xsl:template element identifies in which mode that template rule should be activated. An xsl:apply-templates element with a mode attribute only activates template rules with matching mode attributes. Example 8-12 demonstrates with a stylesheet that begins the output document with a list of people's names. This is accomplished in the toc mode. Then a separate template rule, as well as a separate xsl:apply-templates element in the default mode (really no mode at all), outputs the complete contents of all person elements.
Example 8-12. A stylesheet that uses modes
<?xml version="1.0"?>

<xsl:stylesheet version="1.0"

                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

     

  <xsl:template match="people">

    <html>

      <head><title>Famous Scientists</title></head>

      <body>

        <ul><xsl:apply-templates select="person" mode="toc"/></ul>

        <xsl:apply-templates select="person"/>

      </body>

    </html>

  </xsl:template>

     

  <!-- Table of Contents Mode Templates -->

  <xsl:template match="person" mode="toc">

    <xsl:apply-templates select="name" mode="toc"/>

  </xsl:template>

     

  <xsl:template match="name" mode="toc">

    <li><xsl:value-of select="last_name"/>,

    <xsl:value-of select="first_name"/></li>

  </xsl:template>

     

  <!-- Normal Mode Templates -->

  <xsl:template match="person">

    <p><xsl:apply-templates/></p>

  </xsl:template>

     

</xsl:stylesheet>
Example 8-13 shows the output when this stylesheet is applied to
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Attribute Value Templates
Inhaltsvorschau
It's easy to include known attribute values in the output document as the literal content of a literal result element. For example, this template rule wraps each input person element in an HTML span element that has a class attribute with the value person:
<xsl:template match="person">

  <span class="person"><xsl:apply-templates/></span>

</xsl:template>
However, it's trickier if the value of the attribute is not known when the stylesheet is written, but instead must be read from the input document. The solution is to use an attribute value template. An attribute value template is an XPath expression enclosed in curly braces that's placed in the attribute value in the stylesheet. When the processor outputs that attribute, it replaces the attribute value template with its value. For example, suppose you want to write a name template that changes the input name elements to empty elements with first, initial, and last attributes like this:
<name first="Richard" initial="P" last="Feynman"/>
This template accomplishes that task:
<xsl:template match="name">

  <name first="{first_name}"

        initial="{middle_initial}"

        last="{last_name}" />

</xsl:template>
The value of the first attribute in the stylesheet is replaced by the value of the first_name element from the input document. The value of the initial attribute is replaced by the value of the middle_initial element from the input document, the value of the last attribute is replaced by the value of the last_name element from the input document.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XSLT and Namespaces
Inhaltsvorschau
Match patterns, as well as select expressions, identify elements based on their local part and namespace URI. They do not consider the namespace prefix. Most commonly, the same namespace prefix is mapped to the same URI in both the input XML document and the stylesheet. However, this is not required. For instance, consider Example 8-14. This is exactly the same as Example 8-1, except that now all the elements have been placed in the namespace http://www.cafeconleche.org/namespaces/people.
Example 8-14. An XML document that uses a default namespace
<?xml version="1.0"?>

<people xmlns="http://www.cafeconleche.org/namespaces/people">

     

  <person born="1912" died="1954">

    <name>

      <first_name>Alan</first_name>

      <last_name>Turing</last_name>

    </name>

    <profession>computer scientist</profession>

    <profession>mathematician</profession>

    <profession>cryptographer</profession>

  </person>

     

  <person born="1918" died="1988">

    <name>

      <first_name>Richard</first_name>

      <middle_initial>P</middle_initial>

      <last_name>Feynman</last_name>

    </name>

    <profession>physicist</profession>

    <hobby>Playing the bongoes</hobby>

  </person>

     

</people>
Except for the built-in template rules, none of the rules in this chapter so far will work on this document! For instance, consider this template rule from Example 8-8:
<xsl:template match="name">

  <p><xsl:value-of select="last_name"/>,

  <xsl:value-of select="first_name"/></p>

</xsl:template>
It's trying to match a name element in no namespace, but the name elements in Example 8-14 aren't in no namespace. They're in the http://www.cafeconleche.org/namespaces/people namespace. This template rule no longer applies. To make it fit, we map the prefix pe to the namespace URI http://www.cafeconleche.org/namespaces/people. Then instead of matching name, we match pe:name. That the input document doesn't use the prefix pe is irrelevant as long as the namespace URIs match up. Example 8-15 demonstrates by rewriting Example 8-8 to work with Example 8-14 instead.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Other XSLT Elements
Inhaltsvorschau
This is hardly everything there is to say about XSLT. Indeed, XSLT does a lot more than the little we've covered in this introductory chapter. Other features yet to be discussed include:
  • Named templates
  • Numbering and sorting output elements
  • Conditional processing
  • Iteration
  • Extension elements and functions
  • Importing other stylesheets
These and more will all be covered in Chapter 24. Since XSLT is itself Turing complete and since it can invoke extension functions written in other languages like Java, chances are very good you can use XSLT to make whatever transformations you need to make.
Furthermore, besides these additional elements, you can do a lot more simply by expanding the XPath expressions and patterns used in the select and match attributes of the elements with which you're already familiar. These techniques will be explored in Chapter 9.
However, the techniques outlined in this chapter do lay the foundation for all subsequent, more advanced work with XSLT. The key to transforming XML documents with XSLT is to match templates to elements in the input document. Those templates contain both literal result data and XSLT elements that instruct the processor where to get more data. Everything you do with XSLT is based on this one simple idea.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 9: XPath
Inhaltsvorschau
XPath is a non-XML language for identifying particular parts of XML documents. XPath lets you write expressions that refer to, for example, the first person element in a document, the seventh child element of the third person element, the ID attribute of the first person element whose contents are the string "Fred Jones", all xml-stylesheet processing instructions in the document's prolog, and so forth. XPath indicates nodes by position, relative position, type, content, and several other criteria. XSLT uses XPath expressions to match and select particular elements in the input document for copying into the output document or further processing. XPointer uses XPath expressions to identify the particular point in or part of an XML document to which an XLink links. The W3C XML Schema Language uses XPath expressions to define uniqueness and identity constraints. XForms relies on XPath to bind form controls to instance data, express constraints on user-entered values, and calculate values that depend on other values.
XPath expressions can also represent numbers, strings, or Booleans. This lets XSLT stylesheets carry out simple arithmetic for purposes such as numbering and cross-referencing figures, tables, and equations. String manipulation in XPath lets XSLT perform tasks such as making the title of a chapter uppercase in a headline or extracting the last two digits from a year.
An XML document is a tree made up of nodes. Some nodes contain one or more other nodes. There is exactly one root node, which ultimately contains all other nodes. XPath is a language for picking nodes and sets of nodes out of this tree. From the perspective of XPath, there are seven kinds of nodes:
  • The root node
  • Element nodes
  • Text nodes
  • Attribute nodes
  • Comment nodes
  • Processing-instruction nodes
  • Namespace nodes
One thing to note are the constructs not included in this list: CDATA sections, entity references, and document type declarations. XPath operates on an XML document after all these items have been merged into the document. For instance, XPath cannot identify the first
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The Tree Structure of an XML Document
Inhaltsvorschau
An XML document is a tree made up of nodes. Some nodes contain one or more other nodes. There is exactly one root node, which ultimately contains all other nodes. XPath is a language for picking nodes and sets of nodes out of this tree. From the perspective of XPath, there are seven kinds of nodes:
  • The root node
  • Element nodes
  • Text nodes
  • Attribute nodes
  • Comment nodes
  • Processing-instruction nodes
  • Namespace nodes
One thing to note are the constructs not included in this list: CDATA sections, entity references, and document type declarations. XPath operates on an XML document after all these items have been merged into the document. For instance, XPath cannot identify the first CDATA section in a document or tell whether a particular attribute value was directly included in the source element start-tag or merely defaulted from the declaration of the element in a DTD.
Consider the document in Example 9-1. This exhibits all seven kinds of nodes. Figure 9-1 is a diagram of the tree structure of this document.
Example 9-1. The example XML document used in this chapter
<?xml version="1.0"?>

<?xml-stylesheet type="application/xml" href="people.xsl"?>

<!DOCTYPE people [

 <!ATTLIST homepage xlink:type CDATA #FIXED "simple"

                  xmlns:xlink CDATA #FIXED "http://www.w3.org/1999/xlink">

 <!ATTLIST person id ID #IMPLIED>

]>

<people>

     

  <person born="1912" died="1954" id="p342">

    <name>

      <first_name>Alan</first_name>

      <last_name>Turing</last_name>

    </name>

    <!-- Did the word computer scientist exist in Turing's day? -->

    <profession>computer scientist</profession>

    <profession>mathematician</profession>

    <profession>cryptographer</profession>

    <homepage xlink":href="http://www.turing.org.uk/"/>

  </person>

     

  <person born="1918" died="1988" id="p4567">

    <name>

      <first_name>Richard</first_name>

      <middle_initial>&#x50;</middle_initial>

      <last_name>Feynman</last_name>

    </name>

    <profession>physicist</profession>

    <hobby>Playing the bongoes</hobby>

  </person>

     

</people>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Location Paths
Inhaltsvorschau
The most useful XPath expression is a location path. A location path identifies a set of nodes in a document. This set may be empty, may contain a single node, or may contain several nodes. These can be element nodes, attribute nodes, namespace nodes, text nodes, comment nodes, processing-instruction nodes, root nodes, or any combination of these. A location path is built out of successive location steps. Each location step is evaluated relative to a particular node in the document called the context node .
The simplest location path is the one that selects the root node of the document. This is simply the forward slash /. (You'll notice that a lot of XPath syntax is deliberately similar to the syntax used by the Unix shell. Here / is the root node of a Unix filesystem, and / is the root node of an XML document.) For example, this XSLT template rule uses the XPath pattern / to match the entire input document tree and wrap it in an html element:
<xsl:template match="/">

  <html><xsl:apply-templates/></html>

</xsl:template>
/ is an absolute location path because no matter what the context node is—that is, no matter where the processor was in the input document when this template rule was applied—it always means the same thing: the root node of the document. It is relative to which document you're processing, but not to anything within that document.
The second simplest location path is a single element name. This path selects all child elements of the context node with the specified name. For example, the XPath profession refers to all profession child elements of the context node. Exactly which elements these are depends on what the context node is, so this is a relative XPath. For example, if the context node is the Alan Turing person element in Example 9-1, then the location path profession refers to these three
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Compound Location Paths
Inhaltsvorschau
The XPath expressions you've seen so far—element names, @ plus an attribute name, /, comment( ), text( ), and processing-instruction( )—are all single location steps. You can combine these with the forward slash to move around the hierarchy from the matched node to other nodes. Furthermore, you can use a period to refer to the context node, a double period to refer to the parent node, and a double forward slash to refer to descendants of the context node. With the exception of //, these are all similar to Unix shell syntax for navigating a hierarchical filesystem.
Location steps can be combined with a forward slash (/) to make a compound location path. Each step in the path is relative to the one that preceded it. If the path begins with /, then the first step in the path is relative to the root node. Otherwise, it's relative to the context node. For example, consider the XPath expression /people/person/name/first_name. This begins at the root node, then selects all people element children of the root node, then all person element children of those nodes, then all name children of those nodes, and finally all first_name children of those nodes. Applied to Example 9-1, it indicates these two elements:
<first_name>Alan</first_name>

<first_name>Richard</first_name>
To indicate only the textual content of those two nodes, we have to go one step further. The XPath expression /people/person/name/first_name/text( ) selects the strings "Alan" and "Richard" from Example 9-1.
These two XPath expressions both began with /, so they're absolute location paths that start at the root. Relative location paths can also count down from the context node. For example, the XPath expression person/@id selects the id attributes of the person child elements of the context node.
A double forward slash (//) selects from all descendants of the context node, as well as the context node itself. At the beginning of an XPath expression, it selects from all of the nodes in the document. For example, the XPath expression
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Predicates
Inhaltsvorschau
In general, an XPath expression may refer to more than one node. Sometimes this is what you want, but sometimes you want to further winnow the node-set. You want to select only some of the nodes the expression returns. Each step in a location path may (but does not have to) have a predicate that selects from the node-set current at that step in the expression. The predicate contains a Boolean expression, which is tested for each node in the context node list. If the expression is false, then that node is deleted from the list. Otherwise, it's retained.
For example, suppose you want to find all profession elements whose value is "physicist". The XPath expression //profession[. = "physicist"] does this. You can use single quotes around the string instead of double quotes, which is often useful when the XPath expression appears inside a double-quoted attribute value, for example, <xsl:apply-templates select="//profession[.= 'physicist']" />.
If you want to ask for all person elements that have a profession child element with the value "physicist", you'd use the XPath expression //person [profession="physicist"]. If you want to find the person element with id p4567, put an @ in front of the name of the attribute, as in //person[@id="p4567"].
In addition to the equals sign, XPath supports a full complement of relational operators, including <, >, >=, <=, and !=. For instance, the expression //person [@born<=1976] locates all person elements in the document with a born attribute whose numeric value is less than or equal to 1976. Note that if this expression is used inside an XML document, you still have to escape the less-than sign as &lt;, for example, <xsl:apply-templates select="//person[@born &lt;= 1976]"/>. XPath doesn't get any special exemptions from the normal well-formedness rules of XML. However, if the XPath expression appears outside of an XML document, as it may in some uses of XPointer, you may not need to escape the less-than sign.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Unabbreviated Location Paths
Inhaltsvorschau
Up until this point, we've been using what are called abbreviated location paths . These are easy to type, not usually verbose, and very familiar to most people. They're also the kind of XPath expression that works best for XSLT match patterns. However, XPath also offers an unabbreviated syntax for location paths, which is more verbose but perhaps less cryptic and definitely more flexible than abbreviated location paths.
Every location step in a location path has two required parts, an axis and a node test, and one optional part, the predicates. The axis tells you which direction to travel from the context node to look for the next nodes. The node test tells you which nodes to include along that axis, and the predicates further reduce the nodes according to some expression.
In an abbreviated location path, the axis and the node test are combined, while in an unabbreviated location path, they're separated by a double colon (::). For example, the abbreviated location path people/person/@id is composed of three location steps. The first step selects people element nodes along the child axis. The second step selects person element nodes along the child axis. The third step selects id attribute nodes along the attribute axis. When rewritten using the unabbreviated syntax, the same location path is child::people/child::person/attribute::id.
These full, unabbreviated location paths may be absolute if they start from the root node, just as abbreviated paths can be. The full form /child::people/child::person, for example, is equivalent to the abbreviated form /people/person.
Unabbreviated location paths may be used in predicates as well. For example, the abbreviated path /people/person[@born<1950]/name[first_name="Alan"] becomes /child::people/child::person[ attribute::born < 1950 ] /child::name[ child::first_name = "Alan" ] in the full form.
Overall, the unabbreviated form is quite verbose and not used much in practice. However, it does offer one crucial ability that makes it essential to know: it is the only way to access most of the axes from which XPath expressions can choose nodes. The abbreviated syntax lets you walk along the child, parent, self, attribute, and descendant-or-self axes. The unabbreviated syntax adds eight more axes:
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
General XPath Expressions
Inhaltsvorschau
So far we've focused on the very useful subset of XPath expressions called location paths. Location paths identify a set of nodes in an XML document and are used in XSLT match patterns and select expressions. However, location paths are not the only possible type of XPath expression. XPath expressions can also return numbers, Booleans, and strings. For instance, these are all legal XPath expressions:
  • 3.141529
  • 2+2
  • 'Rosalind Franklin'
  • true( )
  • 32.5 < 76.2
  • position( )=last( )
XPath expressions that aren't node-sets can't be used in the match attribute of an xsl:template element. However, they can be used as values for the select attribute of xsl:value-of elements, as well as in the location path predicates.
There are no pure integers in XPath. All numbers are 8-byte, IEEE 754 floating-point doubles, even if they don't have an explicit decimal point. This format is identical to Java's double primitive type. In addition to representing floating-point numbers ranging from 4.94065645841246544e-324 to 1.79769313486231570e+308 (positive or negative) and 0, this type includes special representations of positive and negative infinity and a special not a number (NaN) value used as the result of operations like dividing zero by zero.
XPath provides the five basic arithmetic operators that will be familiar to any programmer:
+ Addition
- Subtraction
* Multiplication
div Division
mod Taking the remainder
The more common forward slash couldn't be used for division because it's already used to separate location steps in a location path. Consequently, a new operator had to be chosen, div. The word mod was chosen instead of the more common % operator to calculate the remainder. Aside from these minor differences in syntax, all five operators behave exactly as they do in Java. For instance, 2+2 is 4, 6.5 div 1.5 is 4.33333333, 6.5 mod 1.5 is 0.5, and so on. The element
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XPath Functions
Inhaltsvorschau
XPath provides many functions that you may find useful in predicates or raw expressions. All of these are discussed in Chapter 23. For example, the position() function returns the position of the current node in the context node list as a number. This XSLT template rule uses the position( ) function to calculate the number of the person being processed, relative to other nodes in the context node list:
<xsl:template match="person">

  Person <xsl:value-of select="position( )"/>,

  <xsl:value-of select="name"/>

</xsl:template>
Each XPath function returns one of these four types:
  • Boolean
  • Number
  • Node-set
  • String
There are no void functions in XPath; it is not nearly as strongly typed as languages like Java or even C. You can often use any of these types as a function argument regardless of which type the function expects, and the processor will convert it as best it can. For example, if you insert a Boolean where a string is expected, then the processor will substitute one of the two strings "true" or "false" for the Boolean. The one exception is functions that expect to receive node-sets as arguments. XPath cannot convert strings, Booleans, or numbers to node-sets.
Functions are identified by the parentheses at the end of the function name. Sometimes functions take arguments between the parentheses. For instance, the round() function takes a single number as an argument. It returns the number rounded to the nearest integer. For example, <xsl:value-of select="round(3.14)"/> inserts 3 into the output tree.
Other functions take more than one argument. For instance, the starts-with( ) function takes two arguments, both strings. It returns true if the first string starts with the second string. For example, this XSLT apply-templates element selects all name elements whose last name begins with the letter T:
<xsl:apply-templates select="name[starts-with(last_name, 'T')]"/>
In this example the first argument to the
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 10: XLinks
Inhaltsvorschau
XLinks are an attribute-based syntax for attaching links to XML documents. XLinks can be simple Point A-to-Point B links, like the links you're accustomed to from HTML's A element. XLinks can also be bidirectional, linking two documents in both directions so you can go from A to B or B to A. XLinks can even be multidirectional, presenting many different paths between any number of XML documents. The documents don't have to be XML documents—XLinks can be placed in an XML document that lists connections between other documents that may or may not be XML documents themselves. Web graffiti artists take note: these third-party links let you attach links to pages you don't even control, like the home page of the New York Times or the C.I.A. At its core, XLink is nothing more and nothing less than an XML syntax for describing directed graphs, in which the vertices are documents at particular URIs and the edges are the links between the documents. What you put in that graph is up to you.
Current web browsers at most support simple XLinks that do little more than duplicate the functionality of HTML's A element. Many browsers, including Internet Explorer, don't support XLinks at all. However, custom applications may do a lot more. Since XLinks are so powerful, it shouldn't come as a surprise that they can do more than make blue underlined links on web pages. XLinks can describe tables of contents or indexes. They can connect textual emendations to the text they describe. They can indicate possible paths through online courses or virtual worlds. Different applications will interpret different sets of XLinks differently. Just as no one browser really understands the semantics of all the various XML applications, so too no one program can process all collections of XLinks.
A simple link defines a one-way connection between two resources. The source or starting resource of the connection is the link element itself. The target or
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Simple Links
Inhaltsvorschau
A simple link defines a one-way connection between two resources. The source or starting resource of the connection is the link element itself. The target or ending resource of the connection is identified by a Uniform Resource Identifier (URI). The link goes from the starting resource to the ending resource. The starting resource is always an XML element. The ending resource may be an XML document, a particular element in an XML document, a group of elements in an XML document, a span of text in an XML document, or something that isn't a part of an XML document, such as an MPEG movie or a PDF file. The URI may be something other than a URL, perhaps a book ISBN number like urn:isbn:1565922247.
A simple XLink is encoded in an XML document as an element of arbitrary type that has an xlink:type attribute with the value simple and an xlink:href attribute whose value is the URI of the link target. The xlink prefix must be mapped to the http://www.w3.org/1999/xlink namespace URI. As usual, the prefix can change as long as the URI stays the same. For example, suppose this novel element appears in a list of children's literature and we want to link it to the actual text of the novel available from the URL ftp://archive.org/pub/etext/etext93/wizoz10.txt:
<novel>

  <title>The Wonderful Wizard of Oz</title>

  <author>L. Frank Baum</author>

  <year>1900</year>

</novel>
We give the novel element an xlink:type attribute with the value simple, an xlink:href attribute that contains the URL to which we're linking, and an xmlns:xlink attribute that associates the prefix xlink with the namespace URI http://www.w3.org/1999/xlink like so:
               <novel xmlns:xlink= "http://www.w3.org/1999/xlink"

       xlink:type = "simple"

       xlink:href = "ftp://archive.org/pub/etext/etext93/wizoz10.txt">

  <title>The Wonderful Wizard of Oz</title>

  <author>L. Frank Baum</author>

  <year>1900</year>

</novel>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Link Behavior
Inhaltsvorschau
So far, we've been careful to talk in the abstract. We've said that an XLink describes a connection between two resources, but we haven't said much about how that connection is presented to the end user or what it makes software reading the document do. That's because there isn't one answer to these questions. For instance, when the browser encounters a novel element that uses an http URL, clicking the link should probably load the text of the novel from the URL into the current window, thereby replacing the document that contained the link. Then again, maybe it should open a new window and show the user the new document in that window. The proper behavior for a browser encountering the novel element that uses an isbn URN is even less clear. Perhaps it should reserve the book with the specified ISBN at the local library for the user to walk in and pick up. Or perhaps it should order the book from an online bookstore. In other cases something else entirely may be called for. For instance, the content of some links are embedded directly in the linking document, as in this image element:
<image width="248" height="173" xlink:type="simple"

       xlink:href="http://www.turing.org.uk/turing/pi1/sark.jpg" />
Here, the author most likely intends the browser to download and display the image as soon as it finds the link. And rather than opening a new window for the image or replacing the current document with the image, the image should be embedded into the current document.
Just as XML is more flexible than HTML in the documents it describes, so too is XLink more flexible in the links it describes. An XLink indicates that there's a connection between two documents, but it's up to the application reading the XLink to decide what that connection means. It's not necessarily a blue, underlined phrase the user clicks in a browser to jump from the source document to the target. It may indeed be that, just as an XML document may be a web page, but it may be something else, too.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Link Semantics
Inhaltsvorschau
A link describes a connection between two resources. These resources may or may not be XML documents; but even if they are XML documents, the relationships they have with each other can be quite varied. For example, links can indicate parent-child relationships, previous-next relationships, employer-employee relationships, customer-supplier relationships, and many more. XLink elements can have xlink:title and xlink:role attributes to specify the meaning of the connection between the resources. The xlink:title attribute contains a small amount of plain text describing the remote resource such as might be shown in a tool tip when the user moves the cursor over the link. The xlink:role attribute contains a URI that somehow indicates the meaning of the link. For instance, the URI http://www.isi.edu/in-notes/iana/assignments/media-types/text/css might be understood to mean that the link points to a CSS stylesheet for the document in which the link is found. However, there are no standards for the meanings of role URIs. Applications are free to assign their own meaning to their own URIs.
For example, this book element is a simple XLink that points to Scott's author page at O'Reilly. The xlink:title attribute contains his name, while the xlink:role attribute points contains the URI for the Dublin Core creator property, thereby indicating he's an author of this book.
<book xlink:type="simple"

 xlink:href="http://www.oreillynet.com/cs/catalog/view/au/751"

 xlink:title="W. Scott Means"

 xlink:role="http://purl.org/dc/elements/1.1/creator" >

  XML in a Nutshell

</book>
As with almost everything else related to XLink, exactly what browsers or other applications will do with this information or how they'll present it to readers remains to be determined.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Extended Links
Inhaltsvorschau
Whereas a simple link describes a single unidirectional connection between one XML element and one remote resource, an extended link describes a collection of resources and a collection of paths between those resources. Each path connects exactly two resources. Any individual resource may be connected to one of the other resources, two of the other resources, zero of the other resources, all of the other resources, or any subset of the other resources in the collection. It may even be connected back to itself. In computer science terms, an extended link is a directed, labeled graph in which the paths are arcs, the documents are vertices, and the URIs are labels.
Simple links are very easy to understand by analogy with HTML links. However, there's no obvious analogy for extended links. What they look like, how applications treat them, what user interfaces present them to people, is all up in the air. No simple visual metaphors like "click on the blue underlined text to jump to a new page" have been invented for extended links, and no browsers support them. How they'll be used and what user interfaces will be designed for them remains to be seen.
In XML, an extended link is represented by an extended link element; that is, an element of arbitrary type that has an xlink:type attribute with the value extended. For example, this is an extended link element that refers to the novel The Wonderful Wizard of Oz:
<novel xlink:type="extended">

  <title>The Wonderful Wizard of Oz</title>

  <author>L. Frank Baum</author>

  <year>1900</year>

</novel>
Although this extended link is quite spartan, most extended links contain local resources, remote resources, and arcs between those resources. A remote resource is represented by a locator element, which is an element of any type that has an xlink:type attribute with the value locator. A local resource is represented by a resource element, which is an element of any type that has an
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Linkbases
Inhaltsvorschau
One of the most revolutionary features of XLinks is the ability to define links between documents you don't control. For instance, Example 10-1 is an extended link that describes and links three documents that neither of the authors of this book has anything to do with. Links between purely remote resources are called third-party links . A third-party link is created when an arc's xlink:from and xlink:to attributes both contain labels for locator elements. Links from a remote resource to a local resource are called inbound links . An inbound link is created when an arc's xlink:from attribute contains the label of a locator element and its xlink:to attribute contains the label of a resource element. Links from a local resource to a remote resource are called outbound links . An outbound link is established when an arc's xlink:from attribute contains the label of a resource element and its xlink:to attribute contains the label of a locator element. Simple links are also outbound links.
An XML document that contains any inbound or third-party links is called a linkbase . A linkbase establishes links from documents other than the linkbase itself, perhaps documents that the author of the linkbase does not own and cannot control. Exactly how a browser or other application will load a linkbase and discover the links there is still an open question. It will probably involve visiting a web site that provides the linkbase. When the browser sees the extended link that attempts to establish links from a third web site, it should ask the user whether he wishes to accept the suggested links. It might even use the xlink:role and xlink:title attributes to help the user make this decision, although if past experience with cookies, Java applets, and ActiveX controls is any guide, the initial user interfaces are likely to be quite poor and the choices offered quite limited.
Once a browser has loaded a linkbase and arrived at a page that's referenced as the starting resource of one or more of the links in the linkbase, it should make this fact known to the user somehow and give them a means to traverse the link. Once again, the user interface for this activity remains to be designed. Perhaps it will be a pop-up window showing the third-party links associated with a page. Or perhaps it will simply embed the links in the page but use a different color underlining. The user could still activate them in exactly the same way they activate a normal HTML link.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
DTDs for XLinks
Inhaltsvorschau
For a document that contains XLinks to be valid, all the XLink attributes that the document uses have to be declared in a DTD just like any other attributes. In most cases some of the attributes can be declared #FIXED . For example, this DTD fragment describes the novel element seen earlier:
<!ELEMENT novel  (title, author, year)>

<!ATTLIST novel  xmlns:xlink CDATA   #FIXED 'http://www.w3.org/1999/xlink'

                 xlink:type (simple) #FIXED 'simple'

                 xlink:href  CDATA   #REQUIRED>

<!ELEMENT title  (#PCDATA)>

<!ELEMENT author (#PCDATA)>

<!ELEMENT year   (#PCDATA)>
Given this DTD to fill in the fixed attributes xmlns:xlink and xlink:type, a novel element only needs an xlink:href attribute to be a complete simple XLink:
<novel xlink:href = "urn:isbn:0688069444">

  <title>The Wonderful Wizard of Oz</title>

  <author>L. Frank Baum</author>

  <year>1900</year>

</novel>
Documents that contain many XLink elements often use parameter entity references to define the common attributes. For example, suppose novel, anthology, and nonfiction are all simple XLink elements. Their XLink attributes could be declared in a DTD like this:
<!ENTITY % simplelink

  "xlink:type (simple) #FIXED 'simple'

   xlink:href  CDATA   #REQUIRED

   xmlns:xlink CDATA   #FIXED 'http://www.w3.org/1999/xlink'

   xlink:role  CDATA   #IMPLIED

   xlink:title CDATA   #IMPLIED

   xlink:actuate (onRequest | onLoad | other | none) 'onRequest'

   xlink:show (new | replace | embed | other | none) 'new'"

>

<!ATTLIST anthology   %simplelink;>

<!ATTLIST novel       %simplelink;>

<!ATTLIST nonfiction  %simplelink;>
Similar techniques can be applied to declarations of attributes for extended XLinks.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Base URIs
Inhaltsvorschau
Relative URL references such as sark.jpg, ../pi1/sark.jpg, and turing/pi1/sark.jpg must be resolved relative to an absolute base URI before being retrieved. When relative URLs are found in XLinks, xml-stylesheet processing instructions, system identifiers, and other locations in XML documents, they are normally resolved relative to the absolute base URL of the document or entity that contains them. For instance, if you find the element <image xlink:type="simple" xlink:href="pi1/sark.jpg" /> in a document at the URL http://www.turing.org.uk/turing/index.html, you would expect to find the file sark.jpg at the URL http://www.turing.org.uk/turing/p1/sark.jpg. This isn't a surprise. It's pretty much how links have worked in HTML for over a decade.
However, XML does add a couple of new wrinkles to this procedure. First, an XML document may be composed of multiple entities loaded from multiple different URLs, even on different servers. If this is the case, then a relative URL is resolved relative to the base URL of the specific entity in which it appears, not the base URL of the entire document.
Secondly, the base URL may be reset or changed from within the document by using xml:base attributes. Such an attribute may appear on the XLink element itself or on any ancestor element in the same entity. For example, this XLink points to ftp://ftp.knowtion.net/pub/mirrors/gutenberg/etext93/wizoz10.txt:
<novel xmlns:xlink = "http://www.w3.org/1999/xlink"

       xml:base="ftp://ftp.knowtion.net/pub/mirrors/gutenberg/etext93/"

       xlink:type = "simple"

       xlink:href = "wizoz10.txt">

  <title>The Wonderful Wizard of Oz</title>

  <author>L. Frank Baum</author>

  <year>1900</year>

</novel>
So does this one:
<novel xmlns:xlink = "http://www.w3.org/1999/xlink"

       xml:base="ftp://ftp.knowtion.net/"

       xlink:type = "simple"

       xlink:href = "/pub/mirrors/gutenberg/etext93/wizoz10.txt">

  <title>The Wonderful Wizard of Oz</title>

  <author>L. Frank Baum</author>

  <year>1900</year>

</novel>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 11: XPointers
Inhaltsvorschau
XPointers are a non-XML syntax for identifying locations inside XML documents. An XPointer is attached to the end of the URI as its fragment identifier to indicate a particular part of an XML document rather than the entire document. XPointer syntax builds on the XPath syntax used by XSLT and covered in Chapter 9. To the four fundamental XPath data types—Boolean, node-set, number, and string—XPointer adds points and ranges, as well as the functions needed to work with these types. It also adds some shorthand syntax for particularly useful and common forms of XPath expressions.
A URL that identifies a document looks something like http://java.sun.com:80/products/jndi/index.html. In this example, the scheme http tells you what protocol the application should use to retrieve the document. The authority, java.sun.com:80 in this example, tells you from which host the application should retrieve the document. The authority may also contain the port to connect to that host and the username and password to use. The path, /products/jndi/index.html in this example, tells you which file in which directory to ask the server for. This may not always be a real file in a real filesystem, but it should be a complete document that the server knows how to generate and return. You're already familiar with all of this, and XPointer doesn't change any of it.
You probably also know that some URLs contain fragment identifiers that point to a particular named anchor inside the document the URL locates. This is separated from the path by the octothorpe, #. For example, if we added the fragment download to the previous URL, it would become http://java.sun.com:80/products/jndi/index.html#download. When a web browser follows a link to this URL, it looks for a named anchor in the document at http://java.sun.com:80/products/jndi/index.html with the name download, such as this one:
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XPointers on URLs
Inhaltsvorschau
A URL that identifies a document looks something like http://java.sun.com:80/products/jndi/index.html. In this example, the scheme http tells you what protocol the application should use to retrieve the document. The authority, java.sun.com:80 in this example, tells you from which host the application should retrieve the document. The authority may also contain the port to connect to that host and the username and password to use. The path, /products/jndi/index.html in this example, tells you which file in which directory to ask the server for. This may not always be a real file in a real filesystem, but it should be a complete document that the server knows how to generate and return. You're already familiar with all of this, and XPointer doesn't change any of it.
You probably also know that some URLs contain fragment identifiers that point to a particular named anchor inside the document the URL locates. This is separated from the path by the octothorpe, #. For example, if we added the fragment download to the previous URL, it would become http://java.sun.com:80/products/jndi/index.html#download. When a web browser follows a link to this URL, it looks for a named anchor in the document at http://java.sun.com:80/products/jndi/index.html with the name download, such as this one:
<a name="download"></a>
It would then scroll the browser window to the position in the document where the anchor with that name is found. This is a simple and straightforward system, and it works well for HTML's simple needs. However, it has one major drawback: to link to a particular point of a particular document, you must be able to modify the document to which you're linking in order to insert a named anchor at the point to which you want to link. XPointer endeavors to eliminate this restriction by allowing authors to specify where they want to link to using full XPath expressions as fragment identifiers. Furthermore, XPointer expands on XPath by providing operations to select particular points in or ranges of an XML document that do not necessarily coincide with any one node or set of nodes. For instance, an XPointer can describe the range of text currently selected by the mouse.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XPointers in Links
Inhaltsvorschau
Obviously, what an XPointer points to depends on which document it's applied to. This document is specified by the URL that the XPointer is attached to. For example, if you wanted a URL that pointed to the first name element in the document at http://example.org/people.xml, you would type:
http://example.org/people.xml#xpointer(//name[position( )=1])
If the XPointer uses any characters that are not allowed in URIs—for instance, the less than sign <, the double quotation mark ", or non-ASCII letters like é—then these must be hexadecimally escaped as specified by the URI specification before the XPointer is attached to the URI. That is, each such character is replaced by a percent sign followed by the hexadecimal value of each byte in the character in the UTF-8 encoding of Unicode. Thus, < would be written as %3C, " would be written as %22, and é would be written as %C3%A9.
In HTML, the URLs used in a elements can contain an XPointer fragment identifier. For example:
<a href = "http://www.example.org/people.xml#xpointer(//name[1])">

  The name of a person

</a>
If a browser followed this link, it would likely load the entire document at http://www.example.org/people.xml and then scroll the window to the beginning of the first name element in the document. However, no browsers yet support the XPointer xpointer scheme, so the exact behavior is open for debate. In some situations it might make sense for the browser to show only the specific element node(s) the XPointer referred to rather than the entire document.
Mozilla 1.4 and later supports the xpath1( ) XPointer scheme proposed by Simon St.Laurent. xpath1( ) is essentially the same as the xpointer( ) scheme discussed here. However, xpath1( ) does not include the XPath extensions for points and ranges that the xpointer( ) scheme does. It only supports pure XPath 1.0 expressions, simplifying implementation.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Shorthand Pointers
Inhaltsvorschau
XPointers provide a number of convenient extensions to XPath. One of the simplest is the shorthand pointer . A shorthand pointer is similar to an HTML named anchor; that is, a shorthand pointer identifies the element it's pointing to by that element's ID. The ID is supplied by an ID type attribute of the element being pointed at rather than by a special a element with a name attribute. To link to an element with a shorthand pointer, append the usual fragment separator # to the URL followed by the ID of the element to which you're linking. For example, http://www.w3.org/TR/1999/REC-xpath-19991116.xml#NT-AbsoluteLocationPath links to the element in the XPath 1.0 specification that has an ID type attribute with the value NT-AbsoluteLocationPath.
The ID attribute is an attribute declared to have an ID type in the document's DTD. It does not have to be named ID or id. Shorthand pointers cannot be used to link to elements in documents that don't have DTDs because such a document cannot have any ID type attributes.
The inability to use IDs in documents without DTDs is a major shortcoming of XML. Work is ongoing to attempt to remedy this, perhaps by defining a generic ID attribute such as xml:id or by defining a namespace that identifies ID type attributes.
For example, suppose you wanted to link to the Motivation and Summary section of the Namespaces in XML recommendation at http://www.w3.org/TR/1999/REC-xml-names-19990114/xml-names.xml. A quick peek at the source code of this document reveals that it has an id attribute with the value sec-intro and that indeed this attribute is declared to have an ID type in the associated DTD. Its start-tag looks like this:
<div1 id='sec-intro'>
Thus, http://www.w3.org/TR/1999/REC-xml-names-19990114/xml-names.xml#sec-intro is a URL that points to this section. The name does not need to be (and indeed should not be) enclosed in
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Child Sequences
Inhaltsvorschau
Another very common form of XPointer is one that descends exclusively along the child axis, selecting elements by their positions relative to their siblings. For example, xpointer(/child::*[position( ) = 1]/child::*[ position( ) = 2]/child::*[position( ) = 3]) selects the third child element of the second child element of the root element of the document. The element( ) scheme allows you to abbreviate this syntax by providing only the numbers of the child elements separated by forward slashes. This is called a child sequence . For example, the previous XPointer could be rewritten using the element scheme in the much more compact form element(/1/2/3).
For example, the aforementioned Motivation and Summary section of the "Namespaces in XML" recommendation at http://www.w3.org/TR/1999/REC-xml-names-19990114/xml-names.xml is given as a div element. It so happens that this div element is the first child element of the second child element of the root element. Therefore, http://www.w3.org/TR/1999/REC-xml-names-19990114/xml-names.xml#element(/1/2/1) points to this section.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Namespaces
Inhaltsvorschau
Since XPointers may appear in places that are not XML documents (HTML documents, database fields, magazine pages, etc.), they require their own mechanism for binding namespace prefixes to namespace URIs. This is done by placing one or more xmlns parts before the xpointer part. The syntax is xmlns( prefix = URI ). For example, this XPointer maps the svg prefix to the http://www.w3.org/2000/svg namespace and then searches out all rect elements in that namespace:
xmlns(svg=http://www.w3.org/2000/svg) xpointer(//svg:rect)
As with most other uses of namespaces, only the URI matters in an XPointer, not the prefix. The previous XPointer finds all rect elements in the namespace http://www.w3.org/2000/svg, regardless of what prefix they use or whether they're in the default namespace.
There is no way to define a default, unprefixed namespace for an XPointer. However, prefixed names in an XPointer can refer to unprefixed but namespace-qualified elements in the targeted document. For example, this XPointer finds the third div element in an XHTML document:
xmlns(html=http://www.w3.org/1999/xhtml) xpointer(//html:div[3])
It uses the prefix html to identify the XHTML namespace, even though XHTML documents never use prefixes themselves.
More than one namespace prefix can be used simply by adding extra xmlns parts. For example, this XPointer seeks out svg elements in XHTML documents by declaring one prefix each for the SVG and XHTML namespaces:
xmlns(svg=http://www.w3.org/2000/svg)

xmlns(h=http://www.w3.org/1999/xhtml) xpointer(/h:html//svg:svg)
If an XPointer is included in an XML document, the namespace bindings established by that document do not apply to the XPointer. Only the bindings established by the xmlns parts apply to the XPointer. If the xpointer parts contain XPath expressions that refer to elements or attributes in a namespace, they must be preceded by xmlns parts declaring the namespaces.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Points
Inhaltsvorschau
XPaths, shorthand pointers, and child sequences can only point to entire nodes or sets of nodes. However, sometimes you want to point to something that isn't a node, such as the third word of the second paragraph or the year in a date attribute that looks like date="01/03/1950". XPointer adds points and ranges to the XPath data model to make this possible. A point is the position preceding or following any tag, comment, processing instruction, or character in the #PCDATA. Points can also be positions inside comments, processing instructions, or attribute values. Points cannot be located inside an entity reference, although they can be located inside the entity's replacement text. A range is the span of parsed character data between two points. Nodes, points, and ranges are collectively called locations ; a set that may contain nodes, points, and ranges is called a location set . In other words, a location is a generalization of the XPath node that includes points and ranges, as well as elements, attributes, namespaces, text nodes, comments, processing instructions, and the root node.
A point is identified by its container node and a non-negative index into that node. If the node contains child nodes—that is, if it's a document or element node—then there are points before and after each of its children (except at the ends, where the point after one child node will also be the point before the next child node). If the node does not contain child nodes—that is, if it's a comment, processing instruction, attribute, namespace, or text node—then there's a point before and after each character in the string value of the node, and again the point after one character will be the same as the point before the next character.
Consider the document in Example 11-1. It contains a novel element that has seven child nodes, three of which are element nodes and four of which are text nodes containing only whitespace.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Ranges
Inhaltsvorschau
A range is the span of parsed character data between two points. It may or may not represent a well-formed chunk of XML. For example, a range can include an element's start-tag but not its end-tag. This makes ranges suitable for uses such as representing the text a user selected with the mouse. Ranges are created with four functions XPointer adds to XPath:
  • range( )
  • range-to( )
  • range-inside( )
  • string-range( )
The range() function takes as an argument an XPath expression that returns a location set. For each location in this set, the range( ) function returns a range exactly covering that location; that is, the start-point of the range is the point immediately before the location, and the end-point of the range is the point immediately after the location. If the location is an element node, then the range begins right before the element's start-tag and finishes right after the element's end-tag. For example, consider this XPointer:
xpointer(range(//title))
When applied to Example 11-1, it selects a range exactly covering the single title element. If there were more than one title element in the document, it would return one range for each such title element. If there were no title elements in the document, then it wouldn't return any ranges.
Now consider this XPointer:
xpointer(range(/novel/*))
If applied to Example 11-1, it returns three ranges, one covering each of the three child elements of the novel root element.
The range-inside( ) function takes as an argument an XPath expression that returns a location set. For each location in this set, it returns a range exactly covering the contents of that location. This will be the same as the range returned by range( ) for anything except an element node. For an element node, this range includes everything inside the element, but not the element's start-tag or end-tag. For example, when applied to Example 11-1,
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 12: XInclude
Inhaltsvorschau
XInclude is a new technology developed at the W3C for combining multiple well-formed and optionally valid documents and fragments thereof into a single document. It's similar in effect to using external entity references to assemble a document from several component pieces. However, XInclude can assemble a document from resources that are themselves fully well-formed documents that include XML declarations and even document type declarations. It can also use XPointers to extract only a piece of an external document, rather than including the entire thing.
XInclude defines two elements, xi:include and xi:fallback, both in the http://www.w3.org/2001/XInclude namespace. An xi:include element has an href attribute that points to a document. An XInclude processor replaces all the xi:include elements in a master document with the documents they point to. These documents can be other XML documents or plain text documents like Java source code. If the xi:include element has an xpointer attribute, then the xi:include element is replaced by only those parts of the remote document that the XPointer indicates. If the processor cannot find the external document the href attribute points to, then it replaces the xi:include element with the contents of the element's xi:fallback child element instead.
This chapter is based on the April 13, 2004 2nd Candidate Recommendation of XInclude. We think this draft is pretty stable, but it's possible some of the details described here may change before the final release. The most current version of the XInclude specification can be found at http://www.w3.org/TR/xinclude/.
The key component of XInclude is the include element. This must be in the http://www.w3.org/2001/XInclude namespace. The xi or xinclude prefixes are customary, although, as always, the prefix can change as long as the URI remains the same. This element has an href attribute that contains a URL pointing to the document to include. For example, this element includes the document found at the relative URL
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The include Element
Inhaltsvorschau
The key component of XInclude is the include element. This must be in the http://www.w3.org/2001/XInclude namespace. The xi or xinclude prefixes are customary, although, as always, the prefix can change as long as the URI remains the same. This element has an href attribute that contains a URL pointing to the document to include. For example, this element includes the document found at the relative URL AlanTuring.xml:
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" 

            href="AlanTuring.xml"/>
Of course, you can use absolute URLs as well:
<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" 

  href="http://cafeconleche.org/books/xian3/examples/12/AlanTuring.xml"

/>
Technically, the href attribute contains an IRI rather than a URI or URL. An IRI is like a URI except that it can contain non-ASCII characters such as é and . These characters are normally encoded in UTF-8, and then each byte of the UTF-8 sequence is percent escaped to convert the IRI to a URI before resolving it. If you're working in English, and you're not writing an XInclude processor, you can pretty much ignore this. All standard URLs are legal IRIs. If you are working with non-English, non-ASCII IRIs, this just means you can use them exactly as you'd expect without having to manually hex-encode the non-ASCII characters yourself.
Normally, the namespace declaration is placed on the root element of the including document, and not repeated on each individual xi:include element. Henceforth in this chapter, we will assume that the namespace prefix xi is bound to the correct namespace URI.
Example 12-1 shows a document similar to Example 8-1 that contains two xi:include elements. The first one loads the document found at the relative URL AlanTuring.xml. The second loads the document found at the relative URL RichardPFeynman.xml.
Example 12-1. A document that uses XInclude to load two other documents
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Including Text Files
Inhaltsvorschau
By default, the XInclude processor assumes the document pointed to by an href attribute is a well-formed XML document. This document is parsed, and the content of the included document replaces the xi:include element in the including document. However, it is also nice to be able to include unparsed text when assembling a larger document. For instance, the program and XML examples in this book could be included directly from their source form. If you add a parse attribute to an xi:include element with the value text, then the document will be loaded as plain text and not parsed. For example, this element includes Example 12-1 as plain text, without parsing it:
<xi:include 

  href="http://cafeconleche.org/books/xian3/examples/12/12-1.xml"

  parse="text"

/>
When parse="text", it is no longer necessary for the referenced document to be well-formed. Indeed, it need not be an XML document at all. It can be C source code, an email message, a classic HTML document, or almost anything else. The only restriction is that the included document must not contain any completely illegal characters, such as an ASCII NUL, or an unmatched half of a surrogate pair.
XInclude processors make use of any protocol metadata such as HTTP headers to determine the encoding of a referenced document so they can transcode it into Unicode before including it. If external metadata is not available, but the MIME media type is text/xml, application/xml, or some type that ends in +xml, then the processor will look inside the document for common signatures like byte-order marks or XML declarations that help it guess the encoding. If these standard mechanisms won't suffice, the document author can add an encoding attribute to the xi:include element, indicating the expected encoding of the document. For example, this element tries to load Example 12-1 using the Latin-1 encoding:
<xi:include 

  href="http://cafeconleche.org/books/xian3/examples/12/12-1.xml"

  
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Content Negotiation
Inhaltsvorschau
HTTP clients and servers support a variety of accept headers that indicate which kinds of content the client is prepared to receive. For example, this browser request indicates that the client prefers French but is willing to read English; can handle HTML, plain text, and JPEG images; knows how to decode gzipped data; and recognizes the ASCII, Latin-1, and UTF-8 character sets:
GET /index.html HTTP/1.1

User-Agent: Mozilla/4.6 [en] (WinNT; I)

Host: www.cafeaulait.org

Accept: text/html, text/plain, image/jpeg

Accept-Encoding: gzip

Accept-Language: fr, en

Accept-Charset: us-ascii, iso-8859-1,utf-8

Connection: close

If-Modified-Since: Sun, 31 Oct 1999 19:22:07 GMT
The server that receives this request uses these headers to decide which version of a resource to send to the client. The same URL can return different content depending on how these headers are set. In browsers, this is normally controlled through preferences, but XInclude allows documents to control two of these headers, Accept and Accept-language, by attributes. Each xi:include element can have an accept and/or accept-language attribute. The values of these attributes should be legal values for the corresponding HTTP header fields. If one or both of these attributes is present, then the XInclude processor will add the relevant accept headers to the HTTP request it sends to the server. For example, this xi:include element indicates you want to include the French HTML version of Google's home page:
<xi:include  href="http://www.google.com" parse="text"

  accept-language="fr"  accept="text/html"

/>
This xi:include element indicates you want to include the English XML version of Google:
<xi:include  href="http://www.google.com"

  accept-language="en"  accept="application/xml"

/>
Both accept and accept-language can be used with parse="xml" and parse="text".
It's not necessarily true, of course, that any given URL will have a version with the language and content type you request. Most servers simply return the same page in the same language regardless of the accept headers. However, for those servers that do provide different translations and formats of the same resource, these two attributes enable you to specify which is preferred.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Fallbacks
Inhaltsvorschau
Documents that reference resources on other sites are subject to all the usual problems of the Web: documents are deleted, documents move, servers crash, DNS records aren't updated fast enough, and more. The examples so far all fail completely if the resource at the end of an href attribute can't be found. However, XInclude offers authors a means to provide alternate content in the face of a missing document. Each XInclude element can contain a single xi:fallback child element. If the remote document can't be loaded, the contents of the xi:fallback element replace the xi:include element instead of the contents of the remote resource. For example:
<xi:include href="AlanTuring.xml">

  <xi:fallback>

    Oops! Could not find Alan Turing!

  </xi:fallback>

</xi:include>
There's no limit to what an xi:fallback element can contain. It can hold plain text, a child element, mixed content, or even another xi:include element to be resolved if the top one can't be. For example, this xi:include element tries to load the same document from three different sites:
<xi:include href="http://www.example.us/data.xml">

  <xi:fallback>

    <xi:include href="http://www.example.fr/data.xml">

      <xi:fallback>

        <xi:include href="http://www.example.cn/data.xml">

          <xi:fallback>

            Could not find the document in the U.S., France, or China.

         </xi:fallback>

        </xi:include>

      </xi:fallback>

    </xi:include>

  </xi:fallback>

</xi:include>
An xi:include element may not contain more than one xi:fallback child, and may not contain any xi:include or other child elements from the XInclude namespace. Otherwise, any children of the xi:include element not in the XInclude namespace are ignored, and do not appear in the result document after inclusion. The xi:fallback element is ignored if the resource specified by the parent xi:include element's href attribute is successfully loaded. An xi:fallback element may only appear as the child of an
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XPointers
Inhaltsvorschau
For various obscure architectural reasons, the URLs used in XPointer href attributes must not have fragment identifiers. Indeed, it is a fatal error if one does, in which case the XInclude processor will simply throw up its hands and give up. Instead, each xi:include element may have an xpointer attribute. This attribute contains an XPointer indicating what part of the document referenced by the href attribute should be included. For example, this xi:include element loads today's news from Cafe con Leche (which is delimited by a today element in the http://www.w3.org/1999/xhtml namespace), but not the rest of the page:
<xi:include href="http://www.cafeconleche.org/" 

    xpointer="xmlns(pre=http://www.w3.org/1999/xhtml) 

    xpointer(//pre:today)"/>
You could also use the element( ) scheme:
<xi:include href="http://www.cafeconleche.org/" 

            xpointer="element(/1/2/4/1/1/4)"/>
If the href attribute is absent, then the XPointer refers to the current document.
XInclude processors are not required to support all XPointer schemes. In particular, they are not required to support the xpointer( ) or xmlns( ) schemes, although some processors, notably libxml2, do support it. All processors are required to support the element( ) scheme as well as bare-name XPointers, although in practice some implementations, especially those based on streaming APIs like SAX, do not support XPointers at all.
A syntax error in the XPointer is a resource error, which will cause the xi:fallback child element to be processed if present. It is not necessarily a fatal error.
Since XPointers only apply to XML documents, they may only be used when parse="xml". It is a fatal error if an xi:include element has an xpointer attribute and parse="text".
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 13: Cascading Style Sheets (CSS)
Inhaltsvorschau
The names of most elements describe the semantic meaning of the content they contain. Often, however, this content needs to be formatted and displayed to users. For this to occur, there must be a step where formatting information is applied to the XML document, and the semantic markup is transformed into presentational markup. There is a variety of choices for the syntax of this presentation layer. However, two are particularly noteworthy:
  • Cascading Style Sheets (CSS)
  • XSL Formatting Objects (XSL-FO)
CSS is a non-XML syntax for describing the appearance of particular elements in a document. CSS is a very straightforward language; no transformation is performed. The parsed character data of the document is presented more or less exactly as it appears in the XML document, although, of course, you can always transform the document with XSLT and then apply a CSS stylesheet to it if you need to rearrange the content of a document before showing it to the user. A CSS stylesheet does not change the markup of an XML document at all; it merely applies styles to the content that already exists.
By way of contrast, XSL-FO is a complete XML application for describing the layout of text on a page. It has elements that represent pages, blocks of text on the pages, graphics, horizontal rules, and more. You do not normally work with this application directly. Instead, you write an XSLT stylesheet that transforms your document's native markup into XSL-FO. The application rendering the document reads the XSL-FO and displays it to the user.
In this chapter and the next, we'll demonstrate the features of the two major stylesheet languages by applying them to the simple well-formed XML document shown in Example 13-1. This document does not have a document type declaration and is not valid, although a DTD or schema could be added easily enough. In general, DTDs and schemas don't have any impact on stylesheets, except insofar as they change the document content through entity declarations, default attribute values, and the like.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The Levels of CSS
Inhaltsvorschau
At the time of this writing, there are several versions of CSS. CSS Level 1 was an early W3C Recommendation from 1996 for HTML only, although the extension to XML was obvious. The CSS Level 1 specification was incomplete and led to inconsistent browser implementations.
The next version, CSS Level 2, added many additional style properties. It also placed XML on an equal footing with HTML. Indeed, CSS Level 2 often works better with XML than with HTML because CSS styles don't have to interact with any predefined rendering semantics. For the most part, CSS Level 2 is a superset of CSS Level 1. That is, all CSS Level 1 stylesheets are also CSS Level 2 stylesheets that mean pretty much the same thing.
The current version is CSS 2.1. CSS 2.1 adds a few minor values to existing properties—for instance, orange is now recognized as a color—but mostly it removes those features of CSS Level 2 that have not been implemented by browsers. It also corrects a few bugs in the CSS2 specification.
The W3C is now working on CSS Level 3. When complete, it will modularize the CSS specification so software can implement particular subsets of CSS functionality without having to implement everything. For instance, an audio browser could implement audio stylesheets but ignore the visual formatting model. Furthermore, CSS Level 3 adds a number of features to CSS, including multi-column layouts, better support for non-Western languages—such as Arabic and Chinese—XML namespace support, more powerful selectors, paged media, and more. However, CSS Level 3 is not yet implemented by any browsers.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
CSS Syntax
Inhaltsvorschau
CSS syntax isn't XML syntax, but the syntax is so trivial this hardly matters. A CSS stylesheet is simply a list of the elements you want to apply the styles to, normally one to a line. If the element is in a namespace, then the qualified name like recipe:dish must be used. The prefix must be the same in the stylesheet as in the XML document. Each element name is followed by the list of styles you want to apply to that element. Comments can be inserted using the /*...*/ format familiar to C programmers. Whitespace isn't particularly significant, so it can be used to format the stylesheet. Example 13-2 is a simple CSS stylesheet for the recipe document in Example 13-1. Figure 13-1 shows the recipe document as rendered and displayed by the Opera 4.01 browser with this stylesheet.
Example 13-2. A CSS stylesheet for recipes
/* Defaults for the entire document */

recipe  {font-family: "New York", "Times New Roman", serif;

         font-size: 12pt }

     

/* Make the dish look like a headline */

dish    {

  display: block;

  font-family: Helvetica, Arial, sans-serif;

  font-size: 20pt;

  font-weight: bold;

  text-align: center

}

     

/* A bulleted list */

ingredient  {display: list-item; list-style-position: inside }

     

/* Format these two items as paragraphs */

directions, story {

  display: block;

  margin-top: 12pt;

  margin-left: 4pt

}
Figure 13-1: A semantically tagged XML document after a CSS stylesheet is applied
This stylesheet has four style rules. Each rule names the element(s) it formats and follows that with a pair of curly braces containing the style properties to apply to those elements. Each property has a name, such as font-family, and a value, such as "New York", "Times New Roman", serif. Properties are separated from each other by semicolons. Neither the names nor the values are case-sensitive. That is, font-family is the same as FONT-FAMILY or Font-Family. CSS 2.1 defines over 100 different style properties. However, you don't need to know all of these. Reasonable default values are provided for all the properties you don't set.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Associating Stylesheets with XML Documents
Inhaltsvorschau
CSS stylesheets are primarily intended for use in web pages. Web browsers find the stylesheet for a document by looking for xml-stylesheet processing instructions in the prolog of the XML document. This processing instruction should have a type pseudo-attribute with the value text/css and an href pseudo-attribute whose value is an absolute or relative URL locating the stylesheet document. For example, this is the processing instruction that attaches the stylesheet in Example 13-2 (recipe.css) to the file in Example 13-1 (cornbread.xml), if both are found in the same directory:
<?xml-stylesheet type="text/css" href="recipe.css"?>
Including the required type and href pseudo-attributes, the xml-stylesheet processing instruction can have up to six pseudo-attributes:
type
This is the MIME media type of the stylesheet; text/css for CSS and application/xml (not text/xsl!) for XSLT.
href
This is the absolute or relative URL where the stylesheet can be found.
charset
This names the character set in which the stylesheet is written, such as UTF-8 or ISO-8859-7. There's no particular reason this has to be the same as the character set in which the document is written. The names used are the same ones used for the encoding pseudo-attribute of the XML declaration.
title
This pseudo-attribute names the stylesheet. If more than one stylesheet is available for a document, the browser may (but is not required to) present readers with a list of the titles of the available stylesheets and ask them to choose one.
media
Printed pages, television screens, and computer displays are all fundamentally different media that require different styles. For example, comfortable reading on screen requires much larger fonts than on a printed page. This pseudo-attribute specifies the media types this stylesheet should apply to. There are 10 predefined values:
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Selectors
Inhaltsvorschau
CSS provides limited abilities to select the elements to which a given rule applies. Many stylesheets only use element names and lists of element names separated by commas, as shown in Example 13-2. However, CSS provides some other basic selectors you can use, although they're by no means as powerful as the XPath syntax of XSLT.
The asterisk matches any element at all; that is, it applies the rule to everything in the document that does not have a more specific, conflicting rule. For example, this rule says that all elements in the document should use a large font:
* {font-size: large}
An element name A followed by another element name B matches all B elements that are descendants of A elements. For example, this rule matches quantity elements that are descendants of ingredients elements, but not other ones that appear elsewhere in the document:
ingredients quantity {font-size: medium}
If the two element names are separated by a greater-than sign (>), then the second element must be an immediate child of the first in order for the rule to apply. For example, this rule gives quantity children of ingredient elements the same font-size as the ingredient element:
ingredient > quantity {font-size: inherit}
If the two element names are separated by a plus sign (+), then the second element must be the next sibling element immediately after the first element. For example, this style rule sets the border-top-style property for only the first story element following a directions element:
directions + story {border-top-style: solid}
Square brackets allow you to select elements with particular attributes or attribute values. For example, this rule hides all step elements that have an optional attribute:
step[optional] {display: none}
This rule hides all elements that have an optional attribute regardless of the element's name:
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The Display Property
Inhaltsvorschau
Display is one of the most important CSS properties. This property determines how the element will be positioned on the page. There are 18 legal values for this property. However, the two primary values are inline and block. The display property can also be used to create lists and tables, as well as to hide elements completely.
Setting the display to inline , the default value, places the element in the next available position from left to right, much as each word in this paragraph is positioned. (The exact direction can change for right-to-left languages like Hebrew or top-to-bottom languages like traditional Chinese.) The text may be wrapped from one line to the next if necessary, but there won't be any hard line breaks between each inline element. In Examples Example 13-1 and Example 13-2, the quantity, step, person, city, and state elements were all formatted as inline. This didn't need to be specified explicitly because it's the default.
In contrast to inline elements, an element set to display:block is separated from its siblings, generally by a line break. For example, in HTML, paragraphs and headings are block elements. In Examples Example 13-1 and Example 13-2, the dish, directions, and story elements were all formatted with display:block.
CSS 2.1 adds an inline-block value that formats the element's contents as if it were a block-level element, but formats the element itself as if it were an inline element. This normally just means there's extra margins and padding around the element's content, but no line breaks before or after it.
An element whose display property is set to list-item is also formatted as a block-level element. However, a bullet is inserted at the beginning of the block. The list-style-type, list-style-image, and list-style-position properties control which character or image is used for a bullet and exactly how the list is indented. For example, this rule would format the steps as a numbered list rather than rendering them as a single paragraph:
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Pixels, Points, Picas, and Other Units of Length
Inhaltsvorschau
Many CSS properties represent lengths. Some of the most important (though far from all) of these include:
  • border-width
  • margin-bottom
  • font-size
  • left
  • line-height
  • top
  • margin-left
  • height
  • margin-top
  • width
  • margin-right
CSS provides many different units to specify length. They fall into two groups:
  • Absolute units of length, such as inches, centimeters, millimeters, points, and picas
  • Relative units, such as ems, exes, pixels, and percentages
Absolute units of length are appropriate for printed media (that is, paper), but they should be avoided in other media. Relative units should be used for all other media, except for pixels, which probably shouldn't be used at all. For example, this style rule sets the dish element to be exactly 0.5 centimeters high:
dish { height: 0.5cm }
However, documents intended for display on screen media like television sets and computer monitors should not be set to fixed sizes. For one thing, the size of an inch or other absolute unit can vary depending on the resolution of the monitor. For another, not all users like the same defaults, and what looks good on one monitor may be illegible on another. Instead, you should use units that are relative to something, such as an em, which is relative to the width of the uppercase letter M, in the current font, or ex, which is relative to the height of the lowercase letter x in the current font. For example, this rule sets the line-height property of the story element to 1.5 times the height of the letter x:
story { line-height: 1.5ex}
Pixel is also a relative unit, although what it's relative to is the size of a pixel on the current display. This is generally somewhere in the vicinity of a point, but it can vary from system to system. In general, we don't recommend using pixels unless you need to line something up with a bitmapped graphic displayed at exactly a 1:1 ratio. Web pages formatted with pixel lengths invariably look too small or too large on most users' monitors.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Font Properties
Inhaltsvorschau
Fonts are one of the most basic things designers want to set with CSS. Is the text italic? Is it bold? What typeface and size are used? CSS provides properties to set all these basic characteristics of text. In particular, you can set these properties:
font-family
This is a list of font names, separated by commas, in order of preference. The last name in the list should always be one of the generic names: serif, sans-serif, monospace, cursive, or fantasy. Multiword names like "Times New Roman" should be enclosed in quotes.
font-style
The value italic indicates that an italic version of the font should be used if one is available. The value oblique suggests that the text should be algorithmically slanted, as opposed to using a specially designed italic font. The default is normal (no italicizing or slanting). An element can also be set to inherit the font-style of the parent element.
font-size
This is the size of the font. This should be specified as one of the values xx-small, x-small, small, medium, large, x-large, or xx-large. Alternately, it can be given as a percentage of the font-size of the parent element. It can also be specified as a length like 0.2cm or 12pt, but this should only be done for print media.
font-variant
If this property is set to small-caps, then lowercase text is rendered in smaller capitals like this instead of normal lowercase letters.
font-weight
This property determines how bold or light the text is. It's generally specified as one of the keywords normal (the default), bold, bolder, or lighter. It can also be set to any multiple of 100 from 100 (lightest) to 900 (darkest). However, not all browsers provide nine different levels of boldness.
font-stretch
This property adjusts the space between letters to make the text more or less compact. Legal values include
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Text Properties
Inhaltsvorschau
Text properties cover those aspects of text formatting other than what can be adjusted merely by changing the font. These include how far the text is indented, how the paragraph is aligned, and so forth. The most common of these properties include:
text-indent
The text-indent property specifies how far in to indent the first line of the block. (Indents of all lines are generally applied via margin properties.) Hanging indents can be specified by making text-indent negative. This property only applies to block-level elements. For example, this style rule indents the first line of the story element by 0.5 inches from the left side:
story { text-indent: 0.5in }
text-align
The text-align property can be set to left, right, center, or justify to align the text with the left edge of the block or the right edge of the block, to center the text in the block, or to spread the text out across the block. This property only applies to block-level elements.
text-decoration
The text-decoration property can be set to underline, overline, line-through, or blink to produce the obvious effects. Note, however, that the CSS specification specifically allows browsers to ignore the request to make elements blink. This is a good thing.
text-transform
The text-transform property has three main values: capitalize, uppercase, and lowercase. Uppercase changes all the text to capital letters LIKE THIS. Lowercase changes all the text to lowercase letters like this. Capitalize simply uppercases the first letter of each word Like This, but leaves the other letters alone. The default value of this property is none, which performs no transformation. It can also be set to inherit to indicate that the same transform as used on the parent element should be used.
Changing the case in English is fairly straightforward, but this isn't true of all languages. In particular, software written by native English speakers tends to do a very poor job of algorithmically changing the case in ligature-heavy European languages, like Maltese, or context-sensitive languages, like Arabic. Outside of English text, it's best to make the transformations directly in the source document rather than relying on the stylesheet engine to make the correct decisions about which letters to capitalize.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Colors
Inhaltsvorschau
CSS has several properties for changing the color of various items:
color
The color of the text itself (black on this page)
background-color
The color of the background behind the text (white on this page)
border-color
The color of a visible box surrounding the text
CSS uses a 24-bit color space to specify colors, much as HTML does. Always keep in mind, however, that just because you can specify a color doesn't mean any given device can render it. A black-and-white printer isn't going to print red no matter how you identify it; it might give you some nice shades of gray though. Like many other properties, color depends on the medium in which the document is presented.
The simplest way to choose a color is through one of these 16 named constants: aqua, black, blue, fuchsia, gray, green, lime, maroon, navy, olive, purple, red, silver, teal, white, and yellow. CSS 2.1 adds orange to this list. There are also a number of colors that are defined to be the same as some part of the user interface. For instance, WindowText is the same color as text in windows on the local system.
Beyond this small list, you can specify the color of an item by specifying the three components—red, green, and blue—of each color, much as you do for background colors on HTML pages. Each component is given as a number between 0 and 255, with 255 being the maximum amount of the color. Numbers can be given in decimal or hexadecimal. For example, these rules use hexadecimal syntax to color the dish element pure red, the story element pure green, and the directions element pure blue:
dish       { color: #FF0000 }

story      { color: #00FF00 }

directions { color: #0000FF }
If you prefer, you can specify the color as decimals separated by commas inside an rgb( ) function. For example, white is rgb(255,255,255); black is rgb(0,0,0). Colors in which each component is equal form various shades of gray. These rules use decimal syntax to color the
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 14: XSL Formatting Objects (XSL-FO)
Inhaltsvorschau
The previous chapter covered CSS; this chapter discusses XSL-FO. In distinct contrast to CSS, XSL-FO is a complete XML application for describing the precise layout of text on a page. It has elements that represent sequences of pages, blocks of text on the pages, graphics, horizontal rules, and more. Most of the time, however, you don't write XSL-FO directly. Instead, you write an XSLT stylesheet that transforms your document's native markup into XSL-FO. The application rendering the document reads the XSL-FO and displays it to the user. Since no major browsers currently support direct rendering of XSL-FO documents, there's normally a third step in which another processor transforms the XSL-FO into a readable format, such as PDF or TEX.
Once again, we demonstrate the features of XSL-FO by applying it to the simple well-formed XML document shown in Example 13-1 (in the last chapter) and repeated here in Example 14-1 for convenience.
Example 14-1. Marjorie Anderson's recipe for Southern Corn Bread
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>

<recipe source="Marjorie Anderson">

  <dish>Southern Corn Bread</dish>

  <ingredients>

    <ingredient>

      <quantity>1 cup</quantity>

      <component>flour</component>

    </ingredient>

    <ingredient>

      <quantity>4 tablespoons</quantity>

      <component>Royal Baking Powder</component>

    </ingredient>

    <ingredient>

      <quantity>1/2 teaspoon</quantity>

      <component>salt</component>

    </ingredient>

    <ingredient>

      <quantity>1 cup</quantity>

      <component>corn meal</component>

    </ingredient>

    <ingredient>

      <quantity>11/2 cups</quantity>

      <component>whole milk</component>

    </ingredient>

    <ingredient>

      <quantity>4 tablespoons</quantity>

      <component>melted butter</component>

    </ingredient>

  </ingredients>

     

  <directions>

    <step>Sift flour, baking powder, sugar &amp; salt together.</step>

    <step>Add 1 cup corn meal.</step>

    <step>

      Beat egg in cup and add beaten egg and 1
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XSL Formatting Objects
Inhaltsvorschau
An XSL-FO document describes the layout of a series of nested rectangular areas (boxes, for short) that are placed on one or more pages. These boxes contain text or occasionally other items, such as an external image or a horizontal rule. There are four kinds of areas:
  • Block areas
  • Inline areas
  • Line areas
  • Glyph areas
Block and inline areas are created by particular elements in the formatting objects document. Line and glyph areas are created by the formatter as necessary. For the most part, the rendering engine decides exactly where to place the areas and how big to make them based on their contents. However, you can specify properties for these areas that adjust both their relative and absolute position, spacing, and size on a page. Most of the time, the individual areas don't overlap. However, they can be forced to do so by setting the properties absolute-position, left, bottom, right, and top.
Considered by itself, each box has a content area in which its content, generally text but possibly an image or a rule, is placed. This content area is surrounded by a padding area of blank space. An optional border can surround the padding. The size of the area is the combined size of the border, padding, and content. The box may also have a margin that adds blank space outside the box's area, as diagramed in Figure 14-1.
Figure 14-1: Content, padding, border, and margin of an XSL-FO area
Text properties—such as font-family, font-size, alignment, and font-weight—can be applied by attaching the appropriate properties to one of the boxes that contains the text. Text takes on the properties specified on the nearest enclosing box. Properties are set by attaching attributes to the elements that generate the boxes. With the exception of a few XSL-FO extensions, these properties have the same semantics as the CSS properties of the same name. Only the syntax for applying the properties to particular ranges of text is different.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The Structure of an XSL-FO Document
Inhaltsvorschau
The root element of all XSL-FO documents is fo:root . This element normally declares the fo prefix mapped to the http://www.w3.org/1999/XSL/Format namespace URI. As always, the prefix can change as long as the URI stays the same. In this chapter, we assume that the prefix fo has been associated with http://www.w3.org/1999/XSL/Format. Thus, a typical FO document looks like this:
<?xml version="1.0" encoding="UTF-8"?>

<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">

  <!-- Formatting object elements -->

</fo:root>
Of course, normally this isn't written as directly as it is here. Instead, it's formed by an XSLT template like this one:
<xsl:template match="/">

  <fo:root>

    <xsl:apply-templates/>

  </fo:root>

</xsl:template>
The fo:root element must contain two things: a fo:layout-master-set and one or more fo:page-sequence s. The fo:layout-master-set contains elements describing the overall layout of the pages themselves; that is, how large the pages are, whether they're in landscape or portrait mode, how wide the margins are, and so forth. The fo:page-sequence contains the actual text that will be placed on the pages, along with the instructions for formatting that text as italic, 20 points high, justified, and so forth. It has a master-reference attribute identifying the particular page master that will be used to layout this content. Adding these elements, a formatting objects document looks like this:
<?xml version="1.0" encoding="UTF-8"?>

<fo:root xmlns:fo="http://www.w3.org/1999/XSL/Format">

  <fo:layout-master-set>

    <!-- page masters -->

  </fo:layout-master-set>

  <fo:page-sequence master-reference="first">

    <!-- data to place on the page -->

  </fo:page-sequence>

</fo:root>
The formatting engine uses the layout master set to create a page. Then it adds content to the page from the fo:page-sequence until the page is full. Then it creates the next page in the sequence and places the next batch of content on that page. This process continues until all the content has been positioned.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Laying Out the Master Pages
Inhaltsvorschau
XSL-FO 1.0 only defines one kind of master page, the fo:simple-page-master. This represents a standard rectangular page with margins on all four sides. This page master also has a unique name given by a master-name attribute. For example, this element describes a page master named first that represents an 8.5 11-inch page with 1-inch margins on all four sides:
<fo:simple-page-master margin-right="1in"  margin-left="1in"

                       margin-bottom="1in" margin-top="1in"

                       page-width="8.5in"  page-height="11in"

                       master-name="first">

  <!-- Separate parts of the page go here -->

</fo:simple-page-master>
The part of the page inside the margins is divided into five regions: the start region, the end region, the before region, the after region, and the body region. Where these fall on a page depends on the writing direction. In left-to-right, top-to-bottom languages like English, start is on the lefthand side, end is on the righthand side, before is on top, and after is on bottom, as diagramed in Figure 14-2. However, if the text were Hebrew, then the start region would be on the right-hand side of the page, and the end region would be on the lefthand side of the page. If the text were traditional Chinese, then the start would be on top, the end on bottom, the before on the righthand side, and the after on the lefthand side. Other combinations are possible.
Figure 14-2: The five regions in a left-to-right, top-to-bottom writing system
These regions are represented by fo:region-start, fo:region-end, fo:region-before, fo:region-after, and fo:region-body child elements of the fo:simple-page-master element. You can place different content into each of the five regions. For instance, the after region often contains a page number, and the before region may contain the title of the book or chapter.
The body region and the corresponding fo:region-body
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XSL-FO Properties
Inhaltsvorschau
The finished document shown in Figure 14-3 is quite spartan. It simply breaks the original XML document into a few separate paragraphs. After quite a lot of work, it still hasn't reached the polish that was achieved much more simply with CSS (back in the last chapter in Example 13-2 and Figure 13-1). Adding the sparkle of different fonts, bold headlines, bulleted lists, and other desirable features requires setting the relevant properties on the individual formatting objects. These are set through optional attributes of the formatting object elements like fo:block. The good news is that most of the property names and semantics are exactly the same as they are for CSS. For example, to make the text in an fo:block element bold, add a font-weight attribute with the value bold, like this:
<fo:block font-weight="bold">Southern Corn Bread</fo:block>
The similarity with the equivalent CSS rule is obvious:
dish { font-weight: bold }
The property name is the same. The property value is the same. The meaning of the property is the same. Similarly, you can use all the font-weight keywords and values that you learned for CSS, like lighter and 100, 200, 300, 400, etc. Only the syntactic details of how the value bold is assigned to the property font-weight and how that property is then attached to the dish element has changed. When XSL-FO and CSS converge, they do so closely.
Many other properties come across from CSS by straight extrapolation. For instance, in Example 13-2 the dish element was formatted with this rule:
dish    {

  display: block;

  font-family: Helvetica, Arial, sans-serif;

  font-size: 20pt;

  font-weight: bold;

  text-align: center

}
In XSL-FO, it will be formatted with this XSLT template:
<xsl:template match="dish">

  <fo:block font-family="Helvetica, Arial, sans-serif" font-size="20pt"

            font-weight="bold" text-align="center">

    <xsl:apply-templates/>

  </fo:block>

</xsl:template>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Choosing Between CSS and XSL-FO
Inhaltsvorschau
CSS is a very straightforward, easy-to-learn, easy-to-use language for formatting web pages. To the extent that CSS has gotten a reputation as buggy and difficult to use, that's mostly because of inconsistent, nonstandard browser implementations. Opera 4.0 and later, Netscape 6.0 and later, Mozilla, and Safari provide extensive support for most features of CSS Level 2, with only a few minor bugs. Internet Explorer's support is much weaker, but it borders on usable.
It's hard to imagine any text-based web site you can't produce by using XSLT to transform a document into HTML and then applying a CSS stylesheet. Alternately, you can transform the XML document into another XML document and apply the CSS stylesheet to that. If the element content in the original XML document is exactly what you want to display in the output document, in the correct order, you can even omit the XSLT transformation step, as we did in Examples Example 13-1 and Example 13-2 in the previous chapter.
Perhaps most importantly, CSS is already well-understood by web designers and well-supported by current browsers. XSL-FO is not directly supported by any browsers. To view an XSL-FO document, you must first convert it into the inconvenient PDF format. PDF does not adjust as well as HTML to the wide variety of monitors and screen sizes in use today. Viewing it inside a web browser requires a special plug-in. The limited open source tools that support XSL-FO are beta quality at best. Personally, we see little reason to use anything other than CSS on the Web.
On the other hand, XSL-FO does go beyond CSS in some respects that are important for high-quality printing. For example, XSL-FO offers multiple column layouts; CSS doesn't. XSL-FO can condition formatting on what's actually in the document; CSS can't. XSL-FO allows you to place footnotes, running headers, and other information in the margins of a page; CSS doesn't. XSL-FO lets you insert page numbers and automatically cross-reference particular pages by number; CSS doesn't. And for printing, the requirement to render into PDF is much less limiting and annoying since the ultimate delivery mechanism is paper anyway. CSS Level 3 will add some of these features, but it will still focus on ease-of-use and web-based presentation rather than high-quality printing. Once the software is more reliable and complete, XSL-FO should be the clear choice for professionally typeset books, magazines, newspapers, and other printed matter that's rendered from XML documents. It should be very competitive with other solutions like Quark XPress, TeX, troff, and FrameMaker. CSS does not even attempt to compete in this area.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 15: Resource Directory Description Language (RDDL)
Inhaltsvorschau
RDDL, the Resource Directory Description Language, is an XML application invented by Jonathan Borden, Tim Bray, and various other members of the xml-dev mailing list to describe XML applications identified by namespace URLs. A RDDL document lives at the namespace URL for the application it describes. RDDL is a hybrid of XHTML Basic and one custom element, rddl:resource . A rddl:resource element is a simple XLink that points to a resource related to the application the RDDL document describes. Humans with browsers can read the XHTML parts to learn about the application. Software can read the rddl:resource elements.
The people who wrote the namespaces specification couldn't agree on what should be put at the end of a namespace URL. Should it be a DTD, a schema, a specification document, a stylesheet, software for processing the application, or something else? All of these are possible, but none of them are required for any particular XML application. Some applications have DTDs; some don't. Some applications have schemas; some don't. Some applications have stylesheets; some don't. Thus, for the most part, namespaces have been purely formal identifiers. They do not actually locate or identify anything.
"Namespaces in XML" specifically states that "The namespace name, to serve its intended purpose, should have the characteristics of uniqueness and persistence. It is not a goal that it be directly usable for retrieval of a schema (if any exists)." That is, it is not required that there be anything in particular, such as a DTD or a schema, at the end of the namespace URL. Indeed, it's not even required that the namespace name be potentially resolvable. It might be an irresolvable URN such as urn:isbn:1565922247. On the other hand, this doesn't say that there can't be anything at the end of a namespace URL, just that there doesn't have to be.
Nonetheless, this hasn't stopped numerous developers from typing namespace URLs into their web browser location bars and filling the error logs at the W3C and elsewhere with 404 Not Found errors. It hasn't stopped weekly questions on the
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
What's at the End of a Namespace URL?
Inhaltsvorschau
The people who wrote the namespaces specification couldn't agree on what should be put at the end of a namespace URL. Should it be a DTD, a schema, a specification document, a stylesheet, software for processing the application, or something else? All of these are possible, but none of them are required for any particular XML application. Some applications have DTDs; some don't. Some applications have schemas; some don't. Some applications have stylesheets; some don't. Thus, for the most part, namespaces have been purely formal identifiers. They do not actually locate or identify anything.
"Namespaces in XML" specifically states that "The namespace name, to serve its intended purpose, should have the characteristics of uniqueness and persistence. It is not a goal that it be directly usable for retrieval of a schema (if any exists)." That is, it is not required that there be anything in particular, such as a DTD or a schema, at the end of the namespace URL. Indeed, it's not even required that the namespace name be potentially resolvable. It might be an irresolvable URN such as urn:isbn:1565922247. On the other hand, this doesn't say that there can't be anything at the end of a namespace URL, just that there doesn't have to be.
Nonetheless, this hasn't stopped numerous developers from typing namespace URLs into their web browser location bars and filling the error logs at the W3C and elsewhere with 404 Not Found errors. It hasn't stopped weekly questions on the xml-dev mailing list about whether it's possible to parse an XML document on a system that's disconnected from the Net. Eventually, the membership of the xml-dev mailing list reached consensus that it was time to put something at the end of namespace URLs, even if they didn't have to.
However, the question still remained, what to put there? All the reasons for not choosing any one thing to put at the end of a namespace URL still applied. Rick Jelliffe suggested fixing the problem by introducing an additional layer of indirection, and Tim Bray proposed doing it with XHTML and XLinks. Instead of putting just one of these at the end of the namespace URL, an XML document containing a list of all the things related to the XML application identified by that particular URL could be put at the end of the namespace URL.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
RDDL Syntax
Inhaltsvorschau
A RDDL document is an XHTML Basic document, plus one new element: rddl:resource . XHTML Basic is a subset of XHTML that includes the Structure, Text, Hypertext, List, Basic Forms, Basic Tables, Image, Object, Metainformation, Link, and Base modules. There are no frames or deprecated presentational elements like font and bold. However, this is enough to write pretty much anything you'd reasonably want to write about an XML application.
In addition, a RDDL document contains one new element, resource, which is placed in the http://www.rddl.org/ namespace. This URL is normally mapped to the rddl prefix. The prefix can change as long as the URL remains the same. However, the RDDL DTD declares the resource element with the name rddl:resource, so a RDDL document will be valid only if it uses the prefix rddl.
A rddl:resource element is a simple XLink whose xlink:href attribute points to the related resource and whose xlink:role and xlink:arcrole attributes identify the nature and purpose of that related resource. The rddl:resource element can appear anywhere a p element can appear and contain anything a div element can contain. Web browsers generally ignore the rddl:resource start- and end-tags, but will display their content. Automated software searching for related resources only pays attention to the rddl:resource elements and their attributes, while ignoring all the XHTML.
Recall the person vocabulary used several times in this book. When last seen in Chapter 8, it looked as shown in Example 15-1. All elements in this document are in the default namespace http://www.cafeconleche.org/namespaces/people.
Example 15-1. An XML document describing two people that uses a default namespace
<?xml version="1.0"?>

<people xmlns="http://www.cafeconleche.org/namespaces/people">

     

  <person born="1912" died="1954">

    <name>

      <first_name>Alan</first_name>

      <last_name>Turing</last_name>

    </name>

    <profession>computer scientist</profession>

    <profession>mathematician</profession>

    <profession>cryptographer</profession>

  </person>

     

  <person born="1918" died="1988">

    <name>

      <first_name>Richard</first_name>

      <middle_initial>P</middle_initial>

      <last_name>Feynman</last_name>

    </name>

    <profession>physicist</profession>

    <hobby>Playing the bongoes</hobby>

  </person>

     

</people>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Natures
Inhaltsvorschau
The nature of a related resource says what the resource is. For example, the nature of a web page might be HTML, and the nature of an image might be JPEG. The nature is indicated by a URL. Normally, this nature URL is a namespace URL for XML applications and a MIME media type URL for everything else. For instance, the XSLT nature is written as http://www.w3.org/1999/XSL/Transform. The JPEG nature is written as http://www.isi.edu/in-notes/iana/assignments/media-types/image/jpeg.
The RDDL specification specifies 24 natures that can be used in xlink:role attributes. In addition, you are welcome to define your own, but, when possible, you should use the standard natures so that automated software can understand your documents and locate the necessary related resources. These are the standard natures and their URLs:
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Purposes
Inhaltsvorschau
The purpose of a related resource indicates what the resource will be used for. Purposes distinguish between resources with the same natures used for different things. For example, DocBook has multiple XSLT stylesheets for transforming DocBook documents into HTML, XHTML, chunked HTML, and XSL-FO. These are all related resources with the same nature but different purposes. Unlike natures, purposes are optional. You don't have to use them if you don't need to distinguish between resources with the same nature, but you can if you'd like.
Purpose names are URLs. These URLs are placed in xlink:arcrole attributes of a rddl:resource element. The RDDL specification defines 21 different well-known purpose URLs, mostly in the form http://www.rddl.org/purposes#purpose. In addition, you are welcome to define your own, but you should use the standard URLs for the standard purposes so that automated software can understand your documents and locate the necessary related resources. These are the well-known purposes:
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 16: XML as a Data Format
Inhaltsvorschau
Despite the intentions of XML's inventors, who mostly envisioned XML as a format for web pages and other narrative documents to be read by people, the most common applications of XML today involve the storage and transmission of information for use by different software applications and systems. New technologies and frameworks (such as Web Services) depend heavily on XML content to communicate and negotiate between dissimilar applications. The structures appropriate for such applications differ from those used for the more traditional narrative documents in XML. They are more rigid in some ways: for instance, they tend to favor strongly typed element content and rarely allow mixed content; while being less rigid in others: the order of child elements rarely matters, for example. Thus, in many applications, the elements tend to look more like database records and less like web pages or books.
The appropriate techniques used to design, build, and maintain a record-like XML application vary greatly, depending on the required functionality and intended audience. This chapter discusses a variety of concerns, techniques, and technologies that should be considered when designing a new record-like XML application.
Before XML, individual programmers had to invent a new data format every time they needed to save a file or send a message. In most cases, the data was never intended for use outside the original program, so programmers would store it in the most convenient format they could devise, which was often very tightly coupled to the program's internal data structures. Indeed, the earliest versions of Microsoft Word wrote at least part of their files by dumping memory straight to disk, and then opened those files by reading the data back into memory. This made understanding the data format and loading it into any other program extremely difficult. A few de facto file formats evolved over the years (RTF, CSV, ASN.1, and the ubiquitous Windows
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Why Use XML for Data?
Inhaltsvorschau
Before XML, individual programmers had to invent a new data format every time they needed to save a file or send a message. In most cases, the data was never intended for use outside the original program, so programmers would store it in the most convenient format they could devise, which was often very tightly coupled to the program's internal data structures. Indeed, the earliest versions of Microsoft Word wrote at least part of their files by dumping memory straight to disk, and then opened those files by reading the data back into memory. This made understanding the data format and loading it into any other program extremely difficult. A few de facto file formats evolved over the years (RTF, CSV, ASN.1, and the ubiquitous Windows .ini file format), but in too many cases, the data written by one program could usually be read only by that same program. In fact, it was often possible for only that specific version of the same program to read the data.
In recent years, however, XML has begun to solve this problem and make data a lot more portable. The rapid proliferation of free XML tools throughout the programming community has made XML the obvious choice when the time comes to select a data-storage or transmission format for their application. For all but the most trivial applications, the benefits of using XML to store and retrieve data far outweigh the additional overhead of including an XML parser in your application. The unique strengths of using XML as a software data format include:
Simple syntax
Easy to generate and parse.
Support for nesting
Nested elements allow programs to represent complex structures easily.
Easy to debug
Human-readable data format is easy to explore and create with a basic text editor.
Language- and platform-independent
XML and Unicode guarantee that your data files will be portable across virtually every popular computer architecture and language combination in use today.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Developing Record-Like XML Formats
Inhaltsvorschau
Despite the mature status of most of XML's core technologies, XML application development is only now being recognized as a distinct discipline. Many architects and XML developers are attempting to apply existing design methodologies (like UML) and design patterns to the problem of constructing markup languages, but a widely accepted design process for creating XML applications still does not exist.
The term "XML application" is often used in XML contexts to describe an XML vocabulary for a particular domain rather than the software used to process it. This may seem a little strange to developers who are used to creating software applications, but it makes sense if you think about integrating a software application with an XML application, for instance.
XML applications can range in scope from a proprietary vocabulary used to store a single computer program's configuration settings to an industry-wide standard for storing consumer loan applications. Although the specifics and sometimes the sequence will vary, the basic steps involved in creating a new XML application are as follows:
  1. Determine the requirements of the application.
  2. Look for existing applications that might meet those requirements.
  3. Choose a validation model.
  4. Decide on a namespace structure.
  5. Plan for expansion.
  6. Consider the impact of the design on application developers.
  7. Determine how old and new versions of the application will coexist.
The following sections explore each of these steps in greater depth.
The first step in designing a new XML application is like the first step in many design methodologies. Before the application can be designed, it is important to determine exactly what needs the application will fulfill. Some basic questions must be answered before proceeding.

Section 16.2.1.1: Where and how will new documents be created?

Documents that will be created automatically by a software application or database server can be structured differently than those that need to be created by humans using an editor. While software wouldn't have a problem generating 100 elements with attributes that indicate cross-references, a human being probably would find those expectations frustrating.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Sharing Your XML Format
Inhaltsvorschau
Creating a data format is often only the first step in making it useful. If an XML vocabulary is used only for a particular process inside a software application, there may not be much reason to publish information about how it works, except for future developers who may work on that application. If, on the other hand, the data format is intended for widespread use by people or organizations who may not normally interact with each other beyond the exchange of messages, then it probably makes sense to provide much more support for the format.
There is a variety of different kinds of information about a data format that are frequently worth sharing:
  • Human-readable documentation, perhaps even in a variety of languages
  • Schemas and DTDs formally defining the structures and content
  • Stylesheets and transformations for presenting the data or converting it from one format to another
  • Code for processing the data, perhaps even in a variety of languages or environments
The first two approaches—human-readable documentation and schemas—are typically the foundations. Formal definitions and rough understandings of what goes where often work for formats that are used by individual programmers or small groups, but sharing formats widely often requires further explanation. Stylesheets and code are additional options that may simplify adoption for developers.
The appropriate level of publicity for an XML vocabulary can vary widely, from no publicity at all to publishing a RDDL document or a support site to registering the format in one of the XML application registries, or to creating a working group at some kind of standards body or consortium.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 17: XML Schemas
Inhaltsvorschau
Although document type definitions can enforce basic structural rules on documents, many applications need a more powerful and expressive validation method. The W3C developed the XML Schema Recommendation to address these needs. Schemas can describe complex restrictions on elements and attributes. Multiple schemas can be combined to validate documents that use multiple XML vocabularies. This chapter provides a rapid introduction to key W3C XML Schema concepts and usage, starting with the fundamental structures that are common to all schemas. We begin with a very simple schema and proceed to add more functionality to it until every major feature of XML Schemas has been introduced.
An XML Schema is an XML document containing a formal description of what comprises a valid XML document. A W3C XML Schema Language schema is an XML Schema written in the particular syntax recommended by the W3C.
In this chapter, when we use the word "schema" without further qualification, we are referring specifically to a schema written in the W3C XML Schema language. However, there are numerous other XML Schema languages, including RELAX NG and Schematron, each with their own strengths and weaknesses.
An XML document described by a schema is called an instance document . If a document satisfies all the constraints specified by the schema, it is considered to be schema-valid . The schema document is associated with an instance document through one of the following methods:
  • An xsi:schemaLocation attribute on an element contains a list of namespaces used within that element and the URLs of the schemas with which to validate elements and attributes in those namespaces.
  • An xsi:noNamespaceSchemaLocation attribute contains a URL for the schema used to validate elements that are not in any namespace.
  • A validating parser may be instructed to validate a given document against an explicitly provided schema, ignoring any hints that might be provided within the document itself.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Overview
Inhaltsvorschau
An XML Schema is an XML document containing a formal description of what comprises a valid XML document. A W3C XML Schema Language schema is an XML Schema written in the particular syntax recommended by the W3C.
In this chapter, when we use the word "schema" without further qualification, we are referring specifically to a schema written in the W3C XML Schema language. However, there are numerous other XML Schema languages, including RELAX NG and Schematron, each with their own strengths and weaknesses.
An XML document described by a schema is called an instance document . If a document satisfies all the constraints specified by the schema, it is considered to be schema-valid . The schema document is associated with an instance document through one of the following methods:
  • An xsi:schemaLocation attribute on an element contains a list of namespaces used within that element and the URLs of the schemas with which to validate elements and attributes in those namespaces.
  • An xsi:noNamespaceSchemaLocation attribute contains a URL for the schema used to validate elements that are not in any namespace.
  • A validating parser may be instructed to validate a given document against an explicitly provided schema, ignoring any hints that might be provided within the document itself.
DTDs provide the capability to do basic validation of the following items in XML documents:
  • Element nesting
  • Element occurrence constraints
  • Permitted attributes
  • Attribute types and default values
However, DTDs do not provide fine control over the format and data types of element and attribute values. Other than the various special attribute types (ID, IDREF, ENTITY, NMTOKEN, and so forth), once an element or attribute has been declared to contain character data, no limits may be placed on the length, type, or format of that content. For narrative documents (such as web pages, book chapters, newsletters, etc.), this level of control is probably good enough.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Schema Basics
Inhaltsvorschau
This section will construct, step-by-step, a simple schema document representing a typical address book entry, introducing different features of the XML Schema language as needed. Example 17-1 shows a very simple, well-formed XML document.
Example 17-1. addressdoc.xml
<?xml version="1.0"?>

<fullName>Scott Means</fullName>
Assuming that the fullName element can only contain a simple string value, the schema for this document would look like Example 17-2.
Example 17-2. address-schema.xsd
<?xml version="1.0"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

 <xs:element name="fullName" type="xs:string"/>

</xs:schema>
It is also common to associate the sample instance document explicitly with the schema document. Since the fullName element is not in any namespace, the xsi:noNamespaceSchemaLocation attribute is used, as shown in Example 17-3.
Example 17-3. addressdoc.xml with schema reference
<?xml version="1.0"?>

<fullName xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

  xsi:noNamespaceSchemaLocation="address-schema.xsd">Scott Means</fullName>
Validating the simple document against its schema requires a validating XML parser that supports schemas such as the open source Xerces parser from the Apache XML Project (http://xml.apache.org/xerces2-j/ ). This is written in Java and includes a command-line program called dom.Writer that can be used to validate addressdoc.xml, like this:
% java dom.Writer -V -S addressdoc.xml
Since the document is valid, dom.Writer will simply echo the input document to standard output. An invalid document will cause the parser to generate an error message. For instance, adding b elements to the contents of the fullName element violates the schema rules:
<?xml version="1.0"?>

<fullName xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

  xsi:noNamespaceSchemaLocation="address-schema.xsd">Scott <b>Means</b>

</fullName>
If this document were validated with dom.Writer, the following validity errors would be detected by Xerces:
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Working with Namespaces
Inhaltsvorschau
So far, namespaces have only been dealt with as they relate to the schema processor and schema language itself. But the schema specification was written with the intention that schemas could support and describe XML namespaces.
Associating a schema with a particular XML namespace is extremely simple: add a targetNamespace attribute to the root xs:schema element, like so:
<xs:schema xmlns:xsi="http://www.w3.org/2001/XMLSchema"

  targetNamespace="http://namespaces.oreilly.com/xmlnut/address">
It is important to remember that many XML 1.0 documents are not associated with namespaces at all. To validate these documents, it is necessary to use a schema that doesn't have a targetNamespace attribute. When developing schemas that are not associated with a target namespace, you should always explicitly qualify schema elements (like xs:element) to keep them from being confused with global declarations for your application.
However, making that simple change impacts numerous other parts of the example application. Trying to validate the addressdoc.xml document as it stands (with the xsi:noNamespaceSchemaLocation attribute) causes the Xerces schema processor to report this validity error:
General Schema Error: Schema in address-schema.xsd has a different target 

namespace from the one specified in the instance document :.
To rectify this, it is necessary to change the instance document to reference the new, namespace-enabled schema properly. This is done using the xsi:schemaLocation attribute, like so:
<fullName xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

  xsi:schemaLocation="http://namespaces.oreilly.com/xmlnut/address 

    address-schema.xsd"

  language="en">Scott Means</fullName>
Notice that the schemaLocation attribute value contains two tokens. The first is the target namespace URI that matches the target namespace of the schema document. The second is the physical location of the actual schema document.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Complex Types
Inhaltsvorschau
A schema assigns a type to each element and attribute it declares. In Example 17-5, the fullName element has a complex type. Elements with complex types may contain nested elements and have attributes. Only elements can contain complex types. Attributes always have simple types.
Since the type is declared using an xs:complexType element embedded directly in the element declaration, it is also an anonymous type, rather than a named type.
New types are defined using xs:complexType or xs:simpleType elements. If a new type is declared globally with a top-level element, it needs to be given a name so that it can be referenced from element and attribute declarations within the schema. If a type is defined inline (inside an element or attribute declaration), it does not need to be named. But since it has no name, it cannot be referenced by other element or attribute declarations. When building large and complex schemas, data types will need to be shared among multiple different elements. To facilitate this reuse, it is necessary to create named types.
To show how named types and complex content interact, let's expand the example schema. A new address element will contain the fullName element, and the person's name will be divided into a first- and last-name component. A typical instance document would look like Example 17-6.
Example 17-6. addressdoc.xml after adding address, first, and last elements
<?xml version="1.0"?>

<addr:address xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

    xsi:schemaLocation="http://namespaces.oreilly.com/xmlnut/address 

      address-schema.xsd"

    xmlns:addr="http://namespaces.oreilly.com/xmlnut/address"

    addr:language="en">

  <addr:fullName>

    <addr:first>Scott</addr:first>

    <addr:last>Means</addr:last>

  </addr:fullName>

</addr:address>
To accommodate this new format, fairly substantial structural changes to the schema are required, as shown in Example 17-7.
Example 17-7. address-schema.xsd to support address element
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Empty Elements
Inhaltsvorschau
In many cases, it is useful to declare an element that cannot contain anything. Most of these elements convey all of their information via attributes or simply by their position in relation to other elements (e.g., the br element from XHTML).Let's add a contact-information element to the address element that will be used to contain a list of ways to contact a person. Example 17-8 shows the sample instance document after adding the new contacts element and a sample phone entry.
Example 17-8. addressdoc.xml with contact element
<?xml version="1.0"?>

<addr:address xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

    xsi:schemaLocation="http://namespaces.oreilly.com/xmlnut/address 

      address-schema.xsd"

    xmlns:addr="http://namespaces.oreilly.com/xmlnut/address"

    addr:language="en">

  <addr:fullName>

    <addr:first>William</addr:first>

    <addr:middle>Scott</addr:middle>

    <addr:last>Means</addr:last>

  </addr:fullName>

  <addr:contacts>

    <addr:phone addr:number="888.737.1752"/>

  </addr:contacts>

</addr:address>
Supporting this new content requires further modifications to the schema document. Although it would be possible to declare the new element inline within the existing address-element declaration, for clarity it makes sense to create a new global type and reference it by name:
<xs:element name="address">

  <xs:complexType>

    <xs:sequence>

      <xs:element name="fullName">

. . .

      </xs:element>

      <xs:element name="contacts" type="addr:contactsType" minOccurs="0"/>

    </xs:sequence>

  <xs:attributeGroup ref="addr:nationality"/>

  </xs:complexType>

</xs:element>
The declaration for the new contactsType complex type looks like this:
<xs:complexType name="contactsType">

  <xs:sequence>

    <xs:element name="phone" minOccurs="0" maxOccurs="unbounded">

      <xs:complexType>

        <xs:attribute name="number" type="xs:string"/>

      </xs:complexType>

    </xs:element>

  </xs:sequence>

</xs:complexType>
The syntax used to declare an empty element is actually very simple. Notice that the
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Simple Content
Inhaltsvorschau
Earlier, the xs:simpleContent element was used to declare an element that could only contain simple content:
<xs:element name="fullName">

  <xs:complexType>

    <xs:simpleContent>

      <xs:extension base="xs:string">

        <xs:attribute name="language" type="xs:language"/>

      </xs:extension>

    </xs:simpleContent>

  </xs:complexType>

 </xs:element>
The base type for the extension in this case was the built-in xs:string data type. But simple types are not limited to the predefined types. The xs:simpleType element can define new simple data types, which can be referenced by element and attribute declarations within the schema.
To show how new simple types can be defined, let's extend the phone element from the example application to support a new attribute called location. This attribute will be used to differentiate between work and home phone numbers. This attribute will have a new simple type called locationType, which will be referenced from the contactsType definition:
<xs:complexType name="contactsType">

  <xs:sequence>

    <xs:element name="phone" minOccurs="0">

      <xs:complexType>

        <xs:attribute name="number" type="xs:string"/>

         <xs:attribute name="location" type="addr:locationType"/>

      </xs:complexType>

    </xs:element>

  </xs:sequence>

</xs:complexType>

     

<xs:simpleType name="locationType">

  <xs:restriction base="xs:string"/>

</xs:simpleType>
Of course, a location type that just maps to the built-in xs:string type isn't particularly useful. Fortunately, schemas can strictly control the possible values of simple types through a mechanism called facets.
In schema-speak, a facet is an aspect of a possible value for a simple data type. Depending on the base type, some facets make more sense than others. For example, a numeric data type can be restricted by the minimum and maximum possible values it could contain. But these types of restrictions wouldn't make sense for a
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Mixed Content
Inhaltsvorschau
XML 1.0 provided the ability to declare an element that could contain parsed character data (#PCDATA) and unlimited occurrences of elements drawn from a provided list. Schemas provide the same functionality plus the ability to control the number and sequence in which elements appear within character data.
The mixed attribute of the complexType element controls whether character data may appear within the body of the element with which it is associated. To illustrate this concept, Example 17-9 gives us a new schema that will be used to validate form-letter documents.
Example 17-9. formletter.xsd
<xs:schema xmlns:xsi="http://www.w3.org/2001/XMLSchema">

  <xs:element name="letter">

    <xs:complexType mixed="true"/>

  </xs:element>

</xs:schema>
This schema seems to declare a single element called letter that may contain character data and nothing else. But attempting to validate the following document produces an error, as shown in Example 17-10.
Example 17-10. formletterdoc.xml
<letter xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"

  xsi:noNamespaceSchemaLocation="formletter.xsd">Hello!</letter>
The following error is generated:
The content of element type "letter" must match "EMPTY".
This is because there's no complex content for the letter element. Setting mixed to true is not the same as declaring an element that may contain a string. The character data may only appear in relation to other complex content, which leads to the subject of relative element positioning.
You have already seen the xs:sequence element, which dictates that the elements it contains must appear in exactly the same order in which they appear within the sequence element. In addition to xs:sequence, schemas also provide the xs:choice and xs:all elements to control the order in which elements may appear. These elements may be nested to create sophisticated element structures.
Expanding the form-letter example, a sequence adds support for various letter components to the
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Allowing Any Content
Inhaltsvorschau
It is often necessary to allow users to include any type of markup content they see fit. Also, it is useful to tell the schema processor to validate the content of a particular element against another application's schema. Incorporating XHTML content into another document is an example of this usage.
These applications are supported by the xs:any element. This element accepts attributes that indicate what level of validation should be performed on the included content, if any. Also, it accepts a target namespace that can be used to limit the vocabulary of included content. For instance, going back to the address-book example, to associate a rich-text notes element with an address entry, you could add the following element declaration to the address element declaration:
<xs:element name="notes" minOccurs="0">

  <xs:complexType>

    <xs:sequence>

      <xs:any namespace="http://www.w3.org/1999/xhtml"

           minOccurs="0" maxOccurs="unbounded"

           processContents="skip"/>

    </xs:sequence>

  </xs:complexType>

</xs:element>
The attributes of the xs:any element tell the schema processor that zero or more elements belonging to the XHTML namespace (http://www.w3.org/1999/xhtml) may occur at this location. Notice that this is done by setting minOccurs to 0 and maxOccurs to unbounded. It also states that these elements should be skipped. This means that no validation will be performed against the actual XHTML namespace by the parser. Other possible values for the processContents attribute are lax and strict. When set to lax, the processor will attempt to validate any element it can find a declaration for and silently ignore any unrecognized elements. The strict option requires every element to be declared and valid per the schema associated with the namespace given.
There is also support in schemas to declare that any attribute may appear within a given element. The xs:anyAttribute element may include the
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Controlling Type Derivation
Inhaltsvorschau
Just as some object-oriented programming languages allow the creator of an object to dictate the limits on how an object can be extended, the schema language allows schema authors to place restrictions on type extension and restriction.
The abstract attribute applies to type and element declarations. When it is set to true, that element or type cannot appear directly in an instance document. If an element is declared as abstract, a member of a substitution group based on that element must appear. If a type is declared as abstract, no element declared with that type may appear in an instance document.
Until now, the schema has placed no restrictions on how other types or elements could be derived from its elements and types. The final attribute can be added to a complex type definition and set to either #all, extension, or restriction. On a simple type definition, it can be set to #all or to a list containing any combination of the values list, union, and/or restriction, in any order. When a type is derived from another type that has the final attribute set, the schema processor verifies that the desired derivation is legal. For example, a final attribute could prevent the physicalAddressType type from being extended:
<xs:complexType name="physicalAddressType" final="extension">
Since the main schema in address-schema.xsd attempts to redefine the physicalAddressType in an xs:redefine block, the schema processor generates the following errors when it attempts to validate the instance document:
ComplexType 'physicalAddressType': cos-ct-extends.1.1: Derivation by 

extension is forbidden by either the base type physicalAddressType_redefined 

or the schema.

Attribute "addr:latitude" must be declared for element type "physicalAddress".

Attribute "addr:longitude" must be declared for element type 

"physicalAddress".
The first error is a result of trying to extend a type that has been marked to prevent extension. The next two errors occur because the new, extended type was not parsed and applied to the content in the document. Now that you've seen how this works, removing this particular "feature" from the
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 18: Programming Models
Inhaltsvorschau
This chapter briefly explains the most popular programming techniques for parsing, manipulating, and generating XML data. XML support is available for virtually every programming platform in use today, from supercomputer to cell phone. If you can't find XML support built into your programming environment, a quick Google search will likely locate a library.
XML's structured and tagged text can be processed by developers in several ways. Programs can look at XML as plain text, as a stream of events, as a tree, or as a serialization of some other structure. Tools supporting all of these options are widely available.
At their foundation, XML documents are text. The content and markup are both represented as text, and text-editing tools can be extremely useful for XML document inspection, creation, and modification. XML's textual foundations make it possible for developers to work with XML directly, using XML-specific tools only when they choose to.
One of the original design goals of XML was for documents to be easy to parse. For very simple documents that do not depend on features such as attribute defaulting and validation, it is possible to parse tags, attributes, and text data using standard programming tools such as regular expressions and tokenizers, but the complexity of processing grows rapidly as documents use more features. Unless the application can completely control the content of incoming documents, it is almost always preferable to use one of the many high-quality XML parsers that are freely available for most programming languages.
Textual tools are a key part of the XML toolset, however. Many developers use text editors such as vi, Emacs, NotePad, WordPad, BBEdit, and UltraEdit to create or modify XML documents. Regular expressions—in environments such as sed, grep, Perl, and Python—can be used for search and replace or for tweaking documents prior to XML parsing or XSLT processing. Various standards are beginning to take advantage of regular expression matching after a particular document has been parsed. The W3C's XML Schema recommendation, for instance, includes regular-expression matching as one mechanism for validating data types, as discussed in Chapter 17.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Common XML Processing Models
Inhaltsvorschau
XML's structured and tagged text can be processed by developers in several ways. Programs can look at XML as plain text, as a stream of events, as a tree, or as a serialization of some other structure. Tools supporting all of these options are widely available.
At their foundation, XML documents are text. The content and markup are both represented as text, and text-editing tools can be extremely useful for XML document inspection, creation, and modification. XML's textual foundations make it possible for developers to work with XML directly, using XML-specific tools only when they choose to.
One of the original design goals of XML was for documents to be easy to parse. For very simple documents that do not depend on features such as attribute defaulting and validation, it is possible to parse tags, attributes, and text data using standard programming tools such as regular expressions and tokenizers, but the complexity of processing grows rapidly as documents use more features. Unless the application can completely control the content of incoming documents, it is almost always preferable to use one of the many high-quality XML parsers that are freely available for most programming languages.
Textual tools are a key part of the XML toolset, however. Many developers use text editors such as vi, Emacs, NotePad, WordPad, BBEdit, and UltraEdit to create or modify XML documents. Regular expressions—in environments such as sed, grep, Perl, and Python—can be used for search and replace or for tweaking documents prior to XML parsing or XSLT processing. Various standards are beginning to take advantage of regular expression matching after a particular document has been parsed. The W3C's XML Schema recommendation, for instance, includes regular-expression matching as one mechanism for validating data types, as discussed in Chapter 17.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Common XML Processing Issues
Inhaltsvorschau
As with any technology, there are several ways to accomplish most design goals when developing a new XML application, as well as a few potential problems worth knowing about ahead of time. An understanding of the intended uses for these features can help ensure that new applications will be compatible not only with their intended target audience, but also with other XML processing systems that may not even exist yet.
The XML specification provides several loopholes that permit XML parsers to play fast and loose with your document's literal contents, while retaining the semantic meaning. Comments can be omitted and entity references silently replaced by the parser without any warning to the client application. Non-validating parsers aren't required to retrieve external DTDs or entities, although the parser should at least warn applications that this is happening. While reconstructing an XML document with exactly the same logical structure and content is possible, guaranteeing that it will match the original in a byte-by-byte comparison generally is not.
XML Canonicalization defines a more consistent form of XML and a process for producing it that permits a much higher degree of predictability in reconstructing a document from its logical model. For details, see http://www.w3.org/TR/xml-c14n.
Authors of simple XML processing tools that act on data without storing or modifying it might not consider these constraints particularly restrictive. The ability to reconstruct an XML document precisely from in-memory data structures, however, becomes more critical for authors of XML editing tools and content-management solutions. While no parser is required to make all comments, whitespace, and entity references available from the parse stream, many do or can be made to do so with the proper configuration options.
The only real option to ensure that a parser reports documents as you want, and not just the minimum required by the XML specification, is to check its documentation and configure (or choose) the parser accordingly.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Generating XML Documents
Inhaltsvorschau
One area of XML development that isn't often addressed is that of generating XML documents for consumption by other applications. Although there are several approaches for processing XML documents, there are relatively few techniques currently used to create new documents.
One of the simplest (and most common) approaches is to use the string and/or file processing facilities of your target development environment to construct the XML document directly. This approach has the benefit of being easy to understand, efficient, and readily accessible to every programmer. This Java statement emits a simple XML document to a file output stream:
FileWriter out = new FileWriter("message.xml");

out.write("<message>Hello, world!</message>");
It's not hard to see how this approach would be implemented in any other programming language. For example, in C++, the following statement creates the desired result:
ofstream fout;

fout.open("message.xml", ios::app);

fout << "<message>Hello, world!</message>";
This is a completely valid approach, and it should be considered when the XML document is not overly complex and the structure of the document will not change substantially over the lifetime of the application. The disadvantage of this approach is that it is much easier to generate a document that is not well-formed or is invalid, since no validation or verification of the structure of the document occurs as it is generated. When using this technique, you of course have to make sure that both your code and the data coming in will produce well-formed XML.
If all of that data validation sounds like a hassle, you may want to explore Genx (http://www.tbray.org/ongoing/genx/docs/Guide.html), a C library created by Tim Bray, one of the editors of the XML specification, that generates Canonical XML.
Another common approach involves using a tree-based API, such as the DOM, to create an XML document tree dynamically. The benefit of this approach is that the library enforces well-formedness constraints, and in the case of DOM Level 3, it can be configured to enforce validity constraints as well.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 19: Document Object Model (DOM)
Inhaltsvorschau
The Document Object Model (DOM) defines an API for accessing and manipulating XML documents as tree structures. The DOM is defined by a set of W3C Recommendations that describe a programming language-neutral object model used to store hierarchical documents in memory. The most recently completed standard, DOM Level 3, provides models for manipulating XML documents, HTML documents, and CSS stylesheets. This chapter covers only the parts of the DOM that are applicable to processing XML documents.
This chapter is based on the DOM Level 3 Core Recommendation, which was released on April 7, 2004. This version of the recommendation, along with any errata that have been reported, is available on the W3C web site (http://www.w3.org/TR/DOM-Level-3-Core/ ). Level 3 introduces several key features that were lacking from earlier DOM Levels, including:
  • Validation—it is now possible to enforce validity constraints during programmatic manipulation of the DOM tree.
  • Type information—post-validation element and attribute type information is now available through standard DOM interfaces.
  • Support for XML 1.1—allows the developer to select which version of the XML recommendation a given DOM document will conform to.
At its heart, the DOM is a set of abstract interfaces. Various DOM implementations use their own objects to support the interfaces defined in the DOM specification. The DOM interfaces themselves are specified in modules, making it possible for implementations to support parts of the DOM without having to support all of it. XML parsers, for instance, aren't required to provide support for the HTML-specific parts of the DOM, and modularization has provided a simple mechanism that allows software developers to identify which parts of the DOM are supported or not supported by a particular implementation.
Successive versions of the DOM are defined as levels. The Level 1 DOM was the W3C's first release, and it focused on working with HTML and XML in a browser context. Effectively, it supported dynamic HTML and provided a base for XML document processing. Because it expected documents to exist already in a browser context, Level 1 only described an object structure and how to manipulate it, not how to load a document into that structure or reserialize a document from that structure.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
DOM Foundations
Inhaltsvorschau
At its heart, the DOM is a set of abstract interfaces. Various DOM implementations use their own objects to support the interfaces defined in the DOM specification. The DOM interfaces themselves are specified in modules, making it possible for implementations to support parts of the DOM without having to support all of it. XML parsers, for instance, aren't required to provide support for the HTML-specific parts of the DOM, and modularization has provided a simple mechanism that allows software developers to identify which parts of the DOM are supported or not supported by a particular implementation.
Successive versions of the DOM are defined as levels. The Level 1 DOM was the W3C's first release, and it focused on working with HTML and XML in a browser context. Effectively, it supported dynamic HTML and provided a base for XML document processing. Because it expected documents to exist already in a browser context, Level 1 only described an object structure and how to manipulate it, not how to load a document into that structure or reserialize a document from that structure.
Subsequent levels have added functionality. DOM Level 2, which was published as a set of specifications, one per module, includes updates for the Core and HTML modules of Level 1, as well as new modules for Views, Events, Style, Traversal, and Range. DOM Level 3 added Abstract Schemas, Load, Save, XPath, and updates to the Core and Events modules.
Other W3C specifications have defined extensions to the DOM particular to their own needs. Mathematical Markup Language (MathML), Scalable Vector Graphics (SVG), Synchronized Multimedia Integration Language (SMIL), and SMIL Animation have all defined DOMs that provide access to details of their own vocabularies.
For a complete picture of the requirements these modules are supposed to address, see http://www.w3.org/TR/DOM-Requirements. For a listing of all of the DOM specifications, including those still under development, see
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Structure of the DOM Core
Inhaltsvorschau
The DOM Core interfaces provide generic access to all supported document content types. The DOM also defines a set of HTML-specific interfaces that expose specific document structures, such as tables, paragraphs, and img elements, directly. Besides using these specialized interfaces, you can access the same information using the generic interfaces defined in the core.
Since XML is designed as a venue for creating new, unique, structured markup languages, standards bodies cannot define application-specific interfaces in advance. Instead, the DOM Core interfaces are provided to manipulate document elements in a completely application-independent manner.
The DOM Core is further segregated into the Fundamental and Extended Interfaces. The Fundamental Interfaces are relevant to both XML and HTML documents, whereas the Extended Interfaces deal with XML-only document structures, such as entity declarations and processing instructions. All DOM Core interfaces are derived from the Node interface, which provides a generic set of methods for accessing a document or document fragment's tree structure and content.
To simplify different types of document processing and enable efficient implementation of DOM by some programming languages, there are actually two distinct methods for accessing a document tree from within the DOM Core: through the generic Node interface and through specific interfaces for each node type. Although there are several distinct types of markup that may appear within an XML document (elements, attributes, processing instructions, and so on), the relationships between these different document features can be expressed as a typical hierarchical tree structure. Elements are linked to both their predecessors and successors, as well as their parent and child nodes. Although there are many different types of nodes, the basic parent, child, and sibling relationships are common to everything in an XML document.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Node and Other Generic Interfaces
Inhaltsvorschau
The Node interface is the DOM Core class hierarchy's root. Though never instantiated directly, it is the root interface of all specific interfaces, and you can use it to extract information from any object within a DOM document tree without knowing its actual type. It is possible to access a document's complete structure and content using only the methods and properties exposed by the Node interface. As shown in Table 19-1, this interface contains information about the type, location, name, and value of the corresponding underlying document data.
Table 19-1: The Node interface
Name
Type
Read-only
2.0
3.0
Attributes
attributes
NamedNodeMap
baseURI
DOMString
childNodes
NodeList
firstChild
Node
lastChild
Node
localName
DOMString
namespaceURI
DOMString
nextSibling
Node
nodeName
DOMString
nodeType
unsigned short
nodeValue
DOMString
ownerDocument
Document
parentNode
Node
prefix
DOMString
previousSibling
Node
textContent
DOMString
Methods
appendChild
Node
cloneNode
Node
compareDocumentPosition
unsigned short
getFeature
DOMObject
getUserData
DOMUserData
hasAttributes
boolean
hasChildNodes
boolean
insertBefore
Node
isDefaultNamespace
boolean
isEqualNode
boolean
isSameNode
boolean
isSupported
boolean
lookupNamespaceURI
DOMString
lookupPrefix
DOMString
normalize
void
removeChild
Node
replaceChild
Node
setUserData
DOMUserData
Since the Node interface is never instantiated directly, the nodeType attribute contains a value that indicates the given instance's specific object type. Based on the nodeType, it is possible to cast a generic Node reference safely to a specific interface for further processing. Table 19-2 shows the node type values and their corresponding DOM interfaces, and Table 19-3 shows the values they provide for nodeName , nodeValue, and attributes attributes.
Table 19-2: The DOM node types and interfaces
Node type
DOM interface
ATTRIBUTE_NODE
Attr
CDATA_SECTION_NODE
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Specific Node-Type Interfaces
Inhaltsvorschau
Although it is possible to access the data from the original XML document using only the Node interface, the DOM Core provides a number of specific node-type interfaces that simplify common programming tasks. These specific node types can be divided into two broad types: structural nodes and content nodes.
Within an XML document, a number of syntax structures exist that are not formally part of the content. The following interfaces provide access to the portions of the document that are not related to element data.

Section 19.4.1.1: DocumentType

The DocumentType interface provides access to the XML document type definition's notations, entities, internal subset, public ID, and system ID. Since a document can have only one DOCTYPE declaration, only one DocumentType node can exist for a given document. It is accessed via the doctype attribute of the Document interface. The definition of the DocumentType interface is shown in Table 19-6.
Table 19-6: The DocumentType interface, derived from Node
Name
Type
Read-only
Attributes
entities
NamedNodeMap
internalSubset
DOMString
name
DOMString
notations
NamedNodeMap
publicId
DOMString
systemId
DOMString
Using additional fields available since DOM Level 2, it is now possible to fully reconstruct a parsed document using only the information provided within the DOM framework. No programmatic way to modify DocumentType node contents currently exists.

Section 19.4.1.2: ProcessingInstruction

The ProcessingInstruction node type provides direct access to a processing instruction's contents. Though processing instructions appear in the document's text, they may also appear before or after the root element, as well as in DTDs. Table 19-7 describes the ProcessingInstruction node's attributes.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The DOMImplementation Interface
Inhaltsvorschau
The DOMImplementation interface could be considered the highest level interface in the DOM. It exposes the hasFeature( ) method, which allows a programmer using a given DOM implementation to detect if specific features are available. In DOM Level 2, it introduced facilities for creating new DocumentType nodes, which can then be used to create new Document instances.
The only method added to the DOMImplementation interface for Level 3 was the getFeature( ) method. This method allows DOM implementers to provide access to extended functionality, which is not part of the DOM specification itself, through the use of extension objects. These objects implement the DOMObject interface, which generally maps to the generic object (e.g., the Java Object) type in the underlying programming language (if the language is object-oriented).Table 19-15 describes the DomImplementation interface.
Table 19-15: The DOMImplementation interface
Name
Type
2.0
3.0
Methods
createDocument
Document
createDocumentType
DocumentType
getFeature
DOMObject
hasFeature
boolean
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
DOM Level 3 Interfaces
Inhaltsvorschau
DOM Level 3 includes several new interfaces that support features, such as:
  • XML Version 1.1
  • Dynamic DOM implementation selection
  • Generic post-validation document type information
  • Dynamic error handling
The following sections describe the new interfaces that were introduced in DOM Level 3.
The DOMStringList interface models a simple utility class that contains an ordered list of DOMString objects. Table 19-16 describes the DOMStringList interface.
Table 19-16: The DOMStringList interface
Name
Type
Read-only
Attribute
length
unsigned long
Methods
contains
boolean
item
DOMString

Section 19.6.1.1: NameList

The NameList interface models an ordered collection of names and corresponding namespace URIs. One use of this interface is in modeling the linkage between namespace prefixes and namespace URIs. Table 19-17 describes the NameList interface.
Table 19-17: The NameList interface
Name
Type
Read-only
Attribute
length
unsigned long
Methods
contains
boolean
containsNS
boolean
getName
DOMString
getNamespaceURI
DOMString

Section 19.6.1.2: DOMImplementationList

The DOMImplementationList interface models a list of DOMImplementation objects, as shown in Table 19-18.
Table 19-18: The DOMImplementationList interface
Name
Type
Read-only
Attribute
length
unsigned long
Method
item
DOMImplementation

Section 19.6.1.3: DOMImplementationSource

The DOMImplementationSource interface, shown in Table 19-19, allows a DOM client to dynamically select a particular DOM implementation from a list of available implementations based on a requested feature set. It also allows the client to retrieve a complete list of all DOMImplementation objects that are available at runtime.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Parsing a Document with DOM
Inhaltsvorschau
Although DOM Level 2 doesn't specify an actual interface for parsing a document, most implementations provide a simple parsing interface that accepts a reference to an XML document file, stream, or URI. After this interface successfully parses and validates the document (if it is a validating parser), it generally provides a mechanism for getting a reference to the Document interface's instance for the parsed document. The following code fragment shows how to parse a document using the Apache Xerces XML DOM parser:
// create a new parser

DOMParser dp = new DOMParser( );

     

// parse the document and get the DOM Document interface

dp.parse("http://www.w3.org/TR/2000/REC-xml-20001006.xml");

Document doc = dp.getDocument( );
DOM Level 3 adds standard mechanisms for loading XML documents and reserializing (saving) DOM trees as XML. JAXP also provides standardized approaches for these processes in Java, although JAXP and DOM Level 3 offer different approaches.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
A Simple DOM Application
Inhaltsvorschau
Example 19-1 illustrates how you might use the interfaces discussed in this chapter in a typical programming situation. This application takes a document that uses the furniture.dtd sample DTD from Chapter 21 and validates that the parts list included in the document matches the actual parts used within the document.
Example 19-1. Parts checker application
/**

 * PartsCheck.java

 *

 * DOM Usage example from the O'Reilly _XML in a Nutshell_ book.

 *

 */

  

// we'll use the Apache Software Foundation's Xerces parser.

import org.apache.xerces.parsers.*;

import org.apache.xerces.framework.*;

  

// import the DOM and SAX interfaces

import org.w3c.dom.*;

import org.xml.sax.*;

  

// get the necessary Java support classes

import java.io.*;

import java.util.*;

  

/**

 * This class is designed to check the parts list of an XML document that

 * represents a piece of furniture for validity.  It uses the DOM to

 * analyze the actual furniture description and then check it against the

 * parts list that is embedded in the document.

 */

public class PartsCheck {

  // static constants

  public static final String FURNITURE_NS =

      "http://namespaces.oreilly.com/furniture/";

  // contains the true part count, keyed by part number

  HashMap m_hmTruePartsList = new HashMap( );

  

  /**

   * The main function that allows this class to be invoked from the command

   * line.  Check each document provided on the command line for validity.

   */

  public static void main(String[  ] args) {

    PartsCheck pc = new PartsCheck( );

  

    try {

      for (int i = 0; i < args.length; i++) {

        pc.validatePartsList(args[i]);

      }

    } catch (Exception e) {

      System.err.println(e);

    }

  }

  

  /**

   * Given a system identifier for an XML document, this function compares

   * the actual parts used to the declared parts list within the document.  It

   * prints warnings to standard error if the lists don't agree.

   */

  public void validatePartsList(String strXMLSysID) throws IOException,

      SAXException

  {

    // create a new parser

    DOMParser dp = new DOMParser( );

  

    // parse the document and get the DOM Document interface

    dp.parse(strXMLSysID);

    Document doc = dp.getDocument( );

  

    // get an accurate parts list count

    countParts(doc.getDocumentElement( ), 1);

  

    // compare it to the parts list in the document

    reconcilePartsList(doc);

  }

  

  /**

   * Updates the true parts list by adding the count to the current count

   * for the part number given.

   */

  private void recordPart(String strPartNum, int cCount)

  {

    if (!m_hmTruePartsList.containsKey(strPartNum)) {

      // this part isn't listed yet

      m_hmTruePartsList.put(strPartNum, new Integer(cCount));

    } else {

      // update the count

      Integer cUpdate = (Integer)m_hmTruePartsList.get(strPartNum);

      m_hmTruePartsList.put(strPartNum, new Integer(cUpdate.intValue( ) + cCount));

    }

  }

  

  /**

   * Counts the parts referenced by and below the given node.

   */

  private void countParts(Node nd, int cRepeat)

  {

    // start the local repeat count at 1

    int cLocalRepeat = 1;

  

    // make sure we should process this element

    if (FURNITURE_NS.equals(nd.getNamespaceURI( ))) {

      Node ndTemp;

  

      if ((ndTemp = nd.getAttributes( ).getNamedItem("repeat")) != null) {

        // this node specifies a repeat count for its children

        cLocalRepeat = Integer.parseInt(ndTemp.getNodeValue( ));

      }

  

      if ((ndTemp = nd.getAttributes( ).getNamedItem("part_num")) != null) {

        // start the count at 1

        int cCount = 1;

        String strPartNum = ndTemp.getNodeValue( );

  

        if ((ndTemp = nd.getAttributes( ).getNamedItem("count")) != null) {

          // more than one part needed by this node

          cCount = Integer.parseInt(ndTemp.getNodeValue( ));

        }

  

        // multiply the local count by the repeat passed in from the parent

        cCount *= cRepeat;

  

        // add the new parts count to the total

        recordPart(strPartNum, cCount);

      }

    }

  

    // now process the children

    NodeList nl = nd.getChildNodes( );

    Node ndCur;

  

    for (int i = 0; i < nl.getLength( ); i++) {

      ndCur = nl.item(i);

  

      if (ndCur.getNodeType( ) =  = Node.ELEMENT_NODE) {

        // recursively count the parts for the child, using the local repeat

        countParts(ndCur, cLocalRepeat);

      }

    }

  }

  

  /**

   * This method reconciles the true parts list against the list in the document.

   */

  private void reconcilePartsList(Document doc)

  {

    Iterator iReal = m_hmTruePartsList.keySet( ).iterator( );

  

    String strPartNum;

    int cReal;

    Node ndCheck;

  

    // loop through all of the parts in the true parts list

    while (iReal.hasNext( )) {

      strPartNum = (String)iReal.next( );

      cReal = ((Integer)m_hmTruePartsList.get(strPartNum)).intValue( );

  

      // find the part list element in the document

      ndCheck = doc.getElementById(strPartNum);

  

      if (ndCheck =  = null) {

        // this part isn't even listed!

        System.err.println("missing <part_name> element for part #" +

            strPartNum + " (count " + cReal + ")");

      } else {

        Node ndTemp;

  

        if ((ndTemp = ndCheck.getAttributes( ).getNamedItem("count")) != null) {

          int cCheck = Integer.parseInt(ndTemp.getNodeValue( ));

  

          if (cCheck != cReal) {

            // counts don't agree

            System.err.println("<part_name> element for part #" +

                strPartNum + " is incorrect:  true part count = " + cReal +

                " (count in document is " + cCheck + ")");

          }

        } else {

          // they didn't provide a count for this part!

          System.err.println("missing count attribute for part #" +

              strPartNum + " (count " + cReal + ")");

        }

      }

    }

  }

}
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 20: Simple API for XML (SAX)
Inhaltsvorschau
The Simple API for XML (SAX) is an event-based API for reading XML documents. Many different XML parsers implement the SAX API, including Xerces, Crimson, the Oracle XML Parser for Java, and Ælfred. SAX was originally defined as a Java API and is primarily intended for parsers written in Java. Therefore, this chapter focuses on the Java version of the API. However, SAX has been ported to most other major object-oriented languages, including C++, Python, Perl, and Eiffel. The translation from Java is usually fairly obvious.
The SAX API is unusual among XML APIs because it's an event-based push model rather than a tree-based pull model. As the XML parser reads an XML document, it sends the program information from the document in real time. Each time the parser sees a start-tag, an end-tag, character data, or a processing instruction, it tells your program. The document is presented to your program one piece at a time from beginning to end. You can either save the pieces you're interested in until the entire document has been read, or process the information as soon as you receive it. You do not have to wait for the entire document to be read before acting on the data at the beginning of the document. Most importantly, the entire document does not have to reside in memory. This feature makes SAX the API of choice for very large documents that do not fit into available memory.
This chapter covers SAX2 exclusively. In 2004, all major parsers that support SAX also support SAX2. The major change in SAX2 from SAX1 is the addition of namespace support, which necessitated changing the names and signatures of almost every method and class in SAX. The old SAX1 methods and classes are still available, but they're now deprecated, and you shouldn't use them.
SAX is primarily a collection of interfaces in the org.xml.sax package. One such interface is XMLReader . This interface represents the XML parser. It declares methods to parse a document and configure the parsing process, for instance, by turning validation on or off. To parse a document with SAX, first create an instance of
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The ContentHandler Interface
Inhaltsvorschau
ContentHandler, shown in stripped-down form in Example 20-1, is an interface in the org.xml.sax package. You implement this interface in a class of your own devising. Next, you configure an XMLReader with an instance of your implementation. As the XMLReader reads the document, it invokes the methods in this object to tell your program what's in the XML document. You can respond to these method invocations in any way you see fit.
The ContentHandler class has no relation to the moribund java.net.ContentHandler class. However, you may encounter a name conflict if you import both java.net.* and org.xml.sax.* in the same class. It's better to import just the java.net classes you actually need, rather than the entire package.
Example 20-1. The org.xml.sax.ContentHandler interface
package org.xml.sax;

     

public interface ContentHandler {

    public void setDocumentLocator(Locator locator);

    public void startDocument( ) throws SAXException;

    public void endDocument( ) throws SAXException;

    public void startPrefixMapping(String prefix, String uri)

     throws SAXException;

    public void endPrefixMapping(String prefix) throws SAXException;

    public void startElement(String namespaceURI, String localName,

     String qualifiedName, Attributes atts) throws SAXException;

    public void endElement(String namespaceURI, String localName,

     String qualifiedName) throws SAXException;

    public void characters(char[  ] text, int start, int length)

     throws SAXException;

    public void ignorableWhitespace(char[  ] text, int start, int length)

     throws SAXException;

    public void processingInstruction(String target, String data)

     throws SAXException;

    public void skippedEntity(String name) throws SAXException;

     

}
Every time the XMLReader reads a piece of the document, it calls a method in its ContentHandler. Suppose a parser reads the simple document shown in Example 20-2.
Example 20-2. A simple XML document
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Features and Properties
Inhaltsvorschau
SAX uses properties and features to control parser behavior. Each feature and property has a name that's an absolute URI. Like namespace URIs, absolute URIs are only used to name things and do not necessarily point to a real page you can load into a web browser. Features are either true or false; that is, they're Booleans. Properties have values of an appropriate Object type. Different parsers support different groups of features and properties, although there are a few standard ones most parsers support.
The http://xml.org/sax/features/validation feature controls whether a parser validates. If this feature is true, then the parser will report validity errors in the document to the registered ErrorHandler; otherwise, it won't. This feature is turned off by default. To turn a feature on, pass the feature's name and value to the XMLReader's setFeature( ) method:
try {

  parser.setFeature("http://xml.org/sax/features/validation", true);

}

catch (SAXNotSupportedException ex) {

  System.out.println("Cannot turn on validation right now.");

}

catch (SAXNotRecognizedException ex) {

  System.out.println("This is not a validating parser.");

}
Not all parsers can validate. If you try to turn on validation in a parser that doesn't validate or set any other feature the parser doesn't provide, setFeature( ) throws a SAXNotRecognizedException. If you try to set a feature the parser does recognize but cannot change at the current time—e.g., you try to turn on validation when the parser has already read half of the document—setFeature( ) throws a SAXNotSupportedException . Both are subclasses of SAXException.
You can check a feature's current value using XMLReader's getFeature( ) method. This method returns a boolean and throws the same exceptions for the same reasons as setFeature( ). If you want to know whether the parser validates, you can ask in the following manner:
try {

  boolean isValidating =

   parser.getFeature("http://xml.org/sax/features/validation");

}

catch (SAXException ex) {

  System.out.println("This is not a validating parser");

}
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Filters
Inhaltsvorschau
A SAX filter sits between the parser and the client application and intercepts the messages that these two objects pass to each other. It can pass these messages unchanged or modify, replace, or block them. To a client application, the filter looks like a parser, that is, an XMLReader. To the parser, the filter looks like a client application, that is, a ContentHandler .
SAX filters are implemented by subclassing the org.xml.sax.helpers.XMLFilterImpl class. This class implements all the required interfaces of SAX for both parsers and client applications. That is, its signature is as follows:
public class XMLFilterImpl implements XMLFilter, XMLReader,

 ContentHandler, DTDHandler, ErrorHandler
Your own filters will extend this class and override those methods that correspond to the messages you want to filter. For example, if you wanted to filter out all processing instructions, you would write a filter that would override the processingInstruction() method to do nothing, as shown in Example 20-5.
Example 20-5. A SAX filter that removes processing instructions
import org.xml.sax.helpers.XMLFilterImpl;

     

public class ProcessingInstructionStripper extends XMLFilterImpl {

     

  public void processingInstruction(String target, String data) {

    // Because this does nothing, processing instructions read in the

    // document are *not* passed to client application

  }

     

}
If instead you wanted to replace a processing instruction with an element whose name was the same as the processing instruction's target and whose text content was the processing instruction's data, you'd call the startElement( ), characters( ), and endElement( ) methods from inside the processingInstruction() method after filling in the arguments with the relevant data from the processing instruction, as shown in Example 20-6.
Example 20-6. A SAX filter that converts processing instructions to elements
import org.xml.sax.*;

import org.xml.sax.helpers.*;

     

public class ProcessingInstructionConverter extends XMLFilterImpl {

     

  public void processingInstruction(String target, String data)

   throws SAXException {

     

    // AttributesImpl is an adapter class in the org.xml.sax.ext package

    // for precisely this case. We don't really want to add any attributes

    // here, but we need to pass something as the fourth argument to

    // startElement( ).

    Attributes emptyAttributes = new AttributesImpl( );

     

    // We won't use any namespace for the element

    startElement("", target, target, emptyAttributes);

    // converts String data to char array

    char[  ] text = data.toCharArray( );

    characters(text, 0, text.length);

     

    endElement("", target, target);

     

  }

     

}
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 21: XML Reference
Inhaltsvorschau
This chapter is intended to serve as a comprehensive reference to the Extensible Markup Language (XML) W3C recommendations for both XML 1.0 and 1.1. We have made every effort to cover the contents of the official W3C document exhaustively. However, if you are implementing an XML parser, editor, or other tool, you should also review the latest revision of these recommendations on the Web at http://www.w3.org/TR/REC-xml and http://www.w3.org/TR/xml11/. This book refers to the XML 1.0 Third Edition dated 04 February 2004 and the XML 1.1 Recommendation dated 04 February 2004, which was edited in place 15 April 2004.
The endorsement of the Extensible Markup Language (XML) 1.1 Recommendation in February of 2004 has introduced some challenges within the XML community. The markup language described by 1.1 is not precisely a superset of the language described by Version 1.0, which means that some documents that are well-formed under 1.0 rules will not be well-formed under 1.1 rules. The main narrative of this chapter adheres to the rules laid out by the 1.0 Recommendation. Notes such as this one will appear when necessary to outline the differences between XML 1.0 and XML 1.1.
When deciding which version of XML is appropriate for your application, consider that unless you specifically need to use markup names that contain characters not available in Unicode 2.0, XML 1.0 will most likely be the correct choice.
This chapter consists of examples of XML documents and DTDs, followed by detailed reference sections that describe every feature of the XML specification and a listing of possible well-formedness and validity errors. The syntax items of XML are introduced in the rough order in which they appear in an XML document. Each entry explains the syntactic structure, where it can be used, and the applicable validity and well-formedness constraints. Each reference section contains a description of the XML language structure, an informal syntax, and an example of the syntax's usage where appropriate.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
How to Use This Reference
Inhaltsvorschau
This chapter consists of examples of XML documents and DTDs, followed by detailed reference sections that describe every feature of the XML specification and a listing of possible well-formedness and validity errors. The syntax items of XML are introduced in the rough order in which they appear in an XML document. Each entry explains the syntactic structure, where it can be used, and the applicable validity and well-formedness constraints. Each reference section contains a description of the XML language structure, an informal syntax, and an example of the syntax's usage where appropriate.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Annotated Sample Documents
Inhaltsvorschau
These examples are intended as a mnemonic aid for XML syntax and as a quick map from a specific instance of an XML language construct to its corresponding XML syntax reference section. The sample document and DTD incorporate features defined in the XML 1.0 and Namespaces in XML recommendations.
The sample XML application describes the construction of a piece of furniture. Within the figures, each distinct language construct is enclosed in a box, with the relevant reference section name provided as a callout. By locating a construct in the sample, then locating the associated reference section, you can quickly recognize and learn about unfamiliar XML syntax. Four files make up this sample application:
bookcase.xml
The document shown in Figure 21-1 uses furniture.dtd to describe a simple bookcase.
Figure 21-1: bookcase.xml
furniture.dtd
The XML document type definition shown in Figure 21-2 provides a simple grammar for describing components and assembly details for a piece of furniture.
Figure 21-2: furniture.dtd
bookcase_ex.ent
The external entity file shown in Figure 21-3 contains additional bookcase-specific elements for the bookcase.xml document.
Figure 21-3: bookcase_ex.ent
parts_list.ent
Figure 21-4 contains an external parsed general entity example that contains the parts list for the bookcase example document.
Figure 21-4: parts_list.ent
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XML Syntax
Inhaltsvorschau
For each section of this reference that maps directly to an XML language structure, an informal syntax reference describes that structure's form. The following conventions are used with these syntax blocks:
Format
Meaning
DOCTYPE
Bold text indicates literal characters that must appear as written within the document (e.g., DOCTYPE).
encoding-name
Italicized text indicates that the user must replace the text with real data. The item indicates what type of data should be inserted (e.g., encoding-name = en-us).
|
The vertical bar indicates that only one out of a list of possible values can be selected.
[ ]
Square brackets indicate that a particular portion of the syntax is optional.
Every XML document is broken into two primary sections: the prolog and the document element. A few documents may also have comments or processing instructions that follow the root element in a sort of epilog (an unofficial term). The prolog contains structural information about the particular type of XML document you are writing, including the XML declaration and document type declaration. The prolog is optional, and if a document does not need to be validated against a DTD, it can be omitted completely. The only required structure in a well-formed XML document is the top-level document element itself.
The following syntax structures are common to the entire XML document. Unless otherwise noted within a subsequent reference item, the following structures can appear anywhere within an XML document.
Characters
XML documents are inherently text documents, which are composed of characters. To ensure that documents are portable across disparate computer systems and can contain content in as many written human languages as possible, XML parsers are required to implement the Unicode standard. This does not mean that all XML documents must be saved and edited in Unicode, but it does mean that the XML parser must be able to convert your document from its native character encoding to Unicode. All XML parsers are required to support (as a minimum) either UTF-8 or UTF-16 as input encoding formats. For more information on encoding formats and Unicode, see Chapter 27.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Constraints
Inhaltsvorschau
In addition to defining the basic structures used in documents and DTDs, XML 1.0 defines a list of rules regarding their usage. These constraints put limits on various aspects of XML usage, and documents cannot in fact be considered to be "XML" unless they meet all of the well-formedness constraints. Parsers are required to report violations of these constraints, although only well-formedness constraint violations require that processing of the document halt completely. Namespace constraints are defined in Namespaces in XML, not XML 1.0.
Well-formedness refers to an XML document's physical organization. Certain lexical rules must be obeyed before an XML parser can consider a document well-formed. These rules should not be confused with validity constraints, which determine whether a particular document is valid when parsed using the document structure rules contained in its DTD. The Backus-Naur Form (BNF) grammar rules must also be satisfied. The following sections contain all well-formedness constraints recognized by XML Version 1.0 parsers, including actual text from the 1.0 specification.
PEs in Internal Subset
Text from specification
In the internal DTD subset, parameter entity references can occur only where markup declarations can occur, not within markup declarations. (This does not apply to references that occur in external parameter entities or to the external subset.)
Explanation
It is only legal to use parameter entity references to build markup declarations within the external DTD subset. In other words, within the internal subset, parameter entities may only be used to include complete markup declarations.
External Subset
Text from specification
The external subset, if any, must match production for extSubset.
Explanation
The extSubset production constrains what type of declaration may be contained in the external subset. This constraint generally means that the external subset of the DTD must only include whole declarations or parameter entity references. See the
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XML 1.0 Document Grammar
Inhaltsvorschau
The Extended Backus-Naur Form (EBNF) grammar, shown in the following section, was collected from the XML 1.0 Recommendation, Third Edition. It brings all XML language productions together in a single location and describes the syntax that is understood by XML 1.0-compliant parsers. Each production has been numbered and cross-referenced using superscripted numbers.

Section 21.5.1.1: Document

[1] document ::= prolog 22 element 39 Misc 27 *

Section 21.5.1.2: Character range

[2] Char ::= #x9 | #xA | #xD | [#x21-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

Section 21.5.1.3: Whitespace

[3] S ::= (#x20 | #x9 | #xD | #xA)+

Section 21.5.1.4: Names and tokens

[4] NameChar ::= Letter 84 | Digit 88 | '.' | '-' | '_' | ':' | CombiningChar 87 | Extender 89
[5] Name ::= ( Letter 84 | '_' | ':') ( NameChar 4 )*
[6] Names ::= Name 5 (#x20 Name 5 )*
[7] Nmtoken ::= ( NameChar 4 )+
[8] Nmtokens ::= Nmtoken 7 (#x20 Nmtoken 7 )*

Section 21.5.1.5: Literals

[9] EntityValue ::= '"' ([^%&"] | PEReference 69 | Reference 67 )* '"' | "'" ([^%&'] | PEReference 69 | Reference 67 )* "'"
[10] AttValue ::= '"' ([^<&"] | Reference 67 )* '"' | "'" ([^<&'] | Reference 67 )* "'"
[11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
[12] PubidLiteral ::= '"' PubidChar 13 * '"' | "'" ( PubidChar 13 - "'")* "'"
[13] PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9] | [-'( )+,./:=?;!*#@$_%]

Section 21.5.1.6: Character data

[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)

Section 21.5.1.7: Comments

[15] Comment ::= '<!--' (( Char 2 - '-') | ('-' ( Char 2 - '-')))* '-->'

Section 21.5.1.8: Processing instructions

Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XML 1.1 Document Grammar
Inhaltsvorschau
The following grammar provides the EBNF productions for the XML 1.1 recommendation.

Section 21.6.1.1: Document

[1] document ::= prolog 22 element 39 Misc 27 * - Char 2 * RestrictedChar 2a Char 2 *

Section 21.6.1.2: Character range

[2] Char ::= [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
[2a] RestrictedChar ::= [#x1-#x8] | [#xB-#xC] | [#xE-#x1F] | [#x7F-#x84] | [#x86-#x9F]

Section 21.6.1.3: Whitespace

[3] S ::= (#x20 | #x9 | #xD | #xA)+

Section 21.6.1.4: Names and tokens

[4] NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[4a] NameChar ::= NameStartChar 4 | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
[5] Name ::= NameStartChar 4 ( NameChar 4a )*
[6] Names ::= Name 5 (#x20 Name 5 )*
[7] Nmtoken ::= ( NameChar 4a )+
[8] Nmtokens ::= Nmtoken 7 (#x20 Nmtoken 7 )*

Section 21.6.1.5: Literals

[9] EntityValue ::= '"' ([^%&"] | PEReference 69 | Reference 67 )* '"' | "'" ([^%&'] | PEReference 69 | Reference 67 )* "'"
[10] AttValue ::= '"' ([^<&"] | Reference 67 )* '"' | "'" ([^<&'] | Reference 67 )* "'"
[11] SystemLiteral ::= ('"' [^"]* '"') | ("'" [^']* "'")
[12] PubidLiteral ::= '"' PubidChar 13 * '"' | "'" ( PubidChar 13 - "'")* "'"
[13] PubidChar ::= #x20 | #xD | #xA | [a-zA-Z0-9] | [-'( )+,./:=?;!*#@$_%]

Section 21.6.1.6: Character data

[14] CharData ::= [^<&]* - ([^<&]* ']]>' [^<&]*)

Section 21.6.1.7: Comments

[15] Comment ::= '<!--' (( Char
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 22: Schemas Reference
Inhaltsvorschau
The W3C XML Schema Language (schemas) is a declarative language used to describe the allowed contents of XML documents by assigning types to elements and attributes. The schema language includes several dozen standard types and allows you to define your own custom types. The combination of the information in an XML document instance and the types applied to that information by the schema is sometimes called the Post Schema Validation Infoset (PSVI).
A schema processor reads both an input XML document and a schema (which is itself an XML document because the W3C XML Schema Language is an XML application) and determines whether the document adheres to the constraints in the schema. A document that satisfies all the schema's constraints, and in which all the document's elements and attributes are declared, is said to be schema-valid , although in this chapter we will mostly just call such documents valid . A document that does not satisfy all of the constraints is said to be invalid .
All standard schema elements are in the http://www.w3.org/2001/XMLSchema namespace. In this chapter, we assume that this URI is mapped to the xs prefix using an appropriate xmlns:xs declaration. This declaration is almost always placed on the root element start-tag:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
In addition, several attributes are used in instance documents to associate schema information with them, including schemaLocation and type. These attributes are in the http://www.w3.org/2001/XMLSchema-instance namespace. In this chapter, we assume that this URI is mapped to the xsi prefix with an appropriate xmlns:xsi declaration on either the element where this attribute appears or one of its ancestors.
In a few cases, schema elements may contain elements from other arbitrary namespaces or no namespace at all. This occurs primarily inside xs:appinfo and xs:documentation elements, which provide supplementary information about the schema itself, the documents the schema describes to systems that are not schema validators, or to people reading the schema.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The Schema Namespaces
Inhaltsvorschau
All standard schema elements are in the http://www.w3.org/2001/XMLSchema namespace. In this chapter, we assume that this URI is mapped to the xs prefix using an appropriate xmlns:xs declaration. This declaration is almost always placed on the root element start-tag:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
In addition, several attributes are used in instance documents to associate schema information with them, including schemaLocation and type. These attributes are in the http://www.w3.org/2001/XMLSchema-instance namespace. In this chapter, we assume that this URI is mapped to the xsi prefix with an appropriate xmlns:xsi declaration on either the element where this attribute appears or one of its ancestors.
In a few cases, schema elements may contain elements from other arbitrary namespaces or no namespace at all. This occurs primarily inside xs:appinfo and xs:documentation elements, which provide supplementary information about the schema itself, the documents the schema describes to systems that are not schema validators, or to people reading the schema.
Finally, most schema elements can have arbitrary attributes from other namespaces. For instance, this allows you to make an xs:attribute element a simple XLink by giving it xlink:type and xlink:href attributes or to identify the language of an xs:notation using an xml:lang attribute. However, this capability is not used much in practice.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Schema Elements
Inhaltsvorschau
The W3C XML Schema Language defines 42 elements, which naturally divide into several categories:
One root element
xs:schema
Three declaration elements
xs:element, xs:attribute, and xs:notation
Eight elements for defining types
xs:complexContent, xs:complexType, xs:extension, xs:list, xs:restriction, xs:simpleContent, xs:simpleType, and xs:union
Seven elements for defining content models
xs:all, xs:any, xs:anyAttribute, xs:attributeGroup, xs:choice, xs:group, and xs:sequence
Five elements for specifying identity constraints
xs:field, xs:key, xs:keyref, xs:selector, and xs:unique
Three elements for assembling schemas out of component parts
xs:import, xs:include, and xs:redefine
Twelve facet elements for constraining simple types
xs:enumeration, xs:fractionDigits, xs:length, xs:maxExclusive, xs:maxInclusive, xs:maxLength, xs:minExclusive, xs:minInclusive, xs:minLength, xs:pattern, xs:totalDigits, and xs:whiteSpace
Three elements for documenting schemas
xs:appinfo, xs:annotation, and xs:documentation
Elements in this section are arranged alphabetically from xs:all to xs:whiteSpace. Each element begins with a sample implementation in the following form:
<xs:elementName

   attribute1 = "allowed attribute values"

   attribute2 = "allowed attribute values"

>

  <!-- Content model -->

</xs:elementName>
Most attribute values can be expressed as one of the 44 XML Schema built-in simple types, such as xs:string, xs:ID, or xs:integer. Values that should be replaced by an instance of the type are italicized. Values that take a literal form are listed in regular type. Some attribute values are specified as an enumeration of the legal values in the form ( value1 | value2 | value3 | etc. ). In this case, the default value, if there is one, is given in boldface.
Element content models are given in a comment in the form they might appear in an ELEMENT declaration in a DTD. For example, an xs:all element may contain a single optional xs:annotation child element followed by zero or more
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Built-in Types
Inhaltsvorschau
The W3C XML Schema Language provides 44 built-in simple types for text strings. Each type has a value space and a lexical space. The value space is the set of unique meanings for the type, which may or may not be text. In some sense, the value space is composed of Platonic forms. The lexical space is the set of text strings that correspond to particular points in the value space. For example, the xs:boolean type has the value space true and false. However, its lexical space contains four strings: true, false, 0, and 1. true and 1 both map to the same value true, while false and 0 map to the single value false. In cases like this where multiple strings in the lexical space map to a single value, then one of those strings is selected as the canonical lexical representation. For instance, the canonical lexical representations of true and false are the strings true and false.
The primitive types are organized in a hierarchy. All simple types descend from an abstract ur-type called xs:anySimpleType , which is itself a descendant of an abstract ur-type called xs:anyType that includes both simple and complex types. Simple types are derived from other simple types by union, restriction, or listing. For example, the xs:nonNegativeInteger type is derived from the xs:integer type by setting its minInclusive facet to 0. The xs:integer type is derived from the xs:decimal type by setting its fractionDigits facet to 0. Figure 22-1 diagrams the complete hierarchy of built-in types. The xs:simpleType element allows you to apply facets to these types to create your own derived types that extend this hierarchy.
Figure 22-1: The simple type hierarchy
The types are organized alphabetically in the following section. For each type, the value and lexical spaces are described, and some examples of permissible instances are provided.
xs:anyURI
The xs:anyURI type indicates a Uniform Resource Identifier. This includes not only Uniform Resource Locators (URLs), but also Uniform Resource Names (URNs). Both relative and absolute URLs are allowed. Legal
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Instance Document Attributes
Inhaltsvorschau
The W3C XML Schema Language defines four attributes in the http://www.w3.org/2001/XMLSchema-instance namespace (here mapped to the xsi prefix), which are attached to elements in the instance document rather than elements in the schema. These are as follows: xsi:nil, xsi:type, xsi:schemaLocation, and xsi:noNamespaceSchemaLocation. All four of these attributes are special because the schemas do not need to declare them.
xsi:nil
The xsi:nil attribute indicates that a certain element does not have a value or that the value is unknown. This is not the same as having a value that is zero or the empty string. Semantically, it is equivalent to SQL's null. For example, in this full_name element, the last_name child has a nil value:
<full_name xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

  <first_name>Cher</first_name>

  <last_name xsi:nil="true"/>

</full_name>
It is not relevant whether an empty-element tag or a start-tag/end-tag pair is used to represent the nil element. However, a nil element may not have any content.
In order for this document to be valid, the element declaration for the name element must explicitly specify that nil values are allowed by setting the nillable attribute to true. For example:
<xs:element name="last_name" type="xs:string" nillable="true"/>
xsi:noNamespaceSchemaLocation
The xsi:noNamespaceSchemaLocation attribute locates the schema for elements that are not in any namespace. (Attributes that are not in any namespace are assumed to be declared in the same schema as their parent element.) Its value is a relative or absolute URL where the schema document can be found. It is most commonly attached to the root element but can appear further down the tree. For example, this person element claims that it should be validated against the schema found at http://example.com/person.xsd:
<person xsi:noNamespaceSchemaLocation="http://example.com/person.xsd">

  <name>

    <first_name>Alan</first_name>

    <last_name>Turing</last_name>

  </name>

  <profession>computer scientist</profession>

  <profession>mathematician</profession>

  <profession>cryptographer</profession>

</person>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 23: XPath Reference
Inhaltsvorschau
XPath is a non-XML syntax for expressions that identifies particular nodes and groups of nodes in an XML document. It is used by both XPointer and XSLT, as well as by some native XML databases and query languages.
XPath views each XML document as a tree of nodes. Each node has one of seven types:
Root
Each document has exactly one root node, which is the root of the tree. This node contains one comment node child for each comment outside the document element, one processing-instruction node child for each processing instruction outside the root element, and exactly one element node child for the root element. It does not contain any representation of the XML declaration, the document type declaration, or any whitespace that occurs before or after the root element. The root node has no parent node. The root node's value is the value of the root element.
Element
An element node has a name, a namespace URI, a parent node, and a list of child nodes, which may include other element nodes, comment nodes, processing-instruction nodes, and text nodes. An element node also has a collection of attributes and a collection of in-scope namespaces, none of which are considered to be children of the element. The string-value of an element node is the complete, parsed text between the element's start- and end-tags that remains after all tags, comments, and processing instructions are removed and all entity and character references are resolved.
Attribute
An attribute node has a name, a namespace URI, a value, and a parent element. However, although elements are parents of attributes, attributes are not children of their parent elements. The biological metaphor breaks down here. xmlns and xmlns:prefix attributes are not represented as attribute nodes. An attribute node's value is the normalized attribute value.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The XPath Data Model
Inhaltsvorschau
XPath views each XML document as a tree of nodes. Each node has one of seven types:
Root
Each document has exactly one root node, which is the root of the tree. This node contains one comment node child for each comment outside the document element, one processing-instruction node child for each processing instruction outside the root element, and exactly one element node child for the root element. It does not contain any representation of the XML declaration, the document type declaration, or any whitespace that occurs before or after the root element. The root node has no parent node. The root node's value is the value of the root element.
Element
An element node has a name, a namespace URI, a parent node, and a list of child nodes, which may include other element nodes, comment nodes, processing-instruction nodes, and text nodes. An element node also has a collection of attributes and a collection of in-scope namespaces, none of which are considered to be children of the element. The string-value of an element node is the complete, parsed text between the element's start- and end-tags that remains after all tags, comments, and processing instructions are removed and all entity and character references are resolved.
Attribute
An attribute node has a name, a namespace URI, a value, and a parent element. However, although elements are parents of attributes, attributes are not children of their parent elements. The biological metaphor breaks down here. xmlns and xmlns:prefix attributes are not represented as attribute nodes. An attribute node's value is the normalized attribute value.
Text
Each text node represents the maximum possible contiguous run of text between tags, processing instructions, and comments. A text node has a parent node but does not have children. A text node's value is the text of the node.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Data Types
Inhaltsvorschau
Each XPath expression evaluates to one of four types:
Boolean
A binary value that is either true or false. In XPath, Booleans are most commonly produced by using the comparison operators =, !=, <, >, <=, and >=. Multiple conditions can be combined using the and and or operators, which have their usual meaning in logic (e.g., 3>2 or 2>1 is true). XPath does not offer Boolean literals. However, the true( ) and false() functions fill that need.
Number
All numbers in XPath are IEEE 754-compliant, 64-bit floating-point numbers. This is the same as the double type in Java. Numbers range from 4.94065645841246544e-324d to 1.79769313486231570e+308d, and are either positive or negative. Numbers also include the special values Inf (positive infinity), -Inf (negative infinity), and NaN (not a number), which is used for the results of illegal operations, such as dividing by zero. XPath provides all the customary operators for working with numbers, including:
+
Addition
-
Subtraction; however, this operator should always be surrounded by whitespace to avoid accidental misinterpretation as part of an XML name
*
Multiplication
div
Division
mod
Taking the remainder
String
Sequence of zero or more Unicode characters. String literals are enclosed in either single or double quotes, as convenient. Unlike Java, XPath does not allow strings to be concatenated with the plus sign. However, the concat( ) function serves this purpose.
Node-set
Collection of zero or more nodes from an XML document. Location paths produce most node-sets. A single node-set can contain multiple types of nodes: root, element, attribute, namespace, comment, processing instruction, and text.
Some standards that use XPath also define additional data types. For instance, XSLT defines a result tree fragment type that represents the result of processing an XSLT instruction or instantiating a template. XPointer defines a location set type that extends node-sets to include points and ranges.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Location Paths
Inhaltsvorschau
Node- sets are returned by location-path expressions. Location paths consist of location steps. Each location step contains an axis and a node test separated by a double colon. That is, a location step looks like this:
               axis::node test
The axis specifies in which direction from the context node the processor searches for nodes. The node test specifies which nodes along that axis are selected. These are some location steps with different axes and node tests:
child::set

descendant-or-self::node( )

ancestor-or-self::*

attribute::xlink:href
Each location step may be suffixed with predicates enclosed in square brackets that further winnow the node-set. For example:
child::set[position( )=2]

descendant-or-self::node( )[.='Eunice']

ancestor-or-self::*[position( )=2][.="Celeste"]

attribute::xlink:href[starts-with(., 'http')]
Each individual location step is itself a relative location path. The context node against which the relative location path is evaluated is established by some means external to XPath—for example, by the current matched node in an XSLT template.
Location steps can be combined by separating them with forward slashes. Each step in the resulting location path sets the context node (or nodes) for the next path in the step. For example:
ancestor-or-self::*/child::*[position( )=1]

child::document/child::set[position( )=2]/following-sibling::*

descendant::node( )[.='Eunice']/attribute::ID
An absolute location path is formed by prefixing a forward slash to a relative location path. This sets the context node for the first step in the location path to the root of the document. For example, these are all absolute location paths:
/descendant::ship/ancestor-or-self::*/child::*[position( )=1]

/child::document/child::set[position( )=2]/following-sibling::*

/descendant::node( )[.='Eunice']/attribute::ID
Multiple location paths can be combined with the union operator (|) to form an expression that selects a node-set containing all the nodes identified by any of the location paths. For example, this expression selects a node-set containing all the
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Predicates
Inhaltsvorschau
Each location step may have zero or more predicates. A predicate is an XPath expression enclosed in square brackets that follows the node test in the location step. This expression most commonly, but not necessarily, returns a Boolean value. In the following location path:
/person[1]/profession[.="physicist"][position( )<3]
[1], [.="physicist"], and [position( )<3] are predicates. An XPath processor works from left to right in an expression. After it has evaluated everything that precedes the predicate, it's left with a context node list that may contain no nodes, one node, or more than one node. For most axes, including child, following-sibling, following, and descendant, this list is in document order. For the ancestor, preceding, and preceding-sibling axes, this list is in reverse document order.
The predicate is evaluated against each node in the context node list. If the expression returns true, then that node is retained in the list. If the expression returns false, then the node is removed from the list. If the expression returns a number, then the node being evaluated is left in the list if and only if the number is the same as the position of that node in the context node list. If the expression returns a non-Boolean, nonnumber type, then that return value is converted to a Boolean using the boolean() function, described later, to determine whether it retains the node in the set.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XPath Functions
Inhaltsvorschau
XPath 1.0 defines 27 built-in functions for use in XPath expressions. Various technologies that use XPath, such as XSLT and XPointer, also extend this list with functions they need. XSLT even allows user-defined extension functions.
Every function is evaluated in the context of a particular node, called the context node. The higher-level specification in which XPath is used, such as XSLT or XPointer, decides exactly how this context node is determined. In some cases, the function operates on the context node. In other cases, it operates on the argument, if present, and the context node, if no argument exists. The context node is ignored in other cases.
In the following sections, each function is described with at least one signature in this form:
               return-type function-name(type argument, type argument, ...)
Compared to languages like Java, XPath argument lists are quite loose. Some XPath functions take a variable number of arguments and fill in the arguments that are omitted with default values or the context node.
Furthermore, XPath is weakly typed. If you pass an argument of the wrong type to an XPath function, it generally converts that argument to the appropriate type using the boolean( ), string( ), or number( ) functions, described later. The exceptions to the weak-typing rule are the functions that take a node-set as an argument. Standard XPath 1.0 provides no means of converting anything that isn't a node-set into a node-set. In some cases, a function can operate equally well on multiple argument types. In this case, its type is given simply as object.
boolean( )
                     boolean boolean(object o)
The boolean() function converts its argument to a Boolean according to these rules:
  • Zero and NaN are false. All other numbers are true.
  • Empty node-sets are false. Nonempty node-sets are true.
  • Empty strings are false. Nonempty strings are true.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 24: XSLT Reference
Inhaltsvorschau
Extensible Stylesheet Language Transformations (XSLT) is a functional programming language used to specify how an input XML document is converted into another text document—possibly, although not necessarily, another XML document. An XSLT processor reads both an input XML document and an XSLT stylesheet (which is itself an XML document because XSLT is an XML application) and produces a result tree as output. This result tree may then be serialized into a file or written onto a stream. Documents can be transformed using a standalone program or as part of a larger program that communicates with the XSLT processor through its API.
All standard XSLT elements are in the http://www.w3.org/1999/XSL/Transform namespace. In this chapter, we assume that this URI is mapped to the xsl prefix using an appropriate xmlns:xsl declaration somewhere in the stylesheet. This mapping is normally declared on the root element, like this:
<xsl:stylesheet version="1.0"

   xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <!-- XSLT top-level elements go here -->

</xsl:stylesheet>
XSLT defines 36 elements, which break down into three overlapping categories:
  • Two root elements:
    xsl:stylesheet
    
    xsl:transform
  • Twelve top-level elements, which may appear as immediate children of the root and are the following:
    xsl:attribute-set          xsl:decimal-format
    
    xsl:import                 xsl:include
    
    xsl:key                    xsl:namespace-alias
    
    xsl:output                 xsl:param
    
    xsl:preserve-space         xsl:strip-space
    
    xsl:template               xsl:variable
  • Twenty-two instruction elements, which appear in the content of elements that contain templates. Here, we don't mean the xsl:template element. We mean the content of that and several other elements, such as xsl:for-each and xsl:message, which are composed of literal result elements, character data, and XSLT instructions that are processed to produce part of the result tree. These elements are as follows:
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The XSLT Namespace
Inhaltsvorschau
All standard XSLT elements are in the http://www.w3.org/1999/XSL/Transform namespace. In this chapter, we assume that this URI is mapped to the xsl prefix using an appropriate xmlns:xsl declaration somewhere in the stylesheet. This mapping is normally declared on the root element, like this:
<xsl:stylesheet version="1.0"

   xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <!-- XSLT top-level elements go here -->

</xsl:stylesheet>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XSLT Elements
Inhaltsvorschau
XSLT defines 36 elements, which break down into three overlapping categories:
  • Two root elements:
    xsl:stylesheet
    
    xsl:transform
  • Twelve top-level elements, which may appear as immediate children of the root and are the following:
    xsl:attribute-set          xsl:decimal-format
    
    xsl:import                 xsl:include
    
    xsl:key                    xsl:namespace-alias
    
    xsl:output                 xsl:param
    
    xsl:preserve-space         xsl:strip-space
    
    xsl:template               xsl:variable
  • Twenty-two instruction elements, which appear in the content of elements that contain templates. Here, we don't mean the xsl:template element. We mean the content of that and several other elements, such as xsl:for-each and xsl:message, which are composed of literal result elements, character data, and XSLT instructions that are processed to produce part of the result tree. These elements are as follows:
    xsl:apply-imports          xsl:apply-templates
    
    xsl:attribute              xsl:call-template
    
    xsl:choose                 xsl:comment
    
    xsl:copy                   xsl:copy-of
    
    xsl:element                xsl:fallback
    
    xsl:for-each               xsl:if
    
    xsl:message                xsl:number
    
    xsl:otherwise              xsl:processing-instruction 
    
    xsl:sort                   xsl:text
    
    xsl:value-of               xsl:variable
    
    xsl:with-param             xsl:when
Most XSLT processors also provide various nonstandard extension elements and allow you to write your own extension elements in languages such as Java and JavaScript.
Elements in this section are arranged alphabetically from xsl:apply-imports to xsl:with-param. Each element begins with a synopsis in the following form:
<xsl:elementName

               attribute1 = "allowed attribute values"

   attribute2 = "allowed attribute values"

>

  <!-- Content model -->

</xsl:elementName>
Most attribute values are one of the following types:
expression
An XPath expression. In cases where the expression is expected to return a value of a particular type, such as node-set or number, it is prefixed with the type and a hyphen; for example,
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
XSLT Functions
Inhaltsvorschau
XSLT supports all functions defined in XPath. In addition, it defines 10 extra functions. Most XSLT processors also make several extension functions available and allow you to write your own extension functions in Java or other languages. The extension API is nonstandard and processor-dependent.
XPath and XSLT functions are weakly typed. Although one type or another is occasionally preferred, the processor normally converts any type you pass in to the type the function expects. Functions that only take node-sets as arguments are an exception to the weak-typing rule. Other data types, including strings, numbers, and Booleans, cannot be converted to node-sets automatically.
XPath and XSLT functions also use optional arguments, which are filled in with defaults if omitted. In the following sections, we list the most common and useful variations of each function.
current( )
                     node-set current( )
The current( ) function returns a node-set containing a single node, the current node. Outside of an XPath predicate, the current node and the context node (represented by a period in the abbreviated XPath syntax) are identical. However, in a location step predicate, the context node changes according to the location path, while the current node stays the same.
document( )
                     node-set document(string uri)

node-set document(node-set uris)

node-set document(string uri, node-set base)

node-set document(node-set uris, node-set base)
The document( ) function loads the XML document at the URI specified by the first argument and returns a node-set containing that document's root node. The URI is normally given as a string, but it may be given as another type that is converted to a string. If the URI is given as a node-set, then each node in the set is converted to a string, and the returned node-set includes root nodes of all documents referenced by the URI argument.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
TrAX
Inhaltsvorschau
Unfortunately, there is no standard API for XSLT that works across languages and engines: each vendor provides its own unique API. The closest thing to a standard XSLT API is the Transformations API for XML (TrAX), included in JAXP. However, this is limited to Java and is not even supported by all Java-based XSLT engines. Nonetheless, since it is the closest thing to a standard there is, we will discuss it here.
Code that transforms an XML document using an XSLT stylesheet through TrAX follows these six steps. All of the classes mentioned are in the javax.xml.transform package, a standard part of Java 1.4 and later and a separately installable option in earlier versions.
  1. Call the TransformerFactory.newInstance( ) factory method to create a new TransformerFactory object.
  2. Construct a Source object from the XSLT stylesheet.
  3. Pass this Source object to the TransformerFactory object's newTransform( ) method to create a Transform object.
  4. Construct a Source object from the input XML document you wish to transform.
  5. Construct a Result object into which the transformed XML document will be output.
  6. Pass the Source and the Result to the Transform object's transform( ) method.
The source can be built from a DOM Document object, a SAX InputSource, or an InputStream—represented by the javax.xml.transform.dom.DOMSource, javax.xml.transform.sax.SAXSource, and javax.xml.transform.stream.StreamSource classes, respectively. The result of the transform can be a DOM Document object, a SAX ContentHandler, or an OutputStream. These are represented by the javax.xml.transform.dom.DOMResult, javax.xml.transform.sax.SAXResult, and javax.xml.transform.stream.StreamResult classes, respectively.
For example, this code fragment uses the XSLT stylesheet found at http://www.cafeconleche.org/books/xian/examples/08/8-8.xsl to transform the file
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 25: DOM Reference
Inhaltsvorschau
This reference section documents the W3C Document Object Model (DOM) Level 3 Core Recommendation dated 07 April 2004. The latest version of this recommendation, along with any errata that have been reported, is available on the W3C DOM Activity's web site (http://www.w3.org/DOM/DOMTR). The symbols (2) and (3) will be used throughout this chapter to indicate in which DOM level a feature became available.
The Document Object Model (DOM) is a language- and platform-independent object framework for manipulating structured documents (see Chapter 19 for additional information). Just as XML is a generic specification for creating markup languages, the DOM Core defines a generic library for manipulating markup-based documents. The W3C DOM is actually a family of related recommendations that provide functionality for many types of document manipulation, including event handling, styling, traversing trees, manipulating HTML documents, and so forth. But most of these recommendations are built on the basic functionality provided by the Core DOM.
The DOM presents a programmer with a document stored as a hierarchy of Node objects. The Node interface is the base interface for every member of a DOM document tree. It exposes attributes common to every type of document object and provides a few simple methods to retrieve type-specific information without resorting to downcasting. This interface also exposes all methods used to query, insert, and remove objects from the document hierarchy. The Node interface makes it easier to build general-purpose tree-manipulation routines that are not dependent on specific document element types.
The following table shows the DOM object hierarchy:
Object
Permitted child objects
Document
Element (one is the maximum)
ProcessingInstruction
Comment
DocumentType (one is the maximum)
DocumentFragment
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Object Hierarchy
Inhaltsvorschau
The following table shows the DOM object hierarchy:
Object
Permitted child objects
Document
Element (one is the maximum)
ProcessingInstruction
Comment
DocumentType (one is the maximum)
DocumentFragment
Element ProcessingInstruction Comment Text CDATASection EntityReference
DocumentType
None (leaf node)
EntityReference
Element ProcessingInstruction Comment Text CDATASection EntityReference
Element
Element Text Comment ProcessingInstruction CDATASection EntityReference
Attr
Text EntityReference
ProcessingInstruction
None (leaf node)
Comment
None (leaf node)
Text
None (leaf node)
CDATASection
None (leaf node)
Entity
Element ProcessingInstruction Comment Text CDATASection EntityReference
Notation
None (leaf node)
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Object Reference
Inhaltsvorschau
This section details the XML DOM Level 3 Core objects. The reference sections detail the descriptions, attributes, and methods of each object in the language- independent IDL specification. Java examples and bindings are presented to illustrate usage.
Attr
The Attr interface represents the value assigned to an attribute of an XML element. The parentNode, previousSibling, and nextSibling of an Attr are always null. Although the Attr interface inherits the Node base interface, many basic Node methods are not applicable.
An XML element can acquire an attribute in several ways. An element has an attribute value if:
  • The XML document explicitly provides an attribute value.
  • The document DTD specifies a default attribute value.
  • An attribute is added programmatically using the setAttribute( ) or setAttributeNode() methods of the Element interface.
An Attr object can have EntityReference objects as children. The value attribute provides the expanded DOMString representation of this attribute.
//Get the element's size attribute as an Attr object

Attr attrName = elem.getAttributeNode("size");
Attributes
The following attributes are defined for the Attr object:
isId: boolean(3)
Returns true if this attribute contains a unique identifier for the parent element node. Attributes are tagged as identifiers by the DTD, the schema, or by using one of the setIdAttribute( ) methods of the Element interface. Read-only.
Java binding
public boolean isId( );
name: DOMString
The name of the attribute. Read-only.
Java binding
public String getName( );
Java example
// Dump element attribute names

Attr attr;

    

for (int i = 0; i < elem.getAttributes( ).getLength( ); i++) {

    // temporarily alias the attribute

    attr = (Attr)(elem.getAttributes( ).item(i));

    System.out.println(attr.getName( ));

    }
ownerElement: Element(2)
This property provides a link to the Element object that owns this attribute. If the attribute is currently unowned, it equals
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 26: SAX Reference
Inhaltsvorschau
SAX, the Simple API for XML, is an event-based API used to parse XML documents. David Megginson, SAX's original author, placed SAX in the public domain. SAX is bundled with all parsers that implement the API, including Xerces, MSXML, Crimson, the Oracle XML Parser for Java, and Ælfred. However, you can also get it and the full source code from http://sax.sourceforge.net/.
SAX was originally defined as a Java API and is intended primarily for parsers written in Java, so this chapter will focus on its Java implementation. However, its port to other object-oriented languages, such as C++, C#, Python, Perl, and Eiffel, is common and usually quite similar.
This chapter covers SAX2 exclusively. In 2004, all major parsers that support SAX support SAX2. The major change from SAX1 to SAX2 was the addition of namespace support. This addition necessitated changing the names and signatures of almost every method and class in SAX. The old SAX1 methods and classes are still available, but they're now deprecated and shouldn't be used.SAX 2.0.2 is a minor update to SAX2 that add a few extra optional classes, features, and properties without really affecting the core API. They were carefully designed to be backward compatible with SAX 2.0 and 2.0.1. Some, but not all, current parsers support SAX 2.0.2. When something in this chapter is only available in SAX 2.0.2, it will be clearly noted.
The org.xml.sax package contains the core interfaces and classes that comprise the Simple API for XML.
The Attributes Interface
An object that implements the Attributes interface represents a list of attributes on a start-tag. The order of attributes in the list is not guaranteed to match the order in the document itself. Attributes objects are passed as arguments to the startElement( ) method of ContentHandler. You can access particular attributes in three ways:
  • By number
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The org.xml.sax Package
Inhaltsvorschau
The org.xml.sax package contains the core interfaces and classes that comprise the Simple API for XML.
The Attributes Interface
An object that implements the Attributes interface represents a list of attributes on a start-tag. The order of attributes in the list is not guaranteed to match the order in the document itself. Attributes objects are passed as arguments to the startElement( ) method of ContentHandler. You can access particular attributes in three ways:
  • By number
  • By namespace URI and local name
  • By qualified name
This list does not include namespace declaration attributes (xmlns and xmlns:prefix) unless the http://xml.org/sax/features/namespace-prefixes feature is true. It is false by default.
If the http://xml.org/sax/features/namespace-prefixes feature is false, qualified name access may not be available; if the http://xml.org/sax/features/namespaces feature is false, local names and namespace URIs may not be available.
package org.xml.sax;

     

public interface Attributes {

     

  public int    getLength(  );

  public String getURI(int index);

  public String getLocalName(int index);

  public String getQName(int index);

  public int    getIndex(String uri, String localName);

  public int    getIndex(String qualifiedName);

  public String getType(int index);

  public String getType(String uri, String localName);

  public String getType(String qualifiedName);

  public String getValue(String uri, String localName);

  public String getValue(String qualifiedName);

  public String getValue(int index);

     

}
The ContentHandler Interface
ContentHandler is the key piece of SAX. Almost every SAX program needs to use this interface. ContentHandler is a callback interface. An instance of this interface is passed to the parser via the setContentHandler( ) method of XMLReader. As the parser reads the document, it invokes the methods in its
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The org.xml.sax.helpers Package
Inhaltsvorschau
The org.xml.sax.helpers package contains support classes for the core SAX classes. These include factory classes used to build instances of particular org.xml.sax interfaces and default implementations of those interfaces.
The AttributesImpl Class
AttributesImpl is a default implementation of the Attributes interface that SAX parsers and filters may use. Besides the methods of the Attributes interface, this class offers manipulator methods so the list of attributes can be modified or reused. These methods allow you to take a persistent snapshot of an Attributes object in startElement( ) and construct or modify an Attributes object in a SAX driver or filter:
package org.xml.sax.helpers;

     

public class AttributesImpl implements Attributes {

     

    public AttributesImpl(  );

    public AttributesImpl(Attributes atts);

     

    public int    getLength(  );

    public String getURI(int index);

    public String getLocalName(int index);

    public String getQName(int index);

    public String getType(int index);

    public String getValue(int index);

    public int    getIndex(String uri, String localName);

    public int    getIndex(String qualifiedName);

    public String getType(String uri, String localName);

    public String getType(String qualifiedName);

    public String getValue(String uri, String localName);

    public String getValue(String qualifiedName);

    public void   clear(  );

    public void   setAttributes(Attributes atts);

    public void   addAttribu

                     te(String uri, String localName,

     String qualifiedName, String type, String value);

    public void   setAttribute(int index, String uri, String localName,

     String qualifiedName, String type, String value);

    public void   removeAttribute(int index)

    public void   setURI(int index, String uri)

    public void   setLocalName(int index, String localName)

    public void   setQName(int index, String qualifiedName);

    public void   
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
SAX Features and Properties
Inhaltsvorschau
Absolute URIs are used to name a SAX parser's properties and features. Features have a Boolean value; that is, for each parser, a recognized feature is either true or false. Properties have object values. SAX 2.0 defines six core features and two core properties that parsers should recognize. SAX 2.0.2 adds nine more. In addition, most parsers add features and properties to this list.
SAX Core Features
All SAX parsers should recognize six core features. Of these six, two (http://xml.org/sax/features/namespaces and http://xml.org/sax/features/namespace-prefixes) must be implemented by all conformant processors. The other four are optional and may not be implemented by all parsers:
http://xml.org/sax/features/namespaces
When true, this feature indicates that the startElement( ) and endElement( ) methods provide namespace URIs and local names for elements and attributes. When false, the parser provides prefixed element and attribute names to the startElement( ) and endElement( ) methods. If a parser does not provide something it is not required to provide, then that value will be set to the empty string. However, most parsers provide all three (URI, local name, and prefixed name), regardless of the value of this feature. This feature is true by default.
http://xml.org/sax/features/namespace-prefixes
When true, this feature indicates that xmlns and xmlns:prefix attributes will be included in the attributes list passed to startElement( ). When false, these attributes are omitted. Furthermore, if this feature is true, then the parser will provide the prefixed names for elements and attributes. The default is false unless http://xml.org/sax/features/namespaces is false, in which case this feature defaults to true. You can set both
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
The org.xml.sax.ext Package
Inhaltsvorschau
The org.xml.sax.ext package provides optional interfaces that parsers may use to provide further functionality. Not all parsers support these interfaces, although most major ones do.
The Attributes2 Interface
SAX 2.0.2 adds an Attributes2 subclass of Attributes that provides extra methods to determine whether a given attribute was declared in the DTD and/or specified in the instance document (as opposed to being defaulted in from the DTD). A parser that supports Attributes2 will pass an Attributes2 object to startElement( ) instead of a plain Attributes object. Using the extra methods requires a cast. Before casting, you may wish to check whether the cast will succeed by getting the value of the http://xml.org/sax/features/use-attributes2 feature. If this feature is true, the parser passes Attributes2 objects.
package org.xml.sax.ext;

     

public interface Attributes2 {

     

  public boolean isDeclared(int index);

  public boolean isDeclared(String qualifiedName);

  public boolean isSpecified(String  namespaceURI, String  localName);

  public boolean isSpecified(int index);

  public boolean isSpecified(String qualifiedName);

  public boolean isSpecified(String  namespaceURI, String  localName);

     

}
The DeclHandler Interface
DeclHandler is a callback interface that provides information about the ELEMENT, ATTLIST, and parsed ENTITY declarations in the document's DTD. To configure an XMLReader with a DeclHandler, pass the name http://xml.org/sax/properties/DeclHandler and an instance of the handler to the reader's setProperty( ) method:
try {

  parser.setProperty(

   "http://xml.org/sax/properties/DeclHandler",

    new YourDeclHandlerImplementationClass(  ));

}

catch(SAXException ex) {

  System.out.println("This parser does not provide declarations.");

}
If the parser does not provide declaration events, it throws a SAXNotRecognizedException. If the parser cannot install a
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Chapter 27: Character Sets
Inhaltsvorschau
By default, an XML parser assumes that XML documents are written in the UTF-8 encoding of Unicode. However, documents may be written instead in any character set the XML processor understands, provided that there's either some external metadata like an HTTP header or internal metadata like a byte-order mark or an encoding declaration that specifies the character set. For example, a document written in the Latin-5 character set would need this XML declaration:
<?xml version="1.0" encoding="ISO-8859-9"?>
Most good XML processors understand many common character sets. The XML specification recommends the character names shown in Table 27-1. When using any of these character sets, you should use these names. Of these character sets, only UTF-8 and UTF-16 must be supported by all XML processors, although many XML processors support all character sets listed here, and many support additional character sets besides. When using character sets not listed here, you should use the names specified in the IANA character sets registry at http://www.iana.org/assignments/character-sets.
Table 27-1: Character set names defined by the XML specification
Name
Character set
UTF-8
The default encoding used in XML documents, unless an encoding declaration, byte-order mark, or external metadata specifies otherwise; a variable-width encoding of Unicode that uses one to four bytes per character. UTF-8 is designed such that all ASCII documents are legal UTF-8 documents, which is not true for other character sets, such as UTF-16 and Latin-1. This character set is normally the best encoding choice for XML documents that don't contain a lot of Chinese, Japanese, or Korean.
UTF-16
A two-byte encoding of Unicode in which all Unicode characters defined in Unicode 3.0 and earlier (including the ASCII characters) occupy exactly two bytes. However, characters from planes 1 through 14, added in Unicode 3.1 and later, are encoded using surrogate pairs of four bytes each. This encoding is the best choice if your XML documents contain substantial amounts of Chinese, Japanese, or Korean.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Character Tables
Inhaltsvorschau
The XML specification divides Unicode into five overlapping sets:
Name characters
Characters that can appear in an element, attribute, or entity name. These characters are letters, ideographs, digits, and the punctuation marks _, -, ., and :. In the tables that follow, name characters are shown in bold type, such as A, Å, Ą, Д, ئ, 1, 2, 3, α, , and _.
One of the major differences between XML 1.0 and 1.1 is in which characters are name characters. All XML 1.0 name characters are also XML 1.1 name characters. However, XML 1.1 also promotes many other characters to name characters. Some of these, such as the Burmese and Mongolian letters, reasonably deserve to be name characters. However, XML 1.1 also allows many problematic characters including ligatures such as ij, currency symbols such as the Greek drachma sign, letter-like symbols such as ©, number forms such as Roman numerals, and presentation forms. Finally, it allows all characters not defined as of Unicode 3.1.1 and all characters from beyond the basic multilingual plane, including such strange things as the musical symbol for a six-string fretboard. Unless you are working in a language such as Burmese or Mongolian that requires these new characters, it is recommended that you restrict your markup to characters that are legal in XML 1.0. The tables that follow are based on XML 1.0 rules.
Name start characters
Characters that can be the first character of an element, attribute, or entity name. These characters are letters, ideographs, and the underscore _. In the tables that follow, these characters are shown with a gray background, such as A, Å, Ą, Д,ئ, α, , and _. Because name start characters are a subset of name characters, they are also shown in bold.
Character data characters
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
HTML4 Entity Sets
Inhaltsvorschau
HTML 4.0 predefines several hundred named entities, many of which are quite useful. For instance, the nonbreaking space is &nbsp;. XML, however, defines only five named entities:
&amp;
The ampersand (&)
&lt;
The less-than sign (<)
&gt;
The greater-than sign (>)
&quot;
The straight double quote (")
&apos;
The straight single quote (')
Other needed characters can be inserted with character references in decimal or hexadecimal format. For instance, the nonbreaking space is Unicode character 160 (decimal). Therefore, you can insert it in your document as either &#160; or &#xA0;. If you really want to type it as &nbsp;, you can define this entity reference in your DTD. Doing so requires you to use a character reference:
<!ENTITY nbsp "&#160;">
The XHTML 1.0 specification includes three DTD fragments that define the familiar HTML character references:
Latin-1 characters (http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent)
The non-ASCII, graphic characters included in ISO-8859-1 from code points 160 through 255, shown in Table 27-3
Special characters (http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent)
A few useful letters and punctuation marks not included in Latin-1
Symbols (http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent)
The Greek alphabet, plus various arrows, mathematical operators, and other symbols used in mathematics
Feel free to borrow these entity sets for your own use. They should be included in your document's DTD with these parameter entity references and PUBLIC identifiers:
<!ENTITY % HTMLlat1 PUBLIC

   "-//W3C//ENTITIES Latin 1 for XHTML//EN"

   "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">

%HTMLlat1;

<!ENTITY % HTMLspecial PUBLIC

    "-//W3C//ENTITIES Special for XHTML//EN"

    "http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent">

%HTMLspecial;

<!ENTITY % HTMLsymbol PUBLIC

    "-//W3C//ENTITIES Symbols for XHTML//EN"

    "http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent">

%HTMLsymbol;
However, we do recommend saving local copies and changing the system identifier to match the new location, rather than downloading them from the
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Other Unicode Blocks
Inhaltsvorschau
So far we've accounted for a little over 300 of the more than 90,000 Unicode characters. Many thousands are still unaccounted for. Outside the ranges defined in XHTML and SGML, standard entity names don't exist. You should either use an editor that can produce the characters you need in the appropriate character set or you should use character references. Most of the 90,000-plus Unicode characters are either Han ideographs, Hangul syllables, or rarely used characters. However, we do list a few of the most useful blocks later in this chapter. Others can be found online at http://www.unicode.org/charts/ or in The Unicode Standard 4.0 by the Unicode Consortium (Addison Wesley).
In the tables that follow, the upper lefthand corner contains the character's hexadecimal Unicode value, and the upper righthand corner contains the character's decimal Unicode value. You can use either value to form a character reference so as to use these characters in element content and attribute values, even without an editor or fonts that support them.
The128 characters in the Latin Extended-A block of Unicode are used in conjunction with the normal ASCII and Latin-1 characters. They cover most European Latin letters missing from Latin-1. The block includes various characters you'll find in the upper halves of the other ISO-8859 Latin character sets, including ISO-8859-2, ISO-8859-3, ISO-8859-4, and ISO-8859-9. When combined with ASCII and Latin-1, this block lets you write Afrikaans, Basque, Breton, Catalan, Croatian, Czech, Esperanto, Estonian, French, Frisian, Greenlandic, Hungarian, Latvian, Lithuanian, Maltese, Polish, Provençal, Rhaeto-Romanic, Romanian, Romany, Sami, Slovak, Slovenian, Sorbian, Turkish, and Welsh. See Table 27-7.
Table 27-7: Unicode's Latin Extended-A block
The Latin Extended-B block of Unicode is used in conjunction with the normal ASCII and Latin-1 characters. It mostly contains characters used for transcription of non-European languages not traditionally written in a Roman script. For instance, it's used for the Pinyin transcription of Chinese and for many African languages. See Table 27-8.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
	

Zurück zu XML in a Nutshell


Themen

Buchreihen

Special Interest

International Sites

O'Reilly China O'Reilly USA O'Reilly Japan O'Reilly Taiwan