JETZT ONLINE BESTELLEN
Second Edition September 2003
ISBN 978-0-596-00420-0
416 Seiten
EUR32.00
Weitere Informationen zu diesem Buch
Inhaltsverzeichnis | Kolophon |
Inhaltsverzeichnis
- Chapter 1: Introduction
- InhaltsvorschauAnywhere there is information, you'll find XML, or at least hear it scratching at the door. XML has grown into a huge topic, inspiring many technologies and branching into new areas. So priority number one is to get a broad view, and ask the big questions, so that you can find your way through the dense jungle of standards and concepts.A few questions come to mind. What is XML? We will attack this from different angles. It's more than the next generation of HTML. It's a general-purpose information storage system. It's a markup language toolkit. It's an open standard. It's a collection of standards. It's a lot of things, as you'll see.Where did XML come from? It's good to have a historical perspective. You'll see how XML evolved out of earlier efforts like SGML, HTML, and the earliest presentational markup.What can I do with XML? A practical question, again with several answers: you can store and retrieve data, ensure document integrity, format documents, and support many cultural localizations. And what can't I do with XML? You need to know about the limitations, as it may not be a good fit with your problem.How do I get started? Without any hesitation, I hope. I'll describe the tools you need to get going with XML and test the examples in this book. From authoring, validating, checking well-formedness, transforming, formatting, and writing programs, you'll have a lot to play with.So now let us dive into the big questions. By the end of this chapter, you should know enough to decide where to go from here. Future chapters will describe topics in more detail, such as core markup, quality control, style and presentation, programming interfaces, and internationalization.XML is a lot like the ubiquitous plastic containers of Tupperware®. There is really no better way to keep your food fresh than with those colorful, airtight little boxes. They come in different sizes and shapes so you can choose the one that fits best. They lock tight so you know nothing is leaking out and germs can't get in. You can tell items apart based on the container's color, or even scribble on it with magic marker. They're stackable and can be nested in larger containers (in case you want to take them with you on a picnic). Now, if you think of information as a precious commodity like food, then you can see the need for a containment system like TupperwareEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
- What Is XML?
- InhaltsvorschauXML is a lot like the ubiquitous plastic containers of Tupperware®. There is really no better way to keep your food fresh than with those colorful, airtight little boxes. They come in different sizes and shapes so you can choose the one that fits best. They lock tight so you know nothing is leaking out and germs can't get in. You can tell items apart based on the container's color, or even scribble on it with magic marker. They're stackable and can be nested in larger containers (in case you want to take them with you on a picnic). Now, if you think of information as a precious commodity like food, then you can see the need for a containment system like Tupperware®.XML contains, shapes, labels, structures, and protects information. It does this with symbols embedded in the text, called markup. Markup enhances the meaning of information in certain ways, identifying the parts and how they relate to each other. For example, when you read a newspaper, you can tell articles apart by their spacing and position on the page and the use of different fonts for titles and headings. Markup works in a similar way, except that instead of spaces and lines, it uses symbols.Markup is important to electronic documents because they are processed by computer programs. If a document has no labels or boundaries, then a program will not know how to distinguish a piece of text from any other piece. Essentially, the program would have to work with the entire document as a unit, severely limiting the interesting things you can do with the content. A newspaper with no space between articles and only one text style would be a huge, uninteresting blob of text. You could probably figure out where one article ends and another starts, but it would be a lot of work. A computer program wouldn't be able to do even that, since it lacks all but the most rudimentary pattern-matching skills.XML's markup divides a document into separate information containers calledEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
- Where Did XML Come From?
- InhaltsvorschauXML is the result of a long evolution of data packaging reaching back to the days of punched cards. It is useful to trace this path to see what mistakes and discoveries influenced the design decisions.Early electronic formats were more concerned with describing how things should look (presentation) than with document structure and meaning. troff and TEX, two early formatting languages, did a fantastic job of formatting printed documents, but lacked any sense of structure. Consequently, documents were limited to being viewed on screen or printed as hard copies. You couldn't easily write programs to search for and siphon out information, cross-reference information electronically, or repurpose documents for different applications.Generic coding, which uses descriptive tags rather than formatting codes, eventually solved this problem. The first organization to seriously explore this idea was the Graphic Communications Association (GCA). In the late 1960s, the GenCode project developed ways to encode different document types with generic tags and to assemble documents from multiple pieces.The next major advance was Generalized Markup Language (GML), a project by IBM. GML's designers, Charles Goldfarb, Edward Mosher, and Raymond Lorie, intended it as a solution to the problem of encoding documents for use with multiple information subsystems. Documents coded in this markup language could be edited, formatted, and searched by different programs because of its content-based tags. IBM, a huge publisher of technical manuals, has made extensive use of GML, proving the viability of generic coding.Inspired by the success of GML, the American National Standards Institute (ANSI) Committee on Information Processing assembled a team, with Goldfarb as project leader, to develop a standard text-description language based upon GML. The GCA GenCode committee contributed their expertise as well. Throughout the late 1970s and early 1980s, the team published working drafts and eventually created a candidate for an industry standard (GCA 101-1983) called the Standard Generalized Markup Language (SGML). This was quickly adopted by both the U.S. Department of Defense and the U.S. Internal Revenue Service.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
- What Can I Do with XML?
- InhaltsvorschauLet me tackle that question by sorting the kinds of problems for which you would use XML.Just about every software application needs to store some data. There are look-up tables, work files, preference settings, and so on. XML makes it very easy to do this. Say, for example, you've created a calendar program and you need a way to store holidays. You could hardcode them, of course, but that's kind of a hassle since you'd have to recompile the program if you need to add to the list. So you decide to save this data in a separate file using XML. Example 1-4 shows how it might look.Example 1-4. Calendar data file
<caldata> <holiday type="international"> <name>New Year's Day</name> <date><month>January</month><day>1</day></date> </holiday> <holiday type="personal"> <name>Erik's birthday</name> <date><month>April</month><day>23</day></date> </holiday> <holiday type="national"> <name>Independence Day</name> <date><month>July</month><day>4</day></date> </holiday> <holiday type="religious"> <name>Christmas</name> <date><month>December</month><day>25</day></date> </holiday> </caldata>Now all your program needs to do is read in the XML file and convert the markup into some convenient data structure using an XML parser. This software component reads and digests XML into a more usable form. There are lots of libraries that will do this, as well as standalone programs. Outputting XML is just as easy as reading it. Again, there are modules and libraries people have written that you can incorporate in any program.XML is a very good choice for storing data in many cases. It's easy to parse and write, and it's open for users to edit themselves. Parsers have mechanisms to verify syntax and completeness, so you can protect your program from corrupted data. XML works best for small data files or for data that is not meant to be searched randomly. A novel is a good example of a document that is not randomly accessed (unless you are one of those people who peek at the ending of a novel before finishing), whereas a telephone directoryEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - How Do I Get Started?
- InhaltsvorschauBy now you are chomping at the bit, eager to gallop into XML coding of your own. Let's take a look at how to set up your own XML authoring and processing environment.The most important item in your XML toolbox is the XML editor. This program lets you read and compose XML, and often comes with services to prevent mistakes and clarify the view of your document. There is a wide spectrum of quality and expense in editors, which makes choosing one that's right for you a little tricky. In this section, I'll take you on a tour of different kinds.Even the lowliest plain-text editor is sufficient to work with XML. You can use Text- Edit on the Mac, NotePad or WordPad on Windows, or vi on Unix. The only limitation is whether it supports the character set used by the document. In most cases, it will be UTF-8. Some of these text editors support an XML "mode" which can highlight markup and assist in inserting tags. Some popular free editors include vim, elvis, and, my personal favorite, emacs.emacs is a powerful text editor with macros and scripted functions. Lennart Stafflin has written an XML plug-in for it called psgml, available at
http://www.lysator.liu.se/~lenst/. It adds menus and commands for inserting tags and showing information about a DTD. It even comes with an XML parser that can detect structural mistakes while you're editing a document. Using psgml and a feature called "font-lock," you can set up xemacs, an X Window version of emacs, to highlight markup in color. Figure 1-5 is a snapshot of xemacs with an XML document open.
Figure 1-5: Highlighted markup in xemacs with psgmlMorphon Technologies' XMLEditor is a fine example of a graphical user interface. As you can see in Figure 1-6, the window sports several panes. On the left is an outline view of the book, in which you can quickly zoom in on a particular element, open it, collapse it, and move it around. On the right is a view of the text without markup. And below these panes is an attribute editing pane. The layout is easy to customize and easy to use. Note the formatting in the text view, achieved by applying a CSS stylesheet to the document. Morphon's editor sells for $150 and you can download a 30-day demo atEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Chapter 2: Markup and Core Concepts
- InhaltsvorschauThere's a Far Side cartoon by Gary Larson about an unusual chicken ranch. Instead of strutting around, pecking at seed, the chickens are all lying on the ground or draped over fences as if they were made of rubber. You see, it was a boneless chicken ranch.Just as skeletons give us vertebrates shape and structure, markup does the same for text. Take out the markup and you have a mess of character data without any form. It would be very difficult to write a computer program that did anything useful with that content. Software relies on markup to label and delineate pieces of data, the way suitcases make it easy for you to carry clothes with you on a trip.This chapter focuses on the details of XML markup. Here I will describe the fundamental building blocks of all XML-derived languages: elements, attributes, entities, processing instructions, and more. And I'll show you how they all fit together to make a well-formed XML document. Mastering these concepts is essential to understanding every other topic in the book, so read this chapter carefully.All of the markup rules for XML are laid out in the W3C's technical recommendation for XML version 1.0 (
http://www.w3.org/TR/2000/REC-xml-20001006). This is the second edition of the original which first appeared in 1998. You may also find Tim Bray's annotated, interactive version useful. Go and check it out athttp://www.xml.com/axml/testaxml.htm.If XML markup is a structural skeleton for a document, then tags are the bones. They mark the boundaries of elements, allow insertion of comments and special instructions, and declare settings for the parsing environment. A parser, the front line of any program that processes XML, relies on tags to help it break down documents into discrete XML objects. There are a handful of different XML object types, listed in Table 2-1.Table 2-1: Types of tags in XML Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Tags
- InhaltsvorschauIf XML markup is a structural skeleton for a document, then tags are the bones. They mark the boundaries of elements, allow insertion of comments and special instructions, and declare settings for the parsing environment. A parser, the front line of any program that processes XML, relies on tags to help it break down documents into discrete XML objects. There are a handful of different XML object types, listed in Table 2-1.
Table 2-1: Types of tags in XML ObjectPurposeExampleempty elementRepresent information at a specific point in the document.<xref linkend="abc"/>container elementGroup together elements and character data.<p>This is a paragraph.</p>declarationAdd a new parameter, entity, or grammar definition to the parsing environment.<!ENTITY author "Erik Ray">processing instructionFeed a special instruction to a particular type of software.<?print-formatter force-linebreak?>commentInsert an annotation that will be ignored by the XML processor.<!— here's where I left off —>CDATA sectionEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Documents
- InhaltsvorschauAn XML document is a special construct designed to archive data in a way that is most convenient for parsers. It has nothing to do with our traditional concept of documents, like the Magna Carta or Time magazine, although those texts could be stored as XML documents. It simply is a way of describing a piece of XML as being whole and intact for parsing.It's important to think of the document as a logical entity rather than a physical one. In other words, don't assume that a document will be contained within a single file on a computer. Quite often, a document may be spread out across many files, and some of these may live on different systems. All that is required is that the XML parser reading the document has the ability to assemble the pieces into a coherent whole. Later, we will talk about mechanisms used in XML for linking discrete physical entities into a complete logical unit.As Figure 2-2 shows, an XML document has two parts. First is the document prolog, a special section containing metadata. The second is an element called the document element, also called the root element for reasons you will understand when we talk about trees. The root element contains all the other elements and content in the document.
Figure 2-2: Parts of an XML documentThe prolog is optional. If you leave it out, the parser will fall back on its default settings. For example, it automatically selects the character encoding UTF-8 (or UTF-16, if detected) unless something else is specified. The root element is required, because a document without data is just not a document.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - The Document Prolog
- InhaltsvorschauBeing a flexible markup language toolkit, XML lets you use different character encodings, define your own grammars, and store parts of the document in many places. An XML parser needs to know about these particulars before it can start its work. You communicate these options to the parser through a construct called the document prolog.The document prolog (if you use one) comes at the top of the document, before the root element. There are two parts (both optional): an XML declaration and a document type declaration. The first sets parameters for basic XML parsing while the second is for more advanced settings. The XML declaration, if used, has to be the first line in the document. Example 2-1 shows a document containing a full prolog.Example 2-1. A document with a full prolog
<?xml version="1.0" standalone="no"?> The XML declaration <!DOCTYPE Beginning of the DOCTYPE declaration reminder Root element name SYSTEM "/home/eray/reminder.dtd" DTD identifier [ Internal subset start delimiter <!ENTITY smile "<graphic file="smile.eps"/>"> Entity declaration ]> Internal subset end delimiter <reminder> Start of document element ⌣ Reference to the entity declared above <msg>Smile! It can always get worse.</msg> </reminder> End of document element
The XML declaration is a small collection of details that prepare an XML processor for working with a document. It is optional, but when used it must always appear in the first line. Figure 2-3 shows the form it takes. It starts with the delimiter<?xml(1), contains a number of parameters (2), and ends with the delimiterEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Elements
- InhaltsvorschauElements are the building blocks of XML, dividing a document into a hierarchy of regions, each serving a specific purpose. Some elements are containers, holding text or elements. Others are empty, marking a place for some special processing such as importing a media object. In this section, I'll describe the rules for how to construct elements.Figure 2-9 shows the syntax for a container element. It begins with a start tag consisting of an angle bracket (1) followed by a name (2). The start tag may contain some attributes (3) separated by whitespace, and it ends with a closing angle bracket (4). After the start tag is the element's content and then an end tag. The end tag consists of an opening angle bracket and a slash (5), the element's name again (2), and a closing bracket (4). The name in the end tag must match the one in the start tag exactly.
Figure 2-9: Container element syntaxAn empty element is very similar, as seen in Figure 2-10. It starts with an angle bracket delimiter (1), and contains a name (2) and a number of attributes (3). It is closed with a slash and a closing angle bracket (4). It has no content, so there is no need for an end tag.
Figure 2-10: Empty element syntaxAn attribute defines a property of the element. It associates a name with a value, which is a string of character data. The syntax, shown in Figure 2-11 is a name (1), followed by an equals sign (2), and a string (4) inside quotes (3). Two kinds of quotes are allowed: double (") and single ('). Quote characters around an attribute value must match.
Figure 2-11: Form of an attributeEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Entities
- InhaltsvorschauEntities are placeholders in XML. You declare an entity in the document prolog or in a DTD, and you can refer to it many times in the document. Different types of entities have different uses. You can substitute characters that are difficult or impossible to type with character entities. You can pull in content that lives outside of your document with external entities. And rather than type the same thing over and over again, such as boilerplate text, you can instead define your own general entities.Figure 2-17 shows the different kinds of entities and their roles. In the family tree of entity types, the two major branches are parameter entities and general entities. Parameter entities are used only in DTDs, so I'll talk about them later, in Chapter 4. This section will focus on the other type, general entities.
Figure 2-17: Entity typesAn entity consists of a name and a value. When an XML parser begins to process a document, it first reads a series of declarations, some of which define entities by associating a name with a value. The value is anything from a single character to a file of XML markup. As the parser scans the XML document, it encounters entity references, which are special markers derived from entity names. For each entity reference, the parser consults a table in memory for something with which to replace the marker. It replaces the entity reference with the appropriate replacement text or markup, then resumes parsing just before that point, so the new text is parsed too. Any entity references inside the replacement text are also replaced; this process repeats as many times as necessary.Recall from Section 2.3.2 earlier in this chapter that an entity reference consists of an ampersand (&), the entity name, and a semicolon (;). The following is an example of a document that declares three general entities and references them in the text:<?xml version="1.0"?> <!DOCTYPE message SYSTEM "/xmlstuff/dtds/message.dtd" [ <!ENTITY client "Mr. Rufus Xavier Sasperilla"> <!ENTITY agent "Ms. Sally Tashuns"> <!ENTITY phone "<number>617-555-1299</number>"> ]> <message> <opening>Dear &client;</opening> <body>We have an exciting opportunity for you! A set of ocean-front cliff dwellings in Piñata, Mexico, have been renovated as time-share vacation homes. They're going fast! To reserve a place for your holiday, call &agent; at ☎. Hurry, &client;. Time is running out!</body> </message>
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Miscellaneous Markup
- InhaltsvorschauRounding out the list of markup objects are comments, processing instructions, and CDATA sections. They all have one thing in common: they shield content from the parser in some fashion. Comments keep text from ever getting to the parser. CDATA sections turn off the tag resolution, and processing instructions target specific processors.Comments are notes in the document that are not interpreted by the XML processor. If you're working with other people on the same files, these messages can be invaluable. They can be used to identify the purpose of files and sections to help navigate a cluttered document, or simply to communicate with each other.Figure 2-21 shows the form of a comment. It starts with the delimiter
<!--(1) and ends with the delimiter-->(3). Between these delimiters goes the comment text (2) which can be just about any kind of text you want, including spaces, newlines, and markup. The only string not allowed inside a comment is two or more dashes in succession, since the parser would interpret that string as the end of the comment.
Figure 2-21: Comment syntaxComments can go anywhere in your document except before the XML declaration and inside tags. The XML processor removes them completely before parsing begins. So this piece of XML:<p>The quick brown fox jumped<!-- test -->over the lazy dog. The quick brown <!-- test --> fox jumped over the lazy dog. The<!-- test -->quick brown fox jumped over the lazy dog.</p>
will look like this to the parser:<p>The quick brown fox jumpedover the lazy dog. The quick brown fox jumped over the lazy dog. Thequick brown fox jumped over the lazy dog.</p>
Since comments can contain markup, they can be used to "turn off" parts of a document. This is valuable when you want to remove a section temporarily, keeping it in the file for later use. In this example, a region of code is commented out:Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Chapter 3: Modeling Information
- InhaltsvorschauDesigning a markup language is a task similar to designing a building. First, you have to ask some questions: Who am I building it for? How will it be constructed? How will it be used? Do I give it many small rooms or a few large ones? Will the rooms be generic and interchangeable or specialized? Is there are role for the building, like storage, office space, or factory work? It takes a lot of planning to do it right.When designing a markup language, there are many questions to answer: What constitutes a document? How detailed do you need it to be? How will it be generated? Is it flexible enough to handle every expected situation? Is it generic enough to support different formatting options and modes? Your decisions will help answer the most basic question which is, how can you represent a piece of information as XML? This problem is part of the important topic of data modeling.In this chapter, we look at the ways in which different kinds of data are modelled using XML. First, I'll show you the most basic kinds of documents, simple collections of preferences for software applications. The next category covers narrative documents with characteristics such as text flows, block and inline elements, and titled sections. Lastly, under the broad umbrella of "complex" data, I'll talk about the myriad specialized markup languages for everything from vector graphics to remote procedure calls.XML can be used like an extremely basic database. Since the early days of computer operating systems, data has been stored in files as tables, like the venerable /etc/passwd file:
nobody:*:-2:-2:Unprivileged User:/nohome:/noshell root:*:0:0:System Administrator:/var/root:/bin/tcsh daemon:*:1:1:System Services:/var/root:/noshell smmsp:*:25:25:Sendmail User:/private/etc/mail:/noshell
Data like this isn't too hard to parse, but it has problems, too. Certain characters aren't allowed. Each record lives on a separate line, so data can't span lines. A syntax error is easy to create and may be difficult to locate. XML's explicit markup gives it natural immunity to these types of problems.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Simple Data Storage
- InhaltsvorschauXML can be used like an extremely basic database. Since the early days of computer operating systems, data has been stored in files as tables, like the venerable /etc/passwd file:
nobody:*:-2:-2:Unprivileged User:/nohome:/noshell root:*:0:0:System Administrator:/var/root:/bin/tcsh daemon:*:1:1:System Services:/var/root:/noshell smmsp:*:25:25:Sendmail User:/private/etc/mail:/noshell
Data like this isn't too hard to parse, but it has problems, too. Certain characters aren't allowed. Each record lives on a separate line, so data can't span lines. A syntax error is easy to create and may be difficult to locate. XML's explicit markup gives it natural immunity to these types of problems.If you are writing a program that reads or saves data to a file, there are good reasons to go with XML. Parsers have been written to parse it already, so all you need to do is link to a library and use one of several easy interfaces: SAX, DOM, or XPath. Syntax errors are easy to catch, and that too is automated by the parser. Technologies like DTDs and Schema even check the structure and contents of elements for you, to ensure completeness and ordering.A dictionary is a simple one-to-one mapping of properties to values. A property has a name, or key, which is a unique identifier. A dictionary is kind of like a table with two columns. It's a simple but very effective way to serialize data.In the Macintosh OS X operating system, Apple selected XML as its format for preference files (called property lists). For the Chess program, the property list is in a file called com.apple.Chess.plist, shown here:<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist SYSTEM "file://localhost/System/Library/DTDs/PropertyList.dtd"> <plist version="0.9"> <dict> <!-- KEY VALUE --> <key>BothSides</key> <false/> <key>Level</key> <integer>1</integer> <key>PlayerHasWhite</key> <true/> <key>SpeechRecognition</key> <false/> </dict> </plist>Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Narrative Documents
- InhaltsvorschauNow let's look at an important category of XML. A narrative document contains text meant to be read by people rather than machines. Web pages, books, journals, articles, and essays are all narrative documents. These documents have some common traits. First, order of elements is inviolate. Try reading a book backward and you'll agree it's much less interesting that way (and it gives away the ending). The text runs in a single path called a flow, which the reader follows from beginning to end.Another key feature of narrative documents is specialized element groups, including sections, blocks, and inlines. Sections are what you would imagine: elements that break up the document into parts like chapters, subsections, and so on. Blocks are rectangular regions such as titles and paragraphs. Inlines are strings inside those blocks specially marked for formatting. Figure 3-2 shows how a typical formatted document would render these elements.
Figure 3-2: Flows, blocks, inlinesA narrative document contains at least one flow, a stream of text to be read continuously from start to finish. If there are multiple flows, one will be dominant, branching occasionally into short tangential flows like sidebars, notes, tips, warnings, footnotes, and so on. The main flow is typically formatted as a column, while other flows are often in boxes interrupting the main flow, or moved to the side or the very end, with some kind of link (e.g., a footnote symbol).Markup for flows are varied. Some XML applications like XHTML do not support more than one flow. Others, like DocBook, have rich support for flows, encapsulating them as elements inside the main flow. The best representations allow flows to be moved around, floated within the confines of the formatted page.The main flow is broken up into sectionsEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Complex Data
- InhaltsvorschauXML really shines when data is complex. It turns the most abstract concepts into concrete databases ready for processing by software. Multimedia formats like Scalable Vector Graphics (SVG) and Synchronized Multimedia Integration Language (SMIL) map pictures and movies into XML markup. Complex ideas in the scientific realm are just as readily coded as XML, as proven by MathML (equations), the Chemical Markup Language (chemical formulae), and the Molecular Dynamics Language (molecule interactions).The reason XML is so good at modelling complex data is that the same building blocks for narrative documents—elements and attributes—can apply to any composition of objects and properties. Just as a book breaks down into chapters, sections, blocks, and inlines, many abstract ideas can be deconstructed into discrete and hierarchical components. Vector graphics, for example, are composed of a finite set of shapes with associated properties. You can represent each shape as an element and use attributes to hammer down the details.SVG is a good example of how to represent objects as elements. Take a gander at the simple SVG document in Example 3-7. Here we have three different shapes represented by as many elements: a common rectangle, an ordinary circle, and an exciting polygon. Attributes in each element customize the shape, setting color and spatial dimensions.Example 3-7. An SVG document
<?xml version="1.0"?> <svg> <desc>Three shapes</desc> <rect fill="green" x="1cm" y="1cm" width="3cm" height="3cm"/> <circle fill="red" cx="3cm" cy="2cm" r="4cm"/> <polygon fill="blue" points="110,160 50,300 180,290"/> </svg>
Vector graphics are scalable, meaning you can stretch the image vertically or horizontally without any loss of sharpness. The image processor just recalculates the coordinates for you, leaving you to concentrate on higher concepts like composition, color, and grouping.SVG adds other benefits too. Being an XML application, it can be tested for well-formedness, can be edited in any generic XML editor, and is easy to write software for. DTDs and Schema are available to check for missing information, and they provide an easy way to distinguish between versions.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Documents Describing Documents
- InhaltsvorschauMany XML documents contain metadata, information about themselves that help search engines to categorize them. But not everyone takes advantage of the possibilities of metadata. And, unless you're using an exhaustive program that spiders through an entire document collection, it's difficult to summarize the set and choose a particular article from it. Making matters worse, not all documents have the capability to describe themselves, such as sound and graphics files. To address these problems, a class of documents evolved that specialize in describing other documents.To fully describe different kinds of documents, these markup languages have some interesting features in common. They list the time documents have been updated using standard time formats. They label the content type, be it text, image, sound, or something else. They may contain text descriptions for a user to peruse. For international documents, they may track the language encodings. Also interesting is the way documents are uniquely identified: using a physical address or some nonphysical identifier.Rich Site Summary (or Really Simple Syndication, depending on whom you talk to) was created by Netscape Corp. to describe content on web sites. They wanted to make a portal that was customizable, allowing readers to subscribe to particular subject areas or channels . Each time they returned to the site, they would see updates on their favorite topics, saving them the trouble of hunting around for this news on their own. Thus was born the service known as content aggregation .Since the time when there were a few big content aggregators like Netscape and Userland, the landscape has shifted to include hundreds of smaller, more granular services. Instead of subscribing to channels that mix together lots of different sources, you can subscribe to individual sites for an even higher level of customization. Everything from the BBC to a swarm of one-person weblogs are at your disposal. Publishing has never been easier.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
- Chapter 4: Quality Control with Schemas
- InhaltsvorschauUp until now, we have been talking about the things all XML documents have in common. Well-formedness rules are universal, ensuring perfect compatibility with generic tools and APIs. This syntax homogeneity is a big selling point for XML, but equally important is the need for ways to distinguish XML-based languages from each other. A document usually attempts to conform to a language of some sort, and we need methods to test its level of conformance.Schemas, the topic of this chapter, are the shepherds of markup languages. They keep documents from straying outside of the herd and causing trouble. For instance, an administrator of a web site can use a schema to determine which web pages are legal XHTML, and which are only pretending to be. A schema can also be used to publish a specification for a language in a succinct and unambiguous way.In the general sense of the word, a schema is a generic representation of a class of things. For example, a schema for restaurant menus could be the phrase "a list of dishes available at a particular eating establishment." A schema may resemble the thing it describes, the way a "smiley face" represents an actual human face. The information contained in a schema allows you to identify when something is or is not a representative instance of the concept.In the XML context, a schema is a pass-or-fail test for documents. A document that passes the test is said to conform to it, or be valid. Testing a document with a schema is called validation . A schema ensures that a document fulfills a minimum set of requirements, finding flaws that could result in anomalous processing. It also may serve as a way to formalize an application, being a publishable object that describes a language in unambiguous rules.An XML schema is like a program that tells a processor how to read a document. It's very similar to a later topic we'll discuss called transformationsEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
- Basic Concepts
- InhaltsvorschauIn the general sense of the word, a schema is a generic representation of a class of things. For example, a schema for restaurant menus could be the phrase "a list of dishes available at a particular eating establishment." A schema may resemble the thing it describes, the way a "smiley face" represents an actual human face. The information contained in a schema allows you to identify when something is or is not a representative instance of the concept.In the XML context, a schema is a pass-or-fail test for documents. A document that passes the test is said to conform to it, or be valid. Testing a document with a schema is called validation . A schema ensures that a document fulfills a minimum set of requirements, finding flaws that could result in anomalous processing. It also may serve as a way to formalize an application, being a publishable object that describes a language in unambiguous rules.An XML schema is like a program that tells a processor how to read a document. It's very similar to a later topic we'll discuss called transformations. The processor reads the rules and declarations in the schema and uses this information to build a specific type of parser, called a validating parser. The validating parser takes an XML instance as input and produces a validation report as output. At a minimum, this report is a return code, true if the document is valid, false otherwise. Optionally, the parser can create a Post Schema Validation Infoset (PSVI) including information about data types and structure that may be used for further processing.Validation happens on at least four levels:
- Structure
-
The use and placement of markup elements and attributes.
- Data typing
-
Patterns of character data (e.g., numbers, dates, text).
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - DTDs
- InhaltsvorschauThe original XML document model is the Document Type Definition (DTD). DTDs actually predate XML; they are a reduced hand-me-down from SGML with the core syntax almost completely intact. The following describes how a DTD defines a document type.
-
A DTD declares a set of allowed elements. You cannot use any element names other than those in this set. Think of this as the "vocabulary" of the language.
-
A DTD defines a content model for each element. The content model is a pattern that tells what elements or data can go inside an element, in what order, in what number, and whether they are required or optional. Think of this as the "grammar" of the language.
-
A DTD declares a set of allowed attributes for each element. Each attribute declaration defines the name, datatype, default values (if any), and behavior (e.g., if it is required or optional) of the attribute.
-
A DTD provides a variety of mechanisms to make managing the model easier, for example, the use of parameter entities and the ability to import pieces of the model from an external file.
According to the XML Recommendation, all external parsed entities (including DTDs) should begin with a text declaration. It looks like an XML declaration except that it explicitly excludes the standalone property. If you need to specify a character set other than the default UTF-8 (see Chapter 9 for more about character sets), or to change the XML version number from the default 1.0, this is where you would do it.If you specify a character set in the DTD, it won't automatically carry over into XML documents that use the DTD. XML documents have to specify their own encodings in their document prologs.After the text declaration, the resemblance to normal document prologs ends. External parsed entities, including DTDs, must not contain a document type declaration.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. -
- W3C XML Schema
- InhaltsvorschauDTDs are chiefly directed toward describing how elements are arranged in a document. They say very little about the content in the document, other than whether an element can contain character data. Although attributes can be declared to be of different types (e.g. ID, IDREF, enumerated), there is no way to constrain the type of data in an element.Returning to the example in Section 4.2.3, we can see how this limitation can be a serious problem. Suppose that a census taker submitted the document in Example 4-5.Example 4-5. A bad CensusML document
<census-record taker="9170"> <date><month>?</month><day>110</day><year>03</year></date> <address> <city>Munchkinland</city> <street></street> <county></county> <country>Here, silly</country> <postalcode></postalcode> </address> <person employed="fulltime" pid="?"> <name> <last>Burgle</last> <first>Brad</first> </name> <age>2131234</age> <gender>yes</gender> </person> </census-record>There are a lot of things wrong with this document. The date is in the wrong format. Several important fields were left empty. The stated age is an impossibly large number. The gender, which ought to be "male" or "female," contains something else. The personal identification number has a bad value. And yet, to our infinite dismay, the DTD would pick up none of these problems.It isn't hard to write a program that would check the data types, but that's a low-level operation, prone to bugs and requiring technical ability. It's also getting away from the point of DTDs, which is to create a kind of metadocument, a formal description of a markup language. Programming languages aren't portable and don't work well as a way of conveying syntactic and semantic details. So we have to conclude that DTDs don't go far enough in describing a markup language.To make matters worse, what the DTD will reject as bad markup are often trivial things. For example, the contents ofdateandEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - RELAX NG
- InhaltsvorschauRELAX NG is a powerful schema validation language that builds on earlier work including RELAX and TREX. Like W3C Schema, it uses XML syntax and supports namespaces and data typing. It goes further by integrating attributes into content models, which greatly simplifies the structure of the schema. It offers superior handling of unordered content and supports context-sensitive content models.In general, it just seems easier to write schemas in RELAX NG than in W3C Schema. The syntax is very clear, with elements like
zeroOrMorefor specifying optional repeating content. Declarations can contain other declarations, leading to a more natural representation of a document's structure.Consider the simple schema in Example 4-7 which models a document type for logging work activity. It's easy to read this schema and understand the structure of a typical document.Example 4-7. A simple RELAX NG schema<element name="worklog" xmlns="http://relaxng.org/ns/structure/1.0" xmlns:ann="http://relaxng.org/ns/compatibility/annotations/1.0"> <ann:documentation>A document for logging work activity, broken down into days, and further into tasks.</ann:documentation> <zeroOrMore> <element name="day"> <attribute name="date"> <text/> </attribute> <zeroOrMore> <element name="task"> <element name="description"> <text/> </element> <element name="time-start"> <text/> </element> <element name="time-end"> <text/> </element> </element> </zeroOrMore> </element> </zeroOrMore> </element>The same thing would look like this as a DTD:<!ELEMENT worklog (day*)> <!ELEMENT day (task*)> <!ELEMENT task (description, time-start, time-end)> <!ELEMENT description #PCDATA> <!ELEMENT time-start #PCDATA> <!ELEMENT time-end #PCDATA> <!ATTLIST day date CDATA #REQUIRED>
Although the DTD is more compact, it relies on a special syntax that is decidedly not XML-ish. RELAX NG accomplishes the same thing with more readability.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Schematron
- InhaltsvorschauSchematron takes a different approach from the schema languages we've seen so far. Instead of being prescriptive, as in "this element has the following content model," it relies instead on a series of Boolean tests. Depending on the result of a test, the schema will output some predetermined message.The tests are based on XPath, which is a very granular and exhaustive set of node examination tools. Relying on XPath is clever, taking much of the complexity out of the schema language. XPath, which is used in places such as XSLT and some implementations of DOM, can scratch an itch that more blunt tools like DTDs can't reach. As the creator of Schematron, Rick Jelliffe, says it's like "a feather duster for the furthest corners of a room where the vacuum cleaner (DTD) cannot reach."The basic structure of a Schematron schema is this:
<schema xmlns="http://www.ascc.net/xml/schematron"> <pattern> <rule context="XPath Expression"> <assert test="XPath Expression"> message </assert> <report test="XPath Expression"> message </report> ...more tests... </rule> ...more rules... </pattern> ...more patterns... </schema>Apatternin Schematron does not carry the same meaning as patterns in RELAX NG. Here, it's just a logical grouping of rules. If your schema is testing books, one pattern may hold rules for chapters while another groups rules for appendixes. So think of this as more of a higher-level, conceptual testing pattern, rather than as a specific node-matching pattern.The context for each test is determined by arule. Itscontextattribute contains an XSLT pattern that matches nodes. Each node found becomes the context node, on which all tests inside the rule are applied.The children of a rule,reportandassert, each apply a test to the context node. The test is another XPath expression, stored in atestattribute.report's contents will be output if its XPath expression evaluates to "true."Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Schemas Compared
- InhaltsvorschauEach of the schemas we've looked at has compelling features and significant flaws. Some of the important points are listed Table 4-2.
Table 4-2: A comparison of schema FeatureDTDW3C SchemaRELAX NGSchematronXML syntaxNoYesYesYesNamespace compatibleNoYesYesYesDeclares entitiesYesNoNoNoTests datatypesNoYesYesYesDefault attribute valuesYesYesNoNoNotationsYesEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Chapter 5: Presentation Part I: CSS
- InhaltsvorschauAs the Web exploded in popularity in the mid-1990s, everyone wanted their own web site. I remember learning HTML from friends and the excitement I felt when I saw my virtual homestead suddenly become accessible to thousands of computer users. Back then, I had only a very limited understanding of things like "good design," "standards," and "best practices." They seemed like lofty concepts with little relation to me and my happy experiments. So, like everyone else, I cut corners, sacrificed good taste, and ignored rules because all that mattered was seeing something display reasonably well in a browser.Since then, the novelty has worn off and my situation is vastly different. My concern has shifted from "How can I get something to display at all?" to "How can I make my information available to everyone who tries to look at it, regardless of what software they are using, on what platform, and in which media format?" And instead of asking "How can I create an HTML page?", I now ask, "How can I make this vast amount of information easier to update, store, and publish?" And where before I might wonder how to achieve some effect in HTML, like making some lines of text larger than other lines, I now have to cope with a variety of different XML formats and extremely detailed design needs.Cascading Style Sheets (CSS) are the first piece of this puzzle. They have been around for a long time, but for several reasons they were slow to take off. Now sites like
wired.comare totally based on CSS and they actually look pretty good, while sites likehttp://csszengarden.com/show off more CSS capabilities. Although originally designed to augment HTML, CSS can complement XML as well. In this chapter, we will see how it can be used for web pages as well as XML documents for human consumption.XML and stylesheets go together like naked people and clothes. Let's take a moment to familiarize ourselves with the general concepts behind stylesheets. First, why do you want them? Second, how do they work? Finally, are there limitations, and what can we do about them?Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Stylesheets
- InhaltsvorschauXML and stylesheets go together like naked people and clothes. Let's take a moment to familiarize ourselves with the general concepts behind stylesheets. First, why do you want them? Second, how do they work? Finally, are there limitations, and what can we do about them?I can rant about why it's important to keep information pure and separate presentation into stylesheets, but this would ignore a critical question: if it's easier to write in presentational markup—and I admit that it is—why would you want to bother with stylesheets? After all, the Web itself testifies to the fact that presentational markup is working quite well for what it was designed to do. For that matter, what's wrong with plain text?If you are already familiar with the sermon, then skip this section, because I'm going to preach the religion of stylesheets now.XML was inspired, to a large extent, by the limitations of HTML. Tim Berners-Lee, the inventor of HTML, always had stylesheets in mind, but for some reason, they had been forgotten in the huge initial surge of webification. Although HTML had only limited presentational capabilities built in, it was enough to satisfy the hordes of new web authors. Easy to implement and even easier to learn, HTML was soon stretched far beyond its original intentions as a simple report-formatting language, forced to encode everything from product catalogs to corporate portals. But the very thing that led to its rapid uptake, presentational markup, is also holding HTML back.Here are some problems associated with presentational markup and some solutions provided by stylesheets:
- Low information content
-
Presentational markup is not much better than plain text. A machine can't understand the difference between a typical body paragraph and a poem or code sample. Nor does it know that one thing is marked bold because it's a stock price and another is bold because it's the name of a town. Consequently, you can't easily mine pages for information. Search engines can only try to match character strings, since any sense of context is missing from the markup.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - CSS Basics
- InhaltsvorschauWhile CSS is a field of endeavor all its own, we'll get started with some foundations.Cascading Style Sheets (CSS) is a recommendation developed by the World Wide Web Consortium (W3C). It originated in 1994 when Håkon Wium Lee, working at CERN (the birthplace of HTML), published a paper titled Cascading HTML Style Sheets. It was a bold move at the right time. By then, the Web was four years old and growing quickly, yet there was still no consensus for a standard style description language. The architects of HTML knew that the language was in danger of becoming a style description language if something like CSS wasn't adopted soon.The goal was to create a simple yet expressive language that could combine style descriptions from different sources. Another style description language, DSSSL, was already being used to format SGML documents. Though very powerful, DSSSL was too big and complex to be practical for the Web. It is a full programming language, capable of more precision and logical expression than CSS, which is a simple language, focused on the basic needs of small documents.While other stylesheet languages existed when CSS was proposed, none offered the ability to combine multiple sources into one style description set. CSS makes the Web truly accessible and flexible by allowing a reader to override the author's styles to adapt a document to the reader's particular requirements and applications.The W3C put forward the first CSS recommendation (later called CSS1) in 1996. A short time later, a W3C working group formed around the subject of "Cascading Style Sheets and Formatting Properties" to add missing functionality. Their recommendation, CSS2, increased the language's properties from around 50 to more than 120 when it was released in 1998. It also added concepts like generated text, selection by attribute, and media other than screen display. CSS3 is still a work in progress.Below is a sample CSS stylesheet:Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
- Rule Matching
- InhaltsvorschauWe will delve now into the details of selector syntax and all the ways a rule can match an element or attribute. Don't worry yet about what the properties actually mean. I'll cover all that in the next section. For now, concentrate on how rules drive processing and how they interact with each other.Figure 5-7 shows the general syntax for selectors. They typically consist of an element name (1) followed by some number of attribute tests (2) in square brackets, which in turn contain an attribute name (3) and value (4). Note that only an element or attribute is required. The other parts are optional. The element name can contain wildcards to match any element, and it can also contain chains of elements to specify hierarchical information. The attribute tests can check for the existence of an attribute (with any value), the existence of a value (for any attribute), or in the strictest case, a particular attribute-value combination.
Figure 5-7: Syntax for a CSS selectorMatching an element is as simple as writing its name:emphasis { font-style: italic; font-weight: bold; }This rule matches anyemphasiselement in the document. This is just the tip of the iceberg. There are many ways to qualify the selection. You can specify attribute names, attribute values, elements that come before and after, and even special conditions such as whether the cursor is currently hovering over a link, or in what language the document claims to be written.A list of names is also allowed, letting you apply the same properties to many kinds of elements. Here, a set of three properties applies to any of the four elements,name,phone,email, andaddress:name, phone, email, address { display-type: block; margin-top: 2em; margin-bottom: 2em; }Besides using definite element names, you can use an asterisk (Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Properties
- InhaltsvorschauThe three levels of CSS define so many properties, I can't cover them all here. There are over 120 in level 2 alone. Instead, I'll cover the basic categories you are likely to encounter and leave more exhaustive descriptions to books specializing on the topic.CSS properties can be passed down from a container element to its child. This inheritance principle greatly simplifies stylesheet design. For example, in the document element rule, you can set a font family that will be used throughout the document. Wherever you want to use a different family, simply insert a new property for a rule and it will override the global setting.In Figure 5-9, a
parainherits some properties from asection, which in turn inherits from anarticle. The propertiesfont-familyandcolorare defined in the property set forarticle, and inherited by bothsectionandpara. The propertyfont-sizeis not inherited bysectionbecausesection's explicit setting overrides it.paradoes inherit this property fromsection.
Figure 5-9: Element-inheriting propertiesInheritance is forbidden for some properties where it wouldn't make sense to pass that trait on. For example, thebackground-imageproperty, which causes an image to be loaded and displayed in the background, is not inherited. If every element did inherit this property, the result would be a complete mess, with every paragraph and inline element trying to display its own copy of the image in its rectangular area. It looks much better if only one element has this property and its children don't. Display type and margins are other examples.Many properties involve some kind of measurement: the width of a rule, a font size, or a distance to indent. These lengths can be expressed in several different kinds of units.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Examples
- InhaltsvorschauLet's put what we know to use now and format a document with CSS. XHTML is a good place to start, so let's take the document in Example 3-4. To maximize the possibilities for formatting it, we should add some structure with
divelements, and usespanelements to increase the granularity of inlines. Example 5-1 is the improved result.Example 5-1. An XHTML document with DIVs and SPANs<html> <head> <title>CAT(1) System General Commands Manual</title> <link rel="stylesheet" type="text/css" href="style1.css" /> </head> <body> <h1>CAT(1) System General Commands Manual</h1> <div class="section"> <h2>NAME</h2> <p>cat - concatenate and print files</p> </div> <div class="section"> <h2>SYNOPSIS</h2> <p class="code">cat [-benstuv] [-] [<em>file</em>...]</p> </div> <div class="section"> <h2>DESCRIPTION</h2> <p> The <span class="command">cat</span> utility reads files sequentially, writing them to the standard output. The file operands are processed in command line order. A single dash represents the standard input. </p> <p> The options are as follows: </p> <dl> <dt><span class="option">-b</span></dt> <dd> Implies the <span class="option">-n</span> option but doesn't number blank lines. </dd> <dt><span class="option">-e</span></dt> <dd> Implies the <span class="option">-v</span> option, and displays a dollar sign (<span class="symbol">$</span>) at the end of each line as well. </dd> <dt><span class="option">-n</span></dt> <dd>Number the output lines, starting at 1.</dd> <dt><span class="option">-s</span></dt> <dd> Squeeze multiple adjacent empty lines, causing the output to be single spaced. </dd> <dt><span class="option">-t</span></dt> <dd> Implies the <span class="option">-v</span> option, and displays tab characters as <span class="symbol">^I</span> as well. </dd> <dt><span class="option">-u</span></dt> <dd> The <span class="option">-u</span> option guarantees that the output is unbuffered. </dd> <dt><span class="option">-v</span></dt> <dd> Displays non-printing characters so they are visible. Control characters print as <span class="symbol">^X</span> for control-X; the delete character (octal 0177) prints as <span class="symbol">^?</span> Non-ascii characters (with the high bit set) are printed as <span class="symbol">M-</span> (for meta) followed by the character for the low 7 bits. </dd> </dl> <p> The <i>cat</i> utility exits 0 on success, and >0 if an error occurs. </p> </div> <div class="section"> <h2>BUGS</h2> <p> Because of the shell language mechanism used to perform output redirection, the command <span class="command">cat file1 file2 > file1</span> will cause the original data in file1 to be destroyed! </p> </div> <div class="section"> <h2>SEE ALSO</h2> <ul> <li><a href="head.html">head(1)</a></li> <li><a href="more.html">more(1)</a></li> <li><a href="pr.html">pr(1)</a></li> <li><a href="tail.html">tail(1)</a></li> <li><a href="vis.html">vis(1)</a></li> </ul> <p> Rob Pike, <span class="citation">UNIX Style, or cat -v Considered Harmful</span>, USENIX Summer Conference Proceedings, 1983. </p> </div> <div class="section"> <h3>HISTORY</h3> <p> A <i>cat</i> utility appeared in Version 6 AT&T UNIX. </p> <p>3rd Berkeley Distribution, May 2, 1995</p> </div> </body> </html>Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Chapter 6: XPath and XPointer
- InhaltsvorschauXML has often been compared to a database because of the way it packages information for easy retrieval. Ignoring the obvious issues of speed and optimization, this isn't a bad analogy. Element names and attributes put handles on data, just as SQL tables use table and field names. Element structure supplies even more information in the form of context (e.g., element A is the child of element B, which follows element C, etc.). With a little knowledge of the markup language, you can locate and extract any piece of information.This is useful for many reasons. First, you might want to locate specific data from a known location (called a path) in a particular document. Given a URI and path, you ought to be able to fetch that data automatically. The other benefit is that you can use path information to get really specific about processing a class of documents. Instead of just giving element name or attribute value to configure a stylesheet as with CSS, you could incorporate all kinds of extra contextual details, including data located anywhere in the document. For example, you could specify that items in a list should use a particular kind of bullet given in a metadata section at the beginning of the document.To express path information in a standard way, the W3C recommends the XML Path Language (also known as XPath). Quickly following on the heels of the XML recommendation, XPath opens up many possibilities for documents and facilitates technologies such as XSLT and DOM. The XML Pointer Language (XPointer) extends XPath into a wider realm, allowing you to locate information in other documents.Remember in Chapter 2 when we talked about trees and XML? I said that every XML document can be represented graphically with a tree structure. The reason that is important will now be revealed. Because there is only one possible tree configuration for any given document, there is a unique path from the root (or any point inside) to any other point. XPath simply describes how to climb the tree in a series of steps to arrive at a destination.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
- Nodes and Trees
- InhaltsvorschauRemember in Chapter 2 when we talked about trees and XML? I said that every XML document can be represented graphically with a tree structure. The reason that is important will now be revealed. Because there is only one possible tree configuration for any given document, there is a unique path from the root (or any point inside) to any other point. XPath simply describes how to climb the tree in a series of steps to arrive at a destination.By the way, we will be slipping into some tree-ish terminology throughout the chapter. It's assumed you read the quick introduction to trees in Chapter 2. If you hear me talking about ancestors and siblings and have no idea what that has to do with XML, go back and refresh your vocabulary.Each step in a path touches a branching or terminal point in the tree called a node. In keeping with the arboreal terminology, a terminal node (one with no descendants) is sometimes called a leaf. In XPath, there are seven different kinds of nodes:
- Root
-
The root of the document is a special kind of node. It's not an element, as you might think, but rather it contains the document element. It also contains any comments or processing instructions that surround the document element.
- Element
-
Elements and the root node share a special property among nodes: they alone can contain other nodes. An element node can contain other elements, plus any other node type except the root node. In a tree, it would be the point where two branches meet. Or, if it is an empty element, it would be a leaf node.
- Attribute
-
For simplicity's sake, XPath treats attributes as separate nodes from their element hosts. This allows you to select the element as a whole, or merely the attribute in that element, using the same path syntax. An attribute is like an element that contains only text.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Finding Nodes
- InhaltsvorschauThere are still a few cultures on earth who can name their ancestors back ten generations or further. "Here is Sam, the son of Ben, the son of Andrew, the son of..." This chain of generations helps establish the identity of a person, showing that he or she is a member of such and such a clan or related to another person through some shared great-great-uncle.XPath, too, uses chains of steps, except that they are steps in an XML tree rather than an actual family tree. The terms "child" and "parent" are still applicable. A location path is a chain of location steps that get you from one point in a document to another. If the path begins with an absolute position (say, the root node), then we call it an absolute path . Otherwise, it is called a relative path because it starts from a place not yet determined.A location step has three parts: an axis that describes the direction to travel, a node test that specifies what kinds of nodes are applicable, and a set of optional predicates that use Boolean (true/false) tests to winnow down the candidates even further.The axis is a keyword that specifies a direction you can travel from any node. You can go up through ancestors, down through descendants, or linearly through siblings. Table 6-1 lists all the types of node axes.
Table 6-1: Node axes Axis typeMatchesAncestorAll nodes above the context node, including the parent, grandparent, and so on up to the root node.Ancestor-or-selfThe ancestor node plus the context node.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - XPath Expressions
- InhaltsvorschauLocation paths are a subset of a more general concept called XPath expressions. These are statements that can extract useful information from the tree. Instead of just finding nodes, you can count them, add up numeric values, compare strings, and more. They are much like statements in a functional programming language. There are five types, listed here:
- Boolean
-
An expression type with two possible values,
trueandfalse. - Node set
-
A collection of nodes that match an expression's criteria, usually derived with a location path.
- Number
-
A numeric value, useful for counting nodes and performing simple arithmetic.
- String
-
A fragment of text that may be from the input tree, processed or augmented with generated text.
- Result tree fragment
-
A temporary node tree that has its own root node but cannot be indexed into using location paths.
In XPath, types are determined by context. An operator or function can transform one expression type into another as needed. For this reason, there are well-defined rules to determine what values map to when transformed to another type.XPath has a rich set of operators and functions for working with each expression type. In the following sections, I will describe these and the rules for switching between types.Boolean expressions have two values: true or false. As you saw with location step predicates, anything inside the brackets that does not result in a numerical value is forced into a Boolean context. There are other ways to coerce an expression to behave as Boolean. The functionEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - XPointer
- InhaltsvorschauClosely related to XPath is the XML Pointer Language (XPointer). It uses XPath expressions to find points inside external parsed entities, as an extension to uniform resource identifiers (URIs). It could be used, for example, to create a link from one document to an element inside any other.Originally designed as a component of the XML Linking Language (XLink), XPointer has become an important fragment identifier syntax in its own right. The XPointer Framework became a recommendation in 2003 along with the XPointer element( ) Scheme (allowing basic addressing of elements) and the XPointer xmlns( ) Scheme (incorporating namespaces). The xpointer( ) scheme itself is stuck at Working Draft, getting no further development.An XPointer instance, which I'll just call an xpointer, works much like the fragment identifier in HTML (the part of a URL you sometimes see on the right side of a hash symbol). It's much more versatile than HTML's mechanism, however, as it can refer to any element or point inside text, not just to an anchor element (
<a name="..."/>). By virtue of XPath, it has a few advantages over HTML fragment identifiers:-
You can create a link to the target element itself, rather than to a proxy element (e.g.,
<a name="foo"/>. -
You don't need to have anchors in the target document. You're free to link to any region in any document, whether the author knows about it or not.
-
The XPath language is flexible enough to reach any node in the target document.
XPointer actually goes further than XPath. In addition to nodes, it has two new location types. A point is any place inside a document between two adjacent characters. Whereas XPath would only locate an entire text node, XPointer can be more granular and locate a spot in the middle of any sentence. The other type introduced by XPointer is a rangeEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. -
- Chapter 7: Transformation with XSLT
- InhaltsvorschauTransformation is one of the most important and useful techniques for working with XML. To transform XML is to change its structure, its markup, and perhaps its content into another form. A transformed document may be subtly altered or profoundly changed. The process is carefully shaped with a configuration document variously called a stylesheet or transformation script.There are many reasons to transform XML. Most often, it is to extend the reach of a document into new areas by converting it into a presentational format. Alternatively, you can use a transformation to alter the content, such as extracting a section, or adding a table of numbers together. It can even be used to filter an XML document to change it in very small ways, such as inserting an attribute into a particular kind of element.Some uses of transformation are:
-
Changing a non-presentational application such as DocBook into HTML for display as web pages.
-
Formatting a document to create a high-quality presentational format like PDF, through the XSL-FO path.
-
Changing one XML vocabulary to another, for example transforming an organization-specific invoice format into a cross-industry format.
-
Extracting specific pieces of information and formatting them in another way, such as constructing a table of contents from section titles.
-
Changing an instance of XML into text, such as transforming an XML data file into a comma-delimited file that you can import into Excel as a spreadsheet.
-
Reformatting or generating content. For example, numeric values can be massaged to turn integers into floating point numbers or Roman numerals as a way to create your own numbered lists or section heads.
-
Polishing a rough document to fix common mistakes or remove unneeded markup, preparing it for later processing.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. -
- History
- InhaltsvorschauIn the very early days of markup languages, the only way to transform documents was by writing a custom software application to do it. Before SGML and XML, this was excruciating at best. Presentational markup is quite difficult to interpret in any way other than the device-dependent behavior it encodes.SGML made it much easier for applications to manipulate documents. However, any transformation process was tied to a particular programming platform, making it difficult to share with others. The SGML community really needed a portable language specifically designed to handle SGML transformations, and which supported the nuances of print publishing (the major use of SGML at the time). The first solution to address these needs was the Document Style Semantics and Specification Language (DSSSL).DSSSL (pronounced "dissel") was completed in 1996 under the auspices of the ISO working group for Document Description and Processing Languages. It laid out the fundamental rules for describing the parts of a formatted document that inspired later efforts including XSL and CSS. Concepts such as bounding boxes and font properties are painstakingly defined here.If you look at a DSSSL script, you'll see that it is a no-fooling-around programming language. The syntax is Scheme, a dialect of Lisp. You have to be a pretty good programmer to be able to work with it, and the parentheses might drive some to distraction. There is really nothing you can't do with DSSSL, but for most transformations, it may be overly complex. I certainly don't miss it.As XML gained prominence, the early adopters and developers began to map out a strategy for high-quality formatting. They looked at DSSSL and decided it suffered from the same problems as SGML: too big, too hard to learn, not easy to implement. James Clarke, a pioneer in the text processing frontier who was instrumental in DSSSL development, took what he had learned and began to work on a slimmed-down successor. Thus was born the Extensible Stylesheet Language (XSL).Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
- Concepts
- InhaltsvorschauBefore we jump into specifics, I want to explain some important concepts that will help you understand how XSLT works. An XSLT processor (I'll call it an XSLT engine) takes two things as input: an XSLT stylesheet to govern the transformation process and an input document called the source tree. The output is called the result tree.The XSLT stylesheet controls the transformation process. While it is usually called a stylesheet, it is not necessarily used to apply style. This is just a term inherited from the original intention of using XSLT to construct XSL-FO trees. Since XSLT is used for many other purposes, it may be better to call it an XSLT script or transformation document, but I will stick with the convention to avoid confusion.The XSLT processor is a state engine. That is, at any point in time, it has a state, and there are rules to drive processing forward based on the state. The state consists of defined variables plus a set of context nodes, the nodes that are next in line for processing. The process is recursive, meaning that for each node processed, there may be children that also need processing. In that case, the current context node set is temporarily shelved until the recursion has completed.The XSLT engine begins by reading in the XSLT stylesheet and caching it as a look-up table. For each node it processes, it will look in the table for the best matching rule to apply. The rule specifies what to output to build its part of the result tree, and also how to continue processing. Starting from the root node, the XSLT engine finds rules, executes them, and continues until there are no more nodes in its context node set to work with. At that point, processing is complete and the XSLT engine outputs the result document.Let us now look at an example. Consider the document in Example 7-1.Example 7-1. Instruction guide for a model rocket
<manual type="assembly" id="model-rocket"> <parts-list> <part label="A" count="1">fuselage, left half</part> <part label="B" count="1">fuselage, right half</part> <part label="F" count="4">steering fin</part> <part label="N" count="3">rocket nozzle</part> <part label="C" count="1">crew capsule</part> </parts-list> <instructions> <step> Glue <part ref="A"/> and <part ref="B"/> together to form the fuselage. </step> <step> For each <part ref="F"/>, apply glue and insert it into slots in the fuselage. </step> <step> Affix <part ref="N"/> to the fuselage bottom with a small amount of glue. </step> <step> Connect <part ref="C"/> to the top of the fuselage. Do not use any glue, as it is spring-loaded to detach from the fuselage. </step> </instructions> </manual>Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Running Transformations
- InhaltsvorschauThere are several strategies to performing a transformation, depending on your needs. If you want a transformed document for your own use, you could run a program such as Saxon to transform it on your local system. With web documents, the transformation is performed either on the server side or the client side. Some web servers can detect a stylesheet declaration and transform the document as it's being served out. Another possibility is to send the source document to the client to perform the transformation. Internet Explorer 5.0 was the first browser to implement XSLT, opening the door to this procedure. Which method you choose depends on various factors such as how often the data changes, what kind of load your server can handle, and whether there is some benefit to giving the user your source XML files.If the transformation will be done by the web server or client, you must include a reference to the stylesheet in the document as a processing instruction, similar to the one used to associate documents with CSS stylesheets. It should look like this:
<?xml-stylesheet type="text/xml" href="mytrans.xsl"?>
Thetypeattribute is a MIME type. The attributehrefpoints to the location of the stylesheet.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - The stylesheet Element
- InhaltsvorschauAs mentioned before, XSLT is an XML application, so stylesheets are XML documents. The document element is
stylesheet, although you are also allowed to usetransformif the term stylesheet bugs you. This element is where you must declare the XSLT's namespace and version. The namespace identifier ishttp://www.w3.org/1999/XSL/Transform/. Both the namespace and version attributes are required.XSLT can be extended by the implementer to perform special functions not contained in the specification. For example, you can add a feature to redirect output to multiple files. These extensions are identified by a separate namespace that you must declare if you want to use them. And, just to make things clear for the XSLT engine, you should set the attributeextension-element-prefixesto contain the namespace prefixes of extensions.As an example, consider thestylesheetelement below. It declares namespaces for XSLT control elements (prefixxsl) and implementation-specific elements (prefixext). Finally, it specifies the version 1.0 of XSLT in the last attribute.<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:ext="http://www.myxslt.org/extentions" extension-element-prefixes="ext" version="1.0" >The namespace, represented here asxslis used by the transformation processor to determine which elements control the process. Any elements or attributes not in that namespace nor the extensions namespace will be interpreted as data to be output in the result tree.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Templates
- InhaltsvorschauXSLT stylesheets are collections of templates. Each template associates a condition (e.g., an element in the source tree with a particular attribute) with a mixture of output data and instructions. These instructions refine and redirect processing, extending the simple matching mechanism to give you full control over the transformation.A template does three things. First, it matches a class of node. The
matchattribute holds an XSLT pattern which, much like an XPath expression, matches nodes. When an XSLT processor is told to apply templates to a particular node, the processor runs through all the templates in the stylesheet and tests whether the node matches the template's pattern. All the templates that match this node are candidates for processing, and the XSLT processor must select one.Second, the template contributes a priority value to help the processor decide which among eligible templates is the best to use. The template that matches the current node with the highest import precedence, or highest priority, is the one that will be used to process it. Different factors contribute to this priority. A template with more specific information will overrule one that is more generic. For example, one template may match all elements with the XPath expression*. Another may match a specific element, while a third matches that element and further requires an attribute. Alternatively, a template can simply state its precedence to the processor using apriorityattribute. This is useful when you want to force a template to be used where otherwise it would be overlooked.The third role of a template is to specify the structure of the result tree. The template's content actually contains the elements and character data to be output in the result tree. So it is often possible to see, at a glance, how the result tree will look. XSLT elements interspersed throughout this content direct the processing to other templates.This model for scripting a transformation has strong benefits. Templates are (usually) compact pieces of code that are easy to read and manage, like functions in a programming language. TheEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Formatting
- InhaltsvorschauSince XSLT was originally intended for producing human-readable formatted documents, and not just as a general transformation tool, it comes with a decent supply of formatting capabilities.A global setting you may want to include in your stylesheet is the
<output>element. It controls how the XSLT engine constructs the result tree by forcing start and end tags, handling whitespace in a certain way, and so on. It is a top-level element that should reside outside of any template.Three choices are provided: XML, HTML, and text. The default output type, XML, is simple: whitespace and predefined entities are handled exactly the same in the result tree as in the input tree, so there are no surprises when you look at the output. If your result document will be an application of XML, place this directive in your stylesheet:<xsl:output method="xml"/>
HTML is a special case necessary for older browsers that do not understand some of the new syntax required by XML. It is unlikely you will need to use this mode instead of XML (for XHTML); nevertheless it is here if you need it. The exact output conforms to HTML version 4.0. Empty elements will not contain a slash at the end and processing instructions will contain only one question mark. So in this mode, the XSLT engine will not generate well-formed XML.Text mode is useful for generating non-XML output. For example, you may want to dump a document to plain text with all the tags stripped out. Or you may want to output to a format such as troff or TEX. In this mode, the XSLT engine is required to resolve all character entities rather than keep them as references. It also handles whitespace differently, preserving all newlines and indentation.XPath introduced the notion of a node's string value. All the text in an element is assembled into a string and that is what you get. So in this element:Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Chapter 8: Presentation Part II: XSL-FO
- InhaltsvorschauXSL-FO rounds out the trio of standards that make up XSL. The FO stands for Formatting Objects, which are containers of content that preserve structure and associate presentational information. XSLT prepares a document for formatting using XPath to disassemble it and produce an XSL-FO temporary file to drive a formatter.Under a W3C charter, the XSL Working Group started to design a high-powered formatting language in 1998. XML was still new, but XSL was understood early on to be a factor in making it useful. The group split its efforts on two related technologies, a language for transformations (XSLT) and another for formatting (XSL-FO). XSLT, the first to become a recommendation in 1999, demonstrated itself to be generally useful even outside of publishing applications. XSL followed as a recommendation in 2001.Cascading Style Sheets, a jewel in the crown of the W3C, had been around for a few years and was a strong influence on the development of XSL. Its simple but numerous property statements make it easy to learn. You will see that quite a few of these properties have been imported into XSL-FO. The CSS box model, elegant and powerful, is the basis for XSL-FO's area model. Mostly what has been added to XSL-FO are semantics for handling page layout and complex writing systems.The principal advantages of XSL over CSS are:
-
Print-specific semantics such as page breaking. CSS happens to be moving in this direction too, so this distinction is less important.
-
Out-of-order processing. With CSS, all the elements in a document are processed in order from start to finish. It provides no means of selecting some sections and rejecting others, of pulling information from various parts of a document, or of processing the same element more than once.
-
XML syntax. CSS uses its own syntax, making it harder to process.
XSL is not always preferable to CSS. You would choose CSS when you want to keep formatting simple and fast. Most web browsers have CSS processing built in, but none can do anything with XSL-FO. XSL-FO processing is also very resource intensive. You would probably not want to push the burden of formatting on a fickle and impatient user when it could be done ahead of time on your server.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. -
- How It Works
- InhaltsvorschauUnlike CSS which simply applies styles directly to elements in a single pass through the document, XSL gives you the opportunity to do major reorganization of your document. This capability comes at the cost of simplicity. To ease the burden on developers, the designers of XSL have split the process into two parts: transformation and formatting.The transformation alters the structure of the input document and adds presentational information in a hybrid format composed of formatting objects. A formatting object is a container for content that associates styles and rendering instructions with the content. It is compact and easy for a person to understand. The formatting objects are arranged in a tree that retains structure used in building the final presentation.The result of this transformation is a temporary file that you feed to an XSL-FO formatter. Through a complex series of steps, the formatter calculates the final geometry and appearance of the presentation and churns out a file suitable for printing or viewing on a screen. When it's finished, the formatting objects are flushed from memory and you can discard the temporary FO file. It is important to understand that you are not meant to write your own XSL-FO markup. Make all of your stylistic corrections in the XSLT stylesheet and let the tools do the rest.Inside the formatter a complex operation in multiple phases takes place, illustrated in Figure 8-1. We start with a result tree (recall from Chapter 7 that a result tree is the product of an XSLT transformation) in the XSL-FO namespace. In phase one of formatting, the formatter translates this document into an object representation in memory in a process called objectification . This structure, a formatting object tree, is structurally similar to the result tree, but with some details changed. For example, all the character data will be replaced with
fo:characternode objects. By making the tree more verbose like this, later processing will run more efficiently.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - A Quick Example
- InhaltsvorschauTo give you a good overview of the whole process, let us take a look at a short, quick example. This humble XML document will be the source:
<mydoc> <title>Hello world!</title> <message>I am <emphasis>so</emphasis> clever.</message> </mydoc>
The first step in using XSL-FO is to write an XSLT stylesheet that will generate a formatting object tree. Example 8-1 is a very simple (for XSL-FO) stylesheet. There are five templates in all. The first creates a page master, an archetype of real pages that will be created as text is poured in, setting up the geometry of content regions. The second template associates a flow object with the page master. The flow is like a baggage handler, throwing suitcases into a compact space that fits the geometry set up in the page master. The rest of the templates create blocks and inlines to be stacked inside the flow.Example 8-1. An XSLT stylesheet to turn mydoc into a formatting object tree<?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:fo="http://www.w3.org/1999/XSL/Format" version="1.0" > <xsl:output method="xml"/> <!-- The root node, where we set up a page with a single region. <layout-master-set> may contain many page masters, but here we have defined only one. <simple-page-master> sets up a basic page type with width and height dimensions, margins, and a name to reference later with a flow. --> <xsl:template match="/"> <fo:root> <fo:layout-master-set> <fo:simple-page-master master-name="the-only-page-type" page-height="4in" page-width="4in" margin-top="0.5in" margin-bottom="0.5in" margin-left="0.5in" margin-right="0.5in"> <fo:region-body/> </fo:simple-page-master> </fo:layout-master-set> <xsl:apply-templates/> </fo:root> </xsl:template> <!-- The first block element, where we insert the document flow. <page-sequence> sets up an instance of the page type we defined above. <flow> contains all the stackable block objects, shaping them so they fit in the page region we defined. The flow contains a block that defines font name, size, text alignment, and surrounds its content in a 0.25 inch buffer of padding. --> <xsl:template match="mydoc"> <fo:page-sequence master-reference="the-only-page-type"> <fo:flow flow-name="xsl-region-body"> <fo:block font-family="helvetica, sans-serif" font-size="24pt" text-align="center" padding="0.25in" > <xsl:apply-templates/> </fo:block> </fo:flow> </fo:page-sequence> </xsl:template> <!-- The second block element, a title, is bold, 10 point type, and inserts 1 em of space below itself. --> <xsl:template match="title"> <fo:block font-weight="bold" font-size="10pt" space-after="1em" > <xsl:apply-templates/> </fo:block> </xsl:template> <!-- The last block element, a message body element. The padding is set to 0.25 inches and the border is visible. --> <xsl:template match="message"> <fo:block padding="0.25in" border="solid 1pt black" > <xsl:apply-templates/> </fo:block> </xsl:template> <!-- The inline emphasis element is set to be italic. --> <xsl:template match="emphasis"> <fo:inline font-style="italic" > <xsl:apply-templates/> </fo:inline> </xsl:template> </xsl:stylesheet>Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - The Area Model
- InhaltsvorschauIn XSL all the positioning, shaping, and spacing of elements on the page takes place inside of areas. An area is an abstract framework used to represent a piece of the formatted document. It contains all the geometric and stylistic information needed to position and render its children correctly. Areas are nested, leading to the concept of an area tree. From the root node, all the way down to the areas containing glyphs and images, a formatted document is completely described by an area tree.A formatting object tree produces an area tree the way an architectural model leads to a final set of blueprints. Strictly speaking, there is not a one-to-one mapping between formatting objects and areas. An FO can create zero or more areas, and each area is usually produced by a single FO. There are exceptions, however, as in the case of ligatures, which are single glyphs created through the contribution of two or more character objects.Associated with an area is a collection of details that completely describe its geometry and rendering. These traits, as they are called, are derived either directly from formatting object properties or indirectly as a result of calculations involving other traits. These traits are the final, precise data that drive the rendering process in the formatter.Areas are divided into two types: block and inline. Blocks and inlines have been described thoroughly in previous chapters and they behave essentially the same here. An area may have block area children or inline area children but never both. One common subtype of block areas is the line area, the children of which are all inline areas (for example, a traditional paragraph). A subtype of inline area is the glyph area , a leaf in the area tree containing a single glyph image as its content.The area model of XSL is strongly reminiscent of the CSS box model. In fact, areas are a superset of CSS boxes. Areas are somewhat more general, allowing for alternative writing modes that require generic terminology like "start" and "before" instead of "top" and "left." Still, you will find XSL's areas very familiar if you've worked with CSS before. Figure 8-3 shows the area model and its components.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
- Formatting Objects
- InhaltsvorschauFormatting objects (FOs) are the building blocks of the transformation result tree that drives the formatter. They are compact containers of content and style with all the information necessary to generate a presentable formatted document.There are two major kinds of FO. Flow objects create areas and appear inside flows. (A flow is a continuous stream of text that may be broken across pages.) Layout objects, or auxiliary objects, help produce areas by contributing parameters.A
blockobject creates a region of content to be inserted into a flow, so it qualifies as a flow object. In contrast, theinitial-page-numberFO resets the count of page numbering. Since it only contributes some information to aid in processing, rather than create regions on its own, it is a layout object.An FO document structure is a tree, like any other XML document. Every element in it is a formatting object, so we call it an FO tree. The root of this tree is arootelement. Its children include:-
layout-master-set -
This element contains page layout descriptions.
-
declarations -
Optional, this element contains global settings that will affect overall formatting.
-
page-sequence -
One or more of these elements contain flow objects that hold the content of the document.
In the coming sections, I will break down this structure further, starting with page layout. From there, we will move to flows, blocks, and finally inlines.Contained in thelayout-master-setobject are specifications for pagination and layout. There are two types.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. -
- An Example: TEI
- InhaltsvorschauLet us take a break now from terminology and see an actual example. The document to format is a Shakespearean sonnet encoded in TEI-XML, a markup language for scholarly documents, shown in Example 8-2. It consists of a header section with title and other metadata, followed by the text itself, which is broken into individual lines of poetry.Example 8-2. A TEI-XML document
<?xml version="1.0"?> <!DOCTYPE TEI.2 SYSTEM "http://www.uic.edu/orgs/tei/lite/teixlite.dtd"> <TEI.2> <!-- The metadata. TEI has a rich vocabulary for describing a document, which is important for scholarly work. --> <teiHeader> <fileDesc> <titleStmt><title>Shall I Compare Thee to a Summer's Day?</title> <author>William Shakespeare</author> </titleStmt> <publicationStmt> <p> Electronic version by Erik Ray 2003-03-09. This transcription is in the public domain. </p> </publicationStmt> <sourceDesc> <p>Shakespeare's Sonnets XVIII.</p> </sourceDesc> </fileDesc> </teiHeader> <!-- The body of the document, where the sonnet lives. <lg> is a group of lines, and <l> is a line of text. --> <text> <body> <lg> <l>Shall I compare thee to a summer's day?</l> <l>Thou art more lovely and more temperate:</l> <l>Rough winds do shake the darling buds of May,</l> <l>And summer's lease hath all too short a date:</l> <l>Sometime too hot the eye of heaven shines,</l> <l>And often is his gold complexion dimm'd;</l> <l>And every fair from fair sometime declines,</l> <l>By chance or nature's changing course untrimm'd;</l> <l>But thy eternal summer shall not fade</l> <l>Nor lose possession of that fair thou owest;</l> <l>Nor shall Death brag thou wander'st in his shade,</l> <l>When in eternal lines to time thou growest:</l> <l>So long as men can breathe or eyes can see,</l> <l>So long lives this and this gives life to thee.</l> </lg> </body> </text> </TEI.2>A typical TEI document consists of two parts: a metadata section in aEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - A Bigger Example: DocBook
- InhaltsvorschauThe last example was relatively simple, lacking complex structures like lists, tables and footnotes. It also was quite short, taking up only one page. Clearly, this lack of a challenge is an insult to any self-respecting XSL formatter. So let us set our sights on something more challenging.Let us return to our old friend DocBook for inspiration. We will include support for most of the common elements you would find in technical prose, including some objects we have not covered yet, such as tables, lists, and footnotes. To flex the muscles of pagination, we will also explore the use of different page masters and conditional master sequences.To save you from having to look at a huge, monolithic listing, I'll break up the stylesheet into manageable chunks interspersed with friendly narrative.We will start by setting up the page masters. This is a very verbose piece of the stylesheet, since we want to create a page master for each type of page. I chose to cover these types:
-
Lefthand (verso) page starting a chapter. It happens to have an even numbered page number, which we will use to identify it in the page sequence master. The layout will include a header that shows the chapter title and number and a body that is pushed down further than a non-starting page.
-
Righthand (recto) page starting a chapter. The main difference is that its page number is odd, and the header is right-justified instead of justified on the left.
-
Verso page that does not start a chapter. It has a plain header and body that reaches up almost to the top.
-
Recto page that does not start a chapter. Again, the distinctive layout involves right-justifying header and footer.
Below is the stylesheet portion containing page master definitions. The template matching the root node in the source document calls a named template that sets up the page masters and also the page sequence masters.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. -
- Chapter 9: Internationalization
- InhaltsvorschauXML was built from the ground up to support information in many different languages. While you and your application may not understand markup and content in foreign languages, XML makes certain that XML-compliant tools let more of the world work in and share information created in whatever language they want.Throughout the book, I have treated characters as a sort of commodity, just something used to fill up documents. But understanding characters and how they are represented in documents is of great importance in XML. After all, characters are both the building material for markup and the cargo it was meant to carry.Every XML document has a character encoding property. I'll give you a quick explanation now and a more complete description later. In a nutshell, it is the way the numerical values in files and streams are transformed into the symbols that you see on the screen. Encodings come in many different kinds, reflecting the cultural diversity of users, the capabilities of systems, and the inevitable cycle of progress and obsolescence.Character encodings are probably the most confusing topic in the study of XML. Partly, this is because of a glut of acronyms and confusing names: UTF-8, UCS-4, Shift-JIS, and ISO-8859-1-Windows-3.1-Latin-1, to name a few. Also hampering our efforts to understand is the interchangeability of incompatible terms. Sometimes a character encoding is called a set, as in the MIME standard, which is incorrect and misleading.In this section, I will try to explain the terms and concepts clearly, and describe some of the common character encodings in use by XML authors.If you choose to experiment with the character encoding for your document, you will need to specify it in the XML declaration. For example:
<?xml version="1.0" encoding="encoding-name"?>encoding-name is a registered string corresponding to a formal character encoding. No distinction is made between uppercase and lowercase letters, but spaces are disallowed (use hyphens instead). Some examples areEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Character Encodings
- InhaltsvorschauThroughout the book, I have treated characters as a sort of commodity, just something used to fill up documents. But understanding characters and how they are represented in documents is of great importance in XML. After all, characters are both the building material for markup and the cargo it was meant to carry.Every XML document has a character encoding property. I'll give you a quick explanation now and a more complete description later. In a nutshell, it is the way the numerical values in files and streams are transformed into the symbols that you see on the screen. Encodings come in many different kinds, reflecting the cultural diversity of users, the capabilities of systems, and the inevitable cycle of progress and obsolescence.Character encodings are probably the most confusing topic in the study of XML. Partly, this is because of a glut of acronyms and confusing names: UTF-8, UCS-4, Shift-JIS, and ISO-8859-1-Windows-3.1-Latin-1, to name a few. Also hampering our efforts to understand is the interchangeability of incompatible terms. Sometimes a character encoding is called a set, as in the MIME standard, which is incorrect and misleading.In this section, I will try to explain the terms and concepts clearly, and describe some of the common character encodings in use by XML authors.If you choose to experiment with the character encoding for your document, you will need to specify it in the XML declaration. For example:
<?xml version="1.0" encoding="encoding-name"?>encoding-name is a registered string corresponding to a formal character encoding. No distinction is made between uppercase and lowercase letters, but spaces are disallowed (use hyphens instead). Some examples areUTF-16,ISO-8859-1, andShift_JIS.A comprehensive list of encoding names is maintained by the Internet Assigned Numbers Authority (IANA), available on the Web athttp://www.iana.org/assignments/character-sets. Many of these encoding names have aliases. The aliases for US-ASCII includeEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - MIME and Media Types
- InhaltsvorschauThe Multipurpose Internet Mail Extensions (MIME) standard is a means of specifying media types such as images, program data , audio files, and text. Described in Internet Engineering Task Force (IETF) Request for Comments (RFC) documents 2045 through 2049, it includes a comprehensive list of known types and has inspired a registry for many more.MIME was developed originally to extend the paradigm of email from plain text to a rich array of media. Email transport systems, such as the Simple Mail Transfer Protocol (SMTP), can only deal with 7-bit ASCII text. You cannot simply append a binary file to the end of a message and have it bounce happily across the Internet. The data has to be encoded in an ASCII-compatible way. There are other requirements as well, such as a minimum line length and absence of certain control characters. MIME introduces methods to transform data into a safe form. It also describes how to package this data in a recognizable way for mail transfer agents and clients to work with.One of the ways MIME describes a resource is by assigning it a media type (or content-type) which names the general category that best describes the data. Each type includes a set of subtypes that exactly identify the resource. The type and subtype are usually written together, joined by a slash character (/). For example,
image/jpegdenotes a graphical resource in the JPEG format. The major types include:- text
-
Textual information that can be read in a traditional text editor without any special processing.
text/plainis as simple as you can get: just ASCII characters without any kind of formatting other than whitespace. - image
-
Graphical data requiring some display device such as a printer or display terminal.
image/gifis a popular image subtype on the Web.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Specifying Human Languages
- InhaltsvorschauSpecifying a character encoding is crucial for correctly processing and displaying an XML document in a multilingual world. But there is a higher level to address than just the symbols on the page. Different languages may use the same characters. If a document is encoded with UTF-8, how can you know if it is speaking Vietnamese or Italian?You may wonder why it matters if software should handle all documents the same way no matter what the language. The push for globalization is not a dream shared by everybody. Sure, we all want equal access to resources, but we would also like to keep our uniqueness intact. So many developers would love a way to know in advance what language a reader prefers to use, and have some automatic means to serve that preference.XML and many related standards have included some devices to allow special handling based on language. You can use labels to create variations on a document and to customize its appearance and behavior. I will describe a few of the important mechanisms in this section.XML defines the attribute
xml:langas a language label for any element. There is no official action that an XML processor must take when encountering this attribute, but we can imagine some future applications. For example, search engines could be designed to pay attention to the language of a document and use it to categorize its entries. The search interface could then provide a menu for languages to include or exclude in a search. Another use forxml:langmight be to combine several versions of a text in one document, each version labeled with a different language. A web browser could be set to ignore all but a particular language, filtering the document so that it displays only what the reader wants. Or, if you're writing a book that includes text in different languages, you could configure your spellchecker to use a different dictionary for each version.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Chapter 10: Programming
- InhaltsvorschauXML was designed to bridge the gap between humans and computers, making data easily grappled by both. If you aren't able to find an existing application to take care of your XML needs, you may find writing your own a good option.XML has great possibilities for programmers. It is well suited to being read, written, and altered by software. Its syntax is straightforward and easy to parse. It has rules for being well-formed that reduce the amount of software error checking and exception handling required. It's well documented, and there are many tools and code libraries available for developers in just about every programming language. And as an open standard accepted by financial institutions and open source hackers alike, with support from virtually every popular programming language, XML stands a good chance of becoming the lingua franca for computer communications.We begin the chapter by examining the issues around working with XML from a developer's point of view. From there, we move to common coding strategies and best practices. The two main methods, event streams and object trees, will be described. And finally, we visit the two reigning standards in XML programming: SAX and DOM. I will include examples in Java and Perl, my two favorite programming environments, both of which have excellent support for XML wrangling.Like any good technology, XML does not try to be the solution to every problem. There are some things it just cannot do well, and it would be foolish to try to force it to do them. The foremost requirement of XML is that it be universally accessible, a lowest common denominator for applications. This necessarily throws out many optimizations that are necessary in some situations.Let's review some of these limitations:
-
XML is not optimized for access speed. XML documents are meant to be completely loaded, and then used as a data source. The parser is required to do a syntax check every time it reads in the markup. In contrast, modern databases are optimized for quick data lookups and updates.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. -
- Limitations
- InhaltsvorschauLike any good technology, XML does not try to be the solution to every problem. There are some things it just cannot do well, and it would be foolish to try to force it to do them. The foremost requirement of XML is that it be universally accessible, a lowest common denominator for applications. This necessarily throws out many optimizations that are necessary in some situations.Let's review some of these limitations:
-
XML is not optimized for access speed. XML documents are meant to be completely loaded, and then used as a data source. The parser is required to do a syntax check every time it reads in the markup. In contrast, modern databases are optimized for quick data lookups and updates.
-
XML is not compact. There is no official scheme for compressing XML. XML parsers expect uncompressed text. You either have to put up with large text files, or you have to create a complex mechanism for compressing and decompressing on the fly, which will add to your processing overhead. Most proprietary applications, like Microsoft Word and Adobe FrameMaker, save data in a binary format by default, saving disk space and perhaps speeding up file transfers. (HTTP offers compression for transmission, which helps reduce this cost.)
-
Many kinds of data are not suited for embedded markup. XML is most useful for text data with a hierarchical structure. It does not offer much for binary data. For example, raster graphic images are long streams of binary data, unreadable to anyone until passed through a graphic viewing program. This binary data may contain dangerous characters that would need to be escaped. Binary data is optimized for size and speed of loading, two qualities that would be hindered by XML.
-
XML may raise expectations too high. Quite often, software vendors tout XML support as a great new feature in their product, only to disappoint users with poor implementation. For example, the early version of Adobe FrameMaker's XML export capability was nearly unusable, as much of the data was structured badly, was missing information, changed figure filenames, and so on. Instead of viewing it as a magic bullet, developers should approach XML as a framework in which intelligent design focusing on the quality of markup structures can achieve magnificent results.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. -
- Streams and Events
- InhaltsvorschauThe stream approach treats XML content as a pipeline. As it rushes past, you have one chance to work with it, no look-ahead or look-behind. It is fast and efficient, allowing you to work with enormous files in a short time, but depends on simple markup that closely follows the order of processing.In programming jargon, a stream is a sequence of data chunks to be processed. A file, for example, is a sequence of characters (one or more bytes each, depending on the encoding). A program using this data can open a filehandle to the file, creating a character stream, and it can choose to read in data in chunks of whatever size it chooses. Streams can be dynamically generated too, whether from another program, received over a network, or typed in by a user. A stream is an abstraction, making the source of the data irrelevant for the purpose of processing.To summarize, here are a stream's important qualities:
-
It consists of a sequence of data fragments.
-
The order of fragments transmitted is significant.
-
The source of data (e.g., file or program output) is not important.
XML streams are more clumpy than character streams, which are just long sequences of characters. An XML stream emits a series of tokens or events, signals that denote changes in markup status. For example, an element has at least three events associated with it: the start tag, the content, and the end tag. The XML stream is constructed as it is read, so events happen in lexical order. The content of an element will always come after the start tag, and the end tag will follow that.Parsers can assemble this kind of stream very quickly and efficiently thanks to XML's parser-friendly design. Other formats often require some look-ahead or complex lookup tables before processing can begin. For example, SGML does not have a rule requiring nonempty elements to have an end tag. To know when an element ends requires sophisticated reasoning by the parser, making code more complex, slowing down processing speed, and increasing memory usage.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. -
- Trees and Objects
- InhaltsvorschauThe tree method is luxurious in comparison to streams. To use an analogy, think of a stream generator as a hose gushing out XML. A tree is that same XML frozen into an ice sculpture. You can peruse it at your leisure, returning to any point in the document when you need it. This structure requires more resources to build and store, so you will only want to use it when the stream method cannot help.There are many reasons why a tree structure representing a piece of XML is a handy thing to have. Since a tree is acyclic (it has no circular links), you can use simple traversal methods that won't get stuck in infinite loops. Like a filesystem directory tree, you can represent the location of a node easily in simple shorthand. Like real trees, you can break a piece off and treat it like a smaller tree. Most important, you have all the information in one place for as long as you need it.This persistence is the key reason for using trees. If you can live with the overhead of memory and time to construct the tree, then you will enjoy luxuries like being able to pull data from anywhere in the document at any point of the processing. With streams, you are forced to work with events as they arrive, perhaps storing bits of data for later use.Tree processing is usually object-oriented. The data structure representing the document is composed of objects whose methods allow you to traverse in different directions, pull out data, or modify values. DOM, as we will see later in the chapter, is a standard that defines the interfaces of objects used to built document trees. Encapsulating XML data in objects is as natural as using markup, with as many benefits.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
- Pull Parsing
- InhaltsvorschauTim Bray, lead editor of XML 1.0, calls pull parsing "the way to go in the future." Like event-based parsing, it's fast, memory efficient, streamable, and read-only. The difference is in how the application and parser interact. SAX implements what we call push parsing. The parser pushes events at the program, requiring it to react. The parser doesn't store any state information , contextual clues that would help in decisions for how to parse, so the application has to store this information itself.Pull parsing is just the opposite. The program takes control and tells the parser when to fetch the next item. Instead of reacting to events, it proactively seeks out events. This allows the developer more freedom in designing data handlers, and greater ability to catch invalid markup. Consider the following example XML:
<catalog> <product id="ronco-728"> <name>Widget</name> <price>19.99</price> </product> <product id="acme-229"> <name>Gizmo</name> <price>28.98</price> </product> </catalog>It is easy to write a SAX program to read this XML and build a data structure. The following code assembles an array of products composed of instances of this class:class Product { String name; String price; }Here is the code to do it:StringBuffer cdata = new StringBuffer(); Product[] catalog = new Product[10]; String name; Float price; public void startDocument () { index = 0; } public void startElement( String uri, String local, String raw, Attributes attrs ) throws SAXException { cdata.clear(); } public void characters( char ch[], int start, int length ) throws SAXException { cdata.append( ch, start, length ); } public void endElement( String uri, String local, String raw ) throws SAXException { if("product".equals(local)) { index ++; } else if( "name".equals(local) ) { catalog[index].name = cdata.toString; } else if( "price".equals(local) ) { catalog[index].price = cdata.toString; } else { throw new SAXException( "Unexpected element: " + local ); } }Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Standard APIs
- InhaltsvorschauNowadays, all programs are written as layered components, where libraries provide functions or objects to take care of routine tasks like parsing and writing XML. An application programming interface (API) is a way of delegating routine work to a dedicated component. An automobile's human interface is a good example. It is essentially the same from car to car, with an ignition, steering wheel, gas and brake pedals, and so on. You do not have to know anything about the engine itself, such as how many cylinders are firing and in what order. Just as you never have to fire the spark plugs manually, you should never have to write your own XML parser, unless you want to do something really unusual.Linking to another developer's parser is a good idea not just because it saves you work, but because it turns the parser into a commodity. By that I mean you can unplug one parser and plug in another. Or you could unplug a parser and plug in a driver from a database or some real-time source. XML does not have to come from files, after all. None of this would be possible, however, without the use of standard APIs.This chapter will demonstrate a few examples of this. For event streams, the standard is SAX. DOM is a standard for object tree interfaces. Most programming languages have a few conforming implementations of each. When possible, it is always a good idea to use SAX or DOM.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
- Choosing a Parser
- InhaltsvorschauAfter choosing your programming strategy (SAX, DOM, XMLPULL, etc.), the next step in writing an XML application is to select a parser. There is no reason to write your own parser when so many excellent ones already exist. Some qualities to look for are API support, speed and efficiency, and robustness. Table 10-1 lists some of the best, although there are so many out there today that I could not hope to list them all.
Table 10-1: Some popular XML parsers NameLanguageAPIsWeb SiteExpatC, Perl (via XML::Parser module), Python (via xml.parsers.expat)Low-level stream parserhttp://www.jclark.com/xml/expat.htmlXPJavaLow-level stream parserhttp://www.jclark.com/xml/xp/libxml2C++, Perl (via XML::LibXML module)DTD validation, SAX (minimal), DOM2 (core, need gdome2 library for the API), XPath, Relax NG, XML Schemas (data types)http://xmlsoft.org/Xerces2JavaEnde der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - PYX
- InhaltsvorschauPYX is an early XML stream solution that converts XML into character data compatible with text applications like grep, awk, and sed. Its name represents the fact that it was the first XML solution in the programming language Python. XML events are separated by newline characters, fitting nicely into the line-oriented paradigm of many Unix programs. Table 10-2 summarizes the notation of PYX.
Table 10-2: PYX notation SymbolRepresents(An element start tag)An element end tag-Character dataAAn attribute?A processing instructionFor every event coming through the stream, PYX starts a new line, beginning with one of the five event symbols. This line is followed by the element name or whatever other data is pertinent. Special characters are escaped with a backslash, as you would see in Unix shell or Perl code.Here's how a parser converting an XML document into PYX notation would look. The following code is XML input by the parser:<shoppinglist> <!-- brand is not important --> <item>toothpaste</item> <item>rocket engine</item> <item optional="yes">caviar</item> </shoppinglist>
As PYX, it would look like this:(shoppinglist -\n (item -toothpaste )item -\n (item -rocket engine )item -\n (item Aoptional yes -caviar )item -\n )shoppinglist
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - SAX
- InhaltsvorschauThe Simple API for XML (SAX) is one of the first and currently the most popular method for working with XML data. It evolved from discussions on the XML-DEV mailing list and, shepherded by David Megginson, was quickly shaped into a useful specification.The first incarnation, called SAX Level 1 (or just SAX1), supports elements, attributes, and processing instructions. It doesn't handle some other things like namespaces or CDATA sections, so the second iteration, SAX2, was devised, adding support for just about any event you can imagine in generic XML. Since there's no good reason not to use SAX2, you can assume that SAX2 is what we are talking about when we say "SAX."SAX was originally developed in Java in a package called
org.xml.sax. As a consequence, most of the literature about SAX is Java-centric and assumes that is the environment you will be working in. Furthermore, there is no formal specification for SAX in any programming language but Java. Analogs in other languages exist, such as XML::SAX in Perl, but they are not bound by the official SAX description. Really they are just whatever their developer community thinks they should be.David Megginson has made SAX public domain and has allowed anyone to use the name. An unfortunate consequence is that many implementations are really just "flavors" of SAX and do not match in every detail. This is especially true for SAX in other programming languages where the notion of strict compliance would not even make sense. This is kind of like the plethora of Unix flavors out today; they seem much alike, but have some big differences under the surface.SAX describes a universal interface that any SAX-aware program can use, no matter where the data is coming from. Figure 10-1 shows how this works. Your program is at the right. It contacts the ParserFactory object to request a parser that will serve up a stream of SAX events. The factory finds a parser and starts it running, routing the SAX stream to your program through the interface.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - DOM
- InhaltsvorschauDOM is a recommendation by the World Wide Web Consortium (W3C). Designed to be a language-neutral interface to an in-memory representation of an XML document, versions of DOM are available in Java, ECMAscript, Perl, and other languages.While SAX defines an interface of handler methods, the DOM specification calls for a number of classes, each with an interface of methods that affect a particular type of XML markup. Thus, every object instance manages a portion of the document tree, providing accessor methods to add, remove, or modify nodes and data. These objects are typically created by a factory object, making it a little easier for programmers who only have to initialize the factory object themselves.In DOM, every piece of XML (the element, text, comment, etc.) is a node represented by a node object. The
Nodeclass is extended by more specific classes that represent the types of XML markup, includingElement,Attr(attribute),ProcessingInstruction,Comment,EntityReference,Text,CDATASection, andDocument. These classes are the building blocks of every XML tree in DOM.The standard also calls for a couple of classes that serve as containers for nodes, convenient for shuttling XML fragments from place to place. These classes areNodeList, an ordered list of nodes, like all the children of an element; andNamedNodeMap, an unordered set of nodes. These objects are frequently required as arguments or given as return values from methods. Note that these objects are all live, meaning that any changes done to them will immediately affect the nodes in the document itself, rather than a copy.When naming these classes and their methods, DOM merely specifies the outward appearance of an implementation and leaves the internal specifics up to the developer. Particulars like memory management, data structures, and algorithms are not addressed at all, as those issues may vary among programming languages and the needs of users. This is like describing a key so a locksmith can make a lock that it will fit into; you know the key will unlock the door, but you have no idea how it really works. Specifically, the outward appearance makes it easy to write extensions to legacy modules so they can comply with the standard, but it does not guarantee efficiency or speed.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Other Options
- InhaltsvorschauAs XML has spread, more and more people have had creative (and often useful) ideas about how to process it.The XPath language provides a convenient method to specify which nodes to return in a tree context. A parser written as a hybrid will only need to return a list of nodes that match an XPath expression. A stream parser efficiently searches through the document to find the nodes, then passes the locations to a tree builder that assembles them into object trees. XPath's advantage is that it is has a very rich language for specifying nodes, giving the developer a lot of control and flexibility. The parsers libxml2 and MSXML are two that come with XPath interfaces.Despite the name, JDOM is not merely a Java implementation of DOM. Rather, it is an alternative to SAX and DOM that is described by its developers as "lightweight and fast . . . optimized for the Java programmer." It doesn't actually replace other parsers, but uses them to build object representations of documents with an interface that is easy to manipulate. It is designed to integrate with SAX and DOM, supplying a simple and useful interface layer on top.The proponents of JDOM say it is needed to reduce the complexity of the factory-based specifications for SAX and DOM. For that reason, the JDOM specification itself is defined with classes and not interfaces. In addition to substituting its own new API, JDOM includes the fabulous XPath API.If streams and trees are the two extremes on a spectrum of XML processing techniques, then the middle ground is home to solutions we might call hybrids. They combine the best of both worlds, low resource overhead of streams with the convenience of a tree structure, by switching between the two modes as necessary. The idea is, if you are only interested in working with a small slice of a document and can safely ignore the rest, then you only need to work with a subtree. The parser scans through the stream until it sees the part that you want, then switches to tree building mode.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
- Appendix A: Resources
- InhaltsvorschauThe resources listed in this appendix can help you learn even more about XML.
Section A.1: Online
Section A.2: Books
Section A.3: Standards Organizations
Section A.4: Tools
Section A.5: Miscellaneous
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Online
- Inhaltsvorschau
- XML.com
-
The web site
http://www.xml.comis one of the most complete and timely sources of XML information and news around. It should be on your weekly reading list if you are learning or using XML. - XML.org
-
Sponsored by OASIS,
http://www.xml.orghas XML news and resources, including the XML Catalog, a guide to XML products and services. - The XML Cover Pages
-
Edited by Robin Cover,
http://xml.coverpages.org/is one of the largest and most up-to-date lists of XML resources. - Cafe Con Leche
-
Elliotte Rusty Harold provides almost daily news, along with a quote of the day, at
http://ibiblio.org/xml. - XMLHack
-
For programmers itching to work with XML,
http://www.xmlhack.comis a good place to go for news on the latest developments in specifications and tools. - DocBook
-
OASIS, the maintainers of DocBook, have a web page devoted to the XML application at
http://www.docbook.org/. You can find the latest version and plenty of documentation here. - A Tutorial on Character Code Issues
-
Jukka Korpela has assembled a huge amount of information related to character sets at
http://www.cs.tut.fi/%7Ejkorpela/. The tutorial is well written and very interesting reading. - XSL mailing list
-
Signing up with the XSL mailing list is a great way to keep up with the latest developments in XSL and XSLT tools and techniques. It's also a forum for asking questions and getting advice. The traffic is fairly high, so you should balance your needs with the high volume of messages that will be passing through your mailbox. To sign up, go to
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Books
- Inhaltsvorschau
- XML in a Nutshell, 2nd Edition, Elliotte Rusty Harold and W. Scott Means (O'Reilly & Associates)
-
A comprehensive desktop reference for all things XML.
- The XML Bible, 2nd Edition, Elliotte Rusty Harold (Hungry Minds)
-
A solid introduction to XML that provides a comprehensive overview of the XML landscape.
- HTML and XHTML, the Definitive Guide, Chuck Musciano and Bill Kennedy (O'Reilly & Associates)
-
A timely and comprehensive resource for learning about HTML.
- Developing SGML DTDs: From Text to Model to Markup, Eve Maler and Jeanne El Andaloussi (Prentice Hall)
-
A step-by-step tutorial for designing and implementing DTDs. While this book is about SGML, much of its advice is still excellent for XML.
- The SGML Handbook, Charles F. Goldfarb (Oxford University Press)
-
A complete reference for SGML, including an annotated specification. Like its subject, the book is complex and hefty, so beginners may not find it a good introduction.
- Java and XML, 2nd Edition, Brett McLaughlin (O'Reilly & Associates)
-
A guide to combining XML and Java to build real-world applications.
- SAX2, David Brownell (O'Reilly & Associates)
-
A complete guide to using the SAX2 API, in Java.
- Processing XML with Java, Elliotte Rusty Harold (Addison-Wesley)
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Standards Organizations
- Inhaltsvorschau
- ISO
-
Visit the International Organization for Standardization, a worldwide federation of national standards organizations, at
http://www.iso.ch. - W3C
-
The World Wide Web Consortium at
http://www.w3.orgoversees the specifications and guidelines for the technology of the World Wide Web. Check here for information about CSS, DOM, (X)HTML, MathML, XLink, XML, XPath, XPointer, XSL, and other web technologies. - Unicode Consortium
-
The organization responsible for defining the Unicode character set can be visited at
http://www.unicode.org. - OASIS
-
The Organization for the Advancement of Structured Information Standards is an international consortium that creates interoperable industry specifications based on public standards such as XML and SGML. See the web site at
http://www.oasis-open.org. - IETF
-
The Internet Engineering Task Force is a less formal organization devoted to the creation of specifications for Internet information exchange. The IETF focuses primarily on protocols, notably HTTP, DNS, and SMTP. It also does some XML work in its MIME type efforts and through its BEEP protocol work. See the web site at
http://www.ietf.org.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Tools
- Inhaltsvorschau
- GNU Emacs
-
An extraordinarily powerful text editor, and so much more. Learn all about it at
http://www.gnu.org/software/emacs/emacs.html. - psgml
-
An Emacs major mode for editing XML and SGML documents that is available at
http://www.lysator.liu.se/~lenst/. - SAX
-
Information on SAX, the Simple API for XML, is available at
http://www.saxproject.org. Here, you will find the Java source code and some helpful documentation. - Xalan
-
A high-performance XSLT stylesheet processor that fully implements XSLT and XLinks. You can find out more about it at the Apache XML Project web site,
http://xml.apache.org. - Xerces
-
A fully validating parser that implements XML, DOM levels 1 and 2, and SAX2. Find out more about it at the Apache XML Project,
http://xml.apache.org. - XT
-
A Java implementation of XSLT, at
http://www.jclark.com/xml/xt.html.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Miscellaneous
- Inhaltsvorschau
- User Friendly, Illiad
-
Starring the formidably cute dust puppy and a gaggle of computer industry drones, this comic strip will inject much-needed jocularity into your bloodstream after a long day of hacking XML. The whole archive is available online at
http://www.userfriendly.org, and in two books published by O'Reilly: User Friendly, the Comic Strip, and Evil Geniuses in a Nutshell. - The Cathedral and the Bazaar, Eric S. Raymond (O'Reilly & Associates)
-
In this philosophical analysis and evangelical sermon about the grassroots open source computer programming movement, Raymond extols the virtues of community, sharing, and that warm feeling you get when you're working for the common good.
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Appendix B: A Taxonomy of Standards
- InhaltsvorschauThe extensibility of XML is clearly demonstrated when you consider all the standards and specifications that have blossomed from the basic XML idea. This appendix is a handy reference to various XML-related activities.
Section B.1: Markup and Structure
Section B.2: Linking
Section B.3: Addressing and Querying
Section B.4: Style and Transformation
Section B.5: Programming
Section B.6: Publishing
Section B.7: Hypertext
Section B.8: Descriptive/Procedural
Section B.9: Multimedia
Section B.10: Science
Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Markup and Structure
- InhaltsvorschauXML 1.0StatusXML 1.0 (second edition) became a Recommendation in October 2000. You can read the specification at
http://www.w3.org/TR/REC-xml.DescriptionXML is a subset of SGML that is designed to be served, received, and processed on the Web in the way that is now possible with HTML. XML has the advantages of easy implementation and compatibility with both SGML and HTML.XML 1.1StatusXML 1.1 became a Candidate Recommendation in October 2002. You can read the specification athttp://www.w3.org/TR/xml11/.DescriptionXML updates the character tables and whitespace rules of XML 1.0 to reflect changes to the Unicode specification since XML 1.0 became a Recommendation.Namespaces in XMLStatusNamespaces became a Recommendation in January 1999, and the specification is published athttp://www.w3.org/TR/REC-xml-names/.DescriptionXML namespaces provide a simple method for qualifying element and attribute names used in XML documents by associating them with namespaces identified by URI references.Namespaces in XML 1.1StatusNamespaces in XML 1.1 became a Candidate Recommendation in December 2002, and the specification is published athttp://www.w3.org/TR/xml-names11/.DescriptionThe 1.1 specification cleans up rules for declaring namespaces by making a provision for undeclaring namespaces, making it possible to reduce the number of unused declarations that apply to a given document framework.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Linking
- InhaltsvorschauXLinkStatusXLink became a W3C Recommendation in June 2001. The specification is published at
http://www.w3.org/TR/xlink/.DescriptionXLink allows elements to be inserted into XML documents that create and describe links between resources. It uses XML syntax to create structures to describe links, from the simple unidirectional hyperlinks of today's HTML to more sophisticated links.XML BaseStatusXBase became a W3C Recommendation in June 2001. The specification is published athttp://www.w3.org/TR/xmlbase/.DescriptionXML Base describes a mechanism for providing base URI services to XLink. The specification is modular so that other XML applications can make use of it.XIncludeStatusXInclude became a Candidate Recommendation in September 2002 and is published athttp://www.w3.org/TR/xinclude/.DescriptionXInclude specifies a processing model and syntax for general-purpose inclusion. Inclusion is accomplished by merging a number of XML infosets into a single composite infoset. Specification of the XML documents (infosets) to be merged and control over the merging process is expressed in XML-friendly syntax (elements, attributes, and URI references).Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Addressing and Querying
- InhaltsvorschauXPathStatusXPath became a W3C Recommendation in November 1999. The specification is published at
http://www.w3.org/TR/xpath/. XPath 2.0 Working Drafts include the XQuery 1.0 and XPath 2.0 Data Model athttp://www.w3.org/TR/xpath-datamodel/, XQuery 1.0 and XPath 2.0 Functions and Operators athttp://www.w3.org/TR/xpath-datamodel/, XQuery 1.0 and XPath 2.0 Formal Semantics athttp://www.w3.org/TR/xquery-semantics/, and XML Path Language (XPath) 2.0 athttp://www.w3.org/TR/xpath20.DescriptionXPath is a language for addressing parts of an XML document, designed to be used by both XSLT and XPointer. XPath 2.0 was expanded considerably to add support for W3C XML Schema information and XQuery.XPointerStatusMost of XPointer became a W3C Recommendation in March 2003. The specification is published in four parts. The XPointer Framework can be found athttp://www.w3.org/TR/xptr-framework/. The XPointer xmlns Scheme can be found athttp://www.w3.org/TR/xptr-xmlns/, and the XPointer element( ) Scheme can be found athttp://www.w3.org/TR/xptr-element/. The XPointer xpointer( ) Scheme remains a Working Draft, and can be found athttp://www.w3.org/TR/xptr-xpointer/.DescriptionXPointer is designed to be used as the basis for a fragment identifier for any URI reference that locates a resource of Internet media type text/xml or application/xml. Based on the XML Path Language (XPath), XPointer supports addressing into the internal structures of XML documents. It allows for examination of a hierarchical document structure and choice of its internal parts based on properties such as element types, attribute values, character content, and relative position.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Style and Transformation
- InhaltsvorschauCSSStatusCSS Level 2 became a W3C recommendation in May 1998, and it is published at
http://www.w3.org/TR/REC-CSS2/. CSS Level 1 became a W3C recommendation in 1996, with a revision in January 1999. It is published athttp://www.w3.org/TR/REC-CSS1. Work on CSS Level 3 is ongoing, and links to specifications may be found athttp://www.w3.org/Style/CSS/current-work.DescriptionCSS is a stylesheet language that allows authors and users to attach styles (e.g., fonts, spacing, and sounds) to structured documents such as HTML documents and XML applications. By separating the presentation style from the content of a document, CSS simplifies web authoring and site maintenance.CSS2 builds on CSS1, and with few exceptions, all stylesheets valid in CSS1 are also valid in CSS2. CSS2 supports media-specific stylesheets, so authors can tailor the presentation of their documents to visual browsers, aural devices, printers, Braille devices, hand-held devices, etc. This specification also supports content positioning, downloadable fonts, table layout, internationalization features, automatic counters and numbering, and some user interface properties.XSLStatusXSL became a W3C Recommendation in October 2001. The specification is published athttp://www.w3.org/TR/xsl/.DescriptionXSL is a language for formatting XML documents. It consists of an XML vocabulary of formatting objects (XSL-FO) and a language for transforming XML documents into those formatting semantics (XSLT). An XSL stylesheet specifies the presentation of a class of XML documents by describing how an instance of the class is transformed into an XML document that uses the formatting vocabulary.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Programming
- InhaltsvorschauDOMStatusDOM Level 2 became a W3C Recommendation in November 2000, and is composed of five specifications:
- DOM2 Core Specification
-
http://www.w3.org/TR/DOM-Level-2-Core/ - DOM2 Views Specification
-
http://www.w3.org/TR/DOM-Level-2-Views/ - DOM2 Events Specification
-
http://www.w3.org/TR/DOM-Level-2-Events/ - DOM2 Style Specification
-
http://www.w3.org/TR/DOM-Level-2-Style/ - DOM2 Traversal and Range Specification
-
http://www.w3.org/TR/DOM-Level-2-Traversal-Range/
Work on DOM Level 3 is in progress. More information on DOM Level 3 (which notably adds XPath and Load and Save support) is available athttp://www.w3.org/DOM/.DescriptionDOM Level 2 is a platform and language-neutral interface that allows programs and scripts to dynamically access and update the content and structure of documents. The DOM Level 2 Core builds on the DOM Level 1 Core, and consists of a set of core interfaces that create and manipulate the structure and contents of a document. The Core also contains specialized interfaces dedicated to XML.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Publishing
- InhaltsvorschauDocBookStatusThe latest SGML version of DocBook is 4.2; the latest XML version of DocBook is 4.3b2. DocBook is officially maintained by the DocBook Technical Committee of OASIS, and you can find the official home page at
http://www.oasis-open.org/docbook/index.html.DescriptionDocBook is a large and robust DTD designed for technical publications, such as documents related to computer hardware and software.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Hypertext
- InhaltsvorschauXHTMLStatusXHTML 1.0 became a W3C Recommendation in January 2000, with a Second Edition in August 2002. The specification is published at
http://www.w3.org/TR/xhtml1/. Modularization of XHTML, which became a Recommendation in April 2001, is published athttp://www.w3.org/TR/xhtml-modularization/. XHTML 1.1, module-based XHTML, which became a Recommendation in May 2001, is published athttp://www.w3.org/TR/xhtml11/. A subset of XHTML 1.1, XHTML Basic, became a W3C Recommendation in December 2000, and is published athttp://www.w3.org/TR/xhtml-basic. Finally, work on XHTML 2.0 has started, and working drafts are available atDescriptionXHTML 1.0 is a reformulation of HTML 4 as an XML 1.0 application. The specification defines three DTDs corresponding to the ones defined by HTML 4. The semantics of the elements and their attributes are defined in the W3C Recommendation for HTML 4, and provide the foundation for future extensibility of XHTML. Compatibility with existing HTML user agents is possible by following a small set of guidelines. XHTML 1.1 reformulates HTML as a set of modules, and XHTML Basic creates a subset of XHTML for use on smaller devices. XHTML 2.0 is now under development, and represents the first major changes to the HTML vocabulary since HTML 4.0.HTMLStatusHTML 4.01 is the latest version of the W3C Recommendation, dated December 1999. The specification is published athttp://www.w3.org/TR/html401/.DescriptionIn addition to the text, multimedia, and hyperlink features of previous versions, HTML 4 supports more multimedia options, scripting languages, and stylesheets, as well as better printing facilities and documents that are more accessible to users with disabilities. HTML 4 also takes great strides towards the internationalization of documents.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Descriptive/Procedural
- InhaltsvorschauSOAPStatusThe SOAP 1.2 specification (formerly the Simple Object Access Protocol) became a W3C Recommendation in June 2003. SOAP 1.2 Part 0: Primer is published at
http://www.w3.org/TR/soap12-part0/, SOAP 1.2 Part 1: Messaging Framework is published athttp://www.w3.org/TR/soap12-part1/, and SOAP 1.2 Part 2: Adjuncts is published athttp://www.w3.org/TR/soap12-part2/. SOAP Version 1.2 Specification Assertions and Test Collection, also a Recommendation, is published athttp://www.w3.org/TR/soap12-testcollection.DescriptionSOAP is an XML-based protocol for exchanging information in a decentralized, distributed environment. It consists of three parts: an envelope that defines a framework for describing what is in a message and how to process it, a set of encoding rules for expressing instances of application-defined data types, and a convention for representing remote procedure calls and responses.RDFStatusThe RDF Model and Syntax Specification became a W3C Recommendation in February 1999, and it is published athttp://www.w3.org/TR/REC-rdf-syntax/. The RDF Schema Specification became a W3C Candidate Recommendation in March 2000, and it is published athttp://www.w3.org/TR/rdf-schema/.More recently, a number of RDF Working Drafts revising those specs have been published. Resource Description Framework (RDF): Concepts and Abstract Syntax is published athttp://www.w3.org/TR/rdf-concepts/. RDF Semantics is published athttp://www.w3.org/TR/rdf-mt/. An RDF Primer is published athttp://www.w3.org/TR/rdf-primer/. RDF Vocabulary Description Language 1.0: RDF Schema is published athttp://www.w3.org/TR/rdf-schema. RDF/XML Syntax Specification (Revised) is published athttp://www.w3.org/TR/rdf-syntax-grammar.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Multimedia
- InhaltsvorschauSVGStatusThe SVG 1.0 specification became a W3C Recommendation in September 2001. The specification is published at
http://www.w3.org/TR/SVG/. SVG 1.1, like XHTML 1.1, modularized SVG, and the January 2003 Recommendation is athttp://www.w3.org/TR/SVG11/. That modularization was used to produce the SVG Mobile Profiles: SVG Tiny and SVG Basic, also published in January 2003 athttp://www.w3.org/TR/SVGMobile/. Work on SVG 1.2 is ongoing, and the latest drafts can be found athttp://www.w3.org/TR/SVG12.DescriptionSVG is a language for describing two-dimensional vector and mixed vector/raster graphics in XML.SMILStatusThe SMIL 1.0 specification became a W3C Recommendation in June 1998. It is published athttp://www.w3.org/TR/REC-smil/.DescriptionSMIL allows a set of independent multimedia objects to be integrated into a synchronized multimedia presentation. While SMIL itself hasn't caught on, some important pieces of SMIL are now in SVG.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar. - Science
- InhaltsvorschauMathMLStatusMathML 2.0 became a W3C Recommendation in February 2001. The specification is published at
http://www.w3.org/TR/MathML2/.DescriptionMathML is an XML application for describing mathematical notation and capturing its structure and content. The goal of MathML is to enable mathematics to be served, received, and processed on the Web, just as HTML has done for text.Ende der Inhaltsvorschau. Der weiterere Inhalt dieses Abschnitts ist hier nicht einsehbar.
Zurück zu Learning XML
