We've made the program a bit more realistic by making the input file an argument that you can specify on the command line (retrieved from args[0]), and by creating the underlying SAX Parser using the ParserManager class that we introduced earlier. It's still not a production-quality program, for example it falls over if called without an input argument, but it's getting closer. Once you have set up the classpath (remember that to use ParserManager, the file ParserManager.properties must also be on the classpath), you can run this program from the command line, for example:
java Indenter file:///c:/data/books.xml
The output appears nicely intended. Because the argument is a URL, you can format any XML file on the web.
The End of the Pipeline: Generating XML
Very often, as in the previous example, the final output of the pipeline will be a new XML document. So you will often need a DocumentHandler that uses the events coming out of the pipeline to generate an XML document: a sort of parser in reverse.
Surprisingly we couldn't find a DocumentHandler on the web that does this, so we've written one and included it here.
Here is the class. It's reasonably straightforward, except for the code that generates entity and character references for special characters, which uses some of Java's less intuitive methods for manipulating Strings and arrays.
import org.xml.sax.*;
import java.io.*;
/**
* XMLOutputter is a DocumentHandler that uses the notified events to
* reconstruct the XML document on the standard output
*/
public class XMLOutputter implements DocumentHandler
{
private Writer writer = null;
/**
* Set Document Locator. Provided merely to satisfy the interface.
*/
public void setDocumentLocator(Locator locator) {}
/**
* Start of the document.
* Make the writer and write the XML declaration.
*/
public void startDocument () throws SAXException
{
try
{
writer = new BufferedWriter(new PrintWriter(System.out));
writer.write("<?xml version='1.0' ?>\n");
}
catch (java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* End of the document. Close the output stream.
*/
public void endDocument () throws SAXException
{
try
{
writer.close();
}
catch (java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* Start of an element.
* Output the start tag, escaping special characters.
*/
public void startElement (String name, AttributeList attributes)
throws SAXException
{
try
{
writer.write("<");
writer.write(name);
// output the attributes
for (int i=0; i<attributes.getLength(); i++)
{
writer.write(" ");
writeAttribute(attributes.getName(i),
attributes.getValue(i));
}
writer.write(">");
}
catch (java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* Write attribute name=value pair
*/
protected void writeAttribute(String attname, String value) throws
SAXException
{
try
{
writer.write(attname);
writer.write("='");
char[] attval = value.toCharArray();
char[] attesc = new char[value.length()*8];
// worst case scenario
int newlen = escape(attval, 0, value.length(), attesc);
writer.write(attesc, 0, newlen);
writer.write("'");
}
catch (java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* End of an element. Output the end tag.
*/
public void endElement (String name) throws SAXException
{
try
{
writer.write("</" + name + ">");
}
catch (java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* Character data.
*/
public void characters (char[] ch, int start, int length)
throws SAXException
{
try
{
char[] dest = new char[length*8];
int newlen = escape(ch, start, length, dest);
writer.write(dest, 0, newlen);
}
catch (java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* Ignorable whitespace: treat it as characters
*/
public void ignorableWhitespace(char[] ch, int start, int length)
throws SAXException
{
characters(ch, start, length);
}
/**
* Handle a processing instruction.
*/
public void processingInstruction (String target, String data)
throws SAXException
{
try
{
writer.write("<?" + target + ' ' + data + "?>");
}
catch (java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* Escape special characters for display.
* @param ch The character array containing the string
* @param start The start position of the input string
* within the character array
* @param length The length of the input string
* within the character array
* @param out Character array to receive the output. In the worst case,
* this should be
* 8 times the length of the input array.
* @return The number of characters used in the output array
*/
private int escape(char ch[], int start, int length, char[] out)
{
int o = 0;
for (int i = start; i < start+length; i++)
{
if (ch[i]=='<')
{
("<").getChars(0, 4, out, o); o+=4;
}
else if (ch[i]=='>')
{
(">").getChars(0, 4, out, o); o+=4;
}
else if (ch[i]=='&')
{
("&").getChars(0, 5, out, o); o+=5;
}
else if (ch[i]=='\"')
{
(""").getChars(0, 5, out, o); o+=5;
}
else if (ch[i]=='\'')
{
("'").getChars(0, 5, out, o); o+=5;
}
else if (ch[i]<127)
{
out[o++]=ch[i];
}
else
{
// output character reference
out[o++]='&';
out[o++]='#';
String code = Integer.toString(ch[i]);
int len = code.length();
code.getChars(0, len, out, o); o+=len;
out[o++]=';';
}
}
return o;
}
}
Now you can see how SAX can be used to write XML documents as well as reading them. In fact, you can run SAX back-to-front: instead of the Parser being standard software that someone else writes, and the DocumentHandler being your specific application code, you can write an implementation of org.xml.sax.Parser that contains your application logic for generating XML, and couple it to this off-the-shelf DocumentHandler for writing XML output!
Other Useful ParserFilters
NamespaceFilter
This ParserFilter implements the XML Namespaces recommendation, described in Chapter 7. It is available from JohnCowan's web site at http://www.ccil.org/~cowan/XML/.
SAX was defined before the XML Namespaces recommendation was published, and takes no account of it. If an element name is written in the source document as <html:table>, then the element name passed to the startDocument() method will be "html:table". There is no simple way for the application to determine which namespace "html" is referring to.
The NamespaceFilter solves this problem. It keeps track of all the namespace declarations in the document (that is, the "xmlns:xxx" attributes), and when a prefixed element or attribute name is reported by the SAX parser, it substitutes the full namespace URI for the prefix before passing it on down the pipeline. For example, if the element start tag is <html:table xmlns:html="http://www.w3.org/TR/REC-html40"> then the element name passed on to the next DocumentHandler will be "http://www.w3.org/TR/REC-html40^table". The circumflex character was chosen to separate the namespace URI from the local part of the element name because it's a character that can't appear in URIs or in XML names.
Sometimes applications want to know the prefix as well as the namespace URI (for example, for use in error messages). NamespaceFilter doesn't provide this information, but it could easily be extended to do so.
InheritanceFilter
This is also available from John Cowan's web site at http://www.ccil.org/~cowan/XML/.
Many XML document designs use the concept of an inheritable attribute. The idea is that if a particular attribute is not present on an element, the value is taken from the same attribute on a containing element. The XML standard itself uses this idea for the special attributes xml:lang and xml:space, and it is extensively used in some other standards such as the XSL Formatting Objects proposal.
InheritanceFilter is a ParserFilter that extends the attribute list passed to the startElement() method to include attributes that were not actually present on that element, but were inherited from parent elements. The InheritanceFilter needs to be primed with a list of attribute names that are to be treated as inherited attributes.
XLinkFilter
This ParserFilter provides support for the draft XLink specification for creating hyperlinks between XML documents. It is published by Simon St. Laurent on http://www.simonstl.com/projects/xlinkfilter/
Unlike most ParserFilters, an XLinkFilter passes all the events through unchanged. While doing so, however, it constructs a data structure reflecting the XLink attributes encountered in the document. This data structure can then be interrogated by subsequent stages in the pipeline.
One kind of link defined in the XLink specification is a so-called "inclusion" link where the linked text is designed to appear inline within the main document - rather like a preprocessor #include directive in C. The XLink syntax for this is show="parsed". This is very similar to an external entity reference, except that the application has some control over the decision whether and when to include the text: for example, the user might have a choice to display the long or short forms of a document. It would be quite possible, of course, to implement a filter that expanded such links directly, presenting an included document to subsequent pipeline stages as if it were physically embedded in the original document.
Pipelines with Shared Context
One potential difficulty with a pipeline is that each filter in the pipeline has to work out for itself things that other filters already know; a common example is knowing the parent of the current element. If one filter is already maintaining a stack of elements so that it can determine this, it is wasteful for another filter to do the same thing.
You can get round this by allowing one filter to access data structures set up by a previous filter, either directly or via public methods. However, this requires that the filters in the pipeline know rather more about each other than the pure pipeline model suggests, which reduces your ability to plug filters together in any order. Arguably, when processing reaches this level of complexity, it might be better to forget event-based processing entirely and use the DOM (with a navigational design pattern) instead.