Alternative Input Sources
In our examples so far, the XML document to be parsed has been described in the form of a URL. This is usually adequate, given the range of resources that a URL can describe. It allows the document to be held in a file locally or remotely, or for it to be generated dynamically by a web server.
Taking Input from a Byte Stream or Character Stream
Sometimes you want to supply the parser with a stream of XML that is generated by another program rather than being held in a file. For example, the XML might be stored in a relational database, or it might be output by an EDI message translation program, or it might be an XML section embedded within a file or message in some non-XML format. You don't want to have to write the XML to the file store (or to install a web server) just so that the parser can read your document.
To handle this situation, SAX allows you to supply the XML input in the form of a character stream or a byte stream. It provides the InputSource class to generalize all these possible sources of input.
For example, let's suppose your program wants to parse XML held in a character string that has just been read from a relational database using JDBC. The following code will do the job:
public void parseString(String s) throws SAXException, IOException
{
StringReader reader = new StringReader(s);
InputSource source = new InputSource(reader);
parser.parse(source);
}
InputSource is a class (not an interface) provided with the SAX distribution. The application can set various details of the input source, some of which are mutually exclusive. These include supplying a URL, a Reader (as here), an InputStream, an encoding name, and a "public identifier". (Public identifiers, however, are as enigmatic in SAX as in the XML specification itself: there are no clues as to what the parser should actually do with the public identifier. But as we will see later, the application can use it.)
Why does SAX need to provide two options for in-memory data, an InputStream and a Reader?
An InputStream is a stream of bytes. The XML standard provides many rules about how a stream of bytes can be translated into a stream of Unicode characters, including for example the encoding attribute (which is part of the xml declaration at the start of the document content). To translate bytes to characters, it's not good enough to leave the work to the standard Java libraries, because they don't understand these rules, and they certainly can't be expected to read the encoding attribute. If the XML comes from a binary source, complete with encoding attribute, we want to hand the stream of bytes to the parser for it to interpret directly.
A Reader, by contrast, is a stream of Unicode characters. If we already have the data in the form of characters, we don't want to have to encode it first as a stream of bytes (say in the UTF-8 encoding) just so that the parser can decode it again. Better to hand the character stream to the parser directly. (Actually, there was some debate about the desirability of providing this option in SAX. While it's obviously useful, it's not entirely in the spirit of the XML specification, which defines an XML document strictly as sequence of bytes. It's perhaps best to think of the input character stream not as an XML document, but as a preprocessed XML document in which the first stage of processing, namely character decoding, has already been done.)
Whether we use a byte stream or a character stream, there is one snag you need to be aware of: the parser has no way of resolving a relative URL that appears in the document source. Suppose the document source contains the line
<!DOCTYPE books SYSTEM "books.dtd">
Where is books.dtd to be found? The XML specification says (in effect) that it should be found in the same directory as the source document, but of course we don't have a directory for the source document because it was in memory when parsing started.
SAX gets round this by allowing a system identifier (in other words, a URL) to be supplied as well as a byte stream or character stream. This URL is not used to read the source document, only as a base for resolving any relative URLs found in the source document.
Specifying a Filename rather than a URL
Another common source of input is a file name: for example, command-line interfaces generally use file names as arguments rather than URLs, and you may well want to use this form of argument in the interface to your application.
The SAX InputSource class does not directly allow you to specify a filename for the input; you have to convert the filename into a URL so that the parser can process it. If you are using Java 2, this is simplicity itself: the Java File class has a suitable method. So to parse the file c:\sample.xml, you can write:
parser.parse((new File("c:\sample.xml")).toURL().toString());
(Note that the parse() method expects the URL as a string, not as a Java URL object, hence the need to call toString() to achieve the conversion.)
With Java 1.1, the translation of a filename to a URL is a little more difficult than you might expect if you want the code to work equally on Windows and on UNIX, because of the wide variety of filename formats. Here's a method that handles most cases, though the error handling leaves something to be desired.
public String CreateURL(File file)
{
String path = file.getAbsolutePath();
try
{
return (new URL(path)).toString();
}
catch (MalformedURLException ex)
{
String fs = System.getProperty("file.separator");
char sep = fs.charAt(0);
if (sep != '/') path = path.replace(sep, '/');
if (path.charAt(0) != '/') path = '/' + path;
return "file://" + path;
}
}
Input from Non-XML Sources
One of the more surprising ways in which SAX has been used is to feed applications with data that is not stored in XML at all. So long as the data is in a hierarchic format that can be mapped reasonably well to the XML data model, you can write a driver that behaves in every way like an XML parser. Your driver sends events such as startElement() and endElement() to the application's DocumentHandler just as if the data originated in an XML document, when in reality there is no XML document there to be parsed.
Why would you want to do this? It allows you to take advantage of applications that were written to accept XML data, without going through the clumsy process of writing your data in XML format and then parsing it again. For example, if you have an application designed to process incoming XML-EDI messages for electronic commerce transactions, you might want also to write a translator that feeds this application with messages arriving in older proprietary formats. One way to do this is for your translator to create an XML file and supply this file to the application. But a neat shortcut, if the target application is written to use SAX, is for your translator to call the application directly, pretending to be an XML parser.
The section below on SAX Filters discusses some of the possibilities using this approach.
Handling External Entities
We often think of XML entities as the markers like äaut; appearing in the text of a document. That's not quite accurate: äaut; isn't strictly an entity, but an entity reference. The entity is the thing that äaut; refers to, that is the definition in the DTD that associates the name "aumlaut" with its expanded text "ä".
Continued...