You will find out about XML at the World Wide Web consortium web site: http://www.w3.org. To find XML specifications, look at the XML Core page. For our purposes, the key specification is Extensible Markup Language (XML) 1.0.
Basically, XML addresses two problems:
The character encoding problem has two parts. The first is that there are literally hundreds of different character sets, many still in use. The Internet register of character sets lists 850 names for 254 character sets, and it is far from complete. More precisely, what it names is not character sets, but encodings: UTF-32, UTF-16, UTF-8, SCSU (Simple Compression Scheme for Unicode), and BOCU (Binary Ordered Compression for Unicode) are all ways of encoding sequences of (21-bit) Unicode code-points as streams of bytes. The basic issue is that you cannot read the words until you can read the letters, and you can't read the letters without knowing the encoding.
How can you know what the encoding of a document is?
setenv LC_CTYPE en_NZ.ISO8859-1
which says that my (LoCale)'s Character TYPE rules are those for the dialect of en(glish) written in New Zealand using member number 1 of the ISO 8859 family of 8-bit coded character sets. On my new Macintosh we find
setenv LANG en_GB.UTF-8
which says the version of English written in Great Britain encoded in UTF-8. All of this is fine for files you have written, or other people have written using the same conventions. It won't do if you have documents from other sources. They might be fetched over the internet, but when you have users sharing a file system, they might have different preferences.
Content-Type: type; charset=character-set
where type is something like text/html
, and
character set is something like ISO-8859-1
,
which is the default that a client must assume if it is not specified.
If that information is there, wonderful. If not, you can't really rely
on it being ISO Latin 1, because the server might not know or (if a
Microsoft server) might lie (Windows 1252 is often misreported as Latin 1).
Of course, once you save the file to disc, this information may be lost.
/* 0123456789.-+ABCDEFGHIJKLMNOPQRSTUVWYYZ_&=$ */
line listing all the characters that might be used in the file (here, the characters used in Fortran 66, plus dollar) in a known order, so that if you didn't see those characters, you knew what translation table to build.
These days, XML files begin with an XML declaration that says which version of XML (there are two so far, XML 1.0 (4th edition) and XML 1.1 (2nd edition)). XML 1.1's major change is going to be folded back into XML 1.0 fifth edition. But the XML declaration also says what the encoding is. For example,
<?xml version="1.0" encoding="ASCII"?>
Of course, in order to read the encoding declaration, you have to be able to read the characters, which seems to mean that you have to know what the encoding is, and if you know that, why do you need the encoding declaration? The answer is that there is a limited number of ways that an XML file can begin. The first character should normally be either a Unicode "Byte Order Mark" or a "<". While there are hundreds of 8-bit character sets, almost all of them are extensions or revisions of either ASCII or EBCDIC, and the "<" character code will tell you which of those it is. The encoding declaration is written in the ASCII subset of Unicode (using only characters that are also in EBCDIC). Annex F of the XML specification says how to auto-detect enough of the encoding to read the rest.
For your assignments, you may assume UTF-8.
The second way that XML helps with character encoding issues is by providing a way to enter any Unicode character, no matter what character set you are using for the document. This means that documents can be translated from any character set to any other; characters not in the target can be written otherwise.
One method is to use names. There are five standard names in XML:
amp | & | the ampersand |
apos | ' | the apostrophe |
gt | > | greater than |
lt | < | less than |
quot | " | the double quotation mark |
Applications of XML may declare other named characters in things
called DTDs (Document Type Definitions). XHTML defines hundreds,
such as &uarr
for the up arrow, ↑
Less conveniently, but more generally, characters may be written
in decimal "&#ddd;
" or hexadecimal
"&#xhhh;
".
Any program that reads XML must be prepared to decode the five standard named characters and decimal and hexadecimal character references.
The other thing XML deals with is document structure. An XML document is basically a tree, where some leaves are text (with named characters, numeric character references, and perhaps even comments) embedded, and nodes (elements) have labels (historically called Generic Identifiers, sometimes called element type names) and unordered attribute=value sets. Bending the idea of a tree a bit, there may also be cross-links. A declaration
<!ATTLIST element attribute ID #IMPLIED>
says that any instance of element may but need not have the given attribute, and that that attribute is a unique identifier (ID) for that instance. All the ID values in a document must be different, and they must look like words. A declaration
<!ATTLIST element attribute IDREF #IMPLIED>
declares an identifier reference attribute; its value must be one of the IDs in a document. For example, we might have
<!ELEMENT person (#PCDATA)> <!ATTLIST person ID #REQUIRED spouse IDREF #IMPLIED> ... <person id="joe99" spouse="mary12">Mengele, Joseph</person> <person id="mry12" spouse="joe19">Mallon, Mary</person> <person id="rod.6">Borgia, Rodrigo</person>
There are two basic ways of processing XML.
This is a very simple way to deal with XML, or would be if the interface weren't barking mad.
Many XML parsers will give you a stream of events as a text file
in ESIS (Element Structure Information Set) format. My
qh
program does this, for example.
The Wall Street Journal collection is made of files like this:
<DOC> <DOCNO> <i>document number</i> </DOCNO> <DOCID> <i>document identifier</i> </DOCID> ... other stuff ... </DOC>
You want to remember the document number for reporting; you want to look for words in the "... other stuff ...".
If you want to process text, you have to learn about regular expressions. If you want to be an effective UNIX or Java programmer, you have to learn about regular expressions. On a UNIX system, you should read the lex(1) and/or flex(1) manual pages, also regcmp(3) -- traditional UNIX regular expression functions -- and regcomp(3) -- POSIX.2 regular expression functions -- and of course grep(1). For Java, it's all there in the Java API JavaDocs. There is a freely available C library providing extended regular expressions; it's called PCRE, for PERL-Compatible Regular Expressions.
In the lecture I describe the syntax of basic regular expressions and how to map them directly onto C provided they satisfy a one-character lookahead condition. Here I am going to be a little bit more formal about it. I'm going to define Empty(e) --- can e match the empty string --- and First(e) --- the set of characters that any non-empty match for e must begin with.
For the C translations, I shall assume the following declarations:
int c; /* unsigned character or EOF */ int get(void); /* return next character, or EOF if none */ void fail(void); /* report match failure */
The C translation of a regular expression will start with the
current character (or EOF if there is none) in c
.
Empty(0) = true.
First(0) = {}, the empty set of characters.
C code: no code required.
Empty(k) = false.
First(k) = {k}.
C code:
if (c != 'k') fail(); c = get();
Empty(.) = false.
First(.) = the whole character set
C code:
if (c == EOF) fail(); c = get();You may also provide a set of characters between square brackets.
Empty([k1,...,kn]) = false.
First([k1,...,kn]) = {k1,...,kn}.
C code:
if (c != k1 && ... && c != kn) fail(); c = get();
The classification functions described in 'man ctype' are also available, [[:alpha:]] is the set of letters, for example.
You can also ask for any character that is not in a list
of characters by placing ^
(a circumflex) at the beginning
of the list.
Empty([^k1,...,kn]) = false.
First([^k1,...,kn]) = the whole character set \ {k1,...,kn}.
C code:
if (c == k1 || ... || c == kn) fail(); if (c == EOF) fail(); /* must match a real character */ c = get();
Empty(p1p2) =Empty(p1) & Empty(p2).
First(p1p2) = if Empty(p1) then First(p1) U First(p2) else First(p1).
This clearly generalises to any number of patterns.
C code (where if Empty(p1), we require that First(p1) and First(p2) have no element in common):
code for p1 code for p2
Empty((p1|p2)) = Empty(p1) | Empty(p2).
First((p1|p2)) = First(p1) U First(p2).
This clearly generalises to any number of patterns.
The C translation I'm telling you about has a one-character lookahead requirement. That is, whenever we have a choice, we must be able to tell just by looking at one next character. Regular expressions as such do not have any such requirement; it is only the "direct" translation technique I'm describing that has this limitation. If we were talking about context-free grammars, this would be the LL(1) condition.
C code, provided not Empty(p1) and not Empty(p2) and First(p1) has no elements in common with First(p2):
if (p1 can begin with c) { code for p1 } else if (p2 can begin with c) { code for p2 } else { fail(); }
Empty(p?) = true.
First(p?) = First(p).
C code:
if (p can begin with c) { code for p }
Empty(pn) = Empty(p).
First(pn) = First(p).
C code (same conditions as p?):
for (i = 0; i < n; i++) { code for p }
Empty(p*) = true.
First(p*) = First(p).
C code (same conditions as p?):
while (c in First(p)) { code for p }
Empty(p*) = Empty(p).
First(p*) = First(p).
C code (same conditions as p?):
do { code for p } while (c in First(p));
Suppose we want to handle simple XML markup. (This is really simple, OK? It is just an example.) Simple XML is a sequence of
To strip out markup, we want to
To satisfy the one-character lookahead condition, we have to express this as
(<(/[a-zA-Z][a-zA-Z0-9]*> |[a-zA-Z][a-zA-Z0-9]*/?> ) |&(lt;|amp;) |[^<&] )*
This translates to the following C code:
c = get(); /* don't forget this! */ while (c != EOF) { if (c == '<) { c = get(); if (c == '/') { c = get(); if (!isalpha(c)) fail(); c = get(); while (isalnum(c)) c = get(); if (c != '>') fail(); c = get(); } else { if (!isalpha(c)) fail(); c = get(); while (isalnum(c)) c = get(); if (c == '/') c = get(); if (c != '>') fail(); c = get(); } putchar('\n'); } else if (c == '&') { c = get(); if (c == 'a') { c = get(); if (c != 'm') fail(); c = get(); if (c != 'p') fail(); c = get(); if (c != ';') fail(); c = get(); putchar('&'); } else if (c == 'l') { c = get(); if (c != 't') fail(); c = get(); if (c != ';') fail(); c = get(); putchar('<'); } else { fail(); } } else { putchar(c); c = get(); } }
Here the boldface lines are the actions we want to perform; everything else follows from the direct translation.
Of course we can do this in less hand-written code using a tool that understands regular expressions. For example,
%% "<"[a-zA-Z][a-zA-Z0-9]*">" | "<"[a-zA-Z][a-zA-Z0-9]*"/>" | "</"[a-zA-Z][a-zA-Z0-9]*">" { putchar('\n'); } "<" { putchar('<'); } "&" { putchar('&'); } . { putchar(yytext[0]); }
is a complete lex(1) program for this problem. It's pretty dumb, but it's good enough for the body of this web page. (It can't handle the DOCTYPE line, the comment at the top, or the META tag, which has parameters.)
Of course, any amount of white space is allowed before the closing '>' of a tag, and tag names may include underscores, hyphens, and dots, so a better version would be
%% "<""/"?[a-zA-Z][-_.a-zA-Z0-9]*[ \t\n]*"/"?">" { putchar('\n'); } "<" { putchar('<'); } "&" { putchar('&'); } . { putchar(yytext[0]); }
Changing the lex code is much easier than changing the C code above (but you should try it as an exercise). Even when changing the C code, it is handy to have the regular expression for reference.
Beware: this is good enough for the body of this web page, but that is all. For serious work, you'll have to extend it.
The theory of turning unrestricted regular expressions into code works through several stages:
The point of a DFA is that the code is very simple:
state = INITIAL_STATE; do { c = get(); switch (state = new_state[state][c]) { case SOME_STATE: user_action(); break; ... } } while (state != FINAL_STATE);
There is an important distinction here concerning who is in charge. In the translation into ordinary code I offered above, that code is in charge of reading characters: characters are pulled into the regexp code, and the state of the automaton basically is the program counter of the PC. In many applications, it is easier for the system as a whole if the characters are pushed into the regexp code, and the state of the automaton has to be explicit as a variable.
Consider breaking a file into words very very crudely. We are going to say that a word is a sequence of letters that is as long as possible. The regular expression is
([[:alpha:]]+|[^[:alpha:]])*
In pull mode, we'd write
c = get(); while (c != EOF) { if (isalpha(c)) { begin_word(); do { add_to_word(c); c = get(); } while (isalpha(c)); end_word(); } else { c = get(); } }
In push mode, we'd have to write
enum {NOT_IN_WORD, IN_WORD} state = NOT_IN_WORD; void accept(int c) { switch (state) { case NOT_IN_WORD: if (isalpha(c)) { begin_word(); add_to_word(c); state = IN_WORD; } else { state = NOT_IN_WORD; // optimise away! break; case IN_WORD: if (isalpha(c)) { add_to_word(c); state = IN_WORD; // optimise away! } else { end_word(); state = NOT_IN_WORD; } break; } }
In your situation, it is easiest to use pull mode for recognising and discarding XML markup, and push mode for breaking the remaining text into words.