COSC 463 Assignment 1 — 2010

The aims of this assignment

The aims of this assignment include

This is a Computer Science paper, where would we be without some programming?
Regular expressions are a very powerful notation for describing strings. I want you to get some practice doing something that's not too hard with them.
XML is a very important and widely used way of holding semistructured data these days. I want you to know what it looks like and to do something simple with it, and the simplest thing is strip it out.
This is an information retrieval paper. I want you to do something interesting, or at least mildly interesting, with words.
There will be a second assignment in which you will have to extract words from XML. You might as well get used to it now.
Zipf's law is an interesting fact about natural language (and some other systems too, although that's outside our scope). I want you to find out what it is, and to see for yourselves on real data just how much you should believe in it.

Resources

For this assignment, I have provided three books marked up in XML. I got the books as plain text from Project Gutenberg, and added the XML markup myself. (Some of the markup was automated, but by no means all of it.) These books were chosen to be different in topic and style, not because I endorse the contents of any of them.

I'll ask you to count all words, all content words, and all function words. I've provided a little AWK program filter.awk that in its present form selects the function words from a token stream on standard input and copies them to standard output. You can easily change this, by adding a "!" in the right place, to a program that selects content words, and by removing a test, to a program that selects alphabetic words whatever they are.

I'm also providing a part-of-speech tagger called QTag. It's a Java program, and the author does not provide sources for it. There is a newer version than the one I provide, but this version seems to be simpler to use.

You are asked to plot some graphs. You may use any software you like for this, but I recommend R, which is already on the lab machines.

book.dtd — the Document Type Description for the books. The lazy man's way to strip out the XML markup is to use an XML parser and just catch the PCDATA items as they fly past; if you want to handle the character names used in these books correctly you will need this file to tell you (and your parser) what they are. I recommend that you not use a parser, but write your own code.
doyle-history.xml — Sir Arthur Conan Doyle's History of Spiritualism, Volume 1. This is popular (rather than academic) history.
wallace-frog.xml — Edgar Wallace's The Fellowship of the Frog. This is a “thriller”. Beware: this file contains the letter é. You must decide what to do with this letter.
witt.xml — Wittgenstein's Tractatus Logico-Philosophicus, in English translation. This is philosophy. The numbers on each paragraph are original; I didn't add them. You should decide what to do about them, because they do show up in the statistics.
qtag.jar — Oliver Mason's part-of-speech tagger. It requires for its input one token per line, where a token is a word or a punctuation mark. Em-dashes should be treated as a single punctuation mark.
qtag-eng.lex and qtag-eng.mat are two data files needed by qtag.jar. They are compressed; do not decompress them, or QTag won't work.
R should already be installed on the lab machines.

What must you do

Decide what is to count as a word. Tell me in plain text, and again by writing a regular expression. In particular, you must decide what to do about hyphens, what to do about apostrophes, what to do about abbreviation points, what to do about alphabetic case, what to do about numbers, and if you decide to treat numbers as words, what to do about decimal points, thousands separators, and fractions like ½ and 22/7.
Convert each of the books to word sequences, one word in or one punctuation mark per line. Note that some punctuation marks (like dashes) may be written as more than one character (two dashes, for example).
Count Words
Given a file with one word per line, we can select those lines that are not numbers or punctuation marks by doing
```
grep '^[a-zA-Z]' <tokens >words
```
Alternatively, as noted above, you can use a copy of filter.awk with the is-a-function-word test deleted:
```
awk -f filter.awk <tokens >words
```
To find out how often things occur, we can bring equal ones together using sort(1) and then count them using uniq(1) with the -c option. To find out the commonest ones, we can then sort the output of uniq(1) into descending order by number, and take the top N. So to find the top 30 words,
```
grep '^[a-zA-Z]' <tokens | sort | uniq -c |
sort -nr | head -30
```
I want you to count all words, to count function words, and to count content words (defined as any words other than function words). Make three copies of filter.awk: filter-all.awk, filter-function.awk, and filter-content.awk. Make the appropriate changes to filter-content.awk and filter-all.awk.
I want you to
1. For each book, find the 30 most frequent words. Produce a single page showing
```
        Wallace   Doyle     Wittgenstein
	%1  $1	  %1' $1'   %1" $1"
	...
	%30 $30   %30' $30' %30" $30"
```
  where the %i entries are the percentage of all word occurrences represented by the ith word, and the $i entries are the ith word, for each document.
  Comment on what you find.
2. For each book, find, display, and comment on, the 30 most common content words.
3. For each book, find, display, and comment on, the 30 most common function words.
4. For each book, find the full list of all words with their counts (grep, sort, uniq -c, sort) and throw away the words, just keeping the counts. You can use
```
awk '{print $1}'
```
  to copy the counts to standard output. According to Zipf's law, the frequency of the kth word in the list should be proportional to 1/k.
  For each of the three books, plot a graph with log10(number of occurrences) as the vertical axis and log10(position in list) as the horizontal axis. Do the graphs have the same shape? What are their shapes? How well do these documents obey Zipf's law?
5. Look up "Heaps' law". What does it predict about these documents? Is that prediction true?
6. For each of the tokenised documents, run it through the part of speech tagger. We just want the tags.
```
your-program <book |
java -jar qtag.tar qtag-eng -f tab /dev/stdin |
awk '{print $2}' >book.pos
```
  For each book, find, display, and comment on, the 30 most common part of speech tags. The question I'm interested in here is "each book has its own vocabulary, does it look as though they have their own (sub)grammar?"
7. Do the part of speech tags follow Zipf's law? What is it about the set of tags that means we should expect that?