COSC 463 Assignment 1 — 2010
The aims of this assignment
The aims of this assignment include
- This is a Computer Science paper, where would we be without
some programming?
- Regular expressions are a very powerful notation for
describing strings. I want you to get some practice doing
something that's not too hard with them.
- XML is a very important and widely used way of holding
semistructured data these days. I want you to know what it
looks like and to do something simple with it, and
the simplest thing is strip it out.
- This is an information retrieval paper. I want you to
do something interesting, or at least mildly
interesting, with words.
- There will be a second assignment in which you will
have to extract words from XML. You might as well
get used to it now.
- Zipf's law is an interesting fact about natural language
(and some other systems too, although that's outside our
scope). I want you to find out what it is, and to see for
yourselves on real data just how much you should believe in it.
Resources
For this assignment, I have provided three books marked
up in XML. I got the books as plain text from
Project
Gutenberg, and added the XML markup myself. (Some of
the markup was automated, but by no means all of it.)
These books were chosen to be different in topic and style,
not because I endorse the contents of any of them.
I'll ask you to count all words, all content words, and
all function words. I've provided a little AWK program
filter.awk that in its present
form selects the function words from a token stream on
standard input and copies them to standard output. You can
easily change this, by adding a "!" in the right place, to
a program that selects content words, and by removing a
test, to a program that selects alphabetic words whatever
they are.
I'm also providing a part-of-speech tagger called QTag.
It's a Java program, and the author does not provide sources
for it. There is a newer version than the one I provide,
but this version seems to be simpler to use.
You are asked to plot some graphs. You may use any
software you like for this, but I recommend
R, which is already on the
lab machines.
- book.dtd — the Document Type
Description for the books. The lazy man's way to strip out the
XML markup is to use an XML parser and just catch the PCDATA
items as they fly past; if you want to handle the character
names used in these books correctly you will need this file
to tell you (and your parser) what they are. I recommend that
you not use a parser, but write your own code.
- doyle-history.xml —
Sir Arthur Conan Doyle's History of Spiritualism, Volume 1.
This is popular (rather than academic) history.
- wallace-frog.xml —
Edgar Wallace's The Fellowship of the Frog. This is
a “thriller”. Beware: this file contains the
letter é. You must decide what to do with
this letter.
- witt.xml — Wittgenstein's
Tractatus Logico-Philosophicus, in English translation.
This is philosophy. The numbers on each paragraph are original;
I didn't add them. You should decide what to do about them,
because they do show up in the statistics.
- qtag.jar — Oliver Mason's
part-of-speech tagger. It requires for its input one token
per line, where a token is a word or a punctuation mark.
Em-dashes should be treated as a single punctuation mark.
- qtag-eng.lex and
qtag-eng.mat are
two data files needed by qtag.jar. They are compressed; do
not decompress them, or QTag won't work.
- R should already be installed on the lab machines.
What must you do
- Decide what is to count as a word. Tell me in plain text,
and again by writing a regular expression. In particular, you
must decide what to do about hyphens, what to do about
apostrophes, what to do about abbreviation points, what to do
about alphabetic case, what to do about numbers, and if you
decide to treat numbers as words, what to do about decimal
points, thousands separators, and fractions like ½ and
22/7.
- Convert each of the books to word sequences, one word in
or one punctuation mark per line. Note that some punctuation
marks (like dashes) may be written as more than one character
(two dashes, for example).
- Count Words
Given a file with one word per line, we can
select those lines that are not numbers or punctuation
marks by doing
grep '^[a-zA-Z]' <tokens >words
Alternatively, as noted above, you can use
a copy of filter.awk with the
is-a-function-word test deleted:
awk -f filter.awk <tokens >words
To find out how often things occur, we can bring equal
ones together using sort(1) and then count them using
uniq(1) with the -c option. To find out the commonest
ones, we can then sort the output of uniq(1) into descending
order by number, and take the top N. So to find the top
30 words,
grep '^[a-zA-Z]' <tokens | sort | uniq -c |
sort -nr | head -30
I want you to count all words, to count function words, and
to count content words (defined as any words other than
function words). Make three copies of
filter.awk: filter-all.awk,
filter-function.awk, and filter-content.awk. Make the
appropriate changes to filter-content.awk and filter-all.awk.
I want you to
- For each book, find the 30 most frequent words.
Produce a single page showing
Wallace Doyle Wittgenstein
%1 $1 %1' $1' %1" $1"
...
%30 $30 %30' $30' %30" $30"
where the %i entries are the percentage of all word occurrences
represented by the ith word, and the $i entries are
the ith word, for each document.
Comment on what you find.
- For each book, find, display, and comment on, the 30 most
common content words.
- For each book, find, display, and comment on, the 30 most
common function words.
- For each book, find the full list of all words with their
counts (grep, sort, uniq -c, sort) and throw away the words,
just keeping the counts.
You can use
awk '{print $1}'
to copy the counts to standard output. According to
Zipf's law,
the frequency of the kth word in the list should be
proportional to 1/k.
For each of the three books, plot a graph with
log10(number of occurrences) as the vertical axis and
log10(position in list) as the horizontal axis.
Do the graphs have the same shape? What are their shapes?
How well do these documents obey Zipf's law?
- Look up "Heaps' law". What does it predict about these
documents? Is that prediction true?
- For each of the tokenised documents, run it through the
part of speech tagger. We just want the tags.
your-program <book |
java -jar qtag.tar qtag-eng -f tab /dev/stdin |
awk '{print $2}' >book.pos
For each book, find, display, and comment on, the 30 most
common part of speech tags. The question I'm interested in
here is "each book has its own vocabulary, does it look as
though they have their own (sub)grammar?"
- Do the part of speech tags follow Zipf's law? What is
it about the set of tags that means we should expect that?