PS-5

The goal of this problem set is to build a personal digital assistant named "Sudi". Okay, we'll stop short of that, but we will tackle part of the speech understanding problem that a digital assistant such as Sudi would have to handle.

Background

A part of speech (POS) tagger labels each word in a sentence with its part of speech (noun, verb, etc.). For example:

I fish
Will eats the fish
Will you cook the fish
One cook uses a saw
A saw has many uses
You saw me color a fish
Jobs wore one color
The jobs were mine
The mine has many fish
You can cook many

===>

I/PRO fish/V
Will/NP eats/V the/DET fish/N
Will/MOD you/PRO cook/V the/DET fish/N
One/DET cook/N uses/VD a/DET saw/N
A/DET saw/N has/V many/DET uses/N
You/PRO saw/VD me/PRO color/V a/DET fish/N
Jobs/NP wore/VD one/DET color/N
The/DET jobs/N were/VD mine/PRO
The/DET mine/N has/V many/DET fish/N
You/PRO can/MOD cook/V many/PRO

where the tags are after the slashes, indicating e.g., that in the first sentence "I" is a pronoun (PRO), "fish" is a verb (V); in the second sentence, "Will" is a proper noun (NP), "eats" a verb, "the" a determiner (DET), and "fish" a noun (N). If you notice, I purposely constructed these somewhat strange sentences so that the same word serves as a different part of speech in different places ("fish" as both a verb and a noun, "Will" as both a person's name and the future tense modifier, "saw" as both a past tense verb and a tool, etc.). A good tagger makes use of the context to disambiguate the possibilities and determine the correct label. The tagging is in turn a core part of having a "Sudi" understand what a sentence means and how to respond.

The tags that we'll use for this problem set are taken from the book Natural Language Processing with Python:

TagMeaningExamples
ADJadjectivenew, good, high, special, big, local
ADVadverbreally, already, still, early, now
CNJconjunctionand, or, but, if, while, although
DETdeterminerthe, a, some, most, every, no
EXexistentialthere, there's
FWforeign worddolce, ersatz, esprit, quo, maitre
MODmodal verbwill, can, would, may, must, should
Nnounyear, home, costs, time, education
NPproper nounAlison, Africa, April, Washington
NUMnumbertwenty-four, fourth, 1991, 14:24
PROpronounhe, their, her, its, my, I, us
Pprepositionon, of, at, with, by, into, under
TOthe word toto
UHinterjectionah, bang, ha, whee, hmpf, oops
Vverbis, has, get, do, make, see, run
VDpast tensesaid, took, told, made, asked
VGpresent participlemaking, going, playing, working
VNpast participlegiven, taken, begun, sung
WHwh determinerwho, which, when, what, where, how

In addition, each punctuation symbol has its own tag (written using the same symbol). I left punctuation out of the above example.

We'll be taking the statistical approach to tagging, which means we need data. Fortunately the problem has been studied a great deal, and there exist good datasets. One prominent one is the Brown corpus (extra credit: outdo it with a Dartmouth corpus), covering a range of works from the news, fictional stories, science writing, etc. It was annotated with a more complex set of tags, which have subsequently been distilled down to the ones above.

The goal of POS tagging is to take a sequence of words and produce the corresponding sequence of tags.

POS tagging via HMM

We will use a hidden Markov model (HMM) approach, applying the principles covered in class. Recall that in an HMM, the states are the things we don't see (hidden) and are trying to infer, and the observations are what we do see. So the observations are words in a sentence and the states are tags because the text we'll observe is not annotated with its part of speech tag (that is our program's job). We proceed through a model by moving from state to state, producing one observation per state. In this "bigram" model, each tag depends on the previous tag. Then each word depends on the tag. (For extra credit, you can go to a "trigram" model, where each tag depends on the previous two tags.) Let "#" be the tag "before" the start of the sentence.

An HMM is defined by its states (here part of speech tags), transitions (here tag to tag, with weights), and observations (here tag to word, with weights). Here is an HMM that includes tags, transitions, and observations for our example sentences. There is a special start state tag "#" before the start of the sentence. Note that we're not going to force a "stop" state (ending with a period, question mark, or exclamation point) since the texts we'll use from the Brown corpus include headlines that break that rule.

POS.png

Do you see how the various sentences above follow paths through the HMM? For example, "I fish": transition from start to PRO (score -1.2); observe "I" (score -1.9); transition to V (-0.5); observe "fish" (-2.1). Likewise "Will eats the fish": start to NP (-1.6); "Will" (-0.7); to V (-0.7); "eats" (-2.1); to DET (-0.2); "the" (-1.0); to N (0); "fish" -1.0.

Of course, we as people know the parts of speech these words play in these sentences, and can figure out which paths are being followed. The goal of tagging is to have an algorithm determine what is the best path through the POS states to have produced the observed words. Given the many different out edges, even in this relatively simple case, along with the ambiguity in how words can be used, a proper implementation of the Viterbi algorithm is required to efficiently find the best path.

It could be fun to take a random walk on the HMM to generate new sentences, mad libs style.

POS Viterbi

Let's assume for now we have the model pictured above, and want to find the best path for a given sentence. We'll soon turn to how to obtain the transition and observation probabilities. The Viterbi algorithm starts at the # (start) state, with a score of 0, before any observation. Then to handle observation i, it propagates from each reached state at observation i-1, following each transition. The score for the next state as of observation i is the sum of the score at the current state as of i-1, plus the transition score from current to next, plus the score of observation i in the next. As with Dijkstra, that score may or may not be better than the score we already have for that state and observation — maybe we could get there from a different state with a better overall score. So we'll propagate forward from each current state to each next and check whether or not we've found something better (as with Dijkstra relax).

Let's start by considering "will eats the fish". Starting from #, we can transition to NP, DET, PRO, or MOD. There's an observation score for "will" only in NP and MOD, so what score do we give for observing it in the others? Maybe we don't want to completely rule out something that we've never seen (e.g., "will" can be a noun too, just not in the sentences used to build this model). So let's just give it a low log probability, call it the "unseen" score. It should be a negative number, perhaps worse than the observed ones but not totally out of the realm. For this example, I'll make it -10. So as of word 0, we could be at:

The next word, word 1, is "eats". We consider each transition from each of the current states, adding the current score plus the transition plus the observation. From DET we can go to:

And from MOD to:

From NP to:

From PRO to:

Note that there are several different ways to get to V for word 1 — from MOD (score -5.8), from NP (score -5.1), or from PRO (score -13.8). We keep just the best score -5.1, along with a backpointer saying that V for word 1 comes from NP for word 0. Similarly VD for word 1 could have come from NP or PRO, and we keep its best score of -13.0 with a backpointer to NP for word 0.

To continue on, I think it's easier to keep track of these things in a table, like the following, where I use strikethrough to indicate scores and backpointers that aren't the best way to get to the next state for the observation. (Apologies if I added anything incorrectly when I manually built this table — that's why we have computers; let me know and I'll fix it.)

#observationnext statecurr statescores: curr + transit + obsnext score
startn/a#0n/a0
0willDET#0 -0.9 -10 -10.9
MOD#0 -2.3 -0.7 -3.0
NP#0 -1.6 -0.7 -2.3
PRO#0 -1.2 -10 -11.2
1eatsNDET-10.9 +0 -10-20.9
PROMOD-3.0 -0.7 -10.0-13.7
VMOD-3.0 -0.7 -2.1-5.8
PRO-11.2 -0.5 -2.1-13.8
NP-2.3 -0.7 -2.1-5.1
VDNP-2.3 -0.7 -10-13.0
PRO-11.2 -1.6 -10-22.8
MODPRO-11.2 -1.6 -10-22.8
2theDETV-5.1 -0.2 -1.0-6.3
VD-13.0 -1.1 -1.0-15.1
MODPRO-13.7 -1.6 -10-25.3
PROV-5.1 -1.9 -10-17.0
VD-13.0 -0.4 -10-23.4
MOD-22.8 -0.7 -10-33.5
VN-20.9 -0.3 -10-31.2
PRO-13.7 -0.5 -10-24.2
MOD-22.8 -0.7 -10-33.5
VDN-20.9 -1.4 -10-32.3
PRO-13.7 -1.6 -10-25.3
3fishDETV-24.2 -0.2 -10-34.4
VD-25.3 -1.1 -10-36.4
MODPRO-17 -1.6 -10-28.6
NDET-6.3 +0 -1.0-7.3
PROMOD-25.3 -0.7 -10-36.0
V-24.2 -1.9 -10-36.1
VD-25.3 -0.4 -10-35.7
VMOD-25.3 -0.7 -2.1-28.1
PRO-17.0 -0.5 -2.1-19.6
VDPRO-17.0 -1.6 -10-28.6

Important: We are not deciding for each word which tag is best. We are deciding for each POS tag that we reach for that word what its best score and preceding state are. This will ultimately yield the best path through the graph to generate the entire sentence, disambiguating the parts of speech of the various observed words along the way.

After working through the sentence, we look at the possible states for the last observation, word 3, to see where we could end up with the best possible score. Here it's N, at -7.3. So we know the tag for "fish" is N. We then backtrace to where this state came from: DET. So now we know the tag for "the" is DET. Backtracing again goes to V, for "eats". Then back to NP, for "will". Then #, so we're done. The most likely tags for "Will eats the fish" are: NP V DET N

POS Training

Let's consider training the pictured model from the example sentences and corresponding tags. Basically, we just go through each sentence and count up, for each tag, how many times each word was observed with that tag, and how many times each other tag followed it. So we essentially want to collect tables (or really maps) like these:

TransitionsObservations
statenexttotal
DETMODNNPPROVVD
#
DET
MOD
N
NP
PRO
V
VD
statewordtotal
cookeatsfishithewillyou...
DET
MOD
N
NP
PRO
V
VD

The first sentence "I/PRO fish/V" puts a 1 in the transition from # to PRO, along with a 1 in that from PRO to V. It also puts a 1 in the observation of "i" in PRO and that of "fish" in V. The next sentence "Will/NP eats/V the/DET fish/N" puts a 1 in transitions # to NP, NP to V, V to DET, and DET to N. It likewise puts 1 in observations "will" in NP, "eats" in V, "the" in DET, and "fish" in N. The third sentence "Will/MOD you/PRO cook/V the/DET fish/N" puts a 1 in transitions # to MOD and MOD to PRO; it increments by 1 the transition PRO to V (already had 1 from the first sentence) along with V to DET and DET to N (both already had 1 from the second sentence). It also puts a 1 for "will" in MOD, "you" in PRO, and "cook" in V, while incrementing by 1 both "the" in DET and "fish" in N (already had 1 from the second sentence). Thus we have:

TransitionsObservations
statenexttotal
DETMODNNPPROVVD
#1113
DET22
MOD11
N
NP11
PRO22
V22
VD
statewordtotal
cookeatsfishithewillyou...
DET22
MOD11
N22
NP11
PRO112
V1113
VD

Finally, we convert these counts to log probabilities, by dividing each by the total for its row and taking the log. With only limited experience of 3 sentences, this HMM has only seen one type of transition from each state other than the start state, and has only limited vocabulary of one or two words for each state. So the transition probabilities from # to each of MOD, NP, and PRO is log (1/3), while that from DET to N is log (2/2) = 0. Likewise, the observation of each of "cook", "eats", and "fish" in V is log (1/3), while that of "i" and "you" in PRO is log (1/2). But if we trained it on the rest of the example sentences, we'd have a somewhat more interesting set of transition and observation counts, whose conversion yields the numbers I gave on the figure (rounded off).

Testing

To assess how good a model is, we can compute how many tags it gets right and how many it gets wrong on some test sentences. (Even tougher: how many sentences does it get entirely right vs. at least one tag wrong.) It wouldn't be fair to test it on the sentences that we used to train it (though if it did poorly on those, we'd be worried).

Provided in texts.zip are sets of files, one pair with the sentences and a corresponding one with the tags to be used for training, and another pair with the sentences and tags for testing. Each line is a single sentence (or a headline), cleanly separated by whitespace into words/tokens, with punctuation also thus separated out. Note the lines corresponds token-to-tag — the 0th token on a line in the text file has the 0th token in the same line in the tags file, ... the ith token has the i tag, .... (And that also applies to punctuation, corresponding token-to-tag mixed in among the words.)

So use the train sentences and train tags files to generate the HMM. Then apply the HMM to each line in the test sentences file, and compare the results to the corresponding test tags line. Count the numbers of correct vs. incorrect tags.

Implementation Notes

Training Viterbi tagging

Exercises

For this assignment, you may work either alone, or with one other partner. Same discussion as always about partners.

  1. Write a method to perform Viterbi decoding to find the best sequence of tags for a line (sequence of words).
  2. Test the method on simple hard-coded graphs and input strings (e.g., from programming drill, along with others you make up). In your report, discuss the tests and how they convinced you of your code's correctness.
  3. Write a method to train a model (observation and transition probabilities) on corresponding lines (sentence and tags) from a pair of training files.
  4. Write a console-based test method to give the tags from an input line.
  5. Write a file-based test method to evaluate the performance on a pair of test files (corresponding lines with sentences and tags).
  6. Train and test with the two example cases provided in texts.zip: "example" are the sentences and tags above, "simple" is just some made up sentences and tags, and "brown" is the full corpus of sentences and tags. The sample solution got 32 tags right and 5 wrong for simple, and 35109 right vs. 1285 wrong for brown, with an unseen-word penalty of -100. (Note: these are using natural log; if you use log10 there might be some differences.) In a short report, provide some example new sentences that are tagged as expected and some that aren't, discussing why. Also discuss your overall testing performance, and how it depends on the unseen-word penalty (and any other parameters you use).

Some ideas for extra credit.

Submission Instructions

Turn in a single zipfile containing all your code and your short report (testing on hard-coded graphs, training/testing on provided datasets).

Acknowledgment

Thanks to Professor Chris Bailey-Kellogg for a rewrite of this problem that greatly clarifies the details around scoring. Also thanks to Sravana Reddy for consulting in the development of the problem set, providing helpful guidance and discussion as well as the processed dataset.

Grading rubric

Total of 100 points

Correctness (70 points)

5Loading data, splitting into lower-case words
15Training: counting transitions and observations
5Training: log probabilities
15Viterbi: computing scores and maintaining back pointers
5Viterbi: start; best end
10Viterbi: backtrace
7Console-driven testing
8File-driven testing

Structure (10 points)

4Good decomposition into objects and methods
3Proper use of instance and local variables
3Proper use of parameters

Style (10 points)

3Comments for classes and methods
4Good names for methods, variables, parameters
3Layout (blank lines, indentation, no line wraps, etc.)

Testing (10 points)

5Example sentences and discussion
5Testing performance and discussion