Due at 11:59 PM Tuesday, Nov 21 via Canvas. PDF format required.

This assignment has two parts -- first, an exercise in computing the Bayesian score used in structure learning, and then a coding exercise on Expectation-Maximization.

Part 1

(3 points) See hm4_part1.pdf.

Part 2

In this part of the assignment, you'll complete an implementation of an EM codebase for performing a "set expansion" task. You'll want to download the code and data files.
The Bayes Net you're performing EM over has the structure word->topic->context, where the topics are never observed. The topic variable is often referred to as "z" in the code. The words (see wordDictionary.txt) are the names of either companies or countries, and the contexts (see contextDictionary.txt) are two-word phrases observed to the left or the right of the words in a set of sentences on the Web (the original text is in corpusSelection.txt). The actual occurrences of words in contexts are listed (using the IDs from the dictionaries) in data.txt.
The code already has routines for reading/writing these files, so you won't need to write any code to process them.

You can think of training the model as a dimensionality reduction task, where the goal is to summarize all of the contextual information regarding each word w in data.txt in a small vector of numbers P(z | w). With those summaries, we can perform a set-expansion task: we efficiently expand a set of "seed" examples (for example, UK, Libya, Myanmar) by searching for those w' which have P(z | w') similar to those of the seeds (for the previous example, this is hopefully a list of countries).

The Code

Even if you don't know Java, you should still be able to complete this assignment. The code is in the file TextMasterEM.java, and can be compiled using
javac TextMasterEM.java
on any machine with a recent Java jdk installation. You can then run the code in two different modes, "train" or "test." For example:
java TextMasterEM train data.txt 10000 model.txt 6
java TextMasterEM test model.txt wordDictionary.txt companies.txt Shell BT Dow
The first example trains a model on data.txt for 10000 iterations and using 6 topics, the second tests the model using the companies.txt as correct ground truth and the three "seed" examples Shell, BT, and Dow. The test script outputs the rank order of the words in decreasing similarity to the seeds, along with the "average precision" performance measure of the list and the baseline perf. of a random list. A sample model is included in the download, so you can try the test script before you fix the training routine.

Exercises

  1. (3 points) Complete the training code. Search for the string

    //BEGIN code to be changed

    in the TextMasterEM.java file; use the comments there and your knowledge of the EM algorithm to fix the algorithm. If you use the skeleton code that's there, only four total lines need to be edited. Output your new code and the training code's output to the screen (giving avg. log likelihood, etc) for 500 iterations.

  2. (2 points) Experiment with local maxima. For a model with four topics, train ten separate models using training runs of 10,000 iterations each (depending on your hardware, these runs may take awhile!). For each run, list the final average log likelihood of the training script, and the "average precision" of the ranked list in testing. Test the models in the following way:
    java TextMasterEM test model.txt wordDictionary.txt companies.txt Shell BT Dow
    java TextMasterEM test model.txt wordDictionary.txt countries.txt Libya United+Kingdom Myanmar
    Answer the following questions:
    1. Which of the statistics you report -- training log likelihood, or average precision of the ranked list -- is the best data to examine for evidence of local maxima causing problems for the EM?
    2. Based on your results, are local maxima evident?
    3. Does accuracy on the set expansion task correlate with the model's likelihood on the training set?
  3. (2 points) Do some other experiment you find interesting. Some ideas: try expanding some different subset of the words (e.g., developing countries, words that begin with A, etc.), train for a much longer time and measure performance, see if you can understand the hidden topic's "meaning" in terms of its words or contexts.