Research Ideas

Homework 5, Problem 2: Markov Text Generation
[50 points; individual or pair]

Submission: Submit your hw5pr2.py file to the submission server

Your goal in this section is to write a program that is capable of generating "meaningful" text all by itself! You will accomplish this goal by writing a Markov text generation algorithm.

Note: Some of the functions shown in the optional problem 4 (hw5pr4.py) below might be very useful for this problem.

Markov Text Generation

Here's the basic idea: English is a language with a lot of structure. Words have a tendency (indeed, an obligation) to appear only in certain sequences. Grammatical rules specify legal combinations of different parts of speech. E.g., the phrase "The cat climbs the stairs" obeys a legal word sequence. "Stairs the the climbs cat", does not. Additionally, semantics (the meaning of a word or sentence) further limits possible word combinations. "The stairs climb the cat" is a perfectly legal sentence, but it doesn't make much sense and you are very unlikely to encounter this word ordering in practice.

Even without knowing the formal rules of English, or the meaning of English words, we can get idea of what word combinations are legal simply by looking at well-formed English text and noting the combinations of words that tend to occur in practice. Then, based on our observations, we could generate new sentences by randomly selecting words according to commonly occurring sequences of these words. For example, consider the following text:

"I love roses and carnations. I hope I get roses for my birthday."

If we start by selecting the word "I", we notice that "I" may be followed by "love", "hope" and "get" with equal probability in this text. We randomly select one of these words to add to our sentence, e.g. "I get". We can repeat this process with the word "get", necessarily selecting the word "roses" as the next word. Continuing this process could yield the phrase "I get roses and carnations". Note that this is a valid English sentence, but not one that we have seen before. Other novel sentences we might have generated include "I love roses for my birthday," and "I get roses for my birthday".

More formally, the process we use to generate these sentences is called a first order Markov process. A first-order Markov process is a process in which the state at time t+1 (i.e. the next word) depends only on the states at times t (i.e., the previous word). In a second-order Markov process, the next word would depend on the two previous words, and so on. Our example above was a first order process because the choice of the next word depended only on the current word. Note that the value of the next word is independent of that word's position and depends only on its immediate history. That is, it doesn't matter if we are choosing the 2nd word or the 92nd. All that matters is what the 1st or the 91st word is, respectively.

Your text-analyzer and generator

In the first part of this assignment you will implement a first-order Markov text generator. Writing this function will involve two functions: (1) one to process a file and create a dictionary of legal word transitions and (2) another to actually generate the new text. Here are details on each:

createDictionary

createDictionary( nameOfFile ) takes in a string, the name of a text file containing some sample text. It should return a dictionary whose keys are words encountered in the text file and whose entries are a list of words that may legally follow the key word. Note that you should determine a way to keep track of frequency information. That is, if the word "cheese" is followed by the word "pizza" twice as often as it is followed by the word "sandwich", your dictionary should reflect this trend. For example, you might keep multiple copies of a word in the list.

The dictionary returned by createDictionary will allow you to choose word t+1 given a word at time t. But how do you choose the first word, when there is no preceding word to use to index into your dictionary?

To handle this case, your dictionary should include the string "$" representing the sentence-start symbol. The first word in the file should follow this string. In addition, each word in the file that follows a sentence-ending word should follow this string. A sentence-ending word will be defined to be any raw, space-separated word whose last character is a period ., a question mark ?, or an exclamation point !

Checking your code...

To check your code, paste the following text into a plain-text file (for example, into a new file window in IDLE):

A B A. A B C. B A C. C C C.

Save this file as t.txt in the same directory where your hw5pr2.py lives. Then, see if your dictionary d matches the sample below:

>>> d = createDictionary( 't.txt' )

>>> d

{'A': ['B', 'B', 'C.'], 'C': ['C', 'C.'], 'B': ['A.', 'C.', 'A'], '$': ['A', 'A', 'B', 'C']}

The elements within each list need not be in the same order, but they should appear in the quantities shown above for each of the four keys, 'A', 'C', 'B', and '$' .

generateText

generateText( d, n ) will take in a dictionary or word transitions d (generated in your createDictionary function, above) and a positive integer, n. Then, generateText should print a string of n words.

The first word should be randomly chosen from among those that can follow the sentence-starting string "$". Remember that random.choice will choose one item randomly from a list! The second word will be randomly chosen among the list of words that could possible follow the first, and so on. When a chosen word ends in a period ., a question mark ?, or an exclamation point !, the generateText function should detect this and start a new sentence by again choosing a random word from among those that follow "$".

For this problem, you should not strip the punctuation from the raw words of the text file. Leave the punctuation as it appears in the text -- and when you generate words, don't worry if your generated text does not end with legal punctuation, i.e., you might end without a period, which is ok. The text you generate won't be perfect, but you might be surprised how good it is!

If your process ever encounters a word that has only appeared as the last word in your training set (i.e., it does not have any words that follow it), you should simply start generating text again with the sentence-start symbol "$" . Here are two examples that use the dictionary d, from above. Yours will differ because of the randomness, but should be similar in spirit.

>>> generateText( d, 20 )

B C. C C C. C C C C C C C C C C C. C C C. A

>>> generateText( d, 20 )

A B A. C C C. B A B C. A C. B A. C C C C C C.

The text you want to use is up to you.

IRS documents like this one (http://www.irs.gov/newsroom/article/0,,id=159925,00.html may be good.

Random scenes from Shakespeare might also strike your fancy (http://www-tech.mit.edu/Shakespeare/)

If you have gotten to this point, you have completed problem2! You should submit your hw5pr2.py file at the Submission Site.

Next

hw5pr3

Lab 5

Homework 5