Homework 5, Problem 2: Markov
Text Generation
[50
points; individual or pair]
Submission: Submit your hw5pr2.py file to the submission server
Your
goal in this section is to write a program that is capable of generating
"meaningful" text all by itself! You will accomplish this goal by
writing a Markov text generation algorithm.
Note: Some of the functions shown in the optional problem 4 (hw5pr4.py) below might be very useful
for this problem.
Markov
Text Generation
Here's the basic idea: English is a language with a lot
of structure. Words have a tendency (indeed, an obligation) to appear only in
certain sequences. Grammatical rules specify legal combinations of different
parts of speech. E.g., the phrase "The cat climbs the stairs" obeys a
legal word sequence. "Stairs the the climbs cat", does not.
Additionally, semantics (the meaning of a word or sentence) further limits
possible word combinations. "The stairs climb the cat" is a perfectly
legal sentence, but it doesn't make much sense and you are very unlikely to
encounter this word ordering in practice.
Even without knowing the formal rules of English, or
the meaning of English words, we can get idea of what word combinations are
legal simply by looking at well-formed English text and noting the combinations
of words that tend to occur in practice. Then, based on our observations, we
could generate new sentences by randomly
selecting words according to commonly occurring sequences of these words. For
example, consider the following text:
"I love roses and carnations. I hope I get roses
for my birthday."
If we start by selecting the word "I", we
notice that "I" may be followed by "love", "hope"
and "get" with equal probability in this text. We randomly select one
of these words to add to our sentence, e.g. "I get". We can repeat
this process with the word "get", necessarily selecting the word
"roses" as the next word. Continuing this process could yield the
phrase "I get roses and carnations". Note that this is a valid
English sentence, but not one that we have seen before. Other novel sentences
we might have generated include "I love roses for my birthday," and
"I get roses for my birthday".
More formally, the process we use to generate these
sentences is called a first order Markov process.
A first-order Markov process is a process in which the state at time t+1 (i.e. the next word) depends only on
the states at times t (i.e., the previous word).
In a second-order Markov process, the next word would depend on the two previous words, and so on. Our
example above was a first order process because the choice of the next word
depended only on the current word. Note that the value of the next word is
independent of that word's position and depends only on its immediate history.
That is, it doesn't matter if we are choosing the 2nd word or the 92nd. All
that matters is what the 1st or the 91st word is, respectively.
Your
text-analyzer and generator
In the first part of this assignment you will implement
a first-order Markov text generator. Writing this function will involve two
functions: (1) one to process a file and create a dictionary of legal word
transitions and (2) another to actually generate the new text. Here are details
on each:
createDictionary( nameOfFile )
takes in a string, the name of a text file containing some sample text. It
should return a dictionary whose keys are words encountered in the text file
and whose entries are a list of words that may legally follow the key word.
Note that you should determine a way to keep track of frequency information.
That is, if the word "cheese" is followed by the word
"pizza" twice as often as it is followed by the word
"sandwich", your dictionary should reflect this trend. For example,
you might keep multiple copies of a word in the list.
The dictionary returned by createDictionary
will allow you to choose word t+1
given a word at time t. But
how do you choose the first word, when there is no preceding word to use
to index into your dictionary?
To handle this case, your dictionary should include the
string "$" representing the sentence-start symbol. The first
word in the file should follow this string. In addition, each word in the file
that follows a sentence-ending word should follow this string. A
sentence-ending word will be defined to be any raw, space-separated word whose
last character is a period ., a question mark ?,
or an exclamation point !
Checking your code...
To check your code, paste the following text into a plain-text file (for
example, into a new file window in IDLE):
A B A. A B C. B A C. C C C.
Save this file as t.txt in the same directory where your hw5pr2.py lives. Then, see if your dictionary d matches the sample
below:
>>> d = createDictionary( 't.txt'
)
>>> d
{'A': ['B', 'B', 'C.'], 'C': ['C',
'C.'], 'B': ['A.', 'C.', 'A'], '$': ['A', 'A', 'B', 'C']}
The elements within each list
need not be in the same order, but they should appear in the quantities shown
above for each of the four keys, 'A', 'C', 'B', and '$' .
generateText
generateText( d, n )
will take in a dictionary or word transitions d (generated in your
createDictionary function, above) and a positive integer, n. Then, generateText
should
print a string of n
words.
The
first word should be randomly chosen from among those that can follow the
sentence-starting string "$".
Remember
that random.choice will choose one
item randomly from a list!
The second word will be randomly chosen among the list of words that could
possible follow the first, and so on. When a chosen word
ends in a period ., a question mark
?,
or an exclamation point !, the generateText function should detect this and start a new sentence by
again choosing a random word from among those that follow "$".
For
this problem, you should not strip the punctuation from the raw words of the
text file. Leave the punctuation as it appears in the text -- and when you
generate words, don't worry if your generated text does not end with legal
punctuation, i.e., you might end without a period, which is ok. The text you
generate won't be perfect, but you might be surprised how good it is!
If
your process ever encounters a word that has only appeared as the last word in
your training set (i.e., it does not have any words that follow it), you should
simply start generating text again with the sentence-start symbol "$" . Here are two examples that use the dictionary d, from above. Yours will differ because of the
randomness, but should be similar in spirit.
>>> generateText( d, 20 )
B C. C C C. C C C C C C C C C C C. C C C. A
>>> generateText( d, 20 )
A B A. C C C. B A B C. A C. B A. C C C C C C.
The
text you want to use is up to you.
IRS
documents like this one (http://www.irs.gov/newsroom/article/0,,id=159925,00.html
may be
good.
Random
scenes from Shakespeare might also strike your fancy (http://www-tech.mit.edu/Shakespeare/)
If
you have gotten to this point, you have completed problem2! You should submit your hw5pr2.py
file at the Submission Site.
Next