Homework 5, Problem 4:
A program that reads
[20 extra
points; individual or pair]
Submission: Submit your hw5pr4.py file to the submission server
The Flesch Index (FI) is a numerical
measure of the readability of a particular piece of text. Textbook publishers
and editors often use it to ensure that material is written at a level
appropriate for the intended audience.
For a given piece of text, the Flesch readability, or
Flesch index (FI), is given by:
FI = 206.835 - 84.6 * numSyls/numWords
- 1.015 * numWords/numSents
where:
1.
numSyls
is the total
number of syllables in the text
2.
numWords
is the total
number of words in the text
3.
numSents
is the total
number of sentences in the text
The FI is simply a linear combination of two
text-related ratios, subtracted off of a constant offset. FI is usually
reported as an integer, and casting the floating-point result to an integer is
completely fine for this purpose.
Here are some resulting readability scores for a few
example publications or styles of writing (cited from Cay Horstmann's book, p.
266). It is, in fact, possible to have a negative readability score:
In this problem you will write a program to compute the
Flesch readability score or Flesch Index (FI) for different pieces of text. Because the
rules for counting words, sentences, and syllables can be tricky, it is
important to allow the user to play with various inputs and see how they
decompose into syllables, sentences, and words. For this reason, you will
structure this program in a function named flesch() that provides a menu of options for
analyzing text.
Example run from a complete flesch program
Here is what the user should see when they run your flesch() function -- changes to the text are OK, but
please keep to the option numbers listed here and their functionality, because
it will make your program much easier to grade!
Welcome to the text readability calculator!
Your options include:
(1) Count sentences
(2) Count words
(3) Count syllables in one word
(4) Calculate readability
(9) Quit
What option would you like?
The first three options will help you troubleshoot
errors in computing the three components of the Flesch readability Index. In
addition, they suggest three functions that you should write to support this
menu:
You are welcome to write additional helper functions,
as well!
Punctuation,
whitespace, and some functions to get you started
The input text to options 1, 2, and 4 in the
readability menu may be any string at all. Because of this, there need to be
some guidelines that define what constitutes a sentence, a word, and a syllable
within a word.
We will use the term raw word to represent a
space-separated string of characters that, itself, contains no spaces. Python's
string objects contain a useful method (function) that can split a string into
space-separated raw words: it is called split().
For example,
>>> s = "This is a sentence."
>>> s.split()
['This', 'is', 'a', 'sentence.']
>>> s = "This is
\n a sentence." # \n is a newline
>>> print s
This is
a sentence.
>>> s.split()
['This', 'is', 'a', 'sentence.']
Thus split returns a list of raw
words, with all whitespace removed.
The following function might be useful -- feel free to
copy it to your hw5pr2.py file and use it in order
to extract a list of raw words from a string of input text:
def makeListOfWords( text ):
""" returns a list of words in
the input text """
L = text.split()
return L
Admittedly, you could avoid using this function simply
by calling split as needed.
After whitespace, punctuation is a second cue that we
will use to define what constitutes a sentence and a word.
The following function will be useful in stripping
non-alphabetic characters from raw words - you should also use this in your hw5pr2.py file:
def dePunc( rawword ):
""" de-punctuationifies the
input string """
L = [ c for c in rawword if 'A' <= c <=
'Z' or 'a' <= c <= 'z' ]
# L is now a list of alphabetic characters
word = ''.join(L) # this _strings_ the elements of L together
return word
Because most of the non-alphabetic characters we need
to remove will be punctuation, the function is called dePunc. It uses a list comprehension that creates a
list of all and only alphabetic characters (hence the if, which is allowed in list comprehensions).
That list L, however is not usable as a string, so
the "magic" line word =
''.join(L) converts
the list L into a string held by the variable word. That line is really not magic -- join is simply a method of all string objects.
Definition of "a word":
For this problem, we will define a word to be a
raw word with all of the non-alphabetic characters removed. For example, if the
raw word was one-way, then the corresponding
word (with punc. removed) would be oneway. A raw word will never
create more than one punctuation-removed word.
However, a raw word could have only
non-alphabetic characters. For example, the raw word 42!will disappear entirely
when the non-alphabetic characters are removed. We will insist that a word
has at least one alphabetic character (thus, the empty string is not a word).
Here's an example of dePunc in action:
>>> dePunc( "I<342,don'tYou?" )
'IdontYou'
So, with these two functions as background, the number
of words in a body of text is defined to be the number of non-empty,
alphabet-only, space-separated raw words in that text. Using makeListOfWords and dePunc will help to write a
function numWords( text ), that takes any string as
input and returns the number of words, as defined above, as output.
Here are some examples -- note that this is not an
exhaustive list of possibilities! You may want to test some other cases, as
well.
>>> words( 'This sentence has 4 words.' )
4
>>> words( 'This sentence has five words.' )
5
>>> words( '42 > 3.14159 > 2.71828' )
0
Note that these rules have their limitations! The first
example probably should be considered to have 5 words, but because it would
take a person to disambiguate all of the many possibilities (and different
people might disagree!), we will stick with this definition, despite its
limitations.
Counting
sentences
The rules used for counting sentences within a string
are
Thus, as long as there is at least one sentence in the
text, an unpunctuated fragment at the end of the text does not count as an
additional sentence. This may seem a bit much, but in fact it means you don't have to create a special case for
the last raw word of the text.
Here are some examples -- again, this is not an
exhaustive list of possibilities! You may want to test some other cases, as
well.
>>> sentences( 'This sentence has 4 words.' )
1
>>> sentences( 'This sentence has no final punctuation' )
1
>>> sentences( 'Hi. This sentence has no final punctuation' )
1
# Note! Fragments don't count unless there are no sentences at all
>>> sentences( 'Wow!?! No way.' )
2
>>> sentences( 'Wow! ? ! No way.' )
4
You will only need to count syllables in
punctuation-stripped words (not raw words with non-alphabetic characters). When
writing your code that counts syllables in a word, you should assume:
Here are some examples -- this definitely is not an exhaustive
list of possibilities! You will want to test some other cases, as well.
>>> syllables( 'syllables' )
3
>>> syllables( 'one' )
1
>>> syllables( 'science' ) # it's not always correct...
1
As with words and sentences, these rules do not always
match the number of syllables that English speakers would agree on. For
computing readability, however, the errors tend not to impact the final score
too much, since they vary in both directions.
How
should the input be collected?
For this problem, use input in order to take in the
numeric menu choices from the user.
However, use raw_input to take in the text for
options 1, 2, 3, and 4. This way, the user will not have to type quotes. It is
possible to paste in large amounts of text at the raw_input prompt
-- I tried the entirety of Romeo and Juliet
and it worked fine on a PC and Mac, though you may have to hit return an
additional time.
In addition to the counting capabilities mentioned
above, you should also implement the overall Flesch readability index as option
#4.
A few details:
Here are two example runs (just option #4):
Choose an option: 4
Type some text: The cow is of the bovine ilk;
one end is moo, the other milk.
Number of sentences: 1
Number of words: 14
Number of syllables: 16
Readability index: 95
Choose an option: 4
Type some text: Fourscore and seven years ago our
fathers brought forth on this continent a new nation conceived in Liberty and
dedicated to the proposition that all men are created equal.
Number of sentences: 1
Number of words: 29
Number of syllables: 48
Readability index: 37
Computing the readability of one of
your papers
Finally, copy-and-paste one of your own papers (or
other works) into the readability scorer. Include a comment or triple-quoted
string at the TOP of your file that mentions what you tested (you don't need to
include all the text, just what text it was) and what it's score was... we look
forward to seeing the results!
If
you have gotten to this point, you have completed problem 4! You should submit your hw5pr4.py
file at the Submission Site.
Next