Research Ideas

Homework 5, Problem 4: A program that reads
[20 extra points; individual or pair]

Submission: Submit your hw5pr4.py file to the submission server

The Flesch Index (FI) is a numerical measure of the readability of a particular piece of text. Textbook publishers and editors often use it to ensure that material is written at a level appropriate for the intended audience.

For a given piece of text, the Flesch readability, or Flesch index (FI), is given by:

FI = 206.835 - 84.6 * numSyls/numWords - 1.015 * numWords/numSents

where:

1. numSyls is the total number of syllables in the text

2. numWords is the total number of words in the text

3. numSents is the total number of sentences in the text

The FI is simply a linear combination of two text-related ratios, subtracted off of a constant offset. FI is usually reported as an integer, and casting the floating-point result to an integer is completely fine for this purpose.

Here are some resulting readability scores for a few example publications or styles of writing (cited from Cay Horstmann's book, p. 266). It is, in fact, possible to have a negative readability score:

95 - Comics
82 - Advertisements
65 - Sports Illustrated
57 - Time Magazine
39 - New York Times
10 - Auto insurance policy
-6 - Internal Revenue Code

In this problem you will write a program to compute the Flesch readability score or Flesch Index (FI) for different pieces of text. Because the rules for counting words, sentences, and syllables can be tricky, it is important to allow the user to play with various inputs and see how they decompose into syllables, sentences, and words. For this reason, you will structure this program in a function named flesch() that provides a menu of options for analyzing text.

Example run from a complete flesch program

Here is what the user should see when they run your flesch() function -- changes to the text are OK, but please keep to the option numbers listed here and their functionality, because it will make your program much easier to grade!

Welcome to the text readability calculator!

Your options include:

(1) Count sentences

(2) Count words

(3) Count syllables in one word

(4) Calculate readability

(9) Quit

What option would you like?

The first three options will help you troubleshoot errors in computing the three components of the Flesch readability Index. In addition, they suggest three functions that you should write to support this menu:

sentences( text ), which takes in any string of text and returns the number of sentences present, according to the rules below
words( text ), which takes in any string of text and returns the number of words present, according to the rules below
syllables( oneword ), which takes in one word of text, called oneword and returns the number of syllables present, according to the rules below

You are welcome to write additional helper functions, as well!

Punctuation, whitespace, and some functions to get you started

The input text to options 1, 2, and 4 in the readability menu may be any string at all. Because of this, there need to be some guidelines that define what constitutes a sentence, a word, and a syllable within a word.

We will use the term raw word to represent a space-separated string of characters that, itself, contains no spaces. Python's string objects contain a useful method (function) that can split a string into space-separated raw words: it is called split(). For example,

>>> s = "This is a sentence."

>>> s.split()

['This', 'is', 'a', 'sentence.']

>>> s = "This is \n a sentence." # \n is a newline

>>> print s

This is

a sentence.

>>> s.split()

['This', 'is', 'a', 'sentence.']

Thus split returns a list of raw words, with all whitespace removed.

The following function might be useful -- feel free to copy it to your hw5pr2.py file and use it in order to extract a list of raw words from a string of input text:

def makeListOfWords( text ):

""" returns a list of words in the input text """

L = text.split()

return L

Admittedly, you could avoid using this function simply by calling split as needed.

After whitespace, punctuation is a second cue that we will use to define what constitutes a sentence and a word.

The following function will be useful in stripping non-alphabetic characters from raw words - you should also use this in your hw5pr2.py file:

def dePunc( rawword ):

""" de-punctuationifies the input string """

L = [ c for c in rawword if 'A' <= c <= 'Z' or 'a' <= c <= 'z' ]

# L is now a list of alphabetic characters

word = ''.join(L) # this _strings_ the elements of L together

return word

Because most of the non-alphabetic characters we need to remove will be punctuation, the function is called dePunc. It uses a list comprehension that creates a list of all and only alphabetic characters (hence the if, which is allowed in list comprehensions). That list L, however is not usable as a string, so the "magic" line word = ''.join(L) converts the list L into a string held by the variable word. That line is really not magic -- join is simply a method of all string objects.

Definition of "a word":

For this problem, we will define a word to be a raw word with all of the non-alphabetic characters removed. For example, if the raw word was one-way, then the corresponding word (with punc. removed) would be oneway. A raw word will never create more than one punctuation-removed word.

However, a raw word could have only non-alphabetic characters. For example, the raw word 42!will disappear entirely when the non-alphabetic characters are removed. We will insist that a word has at least one alphabetic character (thus, the empty string is not a word).

Here's an example of dePunc in action:

>>> dePunc( "I<342,don'tYou?" )

'IdontYou'

Counting words

So, with these two functions as background, the number of words in a body of text is defined to be the number of non-empty, alphabet-only, space-separated raw words in that text. Using makeListOfWords and dePunc will help to write a function numWords( text ), that takes any string as input and returns the number of words, as defined above, as output.

Here are some examples -- note that this is not an exhaustive list of possibilities! You may want to test some other cases, as well.

>>> words( 'This sentence has 4 words.' )

4

>>> words( 'This sentence has five words.' )

5

>>> words( '42 > 3.14159 > 2.71828' )

0

Note that these rules have their limitations! The first example probably should be considered to have 5 words, but because it would take a person to disambiguate all of the many possibilities (and different people might disagree!), we will stick with this definition, despite its limitations.

Counting sentences

The rules used for counting sentences within a string are

We will say that a sentence has occurred any time that one of its raw words ends in a period . question mark ? or exclamation point ! Note that this means that a plain period, question mark, or exclamation point counts as a sentence.
The empty string is the only string that has 0 sentences. Any other string should be considered to have at least one sentence.

Thus, as long as there is at least one sentence in the text, an unpunctuated fragment at the end of the text does not count as an additional sentence. This may seem a bit much, but in fact it means you don't have to create a special case for the last raw word of the text.

Here are some examples -- again, this is not an exhaustive list of possibilities! You may want to test some other cases, as well.

>>> sentences( 'This sentence has 4 words.' )

1

>>> sentences( 'This sentence has no final punctuation' )

1

>>> sentences( 'Hi. This sentence has no final punctuation' )

1

# Note! Fragments don't count unless there are no sentences at all

>>> sentences( 'Wow!?! No way.' )

2

>>> sentences( 'Wow! ? ! No way.' )

4

Counting syllables

You will only need to count syllables in punctuation-stripped words (not raw words with non-alphabetic characters). When writing your code that counts syllables in a word, you should assume:

A vowel is a capital or lowercase a, e, i, o, u, or y.
A syllable occurs in a punctuation-stripped word whenever:

Rule 1: a vowel is at the start of a word
Rule 2: a vowel follows a consonant in a word
Rule 3: there is one exception: if a lone vowel e or E is at the end of a (punctuation-stripped) word, then that vowel does not count as a syllable.
Rule 4: finally, everything that is a word must always count as having at least one syllable.

Here are some examples -- this definitely is not an exhaustive list of possibilities! You will want to test some other cases, as well.

>>> syllables( 'syllables' )

3

>>> syllables( 'one' )

1

>>> syllables( 'science' ) # it's not always correct...

1

As with words and sentences, these rules do not always match the number of syllables that English speakers would agree on. For computing readability, however, the errors tend not to impact the final score too much, since they vary in both directions.

How should the input be collected?

For this problem, use input in order to take in the numeric menu choices from the user.

However, use raw_input to take in the text for options 1, 2, 3, and 4. This way, the user will not have to type quotes. It is possible to paste in large amounts of text at the raw_input prompt -- I tried the entirety of Romeo and Juliet and it worked fine on a PC and Mac, though you may have to hit return an additional time.

Sample readability scores

In addition to the counting capabilities mentioned above, you should also implement the overall Flesch readability index as option #4.

A few details:

In addition to the overall readability score, be sure to print the number of sentences, words, and total number of syllables in the input text.

Cast the floating-point Flesch readability score to an int.
If the denominator numWords is zero, you should print a warning and continue to provide the readability menu.

Here are two example runs (just option #4):

Choose an option: 4

Type some text: The cow is of the bovine ilk;

one end is moo, the other milk.

Number of sentences: 1

Number of words: 14

Number of syllables: 16

Readability index: 95

Choose an option: 4

Type some text: Fourscore and seven years ago our fathers brought forth on this continent a new nation conceived in Liberty and dedicated to the proposition that all men are created equal.

Number of sentences: 1

Number of words: 29

Number of syllables: 48

Readability index: 37

Computing the readability of one of your papers

Finally, copy-and-paste one of your own papers (or other works) into the readability scorer. Include a comment or triple-quoted string at the TOP of your file that mentions what you tested (you don't need to include all the text, just what text it was) and what it's score was... we look forward to seeing the results!

If you have gotten to this point, you have completed problem 4! You should submit your hw5pr4.py file at the Submission Site.

Next

hw5pr5

Lab 5

Homework 5