Professional Documents
Culture Documents
Roger Bilisoly, PhD Department of Mathematical Sciences Central Connecticut State University
Homework Problem: Find the Proportion of Each Letter of the Alphabet in Dickens A Christmas Carol.
Are there any letter frequency anomalies? For example, does the letter J appear more often than average due to the name Jacob Marley? This novel was originally published in 1843. How do its letter frequencies compare to American English in 1961, i.e., to the Brown Corpus? How do its letter frequencies compare to German frequencies: e.g., to Goethes Die Leiden des jungen Werther? Complications: Other languages using the Latin alphabet often employ diacritical marks (e.g., German has umlauts) and sometimes add new letters (e.g., German has , the Eszett, which stands for a double s). Hence alphabets are more complex than one might first suppose.
This SAS Code Introduces both Character Data and Frequency Tables.
data carol; infile C:\A_Christmas_Carol.txt"; input char $1. @@; lowchar = lowcase(char); run;
data letters_carol; set carol; if anyalpha(lowchar) > 0; run; proc freq data=letters_carol order=freq; tables lowchar / out=carolfreq; run; The above code can be introduced early in a programming class, and the ability to read in external files is important for applications. Read characters one at a time.
Top 12 letters in frequency order for several sources: Christmas Carol ETOAHI NSRDLU Brown Corpus ETAOIN SRHLDU
Studying initial consonant clusters restricts attention to one syllable, so boundaries are not a problem. Lets compare English and German.
First, note that English and German phonology (sounds the letters make) differ. For example, a German v is pronounced like the English f. Second, these two languages have different constraints on initial letters. For example, almost no words in German start with c, but z is pronounced like ts, which is a common starting letter (ranks 6th above) in German. Third, the frequencies of initial letters does not match the overall letter frequencies found earlier.
Here are the SAS solutions to the crossword and hangman problems.
data one; length word $30; infile "C:\crosswd.txt"; input word; len = length(word); run; data two; set one; if len = 7; if substr(word,4,1) = 'b'; if substr(word,7,1) = 'u'; run; proc print data=two; run; data three; set one; if len = 7; if findc(word,'taoin') = 0; if findc(word,'e') = 2 and findc(word,'e',-30) = 2; if findc(word,'s') = 7 and findc(word,'s',-30) = 7; proc print data=three; run;
len
len 7
1 2 3 4 5 6 7 8 9 10 11 12
bedbugs bedrugs bedumbs begulfs ferrums peplums rebuffs redbuds redbugs regulus vellums zephyrs
7 7 7 7 7 7 7 7 7 7 7 7
Word Inflections
A complete analysis of adverbs would be quite complicated. However, the exceptions noted earlier (happily, etc.) were easy to find by reading in a wordlist and then checking each word that ends in ly to see if it is still a word after removing ly. There is a methodology called regular expressions that finds general text patterns. This is implemented in version 9 of SAS using functions such as PRXPARSE and PRXMATCH. English is not very inflected, but this varies from language to language. For example, English is less inflected than German, and Finnish is heavily inflected. Moreover, there are many other word structures (morphemes) to analyze: plurals, verb conjugations, compound nouns, etc.
Current Status
I used language examples in CCSUs STAT 456 (Fundamentals of SAS), Spring, 2009, for the first time. Initial feedback is mixed. The language examples were difficult for non-native speakers of English. Would this be helpful in an introductory class? I plan to ask my future classes in their interest in word games to judge whether this is worth pursuing at the introductory level.