You are on page 1of 8

Syllable to Word:

Relative Collocational Syllable Entropy


vs. Transitional Probabilities
Phillip Potamites
June 1, 2013
Abstract
This article uses the hypothesis that language must balance predictability and
informativeness to segment syllable streams into words. It shows that this method
works somewhat better and more intuitively than its competing candidate for childrens method of achieving word cognition. It further shows that syllable statistics
fall within a specified range and argues that borderline cases are either victims of
over-fitting or likely candidates for language change.

introduction

This article explore the hypothesis that entropy in the sense of a quantitative scale
measuring informativeness vs. predictability, as defined by Claude Shannon (1948), is
a fundamental property of natural languages, essential to their learning, and governing
their possible manifestations. Given a random variable, entropy (often abbreviated H,as
capital , eta) is the negative sum of probabilities of all its possible states times their
log probabilities.
P
P
(1) H= p log p (where
p = 1)
Thus, random variables with more states or more evenly distributed probabilities have
higher entropies. The base of the log may be arbitrarily chosen: if e, the resulting units
are referred to as nats, if 2, as bits. Bits will be used here. Using bits, the entropy
of a fair coin toss is 1, of 2 coins, considering different orders different states, is 2, and so
on. Not distinguishing order (either heads and tails or tails and heads), the entropy of 2
coins is 1.5, while the entropy of a 6-sided die is 2.58.
The research here examines the probability of syllables and their collocations. Conditional transitional probabilities between syllables are known to vary within and across
word boundaries and Saffran et al. (2007) have shown than infants as young as 8-months
old are sensitive to these probabilities. This paper will present these boundary variations
for both probabilities and entropy, show how they compare in predicting word boundaries,
and consider overall limits on collocational entropy.
The fundamental hypothesis proposed here is that collocational entropy in natural
language must be limited within some moderate range, because language learning and
1

robustness should require repetition and redundancy (low entropy), while also remaining
efficient and informative (high entropy) by eliminating excessive repetitiveness. Genzel
and Charniak (2002) has already argued that streams of text display a roughly average
rate of entropy given sentence and word context. Keller (2004) further shows that entropy
corresponds to processing difficulty, while Levy and Jaeger (2007) shows its ability to
predict insertion of optional syntactic markers (specifically that in complement clauses).
Those studies all worked at the sentence and word level with complex language models.
This study demonstrates effective results with a simple bigram model at the syllable
level. The data here will show a number of violations to its intuitive hypothesis, but
these will lead to further predictions regarding possible applications of the theory to
language change and/or data interpretation.

materials

This research used the text of the Penn Tree Bank, and replaced the written words
there with corresponding pronunciations from the Carnegie Mellon phonetic dictionary
done in ARPAbet format. That dictionary actually provides pronunciations in segmentdelineated form without syllable information, so syllables were automatically inferred
following general principles implied by cross-linguistic tendencies and the sonority heirarchy. That is to say, syllables were constructed around vowels, vowel-preceding consonants were attached to those vowels (and also sonorants), vowel-following consonants
were then attached, and possible attachments for remaining consonants were alternatively considered until all consonants were attached to some vowel. Numbers and Month
abbreviations were replaced with appropriate pronunciations. Words lacking entries in
the CMU dictionary were replaced with an unknown marker. This method produces
a corpus of 1,411,528 syllable tokens and 7,462 types from 897,975 word tokens and
41,213 types, with only 9,191 unknown tokens. This system provides a consistent system
of syllabification that generally matches English conventions, as shown by the example
below.
(2) That could cost him the chance to influence the outcome and perhaps join the
winning bidder .
DHAET # KUHD # KAAST # HHIHM # DHAH # CHAENS # TUW # IHN
FLUW AHNS # DHAH # AWT KAHM # AEND # PER HHAEPS # JHOYN
# DHAH # WIH NIHNG # BIH DER
From files of line-delineated sentences, with #-delineated spaces, and space-delineated
syllables, conditional probabilities of following syllables given preceding syllables1 and
preceding syllables given following syllables were collected, and entropies for both forward and backward sets calculated.
For possible manifestations of a random variable, one can define a value of maximum
entropy, where all possibilities are equally likely. From this value, one can define relative
entropy (rH), by dividing the observed entropy by the maximum possible entropy, to
1

Probablity of word2 given word1 =p(w2 |w1 ) =

p(w1 w2 )
p(w1 )

derive a number between 0 and 1. Here, maximum possible entropy was calculated as a
uniform distribution over observed collocations.2
Finally, words were segmented from syllable files using these various statistics and the
results scored as presented below.

results

Given known syllable pairs, word breaks are extremely consistent, so a supervised learning system trained to predict breaks from pairs, by simply calculating the most likely
outcome, can easily achieve an accuracy of over 99% (testing/training on the same data
provides the maximum accuracy; a fair application to new data would raise questions of
coverage, over-fitting, etc.) The consistency of word breaks given syllable pairs can be
seen in the histogram showing frequency of syllable pairs with various probabilities. In
fact, a large majority of pair types always imply breaks (189,138/212,635=89%), while
only about 10% (21,612) account for almost all non-breaks; Only about 1% of pairs
(1,885) are ever ambiguous.

However, this kind of supervised learning approach, where desired answers are part
of the input to the learning system, offers little insight into how words arise naturally
as an intuitive aspect of language (though there may be some debate as to their flexi2

Assuming the actual population of possible English collocations is larger than that observed in
the sample, this number would necessarily underestimate the maximum entropy, and thus, possibly,
overestimate the relative entropy.

bility and/or the degree to which they are taught), nor how we might begin analyzing a
completely new and alien language.
Saffran et al. (2007) hypothesize that infants use transitional probabilities (TP) to
begin identifying words and provide evidence infants are sensitive to these probabilities.
The plots below show how probabilities and relative entropy vary with or across words,
and the table lists the corresponding means and T-test results. They show that wordexternal probabilities are generally somewhat lower and relative entropies much higher
than those for word-internal transitions. Though the distributions are not all entirely
normal, t-tests suggest it is almost impossible that the mean differences could be due to
chance.

mean log10 prob.


mean rel. ent.

word-internal
-1.93
.25

word-external
-2.39
.62

t-test p(null)
0
0

Considering a range of possible thresholds for each statistic provides varying accuracy
for word segmentation, as shown in the tables below.3 Transitional probabilities include
information about preceding and following syllables, so to also include following syllable
information in the entropy statistic, the square root of the multiple for both forward and
backward relative entropies was analyzed. In fact, this method also offered improvement
to the transitional probability performance, though it was necessary to focus on particularly low probabilities (< .01) to find the maximum accuracy. Pelucchi et al. (2009) has,
in fact, also shown that infants appear capable of also tracking backward transitional
probabilities. Baseline segmentation systems attain roughly 60% accuracy when inserting word breaks at all syllable breaks, and 40% if inserting none. The entropy method
achieves a high of 85% at exactly .5rH. while the probability method gets to 79% at .05.
Surprisingly, though these methods provide similar accuracies, these statistics are not
highly correlated, as shown in the next set of tables. More possibilities that are more
evenly distributed necessarily imply lower transitional probabilities and higher entropies,
but entropy and particularly relative entropy can still vary as they range over sets of
3

Insert a word break under a given threshold for probablities and over a given threshold for entropies.

probabilities instead of a single value. The Pearson R correlation between the log probability multiple and absolute entropy is -.52, but with relative entropy, it is only -.05.
The two prediction methods show about 80% agreement in their predictions, but they do
disagree over 262,745 transitions, and in these cases, the entropy method is correct 64%
vs. probabilitys 36%.

Plotting the relative predictive entropies of single syllables vs. their frequencies, as
shown in the table below, reveals a pattern that mostly confirms the hypothesis that
language balances redundancy and informativeness. Though items that only appear a
few times often manifest relative entropies of 0 or 1, almost any item with sufficient
representation falls into a range that narrows as frequency increases. We might even
make the predictions of our hypothesis more specific by suggesting that items necessarily
tend towards .5rH as their frequency increases (though the mean is actually slightly higher
at .55, as shown by the green line). However, such a prediction, as represented by the
red lines, presents a number of exceptions.
5

2 syllables fall above the predicted range: ah (/@/) and aend (/nd/). It seems
fair to say that these items are followed by word breaks and all the diversity that comes
with them more than language typically expects. In fact, in the data here, aend is never
followed by anything but a word break. We might further note that largely analytic
languages like Chinese generally dispense with terms like articles and conjunctions, while
synthetic languages such as Latin actually diversify their articles. The informative theory
would predict that morphologically diverse languages should require more basic function
words to hold down their entropy, while such function words should be less necessary
in non-inflectional languages where content words could be more freely combined. It
is generally accepted that English has degenerate inflectional paradigms, and we could
predict from this evidence that the frequency of these items will decrease. Thus, the
informative theory suggests ways in which we may predict likely historical changes.
Examining syllables with log10 probability greater than -3 and rH less than .2, we
find twehn (/twEn/), ther (/T/), fihf (/fIf/), sihk (/sIk/), hhahn (/h@n/), maar
(/mAr/), mihs (mIs), and nae (/n/). The first five all derive from numbers (twenty,
etc. and hundred) which are presumably over-represented in the Penn Treebank since
it comes from the Wall Street Journal, though we may still want to ask if numbers or
other items such as months more generally tend towards over-redundancy (and what
techniques of abbreviation or variation natural language then develops). Maar results
from the overabundance of maar-kahts, mihs from mihs-ter, and nae from nae-shah
as in national. In other contexts, it is likely that words such as thirsty, sick, and
nasty might be more probable.
These observations lead to a further practical application of the informative theory.
Learning systems need to beware of over-fitting their models to their training data in
6

ways that will not generalize well to new data. Items outside the predicted frequencyinformation range should be viewed suspiciously and regarded as most likely misrepresented in someway. Smoothing their frequencies to bring them into a more reasonable
range ought to provide more generalizable models.

discussion

Science is built by falsifiable theories. When exceptions to theories are discovered, people
must decide to re-evaluate the data or to revise or reject the theory. Exceptions to
the theory proposed here have been revealed. Possibly they warrant rejection of the
theory. However, we have instead proposed that they point to likely future changes
and/or limitations of the data.
Current mainstream linguistic theories currently emphasize universal, experienceindependent theories of language learning. To have any empirical content, these theories
must interface with the specific, experience-dependent aspects of diverse languages in
some way. The informative theory proposes a general cross-cultural constraint on manifestations and a premise of learning that acts on overt data manifestations as shown
above. There is no appeal to debatable underlying representations.
Both the segmentation results and range constraints provide evidence that entropy
is a significant factor in language learning and manifestations as initially proposed. The
differences within and across word boundaries also suggest that they balance each other
as required by the hypothesis: words provide more robustness, while word combinations
provide more information. Further research will also examine these relations for words
and segments in English, as well as other languages.
Though the entropy method only provides a slight improvement over probability segmentation, there are a number of reasons that it could be argued to be theoretically
superior. For one, it is based on an intuitively rational assumption of informativeness.
Furthermore, the scale it operates on appears more intuitively appealing. The difference
of means is proportionally much larger, and centering of the threshold is more aesthetically attractive.
In Saffran et al. (2007), within word transitional probabilities were all 1 and between
word transitional probabilities were all .33. Thus, within word rH were all 0, and between
word rH were all 1. Thus, Saffrans study surely confounded these 2 variables, though in
the real world, this correlation is not absolutely necessary. Further research is needed to
determine whether humans rely on one, the other, or both.

References
Genzel, D. and Charniak, E. (2002). Entropy rate constancy in text. In Proceedings of
ACL.
Keller, F. (2004). The entropy rate principle as a predictor of processing effort: An
evaluation against eye-tracking data. In Proceedings of the Conference on Empirical
Methods in Natural Language Processing, pages 317324.

Levy, R. and Jaeger, T. F. (2007). Speakers optimize information density through syn
tactic reduction. In SchAlkopf,
B., Platt, J., and Hoffman, T., editors, Advances
in Neural Information Processing Systems 19, pages 849856, Cambridge, MA. MIT
Press.
Pelucchi, B., Hay, J. F., and Saffran, J. R. (2009). Learning in reverse: 8-month-old
infants track backward transitional probabilities. Cognition, 113(2):244a247.
Saffran, J. R., Aslin, R. N., and Newport, E. L. (2007). Statistical learning by 8-month-old
infants. Science, 274:19268.
Shannon, C. E. (1948). A Mathematical Theory of Communication. CSLI Publications.

You might also like