You are on page 1of 18

Kathleen B.

Egan

Speaking: A Critical Skill and a Challenge


Kathleen B. Egan
Federal Language Learning Laboratory
ABSTRACT Speaking is at the heart of second language learning but has been somewhat ignored in teaching and testing for a number of logistical reasons. Automatic Speech Recognition (ASR) can give speaking a central role in language instruction. This article describes plans and efforts to shape speech-interactive Computer-Assisted Language Learning (CALL) programs. Current proficiency guidelines provide a practical framework for this development. Although questions and challenges remain, current implementations of ASR provide some solutions now, and on-going research holds great promise for future implementations.

KEYWORDS Speaking, Automatic Speech Recognition, Speech-Interactive CALL, National Foreign Language Standards, Second Language Acquisition

INTRODUCTION Speaking is at the heart of second language learning. It is arguably the most important skill for business and government personnel working in the field, yet it appears particularly vulnerable to attrition. Despite its importance and its fragility, speaking was until recently largely ignored in schools and universities, primarily for logistical and programmatic reasons, such as emphasis on grammar and culture and unfavorable teacherstudent ratios. Speaking was also absent from testing because of the difficulty in evaluating it objectively and the time it takes to conduct speaking tests (Clifford, 1987). Finally, speaking has been neglected in Computer Assisted Language Learning (CALL) technology. Until recently, CALL programs engaged students in listening, reading, and filling in blanks but not in producing oral language.
1999 CALICO Journal

Volume 16 Number 3

277

Speaking: A Critical Skill and Challenge


The current situation is different. An increased emphasis on the acquisition of communicative language skills calls for language learning software that is speech-enabled and engages learners in interactive speaking activities. Developing this software is now feasible with the deployment of automated speech recognition (ASR) on PC platforms. In this article, I will share our experiences in the Federal Language Training Laboratory and collaborating organizations in the U.S. government in shaping speech-interactive CALL. I will propose ways to drive development in terms of sound pedagogy and not solely technology. While I am enthusiastic about integrating ASR into language learning, I am also conscious of the potential for misuses of this powerful capability. Gaps in CALL systems are not always due to limitations inherent in the technology. They are also due to the lack of clearly formulated learning goals and pedagogical principles as well as to an incomplete understanding of the capabilities and complexities of the technology. The perspective of my colleagues and me stems from personal experience in teaching and managing government programs. With the emerging use of ASR in CALL products, we quickly observed that users reactions to the technology ranged from eagerness to rejection. I would point out here that the rejection came mostly from administrators and teachers rather than from learners. My colleagues and I also observed that generalizations about the technology were made across systems regardless of the products claims and/or objectives. We soon discovered that the language community lacked a common framework for the design and evaluation of multimedia CALL.

A PROFICIENCY FRAMEWORK Proficiency goals can direct the design and development of quality learning activities. Foreign language proficiency is measured by the ability to communicate in the language. This ability is demonstrated in the understanding of authentic aural and written materials and in the ability to generate spoken and written language for real-life purposes. Proficiency in a language is a complex concept. Second language acquisition scholars are still trying to define it and identify what its components are. Practitioners such as the American Council on the Teaching of Foreign Languages (ACTFL) and the Interagency Language Roundtable (ILR) continue working toward a framework for describing language proficiency at different levels that incorporates the elusive components of communicative ability. Still, the oral proficiency tests developed by ACTFL and ILR have achieved high reliability and are a good anchor for the concept of speaking proficiency that CALL seeks to foster. ACTFL and ILR have defined speaking proficiency for testing purposes
278 CALICO Journal

Kathleen B. Egan
as the ability of an individual to carry out in appropriate ways communicative tasks which are typically encountered where the language is natively spoken . The proficiency tests present different topics at various levels of complexity that require the individual to handle vocabulary, structure, pronunciation, pragmatics, and sociolinguistic functions. Thus, rather than selecting one language acquisition theory or another as a framework for developing speech-interactive CALL, we see the proficiency definitions as the best practical framework. The proficiency guidelines focus on the goal rather than on the process and are useful for guiding software development (Egan & Kulman, 1998).

Communication in the National Foreign Language Standards

Enriching the concept of language proficiency are the new national foreign language standards (Phillips, 1998), organized around five goal areas: communication, culture, connections, comparisons, and communities. In all areas the focus is on what should students be able to do in a foreign language. This view departs from the linear sequential separation of skills and also goes beyond the traditional distinction between skill getting and skill using. In the communication standard, speaking is still key; however, speaking is for a purpose. The intent is to have learners engage in realistic tasks rather than just practicing linguistic material. There is consensus in the profession that learning a second language is not only for cultural and literary knowledge but also, and primarily, for practical and/or professional reasons. Language is as much a skill in which the individual engages interactively with others as it is a tool to extract information from written or aural materials. Using the language implies that the speaker is able to progressively perceive, understand, present, negotiate, persuade, hypothesize, and interpret in that language. These functions of real-life communication need to be represented in multimedia software for language learners. CALL is faced with a fantastic opportunity to provide help toward reaching the communication goal of the national foreign language standards by incorporating available technologies in ways that promote acquisition of communicative competence.

Assessing and Advancing Technology Options in Language Learning

The proficiency framework was further elaborated at the 1998 Symposium on Assessing and Advancing Technology Options in Language Learning (AATOLL), which the Federal Language Learning Laboratory sponsored together with the National Foreign Language Resource Center at the University of Hawaii, under the direction of Dr. Irene Thompson. The
Volume 16 Number 3 279

Speaking: A Critical Skill and Challenge


symposium brought together the viewpoints of different disciplines as they affect the conceptualization and evaluation of multimedia CALL. Participants from academe, industry, and government collectively aimed at creating a dialogue among experts and fostering an exchange between specialists in foreign language pedagogy and in technology. Papers from the AATOLL Symposium were published in a special issue of Language Learning and Technology entitled The Design and Evaluation of Multimedia Software (available at http://polyglot.cal.msu.edu/llt/vol2num1/ default.html). In addition, a web site was created with a data bank of over 600 multimedia language programs in 45 languages as well as an extensive taxonomy for their evaluation (available at http://nts.lll.hawaii.edu/ flmedia).

SPEAKING AND TECHNOLOGY


The Problem of Integrating Speaking into CALL

At the same time as the profession shifted towards communication in foreign languages, CALL technology began to flourish. However, the technology available to CALL developers limited their ability to promote speaking proficiency. I am not referring here to the ability to e-mail in the target language, participate in chatrooms, or communicate in the language via the web. While these activities are critical to generating language and reflecting on ones learning process, they nevertheless keep the learner mute. Integrating speaking into the technology is problematic on many grounds: funds for developing the technology for language learning are limited, research projects have often remained at the prototype stage, and the commercial side has not fully absorbed the lessons learned from experimental efforts. We started our experimental work in 1993 with a simple question: How can we help adult learners improve their speaking skills? Adult learners have a hard time distinguishing the sounds of the language, are often inhibited in speaking in front of others, and usually do not get undivided attention from their instructors (Egan, 1996; Eskenazi, 1999; this issue). Our vision then and now is that computers have a role to play in learning to speak; however, interaction with the computer remains mainly via keyboard and mouse. Most commercial software provides learners with practice in filling blanks or choosing the correct answer. A small proportion of available software offers learners practice in reading and listening to authentic written and spoken language. An even smaller proportion lets learners produce language by repeating words or sentences, recording their responses, and comparing them to native models (see Wachowicz & Scott, this issue).
280 CALICO Journal

Kathleen B. Egan
However, getting learners to produce spoken language cannot be limited to recording ones voice and comparing it to native models. Speech needs to be an integral part of the instructional design, and production possibilities need to be expanded. We share the view of Chapelle (1998) that multimedia CALL must have learners produce in meaningful ways in the target language. Like input, which can be either uncomprehended noise or valuable for acquisition, output can be produced mindlessly or it can be created by the learner under conditions that facilitate acquisition. In the past four years, we have seen speech technologies and, in particular, ASR begin to be integrated into CALL. While this integration signals an emerging awareness on the part of software designers that CALL products can no longer ignore speaking skills, we are still at the very beginning of using this technology well. As Ehsani and Knodt (1998) pointed out in their comprehensive overview of speech technology in CALL, we need to exploit the strengths of the technology while working around its limitations.

Advantages of Technology for Learning

As many have said (Clark & Sugrue, 1991; Hubbard, 1987, 1998; Clifford, 1998), the best of technology does not by itself create a productive learning environment. The technology offers access, authenticity, and insights (Phillips, 1998). I would add that advances in intelligent and adaptive technologies also offer a world of illusion, games, and simulations. Technology can stimulate the playfulness of learners and immerse them in a variety of scenarios. Technology gives learners a chance to engage in self-directed actions, opportunities for self-paced interactions, privacy, and a safe environment in which errors get corrected and specific feedback is given. Feedback by a machine offers additional value by its ability to track mistakes and link the student immediately to exercises that focus on specific errors. Studies are emerging that show the importance of qualitative feedback in CALL software. When links are provided to locate explanations, additional help, and reference, the value of CALL is further augmented.

SPEECH TECHNOLOGY IN LANGUAGE LEARNING


A Communicative Focus

The inclusion of ASR in software programs leads to a reexamination of multimedia CALL. Little research exists on how CALL can be used to develop speaking skills. What does it mean to the learner to have a speechVolume 16 Number 3 281

Speaking: A Critical Skill and Challenge


interactive system in the learning process? Where should ASR be used? Why should we use it? What role does it play, if any? The ILR and ACTFL definitions of speaking proficiency imply communicating in context. Proficiency tests (with some variations) have the speaker demonstrate functional ability for different topics and levels of complexity. How well the individual handles linguistic variables in the service of communication determines the individuals standing on the proficiency scale. The same can be said of CALL. According to Nunan (1991), the optimum CALL program focuses on meaning rather than on form; it involves learners in comprehending, manipulating, producing, or interacting in the target language while focusing their attention principally on meaning. However, because speaking involves both sound and meaning in real-life situations, complete speaking instruction must address both discourse issues in context and phonemes as isolated and contrasting sounds. Levelt (1989) has put together a framework that characterizes speaking as a multilevel process, moving from intention to articulation and involving acoustic, linguistic, social, pragmatic, and functional characteristics. Thus, rehearsed sounds, memorized sentences, reciting of read prose, repetition of words and sentences, and spontaneous conversations all have relevance. Sound and meaning are interrelated yet distinct entities. Some work in CALL has focused on production of sounds in isolation from context, with remediation of speech, accent modification, and speech analysis (Delmonte, 1998). Now, with emerging speech technologies, the shift is toward relating pronunciation and other exercises on linguistic form to communication (Chun, 1998) and placing them in a communicative context.

Tailoring ASR to Learning Objectives

From our experience in developing speech-interactive CALL, we realized that it is important to clarify the role of speech technology in the learning process. Is it to diagnose and improve pronunciation? Is it to develop confidence and fluency? Is it to model and mimic native speech? Is it to use speech to navigate through a program? Is it to acquire new vocabulary and language structures, to solve problems, or to role-play? All these objectives can be integrated into a CALL program, but each requires specific adaptation of the technologies and design.

Human Versus Machine Tutoring

Another important factor in developing CALL is to determine clearly the differences between the human tutor and the machine. In an earlier
282 CALICO Journal

Kathleen B. Egan
paper (Egan & Kulman, 1998) we stated, While technology can mirror the human tutor, and/or instructor, it is not necessarily a replacement for the human interaction in language learning. Machines are better at storing large quantities of data, a variety of resources, and links to access other resources. The machine can give students control over their learning and provide privacy and flexibility in time and space, but the machine is limited in perception, understanding, and decision-making. As Pinker (1994) has noted, understanding a sentence exemplifies the kind of problem that is hard for machines and easy for humans. And if machines have a hard time understanding a sentence from a text, the challenge is even bigger when the source of information is acoustic.

Limitations of ASR

Meador et al. (1998) pointed out that as the technology for speech recognition has matured, the possibility of building interesting and meaningful exercises for language learners with this technology has become real. However, ASR cannot do everything users may want. Task definition (complexity, vocabulary size), acoustic models (speaker dependent, independent, or adapted), input quality (noise levels, microphone, sound card), and input modality (discrete or continuous input) have an impact on speech recognition performance (Bernstein & Franco, 1996; Ehsani & Knodt, 1998). All these factors need to be taken into account when designing a learning activity or promising the learner feedback on their discourse.

CHALLENGES IN DEVELOPING ASR FOR CALL VERSUS OTHER APPLICATIONS When we got interested in ASR in 1993, we were faced with challenges in applying the technology to language learning that had not been faced in other ASR applications. At the same time, we realized that we had common challenges with the larger ASR research community. Many issues confronting ASR for CALL were no different from those tackled in research on large vocabulary continuous speech recognition (LVCSR), as summarized by Young (1996). In some ways the application of ASR to language learning is easier and in some ways harder than traditional LVCSR. Table 1 presents relevant differences and similarities. These and other differences are discussed in the following paragraphs.

Volume 16 Number 3

283

Speaking: A Critical Skill and Challenge


Table 1 Comparison of Variables for Large Vocabulary Continuous Speech Recognition (LVCSR) and Language Learning Variables LVCSR Language Learning Vocabulary Size Large Moderate (20k-100k words) (1k-5k words) Speakers Disposition Noncooperative Cooperative Recording Environment Noisy Noiseless Speakers Proficiency Native Nonnative* Speech Assessment Not important Very Important* Speaking Style Conversational Conversational Recognition Speed Real Time Real Time *These two variables are critical for language learning ASR and are harder to deal with than in traditional LVCSR.

Nonnative Speakers and Speech Patterns

LVCSR technology, even when applied to other languages, was in 1993 and is still today used mainly by native speakers and for native speakers. This native-speaker basis is also the case for commercial developments stemming from LVCSR research, for example, dictation systems, command and control, and telephony. In contrast, CALL is designed for nonnative speakers, whose early learning is characterized by errors, disfluencies, and mispronunciations of far greater scope, as well as being qualitatively different from speech of native speakers.

Speech Assessment

The focus of LVCSR technology was, when we started our work, and still remains, on recognition. In contrast, learners speech in CALL needs not only to be recognized but also to be diagnosed, corrected, and learners answered with meaningful and validated feedback to improve their speech. Assessment demands more than just the statistically based recognizers used in LVCSR. Pronunciation assessment needs to be able to make distinctions between possible pronunciations of words, while traditional ASR tries to recognize commonalities between different pronunciations of words, i.e., to distinguish between words, not pronunciations. Therefore, the best tools for speech recognition may not be the most appropriate for pronunciation assessment.

284

CALICO Journal

Kathleen B. Egan
Languages Other Than English

When we started our work, most LVCSR research had been done in English, whereas our interest was in languages other than English (including the less commonly taught languages), which were not addressed in commercial products.

Evaluating the Success of an Application

The criteria for successfully applying ASR in CALL are not the same as for LVCSR. In LVCSR applications such as dictation and telephony, accuracy, speed, and user interface are the primary determinants of ASR performance, as seen in the LVCSR research and evaluations sponsored by the Defense Advanced Research Projects Agency. While these variables also play a role in CALL, it is more important to determine whether users are learning the language and whether their learning is better served by ASR than by other means. Thus, we faced the challenge of how to measure specific aspects of language learning.

DEALING WITH THE CHALLENGES OF DEVELOPING ASR FOR CALL Some of the challenges from 1993 are still with us. Below I consider progress made toward these challenges and the work that remains to be done, providing illustrations from systems featured in this issue of the journal. A companion paper by LaRocca et al. (this issue) further considers the adaptations needed to apply ASR to CALL.

Improving Recognition of Nonnative Speakers: Modeling Utterances for Acceptance and Rejection The error rate in speech recognition is still high unless it is speakerdependent or speaker-adapted in dictation systems for natives. (These systems typically start as speaker- and text-independent and become dependent or adapted to the speaker in order to work well.) At this point, enough data have been collected to show how to create acoustic models that include both native and nonnative speech. However, accounting for all pronunciation disfluencies and grammatical variants and knowing when to reject utterances that fall outside the expected vocabulary are unresolved problems (Byrne et al., 1998). Language learning applications demand very low false-acceptance rates so that learners who make errors are not misled when they are correct. We need more research on rejection modelVolume 16 Number 3 285

Speaking: A Critical Skill and Challenge


ing so as to be able appropriately to signal the learner, You werent understood. or You need to repeat. Rejection modeling helps CALL avoid forcing the categorization of a questionable utterance into one of a small set of expected responses. The software engineer aims at the recognition of all utterances to minimize word error rates, while the educator aims at accepting certain utterances and rejecting others. I believe that in some activities it is best to accept mispronounced utterances for the purpose of building confidence rather than frustrating the learner. In other formative activities, it is best to reject questionable utterances to ensure comprehensible and acceptable pronunciation. The capability needed in CALL in this case is not to recognize but rather to know when and why to reject. This challenge requires research and strong partnerships between speech engineers, on the one hand, and pedagogues and language testing experts, on the other.

Developing Instructional Designs to Surmount Limitations of ASR

Another challenge involves restricting communicative domains and predicting the possible branching that learners will follow in simulated dialogues with a computer character. Predicting possible branching paths is feasible but not necessarily quick and easy for CALL developers. Ehsani and Knodt (1998) differentiate between closed- and open-response designs and consider how these designs affect response prediction for speechinteractive CALL dialogues.

CLOSED-RESPONSE DESIGNS

Closed-response designs display a few utterance choices for learners to say. The ECHOS version of the Voice-Interactive Language Training System (VILTS) described by Rypa and Price (this issue) and Rypa (1996) is a good example of closed-response design. It lets learners choose from three possible responses displayed on the screen in a simulated conversation. One purpose of the system is to collect sufficient samples of speech data from students to give them an accurate pronunciation score. The words in the conversation are carefully chosen to represent difficult phones for Americans learning French. These segments were studied, and the automated scores were found to be highly correlated with expert human raters (Neumeyer et al., 1998). Closed-response designs are also used, in the interest of assuring accurate recognition, in the Virtual Conversations program of Harless, Zier, and Duncan, the multiple choice exercises described by LaRocca, Morgan, and Bellinger, and the MILT microworld described by Holland, Kaplan, and Sabol (all in this issue).
286 CALICO Journal

Kathleen B. Egan
OPEN-RESPONSE DESIGNS

Open-response designs do not tell learners what to say. These designs are time-consuming to develop and require a multilevel network grammar based on data collected from students, natural language processing capabilities, and strategies for recovering from misunderstandings. An example of an open-response design is Subarashii, a version of the Interactive Spoken Language Education (ISLE) program described by Bernstein, Najmi, and Ehsani (this issue), Ehsani et al. (1997) and Meador et al. (1998). Designed to teach Japanese, Subarashii supports multiturn, open conversations focused on a very specific task; the constrained task delineates the grammar network. The goal of Subarashii is not to provide students a score or feedback on their pronunciation but rather to have them engage in a human-like dialogue with the system. Subarashii was developed to build confidence and to have the learner role-play and solve simple contextualized problems. Similarly, an extension of VILTS called SOLVIT (Rypa & Price, this issue) employs open-response exercises that are built up from closed-response exercises.

MOVING TOWARD FREER DIALOGUE

In language learning environments, our trials with some of the systems mentioned above show that the recognizer works correctly in close to 99% of the cases if the vocabulary is small, the network well defined, and the scope of the conversation limited and guided. The challenge now is to build full systems that apply ASR to CALL. Most designs, even the most interactive ones, are largely of the closed-response type with fixed multiple-choice utterances. Enabling open, multiturn discourse is far more complicated, and the number of possibilities at each step of the exchange can run extremely high. Interim solutions such as Subarashii and SOLVIT, which bridge the gap between closed and open design, must rely on constrained recognition through clever use of specified-domain tasks. In these situations, it is expected that the learner will not make too many grammatical mistakes or deviate too much from the expected pattern of responses. Data must be gathered on how learners acquire patterns of language for certain situations. Good teachers can intuitively predict how a student will respond, and it is this kind of knowledge that needs to be captured in our systems. Machines can learn, and the goal is to create transitional prototypes with limited dialogue capability which collect student-interaction data for future, freer interactions. Complex authoring tools must also be developed that allow seamless integration between activity design and recognizer control.
Volume 16 Number 3 287

Speaking: A Critical Skill and Challenge


Developing Robust and Transparent Language Assessment Tools

A variety of experimental work on pronunciation assessment is reported in this special issue of the journal. Work at SRI (Bernstein et al., 1990; Neumeyer et al., 1996; Price, 1998; Rypa & Price, this issue), at Carnegie Mellon University (Eskenazi, 1999; this issue), and at Indiana University and Communications Disorders (Dalby & Kewley-Port, this issue) has enabled ASR to give learners reliable pronunciation scores and some corrective feedback. However, as discussed by Wachowicz and Scott (this issue), most pronunciation graders purchased off the shelf depend on the student to diagnose shortcomings. For reasons stated earlier (and detailed in this issue by Dalby & Kewley-Port as well as by Rypa & Price), current commercial speech recognizers are not appropriate for pronunciation training without adaptation. Relevant here is the discussion by Mostow and Aist (this issue) on the kinds of ASR adaptations needed to track oral reading. Current speech recognizers may offer scores, but students need more than just scores to help them improve pronunciation. They need systems that diagnose specific pronunciation problems and communicate these deficiencies clearly. The required technologies for such a venture are not yet widely available beyond the prototype work just mentioned. These technologies need to be further developed, drawing from theories of speech perception and production as well as from research on how corrective feedback and modeling of articulation can help students improve their speech.
Developing Systems Suitable for Realistic Platforms

From 1993 to 1996, speech-interactive CALL prototypes using continuous speech recognition were developed and delivered on UNIX workstations; ASR on PC platforms was a novelty. Whereas the platform issue is now resolved (see, in this issue, Bernstein et al.; LaRocca, Morgan, & Bellinger; and Mostow & Aist), the vocabulary for language learning applications is still small to medium in size, and authoring tools are limited. The integration of continuous ASR with commercial authoring tools such as Macromedias Authorware has been demonstrated with Subarashii at the Federal Language Training Laboratory. At the United States Military Academy, LaRocca et al. (this issue) have shown the WinCalis courseware authoring tool to work with continuous ASR. The developmental Global Authoring System (GLAS) by BlueShoe Technologies has also proved capable of incorporating recognition and other multilingual processing tools. The necessary tools have been assembled, but building easy-to-use authoring systems and robust realistic dialogues for learners of foreign languages is still a challenge.
288 CALICO Journal

Kathleen B. Egan
Collection of Speech Data in Multiple Languages

Building speech-enabled systems requires two separate types of speech data: data for acoustic modeling and data to inform learning activities. The first type of data are not necessarily the same as the acoustic models sought by LVCSR researchers. Speech data for language learning has to be a mix of native and nonnative samples. It is preferable, moreover, for the corpus to include a range of male and female voices at all age levels. The second type of data, the input for learning activities, should represent a range of situationally and culturally authentic spontaneous dialogues containing natural disfluencies, redundancies, and ellipses. There are advantages for the language education community to collect these data and add them to the existing Language Data Consortium (LDC) database at the University of Pennsylvania. This kind of collection effort amounts to a one-time effort per language. In addition, the collected data allow flexibility in instructional design, and the same data can be reused for a variety of learning activities, as discussed by LaRocca et al. (this issue). Creating data sets for education is a necessity not only for the use of speech samples but also for other learning activities. It would be ideal to have a library of videos, texts, graphics, and audio organized by languages, functions, topics, and skill levels. Links in the data set itself and hyperlinks with lexical and other linguistic resources need to be structured in ways that make them accessible and easy to manipulate by designers, teachers, and learners.

Metrics for Evaluating Speech-interactive CALL

Evaluating the effectiveness of developed systems is still a challenge. We have a proposed proficiency framework and a taxonomy for multimedia foreign language evaluation, but we still need studies to validate that framework. Also needed are answers to driving questions: (a) Which instructional designs are best suited for inclusion of ASR? (b) How can we maximize the use of ASR and enhance learning at the same time? (c) How good does the speech recognizer have to be for the system to be functional and useful? (d) Should ASR always assess pronunciation? (e) How can CALL build up speaking confidence, expand lexical mastery, and improve pronunciation and overall facility with the spoken language? Research on human-machine interaction has not focused on second language acquisition, and the role of the multimedia interface on the learning process is just beginning to be examined (Plass, 1998).

Volume 16 Number 3

289

Speaking: A Critical Skill and Challenge


NEW DIRECTIONS IN DESIGN Based on the challenges outlined here, it is critical to define clearly the role of ASR in the development of CALL. Given the current state of the art, including a speech recognizer in a CALL program does not necessarily mean that students are going to receive meaningful feedback on the quality of their speech. Current recognizers cannot discriminate between good and bad pronunciation unless the system includes a pronunciation assessment tool in addition to the recognizer. Students have to mispronounce badly for their utterance to be rejected by the recognizer. Therefore, I would suggest that current speech-interactive CALL use ASR for building confidence and fluency rather than for pronunciation assessment, except in narrow applications such as the minimal pair drills used in speech therapy and specific error correction (e.g., in this issue, Dalby & KewleyPort and LaRocca et al.). The research, validation, and testing needed to adapt ASR to pronunciation assessment is time consuming and difficult. As more research accumulates and testing progresses (see Rypa & Price, this issue), we can look forward to inclusion of more general pronunciation assessment as part of fielded-tested systems. Even the human teacher who gives feedback to students on their communication skill cannot do so effectively without gaining additional knowledge about their learning styles or without acquiring the experience of working with many students who make similar mistakes. The machine needs the same data in order to make a reliable assessment that correlates highly with judgments of human experts. It is of paramount importance to know what is behind the scene that allows the machine to provide scores, feedback, and comments. The proficiency goals from the ILR and ACTFL can direct the design of pedagogically sound activities. The simplest principle is practice makes perfect. The best way to improve reading and listening skills is to read and listen extensively. While reading and listening materials abound and learners can easily find them in newspapers, on radio and TV, on the web, in textbooks, and on CD-ROMS, the same cannot be said of speaking. When the learner is a beginner, and speaking in full sentences is difficult, practice can be frustrating with a native speaker or a teacher who has limited time. A machine that listens is a good practice tool; a machine that listens, responds, and engages in conversation is better; and a machine that simulates multiturn dialogue is even better. Best is a system that does all of this, assesses the quality of the learners speech, and directs ways to improve it. An ideal system, still a future vision, would behave intelligently track students progress by accumulating data on their speech, detect and trace patterns of mistakes, and adapt feedback and guidance accordingly. Some progress on adaptive prototypes for language learning that perform some of these functions has been made (Holland, 1995, 1997; LaRocca et al., this issue).
290 CALICO Journal

Kathleen B. Egan
SUMMARY CALL systems that include ASR can help develop proficiency. Learners exposed to large quantities of speech from different native speakers will have a trained ear to better discriminate sounds and constructs. Learners who also get to produce speech will improve their speaking skills. While we do not know whether this improvement affects fluency, confidence, or fully functional communicative skills, we believe that it is critical that CALL not keep learners mute. Creativity and multidisciplinary partnership are the key to making CALL fully communicative. We face challenges when applying ASR in language learning. We need research on recognizing nonnative conversational speech, on accommodating variant grammatical and sociolinguistic constructions characteristic of learners, and on modeling and predicting multiturn spontaneous dialogues. We also need more and better training data to model nonnative speech. Authoring tools with embedded multilingual and multimedia capabilities need to become more widely available and user-friendly. By developing and testing more ASR-based CALL, we will be better positioned to know empirically what works and what does not. REFERENCES
Bernstein, J., Cohen, M., Murveit, H., Rtischev, D., & Weintraub, M. (1990). Automatic evaluation and training in English pronunciation. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Kobe, Japan. Bernstein, J., & Franco, H. (1996). Speech recognition by computer. In N. Lass (Ed.), Principles of experimental phonetics. St. Louis: Mosby. Byrne W., Knodt, E., Khudanpur, S., & Bernstein, J. (1998). Is automatic speech recognition ready for non-native speech? A data collection effort and initial experiments in modeling conversational Hispanic-English. In Proceedings of the Workshop on Speech Technology in Language Learning (StiLL), Stockholm, Sweden. Chapelle, C. A. (1998). Multimedia CALL: Lessons to be learned from research on instructed SLA. Language Learning and Technology [on-line serial], 2 (1), 22-34. Available: http://polyglot.cal.msu/llt Chun, D. (1998). Signal analysis software for teaching discourse intonation. Language Learning and Technology [on-line serial], 2 (1), 61-77. Available: http://polyglot.cal.msu/llt Clark, R. E., & Sugrue, B. M. (1991). Research on instructional media, 19781988. In G. Anglin (Ed.), Instructional Technology. Englewood Cliffs, NJ: Prentice Hall. Clifford, R. T. (1987, March). Language teaching in the federal government: A personal perspective. Annals, AAPSS, 490. Volume 16 Number 3 291

Speaking: A Critical Skill and Challenge


Clifford, R. T. (1998). Mirror, mirror, on the wall: Reflections on computer assisted language learning. Calico Journal, 16 (1), 1-10. Delmonte, R. (1998). Prosodic modeling for automatic language tutors. In Proceedings of the Workshop on Speech Technology in Language Learning (StiLL), Stockholm, Sweden. Egan, K. B. (1996). Speech recognition application to language learning: ECHOS. Paper presented at the Annual Symposium of the Computer Assisted Language Instruction Consortium, Albuquerque, NM. Egan, K. B., & Kulman, A. H. (1998). A proficiency-oriented analysis of computer -assisted language learning. In Proceedings of the Workshop on Speech Technology in Language Learning (StiLL), Stockholm, Sweden. Ehsani, F., Bernstein, J., Najmi, A., & Todic, O. (1997). Subarashii: Japanese interactive spoken language education. In Proceedings of EuroSpeech Conference, Rhodes, Greece. Ehsani, F. & Knodt, E. (1998). Speech technology in computer-assisted language learning: Strengths and limitations of a new CALL paradigm. Language Learning & Technology [on-line serial], 2 (1), 46-60. Available: http:// polyglot.cal.msu/llt Eskenazi, M. (1999). Using automatic speech processing for foreign language pronunciation tutoring: Some issues and a prototype. Language Learning & Technology [on-line serial], 2 (2), 62-76. Available: http://polyglot. cal.msu/llt Holland, V. M. (1995). The case for intelligent CALL. In V. M. Holland, J. D. Kaplan, & M. R. Sams (Eds.), Intelligent language tutors: Theory shaping technology. Mahwah, NJ: Lawrence Erlbaum. Holland, V. M. (1997). Translating linguistic research into teaching: Precaution and promise in the application of natural language processing. In K. Murphy-Judy (Ed.), Nexus (pp. 52-65). Durham, NC: CALICO. Hubbard, P. L. (1987). Language teaching approaches, the evaluation of CALL software, and design implications. In W. F. Smith (Ed.), Modern media in foreign language education: Theory and implementation. Lincolnwood, IL: National Textbook Company. Hubbard, P. L. (1998). An integrated framework for CALL courseware evaluation. CALICO Journal, 16 (1), 51-72. Levelt, W. J. M. (1989). From intention to articulation. Cambridge, MA: The MIT Press. Meador, J., Ehsani, F., Egan, K., & Stokowski, S. (1998). An interactive dialog system for learning Japanese. In Proceedings of the Workshop on Speech Technology in Language Learning (StiLL), Stockholm, Sweden. Neumeyer, L., Franco, H., Weintraub, M., & Price, P. (1996). Automatic text-independent pronunciation scoring of foreign language student speech. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Philadelphia, PA.

292

CALICO Journal

Kathleen B. Egan
Neumeyer, L., Franco, H., Abrash, V., Julia, L., Ronen, O., Bratt, H., Bing, J., Digalakis, V., & Rypa, M. (1998). WebGrader: A multilingual pronunciation practice tool. In Proceedings of the Workshop on Speech Technology in Language Learning (StiLL), Stockholm, Sweden. Nunan, D. (1991). Language teaching methodology. New York: Prentice Hall International. Phillips, J. K. (1998). Media for the message: Technologys role in the standards. CALICO Journal, 16 (1), 25-36. Pinker, S. (1994). The language instinct: How the mind creates language. New York: Harper Perennial. Plass, J. L. (1998). Design and evaluation of the user interface of foreign language multimedia software: A cognitive approach. Language Learning & Technology [on-line serial], 2 (1), 35-45. Available: http://polyglot.cal.msu/ llt Price, P. (1998). How can speech technology replicate and complement good language teachers to help people learn language? In Proceedings of the Workshop on Speech Technology in Language Learning (StiLL), Stockholm, Sweden. Rypa, M. (1996). ECHOS: A voice interactive language training system. Paper presented at the Annual Symposium of the Computer Assisted Language Instruction Consortium, Albuquerque, NM. Young, S. (1996, September). A review of large-vocabulary continuous-speech recognition. IEEE Signal Processing Magazine, 45-57.

AUTHORS BIODATA Kathleen B. Egan, whose Ph.D. is from the University of Wisconsin at Madison, is a pioneer in the application of speech recognition to language learning needs. She partners with several federal government agencies in the development of experimental projects and the integration of new technologies in the field of language education. A member of the CAPITAL Special Interest Group, she was recently elected to serve on the Executive Board of CALICO.

AUTHORS ADDRESS Kathleen Egan Federal Language Training Laboratory 801 Randolph St. #201 Arlington, VA 22203 E-Mail: EganKB@aol.com
Volume 16 Number 3 293

Speaking: A Critical Skill and Challenge

CALICO 99
1-5 June 1999
CALICO CALICO

Tuesday-Wednesday Thursday-Saturday

1-2 June 3-5 June

Preconference Workshops Opening Plenary, Sessions, Exhibits, Luncheon, Courseware Showcase, SIG Meetings, Banquet, Closing Plenary

Plenary speakers

G. Richard Tucker, Carnegie Mellon University Diane Birckbichler, Ohio State University Gary Strong, National Science Foundation.

Register online at http://calico.org/calico99.html


Early (before 1 May) Member Nonmember K-12 or Community College Saturday only Regular (after 1 May) Member Nonmember K-12 or Community College Saturday only On-site Member Nonmember K-12 or Community College Saturday only with luncheon & banquet $175 $200 $125 $50 $200 $225 $150 $55 $225 $250 $175 $60 $215 $240 $165 $190 $215 $140 no luncheon or banquet $165 $190 $115

This years conference does not have a designated conference hotel. Lodging is available in residence halls on campus and at motels in the area. For more information, visit CALICOs web site. Ascot Travel is the official travel agency for CALICO 99 and offers special discount fares on Delta Airlines. Visit CALICOs web site or call Ascot Travel at 800/460-2471. Be sure to mention you are part of the CALICO group.

For more information, contact CALICO 512/245-1417, info@calico.org, http://www.calico.org.


294 CALICO Journal

You might also like