Professional Documents
Culture Documents
Egan
KEYWORDS Speaking, Automatic Speech Recognition, Speech-Interactive CALL, National Foreign Language Standards, Second Language Acquisition
INTRODUCTION Speaking is at the heart of second language learning. It is arguably the most important skill for business and government personnel working in the field, yet it appears particularly vulnerable to attrition. Despite its importance and its fragility, speaking was until recently largely ignored in schools and universities, primarily for logistical and programmatic reasons, such as emphasis on grammar and culture and unfavorable teacherstudent ratios. Speaking was also absent from testing because of the difficulty in evaluating it objectively and the time it takes to conduct speaking tests (Clifford, 1987). Finally, speaking has been neglected in Computer Assisted Language Learning (CALL) technology. Until recently, CALL programs engaged students in listening, reading, and filling in blanks but not in producing oral language.
1999 CALICO Journal
Volume 16 Number 3
277
A PROFICIENCY FRAMEWORK Proficiency goals can direct the design and development of quality learning activities. Foreign language proficiency is measured by the ability to communicate in the language. This ability is demonstrated in the understanding of authentic aural and written materials and in the ability to generate spoken and written language for real-life purposes. Proficiency in a language is a complex concept. Second language acquisition scholars are still trying to define it and identify what its components are. Practitioners such as the American Council on the Teaching of Foreign Languages (ACTFL) and the Interagency Language Roundtable (ILR) continue working toward a framework for describing language proficiency at different levels that incorporates the elusive components of communicative ability. Still, the oral proficiency tests developed by ACTFL and ILR have achieved high reliability and are a good anchor for the concept of speaking proficiency that CALL seeks to foster. ACTFL and ILR have defined speaking proficiency for testing purposes
278 CALICO Journal
Kathleen B. Egan
as the ability of an individual to carry out in appropriate ways communicative tasks which are typically encountered where the language is natively spoken . The proficiency tests present different topics at various levels of complexity that require the individual to handle vocabulary, structure, pronunciation, pragmatics, and sociolinguistic functions. Thus, rather than selecting one language acquisition theory or another as a framework for developing speech-interactive CALL, we see the proficiency definitions as the best practical framework. The proficiency guidelines focus on the goal rather than on the process and are useful for guiding software development (Egan & Kulman, 1998).
Enriching the concept of language proficiency are the new national foreign language standards (Phillips, 1998), organized around five goal areas: communication, culture, connections, comparisons, and communities. In all areas the focus is on what should students be able to do in a foreign language. This view departs from the linear sequential separation of skills and also goes beyond the traditional distinction between skill getting and skill using. In the communication standard, speaking is still key; however, speaking is for a purpose. The intent is to have learners engage in realistic tasks rather than just practicing linguistic material. There is consensus in the profession that learning a second language is not only for cultural and literary knowledge but also, and primarily, for practical and/or professional reasons. Language is as much a skill in which the individual engages interactively with others as it is a tool to extract information from written or aural materials. Using the language implies that the speaker is able to progressively perceive, understand, present, negotiate, persuade, hypothesize, and interpret in that language. These functions of real-life communication need to be represented in multimedia software for language learners. CALL is faced with a fantastic opportunity to provide help toward reaching the communication goal of the national foreign language standards by incorporating available technologies in ways that promote acquisition of communicative competence.
The proficiency framework was further elaborated at the 1998 Symposium on Assessing and Advancing Technology Options in Language Learning (AATOLL), which the Federal Language Learning Laboratory sponsored together with the National Foreign Language Resource Center at the University of Hawaii, under the direction of Dr. Irene Thompson. The
Volume 16 Number 3 279
At the same time as the profession shifted towards communication in foreign languages, CALL technology began to flourish. However, the technology available to CALL developers limited their ability to promote speaking proficiency. I am not referring here to the ability to e-mail in the target language, participate in chatrooms, or communicate in the language via the web. While these activities are critical to generating language and reflecting on ones learning process, they nevertheless keep the learner mute. Integrating speaking into the technology is problematic on many grounds: funds for developing the technology for language learning are limited, research projects have often remained at the prototype stage, and the commercial side has not fully absorbed the lessons learned from experimental efforts. We started our experimental work in 1993 with a simple question: How can we help adult learners improve their speaking skills? Adult learners have a hard time distinguishing the sounds of the language, are often inhibited in speaking in front of others, and usually do not get undivided attention from their instructors (Egan, 1996; Eskenazi, 1999; this issue). Our vision then and now is that computers have a role to play in learning to speak; however, interaction with the computer remains mainly via keyboard and mouse. Most commercial software provides learners with practice in filling blanks or choosing the correct answer. A small proportion of available software offers learners practice in reading and listening to authentic written and spoken language. An even smaller proportion lets learners produce language by repeating words or sentences, recording their responses, and comparing them to native models (see Wachowicz & Scott, this issue).
280 CALICO Journal
Kathleen B. Egan
However, getting learners to produce spoken language cannot be limited to recording ones voice and comparing it to native models. Speech needs to be an integral part of the instructional design, and production possibilities need to be expanded. We share the view of Chapelle (1998) that multimedia CALL must have learners produce in meaningful ways in the target language. Like input, which can be either uncomprehended noise or valuable for acquisition, output can be produced mindlessly or it can be created by the learner under conditions that facilitate acquisition. In the past four years, we have seen speech technologies and, in particular, ASR begin to be integrated into CALL. While this integration signals an emerging awareness on the part of software designers that CALL products can no longer ignore speaking skills, we are still at the very beginning of using this technology well. As Ehsani and Knodt (1998) pointed out in their comprehensive overview of speech technology in CALL, we need to exploit the strengths of the technology while working around its limitations.
As many have said (Clark & Sugrue, 1991; Hubbard, 1987, 1998; Clifford, 1998), the best of technology does not by itself create a productive learning environment. The technology offers access, authenticity, and insights (Phillips, 1998). I would add that advances in intelligent and adaptive technologies also offer a world of illusion, games, and simulations. Technology can stimulate the playfulness of learners and immerse them in a variety of scenarios. Technology gives learners a chance to engage in self-directed actions, opportunities for self-paced interactions, privacy, and a safe environment in which errors get corrected and specific feedback is given. Feedback by a machine offers additional value by its ability to track mistakes and link the student immediately to exercises that focus on specific errors. Studies are emerging that show the importance of qualitative feedback in CALL software. When links are provided to locate explanations, additional help, and reference, the value of CALL is further augmented.
The inclusion of ASR in software programs leads to a reexamination of multimedia CALL. Little research exists on how CALL can be used to develop speaking skills. What does it mean to the learner to have a speechVolume 16 Number 3 281
From our experience in developing speech-interactive CALL, we realized that it is important to clarify the role of speech technology in the learning process. Is it to diagnose and improve pronunciation? Is it to develop confidence and fluency? Is it to model and mimic native speech? Is it to use speech to navigate through a program? Is it to acquire new vocabulary and language structures, to solve problems, or to role-play? All these objectives can be integrated into a CALL program, but each requires specific adaptation of the technologies and design.
Another important factor in developing CALL is to determine clearly the differences between the human tutor and the machine. In an earlier
282 CALICO Journal
Kathleen B. Egan
paper (Egan & Kulman, 1998) we stated, While technology can mirror the human tutor, and/or instructor, it is not necessarily a replacement for the human interaction in language learning. Machines are better at storing large quantities of data, a variety of resources, and links to access other resources. The machine can give students control over their learning and provide privacy and flexibility in time and space, but the machine is limited in perception, understanding, and decision-making. As Pinker (1994) has noted, understanding a sentence exemplifies the kind of problem that is hard for machines and easy for humans. And if machines have a hard time understanding a sentence from a text, the challenge is even bigger when the source of information is acoustic.
Limitations of ASR
Meador et al. (1998) pointed out that as the technology for speech recognition has matured, the possibility of building interesting and meaningful exercises for language learners with this technology has become real. However, ASR cannot do everything users may want. Task definition (complexity, vocabulary size), acoustic models (speaker dependent, independent, or adapted), input quality (noise levels, microphone, sound card), and input modality (discrete or continuous input) have an impact on speech recognition performance (Bernstein & Franco, 1996; Ehsani & Knodt, 1998). All these factors need to be taken into account when designing a learning activity or promising the learner feedback on their discourse.
CHALLENGES IN DEVELOPING ASR FOR CALL VERSUS OTHER APPLICATIONS When we got interested in ASR in 1993, we were faced with challenges in applying the technology to language learning that had not been faced in other ASR applications. At the same time, we realized that we had common challenges with the larger ASR research community. Many issues confronting ASR for CALL were no different from those tackled in research on large vocabulary continuous speech recognition (LVCSR), as summarized by Young (1996). In some ways the application of ASR to language learning is easier and in some ways harder than traditional LVCSR. Table 1 presents relevant differences and similarities. These and other differences are discussed in the following paragraphs.
Volume 16 Number 3
283
LVCSR technology, even when applied to other languages, was in 1993 and is still today used mainly by native speakers and for native speakers. This native-speaker basis is also the case for commercial developments stemming from LVCSR research, for example, dictation systems, command and control, and telephony. In contrast, CALL is designed for nonnative speakers, whose early learning is characterized by errors, disfluencies, and mispronunciations of far greater scope, as well as being qualitatively different from speech of native speakers.
Speech Assessment
The focus of LVCSR technology was, when we started our work, and still remains, on recognition. In contrast, learners speech in CALL needs not only to be recognized but also to be diagnosed, corrected, and learners answered with meaningful and validated feedback to improve their speech. Assessment demands more than just the statistically based recognizers used in LVCSR. Pronunciation assessment needs to be able to make distinctions between possible pronunciations of words, while traditional ASR tries to recognize commonalities between different pronunciations of words, i.e., to distinguish between words, not pronunciations. Therefore, the best tools for speech recognition may not be the most appropriate for pronunciation assessment.
284
CALICO Journal
Kathleen B. Egan
Languages Other Than English
When we started our work, most LVCSR research had been done in English, whereas our interest was in languages other than English (including the less commonly taught languages), which were not addressed in commercial products.
The criteria for successfully applying ASR in CALL are not the same as for LVCSR. In LVCSR applications such as dictation and telephony, accuracy, speed, and user interface are the primary determinants of ASR performance, as seen in the LVCSR research and evaluations sponsored by the Defense Advanced Research Projects Agency. While these variables also play a role in CALL, it is more important to determine whether users are learning the language and whether their learning is better served by ASR than by other means. Thus, we faced the challenge of how to measure specific aspects of language learning.
DEALING WITH THE CHALLENGES OF DEVELOPING ASR FOR CALL Some of the challenges from 1993 are still with us. Below I consider progress made toward these challenges and the work that remains to be done, providing illustrations from systems featured in this issue of the journal. A companion paper by LaRocca et al. (this issue) further considers the adaptations needed to apply ASR to CALL.
Improving Recognition of Nonnative Speakers: Modeling Utterances for Acceptance and Rejection The error rate in speech recognition is still high unless it is speakerdependent or speaker-adapted in dictation systems for natives. (These systems typically start as speaker- and text-independent and become dependent or adapted to the speaker in order to work well.) At this point, enough data have been collected to show how to create acoustic models that include both native and nonnative speech. However, accounting for all pronunciation disfluencies and grammatical variants and knowing when to reject utterances that fall outside the expected vocabulary are unresolved problems (Byrne et al., 1998). Language learning applications demand very low false-acceptance rates so that learners who make errors are not misled when they are correct. We need more research on rejection modelVolume 16 Number 3 285
Another challenge involves restricting communicative domains and predicting the possible branching that learners will follow in simulated dialogues with a computer character. Predicting possible branching paths is feasible but not necessarily quick and easy for CALL developers. Ehsani and Knodt (1998) differentiate between closed- and open-response designs and consider how these designs affect response prediction for speechinteractive CALL dialogues.
CLOSED-RESPONSE DESIGNS
Closed-response designs display a few utterance choices for learners to say. The ECHOS version of the Voice-Interactive Language Training System (VILTS) described by Rypa and Price (this issue) and Rypa (1996) is a good example of closed-response design. It lets learners choose from three possible responses displayed on the screen in a simulated conversation. One purpose of the system is to collect sufficient samples of speech data from students to give them an accurate pronunciation score. The words in the conversation are carefully chosen to represent difficult phones for Americans learning French. These segments were studied, and the automated scores were found to be highly correlated with expert human raters (Neumeyer et al., 1998). Closed-response designs are also used, in the interest of assuring accurate recognition, in the Virtual Conversations program of Harless, Zier, and Duncan, the multiple choice exercises described by LaRocca, Morgan, and Bellinger, and the MILT microworld described by Holland, Kaplan, and Sabol (all in this issue).
286 CALICO Journal
Kathleen B. Egan
OPEN-RESPONSE DESIGNS
Open-response designs do not tell learners what to say. These designs are time-consuming to develop and require a multilevel network grammar based on data collected from students, natural language processing capabilities, and strategies for recovering from misunderstandings. An example of an open-response design is Subarashii, a version of the Interactive Spoken Language Education (ISLE) program described by Bernstein, Najmi, and Ehsani (this issue), Ehsani et al. (1997) and Meador et al. (1998). Designed to teach Japanese, Subarashii supports multiturn, open conversations focused on a very specific task; the constrained task delineates the grammar network. The goal of Subarashii is not to provide students a score or feedback on their pronunciation but rather to have them engage in a human-like dialogue with the system. Subarashii was developed to build confidence and to have the learner role-play and solve simple contextualized problems. Similarly, an extension of VILTS called SOLVIT (Rypa & Price, this issue) employs open-response exercises that are built up from closed-response exercises.
In language learning environments, our trials with some of the systems mentioned above show that the recognizer works correctly in close to 99% of the cases if the vocabulary is small, the network well defined, and the scope of the conversation limited and guided. The challenge now is to build full systems that apply ASR to CALL. Most designs, even the most interactive ones, are largely of the closed-response type with fixed multiple-choice utterances. Enabling open, multiturn discourse is far more complicated, and the number of possibilities at each step of the exchange can run extremely high. Interim solutions such as Subarashii and SOLVIT, which bridge the gap between closed and open design, must rely on constrained recognition through clever use of specified-domain tasks. In these situations, it is expected that the learner will not make too many grammatical mistakes or deviate too much from the expected pattern of responses. Data must be gathered on how learners acquire patterns of language for certain situations. Good teachers can intuitively predict how a student will respond, and it is this kind of knowledge that needs to be captured in our systems. Machines can learn, and the goal is to create transitional prototypes with limited dialogue capability which collect student-interaction data for future, freer interactions. Complex authoring tools must also be developed that allow seamless integration between activity design and recognizer control.
Volume 16 Number 3 287
A variety of experimental work on pronunciation assessment is reported in this special issue of the journal. Work at SRI (Bernstein et al., 1990; Neumeyer et al., 1996; Price, 1998; Rypa & Price, this issue), at Carnegie Mellon University (Eskenazi, 1999; this issue), and at Indiana University and Communications Disorders (Dalby & Kewley-Port, this issue) has enabled ASR to give learners reliable pronunciation scores and some corrective feedback. However, as discussed by Wachowicz and Scott (this issue), most pronunciation graders purchased off the shelf depend on the student to diagnose shortcomings. For reasons stated earlier (and detailed in this issue by Dalby & Kewley-Port as well as by Rypa & Price), current commercial speech recognizers are not appropriate for pronunciation training without adaptation. Relevant here is the discussion by Mostow and Aist (this issue) on the kinds of ASR adaptations needed to track oral reading. Current speech recognizers may offer scores, but students need more than just scores to help them improve pronunciation. They need systems that diagnose specific pronunciation problems and communicate these deficiencies clearly. The required technologies for such a venture are not yet widely available beyond the prototype work just mentioned. These technologies need to be further developed, drawing from theories of speech perception and production as well as from research on how corrective feedback and modeling of articulation can help students improve their speech.
Developing Systems Suitable for Realistic Platforms
From 1993 to 1996, speech-interactive CALL prototypes using continuous speech recognition were developed and delivered on UNIX workstations; ASR on PC platforms was a novelty. Whereas the platform issue is now resolved (see, in this issue, Bernstein et al.; LaRocca, Morgan, & Bellinger; and Mostow & Aist), the vocabulary for language learning applications is still small to medium in size, and authoring tools are limited. The integration of continuous ASR with commercial authoring tools such as Macromedias Authorware has been demonstrated with Subarashii at the Federal Language Training Laboratory. At the United States Military Academy, LaRocca et al. (this issue) have shown the WinCalis courseware authoring tool to work with continuous ASR. The developmental Global Authoring System (GLAS) by BlueShoe Technologies has also proved capable of incorporating recognition and other multilingual processing tools. The necessary tools have been assembled, but building easy-to-use authoring systems and robust realistic dialogues for learners of foreign languages is still a challenge.
288 CALICO Journal
Kathleen B. Egan
Collection of Speech Data in Multiple Languages
Building speech-enabled systems requires two separate types of speech data: data for acoustic modeling and data to inform learning activities. The first type of data are not necessarily the same as the acoustic models sought by LVCSR researchers. Speech data for language learning has to be a mix of native and nonnative samples. It is preferable, moreover, for the corpus to include a range of male and female voices at all age levels. The second type of data, the input for learning activities, should represent a range of situationally and culturally authentic spontaneous dialogues containing natural disfluencies, redundancies, and ellipses. There are advantages for the language education community to collect these data and add them to the existing Language Data Consortium (LDC) database at the University of Pennsylvania. This kind of collection effort amounts to a one-time effort per language. In addition, the collected data allow flexibility in instructional design, and the same data can be reused for a variety of learning activities, as discussed by LaRocca et al. (this issue). Creating data sets for education is a necessity not only for the use of speech samples but also for other learning activities. It would be ideal to have a library of videos, texts, graphics, and audio organized by languages, functions, topics, and skill levels. Links in the data set itself and hyperlinks with lexical and other linguistic resources need to be structured in ways that make them accessible and easy to manipulate by designers, teachers, and learners.
Evaluating the effectiveness of developed systems is still a challenge. We have a proposed proficiency framework and a taxonomy for multimedia foreign language evaluation, but we still need studies to validate that framework. Also needed are answers to driving questions: (a) Which instructional designs are best suited for inclusion of ASR? (b) How can we maximize the use of ASR and enhance learning at the same time? (c) How good does the speech recognizer have to be for the system to be functional and useful? (d) Should ASR always assess pronunciation? (e) How can CALL build up speaking confidence, expand lexical mastery, and improve pronunciation and overall facility with the spoken language? Research on human-machine interaction has not focused on second language acquisition, and the role of the multimedia interface on the learning process is just beginning to be examined (Plass, 1998).
Volume 16 Number 3
289
Kathleen B. Egan
SUMMARY CALL systems that include ASR can help develop proficiency. Learners exposed to large quantities of speech from different native speakers will have a trained ear to better discriminate sounds and constructs. Learners who also get to produce speech will improve their speaking skills. While we do not know whether this improvement affects fluency, confidence, or fully functional communicative skills, we believe that it is critical that CALL not keep learners mute. Creativity and multidisciplinary partnership are the key to making CALL fully communicative. We face challenges when applying ASR in language learning. We need research on recognizing nonnative conversational speech, on accommodating variant grammatical and sociolinguistic constructions characteristic of learners, and on modeling and predicting multiturn spontaneous dialogues. We also need more and better training data to model nonnative speech. Authoring tools with embedded multilingual and multimedia capabilities need to become more widely available and user-friendly. By developing and testing more ASR-based CALL, we will be better positioned to know empirically what works and what does not. REFERENCES
Bernstein, J., Cohen, M., Murveit, H., Rtischev, D., & Weintraub, M. (1990). Automatic evaluation and training in English pronunciation. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Kobe, Japan. Bernstein, J., & Franco, H. (1996). Speech recognition by computer. In N. Lass (Ed.), Principles of experimental phonetics. St. Louis: Mosby. Byrne W., Knodt, E., Khudanpur, S., & Bernstein, J. (1998). Is automatic speech recognition ready for non-native speech? A data collection effort and initial experiments in modeling conversational Hispanic-English. In Proceedings of the Workshop on Speech Technology in Language Learning (StiLL), Stockholm, Sweden. Chapelle, C. A. (1998). Multimedia CALL: Lessons to be learned from research on instructed SLA. Language Learning and Technology [on-line serial], 2 (1), 22-34. Available: http://polyglot.cal.msu/llt Chun, D. (1998). Signal analysis software for teaching discourse intonation. Language Learning and Technology [on-line serial], 2 (1), 61-77. Available: http://polyglot.cal.msu/llt Clark, R. E., & Sugrue, B. M. (1991). Research on instructional media, 19781988. In G. Anglin (Ed.), Instructional Technology. Englewood Cliffs, NJ: Prentice Hall. Clifford, R. T. (1987, March). Language teaching in the federal government: A personal perspective. Annals, AAPSS, 490. Volume 16 Number 3 291
292
CALICO Journal
Kathleen B. Egan
Neumeyer, L., Franco, H., Abrash, V., Julia, L., Ronen, O., Bratt, H., Bing, J., Digalakis, V., & Rypa, M. (1998). WebGrader: A multilingual pronunciation practice tool. In Proceedings of the Workshop on Speech Technology in Language Learning (StiLL), Stockholm, Sweden. Nunan, D. (1991). Language teaching methodology. New York: Prentice Hall International. Phillips, J. K. (1998). Media for the message: Technologys role in the standards. CALICO Journal, 16 (1), 25-36. Pinker, S. (1994). The language instinct: How the mind creates language. New York: Harper Perennial. Plass, J. L. (1998). Design and evaluation of the user interface of foreign language multimedia software: A cognitive approach. Language Learning & Technology [on-line serial], 2 (1), 35-45. Available: http://polyglot.cal.msu/ llt Price, P. (1998). How can speech technology replicate and complement good language teachers to help people learn language? In Proceedings of the Workshop on Speech Technology in Language Learning (StiLL), Stockholm, Sweden. Rypa, M. (1996). ECHOS: A voice interactive language training system. Paper presented at the Annual Symposium of the Computer Assisted Language Instruction Consortium, Albuquerque, NM. Young, S. (1996, September). A review of large-vocabulary continuous-speech recognition. IEEE Signal Processing Magazine, 45-57.
AUTHORS BIODATA Kathleen B. Egan, whose Ph.D. is from the University of Wisconsin at Madison, is a pioneer in the application of speech recognition to language learning needs. She partners with several federal government agencies in the development of experimental projects and the integration of new technologies in the field of language education. A member of the CAPITAL Special Interest Group, she was recently elected to serve on the Executive Board of CALICO.
AUTHORS ADDRESS Kathleen Egan Federal Language Training Laboratory 801 Randolph St. #201 Arlington, VA 22203 E-Mail: EganKB@aol.com
Volume 16 Number 3 293
CALICO 99
1-5 June 1999
CALICO CALICO
Tuesday-Wednesday Thursday-Saturday
Preconference Workshops Opening Plenary, Sessions, Exhibits, Luncheon, Courseware Showcase, SIG Meetings, Banquet, Closing Plenary
Plenary speakers
G. Richard Tucker, Carnegie Mellon University Diane Birckbichler, Ohio State University Gary Strong, National Science Foundation.
This years conference does not have a designated conference hotel. Lodging is available in residence halls on campus and at motels in the area. For more information, visit CALICOs web site. Ascot Travel is the official travel agency for CALICO 99 and offers special discount fares on Delta Airlines. Visit CALICOs web site or call Ascot Travel at 800/460-2471. Be sure to mention you are part of the CALICO group.