You are on page 1of 8

70

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 1, JANUARY 2006

Scalable Architecture for Word HMM-Based Speech Recognition and VLSI Implementation in Complete System
Shingo Yoshizawa, Student Member, IEEE, Naoya Wada, Student Member, IEEE, Noboru Hayasaka, Student Member, IEEE, and Yoshikazu Miyanaga, Senior Member, IEEE

AbstractThis paper describes a scalable architecture for realtime speech recognizers based on word hidden Markov models (HMMs) that provide high recognition accuracy for word recognition tasks. However, the size of their recognition vocabulary is small because its extremely high computational costs cause long processing times. To achieve high-speed operations, we developed a VLSI system that has a scalable architecture. The architecture effectively uses parallel computations on the word HMM structure. It can reduce processing time and/or extend the word vocabulary. To explore the practicality of our architecture, we designed and evaluated a complete system recognizer, including speech analysis and noise robustness parts, on a 0.18- m CMOS standard cell library and eld-programmable gate array. In the CMOS standardcell implementation, the total processing time is 56.9 s word at an operating frequency of 80 MHz in a single system. The recognizer gives a real-time response using an 800-word vocabulary. Index TermsHidden Markov model (HMM), scalable architecture, speech recognition, VLSI implementation.

I. INTRODUCTION

IDDEN Markov model (HMM)-based speech recognition technologies have developed considerably and can now obtain a high recognition performance. Voice dictation systems, spoken dialogue systems, and speech input interfaces are representative speech applications that use these sophisticated technologies. These developments lead us to expect speech input interfaces to be embedded in practical applications. The development of speech input interfaces embedded in mobile terminals requires recognition accuracy, miniaturization, and low-power consumption. Hardware-based speech recognition systems meet these requirements. Previous research on custom hardware described the implementation of the HMM algorithm using application-specic integrated circuits (ASICs) [1], [2]and eld-programmable gate arrays (FPGAs) [3], [4]. Word speech recognition employs a word HMM or a phoneme HMM in acoustic models. In particular, the word HMM is adopted in [2], [4], and the phoneme HMM is adopted in [1], [3]. We adopted a word HMM for our hardware recognition system that performs isolated word recognition tasks. This word HMM accurately expresses coarticulation effects and maintains high

Manuscript received June 2, 2004; revised February 7, 2005, June 7, 2005. This work was supported in part by the Semiconductor Technology Academic Research Center (STARC), Program 112 and by the Ministry of Education, Science, Sports and Culture under Grant B215300010), 2003. This paper was recommended by Associate Editor P. Nilsson. The authors are with the Graduate School of Information Science Technology, Hokkaido University, Sapporo 0600814, Japan. Digital Object Identier 10.1109/TCSI.2005.854408

recognition accuracy in variable environments. For isolated word recognition, dynamic time warping (DTW) is an effective technique, particularly, for speaker-dependent tasks. It is difcult to decide whether the word HMM has better recognition performance than DTW or not because recognition results are quite variable according to experimental conditions. Rabiner et al. reported experimental results for isolated digits recognition [5]. They reported that DTW performs better in speaker-dependent tasks, whereas the word HMM outperforms DTW in speaker-independent tasks. In general, the word HMM is disadvantageous in terms of computation costs compared with the phoneme-level HMM algorithm. The word HMM-based recognition system has extremely high computation costs and requires a long processing time because it has to calculate the likelihood scores for all reference models. For example, 335 million arithmetic operations are required for the output probability calculation in the unpruned 800-vocabulary task described in Section V. Word HMM applications have been developed for numerical and alphabetical recognitiontasks [5], [6]. The recognition vocabulary, however, issmall size (50 words or fewer), because of the high computation costs. However, since the latest circuit technologies have reached an operating performance of about 10 GIPS, we believe that the word HMM-based system can deal with a middle-sized vocabulary of up to 1000 words or less using dedicated hardware architecture that decreases the processing time. In this paper, we focus on the word HMM structure and present effective parallel computations to achieve high-speed operations. We propose a new architecture based on these computations. It achieves high throughputs and low-power operations. Furthermore, the proposed architecture provides scalability. To implement speech recognition systems on hardware, variable conditions, such as vocabularies, recognition rates, and types of recognition words, should be considered. Because the required computational costs vary for the above conditions, the optimum number of parallel computations must be changed. Fixed circuit structures require redundant circuit resources for excessive parallel operations, or degrade system performance, such as response time and recognitionaccuracy,resultinginaninsufcient operatingperformance. A scalable technique always provides optimum hardware resources that can cope with the variable conditions by making small modications to the hardware architecture. Namely, the proposed architecture reduces the processing time and/or easily extends the word vocabulary. Our proposed scalable architecture extends over not only the HMM computations in speech recognition, but also the complete recognition system including robust processing and

1057-7122/$20.00 2006 IEEE

YOSHIZAWA et al.: SCALABLE ARCHITECTURE FOR WORD HMM-BASED SPEECH RECOGNITION

71

Fig. 2. Left-right HMM.

2. Recursion, and
Fig. 1. Flowchart of a speech recognition system.

speech analysis processing. In related works, such as [1][4], the authors realized hardware architecture that only used a part of the speech recognition algorithm, e.g., Viterbi algorithm or output probability calculation. Our work provides unied parallel computations and scalable techniques in a word HMM-based speech recognition system, and evaluates its effectiveness by implementing the complete recognition system using CMOS technologies. We veried the complete system on an FPGA board in actual environments, such as computer rooms, ofces, and exhibition halls. II. OUTLINE OF SPEECH RECOGNITION SYSTEM Fig. 1 shows a owchart of a speech recognition system. This owchart is based on our developed, complete recognition system. In the speech analysis part, speech feature vectors are extracted from a time series of short-duration speech signals. Traditional speech recognition systems directly handed these feature vectors over, from speech analysis to speech recognition. Currently, many systems employ robust processing that removes noise interferences because the raw data is very sensitive to noise. In the robust processing part, the feature vectors are re-generated by shaping, e.g., subtracting the noise components. During the speech recognition part, the recognizer computes the likelihood scores and nds the best match using test utterances. Reference models are generated from HMM training in advance. Because the training is assumed to have been executed by the software, the system does not include a training function. The complete system unies speech analysis, robust processing, and speech recognition. III. WORD-LEVEL HMM ALGORITHM HMM is a statistical modeling approach that is robust to temporal variations in speech and speaker differences [7], [8], and is dened by a state transition probability matrix , a symbol output probability matrix , and an initial state probability . is given The probability of the observation sequence by multidimensional observation sequences , known as fea, which ture vectors, and an HMM expression and . For the is the compact notation of three sets of , word-level HMM, the recognizer computes and compares all the s ( ), where is the number of word is computed using the models. For left-to-right HMMs, LogViterbi algorithm as follows. 1.Initialization for (1)

for 3. Termination for

(2)

(3)

Here, is the number of states, is the number of frames for , is the state tranthe feature vectors denotes their sition probability between and , -by- matrix, is a -by- matrix in log output probability, and is the likelihood value at the time index and state . We restrict the HMM structure to the strict left-to-right connection topology shown in Fig. 2 for use in hardware architecture. Hence, a sparse matrix gives the state transition probability if or otherwise, for

(4) Discrete HMM (DHMM), semi-continuous HMM (SCHMM) [9], [12], and continuous HMM (CHMM) are utilized to compute the output probabilities. The DHMM and SCHMM can reduce computation the output probability costs. However, their recognition rates are lower than CHMM. Our system employs CHMM to give priority to recognition accuracy. In CHMM, the output probability is typically based on a Gaussian distribution. For an uncorrelated single Gaussian distribution, the output probability is expressed as follows:

(5) where and are the mean vectors and diagonal covariance matrices, respectively, for the state index and the dimension index . The frame number and dimension index feature vectors are expressed as of the th feature vector in and is the number of dimensions. The log output probability is simplied as follows:

(6)

72

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 1, JANUARY 2006

We explained the output probability calculation using a single Gaussian distribution. However, a mixture Gaussian distribution is also available for our hardware recognition system. The log output probability of the mixture distribution with states and mixtures is given by

(9) where (10) (11) Vector gives the mixture weights for mixture index and state index . The vectors and are computed in the same way as the single Gaussian distribustates and tion. The mixture Gaussian distribution with mixtures has the same computational processing as the single states, when the addlog opGaussian distribution with eration (10) is ignored and replaces . Approximations of the addlog operation using the maximum function [11] and the log [12], are proposed. Our recogtable function, e.g., nition system can employ the maximum function. IV. SCALABLE ARCHITECTURE
Fig. 3. Flowchart of HMM computation.

where

(7)

(8) and can be computed In (7) and (8), beforehand, i.e., during HMM training. The matrix/vectors , , and are called HMM model parameters in this paper. These parameters are stored in the hardware recognition system memory. Fig. 3 shows a owchart of the whole computation. The output probability is the most computationally expensive part of the procedure. For each output probability, the number of , arithmetic operations for (6) can be represented by about which indicates one addition, one subtraction, two multiplications and repetitions. Because it repeats as Loop A, Loop repetitions. The total C, and Loop D, it requires computation costs, excluding the other calculation parts, is rep. As a measure of the processing time, we resented by use the number of clock cycles. We assume that one arithmetic operation requires one clock cycle. The number of clock cycles for the above computation. For clarity, we comes to and the processing state the computation cost as time as . The processing time is proportional to the number of frames, the feature vector dimensions, the HMM states, and the word models. Large numbers of HMM states and feature vector dimensions are required when long words are expected in word recognition tasks.

Our scalable architecture for high-speed computation is based on parallel and concurrent processing. We applied two methods to the scalable architecture. In the rst, multiple process elements (PEs) are implemented inside the HMM computation module. The HMM computation module executes all arithmetic operations in the word HMM algorithm. In the other, we employ a masterslave operation in the recognition systems. The system consists of speech recognition, speech analysis, and noise robust processing and controls data transfer. The masterslave operation is done using instruction sets designed for the recognition system. This masterslave operation reduces processing time, or can extend the word vocabulary by simply arraying two or more systems. A. Multiple Process Elements In the word HMM structure, some computation parts are executed concurrently by partitioning the HMM states and word models. We considered the following three points in the HMM computation structure. a) During Loop C, the same HMM model parameters , , and are used repeatedly, because the parameters are independent of frame number . b) Loops A and B are divided into parallel computations in each HMM state . c) Loops A and B are computed simultaneously if Loop A precedes Loop B by one frame. Case (a) indicates block processing of the HMM model parameters data. When all the data in a word model is transferred to an internal memory, data fetches from an external memory are unnecessary during Loop C. Note that the HMM model parameters of all word models are stored in the external memory in advance, because the data size increases. Block processing reduces

YOSHIZAWA et al.: SCALABLE ARCHITECTURE FOR WORD HMM-BASED SPEECH RECOGNITION

73

Fig. 5. Structure of HMM circuit.

Fig. 4. Modied owchart for parallel computing.

data transfer by times. Case (b) enables the maximum parallel computations. However, the parallel computations require a lotofdataportsinarithmeticunits.Thevalues and correspond to the number of HMM states and the number of feature vector is frames, respectively. The number of parallel computations . To obtain the maximum perforno more than , that is mance usingparallel computation, we considerthe number of par. The owchart shown in Fig. 3 can allel computations as be modied to the new owchart shown in Fig. 4, which is suitable for parallel computing. It is difcult to directly connect the arithmetic units and external memory, which exists outside the chip. We effectively utilize the internal memory to solve this issue. The internal memory structure can be modied inside a circuit module or a chip. When the internal memory has multiple output ports, the model parameter data can be supplied to all the arithmetic units. Fig. 5 shows the HMM circuit structure for the single Gaussian distribution. PE1 and PE2 are the process elements of the output probability calculation (6) and the Viterbi algorithm (1)(3), respectively. The model parameters are partitioned by the HMM states. The data and are transferred to all the PE1s. The data and are transferred to all the PE2s. The addition of is executed in the Viterbi algorithm in this circuit structure. The data port of the feature vectors is shared by all the PE1s. The PE1 operates a 4-stage pipeline process, consisting of add, square, multiply, and accumulate operations using xed-point arithmetic. The PE1s generate absolute and treat the value of as the values of maximum value in their xed-point format1. Due to these use of the absolute values, the maximum functions in (2) and (3) change to the minimum function in actual hardware processing. Case (c) realizes pipeline chaining between Loops A and B. Because
1We assumed that all the log-likelihoods  in (1)(3) were negative. The values of ! are adjusted by subtracting a constant value so that the log-likelihoods do not become positive. The constant values can be pre-computed. The hardware architecture uses the absolute values of their likelihoods to cut a sign bit.

Fig. 6. Implementation of mixture Gaussian distribution.

the operation cycles of Loop A in one frame surpass those of , it satises the requirement of pipeline Loop B; that is chaining. Consequently, the scalable architecture obtains a faster in the total computation excluding data transfer of the model parameters. The processing time barely increases by arraying the PEs because the number of the HMM states does . not depend on Conventional architecture that arrays multiple arithmetic units is limited for expanding the memory bandwidth or increasing the number of pins connected to the external memory LSIs. The proposed architecture solves this problem by connecting internal memory units with arithmetic units to the inside of a chip. The block processing that transfers the HMM model parameters thus becomes important when implementing the internal memory. We explained the hardware architecture for the single Gaussian distribution. The mixture Gaussian distribution also can be applied by inserting the addlog approximation unit. Fig. 6 shows a simple example of a Gaussian distribution with two mixtures and two states. The input ports of the addlog unit are connected to PE1. The output port is connected to PE2. B. Complete Recognition System This system executes the whole speech processing required for a speech recognition system. It includes not only the speech recognition part but also the parts for speech analysis, noise robustness, and system control. Fig. 7 shows a block diagram of the complete recognition system. The speech analysis algorithm consists of Hanning windowing and Mel frequency cepstral coefcients (MFCC) analysis [10]. For the noise robustness algorithm, we adopted the running spectrum ltering/dynamic range adjustment (RSF/DRA) [13]. This method has an improved robust performance compared with

74

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 1, JANUARY 2006

Fig. 7. Complete recognition system. TABLE I INSTRUCTION SETS

conventional methods, such as spectral subtraction [14] and RASTA [15]. The RSF/DRA improved recognition accuracy by average 5% in car noise environments of signal-to-noise ratio (SNR) 0 dB, and by an average 10% in SNR-10 dB white noise environments. The right interfaces shown in the gure are connected to an external memory. The memory stores the HMM model parameters and feature vectors. The top and bottom interfaces are used for the masterslave operation. When the system is set to the master, the top interface is connected to the microprocessor. The microprocessor controls the recognition system using instruction sets. These instructions execute speech analysis, noise robustness, and speech recognition processing. They are also used to control communication between the microprocessor and the complete systems, to call the recognition results, and to execute the masterslave operation. Table I shows the instruction sets for the complete system. C. MasterSlave Operation The masterslave operation can realize speeds up to where is the number of systems. Fig. 8 shows the masterslave operation. Fig. 8(a) illustrates the arrangement of the master and slave systems. The master system is directly connected to the microprocessor. The master and slave systems each have an external memory that stores the HMM model parameters. The master/slave assignment is done by giving constant values to the left ports of the recognition system in Fig. 7. Zero is assigned to the master system, and the other values are assigned to the slave systems. Data and instructions from the microprocessor are transferred to individual master/slave systems by switching the Chip Select value. If Chip Select is set to the maximum value, the data and instructions are transferred to all the slave systems.

Fig. 8. Procedure of masterslave operation.

Before the masterslave operation, the HMM reference models are transferred to each memory shown in Fig. 8(b). The data size per memory is equally partitioned depending on . In the masterslave operation, speech analysis and robust processing are rst executed in the master system while a speaker is uttering. The feature vectors are then stored in the external memory on the master system shown in Fig. 8(c). Note that the feature vectors are constants for all the word models during Loop D. Because the feature vectors are commonly utilized in all the systems, when only master system executes the speech analysis and robust processing it reduces power consumption. Second, the feature vectors are transferred from the master system to the slave systems using the BROADCAST instruction shown in Fig. 8(d). The broadcast transmission does not require a handshake between the microprocessor and the slave system, thus reducing the data transfer time, even if the number of slave systems increases. Third, speech recognition processing is simultaneously executed in all the systems by calling the feature vectors and the reference HMM models from the memory shown in Fig. 8(e). Finally, the microprocessor gathers the recognition scores in all the systems and searches for the best recognition result, as in Fig. 8(f). V. EVALUATIONS A. System Implementation The HMM computation circuit and the complete recognition system were designed on a CMOS 0.18- m standard cell library using the Verilog-HDL RTL level description. The number of gates in the HMM computation circuit and the complete system

YOSHIZAWA et al.: SCALABLE ARCHITECTURE FOR WORD HMM-BASED SPEECH RECOGNITION

75

TABLE II ARITHMETIC OPERATION IN HMM COMPUTATION

TABLE III INSTRUCTION TYPES AND NUMBER OF INSTRUCTION CYCLES

was 340 k and 400 k, respectively. The circuit executed 32 parallel operations. The number of parallel operations was equal to the number of the HMM states. The maximum operating frequency in the circuit and the recognition system was 128 MHz. B. System Performance The hardware-based recognition system based on the proposed architecture was evaluated on processing time and power dissipation. The processing time of the proposed HMM circuit was much smaller than that of a single algorithmic logic unit (ALU). This should be further evaluated, but including power dissipation. We estimated power dissipation in the arithmetic units on both the hardware recognition system and a xed-point DSP using a software solution. Most software implementations use pruning to reduce the computational load. Two popular forms of pruning are Gaussian selection [16] and Gaussian pruning [17], [18]. These techniques reduce the computational loads to 2040% in HMMbased recognition systems. For example, Gaussian pruning can reduce the computation loads in the output probability calculadenotes additions, incretion in (6). The summation menting dimensional index . During this summation, if the calculated value falls below a certain threshold, computation might stop halfway by replacing an approximate value because its likelihood value is assumed to be far from the center of Gaussian distribution. This indicates that the computation loads shown in the summation are reduced using threshold pruning. In the evaluations, we assumed an 800-word vocabulary task that could be handled with a single-system. The parameters in , , , ; the recognition task were (i.e., 32 HMM states, 38-dimensional feature vectors, 86 speech frames in which the speech length was 1.0 seconds, and an 800-word HMM). Table II shows the number of arithmetic operations in output probability calculation (6) and Viterbi search algorithm (1)(3). The computational cost of the Viterbi algorithm was a small percentage of the total. To simplify the comparison, we evaluated the system performance only for the output probability calculation that is common to the hardware and software implementations. The hardware implementation requires 335 million arithmetic operations. In the software implementation, we estimated the required computational costs using Gaussian pruning for the vector threshold with heuristic estimation [18]. In this case, the arithmetic cost was reduced to 117 million, or 34.9% of the full computation. In the hardware implementation, the proposed architecture was measured using only a single-system operation and not the masterslave operation. The clock frequency was set to 80 MHz. The processing time in the output probability calculation was 32.7 ms without the HMM training data transfer from an external to an internal memory. The total processing time came to 45.5 ms, including the Viterbi algorithm and data transfer. Consequently, this recognizer took 56.9 s word for the single word HMM at an 80-MHz clock frequency. The processing time

TABLE IV EVALUATIONS IN THE HARDWARE AND SOFTWARE IMPLEMENTATIONS

and recognition time were measured from the RTL-level circuit simulation. The power dissipation value was measured using a power estimation CAD tool that uses switching activities in gate-level simulations, including data loading/storing in an internal memory. The hardware system consumed 421.5 mW at an 80-MHz clock frequency and an 1.8-V power supply. This measurement includes the Viterbi algorithm unit, but its power dissipation percentage was low. The software implementation utilizes a Texas Instruments TMS320VC5416 xed-point digital signal processor (DSP). It had a 160-MIPS operating performance at a 160-MHz clock frequency and took one cycle per one instruction. The number of instruction cycles was measured from a DSP compiler tool. Table III shows an example of the required instructions for one addition, one subtraction, and two multiplications inside the summation shown in (6). The total output probability calculation for the above Gaussian pruning required 241 million instruction cycles. We used the power dissipation value from the TMS320VC5416 data sheet [19]. This DSP consumed 96 mW at a 160-MHz clock frequency using a 1.6-V power supply2. Table IV gives the evaluation results of the software and hardware implementations. The processing time represents the time length in the output probability calculation, excluding the HMM training data transfer. Note that most software implementations use a time-synchronous search [20] to reduce the latency between the end of the utterance and obtaining the recognition result. The time-synchronous search executed recognition processing during an utterance. In contrast, the hardware implementation cannot use a time-synchronous search. The processing time directly results in the system response delay because HMM computation starts after the end of the utterance. If very long utterances are recognized, it causes an unacceptably long delay. The dissipated energy is given by mW s (12)

The proposed hardware system requires a higher peak-power compared with the DSP-based system, even though it considerably outperforms the DSP system in terms of total dissipated
2The value from the data sheet was obtained using 50% MAC and 50% NOP instructions. Otherwise, the actual value would have been higher.

76

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSI: REGULAR PAPERS, VOL. 53, NO. 1, JANUARY 2006

Fig. 10.

FPGA board system.

Fig. 9. Measurement results.

energy. A software implementation would reduce power dissipation and processing time by applying additional techniques to reduce computational costs. For example, using beam pruning can decrease cost by as much as a factor of two, but the proposed hardware system is still better than the DSP system in terms of the total dissipation energy. Although the hardware system does not always outperform in system response time, even if processing time is much shorter than the DSP system, its response time is fully allowable for real-time applications. The proposed architecture realized both low-power consumption and real-time system response in the 800-word vocabulary task. C. Measurement Results for Scalability Fig. 9 shows the measurement results for scalability in HMM states and masterslave operation. For the speech recognition performance evaluation, we used 100 Japanese city names from the Japanese Electronic Industry Development Association (JEIDA) database [21] for the speaker-independent recognition experiments. Speech was sampled at 11.025 kHz and 16-bit quantization. For the speech analysis, the MFCC features were extracted after pre-emphasis and Hanning windowing. They were converted to 38-dimensional feature vectors. The frame length and shift were 23.2 ms and 11.6 ms, respectively. The feature vectors consisted of 12 MFCCs, 12 delta MFCCs, 12 deltadelta MFCCs, delta log energy, and delta-delta log energy. Two hundred gender-dependent models were trained on a speech corpus of 24 000 words, collected from 40 males and 40 females. Speech from the training speakers was not

included in the test data. The word models were set at from 4- to 32-state HMMs, with a single Gaussian distribution. The speech data from 10 males and 10 females was tested for recognition, and the noise of a running car was added to the original speech data under the 10-dB SNR condition. The experimental results indicate that a large number of HMM states improves recognition performance in noisy environments. Because over 32-state HMMs barely increased the recognition accuracy, they provided the best recognition performance in this test set. With regards to circuit performance, processing time was measured according to the conditions in Section V-B. The clock frequency was set to 25 MHz. The evaluated processing included both the output probability and other calculations. The circuit area was proportional to the number of recognition words. However, the recognition time slightly increased for large numbers of states because it requires data transfer from external to internal memory. The data size was proportional to the HMM states. In masterslave operations, the total processing time is inversely proportional to the number of systems. When the number of systems is more than two, the feature vectors take the data transfer time between the microprocessor and the recognition systems. The transfer time is no more than 20 ms in a ve-system operation. The masterslave operation can thus considerably reduce the total processing time. VI. FPGA IMPLEMENTATION We implemented the complete recognition system on a FPGA to verify various system operations and evaluate the entire system in a realistic environment. Fig. 10 shows the FPGA board recognition system using an Altera APEX20KE running at 10 MHz. The sampling clock generator, A/D converter, serial port interface, and external SRAM were connected to the FPGA board. The sampling rate was 11.025 kHz with 12-bit quantization. The sequential control circuit substitutes for a microprocessor. Speech detection starts when a switch on the board is pushed and ends automatically, after 1.5 seconds. A more standard push-to-talk interface, (e.g., the user starts an utterance by pushing down a button and halts by releasing it), or automatic voice activity detection (VAD) [22] should be used in practical applications for future developments. The HMM model parameters were transferred from a PC to the FPGA board via the serial port before speech recognition testing. The FPGA board system enabled users to utter speech using

YOSHIZAWA et al.: SCALABLE ARCHITECTURE FOR WORD HMM-BASED SPEECH RECOGNITION

77

a microphone and to observe the recognition results as word numbers displayed on an LED. VII. CONCLUSION In this paper, we described a new scalable architecture for the word HMM computation in speech recognition. The high computation costs in word HMMs cause excessively long processing times and restricts the applications, to only small vocabulary. To solve this problem, we applied new methods in parallel computations to the hardware architecture. The proposed architecture provides scalability that can reduce processing time and/or extend word vocabulary. This scalability is realized by employing the multiple process elements inside the HMM computation circuit and the masterslave operation between the recognition complete systems. To evaluate the proposed architecture, we designed the complete system using CMOS standard library cells and demonstrated that the system is adequate for operating bigger vocabularies from the evaluations. ACKNOWLEDGMENT The authors would like to thank Research and Development Headquarters, Yamatake Corporation and the VLSI Design Education and Research Center (VDEC), Tokyo University for their cooperation of our work. REFERENCES
[1] J. Pihl, T. Svendsen, and M. H. Johnsen, A VLSI implementation of pdf computations in HMM based speech recognition, in Proc. IEEE TENCON96, 1996, pp. 241246. [2] W. Han, K. Hon, and C. Chan, An HMM-based speech recognition IC, in Proc. IEEE ISCAS03, vol. 2, 2003, pp. 744747. [3] S. J. Melnikoff, S. Quigley, and M. J. Russell, Implementing a simple continuous speech recognition system on an FPGA, in Proc. IEEE Symp. FPGAs for Custom Computing Machines (FCCM02), 2002, pp. 275276. [4] F. Vargas, R. Fagundes, and D. Barros, A FPGA-based Viterbi algorithm implementation for speech recognition systems, in Proc. IEEE ICASSP01, vol. 2, May 2001, pp. 12171220. [5] L. R. Rabiner, Recognition of isolated digits using hidden Markov models with continuous mixture densities, AT&T Tech. J., vol. 64, no. 6, pp. 12111234, 1985. [6] M. Karnjanadecha and S. A. Zahorian, Signal modeling for isolated word recognition, in Proc. IEEE ICASSP99, vol. 1, Mar. 1999, pp. 293296. [7] X. Huang, Spoken Language Processing. Englewood Cliffs, NJ: Prentice-Hall, 2001. [8] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE, vol. 77, pp. 257285, Feb. 1989. [9] T. Watanabe, K. Shinoda, K. Takagi, and E. Yamada, Speech recognition using tree-structured probability density function, in Proc. ICSLP94, 1994, pp. 223226. [10] L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition. Englewood Cliffs, N.J.: Prentice-Hall, 1993. [11] P. Beyerlein, Fast log-likelihood computation for mixture densities in a high-dimensional feature space, in Proc. ICSLP94, vol. S0722, 1994, pp. 5354. [12] S. Sagayama and S. Takahashi, On the use of scalar quantization for fast HMM computation, in Proc. IEEE ICASSP95, vol. 1, 1995, pp. 213216. [13] S. Yoshizawa, N. Wada, N. Hayasaka, and Y. Miyanaga, Noise robust speech recognition focusing on time variation and dynamic range of speech feature parameters, in Proc. ISPACS03, 2003, pp. 484487. [14] S. Boll, Suppression of acoustic noise in speech using spectral subtraction, IEEE Trans. Acoust., Speech Signal Process., vol. 27, no. 1, pp. 113120, Feb. 1979.

[15] H. Hermansky and N. Morgan, RASTA processing of speech, IEEE Trans. Speech Audio Process, vol. 2, no. 5, pp. 578589, Oct. 1994. [16] K. M. Knill, M. J. F. Gales, and S. J. Young, Use of Gaussian selection in large vocabulary continuous speech recognition using HMMs, in Proc. ICSLP96, 1996, pp. 470473. [17] A. Lee, T. Kawahara, and K. Shikano, Gaussian mixture selection using context-Independent HMM, in Proc. IEEE ICASSP01, May 2001, pp. 6972. [18] A. Lee, T. Kawahara, K. Takeda, and K. Shikano, A new phonetic tiedmixture model for efcient decoding, in Proc. IEEE ICASSP00, 2000, pp. 12691272. [19] TMS320VC5416 Fixed-Point Digital Signal Processor Data Manual, Texas Instruments, Literature Number: SPRS095O, 1999. [20] H. Ney, D. Mergel, A. Noll, and A. Paeseler, Data driven search organization for continuous speech recognition, IEEE Trans. Signal Process., vol. 40, no. 1, pp. 272281, Feb. 1992. [21] S. Itahashi, A japanese language speech database, in Proc. IEEE ICASSP86, 1986, pp. 321324. [22] J. Sohn, N. S. Kim, and W. Sung, A statistical model based voice activity detection, IEEE Signal Process. Lett., vol. 6, no. 1, pp. 13, Jan. 1999.

Shingo Yoshizawa (S00) received the B.E. and M.E. degrees in electrical engineering from Hokkaido University, Sapporo, Japan in 2001 and 2003, respectively. He is currently working toward the Ph.D. degree at the Graduate School of Information Science and Technology, Hokkaido University. His research interests are speech processing, wireless communication systems, and VLSI architecture.

Naoya Wada (S00) received the B.E. and M.E. degrees in electrical engineering from Hokkaido University, Sapporo, Japan in 2001 and 2003, respectively. He is currently working toward the Ph.D. degree at the Graduate School of Information Science and Technology, Hokkaido University. His research interests are digital signal processing, speech analysis, and speech recognition.

Noboru Hayasaka (S00) received the B.E. and M.E. degrees in electrical engineering from Hokkaido University, Sapporo, Japan in 2002 and 2004, respectively. He is currently working toward the Ph.D. degree at the Graduate School of Information Science and Technology, Hokkaido University. His research interests are digital signal processing, speech analysis, and speech recognition.

Yoshikazu Miyanaga (S80M83SM03) received the B.S., M.S., and Dr.Eng. degrees from Hokkaido University, Sapporo, Japan in 1979, 1981, and 1986, respectively. Since 1983, he has been with Hokkaido University, where he is a Professor in the Graduate School of Information Science and Technology. His research interests are adaptive signal processing, nonlinear signal processing, and parallel-pipelined VLSI systems. Prof. Miyanaga is a member of IEICE, Information Processing Society of Japan, and Acoustical Society of Japan.

You might also like