You are on page 1of 10

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

A Neural Network-based Time Series Prediction Approach for Feature Extraction in a BrainComputer Interface
Damien Coyle, Girijesh Prasad, and Thomas M. McGinnity, Member, IEEE
require neuromuscular control, people with neuromuscular disorders who may have no control over any of their conventional communication channels may still be able to communicate via this approach. EEG-based communication can be realised through a direct brain-computer interface (BCI) which does not depend on the peripheral nerves or muscles [1]. A BCI replaces the use of nerves and muscles and the movements they produce with electrophysiological signals in conjunction with the hardware and software that translate those signals into actions [2]. BCI technology is a developing technology but is currently contributing to the improvement of living standards of disabled people [3][4][5]. Nearly two million people in the United States [2] are affected by neuromuscular disorders. A conservative estimate of the overall prevalence is that 1 in 3500 of the worlds population may be expected to have a disabling inherited neuromuscular disorder presenting in childhood or in later life [6]. In light of these figures, it is obvious that successful development of BCI technology is of crucial importance and has potential to improve the quality of life for millions of people worldwide. A BCI involves extracting information from the highly complex EEG. This is usually achieved by extracting features from EEG signals recorded from subjects performing specific mental tasks. A class of features is usually obtained from a number of signals recorded for each mental task and subsequently a classifier is trained to learn which features belong to which class. This ultimately leads to the development of a BCI system that can determine which thoughts belong to which EEG signals [7] and associate those thoughts with the users intended communication. Techniques for feature extraction and classification involve utilisation of computational models, some of which are based on the architecture of the human brain and inspired by the way neuronal information processing is carried out. Approaches involving applied multivariate statistical data analysis, regression and signal processing techniques are often used. The underlying generator of EEG is neuronal activity of a large number of neurons communicating and interacting through low voltage electrical signalling. The highly nonstationary EEG signal is produced by the temporal and spatial summation of electrical currents that arise from pre- and postsynaptic potentials [8] of millions of parallel and synchronously active neurons. The normal geometry of

Abstract This paper presents a novel feature extraction procedure (FEP) for extracting features from the electroencephalogram (EEG) recorded from subjects producing right and left motor imagery. EEG recorded from two electrodes positioned over the motor cortex has been used to provide an alternative communication channel in brain-computer interface (BCI) systems. In this work two neural networks (NNs) are trained to perform one-step-ahead predictions for the EEG time series data, where one NN is trained on the right motor imagery and the other on left motor imagery. Subsequently each type of data is fed into both prediction NNs. Features are extracted based on each NNs ability to predict each type of time series data. Features are derived from the mean squared error in prediction (MSE) or the mean squared of the predicted signals (MSY), calculated for a segment of each predicted signal. Each of the prediction NNs are trained on the EEG time series recorded from two electrodes therefore, at least four features can be extracted from each trial. Separability of features is achieved due to the morphology of the EEG signals and each NNs homogeneity to the type of data on which it is trained. The features are normalized before classification is performed using linear discriminant analysis (LDA). This FEP is tested on two subjects off-line and classification accuracy (CA) rates reach 92% with information transfer (IT) rates approaching 18 bits/min. The approach does not require large amounts of training data and minimum subject specific data analysis is required. Preliminary results show good potential for online signal classification and autonomous system adaptation. Index TermsAugmentative communication, brain-computer interface, electroencephalogram, time-series prediction

I. INTRODUCTION

undertaken over the last fifteen years indicates that humans can learn to control certain characteristics of their EEG, such as amplitude and certain frequency rhythms. A persons ability to control their EEG may enable him/her to communicate without the prerequisite of being to control their voluntary muscles. As EEG-based communication does not
ESEARCH Manuscript received March 19, 2004. This first author is supported by a William Flynn Scholarship. The authors are with the Intelligent Systems Engineering Laboratory, School of Computing and Intelligent Systems, Faculty of Engineering, Magee Campus, University of Ulster, Northland Road, Derry, Northern Ireland, BT48 7JL, UK. (phone: +44 (0)28 7137 5745; fax: +44 (0)28 7137 5570; email: dh.coyle@ulster.ac.uk)

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < synaptic distributions over the pyramidal cells makes it impossible to know whether an EEG event at the scalp (i.e. frequency or amplitude change) is due to an inhibitory or excitatory postsynaptic potential or exactly which neuron is active. In general the EEG can be considered as an information-carrying neural signal that is not used internally as a neural code [9]. Noise, artefacts and background EEG distort the EEG making it exceedingly difficult to extract reliable information and features. Therefore the task of extracting information using an EEG-based BCI system is complex and requires that the components of the system are thoroughly investigated. This work demonstrates a novel FEP which carries out NNbased time series prediction and focuses on features extracted from the time domain only. EEG is recorded from two electrodes (C3 and C4) attached to the scalp, over the motor cortex. The EEG time series data is configured so that data from both electrodes can be predicted by a single network. Two NNs are used to predict a value of EEG time-series at point t, in the future, using past values. The system is configured in three stages. The first stage involves training two NNs separately to perform one-step-ahead prediction using four previous measurements of each time-series. These NNs are labelled L for left and R for right corresponding to the type of EEG data on which they are trained (i.e. either left or right motor imagery). As these NNs perform prediction they are referred to as pNNs. The second stage involves inputting each type of training data into both pNNs, trial by trial. Each pNN provides a one-step-ahead prediction for the data in each trial. The MSE (mean squared of the actual signals minus the predicted) or MSY (mean squared of the predicted signal) measured over a segment of the prediction provides a feature for each signal that is input to the pNN. Because each pNN is trained to predict the values of two EEG signals there can be at least two features extracted from each pNN. Also, each type of data is fed into both pNNs therefore; four or more features can be extracted for each type of data (i.e. two from each channel prediction of the LNN and two from the RNN). Due to the nature of this approach, the MSE or MSY of the prediction over a number of segments from each channel provides the option of extracting more than one feature for each channel. By combining the features obtained from left data input to both pNNs a left feature vector is formed and similarly, for right data, a right feature vector is obtained. In some cases, features are normalised to reduce intra-class variance. A comparison of normalised and non-normalised features shows that normalisation helps improve the performance of the system. The third stage is classification. This involves training a linear or nonlinear classifier on the features obtained. In summary, when applying the system, the unknown data is fed into both pNNs and a feature vector is extracted and normalised. Subsequently, these features are fed into a classifier and a class prediction is made. The classification abilities of linear discriminant analysis (LDA) and NNs are compared on a number of initial tests.

LDA is selected for further experimentation where performance is enumerated based on results obtained from three performance quantifiers. The measurement of BCI performance is very important for comparing different systems and measuring improvements in systems. There are a number of techniques used to quantify the effectiveness and performance of a BCI system. These include measuring the CA and/or measuring the IT rate [10]. The latter performance quantifier considers the CA and the time required to perform classification of each mental task. A third and relatively new quantifier of performance for a BCI system is to quantify the mutual information (MI) [11][12] - a measure of the average amount of information a classifier output contains about the input signal. In this case the classifier should produce a distance value, D, where the sign of D describes whether the classification is left or right and the value expresses the distance to the separating hyperplane. Also, to estimate the time-course of CA rate, IT rate and MI, the system described in the work facilitates features to be extracted and classified with a time-resolution as high as the sampling rate very easily. This allows selection of the point(s) in a trial at which the signals are most separable and system performance is optimal. To the best of the authors knowledge a NN-based time series prediction approach has not been employed for feature extraction in BCI applications before. Results show that the proposed FEP compares well to existing approaches. Results ranging from 70-95% are reported for experiments carried out on similar EEG recordings [13][14][15]. In some cases, results are subject-specific and are based on a 10*10 cross-validation, which provides a more general view of the classification ability [13] and may not accurately represent the abilities of the system for online data processing. Current BCIs have IT rates ranging between 5-25 bits/min [16]. The types of EEG signals used have a significant influence on the IT rates and the suitability of many signals for BCI utilization by impaired patients is speculative (see [5] for a review). For one particular parameter setup in this work, IT rates approach 18 bits/min. CA rates approaching 92% are achieved on unseen data without using cross-validation. Classification speed and accuracy are of crucial importance in the application of BCI systems and are fundamental to new techniques for feature extraction and classification. However, other requirements such as the development of algorithms which can perform autonomous adaptation and online feature extraction and classification are essential for successful progress of BCI technology. In a recent review article [2] Wolpaw et al. specify three levels of adaptation which must be confronted for the successful progress in the development of BCI systems. The first level of adaptation: initial adaptation to the individual user is extremely important. Most BCI systems have concentrated on the first level of adaptation but there are many issues which must be addressed, such as the ease with which initial adaptation of the system is realized. Does the system require detailed EEG analysis before the system parameters can be selected? A minimum of subject specific data analysis is desirable for ease of transferability of the BCI

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < from user to user. How much training time is required before the system achieves an acceptable performance? This depends on the user and the system but ideally the system should require as little training data as possible to be effectively adapted to any individuals EEG. Is noise and artifact removal of crucial importance? Is classification time delayed by the necessity to reduce the effects of artifacts? In developing this time series prediction approach, all of these requirements have been considered, and the resultant system, in some measure, fulfills these requirements. It shows good potential for further development in these areas and to confront the requirements of the second level of adaptation outlined in [2]: continuing adaptation to spontaneous changes in the users performance. Development of this system to enable autonomous adaptation to changes in the users EEG caused by environmental factors, the users frame of mind and other factors is ongoing. The approach was developed to tackle the second level of adaptation as a fundamental criterion and preliminary results based on a derivative version of the system outlined in this work are encouraging. The system is based on the same conceptual ideas as the NN approach described in this paper but utilizes a processing algorithm which is more susceptible to the second level of adaptation requirements. In the 2nd International conference on BCI held in 2002 [16] there was some discussion of the fact that most existing signal processing methods were developed for man-made signals and may not be appropriate for BCI. Therefore, electrophysiological signals must be considered in new and different ways. This work outlines a new signal processing technique which is well suited to EEG signals recorded over the motor cortex but its ability to effectively process other types of EEG/electrophysiological signals remains to be established. The paper is organised as follows. Section II describes the data acquisition procedure and the data configuration. The pNNs configuration and FEP are also described. The classification methods are also detailed. Section III provides a detailed description of the results obtained from three performance measures. Section IV is a discussion of the results and methods. Section V concludes the paper and suggests some future work II. METHODS A. Data Acquisition The EEG data used to demonstrate this approach was recorded by a research group at the Institute for Biomedical Engineering, University of Technology Graz, Austria [13][14][15]. The Graz group has developed a BCI which uses (8-12Hz) and central (18-25Hz) EEG rhythms recorded over the sensorimotor cortex. Several factors have suggested that and/or rhythms may be good signal features for EEG based communication. These signals are associated with those cortical areas most directly connected to the brains normal motor output channels [2]. The data is recorded from a subject in a timed experimental recording procedure where the subject is instructed to imagine moving the left and right hand in

accordance to a directional cue displayed on a computer monitor. In each recording session a number of EEG patterns relating to the imagined right or left arm movement are produced by a subject over a number of trials. The recording was made using a g.tec amplifier and Ag/AgCl electrodes. All signals are sampled at 128Hz and filtered between 0.5 and 30Hz. Three bipolar EEG channels (anterior +, posterior -) were measured over C3, Cz and C4. A detailed description of similar experimental setups for recording these EEG signals is available in [11][12][13][14][15][17]. B. Data Configuration The recorded EEG data is structured so that the values of every previous four measurements in the time-series recorded from each electrode are used to predict the value of the next measurement in the series from each of the electrodes. Each training data input exemplar thus contains four values from the data recorded from C3 and four from C4. This forms an eight element input vector for each training exemplar. The training data output contains every subsequent value from each of the training input data vectors of C3 and C4. The extracted input-output data vector for the time series C3 and C4 have the following format:[ c3(t 4), c3(t 3), c3(t 2), c3(t 1), c 4(t 4), c 4(t 3), c 4(t 2), c 4(t 1) ; c3(t ), c 4(t ) ]

An analysis of the data was carried out to determine the optimum number of past samples of each signal to use for prediction purposes. A quantitative measure of the average amount of information that several sequential measurements of a signal provide about the next measurement can be obtained by estimating the mutual information (referred to as redundancy for more than two variables) among different samples of the signal and subsequently estimating the marginal redundancy each time a new measurement is added [18][19]. Depending on the value of the marginal redundancy the optimum embedding dimension, D, (i.e. number of points of the signal to use for prediction) and lag, m, (i.e. the interval between samples used for prediction) can be obtained. Results from this analysis of the data showed that at m=1 the marginal redundancy increases the most when a new measurement is added, indicating that sequential measurements of the time series can provide more information about the predicted point than using every second or third (and so on) consecutive measurement. All lags up to 10 were analyzed and marginal redundancies varied, depending on which parts of the signals were being analysed. In many cases, at a lag of 1 the marginal redundancy increased as D increased. Results for 1 D 5 were calculated at each lag. The redundancy for D>5 was not computed due to the extensive computational load required to estimate the joint probability (i.e. the probability that two or more outcomes are realized jointly) among greater than five dimensions. From this analysis D=5 (i.e. four regressed measurements) and m=1 was chosen for optimum predictability of each time series. Depending on the points of

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < each signal and the type of signal analysed, the information quantity about the predicted point varied and could probably increase further by increasing D although, it was observed that marginal redundancy became constant at D= 3 or 4 in some cases. When marginal redundancy becomes constant further increases in D provide redundant information. Therefore, it was decided that four predictor samples was sufficient for prediction and would keep the dimensionality of each input exemplar reasonably low (i.e. 8 elements). Also, four predictors (i.e. 4 lagged variables of each time series) is used for many of the approaches to chaotic time series prediction problems described in the literature [20][21][22][23]. Each trial consists of approximately 5 seconds of task related data. The data was recorded from two subjects (S1 and S2) during two different sessions. For the subject S1 a total of 280 trials were recorded (140 trials of each type of movement imagery). For the subject S2 there were 320 trials, (160 trials of each type of movement imagery). Each trial consists of 640 samples (5s/128-1s= 640) therefore, the training input/output data pairs for each trial consists of 636 samples (samples 636 639 are used to predict 640). C. Prediction NNs Architecture and Training Procedure Two feed-forward multilayered perceptron NNs are used to perform prediction. One pNN is trained on the left EEG data and the other on the right EEG data. For training purposes 50% of the trials are used as the training data set. By using separate NNs for each type of data, it is expected that each trained NN has certain uniqueness, in that it is more apposite to each type of time-series data. It is assumed that combining the data from both electrode (C3 and C4) also enhances each pNNs expediency to the type of data on which it is trained. According to Bishop [24] there are number of techniques that can be used to determine the correct NN architecture to use to for a specific task. For prediction, the optimum number of hidden layers, number of neurons in each layer and the type of activation function are important adjustable parameters and the best choice depends on the problem to be learned by the network. A network architecture which is too complex for the problem will over-fit the training data. Conversely, a network architecture which is too simple will generalise poorly and provide poor predictions for new data. The complexity of the NN model can be controlled to obtain the optimum structure and this can be achieved by varying the number of adaptive parameters in the network. This is called structural stabilisation [24]. There are a number of techniques for structural stabilisation and one of the simpler techniques used is to experiment with a number of NNs with different architectures and choose the model with the best performance. This approach was used in this work. To obtain a suitable NN structure each set of pNNs was adjusted for experimental purposes during training. In many cases it was observed that the size of the pNN (i.e. no. of hidden layers and no. of neurons in each layer) did not significantly affect the overall prediction performance. For this approach the quality of the features extracted are directly related to the types of predictions provided by the pNNs and not simply the best

prediction accuracy. (A full description is given in section D). The prediction accuracy was not used as a criterion for the choice of pNN architecture and instead the quality of the features determined the best architecture. To assess the quality of the features the overall CA of the system was used. It was found that pNNs with different architectures produced significantly different features and this was reflected in the CA. The pNN weights were updated using LevenbergMarquardt method. This method allowed fast convergence to a minimum error (plateaux) but has the disadvantage of being computationally expensive. To determine which sets of pNNs provide features which allow the highest CA a comparative analysis was performed for the pNNs trained on data recorded from subject S1. Many types of pNN architectures were investigated, of which six are presented in this work. Three sets of pNNs were trained for 50 epochs regardless of convergence times and three were trained using validation data to perform early stopping during training. Considering that the criterion for the best pNN architectures was the quality of features provided, over generalisation was not of concern and in some cases improved the quality of the features. Three sets of pNNs were trained using both training durations, one set with one hidden layer containing 10 neurons, one set with two hidden layers with 6 and 8 neurons in the first and second hidden layers, respectively, and another with 3 hidden layers with 8, 10 and 6 neurons in the first, second and third hidden layers, respectively. D. Feature Extraction Procedure After each set of pNNs has been trained to perform onestep-ahead prediction, the training data (the same data used to train the pNNs) is input to both pNNs, trial by trial. When a trial is input to both pNNs features can be extracted in two ways. The first method involves calculating the MSE of the prediction for a segment of the trial. This is also a measure of the prediction accuracy. The second method involves calculating the mean squared of the actual prediction (MSY). As these calculations deduce predictions over a segment to a scalar value, features based on the error for the first case and on the predicted signal for the second case can be obtained. Equations (1) and (2) are used for obtaining each feature.
fyk = 1 M
1 M

( y(t ) yk (t ))2
t =1

for MSE

(1)

or
fyk = yk (t ))2
t =1 M

for MSY

(2)

where y (t ) and yk (t ) are the values of actual signal and predicted signals (i.e. either C3 or C4) at time t, respectively. The k index indicates whether the signal is the output from the left or right pNN (i.e. k can be l or r). M is the number of prediction samples used for the MSE or MSY calculation. Equation (1) is used for calculating the error type (MSE)

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < features and (2) is used for calculating the predicted type (MSY) features. The left data is fed into both the right and left pNNs and each pNN provides a prediction for two signals (C3 and C4) therefore four features can be extracted. Similarly right data is fed into both pNNs and four features can be extracted. The feature vector, fv, is shown in Figure 1. For each trial a four element feature vector is obtained and classes of features for right and left data can be obtained by entering all trials of training data into both pNNs. Optionally, the features can be normalised by dividing each feature vector by the sum of the components within the vector. The main goal of any FEP is to extract features that keep intra-class variance at a minimum and inter-class variance at a maximum. Normalising the features can reduce the variance within each class. Inter-class variance is achieved when a pNN is less suited to a particular data type (i.e. the LNN should predict left data better than the RNN and vice versa). It is expected that a left feature vector should contain smaller error values from the LNN predictions than the RNN and a right feature vector should have smaller error values extracted from the RNN predictions than the LNN. This expectation is due to the fact that an LNN is trained on left data and an RNN is trained on right data. In the case of MSY features, the power of the signals being predicted has a direct effect on the features produced. All aspects of the types of features extracted are compared and discussed in sections III and IV.
NEW DATA fc 3l Left c 3l (t ) NN c 4 l (t ) fc 4 l
fc 3 r

out to determine which of two classifiers had the ability to discriminate between the features from each class most effectively. LDA works on the assumption that different classes of features can be separated linearly. NNs, utilised as classifiers, attempt to find a non-linear separating hyperplane between the classes and are investigated to determine if they can outperform LDA. Linear classifiers are generally more robust than their non-linear counterparts, since they have only limited flexibility (less free parameters to tune) and are less prone to over fitting [25]. The initial experimentation was carried out on nonnormalized MSE type features extracted from specific segments of the trials. The CA for an LDA classifier and an NN classifier is calculated at every two hundred time points (i.e. approx. after every 1.56s - arbitrarily chosen). Based on results from the initial tests and the fact that training an LDA classifier does not require a great deal of time, the LDA classifier is chosen to determine the time course of the CA, MI and IT rate. The main experimentation involved extraction and classification of features at every time point in a trial allowing selection of the optimum time point(s) to perform feature extraction and classification for effective employment of the system. It is shown that performance can be significantly improved when analysis of the system is performed at each time point in the trial compared to selecting arbitrarily chosen segments of the trial. The full system is illustrated in Figure 1. III. RESULTS The classification technique described was tested on unseen data, 140 trials for subject S1 and 160 trials for subject S2. Table 1 shows the CA results for subject S1. These results, obtained utilizing an LDA and NN classifier, were used to compare the abilities of both classifiers based on results obtained using MSE type features at arbitrarily chosen M values for (1) and (2). The first and second columns specify the number of hidden layers and the number of neurons in each layer, respectively. The third column specifies the number of epochs used for training (xval indicates that the pNNs were prevented from over generalising on the training data by using cross validation for early stopping). The fourth column specifies the number of points, M, involved in the MSE calculation and the number of feature sets extracted for each channel. Using a 200 point MSE calculation, the maximum number of features that can be extracted from each trial is 3, as each trial contains 640 data points. A 200 point MSE calculation was arbitrarily chosen for this initial experimentation. There are number of ways the features can be extracted and depending on the subjects EEG characteristics the results may differ. For example, using a smaller number of points for the MSE calculation and a greater number of features may be extracted from each trial and this may improve the CA. As can be seen from Table 1, using 2 x 200 point MSE calculations for each electrode helped improve the CA in some cases, although this improvement is obtained at the expense of increasing

c 3( t 4 ) c 3( t 3) c 3( t 2 ) c 3(t 1) c 4(t 4 ) c 4(t 3) c 4(t 2 ) c 4(t 1)

Right c 3 r ( t ) NN c 4 r (t )

fc 4 r

Feature Vector fv = { fc 3l , fc 4 l , fc 3 r , fc 4 r }

Normalize Features ( Optional ) fv = fv fv

CLASSIFIER

Classification Left or Right TSD

Fig. 1. Illustration of FEP and the complete system

E. Classification To classify the features, initial experimentation was carried

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < classification time. For example, if only one feature from each channel prediction is used then the duration of the classification time would be approximately 1.56s. (200*1281 s). If two features from each channel prediction are used then the classification would take approximately 3.1s. IT rates approach 15 bits/min with a classification time of approximately 1.56s and 85% CA. BCI systems must have the ability to classify signals rapidly (ideally in real-time) and accurately therefore a trade-off must be made. For example, the first and second pNNs of Table 1 can provide features which can be classified with 84% accuracy, using LDA, with only one feature from each channel, thus requiring a reasonably short classification time. For the fifth and sixth pNNs of Table 1 the CA is increased by using 2 features from each channel. Table 1: A Comparison of Classification Results for different pNN Architectures, Training Durations and Feature Extraction Parameters
Hidden Layers 3 Neurons 8-10-6 T. Dur. 50 M 200x1 200x2 200x3 200x1 200x2 200x3 200x1 200x2 200x3 200x1 200x2 200x3 200x1 200x2 200x3 200x1 200x2 200x3 LDA % 84.3 75.7 71.4 84.3 75.7 75 76.4 75.7 80.7 79.3 79.3 75.7 80 78.6 77.1 76.4 77.8 80 NN % 82 80 81.4 83.6 82 82.9 82.1 83.6 83.6 82.9 81.4 80.7 80.7 85 81.4 83 85 85

6-8

50

10

50

8-10-6

xval

6-8

xval

10

xval

The two smaller pNNs architectures trained with xval provided the best CA overall. This CA was obtained using a NN classifier. The fastest classification was obtained using either of the largest over-trained pNNs (50 epochs) along with an LDA classifier. The speed for this classification is twice as fast as that of the other two at the expense of losing a percentage in CA therefore, either of the first or second pNNs of Table 1 for feature extraction conjoint with an LDA classifier would be the best choice as an outcome from this comparative analysis. Due to the random weight initialization of NNs, there are always discrepancies in the results from retrained NNs even when the same architectures, parameters and training data are used. Therefore, the NN classifiers were trained a number of times and the NN that achieved the best results were chosen. It may have been possible to improve the results achieved by the NN classifiers by continually fine tuning the parameters of the NNs. To achieve the results documented in this work required a substantial amount of time and sufficient conclusions could be drawn without further optimization of the NN. A GA could have been used to

optimize the network parameters or some other form of automatic neuron adding and pruning procedure but this would have required a significant amount of development time. Therefore, based on the time aspect alone the LDA classifier, which usually achieves a performance that requires very little training time and no parameter tuning, outperforms the NN classifier. For this FEP it is concluded that the LDA classifier is definitely the best choice. The choice is based on the best CA and time required to achieve that CA as well as the best training times, ease of parameter setup and degree of complexity. This conclusion complements the conclusions made by Muller et al., [25] at the Second International Meeting on BCIs (June 2002) where a formal debate was held on the pros and cons of linear and nonlinear methods in BCI research. They concluded that simplicity should be preferred and that linear methods should be chosen when limited knowledge about the data is available. They also stated that it is desirable to avoid reliance on nonlinear classifiers as much as possible because often a lot of parameter tuning is necessary and it may only be possible to optimize a nonlinear classifier when there is sufficient prior knowledge of the data. The LDA classifier was therefore selected for further detailed investigation. The best result obtained for subject S2 was 92% CA using an LDA classifier with a sliding window-360 point MSE feature calculation. To extract a new set of features for every successive time point in a trial using the sliding window approach, t in (1) and (2) ranges from t=s to M (initially s=1 and M=360) and s and M are incremented before the next set of features is extracted. This means that data at the beginning of a trial does not have any effect on the features as the window slides away from the start of the trial. Table 2 illustrates the results for subject S1, obtained using an LDA classifier. The evolutions of all performance measures over time were calculated and the results obtained for the best time points are presented. The first column specifies the number of hidden layers and the second column specifies the number of neurons in each hidden layer of the pNNs. The third column specifies the number of epochs that each pNN was trained for (xval means the pNN is stopped during training by cross validation). The fourth column specifies the classifier type (LDA only) and the fifth column specifies if the features are extracted using MSE or MSY. An n indicates that the results in that row were obtained using normalised features. Experimentation with each pNN setup was carried out for four feature types. All FEPs use a one-step accumulation for the calculation at each time point so the MSE or MSY is calculated at each time point from the current time point and all the previous predicted values from the beginning of the trial (i.e. t in (1) and (2) ranges from the start of the trial to M and M is incremented by one for extraction of each new set of features). The CA, the number of predictions required to achieve that CA, the corresponding time and the IT rates are specified in columns 6-9, respectively. The maximum MI and the time required to achieve that MI are detailed in columns 10 and 11, respectively. The last column

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < Table 2: A comparative analysis of the pNNs based on results from three performance measures, classification accuracy, information transfer rates and mutual information. The time course of all performance was calculated therefore all results are shown for the optimum time points.
Hidden Neurons Dur. Layers

8-10-6 50

6-8

50

10

50

8-10-6 xval

6-8

xval

10

xval

C-Type Feature CA % Type MSE 87.14 MSY 88.57 LDA MSE n 85.71 MSY n 86.49 MSE 84.29 MSY 88.57 LDA MSE n 84.29 MSY n 87.86 MSE 85 MSY 87.86 LDA MSE n 85 MSY n 90 MSE 85.71 MSY 90 LDA MSE n 87.14 MSY n 90 MSE 86.42 MSY 88.57 LDA MSE n 85.71 MSY n 86.49 MSE 87.14 MSY 87.86 LDA MSE n 86.43 MSY n 86.43

No. of Max. C- IT Rate Max. MI Max. MI- Best Samples Time [s] (bpm.) [bit] Time [s] 271 2.1172 12.651 0.3996 3 410 3.2031 9.1270 0.5296 2.8125 X 267 2.0859 11.7451 0.4870 3 391 3.0547 8.389 0.5884 2.8672 358 2.7969 7.9926 0.3462 2.9844 377 2.9453 9.9268 0.5238 2.5156 366 2.8594 7.8179 0.47393 2.875 360 2.8125 X 9.953 0.6034 2.8203 346 2.7031 8.6602 0.4170 2.5391 391 3.0547 9.1639 0.5302 3.1094 325 2.5391 9.2198 0.4753 3.0938 365 2.8516 11.173 0.5978 3.0547 X 177 1.3828 17.717 0.3756 2.8984 398 3.1094 10.2465 0.5190 2.5313 343 2.6797 9.9973 0.5286 2.4844 362 2.8281 11.2655 0.6218 2.8672 X 330 2.5781 9.9397 0.4519 2.6328 389 3.0391 9.6205 0.5074 2.5156 307 2.3984 10.215 0.55598 2.6172 355 2.7734 9.2397 0.6025 2.8203 X 246 1.9219 13.9393 0.4908 2.6250 395 3.0859 9.0711 0.4918 2.6875 209 1.6328 15.694 0.52019 2.5 X 361 2.8203 9.0861 0.5854 3.0781

specifies the best choice of feature type (column 5) for each type of pNN setup based on making a trade off between the results from all three performance measures. All results in bold specify the best results obtained for each type of performance quantifier, for each pNN setup. IV. DISCUSSION By comparing the CA results for the non- normalised MSE features of Table 2 to those of Table 1 it can be seen that a significant improvement could be made by estimating the time course of the CA at the rate of the sampling interval and selecting the time point at which CA was maximised. A comprehensive analysis of the time course of the performance measures can provide useful information for more effective deployment of the system. All IT rates were calculated using the time required to obtain the maximum CA. The IT rate can be much higher if it is calculated at the beginning of a trial when the CA may not be as high. A reasonable CA rate in reduced time may be feasible in a system which can tolerate high error rates in communication but for the intended

purposes of BCI high error may not be tolerable. Erroneous communications can be counteracted by a fast system more quickly although the correcting communication may also be erroneous. This would result in a system which could be frustrating or exhausting to use and may not be suitably efficient for utilisation, especially by patients with severe neuromuscular disorders. Frustration and fatigue and other such feelings and emotions can also result in adverse changes in the EEG which can seriously affect the users ability to communicate, causing increased difficulty for the BCI system to discriminate between signals. Kubler et al. [5] refer to this as interference and distraction which not only results from superfluous emotions and feelings but can occur when imagery associated with the different cognitive tasks that the patient performs to produce specific EEG frequency patterns and signals may interfere with the thoughts the patient intends to communicate. The MI results were selected at the point at which MI is maximised. This usually occurs around the point where CA is maximised. The MI reaches more than 0.62 bits which is a practical result and indicates that the SNR is sufficiently high

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < to facilitate using the classifier output to perform limited modulated control of a cursor. However, an increase in MI would indicate a more reliable FEP and classifier. In this approach it was decided to try to increase the MI as much as possible. The MSE type features were affected by noise and artefacts in the data, more so than the MSY features. The MSE type features caused increased fluctuations and variance in the classifier output from trial to trial. Irregular transients in the signals resulted in an increase in the scalar value of an MSE type feature because there was a larger prediction error. The MSY calculation was not affected by noise and artefacts as much because the pNNs did not predict irregular transients in the signal, indicating that the pNNs helped to aid the removal of artefacts and noise. MSY type features are more stable and less varied from trial to trial. As can be seen from Table 2, the CA and MI show significant improvements on all tests using the MSY features, although, these results were obtained at the expense of requiring increased classification time which can cause a reduction in the IT rate, depending on the CA. A prime example of the effect that classification time and CA have on the IT rate can be seen by comparing the MSE and MSY results from the 3 layer pNN with xval. The CA using the MSE features is over 4% lower than that of the MSY but the IT rate is more than 7 bits higher. This is due to the differences in classification time. IT rates approaching 18 bits/min are quite high and compare well to results reported for many BCIs, demonstrating the potential of this approach. Figure 2 shows the time course of all three performance quantifiers for the three layer pNN with xval. The graph at the bottom shows a large increase in the IT rate due to a reasonably high CA rate achieved in reduced time. IT rates can be very high if errors are tolerable, selecting a classifying point within the first half-second of the trial may result in a much faster system but this may not be appropriate for the intended use of the BCI. For this analysis the optimum classification point is marked by an X in Figure 2. This point is chosen as the point at which CA and MI are at a maximum. Normalising the MSE and MSY type features produced significant variations in the results compared to those obtained with raw features. Normalising the features reduces the intraclass variability. When the features are normalised the relations between features within each class and the separability between features from the left and right classes can be visualised much more clearly than the features that are not normalised. Figures 3 and 4 illustrate graphically the normalised MSY features extracted from the left and right three layer pNNs with xval, respectively. The features are those extracted at the time point X (Figure 2). For each trial there are four features extracted. Left C3 has a larger normalised MSY than left C4 data. This is because the C3 channel is ipsilateral to the imagined movement task and therefore the signal has increased rhythms (not desynchronised) activity which is generally much higher in amplitude than the desynchronised signal, occurring contra lateral to the movement. The left C4 signal for this subject shows an ERD of the rhythm and an ERS of the rhythm

Time course of performance quantifiers 100 A c c .[% ] 50 0 1 M I.[bits ] 0.5

0 300 IT [bits /m in] 200 100 0

X3 time[s]

Figure 2: Time course of the CA rate, IT rate and MI for the three layer pNNs with xval. The line at X indicates the points of maximum CA and MI and the corresponding IT rate
Graph of Normalised Feature Values for 70 Left Trials 0.45 0.4 0.35 Normalised Values 0.3 0.25 0.2 0.15 0.1 0.05 ll3 ll4 lr3 lr4

10

20

30 40 Trial Number

50

60

70

Figure 3: Normalised feature for 70 left trials. Index ll is left data input to left pNN (c3 and c4) and index lr is left data input to right pNN (c3 an c4).
Graph of Normalised Feature Values for 70 Right Trials 0.45 0.4 0.35 Normalised Values 0.3 0.25 0.2 0.15 0.1 0.05 rl3 rl4 rr3 rr4

10

20

30 40 Trial Number

50

60

70

Figure 4: Normalised feature for 70 right trials. Index rl is right data input to left network (c3 and c4) and index rr is right data input to right network (c3 an c4).

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < thus, the EEG from the C4 channel is much lower in amplitude and results in the normalised MSY values being relatively lower than those extracted for the C3 channel. The opposite of this occurs for the right movement imagery data. A frequency analysis for this subject is detailed in [26] and the ERS/ERD phenomenon is described in [15] and [27]. It can be seen from Figures 3 and 4 that the features ll3 and lr3 have generally much higher values than features rl3 and rr3 whereas the opposite is true for the features marked with index 4. On both the graphs, in any trial where the c3 features overlap with the c4 features it can be expected that the separability is reduced and this may cause increased difficulty for the classifier to discriminate and allocate those features to the correct class. Trials which show features with these characteristics are the cause of reduced overall performance. From a visual perspective the normalised features appear to be much more separable than the non-normalised features (nonnormalised features not shown) but CA and IT rates are improved in some cases and degraded in others. However, in all cases the MI is increased which alludes that normalising the features helps to increase the signal-to-noise (SNR). All the results shown in Table 2 are selected following analysis of the results from all time points. Selection of the best pNNs, the best feature extraction method (MSE or MSY, normalised or not) and the best time points can be carried out from this comparative analysis. The choice depends on the desired characteristics of the system and whether a loss of performance in one area, to increase performance in other areas, is tolerable. Once the optimum system has been chosen, further investigation should be carried out on new EEG data recorded from the same subject. V. CONCLUSION The proposed neural network based time series prediction approach shows good potential and compares well to existing approaches. Different results are achieved depending on the choices of pNN parameters (i.e. training duration and architecture). It has been shown that the number of layers, the number of neurons in each layers and the training durations have significant effects on the features. The documented results are based on a number of variations to the FEP and the basic concept was significantly improved by additional feature analysis and further processing, increasing the potential applicability of this approach. The proposed BCI system has the ability to discriminate between complex patterns and classify thought processes for a two class problem with accuracy rates of 92% and IT rates approaching 18 bits/min. This BCI can be easily adapted to different subjects, adaptability being a fundamental requirement for a BCI. This method meets the requirements for the first level adaptation; outlined in [2], being easily adapted to each individual subjects and does not require a subject-specific frequency analysis or any form of EEG analysis. No artefact removal or noise reduction was carried out on the raw EEG data which suggests that this approach is fairly robust. However, further

work will be carried out to determine if some form of data preprocessing such as performing independent component analysis (ICA) on the data will result in increased CA, IT rate and/or MI. This approach has many advantages and the overall structure and concept provide plenty of options for further development. Firstly, considering the fact that the predicted signal can be used to produce features at t steps ahead in the future, and does not require the original signal at that point in the future, significant improvement of the IT rate may be possible. For example, if the pNNs are trained to perform 100step-ahead prediction then all features are extracted approximately 100*128-1s = 0.78s faster than those using onestep-ahead prediction. Further investigation is ongoing to establish if the features from 100 or larger step-ahead predictions are reliable. A number of preliminary results show that the MSE or MSY predictions do not have to be accumulated from the beginning of each trial. Tests were carried out using a sliding window where MSY type features are extracted from data within a window which moves along the data. The CA rates are maintained thus emphasising the potential of this system for online feature extraction. In the case of subject S2 CA rate increased by 5% using a sliding window. Initial results from using a self-organising fuzzy neural network (SOFNN) [20][21], that can self organise the network architecture, adding and pruning neurons as required, also appears promising. The advantage of using an SOFNN for this system is that the NN architecture does not have to be adjusted manually for each individual user and the structure can be optimised online. A system that can autonomously add neurons to accommodate variations in the EEG, as the user learns to control his/her EEG better and/or prune neurons if older EEG characteristics are not used by the person any longer, would have advantages for continual online adaptation. This approach, along with the sliding window concept, is being developed further to allow continuous feature extraction and online autonomous adaptation to the feature extraction procedure thus confronting the second level of adaptation requirements, outlined in [2]. ACKNOWLEDGMENT The authors would like to acknowledge the Institute for Biomedical Engineering, University of Technology Graz, Austria, for providing the EEG data. REFERENCES
[1] J.R. Wolpaw, N, Birbaumer, W. J. Heetderks, D. J. McFarland, P. H. Peckham, G. Schalk, E. Donchin, L. A. Quatrano, C. J. Robinson and T. M. Vaughan. "Brain-Computer Interface Technology: A Review of the first International Meeting", IEEE Trans. on Rehab. Eng. vol. 8, no. 2, pp. 164-173, June 2000. J. R. Wolpaw, N. Birbaumer, D. J. McFarland, G. Pfurtscheller, T. M. Vaughan, "Brain-computer interfaces for communication and control", (invited review) J. Clinical Neurophysiology, Elsevier, vol. 113, pp. 767-791, 2002.

[2]

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <
[3] [4] G. Pfurtscheller, C. Guger, G. Muller, G. Krausz, C. Neuper Brain oscillations control hand orthosis in a tetraplegic, Neuroscience Letters Vol. 292, pp. 211214, 2000. A. Kubler, B. Kotchoubey, T. Hinterberger, N. Ghanayim, J. Perelmouter, M. Schauer, C. Fritsch, E. Taub, and N.Birbaumer, The thought translation device: a neurophysiological approach to communication in total motor paralysis. Exp Brain Res, SpringerVerlag,Vol. 124, pp. 223232, 1999; A. Kubler, B. Kotchoubey, J. Kaiser, J.R. Wolpaw.] and N. Birbaumer, Brain-Computer communication: unlocking the locked-in, Psychological Bulletin, Vol. 127, No. 3, pp. 358-375, 2001. A. E. H. Emery, Population Frequencies of Inherited Neuromuscular Diseases A World Survey, Neuromuscular Disorders, Vol. 1, No. 1, pp. 19-29, 1991. D. Coyle, G. Prasad and TM. McGinnity, EEG-Based Communication: A Time Series Prediction Approach, IEEE Systems, Man and Cybernetics UK&RI Chapter, pp. 142-147, Sept. 2003 B. J. Fisch, Fisch & Spellmanns EEG Primer: Basic principles of digital and analogue EEG, Elsevier 1999. Lytton, W. W., Computer to Brain: Foundation of Computational Neuroscience, Springer, 2002. J.R. Wolpaw, H. Ramouser, D. J. McFarland and G. Pfurtscheller, "EEG-Based Communication: Im-proved Accuracy by Response Verification", IEEE Trans. on Rehab. Eng. vol. 6. No.3, 2000. A. Schlogl, C. Keinrath, R. Scherer, G. Pfurtscheller, "Estimating the Mutual Information of an EEG-based brain computer interface", Biomedizinische Technik, Band 47, pg. 03-08, 2002. A. Schlogl, C. Keinrath, R. Scherer, G. Pfurtscheller, "Information Transfer of an EEG-based brain computer interface", Proceedings of the 1st International IEEE EMBS Conference on Neural Engineering, pp. 641-644, March 2003. C. Guger, A. Schlogl, C. Neuper, T. Strein, D. Walterspacher, and G. Pfurtscheller, Rapid Proto-typing of an EEG Based BCI, IEEE Trans. on Neural Sys. and Rehab. Eng. vol. 9 no.1, pp. 49-57, March 2001. E. Haselsteiner and G. Pfurtscheller Using Time-Dependent NNs for EEG Classification, IEEE Trans. on Rehab. Eng. vol. 8, no. 4, pp. 457462, Dec. 2000. G. Pfurtscheller, C. Neuper, A. Schlogl and K. Lugger, "Separability of EEG signals Recorded During Right and Left Motor Imagery Using Adaptive Autoregressive Parameters", IEEE Trans. On Rehab. Eng. Vol.6 No.3, pp. 316-324, Sept. 1998. Vaughan et al., "Guest Editorial-Brain-Computer Interface Technology: A Review of the second International Meeting", IEEE Trans.On Neur. Sys. and Rehab. Eng. Vol. 11, No.2, pp. 94-109, June 2003. A. Schlogl, D. Flotzinger, G. Pfurtscheller, "Adaptive Autoregressive Modelling used for Single-trial EEG Classification", Biomedizinische Technik, Band 42, pp. 162-167, 1997. G.P. Williams, Chaos Theory Tamed, Taylor and Francis, 1997. M. Palus, Kolmogorov Entropy from Time Series using InformationTheoretic Functionals, Neural Network World, Vol. 3, pp. 268-292, 1997. G. Leng, G. Prasad and T.M. McGinnity, A New Approach to Generate a Self Organising Fuzzy Neural Network, Proceedings of IEEE Int. Conference on Systems, Man and Cybernetics, October 2002. G. Leng, G. Prasad and T.M. McGinnity, A design for a Self Organising Fuzzy Neural Network based on the genetic algorithm, Proceedings of IEEE Int. Conference on Systems, Man and Cybernetics, October 2003. N. K. Kasobov, Q. Song,, DENFIS: Dynamic Evolving Neural-Fuzzy Inference System and its Application for Time Series Prediction, IEEE Trans. On Fuzzy Systems,. Vol. 10, No. 2, April 2002. J. S. R. Jang, ANFIS: Adaptive-Network-based Fuzzy Inference System, IEEE Trans. on Systems, Man and Cybernetics, Vol. 23, no. 3, pp. 665-685, May 1993. C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University Press, 1995. K-R. Muller, C. W. Anderson and G. E. Birch, "Linear and Nonlinear Methods for Brain-Computer Interfaces", IEEE Trans. On Neural Systems and Rehab. Eng., Vol. 11, No. 2, June 2003. E. Niedermeyer and F. L. Da Silva, Electroencephalography, Basic Principles, Clinical Application and Related Fields, 4th Ed., Chap. 53, Williams and Wilkins, 1998

10

[27] D. Coyle, G. Prasad and T.M. McGinnity, Time-Frequency Approach to Feature Extraction for a Brain-Computer Interface with a Comparative Analysis of Linear and Neural Network Classifiers and Performance Quantifiers, EURASIP Journal on Applied Signal Processing, submitted for publication Damien Coyle was born in Letterkenny, Co. Donegal, Republic of Ireland, on July 26, 1980. He graduated from the University of Ulster in 2002 with a first class honors degree in electronics and computers engineering and a diploma in industrial studies. He is currently undertaking research as a PhD student in the Intelligent Systems Engineering Laboratory (ISEL) at the University of Ulster. His research interests include non-linear signal processing, biomedical signal processing, chaos theory, information theory and neural and adaptive systems. Dr. Girijesh Prasad is a Lecturer in the School of Computing and Intelligent Systems at the Magee Campus of University of Ulster and is an active member of the ISEL. Previously he worked as Digital Systems Engineer in Uptron India Ltd at Lucknow, India, and then as a Power Plant Engineer in a 5x200 MW thermal power station of UPSEB at Obra, India. He also worked as a Research Fellow on an UK EPSRC/industry funded research project at the Queens University of Belfast. He holds a first class honors bachelor degree in Electrical Engineering and a first class honors master degree in Computer Science and Technology. He obtained his PhD degree from the Queen's University of Belfast. He is also a Chartered Engineer and a member of IEE. At present, Dr. Prasad is jointly supervising five PhD students and one research associate funded under various grants/awards from the university, local industry, and a regional funding agency Invest Northern Ireland (INI). These include Collaborative Award in Science & Technology (CAST) Award on Intelligent Inferential Modeling and Control, William Flynn Research Scholarship Award on Brain Computer Interface and Department of Education and Learning (DEL) Scholarship Award on Hardware Realization of Biologically Realistic Spiking Neural Networks. His research interests include neural networks, fuzzy-neural hybrid techniques, self-organizing systems, principle component analysis, model predictive control, performance monitoring and optimization, and applied research in thermal power plants, embedded systems, and healthcare systems. Martin Mc Ginnity has been a member of the University of Ulster academic staff since 1992, and holds the post of Professor of Intelligent Systems Engineering within the Faculty of Engineering. He has a first class honours degree in physics, and a doctorate from the University of Durham, is a Fellow of the IEE, member of the IEEE, and a Chartered Engineer. He has 25 years experience in teaching and research in electronic engineering, leads the research activities of the Intelligent Systems Engineering Laboratory at the Magee campus of the University, and is Head of the School of Computing and Intelligent Systems. His current research interests relate to the creation of intelligent computational systems in general, particularly in relation to hardware and software implementations of neural networks, fuzzy systems, genetic algorithms, embedded intelligent systems utilising re-configurable logic devices and bio-inspired intelligent systems.

[5] [6] [7] [8] [9] [10] [11] [12]

[13] [14] [15]

[16] [17] [18] [19] [20] [21]

[22] [23] [24] [25] [26]

You might also like