Comparison of A Bayesian Classifier With A

Computers and Electronics in Agriculture 39 (2003) 3 /22 www.elsevier.
com/locate/compag
Comparison of a Bayesian classier with a multilayer feed-forward neural network using the example of plant/weed/soil discrimination
J.A. Marchant *, C.M. Onyango
Image Analysis and Control Group, Silsoe Research Institute, Wrest Park, Silsoe, Bedford MK45 4HS, UK Received 13 May 2002; received in revised form 4 November 2002; accepted 13 November 2002
Abstract The feed-forward neural network has become popular as a classification method in agricultural engineering as well as in other applications. This is despite the fact that statistically based alternatives have been in existence for a considerable time. This paper compares a Bayesian classifier with a multilayer feed-forward neural network in a task from the area of discriminating plants, weeds, and soil in colour images. The principles behind and the practical implementation of Bayesian classifiers and neural networks are discussed as are the advantages and problems of each. Experimental tests are conducted using the same set of training and test data for each classifier. Because the Bayesian classifier is optimal in the sense of total misclassification error, it should outperform the neural network. It is shown that this is generally the case. There are significant similarities in the performance of each classifier. Understanding why this should be the case gives insight into the operation of each classifier and so the paper explores this aspect. In this work, the Bayesian classifier is implemented as a look-up table. Thus any probability function can be represented and the decision surfaces can be of any shape, i.e. the classifier is not restricted to a linear form. On the other hand, it does require a relatively large amount of memory. However, memory requirement is no longer such a major issue in modern computing. Thus, it is concluded that if the number of features is small enough to require a feasible amount of storage, a Bayesian classifier is preferred over a feed-forward neural network. # 2002 Elsevier Science B.V. All rights reserved.
Keywords: Classication; Bayes; Neural networks; Image analysis; Machine vision; Weeds; Precision
* Corresponding author. Tel.: '/44-1525-860000; fax: '/44-1525-860156. E-mail address: john.marchant@bbsrc.ac.uk (J.A. Marchant). 0168-1699/02/$ - see front matter # 2002 Elsevier Science B.V. All rights reserved. doi:10.1016/S0168-1699(02)00223-5
J.A. Marchant, C.M. Onyango / Computers and Electronics in Agriculture 39 (2003) 3 /22
agriculture
1. Introduction Classification is concerned with assigning an object to one of a number of possible classes based on measurements we can make on the object. An obvious application in agriculture is quality grading of produce where the measurements could be size, shape, defect count and so on. A slightly different use of classification, where the object is not a physical entity, is addressed in this work where the properties of each pixel in a digital image are used as a pre-processing step to classify parts of the image into regions. Statistically based classification methods have been in existence for a long time. These are based on probabilities that a given set of measurements come from objects belonging to a certain class. Often the probabilities are estimated by training where examples of objects are presented to an algorithm, measurements are made, and the true class is indicated to the algorithm by a human classifier. After training, the classifier is able to operate automatically on measurements made on the object. A more recent phenomenon is the artificial neural network. The term artificial is present to distinguish it from a real (biological) neural network but is often dropped. An artificial network shares some of the physical and behavioural aspects of a biological one. In particular, it can be trained to classify objects from measurements and so can perform the same function as a statistical classifier. Although there are many types of artificial neural networks, we concentrate here on the most commonly used type, the multilayer feed-forward network. In agricultural engineering, just as in other branches of physical investigation, there has been a large interest in neural networks in recent times. This could be caused by a hope that, because there are similarities between artificial and biological networks, it may be possible to emulate human intelligence in an artificial system. It could also be due to the easy availability of neural networks in packages such as Matlab, and possibly due to (dare we say it) fashion. Here we do not so much compare a statistical classifier with a neural network on the basis of performance. Rather, we examine how each works and draw out the similarities as well as the differences between the two. Thus, we form an impression of how the two types should compare, as well as comparing classification performance. In this work, we use the example of detecting weeds among crops (cauliflower) grown in a predefined planting pattern, i.e. a nominally square grid. The example is drawn from our work on what we have termed plant scale husbandry , which involves developing the sensing and mechanical technology to treat crops on a highly accurate basis. Examples of our past work can be found in Hague et al. (1997), Southall et al. (1998) and Tillett and Hague (1999). As we are primarily concerned with comparing classification methods, it has not been our aim here to solve the problem of plant/ weed discrimination per se. In particular, we have only classified individual pixels which could be seen as a pre-processing stage for a more complete analysis. This
complete analysis could include grouping spatially close pixels into discrete regions, and size and shape analysis on those regions. Examples can be found in Brivot and Marchant (1996), Blasco et al. (1998), Slaughter et al. (1999), Tian et al. (1999) and Perez et al. (2000).
2. The Bayesian classier 2.1. Principle The fundamental principle of a Bayesian classifier is a combination of Bayes theorem and Bayes rule (James, 1985). The practical use of Bayes theorem is to turn probabilities that can be estimated from a training set into those required for classification. We wish to classify into i groups, Gi , by measuring features, f1, f2,. . ., at each pixel. The conditional probability P (f1, f2,. . .jGi ) can be estimated from training data (how to do this is covered below). This is the probability of finding that particular vector of features, given that the pixel comes from class i . In principle, we can estimate this by hand-classifying the training data into our three components, plants, weeds, and soil, and counting how many times particular combinations of features occur in each component. Bayes theorem allows the calculation of the posterior probability from the above conditional probability and the prior probability. It is stated formally as P(Gi f1 ; f2 ; . . .) 0 P P(f1 ; f2 ; . . . Gi )P(Gi ) : alli P(f1 ; f2 ; . . . Gi )P(Gi ) (1)
The posterior probability, P (Gi jf1, f2,. . .), is the one needed for classification. It is the probability of that pixel being component i after measuring the feature vector f1, f2,. . .. The prior probability, P (Gi ), is the probability of the pixel being component i given no information about its feature values, i.e. before a measurement is made. Bayes rule does the classification. It assigns a pixel to the class having the greatest posterior probability. A useful property of the Bayesian classifier is that it is optimal in the sense that it minimises the expected misclassification rate (Ripley, 1996). 2.2. Practical implementation It has been claimed that the Bayesian classifier as specified above is almost unusable (James, 1985) because it is difficult to collect enough data to estimate P (f1, f2,. . .jGi ). While this may be true in, say, quality control applications, it is not the case in pixel classification as there are generally vast amounts of data. The classifier is often implemented by making sweeping assumptions about the conditional probability. For example, if P (f1, f2,. . .jGi ) is assumed to be a multivariate normal distribution, only the parameters of the distributions (means and covariance matrices) need to be estimated and the classifier can be expressed in a compact form. Indeed, it is sometimes implied (erroneously) (Shahin et al., 2001)
that this is essential. If, in addition to the normality assumption, all class covariance matrices are assumed equal the classifier becomes a hyperplane dividing the feature space (the linear discriminant) simplifying the rule even more. It is possible that these views of the Bayesian classifier were motivated by the limited ability of early computers, especially with regard to storage capacity. In our current application area, we have decided that the assumption of normality is unacceptably restrictive. As well as there being no reason to suppose that our distributions are normal, we know that there will be significant exceptions to this. For example, sometimes we use hue as a feature value to describe pixel colour. This quantity is represented as an angle (Wyszecki and Stiles, 1982) with red at 08. Thus as well as extending above 08, statistical variation about red would extend below, being interpreted, e.g. as in the range 340 /3598. Thus the hue of a red object could appear to have two distinct peaks in its probability distribution, one around say 108 and the other around 3508 and could not be represented by a normal distribution. A simple method of estimating the conditional probability is to divide the range of each feature into discrete levels thus dividing the feature space into cells. On measuring a feature vector, the appropriate cell is located and the count in that cell is increased by 1. After training, the probability of finding a feature in a given cell is estimated as nj /m , where nj is the count in cell j and m the total count. Thus a multidimensional histogram is formed. A problem with this method is how to decide on the cell size. Too small a size means that there may be holes in the histogram. Too large means over-smoothing of the distribution. The traditional objection with this approach is the amount of storage space required. If there are n features, each divided into r levels the storage requirement is rn locations per class. Below, we use four features with 21 levels and three classes giving a requirement of 583 443 4-byte storage locations. With modern computers, this is almost trivial although it must be said that the requirement increases rapidly with the number of features. The problem of holes in the histogram can be tackled by smoothing, e.g. assuming that the distribution is locally Gaussian (Ripley, 1996). This Parzen estimate requires parameters for the smoothing kernel. As these are usually derived in an ad hoc manner, these methods seem to offer little advantage over the simple histogram where smoothing is achieved as a by-product of finite cell size. It is interesting to note that Ripley (1996) attributes to Specht (1990) the re-labelling of this type of kernel smoothing method without any biological motivation as a neural network either probabilistic or general regression . In this work, we will use the histogram method. After training (forming the histogram), classification of new images consists of measuring the features at each pixel, deciding which cell the feature vector fits in, and retrieving the conditional probabilities for each class, plant, weed, and soil. After multiplying by the prior probabilities (Eq. (1)), the pixel is assigned to the class with the greatest posterior probability. As the denominator of Eq. (1) is the same for all classes, only the numerators need to be compared. Although speed of operation is not part of this paper, an initial inspection showed that most of the time was taken up in locating the histogram cell. This could easily be reduced by scaling such that the cell size was 1 unit in each direction. Location would then reduce to truncating the feature values (offsets caused by truncation would be avoided by loading the
histogram in the same way). Thus, classification is a table look-up followed by a simple comparison and is potentially very fast. Choosing the priors is often seen as a problem with Bayesian classification. The reason that this problem is not raised with a neural network is simply that there is no way of dealing with prior information. As will be demonstrated in Section 5.2, aspects of the way the network is trained have to be chosen with no formal rules and act in much the same way as adjusting the priors. It cannot be said, therefore, that a Bayesian classifier is inferior in any way to a neural network in this respect. A common way of estimating priors is by equating them to the proportions in the training set. However, there are problems with this approach either when the class populations are very unbalanced (see Section 5.1) or when the priors are variable.
3. The feed-forward neural network An excellent review of the use of feed-forward neural networks for classification (and of many aspects of classification in general) is given by Zhang (2000). However, Zhang does imply at various points that the restrictions imposed by Gaussian assumptions in a Bayesian classifier apply generally. As discussed above, this is not true with the histogram representation used in the present work. To quote from Ripley (1996) A great deal of hyperbole has been devoted to neural networks . . .. Some of this hyperbole arises from the understandable excitement that we might be able to design better artificial systems by learning from nature. It is certainly true that human and other animals can solve problems quickly without resorting to mathematics, probably by learning. These problems are often difficult or impossible for artificial systems. The biological motivation for feedforward neural networks starts with McCulloch and Pitts (1943) whose model of a biological neuron is a weighted sum of several inputs followed by a binary threshold. Networks of artificial neurons based on this model can learn from training examples, and the similarity between natural neural systems (that are also organised as repeated simple units and can also learn) must be noted. The most popular method of training is back-propagation (Rumelhart and McClelland, 1986). This method requires all transfer functions in the network to be differentiable and so a function softer than a simple threshold must be used. Networks of artificial neurons are constructed using an input layer , used to distribute the inputs to a number of hidden layers , the outputs of which are connected to an output layer . The outputs of units are connected to the inputs of the next via connection weights . The function at the node output in an artificial neuron, analogous to the McCulloch and Pitts threshold, which is often called the activation function . However, even if this type of network can be related to biology, it must be remembered that it can also be represented as a function X X yk 0 fk ak ' wjk fj aj ' wij xi (2)
j 0k i0j
that maps the network inputs xi to the target training outputs yk (Ripley, 1996). Eq. (2) is for a network with one hidden layer. wij are the weights that connect the input to the hidden layer and the suffix i 0 j indicates that the sum is over all connections from input to hidden layer. aj are the biases in the hidden layer nodes. fj are the activation functions at the outputs of hidden layer nodes. Suffixes j and k are for connections between hidden and output layers. The idea can easily be extended to more hidden layers. Training methods, including back-propagation, are optimisation processes. Typically, vectors of input features will be presented to the network along with target values of the network outputs. Optimisation seeks to reduce some measure of error between actual and target outputs by adjusting the weights and biases. Like all optimisation methods (excepting a global search which is generally not feasible), practical problems include getting stuck in local minima and knowing when the procedure has converged. There are no principled methods of choosing the network parameters (number of hidden layers, number of nodes in the hidden layer, and form of activation functions). This aspect is no different from choosing the parameters of any other approximating function, e.g. a spline where we must choose its type, degree, and the number and position of the knots. If the network is too complex (say a large number of hidden layers or nodes in each layer), it may fit well to the actual training data but not generalise to other data. This is analogous to fitting say a 20thorder polynomial to 21 data points, the fit will be exact but it is unlikely to deal with new data. How to choose the degree of smoothing necessary to avoid this problem is (just like choosing cell size for the Bayesian classifier) not obvious. Because a neural network is represented by parameters, its requirement for memory is much smaller than the tabular Bayesian classifier outlined in Section 2. For example, the network used below can be stored in less than 30 kilobytes which is about 1.3% of the storage of the above Bayesian classifier. However, computer memory is cheap and extensive, and comparisons on the basis of memory requirement may not be relevant.
4. Experimental data and choice of features Seven images were collected from an experimental plot containing cauliflower and some weeds. The images were captured with a clear view of about 1.8 metre square at a resolution of 1700 )/1700 pixels. A section of each image was divided into tiles (3 )/3 tiles) forming nine subimages from each main image, each one having 512 )/ 512 pixels. An example image to be divided into nine tiles is shown in Fig. 1 (left) with two example tiles in Fig. 1 (middle and right). Thus there were 63 images used with the classifier, 31 used for training and 32 for testing. Images were allocated to either the training set or the test set by using a random number generator within the computer program. Ground truth was obtained by laboriously drawing round each plant and weed in each image using the Adobe Photoshop package. This process is quite difficult (and extremely tedious) and so we cannot claim that the ground truth
Fig. 1. Left, example of original (large) image. The white circles are the centres of the tted grid points. Middle and right, two sub-images */the middle is from the left image, the right from another.
is perfect. However, the same data were used for both Bayesian and neural network classifiers and so the comparisons should be valid. The features used were the colour and position of each pixel. In order to normalise the colour with respect to possible changes in light intensity, we used the chromaticities (Wyszecki and Stiles, 1982) r0 R ; R'G'B g0 G ; R'G'B b0 B ; R'G'B (3)
where R , G , and B are the red, green, and blue pixel values. On first sight, there would seem to be no advantage in using all three chromaticities as they sum to 1.0 and therefore one value is redundant. If the Bayesian classifier could use these values exactly, there would indeed be no advantage. However, as the classifier uses the values categorised into discrete cells they are not completely dependent, they may well not sum exactly to 1.0. Thus using all three chromaticities could give an advantage and, in fact, this was observed (though only to a small extent using 21 cells for each feature) in preliminary experiments. The neural network uses continuous values of the chromaticities and no advantage should be gained, this was also demonstrated. Thus, with the neural network we used just the red and green chromaticities. The plants in our images are established by transplanting into a nominally uniform grid pattern as is usual in horticultural practice. However, due to planting errors and the natural variability of plant growth, the grid will not be precise. Also, it is possible that some plants will be missing as is the case in Fig. 1 at the bottom right. Southall et al. (1998) have shown that the grid can be tracked in an automatic system using an extended Kalman filter. The filter estimates the row spacing (horizontal in the image) and the plant spacing along the rows (vertical in the image). The estimated grid position makes it possible to use the distance from the grid points as an extra feature - pixels close to the grid points are more likely to be plants than weeds or soil. Here we do not have the benefit of a continuous sequence of images required by the Kalman filter. It is not too important for this comparative study that we obtain the same result as would be obtained with the Kalman filter. However, we
10
have tried to make the result similar, in that we fit a grid to the images where the row and plant spacing are the parameters to be estimated. In our case, we locate the plant centres by hand. The model for the grid is xc 0 x0 ' nr; yc 0 y0 ' nl ; (4) where xc and yc are the hand-located plant positions and n is an integer, 0, 1, or 2 depending on whether the plant is in the left, middle, or right row (for x ) or top, middle, or bottom column (for y ). The parameters to be estimated are x0 and y0 the grid offset, and r and l the spacing in each direction. As there are more measurements than unknowns, the parameters can be estimated by least-squares. Fig. 1 (left) shows the grid points fitted to this image. Although we restrict ourselves to pixel classification, we found that very simple pixel grouping techniques improved classification somewhat. The poorer performance without spatial grouping was due to small holes in the continuous areas of the components. Thus, we applied a 5 )/5 median filter (Gonzales and Woods, 1992) to the classified image from both Bayesian and neural network classifiers before calculating classification rates.
5. Results 5.1. Bayesian classier A number of comparison runs were carried out where conditions were varied around a standard set. In the standard set, the features were the chromaticities r , g , and b , and the distance to the nearest grid point d ; each feature was divided into 21 discrete levels; the prior probabilities were assumed equal. The results are expressed as total misclassification rate (E ) along with a confusion matrix where extra detail needs to be added (Ripley, 1996). E can be estimated as R /m (Ripley, 1996) where R is the total number of misclassifications and m the total number of feature vectors classified. It is in the sense of minimum E that the Bayesian classifier is optimal but it must be remembered that R /m is an estimate of the error rate and so is itself subject to statistical errors. The confusion matrix adds more detail on how components are misclassified to the performance measure and an element eij is formally defined as eij 0/P (decision j jclass i ) (Ripley, 1996). Total misclassification rates and confusion matrices are always calculated and shown for the whole of the test set in this work, even if particular images are used for illustration. Table 1 shows the effect of using different numbers of discrete feature levels. As discussed in Section 2.2, too small a number leads to over-smoothing which should give a poorer performance. This is seen in Table 1where E increases for small number of levels. An increase can also be seen as we change the number of levels towards 91. For those numbers, the histogram is not smooth enough */the classifier is responding to the particulars of the data in the training set rather than conforming to the general trends, i.e. its ability to generalise is worsening. As there seems little to be gained with greater than 21 levels and the storage requirement increases by a
J.A. Marchant, C.M. Onyango / Computers and Electronics in Agriculture 39 (2003) 3 /22 Table 1 Effect of dividing the features into different numbers of levels Levels 7 11 21 51 91 Total misclassification 0.077 0.074 0.069 0.067 0.113
11
Features r , g , b , and d with equal priors.
factor of about 35 times on changing to 51 levels, it was decided to fix the number of levels at 21. Table 2 shows the effect of changing the values of the prior probabilities, either by estimating them from the training set (left), which gives 0.355, 0.020, and 0.624 for the plant, weed, and soil priors, respectively, or assuming they are equal (middle). Fig. 2 (left and middle) shows classification results for the same conditions for the particular image of Fig. 1 (middle). We note that the priors are very different from being equal. In particular, the weed prior is very small. As the Bayesian classifier is optimal, if the priors are known, we would expect that E would be smallest for this condition. In fact E for the equal prior case is slightly smaller. Here we recall that E is an estimate of the total misclassification rate and is subjected to statistical variation. The standard error on E is (James, 1985) s E (1 ( E ) SE(E ) 0 ; (5) m where m is the total number of samples (i.e. pixels) in the training set. As m is very large, the standard error is very small (about 0.0001) which gives the impression that E for equal priors is less, to a statistically significant level, than E for estimated priors. However, this impression is erroneous as the data are not independent samples, pixels in images can be very correlated. In addition, as stated by Ripley (1996), it is important not to read too much into statistical significance when differences are actually small. In fact if 11 levels are used instead of 21, E for estimated priors is smaller. Here we take the pragmatic view that the total misclassification rate, E , is very insensitive to the choice of priors. In fact E is an extremely blunt instrument with which to assess performance. Greater insight is obtained from the confusion matrices (Table 2). Here we see that many of the weed pixels have been classified as plant. Estimating the priors gives better performance with the largest component (soil), about the same with the next largest (plant), and very poor performance with the smallest (weed). The principle is that because the weed component is very small, its classification has been sacrificed for the greater good of the larger components. While this may make sense mathematically, a human assessor may well claim that the classification of Fig. 2 (middle) is better than that to the left. In doing this, he is implicitly claiming that
12 J.A. Marchant, C.M. Onyango / Computers and Electronics in Agriculture 39 (2003) 3 /22
Table 2 Confusion matrices and total misclassication rates for the whole test set Actual Classed as Plant Weed Soil Total misclassification Plant 0.96 0.00 0.04 Weed 0.44 0.40 0.16 0.071 Soil 0.06 0.00 0.94 Plant 0.96 0.01 0.03 Actual Weed 0.18 0.82 0.00 0.069 Soil 0.07 0.01 0.92 Plant 0.91 0.06 0.03 Actual Weed 0.09 0.91 0.00 0.084 Soil 0.06 0.02 0.92
Left, using features r , g , b , and d with priors estimated from training set. Middle, same features with equal priors. Right, using features r , g , and b with equal priors.
J.A. Marchant, C.M. Onyango / Computers and Electronics in Agriculture 39 (2003) 3 /22 13
Fig. 2. Examples of Bayesian classier performance where grey is plant, white is weed, and black is soil. The original image is shown in Fig. 1 (middle). Left, using features r , g , b , and d with priors estimated from training set. Middle, same features with equal priors. Right, using features r , g , and b with equal priors.
14
misclassifying a weed pixel is more severe a mistake than misclassifying a soil or a plant pixel. The relationship of the separate costs of misclassification to the human perception of a good result can be seen as follows. Bayes rule consists of assigning to the class where P (f1, f2,. . .jGi )P (Gi ) is highest. If cik is the cost of classifying a pixel whose true class is i to class k , then the classification rule for least cost consists of assigning to the class where aalli"k P(f1 ; f2 ; . . . Gi )P(Gi )cik is lowest. If cik is proportional to P (Gi ), then this reduces to the equal priors case of Bayes rule. Hence, using equal priors implies attaching more significance to misclassifying classes with low populations, in fact misclassifying a given proportion of weeds is seen as having the same significance as misclassifying the same proportion of, say, soil which sounds like a reasonable rule of thumb. A further justification for using equal priors is that the proportions of plants and weeds may well change between test images, hence the priors are not really known for any individual test. The classification error can be thought of as a variable that changes as the assumed priors change. A criterion known as minimax can be used where it is assumed that the priors are such that they make the performance as bad as possible. The criterion minimises the maximum possible error. Assuming equal priors is equivalent to using the minimax rule (James, 1985; van Trees, 1968). From here on, we use equal priors. Fig. 2 (right) shows the effect of using only the colour features, i.e. not using the distance feature. The most apparent error is that the centre parts of the plant are misclassified as weed. This may be because the illumination here is more subjected to inter-reflections as the plant leaves are more vertical. This would make the leaf surfaces appear greener and possibly increase confusion with weeds. Without d , the greater number of plant pixels being classified as weed is reflected in an increased value of E to 0.084. However, using the distance feature, d , does not improve all aspects of classification. In particular, using d actually causes a drop in performance in classifying weeds where more of them are classified as plants. This is not so apparent in the particular results of Fig. 2 (middle and right) but some misclassifications of weeds as plants can be seen where weed pixels are relatively close to the plant centre. The effect can be seen for the whole of the test set in the confusion matrices of Table 2. Fig. 3 shows another example where using d is not completely beneficial. These classifications are shown for the images in Fig. 1 (right). Although the centre of the plant in the left image of Fig. 3 is better classified than that in the middle, some of the pixels away from the centre are wrongly classified as weed. This is caused by two factors. (1) The plant is significantly larger than average and some of its pixels occupy the area normally occupied by weeds. (2) The plants are not in completely regular positions and the grid points, which are the origins for measuring d , cannot lie exactly at plant centres, in fact the grid point in Fig. 3 is 125 pixels above the plant centre. Table 3 and Fig. 3 (right) show better performance if the actual plant centres are temporarily substituted for grid positions in training and testing. Comparing Table 3 with Table 2 (middle and right), it can be seen that using actual plant centres
J.A. Marchant, C.M. Onyango / Computers and Electronics in Agriculture 39 (2003) 3 /22 15
Fig. 3. Example of varying performance when using feature d . The original image is shown in Fig. 1 (right). Left, using features r , g , b , and d measured from grid points. Middle, using features r , g , and b . Right, using features r , g , b , and d measured from plant centres.
16
Table 3 Confusion matrices and total misclassication rates for the whole test set when using plant centres instead of grid points Actual Classed as Plant Weed Soil Total misclassification Plant 0.97 0.01 0.02 Weed 0.06 0.93 0.01 0.064 Soil 0.06 0.02 0.92
generally outperforms the other feature combinations. Hence using d gives an advantage but poor grid positioning or inaccurate planting can reduce this.
5.2. Neural network As mentioned in Section 3, there is no principled way of designing a neural network. As a three-layer network can implement a decision surface with arbitrary complexity (Gonzales and Woods, 1992), we fixed on three layers. The number of input nodes was fixed to equal the number of features, i.e. 3 (features r , g , and d ) or 2 (r and g ). The number of output nodes was fixed as the number of classes, i.e. 3. In an example problem, Gonzales and Woods (1992) suggest using the average of the number of input and output nodes as the number of nodes in the hidden layer. We follow this example and use three nodes. The form of the activation function is another choice faced by the designer, once again we follow Gonzales and Woods (1992) and use a sigmoid function for the whole network. We used the multilayer feed-forward network available in the Matlab Neural Network toolbox (Dumuth and Beale, 2000) with the logsig activation function and the trainrp training method. This method is a variation on the back-propagation algorithm (Ripley, 1996) where only the sign of the derivative is used to determine the direction of the update. This avoids problems of slow convergence when using sigmoid activation functions which can have very small derivatives in the latter part of training. It is recommended as being suitable for classification problems (Dumuth and Beale, 2000). Inputs were scaled to fill reasonably well the range 0.0 /1.0. An initial problem was that training with all the training set used an extremely large amount of memory and could not be done with the PC available (256 MB). As we have a very large number of training pixels, we trained with a random sample of 1/20th of the full set. Even with the sample, the training used all the available memory plus about half as much again of swap space. In order to check that the results would still compare with the results using the Bayesian classifier, we re-ran the Bayesian classifier with the same (1/20th) training data. The results were very close to those with all the data and are reproduced in Table 4 for comparison with Table 2 (middle). We concluded that using 1/20th of the data, when training made
17
Table 4 Confusion matrices and total misclassication rates for the Bayesian classier using 1/20th of the training data Actual Classed as Plant Weed Soil Total misclassification Plant 0.96 0.01 0.03 Weed 0.19 0.81 0.00 0.068 Soil 0.07 0.01 0.92
little difference and so, results from the neural network could be compared with those from the Bayesian classifier. The network was trained by presenting the input with feature vectors and setting the relevant output (plant, weeds, or soil node depending on the training value) to 1.0, with the other two at 0.0. We used 2000 training epochs which took approximately 10 h for three features (5 h for two) on a 400 MHz Pentium II processor. These timings are significantly influenced by using swap space instead of memory. We estimate a halving of training times if enough memory was available. Three training cases were investigated. (1) The features r , g , and d with training samples in proportion to the number of pixels in each class in the training set. (2) The same features with equal numbers of training samples from each class. (3) Features r and g with equal numbers of training samples from each class. During training, the mean-squared error at the output nodes for case 1 fell from 0.4876 on initialisation to 0.0382 when stopped. Most of the minimisation was complete by 1000 epochs when the mean-squared error was 0.0387. We therefore concluded that the training had converged. The corresponding figures for case 2 were 0.4806 and 0.0749 with a halfway value of 0.0776 and for case 3, 0.4704 and 0.0970 with a halfway value of 0.0985. The results (example shown in Fig. 4 with performance figures in Table 5) are very similar to those for the Bayesian classifier (Fig. 2 and Table 2). Misclassifications are of a similar amount and are made in much the same way. The discussion in Section 1 concerning optimality of the Bayesian classifier leads us to believe that the neural network should never outperform it in terms of total misclassification rate, E . However, this expectation must be tempered by the fact that E is an estimate as discussed in Section 5.1. It is interesting to investigate the reasons for the similarity in performance. Despite much discussion in the literature on biological motivation, we must remember that a feed-forward neural network is merely a mathematical function that maps an input vector to a number of outputs (see Section 3). In order to compare the training of a Bayesian classifier with a feed-forward neural network, consider the following simplified argument. For illustration, we consider that the input vector is onedimensional and concentrate on one class, say weed. Imagine a string of training values of the single feature x . The corresponding value of the classifier target output
18 J.A. Marchant, C.M. Onyango / Computers and Electronics in Agriculture 39 (2003) 3 /22 Fig. 4. Examples of neural network performance where grey is plant, white is weed, and black is soil. The original image is shown in Fig. 1 (middle). Left, using features r , g , and d with training samples in proportion to the number of pixels in each class in the training set. Middle, same features with equal number of training samples from each class. Right, using features r and g with equal number of training samples from each class.
Table 5 Confusion matrices and total misclassication rates for the whole test set Actual Classed as Plant Weed Soil Total misclassification Plant 0.97 0.00 0.03 Weed 0.33 0.53 0.14 0.067 Soil 0.06 0.00 0.94 Plant 0.94 0.03 0.03 Actual Weed 0.12 0.88 0.00 0.074 Soil 0.05 0.03 0.92 Plant 0.90 0.07 0.03 Actual Weed 0.05 0.95 0.00 0.088 Soil 0.05 0.03 0.92
Left, using features r , g , and d with training samples in proportion to numbers of each class in the training set. Middle, same features with equal number of training samples from each class. Right, using features r and g with equal number of training samples from each class.
19
20
Fig. 5. Left, target input/output relationship of a neural network and smoothed version. Right, conditional probability histogram for the same training data.
is 1 if the actual class is weed and 0 if not. Fig. 5 (left) plots the target output of the neural network against the input (x ) where 16 of the training values are weed. This is the function that the neural network is trying to approximate at the weed output node. The raw output is very spiky as it can only take on the values 0 or 1. Network training fits a function to the input/output relationship. The greater the network complexity, the more accurately the function will fit the data. An appropriate complexity is one where the data are followed reasonably well, yet there is sufficient smoothing to allow generalisation. Just for illustration, the training data have been smoothed with a simple rolling average. The smoothed relationship, also shown in Fig. 5 (left), represents in qualitative terms the type of input/output relationship that might be derived by training the network. Fig. 5 (right) shows the histogram for the weed component after training the Bayesian classifier with the same data. The feature space has been divided into 10 cells, six of which contain values. The histogram is an estimate of the conditional probability P (weedjx ). From this illustration, it becomes obvious that the histogram on the right is a smoothed version of the raw data on the left, the smoothing obtained by collecting the data into histogram bins. It should therefore be no surprise that the illustrative network input/ output relationship (the smoothed data in Fig. 5, left) is very similar in shape (not necessarily in magnitude) to the histogram. If the neural network is trained with an equal number of examples from each class then, assuming the spread of feature values within each class is similar, the magnitudes of the input/output relationship will be similar for each class. The magnitudes of the histograms for each class will also be similar as the area under each one must sum to 1.0. Now consider how each classifier is used. In the case of the neural network, the output of the network for each class is compared and the classification assigned to the largest. In the case of the Bayesian classifier, the conditional probability for each class is multiplied by the relevant prior to form a value proportional to the posterior probability and the classification assigned to the largest. If the priors are equal, this amounts to classifying on the largest conditional probability, i.e. the histogram. Thus, as the magnitudes of the input/output relationships are similar for each class
21
and so are the histograms, the operation of both classifiers should be very similar. If the priors are not equal, then the histograms are weighted in proportion to the priors before classifying. In a similar way, if the neural network is trained with a class sample size in proportion to the populations in the training set, the density of the data in Fig. 5 (left) is greater for the more populous classes and hence the input/ output relationship is of higher magnitude. Thus, once again the relative magnitudes of the input/output relationships are in the same proportion as those of the histograms. Hence operation of the Bayesian classifier with priors determined from the training set should be very similar to operation of the neural network with training samples in proportion to class populations.
6. Conclusions As the Bayesian classifier is optimal in terms of total misclassification error, there should be no advantage in using a neural network for a similar classification problem. We have demonstrated this in an example problem, classifying images into plant, weed, and soil components using colour and position features. The neural network can approach the performance of the Bayesian classifier and in particular cases may slightly outperform it. This latter characteristic arises because any measured total misclassification error is a statistical estimate and is itself subjected to uncertainty. We have shown, using an illustrative example, why the performance of a Bayesian classifier should be very similar to that of a feed-forward neural network. Both classifiers are approximating very similar input/output functions, the Bayesian by building up a histogram, and the neural network by function fitting using an optimisation algorithm. The neural network has a much lower storage requirement as it represents the fitted function in an analytical form where the parameters are weights, biases, and the network topology. The neural network smoothes the input/output relationship by fitting a function, whereas the Bayesian classifier does this by collecting data into a finite number of cells. Both classifiers can trade storage against smoothing but there is no way of determining which handles the trade-off better. It must be remembered that memory has increased dramatically in density and continues to do so which makes a Bayesian classifier more realisable with the passage of time. However, the requirement increases rapidly with number of features and more than five or so features may make the Bayesian classifier infeasible. We did not carry out tests to determine relative speeds of operation, it is always difficult to conduct these tests as the competing methods may not be designed optimally. However, the Bayesian classifier can easily be implemented as a simple table look-up and proper scaling of the features makes the indexing trivial. Therefore, it is intrinsically fast. The neural network requires a number of multiply and add operations which will make it comparatively slow. Training of a neural network is generally lengthy and can suffer from a number of problems. Like all practical optimisation procedures, there is no guarantee that the
22
solution will converge to a global minimum. Also, there is no principled way to design a neural network and no way of knowing that the design is near-optimal (or even good) or whether an improvement gained by altering the design will extend to new data. On the other hand, training of a Bayesian classifier is trivial and fast. Although the cell size of the histogram does have to be chosen in an ad hoc way and is subject to some of the same criticism, there are far fewer tuning parameters in a Bayesian classifier compared with a neural network. Our general conclusion is that, if the number of features is small enough to require a feasible amount of storage, a Bayesian classifier is preferred over a feed-forward neural network.
Acknowledgements This work was funded by the Biotechnology and Biological Sciences Research Council.
References
Blasco, J., Benlloch, J.V., Agusti, M., Molto, E., 1998. Machine vision for precise control of weeds. Proc. SPIE 3543, 336 /343. Brivot, R., Marchant, J.A., 1996. Segmentation of plants and weeds using infrared images. In: Proceedings of the Institution of Electrical Engineers. Vision, Image Signal Process. 143 (2), 118 /124. Dumuth, H., Beale, M., 2000. Neural Network Toolbox: Users Guide. MathWorks, Inc, Natick, MA. Gonzales, R.C., Woods, R.E., 1992. Digital Image Processing. Addison-Wesley, Reading, MA. Hague, T., Marchant, J.A., Tillett, N.D., 1997. A system for plant scale husbandry. In: Proceedings of the First European Conference on Precision Agriculture, September 7 /10, Warwick, UK, pp. 635 /642. James, M., 1985. Classication Algorithms. Collins, London. McCulloch, W.S., Pitts, W., 1943. A logical calculus of ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115 /133. Perez, A.J., Lopez, F., Benlloch, J.V., Christensen, S., 2000. Colour and shape analysis techniques for weed detection in cereal elds. Comput. Electron. Agric. 25 (3), 197 /212. Ripley, B.D., 1996. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, UK. Rumelhart, D.E., McClelland, J.L., 1986. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, vol. 1. MIT Press, Cambridge, MA. Shahin, M.A., Tollner, E.W., McClendon, R.W., 2001. Articial intelligence classiers for sorting apples based on watercore. J. Agric. Eng. Res. 79 (3), 265 /274. Slaughter, D.C., Chen, P., Curley, R.G., 1999. Vision guided precision cultivation. Precision Agric. 1 (2), 199 /216. Southall, B., Marchant, J.A., Hague, T., Buxton, B.F., 1998. Model based tracking for navigation and segmentation. In: Proceedings of the Fifth European Conference on Computer Vision, June 2 /6, Freiburg, pp. 797 /811. Specht, D.F., 1990. Probabalistic neural networks. Neural Netw. 3, 109 /118. Tian, L., Reid, J.F., Hummel, J.W., 1999. Development of a precision sprayer for site-specic weed management. Trans. Am. Soc. Agric. Eng. 42 (4), 893 /900. Tillett, N.D., Hague, T., 1999. Computer-vision-based hoe guidance for cereals */an initial trial. J. Agric. Eng. Res. 74 (3), 225 /236. van Trees, H.L., 1968. Detection, Estimation, and Modulation Theory. Wiley, New York.

Comparison of A Bayesian Classifier With A

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Comparison of A Bayesian Classifier With A

Uploaded by

Copyright:

Available Formats

Computers and Electronics in Agriculture 39 (2003) 3 /22 www.elsevier.

Features r , g , b , and d with equal priors.

You might also like