You are on page 1of 30

Meteorol Atmos Phys 87, 167–196 (2004)

DOI 10.1007/s00703-003-0070-7

School of Computational Sciences, George Mason University, Fairfax, VA, USA

Air quality model performance evaluation


J. C. Chang and S. R. Hanna

Received April 16, 2003; accepted September 22, 2003


Published online: June 2, 2004 # Springer-Verlag 2004

Summary 1. Introduction and background review


This paper reviews methods to evaluate the performance of Air quality models are powerful tools to predict
air quality models, which are tools that predict the fate of the fate of pollutant gases or aerosols upon their
gases and aerosols upon their release into the atmosphere.
Because of the large economic, public health, and
release into the atmosphere. The models account
environmental impacts often associated with the use of air for the dilution effects of wind speed and tur-
quality model results, it is important that these models be bulent diffusion. Pollutants can be contaminants
properly evaluated. routinely emitted by industrial sources (such as
A comprehensive model evaluation methodology makes the sulfur dioxide emissions from power plants),
use of scientific assessments of the model technical algo-
hazardous chemicals released due to accidents
rithms, statistical evaluations using field or laboratory data,
and operational assessments by users in real-world applica- (such as the rupture of a railroad car containing
tions. The focus of the current paper is on the statistical chlorine), or chemical and biological warfare
evaluation component. It is important that a statistical model agents disseminated by weapon systems. It is
evaluation exercise should start with clear definitions of the imperative that these dispersion models be prop-
evaluation objectives and specification of hypotheses to be erly evaluated with observational data before
tested. A review is given of a set of model evaluation meth-
odologies, including the BOOT and the ASTM evaluation
their predictions can be used with confidence,
software, Taylor’s nomogram, the figure of merit in space, because the model results often influence deci-
and the CDF approach. Because there is not a single best sions that have large public-health and economic
performance measure or best evaluation methodology, it is consequences.
recommended that a suite of different performance measures There can be three components to the evalua-
be applied. Suggestions are given concerning the magnitudes
tion of air quality models: scientific, statistical,
of the performance measures expected of ‘‘good’’ models.
For example, a good model should have a relative mean bias and operational. In a scientific evaluation, the
less than about 30% and a relative scatter less than about a model algorithms, physics, assumptions, and
factor of two. codes are examined in detail for their accuracy,
In order to demonstrate some of the air quality model efficiency, and sensitivity. This exercise usually
evaluation methodologies, two simple baseline urban disper- requires in-depth knowledge of the model. For
sion models are evaluated using the Salt Lake City Urban
2000 field data. The importance of assumptions concerning
statistical evaluation, model predictions (such
details such as minimum concentration and pairing of data as concentrations and cloud widths) are exam-
are shown. Typical plots and tables are presented, including ined to see how well they match observations.
determinations of whether the difference in the relative mean It is possible for a model to produce the right
bias between the two models is statistically significant at the answers, but as a result of compensating errors.
95% confidence level.
168 J. C. Chang and S. R. Hanna

The operational evaluation component mainly mance measures to use. For example, for the
considers issues related to the user-friendliness Environmental Protection Agency (EPA) regula-
of the model, such as the user’s guide, the user tory applications to standard pollutants such as
interface, error checking of model inputs, diag- SO2, the highest and second highest hourly-aver-
nostics of interim (or internal) model calcula- age concentrations at ground level, for all hours
tions, and processing and display of model of a year, are typically of interest. In this case, air
outputs. The focus of this paper is mainly on quality models are evaluated to find out whether
statistical model evaluation. they can correctly predict the high end of
Dispersion is primarily controlled by turbu- the concentration distribution, preferably for the
lence in the atmospheric boundary layer. Turbu- right reasons of course. Whether the locations of
lence is random by nature and thus cannot be high concentrations are correctly predicted
precisely described or predicted, other than by is usually less important for these regulatory
means of basic statistical properties such as the applications.
mean and variance. As a result, there is spatial For military applications, on the other hand, it
and temporal variability that naturally occurs in is typically the hazard areas, as defined by the
the observed concentration field. On the other contours at ground level of certain concentration
hand, uncertainty in the model results can also or dosage (concentration integrated over time)
be due to factors such as errors in the input data, thresholds, that are of interest. In this case, a
model physics, and numerical representation. model should correctly predict both the location
Because of the effects of uncertainty and its and the size of a hazard area. In contrast to reg-
inherent randomness, it is not possible for an ulatory applications, it may be more important
air quality model to ever be ‘‘perfect’’, and there for a model in this case to correctly predict the
is always a base amount of scatter that cannot be middle or even the low end of the concentration
removed. or dosage distribution.
Decision makers should be able to make use of Dosages are also important in EPA (2002)
available information on air quality model per- assessments of the effects of toxic pollutants
formance. An example of a decision maker is a such as benzene. These assessments look at
person in a regulatory agency who uses the annual average concentrations over broad areas
results of photochemical models to decide emis- and integrate over the population distribution to
sion control strategies which clearly will have calculate health effects such as numbers of
large economic and social impacts. As another excess cancers. The relative impacts of various
example, for an accidental release of hazardous toxic chemicals are then compared to enable
materials, emergency responders need to use the decision makers to prioritize nation-wide emis-
model results to decide which neighborhoods to sions control strategies.
evacuate. Or, when under attack by chemical or The terms verification and validation (V&V)
biological weapons in a battlefield, a military are often used, especially by the military
commander would need to decide whether to community, to describe activities aimed to
order troops to put on protective gear, or to order demonstrate the credibility of numerical models.
evacuation based on the model results. However, according to Oreskes et al (1994), ver-
Model evaluation methods and the issue of ification and validation of numerical models of
model uncertainty are further reviewed in the fol- natural systems are impossible, because natural
lowing. Most of the materials presented in this systems are never closed and because model
paper, except for the test case application dis- solutions are always non-unique. The random
cussed in Sect. 3, are based on a chapter of the nature of the process leads to a certain irreduc-
lead author’s Ph.D. thesis (Chang, 2002). ible inherent uncertainty. Oreskes et al (1994)
suggest that models can only be confirmed or
evaluated by the demonstration of good agree-
1.1 Model evaluation definitions and methods
ment between several sets of observations and
The goal of a model evaluation study must first predictions. Following this guidance, the term
be well-defined. Different goals could lead to dif- evaluation is used instead of verification
ferent variables to evaluate and different perfor- throughout this paper.
Air quality model performance evaluation 169

Different model evaluation methodologies uncertainties in observations and model predic-


have been recommended and developed for var- tions arise from different sources. For example,
ious disciplines (e.g., air quality models, water the uncertainty in observations may be due to
quality models, weather and climate prediction random turbulence in the atmosphere and
models). Depending on the intended goals measurement errors, whereas the uncertainty in
associated with each discipline, the focus and model predictions may be due to input data
approach are also different. errors and model physics errors. Therefore, an
Fox (1984), Hanna (1989), Hanna et al (1991; alternative approach has been proposed by the
1993) and ASTM (2000) propose some com- ASTM to compare observations and model
prehensive model performance measures for air predictions for dispersion models (ASTM, 2000).
quality and dense-gas dispersion models. Mea- The approach calls for properly averaging the
sures such as the fractional bias, the normalized observations before comparison.
mean square error, the geometric mean, the geo- Evaluation methods for regional Eulerian grid
metric variance, the correlation coefficient, and models for photochemical pollutants or for fine
the fraction of predictions within a factor of particles have developed along a slightly differ-
two of observations are suggested (see Sect. 2 ent path. For example, Seigneur et al (2000)
for more details). The bootstrap resampling describe guidance for the performance evaluation
method (Efron, 1987) is used by Hanna (1989) of regional Eulerian modeling systems for par-
and ASTM (2000) to estimate the confidence ticulate matter (PM) and visibility, where four
limits on the performance measures. The meth- levels of evaluation efforts are suggested, includ-
odology has been traditionally used to study ing operational, diagnostic, mechanistic, and
whether air quality models can correctly predict probabilistic. An operational evaluation uses sta-
quantities such as the maximum concentration tistical performance measures to test the ability
over the entire receptor network, the maximum of the model to estimate PM concentrations or
concentration across a given sampling arc, the other quantities used to characterize visibility.
concentration integrated on a given sampling A diagnostic evaluation tests the ability of the
arc, or the cloud width on a given sampling model to predict the components of PM or visi-
arc. However, the methodology is generic and bility, including PM precursors and associated
is also applicable to other quantities such as wind oxidants, particle size distribution, temporal
speed, temperature, or even stock price, as long and spatial variations, and light extinction. A
as predictions and observations are paired (e.g., mechanistic evaluation tests the ability of the
in space only, in time only, or in both space and model to predict the response of PM concentra-
time). The Hanna et al (1991; 1993) methodol- tions or visibility to changes in meteorology and
ogy has been widely used in the air quality mod- emissions. A probabilistic evaluation takes into
eling community. For example, in its Initiative the account the uncertainty associated with
on Harmonization Within Atmospheric Dispersion model predictions and observations of PM and
Modeling for Regulatory Purposes, the National visibility. A set of statistical measures for model
Environmental Research Institute (NERI) of evaluation is also recommended by Seigneur et al
Denmark developed the Model Validation Kit (2000). In a similar effort, McNally and Tesche
(MVK) based on the methodology (see Olesen, (1993) developed the Model Performance
2001, and www.harmo.org). Evaluation, Analysis, and Plotting Software
Traditionally, model predictions are directly (MAPS) to evaluate three-dimensional urban-
compared to observations. As described by John and regional-scale meteorological, emissions,
Irwin, who is the lead author of the American and photochemical models.
Society of Testing and Materials (ASTM, 2000) The discipline of numerical weather prediction
model evaluation guidelines, this direct com- (NWP) also has a long history of evaluating
parison method may cause misleading results, NWP models (e.g., Pielke and Pearce, 1994;
because (1) air quality models almost always pre- Pielke, 2002; Seaman, 2000). Hamill et al
dict ensemble means, but observations represent (2000) suggest that a suite of performance mea-
single realizations from an infinite ensemble of sures should be considered to judge the per-
cases under the same conditions; and (2) the formance of an analysis or forecast made by
170 J. C. Chang and S. R. Hanna

mesoscale meteorological models. A single eval- the deviation to be larger for shorter sampling
uation metric is generally not adequate since it time and under unstable conditions. Some
provides only unique information on model advanced dispersion models such as SCIPUFF
performance. (Sykes et al, 1998) were developed based on
Many sophisticated dispersion modeling sys- higher-order turbulence closure schemes, so that
tems now routinely use outputs from NWP basic statistical properties (e.g., mean and
models. For example, the Third-Generation Air variance) of concentration predictions can be
Quality Modeling System (Models-3; EPA 1999) estimated.
of the Environmental Protection Agency (EPA) is According to Fox (1984), Anthes et al (1989),
coupled to the Pennsylvania State University- Hanna et al (1991), and Beck et al (1997), uncer-
National Center for Atmospheric Research tainty in air quality modeling generally can be
(PSU-NCAR) Fifth-Generation Mesoscale due to (1) variability because of random turbu-
Model (MM5; Grell et al, 1994). Similarly, the lence, (2) input data errors, and (3) errors and
National Atmospheric Release Advisory Center uncertainties in model physics. It has already
(NARAC) system (Nasstrom et al, 2000) of the been mentioned that there are natural fluctuations
Department of Energy uses the outputs from the in the concentration fields due to random turbu-
Naval Research Laboratory (NRL) Coupled lence in the atmosphere. Input data errors can be
Ocean=Atmosphere Mesoscale Prediction Sys- due to uncertain source terms, instrument errors,
tem (COAMPS; Hodur, 1997). unrepresentative instrument siting, and many
This close integration between meteorological other factors. Furthermore, there might be errors
and dispersion models is quite promising for the in the physical formulations of a model. Even if
future development of air quality modeling. these physical formulations were correct, there
However, there is a noteworthy irony that meso- are still uncertainties in the parameters used in
scale models have been traditionally evaluated the formulations.
when there are significant weather systems to The current paper does not go into detailed
speak of, in which case the winds tend to be analyses of air quality model uncertainty. This
moderate to strong. On the contrary, dispersion rapidly-growing field is following guidelines
or air quality modeling is usually more con- developed for other types of environmental mod-
cerned with the so-called worst-case scenario els. For example, Morgan and Henrion (1990)
(i.e., leading to the highest pollutant concentra- and Helton (1997) suggest formal statistical frame-
tions) when the winds tend to be light and vari- works for uncertainty analysis. Helton (1997)
able. The study by Hanna and Yang (2001) is recommends that there are two kinds of uncer-
among the first to systematically evaluate meso- tainty that should be differentiated, stochastic
scale meteorological model predictions of near- and subjective. Cullen and Frey (1998) focus
surface winds, temperature gradients, and mixing on the uncertainty issue in the field of environ-
depths, quantities that are most crucial to disper- mental exposure assessment. Saltelli et al (2000)
sion modeling applications. provide a comprehensive collection of methodol-
ogies for sensitivity analysis, which is closely
related to uncertainty analysis. In traditional
1.2 Model uncertainty definitions and methods
literature on meteorology, uncertainty is often
As previously mentioned, because of random synonymous to predictability, i.e., the sensitive
turbulence in the atmospheric boundary layer, dependence of the solution of a nonlinear system
observed concentrations are expected to fluctuate on its initial conditions (Lorenz, 1963).
around some sort of average (e.g., Wilson, 1995).
The uncertainty will increase for a smaller spatial
2. Description of statistical model
scale or a shorter averaging period, and is
performance evaluation methods
approximately equal to the mean concentration
for averaging times of 10 seconds or less. This section describes several statistical
Venkatram (1979) developed an empirical for- approaches used in air quality model perfor-
mula for the expected deviation of observed con- mance evaluation, and is mainly based on a
centrations from ensemble means, and estimated chapter of the lead author’s Ph.D. thesis (Chang,
Air quality model performance evaluation 171

2002). The approaches and the resulting BOOT to evaluate model predictions of 24 hour aver-
model evaluation software package represent the aged PM at specific locations such as heavily
work previously developed by the authors, but populated areas with poor housing. For military
with additional methods suggested for interpreta- applications, the location of the dosage footprint
tion of the results (Sects. 2.1–2.4). Four innova- of a chemical agent cloud is an important piece
tive evaluation methodologies (Sects. 2.5–2.8) of information. For flammable substances, the
developed by other researchers are also de- instantaneous maximum concentration is more
scribed, and the existing BOOT software is important than the one-hour average concentra-
further upgraded to include three of these new tion. For a forensic study involving decisions
capabilities (Sects. 2.5–2.7). concerning population impacts, it might be of
It is important that air quality models be prop- interest to correctly estimate the cloud arrival
erly evaluated in order to demonstrate their fidelity and departure times. Moreover, for any actual
in simulating the phenomena of interest. In the field experiment, there are some practical con-
following, a set of procedures for the statistical straints that lead to only a limited number of
evaluation of air quality models is described. Even the above evaluation objectives that can actually
though the emphasis here is somewhat biased be considered.
towards the air quality branch of the environmen- When conducting a statistical test, a null
tal sciences, the same procedures and approaches hypothesis must first be defined. Depending on
are also applicable to other disciplines and models. the goals and emphases of the study, there are a
For example, the BOOT model evaluation soft- number of potential outputs of an air quality
ware could just as well be applied to the question model that could be evaluated, such as
of whether a certain drug produces a significant
improvement in health, or to whether a certain (1) For a given averaging time:
new type of corn seed produced significantly – The overall maximum concentration over
higher corn yields that the old type of seed. the entire domain,
For statistical evaluation, model predictions – The maximum concentration along a sam-
are evaluated against some reference states, pling line,
which in most cases are simply ‘‘observations.’’ – The cross-line integrated concentration
Observations can be directly measured by instru- along a sampling line,
ments, or are themselves products of other – The location of a contour (i.e., cloud foot-
models or analysis procedures. It is important to print) for a certain concentration threshold
recognize that different degrees of uncertainty (e.g., toxicity limit or flammability limit),
are associated with different types of observa- – The cloud width along a sampling line,
tions. Furthermore, it is important to define – The cloud height along a vertical tower.
how predictions are to be compared with obser-
vations. For example, should observations and (2) Dosage (integration of concentration over
predictions be paired in time, in space, or in both time):
time and space? Different conclusions can be – The maximum dosage (concentration inte-
reached depending on the type of pairing chosen. grated with time) along a sampling line,
– The cross-wind integrated dosage along a
2.1 Definition of evaluation objective sampling line,
– The total dosage over an area,
The evaluation objective must be clearly defined – The location of a contour for a certain
before doing any model performance evaluation. dosage.
For example, for EPA regulatory applications,
the primary objective might be how well an air (3) Cloud timing:
quality model simulates the maximum one-hour – The cloud arrival and departure times and
averaged concentration anywhere on the sam- the effective velocity.
pling network. In this case, the location of the
maximum impact is less important. For environ- Once the model output that is to be evaluated
mental justice applications, it might be important is decided, it is useful to pose a hypothesis to be
172 J. C. Chang and S. R. Hanna

tested by the statistical evaluation. Such a or high concentrations are quickly revealed in
hypothesis might be, for example, ‘‘Can we say this plot.
with 95% confidence that the model’s mean bias Residual plots employing box diagrams: The
is not significantly different from zero?’’ Or, scatter and quantile–quantile plots mentioned
‘‘Can we say with 95% confidence that the above clearly do not provide a complete under-
Normalized Mean Square Error of Model 1 is standing of the physical reasons why a model
not significantly different from that for Model performed in a certain way. The issue can be
2?’’ Statistical model evaluation software such addressed using residual analyses, and combined
as BOOT or the ASTM method are capable of with box diagrams if necessary. In this plot,
addressing these questions. model residuals, defined as the ratio of predicted
(Cp) to observed (Co) concentrations (or dosages
or other outputs) are plotted, in the form of a
2.2 Exploratory data analysis
scatter plot, versus independent variables such
Before beginning the calculation of various sta- as hour of day, downwind distance, ambient wind
tistical performance measures, it is extremely speed, mixing height, atmospheric stability. If
useful to perform exploratory data analysis there are many points, it is not effective to plot
by simply plotting the data in different ways. all of them, and instead the residuals are binned
The human eye can often glean more valuable according to different ranges of independent
insights from these plots than pure statistics, variables, and the distribution of all data points
especially since many of the statistical measures in each bin is represented by a box diagram.
depend on linear relations. These plots can also The significant points for each box diagram
provide clues as to why a model performs in a represent the 2nd, 16th, 50th, 84th, and 98th per-
certain way. Some of the commonly-used plots centiles of the cumulative distribution of the n
are scatter plots, quantile–quantile plots, residual points considered in the box. A good performing
box plots, and simply plots of predictions and model should not show any trend of the residuals
observations as a function of time or space. Some when they are plotted versus independent
of these plots are demonstrated in Sect. 3 with variables.
data from Intensive Operating Period 9 (IOP09)
from the Salt Lake City Urban 2000 field data.
2.3 Quantitative performance measures
Scatter plot: In a scatter plot, the paired obser-
implemented in BOOT software
vations and predictions are plotted against each
other. Visual inspection can reveal the magnitude Hanna et al (1991; 1993) recommended a set of
of the model’s over or under-predictions. Also, as quantitative statistical performance measures for
implied by its name, the scatter of the points can evaluating models, and implemented the proce-
be quickly seen and estimated by eye (factor of 2, dures in a software package called BOOT. The
5, or 10?). Because of the obvious impacts on the performance measures have been widely used in
public health due to high pollutant concentrations many studies (e.g., Ichikawa and Sada, 2002;
or dosages, the high end of the plots can be Nappo and Essa, 2001; Mosca et al, 1998), and
studied. On the other hand, correct predictions have been adopted as a common model evalua-
of low concentrations may sometimes also be tion framework for the European Initiative on
important for highly toxic chemicals. ‘‘Harmonisation within Atmospheric Dispersion
Quantile–quantile plot: The quantile–quantile Modelling for Regulatory Purposes’’ (Olesen,
plot begins with the same paired data as the scat- 2001).
ter plots, but removes the pairing and instead In addition to the standard statistical proce-
ranks each of the observed and predicted data dures, the BOOT software includes the capability
separately from lowest to highest. Thus the 3rd to produce the scatter, quantile–quantile, and re-
lowest predicted concentration would be plotted sidual plots used in the exploratory data analysis
versus the 3rd lowest observed concentration. It described in the previous subsection. The quan-
is often of interest to find out whether a model titative performance measures used in the BOOT
can generate a concentration distribution that is procedure are described below, together with
similar to the observed distribution. Biases at low some recent enhancements.
Air quality model performance evaluation 173

2.3.1 Definitions and properties on these alternate performance measures can be


of performance measures calculated by the same algorithms in BOOT that
are used to calculate the confidence limits on the
In order to evaluate the predictions of a model
six standard performance measures.
with observations, Hanna et al (1991; 1993)
In the above discussions, it is simply assumed
recommend the use of the following statistical
that the ‘‘dataset’’ contains pairs of Cp and Co (or
performance measures, which include the
dosages or other model outputs), and that they
fractional bias (FB), the geometric mean bias
represent averages over an averaging time, Ta.
(MG), the normalized mean square error
BOOT allows sets of Cp for several alternate
(NMSE), the geometric variance (VG), the cor-
models. The pairing is completely generic, and
relation coefficient (R), and the fraction of pre-
can be:
dictions within a factor of two of observations
(FAC2): – Pairing in time only, such as the time series of
ðCo  Cp Þ the maximum pollutant concentrations any-
FB ¼ ; ð1Þ where in the domain of interest (i.e., no
0:5ðCo þ Cp Þ penalty is given if the model predicts the max-
MG ¼ exp ðln Co  ln Cp Þ; ð2Þ imum concentration at a wrong place);
– Pairing in space only, such as the spatial dis-
tribution of the maximum pollutant concentra-
ðCo  Cp Þ2
NMSE ¼ ; ð3Þ tions over a time period (i.e., no penalty is
Co Cp given if the model predicts the maximum con-
centration at a wrong time);
VG ¼ exp½ðln Co  ln Cp Þ2 ; ð4Þ – Pairing in both time and space.
ðCo  Co ÞðCp  Cp Þ Pairing in both time and space is clearly most
R¼ ; ð5Þ stringent. Concentration fields often exhibit com-
Cp Co
plex spatial patterns. As Weil et al (1992) point
FAC2 ¼ fraction of data that satisfy out, because of typical variations in wind direc-
tion of 20 to 40 degrees (or more in light winds),
Cp ð6Þ
0:5   2:0; predicted plumes often completely fail to over-
Co lap observed plumes, even though the mag-
where: nitudes and patterns may be similar. This
Cp : model predictions; difficulty of separating the effects of winds and
plume dispersion is a common challenge for
Co : observations;
model evaluation.
overbarðCÞ: average over the dataset; and
Multiple performance measures should be
C : standard deviation over the dataset: applied and considered in any model evaluation
A perfect model would have MG, VG, R, and exercise, as each measure has advantages and
FAC2 ¼ 1.0; and FB and NMSE ¼ 0.0. Of course, disadvantages and there is not a single measure
as noted earlier, because of the influence of ran- that is universally applicable to all conditions.
dom atmospheric processes, there is no such The relative advantages of each performance
thing as a perfect model in air quality modeling. measure are partly determined by the character-
Note that since FB and MG measure only the istics and distributions of the model predictions
systematic bias of a model, it is possible for a and observations. The distribution is close to log-
model to have predictions completely out of normal for most atmospheric pollutant concen-
phase of observations and still have FB ¼ 0.0 or trations. In this case, the linear measures FB
MG ¼ 1.0 because of canceling errors. and NMSE may be overly influenced by infre-
The six performance measures defined above quently occurring high observed and=or pre-
are by no means exhaustive. Depending on the dicted concentrations, whereas the logarithmic
purpose and emphasis of a study, other measures measures MG and VG may provide a more
can be defined and can be easily incorporated balanced treatment of extreme high and
into the BOOT software. The confidence limits low values. Therefore, for a dataset where both
174 J. C. Chang and S. R. Hanna

predicted and observed concentrations vary by not consistently relate to the accuracy of predic-
many orders of magnitude, MG and VG may tions. It is often suggested that the more robust
be more appropriate. FAC2 is the most robust ranked correlation, Rrank (or often called the
measure, because it is not overly influenced by Spearman correlation coefficient), be considered,
outliers. where the ranks of Cp and Co are correlated
MG and VG may be overly influenced by instead of their values. The Rrank is not influ-
extremely low values, near the instrument thresh- enced by extreme outliers.
olds, and are undefined for zero values. These Moreover, it is typical for short-range disper-
low and zero values are not uncommon in disper- sion field experiments to have concentration data
sion modeling, where a low concentration value measured along concentric arcs at several dis-
might occur at a receptor that the plume has crete distances from the source, and it is also
missed. Therefore, when calculating MG and customary to evaluate model performance based
VG, it is recommended that a minimum thresh- on the maximum concentration or the cross-line
old be assumed for data values. For example, the integrated concentration along each sampling
instrument threshold, such as the limit of quanti- arc. In this case, the value of R can be deceiv-
tation (LOQ), could be used as the lower bound ingly high, mainly reflecting the fact that concen-
for both Cp and Co. The sensitivity of MG and tration decreases with downwind distance, which
VG to various assumptions regarding minimum any reasonable dispersion model will simulate.
values should be determined as part of the eval- Therefore, the correlation coefficient is less use-
uation exercise, which will be later demonstrated ful in a typical evaluation exercise for dispersion
with the Urban 2000 field data. models. On the other hand, it might be more
FB and MG are measures of mean relative useful when gridded fields are involved (e.g.,
bias and indicate only systematic errors, whereas McNally and Tesche, 1993).
NMSE and VG are measures of mean relative Since NMSE accounts for both systematic and
scatter and reflect both systematic and unsystem- random errors, it is helpful to partition NMSE
atic (random) errors. For FB, which is based on a into the component due to systematic errors,
linear scale, the systematic bias refers to the NMSEs, and the unsystematic component due
arithmetic difference between Cp and Co; and to random errors, NMSEu. It can be shown that
for MG, which is based on a logarithmic scale, 4FB2
the systematic bias refers to the ratio of Cp to Co. NMSEs ¼ : ð7Þ
Because FB is based on the mean bias, it is pos- 4  FB2
sible for a model whose predictions are complete- The above expression gives the minimum
ly out of phase with observations to still have a NMSE, i.e., without any unsystematic errors,
FB ¼ 0 because of compensating errors. An alter- for a given value of FB (Hanna et al, 1991).
native is to consider a slightly modified version The total NMSE is the sum of NMSEs and
of FB where the two error components (i.e., over- NMSEu.
prediction and underprediction) are separately Similarly, VG can also be partitioned into the
treated. This approach is described in detail in systematic component, VGs, and the random
Sect. 2.6. (unsystematic) component, VGu. The systematic
The correlation coefficient, R, reflects the component of VG is given by
 
linear relationship between two variables and is VGs ¼ expðln Co  ln Cp Þ2 ¼ exp ðln MGÞ2
thus insensitive to either an additive or a multi-
plicative factor. Because of the linear implica- ð8Þ
tion, if there is a clear non-linear relation (say, or,
a parabolic relation) in the scatter plots, it is not
revealed by R. Also, R is sensitive to a few aber- ln VGs ¼ ðln MGÞ2 : ð9Þ
rant data pairs. For example, a scatter plot might The above equation gives the minimum possible
show generally poor agreement; however, the VG, without any unsystematic errors, given a
presence of a good match for a few extreme pairs value of MG (Hanna et al, 1991). The total VG
will greatly improve R. As a result, Willmott is the product of VGs and VGu. Or, the total
(1982) discourages the use of R, because it does (ln VG) is the sum of (ln VGs) and (ln VGu).
Air quality model performance evaluation 175

2.3.2 Relations between FB and NMSE, tional bias, FB. A model might also appear to
and MG and VG have a better performance than other models
based on, for example, a smaller FB. However,
The FB, MG, NMSE, and VG defined in Eqs.
these results may not be significant in a statistical
(1)–(4) can be used to quantitatively define a
sense. To investigate the significance question,
certain aspect of model performance. However,
there are several hypotheses that could be tested.
direct quotations of their values are often not that
Two examples are:
informative. For example, without experience, it
will be difficult for a user to discern the meaning – When compared to observations, is a model’s
of, say, NMSE ¼ 9 and VG ¼ 13. It is shown relative mean bias performance measure, FB,
below how the values of FB, NMSE, MG, and significantly different from zero at the 95%
VG can be further interpreted in terms of a mea- confidence levels? For the case of geometric
sure that is more easily comprehended, such as measures such as MG, it is necessary to con-
the equivalent ratio of Cp to Co. sider whether MG is significantly different
For example, Eq. (1) can be expressed as from 1.0, since a perfect model has MG ¼ 1.0.
Cp 1  12 FB – When comparing the performance of two
¼ : ð10Þ models, are the differences in the performance
Co 1 þ 12 FB measures for the two models (e.g., FB for
Therefore, FB ¼ 0.67 would imply a factor of Model 1 minus FB for Model 2) significantly
two mean underprediction, and FB ¼ 0.67 different from zero at the 95% confidence
would imply a factor of two mean over- levels?
prediction. If the distribution of the quantity follows a
MG is simply the ratio of the geometric mean normal distribution or can be transformed to a
of Co to the geometric mean of Cp: normal distribution, then significance tests such
hCp i 1 as the student-t test can be applied to the prob-
¼ ; ð11Þ
hCo i MG lem. However, in the case of general distribu-
tions, then random resampling methods such as
where the angle brackets indicate a geometric
the Bootstrap method (Efron, 1987; Efron and
mean. Consequently, a factor of two mean bias
Tibshirani, 1993) can be used to estimate the
would imply that MG ¼ 0.5 or 2.0, and a factor of
mean, , and standard deviation, , of the distri-
four mean bias would imply that MG ¼ 0.25 or 4.0.
bution of each performance measure. Each ran-
To interpret NMSE, assume that the mean of
dom sample will yield one estimate of the
the observed concentrations equals the mean of
performance measure. After, say, 1000 samples,
the predicted concentrations. Then NMSE ¼ 1.0
there will be 1000 estimates of any given perfor-
implies that the root-mean-square-error is equal
mance measure. These 1000 estimates can be
to the mean. As NMSE becomes much larger
used to estimate the  and  for the performance
than 1.0, it can be inferred that the distribution
measure. As suggested by Efron (1987), the 95%
is not normal but is closer to log-normal (e.g.,
confidence intervals are given by
many low values and a few large values).
VG expresses the scatter of a log-normal dis-  1=2
n
tribution, which can be expressed as, say, ‘‘plus   t95%  ; ð12Þ
or minus 20%’’, or ‘‘plus or minus a factor of n1
10’’. For example, a factor of 2 scatter would
where n is the number of resamples, t95% is the
imply a VG ¼ 1.6 and a factor of 5 scatter would
Student’s t value at the 95% confidence level
imply VG ¼ 12.
with NP  1 degrees of freedom, and NP is the
number of observation-prediction pairs. Alterna-
tively, the 1000 estimates could be used to
2.4 Confidence limits on performance
directly determine the 95% confidence intervals
measures
based on the 5th and 95th percentiles of the dis-
A model might appear to have a certain amount tribution. The rule of thumb on the minimum
of skill as reflected by, for example, a small frac- number of observation-prediction pairs is around
176 J. C. Chang and S. R. Hanna

20, so that meaningful information on the confi- where all variables on the right hand side of
dence level can be obtained. Eqs. (14) and (15) are similarly defined as in
Some practical considerations concerning Eqs. (1)–(6). Note that NSD, NRMSE, and R
resampling are given below: account for only unsystematic errors, i.e., their
values do not depend on mean bias. Therefore,
– The evaluation dataset may sometimes appear
R ¼ 1.0, NSD ¼ 1.0, and NRMSE ¼ 0.0 are only
in blocks. For example, one block of data may
necessary but not sufficient conditions for a per-
be from one experiment trial and another
fect model. As a result, if a model systematically
block of data may be from a second experi-
overpredicts or underpredicts, but still has the
ment trial. Or, one block of data may be from
same scatter as the observations, it would yield
one monitoring arc, and another block of data
a perfect NSD of 1.0.
may be from a second monitoring arc. It may
Taylor (2001) shows that a relationship exists
also be appropriate to block data by other
among the three performance measures (NSD,
independent variables such as wind speed
NRMSE, and R) based on the law of cosines. It is
and stability class. In this case, resampling
this relationship that allows the three measures to
should be restricted to within each block;
be plotted in a single diagram. It can be shown that
otherwise, artificial block-to-block variance
will be introduced. 1
R¼ ðNSD2 þ 1  NRMSE2 Þ: ð15Þ
– Resampling should be done with replacement. 2 NSD
That is, once a sample is drawn, it is allowed The above equation is in the form of the law of
to be drawn again. cosines, i.e.,
– Observational and predicted values are 1
sampled concurrently, in order to maintain cos  ¼ ðB2 þ C2  A2 Þ; ð16Þ
2 BC
the relationship between them.
where A, B, and C are the three sides of a trian-
Because the blocking procedure is clearly arbi- gle, and A is the side opposite the angle , as
trary and can have an effect on the resulting depicted by the following diagram:
confidence limits, some experimentation with
one or two options is advised in any operational
evaluation.

2.5 Taylor’s single nomogram method


Taylor (2001) and Gates et al (1999) recommend
a nomogram that can summarize three perfor-
mance metrics ‘‘in a single diagram’’. The
method was first developed for applications to
weather forecast and climate models, but also An example of the nomogram format proposed
has potential for other types of models, such as by Taylor (2001) is given in Fig. 1, where three
air quality models. The three performance mea- hypothetical models are used for illustration. A
sures used by Taylor are the normalized standard model’s performance is indicated by a unique
deviation (NSD), the normalized root mean point on the diagram. In Fig. 1, values of con-
square error (NRMSE), and the correlation coef- stant NSD are indicated by concentric arcs
ficient (R). NSD and NRMSE are defined below, (dashed curves, except that a solid curve is used
whereas R has been previously defined by for NSD ¼ 1.0), and values of R are indicated by
Eq. (5), the polar dial along the edge of the figure. A
C perfect model (NSD ¼ R ¼ 1.0, and NRMSE ¼
NSD ¼ p ; ð13Þ 0.0) is indicated by the large solid circle. In the
Co
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi figure, Model-A is located closest to the large
solid circle, thus the overall best performer.
½ðCp  Cp Þ  ðCo  Co Þ2
NRMSE ¼ ; ð14Þ Taylor’s diagram has so far been primarily
Co used for gridded fields. The method should be
Air quality model performance evaluation 177

hazardous gas models, or by areas of precipitation


for weather forecasts. It is clear from the above
definition that the FMS does not depend on the
detailed distribution within the contour. Therefore,
the FMS is more subjective and qualitative than the
statistical evaluation procedures described above.
In addition to its routine use to verify precipi-
tation forecasts, the FMS has been used to eval-
uate the performance of long-range transport and
dispersion models for the Chernobyl accident
(Klug et al, 1992) and for the European Tracer
Experiment (ETEX) (Mosca et al, 1998). It is
also often called the threat score (Wilks, 1995;
McNally and Tesche, 1993).
The FMS has not often been used for evaluating
short-range or mesoscale-range dispersion mod-
els. Because of the difficulty in simulating plume
directions within 20 to 40 deg, the observed and
Fig. 1. Nomogram proposed by Taylor (2001). The radial
distance from the center of the model name to the origin predicted contours seldom overlap (Weil et al,
indicates NSD. The distance from the center of the model 1992). Thus, the FMS is usually quite low for
name to the large solid circle indicates NRMSE. The cosine plume models applied to point sources, whose con-
of the angle formed by the horizontal axis and segment centration contours frequently have a cigar shape.
NSD indicates R. The inset further depicts the relationship Emergency response personnel are interested
among NSD, NRMSE, and R explicitly. The large solid
circle indicates a perfect model, i.e., NSD ¼ R ¼ 1.0, and
in whether a model is more likely to predict false
NRMSE ¼ 0.0. NSD is measured by concentric arcs negatives than false positives. In Fig. 2, the area
(dashed curves, except that a solid curve is used for under Ao but not under Ap \ Ao can be defined as
NSD ¼ 1.0), and R is indicated by the polar dial along the the area of ‘‘false negative’’ predictions (area
edge of the nomogram. In the above example, Model-A has AFN) for the model, whereas the area under Ap
NSD ¼ 1.2, NRMSE ¼ 0.74, and R ¼ 0.79; Model-B has but not under Ap \ Ao can be defined as the area
NSD ¼ 1.25, NRMSE ¼ 1.02, and R ¼ 0.61; and Model-C
has NSD ¼ 0.92, NRMSE ¼ 1.36, and R ¼ 0 of ‘‘false positive’’ predictions (area AFP) for the
model. False negative means that the model did
not predict any impact but observations showed
further reviewed to determine its application to otherwise; false positive means that the model
short-range dispersion field experiments where predicted impact but observations showed
concentrations are typically measured along con- otherwise. Generally, false negatives are more
centric arcs, and where R is less relevant. worrisome to emergency responders than false
positives. Equation (17) can then be rewritten as
2.6 Figure of Merit in Space (FMS) Ap \ Ao
FMS ¼ : ð18Þ
and Measure of Effectiveness (MOE) methods Ap \ Ao þ AFN þ AFP
Another method that is sometimes used to eval- One shortcoming with the FMS is that if Ap and
uate model performance is the so-called figure of Ao are nearly identical in shapes, but do not over-
Merit in Space (FMS), defined as lap at all, perhaps due to incorrect wind direction
Ap \ Ao inputs, then the FMS will be zero and no credit
FMS ¼ ; ð17Þ will be given to the fact that the model has done a
Ap [ Ao
satisfactory job in predicting the shapes.
where Ap is the predicted contour area based on a Warner et al (2001) suggest a more general
certain threshold, and Ao is the observed contour approach whereby a user can specify the relative
area based on the same threshold (Fig. 2). Contour importance of the areas of false negative and
areas can be defined, for example, by a threshold false positive predictions by inserting weighting
concentration for toxic or flammable materials for factors to AFN and AFP in the above equation.
178 J. C. Chang and S. R. Hanna

Fig. 2. Three possible scenarios of evaluating model performance using the


Figure of Merit in Space (FMS), Ap \ Ao =Ap [ Ao , where Ap (shaded area)
is the predicted contour area, and Ao (dotted area) is the observed contour
area. The contour can be defined, for example, by a concentration threshold
of the released chemical for dispersion modeling, or by areas of precipitation
for weather predictions. The sizes of Ao and Ap are the same for all three
frames, but the corresponding values of the FMS are quite different. In the
above example, the FMS values for (a), (b), and (c) are 0.19, 0.61, and 0.04,
respectively

They define the more general metric as the user- Note that the two components are normalized
oriented one-dimensional (1-D) measure of differently, i.e., one by Ao and the other one by
effectiveness (MOE), i.e., Ap.
Ap \ Ao With appropriate configuration, a model can
1  D MOE ¼ ; always generate sufficiently-detailed outputs for
Ap \ Ao þ CFN AFN þ CFP AFP
contouring. However, sufficiently-detailed spatial
ð19Þ coverage of observations is needed to permit the
where CFN and CFP are weighting factors or con- accurate drawing of contours to calculate FMS
stants to be determined by the user. A family of (or MOE). When assessing precipitation fore-
1-D MOE estimates can be defined based on dif- casts, good coverage is often provided by net-
ferent combinations of CFN and CFP. The 1-D works of rain gauges. However, the monitoring
MOE is the same as the FMS when CFN and instruments for short-range dispersion field ex-
CFP both equal 1.0. periments are typically arranged in concentric
Warner et al (2001) also suggest a two-dimen- arcs. The data from these arcs are adequate to
sional (2-D) MOE, where two components are measure the plume’s cross-wind distribution,
used to indicate model performance, but are usually insufficient to plot contours. This
is especially so if the number of sampling arcs is
2  D MOE ¼ ðMOEx ; MOEy Þ
  limited. And in the case of routine air quality
Ap \ Ao Ap \ Ao monitoring networks, the ten or so monitors are
¼ ;
Ao Ap generally scattered around at various angles and
  distances, and any contours that may be attempt-
Ap \ Ao Ap \ Ao
¼ ; : ed would be highly uncertain.
Ap \ Ao þ AFN Ap \ Ao þ AFP Warner et al (2001) recommend two alter-
ð20Þ nate methods to compute the MOE under the
Air quality model performance evaluation 179

is calculated does not depend on the detailed dis-


tributions of observations and predictions. For
example, with an assumed concentration thresh-
old of 60, the MOE would remain the same
regardless of whether the observed maximum
concentration were 650 or 6500 in Fig. 3. It is
obvious that a perfect MOE (or FMS) under
this method does not necessarily guarantee a
perfect agreement between observations and
predictions.
Method 2 is applied on a point by point basis,
similar to FB or MG, and therefore applies to
cases where receptors are not distributed nicely
along a sampling arc. It is possible to apply
Method 2 to, for example, a set of randomly dis-
tributed receptors, as illustrated in Fig. 3. Under
Method 2, AFP is given by the sum of the differ-
ences between predictions and observations for
Fig. 3. Two methods to estimate the areas required to com-
pute the Figure of Merit in Space (FMS), or the Measure of
all those locations where predictions are higher
Effectiveness (MOE), for a typical short-range dispersion than observations (i.e., false positive), AFN is
field experiment, where concentrations are measured along given by the sum of the differences between pre-
concentric arcs. Solid curve is observations along a sam- dictions and observations for all those locations
pling arc, expressed in azimuth angle. Dashed curve is where predictions are lower than observations
model predictions. Thick solid horizontal line represents a (i.e., false negative), and Ap \ Ao is given by
threshold (60 concentration units in this case). Both the
FMS and MOE require estimates of the overlap area the differences between (1) the smaller of predic-
(Ap \ Ao), the false-positive area (AFP), and the false-nega- tions and observations and (2) the threshold at all
tive area (AFN), which can be given by the following two locations. (Recall that all the above operations
methods. Method 1: The AFP , AFN, and Ap \ Ao are deter- are with respect to the threshold value.) For
mined by the three line segments (or azimuthal distances Method 2, a perfect MOE (or FMS) would imply
along the arc) at the bottom of the figure, where AFP in-
dicates the locations where predictions are higher than the
a perfect agreement between observations and
threshold and observations are lower than the threshold, predictions.
AFN is conversely defined, and Ap \ Ao indicate the loca- The BOOT software has incorporated the phi-
tions where both observations and predictions are higher losophy of the 2-D MOE determined by Method 2
than the threshold. Method 2: The AFP , AFN, and Ap \ Ao above by breaking up FB into the false positive
pertain to the cross hatched, horizontal hatched, and shaded (FBfp) and false negative (FBfn) components
areas above the threshold, respectively. In actual implemen-
tation, AFP , AFN, and Ap \ Ao are simply estimated by the (or 2-D FB).
summation of the data
0:5½jCo  Cp j þ ðCp  Co Þ
FBfp ¼ ; ð21Þ
0:5ðCo þ Cp Þ

condition where limited data are available (see where the numerator amounts to considering
Fig. 3 for an illustration): only those data pairs with Cp >Co (i.e., overpre-
Method 1 applies when sufficient monitoring dicting or false positive).
data are available on a crosswind arc, and these
data can be used to define crosswind distance 0:5½jCo  Cp j þ ðCo  Cp Þ
FBfn ¼ ; ð22Þ
ranges where the concentration (or dosage) 0:5ðCo þ Cp Þ
exceeds some threshold. In this case, the posi-
tions of the predicted and observed distance where the numerator amounts to considering
ranges are used as before to calculate AFP, AFN, only those data pairs with Co >Cp (i.e., under-
and Ap \ Ao. The difference is that the ‘‘areas’’ predicting or false negative). The difference
are now ‘‘line lengths’’ in Fig. 3. The MOE that between FBfn and FBfp yields the original FB,
180 J. C. Chang and S. R. Hanna

whereas the sum of FBfn and FBfp is the normal- where  represents the set of model input
ized absolute error (NAE): parameters, Co ðÞ is the ensemble average of
the observations (which dispersion models are
jCo  Cp j
NAE ¼ : ð23Þ ideally supposed to predict), Co0 () represents
0:5ðCo þ Cp Þ measurement errors due to calibration or unrep-
Model evaluation experts such as McNally and resentative siting, and Co0 () represents sto-
Tesche (1993) prefer NAE over NMSE because it chastic fluctuations due to turbulence.
is less susceptible to outliers. MG can also be Similarly, the concentration, Cp, predicted by
similarly divided into the false positive and false most models can be considered to have the fol-
negative components. The following two equa- lowing three components:
tions show the relationship between the 2-D FB Cp ¼ Cp ðÞ þ Cp 0 ðÞ þ Cp 0 ðÞ; ð27Þ
defined in Eqs. (21) and (22), and the 2-D MOE
defined in Eq. (20) and estimated by Method 2 where Cp ðÞ is the ensemble average predicted
above (Chang, 2002). by most models; Cp0 () represents the errors
due to model input uncertainty; and Cp0 () repre-
2  MOEx  ð1  MOEy Þ sents errors due to factors such as incorrect
FBfp ¼ ; ð24Þ
MOEx þ MOEy model physics, unrepresentativeness (such as
comparing grid-volume averages with point mea-
2  MOEy  ð1  MOEx Þ
FBfn ¼ : ð25Þ surements), and parameters (other than ) not
MOEx þ MOEy accounted for by the model.
The problem of directly comparing observa-
tions (which are sets of realizations of ensem-
2.7 ASTM Statistical Model bles) with model predictions (which are ensemble
Evaluation method averages) arises because Co is compared with
The qualitative and quantitative procedures men- Cp ðÞ. The ASTM (2000) procedure argues that
tioned in the above sections typically involve if the effects of Co0 (), Co0 (), Cp0 (), and
direct comparisons of model predictions with Cp0 () all somehow average to zero, then it is
field observations. However, many studies (e.g., more appropriate to first separately average
Fox, 1984; Venkatram, 1984 and 1988; Weil observations and modeling results over a number
et al, 1992) point out that there is a fundamental of regimes (or ensembles), which can be defined
difficulty in that most dispersion models generate by independent factors such as downwind dis-
an ensemble-mean prediction (either explicitly or tance and stability parameter, and then do the
implicitly), whereas an observation corresponds comparison. The goal is to group experiments
to a single realization of the ensemble. Here, into regimes of similar conditions. Averaging
an ensemble is defined as ‘‘a set of experiments observations over each of these regimes provides
corresponding to fixed external conditions’’ an estimate of what most dispersion models
(Lumley and Panofsky, 1964). Therefore, some attempt to predict, Co ðÞ. These regime averages
researchers have been advocating a framework of observations and predictions can then be
through which the atmospheric dispersion model paired to calculate, for example, those perfor-
predictions can be compared with a grouping of a mance metrics defined in Sect. 2.3. Like the
number of field observations. One such frame- BOOT software, the ASTM procedure uses
work was proposed as a standard guide by the the bootstrap resampling technique to calculate
American Society for Testing and Materials the confidence limits of the performance
(ASTM, 2000). metrics, where resampling is done within each
Following the definitions proposed by regime, and where the observed and predicted
Venkatram (1984), the basic premise of the values for the same experiment are sampled
ASTM procedure is that a single realization of concurrently.
the observed concentration, Co, can be expressed Strictly speaking, the calculation of an
as observed ensemble would require that many
experiments be conducted under identical condi-
Co ¼ Co ðÞ þ Co 0 ðÞ þ Co 0 ðÞ; ð26Þ tions. This objective is impossible to achieve in a
Air quality model performance evaluation 181

field program. Therefore, the regime-average the method. First, it is important to note
represents a surrogate of a true ensemble that the plume must have been well-captured
average, and is justified because each regime by the sampling arc before any further analysis
consists of experiments conducted under similar can be conducted. This usually involves plotting
conditions. all observations along the arc and carefully in-
The ASTM (2000) procedure was initially specting these plots. Once the data are quality-
developed with short-range dispersion field assured, then consider all those measure-
experiments in mind, but can be extended to ments that are within a certain range of the
other types of experiments with appropriate con- plume center-of-mass (or centroid) location.
siderations. Traditionally, short-range dispersion Due to uncertainty in the plume centerline posi-
experiments have receptors arranged in con- tion, the ASTM procedure suggests that the
centric arcs to maximize plume capture. Previous concentration at a receptor that is within 0.67
researchers have often used the centerline con- y from the centroid location is a representative
centration to assess model performance. (This sample of the plume centerline concentration.
was partly motivated by regulatory require- For a Gaussian distribution, the value at a lat-
ments.) In addition to providing the rationale eral distance of 0.67 y from the centerline
for the need to combine data within a regime would equal 80% of the centerline maximum
for analysis, the ASTM procedure also suggests value.
that because of wind shifts and concentration The BOOT software has been extended
fluctuations, the cross-wind concentration distri- to include the ASTM method. There are many
bution along a sampling arc is unlikely to be similarities between the ASTM (2000) procedure
perfectly Gaussian, a lateral distribution assumed and the BOOT software described earlier, such
by most air dispersion models. These departures as
from the ideal Gaussian shape lead to uncertainty
– the calculation of statistical performance mea-
in defining the plume centerline position. As a
sures such as FB,
result, the ASTM procedure further recommends
– the use of the bootstrap resampling technique
treating all ‘‘near-centerline’’ observed concen-
to estimate the confidence level,
trations as if they were representative of the
– the paired sampling between observational
plume centerline concentration.
and predicted values, and
One way to define near-centerline concentra-
– the grouping of the data in blocks in BOOT
tions is described below, as suggested by the
and in regimes in ASTM.
ASTM (2000) procedure. Figure 4 illustrates
However, the ASTM procedure proceeds further
by
– calculating the performance measures based
on regime-averages (i.e., averaging over all
experiments within a regime), rather than the
values for individual experiments, and
– if the variable to be evaluated is the centerline
concentration, considering the near-centerline
observations as representative samples of the
centerline value, where the near-centerline
region can be defined as within a distance
(e.g., 0.67 times the lateral distance scale of
Fig. 4. Schematic of the definition of a near-centerline the concentration distribution) from the cloud
region for the ASTM procedure. A sample cross-wind con-
centration distribution is shown. The first moment of the
centroid location.
distribution defines the center-of-mass (or centroid) loca- The ASTM (2000) procedure represents a prom-
tion, yc. The second moment defines the spread, y, of the ising approach. However, there are still some
distribution. The region, marked by dotted lines, that is
within 0.67 y from the centroid location is the near-center- issues to be resolved and investigated before
line region the procedure can be fully useful.
182 J. C. Chang and S. R. Hanna

– There is a need to study the sensitivity of the


evaluation results to the definition of regimes
(i.e., how data are stratified).
– There is always only a limited number of
regimes (e.g., 20 to 40) that can be defined,
regardless of the size of the dataset. As a
result, the performance measures are always
determined by this limited number of regime
averages. It is necessary to carefully examine
the implication of accounting for only the var-
iance in regime averages, rather than the full
variance in the complete dataset.
– The ASTM procedure has so far only been
demonstrated for short-range dispersion ex-
periments with concentric sampling arcs,
where multiple observed concentrations near
the plume centerline are assumed to represent
all possible values of the centerline values. Fig. 5. Schematic of the use of the cumulative-distribution-
However, many other mesoscale or long range function (CDF) approach for model evaluation. Co is ob-
field experiments do not have similar arc-wise servation, and Cp is model prediction. In this approach, Co
configurations. is treated as a sample taken from the probability density
function (PDF) of Cp, which is assumed to be a random
variable. The PDF can be estimated by higher-order turbu-
2.8 Cumulative Distribution Function lence closure models, or by a Monte Carlo analysis that
(CDF) method involves many sensitivity model runs, in addition to the
base model run. Solid curve represents the model residuals,
Some researchers such as Lewellen et al (1985), Co  Cp , for the base run. Each Monte Carlo run creates a
Lewellen and Sykes (1989), and Weil et al (1992) new version of the residual CDF. A sufficient number of
suggest that, since observations are sampling a Monte Carlo runs then allow the estimation of the 95%
confidence bounds, as indicated by dashed curves
random process and almost all dispersion models
predict some sort of ensemble averages, a deter-
ministic evaluation framework is not appropriate. obtained by replacing all negative values in a
This argument is similar to the premise for the general Gaussian distribution with zeros.
ASTM approach mentioned above. Lewellen et al Besides the use of various analytical distribu-
(1985) devised a method to test the hypothesis tion functions to define the PDF, a second option
that, at any location and time, the observed con- is to generate the required PDF by means of a
centration, Co, is simply a sample taken from the Monte Carlo analysis (e.g., Hanna and Davis,
probability density function (PDF) of the pre- 2002). Figure 5 provides a schematic of the
dicted concentration, Cp. In this method, Cp is CDF approach based on a Monte Carlo analysis
treated as a random variable, whereas most air (Hanna and Davis, 2002). The solid curve repre-
dispersion models give ensemble mean predic- sents the CDF of a series of model residuals,
tions, Cp . If a model predicts large differences Co  Cp . In this example, the CDF is constructed
(or residuals) between Co and Cp , then it is desir- after rank ordering the differences between the
able to check whether the differences are within overall maximum observed and predicted hourly
the expected 95% uncertainty range of Cp. One ozone concentrations at N sampling stations for a
method to generate the required PDF and its 95% high-ozone episode. Each Monte Carlo sensitiv-
range is to use higher-order turbulence closure ity run creates a new version of the CDF. With a
models such as SCIPUFF (Sykes et al, 1998), sufficient number (say, 100) of Monte Carlo
which provide predictions of the concentration runs, the 95% confidence bounds of the CDF
mean and variance. The shape of the PDF is can then be estimated, shown by dashed curves
assumed by Lewellen and Sykes (1986) to in Fig. 5. Since the solid curve is bounded by
follow the clipped normal distribution, which is dashed curves in Fig. 5, it can be concluded that
Air quality model performance evaluation 183

model performance is consistent with the esti- where predictions are within a factor of two of
mated uncertainty, or that the model is perform- observations (FAC2). The FB and MG measure
ing as well as expected in light of the stochastic systematic bias, whereas NMSE and VG measure
or uncertain values of Cp. systematic bias and random scatter. There is not a
The CDF approach mainly checks whether single performance measure that is universally
model residuals are consistent with our expecta- applicable to all situations, and a balanced
tion of model uncertainty. It does not provide approach is required to look at a number of per-
quantitative information on model performance, formance measures. For dispersion modeling
and should be applied in conjunction with other where concentrations can easily vary by several
evaluation techniques, such as those mentioned orders of magnitude, MG and VG are preferred
above. There are also other issues that need to over FB and NMSE. However, MG and VG may
be investigated further. For example, the turbu- be strongly influenced by very low values, and
lence closure models only address concentration are undefined for zero values. It is recommended
fluctuations due to turbulent wind fields, and can- that the instrument threshold, such as the limit of
not adequately address the important spatial quantitation (LOQ), be used as a lower threshold
and temporal correlation information (Hanna in calculating MG and VG. The R is probably not
and Davis, 2002; Sykes, 2002). Monte Carlo anal- a very robust measure because it is sensitive to a
ysis is more robust in accounting for this correla- few outlier data pairs. Furthermore, because
tion information and uncertainties due to other measurements are commonly available in
model inputs such as emissions and the chemical concentric arcs for short-range dispersion field
reaction rate constants. On the other hand, Monte experiments, there is already a pattern in the
Carlo analysis cannot easily estimate turbulent dataset, i.e., concentration decreasing with down-
contributions from smaller-scale processes. wind distance. Since any reasonable dispersion
model would be able to reproduce this pattern,
R often mainly reflects this agreement, and is
2.9 Summary of Model
thus not that informative. The FAC2 is probably
Evaluation methods
the most robust performance measure, because it
Various methodologies for evaluating atmo- is not overly influenced by either low or high
spheric dispersion model performance have been outliers.
presented and discussed. It is recommended that Since NMSE and VG measure both systematic
any model evaluation exercise should start with bias and random (unsystematic) scatter, it is
clear definitions of the evaluation goal and the recommended that they be further partitioned to
variables to be evaluated, followed by explor- determine the fractional contributions of the sys-
atory data analysis, and then statistical perfor- tematic and unsystematic components. It is also
mance evaluation. Exploratory data analysis recommended that FB, NMSE, MG, and VG be
involves the use of many types of plots, including further interpreted by translating them into a
scatter plots, quantile–quantile plots, box-resid- quantity (e.g., the equivalent factor-of-N-differ-
ual plots, and scatter-residual plots, where the ence between predictions and observations) that
residual refers to the ratio of the predicted to is more easily understood.
observed values. The first two types of plots give Bootstrap resampling can be used to estimate
an overall assessment of model performance. The the confidence limits of the performance mea-
last two types of plots are useful in identifying sures, in order to address questions such as (1)
potential flaws in model physics, as indicated whether the FB for Model-A is significantly dif-
by trends of model residuals with independent ferent from zero, and (2) whether the FB for
variables. Model-A and the FB for Model-B are signifi-
The BOOT software package calculates a set cantly different.
of performance measures (or metrics), including In addition to the six performance measures
the fractional bias (FB), the geometric mean included in the BOOT software, there are also
(MG), the normalized mean square error other performance measures that can be defined,
(NMSE), the geometric variance (VG), the cor- such as the normalized standard deviation (NSD)
relation coefficient (R), and the fraction of data and the normalized root mean square error
184 J. C. Chang and S. R. Hanna

(NRMSE). It can be shown that NSD, NRMSE, after accounting for uncertainties. The CDF
and R satisfy a geometric relationship based on approach is qualitative, and should be used
the law of cosines. This relationship allows the in conjunction with other more quantitative
plotting of the three performance measures in a techniques.
two-dimensional nomogram.
The figure of merit in space (FMS) and the
2.10 Typical magnitudes
measure of effectiveness (MOE) can also be used
of performance measures
to measure model performance. The FMS and
MOE are naturally more appropriate for fields There have been many applications of various
that can be easily contoured. As previously men- subsets of the above model evaluation methodol-
tioned, short-range dispersion field experiments ogies to a variety of models and scenarios. This
typically have the concentration data measured experience allows ‘‘typical magnitudes’’ of per-
along concentric arcs. This type of data is often formance measures to be estimated, which could
not suitable for contouring. In order to imple- be used to develop model acceptance criteria.
ment the MOE, one method suggested by Warner Because a given model’s performance usually
et al (2001) is to replace area estimates with varies from one field site to the next (typically
straight data summation. In this case, it can be there may be a 20% mean bias towards overpre-
shown that the MOE and FB are in fact mathe- diction at one site and a 30% mean bias toward
matically related. The definitions of the false underprediction at another site), the most useful
negative and false positive parts of FB allow their model evaluation studies are those that look at a
sum to equal the normalized absolute error number of models and a number of field sites.
(NAE). Another important consideration is that model
Many researchers have suggested the inade- performance will vary with factors such as com-
quacy of a deterministic evaluation framework, plexity of the scenario, uncertainty in the source
because observations are realizations of ensem- term, and type and amount of meteorological
bles and model predictions often represent data. Most of the reported model evaluation stud-
ensemble averages. The American Society for ies are associated with ‘‘research grade’’ field
Testing and Materials (ASTM, 2000) approach experiments with good instruments and uncom-
suggests (1) grouping experiments under similar plicated terrain and simple source scenarios. And
conditions (regimes), (2) averaging predictions the experimentalists usually pack up and go
and observations over each regime, and (3) home if it starts raining. Only a few of the
calculating the performance measures based on reported model evaluation studies make use of
these regime averages. The ASTM procedure routine monitoring networks and long time peri-
also recommends the use of bootstrap resampling ods (e.g., one year).
to estimate the confidence limits. Because there One of the more comprehensive recent model
are many similarities between the BOOT and evaluation documents is the EPA’s evaluation
ASTM methodologies, the former can be easily of their new model, AERMOD (American
extended to also include the latter. Meteorological Society=Environmental Protec-
Another way to evaluate model performance in tion Agency Regulatory Model Improvement
a stochastic framework is to assume that the Committee Dispersion Model), with numerous
observed concentration is simply a random sam- field data sets, representing both research grade
ple taken from the probability density function tracer experiments and routine monitoring net-
(PDF) of the predicted concentration. The PDF works around industrial stacks (Paine et al,
can be estimated by such techniques as higher- 1998). The evaluations also include the existing
order turbulence closure schemes and Monte model, ISC (Industrial Source Complex Disper-
Carlo analysis. The cumulative distribution func- sion Model), which AERMOD is intended to
tion (CDF) of model residuals (observations replace. Because of the emphasis of the EPA reg-
minus predictions) can then be plotted to see ulations on the highest concentrations, the Paine
whether it lies inside the confidence bounds et al (1998) evaluations are focused more on the
given by the PDF. If so, then large model resid- high-concentration region of quantile–quantile
uals are said to be consistent with predictions plots. Concentration averaging times are 1 hr
Air quality model performance evaluation 185

and 24 hrs. It is seen that AERMOD has better may be in very good agreement, but a displace-
agreement than ISC at higher concentrations, and ment of 20 degrees may cause the two sets of
also improved performance over the middle (near contours to not overlap at all. For this reason,
the 50th percentile) region of the concentration they stress the use of quantile–quantile plots
distribution. In general, the new model is able to applied to data unpaired in time or space
predict the maximum concentrations within (perhaps paired by downwind distance arc). For
about 20 to 50%, although there is variation regulatory applications where the highest and
from site to site. second highest concentrations, regardless of their
Hanna et al (2000) also evaluate AERMOD locations, are typically of interest, the unpaired
and ISC, as well as ADMS (Atmospheric Disper- in space or time comparisons are usually suffi-
sion Modeling System), using data from five field cient. However, it is also recognized that there
studies (four tracer experiments and one routine are applications such as emergency response,
monitoring network). This study brought up the homeland security, and environmental justice,
issue about how to use multiple model perfor- where it is necessary to predict the exposure
mance measures from multiple sites to arrive at to a population, in which case not only the
a single conclusion. It was decided to list the magnitude, but also the location and shape of
performance measures and then rank the models the concentration field are important. This is
for each experiment and performance measure, the reason why despite the challenge due to
and come up with a final ‘‘score’’. The types of uncertainty in wind direction, there is neverthe-
results that would go into this ranking are listed less an interest in the paired in space or time
as an example in Table 1. The data represent the comparisons.
maximum concentration on a monitoring arc and Olesen (2001) reviews the numerous applica-
are therefore not paired in space. It is seen that tions of the Model Validation Kit, which includes
ADMS and AERMOD both overpredict by about the BOOT software, the ASTM (2000) software,
20 or 30% at the high end and at the median, on and many field experiment data sets. These
average, and their scatter is slightly more than a studies show that, for the better models, the typi-
factor of two. These two models’ predictions are cal relative mean bias is about 40% and the
within a factor of two of the observations about amount of random scatter is about a factor of
50% of the time. The ISC model, on the other two of the mean. Most of these field data sets
hand, does not have as good performance across involved short-range (up to the order of 10 km
all performance measures except MG (related to or so) dispersion experiments, maximum concen-
mean bias). However, the fairly good mean bias trations along a sampling arc but paired in time,
for ISC is actually caused by compensating and flat to rolling terrain.
errors. Chang et al (2003) investigated the perfor-
Similar conclusions are found in the review mance of three models over two mesoscale tracer
paper by Weil et al (1992) on air quality model experiments, where it was shown that wind direc-
evaluation. They stress that, because of variations tion variations over the 30 km domain caused
in wind direction, it is almost fruitless to attempt errors in model predictions at specific monitor
to compare predictions and observations paired locations. The general conclusion was that the
in space and time. They point out that the pre- two better models had mean biases within
dicted and observed plume shapes and contours 35%, relative random scatter of about a factor
of three or four, and 40 to 60% of the predictions
(of the maximum concentration on an arc) within
Table 1. Median performance measures over five field a factor of two of the observation.
experiments for three models (from Hanna et al, 2000) Hanna (1988; 1993) presented overviews of
many model evaluation exercises (about 20 mod-
ISC3 ADMS AERMOD
els and about 20 field sites) over short-ranges
Max Cp=Max Co 6.7 0.80 0.77 (less than a few km) to conclude that the typical
MG 0.70 1.22 1.7 relative mean bias was about 20 to 50% and
VG 7.7 2.4 2.9 the typical relative scatter was about 60 to 80%.
FAC2 0.33 0.53 0.46
It was suggested that the minimum values of
186 J. C. Chang and S. R. Hanna

about 20% for mean bias and about 60% for


relative scatter represent the best achievable per-
formance of air quality models. Random turbu-
lence (natural or inherent uncertainty) prevents
any further improvement in models.
Klug et al (1992) evaluated the performance
of many long-range transport models applied to
the ETEX tracer release experiment in Europe,
where the tracer cloud was tracked for over
1000 km. It was found that the unpaired relative
biases and scatter were similar to that found for
the short-range experiments and reported above,
but that the wind shears, fronts, and rain patterns
associated with ‘‘real weather’’ at those scales
could cause large displacements in position of
the predicted and observed clouds.
To conclude, ‘‘good’’ performing models
appear to have the following typical charac-
teristics based on unpaired in space com- Fig. 6. Map of Salt Lake City domain used in the Urban
parisons: 2000 study, showing the locations of the release, the surface
meteorological monitors, the upper air sites, and the SF6
– The fraction of model predictions within a samplers
factor of two of observations is about 50%.
– The mean bias is within 30% of the mean.
– The random scatter is about a factor of two to
The current test case involves only one of the
three of the mean.
six Intensive Operating Periods (IOPs) –
Of course these numbers will be revised as more IOP09, which took place the night of 20–21
evidence appears from new model evaluation October 2000, with 2 g=s releases at 2100,
exercises. 2300, and 0100 LST. IOP09 was the experiment
with the highest average wind speed – 2.64 m=s,
when averaged over six downtown anemometers.
3. Test case – Salt Lake City Urban All the anemometers were mounted on top of
2000 IOP09 buildings, whose heights are between 4 and
The application of the model performance eval- 20 m except for one building that is 121 m tall.
uation methods described in Sect. 2 can be best Figure 6 shows the test domain, which covers a
illustrated using a test case. Two versions of a 14 km square area. The release point is marked
simple urban baseline dispersion model (Britter by a star near the middle of the domain and the
and Hanna, 2003; Hanna et al, 2003) are applied SF6 monitors are marked as black dots. Three
to a single Intensive Operating Period (IOP) SF6 sampling arcs are visible at distances of
during the Salt Lake City Urban 2000 field about 2, 4, and 6 km to the northwest of the
experiment. release point. In addition, in the 1.3 km square
area known as the ‘‘Downtown Domain’’, there
were grids of monitors located on block intersec-
3.1 Description of field experiment tions and midway along the blocks. These moni-
The Urban 2000 flow and dispersion experiment tors were used to define four additional ‘‘arcs’’ at
in the Salt Lake City area took place in Septem- distances from about 0.15 km to 1 km. The
ber and October, 2000 (Allwine et al, 2002). majority of SF6 concentrations were reported as
There were six nights of experiments, where 30 minute averages over a six-hour period during
three SF6 releases were made every two hours each night, allowing data from all three releases
and the release duration was one hour from a to be reasonably captured. Some of the meteoro-
source near street level in the downtown area. logical monitors are also shown in Fig. 6. The
Air quality model performance evaluation 187

Salt Lake City (SLC) National Weather Service would be used for the evaluations made as part of
(NWS) anemometer is at the airport in the north- Comparison 1 above. The right side of the table
west corner of the figure. The N01 surface anem- contains the observed and predicted lateral dis-
ometer, the N02 sodar, and the N03 profiler sites persion parameter y. Concentration data were
are located in a rural area about 6 km upwind of also available from 66 monitor locations, for
the urban area. The M02 anemometer is at the use in the evaluations made as part of Compar-
top of a 121 m building, and the D11 square ison 2 above, but these numbers are not listed in
marks a sodar at the top of a 36 m building. Aver- this paper.
age building height, Hb, is about 15 m, and the
surface roughness length, zo, is estimated to be
3.2 Model evaluation results
about 2 m.
The distributions of the observed 30 minute- Table 2 contains the predictions of Cmax=Q and
averaged concentrations on each of the seven y for two simple urban baseline dispersion mod-
monitoring arcs were plotted and the maximum els, both of which have a Gaussian structure.
concentration, Cmax, and the lateral standard Hanna et al (2003) describe the model equations.
deviation of the concentration distribution, y, Both models assume neutral stabilities in urban
were calculated if there were sufficient data. In areas. It is not necessary to know the precise
some cases, there were problems because the structure of the two models since they are used
concentrations were all below the threshold of only for demonstration of the model performance
about 45 ppt, or there was perhaps only a single measures. Because the wind speed cancels out of
high observation, or the plume was obviously on Model A, its predictions of Cmax=Q and y are
the edge of the network. The concentration data the same for each trial. Model B does account for
were also analyzed for continuity in space and variations in wind speed and therefore yields dif-
time. ferent predictions of Cmax=Q for each trial.
The focus of this test case is on two types of The results of the evaluation of Models A and
comparisons: B with the IOP09 SF6 tracer data are given in
Fig. 7 and Table 3.
(1) the normalized one-hour averaged arc-max-
Figure 7 contains scatter plots of predicted (for
imum concentration, Cmax=Q, anywhere on
Models A and B) and observed C=Q for paired-
an arc during the passage of the cloud from
in-space C=Q (top) and for arc maximum
each of the three trials;
Cmax=Q (bottom) comparisons. The scatter plots
(2) the normalized one-hour averaged maximum
for paired C=Q on the top consist of 34 points
concentration, C=Q, at each monitor location
where both observed (Co=Q) and predicted
during each of the three trials.
(Cp=Q) values exceed 0.12  106 s=m3 (or
Comparisons with data set 1 are called ‘‘unpaired C>45 ppt). The scatter plots of Cmax=Q on the
in time and space’’, while comparisons with data bottom consist of 18 points (seven monitoring
set 2 are called ‘‘unpaired in time but paired in arcs times three trials equals 21, minus three
space’’. Actually there is some pairing in time, trial-arcs with insufficient data). It can be seen
since the observed one-hour average is taken that the plots for the 18 arc-maximum values,
from a two to three hour sampling period when Cmax=Q, show less scatter than the plots for the
the SF6 cloud was passing through the network. many monitors, because the arc-maximum values
The observed maximum could have been from do not impose pairing in space and they include
2130 to 2230 LST, while the predicted maximum only the highest concentrations in the center of
is assumed to be steady-state in this exercise. the plume. Model A appears to exhibit a slight
Also, in Comparison 1, there is pairing in overprediction and Model B exhibits a slight
along-wind distance (i.e., by monitoring arc), underprediction. The scatter is seen to be gener-
even though there is not necessarily pairing in ally within a factor of five to ten for the paired
cross-wind distance. data (top) and within a factor of two for the arc
Table 2 contains the observed and predicted maxima (bottom). However, it must be remem-
Cmax=Q values, times 106 s=m3, for each mon- bered that the seven monitoring arcs cover a dis-
itoring arc in the three trials in IOP09. This table tance range from about 160 m to 6000 m, and so
Table 2. Test case observations and predictions of arc-maximum normalized concentration, Cmax=Q, for IOP09 from Salt Lake City Urban 2000. The numbers 1, 2, and 3 refer to
188

the separate release trials in IOP09. Models A and B are from the Hanna et al (2003) study. Model A assumes that the predictions of C=Q are mostly independent of u and employ
the all IOP average wind speed of 1.39 m=s. Model B assumes the actual wind speed for each release trial. For IOP09, the wind speeds are 2.69 m=s, 2.47 m=s, and 3.23 m=s for
trials 1, 2, and 3, respectively. Observed and predicted y are listed

Arc Cmax=Q (106 s=m3) y (m)


Model A Model B Model B Model B Model A Model B
x Trial 1 Trial 2 Trial 3 All trial All trial Trial 1 Trial 2 Trial 3 Trial 1 Trial 2 Trial 3 All trial All trial All trial
avg avg avg avg avg
(m) Obs Obs Obs Obs Pred Pred Pred Pred Obs Obs Obs Obs Pred Pred
1 156 129 244 115 163 229 141.7 154.4 118 76 56 49 60 34.7 29
2 394 49.7 56.8 37.3 47.9 52.3 30 32.6 25 137 123 120 127 73.4 66.2
3 675 44.9 32.5 11 29.5 21.2 12.2 13.3 10.2 153 170 150 158 115 103
4 928 4.77 n=a 7.56 6.17 12.5 7.17 7.81 5.98 224 189 228 214 150 134
5 1974 n=a n=a 2.63 2.63 3.71 2.14 2.33 1.78 n=a 256 n=a 256 273 244
6 3907 0.63 1.06 1.5 1.06 1.36 0.784 0.854 0.653 503 494 n=a 499 447 398
7 5998 0.49 1.01 n=a 0.75 0.76 0.439 0.478 0.365 820 1075 n=a 948 593 529
u (m=s) 2.69 2.47 3.23 2.64 1.39

Table 3. Results of model performance evaluation for Models A and B for Comparison 1 (Arc Max, last row) and for four alternate methods of addressing the data in
Comparison 2
J. C. Chang and S. R. Hanna

FB FBfn FBfp NMSE MG VG FAC2 R High 2nd high N

All Co and Cp, paired in space Model A 0.043 0.210 0.253 3.47 3.23 233 0.57 0.93 229 229 147
Model B 0.518 0.544 0.026 3.47 4.28 561 0.55 0.95 154 142
Co > 45 ppt and Cp > 45 ppt, paired in space Model A 0.134 0.131 0.265 0.86 1.19 4.1 0.56 0.92 229 229 34
Model B 0.431 0.459 0.028 0.85 2.70 21.4 0.47 0.94 154 142
Co > 45 ppt and all Cp, paired in space Model A 0.041 0.210 0.253 1.68 8.92 48974 0.28 0.93 229 229 71
Model B 0.518 0.544 0.026 1.68 15.52 306520 0.24 0.95 154 142
All Co and Cp > 45 ppt, paired in space Model A 0.135 0.130 0.265 0.91 1.08 4.3 0.53 0.92 229 229 36
Model B 0.431 0.459 0.028 0.90 2.47 18.4 0.47 0.94 154 142
Arc Max Model A 0.230 0.066 0.296 0.60 0.84 1.2 0.89 0.91 229 229 21
Model B 0.294 0.318 0.024 0.46 1.48 1.4 0.78 0.95 154 142
Air quality model performance evaluation 189

Fig. 7. Scatter plots of observed (Co=Q) and predicted (Cp=Q) concentrations (106 s=m3) for IOP 9 of Urban 2000; a Model
A paired in space, and b Model B paired in space; when Co=Q and Cp=Q are both greater than 0.12  106 s=m3 (or 45 ppt
when not normalized by the emission rate); c Model A arc maximum, and d Model B arc maximum. The number of points (N)
is also indicated in each frame

the predictions and observations both show a Also, because the global background concentra-
three-order-of-magnitude decrease from close tion of SF6 is 3 ppt (0.01  106 s=m3 when not
distances (with high concentrations) to far dis- normalized by the emission rate), and that value
tances (with low concentrations). was added to all concentrations predicted by the
The plots with all 147 points, including many models, there are 63 points at 0.01  106 s=m3.
cases with Co and=or Cp less than 45 ppt, are not These 63 points will be seen to affect some of
shown here. They exhibit much more scatter, the statistical performance measures. It can be
especially at low concentrations, since there are shown that the model performance measures
many monitors on the edges of the observed or are strongly affected by whether we add the glo-
predicted plumes. The model performance mea- bal background of 3 ppt to the model predictions
sures are influenced by assumptions regarding or subtract 3 ppt from the observations.
the minimum or threshold concentration. For Many of the statistical model performance
this experiment, concentrations below 45 ppt measures discussed in Sect. 2 are calculated
(0.12  106 s=m3 when normalized by the emis- for the data in the scatter plots. Table 3 contains
sion rate) were considered to be rather uncertain. the results for Models A and B for Comparison 1
190 J. C. Chang and S. R. Hanna

(Arc Max) and for four alternate methods of nitude. The correlation, R, is always between
addressing the data in Comparison 2: 0.90 and 0.95, due to the dominant influence of
the decrease in concentrations with distance from
– Method 1 (top row of the table) includes ‘‘all
the source, and is therefore not very useful as a
Co and Cp, paired in space’’, no matter how
method to discern model performance. As men-
low the concentrations. There are 147 points.
tioned previously, the sum of the false negative
– Method 2 (second row of the table) includes
and false positive components of FB (i.e., FBfn
only monitors where both Co and Cp exceed
and FBfp) yields the normalized absolute error
the threshold of 45 ppt (or C=Q ¼ 0.12 
(NAE). The values of NAE (not shown in
106 s=m3). There are 34 points.
Table 3) indicate that both Models A and B have
– Method 3 (third row of the table) includes only
a comparable absolute error that is about 30 to
monitors where Co >45 ppt, but Cp can have
40% of the mean, which means that Model B’s
any value. There are 71 points.
NAEs are more due to persistent underpredic-
– Method 4 (fourth row of the table) includes
tions, whereas Model A’s NAEs are due to more
only monitors where Cp >45 ppt but Co can
overpredictions and less underpredictions.
have any value. There are 36 points.
Table 3 also contains the predicted ‘‘high’’ and
These four methods are applied in order to show ‘‘2nd high’’ C=Q values, to show whether the
the range of the results depending on the assump- models can match the worst case observed con-
tions concerning low concentrations. Note that centrations. The observed high and 2nd high
there are 71 points with Co >45 ppt but only 36 C=Q are 244 and 129  106 s=m3 (see Table 2),
points with Cp >45 ppt, reflecting the fact that while the Model A predictions are 229 and
the predicted plume width is too small. Probably 229, and the Model B predictions are 154 and
the one-sided comparisons are less valid in rows 142 (see Table 3). These predictions are rela-
three and four of the table (where one of Co or Cp tively close to the observations, when compared
is not allowed to drop below 45 ppt, while the to previous model evaluations. The relative dif-
other one is allowed to have any value). ference in Models A and B is consistent with that
Despite the range in some of the performance found for the mean relative bias calculations.
measures in Table 3, due to the various methods The relative magnitudes of the false negatives
for considering low concentrations, the overall (underpredictions) and false positives (overpre-
results in the table appear to be fairly conclusive: dictions) can be seen in Table 3 by looking
Overall, the values of FB suggest that Model A at the FBfn and FBfp. As shown in Sect. 2,
overpredicts by a slight amount (about 10 to FB ¼ FBfn  FBfp. On the other hand, the sum
20%), while Model B underpredicts by about FBfn þ FBfp is the normalized absolute error,
20 to 40%. The fraction of predictions within a which is a more robust measure of scatter than
factor of two, or FAC2, is slightly less for Model NMSE or VG.
B than for Model A, and is about 0.8 for the arc- Table 4 addresses the questions concerning
maximum values and about 0.5 for all monitors. whether the performance measures are signifi-
The NMSE less than unity indicates that the cantly different from zero. This table focuses
magnitude of the scatter is less than the mean on the fractional bias, FB, although any perfor-
concentration. The scatter, as measured by mance measure could be used, and although the
NMSE or VG, appears to be more strongly influ- numerical value in question may be different for
enced by the assumption of which monitors to different performance measures. The models
use and which low concentration threshold to (A and B), data set and other assumptions are
select. This result has been found in other the same in Table 4 as in Table 3. Note that there
studies, and is caused by the fact that outliers are five rows given in the table, related to the four
have a stronger influence when they are squared ways of defining minimum C for the paired-in-
in the calculation of the performance measure. In space data set, and the arc-maximum data in the
the case of VG, where the logarithm of C is last row, where the minimum C is never reached
taken, the assumption regarding the threshold so it is not of concern. Because there is not a
can dominate the calculated VG, causing varia- ‘‘’’ in any box for Model A, it follows that
tions in calculated VG of several orders of mag- the FB for Model A is not significantly different
Air quality model performance evaluation 191

Table 4. Results of significance testing to determine (1) whether the FB for Models A and B (FBModel A and FBModel B) are
significantly different from zero, and (2) whether the difference in FB between Models A and B (FB) is significantly different
from zero, for Comparison 1 (Arc Max, last row) and for four alternate methods of addressing the data in Comparison 2. A ‘‘’’
indicates that the results are significant at 95% confidence intervals

FBModel A FBModel B FB

All Co and Cp, paired in space  


Co > 45 ppt and Cp > 45 ppt, paired in space  
Co > 45 ppt and all Cp, paired in space  
All Co and Cp > 45 ppt, paired in space  
Arc Max  

from zero (at the 95% confidence level). How- maximum Cp). As previously mentioned, the
ever, because there is a ‘‘’’ in all the boxes for goal of the quantile–quantile plot is to see
Model B, the FB for that model is significantly whether the distributions of all the observed
different from zero. And of interest is the model- and predicted values as a whole are comparable,
to-model comparison in the last column, where it where it is not necessary to require that, for
is concluded that the FB for Model A is signifi- example, the maximum Co and the maximum
cantly different from the FB for Model B. This Cp take place under the same conditions. The
information is routinely output from BOOT. general tendency for underprediction of Model
Three types of plots commonly used in statis- B at all parts of the distribution can be seen,
tical model evaluation exercises are given in although Model B does well with the three high-
Figs. 8–10. est concentrations. The distribution suggests little
Figure 8 contains quantile–quantile plots for bias for Model A over most of the range.
Model A (left) and Model B (right), where only Figure 9 contains residual plots of Cp=Co as a
monitors with both Co and Cp exceeding 45 ppt function of arc distance from the source. This is
(or C=Q of 0.12  106 s=m3) are considered. To for the same data set as used in Fig. 8. Both figures
create these plots, all observed data are rank- suggest little overall trend with distance, although
ordered, all predicted data are rank-ordered, some arc distances (e.g., 400 m) show an under-
and each point on the plot represents a particular prediction and other arc distances (e.g., 6 km)
rank number (e.g., the maximum Co and the show an overprediction. The underprediction at

Fig. 8. Quantile–quantile (Q–Q) plots of observed (Co=Q) and predicted (Cp=Q) concentrations (106 s=m3) when Co=Q and
Cp=Q are separately ranked; (a) Model A for all sampler locations, and (b) Model B for all sampler locations; when Co=Q and
Cp=Q are both greater than 0.12  106 s=m3 (or 45 ppt when not normalized by the emission rate). The number of points (N)
is also indicated in each frame
192 J. C. Chang and S. R. Hanna

Fig. 9. Residual (ratio of model prediction to observation) plots for IOP 9 of Urban 2000 as a function of arc distance (m); (a)
Model A for all sampler locations, and (b) Model B for all sampler locations; when Co=Q and Cp=Q are both greater than
0.12  106 s=m3 (or 45 ppt when not normalized by the emission rate). Dashed lines indicate factor-of-two scatter

Figure 10 is an ‘‘FB versus NMSE’’ plot


which is often used as a single plot to indicate
overall relative model performance. The perfect
model would have FB ¼ NMSE ¼ 0.0. Models A
and B are the same ones used before, but this
time the emphasis is only on data monitors where
both Co and Cp exceed 45 ppt (the same as in the
previous two plots). The figure suggests that
Model A and Model B have similar scatter
(NMSE) but Model A has a lower relative mean
bias (FB).

4. Conclusions
This paper investigates methodologies for evalu-
ating the performance of dispersion (air quality)
models. Dispersion models are used to predict
the fate of gases and aerosols after they are
released into the atmosphere. For example, these
Fig. 10. FB, together with the 95% confidence intervals, pollutants can be from regularly emitted indus-
and NMSE for Models A and B, when Co=Q and Cp=Q are trial sources, accidental releases, and intentional
both greater than 0.12  106 s=m3 (or 45 ppt when not releases of harmful chemical and biological
normalized by the emission rate). The parabola indicates agents. The dispersion model results have been
the minimum NMSE for a given FB used in applications such as granting permits and
preparing risk management programs for indus-
trial facilities, designing emissions control stra-
the 400-m arc might be due to the fact that the tegies for regions that are not in compliance with
arc is in the downtown area with a number of existing air quality standards, and conducting
tall buildings. Nevertheless, these plots also forensic studies of historical events (e.g., dosage
clearly show the points where the predictions are reconstruction for areas surrounding a nuclear
within a factor of two of the observations (i.e., the fuel processing site, and possible exposures due
points within the range defined by the dashed to the derailment of rail cars containing toxic
lines). chemicals). Because of the large economic and
Air quality model performance evaluation 193

public-health impacts, these atmospheric disper- eral. It was suggested that measures such as FB,
sion models should be properly evaluated before MG, NMSE, and VG be further expressed in
they can be used with confidence. terms of a quantity, such as the equivalent ratio
As is the case for all other kinds of computer of the predicted to observed values, that is easier
models, there are also uncertainties in the disper- to interpret.
sion model results. For example, random turbu- Depending on the situation, some performance
lence in the atmosphere causes the concentration measures might be more appropriate than others.
field to fluctuate. As a result, the concentration Hence, there is not a single best measure, and it
field can only be described through gross statis- is necessary to use a combination of the per-
tical properties (e.g. mean and variance). In addi- formance measures. The FAC2 is more robust,
tion, the input data used to run dispersion models because it is not sensitive to the distribution of
are subject to errors due to such factors as instru- the variables to be evaluated. For dispersion
ment accuracy and representativeness. Physical modeling where concentrations or dosages often
parameterizations included in the models can vary by several orders of magnitude, MG and VG
also contain errors, approximations, and uncer- are better than FB and NMSE. However, a lower
tainties. Therefore, it is critical to understand, threshold was suggested when calculating MG
and quantify if possible, the uncertainty in the and VG, because these two measures are strongly
dispersion model results. The current paper influenced by very low values and are undefined
mainly focuses on the issue of model evaluation, for zero values. The R is not a very robust mea-
and does not go into detailed analyses of model sure, because it is sensitive to a few aberrant data
uncertainty. pairs, and often mainly reflects the obvious pat-
Section 2 provided a review of various meth- tern (i.e., concentration decreasing with down-
odologies for evaluating the performance of wind distance) that already exists in the dataset.
atmospheric dispersion models. Any model eval- The confidence limits of the performance mea-
uation exercise should start with clear definitions sures can be estimated with bootstrap resampling
of the evaluation goal (i.e., the statistical null (Efron, 1987) to answer questions such as (1)
hypothesis to test) and the types of model outputs whether the FB for one model is significantly
to be evaluated. Before conducting any statistical different from zero, and (2) whether the FBs
performance evaluation, it is useful to first con- for two models are significantly different at the
duct exploratory data analysis by means of dif- 95% confidence level.
ferent graphical displays, including scatter plots, In addition to the six performance measures
quantile–quantile plots, box-residual plots, and described above, other measures can also be
scatter-residual plots. (Here the residual refers defined, including the normalized standard devia-
to the ratio of the predicted to observed values.) tion (NSD), and the normalized root mean square
Scatter and quantile–quantile plots provide a error (NRMSE). It has been shown that NSD,
general assessment of model performance. Resi- NRMSE, and R can be concisely plotted on a
dual plots are useful in identifying potential pro- nomogram, based on the law of cosines (Taylor,
blems in the model physics. 2001). Note that all these three measures only
One commonly-used evaluation methodology account for random scatter, but not systematic
is the BOOT software package previously co- bias.
developed by the author (Hanna et al, 1993). It The figure of merit in space (FMS) and the
calculates a set of quantitative performance mea- measure of effectiveness (MOE) are similar mea-
sures, including the fractional bias (FB), the geo- sures that are essentially, in their original defini-
metric mean (MG), the normalized mean square tions, based on the ratio of the intersection to the
error (NMSE), the geometric variance (VG), the union of the predicted and observed contour
correlation coefficient (R), and the fraction of areas. The two-dimensional (2-D) version of
data where predictions are within a factor of the MOE separates the false-negative (or under-
two of observations (FAC2). Most previous predicting) and false-positive (or overpredicting)
studies where the BOOT software had been used components of predictions. For situations where
simply quoted the values of these measures, contours are not easily defined (because, for
which are not that informative to readers in gen- example, receptors are arranged in concentric
194 J. C. Chang and S. R. Hanna

arcs or the number of receptors is simply too schemes (e.g., Sykes, 1984) or Monte Carlo anal-
few), Warner et al (2001) recommend using ysis (e.g., Hanna and Davis, 2002).
straight data summation as one of the options The performance measures reported in many
to provide the area estimates necessary for calcu- model evaluation exercises with many sets of
lating the MOE. It can be shown that the 2-D field data were reviewed. It was concluded that
MOE is closely related to a 2-D FB that is a ‘‘good’’ model would be expected to have
adapted from the traditional one-dimensional about 50% of the predictions within a factor of
FB (Chang, 2002). One advantage of the 2-D two of the observations, a relative mean bias
FB is that, like the 2-D MOE, it separates the within 30%, and a relative scatter of about a
underpredicting and overpredicting components, factor of two or three.
so that cases of compensating errors will be To demonstrate the use of many of the perfor-
easily detected. mance measures, a model evaluation exercise was
Nevertheless, it is emphasized that the MOE carried out using one six-hour period (IOP09) of
has been traditionally calculated for each sam- the Salt Lake City Urban 2000 field experiment
pling arc of each trial, where the data at all sam- (Allwine et al, 2002). Two alternate baseline urban
pler locations are paired in space. The average of dispersion models (Britter and Hanna, 2003;
these individual MOEs then provides a summary Hanna et al, 2003) were run for this test case,
of model performance for all the experiments as and standard plots were presented and discussed
a whole. On the other hand, the FB has been such as scatter plots, quantile–quantile plots, resid-
traditionally based on the arc-wise maximum ual plots, and ‘‘FB versus NMSE’’ plots. Tables
concentrations (or dosages) for all experiments, are presented containing the performance mea-
so it directly gives an assessment of the overall sures, as well as significance tests on whether the
model performance. Therefore, even though the fractional bias, FB, is significantly different from
MOE and FB are mathematically related, it is zero or the difference in FB between the two mod-
important to discern how they are actually im- els is significantly different from zero. These per-
plemented in order to interpret the overall formance measures were calculated using several
model performance for all the experiments as a alternate interpretations of the data (e.g., maxi-
whole. mum C=Q on each downwind distance arc, and
Because observations are realizations of infi- C=Q ‘‘paired in time and space’’ at each monitor
nite ensembles, and model predictions mostly but using various assumptions concerning the
represent some sort of ensemble averages, it minimum C that is allowed. It is found that the
has been recommended that observations and assumption regarding minimum C can have a large
predictions under similar regimes (e.g., as effect on some of the performance measures.
defined by similar atmospheric stability and
downwind distance) should be first averaged
before conducting any performance evaluation Acknowledgments
(ASTM, 2000). This research was supported by the Defense Threat Reduc-
The ASTM procedure demonstrated the inade- tion Agency (DTRA) and the Department of Energy (DOE).
quacy of performing model evaluation in a Major Brian Beitler and Mr. John Pace have been DTRA’s
deterministic manner. Lewellen et al (1985) technical contract representatives, and Dr. Michael Brown
recommend an evaluation methodology based of Los Alamos National Laboratory has managed the DOE
portion of the research. We thank Dr. K. Jerry Allwine of
on the assumption that the observed concentra- Battelle Pacific Northwest Laboratory for providing the Salt
tion is simply a random sample taken from the Lake City Urban 2000 field data.
probability density function (PDF) of the pre-
dicted concentration. As a result, the cumulative
distribution function (CDF) of model residuals References
(e.g., observations minus predictions) should be
Allwine KJ, Shinn JH, Streit GE, Clawson KL, Brown M
inspected to see if it is within the confidence (2002) Overview of Urban 2000. Bull Am Meteor Soc
bounds given by the PDFs of predicted concen- 83(4): 521–536
trations. The PDF can be estimated by such Anthes RA, Kuo Y-H, Hsie E-Y, Low-Nam S, Bettge TW
techniques as higher-order turbulence closure (1989) Estimation of skill and uncertainty in regional
Air quality model performance evaluation 195

numerical models. Quart J Royal Meteor Soc 115: Hanna SR (1988) Air quality model evaluation and uncer-
763–806 tainty. J Air Poll Control Assoc 38: 406–412
ASTM (2000) Standard guide for statistical evaluation of Hanna SR (1989) Confidence limits for air quality model
atmospheric dispersion model performance. American evaluations as estimated by bootstrap and jackknife
Society for Testing and Materials, Designation D 6589- resampling methods. Atmos Environ 23: 1385–1398
00. ASTM, 100 Barr Harbor Drive, West Conshohocken, Hanna SR (1993) Uncertainties in air quality model predic-
PA 19428-2959 tions. Bound Layer Meteorol 62: 3–20
Beck MB, Ravetz JR, Mulkey LA, Barnwell TO (1997) On Hanna SR, Britter RE, Franzese P (2003) A baseline
the problem of model validation for predictive exposure urban dispersion model evaluated with Salt Lake
assessments. Stoch Hydrol Hydraulics 11: 229–254 City and Los Angeles Tracer data. Atmos Environ
Britter RE, Hanna SR (2003) Flow and dispersion in urban (submitted)
areas. Ann Rev Fluid Mech 35: 469–496 Hanna SR, Chang JC, Strimaitis DG (1993) Hazardous gas
Chang JC (2002) Methodologies for evaluating performance model evaluation with field observations. Atmos Environ
and assessing uncertainty of atmospheric dispersion 27A: 2265–2285
models. Ph.D. thesis, George Mason University, Fairfax, Hanna SR, Egan BA, Purdum J, Wagler J (2000) Evaluation
277 pp. Available from: www.lib.umi.com=dissertations= of the ADMS, AERMOD, and ISC3 Dispersion Models
search. The abstract can be seen by selecting ‘‘Pub Num- with the Kincaid, Indianapolis, Lovett, Sweeny, and Duke
ber (PN)’’ from the first pull-down menu and entering Forest Field Data Sets. Int J Environ Poll
3068631. The full document can be ordered on-line at $34 Hanna SR, Strimaitis DG, Chang JC (1991) Hazard response
per unbound copy modeling uncertainty (A quantitative method), vol. I:
Chang JC, Franzese P, Chayantrakom K, Hanna SR (2003) User’s guide for software for evaluating hazardous gas
Evaluations of CALPUFF, HPAC, and VLSTRACK dispersion models; vol. II: Evaluation of commonly-used
with two mesoscale field data sets. J Appl Meteor 42: hazardous gas dispersion models; vol. III: Components of
453–466 uncertainty in hazardous gas dispersion models. Report
Cullen AC, Frey HC (1998) Probabilistic techniques in no. A119=A120, prepared by Earth Tech, Inc., 196 Baker
exposure assessment: A handbook for addressing vari- Avenue, Concord, MA 01742, for Engineering and Ser-
ability and uncertainty in models and inputs. Plenum, vices Laboratory, Air Force Engineering and Services
352 pp Center, Tyndall Air Force Base, FL 32403; and for
Efron B (1987) Better bootstrap confidence intervals. J Am American Petroleum Institute, 1220 L Street, N.W.,
Stat Assoc 82: 171–185 Washington, D.C., 20005
Efron B, Tibshirani RJ (1993) An introduction to bootstrap. Hanna SR, Yang R (2001) Evaluations of mesoscale model
Monographs on Statistics and Applied Probability, vol. 57. predictions of near-surface winds, temperature gradients,
New York: Chapman & Hall, 436 pp and mixing depths. J Appl Meteor 40: 1095–1104
EPA (1999) User Manual for the EPA Third-Generation Air Hanna SR, Davis JM (2002) Evaluation of a photochemical
Quality Modeling System (Models-3 Version 3.0). Office grid model using estimates of concentration probability
of Research and Development, U.S. Environmental density functions. Atmos Environ 36: 1793–1798
Protection Agency, Washington, D.C. 20460. EPA-600= Helton JC (1997) Uncertainty and sensitivity analysis in the
R-99=055. Report available from: www.epa.gov= presence of stochastic and subjective uncertainty. J Stat
asmdnerl=models3=doc=user=user.html Comp Simul 1997: 3–76
EPA (2002) Example application of modeling toxic air Hodur RM (1997) The Naval Research Laboratory’s
pollutants in an urban area. EPA-454=R-02-003, Coupled Ocean=Atmosphere Mesoscale Prediction Sys-
OAQPS=EPA, RTP, NC 27711, www.epa.gov=scram001= tem (COAMPS). Mon Wea Rev 125: 1414–1430
tt25htm#toxics (91 pages þ appendix) Ichikawa Y, Sada K (2002) An atmospheric dispersion
Fox DG (1984) Uncertainty in air quality modeling. B Amer model for the environmental impact assessment of ther-
Meteor Soc 65: 27–36 mal power plants in Japan – A method for evaluating
Gates WL, Boyle JS, Covey C, Dease CG, Doutriaux CM, topographical effects. J Air Waste Manage Assoc 52:
Drach RS, Fiorino M, Gleckler PJ, Hnilo JJ, Marlais SM, 313–323
Phillips TJ, Potter GL, Santer BD, Sperber KR, Taylor Klug W, Graziani G, Grippa G, Pierce D, Tassone C (1992)
KE, Williams DN (1999) An overview of the results of the Evaluation of long range atmospheric transport models
Atmospheric Model Intercomparison Project (AMIP I). using environmental radioactivity data from the
Bull Amer Meteor Soc 80: 29–55 chernobyl accident. The ATMES Report. Elsevier Science
Grell GA, Dudhia J, Stauffer DR (1994) A description of the Publishing
Fifth-Generation Penn State=NCAR mesoscale model Lewellen WS, Sykes RI, Parker SF (1985) An evaluation
(MM5). NCAR Tech. Note NCAR=TN 398 þ STR, 138 technique which uses the prediction of both concentration
pp. Available from NCAR, P.O. Box 3000, Boulder, CO mean and variance. Proc. DOE=AMS Air Pollution Model
80307 Evaluation Workshop, Savannah River Lab. Report No.
Hamill TM, Mullen SL, Snyder C, Toth Z, Baumhefner DP DP-1701-1, Section 2, 24 pp
(2000) Ensemble forecasting in the short to medium Lewellen WS, Sykes RI (1986) Analysis of concentration
range: Report from a workshop. Bull Amer Meteor Soc fluctuations from lidar observations of atmospheric
81: 2653–2664 plumes. J Clim Appl Meteor 25: 1145–1154
196 J. C. Chang and S. R. Hanna: Air quality model performance evaluation

Lewellen WS, Sykes RI (1989) Meteorological data needs particulate matter and visibility. J Air Waste Manage
for modeling air quality uncertainties. J Ocean Atmos Assoc 50: 588–599
Tech 6: 759–768 Sykes RI (1984) The variance in time-averaged samples
Lorenz E (1963) Deterministic nonperiodic flow. J Atmos from an intermittent plume. Atmos Environ 18: 121–123
Sci 20: 130–141 Sykes RI (2002) The use of concentration fluctuation var-
Lumley JL, Panofsky HA (1964) The structure of atmo- iance predictions in model evaluation. Proc. 12th Joint
spheric turbulence. New York: Wiley Interscience, 239 pp Conf. on the Applications of Air Pollution Meteorology
McNally D, Tesche TW (1993) MAPS sample products. with A&WMA, 20–24 May 2002, Norfolk, VA, Amer
Alpine Geophysics, 16225 W. 74th Dr., Golden, CO Meteor Soc, Boston, MA
80403 Sykes RI, Parker SF, Henn DS, Cerasoli CP, Santos LP
Morgan MG, Henrion M (1990) Uncertainty. A guide to (1998) PC-SCIPUFF Version 1.2PD, Technical Docu-
dealing with uncertainty in quantitative risk and policy mentation. Titan Corporation, Titan Research and Tech-
analysis (with a chapter from M. Small). Cambridge nology Division, ARAP Group, P.O. Box 2229, Princeton,
University Press, 332 pp NJ 08543-2229, 172 pp
Mosca S, Graziani G, Klug W, Bellasio R, Bianconi R (1998) Taylor KE (2001) Summarizing multiple aspects of model
A statistical methodology for the evaluation of long-range performance in a single diagram. J Geophys Res 106(D7):
dispersion models: an application to the ETEX exercise. 7183–7192
Atmos Environ 24: 4307–4324 Venkatram A (1979) The expected deviation of observed
Nappo CJ, Essa KSM (2001) Modeling dispersion from concentrations from predicted ensemble means. Atmos
near-surface tracer releases at Cape Canaveral, FL. Atmos Environ 13: 1547–1549
Environ 35: 3999–4010 Venkatram A (1984) The uncertainty in estimating disper-
Nasstrom JS, Sugiyama G, Ermak D, Leone JM Jr (2000) A sion in the convective boundary layer. Atmos Environ 18:
real-time atmospheric dispersion modeling system. Proc. 307–310
11th Joint Conf. on the Application of Air Pollution Venkatram A (1988) Topics in applied dispersion modeling.
Meteorology with the A&WMA, Amer Meteor Soc, Lectures on Air Pollution Modeling, chap. 6. Amer
Boston, MA Meteor Soc, Boston, MA, 390 pp
Olesen HR (2001) Ten years of harmonization activities: Warner S, Platt N, Heagy JF, Bradley S, Bieberbach G,
past, present, and future. 7th Int. Conf. on Harmonisation Sugiyama G, Nasstrom JS, Foster KT, Larson D (2001)
within Atmospheric Dispersion Modelling for Regulatory User-oriented measures of effectiveness for the evaluation
Purposes, Belgirate, Italy. National Environmental Re- of transport and dispersion models. Institute for Defense
search Institute, Roskilde, Denmark (www.harmo.org) Analyses, IDA Paper P-3554, 815 pp. IDA, 1801 N.
Oreskes N, Shrader-Frechette K, Belitz K (1994) Verifica- Beauregard Street, Alexandria, VA 22311-1772
tion, validation, and confirmation of numerical models in Weil JC, Sykes RI, Venkatram A (1992) Evaluating air-
the earth sciences. Science 263: 641–646 quality models: review and outlook. J Appl Meteor 31:
Paine RJ, Lee RF, Brode R, Wilson RB, Cimorelli AJ, Perry 1121–1145
SG, Weil JC, Venkatram A, Peters WD (1998) Model Wilks DS (1995) Statistical methods in the atmospheric
evaluation results for AERMOD. USEPA, RTP. NC 27711 sciences. New York: Academic Press, 467 pp
Pielke RA (2002) Mesoscale meteorological modeling, 2nd Willmott CJ (1982) Some comments on the evaluation
ed. Academic Press, 676 pp of model performance. Bull Amer Meteor Soc 63:
Pielke RA, Pearce RP (1994) Mesoscale modeling of the 1309–1313
atmosphere. Meteorological Monographs No. 47. Amer Wilson DJ (1995) Concentration fluctuations and averaging
Meteor Soc, Boston, MA, 167 pp times in vapor clouds. American Institute of Chemical
Saltelli A, Chan K, Scott EM (2000) Sensitivity analysis. Engineers, 345 East 47th Street, New York, NY 10017,
Wiley, 475 pp 208 pp
Seaman NL (2000) Meteorological modeling for air quality
assessments. Atmos Environ 34: 2231–2259
Seigneur C, Pun B, Pai P, Louis J-F, Solomon P, Emery C, Corresponding author’s address: J. C. Chang, School of
Morris R, Zahniser M, Worsnop D, Koutrakis P, White W, Computational Sciences, George Mason University, MS
Tombach I (2000) Guidance for the performance evalua- 5B2, Fairfax, VA 22030-4444, USA (E-mail:
tion of three-dimensional air quality modeling systems for jchang4@scs.gmu.edu)

Verleger: Springer-Verlag GmbH, Sachsenplatz 4–6, 1201 Wien, Austria. – Herausgeber: Prof. Dr. Reinhold Steinacker, Institut für Meteorologie und Geophysik,
Universität Wien, Althanstraße 14, 1090 Wien, Austria. – Redaktion: Innrain 52, 6020 Innsbruck, Austria. – Satz und Umbruch: Thomson Press (India) Ltd., Chennai. –
Druck und Bindung: Grasl Druck&Neue Medien, 2540 Bad Vöslau, Austria. – Verlagsort: Wien. – Herstellungsort: Bad Vöslau. – Printed in Austria.
Offenlegung gem. § 25 Abs. 1 bis 3 Mediengesetz: Unternehmensgegenstand: Verlag von wissenschaftlichen Büchern und Zeitschriften
An der Springer-Verlag GmbH ist beteiligt: Springer-Verlag GmbH, Heidelberger Platz 3, 14197 Berlin, Germany, zu 100%, Gesch€aftsf€uhrer: Rudolf Siegle, Sachsenplatz 4–6,
1201 Wien, Austria

You might also like