You are on page 1of 14

1

ME-EM 3220 ENERGY LABORATORY


Statistics, Uncertainty, and Regression
Analysis
Background
St at i st i c:
A quantity (as the mean, mode, median, and etc.) that is computed from a sample data.
Uncer t ai nt y:
A process of identifying the errors in a measurement and quantifying their effects.
Regressi on Anal ysi s:
The part of statistics that deals with investigation of the relationship between two or more variables
related in a nondeterministic fashion or the analysis of a set of experimental data (i.e.: y vs. x)
Objective
1. To analyze any experimental data by using statistical methods.
2. To estimate the error based on the statistics such as uncertainty analysis.
3. To analyze data obtained from the experiments by identifying the source and the magnitude of
error.
4. To estimate a minimum number of sampling to achieve an acceptable uncertainty.
5. To perform the Regression Analysis by tting a curve in a set of data points with a minimum
amount of error.
Introduction
We begin by considering the measurement problem of estimating the true mean value, , based
on the information derived from the repeated measurements of x. A sample of the variable x under
controlled, xed operating conditions renders a nite number of data points. We use this data to infer .
We can imagine that if the number of data points is very small, then our estimation of from the data set
could be heavily inuenced by the value of any one of the data points. If the data set were larger, then the
inuence of any one data point would be offset by the larger inuence of the rest of the data. From a
practical view, only nite size sets may be possible, in which case the data can provide only some estimate
of the true value.
From a statistical analysis of the data set and an analysis of sources of error that inuence this data,
we can estimate as
[1]
x'
x'
x'
x'
x' x u
x
t =
2
where (average) represents the most probable estimate of based on the available data and
the condence interval or uncertainty in that estimate at some probability level, P%. The con-
dence interval (or uncertainty) is based both on estimates of the precision error and on the bias
error in the measurement of x.
Statistical Measurement Theory
A sample (N) of data refers to a set of data points (sampling) obtained during repeated
measurements of a variable x under xed operating conditions. The measured variable is also known as
the measurand. Fixed operating conditions imply that the external constraints that control the process
from which the measured value is obtained, are held at xed values while obtaining the sample. In actual
engineering practice, the ability to control the constraints at truly xed conditions may be impossible and
the term xed operating conditions should be considered in a nominal sense. That is, the process
conditions are maintained as closely as possible.
Denitions:
Mode:
The mode is the value which occurs with the highest frequency.
Example: [1]
The 20 readings of a variable were collected as 26, 25, 28, 23,25, 24, 24, 21, 23, 26, 28, 26, 24, 23,
24, 32, 25, 27, 24, and 22. Find the mode.
Solution:
Among these numbers, 21, 22, 27, and 32 each occurs once. The number 28 occurs twice: 23, 25, and
26 each occurs three times: and 24 occurs ve times. Thus, 24 is the modal reading.
Note: If there are no multiple data points, the mode does not exist.
Medi an:
The median of N values requires that we arrange the data according to size (in either ascending or
descending order). Then, when N is odd, the median is the value of the item that is in the middle. When N
is even, the median is the mean of the two items that are nearest to the middle.
Example: [2]
In a recent month, a state Game and Fish Department reported 53, 31, 67, 53, and 36 hunting or
shing violations for ve different regions. Find the median number of violations for these regions.
Solution :
The median is not 67, the third (or middle) item, because the gures must rst be arranged according
to size. Thus, we get 31 36 53 53 67 and it can be seen that the median is 53.
Average:
The average of N numbers is the sum of all their values divided by N.
[2]
x x' u
x
x
1
N
---- x
i
i 1 =
N

x
1
x
2
x
N
+ + +
N
------------------------------------------ = =
3
Example: [3]
On a certain day, nine students received 1, 4, 2, 0, 1, 5, 2, 1, and 3 pieces of mail. Find the average.
Solution :
The total number of pieces of mail which the nine students received is 1 + 4 + 2 + 0 + 1 + 5 + 2 + 1 + 3
= 19. Since = 2.25, the average number of pieces of mail per student is 2.25.
Mean:
The mean value, , is that which would be obtained if every x in the population could be averaged
together. In other words, the average value could be used to predict the mean values.
Note: In certain cases, median could be used to predict the mean value.
St andard Devi at i on:
Standard Deviation could be described as the dispersion of a data set. If the data set are closely
bunched about their mean, the standard deviation obtained is small. If the data set are scattered widely
about their mean, the standard deviation obtained is large.
If a set of numbers , constituting a population, has the average (mean) , the
differences
[3]
are called the deviations from the average (mean). The standard deviation for such a discrete data
is given by as:
[4]
where N-1 term is called the degree of freedom or of that sample. is called the sample
variance.
Example: [4]
On six consecutive Sundays, a tow-truck operator received 9, 7, 11, 10, 13, and 7 service calls.
Calculate standard deviation, .
Solution :
First calculating the average, we get
[5]
and the work required to nd may be arranged as in Table [1]. Dividing by 6-1 =5 and
taking the square root, we get
[6]
Note in the table above that the total for the middle column is zero; since this must always be the case,
it provides a check on the calculations.
19
9
------
,
_

x
1
x
2
x
3
and x
N
, , , , x
x
1
x x
2
x x
3
x and x
N
x , , , ,
S
x
S
x
x x ( )
N 1
--------------------
2
=
S
x
2
S
x
x
9 7 11 10 13 7 + + + + +
6
-----------------------------------------------------------
57
6
------ 9.5 = = =
x x ( )
2

S
x
27.50
5
------------- 5.5 2.3 = =
4
Finite Statistic
Statistical values obtained from nite-sized data sets should be regarded only as estimates of the true
statistics of the measurand. Finite Statistics describe only the behavior of the nite data set, whereas the
true behavior of a variable is described by its innite statistic.
Finite-sized data sets can provide the statistical estimates known as the sample mean value and the
sample variance, dened by
[7]
[8]
The sample mean value provides a most probable estimate of the true mean value, . The sample
variance represents a measure of the precision of a measurement. These equations are robust and
provide reasonable statistical estimates, regardless of the probability density function of the mea-
surand.
In general, the number of degrees of freedom in a data set is the number of independent
measurements available for estimating a statistical value. However, in N measurements having a central
tendency, the data will be scattered about a mean value. The freedom of any data point in the
measurement to assume any value, therefore, becomes restricted by this mean value. Hence the degrees
of freedom, in the measure of data scatter is reduced by one to N-1, as seen in equation above.
The predictive utility of innite statistics can be extended to data sets of nite sample size with only
some modication. These discussions are not included in this notes.
For a normal distribution of x about the sample mean value, , one can state that statistically
(P%) [9]
Table [1]
x
9 -0.5 0.25
7 -2.5 6.25
11 1.5 2.25
10 0.5 0.25
13 3.5 12.25
7 -2.5 6.25
Total 0.0 27.50
x x
x x ( )
2
x
1
N
---- x
i
i 1 =
N

=
S
x
2 1
N 1
------------- x
i
x ( )
2
i 1 =
N

=
x'

x
x
i
x t
P ,
S
x
t =
5
Here, is the estimation of the interval of value at P%., is sample mean of nite number of
sampling and is obtained from a weighting function used for nite data sets. This value for
the t estimator is a function of the probability, P, and the degrees of freedom, , in the standard
deviation.
Students t-Distribution:
The denition of t estimator is beyond the objective of these discussions. In short, this distribution is
used in predicting the mean value of a Gaussian (or normal probability distribution) population when only a
small sample of data is available. t values can be obtained from Table [2] below which is a tabulation of
the Students t-distribution as developed by William S. Gosset.
Standard Deviation of the Means
We must now recognize that the sample mean value itself has some degree of inherent uncertainty.
The amount of variation possible in the sample means would depend on two values: the sample variance,
, and sample size, N. such that the discrepancy tends to increase with variance and decrease with
. The variance of the distribution of mean values that could be expected can be estimated
from a single nite data set through the standard deviation of the means, .
[10]
The standard deviation of the means represents a measure of the precision in a sample mean. The
range over which the possible values of the true mean value might lie at some probability level, P, based on
the information from a sample data set is given as,
(P%) [11]
where represents a precision interval, at the assigned probability, P%, within which one
should expect the true value of x to fall. As such, the precision interval is a quantied measure of
the precision error in the estimate of the true value of variable x.
This estimate of the true mean value based on a nite data set is now stated as
[12]
Example: [5]
Statistics, Value Interval and True Mean Value. Consider the sample of variable x in Table [3]:
a) Compute the sample statistics for this data set.
b) Estimate the interval of value over which 95% of the measurements of the measurand should be
expected to lie (or calculate the precision interval of each measurements).
c) Estimate the true mean value of the measurand at 95%probability based on this nite data set (or
calculate the precision interval of the mean of the measurements)
Known: N=20,
Find:
x
i
x
t
v p ,

S
x
2
N
1 2
S
x
S
x
S
x
N
1 2
------------ =
x t
P ,
S
x
t
t
P ,
S
x
t
x x t
P ,
S
x
t =
x
i
x x tS
x
and x tS
x
t t ,
6
Solution :
a) The sample mean value is computed for the N=20 values by the relation
[13]
Table [2] Students t-Distribution
1 1.000 6.314 12.706 63.657
2 0.816 2.920 4.303 9.925
3 0.765 2.353 3.182 5.481
4 0.741 2.132 2.770 4.604
5 0.727 2.015 2.571 4.032
6 0.718 1.943 2.447 3.707
7 0.711 1.895 2.365 3.449
8 0.706 1.860 2.306 3.355
9 0.703 1.833 2.262 3.250
10 0.700 1.812 2.228 3.169
11 0.697 1.796 2.201 3.106
12 0.695 1.782 2.179 3.055
13 0.694 1.771 2.160 3.012
14 0.692 1.761 2.145 2.977
15 0.691 1.753 2.131 2.947
16 0.690 1.746 1.120 2.921
17 0.689 1.740 2.110 2.898
18 0.688 1.734 2.101 2.878
19 0.688 1.729 2.093 2.861
20 0.687 1.725 2.086 2.845
21 0.686 1.721 2.080 2.831
30 0.683 1.697 2.042 2.750
40 0.681 1.684 2.021 2.704
50 0.680 1.679 2.010 2.679
60 0.679 1.671 2.000 2.660
0.674 1.645 1.960 2.576

t
50
t
90
t
95
t
99

x
1
20
------ x
i
i 1 =
20

1.02 = =
7
This, in turn, is used to compute the sample standard deviation
[14]
The degrees of freedom in the standard deviation are .
b) From Table [2] at 95% probability, is 2.093. Then, the interval of values in which 95% of
the measurements of x should lie is given by equation (1.4):
(95%) [15]
Accordingly, if a 21st data point were to be taken, there is a 95% probability that its value would lie
between 0.69 and 1.35.
c) The true mean value is estimated by the sample mean value. However, the precision interval for
this estimate is , where
[16]
Then from Equation [12]
[17]
Accordingly true mean of the 20 data points should lie between 1.1 and 0.94 with 95% probability.
Note: The difference between part (b) and part (c) is that, part (b) is the estimation of each sample
interval meanwhile part (c) is the estimation of the mean of the measurements (or the true mean value
estimations).Number of Measurements Required to Achieve a Given Precision.
Table [3] Sample of Variable x
i
x
i
i
x
i
i
x
i
i
x
i
1 0.98 6 0.68 11 1.02 16 1.11
2 1.07 7 1.34 12 1.26 17 0.99
3 0.86 8 1.04 13 1.08 18 0.78
4 1.16 9 1.21 14 1.02 19 1.06
5 0.96 10 0.86 15 0.94 20 0.96
S
x
1
19
------ x
i
1.02 ( )
2
i 1 =
20

0.16 = =
N 1 19 = =
t
19 95 ,
x
i
x 2.093 0.16 ( ) t 1.02 0.33 t = =
t
19 95 ,
S
x
t
S
x
S
x
N
1 2
------------
0.16
20 ( )
1 2
------------------ 0.04 = = =
x x t
19 95 ,
S
x
t 1.02 0.08 t = =
8
Number of Measurements Required to Achieve a Given Precision
Statistics can be used to assist in the design and planning of a test program. For example, how many
measurements, N, are required to estimate the true mean value, , with acceptable precision? To answer
this question, begin with Equation [12], which expresses the true value based on a sample mean and its
precision interval:
[18]
where
[19]
Therefore, we could rearrange Equation [18] to read
[20]
We can express the precision interval in Equation [20] as Condence Interval or CI, that is,
(P%) [21]
To evaluate CI, we must assign a value to . should be a conservative estimate based on
previous test data, prior experience, or manufacturers information.
The Precision interval is two sided about the mean, dening a range from to
. We introduce the one-sided precision value d as
[22]
Then, it follows that the required number of measurements is estimated by
(P%) [23]
The use of the inequality serves as a remainder that this expression is based on an assumed value for
. The accuracy of Equation [23] will depend on how well the assumed value for approximates the
standard deviation.
The obvious deciency in the above method is that an estimate for the sample variance is needed.
One way around this is to make a preliminary small number of measurements, , to obtain an estimate
of the sample variance, , to be expected. Then is used to estimate the number of measurements
required. The total number of measurements, , will be estimated by
(P%) [24]
This is an iterative process as will be demonstrated next. This establishes that additional
measurements will be required.
x'
x x t
P ,
S
x
t =
S
x
S
x
N
1 2
------------ =
x x t
P ,
S
x
N
1 2
------------ t =
CI t
P ,
S
x
N
1 2
------------ t =
S
x
S
x
t
P ,
S
x
N
1 2

t
P ,
S
x
+ N
1 2

d
CI
2
------
t
P ,
S
x
N
1 2
--------------- = =
N
t
P ,
S
x
d
---------------
,
_
2

S
x
S
x
N
1
S
1
S
1
N
T
N
T
t
N
1
1 P ,
S
1
d
------------------------
,
_
2

N
T
N
1

9
Example: [6]
Consider Example [5]. Determine the number of measurements required to reduce the precision
interval of the mean value of a variable to within 5%. Assume P=95%.
Known: CI = 5% = 0.05, P = 95%
d = [25]
where from the example .
Solution :
Because Equation [23] has two unknowns, begin this problem by guessing at some value for N. Then,
using this guess value, compute the t variable at the probability level desired. An updated value for N can
then be found from the formulation
(95%) [26]
Then use trial and error iteration to converge on a value for N. We begin with N=20. Then
This yields samples.
So, now guess N=180. Then
This yields samples
Then, guess again at N=158 where
This yields samples.
We have converged on N = 158. Thus, at least 158 measurements must be made (138 more than 20
already available) to achieve the desired precision interval in the measured variable. An analysis of the
results after 158 measurements should also be made to ensure that the variance level used was
representative of the actual data set.
Error Sources
As a guide to looking for measurement errors, it is possible to consider the measurement process as
consisting of three distinct steps: calibration, data acquisition, and data reduction. Errors that enter during
each of these steps will be grouped under their respective error source heading:
Calibration errors
Data acquisition errors
Data reduction errors
Within each of these three error source groups, an objective should be to list the types of errors
encountered. Such errors are the elemental errors of the measurement.
CI
2
------
0.05
2
---------- 0.025 = =
S
x
0.16 =
N
t
95 ,
S
x
d
----------------
,
_
=
20 1 19 = = t
19 95 ,
2.093 =
N
2.093 0.16
0.025
------------------------------
,
_
2
179.43 180 = =
180 1 179 = = t
179 95 ,
1.96 =
N
1.96 0.16
0.025
---------------------------
,
_
2
157.35 158 = =
158 1 157 = = t
157 95 ,
1.96 =
N
1.96 0.16
0.025
---------------------------
,
_
2
157.35 158 = =
10
In each elemental error under above categories one can also have bias and/or precision error, which is
a difference between the value indicated by a measurement system and the actual value measured.
Bi as Error:
The constant offset between the average indicated value and the actual value measured.
Preci si on Error:
Statistical measure of the variation of the measured value during repeated measurements.
Uncertainty Analysis:
Uncer t ai nt y:
An estimate of the range of a possible error or errors. An estimate of the probable error in a reported
value.
Uncer t ai nt y Anal ysi s:
A process of identifying the errors in a measurement and quantifying their effects.
During measurement, we really cannot know if the system indicates the true value. However, from the
calibration we can estimate the probable error in any subsequent measurement. From that we can
speculate on how closely the measured value should agree with the true value.
In the previous part, we stated that the best estimate of the true value sought in a measurement is
provided by its sample mean value and the uncertainty in that value,
(P%) [27]
where is called the uncertainty.
Note: In comparing Equations [27] and [12], we observe that the terms and , are identical.
But, we are going to modify the Precision error with the Bias error and call it . So that, we could perform
the uncertainty analysis by taking into account different kind of errors that might cause us in performing an
experimental analysis.
Uncertainty analysis is the method used to quantify the term, where in the case of a single error,
[28]
Note:
B = Bias Error
E = Precision Error (also from previous discussions)
= Table [2]
If multiple elemental errors exist as a source of Bias and Precision errors, then
where n = 1, 2, 3... [29]
and
where n = 1, 2, 3,... [30]
x x u
x
t =
u
x
t
P ,
S
x
u
x
u
x
u
x
u
x
B ( )
2
t
v P ,
E ( )
2
+ =
S
x
t
v p ,
B B
1
2
B
2
2
B
n
2
+ + + =
E E
1
2
E
2
2
E
n
2
+ + + =
11
Here, n is the number of elemental errors. For instance, in measuring density of a gas if only pressure
and temperature data is utilized, then n=2.
Estimation of the degrees of freedom in the precision index E requires some discussion since Es
composed of elements that usually have different degrees of freedom. In this case, the degrees of freedom
in the measurement precision index is estimated using the Welch-Satterhwaite formula:
where =1, 2, 3,... [31]
Certain assumptions are implicit in an uncertainty analysis.
1. The test objectives are known.
2. The measurement itself is a clearly dened process in which all known calibration corrections for
bias error have already been applied.
3. Data are obtained under xed operating conditions.
4. Some system component experience is available.
Component experience is dened as an estimate of component bias and precision errors based on
some evidence, such as personal experience through previous or simulated tests and calibrations, or
someone elses experience, such as the manufacturers performance literature, an NIST bulletin, a
professional test code, or performance information discussed in the technical literature.
Example: [7]
Find the best estimate of the true value sought in a measurement which is provided by its sample
mean value and the uncertainty in that value. Consider Example [5] again.
This time the Bias error are known to be of a single source of a value from the manufacturer as
recorded on the machine used to estimate those values. Determine the best estimate of the by
performing the uncertainty analysis.
Solution :
Bias Error, B = 0.05
Precision Error, = 0.04 = E
Sample Data, N = 20
Degree of freedom, = N-1 = 19
Mean Value, = 1.02
Number of errors n = 1
We seek for the statement, (95%) where all the information were stated above. The
uncertainty estimate in this measurement is obtained from the source error statements.
B = 0.05
E = 0.04
v = 19
Therefore the t-estimator, could be determined from Table [2] where . The uncertainty
estimate is found using equation (1.22) as shown below:
= [32]
The best estimate is given in the form of Equation [27] as
This measurement has an uncertainty of about %

E
i
2
i 1 =
n

,

_
2
E
i
4

i
( )
i 1 =
n

----------------------------- = i
0.05 t
x
i
x
i
S
x

x
x x u
x
t =
t
19 95 ,
2.093
u
x
0.05 ( )
2
2.093 0.04 ( )
2
+ t = 0.0975 t
x' 1.02 0.0975 t =
0.0975
1.02
---------------- 100 9.5 =
12
Regression Analysis
Objective
1. To show how to analyze a set of experimental data using a concept called regression analysis.
2. Generate a curve (line) to represent all those points with minimum error, that is to say, the devia-
tion of the experimental data from the polynomial curve is minimal.
Introduction
The regression analysis for a single variable of the form y = f(x) provides an mth-order polynomial t of
the data in the form
[33]
where refers to the value of the dependent variable obtained directly from the polynomial
equation for a given value of x.
For n different values of the independent variable included in the analysis, the highest order, m, of the
polynomial that can be determined is restricted to . The values of the m coefcient
are determined analytically.
The most common form for regression analysis for engineering applications is the method of least-
squares. The least-squares technique attempts to minimize the sum of the squares of the deviations
between the actual data and the polynomial t of a stated order by adjusting the values of the coefcients,
as necessary.
An mth-order polynomial relationship is to be found for a set of N data points of the form (x,y) in which
x and y are the independent and dependent variables, respectively. Consider the situation in which N
values of y exist, , where i=1, 2,..., N, over n values of x. The task is to nd the m+1
coefcients, , of the polynomial of Equation [33]. Dene the deviation between any
dependent variable and the polynomial as where is the value of the polynomial evaluated
at the data point . The sum of the squares of this deviation for all values of is
[34]
The goal is to reduce D to a minimum for a given order of polynomial. Combining Equations [33] and
[34], one can write
[35]
The total differential of D is dependent on the m+1 coefcients through
[36]
To minimize the sum of squares of the deviations, one wants dD to be zero. This is accomplished by
setting each of the partial derivatives equal to zero:
y
c
f x ( ) C
0
C
1
x C
2
x
2
C
m
x
m
+ + + + = =
y
c
m n 1
C
0
C
1
C
m
, , ,
y
i
C
0
C
1
C
m
, , ,
y
i
y
i
y
ci
y
ci
x
i
y
i
, ( ) y
i
D y
i
y
ci
( )
2
i 1 =
N

=
D y
i
C
0
C
1
x C
2
x
2
C
m
x
m
+ + + + ( ) [ ]
2
i 1 =
N

=
dD
D
C
0
---------dC
0
D
C
1
---------dC
1

D
C
m
----------dC
m
+ + + =
13
[37]
[38]
[39]
This yields m+1 equations which are solved simultaneously to yield the unknown regression
coefcients, .
Example: [8]
Least-Square Regression Analysis. The following data in Table [4] is suspected to follow a linear
relationship. Find an appropriate equation of the rst-order form.
Known
Independent variable, x
dependent measured variable, y
N=5
Assumptions
Linear relations. Find
Solution :
We seek a polynomial of the form , which minimizes the term
[40]
[41]
Table [4] x and y data
x y
1.0 1.2
2.0 1.9
3.0 3.2
4.0 4.1
5.0 5.3
D
C
0
--------- 0

C
0
--------- y
i
C
0
C
1
x C
2
x
2
C
m
x
m
+ + + + ( ) [ ]
2
i 1 =
N


' ;

= =
D
C
1
--------- 0

C
1
--------- y
i
C
0
C
1
x C
2
x
2
C
m
x
m
+ + + + ( ) [ ]
2
i 1 =
N


' ;

= =
D
C
m
---------- 0

C
m
---------- y
i
C
0
C
1
x C
2
x
2
C
m
x
m
+ + + + ( ) [ ]
2
i 1 =
N


' ;

= =
C
0
C
1
C
m
, , ,
y
c
C
0
C
1
x + =
y
c
C
0
C
1
x + =
D y
i
y
ci
( )
2
i 1 =
N

=
D
C
0
--------- 0 2 y
i
C
0
C
1
x
i
+ ( ) [ ]
i 1 =
N

= =
14
[42]
yielding
[43]
[44]
Solving simultaneously for the coefcients and yields
[45]
[46]
From the data set, one nds = 0.02 and = 1.04. Hence,
D
C
1
--------- 0 2 y
i
C
0
C
1
x
i
+ ( ) [ ]x
i
i 1 =
N

= =
y
i
C
0
C
1
x
i
+ ( ) [ ]
i 1 =
N

0 =
y
i
C
0
C
1
x
i
+ ( ) [ ]x
i
i 1 =
N

0 =
C
0
C
1
C
0
x
i
x
i
y
i
( ) x
i
2
y
i

x
i
( )
2
Nx
i
2

-------------------------------------------------- =
C
1
x
i
x
i
y
i
( ) Nx
i
y
i

x
i
( )
2
Nx
i
2

----------------------------------------------------- =
C
0
C
1
y
c
0.02 1.04x + =

You might also like