You are on page 1of 16

S TATISTICS WITH M ATLAB FOR E NGINEERS :

D ESCRIPTIVE S TATISICS
Paul Razafimandimby

Montanuniversitat Leoben

October 28, 2015

Contents
Introduction
Organizing and visualization of the data
Visualization of correlation
Measures of Central Tendency/Location
Measures of variation or dispersion
Appendix: Calculation of some parameters for grouped data.

Introduction
Roughly speaking, Statistics is the science of gaining knowledge from numer-
ical and categorical data. It deals with the collection, analysis, interpretation
and drawing conclusion from collected data. A population is basically the
collection or set of all individuals under consideration in a statistical study. A
sample is a part of the part or subset of the population from which information
is collected.

One can distinguish two branches of Statistics.

1. Descriptive Statistics is the methodology of organizing and summariz-


ing information. This branch of statistics deals with the construction of
the distribution of the sample/population (calculation of frequency), the
visualization of data (graphs, charts, histograms), and the calculation of
various descriptive measures (averages, standard deviation, percentiles).
2. Inferential Statistics is a science of drawing and measuring the reliabil-
ity of conclusions about population based on information collected from
a sample of population. Inferential statistics deals with point estimation,
interval estimation and hypothesis testing which rely very much on prob-
ability theory.

1
Descriptive and inferential statistics are interrelated in that before inferring
conclusion from the statistical investigation it is necessary to organize and
summarize the information collected from a sample. Moreover, the knowl-
edge from the descriptive statistics usually suggests the appropriate method
or approach to be used for the inferential statistics.

In a statistical study, either it is a descriptive or inferential, the property of


a population is usually described by numerical parameters. In many cases
these parameters are unknown and a statistical study are very often oriented
to the investigation/estimation of these parameters. For this purpose, one usu-
ally uses statistical samples to make inference about these unknown parame-
ters. Numerical values calculated from and characterizing a statistical sample
is called a statistic and they are used to make inference about the unknown
parameters of the whole population. Statistics finds its applications in numer-
ous applied sciences, among others, economics, political science, medicine. Of
course, Statistics play an important role in many branches of Engineering sci-
ences. For instance, assuming that a factory producing use the same equip-
ment, the raw materials and the methods of production, then using statistics
we can infer about the qualities of the light bulbs produced in the future.

Usually a statistical study has the following steps:

1. Describe the research problem. For instance, we want to know the average
age of MUL students.
2. Define the population and the sample on which we will conduct the study. In
a very simple terms, a population is basically the collection or set of all
individuals under consideration in a statistical study. In our example, the
population is the set of all MUL students (from 1st year to phd students).
A sample is a part or subset of the population from which information
is collected. Sample could be set of 100 students randomly interviewed
by 10 volunteers at 5 building entrances of the university from tomorrow
7:00-9:00 am.
3. Collect the data We send 10 volunteers to interview 100 students at 5 build-
ing entrances of the university during the period of tomorrow 7:00-9:00
am.
4. Conduct a descriptive data analysis After collecting the data we need to
organize it. For instance,
we could form a table containing the (relative) frequency and cumulative
(relative) frequency of each class of the sample.
We could plot the data to visualize some of its properties.
Study the tendency of the population/sample by calculating its measure
of location such as mean, median, mode, ....
We could also study the dispersion of the population/sample through the
calculation of range, variance, standard deviation, coefficient of skew-
ness, kurtosis, interquartile,... All of these terms will be or have been
defined appropriately.

2
Organizing and visualization of the data
As defined above, this branch of statistics deals with the organization and the
summary of information form the collected data. But, before we organize our
data we need to specify our variate or (random) variable.

Variate/Variable: a characteristic that varies from one individual of the popu-


lation to the other. In our example, our variable is the age of each MUL student.
On can distinguish three types of variables or data

1. Qualitative data/variable: This type of variable is also known as categor-


ical or nominal data/variable and it can only described by word, letter
or phrase. For example, the sexe, marital status of blood type of the MUL
students.
2. Quantitative or numerical data/variable: is a variable that can be quan-
tified or numerically described. For instance, the height, weight and age
of MUL students.
3. Ordinal data/variable is variable that cannot be numerically described or
does not fall into a quantitative variable, but can be ordered. For instance,
quality of moral behavior (bad manner, good manner), performance of a
football team (winner, runner up, semi-finalist,...).
After properly defining the variable one can organize the observed values into
classes (Ci , ) and form a table containing the count Ni of individual belonging
to each class (class frequency). One can also insert in the table the relative
frequency of each class. The relative frequency of a class Ci is defined by

Ni
RF (Ci ) = .
i=1 .Ni

The relative frequency of all classes sum to 1 or 100% The cumulative (relative)
frequency of a class Ci is the sum of all frequencies of all classes up to to the
class Ci

i
CF (Ci ) = RF(Cj ).
j =1

Note that cumulative frequency makes sense only for quantitative and ordinal
variable.

Example: To simulate our statistical study on the students age, we generate


100 random numbers (I did it for you in this note, but you should learn how to
do it) from 17-40. First, we load it to Matlab

3
Age=load(Age.txt);
and create a frequency table from it

tabulate(Age);

Value Count Percent


1 0 0.00%
2 0 0.00%
3 0 0.00%
4 0 0.00%
5 0 0.00%
6 0 0.00%
7 0 0.00%
8 0 0.00%
9 0 0.00%
10 0 0.00%
11 0 0.00%
12 0 0.00%
13 0 0.00%
14 0 0.00%
15 0 0.00%
16 0 0.00%
17 5 5.00%
18 4 4.00%
19 3 3.00%
20 7 7.00%
21 2 2.00%
22 5 5.00%
23 9 9.00%
24 5 5.00%
25 7 7.00%
26 3 3.00%
27 3 3.00%
28 2 2.00%
29 1 1.00%
30 1 1.00%
31 3 3.00%
32 4 4.00%
33 6 6.00%
34 4 4.00%
35 4 4.00%
36 5 5.00%
37 2 2.00%
38 6 6.00%

4
39 5 5.00%
40 4 4.00%
But it gives us the age range 0-16 which we do not want. To get the right table
we have to remove these values. For this purpose, let us store the table in a
40 3 matrix called T

T=tabulate(Age);
and remove the block T (i, j), for i = 1, 16 and j = 2, 3.

T(1:16,:)=[];
Now we recreate the frequency table

Freq_Table=table(T(:,1),T(:,2),T(:,3),VariableNames,{Age,Count,Percent})

Freq_Table =

Age Count Percent


___ _____ _______

17 5 5
18 4 4
19 3 3
20 7 7
21 2 2
22 5 5
23 9 9
24 5 5
25 7 7
26 3 3
27 3 3
28 2 2
29 1 1
30 1 1
31 3 3
32 4 4
33 6 6
34 4 4
35 4 4
36 5 5
37 2 2
38 6 6
39 5 5
40 4 4

5
We can also export our table into a txt, xls, ... file.

writetable(Freq_Table,Freq_Table_Age.txt, Delimiter, );
To visualize our data we can plot the frequencies versus the classes. For ordinal
or quantitative variable we usually use a pie chart or a bar graph. Note that in
a bar graph, the bars do not touch each other. Bar graph is also used to visualize
discrete quantitative data, i.e., the each class is described by a single number.
For visualization of continuous quantitative data, i.e., each class is an interval,
we usually draw an histogram. The bars of an histogram do touch each other.
The usual method to form the frequency table of a continuous quantitative data
is as follows.

1. Find the min and max values of the observed data


2. Form disjoint intervals of same length covering the range between the
min and the max values. In general 5 t0 15 intervals are satisfactory.
3. Count the number of individuals falling in each interval. This is the fre-
quency distribution.
4. Form the relative frequency of each classes.
For our example, a lazy way of visualizing the frequency of our data is just bar
plot the second column of the matrix T

bar(T(:,2))

But in this case, the x axis contain unwanted values and does not contain the
whole range of our variable classes. To remediate this we can specify the value
of the bar location along the x-axis as follows

6
bar(17:1:40,T(:,2))

Which is equivalent to
bar(T(:,1),T(:,2));

We can also draw a histogram for our data. For instance, we will cover the min
and max values of our observation by disjoint intervals of same length, say,
[17,22], [23,28],...., Here is how we do it in matlab

7
histage=histogram(Age,[17:5:45]);

We could also draw a pie chart

pie(Age);

Well! This looks awful. Let us just do the pie chart of the first 5 students and
label them

8
pie(Age(1:5), {Stud1, Stud2, Stud3, Stud4, Stud5});

Visualization of correlation
Graphs are also very useful to give an intuition of teh correlation between vari-
ables. For example, we want to know whether smoking is one of cancer factors
and which cancer type is mostly caused by smoking. For this let us download
a data from

http://lib.stat.cmu.edu/DASL/Stories/cigcancer.html
. I named the data as smoke cancer.txt and load it to Matlab by using the
dataset command.

smokeds=dataset(File, smoke_cancer.txt);
We can now visualize the correlation between smoking and let say bladder
cancer and lung cancer

subplot(2,1,1)
scatter(smokeds.CIG,smokeds.BLAD),
title(CIG vs BLAD)
subplot(2,1,2)
scatter(smokeds.CIG,smokeds.LUNG)
title(CIG vs LUNG);

9
It seems that CIG and LUNG has a positive linear correlation. Let see how if
we can draw something from the histogram

bar(smokeds.LUNG, c)
hold on
bar(smokeds.BLAD, r)
hold off

10
Measures of Central Tendency/Location
A measure of location is a typical or a central value which describe well the
location of the data. We mainly have three measures of location

Mean Let Xi , i = 1, . . . , N be our observed values, then the mean is defined by

N
1
X =
N Xi .
i =1

Note when the data is grouped in classes Ci , i + 1, .., n, then the mean is defined
by

n
1
X =
N f i Xi .
i =1

where Xi is midpoint of a class Ci (of course Xi = Ci is the variable is discrete)


and f i is the count of the class Ci (or Xi ) and N = in=1 f i is the total number of
observation.

Median or Middle is the middle value which divides the observation into tow
equal parts. If the data is ungrouped, then the median is defined by

Med = X n+1 ,
2

if n is odd, and

 
X n + X n +1 /2,
2 2

is n is even.

Example: This is the list of ages of 7 MUL students

age7=[23,24,16,19,30,28,33];
age7s=sort(age7);
Medage7=age7s((length(age7)+1)/2);

11
Example again! Now let us look at an ungrouped data with even number of
observation. For this take 8 MUL students

age8=[23,24,16,19,30,28,33,40];
age8s=sort(age8);
Medage8=(age8s((length(age8))/2)+age8s(length(age8)/2+1))/2;
Warning The above formula/procedure for the median does not work well
grouped data (especially when the observed values are grouped into intervals)
For grouped data, the formula/procedure for finding the median is more com-
plicated and it gives only an estimate for the median; we will the method on
how to find it in appendix. Nevertheless, it is relatively simple to find a Me-
dian class which is basically the interval containing the first cumulative fre-
quency bigger than N/2. However, we can apply the above procedure in our
example of 100 MUL

Class mode is the most frequently occurring class, i.e., it is the class which has
the highest count. In our example, the mode or modal class is the number with
the highest frequency ( which is 9), i.e., 23. For a grouped data we only have
a complicated formula/procedure which will be given in the appendix. Fortu-
nately, with Matlab we do not need to worry about these formula, the software
will do it for us (but, you should read books and understand the procedure).

Example: The average and median ages in our example is given by

Mean_age2=mean(Age);
Mean age2 is equivalent to the second definition of mean, i.e.,

1 24
Mean age2 = fi i.
100 i= 17

which rounds to 28.

Let us caclulate the median.

Med_age=median(T(:,1));
which returns 28.500.

Now let us calulate the mode

Mode=mode(Age);
which gives us 23. This is also the class modal as we grouped our data in a
discrete way.

12
Measures of variation or dispersion
The measures of dispersion given in the first lecture note are valid for un-
grouped data, but their meaning are the same as for grouped data. For grouped
data we give them below. The variance and the standard deviation of sample
of size n are respectively defined by:

n
1
S2 = f ( X X )2 ,
n 1 i =1 i i


S= S.

Sometimes, we use the shortcut formula

!2
n n
1 fi X2 1
S2 =
n 1 i =1 i
n f i Xi ,
i =1


S= S.

As in ungouped data we can also defined the r-th moment and r-th central
moment . They are respectively defined by

n
1
Mr0 =
n fi Xir ,
i =1

n
1
Mr =
n fi (Xi Mean)r .
i =1

Now these parameters can be used to defined the coefficient of skewness and
kurtosis whose definitions are exactly the same as in an ungrouped data.

Let us calculate the kurtosis of our data


First we try our formula.

13
Kurtf=mean((Age-Mean_age2).^4)/(mean((Age-Mean_age2).^2))^2;
Skewf=mean((Age-Mean_age2).^3)/(mean((Age-Mean_age2).^2))^(3/2);
We compare them with values returned by the Matlab functions kurtosis and
skewness

Kurtf-kurtosis(Age);
Skewf-skewness(Age);
Interquartile The k-th percentile is the value of the observed variable which
has a cumulative frequency equal to k/100.

The first quartile, the second quartile and the third quartile correspond to the
values with cumulative frequencies 25%, 50% and 75%, respectively.

The interquartile is the difference between the first quartile and third quartile.
It is a range within which the middle half of the data lie.

Appendix: Calculation of some parameters for grouped data.


This appendix serves to give an explanation on how to calculate of some pa-
rameters of grouped data, in particular, when the range of the observed values
of the variable are covered by disjoint intervals. Many of the calculations in
this section require the knowledge of lower/upper boundaries of the classes
which in its turn require the knowledge of a gap classes.

The gap between classes is the difference between the upper limit of one class
and the lower limit of the next class. For example, assume that our classes are
the interval ( ai , bi ), i = 1, . . . , n . The gap is

gap = bi ai+1 .

Having the gap at hand, we can form the class boundaries.

The lower class boundaries are

a i = ai gap/2

and the upper class boundaries are

b i = bi + gap/2.

14
Now, we are ready to estimate the median, quartiles and interquartile (range)
of a grouped data. Follow the steps below to calculate the media:

1. Form the cumulative frequency table and insert in it the ranges of class
boundaries. Call N the total frequency which is also the total number
of observation or individuals in the sample. Locate the Median class,i.e.,
find the class which contains the N/2-th individual. Call it Cm = ( am , bm )
and C m = ( a m , b m ) its lower and upper class boundaries. Apply the fol-
lowing formula to find the median

N/2 Fb
 
Median = a m + ( bm a m ) ,
fm

where

f m is the frequency of the median class,

Fb is the cumulative frequency before the median class.

A similar argument can be used to compute the first quartile ( N/4) and the
third quartile ( 3N/4). Let {1, 3}

N/4 Fb
 
Q = a Q + ( bQ a Q ).
f Q

For the mode we can use the following formula

f mo f a
 
Mode = a mo + (b mo a mo ),
2 f ( fa + fb )

where

f mo is the frequence of the class mode,

f b and f a are respectively the frequency of the class before and after the class
mode.
Exercise: Find the median, interquartile and the mode of the following
grouped data.

15
Time to travel to work Frequency

1-10 8
11-20 14
21-30 12
31-40 9
41-50 7

16

You might also like