You are on page 1of 12

Lecture Note 1: Introduction to Empirical Relationships

V31.0380. Topics in Econometrics, Fall 2009


by
Professor Bryan Graham
Table 1 reports monthly earnings and years of schooling data for a random sample of
4,400 non-farm male Honduran workers between the ages of 20 and 50 years old. On the top
of the table are listed twenty-one possible years of schooling values (r
)
: , = 1. . . . . J = 21).
These range from r
1
= 0 years of schooling to r
21
= 20 years of schooling. On the left-
hand axis are listed / = 1. . . . . 1 = 11 distinct brackets for monthly earnings (in thousands
of Honduran Lempiras). The lowest brackets is for earnings less than 1,000 Lempiras per
month, while the highest is for earnings in excess of 10,000 Lempiras per month.
1
It will be convenient to treat all workers with earnings in a given bracket as if they
received a single, common, earnings level. Workers with actual earnings between 0 and 1,000
Lempiras, for example, are treated as if their earnings are 500 Lempiras, those between 1,000
and 2,000, as if they are 1,500 and so on (with earnings for the highest bracket set equal
to 10,500). This discretization of the data has be done for pedagogical reasons.
1 Frequency distributions and sample means
What can we learn about the relationship between schooling and earnings from these data?
In total, there are J 1 = 2111 = 231 cells in the table. In the (,. /) cell is the fraction of
workers (out of the 4,400 in the sample) with exactly r
)
years of schooling and
I
Lempiras
of earnings. That is
j (r
)
.
I
) =
1
`
X
.
i=1
1(A
i
= r
)
) 1(1
i
=
I
) . (1)
where i = 1. . . . . ` indexes the ` = 4. 400 individual paired observations of years of school-
ing, A
i
, and monthly earnings, 1
i
,. 1() denotes the indicator function; the indicator function
takes a value of one when the condition inside the parentheses is true and zero otherwise.
For example, 1(A
i
= r
)
) equals one if the i
tI
individual in the sample has exactly r
)
years
of schooling. The product, 1(A
i
= r
)
) 1(1
i
=
I
), therefore only takes on a value of one
1
In 2004 one US$ bought about 18 Honduran Lempiras.
1
if the i
tI
individual has both years of schooling equal to r
)
and monthly earnings equal to

I
.
2
Equation (1) is called the joint frequency distribution function of A and 1 ; for each
unique (r
)
.
I
) pair it computes the fraction of sample observations where A
i
= r
)
and
1
i
=
I
. Put dierently it computes the frequency of the event A
i
= r
)
and 1
i
=
I
in the
sample. Thus 7.07 percent of the workers in the sample have six years of schooling and earn
between two and three thousand Lempiras per month.
3
If we sum (1) over all possible ,. /
pairs we get
X
J
)=1
X
1
I=1
j (r
)
.
I
) = 1.
i.e., the sum of all the individual frequencies equals one. Since Table 1 enumerates all possible
joint realizations of A and 1 , the sum of the individual event frequencies must sum to one.
This is just a more complex manifestation of the fact that for any number of coin ips, the
sum of the fraction tails and fraction heads will always equal one!
Table 1 completely characterizes the joint distribution of schooling and earnings in our
sample. What can it tell us about the relationship between these two variables? The table is
a bit bewildering, but with a little work we can discern an important regularity. Note that
most of the cells in the lower left-hand and upper right-hand regions of the table are close
to zero. There are few workers in our sample with few years of schooling and high monthly
earnings, conversely there are few workers in the sample with many years of schooling and
low monthly earnings. Most of the frequency mass is concentrated on the diagonal of cells
running from the upper left-hand to the lower right-hand portions of the table. This suggests
that higher levels of schooling tend to be associated with higher levels of earnings in our
sample.
What does Table 1 tell us about the distribution of years schooling in our sample, irre-
spective of a workers earnings? The marginal frequency distribution function of A is
j(r
)
) =
X
1
I=1
j(r
)
.
I
). (2)
Equation (2) computes the marginal frequency of the event A
i
= r
)
in our sample. By
marginal we mean that we are only interested in the frequency distribution of A alone, in-
dependent of any possible relationship between A and 1 . To calculate the fraction of workers
with exactly A
i
= r
)
years of schooling in the sample we sum over the / = 1. . . . . 1 frequen-
cies for earnings and schooling combinations where schooling is held xed at r
)
. Looking at
Table 1, we simply sum up all the elements in a given column. For example, the marginal
2
A comment on notation before we continue: a capitalized variable indicates a random draw from the
population. For example

denotes the schooling of the

(randomly) surveyed worker in our dataset.


Lowercase variables denote a possible realized value for these draws (e.g., 3, 7, or 10 years of schooling). If
this is not clear right now, it will become so later in the course.
3
In order to enhance readibility, the actual numbers reported in Table 1 equal equation (1) multiplied by
100.
2
frequency of six years of school equals
j(6) =
X
1
I=1
j(6.
I
) = 0.0439 + 0.0616 + 0.0707 + + 0.0068 = 0.29.
A total of 0.29 100 = 29 percent of the workers in our sample have exactly 6 years of
schooling. The marginal frequency distribution provides important information about the
distribution of schooling in our sample. The bottom row of Table 1 reports the marginal
frequency of each of the J possible years of schooling values. We see that 10.5 percent of
our sample has no formal schooling. A large number of workers are also lumped into the six
years of schooling cell (29 percent). In Honduras six years of schooling marks the completion
of primary school and the end of compulsory schooling. We also see a little lumping at
the nine (5.6 percent) and twelve (9.9 percent) year levels, corresponding to the completion
of the equivalent of junior high school and high school respectively. Overall, the sample is
characterized by a low level of education attainment. The majority of the workers in the
sample have six or fewer years of schooling.
Returning to the relationship between schooling and earnings, the conditional frequency
distribution function of 1 (earnings) given A (schooling) is
j(
I
|r
)
) =
j(r
)
.
I
)
j(r
)
)
. (3)
Equation (3) gives the fraction of workers earning exactly
I
thousands of Lempiras per
month, conditional on also having exactly r
)
years of schooling. The marginal frequency
of workers with six years of schooling is 0.29. The joint frequency of workers with earnings
between zero and one thousand Lempiras per month and six years of schooling is 0.0439.
Therefore the conditional frequency of earning between zero and one thousand Lempiras
where the conditioning is on schooling being equal to exactly six years is 0.0439,0.29 =
0.151. Among those workers in our sample with exactly six years of schooling, 15.1 percent
earn between zero and one thousand Lempiras per month. The conditional frequency does
not measure the frequency of an event across our entire sample, rather it measures the
frequency within the subsample for which the conditioning criterion holds. In our sample
0.29 4. 400 = 1. 276 workers have six years of schooling (our conditioning criterion); the
conditional frequency distribution of earnings refers to this subsample only.
The conditional frequency distribution of earnings for each schooling level, calculated
using equation (3) and the joint and marginal frequencies reported in Table 1, is reported in
Table 2. As we look at Table 2 we see the relationship between schooling and earnings with
greater clarity. Among workers with no schooling, 43.3 percent have earnings in the lowest
bracket and a full 43.3 + 26.8 = 70.1 percent have earnings in the lowest two brackets. If
we look at workers with 6 years of completed schooling we see that only 15.1 percent have
earnings in the lowest bracket. Among workers with twenty years of schooling, none have
3
earnings in the lowest bracket and a majority 62.5 percent have earnings in the highest
bracket. These patterns conrm that workers in our sample with more schooling tend to
earn more per month.
The tend to part of this statement is important. Even among workers with no schooling,
1.9 percent have earnings in the highest bracket. For each schooling level we observe a
distribution of earnings. Among workers with identical schooling levels, we see a wide range
of realized earnings. There are relatively rich workers who are uneducated, and relatively
poor workers who are educated. Overall, however, the frequency of high earnings is greater
among more educated workers.
Figure 1 plots the conditional frequency distribution of earnings for zero, six and nine
years of schooling respectively. Along the horizontal axis of each gure is monthly earnings.
The vertical height of each point gives the frequency of the earnings level listed on the
horizontal access conditional on a particular level of schooling. In the rst gure, the rst
few dots from left to right have heights of 0.433, 0.268 and 0.165, corresponding to the
conditional frequencies of observing a worker with no schooling in the each of three lowest
earning brackets.
An important property of conditional frequency distributions is that for each , = 1. . . . . J
we have
X
1
I=1
j(
I
|r
)
) =
P
1
I=1
j(r
)
.
I
)
j(r
)
)
=
j(r
)
)
j(r
)
)
= 1.
The fraction of workers with a given number of years schooling in each of the / = 1. . . . . 1
earnings brackets sums to 1. Graphically, the sum of the vertical heights in Figure 1 equals
one. Figure 1 reinforces our conclusion that workers in our sample with higher levels of years
of schooling tend to earn more. As we go from the top to bottom gure we can observed the
frequency mass shifting rightward toward higher levels of earnings.
Table 2 and Figure 1 clarify the relationship between schooling and earnings in our
sample, however it would be nice to have an even simpler characterization of this relationship.
To this end consider the marginal mean of A
:
A
=
X
)
r
)
j(r
)
). (4)
Equation (4) can be used to calculated the (marginal) mean years of schooling among workers
in our sample. Using the data from bottom row of Table 1 we have
:
A
= 0 0.105 + 1 0.028 + + 20 0.004 = 6.597.
So the average years of schooling for workers in our sample is 6.597.
Notice how we used the marginal frequencies for each possible schooling level as weights
in computing :
A
. We can also use the conditional frequencies as weights to calculate the
4
conditional mean of earnings for a given level of schooling. The conditional mean of 1 given
A = r
)
in the sample is
:
Y |A
(r
)
) =
X
1
I=1

I
j(
I
|r
)
). (5)
Using the conditional frequencies given in Table 2 we can compute the average earnings for
workers with exactly six years of schooling as
:
Y |A
(6) =
X
1
I=1

I
j(
I
|6) = 500 0.151 + 1. 500 0.213 + + 10. 500 0.024 = 2. 956.
The average monthly earnings among workers with exactly six years of schooling in our
sample is 2,956 Lempiras. The conditional mean is simply an average among individuals for
which the conditioning criterion is true. It is an average over a subsample. From Table 2 and
the second panel of Figure 1 we know that some workers with six years of schooling earn more
and others less, but the conditional mean is equal to 2,956 Lempiras. If we calculate mean
earnings among workers with no schooling in our sample using (5) we get 1,721 Lempiras;
for workers with nine years of schooling we get 3,617 Lempiras. These three conditional
means are plotted as dashed vertical lines in Figure 1. Among workers with higher levels of
schooling we observe a higher mean level of earnings (i.e., the conditional mean vertical lines
shift rightward as we move from the rst to third panels).
Conditional means provide a simple way to summarize the relationship between schooling
and earnings in our sample. Figure 2 plots :
Y |A
(r
)
) for each of the r
1
. . . . . r
J
possible
schooling levels. The gure clearly shows that, in our sample, mean earnings are higher for
workers with higher levels of schooling. I have connected the individual points in Figure 2
to clarify the sample trend.
The conditional mean function provides a simple summary of the relationship between
schooling and earnings among workers in our sample. It is certainly far clearer than the 231
joint frequencies we began with in Table 1! It is also easier to read that Table 2 and Figure
1. However this clarity comes at a cost. The conditional frequency distribution, in addition
to showing how earnings tends to increase with schooling in our sample, also shows the range
of earnings realizations among workers with the same level of schooling. By simply plotting
:
Y |A=a

for , = 1. . . . . J we lose information on the variability of earnings at each schooling


level.
The conditional sample variance of earnings given schooling is

Y |A
(r
)
) =
X
1
I=1
(
I
:
Y |A=a

)
2
j(
I
|r
)
). (6)
Equation (6) gives the average of the squared dierences between an individuals actual
earnings and the mean earnings of individuals with the same schooling level this is a
common measure of variability around the mean. For workers with six years of schooling we
5
have, using the information in Table 2, a conditional sample variance of

Y |A
(6) = (5002. 956)
2
0.151+(1. 500 2. 956)
2
0.213+ +(10. 500 2. 956)
2
0.024 = 45. 273.
The square root of (6) gives what is called root mean square error (rmse). For six years of
schooling we have a rmse of

45. 273 ' 213.
Figure 2, in addition to plotting :
Y |A
(r
)
) for r
1
. . . . . r
J
also plots :
Y |A
(r
)
) 0.675
p

Y |A
(r
)
) for , = 1. . . . . J. These dashed lines give us a sense of the sample variability in
earnings around each conditional mean earnings level. Later in the course we will learn how
to interpret these dashed lines in a precise way and why, for example, I chose to multiply
rmse by 0.675 and not some other number, or no number at all.
Our goal was to characterize the relationship between schooling and earnings in our
sample of 4,400 Honduran male workers. We began with the joint frequency distribution
of earnings in Table 1. This table gives the fraction of workers in our sample in each
of the J 1 = 21 11 = 231 possible schooling-by-earnings cells. Using these joint
frequencies we calculated the marginal frequency distribution of years of schooling in our
sample. The conditional frequency distribution of earnings given schooling helped to clarify
the relationship between schooling and earnings in our sample. We found that workers with
more years of schooling tended to be in the higher earnings brackets, while workers with
few years of schooling tended to be in the lower earnings brackets. We then computed the
conditional mean of earnings given schooling. We found that the average or mean earnings of
individuals with little schooling was substantially lower than the mean earnings of individuals
with lots of schooling. Finally, we ended by dening and calculating the conditional sample
variance a measure of earnings variability. Our nal product is Figure 2, which provides
a compact summary of the relationship between schooling and earnings in our sample.
2 Samples, populations and the analogy principle
All the information in Tables 1 and 2 and Figures 1 and 2 pertains to our sample of 4,400
Honduran workers. Ideally wed like to learn about more than just the workers actually in
our sample; wed like to learn about the entire population of Honduran male workers aged
20 to 50.
In this course we will use random samples to learn about characteristics of an underlying
population. The basic idea is quite straightforward. Say we are interested in computing
the average years of schooling for individuals aged 30 to 35 who live in the United States.
One way to compute this would be to conduct a census we could interview every single
individual between the ages of 30 and 35 living in the United States, ask them about their
educational attainment, and average the results. This would give us the population mean
years of schooling where our population includes all individuals between the ages of 30
6
and 35 living in the United States. Alternatively we could draw a random sample of such
individuals and ask themabout their schooling. The average years of schooling for individuals
in our (random) sample (i.e., the sample mean) provides an estimate of the true or actual
population mean.
Let o equal years of schooling and let
o
0
= E[o
i
]
equal the expected years of schooling for the i
tI
randomly drawn individual from the pop-
ulation of interest. E[] denotes the expectations operator. Imagine we randomly sampled
an individual from the population we are interested in learning about (to make this process
concrete we can imagine every individual in the population is assigned a unique number and
that we use a computer to randomly generate numbers and hence select one of these indi-
viduals).
4
Once we sample an individual we observe her years of schooling. Prior to making
this observation, however, we would expect her years of schooling to equal o
0
. By expect
we mean that if we were to repeatedly (randomly) sample individuals an innite number of
times that the average of their observed years of schooling would equal o
0
(we will dene the
expectation or a random variable more carefully in the lectures that follow).
Say we sample ` individuals. This gives us the random sample of schooling observations
o
1
. o
2
. . . . . o
.
.
If we take the sample average we get
b
o
.
=
1
`
X
S
i=1
o
i
.
Our sample average,
b
o
.
, is an estimate of the population mean, o
0
. It is intuitive that as `
gets very large (approaches innity) that our estimate will be close to o
0
. Unfortunately for
any given random sample of individuals we do not know for certain when our estimate,
b
o
.
,
is close to the population mean, o
0
, that we are actually interested in. Intuitively our belief
is that if we randomly sample a large (but nite) number of individuals from the relevant
population that our estimate will be pretty close to the truth (i.e., to o
0
), but there is
always some chance that this will not be the case. Later in the course we will learn how to
(approximately) characterize the sampling variability of our estimate.
Our estimator for the average years of schooling in the population of interest is the sample
mean of a random sample of schooling observations. This estimator is an example of the
analogy principle in action. We are interested in the expectation (i.e., population mean) of
4
For technical reasons we assume that we sample with replacement. That is, once we sample an individual
and observe her years of schooling she is placed back into the population and can in theory be selected
again by our computer.
7
the random variable o. To estimate o
0
= E[o] we replace the population mean E[o]
with its sample analog, the sample mean `
1
P
S
i=1
o
i
. Intuitively, we believe that the
sample mean years of schooling is a sensible estimate of the true population mean years of
schooling. All of the estimators we will learn about in this course are based on the idea of
replacing population means (i.e, expectations of random variables) with sample means (i.e.,
averages of a random sample of observations).
We have already used the analogy principle extensively in this Lecture. Consider the
population joint frequency distribution function or joint probability mass function (pmf)
associated with our sample of Honduran workers.
:(r
)
.
I
) = Pr (A
i
= r
)
. 1
i
=
I
) . (7)
Equation (7) gives the probability that a randomly sampled male Honduran worker between
the ages of 20 and 50 has both years of schooling equal to r
)
and earnings equal to
I
.
Put dierently :(r
)
.
I
) gives the fraction of workers in the entire population for which
A
i
= r
)
and 1
i
=
I
. The joint frequency distribution associated with our sample is an
analog estimator for the true joint probability mass function underlying the population
from which our sample was collected. To see that this note that
:(r
)
.
I
) = Pr (A
i
= r
)
. 1
i
=
I
)
= Pr (1(A
i
= r
)
) 1(1
i
=
I
) = 1)
= E[1(A
i
= r
)
) 1(1
i
=
I
)] .
Then observe that (1) is simply the sample analog of E[1(A
i
= r
)
) 1(1
i
=
I
)] .
3 Introduced concepts
1. joint, marginal and conditional frequency distribution
2. marginal and conditional mean
3. conditional sample variance and root mean square error (rmse)
4. population
5. random samples
6. expected value
7. the analogy principle
8
Table 1: Joint frequency distribution (jfd) of Y (= Monthly Wages) and X (= Years of Schooling)
\ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0Y1 4.55 1.07 2.05 2.57 1.36 1.05 4.39 0.23 0.14 0.61 0.07 0.05 0.23 0.05 0.05 0.00 0.02 0.00 0.00 0.00 0.00
1Y2 2.82 0.68 1.66 1.86 1.30 1.07 6.16 0.57 0.41 0.93 0.39 0.34 0.68 0.05 0.00 0.05 0.05 0.05 0.00 0.02 0.00
2Y3 1.73 0.55 1.00 1.68 1.02 1.05 7.07 0.45 0.73 1.23 0.39 0.41 1.86 0.23 0.05 0.14 0.11 0.02 0.00 0.00 0.00
3Y4 0.75 0.16 0.61 1.02 0.50 0.45 5.07 0.45 0.73 1.16 0.20 0.57 1.77 0.39 0.20 0.09 0.18 0.11 0.05 0.02 0.00
4Y5 0.16 0.14 0.30 0.45 0.23 0.32 2.30 0.16 0.25 0.36 0.23 0.34 1.41 0.20 0.20 0.16 0.16 0.16 0.05 0.00 0.02
5Y6 0.23 0.07 0.07 0.20 0.05 0.07 1.73 0.14 0.16 0.45 0.11 0.23 0.91 0.16 0.14 0.14 0.11 0.16 0.07 0.02 0.00
6Y7 0.00 0.05 0.09 0.00 0.18 0.05 0.66 0.07 0.07 0.23 0.07 0.20 0.89 0.11 0.20 0.05 0.14 0.18 0.00 0.00 0.02
7Y8 0.07 0.02 0.02 0.02 0.05 0.02 0.50 0.05 0.09 0.20 0.09 0.30 0.66 0.16 0.16 0.09 0.14 0.23 0.00 0.00 0.02
8Y9 0.00 0.00 0.05 0.07 0.05 0.00 0.32 0.00 0.00 0.16 0.00 0.11 0.30 0.02 0.05 0.05 0.07 0.02 0.07 0.02 0.02
9Y10 0.00 0.02 0.00 0.02 0.02 0.07 0.11 0.05 0.07 0.05 0.02 0.14 0.34 0.05 0.09 0.00 0.09 0.16 0.07 0.05 0.05
Y10 0.20 0.02 0.07 0.18 0.14 0.05 0.68 0.11 0.02 0.25 0.07 0.48 0.84 0.09 0.16 0.25 0.89 1.27 0.36 0.16 0.23
() 10.5 2.8 5.9 8.1 4.9 4.2 29.0 2.3 2.7 5.6 1.6 3.2 9.9 1.5 1.3 1.0 2.0 2.4 0.7 0.3 0.4
Notes: Empirical frequencies calculated from a random sample of 4,400 non-farm Honduran male workers aged 20 to 50. Raw frequencies are
multiplied by one hundred. Monthly wage data is expressed in thousands of Lempiras. In May of 2004, when the data were collected, 1 US$ bought
approximately 18 Honduran Lempiras. Data collected as part of the biennial Encuesta Permanente de Hogares de Propsitos Mltiples (EPHPM).
9
Table 2: Conditional frequency distribution (cfd) of Y (= Monthly Wages) given X (= Years of Schooling)
\ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0Y1 43.3 38.5 34.6 31.7 27.9 25.0 15.1 10.0 5.1 10.9 4.2 1.4 2.3 3.0 3.5 0.0 1.2 0.0 0.0 0.0 0.0
1Y2 26.8 24.6 28.1 23.0 26.5 25.5 21.3 25.0 15.4 16.5 23.6 10.8 6.9 3.0 0.0 4.5 2.3 1.9 0.0 7.7 0.0
2Y3 16.5 19.7 16.9 20.8 20.9 25.0 24.4 20.0 27.4 21.8 23.6 12.9 18.9 15.2 3.5 13.6 5.8 1.0 0.0 0.0 0.0
3Y4 7.1 5.7 10.4 12.6 10.2 10.9 17.5 20.0 27.4 20.6 12.5 18.0 17.9 25.8 15.8 9.1 9.3 4.8 6.9 7.7 0.0
4Y5 1.5 4.9 5.0 5.6 4.7 7.6 7.9 7.0 9.4 6.5 13.9 10.8 14.3 13.6 15.8 15.9 8.1 6.7 6.9 0.0 6.2
5Y6 2.2 2.5 1.2 2.5 0.9 1.6 6.0 6.0 6.0 8.1 6.9 7.2 9.2 10.6 10.5 13.6 5.8 6.7 10.3 7.7 0.0
6Y7 0.0 1.6 1.5 0.0 3.7 1.1 2.3 3.0 2.6 4.0 4.2 6.5 9.0 7.6 15.8 4.5 7.0 7.7 0.0 0.0 6.2
7Y8 0.6 0.8 0.4 0.3 0.9 0.5 1.7 2.0 3.4 3.6 5.6 9.4 6.7 10.6 12.3 9.1 7.0 9.6 0.0 0.0 6.2
8Y9 0.0 0.0 0.8 0.8 0.9 0.0 1.1 0.0 0.0 2.8 0.0 3.6 3.0 1.5 3.5 4.5 3.5 1.0 10.3 7.7 6.2
9Y10 0.0 0.8 0.0 0.3 0.5 1.6 0.4 2.0 2.6 0.8 1.4 4.3 3.4 3.0 7.0 0.0 4.7 6.7 10.3 15.4 12.5
Y10 1.9 0.8 1.2 2.2 2.8 1.1 2.4 5.0 0.9 4.4 4.2 15.1 8.5 6.1 12.3 25.0 45.3 53.8 55.2 53.8 62.5
Notes: See notes to Table 1 for details of the data.
10
Figure 1: Conditional frequency distribution of earnings given
schooling equal to zero, six and nine years respectively.
Notes: Based on conditional frequencies reported in columns 1, 7, and 10 of Table
2.
11
Figure 2: Conditional mean of monthly earnings given years of schooling
Notes: Calculated used data in Table 2 and equations (5) and (6).
12

You might also like