You are on page 1of 17

Econometrics

Chapter 1: Economic questions and data In common usage, an action is said to cause an outcome if the outcome is the direct result, or consequence, of that action. Causality means that a specific action (applying fertilizer) leads to a specific, measurable consequence (more tomatoes). A randomized controlled experiment is an experiment in which there are both a control group that receives no treatment and a treatment group that receives treatment. It is randomized in the sense that the treatment is assigned randomly. In this book, the causal effect is defined to be the effect on an outcome of a given action or treatment as measured in an ideal randomized controlled experiment. In such an experiment, the only systematic reason for differences in outcomes between the treatment and control groups is the treatment itself. Experimental data come from experiments designed to evaluate a treatment or policy or to investigate a causal effect. Data obtained by observing actual behaviour outside an experimental setting are called observational data. Observational data are collected using surveys, such as a telephone survey of consumers, and administrative records, such as historical records on mortgage applications maintained by lending institutions. Whether the data are experimental or observational, data sets come in three main types: cross-sectional data, time series data, and panel data. Data on different entities workers, consumers, firms, governmental units, and so forth for a single time period are called crosssectional data. With cross-sectional data, we can learn about relationships among variables by studying differences across people, firms, or other economic entities during a single time period. Time series data are data for a single entity (person, firm, country) collected at multiple time periods. Our data set on the rates on inflation and unemployment in the US is an example of a time series data. Panel data, also called longitudinal data, are data for multiple entities in which each entity is observed at two or more time periods. Panel data can be used to learn about economic relationships from the experiences of the many different entities in the data set from the evolution over time of the variables for each entity. Chapter 2: Review of probability The mutually exclusive potential of a random process are called the outcomes. The probability of an outcome is the proportion of the time that the outcome occurs in the long run. The set of all possible outcomes is called the sample space. An event is a subset of the sample space, that is, an event is a set of one or more outcomes. Some random variables are discrete and some are continuous. As their names suggest, a discrete random variable takes on only a discrete set of values, like 0, 1, 2, , whereas a continuous random variable takes on a continuum of possible values. The probability distribution of a discrete random variable is the list of all possible values of the variable and the probability that each value will occur. These probabilities sum to 1. The cumulative probability distribution is the probability that the random variable is less than or equal to a particular value. A cumulative probability distribution is also referred to as a cumulative distribution function, a c.d.f., or a cumulative distribution. An important special case of a discrete random variable is when the random variable is binary, that is, the outcomes are 0 or 1. A binary random variable is called a Bernoulli random variable, and its probability distribution is called the Bernoulli distribution. 1

Verspreiden niet toegestaan | Gedownload door: Lisien Chen | ID: 16757

Because a continuous random variable can take on a continuum of possible values, the probability distribution used for discrete variables, which lists the probability of each possible value of the random variable, is not suitable for continuous variables. Instead, the probability is summarized by the probability density function (p.d.f.). The area under the p.d.f. between any two points is the probability that the random variable falls between those two points. The expected value of a random variable Y, denoted E(Y), is the long-run average value of the random variable over many repeated trials or occurrences. The expected value of Y is also called the expectation of Y or the mean of Y and is denoted Y.

An important special case of the general formula above is the mean of a Bernoulli random variable. Let G be the Bernoulli random variable, the expected value of G is

The variance and standard deviation measure the dispersion or the spread of a probability distribution. The variance of a random variable Y, denoted var(Y), is the expected value of the square of the deviation of Y from its mean: var(Y) = E[(Y Y)2] Because the variance involves the square of Y, the units of the variance are the units of the square of Y, which makes the variance awkward to interpret. It is therefore common to measure the spread by the standard deviation, which is the square root of the variance and is denoted Y. The variance of a Bernoulli random variable G: var(G) = 2G = (0 p)2 x (1 p) + (1 p)2 x p = p(1 p) Thus the standard deviation G = p(1 p) If Y = a + bX then the mean and variance of Y are: Y = a + bX ; 2Y = b22X The mean and standard deviation measure two important features of a distribution: its centre (the mean) and its spread (the standard deviation). This section discusses measures of two other features of a distribution: the skewness, which measures the lack of symmetry of a distribution, and the kurtosis, which measures how thick, or heavy are it tails. The mean, variance, skewness, and kurtosis are all based on what are called the moments of a distribution. Skewness = E[(Y Y)3] / 3Y , the skewness of a distribution provides a mathematical way to describe how much a distribution deviates from symmetry. Kurtosis = E[(Y Y)4] / 4Y, the kurtosis of a distribution is a measure of how much mass is in it tails and, therefore, is a measure of how much of the variance of Y arises from extreme values. An extreme value of Y is called an outlier. The greater the kurtosis of a distribution, the more likely are outliers. A distribution with kurtosis exceeding 3 is called leptokurtic or, more simply, heavy-tailed. Like skewness, the kurtosis is unit free, so changing the units of Y does not change its kurtosis. The joint probability distribution of two discrete random variables, say X and Y, is the probability that the random variables simultaneously take on certain values, say x and y. The probabilities of all possible (x, y) combinations sum to 1. The joint probability distribution can be written as the function Pr(X = x, Y = y). The marginal probability distribution of a random variable Y is just another name for its probability distribution. The distribution of a random variable Y conditional on another random variable X taking on a specific value is called the conditional distribution of Y given X. In general the conditional distribution of Y given X = x is: 2

Pr(Y = y | X = X) = Pr(X = x, Y = y) / Pr(X = x) Two random variables X and Y are independently distributed, or independent, if knowing the value of one of the variables provides no information about the other. Specifically, X and Y are independent if the conditional distribution of Y given X equals the marginal distribution of Y. That is, X and Y are independently distributed if, for all values of x and y, Pr(Y = y | X = x) = Pr(Y = y) One measure of the extent to which two random variables move together is their covariance. The covariance between X and Y is the expected value E[(X X)(Y Y)], where X is the mean of X and Y is the mean of Y. The covariance is denoted cov(X, Y) or XY. If X can take on l values and Y can take on k values, then the covariance is given by the formula: Cov(X, Y) = XY = E[(X X)(Y Y)] Finally, if X and Y are independent, then the covariance is zero. Because the covariance is the product of X and Y, deviated from their means, it units are, awkwardly, the units of X multiplied by the units of Y. This units problem can make numerical values of the covariance difficult to interpret. The correlation is an alternative measure of dependence between X and Y that solves the units problem of the covariance. Specifically, the correlation between X and Y is the covariance between X and Y divided by their standard deviations: Corr(X, Y) = XY / XY The random variables X and Y are said to be uncorrelated if corr(X, Y) = 0. The correlation is always between -1 and 1. The mean of the sum of two random variables, X and Y, is the sum of their means: E(X + Y) = E(X) + E(Y) = X + Y The variance of the sum of X and Y is the sum of their variances plus two times their covariances, plus two times their covariance: Var(X + Y) = var(X) + var(Y) + 2cov(X, Y) = 2X + 2Y + 2XY If X and Y are independent, then the covariance is zero and the variance of their sum is the sum of their variances: Var(X + Y) = var(X) + var(Y) = 2X + 2Y A continuous random variable with a normal distribution has the familiar bell-shaped probability density. The normal density with mean and variance 2 is symmetric around its mean and has 95% of its probability between 1.96 and + 1.96. The standard normal distribution is the normal distribution with mean = 0 and variance 2 = 1 and is denoted N(0, 1). The Chi-squared is the distribution of the sum of m squared independent standard normal random variables. This distribution depends on m, which is called the degrees of freedom of the chi-squared distribution. The student t distribution with m degrees of freedom is defined to be the distribution of the ratio of a standard normal random variable, divided by the square of a root of an independently distributed chi-squared variable with m degrees of freedom divided by m. That is, let Z be a standard normal variable, let W be a random variable with chi-squared distribution with m degrees of 3

freedom, and let Z and W be independently distributed. Then the random variable Z/(W/m) has a student t distribution. The F distribution with m and n degrees of freedom, denoted Fm,n, is defined to be the distribution of the ratio of chi-squared random variables with degrees of freedom m, divided by m, to an independently distributed chi-squared random variable with degrees of freedom n, divided by n. The act of random sampling that is, randomly drawing a sample from a larger population has the effect of making the sample average itself a random variable. Because the sample average is a random variable, is has a probability distribution, which is called its sampling distribution. This section concludes with some properties of the sampling distribution of the sample average. When Yi has the same marginal distribution for i = 1, , n then Y1, , Yn are said to be identically distributed. Under simple random sampling, knowing the values of Y1 provides no information about Y2, so the conditional distribution of Y2 given Y1 is the same as the marginal distribution of Y2. In other words, under simple random sampling, Y1 is distributed independently of Y2, , Yn. When both these conditions are met, they are said to be independently and identically distributed (i.i.d.) The law of large numbers states that, under general conditions, Y will be near Y with very high probability when n is large. This is sometimes called the law of averages. When a large number of random variables with the same mean are averaged together, the large values balance the small values and their sample average is close to their common mean. The property that Y is near Y with increasing probability as n increases is called convergence in probability or, more concisely, consistency. The law of large numbers states that, under certain conditions, Y converges in probability to Y or, equivalently, that Y is consistent for Y. The conditions for the law of large numbers that we will see in this book are that Yi, i = 1, , n are i.i.d. and that the variance of Yi, 2Y is finite. If the data are collected by simple random sampling, then the i.i.d. assumption holds. The assumption that the variance is finite says that extremely large values of Yi that is, outliers are unlikely and observed infrequently; otherwise, these large values could dominate Y and the sample average would be unreliable. The central limit theorem says that, under general conditions, the distribution of Y is well approximated by a normal distribution when n is large. Recall that the mean of Y is Y and its variance is 2Y = 2Y / n. According to the central limit theorem, when n is large, the distribution of Y is approximately N(Y, 2Y). Chapter 3: Review of statistics The sample average Y is a natural way to estimate Y, but it is not the only way. There are many possible estimators, so what makes one estimator better than another? Because estimators are random variables, this question can be phrased more precisely: What are desirable characteristics of the sampling distribution of an estimator? In general, we would like an estimator that gets as close as possible to the unknown true value, at least in some average sense; in other words, we would like the sampling distribution of an estimator to be as tightly centered on the unknown value as possible. This observation leads to three specific desirable characteristics of an estimator: unbiasedness, consistency and efficiency. A desirable property of an estimator is that the mean of its sampling distribution equals Y; if so, the estimator is said to be unbiased. To state this concept mathematically, let Y denote some estimator Y, such as Y or Y1. The estimator Y is unbiased if E(Y) = Y, where E(Y) is the mean of the sampling distribution of Y; otherwise, Y is biased. Another desirable property of an estimator Y is that, when the sample size is large, the uncertainty about the value of Y arising from random variations in the sample is very small. Stated 4

more precisely, a desirable property of Y is that the probability that it is within a small internal of the true value Y approaches 1 as the sample size increases, that is, Y is consistent for Y. At last another desirable property is that the estimator is efficient. An estimator is said to be efficient when the estimator has a smaller variance than other estimators. The starting point of statistical hypotheses testing is specifying the hypothesis to be tested, called the null hypothesis. Hypothesis testing entails using data to compare the null hypothesis to a second hypothesis, called the alternative hypothesis that holds if null does not. The most general alternative hypothesis is that E(Y) is not 0, which is called a two-sided alternative hypothesis because it allows E(Y) to be either less than or greater than the null hypothesis. The p-value is the probability of drawing a statistic at least as adverse to the null hypothesis as the one you actually computed in your sample, assuming the null hypothesis is correct. In the case at hand, the p-value is the probability of drawing Y at least as far in the tails of its distribution under the null hypothesis as the sample average you actually computed. Calculating the p-value when Y is unknown and Y1, , Yn are i.i.d., the p-value is calculated using the formula: p-value = 2(-|Yact = Y,0 / SE(Y)|) The sample variance (s2Y) is an estimator of the population variance (2Y), the sample standard deviation is an estimator of the population standard deviation, and the standard error of the sample average is an estimator of the standard deviation of the sampling distribution. s2Y = 1 / (n 1) * (Yi Y)2 The sample variance is a consistent estimator of the population variance. In other words, the sample variance is close to the population variance with high probability when n is large. Because of the standard deviation of the sampling distribution of Y is Y = Y / n. The estimator of Y, sY / n is called the standard error of Y and is denoted SE(Y) or Y. A scatterplot is a plot of n observations on Xi and Yi, in which each observation is represented by the point (Xi, Yi). The sample covariance and correlation are estimators of the population covariance and correlation. Like the estimators discussed previously in this chapter, they are computed by replacing a population mean (the expectation) with a sample mean. The sample covariance, denoted SXY, is SXY = 1 / (n 1) * (Xi X)(Yi Y) The sample correlation coefficient, or sample correlation, is denoted rXY and is the ratio of the sample covariance to the sample standard deviations: rXY = SXY / SXSY The sample correlation measures the strength of the linear association between X and Y in a sample of n observations. Like the population correlation, the sample correlation is unit less and lies between -1 and 1.

Chapter 4: Linear regression with one regressor Yi = 0 + 1Xi + ui This equation is the linear regression model with a single regressor, in which Y is the dependent variable and X is the independent variable or the regressor. The first part of the equation, 0 + 1Xi, is the population regression line or the population regression function. This is the relationship that holds between Y and X on average over the population. The intercept 0 and the slope 1 are the coefficients of the population regression line, also known as the parameters of the population regression line. The slope 1 is the change in Y associated with a unit change in X. The intercept is the value of the population regression line when X = 0; it is the point at which the population regression line intersects the Y axis. The term ui in the equation is the error term. The error term incorporates all of the factors responsible for the difference between the ith districts average test score and the value predicted by the population regression line. This error term contains all the other factors besides X that determine the value of the dependent variable, Y, for a specific observation, i. The OLS estimator chooses the regression coefficients so that the estimated regression line is as close as possible to the observed data, where closeness is measured by the sum of the squared mistakes made in predicting Y given X. The OLS estimator extends this idea to the linear regression model. Let b0 and b1 be some estimators of 0 and 1. The regression line based on these estimators is b0 + b1X, so the value of Yi predicted using this line is b0 + b1Xi. Thus the mistake made in predicting the ith is Yi (b0 + b1Xi). The sum of these squared prediction mistakes over all n observations is: (Yi - b0 b1Xi)2 The estimators of the intercept and slope that minimize the sum of squared mistakes are called the ordinary least squares (OLS) estimators. The OLS regression line also called the sample regression line or sample regression function, is the straight line constructed using OLS estimators: 0 + 1X. There are both practical and theoretical reasons to use the OLS estimators. Because OLS is the dominant method used in practice, it has become the common language for regression analysis throughout economics, finance, and the social sciences more generally. The R2 and the standard error of the regression measure how well the OLS regression line fits the data. The R2 ranges between 0 and 1 and measures the fraction of the variance of Yi that is explained by Xi. The standard error of the regression measures how far Yi typically is from its predicted value. The regression R2 is the fraction of the sample variance of Yi explained by Xi. The definitions of the predicted value and the residual allow us to write the dependent variable Yi as the sum the predicted value, Yi, plus the residual ui: Yi = Yi + ui Mathematically, the R2 can be written as the ratio of the explained sum of squares to the total sum of squares . The explained sum of squares (ESS) is the sum of squared deviations of the predicted values of Yi, Yi, from their average, and the total sum of squares (TSS) is the sum of squared deviations from Yi from its average: ESS = (Yi Yi)2 ; TSS = (Yi Y)2 The R2 is the ratio of the explained sum of squares to the total sum of squares: R2 = ESS / TSS = 1 (SSR / TSS) ; SSR = (ui)2

The R2 ranges between 0 and 1. If 1 = 0, then Xi explains more none of the variation of Yi and the predicted value of Yi based on the regression is just the sample average of Yi. In this case, the explained sum of squares is zero and the sum of squared residuals equals the total sum of squares; thus the R2 is zero. In contrast, if Xi explains all of the variation of Yi then Yi = Yi for all i and every residual is zero, so that ESS = TSS and R2 = 1. In general, the R2 does not take on the extreme values of 0 or 1 but falls somewhere in between. An R2 near 1 indicates that the regressor is good at predicting Yi, while an R2 near 0 indicates that the regressor is not very good at predicting Yi. The standard error of the regression (SER) is an estimator of the standard deviation of the regression error ui. SER = su, where s2u = 1 / (n 2) * u2i =SSR / (n 2) The least squares assumptions: 1. The error term ui has conditional mean zero given Xi: E(ui | Xi) = 0 2. (Xi, Yi), i = 1, , n, are independent and identically distributed (i.i.d.) draws from their joint distribution 3. Large outliers are unlikely: Xi and Yi have nonzero finite fourth moments. Chapter 5: Regression with a single regressor: Hypothesis tests and confidence intervals The test of the null hypothesis H0 against the two-sided alternative proceeds as in the three steps summarized in chapter 3. The first is to compute the standard error of Y, SE(Y), which is an estimator of the standard deviation of the sampling distribution of Y. The second step is to compute the tstatistic, which has the general form (t = (estimator hypothesized value) / SE(Y)). The third step is to compute the p-value, which is the smallest significance level at which the null hypothesis could be rejected, based on the test statistic actually observed; equivalently, the p-value is the probability of obtaining a statistic, by random sampling variation, at least as different from the null hypothesis value as is the statistic actually observed, assuming the null hypothesis is correct. Regression analysis can also be used when the regressor is binary that is, when it takes on only two values, 0 or 1. For example, X might be a workers gender (= 1 if female, = 0 if male). A binary variable is also called an indicator variable or sometimes a dummy variable. The error term ui is homoscedastic if the variance of the conditional distribution of ui given Xi is constant for i = 1, , n and in particular does not depend on Xi. Otherwise, the error term is heteroskedastic. If the least squares assumptions hold and the errors are homoscedastic, then the OLS estimators 0 and 1 are efficient among all estimators that are linear in Y1, , Yn and are unbiased, conditional on X1, , Xn. This result is called the Gauss-Markov theorem. If the error term is homoscedastic, then the formulas for the variances of the estimators of simplify. Consequently, if the errors are homoscedastic, then there is a specialized formula that can be used for the standard errors of the estimators of . The homoscedastic-only standard error of 0 and 1 is SE(1) = (2beta), where 2beta = (s2u / (Xi X)2). At a general level, however, economic theory rarely gives any reason to believe that the errors are homoscedastic. It therefore is prudent to assume that the errors might be heteroskedastic unless you have compelling reasons to believe otherwise. The Gauss-Markov theorem states that, under a set of conditions known as the Gauss-Markov conditions, the OLS estimator 1 has the smallest conditional variance, given X1, , Xn, of all linear conditionally unbiased estimators of 1; that is, the OLS estimator is BLUE. The Gauss-Markov conditions are implied by the three least squares assumptions plus the assumption that the errors are homoscedastic. 7

Chapter 6: Linear regression with multiple regressors If the regressor (the student-teacher ratio) is correlated with a variable that has been omitted from the analysis (the percentage of English learners) and that determines, in part, the dependent variable (test scores), then the OLS estimator will have omitted variable bias. Omitted variable bias occurs when two conditions are true: (1) when the omitted variable is correlated with the included regressor and (2) when the omitted variable is a determinant of the dependent variable. Omitted variable bias is a problem whether the sample size is large or small. Because 1 does not converge in probability to the true value 1, 1 is biased and inconsistent; that is, 1 is not a consistent estimator of 1 when there is omitted variable bias. Whether this bias is large or small in practice depends on the correlation between the regressor and the error term. The larger the correlation is, the larger the bias. The direction of the bias in 1 depends on whether X and u are positively or negatively correlated. The multiple regression model extends the single variable regression model of chapter 4 and 5 to include additional variables as regressors. This model permits estimating the effect on Yi of changing one variable (X1i) while holding the other regressors (X2i, X3i, and so forth) constant. Yi = 0 + 1X1i + 2X2i The equation above is the population regression line or population regression function in the multiple regression model. The coefficient 0 is the intercept; the other coefficients are the slope coefficients. The coefficient of 1 is the effect on Y of a unit change in X1, holding X2 fixed. Another phrase used to describe 1 is the partial effect on Y of X1, holding X2 fixed. Three commonly used summary statistics in multiple regression are the standard error of the regression, the regression R2, and the adjusted R2 (also known as R2). The standard error of the regression (SER) estimates the standard deviation of the error term ui. Thus the SER is a measure of the spread of the distribution of Y around the regression line. In multiple regression, the SER is: SER = su, where s2u = (1 / (n k 1))* u2i = (SSR / (n k 1)) The regression R2 is the fraction of the sample variance of Yi explained by (or predicted by) the regressors. The mathematical definition of the R2 is the same as for regression with a single regressor: R2 = ESS/ TSS = 1 (SSR / TSS); ESS = (Yi Y)2 ; TSS = (Yi Y)2 Because the R2 increases when a new variable is added, an increase in the R2 does not mean that adding a variable actually improves the fit of the model. In this sense, the R2 gives an inflated estimate of how well the regression fits the data. One way to correct for this is to deflate or reduce the R2 by some factor, and this is what the adjusted R2, or R2 does. R2 = 1 ((n 1)/(n k 1))*(SSR/TSS) = 1 (s2u / s2Y) There are four least squares assumptions in the multiple regression model. The first three are those of chapter 4 for the single regressor model, extended to allow for multiple regression: 8

1. ui has conditional mean zero given X1i, X2i, , Xki; that is. E(ui | X1i, X2i, , Xki) = 0 2. (X1i, X2i, , Xki, Yi),i = 1, , n are independently and identically distributed (i.i.d.) draws from their joint distribution 3. Large outliers are unlikely: X1i, X2i, , Xki and Yi have nonzero finite fourth moments. 4. There is no perfect multicollinearity. Imperfect multicollinearity means that two or more of the regressors are highly correlated in the sense that there is a linear function of the regressors that is highly correlated with another regressor. Regression with a single regressor is vulnerable to omitted variable bias: If an omitted variable is a determinant of the dependent variable and is correlated with the regressor, then the OLS estimator of the slope coefficient will be biased and will reflect both the effect of the regressor and the effect of the omitted variable. Multiple regression makes it possible to mitigate omitted variable bias by including the omitted variable in the regression. The coefficient on a regressor, X1, in multiple regression is the partial effect of a change in X1, holding constant the other included regressors. The statistical theory of multiple regression builds on the statistical theory of regression with a single regressor. The least squares assumptions for multiple regression are extensions of the three least squares assumptions for regression with a single regressor, plus a fourth assumption ruling out perfect multicollinearity. Because the regression coefficients are estimated using a single sample, the OLS estimators have a joint sampling distribution and therefore have sampling uncertainty. This sampling uncertainty must be quantified as part of an empirical study, and the ways to do so in the multiple regression model are the topic of then next chapter. Chapter 7: Hypothesis tests and confidence intervals in multiple regression Testing one or more hypothesis: In this case, the null hypothesis restricts the value of two of the coefficients, so as a matter of terminology we can say that the null hypothesis imposes two restrictions on the multiple regression model. In general, a joint hypothesis is a hypothesis that imposes two or more restrictions on the regression coefficients. We consider joint null and alternative hypothesis of the form: H0 : j = j, 0, m = m, 0 H1 = one or more of the q restrictions under H0 does not hold The F-statistic is used to test joint hypothesis about regression coefficients. We first discuss the case of two restrictions, and then turn to the general case of q restrictions. When the joint null hypothesis has the two restrictions 1 = 0 and 2 = 0, the F-statistic combines the two t-statistics t1 and t2 using the formula: F = (1/2)*((t12 + t22 - 2pt1, t2t1t2) / (1 - 2p2t1, t2)) The overall regression F-statistic tests the joint hypothesis that all the slope coefficients are zero. When q = 1, the F-statistic tests a single restriction. Then the joint null hypothesis reduces to the null hypothesis on a single regression coefficient, and the F-statistic is the square of the t-statistic. The homoscedastic-only F-statistic is given by the formula: F = [((SSRr SSRu)/q) / (SSRu/ (n ku 1))] F = [((R2u R2r)/q) / ((1 R2u)/(n ku -1))] A control variable is not the object of interest in the study; rather it is a regressor included to hold constant factors that, if neglected, could lead to estimated causal effect of interest to suffer from omitted variable bias. Our approach to the challenge of potential omitted variable bias is twofold. First, a core or base set of regressors should be chosen using a combination of expert judgment, economic theory, 9

and knowledge of how the data were collected; the regression using this base set of regressors is sometimes referred to as a base specification. This base specification should contain the variables of primary interest and the control variables suggested by expert judgment and economic theory. Expert judgment and economic theory are rarely decisive, however, and often the variables suggested by economic theory are not the ones on which you have data. Therefore the next step is to develop a list of candidate alternative specifications, that is, alternative set of regressors. There are four potential pitfalls to guard against when using the R2 and R2: 1. An increase in the R2 and R2 does not necessarily mean that an added variable is statistically significant. 2. A high R2 and R2 do not mean that the regressors are a true cause of the dependent variable. 3. A high R2 and R2 do not mean that there is no omitted variable bias. 4. A high R2 and R2 does not necessarily mean that you have the most appropriate set of regressors, nor does a low R2 and R2 necessarily mean that you have an inappropriate set of regressors. Chapter 8: Nonlinear regression functions A nonlinear function is a function with a slope that is not constant: The function f(X) is linear if the slope of f(X) is the same for all values of X, but if the slope depends on the value of X, then f(X) is nonlinear. If a straight line is not an adequate description of the relationship between variables, what is? One way to approximate such a curve mathematically is to model the relationship as a quadratic function. That is, we could model the variables as a function of the variable and the square of the variable. Yi = 0 + 1X1i + 2X1i2 + ui This equation is called the quadratic regression model because the population regression function is a quadratic function of the independent variable. Because the quadratic regression model is a variant of multiple regression, its unknown population coefficients can be estimated and tested using the OLS methods described in chapters 6 and 7. The general approach to modelling nonlinear regression functions taken in this chapter has five elements: 1. Identify a possible nonlinear relationship. The best thing to do is to use economic theory and what you know about the application to suggest a possible nonlinear relationship. 2. Specify a nonlinear function and estimate its parameters by OLS. 3. Determine whether the nonlinear model improves upon a linear model. 4. Plot the estimated nonlinear regression function. 5. Estimate the effect on Y of a change in X. The first method discussed in this section is polynomial regression, an extension of the quadratic regression used in the last section to model the relationship between test scores and income. The second method uses logarithms of X, of Y, or of both X and Y. Although these methods are presented separately, they can be used in combination. One way to specify a nonlinear regression function is to use a polynomial in X. In general, let r denote the highest power of X that is included in the regression. The polynomial regression model of degree r is: Yi = 0 + 1Xi + 2Xi2 + + rXir + ui

10

When r = 2, the equation above is the quadratic regression model discussed before. When r = 3 so that the highest power of X included is X3, then the equation is called the cubic regression model. If the population regression function is linear, then the quadratic and higher-degree terms do not enter the population regression function. Accordingly, the null hypothesis (H0) that the regression is linear and the alternative (H1) that it is a polynomial of degree r correspond to. The null hypothesis that the population regression function is linear can be tested against the alternative that it is a polynomial of degree r by testing H0 against H1. The answer to the question of how many terms to include is that you should include enough to model the nonlinear regression function adequately, but no more. Unfortunately, this answer is not very useful in practice. A practical way to determine the degree of the polynomial is to ask whether the coefficients in equation associated with largest value of r are zero. If so, then these terms can be dropped from the regression. This procedure, which is called sequential hypothesis testing because individual hypotheses are tested sequentially, is summarized in the following steps: 1. Pick a maximum value of r and estimate the polynomial regression for that r. 2. Use the t-statistic to test the hypothesis that the coefficient on Xr is zero. If you reject this hypothesis, then Xr belongs in the regression, so use the polynomial of degree r. 3. If you do not reject r = 0 in step 2, eliminate Xr from the regression and estimate a polynomial regression of degree r 1. Test whether the coefficient on Xr 1 is zero. If you reject, use the polynomial of degree r 1. 4. If you do not reject r-1 = 0 in step 3, continue this procedure until the coefficient on the highest power in your polynomial is statistically significant. The exponential function of x is ex, where e is the constant 2.71828; the exponential function is also written as exp(x). The natural logarithm is the inverse of the exponential function; that is, the natural logarithm is the function for which x = ln(ex) or, equivalently, x = ln[exp(x)]. Note that the logarithm function is defined only for positive values of x. The logarithm function has a slope that is steep at first and then flattens out. Yi = 0 + 1ln(X1) + ui, i = 1, , n Because Y is no in logarithms but X is, this is sometimes referred to as a linear-log model. ln(Yi) = 0 + 1X1 + ui Because Y is in logarithms but X is not, this is referred to as a log-linear model. In the log-linear model, a one-unit change in X (X = 1) is associated with a 100 x 1 % change in Y. ln(Yi) = = 0 + 1ln(X1) + ui Because both Y and X are specified in logarithms, this is referred to as a log-log model. In the log-log model, a 1% change in X is associated with a 1% change in Y. Thus in this specification 1 is the elasticity of Y with respect to X. Y / Y = 1*(X / X) 1 = (Y / Y) / (X / X) = percentage change in Y / percentage change in X Thus in the log-log specification 1 is the ratio of the percentage change in Y associated with the percentage change in X. 11

Which of the log regression models best fits the data? As we saw in the discussion, the R2 can be used to compare the log-linear and log-log models; as it happened, the log-log model had the higher R2. But how can we compare the linear-log model and the log-log model? Unfortunately, the R2 cannot be used to compare these two regressions because their dependent variables are different. Recall that the R2 measures the fraction of the variance of the dependent variable explained by the regressors. Because the dependent variables in the log-log and linear-log models are different, it does not make sense to compare their R2s. Chapter 9: Assessing studies based on multiple regression Internal and external validity distinguish between the population and setting studied and the population and setting to which the results are generalized. The population studied is the population of entities people, companies, school districts, and so forth from which the sample was drawn. The population to which the results are generalized, or the population interest, is the population of entities to which the causal inferences from the study are to be applied. Internal validity has two components. First, the estimator of the causal effect should be unbiased and consistent. For example, if STR is the OLS estimator of the effect on test scores of a unit change in the student-teacher ratio in a certain regression, then STR should be an unbiased and consistent estimator of the true population causal effect of a change in the student-teacher ratio, STR. Second, the hypothesis tests should have the desired significance level, and confidence intervals should have the desired confidence level. For example, if a confidence interval is constructed as STR +- 1.96SE(STR), this confidence interval should contain the true population causal effect, STR, with probability 95% over repeated samples. Potential threats to external validity arise from differences between the population and setting studied and the population and setting of interest. Differences between the population studied and the population of interest can pose a threat to external validity. More generally, the true causal effect might not be the same in the population studied and the population of interest. This could be because the population was chosen in a way that makes it different from the population of interest, because of differences in characteristics of the populations, because of geographical differences, or because the study is out of date. Even if the population being studied and the population of interest are identical, it might not be possible to generalize the study results if the settings differ. More generally, examples of differences in settings include differences in the institutional environment, differences in laws, or differences in the physical environment. Studies based on regression analysis are internally valid if the estimated regression coefficients are unbiased and consistent, and if their standard errors yield confidence intervals with the desired confidence level. This section surveys five reasons why the OLS estimator of the multiple regression coefficients might be biased, even in large samples; omitted variables, misspecification of the functional form of the regression function, imprecise measurement of the independent variables (errors in variables), sample selection, and simultaneous causality. Solutions to omitted variable bias when the variable is observed or there are adequate control variables: The decision whether to include a variable involves a trade-off between bias and variance of the coefficient of interest. In practice, there are four steps that can help you decide whether to include a variable or set of variables in a regression. The first step is to identify the key coefficient or coefficients of interest in your regression. In the test score regressions, this is the coefficient on the student-teacher ratio, because the question originally posed concerns the effect on test scores reducing the student-teacher ratio. 12

The second step it to ask yourself: What are the most likely sources of important omitted variable bias in this regression? The results of this step are a base regression specification, the starting point for your empirical regression analysis, and a list of additional questionable variables that might help to mitigate possible omitted variable bias. The third step is to augment your base specification with the additional questionable control variables identified in the second step. If the coefficients on the additional control variables are statistically significant or if the estimated coefficients of interest change appreciable when the additional variables are included, then they should remain in the specification and you should modify your base specification. If not, then these variables can be excluded from the regression. The fourth step is to present an accurate summary of your results in tabular from. This provides full disclosure to a potential sceptic, who can then draw his or her own conclusions. Solutions to omitted variable bias when adequate control variables are not available: The first solution is to use data in which the same observational unit is observed at different points in time. Using panel data makes it possible to control for unobserved omitted variables as long as those omitted variables do not change over time. The second solution is to use instrumental variables regression. This method relies on a new variable, called an instrumental variable. Instrumental variables regression is discussed in chapter 12. The third solution is to use a study design in which the effect of interest is studied using a randomized controlled experiment. Randomized controlled experiments are discussed in chapter 13. If the true population regression function is nonlinear but the estimated regression is linear, then this functional form misspecification makes the OLS estimator biased. This bias is a type of omitted variable bias, in which the omitted variables are the terms that reflect the missing nonlinear aspects of the regression function. Suppose that in our regression of test scores against the student-teacher ratio we had inadvertently mixed up our data so that we ended up regressing test scores for fifth grades on the student-teacher ratio in that district. Although the student-teacher ratio for elementary school students and tenth graders might be correlated, they are not the same, so this mix-up would lead to bias in the estimated coefficient. This is an example of errors-in-variables bias because its source is an error in the measurement of the independent variable. This bias persists even in very large samples, so the OLS estimator is inconsistent if there is a measurement error. The best way to solve the errors-in-variables problem is to get an accurate measure of X. If this is impossible, however, econometric methods can be used to mitigate errors-in-variables bias. One such method is instrumental variables regression. It relies on having another variable that is correlated with the actual value Xi but is uncorrelated with the measurement error. This method is studied in chapter 12. A second method is to develop a mathematical model of the measurement error and, if possible, to use the resulting formulas to adjust the estimates. So far, we have assumed that causality runs from the regressors to the dependent variable (X causes Y). But what if causality also runs from the dependent variable to one or more regressors (Y causes X)? If so, causality runs backward as well as forward; that is, there is simultaneous causality. If there is simultaneous causality, an OLS regression picks up both effects, so the OLS estimator is biased and inconsistent. A study based on multiple regression is internally valid if the estimated coefficients are unbiased and consistent, and if standard errors are consistent. Threats to the internal validity of such a study include omitted variables, misspecification of functional form (nonlinearities), and imprecise measurement of the independent variables (errors in variables), sample selection, and simultaneous causality. Each of these introduces correlation between the regressor and the error term, which in turn makes OLS estimators biased and inconsistent. If the errors are correlated across observations, 13

as they can be with time series data, or if they are heteroskedastic but the standard errors are computed using homoscedastic-only formula, then internal validity is compromised because the standard errors will be inconsistent. A study using regression analysis, like any statistical study, is externally valid if its findings can be generalized beyond the population and setting studied. Sometimes it can help to compare two or more studies on the same topic. Whether or not there are two or more such studies, however, assessing external validity requires making judgments about the similarities of the population and setting studied and the population and setting to which the results are being generalized. Chapter 11: Regression with a binary dependent variable The linear multiple regression model applied to a binary dependent variable is called the linear probability model: linear because it is a straight line and probability model because it models the probability that the dependent variable equals 1. The linear probability model is the name for the multiple regression model when the dependent variable is binary rather than continuous. Because the dependent variable Y is binary, the population regression function corresponds to the probability that the dependent variable equals 1, given X. The population coefficient 1 on a regressor X is the change in the probability that Y = 1 associated with a unit change in X Similarly, the OLS predicted value Yi, computed using the estimated regression function, is the predicted probability that the dependent variable equals 1, and the OLS estimator 1 estimates the change in the probability that Y = 1 associated with a unit change in X. A probability cannot be less than 0 or greater than 1. This nonsensical feature is an inevitable consequence of the linear regression. To address this problem, we introduce new nonlinear models specifically designed for binary dependent variables, the probit and logit regression models. Probit and logit regression are nonlinear models specifically designed for binary dependent variables. Because a regression with a binary dependent variable Y models the probability that Y = 1, it makes sense to adopt a nonlinear formulation that forces the predicted values to be between 0 and 1. Because cumulative probability distribution functions (c.d.f.s) produce probabilities between 0 and 1, they are used in logit and probit regressions. Probit regression uses the standard normal c.d.f. Logit regression, also called logistic regression, and uses the logistic c.d.f. The probit regression model with a single regressor X is: Pr( Y = 1| X) = (0 + 1X) The probit coefficient 1 is the change in the z-value associated with a unit change in X. If 1 is positive, an increase in X increases the z-value and thus increases the probability that Y = 1; if 1 is negative, an increase in X decreases the probability that Y = 1. Although the effect of X on the z-value is linear, its effect on the probability is nonlinear. Thus in practice the easiest way to interpret the coefficients of a probit model is to compute the predicted probability, or the change in the predicted probability, for one or more values of the regressors. When there is just one regressor, the predicted probability can be plotted as a function of X. The logit regression model is similar to the probit regression model except that the cumulative standard normal distribution function in the equation for the probit model is replaced by the cumulative standard logistic distribution function, which we denote by F. The logistic cumulative distribution function has a specific functional form, defined in terms of the exponential function, which is given as the final expression in the following formula: Pr( Y = 1| X1, X2, , Xk) = F(0 + 1X1 + 2X2 + + kXk) = [1 / (1 + e-(0 + 1X1 + 2X2 + + kXk))] 14

Nonlinear least squares is a general method for estimating the unknown parameters of a regression function when, like the probit coefficients, those parameters enter the population regression function nonlinearly. The nonlinear least squares estimator extends the OLS estimator to regression functions that are nonlinear functions of the parameters. Like OLS, nonlinear least squares finds the values of the parameters that minimize the sum of squared prediction mistakes produced by the model. That is, the nonlinear least squares estimator of the probit coefficients are those values of b0, b1, , bk that minimize the sum of squared prediction mistakes: [Yi (b0 + b1X1 + + bkXk)]2 The nonlinear least squares estimator shares two key properties with the OLS estimator in linear regression: It is consistent and it is normally distributed in large samples. The likelihood function is the joint probability distribution of the data, treated as a function of the unknown coefficients. The maximum likelihood estimator (MLE) of the unknown coefficients consists of the values of the coefficients that maximize the likelihood function. Because the MLE chooses the unknown coefficients to maximize the likelihood function, which is in turn the joint probability distribution, in effect the MLE chooses the values of the parameters to maximize the probability of drawing the data that are actually observed. In this sense, the MLEs are the parameters values most likely to have produced the data. The likelihood function is the joint probability distribution, treated as a function of the unknown coefficients. For n = 2 i.i.d. observations of Bernoulli random variables, the likelihood function is: f(p; Y1, Y2) = p(Y1 + Y2) * (1 p)2-(Y1 + Y2) The maximum likelihood estimator of p is the value that maximizes the likelihood function in the equation above. As with all maximization or minimization problems, this can be done by trial and error; that is, you can try different vales of p and compute the likelihood f(p; Y1, Y2) until you are satisfied that you have maximized this function. You could, however, also use calculus which produces a simple formula for the MLE in this example. The MLE is p = (Y1 + Y2). In other words, the MLE of p is just the sample average. In fact, for general n, the MLE of p of the Bernoulli probability p is the sample average; that is p = Y. In this example, the MLE is the usual estimator of p, the fraction of times Y1 = 1 in the sample. Two measures of fit for models with binary dependent variables are the fraction correctly predicted and the pseudo-R2. The fraction correctly predicted uses the following rule: If Y1 = 1 and the predicted probability exceeds 50% or if Y1 = 0 and the predicted probability is less than 50%, then Yi is said to be correctly predicted. Otherwise, Yi is said to be incorrectly predicted. The fraction of correctly predicted is the fraction of the n observations Y1, , Yn that are correctly predicted. An advantage of this measure of fit is that it is easy to understand. A disadvantage is that it does not reflect the quality of the prediction: If Y1 = 1, the observation is treated as correctly predicted whether the predicted probability is 51% or 90%. The pseudo-R2 measures the fit of the model using the likelihood function. Because the MLE maximizes the likelihood function, adding another regressor to a probit or logit model increases the value of the maximized likelihood, just like adding a regressor necessarily reduces the sum of squared residuals in linear regression by OLS. This suggests measuring the quality of fit of a probit model by comparing values of the maximized likelihood function with all the regressors to the value of the likelihood with none. This is, in fact, what the pseudo-R2 does.

15

Chapter 12: The IV estimator with a single regressor and a single instrument Instrumental variables regression has some specialized terminology to distinguish variables that are correlated with the population error term u from ones that are not. Variables correlated with the error term are called endogenous variables, while variables uncorrelated with the error term are called exogenous variables. A valid instrumental variable (instrument) must satisfy two conditions, known as the instrument relevance condition and the instrument exogeneity conditionL 1. Instrument relevance: corr(Zi, Xi) 0 2. Instrument exogeneity: corr(Zi, ui) = 0 If an instrument is relevant, then variation in the instrument is related to variation in Xi. If in addition the instrument is exogenous, then that part of the variation of Xi captured by the instrumental variable is exogenous. Thus an instrument that Is relevant and exogenous can capture movements in Xi that are exogenous. This exogenous variation can in turn be used to estimate the population coefficient 1. The two conditions for a valid instrument are vital for instrumental variables regression, and we return to them repeatedly throughout this chapter. If the instrument Z satisfies the conditions of instrument relevance and exogeneity, the coefficient 1 can be estimated using an IV estimator called two stage least squares (TSLS). As the name suggests, the two stage least squares estimator is calculated in two stages. The first stage decomposes X into two components: a problematic component that may be correlated with the regression error and another problem-free component to estimate 1. The first stage begins with a population regression linking X and Z: Xi = 0 + 1Zi + vi The second stage of TSLS is easy: Regress Yi on Xi using OLS. The resulting estimators from the second-stage regression are the TSLS estimators, 0TSLS and 1TSLS. Although the two stages of TSLS make the estimator seem complicated, when there is a single X and a single instrument Z, as we assume in this section, there is a simple formula for the TSLS estimator. Let sZY be the sample covariance between Z and Y and let sZX be the sample covariance between Z and X. As shown in Appendix 12.2, the TSLS estimator with a single instrument is: 1TSLS = [ sZY / sZX ] That is, the TSLS estimator of 1 is the ratio of the sample covariance between Z and Y to the sample covariance between Z and X. The general IV regression model has four types of variables: the dependent variable, Y; problematic endogenous regressors, like the price of cigarettes, which are correlated with the error term and which we will label X; additional regressors, called included exogenous variables, which we will label W; and instrumental variables, Z. In general, there can be multiple endogenous regressors (Xs), multiple included exogenous (Ws), and multiple instrumental variables (Zs). For IV regression to be possible, there must be at least as many instrumental variables (Zs) as endogenous regressors (Xs). The relationship between the number of instruments and the number of endogenous regressors has its own terminology. The regression coefficients are said to be exactly identified if the number of instruments (m) equals the number of endogenous regressors (k); that is, m = k. The coefficients are over identified if m > k. They are under identified if the number of instruments is less than the number of endogenous regressors; that is m < k. The coefficients must be either exactly identified or over identified if they are to be estimated by IV regression. 16

17

You might also like