You are on page 1of 10

The Analysis of Panel Data Under a Markov Assumption Author(s): J. D. Kalbfleisch and J. F.

Lawless Reviewed work(s): Source: Journal of the American Statistical Association, Vol. 80, No. 392 (Dec., 1985), pp. 863871 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2288545 . Accessed: 27/05/2012 20:44
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact support@jstor.org.

American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association.

http://www.jstor.org

The

Analysis

of

Panel Markov

Data

Under

Assumption

and J. D. KALBFLEISCH J. F. LAWLESS*

Methodsfor the analysis of panel dataundera continuous-time Markov model are proposed. We present proceduresfor obtaining maximum likelihood estimates and associated asympin totic covariancematricesfor transitionintensity parameters time homogeneousmodels, andfor otherprocesscharacteristics distributions.Gensuch as mean sojourntimes and equilibrium eralizationsto handle covarianceanalysis and to the fitting of certainnonhomogeneousmodels arepresented,and an example based on a longitudinalstudy of the smoking habits of school childrenis discussed. Questions of embeddabilityand estimation are examined.

Some further aspects of maximum likelihood estimation are discussed in Section 4, and Section 5 considersgeneralizations to covariance analysis and to the fitting of certain nonhomogeneous models. The remainderof the article contains an example based on a longitudinalstudy of the smoking habits of school children (Section 6), a discussion of the embedding problem for continuous-timeMarkov processes and its relationshipsto estimation(Section 7), and some concludingcomments.

PROCESSES MARKOV 2. CONTINUOUS-TIME

Canada. Research was supportedin part by grantsfrom the NaturalSciences and EngineeringResearchCouncil of Canada.The authorsthankK. S. Brown for the data on school children in Section 6 and for discussions concerning these data and the methods proposed here. They also thank W. M. Vollmer and E. Kelly for work on the developmentof the computeralgorithms.

Suppose individuals independentlymove among k states, KEY WORDS: Markov processes; Maximumlikelihood estidenotedby 1, . . . , k, accordingto a continuous-timeMarkov data;Embeddability. mation;Regressionanalysis;Longitudinal process. Let X(t) be the state occupied at time t by a randomly chosen individual. For 0 ' s ' t, let P(s, t) be the k x k 1. INTRODUCTION transitionprobabilitymatrixwith entries Continuous-timeMarkov models have found wide applicapi1(s, t) = Pr{X(t) = j I X(s) = i}, tion in the social sciences, especially in the study of data that record life history events (e.g., social mobility studies) for for i, j = 1, . . ., k. As is well known (e.g., see Cox and individuals. A recent review of this work is given by Barthol- Miller 1965), this process can be specified in terms of the omew (1983); see also, for example, Bartholomew (1982), transitionintensities, Singer and Spilerman(1976a,b), and Tuma et al. (1979). In i =j. qij(t) = lim pij(t,t + I\t)l/lt, many instances, a full historyof the individualprocesses is not A. 0 available. In this articlewe consider "paneldata" in which the observationsconsist of the states occupied by the individuals For convenience, we also define under study at a sequence of discrete time points; we assume I qij(t), i = 1, . . .,9 kg qii(t) no informationto be available about the timing of events bejoi tween observationtimes. intensitymatrixwith entries and let Q(t) be the k x k transition Methods for the analysis of panel data under a continuous(1983), qij(t). time Markovmodelhavebeendiscussedby Bartholomew This article is primarilyconcerned with time-homogeneous Singer and Spilerman(1976a,b), Wasserman(1980), and othmodels in which qij(t) = qijindependentof t. In this case, the ers. Most of the methodology so far proposedhas severe limprocess is stationaryand we write itations, such as (a) inability to handle observationtimes that P(t) = P(s, s + t) = P(O, t) errors, are not equally spaced;(b) inabilityto producestandard tests, and interval estimates for all model characteristicsof and Q = (qij) to denote the transitionintensity matrix. Note interest;and (c) inability to generalize easily to handle covari= 0. It is well known j that qij 0 for i $A and that k ance analysis. We propose here algorithmsfor maximumlike(e.g., see Cox and Miller 1965) that lihood estimationthat overcome these difficulties and that furthermoreprovide a very efficient way of obtaining maximum E QTtT/r!. P(t) = eQt= likelihood estimates (MLE's). = =O Section 2 sets notation and reviews a numberof results asConsider the time-homogeneous model, and suppose that Markovprocesses. Section 3 presociatedwith continuous-time sents the basic algorithm.Estimatesof asymptoticcovariance qi= qij(O)depends on b functionallyindependentparameters matrices for parametersand estimates of other process char- 01, * * . . . , Ob) for each i, j = Ob, with O = (0, are acteristicssuch as equilibriumdistributions also discussed. 1, . . ., k. As a simple example, consider a Markov model for maritalstability discussed by Tuma et al. (1979, p. 824). * J. D. KalbfleischandJ. F. Lawless are Professors,Department Statistics Here, an individualis in state 1 (not married),state 2 (married), of and ActuarialScience, University of Waterloo, Waterloo,OntarioN2L 3G1, or state 3 (has "emigrated"from the study for some reason); ? 1985AmericanStatisticalAssociation Journalof the AmericanStatisticalAssociation December 1985,Vol. 80, No. 392, Theoryand Methods

863

864

Journal of the American Statistical Association, December 1985

state 3 is absorbing.The transitionintensitymatrixhas the form

Q= (AmU
andwe have 0
=

A2fP2 P)2
A2, U2),

(2.1)

Derivativescan be computedin a similarway. In particular, the matrixwith entries 8pij(t;0)/aO, can be obtainedas

3P(t)/laO AVuA-', =
g(.)(edi_
g!")tedj,

u = 1, . . . , b,
dj), j
=

(3.4)

where Vuis a k x k matrix with (i, j) entry

(Al, PI,

say. The matrix P(t) is a

complicated function of these parameters.For example (see Tuma et al. 1979),

i =#j,

(t) pX =
where al = Al + ,
=

(al +

52)e't 2

,
/2,

(al + 5l,)e62
61

(2.2) of this result, which also appearsin Jennrichand Bright(1976),

andg(7)is the (i, j) entryin G(u)= A-' (aQ/aO)A. A derivation

a2
a2) a2) +
-

A2 +
[(a1

and

-'-i(aI
-2(al

+ +

a2)2 + 4 AI212]"2 '[(aI - a2)2 + 4A IA2112. (2.3)


-

We comment furtheron this model below.

is given in Appendix A. The derivatives aQ/l6" are usually very simple, and calculationof (3.4) is easy once A and D are obtained. It is generally the case that P(t) is a complicated function of 0; the above formulas, however, allow us to avoid use of any explicit representation the elementspij(t) as funcof tions of 0.

3. MAXIMUM LIKELIHOOD ESTIMATION


Suppose for now that each of a random sample of n individuals is observed at times to, tl,, . . , t,,,. (We will discuss the possibility of immigrationor emigrationfrom the group of individualsobserved later.) If nij,denotes the numberof individuals in state i at t,_, andj at t,, and if we condition on the of distribution individualsamongstatesat to, thenthe likelihood function for 0 is
L(0) =

3.1 The Scoring Procedure


The quasi-Newton(or scoring) proceduredescribedbelow is an efficient and simple method for obtaining the MLE of 0. From (3.2) we find

SU(o log = do
a2log L

I ij2l I'
k

apij(w1)/aOU
piJ(Wl) u = 1, ...,b, (3.5)

:1 { JIp0(tl_I,

fk

t,)nUi} =

(3.1)

In the time-homogeneous case with w1


m k

t1 - t1,,

1=

a6"d0ao .E j i=1

i
d0"dSV dpij(wi)laOUapij(Wj)1adSv piJ(wl) P?J(WI)

1, . . . , m, (3.1) gives the log-likelihood,


log L(O)= E E nj,1logpj1(w,).
1=1 i,j=l

x rpj(Wj)1

eigenvalues dl,

Direct use of a Newton-Raphson algorithmwould requirethe evaluationof both first and second derivatives.We use instead the scoring device in which the second derivativesare replaced Throughout,we suppressthe dependenceof pij(wl; 0) on 0. The MLE 0 or, equivalently, the MLE Q = Q(O) of Q, is by estimates of their expectations. This leads to an algorithm obtainedby maximizing (3.2). One approach(see Wasserman in which only first derivativesare required. Let Ni(tl- = 1) 1980) utilizes a numericalalgorithmthatrequiresno derivatives nij,representthe numberof individuals of log L(o). There are various algorithmsthat could be used in state i at time ti_1. By first taking the expectation of n1jl on (e.g., see Chambers1977, chap. 6). We propose here a more conditional N#(t1l) andthenusing the fact that Ejk=Ia2p1j(wj)/ = efficient quasi-Newtonprocedurethat uses first derivativesof aouaov 0, we findthat log L(o). This leads to faster convergence and an estimate of EjNj(t,_)} ap1j(wl) E( _2log LA E 3pij(w,) m the asymptoticcovariancematrix of 0. Our approachis made E{ -oa0 aOu ao~ 1=1 i,j=I v Pij(w1) feasible by the provision of an efficient algorithmfor the computation of P(t; 0) and its derivative with respect to 0. This which can be estimatedby leads to simple computationof log L(o) and its derivatives. m kNii ) pjw ai(l To compute P(t; 0) = exp{Q(0)t} for a given 0, we use a (O) = E E (3.6) Muv (w,) p1(, ap11(w1 da df canonical decomposition:if, for the given 0, Q(O) has distinct I"()= I ij Ipi
..

(3.2)

. , dkandA is the k x k matrix whosejth


=

column is a right eigenvector correspondingto di, then Q whereD = diag(dl, . .. , dk). Then ADA-1, P(t)
=

A diag(edl,

edkt)A-I'

(3.3)

where the dependenceof Q, P(t), A and the dj's on 0 is suppressed for notationalconvenience. When Q has repeatedei01 = o0 + M(00)'1S(0A) genvalues, an analogous decomposition of Q to Jordancanonical form is possible (see Cox andMiller 1965). This is rarely where it is assumed thatM(00) is nonsingular.The process is necessary, since for most models of interest, Q(0) has distinct now repeated with 01 replacing 0O. With a good initial eseigenvalues for almost all 0. timate 00, the algorithmproduces 0 upon convergence, and

Computationof (3.5) and (3.6) for any given 0 is facilitated by the results (3.3) and (3.4). The algorithmis now simply described. Let OO an initial be estimate of 0, S(o) be the b x 1 vector (S1(o)), and M(o) the b x b matrix (Muv(O)). Then an updatedestimate is obtained as

Kalbfleisch and Lawless: Markov Analysis of Panel Data

865

Let QI(0) be the k x (k - 1) matrixobtainedby deleting M(O)-' is an estimate of the asymptoticcovariance matrix of 0. If the true value of 0 is an interiorpoint of the parameter the last column of Q, and define the (k - 1) x (k - 1) matrix space, \n/(0 - 0) has a limiting multivariatenormal distri- B(0) with (i, j) element bution as the numberof individuals(n) underobservationapi. i, j-1, .,k- k Bij(O) = qji(O) - qki(O), proaches infinity. In addition, let C(o) be the (k - 1) x b matrix whose j'th 3.2 Illustration column is given by Consider the model (2.1) with 0 = (Al, PI, A2, u2). To implement the algorithm, we determine the eigenvalues and eigenvectorsof Q(00), where 00 is the trial value. In this case, the eigenvalues can be obtained algebraically;they are 0, s1, and 62, where (1 and 62 are defined in (2.3). The eigenvectors are more complicated functions of 0. It is importantto note, however, that no use is made of these algebraic expressions; for the given 00 the numericalvalues of the eigenvectors and eigenvalues can be easily obtainedwith standardsoftware. Once the canonical decompositionof Q(00) is obtained, the matricesP(wj; 00) are obtainedfrom (3.3) and the derivatives from (3.4). The matricesG(u)in (3.4) are simply computed;for example,
G(-) = A`
"I. I (aQ/aOj)

(3.7)

The derivativesof , . . ., k-I with respect to 0S, . , Ob arecontainedin the matrixWwith (i, j) elementa7cil/aO, where W(O) = -B(0)-'C(0).
7f . . . * is

(3.8) =

An estimate of the asymptotic covariance matrix of *l


7k-1)

V=1= W(O)VW(O)',

(3.9)

/-1 l 0 0

where V = M(O)-1is the estimated covariance matrix for 0. It will be noticed that the matricesB(O) and C(O) are easily calculated from quantities already computed when obtaining 0.

1 0o 0 OJA. 0 0

3.4 Some Simple Extensions


In many studies, some of the individualsare observed only over partof the observationperiod. If individualswho enteror leave the observationperiodare like those who stay in the study in all relevantrespects (the same transitionprobabilitiesapply to both groups), then the methodsapply withoutchange. Note, however, that Nj(tj_1)of (3.5) is the numberof individualsin state i at tl I for whom the state occupied at t, is known. It is similarly unnecessarythat all individualsbe observed over the same set of time points. The amountof computation, however, increases linearly with the number of distinct time intervalsw, in the sample. These methodscan be extendedin a simple way to fit certain nonhomogeneousMarkovmodels. Suppose, for example, that intensitymatrix {X(t)}is a Markovprocess with time-dependent Q(t) = Qo * g(t; A), where Qo is a fixed intensitymatrixwith unknownentries (qij), and g(t; A) is a known function of time A. up to an unspecifiedparameter In this case, g(t; A) defines, for given A, an operationaltime. For given A, let s = f'g (u; A)du and define Y(s) = X(t). The process {Y(s): 0 < s < oo}is then a homogeneousMarkovprocesswith intensitymatrix Qo. Thus for any given Awe replace t1 with s1 = fig(u; A) du of andw1with w4 = s, - sl-1. The parameters Qoareestimated by the methodsoutlinedabove. In addition,the maximizedloglikelihood can be obtained for that A. By varying A, this adcan ditional parameter be estimatedby observingthe effect on the maximized log-likelihood. Our approachto variance estimationandconfidenceintervalsis to considerAas fixed (usually at the value A, which maximizesthe maximizedlog-likelihood) and then to use the estimatedcovariancematrixM(O)-1for the estimate 0 of 0. The associateeditorhas pointed corresponding out the similarity between the interpretationand treatment here and that of the "conditional"approach of the i parameter A parameter in the transto the estimationof the transformation formation family of Box and Cox (1964); see Hinkley and Runger (1984) for a review and discussion of this topic.

and is The entireprocedure easily programmed executes quickly on a computer. A listing of a FORTRAN subroutinethat implements the general methodology of this article is available from the authors. An example is given in Section 6. As is remarkedfurtherbelow, there is an advantagehere to log log in parameterize terms of 0 = (log AI, ,u1 , log A2, 1 u2), This guarsince the parameter space for 0 is then unrestricted. antees that the result of an iterationwill not fall outside the parameterspace. Estimates that lie on the boundary of the parameterspace do cause difficulty and are discussed in Section 4.

3.3 Estimates of Derived Quantities


Often, specific characteristicsof the model are of interest, such as - qjj(0)-, the mean sojourntime in state i. In general, such quantities are estimated by replacing 0 with 0, and the asymptotic variance is estimated from the multivariatedelta theorem(see Rao 1973, p. 388). Forexample, the meansojourn time in statei is estimatedby - qjj(0)-Iandits estimatedasymptotic varianceis qii(o)-4 times
?
b

aqi(O)aqii(O) -a6u a6vmu(O

where MuV(O)is the u, v element of M(O)-l. In like manner, for example, the transition probabilities pij(t; 0) for fixed i, j, and t can be estimated. The requiredderivativesare computed using (3.3) and (3.4). of One importantcharacteristic ergodic models is the equilibrium probability vector 'n'
= (i,I . . . , 7k),

specified as

the unique solution to the equation Q'"T = 0 that satisfies = 1. In order to determinean asymptoticcovariance 1matrixfor *r via the delta theorem, we requirederivatives 3i~/ yields iJO1. AppendixB we provethatthe following procedure In an asymptoticcovariancematrixfor fr.

866

Journal of the American

Statistical Association,

December

1985

A second type of time-dependent model can also be treated. where a, f, > 0. It is easily shown that In this, the transitionmatrix Q(t) is allowed to change at the P(w) = (Pij(w)) observationtimes but is constant between observationtimes. 1 -7i + 7e-w -(_ Thus, for example, we might consider
0(

eIw)\ 7r)e-W

7)(I

-e-

7, + (I

(TO = Q I, =Q29

O tml

C- t < tm
Ct < t m

for some specified ml less than m. Estimationof Ql and Q2 proceeds by treatingthe first ml and the second m2 - ml time intervalsseparately.

where i = a/(a + f) and V/ = a + f. Suppose that w, = w (1 = 1, . . . , m) so that the likelihood (3.1) can be written as L
=

[1 -

P12]njp122pn21[

-P21]

n22,

(4.2)

OF 4. SOME FURTHER ASPECTS ESTIMATION


In this section we discuss some points associated with implementingthe algorithmfor maximumlikelihood estimation. Many of these comments have to do with the shape of the likelihood function and the informationavailableaboutthe parameters. 1. An initial estimate 00 of 0 can usually be obtainedin an ad hoc way by examiningthe transitioncounts nijl;an example is given in Section 6. An alternativeapproachwould involve a preliminarytabulationof the likelihood surface. 2. There is an advantage to parameterizingthe model by writing qij = exp(aij), i 5$j. This is because the parameters aij can take any real value whereas qij > 0. This reparameterization avoids problemsthat can arise when an iterationresults in parametervectors outside the parameterspace. Note also thatit is possible to have qij = 0. Whenthis happens,successive iteratesof aij will typically become large and negative. In this situation we have found it useful to set the correspondingq11 = 0 and fit the model, against using the aij's, with one less parameter.The resulting estimate is then comparedwith one in which qij is taken as a small positive value. 3. When the times w, between successive observationsare large, it is clear on intuitive grounds that not all parameters will be well estimated. The result can be a likelihood surface for the qij'sor aij's thathas ridges definedby certainparameters that are impreciselyestimated.For example, if the wl's are very large and the process is ergodic, then pij(wl) = nj, where a' = (7nk * , ) is the vector of equilibrium probabilities.In this case, a large numberof individualsunderstudy will allow precise estimationof 'n, but individualqij's will be imprecisely estimated. 4. If w, = w for all 1, it may be possible to determine0 in a relatively simple way. The empiricaltransitionmatrixP(w) with fij.(w) = nij.lni., where nij = 1=I nij and ni.. = I nij., provides an estimate of P(w). If the equation P(w) = exp(Qw) admits a solution Q = Q(O), then 0 is a MLE of 0. The conditions under which this approachcan be used are very restrictive.There is a close relationshipwith the so-called embeddability problem (see Singer and Spilerman 1976a), which is discussed in Section 7.
,

wherepij = pij(w). For fixed w, the parameterspace {(i, 9,tv): 0 < X < 1, ,V > 0} is mappedone to one by (4. 1) onto A = {(P12, P21): 0 < P12 < 1, 0 < P21 < 1, P12 + P21 < 1}. We can maximizeL for (P12' P21) E A andthen use the relationship (4.1) to obtain , ,v. In Let P12 = n12/ 1I., P21 = n21 In2 which is the unconstrained maximum of (4.2). If (P12, P21) E A, (4.1) can be used to show that
,

7, =

__ _ __ _ _ _ - Pl 12

- -log[1
w

-P2

- P21]-

P12 + P21

If, however, P12 + the boundaryP12 +


P12 =

P21 2

1,

P21 =

the supremumof L in A lies on 1 and we find (n11. + n2l1)In...

P21 =

Thus X = (nll. + n2l)/n.. andv= oo. where n.. =ijnij.. The parameter (which incidentallydeterminesthe equilibrium X distribution)has an admissibleMLE, but the likelihood is, for fixed 7, boundedbut increasingas V -v> oo? Note thatp12(w) + p21(w) = 1- ew. If ,vw is large, then P12 + P21 will be close to one, and it is clear that observations for which P12 + P21 2 1 will arise much more frequentlythan when ,vw is small. Finally, we remarkthat for this two-state case thereis a close connectionwith the embeddability problem in thatthereis no finite MLE of V/precisely when the empirical transition matrix P(w) is nonembeddable,but this does not appearto be a generalphenomenon.We discuss this furtherin Section 7.

5. THEINCORPORATION COVARIATES OF
In many applications,there are measuredcovariateson each individualunderstudy, and interestcenterson the relationship between these covariates and the intensities qij in the Markov model. One advantageof the methods in Section 3 is that they generalize in a straightforward to allow for the regression way modeling of Q, though with many distinct covariatevalues in the sample, computationsmay be too extensive to be easily implementeddirectly. Implementation may requirethatthe covariatesbe grouped. Suppose that each individual has an associated vector of s covariates, z' = (zI, z2, . . . , z), where zI = 1. For given z, we suppose that the process is homogeneous Markovwith transitionintensity matrix Q(z) = (qjj(z)), where
q1/(z)
=

An Illustrative Example. Consider an ergodic two-state process with


Q( ta

a l)

exp(z 'jS1j),

i 7] ,

(5.1)

Kalbfleisch and Lawless: Markov Analysis of Panel Data


. and qii(z) = - Ylii qij(z). In (5.1), frj = (/ij , fl) is sij) a vector of s regression parametersrelating the instantaneous rate of transitionsfrom state i to statej to the covariatesz. In are most applications,some of the regressionparameters specified (usually taken to be zero) and only a subset is to be between componentsof z and a subestimated;the relationship set of the transitionrates would usually be of interest. The particularparametricform in (5. 1), a log-linear model for the Markov rates qij(z), is chosen primarilyfor analytical convenience;other parameterizations be more appropriate may in particularapplications. This model has, however, the attractivefeatureof yielding nonnegativetransitionintensitiesfor any z and ,ij's, and it has been suggested by several authors. TumaandRobins(1980, pp. 1034-1035) providedan example. The algorithmof Section 3 requiresa separatecanonicaldecompositionof Q(z) for each of the r distinctcovariatevectors z in the sample. Let these be denoted by Zh = (Zlh, * * Zsh) with Zlh = 1, and let
,

867

E M(h)(0) is obtained by calculating the scoring matrix M(h)(0)for each h using (3.6) and formulas (3.3) and (3.4). The algorithmrequiresthe derivativesof Ph(t) with respect to the elements of 0, so a separatediagonalizationof each Qh iS required. Note that the computationof G(u)in (3.4) requires calculationof the derivativesof Q(Zh) with respectto Ou.Since ou = flij for some c, i, j, these derivativesareeasily calculated; we find
3Q(zh)Iaflci;

= Zchqij(Zh)LL

where L is a k x k matrixwith all elements 0 except Lij = I and Li, = -1. The specification and fitting of models with covariates is discussed furtherin Section 6.

6. ILLUSTRATIONS
In a study on smoking behavior, childrenfrom two Ontario counties (Waterlooand Oxford)who were enteringsixth grade at time to = 0 were surveyed at times (in years) t1 = .15, t2 = .75, t3 = 1.10, and t4 = 1.90. The study was designed to compare a control group from each county with a treatment group who received educational material on smoking during obtained the firsttwo monthsof study. A partof the information at each follow-up time was the "smokingstatus"of each child, which is classified into three states: State 1-child has never smoked. State 2-child is currentlya smoker. State 3-child has smoked, but has now quit. We consider a homogeneous Markov process with intensity matrix
- 01 01

Qh =

Q(Zh)

(qij(Zh)),

1,

. . . ,

be as defined in (5.1). Let nl(jhl) the numberof individualswith covariate values Zh that are in state i at t1- and state j at t1. The likelihood is then a productof terms like (3.1), where the hth term arises from data collected on a homogeneous model with intensity matrix Qh. Thus the log-likelihood is
r m k

log L(o) where

E E
h=1 1=1 ij=1

n(lh)logpij(w;

Zh),

Ph(t) = exp(Qht)

= (pij(t; Zh)).

The parameter is being used to indicatethe vector of param0 eters in rij(i # j) that are to be estimated. The maximum likelihood equations involve the sum of r terms, one for each Zh. Thus the score vector is
r

0
02
- 03

0
0

02 03

S(0) =

E S(h)(),
h= I

where S(h)(0) is computedas outlined in Section 3 [see Equation (3.5)]. In like manner,the Fisher scoring matrixM(o) =
1 1 2 3 93 (96.0) 0 (0) 0 (0) 2 3 (1.7) 8 (12.9) 1 (.5)
(t o.t 1)

where 0i > 0 (i = 1, 2, 3) as a model for the underlying continuous-timeprocess for "smoking status." The purposeof our discussion is to illustrate the methodology developed in earlier sections, and not to present a thoroughanalysis of the data. Figure 1 presents the transitioncounts nij,(i, j = 1, 2, 3; 1
1 2 2 (4.2) 7 (4.0) 5 (2.8)
(t ,t 2)

3 2 (.3) 10 (5.1) 8 (8.5) 98 18 9 1 2 3

3 2 (3.1) 5 (8.0) 15 (17.2) 93 12 20

89 (85.7) 0 (0) 0 (0)

1 1 2 3 83 (84.9) 0 (0) 0 (0)

2 3 (2.9) 9 (6.8) 2 (2.3)

3 3 (1.2) 5 (7.2) 20 (19.7) 89 14 22 1 2 3

1 76 (74.5) 0 (0) 0 (0)

2 3 (4.3) 6 (3.7) 0 (4.3)

3 4 (4.3) 8 (10.3) 28 (23.7) 83 14 28

(t2,t3) (t8,t4) counts are in parentheses. Expected transition Countsfor FourTimeIntervals. Figure 1. Transition

868

Journal of the American

Statistical Association,

December

1985

1 1 2 3 61 (62.4) 0 (0) 0 (0)

2 1 (1.4) 8 (11.5) 1 (.5)

3 2 (.2) 8 (4.5) 7 (7.5) 64 16 8 1 2 3

1 59 (57.3) 0 (0) 0 (0)

2 1 (2.6) 7 (4.7) 3 (1.9)

3 1 (1.2) 3 (5.3) 14 (15.1) 61 10 17

1 1 2 3 56 (56.1) 0 (0) 0 (0)

2 2 (2.1) 8 (5.9) 2 (1.7)

3 1 (.8) 3 (5.1) 16 (16.3) 59 11 18 1 2 3

1 51 (51.2) 0 (0) 0 (0)

2 2 (2.9) 6 (4.4) 0 (2.6)

3 3 (1.9) 6 (7.6) 20 (17.4) 56 12 20

counts are in parentheses. Expected transition Individuals. Countsfor High-Risk Figure2. Transition

= 1, . . . , 4) for the "Oxfordtreatment"group. From these data, the scoring algorithmof Section 3.1 gives the MLE of Q as
Q=

-.136 0
0

.136 -2.28 .470

0 2.28 - .470

Initial estimates for the algorithmcan be obtained in several ways. One possibility is to use a single interval (e.g., t2, t3) and to assume that individualsmake 0 or 1 transitionsin the interval. Thus, for example, we obtain an initial estimate of qll from exp(q,I(t3 - t2)) = 83/89. The correspondingestimates, .20, 1.26, and .27 for 0i (i = 1, 2, 3) are adequate. The asymptotic covariance matrix for 0 can also be obtained from (3.6) and used to give approximateconfidence intervals for the 0's. Questions of fit of the Markov model can, to some extent, be assessed by comparing observed transition frequencies, nijl, with expected frequencies, eijl = ni,jij(wl), where ni., = 1 nij1. likelihood ratio, or asymptoticallyequivalentPearA son chi-squaredstatistic to test the fit of the Markovmodel is readily obtained by methods similar to those used in Markov chains (e.g., Anderson and Goodman 1957). If none of the pij(wl)'s is restrictedto be zero, the likelihood ratio statistic is m k A = 2 E E nij,log(nijlleijl),
1=1 i,j=l

In the present situation, other models provide a somewhat better fit to the data. The population is very heterogeneous, and one possibilityis to examinemore homogeneous subgroups. Data were collected on each child, which pertainedto the exposure he/she received from relatives and friendsthat smoked. On this basis, each child was assigned to a "high risk" or a "low risk" group. The observed and expected transitionrates from the high risk group are given in Figure 2. We find A = 22.4, and the fit of the Markov model is somewhatbetter. Otherpossibilities also arise. Examinationof Figures 1 and 2 shows that the numberof transitionsfrom State 2 to State 3 is unusuallyhigh in the first intervalthat immediatelyfollows the treatment program.A Markovmodel describesmuch more accuratelysubsequenttransitionpatternsin these data. Another possibility is to consider time dependencein the transitionintensities. One approachis to fit separatemodels to different time intervals and to compare the estimated transitionintensities. Another approachis to fit a parametric time-dependent model. For example, we consideredmodels with
` qij(t) = qqje

for i, j = 2, 3

to look for a monotone trend in the propensityof individuals to alter their smoking habits as they get older. This gives no significant improvementin this instance. Comparisonsamong the four treatment-county groups are also of interest. These can be done using likelihood ratio tests or by the methods of Section 5, which utilize a regression
Table 1. SimulatedData FromModel (6.1)

variate which is asymptotically(m fixed, n -* oo)a chi-squared - 1) - b degrees of freedom. The related Pearson on mk(k statistic is m k
2= %2=EE(nij, 1=1 i,j=l

(to,t,)
z (0,0) 1 2 3 (0,1) 1 2 3 (1,0) 1 2 3 (1,1) 1 2 3 1

(tl,

t2J

(2,

W3

(t3, W4

(t4, t5)

n~

- ej,)2i

leijl.

1 2 3
14 1 0 16 3 0 10 5 0 12 6 0 1 9 0 0 8 0 2 9 0 1 6 0 2 0 3 1 1 1 2 1 1 1 2 2

1 2 3
12 1 0 19 1 0 10 4 0 17 1 0 2 6 0 0 5 0 2 7 0 1 5 0 1 3 5 0 2 3 3 0 4 0 1 5

1 2 3
11 1 0 17 2 0 10 1 0 17 0 0 0 7 0 2 3 0 2 6 0 0 6 0 2 0 9 1 0 5 2 2 7 1 0 6

1 2 3
11 0 0 16 3 0 10 3 0 13 1 0 1 0 6 1 0 11 1 2 2 0 0 6 1 0 5 0 0 11 1 3 5 0 0 7

For the example at hand, p21(Wl) = p31(WI) = 0, and the degrees of freedombecome 4m - b = 13. We find that A = 33.2 and x2 = 19.9. Although these are based on some small frequencies, it is apparentthat there is fairly strong evidence against the Markov model. It is common to find thattime-homogeneousMarkovmodels are not strictly appropriate.Fitting Markov models is, nonetheless, a useful and often necessary first step in developing a suitable model. The time-homogeneous Markov model is a convenient baseline; insight can be obtained, examiningthe by natureof departures from this model.

13 2 0 4 8 3 0 0 0 14 1 0 3 11 1 0 0 0 11 4 0 3 11 1 0 0 0 11 3 1 3 11 1 0 0 0

NOTE: Entries are the numbers of transitions over the corresponding time interval.

Kalbfleisch and Lawless: Markov Analysis of Panel Data


Table 2. True Values, Maximum Likelihood Estimates, and Estimated Standard Errors for Model (6.1) Fitted to the Data in Table 1 Estimated Standard Error

869

where
k

pij(w)

= n j.

I
j=1

nij.,

i,j

1, . ..

k.

Parameter
a12

True Value -2.30

MLE

If there exists a matrix Q E Q such that P(w) = exp(Qw), (7.2) if follows that Q is a maximumlikelihoodestimateof Q. When applicable, this methodgives a simple procedurefor maximizing the likelihood (7.1) with respect to Q E Q. This approachleads one to consider those conditions on a stochastic matrixP underwhich the equation exp(Q) = P (7.3)

#1
X2

.50
-

.50

a13
a21 #3 fl4

-2.30
-1.90

-2a177 .700 - .772 -2.659


-1.389

.356 .406 .406 .235


.278

a23

.50 .50 -2.30

.284 .111 -2.246

.307 .304 .254

admits a solution Q E Q. This is known as the embeddability problemfor homogeneousMarkovprocesses and has been discussed by many authors;for example, see Kingman (1962), Kingmanand Williams (1973), Frydman(1980), who extended the problemto include nonhomogeneousprocesses, and Singer and Spilerman(1976a), who reviewed the area. If k = 2, P is embeddable if and only if tr(P) > 1. Simple necessary and ea13\ eaa12+z113+?z2f2 sufficient conditions for general k are not known. Equation / qll(z) ea23 ?e21+Z q22(Z) (6.1) (7.3) can, however, admit 0, 1 or many solutions in Q for a 0 e O (e2\ 1Z given P. of Although characterization embeddabilityis an interesting of which is a naturalregressiongeneralization the maritalstatus problem,the implicationsfor the and challengingmathematical example leading to (2.1). For the simulation, the parameters analysis of panel data would seem to be few. There are several were a12 = a13 = a23 = -2.30, a21 = - 1.90, ,6 = ,B3 = reasons for this: 34 = .50, and /2 = - .50. It was assumedthat 15 individuals began in each of States 1 and 2 at time to = 0, for each of 1. If P(w) is nonembeddable,this does not imply that the vectors(zl, Z2) = (0, 0), (0, 1), continuous-timeMarkovmodel gives a poor fit to the data, nor fourgroupswithregression (1, 0), (1, 1), respectively. The simulated data are presented thatthe parameters thatmodel arepoorlyestimated.Although in in Table 1. P is nonembeddable, there might be stochastic matrices Convergence to the MLE's was rapid, and the region of "close" to P that are embeddable(see Example 7.1 below). convergence appearedto be reasonably broad. Convergence 2. If P(w) is embeddable, it is quite possible that many of occurredfrom startingvalues obtainedby a method similar to the parametersof the continuous-timemodel are nonetheless that used in the previous example. The MLE's and their esti- very poorly estimated (see Example 7.2). errorsare given in Table 2. These resultsallow matedstandard 3. The approachof estimating Q by solving (7.2) is applisimple expression of approximateconfidence intervalsfor the cable only if the model is saturated space is Q) and (parameter parameters.The overall likelihood ratio statisticfor testing the the observationtimes are equally spaced. Models and sampling fit of the model is A = 78.69 on 80 - 8 = 72 df, indicating plansused in manyapplications not meet these requirements. do no lack of fit. allows no possibilityof generalization In addition,this approach to incorporatecovariates. model. The Markov model, however, does not fit the data sufficientlywell thatone would wantto base comparisonssolely on it. To illustrate the methods of Section 5, we consider data simulated from a three-state Markov process with transition matrix
, 13+?z2f4

AND 7. EMBEDDABILITY ESTIMABILITY

Consideragain the homogeneouscase of Section 2, and suppose that the observationtimes are equally spaced so that w,
= w (1 = 1, .
. ,

m). In addition, suppose that the model is

saturated so that the elements of 0 are the intensities qij (i # j) and the parameterspace is

Q = {Q = (qij):q,j' 0, q11=

> qe1},
j#i

the set of admissible intensity matrices. In this case, the likelihood (3.1) can be writtenas
k

These observations suggest that considerationsof embeddabilityof P(w) are largely irrelevantwith respectto important statisticalquestions. In addition, even when the approachembodied in (7.2) is possible, it is preferableto consider maximization of (7.1) over the space Q directly. By so doing, estimates are obtained even if P is nonembeddableand the covariance matrixM' provides estimates of precision. We now give threeexamples that illustrateand extend points alreadymade. Example 7.1. (See remark I above.) Suppose that 100 individuals, beginningin each of k = 3 states, are observedover one transitionof length w = 1, and the observed transition matrix is P =fP(1) =(t.1 .8 .1).

L(o) =
n,1 where =
1i

1,1=1

fJ pej(w)'ij,

(7.1)

l nijiS

transitions of thetotalnumber recorded

from i to j. Since, for each i, the likelihood (7.1) is of multinomial form, a naturalestimate of P(w) is P(w) = p(w)

870

Journal of the American Statistical Association, December 1985

For any irreducibleMarkovprocess with intensitymatrixQ, it is clear that all entries of exp(Q) are nonzero, and since 1t3 = 0, it follows that P is nonembeddable.The approachto estimation embodied in (7.2) fails. The algorithmoutlined in Section 2, however, converges to

and -1
-

- = b

qI-q31

,c =

q12-q32

a-b2 C

Y
a2 =

q2l-q3l

/-.237
Q=

.237
-.231 .102

0
.120 - .364

If a < 0, the remainingtwo latent roots are complex, and if -a, the relevant 2 x 2 submatrixis

.111 .262

-1 - b ( -a 2b _2)IC
with eigenvalues A = rows of -b + ali c
=

c b - 1 + bJ
-

which correspondsto

1 ? a Ii. The left eigenvectorsare the


1 1
-

/ .800
P
=

.189 .011
.810 .090

exp(Q)

.100

.200 .100 .700


K

f2ali
1=

in general good agreementwith the observed transitionrates. In addition, all elements of Q are well estimated. AlthoughP is nonembeddable,there are stochastic matrices "close" to P that are embeddable. Example 7.2. (See remark2 above.) Suppose that
-

-b

-ali

2ali ali
b + all

2aIci

2alci

For a fixed t, we let P(t) = exp(Qt) and determinea family of matricesQk(k = 0, 1, 2,. . .) , which give rise to the same P(t) = exp(Qkt). From the above,

.51 .49

.498 .51,

Q = yK-'H'IDHK,
where D = diag(O, -1 + a1i, - 1 - a,i) so that P(t) = K-'H-'exp(ytD)HK. Let k be any integer(k # 0) and define a* = aI + 2k7r/ty. It can then be seen that exp(ytD*) = exp(ytD), where D* (0, -1 + a* i, - 1 - a* ). Define also b* = ba/ la and c*= ca /a,. In an obvious notation, it follows thatK* = K and the intensity matrix, Qksay, corresponding a*, b*, c* to H, and y satisfies
-

is an estimate based on 100 individualsbeginning in each of States 1 and 2 and observedfor a single transitionover a period of length w = 1. Since tr(P) > 1, P is embeddable,and there exists a unique matrix,

Q = log P

= (

1.956

- 1.956

71 + 7I2 72 -rl I ature concerningthe possibility that (7.3) may admit multiple solutions Q E Q. When this occurs, there are multiple maxiIf, for example, 7r1 =2 c = 3 = , a = 41, mum likelihood estimates of Q. In this example, we examine 0, and = 1, then y conditions under which this can occur, in the case k = 3. Let Q = (q,j) be the intensitymatrixof a three-stateergodic process, and let r = (Orl,7r2, 7r3) be the equilibriumdistri1 ) -7 Q = ll6 bution. Since w is the left eigenvector of the latent root A = 0 of Q, it follows, after some algebra, that Let P(t) = exp(Qt). For a given t, the matrices

the maximumlikelihood estimate of Q. Applicationof the algorithm in Section 2 leads to this same estimate along with exp(Qkt) = exp(Qt). estimates of standarderrors. It is easily seen that, althoughQ is unique, only the equilibriumdistributionis estimated with Some additionalcalculationsverify that precision. In effect, any Q with equilibrium distribution Qk = Q + (2k7r/yalt)(Q + yG), (rl, 7r2) = (.5, .5) and with large entries will yield a corresponding P that is consistent with the observed transition where frequencies. -7r3 72 /f2 + 73 G = - 3 -7rl 7rI+ 7r3 Example 7.3. There have been many remarksin the liter-1

1, b

Q = yH-(O
where
7rf I 2

-1 - 1b \0 (a - b2)IC

H,

-1 + b

Qk =

12

-8)

12t i

12

f73 \

73 7r3

-7r2\

H=

0
y
=

-1

-7r2

7 +

-(ql l + q22 + q33)/2,.

sastisfy P(t) = exp(Qkt), k = .. - 1, 0, 1, .. For Qk E we requirethat all entries of Qkhave the propersign. For Q, fixed k < 0 (>O), this happens only if t > - 16k7r/6 (t > 24kir). Thus, if k = -1, Q l E Q only for t > l6ir/6 8.38. There are multiple solutions in Q to P(t) = eQt~ for only large t values. For t as large as 8.38, P(t) is observationally in-

Kalbfleisch and Lawless: Markov Analysis of Panel Data

871

distinguishable from the equilibrium matrix with all rows

system of equations

(AA33)~
(.33319 P(t)
=

7air/a01
(.33350 .33331 .33344 .33326
.33324 .33332 .33357 .33317
B(O) | .
a7rk I / aOj

+ C1(O)= 0,

(B.2)

where B(O) is a (k - 1) x (k - 1) matrixwith (i, j) element dF(0, rr)i =F(O, ),


7tj

to five figures. This and similarexamplessuggest thatmultiplerootsof (7.2) will not occur in practicalapplications, at least in the 3 x 3 case. Equivalently, when P is embeddable there will not be multiplemaximumlikelihood estimates of Q unless P is close to the equilibriummatrix.At this time, we have not undertaken problems (k ? 4). If, a detailed study of higher-dimensioned however, multiple MLE's occur, methodssimilarto those outlined above could be used to determineall of them. Multiple MLE's indicate that the likelihood is badly behaved and point estimates are bound to be misleading.

a(Q'n)i
-

a(Q )~
d7zj

qJI(0) - qki(O),

and where Cj(O)is the (k - 1) x 1 vector given by WF(0Tr)


=

v (w
= =

From (B.2), the derivatives 37rj1/3j (i by (r /ia0j,. . . , a7rkl/a6j)'

k - 1) are given

-B(O)-'Cj(O).
7.k-I

The matrix giving all derivatives of 7rl, each of 1, . ,b iS thus W(O) = (a7r/ a0)

with respect to

= -B(H)-'C(H),

OF A: APPENDIX DERIVATION (3.4)


Suppose that Q has distinct eigenvalues d,, . . .
dk for all 0 in

each by obtained differentiating entryin someopen set. The matrix to P(t) withrespect 0. is, from(3.3) andthefactthatQ = ADA-',
dP(t)

where C(O) is the (k - 1) x b matrix with jth column Cj(O). This establishes formula (3.8). Formula (3.9) then follows by an applicationof the multivariate delta theorem(e.g., Rao 1973, p. 388).
[ReceivedApril 1984. Revised June 1985.]

aG s dS s I= dS = S Qaok ss
x s-I

Ec

a rQsts
s

REFERENCES
s' -I t.

E Q1
s= s-I

E
S= I=

AD'G,,DS_'A'1

1=0

= where GU A' (aQ/305) A. Thus

(> -DGI~ &P ts (t) A-' G,DS 1-l da, A (2ED


-

= AVA-',

with(i, j) element is whereVu a k x k matrix


S-1

g(j

gm
S=

E d-11=0

edit - = e~~~(U- edjt d'lS = gW"-e dj gj di- 6d>


ts

ts

xc s-I

g0")
s=l

E
1=0

-g~td%

g("tedul, i = j,

(3.4). in is whereg0," the (i, j) element G4. Thisestablishes

OF B: APPENDIX DERIVATIONS (3.8) AND (3.9)


The k x k equilibriumprobabilityvector iT' = (7,, . . ., 7k- I "' irk-) can be obtained as the unique solution to the -ti

systemof equations
I=O,

(B.1)

where Q,

Q,(0) is the k x (k

1) matrixobtainedby dropping

simplefunction thelastcolumn Q(H).Now, a is nota particularly of of 0 or Ql, so we will develop derivatives an,1aIj by treating in as definingX7., Q'rr = F(0, ST) a function k- implicitly termsof 01, b derivatives (i To obtain a7rjia0j = 1, . . . k- 1) forfixedj, we
use implicit differentiation of (B.I1) with respect to
+ obtain the to

Anderson, T. W., and Goodman, L. A. (1957), "StatisticalInferenceAbout MarkovChains," Annals of MathematicalStatistics, 28, 89-110. Bartholomew,D. J. (1982), StochasticModelsfor Social Processes (3rd ed.), London:John Wiley. (1983), "Some Recent Developments in Social Statistics," International Statistical Review, 51, 1-9. Box, G. E. P., and Cox, D. R. (1964), "An Analysis of Transformations" (with discussion), Journal of the Royal StatisticalSociety, Ser. B, 26, 211252. Chambers, J. M. (1977), ComputationalMethodsfor Data Analysis, New York:John Wiley. Cox, D. R., and Miller, H. D. (1965), The Theory of Stochastic Processes, London: Methuen(chap. 4). Frydman,H. (1980), "TheEmbeddingProblemfor MarkovChainsWith Three States," MathematicalProceedings of the Philosophical Society, 87, 285294. Hinkley, David V., and Runger, G. (1984), "The Analysis of Transformed Data," Journal of the AmericanStatisticalAssociation, 79, 302-309. Jennrich,Robert I., and Bright, Peter B. (1976), "FittingSystems of Linear ExactDerivatives,"TechGenerated DifferentialEquationsUsing Computer nometrics, 18, 385-392. Kingman, J. F. C. (1962), "The Imbedding Problem for Finite Markov Gebiete, und Chains," Zeitschriftfur Wahrscheinlichkeitstheorie Verwandte 1, 14-24. Kingman, J. F. C., and Williams, D. (1973), "The CombinatorialStructure fur of Nonhomogeneous Markov Chains," Zeitschrift WahrscheinlichkeitsGebiete, 26, 77-86. theorie und Verwandte Rao, C. R. (1973), Linear StatisticalInferenceand Its Applications(2nd ed.), New York:John Wiley. of Singer, B., andSpilerman,S. (1976a), "TheRepresentation Social Processes by MarkovModels," AmericanJournal of Sociology, 82, 1-54. (1976b), "Some MethodologicalIssues in the Analysis of Longitudinal Surveys," Annals of Economicand Sociological Measurement,5, 447-474. Tuma, N. B., Hannan, M. T., and Groeneveld, L. P. (1979), "Dynamic Analysis of Event Histories,"AmericanJournalof Sociology, 84, 820-854. Tuma, N. B., and Robins, P. K. (1980), "A Dynamic Model of Employment Behavior: An Application to the Seattle and Denver Income Maintenance Experiments,"Econometrica, 48, 1031-1052. Wassernan, S. (1980), "Analyzing Social Networks as Stochastic Processes," Journal of the AmericanStatisticalAssociation, 75, 280-294.

You might also like