Dissertation 22

A Bayesian Approach in Medicine
Konstantinos-Michail Mylonas
September 9, 2016
Contents
1 Introduction
1.0.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.0.2
Literature Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 Main Body
3 Survival Analysis
3.1
Basic Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1.2
Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2
Introductory Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
Proportional Hazard Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
3.3.1
Semi-parametric Cox Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
Accelerated Failure Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
3.4.1
Exponential distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
3.4.2
Weibull Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
3.4.3
Proportional Odds Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
3.4.4
Log-normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Bayesian Non Parametric Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.5.1
Basic Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
3.5.2
Explanation of the basic Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.5.3
An Initial Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.5.4
Formulation of the Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29
3.5.5
An Intuitive Look . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
3.5.6
Construction of Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
32
3.4
3.5
3.5.7
Dependent Dirichlet Process
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Epidemiology
37
44
4.1
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.2
Overview of the Data
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
4.3
Deterministic Analysis of Epidemiological Data . . . . . . . . . . . . . . . . . . . . . . . . . .
45
4.4
Bayesian Approach on Epidemiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
Appendices
51
A Code Appendix
52
A.1 One . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52
A.2 Two . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
A.3 Three . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
A.4 Four . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
List of Figures
Figure 3.1
Kaplan-Meier Survival Curve:It is reasonable to assume that Proportional Hazard
hypothesis holds since the two curves of the corresponding groups do not cross each other or
converge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 3.2
In both figures, it is ostensible that a Weibull distribution would provide a satisfactory
fit for the data at hand.The complimentary log transformation of Kaplan-Meier estimate
plotted against time falls in the least square regression line indicating that the hazard function
is linear . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 3.3
The residual plot indicates a good fit of the data under the framework of Proportional
Hazard model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 3.4
11
14
Martingale Residuals:The plotted martingale residuals plotted against age show some
signs of non-linearity whereas plotted against the treatment groups seems to distributed
evenly.A possible remedy to the non-linearity of age is to transform the variable or stratify.In this case,log transformation is employed however,it did not show a change worthy of
mentioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 3.5
14
Schoenfeld Residuals:The shape of the smoothed (lowess) curve is an estimate of the
difference parameter as a function of time which appear to increasing followed by a slight

reduction.Overall, there is no evidence to reject proportional hazard hypothesis . . . . . . . .
Figure 3.6
Jacknife Residuals:The plot indicates that are patients with significant overall influence
to the model.Indeed 4 patients seem to be influential observations . . . . . . . . . . . . . . . .

Figure 3.7
18
Jacknife Residuals:The plot seem to agree with the former model that patient seem to
have significant influential power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 3.9
16
Deviance Residuals:Here it is plain to witness that age might have a non linear pattern
as for the th group treatment variable is seen to be evenly distributed . . . . . . . . . . . . .

Figure 3.8
15
19
Cox-Snell:The plot indicates that the assumption of exponentially distributed survival
times does not hold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3
19
Figure 3.10 Deviance Residuals:The treatment groups seem to be successfully modelled whereas
the age seems to exhibit some problems similar to the nature of the problems when Cox model
was fit to the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
21
Figure 3.11 Jacknife Residuals:From the Jacknife plot we can see that patients 46,68 and 114 are
deemed to be influential.They removed and the data did not exhibit any novel pattern that
would suggest that the removal of influential observations would result in a better fitted model
or in decreasing the amount of non-linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
Figure 3.12 Cox-Snell:The Cox Snell residual plot reveals that Weibull distribution provide an
adequate fit for the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
Figure 3.13 Deviance Residuals:The plot obtained at this point depict no novel pattern that it has
not been watched in previous models that have been developed . . . . . . . . . . . . . . . . .
25
Figure 3.14 Jacknife Residuals:The model distinguishes the same patients for their influence as
previous models did . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
Figure 3.15 Cox-Snell:The plot reveals that although the implemented log-logistic model provides
a better fit than the exponential model the comparison between the Log-logistic and Weibull
model favours the latter. Thus, it seems that the Weibull provides the best available option
in terms of accelerated failure model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
26
Figure 3.16 The autocorrelation plot indicates that the Markov Chain is stationary meaning that
the posterior must have converged to a Normal distribution as it can be seen from the histogram 40
the posterior must have converged to a left-skewed distribution as it can be seen from the
histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
40
the posterior must have converged to a Normal distribution as it can be seen from the histogram 41
Figure 3.19 The autocorrelation plot indicates that the Markov Chain have not converged to a
distribution as it can be seen from the histogram . . . . . . . . . . . . . . . . . . . . . . . . .
41
Figure 3.20 Initially,Nelson-Aalen hazard estimates seem to follow the decrease of the model estimated hazard, However there was a change in the pattern that showed the Nelson-Aalen to
start increase whereas the estimated hazard model continued to decrease . . . . . . . . . . . .
42
Figure 3.21 The estimated survival function under the non-parametric Bayesian approach for the
treatment type is similar to the Kaplan-Meier curve. It indicates that the non-parametric
Bayesian model suits the data for this study . . . . . . . . . . . . . . . . . . . . . . . . . . . .
42
Figure 4.1
The number of infected seems to rise sharply at 0.06 of total population before the
10th day and subsequently decreasing at symmetric rate, returning to the initial state,the
second plot illustrates the course of the number of susceptible in the period of time which
the outbreak took place. The number remains stable before starting to decrease in the same
period that incidence of infected starts to grow as for the last plot describes the same pattern
as the infected plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.2
The Monte Carlo costs seem to rapidly growing after 10th day picking at approximately
2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.3
45
48
The curve of susceptible is smother than the curve produced by stochastic SIR and does
not decrease so sharply, the number of infected does not exhibit that spike as it does when
there is vaccination and the pick is much higher than it is located when there is a vaccination
policy leading the number of recovered to finally rise at the level of susceptible individuals at
the start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.4
48
Here, it is shown the fluctuations susceptible,infected and removed as time moves
forward under optimal vaccination policy, it is easy to notice that as the number of susceptible
decreases the number of recovered increases before remaining stable at 200 individuals and
the number of infected initially increases picking at less than 200 individuals at 7 days only to
start to diminish again returning to 0. The second plot illustrates the cost for no vaccination
policy for the English boarding school data after a rapid growth at approximately 10th day
the cost remains stable at well above 1500, the third plot portrays a sudden growth in the
total fraction of individuals vaccinated at some time before the tenth day picking at 80 per
cent of total population before plunging the last plot depicts the threshold which has to be
vaccinated
Figure 4.5
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
The graph depicts how the number of susceptible,infected and recovered is changing
under more realistic circumstances where there is no perfect information resulting in estimating the epidemic model parameters as well as finding an optimal management strategy.
In particular,the curve of susceptible exhibits a sharper decline and the number of infected
does not have a pick as it has with no vaccination policy which leads to a reduced number of
recovered individuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
49
Figure 4.6
Under, these circumstances there has been an endeavour to calculate posterior distri-
butions of estimated parameters (clockwise from the left: transmission rate, over dispersion
parameter, mortality rate, recovery rate). True parameter values are indicated by a dot,
mean posterior values are indicated by an x and the central 95 region of the distribution is
shaded. From the plots, it is ostensible that transmission rate and mortality rate resemble
a Normal however, the tails betray such an assumption is, on the other hand the skewed
distribution of the rest of parameters may suggest that a X 2 could be a good approximation
50
Abstract
In the modern realm of Medicine,practicians and researchers face the daunting task of drawing inferences
from larger and larger amount of data which most often will give a distorted view of reality due to a number
of factors.
Fortunately,recent advances on computational statistics facilitated this task giving way to the development
of Bayesian Statistics. Thus, in recent years,there is a growing literature focusing on implementing advances
in Bayesian Non-Parametric Statistics to create new models with enhanced predictive power and robustness.
In this thesis,it will be discussed and evaluated a number of techniques and proposals concerning different
areas of Medicine.In particular,it will be brought under scrutiny problems on survival analysis ,which give
rise to more flexible models.
In later parts of this dissertation,the focus will be sifted to epidemiology and follow recent advancements
pertaining to the implementation of Bayesian methods in contrast to more traditional approaches such as
SIR model.
Acknowledgements
At this point,I would like to express my sincere gratitude to my advisor Dr Mark Fiecas for his patience,support,motivation and immense knowledge, and to my supervisor tutor Dr Ewart Shaw for his support and patient guidance throughout my dissertation study. Many thanks to the committee for reading my
dissertation. Last but not least, I would like to thank all the professors at the department of Mathematics in university of Patras for their admirable work and for introducing me to the realms of Statistics and
Mathematics.
Chapter 1
Introduction
1.0.1
Introduction
In recent years, advancements in the domain of computational statistics instigated a rapid growth of Bayesian
literature in Statistics. Thus,applied Bayesian Statistics became a prominent feature of the expanding realm
which has been since then utilised to draw more accurate inferences in many areas where Statistics are
employed. In this dissertation,it will be discussed and analysed survival data under the frequentist and
Bayesian Non parametric paradigm. Thus, under latter regime it will be created a number of Bayesian nonparametric or semi-parametric models which facilitates more flexibility and accuracy allowing survival curves
to cross or not,following the commands of data unlike parametric models. Then,attention will be shifted
to Epidemiology and the development of a stochastic equivalent of SIR model[25]. Data will be analysed
under both regimes to decipher if recent approaches provide better analysis. In chapter 1, motivation for
developing Bayesian non-parametric approach is presented with an outline of the limitation and strengths of
Bayesian non-parametric models. Chapter 2 focuses on the Survival Analysis illustrating the analysis of data
under both paradigms.Moreover,in the outset of each analysis, definitions and explanation of more complex
structures are provided to assist readers in better understanding. Lastly,chapter 3 is dedicated to the recent
developments of Bayesian Statistics on Epidemiology. Data was analysed under both approaches outlining
the benefits of Bayesian Approaches over more traditional approaches such as SIR models.
1.0.2
Literature Overview
Initially,the idea of a Bayesian non-parametric tool was firstly introduced by Susarla and Ryzin[37] however it
was Ferguson who by defining the Dirichlet Process paved the way for Bayesian Non-Parametric Statistic,the
next advance was from Ferguson and Phadia[16] who obtained the Bayesian estimate survivor function and
derived its posterior distribution. Another other touchstone works in the research of Bayesian non-parametric
are of Christensen and Johnson[7] developing corresponding results for life table,Kuo and Smith[27] proposing
a Gibbs sampler for right,left and interval censored data.Then Doss and Doss and Huffer[11] and Doss and
Narasimhan[12] discussed the implementation of a mixture of Dirichlet prior for the survival function in the
presence of right censored data using Gibbs sampler.
1.0.2.0.1
Abbreviations
pmf:probability mass function
DP:Dirichlet Process
DDP:Dependent Dirichlet Process
HDP:Hierarchical Dirichlet Process
SIR:Susceptible Infected Recovered
RPM:Random Probability Measure
Chapter 2
Main Body
2.0.2.1
Datasets
In the present dissertation,the following datasets: pharmacoSmoking and the English boarding School influenza dataset were analysed to draw conclusions about the role of Bayesian methods in the modern statistical analysis. Both of the datasets were chosen for their simplicity and for giving the chance to implement
a number of parametric models before jumping to Bayesian methodology, the number of observations safeguarded against the possibility of compromising the validity of deductions due to lack of statistical power.
Most importantly it was thought that these datasets would provide a fertile ground to display the gain of
knowledge through the implemented Bayesian techniques compared to more traditional approaches.
2.0.2.1.1
pharmacoSmoking Dataset
The first dataset which will be put under scrutiny is pharma-
coSmoking dataset from asaur package in R[32].A dataset comprised of 125 observations on the 14 variables
which includes some background information such as age,race(black, hispanic, white, or other),employment(ft
(full-time), pt (part-time), or other), ageGroup2(Age group with levels 21-49 or 50+) ,ageGroup4(Age
group with levels 21-34, 35-49, 50-64, or 65+) and variables which contain information about the treatment and medical record of patients concerning their smoking patterns including information such as ttr
(time in days until relapse),relapse, indicator of relapse (return to smoking),grp(randomly assigned treatment group with levels combination or patchOnly),yearsSmoking (number of years the patient had been a
smoker),levelSmoking (heavy or light),priorAttempts(the number of prior attempts to quit smoking) ,longestNoSmoke( the longest period of time, in days, that the patient has previously gone without smoking). This
information was gathered from a clinical trial described in Steinberg et al (2009)[42] to assess the effectiveness
of Randomized trial of triple therapy against patch for smoking cessation on smokers. Moreover,one reason
for choosing this particular dataset was the opportunity to fit a number of parametric techniques and explore
some techniques in residual diagnostic and model building. Lastly, it was hoped that the examination under
the Bayesian framework would result in deep understanding of the problem.
2.0.2.1.2
English Boarding School Influenza
As for the second dataset[41],it comes from a study of
influenza epidemic outbreak in English boarding schools in 1978 published in British Medical Journal. The
school had a population of 763 boys. Of these 512 were confined to bed during the epidemic, which lasted
from 22 January until 4 February. It seems that one infected boy initiated the epidemic. At the outbreak of
the epidemic, none of the boys had previously had influenza, so no resistance to the infection was present. In
particular the data is comprised of two columns,shells in the first column give the number of infected boys
while the shells in second column correspond to the week in which the number of infected was recorded in
the shells of the first column. It was chosen because boarding schools are a closed environment, especially
the population is rather homogeneous and actions could be taken to avert further expansion of the epidemic
such as quarantine or vaccination allowing to study the differences of two analyses in a small scale without
external factors which would confuse the analysis in one of the two paradigms.
2.0.2.2
Bayesian Motivation
[20] The literature on statistical modelling has been dominated by the parametric models leaving the nonparametric approach without much exploration until 40 years ago when the seminal work of Ferguson shed
light to the Bayesian Non-Parametric by laying the foundations of an ubiquitous tool in non-parametric
modelling. In doing so,it endeavoured to ameliorate common errors that emerged from the limitations of
parametric modelling. Its significance lies on the fact that it proposed an alternative way of modelling
survival data when the proportional hazards assumption is not met or when any distribution does not seem
to provide an adequate fit to the data-something which was considered uncharted waters at that time.
For instance, when the treatment interacts with time which fails to accommodate such innovations since
all of them are constrained in such fashion that the survival curves are not allowed to cross. Thus,Non
Parametric Bayesian methods came to spotlight ,being responsible for some ground breaking advancements
such as modelling more complex distributions of survival times (bathtub shape) while incorporating some
uncertainty pertaining the estimates because it allowed to take prior distributions for parameters in question
and most importantly it enabled to create models that could handle vast datasets and their complexity
would adjust as new data is collected. However,much to the popularity enjoined,Bayesian Non-Parametric
approach carries its drawbacks the amount of knowledge to be assimilated is circumvented by the number
of the observations. In that cause,it is seemed to that the number of observation grows large so does the
models complexity by adding parameters to assure that model would catch every possible curve that would
be observed as data would continue to flow into the model.
definition 1. Non-Parametric Model is a model where the number of parameters increases with data
definition 2. Non-Parametric Model is a family distribution that is dense in some large space relevant to
the problem at hand[43]
Despite these disadvantages,the power of its virtues forces researchers to consider Bayesian Non-Parametric
models for handling demanding cases.The following recapitulates some of their strengths
following parametric approaches inevitably will lead Model Selection/Averaging which are computationally expensive
it is used as safeguard against over-fitting and under-fitting
Non-parametric Bayesian Models have interesting and useful properties exchangeability,Heap and
power law
Flexible way building complex models [43]
Chapter 3
Survival Analysis
3.1
3.1.1
Basic Knowledge
Definitions
= P (t 0) = 1 F (t)
definition 3. Survivor Function is the probability of surviving up to time t S(t)
definition 4. Hazard function is the instantaneous death rate given survival up to time t
hT (t) = lim0
P (T t+,T t|T t)
)
where h is the hazard function
definition 5. Integrated Hazard Function HT (t) =
Rt
0
hT (t) where h is the hazard function
=Q
definition 6. Kaplan-Meier estimator S(t)
:tt (1
d
r )
where
d : number of deaths at time t

r :number of individuals at risk before time t
definition 7. Nelson-Aalen estimator N A =
d
t t r
where
d : number of deaths at time t

r :number of individuals at risk before time t
3.1.2
Residuals
The residual plots provide a momentous tool for assessing the fit of a model,the existence of influential
observations in data and the functional form of the covariates. In particular,the Cox-Snell residuals are
entailed to examine the general fit of the model. As for influential observations, jacknife residuals are
utilised to track influential observations and martingale in Cox model or rather often deviance residuals in
accelerated failure time models particularly due to the inclination of martingale residuals to be asymmetric
are employed to search for non-linearity amongst the covariates. Another type of residuals is Schoenfeld s
which is preferred to assess the Proportionate Hazard assumption.
definition 8. Martingale residuals is
= H0 ez
[0, 1]
sum of martingale residual is 1
expected value of martingale residuals is 0
definition 9. Deviance residual is
p
d = sign( 2[ + log
symmetrically distributed with expected value 0
sum of squares of deviance residuals is value of likelihood ratio test
Analogous properties of deviance residuals of generalised linear models
definition 10. Jacknife residuals is the difference in value of the coefficient using all of the data and its
value when the subject is deleted from the data set
P
definition 11. l() = D
P
P
log log D z log kR exp zk
P
=P
and its derivative which is the score function l()
D z
kR z zk p(, zk ) p(, zk ) =
zk
P e
Rk
The Schoenfeld residuals are the individuals terms of the score function and each term is the observed value
)
of the covariate for patient minus the expected value z =z(t
A plot of these residuals versus the covariates z will yield a pattern of points that are centred at zero
if the proportional hazards assumption is correct
Schoenfeld residuals are defined only for the failure time not the censoring times
3.2
Introductory Survival Analysis
This section focuses on the first steps that have to be taken when survival analysis is conducted. Here,pharmacoSmoking
data is subjected to analysis, however only survival time(ttr) ,relapse and group(grp) are initially employed
8
aiming to see if proportional hazard assumption holds and comparing the survival times of the two groups
without exploring other covariates since we want to decipher whether which treatment assist patients to
quit smoking leading to examine treatment effects on survival times firstly before going in much depth .
Thus,Kaplan-Meier[2] estimator of the survival curve is fitted to assess the proportional hazard assumption
and to notice the differences between the survival times of the two groups,subsequently a log-rank test[29]
is computed to verify that the two groups have significant different survival times.[6] [9] [23]1 . Below,it is
given the output of two variant of log-rank test.
0.5
0.0
0.5
Survival of two Groups
1.0
PatchOnly
2.5
2.0
1.5
Combination
0.5
1.0
2.0
5.0
10.0
20.0
50.0
100.0
200.0
survival time (days)
Figure 3.1: Kaplan-Meier Survival Curve:It is reasonable to assume that Proportional Hazard hypothesis
holds since the two curves of the corresponding groups do not cross each other or converge
Call :
s u r v d i f f ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$grp , data = data ,
rho = 1 )
Observed Expected (O E) 2 /E (O E) 2 /V
data$ grp=c o m b i n a t i o n 61
23.1
32.1
2.53
8.01
data$ grp=patchOnly
35.8
26.8
3.04
8.01
Chisq= 8
64
on 1 d e g r e e s o f freedom , p= 0 . 0 0 4 6 5
Call :
s u r v d i f f ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$grp , data = data ,
1 The
books cited here facilitated the understanding of the interpretation of the models below and the techniques employed
to tackle problems on Survival Analysis
rho = 0 )
37
49.9
3.36
8.03
data$ grp=patchOnly
52
39.1
4.29
8.03
Chisq= 8
64
Call :
s u r v d i f f ( formula = Surv ( data$ t t r , data$ r e l a p s e ) \ \
data$grp , data = data ,
rho = 0 )
37
49.9
3.36
8.03
data$ grp=patchOnly
52
39.1
4.29
8.03
Chisq= 8
64
Call :
s u r v d i f f ( formula = Surv ( data$ t t r , data$ r e l a p s e ) \\
data$grp , data = data ,
rho = 1 )
N Observed Expected (O E) 2 /E (O E) 2 /V
23.1
32.1
2.53
8.01
data$ grp=patchOnly
35.8
26.8
3.04
8.01
Chisq= 8
64
Both variants of log rank test Peto-Peto[35] and Breslow[5] bear evidence that there is a significant difference
between the relapse time of the two groups since the p-values of both tests are less than 0.05. Therefore,it
is thought that one of the groups has statistically better survival times than the other which is a very
satisfactory sign for proceeding to a more thorough examination of the covariates and the data.
10
The complimentary log versus log time
0.5
1.0
log estimate of survival function
1.5
1.0
0.0
2.0
0.5
Hazard
1.5
0.0
2.0
Estimation of Hazard Function
50
100
Time
150
log time
Figure 3.2: In both figures, it is ostensible that a Weibull distribution would provide a satisfactory fit for
the data at hand.The complimentary log transformation of Kaplan-Meier estimate plotted against time falls
in the least square regression line indicating that the hazard function is linear
3.3
Proportional Hazard Model
The implementation of non-parametric methods is thwarted by covariates and it requires categorical predictors. When we have several prognostic variables,we must use multivariate approaches. However,both
multiple linear or logistic regression seem unable to handle censored observations. Thus,the aid of Cox
model[10] to fit survival data with the presence of censoring must be entailed. Two fairly popular models
under this umbrella are Cox and the parametric proportional hazard model. In both of them,the hazard at
time t of an individual with covariates x is assumed to be proportionate. In particular, the baseline hazard
function portraits the risk for an individual with x=0 who serves as a reference and is the relative proportionate increase or reduction in risk associated with a set of characters x. In particular, consider two samples
where dummy variable which serves to identify groups one and zero. Thus,it represents the ratio of risk
in group one relative to group zero at any time. More practically,if the exponentiated form of a coefficient
is less than one that indicates the covariate prolongs an event and increases survival time. However,the
difference between the two families of models lies in that the baseline hazard function is assumed to follow
a specific distribution when a fully parametric Proportional Hazard model is attached to the data, whereas
the Cox model, makes no assumptions about the form of h0 (t) (non-parametric part of model) but assumes
parametric form for the effect of predictors on the hazard(parametric part of model). The model is therefore
referred as a semi-parametric model. The coefficients are estimated by partial likelihood in Cox model or
maximum likelihood in parametric Proportional Hazard model. Other than this, the two types of models
are equivalent. Hazard ratios have the same interpretation and proportionality of hazards is still assumed.
11
A number of different parametric Proportional Hazard models may be derived by choosing different hazard
functions. The commonly applied models are exponential, Weibull, or Gompertz models.
3.3.1
Semi-parametric Cox Model
definition 12. Cox Model assumes that the hazards are proportional
h (t) = h0 (t)e
3.3.1.1
is the corresponding coefficient of covariate X and h is the hazard function.
Model Building
In this section,it is given a roadmap of how the final model is obtained through a procedure proposed
by Collett[Model Selection in Survival Analysis]. At all stage of the procedure,decisions about the
inclusion or exclusion of covariates were taken with regard to likelihood test.
Fit a univariate model for each covariate, and identify whether the predictors are significant at some
level p1(0.10 or 0.20). Please see section One from the Appendix.
Fit a multivariate model with all significant univariate predictors, and use backward selection to eliminate non significant variables at some level p2(0.10 or 0.05). Please see section Two from the Appendix.
Starting with final step (2) model, consider each of the non-significant variables from step (1) using
forward selection, with significance level p3,(0.10). Please see section Three from the Appendix.
Do final pruning of main-effects model (omit variables that are non-significant, add any that are significant), using stepwise regression with significance level p4(0.05). At this stage, it may be considered
to add interactions between any of the main effects currently in the model, under the hierarchical
principle. Please see section Four from the Appendix.
A possible criticism over the use of backward stepwise methods is that the obtained model has an unfathomable element of conditional upon all previous choices on variables and their order of inclusion. Another
issue is that it cannot be easily drawn the conclusion that a covariate is independently associated with the
outcome. Higher-quality clinical journals are now more frequently requiring better statistical analyses, often
having a separate review by a statistician, a trend that is to be encouraged.The shortcomings of stepwise
methods such as the inclusion of bias in parameters and estimation, inconsistencies among model selection
algorithms, an inherent (but often overlooked) problem of multiple hypothesis testing, and an inappropriate
focus or reliance on a single best biased (inflated) coefficients, biased (deflated) p-values,and inflated model
fit statistics model has largely brought to the light of day in the Statistics realm. Thus,to remedy these
drawbacks Collett suggested a second stepwise variable selection in forward direction initialised at the last
12
step of backward stepwise search. Then, it is common to investigate possible interactions of the covariates
included to acquire the final form of the model in terms of the covariates to include. Interpretation of the
model
r e s . cox < coxph ( Surv ( data$ t t r , data$ r e l a p s e \\
) data$ grp+data$ age+data$employment )
summary( r e s . cox )
Call :
coxph ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$ grp + data$ age )
n= 1 2 5 , number o f e v e n t s= 89
coef
exp ( coef )
se ( coef )
Pr ( >| z | )
data$ grppatchOnly
0.558663
1.748334
0.216674
2.578
0 . 0 0 9 9 3
data$ age
0.023018
0.977245
0.009605
2.397
0.01655
exp ( coef ) exp ( coef ) lower . 9 5 upper . 9 5

data$ grppatchOnly
1.7483
0.572
1.143
2.6734
data$ age
0.9772
1.023
0.959
0.9958
Concordance= 0 . 6 2 5
Rsquare= 0 . 1 0 5
( se = 0 . 0 3 4 )
(max p o s s i b l e= 0 . 9 9 8 )
L i k e l i h o o d r a t i o t e s t= 1 3 . 8 2
on 2 df ,
p =0.0009956
Wald t e s t
= 13.48
on 2 df ,
p =0.001183
Score ( logrank ) t e s t = 13.74
on 2 df ,
p =0.00104
3.3.1.1.1
Interpretation
At first glance, all the covariates included in the model are found signifi-
cant.The exponentiated coefficients in the second column give the multiplicative effect on baseline hazard
function. In more detail,the effect of being in the Patch Only group seems to prolong the time before start
smoking again.On the other hand, age which its exponentiated coefficient is less than one seems to narrow
the length of time which someone is not smoking.Both Likelihood,Wald and log-rank tests Global Null Hypothesis Beta =0. In this particular cases, P Value for each of these tests are 0, so the null hypothesis could
13
not be accepted that the beta coefficients are zero.Concordance 0.625 gives a decent predictive power to the
model.
3.3.1.2
Residuals Diagnostic
1.5
1.0
0.0
0.5
Estimated Cumulative Hazard Function
2.0
Cox Snell Residuals[8]
0.0
0.5
1.0
1.5
2.0
CoxSnell Residuals
Figure 3.3: The residual plot indicates a good fit of the data under the framework of Proportional Hazard
model
1.0
0.5
0.0
0.5
2.0
1.5
1.0
res.ma
0.5
2.0
1.5
1.0
res.ma
0.0
0.5
1.0
Martingale Residuals[46]
combination
patchOnly
data$grp
20
30
40
50
60
70
80
data$age
Figure 3.4: Martingale Residuals:The plotted martingale residuals plotted against age show some signs of
non-linearity whereas plotted against the treatment groups seems to distributed evenly.A possible remedy
to the non-linearity of age is to transform the variable or stratify.In this case,log transformation is employed
however,it did not show a change worthy of mentioning
Schoenfeld Residuals[40]
14
0.3
0.2
0.1
0.0
Beta(t) for data$age
0.1
0.2
0
50
100
150
Time
Figure 3.5: Schoenfeld Residuals:The shape of the smoothed (lowess) curve is an estimate of the difference
parameter as a function of time which appear to increasing followed by a slight reduction.Overall, there is
no evidence to reject proportional hazard hypothesis
Grambsch and Therneaus test for Proportional Hazard assumption

rho
chisq
data$ grppatchOnly
0 . 0 5 7 4 0.2924 0.589
data$ age
0.0243 0.0635 0.801
GLOBAL
NA 0 . 3 8 0 5 0 . 8 2 7
The test comes to validate the Schoenfeld residuals that the proportional hazard hypothesis hold for the
data. Another conclusion that can be drawn from the test is that both variables seem to agree with the
proportional hazard hypothesis. Thus, it is ludicrous to investigate whether the variables are time-dependent.
Jacknife Residuals
15
0.002
0.000
Change in coefficients
0.002
0.004
0
20
40
60
80
100
120
Observation
Figure 3.6: Jacknife Residuals:The plot indicates that are patients with significant overall influence to the
model.Indeed 4 patients seem to be influential observations
3.4
Accelerated Failure Models
Despite the dominance of Proportional Hazards models in analysing survival data,their usefulness is circumvented by the fact that only a few probability distributions can accommodate the proportional hazard
property. Bereft of this assumption, it paves the way for a family of models called Accelerated Failure
Time. Under this new framework the effect of covariates on survival time is measured allowing for an easier
interpretation of results2 since covariates effects are assumed to have a multiplicative effect on time scale
impacting by constant factor on mean survival time.The time ratio comparing two levels of covariate x(x =
1 against x = 0) after controlling all the other covariates ise ,which is interpreted as the estimated ratio of
the expected survival times for two groups. A time ratio above 1 for the covariate implies that this covariate
prolongs the time to event, while a time ratio below 1 indicates that an earlier event is more likely. Therefore,
the Accelerated Failure Time models can be interpreted in terms of the speed of progression of a disease.The
effect of the covariates in an accelerated failure time model is to change the scale, and not the location of a
baseline distribution of survival. In other words, the probability that a subject with covariates x will be alive
at time t is the same as the probability that a reference subject will be alive at time tex . Alternatively,
it can be viewed as an ordinary regression model for log survival times with the form log T = x + w
where the error term W has some suitable extreme value,generalised extreme value, Normal,logistic distribution which leads to Weibull, generalised gamma,log-normal and log-logistic distribution respectively. For
example, if W is distributed according to an extreme value distribution then T is distributed Weibull with
2 Note that the proportional hazards model has a natural interpretation if there are some external factors, for example more
people tend to die in the winter so a Proportional hazards model might be more appropriate
16
parameters log = x and p = 1/ where p is assumed to be constant. This model has an accelerated life
interpretation. In this file formula the error term is viewed as standard or reference distribution to the time
scale by defining. The probability that a reference subject will be alive at time t is S(t) = P (T t) the
effects of covariates. In terms of risk it means that the exposed subject will at any given age double the risk
of reference subject.
definition 13. Accelarated Failure Time Model assumes a scaling on time axis
where is the accelerator factor
Sn(t) = S0 (t)
3.4.1
Exponential distribution
probability density Function:f (t) = et cumulative density Function: F (t) = 1 et Survivor function:
S(t) = et Hazard function: h(t) = Cumulative Hazard H(t) = t where is the parameter of exponential
model The exponential model is the parametric model inducing a constant risk over time which reflects the
lack of memory property of the distribution of exponential. The probability of dyingwithin particular
time interval depends only on the length but not the location of the interval it. The exponential distribution
has constant hazard rate. The exponential distribution was widely used in early work on the reliability of
electronic components and technical systems. The model is very sensitive to even a modest variation because
it has only one adjustable parameter, the inverse of which is both mean and standard deviation. Recent
works have overcome this limitation by using more flexible distributions.
3.4.1.1
How to identify if Exponential is a valid distribution for modelling the survival data
One may plot the hazard rate against the time t and if the result is parallel to x-axis and is a straight line or
the cumulative hazard function passes through the origin. Then exponential model is deemed appropriate.
Call :
s u r v r e g ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$ grp + data$ age +
data$longestNoSmoke , d i s t = e x p o n e n t i a l )
Value Std . E r r o r
( Intercept )
3.480797
0.510143
6.82 8.90 e 1 2
data$ grppatchOnly
0.681830
0.215909
3 . 1 6 1.59 e 0 3
data$ age
0.028984
0.010001
2.90 3.75 e 0 3
data$longestNoSmoke
0.000211
0.000128
1.64 1.01 e 0 1
Scale f i x e d at 1
17
Exponential d i s t r i b u t i o n
L o g l i k ( model)=
493.6
L o g l i k ( i n t e r c e p t o n l y )=
506.4
Chisq= 2 5 . 5 7 on 3 d e g r e e s o f freedom , p= 1 . 2 e 0 5
Number o f Newton Raphson I t e r a t i o n s : 5
3.4.1.2
Interpretation of the Exponential Model
All the confounding variables which were included in the model were found to be significant. In particular,the coefficient of Patch Only group seems to shorten the event of starting smoking, the two other
confounding variables seem to prolong the time to the next event slightly. The p in coefficients differs from
the unity. Thus,exponential distribution is not the most appropriate amongst the known distributions. The
log-likelihood is 493.6 compared to 506.4 of the null model and the Chi-square with 3 degrees of freedom and
with p-value less 0.05 indicates that the model coefficients are significantly different from zero compared to
the model with intercept only.
3.4.1.3
Residual Analysis of Exponential Model
2
1
0
resid.step
0
3
resid.step
Deviance Residuals
combination
patchOnly
data$grp
20
30
40
50
60
70
80
data$age
Figure 3.7: Deviance Residuals:Here it is plain to witness that age might have a non linear pattern as for
the th group treatment variable is seen to be evenly distributed
Jacknife Residuals
18
0.002
0.000
0.002
0.004
0
20
40
60
80
100
120
Observation
Figure 3.8: Jacknife Residuals:The plot seem to agree with the former model that patient seem to have
significant influential power
1.0
0.0
0.5
log(cs.fit$surv)
1.5
Cox Snell Residuals
cs.fit$time
Figure 3.9: Cox-Snell:The plot indicates that the assumption of exponentially distributed survival times does
not hold
3.4.2
Weibull Distribution
[36]
Probability Density Function:f (t) = t1 et1
Cumulative Density Function: F (t) = 1 e t
Survivor function: S(t) = e t
Hazard function: h(t) =
et
1+et
Cumulative Hazard H(t) = t whereand are the parameters of Weibull model

19
The Weibull model (introduced by Waloddi Weibull in 1939) is an important generalization of the exponential
model with two positive parameters. The second parameter in the model allows great flexibility of the model
and different shapes of the hazard function. The convenience of the Weibull model for empirical work stems
from this flexibility and the simplicity of the hazard and survival function. The Weibull distribution hazard
has been theoretically derived for cancer incidence by Pike (1966),but it is unknown whether it has relevance
for other diseases. The Weibull distribution is inappropriate when the hazard rate is indicated to be uni
modal or bathtub-shaped. A generalization of the Weibull distribution to include such kind of shapes was
proposed by Mudholkar [33] (1996).If T has a Weibull distribution, then i has an extreme value distribution
(Gumbel) Interpretation of the model.
3.4.2.1
How to identify if Weibull is a valid distribution for modelling the survival data
Again,one may plot the hazard rate against the time t and if the result is a straight line with a slope p1.
Then Weibull model is deemed appropriate.
Call :
s u r v r e g ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$ grp + data$age ,
dist = weibull )
Value Std . E r r o r
( Intercept )
2.9242
0.9740
3.00 2.68 e 0 3
data$ grppatchOnly
1.1327
0.4292
2 . 6 4 8.32 e 0 3
data$ age
0.0471
0.0191
2.46 1.38 e 0 2
Log ( s c a l e )
0.6733
0.0905
7.44 1.03 e 1 3
S c a l e= 1 . 9 6
Weibull d i s t r i b u t i o n
458.7
Chisq= 1 4 . 8 7 on 2 d e g r e e s o f freedom , p= 0 . 0 0 0 5 9
n= 125
20
466.1
3.4.2.2
Interpretation of Weibull Model
The accelerated failure time model based on Weibull distribution differs on the magnitude of the coefficients
and standard errors. All covariates in the model are found significant and the model has its coefficients
significantly different from the model with just the intercept. In the first column is given the coefficient
whose exponentiated form increase or decreases the survival time, columns which follow are an estimate of
standard error of the estimates along with the z-statistics and p-value. Moreover the scale is the accelerator
factor increasing or decreasing the treatment group survival time. Lastly, there is the log-likelihood that
validates that the model is significantly different than the null model. Distribution seems to fit the data,
qualifying this model as the best candidate for analysing the data from Accelerated Failure time model class.
3.4.2.3
Residuals Analysis of Weibull Model
1
0
resid.step
0
2
resid.step
Deviance residuals versus age
Deviance Residuals against treatment group
combination
patchOnly
data$grp
20
30
40
50
60
70
80
data$age
Figure 3.10: Deviance Residuals:The treatment groups seem to be successfully modelled whereas the age
seems to exhibit some problems similar to the nature of the problems when Cox model was fit to the data
Jacknife Residuals
21
0.004
0.002
0.000
0.002
0.004
0.006
0
20
40
60
80
100
120
Observation
Figure 3.11: Jacknife Residuals:From the Jacknife plot we can see that patients 46,68 and 114 are deemed
to be influential.They removed and the data did not exhibit any novel pattern that would suggest that
the removal of influential observations would result in a better fitted model or in decreasing the amount of
non-linearity
1.0
0.0
0.5
log(cs.fit$surv)
1.5
2.0
Cox Snell Residuals
0.0
0.5
1.0
1.5
2.0
2.5
cs.fit$time
Figure 3.12: Cox-Snell:The Cox Snell residual plot reveals that Weibull distribution provide an adequate fit
for the data
3.4.3
Proportional Odds Model
3.4.3.0.1
Introduction
The Proportional Odds model[2] arises from the hypothesis that the effect of
covariates is to increase or to decrease the odds of dying by proportionate amount given duration. Mathematically this relationship can be interpreted as
1S(t)
S(x ,t )
)
= ex 1S(t
S(t ) where S0 (t ) is the baseline function
taken from a suitable distribution and ex is a multiplier reflecting the proportional increase in the association with covariates values x . In particular, it is particularly useful when modelling survival data in which
22
the mortality rates converge.

definition 14. Odds Proportional Model The proportional odds model assumes that each explanatory variable
exerts the same effect on each cumulative log it regardless of the cut-off k: log
1 S(t)
=
S(x , t )
tp
1+tp
1
1+tp
P (T )
T
= + x T
= tp
Thus
1S1 (t)
S1 (x ,t )
1S2 (t)
S2 (x ,t )
1
2
Hence, the log-logistic model is a proportional odds model

3.4.3.1
Log-logistic Distribution
[26]
Probability Density
Function:f (t) =
x
1
x 2
1+
Cumulative Density Function: F (t) =

Survivor Function: S(t) =
1
1+x
1
1+e t
Hazard function: h(t) =
e t 1
1+e t
Cumulative Hazard function H(t) = log 1 + ax

A major impediment in utilising log-normal distribution to model survival data is the requirement of no
censoring-something that it may not be tenable. An alternative approach would be to consider log-logistic
distribution which is analogous to log-normal distribution but with heavier tails. More importantly, it is
benefited from closed forms in survivor and hazard function. The log-logistic distribution has a fairly flexible
functional form, it is one of the parametric survival time models in which the hazard rate may be decreasing
and, increasing giving hump-shaped curve that is initially increases and then decreases. The log-logistic
distribution can be obtained as a mixture of Gompertz distributions with a gamma distributed mixture
variable with mean and variance equal to one.
3.4.3.2
How to identify if Log-Logistic is a valid distribution for modelling the survival data
To examine this assumption,one may plot log
1S(t)
S(t)
against log tIf the plot is linear with slope p, then the
survival time follows a log-logistic distribution.

23
Call :
s u r v r e g ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$ grp + data$age ,
dist = loglogistic )
Value
Std . E r r o r
( Intercept )
2.0987
1.0177
2.06 3.92 e 0 2
data$ grppatchOnly
1.3522
0.4745
2 . 8 5 4.38 e 0 3
data$ age
0.0473
0.0197
2.40 1.62 e 0 2
Log ( s c a l e )
0.3907
0.0881
4.44 9.19 e 0 6
S c a l e= 1 . 4 8
Log l o g i s t i c d i s t r i b u t i o n
456.5
463.6
Chisq= 1 4 . 2 6 on 2 d e g r e e s o f freedom , p= 8 e 0 4
n= 125
3.4.3.2.1
Interpretation of Logistic Model
The Logistic models estimates only change infinitesi-
mally with regard to the magnitude again all covariates are found significant. Perhaps, the only quantity,
that differs, is the scale. The model is significant compared to null model(intercept)
3.4.3.2.2
Residuals Analysis
Deviance Residuals
24
1
0
resid.step
0
2
resid.step
deviance Residuals versus age
Deviance Residuals versus treatment group
combination
patchOnly
20
data$grp
30
40
50
60
70
80
data$age
Figure 3.13: Deviance Residuals:The plot obtained at this point depict no novel pattern that it has not been
watched in previous models that have been developed
0.000
0.002
0.006
0.004
Change in Coefficients
0.002
0.004
Jacknife Residuals
20
40
60
80
100
120
Observation
Figure 3.14: Jacknife Residuals:The model distinguishes the same patients for their influence as previous
models did
Cox Snell Residuals
25
1.0
1.5
2.0
log estimate of survival function
0.5
0.0
The complimentary log versus log time
log time
Figure 3.15: Cox-Snell:The plot reveals that although the implemented log-logistic model provides a better
fit than the exponential model the comparison between the Log-logistic and Weibull model favours the latter.
Thus, it seems that the Weibull provides the best available option in terms of accelerated failure model
3.4.4
Log-normal Distribution
[4] probability density Function:f (t) =
1 t1 e
2
(log t)2
2 2
cumulative density Function: no closed form

Survivor function: S(t) = 1 ( logt )
Hazard function no useful form
Cumulative Hazard no useful form where and are parameters of Log-Normal Model
If survival times are assumed to have a log-normal distribution the natural logarithm of the lifetime then X
is assumed to be normally distributed. A log normal distribution results when the variable is the product of
a large number of independent, identically distributed variables in the same way that a normal distribution
results when the variable is the sum of a large number of independent, identically distributed variables. The
survival and hazard functions include the incomplete normal integral. Under the accelerated failure the lognormal distribution may be convenient to be employed with non-censored data, but when this distribution
is applied to censored data, the computations quickly become formidable. Because of the decreasing form of
the hazard function for older ages, the distribution seems implausible as a lifetime model in most situations.
Nevertheless, it makes sense if interest is focused on time periods of younger ages. Despite its unattractive
features, the log-normal distribution has been widely used as failure distribution in diverse situations, such
as the analysis of electrical insulation or time to occurrence of lung cancer among smokers. Furthermore,
the log-normal distribution has often been used as a frailty (mixing) distribution. Especially in the context
of unobserved normal distributed covariates in the Cox model, the log-normal frailty distributions provide
26
an appealing interpretation of the model. Unfortunately, the Laplace transform is intractable, and therefore
numerical integration is needed for probability results.
3.5
Bayesian Non Parametric Survival Analysis
3.5.0.0.3
Bayesian Paradigm In this section,data is subjected to analysis under Bayesian paradigm
forcing an amount of uncertainty in all parameters of the models since they are treated as random variables.
In particular, pharmacoSmoking data was examined employing Bayesian Non-Parametric regime. Therefore,
below it can be found some definitions concerning the tools that whose aid will be entailed in the analysis
such as the Dirichlet Process definition and some other definitions of more complex cardinal structures
3.5.1
Basic Knowledge
definition 15. Dirichlet Process Given a measurable set S, a base probability distribution H and a positive
real number, the Dirichlet process DP(,P) is a stochastic process whose sample path (or realization, i.e. an
infinite set of random variates drawn from the process) is a probability distribution over S and the following
holds. For any measurable finite partition of S, s
if X DP (, P )
then (X(B1 ), X(B2 ) . . . , X(Bn )) Dirichlet(P (B1 ), P (B2 ) . . . , P (Bn ))
definition 16. A hierarchical Dirichlet process[45] is a distribution over a set of random probability measures
over(, ). The process defines a set of random probability measures P , one for each group, and a global
random probability measure P0 . The global measure P0 is distributed as a Dirichlet process with concentration
parameter and base probability measure H P0 DP (, H) and the random measures P are conditionally
independent given P0 , with distributions given by a Dirichlet process with base probability measure H: The
hyper parameters of the hierarchical Dirichlet process consist of the baseline probability measure H, and
the concentration parameters and . The baseline H provides the prior distribution for the factors .
The distribution P varies around the prior H, with the amount of variability governed by . The actual
distribution P over the factors in the th group deviates from P0 , with the amount of variability governed
by 0 . If we expect the variability in different groups to be different, we can use a separate concentration
parameter j for each group . In this paper, following Escobar and West(1995)[14], we put vague gamma
prior on and 0 . A hierarchical Dirichlet process can be used as the prior distribution over the factors for
grouped data. For each let ,1 , 2 ,. . . be independent identical distribution random variables distributed as
P . Each is a factor corresponding to a single observation x . The likelihood is given by: P P
27
x F ( ). (15) This completes the definition of a hierarchical Dirichlet process mixture model
definition 17. Dependent Process is defined by the relation where the parameter follow the restrictions
described previously Z , U for = 1, 2, . . . are mutually independent realisations from stochastic processes
Zx , Ux . The processes and V are determined by the transformations Z , U respectively. At each x ,the
P
distribution Fx is defined to be discrete distribution characterised by Px (A) = |z A px or equivalently
P
Fx (A) = |z, px
definition 18. Suppose y , for = 1, . . . , n are observations for different subjects within centre , for
= 1, . . . , J. For example, y j = (y1 , . . . , yn ) may represent patient outcomes within the th hospital or
hospital-level outcomes within the th state. Although covariates, x = (x1 , . . . , xp ) are typically available,
we initially assume that subjects are exchangeable within centres, with
R
P
P
y F f or = 1, . . . , JF = p(|, )dG (d )G k=1 k Gk () Gk () = l=1 wlk lk ()
3.5.2
Explanation of the basic Ideas
3.5.2.0.4
Introduction
Although contemporary Bayesian Non-Parametric Survival Analysis may rely
on more complex structures than a Dirichlet process, Dirichlet processes[15] remain a challenging topic to
fathom and master. Bereft of deep understanding of the knotty nuances that are involved it may not be
feasible to lay the foundations for more advanced methods. Thus,it is considered more beneficial to cite an
overview of Dirichlet processes before the focus is shifted back to the implementation of a Bayesian NonParametric framework in Survival Analysis.The introduction on Dirichlet processes is deemed to lay the
cornerstone for modern Non-Parametric Statistic because it provides with a non-parametric prior specification over the space of all possible distribution functions for a random variable y. Before proceeding though
assume a random vector (y1 , . . . , yn1 ) with n-1 dimensional Dirichlet distribution so
y Dirichletk1 )(1 , 2 , . . . , n ) (y1 , y2 , . . . , yn1 )have joint density

f (y1 , y2 , . . . , yk1 ) = Q
3.5.3
k1
k1
Y
X
( )
)
y
)(1
y 1 )
k1
=1
( ) =1
=1
An Initial Approach
A motivating example that gave rise to one of the most useful tools in Bayesian Non Parametric Statistic
is the acquisition of a Bayesian estimate for the density. The quest for a Bayesian density estimator swiftly
took a likely turn, that of creating parametric version of the histogram of density in question. From the
origins of Statistics, Statisticians relied heavily on histograms to decipher the identity of a density. In that
28
cause,assume that is given a specified number of knots = (0 , . . . , k ) to define the histogram estimate with
Pk
h
0 . . . k and y . A probability model that portraits the histogram is as follows f (y) = h=1 h
h1
y R with = (1 , 2 , . . . ) an unknown probability model.A prior distribution is added to these probabilities
which is Dirichlet(1 ) The posterior distribution of is
p(|) =
k
Y
h=1
hh 1
Y
k=1
k
Y
h
=
hh +nh 1 = Dirichlet( + n1
h h1
h=1
where
nh =
Ih1 yt h
the number of observations falling in the th bin of the histogram. At first glance,a Bayesian histogram has
a striking resemblance with a Dirichlet Process. Much to the similarities,it should be borne in mind that a
Dirichlet Process requires for every possible partition of its sample space the Beta = (B1 , . . . , Bn ) to exist a
probability measure that will make them distributed as Dirichlet distribution. The formulation of Bayesian
histogram enabled to pose questions on bypassing the need of explicitly specifying bins so an multivariate
case would be easier to implement because infinite number of bins would be needed[19].
3.5.4
Formulation of the Dirichlet Process
Let X be a set, and let B be a -algebra on X , a non-empty collection of subsets of X.

,XB
if a B B then its complement B c B
if
B is a countable collection of sets in B then their union
k=1
B B.
k=1
Assume that P is a probability measure over (, B) with B is the collection all possible subsets or bins(any
random partition of ) for example B can be disjoint intervals. Recall that Dirichlet distribution is a
probability distribution over probability mass function and we can say a random probability mass function
has a Dirichlet distribution with parameter . A Dirichlet process is a collection of random variables whose
index set is the -algebra B. The Dirichlet Processs goal was to specify a distribution over P that was
manageable, yet useful, while achieved this task with but one restriction. Let be a finite non-zero measure
just a measure, not necessarily a probability measure on the original measurable space (, B). Ferguson
3 The example was taken by the book cited in the end to assist readers to a smoother transition to the world Bayesian
Non-Parametric Statistics
29
termed P a Dirichlet process with parameter on (, B) if for any finite measurable partition of X, then
the probability measure will assign probabilities to these bins because it is a random probability measure
then the bin probabilities are random variables implementing a simple conjugate prior which is the Dirichlet
distribution where P0 is utilised as an initial guess and is a prior concentration parameter controlling the
degree of shrinkage of P towards P0 meaning that is an estimate of how sure one is for the initial guesses.
Thus Dirichlet Process has intuitive interpretation of a histogram without having to depend on the partition
of bins. Hence, a fixed partition does not induce a fully specified prior for the random probability measure
P. Eliminating sensitivity to the choice of partition and induce a fully specified prior on P through assuming
hold for all possible partitions and for all k. For these specifications to be coherent, there must exist a
random probability measure P such that the probabilities assigned to any measurable partition by P is
certain consistency condition and the resulting random probability measure P is referred to as a Dirichlet
Process the random vector (P (B1 ), . . . , P (Bk ) has Dirichlet distribution with parameters (B1 ), . . . , (Bk ).
S
We call
B a finite measurable partition of X if B B for all = 1 . . . , k B B = if 6= and
k=1
B X. If P is a Dirichlet process with parameter, then its distribution D is called a Dirichlet measure.
k=1
As a consequence of Fergusons restriction, D has support only for atomic distributions with infinite atoms,
and zero probability for any non-atomic distribution (e.g. Gaussian) and for atomic distributions with finite
P
atoms. That is a realization drawn from D is a measure: P = h=1 pk yk where x is the Dirac measure on
X : x (A) = 1 if x A and 0 otherwise pk is some sequence of weights and yk is a sequence of points in X.
3.5.5
An Intuitive Look
3.5.5.0.5
Introductory Example A random probability mass function resembles a bag full of dice, and
a realization from the Dirichlet gives a specific die. The Dirichlet distribution is limited because it assumes
a finite set of events. Dirichlet process enables to work with an infinite set of events, and hence to model
probability distributions over infinite sample spaces. Another analogy that would resemble the Dirichlet
Process is to imagine that someone is asked on the street about her/his favourite colour.Suppose that the
choices are black, pink, blue, green, orange, white. Given his/her mood,the question can lead to different
answers depending on his/her mood, modelling the probability that he/she chooses each of these colours as
a probability mass function. Thus, we are modelling each person as a probability mass function over the six
colours, and we can think of each persons probability mass function over colours as a realization of a draw
from a Dirichlet distribution over the set of six colours. But what if we didnt force people to choose one of
those six colours? What if they could name any colour they wanted? There would be an infinite number of
colours they could name. To model the individuals probability mass functions(of infinite length), we need a
30
distribution over distributions over an infinite sample space. One solution is the Dirichlet process, which is
a random distribution whose realizations are distributions over an arbitrary(possibly infinite) sample space.
3.5.5.0.6
Development of Dirichlet Process Through an Example The set of all probability
distributions over an infinite sample space is unmanageable. To deal with this, the Dirichlet process restricts
the class of distributions under consideration to a more manageable set such as discrete probability distributions over the infinite sample space that can be written as an infinite sum of weighted indicator functions.
A infinite sample space can be perceived as a dartboard, and a realization from a Dirichlet is a probability
distribution on the dartboard marked by an infinite set of darts of different lengths (weights). The k th
indicator yk marks the location of the k th dart-of-probability such that yk (B)= 1 if yk B,and yk (B) = 0
otherwise. The locations of the darts are independent, and the probability weight associated with the k th
dart is independent of its location. However, the weights on the darts are not independent. Instead of a
vector with one component per event in our six colour sample space, the Dirichlet process is parametrized
by a function (specifically, a measure)over the sample space of all possible colours. Note that this is a finite
positive function so it can be normalized to be a probability distribution. The locations of the darts yk are
drawn independently identical distributed. The weights on the darts pk are a decreasing sequence, and are
a somewhat complicated function of the total mass of is 1. Each realization of a Dirichlet process has a
different and infinite set of these dart locations. Further, the k th dart has a corresponding probability weight
P
pk [0, 1] and k=1 pk = 1. So, for some set B of the infinite sample space, a realization of the Dirichlet
P
process will assign probability P(B) to B, where P (B) = k=1 pk yk . However, because realizations from
the Dirichlet process are atomic, they are not a useful model for many continuous scenarios. For example,
instead of picking colours someone can pick her/his favourite colour from a continuous range of colours, and
we would like to model the probability distribution over that space. A realization of the Dirichlet process
might give a positive probability to a particular shade of dark blue, but zero probability to adjacent shades
of blue, which feels like a poor model for this case. However, in cases the Dirichlet process might be a finite
model. For instance, we ask colour professionals to name their favourite colour, then it would be reasonable
to assign a finite atom of probability to Coca-cola can red, but zero probability to the nearby colours that
are more difficult to name. Similarly, darts of probability would be needed for international Klein blue, 530
nm pure green, all the Pantone colours, etc., all because of their name-ability.
4 [17]
offered a great description of all subtle nuances involved in Bayesian Non-Parametric
31
3.5.6
Construction of Dirichlet Process
[seat2010] Turning to more practical issues one may wonder how a sample from this distribution can be
taken.Here three methods are displayed.
Polyas urn[3]
a stick-breaking approach[1] which can be thought of as iteratively breaking of pieces of (and hence
dividing) a stick of length one in such a way that the vector of the lengths of the pieces is distributed
according to a Dirichlet distribution
a method based on transforming Gamma-distributed random variables.5
3.5.6.1
Polyas Urn
Suppose,it is of interest to generate a colour from = (1, . . . , k) in an urn. Bearing in mind that 0 is
not necessarily an integer, it may have a fractional or even an irrational number of balls of colour i in an
urn. At each iteration, draw one ball uniformly at random from the urn, and then place it back into the
urn along with an additional ball of the same colour. As the procedure is iterated more and more times,
the proportions of balls of each colour will converge to a probability mass function that is a sample from
the distribution Dirichlet. Mathematically,a sequence of balls with colours (x1 , x2 , . . . , xn ) is generated as
follows:
Step 1: Set a counter n = 1. Draw x1
0 .
(Note that
is a non-negative vector whose entries sum
to 1, so it is a probability mass function.

Step 2: Update the counter to n + 1. Draw Xn+1 |X1 , X2 , X n
where 0 = + X n0 is the sum
of the entries of n Repeat this step an infinite number of times.

Once you have finished Step 2, calculate the proportions of different colours: let Qn = (Qn1 , Qn2 , . . . , Qnk ),
where Qn is the proportion of balls of colour after n balls are in the urn. Then,Qn Q Dir() as n 1
, where denotes convergence in distribution. That is,
P (Qn1 z1 , . . . , Qnk zk ) P (Q1 z1 , . . . , Qk zk )as

n1
5 The construction of Dirichlet employs the help of a collapsed Gibbs Sampler, however here it will be given just an overview
of algorithms utilised in the process of construction. These books[21] and [38] will provide readers with a good command in the
field of Bayesian Statistics and Monte Carlo methods
6 is a parameter of Dirichlet Process and is the sum of
0
32
for all
(z1 , z2 , . . . , zk )
. Note: that this does not mean that the limit as the number of balls in the urn goes to infinity the probability
of drawing balls of each colour is given by the probability mass function
0 .
Instead asymptotically, the
probability of drawing balls of each colour is given by a probability mass function that is a realization of the
Dirichlet distribution, resulting in generating a sample from Dirichlet. The proof relies on the Martingale
Convergence Theorem. The Polya sequence and the Chinese restaurant process are two names for the same
process due to Blackwell-MacQueen, which asymptotically produces a partition of the natural numbers. It
can be employed this partition to produce the realisations. The Polya sequence is analogous to the Polya
urn for generating samples from a Dirichlet distribution, except that now there is an infinite number of ball
colours and procedure starts with an empty urn.
Let n = 1, and pick a new colour with probability distribution
(x)
from the set of infinite ball colours.
Paint a new ball that colour and add it to the urn.

With probability
n+(X) ,
pick a ball out of the urn, put it back with another ball of the same colour,
repeat Step 2. With probability
(X)
n+(X) ,
go to Step 1.
That is, one draws a random sequence (x1 , x2 , . . . ), where x is a random colour from the set of y1 , y2 , . . . , yk , . . . , y1g .
The elegance of the Polya sequence is that we do not need to specify the set ahead of time. In similar fashion
the random sequence (x1 , x2 , . . .) has the following
x1
(x)
xn+1 |x1 , . . . , xn
where n = +
Pk
=1 x
n
n (x)
since n(x) = (x) + n equivalently:
xn+1 |x1 . . . , xn =
K
X
k=1
+
1 + (X) (X) + n
The balls of the nth colour will produce the weight pk , and we can equivalently write the distribution of
(x1 , . . . , xk ) in terms of k. If the first n draws result in K different colours y1 , . . . , yk , and the nth colour
33
shows up mk times, then
xn+1 |x1 . . . , xn =
K
X
x mxk
+
1 + (x) (X) + n
k=1
3.5.6.2
The Stick-breaking Approach
The stick-breaking approach on generating a random vector with a Dirichlet distribution involves iteratively
breaking a stick of length 1 into k pieces in such a way that the lengths of k pieces follow a Dirichlet
distribution. Assume that it has been already assimilated the knowledge of how to generate random variables
from a Beta distribution. In case where the stick has length 2, simulating from the Dirichlet is equivalent to
simulating from the Beta distribution, so henceforth in this section, it is considered thatk 3. For ease of
exposition, we will assume that k = 3, and then generalise the procedure to k 3. Over the course of the
stick-breaking process, track of a set of intermediate values will be kept u , which we will use to ultimately
calculate the realization q. To begin with, we generate Q1 from Beta(1 , 2 + 3) and set u1 = q1 equal to
its value: u1 = q1 . Then, generate
Q2
1Q1
fromBeta(2 , 3 ). Denote the result by u2 , and set q2 = (1 u1 )u2 .
The resulting vector u = (u1 , 1u1 )u2 , 1u1 (1u1 )u2 ) comes from a Dirichlet distribution with parameter
vector.
Step 1: Simulate u1 Beta(1 ,
Pk
=1
) and set u1 = q1 . This is the first piece of the stick. The
remaining piece has length 1-u1 .

Step 2: For 2 k 1, if 1 pieces, with lengths u1 , u2 , . . . , un ,have been broken off the length
of the remaining stick is
1
Y
1 u
=1
We simulate
u Beta( ,
k
X
=1
and set q = u
Q1
=1
1 u .The length of remaining part is
Q1
=1
1 u u
Q1
=1
1 u =
=1
1 u
Step 3: The length of the remaining piece is qk

. Note that at each step, if 1 pieces have been broken off, the remainder of the stick, with length
Q1
=1 1 u , will eventually be broken up into k + + 1 pieces with proportions distributed according to a
34
Dirichlet(ldots, , . . . , k) distribution. The reason why the stick-breaking method generates random vectors
from the Dirichlet distribution relies on a property of the Dirichlet called neutrality, which the definition is
given below.
definition 19. Neutrality Neutrality Let Q = (Q1 , Q2 , . . . ) be a random vector Then, we say that Q is neutral
Q
1Q .
if for each = 1, . . . k,Q is independent of the random vector

Q
1Q
In this case where Q Dirichlet
is simply the vector Q with th component removed and scaled by the sum of the remaining elements
[Bayesian Nonparametrics]
3.5.6.3
Generating the Dirichlet from Gamma random variables
We will argue that generating samples from the Dirichlet distribution using Gamma random variables is
more computationally efficient than both the urn-drawing method and the stick-breaking method. This
method has two steps which we explain in more detail and prove in this section:
Step 1: Generate gamma realizations: for = 1 . . . k draw a number z from ( , 1).
Step 2: Normalize them to form a probability mass function: = 1 . . . k set qk =
Pkz
=1
Then q
is a realization of Dirichlet. The Gamma distribution (, ) is defined by the following probability

x
e
density: f (x; , ) = x1 ()
where is shape and is scale
Here is cited a proof that the above method provides Dirichlet realizations: To prove that the above procedure
creates Dirichlet samples from Gamma random variables, we use the change of variables formula to show
that the density of Q is the density corresponding to the Dirichlet distribution. First, recall that the original
S
variables are
Z , and the new variables are Z, Q1 , . . . , Qk1 . We relate them using the transformation T:
k=1
(z1 , . . . , zk ) = T (z, q1 , , , , qk ) = (z, q1 , . . . , zqk ) The Jacobian matrix of this transformation
X=
Q1
...
Q2
..
.
Pk1
0
..
.
z
..
...
=1
z . . . z
which has determinant z k1 From the change of variables formula
f (x) = g(f (x))(J(T ))
35
where
g(z1 , z2 , . . . , zk , 1 , . . . , k ) =
k
Y
=1
ez
( )
is the joint density of the original (independent) random variables. Substituting into our change of variables
formula
f (z, q 1, . . . , qk ) =
k1
Y
zq 1
=1
Pk1
k1
k1

X
X
Pk1
ezq
z(1
=1 q ) k1
z(1
q ) )
z
) 1
q ) 1 (z ) =1 1 ez
( )
( )
=1
=1
Integrating over z, the marginal distribution of

f (q) = f (q1 , . . . , qk )
Z
=
f (z, q1 , . . . , qk )
=
0
k1
Y
=1
(3.1)
Z
k1

X
q 1
q ) 1
=1
(z)
Pk1
=1
1 z
Pk1
k1
k1
X
(
) Y
= Qk =1
q 1 )
(q ) 1 (1
=1
=1 ( ) =1
know that the Dirichlet realizations are characterized by the atomic distributions of the form (15). So
we will characterise the distribution of the (pk , yk ), which will enable to generate samples from D. However,
it is difficult to characterize the distribution of the (pk , yk ), and instead we characterize the distribution
of a different set (, yk ), where k will allow to generate pk . Consider the countably infinite random sequence ((k , yk )) that takes the values in. All the k and yk are independent each k has Beta distribution
Beta(1, (x)) and each yk is distributed according to Beta. Kolmogorov existence theorem maintains that
there is a probability space(, A, P ), where A is -algebra on and P is a probability measure on A, such
that the random sequence on , such that the random sequence ((k , yk )) joint distribution D. This is
wonderful because as soon as we connect the k to the pk , we will be able to generate samples from D .
The random sequence (k , yk ) to ([0, 1] X) to P. Draw a realization from ((k , yk )) as described above.
Then to form a corresponding probability distribution in P from this sequence we can use the stick-breaking
method.Let p1 = 1 that is break off a1 portion of a unit long stick. What remains has length 1 1 .
Break off 2 fraction of the remaining stick that is, let p2 = 2 (1 1 ). What is left after this step is a
Qk1
(1 1 )(1 2 ) long stick. In the k th step, we have a stick of length =1 1 remaining and produce pk
Qk
we break off a k portion of it so pk = k =1 1 The result is a sequence (p )
0 and we can use this
sequence directly to produce a probability distribution P in P: We can think of P as a stochastic process on
(, A, P ) obtaining values in P. The index set of this random process is the set B
36
3.5.6.4
Comparison of the methods
At this points all the aforementioned methods are brought under scrutiny to reveal the most reliable and
computationally less intense method. Due to Polya Urn scheme dependence to convergence result this method
is deemed to be the least efficient.It is most probable that the urn-drawing schemes has to be iterated many
times to provide good results, and even for the best case scenario assuming that the procedure is let to
run for any infinite number of iterations, the resulting probability mass function is not perfectly accurate.
Both the stick-breaking approach and the Gamma-based approaches bore better simulations to Dirichlet
Process since simulated probability mass functions distributed exactly according to a Dirichlet distribution
However,assuming that it takes the same amount of time generate a Gamma random variable as it does
a Beta random variable then the stick-breaking approach is considered more computationally burdensome.
The reasoning for this lies in the fact that at each iteration of the stick-breaking procedure, the additional
intermediate step of summing the tail of the vector has to be done before drawing from the Beta distribution
Q1
and then multiplying by =1 1 Q . With the Gamma-based approach, all we need to do after drawing
Gamma random variables is to divide them all by their sum. The Chinese restaurant process is typically
found in the literature of Dirichlet Processes, a standard culinary metaphor that vividly illustrates how
the Dirichlet Processes operates. In this metaphor, the atom locations k are referred to as dishes in a
restaurant, and observations that are clustered together are viewed as customers sitting on the same table,
therefore eating from the same dish. In this generative process, customers enter the restaurant one at a
time, and they can either sit on an existing table, with probability proportional to the number of previous
customers that are already sitting on that table, or open a new table, with probability proportional to . In
the latter case, they sample a new dish from the prior[17].
3.5.7
Dependent Dirichlet Process
Lets imagine the following problem there are related studies, a suitable prior to model the data is needed
which borrows strength across the studies. In such fashion that patients under study 1 should inform about
patients enrolled in another related study 2 . Two extreme options would be to pool all patients and assume
one common random effects distribution or assume distinct random effects distributions with independent
prior. The characterization as extremes lies in the fact that in the latter model there is no borrowing of
strength between the studies while in the earlier of two models, there is maximum borrowing of strength.
However, the most desired prior model would be something that lies between these two extremes. In that
direction, it is wise to define a hyper parameter for both cases to ensure that even in the worst case scenario
there would be a trade-off strength between the studies. In addition, the nature of learning is much affected
37
by the choice of parametric form of the hyper parameter. Subsequently, a straightforward advance would be
to consider more complex choices for the hyper parameter. Dependent Dirichlet Process[28] is a stochastic
process that generalizes the Dirichlet Process and that can be applied for clustering groups of data.For
P
each group j, we have an infinite number of mixture models of the form: Ps = l=1 l (s)l (s)independent
realizations from a centred stochastic process P0 , defined on S and stick-breaking weights defined through
independent realizations zl,s = z(s) : s S, l = 1, 2, . . ., from a stochastic process on S with marginals
z(s) Beta(1, (s)) the stochastic processes that define the dependent atoms and weights of Px would
typically arise through transformations of Gaussian processes. Ideally, when P = is an Random Probability
Measure itself,then we could potentially achieve arbitrary learning across the studies. This is exactly the
construction of the hierarchical and nested known as special cases of Dependent Dirichlet Process. Here
mh = mx , h : x X are independent realizations from a stochastic process. Keeping mx , h independent
across h ensures that each Px marginally follows a Dirichlet Process prior. The simple, yet powerful idea of
the Dependent Dirichlet Process construction is to introduce dependence over x, i.e., to link the Px through
dependent locations of the point masses. Implicit in the notation used is the definition of weights wh that
are common across x. However, the proposal in MacEachern (1999) is more general than that, allowing also
varying weights . In particular, we focus on the Dependent Dirichlet Process and two of its variants. The
Dependent Dirichlet Process is a generalization of the DP for groups of data
the Hierarchical Dirichlet Process[44] is a special case of Dependent Dirichlet Process to cluster groups
of data sharing mixture component. It uses a Dirichlet Process for each group of data, with the
Dirichlet Process for all groups sharing a base distribution which is itself realisation drawn from a
Dirichlet Process. Unlike Dirichlet Processes models where prior is a parametric distribution allowing
for greater flexibility.
P =
h h , P0 =
h=1
h h
h=1
whereh P00
To fathom,the structure of a Hierarchical Dirichlet Process, imagine there exist several Dirichlet processes to model j groups, each group is described by a different distribution and parameters k are
shared across all groups. However, there is an odd consequence from assigning a Dirichlet Process prior
to the common baseline measure which is that each P utilises identical atoms instead of introducing
new atoms since the baseline measure within all groups specific random probability measures are discrete probability whereas the weights on these atoms are allowed to deviate across groups(Franchise of
38
Chinese Restaurant Process). Furthermore, this method allows groups to share statistical strength via
the atom weights k of the base distribution P0 . Indeed, the vector of weights for each group can be
obtained as DP (, H).More simplistic, we first draw a base distribution from a Dirichlet Process as
P
P0 DP (, H), where P0 = k=1 qk k , and for each group we draw a distribution from a DP using
P0 as the base distribution, P DP (, P0 ) base measure P0 . To illustrate,this structure suppose that
R
f = N (y|, 1 )dP (, ), P HDP (, , P00 ) In this case the model introduces a common global
dictionary of normal kernels with varying location and mixture. There is a central density f0 which is
P
characterized by mixing over this dictionary with weights , (P0 = h=1 h h where h P00 which
normal Kernel)) and group specific densities are expressed using the same dictionary but with weights
drawn from a stick-breaking process centred on . The hyper parameter controls the variability
across groups in the weights with implying that the group specific densities are Gaussian with clusters of groups sharing the same mean and precision in the Gaussian kernel.At another extreme when
= 1,one obtains pooling across groups with f == f0 and f0 modelled as a Dirichlet Process Model
location-scale mixture of Gaussian. The recycling of the same atoms across groups while allowing a
simple structure accommodating variability in the weights lets to broad impact of Hierarchical Dirichlet Process. Then in general, we cannot relate the samples drawn from one process to samples drawn
from another process. For example, suppose X [0, 1] and is the Lebesgue measure and we have two
Chinese restaurant processes on [0,1]. The probability of seeing the same sample in both processes is
0. The Chinese restaurant franchise interpretation of the hierarchical Dirichlet process is the following:
Suppose there is a chain of Chinese restaurants with a central menu (with dishes specified by the darts
of P0 ) and at each restaurant each table is associated with a dish from this menu.The popularity of
different dishes can vary(different corresponding atom weights) from restaurant to restaurant, but the
probability that two restaurants will order the same dish (two processes share a sample) is non zero in
P
P = k=1 k k where the atom locations k do not depend on the group .
Nested Dirichlet Process[39]: Unlike the Hierarchical Dependent Process, nested Dirichlet Processes
do not assume that the structure is known to the system. There is even some suspicion to what the
structure might be,expressed in the prior distribution so observations which are in a group must be
clustered using a second level mixture model leading to the nested Dirichlet Process model
P P DP (, P0 )whereP0 = DP (P00 )
P P =
h Pk , stick(), Ph DP (P00 )
h=1
39
3.5.7.1
7
Bayesian Non Parametric Analysis of the Data
Here,the BGPhazard package[34] is implemented which instead of employing Dirichlet Process it introduces
a Gamma Process. A Gamma Process[22] is a collection of infinitely many random variables whose underlying
distribution is a Gamma. Employing this framework to find the underlying distribution of survival times,
it is certain that the resulting model would be more flexible than previous since instead of trying to fit the
data into a parametric distribution data will dictate the form of the survival times through the that process.
lambda 1
Ergodic mean
0.04
0.055
0.05
0.060
0.06
0.07
0.065
0.08
0.09
0.070
Trace
50
100
150
50
100
Iteration
Autocorrelation function
Histogram
150
20
0
0.2
0.0
10
0.2
0.4
Density
0.6
30
0.8
1.0
Iteration
10
15
20
0.03
0.05
0.07
0.09
Lag
Figure 3.16: The autocorrelation plot indicates that the Markov Chain is stationary meaning that the
posterior must have converged to a Normal distribution as it can be seen from the histogram
u1
Ergodic mean
2.5
3.0
3.5
4.0
4.5
5.0
10
5.5
Trace
50
100
150
50
100
Iteration
Histogram
150
0.4
0.3
0.0
0.2
0.0
0.1
0.2
0.2
0.4
Density
0.6
0.5
0.8
0.6
1.0
0.7
Iteration
10
15
20
10
Lag
posterior must have converged to a left-skewed distribution as it can be seen from the histogram
7 An
excellent resource which was an inspiration for developing this chapter was [13]
40
c1
Ergodic mean
60
70
50
80
100
90
150
100
Trace
50
100
150
50
100
Iteration
Histogram
150
0.000
0.2
0.0
0.2
0.005
0.4
Density
0.6
0.010
0.8
1.0
0.015
Iteration
10
15
20
50
100
150
Lag
posterior must have converged to a Normal distribution as it can be seen from the histogram
epsilon 1
Ergodic mean
0.012
0.014
0.016
0.018
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07
Trace
50
100
150
50
100
Iteration
Histogram
150
30
Density
0.4
0.2
0.0
10
0.2
20
0.6
40
0.8
50
1.0
Iteration
10
15
20
0.00
0.02
0.04
0.06
0.08
Lag
Figure 3.19: The autocorrelation plot indicates that the Markov Chain have not converged to a distribution
as it can be seen from the histogram
41
Estimate of hazard rates

+
+
Hazard function
0.08
Confidence band (95%)

+
+
+
+
+
+
+
+
+
0.04
Hazard rate
0.06
NelsonAalen based estimate
+
+
+
0.02
+
+
+
+
++ +
+++
+++ ++
0.00
++++
++
+
0
50
100
150
time
Figure 3.20: Initially,Nelson-Aalen hazard estimates seem to follow the decrease of the model estimated
hazard, However there was a change in the pattern that showed the Nelson-Aalen to start increase whereas
the estimated hazard model continued to decrease
0.4
0.6
0.8
1.0
Estimate of Survival Function
Model estimate
0.2
Confidence bound (95%)
KaplanMeier
0.0
KM Confidence bound (95%)
50
100
150
times
Figure 3.21: The estimated survival function under the non-parametric Bayesian approach for the treatment
type is similar to the Kaplan-Meier curve. It indicates that the non-parametric Bayesian model suits the
data for this study
On the whole,the resulting plot insinuates that the survival curve falls in the Kaplan-Meier curve indicating that the initial estimate was indeed reliable whereas the corresponding Nelson-Aalen estimator digresses
from the corresponding Bayesian estimate. Most intrinsically,posterior distributions of parameters have been
acquired and the identity of the distribution has been revealed with some certainty as the diagnostic plots
have asserted in most cases. However, it remains a tenuous assumption to maintain wholeheartedly that for
example the is Normally distributed without running some tests to verify it. It is important to understand
though that perhaps a Normal or can be employed as a good approximation for future inferences in the
42
worst case scenario.
3.5.7.2
Concluding Remarks
As the end of this chapter looms large, it is somewhat unsatisfactory that the Non-parametric Bayesian
model did not provide any new information about the data compared to the parametric semi-parametric
models that have been developed. However, what should be borne in mind is that the illustrated example
might not be optimal field to exemplify the virtues of non parametric Bayesian paradigm.
43
Chapter 4
Epidemiology
4.1
Definitions
definition 20. SIR model[24] is a continuous time deterministic epidemiological model which obeys the
following differential equations:
ds(t)
dt
dr
dt
N s(t)(t)
= t
d(t)
dt
N s(t)
ds(t)
dt
dr
dt
d(t)
dt
= 0 where
s(t):susceptible individuals at time t

(t)=infected individuals at time t
r(t):recovered individuals at time t
:recovery rate
:contact rate
N:population( considered constant)
4.2
Overview of the Data
In this chapter of dissertation, English Boarding School Influenza data is brought under scrutiny. It was first
examined by Sobal Jeff and Loveland, Frank C [41] and it came from a study of influenza epidemic outbreak
in English boarding schools in 1978 published in British Medical Journal. The school had a population of
763 boys. Of these 512 were confined to bed during the epidemic, which lasted from 22 January until 4
February. It seems that one infected boy initiated the epidemic. At the outbreak of the epidemic, none of
44
the boys had previously had influenza, so no resistance to the infection was present. The actual dataset
is comprised of two columns. In the first column, it is given the number of infected individuals while the
second corresponds to the number of week which the number of infected is observed.
4.3
Deterministic Analysis of Epidemiological Data
The objective of the study of epidemiological data is to fathom how infections are spread, and to discover
ways to control the spread of a disease.Some of the most common methods for intervening in the spread of
an infectious disease are to either remove susceptible individuals or apply treatment to infected individuals.
Here, data of influenza epidemic drawn from a close environment( English Boarding School) is considered
for analysis. Firstly, a SIR model is implemented to scrutinize the dynamics of infectious disease.
Susceptible
0.8
0.7
0.5
time at which S=1/R0
0.4
0.00
newly infected/day (incidence)

50
100
150
50
100
time
log(Infected)
log(Infected)
150
6
log(Infected)
log(Infected)
14
Initial
exponential
rise
16
16
14
Initial
exponential
rise
10
log(fraction infected)
12
10
12
time
log(fraction infected)
0.6
fraction susceptible
0.06
0.04
total infected (prevalence)
0.02
fraction infected
0.08
0.9
1.0
0.10
Infected
50
100
time
150
50
100
150
time
Figure 4.1: The number of infected seems to rise sharply at 0.06 of total population before the 10th day
and subsequently decreasing at symmetric rate, returning to the initial state,the second plot illustrates the
course of the number of susceptible in the period of time which the outbreak took place. The number remains
stable before starting to decrease in the same period that incidence of infected starts to grow as for the last
plot describes the same pattern as the infected plot
4.4
Bayesian Approach on Epidemiology
Although deterministic models have predominately utilised in Epidemiology, it is now that computational
complexity has been overcome-something that facilitates the implementation of stochastic models. It is
introduced amei[31], an R package (R Development Core Team 2009) that implements the statistical framework introduced by Merl[30]. The novelty of amei is that now statistical inferences are drawn on the SIR
model parameters, conditional on sequential observations of the numbers of susceptible, infected, and re1 [13]
gave rise to the more traditional part of this chapter
45
covered individuals in the population. Given the discrete time approximation in the previous section, it is
possible to do this via straightforward parametric Bayesian methods. In particular, Markov Chain Monte
Carlo methods(e.g. Gamerman and Lopes 2006)[18] are entailed to learn about the posterior distributions
of the transmission rate b, the over dispersion (or clumpiness) parameter k the death rate and the rate of
recovery to the immune class conditioned on the evolution of the epidemic observed so far. Thus it is possible to take Gibbs and Metropolis-Hastings samples for b, k, and so long as appropriate hyper-parameters
r , r , d andd and gamma distributions for b and k can be found to represent our prior beliefs. This
framework allows to respond to an emerging epidemic, considering vaccination strategies. Bayesian methods
in epidemiology put unassailable truths that practitioners made in mathematical formulations which under
the rigid framework of deterministic paradigm could never be implemented due to their inherited complexity
and stochastic nature. Thus, due to its high efficiency and incorporated assumptions the Bayesian methods
seem to be inevitably the next step in studying the evolution of an epidemic. Below there are cited some of
the advantages of the method discussed.
it incorporates the uncertainty about the parameters

it takes into account that in reality boundaries between exposed, infectious and recovered are fuzzy
because ability to transmit is not binary (on-off).
The health status of the host is therefore irrelevant it is not important whether the individual
is showing symptoms- an individual who feels perfectly healthy can be excreting large amounts of
pathogens
Also, complications due to variability in responses between different individuals and the variability in
pathogen levels over the infectious period.
It is thought that a negative binomial model for the transmission function would represent the stochastic
component adequately. Given the SIR model is described by the equations below
dS
bl
= kS log (1 + )(1)
dt
k
dI
bl
= S log (1 + ) ( + )I(2)
dt
k
dR
= I(3)
dt
dD
= I(4)
dt
2 amei[31]
gave much inspiration in developing this chapter with the examples cited in this article
46
The model parameters are: the transmission rate b, the over dispersion (or clumpiness) parameter k the death
rate and the rate of recovery to the immune class . The negative binomial distribution can be interpreted
as a compound stochastic process in which encounters between infected and susceptible individuals occur
randomly (i.e. according to a Poisson process) such that the encounter rate varies according to a Gamma
1
distribution with coefficient of variation k 2 . Thus, via k, the negative binomial transmission can account for
social interactions and/or network factors in disease transmission, without requiring explicit characterization
of the population structure. This SIR formulation leads to a natural discrete time approximation for the
numbers of infections (I), recoveries (R), and deaths (D) arising in the unit time interval from t to t + 1.
Assuming the total number of infected individuals, I, is approximately constant and integrating equation (1)
over a unit time interval gives
k
S(t + 1) = S(t)( k+bI(t)
k)
k
S(t)(1 ( k+bI(t))
k)
Therefore, if S(t) = s and I(t) = i, we may sensibly take the new infections I at time t + 1 to follow
I|s, t Bin(s, p (, b, k))
where, p (, b, k) = 1 (
k
)k
k + blI(t)
and Bin(n, pd ) is the binomial distribution with size n and success probability. Similarly, by integrating
equations (2-3), the numbers of recoveries and deaths occurring between time t and t + 1 can be described
by
R| Bin(, pr )
D|, r Bin( r, pd )where
pr = 1 e , pd = 1 e
S(t + 1) = S(t) I|s,
I(t + 1) = I(t) + I|s, R| D|, r +
Thus a Bayesian depends on the simulation to provide[The Statistical Analysis of Failure Time Data,
2nd Edition ] results rather than evaluating likelihood function. Firstly a Gamma prior is assigned to both
and so their full conditionals would be Gamma again unknown infection times are updated using a
Metropolis-Hastings step.
47
(, , 1 , |r) = (, r|, , 1 )(, , 1 ) where likelihood function is

n
n
n
Y
Y
Y
Y
S I
Ir n1 n
S I
Ir = n1 n (N 1)(N
N
=2
=1n
=2
=1
Z b
Z b
n
n
n X
Y
Y
X
X
n
2) . . . (N n + 1)
= 1 Ir
Ir
It dt =
rk k
St It dt =
rk k
(, r|, , 1 ) =
=2
k=1
(4.1)
k=1 =1n
500
1000
cost
1500
2000
2500
Monte Carlo Costs
10
20
30
40
time
Figure 4.2: The Monte Carlo costs seem to rapidly growing after 10th day picking at approximately 2000
Monte Carlo Epidemics
600
400
0
200
individuals
10
20
30
40
time
Figure 4.3: The curve of susceptible is smother than the curve produced by stochastic SIR and does not
decrease so sharply, the number of infected does not exhibit that spike as it does when there is vaccination
and the pick is much higher than it is located when there is a vaccination policy leading the number of
recovered to finally rise at the level of susceptible individuals at the start
48
Monte Carlo Costs
1500
0
500
1000
cost
400
200
individuals
600
2000
10
20
30
I
40
10
time
30
40
Monte Carlo Stopping Threshold

500
1.0
Monte Carlo Fraction Vaccinated
300
stop time
0.0
0.2
100
0.4
200
0.6
400
0.8
fraction
20
time
10
20
30
40
10
time
20
30
40
time
Figure 4.4: Here, it is shown the fluctuations susceptible,infected and removed as time moves forward under
optimal vaccination policy, it is easy to notice that as the number of susceptible decreases the number of
recovered increases before remaining stable at 200 individuals and the number of infected initially increases
picking at less than 200 individuals at 7 days only to start to diminish again returning to 0. The second plot
illustrates the cost for no vaccination policy for the English boarding school data after a rapid growth at
approximately 10th day the cost remains stable at well above 1500, the third plot portrays a sudden growth
in the total fraction of individuals vaccinated at some time before the tenth day picking at 80 per cent of
total population before plunging the last plot depicts the threshold which has to be vaccinated
Monte Carlo Costs
1500
1000
0
500
cost
400
200
individuals
600
2000
10
20
30
I
40
time
10
20
30
40
time
Figure 4.5: The graph depicts how the number of susceptible,infected and recovered is changing under
more realistic circumstances where there is no perfect information resulting in estimating the epidemic
model parameters as well as finding an optimal management strategy. In particular,the curve of susceptible
exhibits a sharper decline and the number of infected does not have a pick as it has with no vaccination
policy which leads to a reduced number of recovered individuals
49
density
1.0
1.5
2.0
1500
1000
density
0.0
0.5
500
0
X
0.000
0.001
0.002
0.003
0.004
X
0.0
0.5
1.0
1.5
2.0
200
density
150
50
100
density
10
250
300
15
0.1
0.2
0.3
X
0.0
0.4
0.000
0.004
0.008
0.012
Figure 4.6: Under, these circumstances there has been an endeavour to calculate posterior distributions of
estimated parameters (clockwise from the left: transmission rate, over dispersion parameter, mortality rate,
recovery rate). True parameter values are indicated by a dot, mean posterior values are indicated by an x
and the central 95 region of the distribution is shaded. From the plots, it is ostensible that transmission rate
and mortality rate resemble a Normal however, the tails betray such an assumption is, on the other hand
the skewed distribution of the rest of parameters may suggest that a X 2 could be a good approximation
50
Appendices
51
Appendix A
Code Appendix
A.1
One
r e s . cox < coxph ( Surv ( data$ t t r , data$ r e l a p s e ) data$ grp )

Call :
coxph ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$ grp )
coef
exp ( coef ) se ( coef )
data$ grppatchOnly 0 . 6 0 5 0
1.8313
0.2161
Pr ( >| z | )
z
2.8
0 . 0 0 5 1 1

data$ grppatchOnly
1.831
Rsquare= 0 . 0 6 2
0.5461
1.199
( se = 0 . 0 2 9 )
(max p o s s i b l e= 0 . 9 9 8 )
L i k e l i h o o d r a t i o t e s t= 7 . 9 9
on 1 df ,
p =0.004708
Wald t e s t
= 7.84
on 1 df ,
p =0.005112
on 1 df ,
p =0.004489
52
2.797
r e s . cox < coxph ( Surv ( data$ t t r , data$ r e l a p s e ) data$ g e n d e r )

Call :
coxph ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$ g e n d e r )
coef exp ( coef ) se ( coef )

data$ genderMale
0.1944
0.8233
0.2289
Pr ( >| z | )
z
0.849
0.396

data$ genderMale
0.8233
Rsquare= 0 . 0 0 6
1.215
0.5257
1.29
( se = 0 . 0 2 8 )
(max p o s s i b l e= 0 . 9 9 8 )
on 1 df ,
p =0.3905
Wald t e s t
= 0.72
on 1 df ,
p =0.3958
on 1 df ,
p =0.3951
> r e s . cox < coxph ( Surv ( data$ t t r , data$ r e l a p s e ) data$ r a c e )

> summary( r e s . cox )
Call :
coxph ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$ r a c e )
coef
Pr ( >| z | )
data$ r a c e h i s p a n i c
0.3858
0.6799
0.4853
0.795
0.427
data$ r a c e o t h e r
0.9921
0.3708
1.0178
0.975
0.330
data$ r a c e w h i t e
0.2513
0.7778
0.2305
1.090
0.276
53
data$ r a c e h i s p a n i c
0.6799
1.471
0.26264
1.760
data$ r a c e o t h e r
0.3708
2.697
0.05044
2.726
data$ r a c e w h i t e
0.7778
1.286
0.49507
1.222
Concordance= 0 . 5 4
Rsquare= 0 . 0 1 8
( se = 0 . 0 2 9 )
(max p o s s i b l e= 0 . 9 9 8 )
on 3 df ,
p =0.5252
Wald t e s t
= 2.07
on 3 df ,
p =0.5574
on 3 df ,
p =0.547
r e s . cox < coxph ( Surv ( data$ t t r , data$ r e l a p s e ) data$employment )

Call :
coxph ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$employment )
coef exp ( coef )
se ( coef )
Pr ( >| z | )
data$employmentother 0 . 1 9 8 2
1.2192
0.2371 0.836
0.403
data$employmentpt
1.5683
0.3229 1.394
0.163
0.4500

data$employmentother
1.219
0.8202
0.7661
1.940
data$employmentpt
1.568
0.6376
0.8328
2.953
Rsquare= 0 . 0 1 6
( se = 0 . 0 3 )
(max p o s s i b l e= 0 . 9 9 8 )
on 2 df ,
p =0.357
Wald t e s t
on 2 df ,
p =0.3376
= 2.17
on 2 df ,
p =0.3333
r e s . cox < coxph ( Surv ( data$ t t r , data$ r e l a p s e ) data$ yearsSmoking )
54
Call :
coxph ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$ yearsSmoking )
coef
data$ yearsSmoking
exp ( coef )
0.016237
se ( coef )
0.983894
Pr ( >| z | )
0.009081
1.788
0.0738 .

data$ yearsSmoking
0.9839
Rsquare= 0 . 0 2 6
1.016
0.9665
1.002
( se = 0 . 0 3 4 )
(max p o s s i b l e= 0 . 9 9 8 )
Wald t e s t
on 1 df ,
= 3.2
on
p =0.07214
1 df ,
p =0.07377
on 1 df ,
p =0.07308
r e s . cox < coxph ( Surv ( data$ t t r , data$ r e l a p s e ) data$ l e v e l S m o k i n g )

Call :
coxph ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$ l e v e l S m o k i n g )
coef
z Pr ( >| z | )
data$ l e v e l S m o k i n g l i g h t 0 . 0 3 8 5
1.0393
0.2308
0.167
0.868

data$ l e v e l S m o k i n g l i g h t
1.039
0.9622
55
0.6611
1.634
Rsquare= 0
( se = 0 . 0 2 7 )
(max p o s s i b l e= 0 . 9 9 8 )
on 1 df ,
p =0.8679
Wald t e s t
= 0.03
on 1 df ,
p =0.8675
on 1 df ,
p =0.8675
r e s . cox < coxph ( Surv ( data$ t t r , data$ r e l a p s e ) data$ p r i o r A t t e m p t s )

Call :
coxph ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$ p r i o r A t t e m p t s )
coef
exp ( coef )
data$ p r i o r A t t e m p t s
se ( coef )
6.984 e 05
Pr ( >| z | )
9.999 e 0 1
1.100 e 0 3
0.063
0.949

data$ p r i o r A t t e m p t s
Rsquare= 0
0.9999
0.9978
1.002
( se = 0 . 0 3 4 )
(max p o s s i b l e= 0 . 9 9 8 )
L i k e l i h o o d r a t i o t e s t= 0
on 1 df ,
p =0.9488
Wald t e s t
= 0
on 1 df ,
p =0.9494
Score ( logrank ) t e s t = 0
on 1 df ,
p =0.9494
r e s . cox < coxph ( Surv ( data$ t t r , data$ r e l a p s e ) data$longestNoSmoke )

Call :
coxph ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$longestNoSmoke )
56
coef
exp ( coef )
se ( coef )
data$longestNoSmoke
Pr ( >| z | )
0.0001938
0.9998062
0.0001220
1.588
0.112

data$longestNoSmoke
0.9998
( se = 0 . 0 3 4 )
Rsquare= 0 . 0 2 4
0.9996
(max p o s s i b l e= 0 . 9 9 8 )
on 1 df ,
p =0.08459
Wald t e s t
= 2.52
on 1 df ,
p =0.1122
on 1 df ,
p =0.1091
r e s . cox < coxph ( Surv ( data$ t t r , data$ r e l a p s e ) data$ageGroup2 )

Call :
coxph ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$ageGroup2 )
data$ageGroup250+ 0 . 7 1 5 9
Pr ( >| z | )
0.4887
0.2207
3.244
0 . 0 0 1 1 8

data$ageGroup250+
0.4887
Rsquare= 0 . 0 8 4
2.046
0.3171
( se = 0 . 0 2 9 )
(max p o s s i b l e= 0 . 9 9 8 )
on 1 df ,
p =0.0009284
Wald t e s t
= 10.52
on 1 df ,
p =0.001178
on 1 df ,
p =0.0009277
57
0.7532
r e s . cox < coxph ( Surv ( data$ t t r , data$ r e l a p s e ) data$ageGroup4 )

Call :
coxph ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$ageGroup4 )
Pr ( >| z | )
data$ageGroup435 4 9
0.0293
1.0297
0.3093
0.095
0.9245
0.7914
0.4532
0.3361
2.355
0.0185
data$ageGroup465+
0.3173
0.7281
0.4435
0.715
0.4744

1.0297
0.9711
0.5616
1.8880
0.4532
2.2066
0.2345
0.8757
data$ageGroup465+
0.7281
1.3734
0.3053
1.7367
Rsquare= 0 . 0 9 3
A.2
( se = 0 . 0 3 2 )
(max p o s s i b l e= 0 . 9 9 8 )
on 3 df ,
p =0.006664
Wald t e s t
= 11.36
on 3 df ,
p =0.009937
on 3 df ,
p =0.007628
Two
r e s u l t . step < step ( modelAll . coxph , s c o p e=l i s t ( data$ grp+data$ age+data$longestNoSmoke ) ,

Start :
AIC=762.87
Surv ( data$ t t r , data$ r e l a p s e ) data$ grp + data$ age + data$longestNoSmoke
Df
AIC
data$longestNoSmoke
<none>
1 762.48
762.87
58
data$ age
1 766.07
data$ grp
1 767.42
Step :
AIC=762.48
Surv ( data$ t t r , data$ r e l a p s e ) data$ grp + data$ age
Df
<none>
A.3
AIC
762.48
data$ age
1 766.32
data$ grp
1 767.25
Three
r e s u l t . step < step ( modelAll . coxph , s c o p e=l i s t ( lowerdata$ grp+data$ age+data$grp , upper
Start :
AIC=762.87
Surv ( data$ t t r , data$ r e l a p s e ) data$ grp + data$ age + data$longestNoSmoke
Df
data$longestNoSmoke
<none>
AIC
1 762.48
762.87
data$ age
1 766.07
data$ grp
1 767.42
Step :
AIC=762.48
Surv ( data$ t t r , data$ r e l a p s e ) data$ grp + data$ age
Df
<none>
A.4
AIC
762.48
data$ age
1 766.32
data$ grp
1 767.25
Four
59
r e s . cox < coxph ( Surv ( data$ t t r , data$ r e l a p s e ) data$ grp+data$ age +(data$ age data$ grp ) )
Call :
coxph ( formula = Surv ( data$ t t r , data$ r e l a p s e ) data$ grp + data$ age +
( data$ age data$ grp ) )
coef
z Pr ( >| z | )
data$ grppatchOnly
0.07976
0.92334
0.93332
0.085
0.9319
data$ age
0.03052
0.96994
0.01441
2.118
0.0341
data$ grppatchOnly : data$ age
0.01349
1.01359
0.01924
0.701
0.4831

data$ grppatchOnly
0.9233
1.0830
0.1482
5.7519
data$ age
0.9699
1.0310
0.9429
0.9977
data$ grppatchOnly : data$ age
1.0136
0.9866
0.9761
1.0525
Rsquare= 0 . 1 0 8
( se = 0 . 0 3 4 )
(max p o s s i b l e= 0 . 9 9 8 )
on 3 df ,
p =0.002505
Wald t e s t
= 12.98
on 3 df ,
p =0.004674
on 3 df ,
p =0.003285
60
Bibliography
[1]
In: ().
[2]
Steve Bennett. Analysis of survival data by the proportional odds model. In: Statistics in medicine
2.2 (1983), pp. 273277.
[3]
David Blackwell and James B MacQueen. Ferguson distributions via Polya urn schemes. In: The
annals of statistics (1973), pp. 353355.
[4]
John W. Boag. Maximum Likelihood Estimates of the Proportion of Patients Cured by Cancer Therapy. In: Journal of the Royal Statistical Society. Series B (Methodological) 11.1 (1949), pp. 1553.
issn: 00359246. url: http://www.jstor.org/stable/2983694.
[5]
Norman E Breslow. Analysis of survival data under the proportional hazards model. In: International
Statistical Review/Revue Internationale de Statistique (1975), pp. 4557.
[6]
G
oran Brostr
om. Event history analysis with R. CRC Press, 2012.
[7]
RONALD CHRISTENSEN and WESLEY JOHNSON. Modelling accelerated failure time with a
Dirichlet process. In: Biometrika 75.4 (1988), pp. 693704. doi: 10.1093/biomet/75.4.693. eprint:
http://biomet.oxfordjournals.org/content/75/4/693.full.pdf+html. url: http://biomet.
oxfordjournals.org/content/75/4/693.abstract.
[8]
David R Cox and E Joyce Snell. A general definition of residuals. In: Journal of the Royal Statistical
Society. Series B (Methodological) (1968), pp. 248275.
[9]
[10]
David Roxbee Cox and David Oakes. Analysis of survival data. Vol. 21. CRC Press, 1984.
Cox R David. Regression models and life tables (with discussion). In: Journal of the Royal Statistical
Society 34 (1972), pp. 187220.
[11]
Hani Doss and Fred W Huffer. Monte Carlo methods for Bayesian analysis of survival data using
mixtures of Dirichlet process priors. In: Journal of Computational and Graphical Statistics 12.2 (2003),
pp. 282307.
61
[12]
Hani Doss and B. Narasimhan. Dynamic Display of Changing Posterior in Bayesian Survival
Analysis. In: Practical Nonparametric and Semiparametric Bayesian Statistics. Ed. by Dipak Dey,
Peter Muller, and Debajyoti Sinha. New York, NY: Springer New York, 1998, 6387. isbn:
978-1-4612-1732-9. doi: "10.1007/978-1-4612-1732-9_4". url: "http://dx.doi.org/10.1007/
978-1-4612-1732-9_4".
[13] Epidemic modelling with compartmental models using R. 2012. url: http : / / sherrytowers . com /
2012/12/11/simple-epidemic-modelling-with-an-sir-model/ (visited on 07/16/2016).
[14]
Michael D Escobar and Mike West. Bayesian density estimation and inference using mixtures. In:
Journal of the american statistical association 90.430 (1995), pp. 577588.
[15]
Thomas S. Ferguson. A Bayesian Analysis of Some Nonparametric Problems. In: The Annals of
Statistics 1.2 (1973), pp. 209230. issn: 00905364. url: http://www.jstor.org/stable/2958008.
[16]
Thomas S. Ferguson and Eswar G. Phadia. Bayesian Nonparametric Estimation Based on Censored Data. In: Ann. Statist. 7.1 (1979), 163186. doi: "10.1214/aos/1176344562".
url: "http://dx.doi.org/10.1214/aos/1176344562".
[17]
Bela A Frigyik, Amol Kapila, and Maya R Gupta. Introduction to the dirichlet distribution and related
processes. Department of Electrical Engineering, University of Washignton. Tech. rep. UWEETR-20100006, 2010.
[18]
Dani Gamerman and Hedibert F Lopes. Markov chain Monte Carlo: stochastic simulation for Bayesian
inference. CRC Press, 2006.
[19]
Andrew Gelman et al. Bayesian data analysis. Vol. 2. Chapman & Hall/CRC Boca Raton, FL, USA,
2014.
[20]
Zoubin Ghahramani. Nonparametric Bayesian methods. In: Tutorial presentation at the UAI Conference. 2005.
[21]
Peter D Hoff. A first course in Bayesian statistical methods. Springer Science & Business Media, 2009.
[22]
Joseph G Ibrahim, Ming-Hui Chen, and Debajyoti Sinha. Bayesian survival analysis. Wiley Online
Library, 2005.
[23]
John D Kalbfleisch and Ross L Prentice. The statistical analysis of failure time data. Vol. 360. John
Wiley & Sons, 2011.
[24]
William O Kermack and Anderson G McKendrick. A contribution to the mathematical theory of

epidemics. In: Proceedings of the Royal Society of London A: mathematical, physical and engineering
sciences. Vol. 115. 772. The Royal Society. 1927, pp. 700721.
62
[25]
W.O. Kermack and A.G. McKendrick. Contributions to the mathematical theory of epidemicsII.
the problem of endemicity. In: Bulletin of Mathematical Biology 53.1 (1991), 57 87.
issn: 0092-8240. doi: "http://dx.doi.org/10.1016/S0092- 8240(05)80041- 2". url: "http:
//www.sciencedirect.com/science/article/pii/S0092824005800412".
[26]
P. Kumaraswamy. Generalized probability density function for double-bounded random process.

In: J. Hydrol 46 (1980). doi: "10.1016/0022-1694(80)90036-0". url: "http://dx.doi.org/
10.1016/0022-1694(80)90036-0".
[27]
Lynn Kuo et al. Bayesian Computations in Survival Models Via the Gibbs Sampler. In: Survival Analysis: State of the Art. Ed. by John P. Klein and Prem K. Goel. Dordrecht: Springer
Netherlands, 1992, 1124. isbn: 978-94-015-7983-4. doi: "10.1007/978-94-015-7983-4_2".
[28]
Steven N MacEachern. Dependent nonparametric processes. In: ASA proceedings of the section on
Bayesian statistical science. Alexandria, Virginia. Virginia: American Statistical Association; 1999.
1999, pp. 5055.
[29]
Nathan Mantel. Evaluation of survival data and two new rank order statistics arising in its consideration. In: Cancer chemotherapy reports. Part 1 50.3 (1966), pp. 163170.
[30]
Daniel Merl and Maintainer Robert B Gramacy. Package amei. In: (2013).
[31]
Daniel Merl et al. amei: an R package for the Adaptive Management of Epidemiological Interventions.
In: Journal of Statistical Software 36.6 (2010), pp. 132.
[32]
Dirk F Moore. Applied Survival Analysis Using R. Springer, 2016.
[33]
Govind S Mudholkar, Deo Kumar Srivastava, and Georgia D Kollia. A generalization of the Weibull
distribution with application to the analysis of survival data. In: Journal of the American Statistical
Association 91.436 (1996), pp. 15751583.
[34]
LE Nieto-Barajas, JA Garcia Bueno, and Maintainer Jose Antonio Garcia Bueno. Package BGPhazard. In: Depression 130 (2016), pp. 162166.
[35]
Richard Peto and Julian Peto. Asymptotically efficient rank invariant test procedures. In: Journal
of the Royal Statistical Society. Series A (General) (1972), pp. 185207.
[36]
M. C. Pike. A Method of Analysis of a Certain Class of Experiments in Carcinogenesis. In: Biometrics

22.1 (1966), pp. 142161. issn: 0006341X, 15410420. url: http://www.jstor.org/stable/2528221.
63
[37]
Kamta Rai, V Susarla, and John. Van Ryzin. Shrinkage estimation in nonparametric Bayesian survival
analysis: a simulation study. In: Communications in Statistics - Simulation and Computation 9.3
(1980), pp. 271298. doi: 10 . 1080 / 03610918008812154. eprint: http : / / dx . doi . org / 10 . 1080 /
03610918008812154. url: http://dx.doi.org/10.1080/03610918008812154.
[38]
Christian Robert and George Casella. Monte Carlo statistical methods. Springer Science & Business
Media, 2013.
[39]
Abel Rodriguez, David B Dunson, and Alan E Gelfand. The nested Dirichlet process. In: Journal of
the American Statistical Association (2012).
[40]
David Schoenfeld. Partial residuals for the proportional hazards regression model. In: Biometrika
69.1 (1982), pp. 239241.
[41]
JEFF Sobal and FRANK C Loveland. Infectious disease in a total institution: a study of the influenza
epidemic of 1978 on a college campus. In: Public health reports 97.1 (1982), p. 66.
[42]
Michael B Steinberg et al. Triple-combination pharmacotherapy for medically ill smokers: a randomized trial. In: Annals of Internal Medicine 150.7 (2009), pp. 447454.
[43]
Yee Whye Teh. Bayesian Nonparametrics 1,2,3 - Yee Whye Teh - MLSS 2013 T
ubingen. Youtube.
2013. url: https://www.youtube.com/watch?v=dNeW5zoNJ7g.
[44]
Yee Whye Teh. Dirichlet process. In: Encyclopedia of machine learning. Springer, 2011, pp. 280287.
[45]
Yee Whye Teh et al. Hierarchical dirichlet processes. In: Journal of the american statistical association (2012).
[46]
Terry M Therneau, Patricia M Grambsch, and Thomas R Fleming. Martingale-based residuals for
survival models. In: Biometrika 77.1 (1990), pp. 147160.
64

Dissertation 22

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dissertation 22

Uploaded by

Copyright:

Available Formats

A Bayesian Approach in Medicine

Introductory Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Proportional Hazard Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Semi-parametric Cox Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Accelerated Failure Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Proportional Odds Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Bayesian Non Parametric Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Explanation of the basic Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Formulation of the Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Construction of Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Dependent Dirichlet Process

Overview of the Data

Deterministic Analysis of Epidemiological Data . . . . . . . . . . . . . . . . . . . . . . . . . .

Bayesian Approach on Epidemiology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Kaplan-Meier Survival Curve:It is reasonable to assume that Proportional Hazard

In both figures, it is ostensible that a Weibull distribution would provide a satisfactory

Schoenfeld Residuals:The shape of the smoothed (lowess) curve is an estimate of the

difference parameter as a function of time which appear to increasing followed by a slight

to the model.Indeed 4 patients seem to be influential observations . . . . . . . . . . . . . . . .

have significant influential power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

as for the th group treatment variable is seen to be evenly distributed . . . . . . . . . . . . .

Cox-Snell:The plot indicates that the assumption of exponentially distributed survival

times does not hold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Here, it is shown the fluctuations susceptible,infected and removed as time moves

pmf:probability mass function

The first dataset which will be put under scrutiny is pharma-

English Boarding School Influenza

As for the second dataset[41],it comes from a study of

where h is the hazard function

definition 5. Integrated Hazard Function HT (t) =

hT (t) where h is the hazard function

d : number of deaths at time t

d : number of deaths at time t

Introductory Survival Analysis

Survival of two Groups

survival time (days)

The complimentary log versus log time

log estimate of survival function

Estimation of Hazard Function

Proportional Hazard Model

Semi-parametric Cox Model

is the corresponding coefficient of covariate X and h is the hazard function.

exp ( coef ) exp ( coef ) lower . 9 5 upper . 9 5

Score ( logrank ) t e s t = 13.74

Estimated Cumulative Hazard Function

Cox Snell Residuals[8]

Beta(t) for data$age

Grambsch and Therneaus test for Proportional Hazard assumption

0.0243 0.0635 0.801

Accelerated Failure Models

Interpretation of the Exponential Model

Residual Analysis of Exponential Model

Cox Snell Residuals

Cumulative Density Function: F (t) = 1 e t

Survivor function: S(t) = e t

Hazard function: h(t) =

Cumulative Hazard H(t) = t whereand are the parameters of Weibull model

Interpretation of Weibull Model

Residuals Analysis of Weibull Model

Deviance residuals versus age

Deviance Residuals against treatment group

Cox Snell Residuals

Proportional Odds Model

the mortality rates converge.