Professional Documents
Culture Documents
From the sample size (n=1066), n=997 (93.5%) is the valid value for pathsize but the missing
value is n=69 (6.5%) for pathsize. Missing values are still inferring to the population. We
decide to recode the missing values by using serial mean.
a) Type of distribution for duration of having cancer
Based on visual inspection, the histogram fail to show normal curve and the boxplot shows
asymmetrical tail, which indicates not normal distribution. By comparing the mean and median,
the values are different. Based on the normality test, Shapiro-Wilk, the significant value is less
than 0.05. Thus, the duration of having cancer is nor normal distributed.
b) Both tumour size and duration of having cancer are not normal. For correlation test, we will
use non-parametric, Spearman rank correlation test.
Interpretation:
Non-parametric correlation coefficient (r) is -0.081 and the p-value (p=0.011). The p value <0.05,
thus we reject the null hypothesis. There is significant correlation between duration of having
cancer and tumour size. There is negative but very weak (r= -0.081).
(ii) New tumour size variable
Step 1. Stating hypothesis
Ho: There is no correlation between duration of having cancer with tumour size.
Ha: There is correlation between duration of having cancer with tumour size.
Set = 0.05
Step 2: Scatter plot
Interpretation:
Non-parametric correlation coefficient (r) is -0.075 and the p-value (p=0.011). The p value <0.05,
thus we reject the null hypothesis. There is significant correlation between duration of having
cancer and tumour size. There is negative but very weak (r= -0.075).
The correlation between the two cases are equal even though the r value is slightly different.
The Kruskal-Wallis H test showed that there was a statistically significant difference in duration
of having cancer between the different estrogen receptor status, 2(2) = 26.535, p<0.001 which
is (p<0.05).We reject the null hypothesis and the result is significant. There is significant
difference of median duration in having cancer between estrogen receptor status.
(e) The boxplot help us to determine the presence of median difference across groups. By doing
post-hoc analysis, we able to compare the medians for each pair. Based on the boxplot, the
medians of time are different across categories of estrogen receptor status. The P = 0.002
(P<0.05), thus rejecting the null hypothesis.
f) Researchers will change the continuous variable into categorical variable before starting the
analysis. There are some reasons such as:
1. To deal the contamination of outliers especially in not-normal distributed data.
2. When the continuous variable is not significant, it is possible to become significant when the
variable being categorized.
3. To answer research question regarding on estimation or determination of ratio based on
frequency. Hence, allows the researcher to use non-parametric test such as Chi squared test to
define the association between variables.