You are on page 1of 8

GROUP 5

From the sample size (n=1066), n=997 (93.5%) is the valid value for pathsize but the missing
value is n=69 (6.5%) for pathsize. Missing values are still inferring to the population. We
decide to recode the missing values by using serial mean.
a) Type of distribution for duration of having cancer
Based on visual inspection, the histogram fail to show normal curve and the boxplot shows
asymmetrical tail, which indicates not normal distribution. By comparing the mean and median,
the values are different. Based on the normality test, Shapiro-Wilk, the significant value is less
than 0.05. Thus, the duration of having cancer is nor normal distributed.

b) Both tumour size and duration of having cancer are not normal. For correlation test, we will
use non-parametric, Spearman rank correlation test.

(i) Without any changes of data to the tumour size


Step 1. Stating hypothesis
Ho: There is no correlation between duration of having cancer with tumour size.
Ha: There is correlation between duration of having cancer with tumour size.
Set = 0.05
Step 2: Scatter plot

The relationship between two variables seems not linear.


Step 3: Checking normality
Both variables are not normally distributed. We use Spearman rank correlation test.

Tumour size not normal distributed and positive skewed.


Step 4: Perform Spearman rank correlation test.

Interpretation:
Non-parametric correlation coefficient (r) is -0.081 and the p-value (p=0.011). The p value <0.05,
thus we reject the null hypothesis. There is significant correlation between duration of having
cancer and tumour size. There is negative but very weak (r= -0.081).
(ii) New tumour size variable
Step 1. Stating hypothesis
Ho: There is no correlation between duration of having cancer with tumour size.
Ha: There is correlation between duration of having cancer with tumour size.
Set = 0.05
Step 2: Scatter plot

The relationship between two variables seems not linear.

Step 3: Checking normality


Both variables are not normally distributed. We use Spearman rank correlation test.

Tumour size not normal distributed and positive skewed

Step 4: Perform Spearman rank correlation test.

Interpretation:
Non-parametric correlation coefficient (r) is -0.075 and the p-value (p=0.011). The p value <0.05,
thus we reject the null hypothesis. There is significant correlation between duration of having
cancer and tumour size. There is negative but very weak (r= -0.075).
The correlation between the two cases are equal even though the r value is slightly different.

C. Different of duration of having cancer among estrogen receptor status


Question asks for comparison between the variables.
Step 1: State hypothesis
Ho There is no difference in median of duration of having cancer between estrogen receptor
status.
Ha There is difference in median of duration of having cancer between estrogen receptor
status.
Set = 0.05
Step 2: Normality test
Estrogen receptor status er is a categorical variable. There are 3 groups:
0-negative; 1-positive, 2-undefined, 9999-undetermined
The duration of having cancer is continuous variable but not normal distributed.
To find the difference between both variables, we perform non-parametric test which is Kruskall
Wallis test.
Step 3: Perform Kruskall Wallis test

The Kruskal-Wallis H test showed that there was a statistically significant difference in duration
of having cancer between the different estrogen receptor status, 2(2) = 26.535, p<0.001 which
is (p<0.05).We reject the null hypothesis and the result is significant. There is significant
difference of median duration in having cancer between estrogen receptor status.

d) Boxplot of duration of having cancer, based on estrogen receptor status.

(e) The boxplot help us to determine the presence of median difference across groups. By doing
post-hoc analysis, we able to compare the medians for each pair. Based on the boxplot, the
medians of time are different across categories of estrogen receptor status. The P = 0.002
(P<0.05), thus rejecting the null hypothesis.

f) Researchers will change the continuous variable into categorical variable before starting the
analysis. There are some reasons such as:
1. To deal the contamination of outliers especially in not-normal distributed data.
2. When the continuous variable is not significant, it is possible to become significant when the
variable being categorized.
3. To answer research question regarding on estimation or determination of ratio based on
frequency. Hence, allows the researcher to use non-parametric test such as Chi squared test to
define the association between variables.

You might also like