You are on page 1of 5

STATS 105: Lab #2 Students Survey Data Dr. Akram Almohalwas 1.

Exploratory Data Analysis


For this lab, we will explore a VERY exciting dataset. It comes from a survey given to UCLA stats 10 students in 2008...it asks very critical questions such as gender, hand, eye color, and whether or not they own an iPod...there are other variables as well...feel free to explore. A. Please open the data file posted on ccle data_survey.txt and name your data lab2. B. Inspecting the data: Always check to make sure that your data imported correctly C. Displaying the dimensions of the data set: > dim(lab2) D. Displaying the variable names in the data set: > names(lab2) E. Attaching/detaching data frames in R: >attach(lab2) F. Display the first 6 rows of the dataset: >head(lab2)

2. Descriptive Statistics
Variable classes: >class(gender) Displaying categorical data >table(gender) >table(gender, hand) >prop.table(table(gender, hand), 1) > barplot(table(gender),col="lightgreen",main= "Barplot of Gender") > barplot(table(gender,hand),beside=TRUE, main= "Barplot of Gender")
>barplot(table(gender,hand),beside=FALSE, main= "Barplot of Gender")

3. Single sample test of the sample proportion


Usually we wouldnt know the population parameter (true proportion p). If we did then there really wouldnt be a point to having a hypothesis about p. But for the purpose of this lab we will start out by learning the true population proportion p (proportion of lefties in our population).
> prop.table(table(hand))

Question 1: What was the proportion of lefties in our population we are examining?

Question 2: Now, suppose I have the hypothesis that the true proportion of lefties in our population was 0.10. Write down this hypothesis formally: H0 : Ha : Question 3: We are now going to take a single random sample of size 400. Record the proportion of lefties of this sample:
> samp = sample(hand,400,rep=F) > prop.table(table(samp))

Question 4: Use your random sample size (n = 400), your sample proportion of lefties to construct a 95% confidence interval. Hint: use z* = 1.96 for the critical value.

Question 5: Based on this confidence interval what is your conclusion to my claim that the true proportion of lefties for the whole population? Question 6: Again using your sample results, calculate the z statistic. Find the P-value. What conclusions do you reach pertaining to the hypothesis. Did you get the same conclusion as when you used the confidence interval to run this test?

4. A Single sample test of the mean:


Usually we wouldnt know the population parameter (true average score ). If we did then there really wouldnt be a point to having a hypothesis about . But for the purpose of this lab well start out by learning the true population mean .
>mean(height)

Question 1: What was the average height for the students who filled out the survey? = Question 2: Now, suppose I have the hypothesis that the true average height is the answer got in Question 1. Write down this hypothesis formally:
2

H0 : Ha : Question 3: We are now going to take a single random sample of size 60. Record the mean and standard deviation of this sample:
> samp = sample(height,60,rep=F) > mean(samp) > sd(samp)

s= Check the distribution of your random sample.


> hist(samp) > boxplot(samp) > summary(samp)

Question 4: a) Use your random sample size (n = 60), your sample mean and sample standard deviation to construct a 95% confidence interval. Hint: use > qt(0.975; df = n - 1) to find the critical value.
> qt(0.975, df = 59)

b) Use your random sample size (n = 60), your sample mean and sample standard deviation to construct a 90% confidence interval. c) compare between the two confidence intervals. Why you think they are different? Question 5: Based on this confidence interval what is your conclusion to the claim that the true average was your answer for Q1? Question 6: Again using your sample results, calculate the t statistic. Find alpha and Find a P value. What conclusions do you reach pertaining to the hypothesis. Did you get the same conclusion as when you used the confidence interval to run this test?

5. Test for Goodness of fit, Independence, and homogeneity (Chi-Square Tests)


a) Goodness-of-Fit Here we are testing for Goodness-of-Fit. Data and this solved example taken from your textbook (Example 3 page461-463 of chapter 10) Look at Table10.12 that summarizes the distribution of swine flu by age brackets. A sample of 619 was reported and the expected proportions of swine flu by age is known.
> swine<-c(51,204,250,68,36,10) > barplot(swine,,ylab="frequency",xlab="Swine Flu Distribution by Age") > expected<-c(49.52,68.09,132.47,146.08,121.32,101.52) > barplot(expected,ylab="frequency",xlab="Swine Flu Distribution by Age")

Compare between the two distributions above Now let's perform the goodness of fit test:
> chisq.test(swine, p=expected/619)

or:
> null.probs <- expected/619 > chisq.test(swine, p=null.probs)

b) Test for Independence: Here we are testing whether or not the dominant hand is independent of gender. To do so, we perform a Chi-square test: Question 9:
> lab2.chi=chisq.test(hand,gender) > names(lab2.chi)

To see the observed values (i.e. the original data) type:


> lab2.chi$observed

To see the expected values: You can calculate it using the formula or:
> lab2.chi$expected

To see the residual values: You can calculate it using the formula or:
> lab2.chi$residuals

The residuals calculated are the Pearson residuals i.e. (observed - expected) / sqrt(expected). You can examine these and easily pick out which are the most important associations (and the direction). You do not actually need to type the full command to see the components of the chisquared test. After the $ sign you can type a short version and as long as it is unique it will be interpreted e.g.

> lab2.chi$obs > lab2.chi$exp > lab2.chi$res Question 10: Do you think that hand and gender variables are independent? Is your conclusion consistent with what you guessed using stacked bar-chart in part 2:
> barplot(table(gender,hand),beside=F, main= "Barplot of Gender")

Test for Independence using a summary table: Political Affiliation and Music Preference example from your powerpoint slides: Democrat Pop Classic Rock Other 70 34 21 Republican 52 57 16

> data<-matrix(c(70,52,34,57,21,16), ncol=2,byrow=T) > data > barplot(data) > chisq.test(data) Code for graphing a Chi-Squared Distribution: (Degrees of freedom =5)
> > > > > > > > > > > > x=seq(0,20,length=200) y=dchisq(x,df=5) y=dchisq(x,5) plot(x,y,type="l",lwd=2,col="red") x=seq(0,20,length=200) y=dchisq(x,5) plot(x,y,type="l", lwd=2, col="blue") x=seq(0,1.54,length=200) y=dchisq(x,5) polygon(c(0,x,1.54),c(0,y,0),col="gray") pchisq(1.54,df=5) qchisq(0.09159182,df=5)

Good Luck

You might also like