You are on page 1of 18

INTERNATIONAL UNIVERSITY VNU HCMC

REPORT

FINAL PROJECT OF STATISTIC BUSINESS Lecturer: Nguyen Bac Huy Team Members:
1. 2. 3. 4. 5. 6. 7. Lng Tho Nhi_BAFNIU11075 Nguyn Tn Pht_BAFNIU11140 T Phm Duy Tin_BABAIU11144 Nguyn an Vy_BAFNIU11057 Trng Th Ngc Tuyt_BAFNIU11080 Hunh Ngc Tho Uyn_BAFNIU11127 L Ngc Anh Phng_BABAIU11269

CONTENT: 1. 2. 3. 4. 5. Question 13-4 Question 25 -11 Question 39 - 13 Question 413 14 Question 515 - 17

Question 1:
A US National Public Transportation survey taken few years ago in USA indicated that less than 5% of US citizens use public transportation. Collect secondary data from several websites of U.S. Departments of Transportation, US Census Bureau etc. to test this hypothesis from the survey. Write a short essay to explain the data.

Solution:
We now consider to a sample from Bureau of Labor Statistics of United States Department of Labor in 2008 and 2009: On Dec. 2008, the sample of 143,338,000 (the number of working people) was chosen. On Dec. 2009, the sample of 137,792,000 (the number of working people) was chosen.

According to the Public Transportation Usage Among U.S. Workers: 2008 and 2009 report of American Community Surveys (ACS), they estimates of the number of workers who commuted by public transportation in the 50 largest metro areas: In 2008, the sample of 7,186,530 of people was found using public transportation to get to work. In 2009, the sample of 6,992,424 of people was found using public transportation to get to work.

Let p denote the probability of people using public transportation. Thus, the null and alternative hypotheses are: H0: p < 5% H1: p 5% We will test this hypothesis in 2 years: 2008 and 2009. Let begin with the year 2008: The hypothesized value of the proportion p0 = 0.05 The sample size n: n= 143,338,000 The sample proportion : = = 0.05014

Because we have: np0= 0.05 x 143,338,000 =7,166,900 > 5

n(1-p0) = 0.95 x 143,338,000 = 136,171,100 >5 So we use z-test with formula: z=


= 7.69

Z test is 7.69 so p-value = 7.365 x 10-13. Because p-value < 5%, we reject H0. There for we do not accept that in 2008, less than 5% of US citizens use public transportation. We continue the data of 2009: The hypothesized value of the proportion p0 = 0.05 The sample size n: n= 137,792,000 The sample proportion : = = 0.05007

Because we have: np0= 0.05 x 137,792,000=6,889,600 > 5 n(1-p0) = 0.95 x 137,792,000= 130,902,400 >5 So we use z-test with formula: z=

= 3.77

Z test is 3.77 so p-value = 8.159x 10-3. Because p-value < 5%, we reject H0. There for we do not accept that in 2008, less than 5% of US citizens use public transportation.

Reference:
Public Transportation Usage AmongU.S. Workers: 2008 and 2009, Table 2 : Public Transportation Usage for the 50 Largest Metropolitan Statistical Areas:1 2008 and 2009Con. The Employment Situation: December 2008, Bureau of Labor Statistics, United States Department of Labor, Table A: Major indicators of labor market activity, seasonally adjusted. The Employment Situation: December 2009, Bureau of Labor Statistics, United States Department of Labor, Table A: Major indicators of labor market activity, seasonally adjusted.

Question 2:
Discuss among your group, select one company, state one dependent variable, and more than two independent variables. a. Collect data. Testing the independence among independent variables. b. Establish regression relationship, write down the regression equation. c. Use the regression equation to estimate new value of dependent variable.

Solution:
a/ Collect data. Testing the independence among independent variables. Company Kinh Do Food Joint Stock Saigon business units operating in the field of food production and processing.. How can they reach all the customers' needs? To solve this problem, a survey about customers' satisfaction has been conducted because of these following reasons: trends: About price, quality, forms, Through this survey, company will think of new strategies to investment closer to the strengths and overcome the shortcomings attract more customers and make them respect in the company.

Data: Y: Dependent variable Level of satisfaction X1: Independent variable Price X2: Independent variable Quality X3: Independent variable Evaluation compared to other milk brands X4: Independent variable How often consumers use X5: Independent variable Repeated use

Level of satisfaction Y 3 4 1 3 X1 3 3 1 3 X2 3 4 2 3 X3 3 5 3 5 X4 5 5 3 5 X5 4 5 2 4

3 3 3 3 1 3 3 3 4 3 5 4 4 3 4 3 3 4 3 3 3 5 3 2 3 3 3 4 3 4 3 3 3 1 4 4 4 4 2 4 3

3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 4 3 3 3 3 3 4 3 2 1 3 2 3 4 3 4 3 3 2 3 4 3 4 3 3 3

3 2 3 3 1 4 3 3 4 3 5 4 5 4 4 3 3 5 3 3 3 4 3 3 3 4 4 4 5 3 3 3 3 1 4 4 4 5 2 3 4

3 1 4 4 5 4 4 5 5 3 5 4 4 2 5 5 4 4 3 3 4 4 4 3 2 3 4 4 2 4 1 3 1 3 5 4 5 5 4 4 4

2 5 2 3 5 2 3 5 2 5 3 2 5 5 2 4 5 4 2 3 2 3 2 4 3 4 2 3 5 2 2 5 1 3 2 3 3 3 4 2 3

3 1 4 4 1 4 4 4 5 4 5 4 5 4 5 2 4 5 4 3 4 4 3 3 3 4 4 5 5 4 3 2 3 2 5 5 4 5 3 4 4

3 4 2 3 4 2 3 4 3 4 1 3 4 3 3

3 3 3 3 4 3 3 3 2 1 3 3 3 2 3

4 4 3 5 3 4 3 4 3 5 4 4 3 4 3

5 5 5 3 4 4 4 2 4 5 3 4 4 4 4

2 2 5 2 5 3 5 5 2 2 4 5 2 2 5

4 5 4 4 5 3 5 4 4 3 3 5 3 3 5

Hypothesis testing: H0: 1 = 2 = 3 = 4 = 5= 0 H1: Not all the i (i=1,2,3,4,5) are zero ANOVA Significance df SS MS F F 5 23.01243 4.602486 11.65684 1.16E-07 54 21.3209 0.394832 59 44.33333

Regression Residual Total

According to ANOVA table we can see that at all level of significance, the test statistic value FT = F-ratio = 11.65684 > F critical = 2.3538 so we can reject H0. In conclusion, based on the ANOVA table for regression model and the hypothesis testing, we have enough evidence to prove that there is a regression relationship between the dependent variable Y and the independent variables Xi (i=1,2,3,4,5) Coefficient table: Standard Coefficients Error t Stat P-value

Intercept x1 x2 x3 x4 x5

0.503736 0.298951 0.293209 0.068627 -0.12698 0.246856

0.542849 0.137257 0.121558 0.08471 0.065631 0.119542

0.927948 2.178037 2.412093 0.810139 -1.93473 2.065025

0.357564 0.03379 0.019289 0.421416 0.058269 0.043736

Regression equation: from the table of coefficient, we can set up the regression equation as followings: Y =0.503736 + 0.298951 X1 + 0.293209 X2 + 0.068627X3 0.12698X4 + 0.246856 X5 To test whether the variables of the regression model are significant, we base on pvalue. If p value of Xi (level of significant) = 0.05, the test statistic value falls into non-rejection region, Xi is non-significant and we should remove Xi. Based on the coefficient table: P-value of X3 = 0.421416 > 0.05 P-value of X4 = 0.058269 > 0.05 Thus, X1, X2 are non-significant and should be removed from the regression equation. In addition: P-value X1= 0.03379 < 0.05 P-value of X2 = 0.019289 < 0.05 P-value of X5 = 0.043736 <0.05 Thus, X1, X2, X5 are significant. We should conduct the regression model again with only X1, X2 and X5, which is Price, Quality and Repeated use. b/ Establish regression relationship, write down the regression equation. Conduct the regression again Data: Y: Dependent variable Level of satisfaction X1: Independent variable Price X2: Independent variable Quality X3: Independent variable Repeated use Y 3 4 1 X1 3 3 1 X2 3 4 2 X3 4 5 2

3 3 3 3 3 1 3 3 3 4 3 5 4 4 3 4 3 3 4 3 3 3 5 3 2 3 3 3 4 3 4 3 3 3 1 4 4 4 4 2 4

3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 4 3 3 3 3 3 4 3 2 1 3 2 3 4 3 4 3 3 2 3 4 3 4 3 3

3 3 2 3 3 1 4 3 3 4 3 5 4 5 4 4 3 3 5 3 3 3 4 3 3 3 4 4 4 5 3 3 3 3 1 4 4 4 5 2 3

4 3 1 4 4 1 4 4 4 5 4 5 4 5 4 5 2 4 5 4 3 4 4 3 3 3 4 4 5 5 4 3 2 3 2 5 5 4 5 3 4

3 3 4 2 3 4 2 3 4 3 4 1 3 4 3 3

3 3 3 3 3 4 3 3 3 2 1 3 3 3 2 3

4 4 4 3 5 3 4 3 4 3 5 4 4 3 4 3

4 4 5 4 4 5 3 5 4 4 3 3 5 3 3 5

Hypothesis testing: H0: 1 = 2 =3 = 0 H1: Not all the i (i=1,2,3) are zero ANOVA df Regression Residual Total SS MS F 3 21.18356 7.061186 17.08122 56 23.14977 0.413389 59 44.33333 Significance F 5.35E-08

According to ANOVA table, we can see that at all level of significance, the test statistic value FT = F-ratio = 17.08122 > F critical = 2.7395 so we can reject H0. In conclusion, based on the ANOVA table for regression model and the hypothesis testing, we have enough evidence to prove that there is a regression relationship between the dependent variable Y and the independent variables Xi (i=1,2,3). Coefficient table: Multiple Regression Standard Coefficients Error t Stat P-value Intercept 0.295018 0.434343 0.679228 0.499791 X1 0.243586 0.136693 1.781989 0.080173

10

X2 X3

0.335106 0.121941 2.748094 0.008051 0.262938 0.113466 2.317325 0.024166

Regression equation: from the table of coefficient, we can set up the regression equation as followings. Y = 0.295018 + 0.243586X1 + 0.335106X2 + 0.262938X3 To test whether the variables of the regression model are significant, we can base on pvalue. If p value of Xi (level of significant) = 0.05, then the test statistic value falls into non-rejection region. So we cannot reject H0 at 0.05 level of significance, Xi is nonsignificant and we should remove Xi. Base on the coefficient table: P-value of X1 = 0.080173 <0.05 P-value of X2 = 0.008051 < 0.05 P-value of X3 = 0.024166 < 0.05 Thus, X1, X2 and X3 are all significant. Therefore, the good regression equation is: Y = 0.295018 + 0.243586X1 + 0.335106X2 + 0.262938X3 c/ Use the regression equation to estimate new value of dependent variable. Using the value of independent variables: X1= 2, X2= 3, X3= 4 Y = 0.295018 + 0.243586X1 + 0.335106X2 + 0.262938X3 = 0.295018 + 0.243586 x 2 + 0.335106 x 3 + 0.262938 x 4 = 2.839

11

Question 3:
Solve the problem 9-18., text book, page 369 (Complete Business Statistics, 7th Ed).

Solution:
Propotype A Propotype B 4420 4230 4540 4220 4380 4100 4550 4300 4210 4420 4330 4110 4400 4230 4340 4280 4390 4090 4510 4320 44070 42300 4407 4230 4257.333333 Propotype C 4110 4090 4070 4160 4230 4120 4000 4200 4150 4220 41350 4135

Total Mean Grand mean

We assume the independent random samples and the normally distributed populations Ho: all the prototypes have the same average range H1: not all the prototypes have the same average range ANOVA table: SSTR SSE= ( = 381126.6667

) = 248460

SST= SSTR+SSE= 629586.6667

12

Source of Variation Treatment (TR) Error (E) Total (T)

Sum of Square (SS) 381126.6667 248460 629586.6667

Df

Mean Squares (MS)

F ratio (FT)

2 190563.3333 20.70840377 27 9202.222222 29

Test statistic value: F-ratio = 20.70 At = 0.05, the critical value: F (2.27;0.05) = 3.35 Because F-ratio > F, we reject Ho. It means that based on the ANOVA table and the hypothesis testing we have sufficient evidence to prove that not all three prototypes have the same average range.

Question 4:
We have taken the survey for student of National University HCMC and collect a data about students who get money from their part-time job or get money from their parents ( are called income ) and their cellphone ( which they can buy to use from their income) and three big companies. Suppose that a random sample of student is available from various companies. We will test the independent between these two factors. (Using a level of significance of 5%). We have result below:

Companies Nokia Students < 1 million 1-3 millions 42 57 16 37 Sony Samsung 38 55

Total

96 149

13

> 3 millions Total

22 121

28 81

37 130

87 332

Solution:
H0 : The student of each each income and the number of users in three companies are independent of each other. H1 : The student of each income and the number of users in three companies are not independent . Expected counts of data points in different cells:

Companies Nokia Student of < 1 million each income 1-3 millions >3 millions Total 34.99 54.3 31.71 121 Sony 23.42 36.35 21.23 81 Samsung 37.59 58.35 34.06 130

Total

96 149 87 332

The chi-square test statistic value for independence is:

t2

(Oij Eij ) 2 Eij

Degree of freedom = (r 1)(c 1) = (3 1)(3 1) = 4. Critical value2c=2 (4, 0.05) = 9.4877. At 0.05 level of significance, we can not reject H0since 2t <2c . It means that based on Chi-Square testing for independence, we do not have enough evidence to prove that

14

the student of each each income and the number of users in three companies are independent of each other.

15

Question 5:
For a random sample of 200 U.S. motorists, the mileages driven last year are in data presented below. 10221 2209 2796 3571 3806 4110 4402 4500 4669 4720 4993 5090 5327 5640 5801 6208 6723 6829 7326 718 8521 2031 2202 5559 1816 5973 2079 5572 5492 6026 8050 2293 6825 3237 6593 2683 8802 6972 7783 502 9889 6972 5174 5281 402 7941 6271 6069 9555 6817 5816 4104 7990 2102 6873 45 1748 521 3818 6198 3668 6182 5966 3514 2960 7712 6744 4784 5751 7645 6669 4221 3257 2697 4760 5717 6115 5998 2781 3833 6632 2692 7912 4447 3018 4895 5524 4185 8404 7077 5891 7284 7146 4482 7734 1286 3510 4500 6229 167 5889 4193 2829 3511 1378 2686 5349 1654 551 5734 6741 8103 5497 4943 2855 1904

3330 8836 7500 5466 5942 5246 567 4527 5354 7474

7250 2859 7124 7924 3625 4801 7289 980 2963 6674

5011 5245 5653 2910 5672 4173 8943 6699 1514 2307 5679 8840 3420 6197 2846 702 6494 5954 3811 5794

5014 7530 4308 3689 6981 5244 1860 5224 655

5401 10304 4336 4673 4694

2336 7869 6657 4223 5857 3469 5682 5144 7044 4059

6274 10703 8198 5731

10962 5667 3615 6465 9577 4047

16

9167

4435

1879

5912

7440 5259 4132 2617 3026

5967

a. Guess the theoretical distribution could be the best fit for data. b. Use the 0.01 level of significance in determining whether the given data follow the distribution on question a.

Solution:
a/ The chi-square testing for goodness of fit can be used to test how well our data support an assumption about distribution of a population or random variable of interest. We know the mean and the standard deviation of the population or variable. But in some cases, they do not give the values of and ,so we need to estimate them from the data. When this happens, we lose a degree of freedom for each parameter estimated from the data. The degrees of freedom of the chi-square statistic are df= k-21 = k-3 (instead of k-1 as before). b/ Base on the answer in question above, we have a guess for this population is the bellshaped distribution. Then in question b, with the 0.01 level of significance, we determine whether the given data follow this distribution or not. The null and alternative hypothesis H0: the population has a normal distribution H1: the population is not normally distributed The chi-square goodness-of-fit test may be applied to testing any hypothesis about the distribution of a population or a random variable. The test may be applied in particular to testing how well an assumption of a normal distribution is supported by a given data set. We have: n= 200 We divided interval into 6 classes: k=6 We have: We have: 2797.85

17

4078.609 5084.92 = 6091.231 7371.99

The expect E = np

18

You might also like