You are on page 1of 5

UECM2263 Applied Statistical Models, January 2016

Assignment Report 1
UTAR, Malaysia

Lecturer: Dr. Chang Yun Fah

PREDICTION ON HOUSE ASKING PRICE IN JINJANG


Sutha Kathiravan1, Keoy Eng Chiew2, Chin Yan3, Sathya Seelan1, Liaw Zhen Liang1
1

AS, Y2S2, Department of Mathematics and Actuarial Sciences, UTAR

FM, Y2S1, Department of Mathematics and Actuarial Sciences, UTAR

AM, Y2S2, Department of Mathematics and Actuarial Sciences, UTAR


sutha_sky@yahoo.com

Abstract Statistical research was carried out


on houses pending for sale in Jinjang, Kuala
Lumpur in order to determine the housing
affordability and other information that are
relevant for potential buyers to know. This will
ensure that house hunters are able to make good
decisions when choosing a house through our
statistical findings. Key word: statistics
I.

INTRODUCTION

This project is a research on the factors affecting


the houses up for sale in Jinjang. Researches in
Malaysia currently say that even though the
economy is slowing and the value of ringgit
Malaysia is depreciating but economists say that
the price of the houses continue to rise. Making it
difficult for Malaysians to own a house in the
future.[2]
Through our project, we have identified 13
predictor variables: We want to investigate what are
the Area, Housing Type, Land area, Built-up area,
Tenure type, Number of bedroom, Number of
bathroom, Furnished, Distance to nearest
LRT/KTM/Monorail station, Distance to nearest
primary school, Distance to nearest secondary
school, Distance to nearest shopping mall, and
Distance to nearest mosque. Plus, and a response
variable (Asking price of the property). We
collected data from a website (iProperty) and then
conducted statistical analysis on it using Rprogram. In this project, we do not use area and
land area as predictor variable because they are not
related. Distance to nearest mosque also not used
because there are too many missing value.
The variables we will be using are defined as
follow:

+6011-26169767
* X2 is the Built-up Area.
* X3 is the Tenure Type.
* X4 is the Number of Bedroom.
* X5 is the Number of Bathroom.
* X6 is the Distance to the nearest
LRT/KTM/Monorial station.
* X7 is the Distance to the nearest primary school.
* X8 is the Distance to the nearest secondary school
* X9 is the Distance to nearest shopping mall/
convenience store
* X10 is the Furnished.
The significance level that we will use throughout
this project is =0.05.
II.

Methodology

We carried out our project on houses that were up


for sale in Jinjang, Kuala Lumpur through
iProperty. Our population is Jinjang and our sample
consists of 65 observations.
Firstly, we suggest a multi-linear regression model
to explain the relationship between the response
variable Y, and the predictor variables. Under this
hypothesis,
Y= 0 + 1X1 + 2X2 + B3X3 + 4X4 + 5X5 +
6X6 + 7X7 + 8X8 + 9X9 + 10X10 +
III.

Analysis and Discussion

Residuals:
Min

1Q Median

* Y is the Asking Price.

-532028 -101390

* X1 is the Housing Type

Coefficients:

3Q

Max

2596 135809 427012

UECM2263 Applied Statistical Models, January 2016


Assignment Report 1
UTAR, Malaysia

Lecturer: Dr. Chang Yun Fah

Estimate Std. Error t value Pr(>|t|)

3.
4.
5.
6.

(Intercept) 79525.93 156095.35 0.509 0.61250


X1

99512.34 59918.18 1.661 0.10255

X2

220.92

53.87 4.101 0.00014 ***

The error term has constant variance.


The errors are normally distributed.
The error are uncorrelated.
No outliers.

The validity of these assumptions should always be


doubtful and conduct analysis to examine the
adequacy of model. The residual vs x is examine
the linearity for a model while the residual vs
predicted value is measure the constancy of the
variance. Normal probability plot is measure
whether the error is normally distributed or not. In
this assignment we are going to check X2 (Built-up
Area) and X4 (Number of Bedroom) as the others
variables are not related.

X3
***

-262283.09 63980.86 -4.099 0.00014

X4
***

83774.21 16884.73 4.962 7.34e-06

X5

109262.42 45492.11 2.402 0.01979 *

X6

-67520.15 39347.29 -1.716 0.09189 .

X7

-130055.64 84225.01 -1.544 0.12839

X8

31032.78 75666.86 0.410 0.68334

Residual VS Built-Up Area

X9

50432.54 49294.02 1.023 0.31082

Input

X10

106441.45 130001.33 0.819 0.41652

Non-linearity of Regression Model

model.reg<-lm(Y~X2,data=model.dat)

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 .


0.1 1

plot(x=model.dat$X2, y=model.reg$residuals, xlab


= "Built-Up Area", ylab = "Residuals",
main="Residuals vs. Built-Up Area", col = "red",
pch =
19,cex=1.5,panel.first=grid(col="gray",lty="dotted
"))

Residual standard error: 230400 on 54 degrees of


freedom

abline(h=0,col="blue")

Multiple R-squared: 0.7909, Adjusted R-squared:


0.7522

Output

---

F-statistic: 20.43 on 10 and 54 DF, p-value:


5.681e-15
After analyzing the data to find the summary of
each variables, we have found that the fitted
regression equation is
= 79525.93 + 99512.34X1 + 220.92X2
262283.09X3 + 83774.21X4+ 109262.42X5
67520.15X6 - 130055.64X7 + 31032.78X8 +
50432.54X9 + 106441.45X10

Model Adequancy Checking


Assumptions:

1.
2.

The relationship between the response Y


and the predictor variables are linear.
The error term, has zero mean.

Figure 1: The residuals fall within a horizontal


band centred around 0, displaying no systematic
tendencies to be positive and negative. Therefore,
linear regression model is appropriate.
Residual VS Bedrooms

model.reg<-lm(Y~X4,data=model.dat)
plot(x=model.dat$X4, y=model.reg$residuals, xlab
= "Bedrooms", ylab = "Residuals",
main="Residuals vs. Bedrooms", col = "red", pch =

UECM2263 Applied Statistical Models, January 2016


Assignment Report 1
Lecturer: Dr. Chang Yun Fah

19,cex=1.5,panel.first=grid(col="gray",lty="dotted
"))
abline(h=0,col="blue")

UTAR, Malaysia

Residual VS Bedrooms
Input
model.reg<-lm(Y~X4,data=model.dat)
plot(x=model.reg$fitted.values,
y=model.reg$residuals, xlab = "Bedrooms", ylab =
"Residuals", main="Residuals vs. Predicted
Values", col = "red", pch =
19,cex=1.5,panel.first=grid(col="gray",lty="dotted
"))
abline(h=0,col="blue")
Output

Figure 2: The residuals fall within a horizontal


band centred around 0, displaying no systematic
tendencies to be positive and negative. Therefore,
linear regression model is appropriate.

Non-constancy of Error Variance


Residual VS Built-Up Area
Input

model.reg<-lm(Y~X2,data=model.dat)
plot(x=model.reg$fitted.values,
y=model.reg$residuals, xlab = "Built-Up Area",
ylab = "Residuals", main="Residuals vs. Predicted
Values", col = "red", pch =
19,cex=1.5,panel.first=grid(col="gray",lty="dotted
"))
abline(h=0,col="blue")
Output

Figure 4: The graph shown that all points are


randomly scatted within a horizontal band centred
and no funnel shape is observed. Hence, constant
variance assumption seems to be fulfilled.
Normal Probability Plot
Price vs Built-Up Area
Input
model.reg<-lm(Y~X2,data=model.dat)
qqplot<qqnorm(model.reg$residuals,main="Normal
Probability Plot",xlab="Built-up
Area",ylab="Price",plot.it=TRUE ,col="blue",
pch=19,
cex=1.5,panelfirst=grid(col="gray",lty="dotted"))
abline(lm(qqplot$y~qqplot$x))

Figure 3: The graph shown that all points are


randomly scatted within a horizontal band centered
and no funnel shape is observed. Hence, constant
variance assumption seems to be fulfilled.

UECM2263 Applied Statistical Models, January 2016


Assignment Report 1
UTAR, Malaysia

Lecturer: Dr. Chang Yun Fah

Output

the error of all regressors conform to the normality


assumption initially made. No violation of
normality is detected.
According to [3], it says here that the interior
designs do affect the pricing of the house due the
number of bathrooms, bedrooms and the housing
type apart from its geographical location.
Coefficient of Determination
model

R.sq

adj.R.sq

Figure 5.From the graph above, error terms do not


depart substantially from normality suggesting that
the error of all regressors conform to the normality
assumption initially made. No violation of
normality is detected.

x1

0.1331626

0.1194033

x1, x2

0.5943963

0.5813123

x1, x2, x3

x1, x2, x3, x4

Referring to [1], research has shown that built-up


affects according to its location. If it is rural, the
housing price should be lower but if the it is located
in a city with a large built-up area, dwellers would
show in favour of those kind of houses.

x1, x2, x3, x4,x5

0.7551079

0.7343543

x1,x2,x3,x4,x5,x6

0.7769258

0.7538492

x1,x2,x3,x4,x5,x6,x7

0.7831864

0.7565602

x1,x2,x3,x4,x5,x6,x7,x8

0.7832118

0.7522420

x1,x2,x3,x4,x5,x6,x7,x8,x9

0.7883409

0.7537057

0.7909363

0.7522208

Price vs Bedrooms

0.6297972 0.6115905
0.7343859

10 x1,x2,x3,x4,x5,x6,x7,x8,x9,x10

0.7166783

Input
model.reg<-lm(Y~X4,data=model.dat)

Call:

qqplot<qqnorm(model.reg$residuals,main="Normal
Probability
Plot",xlab="Bedrooms",ylab="Price",plot.it=TRUE
,col="blue", pch=19,
cex=1.5,panelfirst=grid(col="gray",lty="dotted"))

lm(formula = Y ~ X1 + X2 + X3 + X4 + X5 + X6 +
X7, data = model.dat)

abline(lm(qqplot$y~qqplot$x))

Residuals:
Min

1Q Median

3Q

Max

-534969 -81367 13282 156080 410508

Output
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 139279.00 138479.11 1.006 0.318774
X1
X2

Figure 6 From the graph above, error terms do not


depart substantially from normality suggesting that

102055.71 57921.39 1.762 0.083437 .


218.51

52.48 4.164 0.000107 ***

X3
***

-247709.72 60273.66 -4.110 0.000128

X4

81540.08 16412.54 4.968 6.5e-06 ***

UECM2263 Applied Statistical Models, January 2016


Assignment Report 1
UTAR, Malaysia

Lecturer: Dr. Chang Yun Fah

X5
**

119661.49 44113.46 2.713 0.008811

X6

-77243.87 37898.25 -2.038 0.046180 *

X7

-92439.26 72053.61 -1.283 0.204711

--Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 .


0.1 1

Residual standard error: 228400 on 57 degrees of


freedom
Multiple R-squared: 0.7832, Adjusted R-squared:
0.7566
F-statistic: 29.41 on 7 and 57 DF, p-value: < 2.2e16
IV.

Conclusion

As you can see the best model is model7 because


model7 has the highest adjusted R squared.
Therefore, the best model used to estimate the price
of a house is
= 139279 + 102055.71X1 + 218.51X2
247709.72X3 + 81540.08X4 + 119661.49X5
77243.87X6 92439.26X7
The challenge that we faced throughout this project
was having certain predictors corresponding with
the response variable because throughout the
process of collecting the data and compiling them,
we have found that the variable that contains the
distance to the nearest mosque has a lot of missing
variables and it became a challenge for us to do the
scatterplot. Hence, this predictor had to be ignored.

In the future, a project should be carried out where


all the predictors have values.
V.

Reference

[1] Gallent N., Shucksmith M., Tewdwr-Jones M.


(2003). Housing in the European Countryside:
Rural Pressure and Policy in Western Europe.
Architecture. 35-36
[2] Malaysias property market slowing sharply.
(2016, January 4). Global Property Guide.
Retrieved
March
23,
2016,
from
http://www.globalpropertyguide.com/Asia/malaysia
/Price-History
[3] Positive and negative impacts on house prices.
Rightmove.
Retrieved
from
http://www.rightmove.co.uk/what-affects-houseprices.html
VI. Overall
Overall, from the project that we have carried out,
we made assumptions of a multiple linear
regression models. Obtained a scatterplot to test its
validity by testing the non-linearity of regression
model and the non-constancy of error variance
between Residual vs. Built-Up Area and Residual
vs. Bedrooms; testing the Normal Probability Plot
between Price vs. Built-Up Area and Price vs.
Bedroom. The coefficient of determination was
obtained in order R square and the adjusted R
square so that the best model could be obtained.
We were able to get the best model and
was able to determine the validity of the all 7
models before choosing the best one. Hence, with
the model that we have just obtained, we could now
determine the asking price of the houses in Jinjang.

You might also like