You are on page 1of 5

Introduction to Data Science

Assignment 1
Due Date: 27th March 2017

Question: Fighting with Data (There are 8 parts to this


question)
You are given a file with 26,000+ entries, and many columns (features). You can
open this file on Excel, but try not to change it in any way. We will use MATLAB to do
whatever to the data we need.

The coding parts below are also listed under To Do in comments in the MATLAB
code, so you know where to make the changes in the code. However you need to
also write written answers to a few questions. Do that in a word document. The
submission of the assignment would be both completed MATLAB code, and the word
document.

Part 1)
Load the data of the file into a variable data. Use the csvread command. If you
use the load command, it will give an error.

Why is the load command giving an error? Can you find out whats the problem with
the given data? Be specific to the data set. If there is some column, row, cell, or
some format thats making it give the error, specify that. (Write the answer in
Word document)

You can try things out in the command window, rather than changing the code in
the file.

Part 2)
Now we would desire that a data just comes in, and we give it to a function to
process. However, life is not that simple. Too many times we are dealing with some
of the following issues:

a. Some rows or columns are not needed, extra, or just plain garbage.
b. There are too many missing values in some of the columns, for them to be
included
c. There are garbage (erroneous) values.

Not all of the above applies to the current data set, but some might. So go ahead
and clean the data. Note that opening home_data.csv with a wordpad/notepad,
rather than Excel, might help in figuring out what is going on.

All of the cleaning you may want to do should be done in MATLAB via code. That is:
DO NOT CHANGE THE CSV FILE. IF YOU WANT TO TAKE OUT SOME ROWS/COLUMNS
ETC, DO IT PROGRAMMATICALLY IN MATLAB.

Part 3).
By now you have a file with valid data on which you can work (quite similar to what
we had in data1.txt during the labs). So now you need to decide which features you
want to use. We will assume we are using all the features (other than the last
column, because that is the value we are trying to predict).

So set X to take all the columns (other than the last column), and set y to be the
last column.

But you are welcome to try out other possibilities, like taking only some of the
features and seeing how it changes the answer.

Part 4)
You will next see a part of the code, that is initializing theta , numbiters and
alpha.

You will note that we using only 20 iterations. This is because we want to try various
values of alpha starting from 1.

Plot the errors using alpha = 1, 0.5, 0.25, 0.1. (The code for plotting is already
given). Which of the alphas are bad? Why? Paste the 4 plots on your Word file.
Now set the alpha to your favorite value from above, and increase the number of
iterations from 20 to 200.

Part 5) Normal Equation


There is some good news though. If the number of features is small (and we treat
even 10,000 as small), then we can avoid doing Gradient Descent for Linear
Regression. We have a closed form formula for the best values of theta. It is:
1
=( X T X ) X T y

Where X is the matrix of features that you are using, and y is the desired output. We
might discuss its derivation in class later. However right now what you need to know
is, that this is the theta that would minimize the error as much as is possible with
the given training data.

Compute theta2 in MATLAB using the above equation. Does your gradient descent
give the same error as the Normal Equation? (If the answers are very different there
is something wrong).

Part 6)
Look at the output of thetas. Which are the two most important features in
determining the house price? Which are the two least important features? (Write
this in your word file).

Part 7) Polynomial (or any Non-Linear) Regression


You might think that Linear Regression limits the hypothesis to be a line. For
example we are looking for theta such that we will predict:
y predicted =x1 1 + x 2 2 + x n n

But what if we thought that the relationship was non-linear? For example what if we
believe that the output should also depend on ( 1)2 ? So that we should have:

y predicted =x1 1 + x 2 2 + x n n+ xn +1 21 .
Well the happy news is that Linear Regression can easily adapt to this situation. All
that you need is to add one more feature to X, by adding one more column. The
new column would the square of another column in the example above.

[ ]
1 2 7
For example if: X= 1 3 9 and we believe that we should include a new
1 5 3

feature which is the square of the 3rd feature in the matrix, then we have to make:

[ ]
1 2 7 49
X = 1 3 9 81
1 5 39

So the new features (whether they are logs, square roots, squares of an existing
feature) are just added as columns. And now, when we will do Linear Regression on
this new X, we would actually be doing Polynomial Regression! Great!

Now open the csv file in Excel, so you can read the titles of the features. Which
features do you think are so important, that even a little change to them, can affect
the price of the house a lot? How about squaring them and adding those columns to
X.

Do this to a couple of features (that is add a couple more columns to X before doing
Normalization). Does the error reduce somewhat?

Part 8) Selling your idea:


Even after the improvements, you would note that the root mean squared error still
looks a bit large. But suppose this was the best we could do. Now we need to sell
this to whoever our boss was.

How would you argue that this error is bearable? What would you say to make it
sound as good as you can?

Some things to consider:

1) What is the average price of a house in the data? How does the size of the
error compare to the average price?
2) How would you include the word outliers in your convincing speech.
3) Are there any other error measures (such as the mean absolute error, that
may make it look better?). NOTE, if you are checking another error metric, do
NOT change computeError function; because our gradient descent is just
working with mean squared error.
Just compute the new kind of error separately after you have received the
final values of theta.

Write your convincing argument (with any numbers that you might have),
in your Word document.