Professional Documents
Culture Documents
In this demo, we will see how the MATLAB Statistics Toolbox can help us explore and analyze historical automotive performance data, and create a model that represent the data well.
Contents
Setup Load Data Exploration Matrix Plot Visualization Modeling Fuel Economy Create model for other combinations Simulation Conclusion
Setup
Load Data
Load fuel economy data from year 2000 to 2007 into a dataset array
data = loadData();
Reading in 1 of 7
Warning: Variable Reading in 2 of 7 Warning: Variable Reading in 3 of 7 Warning: Variable Reading in 4 of 7 Warning: Variable Reading in 5 of 7 Warning: Variable Reading in 6 of 7 Warning: Variable Reading in 7 of 7 Warning: Variable Done names were modified to make them valid MATLAB identifiers. names were modified to make them valid MATLAB identifiers. names were modified to make them valid MATLAB identifiers. names were modified to make them valid MATLAB identifiers. names were modified to make them valid MATLAB identifiers. names were modified to make them valid MATLAB identifiers. names were modified to make them valid MATLAB identifiers.
First, we would like to examine the distribution of the MPG data to get a better understanding of the variability. For this, we will use the Distribution Fitting Tool from the Statistics Toolbox. Let us see how "dfittool" simplifies the task of visualizing and fitting distributions.
dfittool(data.mpg, [], [], 'MPG');
The histogram seems to have multiple peaks. Perhaps we need to break the data up into groups, "highway" and "city".
% close DFITTOOL (custom function) closeDTool; dfittool(data.mpg(data.C_H == 'H'), [], [], 'Highway'); dfittool(data.mpg(data.C_H == 'C'), [], [], 'City');
If we look at the "highway" data, the distribution still has multiple peaks. We can group it even further into "cars" and "trucks"
% close DFITTOOL (custom function) closeDTool; dfittool(data.mpg(data.C_H == 'H' & data.car_truck == 'C'), [], [], 'Highway Car'); dfittool(data.mpg(data.C_H == 'H' & data.car_truck == 'T'), [], [], 'Highway Truck');
We will try different types of distributions. We can do this by clicking on "New Fit..." from dfittool. "Normal" distribution doesn't seem to fit well. "Logistic" seems to be a better fit. It's not perfect but good enough for our purpose.
myDistFit(data.mpg(data.C_H == 'H' & data.car_truck == 'C'), ... data.mpg(data.C_H == 'H' & data.car_truck == 'T')) xlabel('MPG');
All 3 variables seem to have negative correlations to MPG. As expected, highway driving seems to have a better fuel economy than city driving.
dat = data(data.C_H == 'H' & data.car_truck == 'C', :); % List of potential predictor variables predictors = {dat.cid, dat.rhp, dat.etw, dat.cmp, dat.axle, dat.n_v}; varNames = {'cid', 'rhp', 'etw', 'cmp', 'axle', 'n_v'}; p = anovan(dat.mpg, predictors, 'continuous', 1:length(predictors), ... 'varnames', varNames);
The anova table informs us of the sources of variability in the model. Looking at the p-value (last column) can tell us whether or not a certain term has a significant effect in the model. We will remove any terms that is not significant to the model. In this case, the axle ratio seems to be insignificant. I will rerun ANOVA with up to 3-way interaction terms included in the model. The new model is
predictors(p > 0.05) = ''; varNames(p > 0.05) = ''; [p2, t, stats, terms] = anovan(dat.mpg, predictors, ... 'continuous', 1:length(predictors), 'varnames', varNames, 'model', 3);
r_square = 0.8478
The R squared value denotes how much of the variation seen in MPG is explained by the model. Let's visually inspect the goodness of the regression.
myPlot(dat.mpg, s.yhat, s.rsquare, 'Highway-Car');
City Car
modelMPG(data, 'City', 'Car'); r_square = 0.8735
City Truck
modelMPG(data, 'City', 'Truck'); r_square = 0.7936
Simulation
Now that we have a model, the final step is to simulate other scenarios. Let's look at the 2007 HONDA ACCORD.
home; idx = (dat.yr == '2007' & dat.mfrName == 'HONDA' & dat.carline == 'ACCORD'); hondaAccord = dat(idx, {'yr', 'mfrName', 'carline', 'mpg'}) hondaAccord = yr mfrName carline mpg 2007 HONDA ACCORD 43.6 2007 HONDA ACCORD 43.1 2007 HONDA ACCORD 43.3 2007 HONDA ACCORD 36.6 2007 HONDA ACCORD 38.5 2007 HONDA ACCORD 35.7 2007 HONDA ACCORD 38.1 2007 HONDA ACCORD 38.2 2007 HONDA ACCORD 37.4
Let's compare this to what the model gives us. We will call simMPG to simulate with the appropriate input arguments.
vars = dat(idx, {'cid', 'rhp', 'etw', 'cmp', 'axle', 'n_v'}); hondaAccord_model = simMPG(vars(:, p < 0.05), terms, s);
hondaMPG = [hondaAccord.mpg'; hondaAccord_model']; fprintf('\n\nModel validation (mpg):\n\n'); fprintf('Actual Model Diff\n'); fprintf('%6.2f %6.2f %6.2f\n', [hondaMPG; diff(hondaMPG)]); Model validation (mpg): Actual 43.60 43.10 43.30 36.60 38.50 35.70 38.10 38.20 37.40 Model 38.90 38.90 37.18 35.06 35.62 35.06 35.62 34.63 34.63 Diff -4.70 -4.20 -6.12 -1.54 -2.88 -0.64 -2.48 -3.57 -2.77
The model gives similar values. Now, we will simulate the fuel economy for a design where the engine displacement is decreased by 20%.
% Decrease displacement by 20% vars.cid = vars.cid * 0.8; hondaAccord_model2 = simMPG(vars(:, p < 0.05), terms, s); hondaMPG2 = [hondaAccord_model'; hondaAccord_model2']; fprintf('\n\nModel data (mpg):\n\n'); fprintf('Current Smaller Disp Diff %%Increase\n'); fprintf('%6.2f %6.2f %6.2f %6.2f\n', ... [hondaMPG2; diff(hondaMPG2); diff(hondaMPG2)./hondaMPG2(1,:)*100]); Model data (mpg): Current 38.90 38.90 37.18 35.06 35.62 35.06 35.62 34.63 34.63 Smaller Disp 40.39 40.39 38.69 36.27 36.97 36.27 36.97 36.00 36.00 Diff 1.49 1.49 1.50 1.22 1.35 1.22 1.35 1.37 1.37 %Increase 3.83 3.83 4.05 3.47 3.79 3.47 3.79 3.95 3.95
Compared to the current configuration, the design with smaller displacement would result in a slightly better fuel economy.
Conclusion
We can now use this model for simulating different scenarios to come up with recommendations for a new automobile design.
warning(warnState);
Contents
Abstract Data Preliminary analysis Pooled comparison: Is the combination therapy better than statin monotherapy ? Effect of Treatment, Statin Dose and Dose by Treatment interaction Effect of Statin Dose on incremental increase in percentage LDL reduction Regression analysis: Effect of statin dose on percent LDL C reduction Secondary Analysis: Consistency of effect across subgroups, age and gender
Abstract
Statins are the most common class of drugs used for treating hyperlipdemia. However, studies have shown that even at their maximum dosage of 80 mg, many patients do not reach LDL cholesterol goals recommended by the National Cholesterol Education Program Adult Treatment anel. Combination therapy, in which a second cholesterol-reducing agent that acts via a complementary pathway is coadmininstered with statin, is one alternative of achieving higher efficacy at lower statin dosage. In this example, we test the primary hypothesis that coadminstering drug X with statin is more effective at reducing cholesterol levels than statin monotherapy. NOTE The dataset used in this example is purely fictitious. The analysis presented in this example is adapted from the following publication. Reference Ballantyne CM, Houri J, Notarbartolo A, Melani L, Lipka LJ, Suresh R, Sun S, LeBeaut AP, Sager PT, Veltri EP; Ezetimibe Study Group. Effect of ezetimibe coadministered with atorvastatin in 628 patients with primary hypercholesterolemia: a prospective, randomized, double-blind trial. Circulation. 2003 May 20;107(19):2409-15.
Data
650 patients were randomly assigned to one of the following 10 treatment groups (65 subjects per group)
Placebo Drug X (10 mg) Statin (10, 20, 40 or 80 mg) Drug X (10 mg) + Statin (10, 20, 40 or 80 mg)
Lipid profile (LDL cholesterol, HDL CHolesterol and Triglycerides) was measured at baseline (BL) and at 12 weeks (after the start of treatment). In addition to the lipid profile, patients age, gender and Cardiac Heart Disease (CHD) risk category was also logged at baseline. The data from the study is stored in a Microsoft Excel (R) file. Note that the data could also be imported from other sources such as text files, any JDBC/ODBC compliant database, SAS transport files, etc. The columns in the data are as follows:
ID - Patient ID Group - Treatment group Dose_A - Dosage of Statin (mg) Dose_X - Dosage of Drug X (mg) Age - Patient Age Gender - Patient Gender Risk - Patient CHD risk category (1 is high risk, and 3 is low risk) LDL_BL - HDL_BL & TC_BL - Lipid levels at baseline LDL_12wks , HDL_12wks & TC_12wks - Lipid levels after treatment
We will import the data into a dataset array that affords better data managemment and organization.
% Import data from an Excel file ds = dataset('xlsfile', 'Data.xls') ;
Preliminary analysis
Our primary efficacy endpoint is the level of LDL cholesterol. Let us compare the LDL C levels at baseline to LDL C levels after treatment
% Use custom scatter plot LDLplot(ds.LDL_BL, ds.LDL_12wk, 50, 'g')
The mean LDL C level at baseline is around 4.2 and mean level after treatment is 2.5. So, at least for the data pooled across all the treatment groups, it seems that the treatment causes lowering of the LDL cholesterol levels
% Use a grouped scatter plot figure gscatter(ds.LDL_BL, ds.LDL_12wk, ds.Group)
The grouped plot shows that LDL C levels before the start of treatment have similar means. However, the LDL C levels after treatment show difference across treatment groups. The Placebo group show no improvement. Statin monotherapy seems to outperform the Drug X monotherapy. There is overlap between the Statin and Statin + X groups; however, it the combination treatment does seem to perform better that the statin monotherapy. Remember that the "Statin" and "Statin + X" groups are further split based on Statin dose. In this example, we will use percentage change of LDL C from the baseline level as the primary metric of efficacy.
% Calculate the percentage improvement over baseline level ds.Change_LDL = ( ds.LDL_BL - ds.LDL_12wk ) ./ ds.LDL_BL * 100 ;
In the following graph, we can see that 1. In the "Statin" and "Statin + X" group, there appears to be a positive linear correlation between percentage improvement and statin dose 2. Even at the smallest dose of 10 mg, monotherapy with statin seems to be better than the Drug X monotherapy group
% Visualize effect of treatment and statin dose on perecentage LDL reduction figure gscatter(ds.ID, ds.Change_LDL, {ds.Group, ds.Dose_S})
legend('Location', 'Best')
We performed a tailed hypothesis to see if Statin + X group (grp2) is better than the Statin group (grp1). We test against the alternative that that mean LDL change of grp1 (Statin only) is less than mean LDL change of grp2 (Statin + X) The null hypothesis is rejected (p < 0.01), implying that grp1 mean is less that grp2 mean, i.e. the Statin group is less effective at lowering LDL C levels than the Statin + X group. The pooled analysis shows that coadministering drug X with statin is more effective than statin monotherapy.
From the above table, we can clearly see that the average efficacy of the combination therapy is better than statin monotherapy at all statin dosages. In the plot of the individual means, notice that the percentage reduction in LDL C levels achieved in the low dose combination therapy group (~50.5 %) is comparable to that achieved in the higher dose Statin monotherapy group (~ 49.4 %). Thus combination therapy with Drug X could help patients that cannot tolerate high statin doses.
figure bar([ds2.Change_LDL_St, ds2.Change_LDL_St_X]) set(gca, 'XTickLabel', [10, 20 40, 80]) colormap summer xlabel('Statin Dose Groups(mg)') ylabel('Percentage reduction of LDL C from Baseline (mmol/L)') legend('Statin', 'Statin + X')
The regression line for the Statin and the Statin + X group run almost parallel. This probably indicates mechanism of actions of drug X and statins are independent.
% Fit [m1, m2] = createFit(x,y,x1, y1) m1 = Linear model Poly1: m1(x) = p1*x + p2 Coefficients (with 95% confidence bounds): p1 = 0.2412 (0.2064, 0.2759) p2 = 34.54 (32.94, 36.14) m2 = Linear model Poly1: m2(x) = p1*x + p2 Coefficients (with 95% confidence bounds): p1 = 0.1435 (0.116, 0.1709) p2 = 50.79 (49.52, 52.05)
We will convert the continuous age variable into a catergorical variable, with 2 categories: Age < 65 and Age >= 65
% Convert age into a ordinal array ds.AgeGroup = ordinal(ds.Age ,{'< 65', '>= 65'} , [] ,[0 65 100] ) % Plot boxplot(ds.Change_LDL(idx), { ds.Dose_S(idx), ds.AgeGroup(idx)} ) ;
Solving Data Management and Analysis Challenges using MATLAB and Statistics Toolbox
Demo file for the Data Management and Statistics Webinar. This demo requires the Statistics Toolbox and was created using MATLAB 7.7 (R2008b). In this demo, we will see how we can take a set of data describing performance and characteristics of various cars, and organize, extract, and visualize useful information for further analysis.
Contents
Automobile Data Dataset Object Categorical Arrays Filtering Concatenate and Join Dealing with Missing Data Clean Up Our Model Create a New Regression Model Multivariate Analysis of Variance Robust Regression Perform Regression Substitution
Automobile Data
Now let's begin. We'll work with this MAT file which contains some automobile data.
clear; load carbig whos Name Acceleration Cylinders Displacement Horsepower MPG Model Model_Year Origin Weight cyl4 org
Size 406x1 406x1 406x1 406x1 406x1 406x36 406x1 406x7 406x1 406x5 406x7
Bytes 3248 3248 3248 3248 3248 29232 3248 5684 3248 4060 5684
Class double double double double double char double char double char char
Attributes
when
406x5
4060
char
This data set contains information regarding 406 different cars. There are different variables for each piece of information, and each row corresponds to the same car.
Dataset Object
Dataset objects allow you to organize information in a tabular format, and have structures very much like that of matrices. Each row represents the observations, or "cars" in this case, and the columns represent the variables, with the appropriate header names.
clc; cars = dataset(Acceleration, Cylinders, Displacement, Horsepower, ... MPG, Model, Model_Year, Origin, Weight) cars = Acceleration Cylinders Displacement Horsepower 12 8 307 130 11.5 8 350 165 11 8 318 150 12 8 304 150 10.5 8 302 140 10 8 429 198 9 8 454 220 8.5 8 440 215 10 8 455 225 8.5 8 390 190 17.5 4 133 115 11.5 8 350 165 11 8 351 153 10.5 8 383 175 11 8 360 175 10 8 383 170 8 8 340 160 8 8 302 140 9.5 8 400 150 10 8 455 225 15 4 113 95 15.5 6 198 95 15.5 6 199 97 16 6 200 85 14.5 4 97 88 20.5 4 97 46 17.5 4 110 87 14.5 4 107 90 17.5 4 104 95 12.5 4 121 113 15 6 199 90 14 8 360 215 15 8 307 200 13.5 8 318 210 18.5 8 304 193 14.5 4 97 88 15.5 4 140 90
14 19 20 13 15.5 15.5 15.5 15.5 12 11.5 13.5 13 11.5 12 12 13.5 19 15 14.5 14 14 19.5 14.5 19 18 19 20.5 15.5 17 23.5 19.5 16.5 12 12 13.5 13 11.5 11 13.5 13.5 12.5 13.5 12.5 14 16 14 14.5 18 19.5 18 16 17 14.5 15 16.5 13 11.5
4 4 4 6 6 6 6 6 8 8 8 8 8 8 8 6 4 6 6 4 4 4 4 4 4 4 4 4 4 4 4 4 8 8 8 8 8 8 8 8 8 3 8 8 8 8 4 4 4 4 4 4 4 4 4 8 8
113 98 97 232 225 250 250 232 350 400 351 318 383 400 400 258 140 250 250 122 116 79 88 71 72 97 91 113 97.5 97 140 122 350 400 318 351 304 429 350 350 400 70 304 307 302 318 121 121 120 96 122 97 120 98 97 350 304
95 NaN 48 100 105 100 88 100 165 175 153 150 180 170 175 110 72 100 88 86 90 70 76 65 69 60 70 95 80 54 90 86 165 175 150 153 150 208 155 160 190 97 150 130 140 150 112 76 87 69 86 92 97 80 88 175 150
13 14.5 12.5 11.5 12 13 14.5 11 11 11 16.5 18 16 16.5 16 21 14 12.5 13 12.5 15 19 19.5 16.5 13.5 18.5 14 15.5 13 9.5 19.5 15.5 14 15.5 11 14 13.5 11 16.5 17 16 17 19 16.5 21 17 17 18 16.5 14 14.5 13.5 16 15.5 16.5 15.5 14.5
8 8 8 8 8 8 8 8 8 8 6 6 6 6 6 4 8 8 8 8 6 4 4 4 3 4 6 4 8 8 4 4 4 4 8 4 6 8 6 6 6 6 4 4 4 4 6 6 6 8 8 8 8 8 4 4 4
350 302 318 429 400 351 318 440 455 360 225 250 232 250 198 97 400 400 360 350 232 97 140 108 70 122 155 98 350 400 68 116 114 121 318 121 156 350 198 200 232 250 79 122 71 140 250 258 225 302 350 318 302 304 98 79 97
145 137 150 198 150 158 150 215 225 175 105 100 100 88 95 46 150 167 170 180 100 88 72 94 90 85 107 90 145 230 49 75 91 112 150 110 122 180 95 NaN 100 100 67 80 65 75 100 110 105 140 150 150 140 150 83 67 78
16.5 19 14.5 15.5 14 15 15.5 16 16 16 21 19.5 11.5 14 14.5 13.5 21 18.5 19 19 15 13.5 12 16 17 16 18.5 13.5 16.5 17 14.5 14 17 15 17 14.5 13.5 17.5 15.5 16.9 14.9 17.7 15.3 13 13 13.9 12.8 15.4 14.5 17.6 17.6 22.2 22.1 14.2 17.4 17.7 21
4 4 4 4 4 4 4 4 6 6 6 6 8 8 8 8 6 6 6 6 6 8 8 4 4 6 4 4 4 4 6 4 6 4 4 4 4 4 4 4 4 4 4 8 8 8 8 6 6 6 6 4 4 4 4 6 6
76 83 90 90 116 120 108 79 225 250 250 250 400 350 318 351 231 250 258 225 231 262 302 97 140 232 140 134 90 119 171 90 232 115 120 121 121 91 107 116 140 98 101 305 318 304 351 225 250 200 232 85 98 90 91 225 250
52 61 75 75 75 97 93 67 95 105 72 72 170 145 150 148 110 105 110 95 110 110 129 75 83 100 78 96 71 97 97 70 90 95 88 98 115 53 86 81 92 79 83 140 150 120 152 100 105 81 90 52 60 70 53 100 78
16.2 17.8 12.2 17 16.4 13.6 15.7 13.2 21.9 15.5 16.7 12.1 12 15 14 18.5 14.8 18.6 15.5 16.8 12.5 19 13.7 14.9 16.4 16.9 17.7 19 11.1 11.4 12.2 14.5 14.5 16 18.2 15.8 17 15.9 16.4 14.1 14.5 12.8 13.5 21.5 14.4 19.4 18.6 16.4 15.5 13.2 12.8 19.2 18.2 15.8 15.4 17.2 17.2
6 6 4 4 4 4 4 8 4 6 6 8 8 8 8 4 4 4 4 4 8 8 8 8 6 6 6 6 8 8 8 8 4 4 4 4 4 4 4 4 6 4 3 4 4 4 4 4 8 8 8 6 6 6 4 6 6
250 258 97 85 97 140 130 318 120 156 168 350 350 302 318 98 111 79 122 85 305 260 318 302 250 231 225 250 400 350 400 351 97 151 97 140 98 98 97 97 146 121 80 90 98 78 85 91 260 318 302 231 200 200 140 225 232
110 95 71 70 75 72 102 150 88 108 120 180 145 130 150 68 80 58 96 70 145 110 145 130 110 105 100 98 180 170 190 149 78 88 75 89 63 83 67 78 97 110 110 48 66 52 70 60 110 140 139 105 95 85 88 100 90
15.8 16.7 18.7 15.1 13.2 13.4 11.2 13.7 16.5 14.2 14.7 14.5 14.8 16.7 17.6 14.9 15.9 13.6 15.7 15.8 14.9 16.6 15.4 18.2 17.3 18.2 16.6 15.4 13.4 13.2 15.2 14.9 14.3 15 13 14 15.2 14.4 15 20.1 17.4 24.8 22.2 13.2 14.9 19.2 14.7 16 11.3 12.9 13.2 14.7 18.8 15.5 16.4 16.5 18.1
6 6 6 6 8 6 8 8 4 4 4 4 4 4 4 4 5 6 4 6 4 4 6 6 4 6 6 8 8 8 8 8 8 8 8 4 4 4 4 5 8 4 8 4 4 4 4 4 6 6 4 4 4 4 4 4 4
231 200 225 258 305 231 302 318 98 134 119 105 134 156 151 119 131 163 121 163 89 98 231 200 140 232 225 305 302 351 318 350 351 267 360 89 86 98 121 183 350 141 260 105 105 85 91 151 173 173 151 98 89 98 86 151 140
105 85 110 120 145 165 139 140 68 95 97 75 95 105 85 97 103 125 115 133 71 68 115 85 88 90 110 130 129 138 135 155 142 125 150 71 65 80 80 77 125 71 90 70 70 65 69 90 115 115 90 76 60 70 65 90 88
20.1 18.7 15.8 15.5 17.5 15 15.2 17.9 14.4 19.2 21.7 23.7 19.9 21.8 13.8 17.3 18 15.3 11.4 12.5 15.1 14.3 17 15.7 16.4 14.4 12.6 12.9 16.9 16.4 16.1 17.8 19.4 17.3 16 14.9 16.2 20.7 14.2 15.8 14.4 16.8 14.8 18.3 20.4 15.4 19.6 12.6 13.8 15.8 19 17.1 16.6 19.6 18.6 18 16.2
4 6 4 4 4 4 4 4 4 4 4 4 5 4 4 4 4 4 6 3 4 4 4 4 4 4 6 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 6 6 6 8 6 6 4 4 4 4
151 225 97 134 120 119 108 86 156 85 90 90 121 146 91 85 97 89 168 70 122 140 107 135 151 156 173 135 79 86 81 97 85 89 91 105 98 98 105 100 107 108 119 120 141 121 145 168 146 231 350 200 225 112 112 112 112
90 90 78 90 75 92 75 65 105 65 48 48 67 67 67 NaN 67 62 132 100 88 NaN 72 84 84 92 110 84 58 64 60 67 65 62 68 63 65 65 74 NaN 75 75 100 74 80 110 76 116 120 110 105 88 85 88 88 88 85
16 18 16.4 20.5 15.3 18.2 17.6 14.7 17.3 14.5 14.5 16.9 15 15.7 16.2 16.4 17 14.5 14.7 13.9 13 17.3 15.6 24.6 11.6 18.6 19.4
4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 6 6 4 6 4 4 4 4 4 4 4 4
135 151 140 151 105 91 91 105 98 120 107 108 91 91 91 181 262 156 232 144 135 151 140 97 135 120 119
Model [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36
Model_Year char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 70 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71 char] 71
Origin Weight USA 3504 USA 3693 USA 3436 USA 3433 USA 3449 USA 4341 USA 4354 USA 4312 USA 4425 USA 3850 France 3090 USA 4142 USA 4034 USA 4166 USA 3850 USA 3563 USA 3609 USA 3353 USA 3761 USA 3086 Japan 2372 USA 2833 USA 2774 USA 2587 Japan 2130 Germany 1835 France 2672 Germany 2430 Sweden 2375 Germany 2234 USA 2648 USA 4615 USA 4376 USA 4382 USA 4732 Japan 2130 USA 2264 Japan 2228 USA 2046 Germany 1978 USA 2634 USA 3439 USA 3329 USA 3302 USA 3288 USA 4209 USA 4464 USA 4154 USA 4096 USA 4955 USA 4746 USA 5140 USA 2962 USA 2408 USA 3282 USA 3139
23 28 30 30 31 35 27 26 24 25 23 20 21 13 14 15 14 17 11 13 12 13 19 15 13 13 14 18 22 21 26 22 28 23 28 27 13 14 13 14 15 12 13 13 14 13 12 13 18 16 18 18 23 26 11 12 13
[1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36
char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char]
71 71 71 71 71 71 71 71 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 72 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73
USA Germany France Italy Japan Japan Germany USA Japan USA Germany USA USA USA USA USA USA USA USA USA USA USA Japan USA USA USA USA Sweden Germany France France USA Japan Japan USA Japan USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA Germany USA USA USA
2220 2123 2074 2065 1773 1613 1834 1955 2278 2126 2254 2408 2226 4274 4385 4135 4129 3672 4633 4502 4456 4422 2330 3892 4098 4294 4077 2933 2511 2979 2189 2395 2288 2506 2164 2100 4100 3672 3988 4042 3777 4952 4464 4363 4237 4735 4951 3821 3121 3278 2945 3021 2904 1950 4997 4906 4654
12 18 20 21 22 18 19 21 26 15 16 29 24 20 19 15 24 20 11 20 21 19 15 31 26 32 25 16 16 18 16 13 14 14 14 29 26 26 31 32 28 24 26 24 26 31 19 18 15 15 16 15 16 14 17 16 15
[1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36
char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char]
73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 73 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 74 75 75 75 75 75 75 75 75 75 75 75
USA USA Japan USA Japan Japan USA USA Italy USA USA Italy Germany Germany Sweden USA Sweden Japan USA USA USA USA USA Japan USA Japan USA USA USA USA USA USA USA USA USA Germany Germany Germany Japan Japan USA Italy Italy Japan Japan Italy USA USA USA USA USA USA USA USA USA USA USA
4499 2789 2279 2401 2379 2124 2310 2472 2265 4082 4278 1867 2158 2582 2868 3399 2660 2807 3664 3102 2875 2901 3336 1950 2451 1836 2542 3781 3632 3613 4141 4699 4457 4638 4257 2219 1963 2300 1649 2003 2125 2108 2246 2489 2391 2000 3264 3459 3432 3158 4668 4440 4498 4657 3907 3897 3730
18 21 20 13 29 23 20 23 24 25 24 18 29 19 23 23 22 25 33 28 25 25 26 27 17.5 16 15.5 14.5 22 22 24 22.5 29 24.5 29 33 20 18 18.5 17.5 29.5 32 28 26.5 20 13 19 19 16.5 16.5 13 13 13 31.5 30 36 25.5
[1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36
char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char]
75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 76 77 77 77 77
USA USA USA USA Japan USA USA USA Japan Germany Japan USA Germany USA Germany France Sweden Sweden Japan Italy Germany USA USA France USA USA USA USA USA USA USA USA USA USA Germany Japan USA USA USA USA Germany Japan Japan USA Sweden USA France Japan Germany USA USA USA USA Japan USA France USA
3785 3039 3221 3169 2171 2639 2914 2592 2702 2223 2545 2984 1937 3211 2694 2957 2945 2671 1795 2464 2220 2572 2255 2202 4215 4190 3962 4215 3233 3353 3012 3085 2035 2164 1937 1795 3651 3574 3645 3193 1825 1990 2155 2565 3150 3940 3270 2930 3820 4380 4055 3870 3755 2045 2155 1825 2300
33.5 17.5 17 15.5 15 17.5 20.5 19 18.5 16 15.5 15.5 16 29 24.5 26 25.5 30.5 33.5 30 30.5 22 21.5 21.5 43.1 36.1 32.8 39.4 36.1 19.9 19.4 20.2 19.2 20.5 20.2 25.1 20.5 19.4 20.6 20.8 18.6 18.1 19.2 17.7 18.1 17.5 30 27.5 27.2 30.9 21.1 23.2 23.8 23.9 20.3 17 21.6
[1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36
char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char]
77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 77 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78 78
Japan USA USA USA USA USA USA USA USA USA USA USA USA Germany USA Japan USA USA USA Japan Germany Japan Germany Japan Germany USA Japan Japan Japan USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA USA Japan Japan USA Japan USA USA Japan Germany Sweden Sweden
1945 3880 4060 4140 4295 3520 3425 3630 3525 4220 4165 4325 4335 1940 2740 2265 2755 2051 2075 1985 2190 2815 2600 2720 1985 1800 1985 2070 1800 3365 3735 3570 3535 3155 2965 2720 3430 3210 3380 3070 3620 3410 3425 3445 3205 4080 2155 2560 2300 2230 2515 2745 2855 2405 2830 3140 2795
16.2 31.5 29.5 21.5 19.8 22.3 20.2 20.6 17 17.6 16.5 18.2 16.9 15.5 19.2 18.5 31.9 34.1 35.7 27.4 25.4 23 27.2 23.9 34.2 34.5 31.8 37.3 28.4 28.8 26.8 33.5 41.5 38.1 32.1 37.2 28 26.4 24.3 19.1 34.3 29.8 31.3 37 32.2 46.6 27.9 40.8 44.3 43.4 36.4 30 44.6 40.9 33.8 29.8 32.7
[1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36
char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char]
78 78 78 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 79 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80 80
France Germany Japan USA USA USA USA USA USA USA USA USA USA USA USA USA Germany Japan USA USA Germany USA France USA USA USA Japan Italy USA USA USA USA Germany Japan USA Japan USA USA USA USA Germany Japan Japan Japan Japan Japan USA Japan Germany Germany Germany Germany Japan France Japan Germany Japan
3410 1990 2135 3245 2990 2890 3265 3360 3840 3725 3955 3830 4360 4054 3605 3940 1925 1975 1915 2670 3530 3900 3190 3420 2200 2150 2020 2130 2670 2595 2700 2556 2144 1968 2120 2019 2678 2870 3003 3381 2188 2711 2542 2434 2265 2110 2800 2110 2085 2335 2950 3250 1850 1835 2145 1845 2910
23.7 35 23.6 32.4 27.2 26.6 25.8 23.5 30 39.1 39 35.1 32.3 37 37.7 34.1 34.7 34.4 29.9 33 34.5 33.7 32.4 32.9 31.6 28.1 NaN 30.7 25.4 24.2 22.4 26.6 20.2 17.6 28 27 34 31 29 27 24 23 36 37 31 38 36 36 36 34 38 32 38 25 38 26 22
[1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36
char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char]
80 80 80 80 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 81 82 82 82 82 82 82 82 82 82 82 82 82 82 82 82 82 82 82 82 82 82 82 82
Japan England USA Japan USA USA USA USA USA Japan USA Japan Japan Japan Japan Japan USA USA USA Germany France Japan Japan Japan Japan France Sweden Sweden Japan Japan USA USA USA USA USA USA USA USA USA USA USA USA Germany Japan Japan USA USA Japan Japan Japan Japan Japan Japan USA USA USA USA
2420 2500 2905 2290 2490 2635 2620 2725 2385 1755 1875 1760 2065 1975 2050 1985 2215 2045 2380 2190 2320 2210 2350 2615 2635 3230 2800 3160 2900 2930 3415 3725 3060 3465 2605 2640 2395 2575 2525 2735 2865 3035 1980 2025 1970 2125 2125 2160 2205 2245 1965 1965 1995 2945 3015 2585 2835
32 36 27 27 44 32 28 31
82 82 82 82 82 82 82 82
The summary function provides basic statistical information for each of the variables included in the dataset object. Notice that there are some missing values for Horsepower and MPG, denoted by NaNs.
clc summary(cars); Acceleration: [406x1 double] min 1st Q median 3rd Q max 8 13.7000 15.5000 17.2000 24.8000 Cylinders: [406x1 double] min 1st Q median 3rd Q max 3 4 4 8 8 Displacement: [406x1 double] min 1st Q median 3rd Q max 68 105 151 302 455 Horsepower: [406x1 double] Columns 1 through 5 min 1st Q median 3rd Q max 46 75.5000 95 130 230 Column 6 NaNs 6 MPG: [406x1 double] Columns 1 through 5 min 1st Q median 3rd Q max 9 17.5000 23 29 46.6000 Column 6 NaNs 8 Model: [406x36 char] Model_Year: [406x1 double] min 1st Q median 3rd Q max 70 73 76 79 82 Origin: [406x7 char] Weight: [406x1 double] min 1st Q median 3rd Q max 1613 2226 2.8225e+003 3620 5140
Cylinders 8 8
11 12 10.5 10 9 8.5 10 8.5 MPG 18 15 18 16 17 15 14 14 14 15 Model [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36
8 8 8 8 8 8 8 8 char] char] char] char] char] char] char] char] char] char]
318 304 302 429 454 440 455 390 Model_Year 70 70 70 70 70 70 70 70 70 70 Origin USA USA USA USA USA USA USA USA USA USA
150 150 140 198 220 215 225 190 Weight 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850
'Variables'}
Categorical Arrays
Notice that some of the variables take on discrete values. For instance, the Cylinders, and Origin take on a unique set of values:
clc disp('Cylinders:'); unique(cars(:, 'Cylinders')) disp('Origin:'); unique(cars(:, 'Origin')) Cylinders: ans = Cylinders 3 4 5 6 8 Origin: ans = Origin England France Germany Italy Japan Sweden USA
Categorical arrays provide significant memory savings. We will convert Cylinders to an ordinal array, which contains ordering information. The variable Origin will be converted to a nominal array, which does not store ordering.
clc Cylinders_cat = ordinal(Cylinders); Origin_cat = nominal(Origin); whos Cylinders* Origin* Name Size Cylinders Cylinders_cat Origin Origin_cat 406x1 406x1 406x7 406x1 Bytes 3248 1178 5684 1366 Class double ordinal char nominal Attributes
Filtering
Dataset objects can be easily filtered by criteria.
For example, we can create a logical array that has ONEs where the origin is Germany and ZEROs where it's not Germany.
germanyMask = cars.Origin == 'Germany' germanyMask = 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 0 0 0
Displacement 97 107 121 97 116 97 97 121 97 116 114 98 79 97 90 90 115 116 90 97 168 97 97 121 90 131 89 89 183 98 97 90 90 121 146 89
14.2 15.3 24.6 MPG 26 24 26 NaN 28 27 23 22 26 24 20 29 26 26 25 29 23 25 29 29.5 16.5 29 30.5 21.5 43.1 20.3 31.5 31.9 25.4 41.5 34.3 44.3 43.4 36.4 30 29.8 33 36 44 Model [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36
4 4 4 char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char] char]
105 105 97 Model_Year 70 70 70 71 71 71 72 72 73 73 73 74 74 74 75 75 75 76 76 76 76 77 77 77 78 78 78 79 79 80 80 80 80 80 80 80 81 82 82 Origin Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany Germany
74 74 52 Weight 1835 2430 2234 1978 2123 1834 2254 2511 1950 2158 2582 2219 1963 2300 2223 1937 2694 2220 1937 1825 3820 1940 2190 2600 1985 2830 1990 1925 3530 2144 2188 2085 2335 2950 3250 1845 2190 1980 2130
We notice a general trend, but the amount of data prevents us from getting useful information. We can use filtering to refine the visualization. Let's extract out only the cars made in 1970, 1976, or 1982.
index = cars.Model_Year == 70 | cars.Model_Year == 76 | cars.Model_Year == 82; filtered = cars(index,:);
Join Joining allows you to take the data in one dataset array and assign it to the rows of another dataset array, based on matching values in a common key variable.
clc tabulate(cars_all.Origin); Value Count Percent England 1 0.20% France 18 3.56% Germany 48 9.49% Italy 9 1.78% Japan 94 18.58% Sweden 13 2.57% USA 323 63.83%
14 15 MPG 18 NaN 24 21 19 13 31 14 13 28 Model [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36 [1x36
8 4 char] char] char] char] char] char] char] char] char] char]
307 98 Model_Year 70 70 70 70 71 71 71 72 72 72 Origin USA France Japan USA USA USA Japan USA USA USA
130 80 Weight 3504 3090 2372 2648 2634 4746 1773 4385 4098 2164
Continent North America Europe Asia North America North America North America Asia North America North America North America
One way to deal with missing data is to substitute for the missing value. In this case, we will create a regression model to represent the performance measures (MPG) as functions of possible predictor variables (acceleration, cylinders, horsepower, model year, and weight)
X = [ones(length(cars.MPG),1) cars.Acceleration, double(cars.Cylinders), ... cars.Displacement, cars.Horsepower, cars.Model_Year, cars.Weight]; Y = [cars.MPG]; [b,bint,r,rint,stats] = regress(Y, X);
Note that cars.Horsepower contains NaNs. The regress function performs listwise deletion on the independent variables
cars.regress = X * b; fprintf('R-squared: %f\n', stats(1)); R-squared: 0.814178
For cars with low or high MPG, the model seems to underestimate the MPG, while for cars in the middle, the model overestimates the true value.
gname(cars.Model)
We see that Japanese and German cars are quite similar, and they are very different from English and American cars Let's add another dummy variable that distinguished Japanese and German cars. Then redo the regression.
carsall.dummy = (carsall.Origin == 'Germany' | carsall.Origin == 'Japan'); X = [ones(length(carsall.MPG),1), carsall.Acceleration, ... double(carsall.Cylinders), carsall.Displacement, carsall.Horsepower, ... carsall.Model_Year, carsall.Weight, carsall.SW, carsall.Diesel, ... carsall.Automatic carsall.dummy]; Y = [carsall.MPG]; [b, bint, r, rint, stats] = regress(Y, X); carsall.regress = X * b;
% Inspect once again residuals2 = carsall.MPG - carsall.regress; stem(carsall.regress, residuals2) xlabel('model'); ylabel('actual - model'); gname(carsall.Model)
Robust Regression
We can also perform robust regression to deal with the outliers that may exist in the dataset.
X2 = [carsall.Acceleration, double(carsall.Cylinders), ... carsall.Displacement, carsall.Horsepower, carsall.Model_Year, ... carsall.Weight, carsall.SW, carsall.Diesel, carsall.Automatic, carsall.dummy]; [robustbeta, stats] = robustfit(X2, Y) X3 = [ones(length(carsall.MPG),1), X2]; carsall.regress2 = X3 * robustbeta; robustbeta = -4.1872 -0.2086 -1.6006 0.0126 -0.0166 0.6444 -0.0048
0.1167 12.4719 -2.6949 1.7508 stats = ols_s: robust_s: mad_s: s: resid: rstud: se: covb: coeffcorr: t: p: w: R: dfe: h:
2.9910 2.8012 2.6344 2.8477 [399x1 double] [399x1 double] [11x1 double] [11x11 double] [11x11 double] [11x1 double] [11x1 double] [399x1 double] [11x11 double] 374 [399x1 double]
Biomedical Statistics & Curve Fitting with Matlab. Data Analysis tool boxes. Signal Processing. Image Processing. Wave lets. Statistics. Curve Fitting. Neural Network.
Statistics Probability Distributions & Fittings Hypothesis testing. Multivariate Analysis. Clustering (Hierarchical K-mean).
Curve Fitting Simple Linear Model (residuals). Robust Fitting (Outlier insensitivity) Automatic Code generation.
Distool (A Gui her DDF & CDF Demo. If starts by showing CDF of the normal distribution but we will switch to the more familiar PDF. There is a variety of distribution to look at, we will choose gamma distribution. As we change the parameter values we can see how the PDF changes. Another gui to look random number generation. Normal view we will choose gamma distribution again and generate one hundred random values. Notice how the histogram mimies the shape of the PDF we saw but it varies as we generate new samples.
If we increase the sample size the shape is more consistent and smoother. (100
1000)
Now let us try fitting in Samuel distribution system of real date. We have data collected on life spans of 100 fruit files. We would like to model this data-first curve fitting univariate probability distribution. First load the data into the matter work space. Clear AK
A data vector called medflies containing the measured life span in days. Lets look at the data. Number specialized graphics of functions are available such as plot and empirical CDF plot but we will stick to sample histogram. List (medflies, 25:5:6:25); we can control the bin location with 2nd argument of the list function. Change bar color as well. Set (get(gea, children0, Face Color, [09, 0.9, 0.9]), From the histogram the date are skewed to the risa so the normal dist. Is probably not a good model. Instead let us use a gamma dist which does have a right skew. The TBox has fitting functions for a variety of univariate distribution. For example the function gamfit fits a gama dist using maximum likelihood estimation. [paramEsts, confints], gamfit (medflies] paramEsts = 4.3672 0.40962 ConInsts = 5.4652 0.50245 The gamfit function returns point estimates for the parameter as well as confidence intervals. For example the maximum likihood estimated with shape parameter is boxt 4.4 and 95% confidence interval goes from 3.3 to 5.5. [Fitmean 1.7889] = gamstat (param Ests (1), paramEsts (2)) fitMean fitVar = = 1.7889 0.73278
Try to judge how the gamma model fits these data with over lay its density function on the histogram. Use gampdf to compute the density along the range of the lifespan data. Yfit = gamapdf (0:0.1:7, paramEsts (1), paramEsts (2))
Hold on, plot (0:.01:7, Ysmooth*100#0.5, r), hold off. The toolbox also has function to calculate CDFs and inkers (CDF, the plot indicates that the gamma distribution may not be entirely suitable for these data. But the analysis is the artifact of the small sample size. One way to judge would be to use M. Carlo Sim to investigate How good of a fit we might expect.
Dont have time to do that here. General idea would be to simulate data from the the model that we just fit. Its we saw earlier the tool box has random number generation to simulate a variety of distributions. From the command line it is simple matter to immaculate a random sample of a hundred values from our demo model using the gamrnd function and the estimated parameter. r = gamrnd (paramEsts (1), paramEsts(2), 1, 100) Long List of Number. Another approach we can take is to fit a semi parametric model to the dta. Ysmooths = ksdensity (medflies, 0:.01:7), The ks density function computes Kernel smooth estimated of the pd for our data overlay that on the histogram too. Hold on, plot (0:0.01:7, ysmooth*100*0.5, r) hold off Legend (Parametric fit, Kernal smooth), The Kernal smooth capture the mode and the right tail of the data better than the parametric model. This difference with probably be worth investigating. But we wont persue it here. We have estimated the parameters of out gamma fit using maximum liklihood and computed confidence intervals that make the standard large sample assumptions of asymptotic normality (abnormality). The boot strap is a useful look when large sample approximation may or may not apply. For example we can boot to estimate standard errors for our parameter estimate instead of relying on the confidence intervals we saw earlier. First we create 500 bootstrap replicate. b = bootstrap (500, gamfit, medflies) b (1:10:) ans = 10 x 2 arrange of numbers. The first few sets of the 500 replicates two columns contain this if we compute the standard deviations along these columns we get a boot strap estimate of the standard error of the actual parameter estimate. The tool box has a variety of functions of calculate descriptive statistics and we need the function std here.
0.077319
We can also look at the replicates after the QQ plot to help determine if asymptotic abnormality applies. qq plot (b(:, 22)),
Let us plot for the 2nd parameter the boot strap replicates are a little skewed relative to normal indicating that the assumption of normality for the actual parameter estimates is probably not valid. So boot straping was a good idea in this case. Comparing Multiple Samples Use data collected in an experiment to determine now well subjects could distinguish a flickering light from a continuous one. The measured variable was the maximum flicker rate that could be distinguished by each subject. We will investigate how subjects eye color effects there ability making use of the tool boxes command line functions for graphing, descriptive statistics & hypothesis testing load the data Clear all Load my flicker Whose Color Flicker Group
Use box plot to show the flicker rate data broken down by eye color. Box plot (flicker, color); Each box summarizes the data for one eye color with brown eyed subjects for example the redline across the box show the medium rate is about 25 hertz. The top and bottom edges of the box indicates the upper and lower quantities for brown eyed subjects. The tails above and below the box show for the other values extended. One possible out eyed value is plotted as a separate point. The other boxes give the same info for the other eye colors. Apparently the rate is larger for green and blue eye although there is enough variability / layerability within each eye color so that the distribution overlap.
We would like to know the difference are significant or simply due to random chance. We could compare any two eye color with But instead we will use one way ANOVA to experiment over all significance test of difference b/w the 3 eye color groups all at once. [P, table, stats] = anoval (flicker, color) Anoval function returns values on the command line but also produce display box plot like before, Anoval table including F-Statistics (13.6) Corresponding to a very small P value of 2.289 x 10.5. This the anova confirms what we suspected from the box plot that there are significant differences b/w eye colors on average. Now we can go on the determine which groups are different from each other. This process known as multiple comparison.
We can use multicompare function and will supply with out put from the anova 1 function. [comparison, means] = multicompare (stats) The result is an interactive display that we can use to see which group differs from the which others by comparing interval estimates of the group effects. By clicking on a group we can determine which other group are significant in different the intervals are district and which are not intervals overlapped. Brown eyed subjects differs significantly on average from blue eyed. The green eyed group has a wide interval reflecting a small sample size for that eye color. Because we cant make a precise estimate for green we are not able to claim that the effect of for these subjects is significantly different than the effects of the other groups. If there have been more green eyed subjects we may have been able to detect the difference. Multivariable / Multivariate Analysis Consist of methods that operates on multivariable at one time use the famous fishers iris data set. Clear all Load fisheriris whose Meas 150 x 4 Species 150 x 1 The data consist of 4 measurements taken on a 150 specimens of iris flower an 3 different species each flower has measurement take of its sepal length & width and petal length & width. The sepal is the lower droopy part of the flower we also know the species of each specimen. Let us look at the scattered part of the sepal measurement with point color coded based on species. \
g scatter (meas (:,1), meas (:, 1), species) xlabel (sepal length), ylabel (sepal width) We see a well separated at the top left consisting of the points from the stcso species. Based on the looking only at the sepal measurements the points from the other two species overlapped. However we havent used petal measurements for that we may still be able to distinguish species. Let us .. four measurements scatter plot. gplotmatrix (meas,[ ], species, [ ], [ ], [ ], off) This plot consists of an arrange of scatter plots each showing a pain of measurements. Histograms along with diagonals show the distribution of each measurement. We can see some pains of variables gives separation b/w species. But it would be best to use all 4 measurements together. The problem is that we cant visualize 4th dimension without using tricks.
The data has a natural grouping by species let us try cluster analysis..? In hierarchical cluster analysis we compute the dissimilarities b/w our specimens and then create a linkage.. tree from those dissimilates. A variety of dissimalities measure and leakage types are available here we will use Euclidean distance and average linkage. dist = pdist (meas, Euclidean), tree = linkage (dist, average), [h, node Num] = dendrogram (tree, 20), The dendrogram function plots the linkage tree. This is a way to visualize the distances b/w four dimensional measurements and there to look for clusters. Starting from the bottom of the tree specimen that are closet are joined together than the next closest and finally all specimens are joined. The y-acix is a measure of the distance b/w specimens as they are joined. Notice that the final joining takes place at a distance of about 4 much higher that next join in at about w. this is an indication that there are two well separated grouping in the data. The clarity of this plot is that it condenses the 150 flower specimen into only 20 .. of dangrogram. Sereom output of dendogram indicate which specimensfalls into which nodes. For example node 8 consists of 4 wersi color specimen. Species (node Num = 8) Ans = Versicder Vercicolor Versicolor.
Species (ismember(node Num,[1 2 15 4])) Colum 1 through 7 Setosa.. We can confirm that the setosa specimens are the owner that make up group of nodes on the right. The is member function here picks out specimens whose node number is one of a set of values. The K-means another clustering method .Instead of producing a clustering hierarchy. K means partitions the data into specific number of clusters based on the distance from each other. For example e can find the best partition of the data into 3 clusters. [Clust Idx, Clust Ctrs] = Kmeans (meas, 3, dist, cos, rep, 5), Here we used the cosine distance. Kmean function returned a list of cluster indices one for each specimen that define which cluster specimen has been assigned to.
Second output contains the centroids which in this care 4 dimensional values representing the average measurement of the specimens in three clusters. You can visualize the clusters with a parallel each of the specimen in represented as a series of 4 measurement. The plots in this case in highly specialize but it can be put together high and low level Matlab graphics function. (See mage plot costs & run) The plot show different clusters with different colors. You can see the Kmeans partitioned the data base on how much sepal and petal measurements over lapped. It turns out that this does a pretty good job of distinguishing the species. Only 5 specimens are misclassified. Those are the ones shown in black. Not too bad since we client even used species info on the measurement. To explicitly incorporate the species info into the analysis we use the classification that is known as decision trees. Tree set function creates a classification tree and treedisp function displays it. A tree is a .. of binary decision.
t = tree fit (meas, species); Treedisp (t, names, {SL SW PL PW}), We classify on observations as one of trees one of the three species. Each node of the tree defines split of the data base from on e of the 4 measurements. e.g. at the top node PL in are used to distinguish b/w .. species and other two. If we click on the two nodes on the second level. Branch on the left contains only species and branch on the right contains other two species. The few of the observations are actually misclassified but that is to be expected. Controls on the tree allow you to desire size. Tool optimal tree size based on cross validation. Dont have time.