Main Content

Regression Using Dataset Arrays

This example shows how to perform linear and stepwise regression analyses using dataset arrays.

Load sample data.

load imports-85

Store predictor and response variables in dataset array.

ds = dataset(X(:,7),X(:,8),X(:,9),X(:,15),'Varnames',...
{'curb_weight','engine_size','bore','price'});

Fit linear regression model.

Fit a linear regression model that explains the price of a car in terms of its curb weight, engine size, and bore.

fitlm(ds,'price~curb_weight+engine_size+bore')
ans = 
Linear regression model:
    price ~ 1 + curb_weight + engine_size + bore

Estimated Coefficients:
                    Estimate        SE         tStat       pValue  
                   __________    _________    _______    __________

    (Intercept)        64.095        3.703     17.309    2.0481e-41
    curb_weight    -0.0086681    0.0011025    -7.8623      2.42e-13
    engine_size     -0.015806     0.013255    -1.1925       0.23452
    bore              -2.6998       1.3489    -2.0015      0.046711


Number of observations: 201, Error degrees of freedom: 197
Root Mean Squared Error: 3.95
R-squared: 0.674,  Adjusted R-Squared: 0.669
F-statistic vs. constant model: 136, p-value = 1.14e-47

The command fitlm(ds) also returns the same result because fitlm, by default, assumes the predictor variable is in the last column of the dataset array ds.

Recreate dataset array and repeat analysis.

This time, put the response variable in the first column of the dataset array.

 ds = dataset(X(:,15),X(:,7),X(:,8),X(:,9),'Varnames',...
{'price','curb_weight','engine_size','bore'});

When the response variable is in the first column of ds, define its location. For example, fitlm, by default, assumes that bore is the response variable. You can define the response variable in the model using either:

fitlm(ds,'ResponseVar','price');

or

fitlm(ds,'ResponseVar',logical([1 0 0 0]));

Perform stepwise regression.

stepwiselm(ds,'quadratic','lower','price~1',...
'ResponseVar','price')
1. Removing bore^2, FStat = 0.01282, pValue = 0.90997
2. Removing engine_size^2, FStat = 0.078043, pValue = 0.78027
3. Removing curb_weight:bore, FStat = 0.70558, pValue = 0.40195
ans = 
Linear regression model:
    price ~ 1 + curb_weight*engine_size + engine_size*bore + curb_weight^2

Estimated Coefficients:
                                Estimate          SE         tStat       pValue  
                               ___________    __________    _______    __________

    (Intercept)                     131.13        14.273     9.1873    6.2319e-17
    curb_weight                  -0.043315     0.0085114    -5.0891    8.4682e-07
    engine_size                   -0.17102       0.13844    -1.2354       0.21819
    bore                           -12.244         4.999    -2.4493      0.015202
    curb_weight:engine_size    -6.3411e-05    2.6577e-05     -2.386      0.017996
    engine_size:bore              0.092554      0.037263     2.4838      0.013847
    curb_weight^2               8.0836e-06    1.9983e-06     4.0451    7.5432e-05


Number of observations: 201, Error degrees of freedom: 194
Root Mean Squared Error: 3.59
R-squared: 0.735,  Adjusted R-Squared: 0.726
F-statistic vs. constant model: 89.5, p-value = 3.58e-53

The initial model is a quadratic formula, and the lowest model considered is the constant. Here, stepwiselm performs a backward elimination technique to determine the terms in the model. The final model is price ~ 1 + curb_weight*engine_size + engine_size*bore + curb_weight^2, which corresponds to

P=β0+βCC+βEE+βBB+βCECE+βEBEB+βC2C2+ϵ

where P is price, C is curb weight, E is engine size, B is bore, βi is the coefficient for the corresponding term in the model, and ϵ is the error term. The final model includes all three main effects, the interaction effects for curb weight and engine size and engine size and bore, and the second-order term for curb weight.

See Also

| |

Related Topics