Ejemplo Datos sobre cemento de Hald

(Draper & Smith (1998))

El siguiente juego de datos es sobre el endurecimiento de cemento Portland, famoso por su nada fácil modelación.

Variable Descripción
\(x_1\) Cantidad de tricalcio de aluminato, \(3CaO \cdot Al_2O_3\)
\(x_2\) Cantidad de tricalcio de silicato, \(3CaO \cdot SiO_2\)
\(x_3\) Cantidad de tricalcio de aluminio ferrito, \(4CaO \cdot Al_2O_3 \cdot Fe_2O_2\)
\(x_4\) Cantidad de dicalcio de silicato, \(2CaO \cdot SiO_2\)
\(y\) Calor en calorías por gramo de cemento

Lectura de datos y características de los mismos:

hald.data <- read.csv("~/Desktop/EstApl2/Markdown_MGA/RM_tema_6/Hald.dat", sep="")
dim(hald.data)
## [1] 13  5
str(hald.data)
## 'data.frame':    13 obs. of  5 variables:
##  $ x1: int  7 1 11 11 7 11 3 1 2 21 ...
##  $ x2: int  26 29 56 31 52 55 71 31 54 47 ...
##  $ x3: int  6 15 8 8 6 9 17 22 18 4 ...
##  $ x4: int  60 52 20 47 33 22 6 44 22 26 ...
##  $ y : num  78.5 74.3 104.3 87.6 95.9 ...
head(hald.data)
##   x1 x2 x3 x4     y
## 1  7 26  6 60  78.5
## 2  1 29 15 52  74.3
## 3 11 56  8 20 104.3
## 4 11 31  8 47  87.6
## 5  7 52  6 33  95.9
## 6 11 55  9 22 109.2
summary(hald.data)
##        x1               x2              x3              x4    
##  Min.   : 1.000   Min.   :26.00   Min.   : 4.00   Min.   : 6  
##  1st Qu.: 2.000   1st Qu.:31.00   1st Qu.: 8.00   1st Qu.:20  
##  Median : 7.000   Median :52.00   Median : 9.00   Median :26  
##  Mean   : 7.462   Mean   :48.15   Mean   :11.77   Mean   :30  
##  3rd Qu.:11.000   3rd Qu.:56.00   3rd Qu.:17.00   3rd Qu.:44  
##  Max.   :21.000   Max.   :71.00   Max.   :23.00   Max.   :60  
##        y         
##  Min.   : 72.50  
##  1st Qu.: 83.80  
##  Median : 95.90  
##  Mean   : 95.42  
##  3rd Qu.:109.20  
##  Max.   :115.90
## $Stat
##    p q         s         S2     SC_res        R2     R2adj         Cp
## 1  0 1 15.043723 226.313590 2715.76308        NA        NA 442.916687
## 2  1 2 10.726716 115.062432 1265.68675 0.5339480 0.4915797 202.548769
## 3  1 2  9.077126  82.394213  906.33634 0.6662683 0.6359290 142.486407
## 4  1 2 13.278145 176.309134 1939.40047 0.2858727 0.2209521 315.154284
## 5  1 2  8.963902  80.351538  883.86692 0.6745420 0.6449549 138.730833
## 6  2 3  2.406335   5.790448   57.90448 0.9786784 0.9744140   2.678242
## 7  2 3 11.077328 122.707206 1227.07206 0.5481667 0.4578001 198.094653
## 8  2 3  2.734266   7.476211   74.76211 0.9724710 0.9669653   5.495851
## 9  2 3  6.445485  41.544273  415.44273 0.8470254 0.8164305  62.437716
## 10 2 3  9.321374  86.888013  868.88013 0.6800604 0.6160725 138.225920
## 11 2 3  4.192112  17.573800  175.73800 0.9352896 0.9223476  22.373112
## 12 3 4  2.312061   5.345624   48.11061 0.9822847 0.9763796   3.041280
## 13 3 4  2.308745   5.330303   47.97273 0.9823355 0.9764473   3.018233
## 14 3 4  2.376648   5.648458   50.83612 0.9812811 0.9750415   3.496824
## 15 3 4  2.863846   8.201617   73.81455 0.9728200 0.9637599   7.337474
## 16 4 5  2.446008   5.982955   47.86364 0.9823756 0.9735634   5.000000
##          AIC
## 1         NA
## 2  102.41187
## 3   98.07040
## 4  107.95980
## 5   97.74404
## 6   64.31239
## 7  104.00908
## 8   67.63411
## 9   89.92954
## 10  99.52173
## 11  78.74499
## 12  63.90360
## 13  63.86629
## 14  64.61995
## 15  69.46829
## 16  65.83669
## 
## $Coef
##    p q Intercept       x1         x2         x3         x4
## 1  0 1  95.42308       NA         NA         NA         NA
## 2  1 2  81.47934 1.868748         NA         NA         NA
## 3  1 2  57.42368       NA  0.7891248         NA         NA
## 4  1 2 110.20266       NA         NA -1.2557813         NA
## 5  1 2 117.56793       NA         NA         NA -0.7381618
## 6  2 3  52.57735 1.468306  0.6622505         NA         NA
## 7  2 3  72.34899 2.312468         NA  0.4944682         NA
## 8  2 3 103.09738 1.439958         NA         NA -0.6139536
## 9  2 3  72.07467       NA  0.7313296 -1.0083862         NA
## 10 2 3  94.16007       NA  0.3109047         NA -0.4569419
## 11 2 3 131.28241       NA         NA -1.1998512 -0.7246001
## 12 3 4  48.19363 1.695890  0.6569149  0.2500176         NA
## 13 3 4  71.64831 1.451938  0.4161098         NA -0.2365402
## 14 3 4 111.68441 1.051854         NA -0.4100433 -0.6427961
## 15 3 4 203.64196       NA -0.9234160 -1.4479712 -1.5570449
## 16 4 5  62.40537 1.551103  0.5101676  0.1019094 -0.1440610
## 
## attr(,"class")
## [1] "modelSelect" "list"

El paquete MASS de R ofrece la busqueda de “el mejor modelo” lineal (y otras familias) con base en el criterio de Akaike.

La función stepAIC es la utilizada para la selección de modelo.

library(MASS)
modAIC <- stepAIC(lm(y~1,data = hald.data), scope = y~x1+x2+x3+x4,
                  direction = c("both","forward","backward")[2],trace = 3)
## Start:  AIC=71.44
## y ~ 1
## 
##        Df Sum of Sq     RSS    AIC
## + x4    1   1831.90  883.87 58.852
## + x2    1   1809.43  906.34 59.178
## + x1    1   1450.08 1265.69 63.519
## + x3    1    776.36 1939.40 69.067
## <none>              2715.76 71.444
## 
## Step:  AIC=58.85
## y ~ x4
## 
##        Df Sum of Sq    RSS    AIC
## + x1    1    809.10  74.76 28.742
## + x3    1    708.13 175.74 39.853
## <none>              883.87 58.852
## + x2    1     14.99 868.88 60.629
## 
## Step:  AIC=28.74
## y ~ x4 + x1
## 
##        Df Sum of Sq    RSS    AIC
## + x2    1    26.789 47.973 24.974
## + x3    1    23.926 50.836 25.728
## <none>              74.762 28.742
## 
## Step:  AIC=24.97
## y ~ x4 + x1 + x2
## 
##        Df Sum of Sq    RSS    AIC
## <none>              47.973 24.974
## + x3    1   0.10909 47.864 26.944
print(summary(modAIC))
## 
## Call:
## lm(formula = y ~ x4 + x1 + x2, data = hald.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0919 -1.8016  0.2562  1.2818  3.8982 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  71.6483    14.1424   5.066 0.000675
## x4           -0.2365     0.1733  -1.365 0.205395
## x1            1.4519     0.1170  12.410 5.78e-07
## x2            0.4161     0.1856   2.242 0.051687
## 
## Residual standard error: 2.309 on 9 degrees of freedom
## Multiple R-squared:  0.9823, Adjusted R-squared:  0.9764 
## F-statistic: 166.8 on 3 and 9 DF,  p-value: 3.323e-08
print(anova(modAIC))
## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq  F value    Pr(>F)
## x4         1 1831.90 1831.90 343.6758 1.771e-08
## x1         1  809.10  809.10 151.7934 6.150e-07
## x2         1   26.79   26.79   5.0259   0.05169
## Residuals  9   47.97    5.33