What data are we going to use?
"Cars93" is a list of cars on sale in the US in 1993. It's real world data made available to the public via the MASS package in R.
![]() |
Photo: www.caranddriver.com |
The data frame contains 93 rows and 27 columns.
More about the data:
Cars were selected at random from among 1993 passenger car models that were listed in both the Consumer Reports issue and the PACE Buying Guide. Pickup trucks and Sport/Utility vehicles were eliminated due to incomplete information in the ConsumerManufacturersource. Duplicate models (e.g., Dodge Shadow and Plymouth Sundance) were listed at most once. Reports
Manufacturer.
Model
Model.
Type
Type: a factor with levels "Small", "Sporty", "Compact", "Midsize", "Large" and "Van".
Min
Minimum Price (in \$1,000): price for a basic version.
Price
Midrange Price (in \$1,000): average of Min
Max
Maximum Price (in \$1,000): price for “a premium version”.
MPG
City MPG (miles per US gallon by EPA rating).
MPG
Highway MPG.
Air Bags standard. Factor: none, driver only, or driver & passenger.
Drive train type: rear wheel, front wheel or 4WD; (factor).
Cylinders
Number of cylinders (missing for Mazda RX-7, which has a rotary engine).
Engine size (
Horsepower
Horsepower (maximum).
RPM
RPM (revs per minute at maximum horsepower).
Rev
Engine revolutions per mile (in highest gear).
Man
Is a manual transmission version available? (
Fuel
Fuel tank capacity (US gallons).
Passengers
Passenger capacity (persons)
Length
Length (inches).
Wheelbase
Wheelbase (inches).
Width
Width (inches).
Turn
U-turn space (feet).
Rear
Rear seat room (inches) (missing
Luggage
Luggage capacity (cubic feet) (missing for vans).
Weight
Weight (pounds).
Origin
Of non-USA or USA company origins? (
Make
Combination of Manufacturer and Model (character).
Goal:
R Code:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
> library(MASS) | |
> data("Cars93") | |
> attach(Cars93) | |
> model <- lm(Weight~MPG.city+EngineSize+Horsepower+Passengers) | |
> summary(model) | |
Call: | |
lm(formula = Weight ~ MPG.city + EngineSize + Horsepower + Passengers) | |
Residuals: | |
Min 1Q Median 3Q Max | |
-370.02 -131.60 10.57 108.59 530.42 | |
Coefficients: | |
Estimate Std. Error t value Pr(>|t|) | |
(Intercept) 1764.4733 269.9261 6.537 3.97e-09 *** | |
MPG.city -28.9017 5.7062 -5.065 2.23e-06 *** | |
EngineSize 147.6183 33.1217 4.457 2.44e-05 *** | |
Horsepower 4.0593 0.6773 5.994 4.41e-08 *** | |
Passengers 192.1325 24.4279 7.865 8.72e-12 *** | |
--- | |
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 | |
Residual standard error: 185.9 on 88 degrees of freedom | |
Multiple R-squared: 0.905, Adjusted R-squared: 0.9007 | |
F-statistic: 209.6 on 4 and 88 DF, p-value: < 2.2e-16 | |
> attributes(model) | |
$names | |
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" | |
[7] "qr" "df.residual" "xlevels" "call" "terms" "model" | |
$class | |
[1] "lm" | |
> #2 ways to extract an attribute | |
> model$residuals | |
1 2 3 4 5 6 7 8 | |
-131.603519 70.863050 116.373195 -74.660985 381.842685 -172.710463 -149.160791 78.066640 | |
9 10 11 12 13 14 15 16 | |
67.971697 -370.023995 -204.236201 -283.872897 11.127103 104.744171 113.387845 -125.194971 | |
17 18 19 20 21 22 23 24 | |
-147.545379 55.893820 -336.613541 67.557578 -182.740661 146.913169 -211.866518 -92.668774 | |
25 26 27 28 29 30 31 32 | |
-86.403312 67.657431 -5.305004 131.592684 -186.866518 -234.581267 -239.688287 -311.636456 | |
33 34 35 36 37 38 39 40 | |
-128.510096 187.089005 108.585239 27.676250 195.647707 102.657543 120.596725 207.524359 | |
41 42 43 44 45 46 47 48 | |
36.632804 395.395322 307.579942 -92.214601 -238.360353 -91.439108 -76.924467 -26.565166 | |
49 50 51 52 53 54 55 56 | |
111.275695 146.037575 58.628476 126.472239 61.096788 -159.705629 -38.457027 73.788676 | |
57 58 59 60 61 62 63 64 | |
10.574073 -94.328498 -4.242766 -60.380141 304.749653 -186.866518 262.268187 -24.695125 | |
65 66 67 68 69 70 71 72 | |
55.331135 426.124044 -10.537597 -90.203358 58.323718 -125.194971 -149.160791 132.570314 | |
73 74 75 76 77 78 79 80 | |
176.374295 -227.152615 104.744171 -39.758925 -124.160791 -250.397483 -46.400689 -7.715720 | |
81 82 83 84 85 86 87 88 | |
-143.640259 172.138413 83.110398 -299.568788 266.777960 88.236721 281.367186 -164.974384 | |
89 90 91 92 93 | |
530.422351 27.621635 -338.653293 64.521438 61.657591 | |
> residuals(model) | |
1 2 3 4 5 6 7 8 | |
-131.603519 70.863050 116.373195 -74.660985 381.842685 -172.710463 -149.160791 78.066640 | |
9 10 11 12 13 14 15 16 | |
67.971697 -370.023995 -204.236201 -283.872897 11.127103 104.744171 113.387845 -125.194971 | |
17 18 19 20 21 22 23 24 | |
-147.545379 55.893820 -336.613541 67.557578 -182.740661 146.913169 -211.866518 -92.668774 | |
25 26 27 28 29 30 31 32 | |
-86.403312 67.657431 -5.305004 131.592684 -186.866518 -234.581267 -239.688287 -311.636456 | |
33 34 35 36 37 38 39 40 | |
-128.510096 187.089005 108.585239 27.676250 195.647707 102.657543 120.596725 207.524359 | |
41 42 43 44 45 46 47 48 | |
36.632804 395.395322 307.579942 -92.214601 -238.360353 -91.439108 -76.924467 -26.565166 | |
49 50 51 52 53 54 55 56 | |
111.275695 146.037575 58.628476 126.472239 61.096788 -159.705629 -38.457027 73.788676 | |
57 58 59 60 61 62 63 64 | |
10.574073 -94.328498 -4.242766 -60.380141 304.749653 -186.866518 262.268187 -24.695125 | |
65 66 67 68 69 70 71 72 | |
55.331135 426.124044 -10.537597 -90.203358 58.323718 -125.194971 -149.160791 132.570314 | |
73 74 75 76 77 78 79 80 | |
176.374295 -227.152615 104.744171 -39.758925 -124.160791 -250.397483 -46.400689 -7.715720 | |
81 82 83 84 85 86 87 88 | |
-143.640259 172.138413 83.110398 -299.568788 266.777960 88.236721 281.367186 -164.974384 | |
89 90 91 92 93 | |
530.422351 27.621635 -338.653293 64.521438 61.657591 | |
> plot(residuals(model)) | |
> plot(Weight, model$fitted, type="p", col="blue") | |
> abline(c(1,1)) | |
> #Index Cars93, round any decimal value in the data, partition 60 as Train | |
> trainindex <- sample(1:nrow(Cars93), size=round(nrow(Cars93)*.60)) | |
> train <- Cars93[trainindex,] | |
> test <- Cars93[-trainindex,] | |
> attach(train) | |
> model <- lm(Weight~MPG.city+EngineSize+Horsepower+Passengers) | |
> summary(model) | |
Call: | |
lm(formula = Weight ~ MPG.city + EngineSize + Horsepower + Passengers) | |
Residuals: | |
Min 1Q Median 3Q Max | |
-427.74 -137.13 -10.58 121.31 474.01 | |
Coefficients: | |
Estimate Std. Error t value Pr(>|t|) | |
(Intercept) 1459.043 353.166 4.131 0.000134 *** | |
MPG.city -19.732 7.286 -2.708 0.009185 ** | |
EngineSize 201.967 50.050 4.035 0.000183 *** | |
Horsepower 4.098 1.058 3.873 0.000308 *** | |
Passengers 182.523 35.668 5.117 4.75e-06 *** | |
--- | |
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 | |
Residual standard error: 192.6 on 51 degrees of freedom | |
Multiple R-squared: 0.9003, Adjusted R-squared: 0.8925 | |
F-statistic: 115.1 on 4 and 51 DF, p-value: < 2.2e-16 | |
> predict(model) | |
1 2 3 4 5 6 7 8 9 10 11 | |
2743.276 2889.000 2573.372 3314.338 4047.745 2733.637 2835.799 2479.407 4127.354 3988.672 2795.499 | |
12 13 14 15 16 17 18 19 20 21 22 | |
2809.266 3845.693 3337.968 2468.260 2203.848 3589.101 2773.481 4167.904 3460.915 3034.933 2479.407 | |
23 24 25 26 27 28 29 30 31 32 33 | |
3156.623 2604.810 2391.323 1708.818 4193.964 3247.412 2997.536 2610.730 3136.930 3428.463 3218.895 | |
34 35 36 37 38 39 40 41 42 43 44 | |
3380.547 2997.830 3431.796 3643.438 2910.374 3453.052 2590.788 3532.352 2475.868 2098.178 2289.152 | |
45 46 47 48 49 50 51 52 53 54 55 | |
2081.347 2815.641 2974.892 2801.829 2434.326 3015.200 3449.667 3150.233 2914.642 3625.985 3054.540 | |
56 | |
3128.944 | |
> model | |
Call: | |
lm(formula = Weight ~ MPG.city + EngineSize + Horsepower + Passengers) | |
Coefficients: | |
(Intercept) MPG.city EngineSize Horsepower Passengers | |
1459.043 -19.732 201.967 4.098 182.523 | |
> #this is the same as the Fitted values of the model | |
> attributes(model) | |
$names | |
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign" | |
[7] "qr" "df.residual" "xlevels" "call" "terms" "model" | |
$class | |
[1] "lm" | |
> fitted.values(model) | |
1 2 3 4 5 6 7 8 9 10 11 | |
2743.276 2889.000 2573.372 3314.338 4047.745 2733.637 2835.799 2479.407 4127.354 3988.672 2795.499 | |
12 13 14 15 16 17 18 19 20 21 22 | |
2809.266 3845.693 3337.968 2468.260 2203.848 3589.101 2773.481 4167.904 3460.915 3034.933 2479.407 | |
23 24 25 26 27 28 29 30 31 32 33 | |
3156.623 2604.810 2391.323 1708.818 4193.964 3247.412 2997.536 2610.730 3136.930 3428.463 3218.895 | |
34 35 36 37 38 39 40 41 42 43 44 | |
3380.547 2997.830 3431.796 3643.438 2910.374 3453.052 2590.788 3532.352 2475.868 2098.178 2289.152 | |
45 46 47 48 49 50 51 52 53 54 55 | |
2081.347 2815.641 2974.892 2801.829 2434.326 3015.200 3449.667 3150.233 2914.642 3625.985 3054.540 | |
56 | |
3128.944 | |
> fitted(model) | |
1 2 3 4 5 6 7 8 9 10 11 | |
2743.276 2889.000 2573.372 3314.338 4047.745 2733.637 2835.799 2479.407 4127.354 3988.672 2795.499 | |
12 13 14 15 16 17 18 19 20 21 22 | |
2809.266 3845.693 3337.968 2468.260 2203.848 3589.101 2773.481 4167.904 3460.915 3034.933 2479.407 | |
23 24 25 26 27 28 29 30 31 32 33 | |
3156.623 2604.810 2391.323 1708.818 4193.964 3247.412 2997.536 2610.730 3136.930 3428.463 3218.895 | |
34 35 36 37 38 39 40 41 42 43 44 | |
3380.547 2997.830 3431.796 3643.438 2910.374 3453.052 2590.788 3532.352 2475.868 2098.178 2289.152 | |
45 46 47 48 49 50 51 52 53 54 55 | |
2081.347 2815.641 2974.892 2801.829 2434.326 3015.200 3449.667 3150.233 2914.642 3625.985 3054.540 | |
56 | |
3128.944 | |
> #Input "new" data | |
> test.predict <- predict(model, newdata=test) | |
> #Check predictions | |
> test.actual <- Cars93$Weight[-trainindex] | |
> #To get errors, just subtract the two | |
> errors <- test.actual-test.predict | |
> #Plot to see our erros(residuals) | |
> #Our model was off by as much as 500 pounds | |
> plot(errors) | |
> #Calculate Mean Squared Error to see how well model performed | |
>#RMSE - same units as response variable. The lower, the better | |
>#RMSE is a good measure of how accurately the model predicts the response, | |
>#and is the most important criterion for fit if the main purpose of the model is prediction | |
> testerrors <- sqrt(mean(errors^2)) | |
> testerrors | |
[1] 188.7558 | |
> #Get MSE for original training data as well | |
> trainerrors <- sqrt(mean(model$residuals^2)) | |
> trainerrors | |
[1] 183.8388 |
Comments
Post a Comment