Predicting the Weight of Cars Using Multiple Linear Regression

What data are we going to use?

"Cars93" is a list of cars on sale in the US in 1993. It's real world data made available to the public via the MASS package in R. 
Photo: www.caranddriver.com


The data frame contains 93 rows and 27 columns.

More about the data:
Cars were selected at random from among 1993 passenger car models that were listed in both the Consumer Reports issue and the PACE Buying Guide. Pickup trucks and Sport/Utility vehicles were eliminated due to incomplete information in the Consumer Reports source. Duplicate models (e.g., Dodge Shadow and Plymouth Sundance) were listed at most once.
Manufacturer
Manufacturer.

Model
Model.

Type
Type: a factor with levels "Small", "Sporty", "Compact", "Midsize", "Large" and "Van".

Min.Price
Minimum Price (in \$1,000): price for a basic version.

Price
Midrange Price (in \$1,000): average of Min.Price and Max.Price.

Max.Price
Maximum Price (in \$1,000): price for “a premium version”.

MPG.city
City MPG (miles per US gallon by EPA rating).

MPG.highway
Highway MPG.

AirBags
Air Bags standard. Factor: none, driver only, or driver & passenger.

DriveTrain
Drive train type: rear wheel, front wheel or 4WD; (factor).

Cylinders
Number of cylinders (missing for Mazda RX-7, which has a rotary engine).

EngineSize
Engine size (litres).

Horsepower
Horsepower (maximum).

RPM
RPM (revs per minute at maximum horsepower).

Rev.per.mile
Engine revolutions per mile (in highest gear).

Man.trans.avail
Is a manual transmission version available? (yes or no, Factor).

Fuel.tank.capacity
Fuel tank capacity (US gallons).

Passengers
Passenger capacity (persons)

Length
Length (inches).

Wheelbase
Wheelbase (inches).

Width
Width (inches).

Turn.circle
U-turn space (feet).

Rear.seat.room
Rear seat room (inches) (missing for 2-seater vehicles).

Luggage.room
Luggage capacity (cubic feet) (missing for vans).

Weight
Weight (pounds).

Origin
Of non-USA or USA company origins? (factor).

Make
Combination of Manufacturer and Model (character).


Goal:

Get the predicted weight of each car based on weight, miles per gallon, engine size, horsepower, and passenger capacity.
Short Answer:

R Code:

> library(MASS)
> data("Cars93")
> attach(Cars93)
> model <- lm(Weight~MPG.city+EngineSize+Horsepower+Passengers)
> summary(model)
Call:
lm(formula = Weight ~ MPG.city + EngineSize + Horsepower + Passengers)
Residuals:
Min 1Q Median 3Q Max
-370.02 -131.60 10.57 108.59 530.42
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1764.4733 269.9261 6.537 3.97e-09 ***
MPG.city -28.9017 5.7062 -5.065 2.23e-06 ***
EngineSize 147.6183 33.1217 4.457 2.44e-05 ***
Horsepower 4.0593 0.6773 5.994 4.41e-08 ***
Passengers 192.1325 24.4279 7.865 8.72e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 185.9 on 88 degrees of freedom
Multiple R-squared: 0.905, Adjusted R-squared: 0.9007
F-statistic: 209.6 on 4 and 88 DF, p-value: < 2.2e-16
> attributes(model)
$names
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign"
[7] "qr" "df.residual" "xlevels" "call" "terms" "model"
$class
[1] "lm"
> #2 ways to extract an attribute
> model$residuals
1 2 3 4 5 6 7 8
-131.603519 70.863050 116.373195 -74.660985 381.842685 -172.710463 -149.160791 78.066640
9 10 11 12 13 14 15 16
67.971697 -370.023995 -204.236201 -283.872897 11.127103 104.744171 113.387845 -125.194971
17 18 19 20 21 22 23 24
-147.545379 55.893820 -336.613541 67.557578 -182.740661 146.913169 -211.866518 -92.668774
25 26 27 28 29 30 31 32
-86.403312 67.657431 -5.305004 131.592684 -186.866518 -234.581267 -239.688287 -311.636456
33 34 35 36 37 38 39 40
-128.510096 187.089005 108.585239 27.676250 195.647707 102.657543 120.596725 207.524359
41 42 43 44 45 46 47 48
36.632804 395.395322 307.579942 -92.214601 -238.360353 -91.439108 -76.924467 -26.565166
49 50 51 52 53 54 55 56
111.275695 146.037575 58.628476 126.472239 61.096788 -159.705629 -38.457027 73.788676
57 58 59 60 61 62 63 64
10.574073 -94.328498 -4.242766 -60.380141 304.749653 -186.866518 262.268187 -24.695125
65 66 67 68 69 70 71 72
55.331135 426.124044 -10.537597 -90.203358 58.323718 -125.194971 -149.160791 132.570314
73 74 75 76 77 78 79 80
176.374295 -227.152615 104.744171 -39.758925 -124.160791 -250.397483 -46.400689 -7.715720
81 82 83 84 85 86 87 88
-143.640259 172.138413 83.110398 -299.568788 266.777960 88.236721 281.367186 -164.974384
89 90 91 92 93
530.422351 27.621635 -338.653293 64.521438 61.657591
> residuals(model)
1 2 3 4 5 6 7 8
-131.603519 70.863050 116.373195 -74.660985 381.842685 -172.710463 -149.160791 78.066640
9 10 11 12 13 14 15 16
67.971697 -370.023995 -204.236201 -283.872897 11.127103 104.744171 113.387845 -125.194971
17 18 19 20 21 22 23 24
-147.545379 55.893820 -336.613541 67.557578 -182.740661 146.913169 -211.866518 -92.668774
25 26 27 28 29 30 31 32
-86.403312 67.657431 -5.305004 131.592684 -186.866518 -234.581267 -239.688287 -311.636456
33 34 35 36 37 38 39 40
-128.510096 187.089005 108.585239 27.676250 195.647707 102.657543 120.596725 207.524359
41 42 43 44 45 46 47 48
36.632804 395.395322 307.579942 -92.214601 -238.360353 -91.439108 -76.924467 -26.565166
49 50 51 52 53 54 55 56
111.275695 146.037575 58.628476 126.472239 61.096788 -159.705629 -38.457027 73.788676
57 58 59 60 61 62 63 64
10.574073 -94.328498 -4.242766 -60.380141 304.749653 -186.866518 262.268187 -24.695125
65 66 67 68 69 70 71 72
55.331135 426.124044 -10.537597 -90.203358 58.323718 -125.194971 -149.160791 132.570314
73 74 75 76 77 78 79 80
176.374295 -227.152615 104.744171 -39.758925 -124.160791 -250.397483 -46.400689 -7.715720
81 82 83 84 85 86 87 88
-143.640259 172.138413 83.110398 -299.568788 266.777960 88.236721 281.367186 -164.974384
89 90 91 92 93
530.422351 27.621635 -338.653293 64.521438 61.657591
> plot(residuals(model))
> plot(Weight, model$fitted, type="p", col="blue")
> abline(c(1,1))
> #Index Cars93, round any decimal value in the data, partition 60 as Train
> trainindex <- sample(1:nrow(Cars93), size=round(nrow(Cars93)*.60))
> train <- Cars93[trainindex,]
> test <- Cars93[-trainindex,]
> attach(train)
> model <- lm(Weight~MPG.city+EngineSize+Horsepower+Passengers)
> summary(model)
Call:
lm(formula = Weight ~ MPG.city + EngineSize + Horsepower + Passengers)
Residuals:
Min 1Q Median 3Q Max
-427.74 -137.13 -10.58 121.31 474.01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1459.043 353.166 4.131 0.000134 ***
MPG.city -19.732 7.286 -2.708 0.009185 **
EngineSize 201.967 50.050 4.035 0.000183 ***
Horsepower 4.098 1.058 3.873 0.000308 ***
Passengers 182.523 35.668 5.117 4.75e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 192.6 on 51 degrees of freedom
Multiple R-squared: 0.9003, Adjusted R-squared: 0.8925
F-statistic: 115.1 on 4 and 51 DF, p-value: < 2.2e-16
> predict(model)
1 2 3 4 5 6 7 8 9 10 11
2743.276 2889.000 2573.372 3314.338 4047.745 2733.637 2835.799 2479.407 4127.354 3988.672 2795.499
12 13 14 15 16 17 18 19 20 21 22
2809.266 3845.693 3337.968 2468.260 2203.848 3589.101 2773.481 4167.904 3460.915 3034.933 2479.407
23 24 25 26 27 28 29 30 31 32 33
3156.623 2604.810 2391.323 1708.818 4193.964 3247.412 2997.536 2610.730 3136.930 3428.463 3218.895
34 35 36 37 38 39 40 41 42 43 44
3380.547 2997.830 3431.796 3643.438 2910.374 3453.052 2590.788 3532.352 2475.868 2098.178 2289.152
45 46 47 48 49 50 51 52 53 54 55
2081.347 2815.641 2974.892 2801.829 2434.326 3015.200 3449.667 3150.233 2914.642 3625.985 3054.540
56
3128.944
> model
Call:
lm(formula = Weight ~ MPG.city + EngineSize + Horsepower + Passengers)
Coefficients:
(Intercept) MPG.city EngineSize Horsepower Passengers
1459.043 -19.732 201.967 4.098 182.523
> #this is the same as the Fitted values of the model
> attributes(model)
$names
[1] "coefficients" "residuals" "effects" "rank" "fitted.values" "assign"
[7] "qr" "df.residual" "xlevels" "call" "terms" "model"
$class
[1] "lm"
> fitted.values(model)
1 2 3 4 5 6 7 8 9 10 11
2743.276 2889.000 2573.372 3314.338 4047.745 2733.637 2835.799 2479.407 4127.354 3988.672 2795.499
12 13 14 15 16 17 18 19 20 21 22
2809.266 3845.693 3337.968 2468.260 2203.848 3589.101 2773.481 4167.904 3460.915 3034.933 2479.407
23 24 25 26 27 28 29 30 31 32 33
3156.623 2604.810 2391.323 1708.818 4193.964 3247.412 2997.536 2610.730 3136.930 3428.463 3218.895
34 35 36 37 38 39 40 41 42 43 44
3380.547 2997.830 3431.796 3643.438 2910.374 3453.052 2590.788 3532.352 2475.868 2098.178 2289.152
45 46 47 48 49 50 51 52 53 54 55
2081.347 2815.641 2974.892 2801.829 2434.326 3015.200 3449.667 3150.233 2914.642 3625.985 3054.540
56
3128.944
> fitted(model)
1 2 3 4 5 6 7 8 9 10 11
2743.276 2889.000 2573.372 3314.338 4047.745 2733.637 2835.799 2479.407 4127.354 3988.672 2795.499
12 13 14 15 16 17 18 19 20 21 22
2809.266 3845.693 3337.968 2468.260 2203.848 3589.101 2773.481 4167.904 3460.915 3034.933 2479.407
23 24 25 26 27 28 29 30 31 32 33
3156.623 2604.810 2391.323 1708.818 4193.964 3247.412 2997.536 2610.730 3136.930 3428.463 3218.895
34 35 36 37 38 39 40 41 42 43 44
3380.547 2997.830 3431.796 3643.438 2910.374 3453.052 2590.788 3532.352 2475.868 2098.178 2289.152
45 46 47 48 49 50 51 52 53 54 55
2081.347 2815.641 2974.892 2801.829 2434.326 3015.200 3449.667 3150.233 2914.642 3625.985 3054.540
56
3128.944
> #Input "new" data
> test.predict <- predict(model, newdata=test)
> #Check predictions
> test.actual <- Cars93$Weight[-trainindex]
> #To get errors, just subtract the two
> errors <- test.actual-test.predict
> #Plot to see our erros(residuals)
> #Our model was off by as much as 500 pounds
> plot(errors)
> #Calculate Mean Squared Error to see how well model performed
>#RMSE - same units as response variable. The lower, the better
>#RMSE is a good measure of how accurately the model predicts the response,
>#and is the most important criterion for fit if the main purpose of the model is prediction
> testerrors <- sqrt(mean(errors^2))
> testerrors
[1] 188.7558
> #Get MSE for original training data as well
> trainerrors <- sqrt(mean(model$residuals^2))
> trainerrors
[1] 183.8388

Comments