- What is a model?
2016-02-25
library(readr) realestate <- read_csv("http://dicook.github.io/Monash-R/4-Modelling/data/realestate.csv") str(realestate) #> Classes 'tbl_df', 'tbl' and 'data.frame': 522 obs. of 19 variables: #> $ id : int 1 2 3 4 5 6 7 8 9 10 ... #> $ price : int 360000 340000 250000 205500 275500 248000 229900 150000 195000 160000 ... #> $ sqft : int 3032 2058 1780 1638 2196 1966 2216 1597 1622 1976 ... #> $ bed : int 4 4 4 4 4 4 3 2 3 3 ... #> $ bath : int 4 2 3 2 3 3 2 1 2 3 ... #> $ ac : chr "yes" "yes" "yes" "yes" ... #> $ cars : int 2 2 2 2 2 5 2 1 2 1 ... #> $ pool : chr "no" "no" "no" "no" ... #> $ built : int 1972 1976 1980 1963 1968 1972 1972 1955 1975 1918 ... #> $ quality : chr "medium" "medium" "medium" "medium" ... #> $ style : int 1 1 1 1 7 1 7 1 1 1 ... #> $ lot : int 22221 22912 21345 17342 21786 18902 18639 22112 14321 32358 ... #> $ highway : chr "no" "no" "no" "no" ... #> $ fitted.m3 : num 12.9 12.4 12.2 12.1 12.5 ... #> $ fitted.m4 : num 12.7 12.4 12.3 12.2 12.5 ... #> $ lot.discrete: chr "large" "large" "large" "large" ... #> $ resid.m5 : num -0.0212 0.3448 0.0735 0.1227 0.0938 ... #> $ fitted.m5 : num 12.8 12.4 12.4 12.1 12.4 ... #> $ resid.m5b : num -30705 70168 7858 28684 3620 ...
is a data set consisting of observations on the sales price of 522 homes along with numerous characteristics of the home and property
We might be interested in how the price is affected by these characteristics
library(GGally) ggpairs(realestate[,2:7])
Fitting a regression model in is accomplished through the lm
command
lm(formula, data, weight, na.action)
A formula has the form
y ~ x1 + x2 + x3
where y
is the dependent (price) and x1, x2, x3
are the covariates
lm(price ~ sqft, data=realestate) #> #> Call: #> lm(formula = price ~ sqft, data = realestate) #> #> Coefficients: #> (Intercept) sqft #> -81433 159
Estimate for Intercept is the average price of a 0 sqft home … doesn't make much sense :)
… but, with an increase of the square footage the price increases … that DOES make sense
(1 sq ft = 0.093 sq m or 1 sq m = 10.8 sq ft)
Problem: the variance of the prices increases with the size of the homes.
Solution: use a transformation (square root or log) for the price
Using a log transformation, variability stabilizes (not perfect, but better)
First model then becomes
lm(log(price)~sqft, data=realestate) #> #> Call: #> lm(formula = log(price) ~ sqft, data = realestate) #> #> Coefficients: #> (Intercept) sqft #> 1.128e+01 5.097e-04
Models are objects, too (so save model into a named object)
summary
provides a nicer overview of the model output
m1 <- lm(log(price)~sqft, data=realestate) summary(m1) #> #> Call: #> lm(formula = log(price) ~ sqft, data = realestate) #> #> Residuals: #> Min 1Q Median 3Q Max #> -0.76772 -0.14834 -0.01448 0.11921 0.84182 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 1.128e+01 3.427e-02 329.24 <2e-16 *** #> sqft 5.097e-04 1.446e-05 35.24 <2e-16 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.2347 on 520 degrees of freedom #> Multiple R-squared: 0.7049, Adjusted R-squared: 0.7043 #> F-statistic: 1242 on 1 and 520 DF, p-value: < 2.2e-16
update
uses an existing model object and allows for changes of the effects
.~.+ac
keeps left hand side (. on the left of the ~) the same and adds ac
to the existing right hand side
m2 <- update(m1, .~.+ac)
summary(m2) #> #> Call: #> lm(formula = log(price) ~ sqft + ac, data = realestate) #> #> Residuals: #> Min 1Q Median 3Q Max #> -0.7323 -0.1407 -0.0204 0.1180 0.8196 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) #> (Intercept) 1.120e+01 3.615e-02 309.88 < 2e-16 *** #> sqft 4.872e-04 1.457e-05 33.45 < 2e-16 *** #> acyes 1.589e-01 2.764e-02 5.75 1.52e-08 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 #> #> Residual standard error: 0.2278 on 519 degrees of freedom #> Multiple R-squared: 0.7226, Adjusted R-squared: 0.7215 #> F-statistic: 675.9 on 2 and 519 DF, p-value: < 2.2e-16
R can deal with categorical and quantitative variables in lm
only value for houses with ac (acyes
) is shown - acno
is used as baseline, and by default set to 0
options()$contrasts #> unordered ordered #> "contr.treatment" "contr.poly"
?contr.treatment
!!!Beware transformations!!! they make interpretations tricky sometimes
log of price is expected to be higher by 1.589e-01 for houses with an AC … same as …
price of the house is on average exp(1.589e-01) = 1.172221
fold higher with AC than the same house without an AC (i.e. AC leads on average to a 17% increase in price)
Is model m2
actually an improvement over m1
?
Statistical answer:
anova(m1, m2) #> Analysis of Variance Table #> #> Model 1: log(price) ~ sqft #> Model 2: log(price) ~ sqft + ac #> Res.Df RSS Df Sum of Sq F Pr(>F) #> 1 520 28.648 #> 2 519 26.932 1 1.7157 33.063 1.524e-08 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, duh.
qplot(sqft, log(price), colour=ac, data=realestate) + geom_smooth(method="lm")
Yes, but there are not that many houses without AC, so ac
might not matter terribly much. Visualization gives us a more nuanced answer than the test.
Trying to capture the curvature in log(price)
:
m3 <- lm(log(price) ~ poly(sqft, 2) + ac, data=realestate) anova(m2, m3) #> Analysis of Variance Table #> #> Model 1: log(price) ~ sqft + ac #> Model 2: log(price) ~ poly(sqft, 2) + ac #> Res.Df RSS Df Sum of Sq F Pr(>F) #> 1 519 26.932 #> 2 518 25.274 1 1.6584 33.991 9.751e-09 *** #> --- #> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
realestate$fitted.m3 <- fitted(m3) ggplot(data=realestate, aes(y=log(price), x=sqft, colour=ac)) + geom_point() + geom_line(aes(y=fitted.m3))
Extend model m3
of the realestate
data some more and try to find at least two effects that significantly affect prices.
Visualize the impact those additional effects have on the model
x:y
specifies the interaction between variables x
and y
x*y
specifies the main effects x
, y
and the interaction
(a + b + c)^2
specifies all main effects and pairwise interactions
(a + b + c)^2 - b:c
specifies all main effects and pairwise interactions, but not the b-c interaction
-1
removes the term for the intercept
Suppose we have model
m5 <- lm(log(price) ~ sqft + bath + built + log(lot), data = realestate)
Suppose a new house is going on the market and the seller wants to use this model to get an idea of a reasonable listing price. This house is 2123 sq. ft. with 4 bedrooms, 2 bathrooms, a 20590 sq. ft. lot, and was built in 1963.
new_house <- data.frame(sqft = 2123, bath = 2, built = 1963, lot = 20590) predict(m5, newdata = new_house, interval = "prediction", level = 0.95) # log(price)! #> fit lwr upr #> 1 12.31436 11.9268 12.70192 exp(predict(m5, newdata = new_house, interval = "prediction", level = 0.95))/1000 #> fit lwr upr #> 1 222.8737 151.2661 328.3794
Using linear regression models we inherently make the following assumptions:
Also generally want to check for collinearity and influential points
We can assess the assumption of normality using a normal probability plot of residuals (sometimes called a normal qq
plot)
Idea of normal qq plot:
m5 <- lm(log(price) ~ sqft + bath + built + log(lot), data = realestate)
realestate$resid.m5 <- residuals(m5) realestate$fitted.m5 <- fitted(m5) ggplot(aes(sample = resid.m5), data=realestate) + geom_qq() + theme(aspect.ratio=1)
The wavy pattern indicates a slightly heavy tail in the residuals.
m5b <- lm(price ~ sqft + bath + built + log(lot), data = realestate) realestate$resid.m5b <- residuals(m5b) ggplot(aes(sample = resid.m5b), data=realestate) + geom_qq() + theme(aspect.ratio=1)
We can assess the assumptions of equal error variance and linearity using residual plots
Idea of checking residual plots:
qplot(data=realestate, x=fitted.m5, y=resid.m5) + xlab("Fitted Values") + ylab("Residuals") + geom_hline(yintercept=0) + geom_smooth()
Linearity and equal variance look pretty good
Many R packages available to do specific statistical diagnostics tests lmtest
package contains many of these tests, such as a Breusch-Pagan test for equal variances
library(lmtest) bptest(m5) #> #> studentized Breusch-Pagan test #> #> data: m5 #> BP = 28.123, df = 4, p-value = 1.178e-05
… so there's still heteroskedasticity in those errors … darn!
Box Cox method for transformations selects a data transform based on the relationship between the observed mean and variance
\[ y_i^{(\lambda)} = \begin{cases} \dfrac{y_i^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, \\[8pt] \ln{(y_i)} & \text{if } \lambda = 0, \end{cases} \]
Use boxcox()
function from the MASS
package to ballpark a \(\lambda\)
library(MASS) bc <- boxcox(m5b)
str(bc) #> List of 2 #> $ x: num [1:100] -2 -1.96 -1.92 -1.88 -1.84 ... #> $ y: num [1:100] -1030 -1019 -1008 -998 -988 ... lambda <- bc$x[which.max(bc$y)]
Now re-fit model with transformed price …
m5c <- lm(price^lambda ~ sqft + bath + built + log(lot), data=realestate) bptest(m5c) # better but still not non-significant #> #> studentized Breusch-Pagan test #> #> data: m5c #> BP = 14.856, df = 4, p-value = 0.00501
qplot(fitted(m5c), residuals(m5c)) + geom_hline(yintercept=0) + geom_smooth()
Interpretation of model coefficients has become even more tricky …
on the model level: AIC, BIC, \(R^2\), adjusted \(R^2\), …
on the coefficient level: VIF (Variance Inflation Factors), …
on the point level:
R can calculate every single one of them … and more …
Hans Rosling can explain that really well in his TED talk
gapminder software
gapminder also makes data available (in R through the package gapminder
)
library(gapminder) qplot(data=gapminder, year, lifeExp, geom="line", group=country)
How would you describe this plot?
tidyr
and purrr
, i.e. run the following code:install.packages("tidyr") install.packages("purrr") library(tidyr) library(purrr)
oz <- subset(gapminder, country=="Australia") head(oz) #> Source: local data frame [6 x 6] #> #> country continent year lifeExp pop gdpPercap #> (fctr) (fctr) (int) (dbl) (int) (dbl) #> 1 Australia Oceania 1952 69.12 8691212 10039.60 #> 2 Australia Oceania 1957 70.33 9712569 10949.65 #> 3 Australia Oceania 1962 70.93 10794968 12217.23 #> 4 Australia Oceania 1967 71.10 11872264 14526.12 #> 5 Australia Oceania 1972 71.93 13177000 16788.63 #> 6 Australia Oceania 1977 73.49 14074100 18334.20
qplot(data=oz, year, lifeExp) + geom_smooth(method="lm")
oz.lm <- lm(lifeExp~year, data=oz) oz.lm #> #> Call: #> lm(formula = lifeExp ~ year, data = oz) #> #> Coefficients: #> (Intercept) year #> -376.1163 0.2277
Intercept is estimated life expectancy at 0 BC - let's use 1950 for the first value:
gapminder <- gapminder %>% mutate(year = year-1950)
We don't want to subset the data for every country …
nest()
makes a data frame part of another data frame:
by_country <- gapminder %>% group_by(continent, country) %>% nest() head(by_country) #> Source: local data frame [6 x 3] #> #> continent country data #> (fctr) (fctr) (chr) #> 1 Asia Afghanistan <tbl_df [12,4]> #> 2 Europe Albania <tbl_df [12,4]> #> 3 Africa Algeria <tbl_df [12,4]> #> 4 Africa Angola <tbl_df [12,4]> #> 5 Americas Argentina <tbl_df [12,4]> #> 6 Oceania Australia <tbl_df [12,4]>
Each element of the data
variable in by_country
is a dataset:
head(by_country$data[[1]]) #> Source: local data frame [6 x 4] #> #> year lifeExp pop gdpPercap #> (dbl) (dbl) (int) (dbl) #> 1 2 28.801 8425333 779.4453 #> 2 7 30.332 9240934 820.8530 #> 3 12 31.997 10267083 853.1007 #> 4 17 34.020 11537966 836.1971 #> 5 22 36.088 13079460 739.9811 #> 6 27 38.438 14880372 786.1134 lm(lifeExp~year, data=by_country$data[[1]]) #> #> Call: #> lm(formula = lifeExp ~ year, data = by_country$data[[1]]) #> #> Coefficients: #> (Intercept) year #> 29.3566 0.2753
Now we are using the map
function in the package purrr
.
map
allows us to apply a function to each element of a list.
by_country$model <- by_country$data %>% purrr::map(~lm(lifeExp~year, data=.)) head(by_country) #> Source: local data frame [6 x 4] #> #> continent country data model #> (fctr) (fctr) (chr) (chr) #> 1 Asia Afghanistan <tbl_df [12,4]> <S3:lm> #> 2 Europe Albania <tbl_df [12,4]> <S3:lm> #> 3 Africa Algeria <tbl_df [12,4]> <S3:lm> #> 4 Africa Angola <tbl_df [12,4]> <S3:lm> #> 5 Americas Argentina <tbl_df [12,4]> <S3:lm> #> 6 Oceania Australia <tbl_df [12,4]> <S3:lm>
broom
packagebroom
allows to extract values from models on three levels:
broom::glance
broom::tidy
broom::augment
library(broom) broom::glance(by_country$model[[1]]) #> r.squared adj.r.squared sigma statistic p.value df logLik #> 1 0.9477123 0.9424835 1.222788 181.2494 9.835213e-08 2 -18.34693 #> AIC BIC deviance df.residual #> 1 42.69387 44.14859 14.9521 10 broom::tidy(by_country$model[[1]]) #> term estimate std.error statistic p.value #> 1 (Intercept) 29.3566375 0.69898128 41.99918 1.404235e-12 #> 2 year 0.2753287 0.02045093 13.46289 9.835213e-08
broom::augment(by_country$model[[1]]) #> lifeExp year .fitted .se.fit .resid .hat .sigma #> 1 28.801 2 29.90729 0.6639995 -1.10629487 0.29487179 1.211813 #> 2 30.332 7 31.28394 0.5799442 -0.95193823 0.22494172 1.237512 #> 3 31.997 12 32.66058 0.5026799 -0.66358159 0.16899767 1.265886 #> 4 34.020 17 34.03722 0.4358337 -0.01722494 0.12703963 1.288917 #> 5 36.088 22 35.41387 0.3848726 0.67413170 0.09906760 1.267003 #> 6 38.438 27 36.79051 0.3566719 1.64748834 0.08508159 1.154002 #> 7 39.854 32 38.16716 0.3566719 1.68684499 0.08508159 1.147076 #> 8 40.822 37 39.54380 0.3848726 1.27820163 0.09906760 1.208243 #> 9 41.674 42 40.92044 0.4358337 0.75355828 0.12703963 1.260583 #> 10 41.763 47 42.29709 0.5026799 -0.53408508 0.16899767 1.274051 #> 11 42.129 52 43.67373 0.5799442 -1.54472844 0.22494172 1.148593 #> 12 43.828 57 45.05037 0.6639995 -1.22237179 0.29487179 1.194109 #> .cooksd .std.resid #> 1 2.427205e-01 -1.07742164 #> 2 1.134714e-01 -0.88428127 #> 3 3.603567e-02 -0.59530844 #> 4 1.653992e-05 -0.01507681 #> 5 1.854831e-02 0.58082792 #> 6 9.225358e-02 1.40857509 #> 7 9.671389e-02 1.44222437 #> 8 6.668277e-02 1.10129103 #> 9 3.165567e-02 0.65958143 #> 10 2.334344e-02 -0.47913530 #> 11 2.987950e-01 -1.43494020 #> 12 2.963271e-01 -1.19046907
Extract all countries automatically (hello map
again!)
by_country_coefs <- by_country %>% unnest(model %>% purrr::map(broom::tidy)) coefs <- by_country_coefs %>% select(country, continent, term, estimate) %>% spread(term, estimate)
and finally, a visualization:
qplot(`(Intercept)`, year, colour=continent, data=coefs)
by_country <- by_country %>% unnest(model %>% purrr::map(broom::glance)) by_country$country <- reorder(by_country$country, by_country$r.squared) qplot(country, r.squared, colour=continent, data=by_country) + theme(axis.text.x=element_text(angle=-90, hjust=0)) + scale_colour_brewer(palette="Dark2")
country_all <- by_country %>% unnest(data) qplot(year+1950, lifeExp, data=subset(country_all, r.squared <= 0.45), geom="line") + facet_wrap(~country)
What do the patterns mean?
qplot(year+1950, lifeExp, data=subset(country_all, between(r.squared, 0.45, 0.75)), geom="line") + facet_wrap(~country)
extract residuals for each of the models and store it in a dataset together with country and continent information
plot residuals across the years and fit a smooth. What does the pattern mean?
This work is licensed under the Creative Commons Attribution-Noncommercial 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/ 3.0/us/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.