4-Modelling

2016-02-25

Your turn

What is a model?

Outline

Linear Regression

Real estate prices

library(readr)
realestate <- read_csv("http://dicook.github.io/Monash-R/4-Modelling/data/realestate.csv")
str(realestate)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    522 obs. of  19 variables:
#>  $ id          : int  1 2 3 4 5 6 7 8 9 10 ...
#>  $ price       : int  360000 340000 250000 205500 275500 248000 229900 150000 195000 160000 ...
#>  $ sqft        : int  3032 2058 1780 1638 2196 1966 2216 1597 1622 1976 ...
#>  $ bed         : int  4 4 4 4 4 4 3 2 3 3 ...
#>  $ bath        : int  4 2 3 2 3 3 2 1 2 3 ...
#>  $ ac          : chr  "yes" "yes" "yes" "yes" ...
#>  $ cars        : int  2 2 2 2 2 5 2 1 2 1 ...
#>  $ pool        : chr  "no" "no" "no" "no" ...
#>  $ built       : int  1972 1976 1980 1963 1968 1972 1972 1955 1975 1918 ...
#>  $ quality     : chr  "medium" "medium" "medium" "medium" ...
#>  $ style       : int  1 1 1 1 7 1 7 1 1 1 ...
#>  $ lot         : int  22221 22912 21345 17342 21786 18902 18639 22112 14321 32358 ...
#>  $ highway     : chr  "no" "no" "no" "no" ...
#>  $ fitted.m3   : num  12.9 12.4 12.2 12.1 12.5 ...
#>  $ fitted.m4   : num  12.7 12.4 12.3 12.2 12.5 ...
#>  $ lot.discrete: chr  "large" "large" "large" "large" ...
#>  $ resid.m5    : num  -0.0212 0.3448 0.0735 0.1227 0.0938 ...
#>  $ fitted.m5   : num  12.8 12.4 12.4 12.1 12.4 ...
#>  $ resid.m5b   : num  -30705 70168 7858 28684 3620 ...

Real estate prices

is a data set consisting of observations on the sales price of 522 homes along with numerous characteristics of the home and property

We might be interested in how the price is affected by these characteristics

A first model?

not so fast!
getting a quick overview of your data using visualizations pays off enormously, because you're not modelling in the dark.

A generalized scatterplot matrix

library(GGally)
ggpairs(realestate[,2:7])

Fitting a regression

Fitting a regression model in is accomplished through the lm command

lm(formula, data, weight, na.action)

A formula has the form

y ~ x1 + x2 + x3

where y is the dependent (price) and x1, x2, x3 are the covariates

A first model

lm(price ~ sqft, data=realestate)
#> 
#> Call:
#> lm(formula = price ~ sqft, data = realestate)
#> 
#> Coefficients:
#> (Intercept)         sqft  
#>      -81433          159

Estimate for Intercept is the average price of a 0 sqft home … doesn't make much sense :)

… but, with an increase of the square footage the price increases … that DOES make sense

(1 sq ft = 0.093 sq m or 1 sq m = 10.8 sq ft)

Problem: the variance of the prices increases with the size of the homes.

Solution: use a transformation (square root or log) for the price

Using a log transformation, variability stabilizes (not perfect, but better)

First model then becomes

lm(log(price)~sqft, data=realestate)
#> 
#> Call:
#> lm(formula = log(price) ~ sqft, data = realestate)
#> 
#> Coefficients:
#> (Intercept)         sqft  
#>   1.128e+01    5.097e-04

Models are objects, too (so save model into a named object)

summary provides a nicer overview of the model output

m1 <- lm(log(price)~sqft, data=realestate)
summary(m1)
#> 
#> Call:
#> lm(formula = log(price) ~ sqft, data = realestate)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.76772 -0.14834 -0.01448  0.11921  0.84182 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 1.128e+01  3.427e-02  329.24   <2e-16 ***
#> sqft        5.097e-04  1.446e-05   35.24   <2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.2347 on 520 degrees of freedom
#> Multiple R-squared:  0.7049, Adjusted R-squared:  0.7043 
#> F-statistic:  1242 on 1 and 520 DF,  p-value: < 2.2e-16

Adding effects

update uses an existing model object and allows for changes of the effects

.~.+ac keeps left hand side (. on the left of the ~) the same and adds ac to the existing right hand side

m2 <- update(m1, .~.+ac)

summary(m2)
#> 
#> Call:
#> lm(formula = log(price) ~ sqft + ac, data = realestate)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -0.7323 -0.1407 -0.0204  0.1180  0.8196 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept) 1.120e+01  3.615e-02  309.88  < 2e-16 ***
#> sqft        4.872e-04  1.457e-05   33.45  < 2e-16 ***
#> acyes       1.589e-01  2.764e-02    5.75 1.52e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.2278 on 519 degrees of freedom
#> Multiple R-squared:  0.7226, Adjusted R-squared:  0.7215 
#> F-statistic: 675.9 on 2 and 519 DF,  p-value: < 2.2e-16

R can deal with categorical and quantitative variables in lm

only value for houses with ac (acyes) is shown - acno is used as baseline, and by default set to 0

options()$contrasts
#>         unordered           ordered 
#> "contr.treatment"      "contr.poly"

?contr.treatment

Interpreting Coefficients

!!!Beware transformations!!! they make interpretations tricky sometimes

log of price is expected to be higher by 1.589e-01 for houses with an AC … same as …

price of the house is on average exp(1.589e-01) = 1.172221 fold higher with AC than the same house without an AC (i.e. AC leads on average to a 17% increase in price)

Model comparisons

Is model m2 actually an improvement over m1 ?

Statistical answer:

anova(m1, m2)
#> Analysis of Variance Table
#> 
#> Model 1: log(price) ~ sqft
#> Model 2: log(price) ~ sqft + ac
#>   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
#> 1    520 28.648                                  
#> 2    519 26.932  1    1.7157 33.063 1.524e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes, duh.

Visually:

qplot(sqft, log(price), colour=ac, data=realestate) + geom_smooth(method="lm")

Yes, but there are not that many houses without AC, so ac might not matter terribly much. Visualization gives us a more nuanced answer than the test.

More effects

Trying to capture the curvature in log(price):

m3 <- lm(log(price) ~ poly(sqft, 2) + ac, data=realestate)
anova(m2, m3)
#> Analysis of Variance Table
#> 
#> Model 1: log(price) ~ sqft + ac
#> Model 2: log(price) ~ poly(sqft, 2) + ac
#>   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
#> 1    519 26.932                                  
#> 2    518 25.274  1    1.6584 33.991 9.751e-09 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualizing models

realestate$fitted.m3 <- fitted(m3)
ggplot(data=realestate, aes(y=log(price), x=sqft, colour=ac)) +
  geom_point() +
  geom_line(aes(y=fitted.m3))

Your turn

Extend model m3 of the realestate data some more and try to find at least two effects that significantly affect prices.
Visualize the impact those additional effects have on the model

Specifying Interaction terms

x:y specifies the interaction between variables x and y

x*y specifies the main effects x, y and the interaction

(a + b + c)^2 specifies all main effects and pairwise interactions

(a + b + c)^2 - b:c specifies all main effects and pairwise interactions, but not the b-c interaction

-1 removes the term for the intercept

Making Predictions

Suppose we have model

m5 <- lm(log(price) ~ sqft + bath + built + log(lot), data = realestate)

Suppose a new house is going on the market and the seller wants to use this model to get an idea of a reasonable listing price. This house is 2123 sq. ft. with 4 bedrooms, 2 bathrooms, a 20590 sq. ft. lot, and was built in 1963.

new_house <- data.frame(sqft = 2123, bath = 2, built = 1963, lot = 20590)

predict(m5, newdata = new_house, interval = "prediction", level = 0.95) # log(price)!
#>        fit     lwr      upr
#> 1 12.31436 11.9268 12.70192
exp(predict(m5, newdata = new_house, interval = "prediction", level = 0.95))/1000
#>        fit      lwr      upr
#> 1 222.8737 151.2661 328.3794

Model Diagnostics

Assumptions of a (normal) linear model

Using linear regression models we inherently make the following assumptions:

Normally distributed errors
Constant error variance
Linearity in the parameters
Independent (uncorrelated) errors

Also generally want to check for collinearity and influential points

Assessing normality of residuals

We can assess the assumption of normality using a normal probability plot of residuals (sometimes called a normal qq plot)

Idea of normal qq plot:

Plot the empirical cdf of the errors against the theoretical quantiles of a normal distribution
If errors are normal empirical and theoretical quantiles will be proportional
Look for a straight line relationship in the plot

m5 <- lm(log(price) ~ sqft + bath + built + log(lot), data = realestate)

realestate$resid.m5 <- residuals(m5)
realestate$fitted.m5 <- fitted(m5)
ggplot(aes(sample = resid.m5), data=realestate) + geom_qq() + theme(aspect.ratio=1)

The wavy pattern indicates a slightly heavy tail in the residuals.

Unlogged price is worse deviation from normality

m5b <- lm(price ~ sqft + bath + built + log(lot), data = realestate)

realestate$resid.m5b <- residuals(m5b)
ggplot(aes(sample = resid.m5b), data=realestate) + geom_qq() + theme(aspect.ratio=1)

Equal variance and linearity

We can assess the assumptions of equal error variance and linearity using residual plots

Idea of checking residual plots:

plot model residuals against predicted values in scatterplot
(linearity) points should show no distinct shape or clustering
(equal variance) scatter of points should not widen or constrict over range of predicted values

Residual plots

qplot(data=realestate, x=fitted.m5, y=resid.m5) + xlab("Fitted Values") + ylab("Residuals") + 
  geom_hline(yintercept=0) + geom_smooth()

Linearity and equal variance look pretty good

Your turn

Use unlogged price in a linear model and draw the residual plot. What do you find?

Test for Equal Error variance

Many R packages available to do specific statistical diagnostics tests lmtest package contains many of these tests, such as a Breusch-Pagan test for equal variances

library(lmtest)
bptest(m5)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  m5
#> BP = 28.123, df = 4, p-value = 1.178e-05

… so there's still heteroskedasticity in those errors … darn!

Box-Cox transformations

Box Cox method for transformations selects a data transform based on the relationship between the observed mean and variance

\[ y_i^{(\lambda)} = \begin{cases} \dfrac{y_i^\lambda - 1}{\lambda} & \text{if } \lambda \neq 0, \\[8pt] \ln{(y_i)} & \text{if } \lambda = 0, \end{cases} \]

Use boxcox() function from the MASS package to ballpark a \(\lambda\)

library(MASS)
bc <- boxcox(m5b)

str(bc)
#> List of 2
#>  $ x: num [1:100] -2 -1.96 -1.92 -1.88 -1.84 ...
#>  $ y: num [1:100] -1030 -1019 -1008 -998 -988 ...
lambda <- bc$x[which.max(bc$y)]

Now re-fit model with transformed price …

m5c <- lm(price^lambda ~ sqft + bath + built + log(lot), data=realestate)
bptest(m5c) # better but still not non-significant
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  m5c
#> BP = 14.856, df = 4, p-value = 0.00501

qplot(fitted(m5c), residuals(m5c)) + geom_hline(yintercept=0) + geom_smooth()

Interpretation of model coefficients has become even more tricky …

Other model diagnostics

on the model level: AIC, BIC, \(R^2\), adjusted \(R^2\), …
on the coefficient level: VIF (Variance Inflation Factors), …
on the point level:
- High leverage (more potential to be influential)
- DFFITS (too much influence on its own fitted value)
- Cook’s D (too much influence on all fitted values)
- DFBETAS (too much influence on estimated parameters)

R can calculate every single one of them … and more …

Working with lots of models

Why would we even do that???

Hans Rosling can explain that really well in his TED talk
gapminder software
gapminder also makes data available (in R through the package gapminder)

First Look

library(gapminder)
qplot(data=gapminder, year, lifeExp, geom="line", group=country)

How would you describe this plot?

Using models as exploratory tools

Idea: fit a line to each one of the countries' life expectancies
then use e.g. intercept and slope to characterise groups of countries

Prepping

This is hot off the press (presented by Hadley Wickham last week at WOMBAT)
you might have to re-install the packages tidyr and purrr, i.e. run the following code:

install.packages("tidyr")
install.packages("purrr")

library(tidyr)
library(purrr)

Australian Life Expectancy

oz <- subset(gapminder, country=="Australia")
head(oz)
#> Source: local data frame [6 x 6]
#> 
#>     country continent  year lifeExp      pop gdpPercap
#>      (fctr)    (fctr) (int)   (dbl)    (int)     (dbl)
#> 1 Australia   Oceania  1952   69.12  8691212  10039.60
#> 2 Australia   Oceania  1957   70.33  9712569  10949.65
#> 3 Australia   Oceania  1962   70.93 10794968  12217.23
#> 4 Australia   Oceania  1967   71.10 11872264  14526.12
#> 5 Australia   Oceania  1972   71.93 13177000  16788.63
#> 6 Australia   Oceania  1977   73.49 14074100  18334.20

Life Expectancy in Australia since 1950

qplot(data=oz, year, lifeExp) + geom_smooth(method="lm")

Life Expectancy in Australia since 1950

oz.lm <- lm(lifeExp~year, data=oz)
oz.lm
#> 
#> Call:
#> lm(formula = lifeExp ~ year, data = oz)
#> 
#> Coefficients:
#> (Intercept)         year  
#>   -376.1163       0.2277

Intercept is estimated life expectancy at 0 BC - let's use 1950 for the first value:

gapminder <- gapminder %>% mutate(year = year-1950)

Nesting data

We don't want to subset the data for every country …

nest() makes a data frame part of another data frame:

by_country <- gapminder %>% group_by(continent, country) %>% nest()
head(by_country)
#> Source: local data frame [6 x 3]
#> 
#>   continent     country            data
#>      (fctr)      (fctr)           (chr)
#> 1      Asia Afghanistan <tbl_df [12,4]>
#> 2    Europe     Albania <tbl_df [12,4]>
#> 3    Africa     Algeria <tbl_df [12,4]>
#> 4    Africa      Angola <tbl_df [12,4]>
#> 5  Americas   Argentina <tbl_df [12,4]>
#> 6   Oceania   Australia <tbl_df [12,4]>

Each element of the data variable in by_country is a dataset:

head(by_country$data[[1]])
#> Source: local data frame [6 x 4]
#> 
#>    year lifeExp      pop gdpPercap
#>   (dbl)   (dbl)    (int)     (dbl)
#> 1     2  28.801  8425333  779.4453
#> 2     7  30.332  9240934  820.8530
#> 3    12  31.997 10267083  853.1007
#> 4    17  34.020 11537966  836.1971
#> 5    22  36.088 13079460  739.9811
#> 6    27  38.438 14880372  786.1134
lm(lifeExp~year, data=by_country$data[[1]])
#> 
#> Call:
#> lm(formula = lifeExp ~ year, data = by_country$data[[1]])
#> 
#> Coefficients:
#> (Intercept)         year  
#>     29.3566       0.2753

Fitting multiple models

Now we are using the map function in the package purrr.

map allows us to apply a function to each element of a list.

by_country$model <- by_country$data %>% purrr::map(~lm(lifeExp~year, data=.))
head(by_country)
#> Source: local data frame [6 x 4]
#> 
#>   continent     country            data   model
#>      (fctr)      (fctr)           (chr)   (chr)
#> 1      Asia Afghanistan <tbl_df [12,4]> <S3:lm>
#> 2    Europe     Albania <tbl_df [12,4]> <S3:lm>
#> 3    Africa     Algeria <tbl_df [12,4]> <S3:lm>
#> 4    Africa      Angola <tbl_df [12,4]> <S3:lm>
#> 5  Americas   Argentina <tbl_df [12,4]> <S3:lm>
#> 6   Oceania   Australia <tbl_df [12,4]> <S3:lm>

On to the `broom` package

broom allows to extract values from models on three levels:

for each model: broom::glance
for each coefficient in the model: broom::tidy
for each value in the dataset: broom::augment

library(broom)
broom::glance(by_country$model[[1]])
#>   r.squared adj.r.squared    sigma statistic      p.value df    logLik
#> 1 0.9477123     0.9424835 1.222788  181.2494 9.835213e-08  2 -18.34693
#>        AIC      BIC deviance df.residual
#> 1 42.69387 44.14859  14.9521          10
broom::tidy(by_country$model[[1]])
#>          term   estimate  std.error statistic      p.value
#> 1 (Intercept) 29.3566375 0.69898128  41.99918 1.404235e-12
#> 2        year  0.2753287 0.02045093  13.46289 9.835213e-08

broom::augment(by_country$model[[1]])
#>    lifeExp year  .fitted   .se.fit      .resid       .hat   .sigma
#> 1   28.801    2 29.90729 0.6639995 -1.10629487 0.29487179 1.211813
#> 2   30.332    7 31.28394 0.5799442 -0.95193823 0.22494172 1.237512
#> 3   31.997   12 32.66058 0.5026799 -0.66358159 0.16899767 1.265886
#> 4   34.020   17 34.03722 0.4358337 -0.01722494 0.12703963 1.288917
#> 5   36.088   22 35.41387 0.3848726  0.67413170 0.09906760 1.267003
#> 6   38.438   27 36.79051 0.3566719  1.64748834 0.08508159 1.154002
#> 7   39.854   32 38.16716 0.3566719  1.68684499 0.08508159 1.147076
#> 8   40.822   37 39.54380 0.3848726  1.27820163 0.09906760 1.208243
#> 9   41.674   42 40.92044 0.4358337  0.75355828 0.12703963 1.260583
#> 10  41.763   47 42.29709 0.5026799 -0.53408508 0.16899767 1.274051
#> 11  42.129   52 43.67373 0.5799442 -1.54472844 0.22494172 1.148593
#> 12  43.828   57 45.05037 0.6639995 -1.22237179 0.29487179 1.194109
#>         .cooksd  .std.resid
#> 1  2.427205e-01 -1.07742164
#> 2  1.134714e-01 -0.88428127
#> 3  3.603567e-02 -0.59530844
#> 4  1.653992e-05 -0.01507681
#> 5  1.854831e-02  0.58082792
#> 6  9.225358e-02  1.40857509
#> 7  9.671389e-02  1.44222437
#> 8  6.668277e-02  1.10129103
#> 9  3.165567e-02  0.65958143
#> 10 2.334344e-02 -0.47913530
#> 11 2.987950e-01 -1.43494020
#> 12 2.963271e-01 -1.19046907

Extract values for each coefficient

Extract all countries automatically (hello map again!)

by_country_coefs <- by_country %>% unnest(model %>% purrr::map(broom::tidy))
coefs <- by_country_coefs %>% select(country, continent, term, estimate) %>% spread(term, estimate)

and finally, a visualization:

qplot(`(Intercept)`, year, colour=continent, data=coefs)

Extract other model diagnostics

by_country <- by_country %>% unnest(model %>% purrr::map(broom::glance))
by_country$country <- reorder(by_country$country, by_country$r.squared)
qplot(country, r.squared, colour=continent, data=by_country) +
  theme(axis.text.x=element_text(angle=-90, hjust=0)) +
  scale_colour_brewer(palette="Dark2")

Compare groups of countries

country_all <- by_country %>% unnest(data)
qplot(year+1950, lifeExp,  data=subset(country_all, r.squared <= 0.45), geom="line") +
  facet_wrap(~country)

What do the patterns mean?

qplot(year+1950, lifeExp,  data=subset(country_all, between(r.squared, 0.45, 0.75)), geom="line") +
  facet_wrap(~country)

Your turn

extract residuals for each of the models and store it in a dataset together with country and continent information
plot residuals across the years and fit a smooth. What does the pattern mean?

Your turn

Outline

Real estate prices

Real estate prices

A first model?

A generalized scatterplot matrix

Fitting a regression

A first model

Adding effects

Interpreting Coefficients

Model comparisons

Visually:

More effects

Visualizing models

Your turn

Specifying Interaction terms

Making Predictions

Model Diagnostics

Assumptions of a (normal) linear model

Assessing normality of residuals

Unlogged price is worse deviation from normality

Equal variance and linearity

Residual plots

Your turn

Test for Equal Error variance

Box-Cox transformations

Other model diagnostics

Working with lots of models

Why would we even do that???

First Look

Using models as exploratory tools

Prepping

Australian Life Expectancy

Life Expectancy in Australia since 1950

Life Expectancy in Australia since 1950

Nesting data

Fitting multiple models

On to the broom package

Extract values for each coefficient

Extract other model diagnostics

Compare groups of countries

Your turn

Share and share alike

On to the `broom` package