# Model checking

“perhaps the most important part of applied statistical modelling”

Simon Wood

### Model checking

• Checking $$\neq$$ validation!
• As with detection function, checking is important
• Want to know the model conforms to assumptions
• What assumptions should we check?

• Convergence
• Basis size
• Residuals

## Convergence

### Convergence

• Fitting the GAM involves an optimization
• By default this is REstricted Maximum Likelihood (REML) score
• Sometimes this can go wrong
• R will warn you!

### A model that converges

gam.check(dsm_tw_xy_depth)


Method: REML   Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [-3.468176e-05,1.090937e-05]
(score 374.7249 & scale 4.172176).
Hessian positive definite, eigenvalue range [1.179219,301.267].
Model rank =  39 / 39

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

k'   edf k-index p-value
s(x,y)   29.00 11.11    0.65  <2e-16 ***
s(Depth)  9.00  3.84    0.81    0.33
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


### A bad model

Error in while (mean(ldxx/(ldxx + ldss)) > 0.4) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In sqrt(w) : NaNs produced
Error in while (mean(ldxx/(ldxx + ldss)) > 0.4) { :
missing value where TRUE/FALSE needed


This is rare

## The Folk Theorem of Statistical Computing

“most statistical computational problems are due not to the algorithm being used but rather the model itself”

Andrew Gelman

## Basis size

### Basis size (k)

• Set k per term
• e.g. s(x, k=10) or s(x, y, k=100)
• Penalty removes “extra” wigglyness
• up to a point!
• (But computation is slower with bigger k)

### Checking basis size

gam.check(dsm_x_tw)


Method: REML   Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [-3.08755e-06,4.928064e-07]
(score 409.936 & scale 6.041307).
Hessian positive definite, eigenvalue range [0.7645492,302.127].
Model rank =  10 / 10

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

k'  edf k-index p-value
s(x) 9.00 4.96    0.76    0.44


### Increasing basis size

dsm_x_tw_k <- dsm(count~s(x, k=20), ddf.obj=df,
segment.data=segs, observation.data=obs,
family=tw())
gam.check(dsm_x_tw_k)


Method: REML   Optimizer: outer newton
full convergence after 7 iterations.
Gradient range [-2.301238e-08,3.930667e-09]
(score 409.9245 & scale 6.033913).
Hessian positive definite, eigenvalue range [0.7678456,302.0336].
Model rank =  20 / 20

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

k'   edf k-index p-value
s(x) 19.00  5.25    0.76    0.39


### Sometimes basis size isn't the issue...

• Generally, double k and see what happens
• Didn't increase the EDF much here
• Other things can cause low “p-value” and “k-index
• Increasing k can cause problems (nullspace)

### k is a maximum

• (Usually) Don't need to worry about things being too wiggly
• k gives the maximum complexity
• Penalty deals with the rest

## Residuals

### What are residuals?

• Generally residuals = observed value - fitted value
• BUT hard to see patterns in these “raw” residuals
• Need to standardise $$\Rightarrow$$ deviance residuals
• Residual sum of squares $$\Rightarrow$$ linear model
• deviance $$\Rightarrow$$ GAM
• Expect these residuals $$\sim N(0,1)$$