Here is a “solution” for practical 3. As with any data analysis, there is no correct answer, but this shows how I would approach this analysis. The analysis here is conditional on selecting a detection function in the previous exercises; I’ve chosen df_hn and df_hr_ss_size that we saved previously (see detection function practical).

Much of the text below is as in the exercise itself, so it should be relatively easy to navigate.

Additional text and code is highlighted using boxes like this.

Load data and packages

library(Distance)
## Loading required package: mrds
## This is mrds 2.1.14
## Built: R 3.2.0; ; 2015-07-30 10:07:19 UTC; unix
library(dsm)
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-7. For overview type 'help("mgcv-package")'.
## This is dsm 2.2.11
## Built: R 3.2.2; ; 2015-10-23 20:20:41 UTC; unix
library(ggplot2)
library(knitr)
library(plyr)
library(reshape2)

Loading the data processed from GIS and the fitted detection function objects from the previous exercises:

load("sperm-data.RData")
load("df-models.RData")

Exploratory analysis

We can do some exploratory analysis by aggregating the counts to each cell and plotting what’s going on.

Don’t worry about understanding what this code is doing at the moment.

# join the observations onto the segments
join_dat <- join(segs, obs, by="Sample.Label", type="full")
# sum up the observations per segment
n <- ddply(join_dat, .(Sample.Label), summarise, n=sum(size), .drop = FALSE) 
# sort the segments by their labsl
segs_eda <- segs[sort(segs$Sample.Label),]
# make a new column for the counts
segs_eda$n <- n$n

# remove the columns we don't need,
segs_eda$CentreTime <- NULL
segs_eda$POINT_X <- NULL
segs_eda$POINT_Y <- NULL
segs_eda$segment.area <- NULL
segs_eda$off.set <- NULL
segs_eda$CenterTime <- NULL
segs_eda$Effort <- NULL
segs_eda$Length <- NULL
segs_eda$SegmentID <- NULL
segs_eda$coords.x1 <- NULL
segs_eda$coords.x2 <- NULL

# "melt" the data so we have four columns:
#   Sample.Label, n (number of observations),
#   variable (which variable), value (its value)
segs_eda <- melt(segs_eda, id.vars=c("Sample.Label", "n"))
# try head(segs_eda)

Finally, we can plot histograms of counts for different values of the covariates:

p <- ggplot(segs_eda) +
       geom_histogram(aes(value, weight=n)) +
       facet_wrap(~variable, scale="free") +
       xlab("Covariate value") +
       ylab("Aggregated counts")
print(p)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

We can also just plot the counts against the covariates, note the high number of zeros (but still some interesting patterns):

p <- ggplot(segs_eda) +
       geom_point(aes(value, n)) +
       facet_wrap(~variable, scale="free") +
       xlab("Covariate value") +
       ylab("Aggregated counts")
print(p)
## Warning: Removed 868 rows containing missing values (geom_point).
## Warning: Removed 868 rows containing missing values (geom_point).
## Warning: Removed 868 rows containing missing values (geom_point).
## Warning: Removed 868 rows containing missing values (geom_point).
## Warning: Removed 868 rows containing missing values (geom_point).
## Warning: Removed 868 rows containing missing values (geom_point).
## Warning: Removed 868 rows containing missing values (geom_point).

These plots give a very rough idea of the relationships we can expect in the model. Notably these plots don’t take into account interactions between the variables and potential correlations between the terms, as well as detectability.

Pre-model fitting

As we did in the previous exercise we must remove the observations from the spatial data that we excluded when we fitted the detection function – those observations at distances greater than the truncation.

obs <- obs[obs$distance <= df_hn$ddf$meta.data$width,]

Here we’ve used the value of the truncation stored in the detection function object, but we could also use the numeric value (which we can also find by checking the model’s summary()).

Again note that if you want to fit DSMs using detection functions with different truncation distances, then you’ll need to reload the sperm-data.RData and do the truncation again for that detection function.

Our new friend +

We can build a really big model using + to include all the terms that we want in the model. We can check what’s available to us by using head() to look at the segment table:

head(segs)
##            CenterTime SegmentID   Length  POINT_X  POINT_Y     Depth
## 1 2004/06/24 07:27:04         1 10288.91 214544.0 689074.3  118.5027
## 2 2004/06/24 08:08:04         2 10288.91 222654.3 682781.0  119.4853
## 3 2004/06/24 09:03:18         3 10288.91 230279.9 675473.3  177.2779
## 4 2004/06/24 09:51:27         4 10288.91 239328.9 666646.3  527.9562
## 5 2004/06/24 10:25:39         5 10288.91 246686.5 659459.2  602.6378
## 6 2004/06/24 11:00:22         6 10288.91 254307.0 652547.2 1094.4402
##    DistToCAS      SST          EKE      NPP coords.x1 coords.x2        x
## 1 14468.1533 15.54390 0.0014442616 1908.129  214544.0  689074.3 214544.0
## 2 10262.9648 15.88358 0.0014198086 1889.540  222654.3  682781.0 222654.3
## 3  6900.9829 16.21920 0.0011704842 1842.057  230279.9  675473.3 230279.9
## 4  1055.4124 16.45468 0.0004101589 1823.942  239328.9  666646.3 239328.9
## 5  1112.6293 16.62554 0.0002553244 1721.949  246686.5  659459.2 246686.5
## 6   707.5795 16.83725 0.0006556266 1400.281  254307.0  652547.2 254307.0
##          y   Effort Sample.Label
## 1 689074.3 10288.91            1
## 2 682781.0 10288.91            2
## 3 675473.3 10288.91            3
## 4 666646.3 10288.91            4
## 5 659459.2 10288.91            5
## 6 652547.2 10288.91            6

We can then fit a model with the available covariates in it, each as an s() term.

dsm_nb_xy_ms <- dsm(count~s(x,y, bs="ts") +
                       s(Depth, bs="ts") +
                       s(DistToCAS, bs="ts") +
                       s(SST, bs="ts") +
                       s(EKE, bs="ts") +
                       s(NPP, bs="ts"),
                 df_hn, segs, obs,
                 family=nb(), method="REML")
summary(dsm_nb_xy_ms)
## 
## Family: Negative Binomial(0.114) 
## Link function: log 
## 
## Formula:
## count ~ s(x, y, bs = "ts") + s(Depth, bs = "ts") + s(DistToCAS, 
##     bs = "ts") + s(SST, bs = "ts") + s(EKE, bs = "ts") + s(NPP, 
##     bs = "ts") + offset(off.set)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -20.7732     0.2295   -90.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                    edf Ref.df Chi.sq  p-value    
## s(x,y)       1.8636924     29 19.141 2.90e-05 ***
## s(Depth)     3.4176460      9 46.263 1.65e-11 ***
## s(DistToCAS) 0.0000801      9  0.000   0.9053    
## s(SST)       0.0002076      9  0.000   0.5402    
## s(EKE)       0.8563344      9  5.172   0.0134 *  
## s(NPP)       0.0001018      9  0.000   0.7820    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0947   Deviance explained = 39.2%
## -REML = 382.76  Scale est. = 1         n = 949

Notes:

  1. We’re using bs="ts" to use the shrinkage thin plate regression spline. More technical detail on these smooths can be found on their manual page ?smooth.construct.ts.smooth.spec.
  2. We’ve not specified basis complexity (k) at the moment. Note that if you want to specify the same complexity for multiple terms, it’s often easier to make a variable that can then be given as k (for example, setting k1<-15 and then setting k=k1 in the required s() terms).
Looking at the summary() here we can see that many of the terms have been shrunk out of the model. We could simply use this model to make predictions at this point since those shrunken terms will have little effect on the results (since their EDFs are so small), but it makes sense to remove them (if only because then predictions will be faster, and we might avoid building many prediction grids for those covariates – more on that later).

Plot

Let’s plot the smooths from this model:

plot(dsm_nb_xy_ms, pages=1, scale=0)