```{r echo=FALSE} library(knitr) library(captioner) # make a numbering for figures fig_nums <- captioner(prefix="Figure") fig_c <- function(p) fig_nums(p, display="cite") ``` # Introduction The analysis is based on a dataset of observations of pantropical dolphins in the Gulf of Mexico (shipped with Distance 6.0 and later). For convenience the data are bundled in an `R`-friendly format, although all of the code necessary for creating the data from the Distance project files is available [on github](http://github.com/dill/mexico-data). The OBIS-SEAMAP page for the data may be found at the [SEFSC GoMex Oceanic 1996](http://seamap.env.duke.edu/dataset/25) survey page. The intention here is to highlight the features of the `dsm` package, rather than perform a full analysis of the data. For that reason, some important steps are not fully explored. Some familiarity with [density surface modelling](http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12105/abstract) is assumed. This is a `knitr` document. The source for this document contains everything you need to reproduce the analysis given here (aside from the data, which is included with the `dsm` package). The most recent version of this document can be found at [github.com/dill/mexico-data](http://github.com/dill/mexico-data). The rest of this document is structured as follows: sections 1-3 deal with the data setup (including plotting considerations), section 4 begins with exploratory analysis and 5-10 consist of fitting and assessing models. # Preamble Before we start, we load the `dsm` package (and its dependencies) and set some options: ```{r loadlibraries} library(dsm) library(ggplot2) # plotting options gg.opts <- theme(panel.grid.major=element_blank(), panel.grid.minor=element_blank(), panel.background=element_blank()) # make the results reproducible set.seed(11123) ``` In order to run this vignette, you'll need to install a few R packages. This can be done via the following call to `install.packages`: ```{r eval=FALSE} install.packages(c("dsm", "Distance", "knitr", "captioner", "ggplot2", "rgdal", "maptools", "plyr", "tweedie")) ``` # The data ## Observation and segment data All of the data for this analysis has been nicely pre-formatted and is shipped with `dsm`. Loading that data, we can see that we have four data frames, the first few lines of each are shown: ```{r loaddata} data(mexdolphins) attach(mexdolphins) ``` `segdata` holds the segment data: the transects have already been "chopped" into segments. ```{r head-segdata} head(segdata) ``` `distdata` holds the distance sampling data that will be used to fit the detection function. ```{r head-distdata} head(distdata) ``` `obsdata` links the distance data to the segments. ```{r head-obsdata} head(obsdata) ``` `preddata` holds the prediction grid (which includes all the necessary covariates). ```{r head-preddata} head(preddata) ``` Typically (i.e. for other datasets) it will be necessary divide the transects into segments, and allocate observations to the correct segments using a GIS or other similar package[^MGET], before starting an analysis using `dsm`. ## Shapefiles and converting units Often data in a spatial analysis comes from many different sources. It is important to ensure that the measurements to be used in the analysis are in compatible units, otherwise the resulting estimates will be incorrect or hard to interpret. Having all of our measurements in SI units from the outset removes the need for conversion later, making life much easier. The data are already in the appropriate units (Northings and Eastings: kilometres from a centroid, projected using the [North American Lambert Conformal Conic projection](https://en.wikipedia.org/wiki/Lambert_conformal_conic_projection)). There is extensive literature about when particular projections of latitude and longitude are appropriate and we highly recommend the reader review this for their particular study area; Bivand *et al* (2013) is a good starting point. The other data frames have already had their measurements appropriately converted. By convention the directions are named `x` and `y`. Using latitude and longitude when performing spatial smoothing can be problematic when certain smoother bases are used. In particular when bivariate isotropic bases are used the non-isotropic nature of latitude and longitude is inconsistent (moving one degree in one direction is not the same as moving one degree in the other). We give an example of projecting the polygon that defines the survey area (which as simply been read into R using `readShapeSpatial` from a shapefile produced by GIS). ```{r projectsurvey, results='hide', message=FALSE} library(rgdal) library(maptools) library(plyr) # tell R that the survey.area object is currently in lat/long proj4string(survey.area) <- CRS("+proj=longlat +datum=WGS84") # proj 4 string # using http://spatialreference.org/ref/esri/north-america-lambert-conformal-conic/ lcc_proj4 <- CRS("+proj=lcc +lat_1=20 +lat_2=60 +lat_0=40 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs ") # project using LCC survey.area <- spTransform(survey.area, CRSobj=lcc_proj4) # simplify the object survey.area <- data.frame(survey.area@polygons[[1]]@Polygons[[1]]@coords) names(survey.area) <- c("x", "y") ``` ```{r echo=FALSE} fig_nums(name="areawithtransects", caption="The survey area with transect lines.", display=FALSE) ``` The below code generates `r fig_nums("areawithtransects", display = "cite")`, which shows the survey area with the transect lines overlaid (using data from `segdata`). ```{r areawithtransects, fig.cap="", fig.height=4} p <- qplot(data=survey.area, x=x, y=y, geom="polygon",fill=I("lightblue"), ylab="y", xlab="x", alpha=I(0.7)) p <- p + coord_equal() p <- p + geom_line(aes(x,y,group=Transect.Label),data=segdata) p <- p + gg.opts print(p) ``` `r fig_nums("areawithtransects")` Also note that since we've projected our prediction grid, the "squares" don't look quite like squares. So for plotting we'll use the polygons that we've saved, these polygons (stored in `pred.polys`) are read from a shapefile created in GIS, the object itself is of class `SpatialPolygons` from the `sp` package. This plotting method makes plotting take a little longer, but avoids gaps and overplotting. `r fig_nums("projection-compare", display="cite")` compares using latitude/longitude with a projection. ```{r echo=FALSE} fig_nums(name="projection-compare", caption="Comparison between unprojected latitude and longitude (left) and the prediction grid projected using the North American Lambert Conformal Conic projection (right)", display=FALSE) ``` ```{r projection-compare, fig.cap="", fig.height=3} par(mfrow=c(1,2)) # put pred.polys into lat/long pred_latlong <- spTransform(pred.polys,CRSobj=CRS("+proj=longlat +datum=WGS84")) # plot latlong plot(pred_latlong, xlab="Longitude", ylab="Latitude") axis(1); axis(2); box() # plot as projected plot(pred.polys, xlab="Northing", ylab="Easting") axis(1); axis(2); box() ``` `r fig_nums("projection-compare")` Tips on plotting polygons are available from [the `ggplot2` wiki](https://github.com/hadley/ggplot2/wiki/plotting-polygon-shapefiles). Here we define a convenience function to generate an appropriate data structure for `ggplot2` to plot: ```{r ggpoly} # given the argument fill (the covariate vector to use as the fill) and a name, # return a geom_polygon object # fill must be in the same order as the polygon data grid_plot_obj <- function(fill, name, sp){ # what was the data supplied? names(fill) <- NULL row.names(fill) <- NULL data <- data.frame(fill) names(data) <- name spdf <- SpatialPolygonsDataFrame(sp, data) spdf@data$id <- rownames(spdf@data) spdf.points <- fortify(spdf, region="id") spdf.df <- join(spdf.points, spdf@data, by="id") # seems to store the x/y even when projected as labelled as # "long" and "lat" spdf.df$x <- spdf.df$long spdf.df$y <- spdf.df$lat geom_polygon(aes_string(x="x",y="y",fill=name, group="group"), data=spdf.df) } ``` # Exploratory data analysis ## Distance data The top panels of `r fig_c("EDA-plots")`, below, show histograms of observed distances and cluster size, while the bottom panels show the relationship between observed distance and observed cluster size, and the relationship between observed distance and Beaufort sea state. The plots show that there is some relationship between cluster size and observed distance (fewer smaller clusters seem to be seen at larger distances). ```{r echo=FALSE} fig_nums(name="EDA-plots", caption="Exploratory plot of the distance sampling data. Top row, left to right: histograms of distance and cluster size; bottom row: plot of distance against cluster size and plot of distances against Beaufort sea state.", display=FALSE) ``` The following code generates `r fig_c("EDA-plots")`: ```{r EDA-plots, fig.height=7, fig.width=7, fig.cap="", results="hide"} par(mfrow=c(2,2)) # histograms hist(distdata$distance,main="",xlab="Distance (m)") hist(distdata$size,main="",xlab="Cluster size") # plots of distance vs. cluster size plot(distdata$distance, distdata$size, main="", xlab="Distance (m)", ylab="Group size", pch=19, cex=0.5, col=gray(0.7)) # lm fit l.dat <- data.frame(distance=seq(0,8000,len=1000)) lo <- lm(size~distance, data=distdata) lines(l.dat$distance, as.vector(predict(lo,l.dat))) plot(distdata$distance,distdata$beaufort, main="", xlab="Distance (m)", ylab="Beaufort sea state", pch=19, cex=0.5, col=gray(0.7)) ``` `r fig_nums("EDA-plots")` ## Spatial data Looking separately at the spatial data without thinking about the distances, we can plot the observed group sizes in space (`r fig_c("spatialEDA")`, below). Circle size indicates the size of the group in the observation. There are rather large areas with no observations, which might cause our variance estimates for abundance to be rather large. `r fig_c("spatialEDA")` also shows the depth data which we will use depth later as an explanatory covariate in our spatial model. ```{r echo=FALSE} fig_nums(name="spatialEDA", caption="Plot of depth values over the survey area with transects and observations overlaid. Point size is proportional to the group size for each observation.", display=FALSE) ``` The following code generates `r fig_c("spatialEDA")`: ```{r spatialEDA, fig.cap=""} p <- ggplot() + grid_plot_obj(preddata$depth, "Depth", pred.polys) + coord_equal() p <- p + labs(fill="Depth",x="x",y="y",size="Group size") p <- p + geom_line(aes(x, y, group=Transect.Label), data=segdata) p <- p + geom_point(aes(x, y, size=size), data=distdata, colour="red",alpha=I(0.7)) p <- p + gg.opts print(p) ``` `r fig_nums("spatialEDA")` # Estimating the detection function We use the `ds` function in the package `Distance` to fit the detection function. (The `Distance` package is intended to make standard distance sampling in `R` relatively straightforward. For a more flexible but more complex alternative, see the function `ddf` in the `mrds` library.) First, loading the `Distance` library: ```{r loadDistance} library(Distance) ``` We can then fit a detection function with hazard-rate key with no adjustment terms: ```{r hrmodel} detfc.hr.null<-ds(distdata, max(distdata$distance), key="hr", adjustment=NULL) ``` Calling `summary` gives us information about parameter estimates, probability of detection, AIC, etc: ```{r hrmodelsummary} summary(detfc.hr.null) ``` ```{r echo=FALSE} fig_nums(name="hr-detfct", caption="Plot of the fitted detection function (left) and goodness of fit plot (right) for the hazard-rate model.", display=FALSE) ``` The following code generates a plot of the fitted detection function (`r fig_c("hr-detfct")`) and quantile-quantile plot: ``` {r hr-detfct, fig.cap="", fig.width=9, fig.height=6, results="hide"} par(mfrow=c(1,2)) plot(detfc.hr.null, showpoints=FALSE, pl.den=0, lwd=2) ddf.gof(detfc.hr.null$ddf) ``` `r fig_nums("hr-detfct")` The quantile-quantile plot show relatively good goodness of fit for the hazard-rate detection function. ## Adding covariates to the detection function It is common to include covariates in the detection function (so-called Multiple Covariate Distance Sampling or MCDS). In this dataset there are two covariates that were collected on each individual: Beaufort sea state and size. For brevity we fit only a hazard-rate detection functions with the sea state included as a factor covariate as follows: ```{r hrcovdf, message=FALSE, cache=TRUE, warning=FALSE} detfc.hr.beau<-ds(distdata, max(distdata$distance), formula=~as.factor(beaufort), key="hr", adjustment=NULL) ``` Again looking at the `summary`, ```{r hrcovdfsummary} summary(detfc.hr.beau) ``` Here the detection function with covariates does not give a lower AIC than the model without covariates (`r round(detfc.hr.beau$ddf$criterion,2)` vs. `r round(detfc.hr.null$ddf$criterion,2)` for the hazard-rate model without covariates). Looking back to the bottom-right panel of `r fig_c("EDA-plots")`, we can see there is not a discernible pattern in the plot of Beaufort vs distance. For brevity, detection function model selection has been omitted here. In practise we would fit many different forms for the detection function (and select a model based on goodness of fit testing and AIC). # Fitting a DSM Before fitting a `dsm` model, the data must be segmented; this consists of chopping up the transects and attributing counts to each of the segments. As mentioned above, these data have already been segmented. ## A simple model We begin with a very simple model. We assume that the number of individuals in each segment are quasi-Poisson distributed and that they are a smooth function of their spatial coordinates (note that the formula is exactly as one would specify to `gam` in `mgcv`). By setting `group=TRUE`, the abundance of clusters/groups rather than individuals can be estimated (though we ignore this here). Note we set `method="REML"` to ensure that smooth terms are estimated reliably. Running the model: ```{r} dsm.xy <- dsm(count~s(x,y), detfc.hr.null, segdata, obsdata, method="REML") ``` We can then obtain a summary of the fitted model: ```{r} summary(dsm.xy) ``` The exact interpretation of the model summary results can be found in Wood (2006); here we can see various information about the smooth components fitted and general model statistics. We can use the deviance explained to compare between models[^rsqoffset]. ```{r echo=FALSE} fig_nums(name="visgam1", caption="Plot of the spatial smooth in `dsm.xy`, values are relative abundances. White/yellow indicates high values, red low indicates low values.", display=FALSE) ``` We can also get a rough idea of what the smooth of space looks like using `vis.gam`: ```{r visgam1, fig.cap=""} vis.gam(dsm.xy, plot.type="contour", view=c("x","y"), asp=1, type="response", contour.col="black", n.grid=100) ``` `r fig_nums("visgam1")` The `type="response"` argument ensures that the plot is on the scale of abundance but the values are relative (as the offsets are set to be their median values). This means that the plot is useful to get an idea of the general shape of the smooth but cannot be interpreted directly. ## Adding another environmental covariate to the spatial model The data set also contains a `depth` covariate (which we plotted above). We can include in the model very simply: ```{r depthmodel} dsm.xy.depth <- dsm(count~s(x,y,k=10) + s(depth,k=20), detfc.hr.null, segdata, obsdata, method="REML") summary(dsm.xy.depth) ``` Here we see a drop in deviance explained, so perhaps this model is not as useful as the first. We discuss setting the `k` parameter in [Model checking], below. Setting `select=TRUE` here (as an argument to `gam`) would impose extra shrinkage terms on each smooth in the model (allowing smooth terms to be removed from the model during fitting; see `?gam` for more information). This is not particularly useful here, so we do not include it. However when there are many environmental predictors is in the model this can be a good way (along with looking at $p$-values) to perform term selection. ```{r echo=FALSE} fig_nums(name="dsm.xy.depth-depth", caption="Plot of the smooth of depth in `dsm.xy.depth`. This \"hockey stick\" shaped smooth indicates that there is not much information in depth after 500m.", display=FALSE) ``` Simply calling `plot` on the model object allows us to look at the relationship between depth and the linear predictor (shown in `r fig_c("dsm.xy.depth-depth")`): ```{r dsm.xy.depth-depth, fig.cap=""} plot(dsm.xy.depth, select=2) ``` `r fig_nums("dsm.xy.depth-depth")` Omitting the argument `select` in the call to `plot` will plot each of the smooth terms, one at a time. ## Spatial models when there are covariates in the detection function The code to fit the DSM when there are covariates in the detection function is similar to the other models, above. However since the detection function has observation-level covariates, we must estimate the abundance per segment using a Horvitz-Thompson-like estimator before modelling, so we change the response to be `abundance.est`: ``` {r cache=TRUE} dsm.est.xy <- dsm(abundance.est~s(x,y), detfc.hr.beau, segdata, obsdata, method="REML") ``` As we can see, the `summary` results are rather similar: ```{r} summary(dsm.est.xy) ``` ```{r echo=FALSE} fig_nums(name="visgam5", caption="Plot of the spatial smooth in `dsm.est.xy`.", display=FALSE) ``` As is the resulting spatial smooth (though the resulting surface is somewhat "amplified): ```{r visgam5, fig.cap=""} vis.gam(dsm.est.xy, plot.type="contour", view=c("x","y"), asp=1, type="response", zlim=c(0, 300), contour.col="black", n.grid=100) ``` `r fig_nums("visgam5")` ## Other response distributions Often the quasi-Poisson distribution doesn't give adequate flexibility and doesn't capture the overdispersion in the response data (see [Model checking] and [Model selection] below), so below we illustrate two additional distributions that can be used with count data. For the models in this section, we'll move back to the `count` response, though the estimated abundance would also work. ### Tweedie Response distributions other than the quasi-Poisson can be used, for example the Tweedie distribution. The Tweedie distribution is available in `dsm` by setting `family=tw()`. ```{r tweedie-fit} dsm.xy.tweedie <- dsm(count~s(x,y), detfc.hr.null, segdata, obsdata, family=tw(), method="REML") summary(dsm.xy.tweedie) ``` ### Negative binomial Though not used here there are, similarly, two options for the negative binomial distribution: `negbin` and `nb`. The former requires the user specification single parameter `theta` or a range of values for the parameter (specified as a vector), the latter estimates the value of `theta` during the model fitting process (and is generally faster). The latter is recommended for most users. ## Other spatial modelling options There is a large literature on spatial modelling using GAMs, much of which can be harnessed in a DSM context. Here are a few highlights. ### Soap film smoothing To account for a complex region (e.g., a region that includes peninsulae) we can use the soap film smoother (Wood *et al*, 2008). To use a soap film smoother for the spatial part of the model we must create a set of knots for the smoother to use. This is easily done using the `make.soapgrid()` function in `dsm`: ```{r soap-knots} soap.knots <- make.soapgrid(survey.area,c(15,10)) ``` where the second argument specifies the number of points (in each direction) in the grid that will be used to create the knots (knots in the grid outside of `survey.area` are removed). As we saw in the exploratory analysis, some of the transect lines are outside of the survey area. These will cause the soap film smoother to fail, so we remove them: ```{r soap-setup} x <- segdata$x; y<-segdata$y onoff <- inSide(x=x,y=y, bnd=as.list(survey.area)) rm(x,y) segdata.soap <- segdata[onoff,] ``` Note that the [`soap_checker` script available here](https://github.com/dill/soap_checker) can be useful in ensuring that the boundary, data and knots are in the correct format to use with the soap film smoother. We can run a model with both the `depth` covariate along with a spatial (soap film) smooth. Note that the `k` argument now refers to the complexity of the boundary smooth in the soap film, and the complexity of the film is controlled by the knots given in the `xt` argument. ``` {r soap-fit, cache=TRUE} dsm.xy.tweedie.soap<-dsm(count~s(x, y, bs="so", k=15, xt=list(bnd=list(survey.area))) + s(depth), family=tw(), method="REML", detfc.hr.null, segdata.soap, obsdata, knots=soap.knots) summary(dsm.xy.tweedie.soap) ``` ### Correlation structure We can use a generalized mixed model (GAMM; Wood, 2006) to include correlation between the segments within each transect. First we re-code the sample labels and transect labels as numeric variables, then include them in the model as part of the `correlation` argument (segments are nested inside the transects). For the sake of example we use an AR1 (lag 1 autocorrelation) correlation structure (though the correlogram did not indicate we had issues with residual autocorrelation, we show it here for illustrative purposes). ```{r dsm.xy.gamm, cache=TRUE} segdata$sg.id <- as.numeric(sub("\\d+-","",segdata$Sample.Label)) segdata$tr.id <- as.numeric(segdata$Transect.Label) dsm.xy.gamm<-dsm(count~s(x,y), detfc.hr.null, segdata, obsdata,engine="gamm",correlation=corAR1(form=~sg.id|tr.id), method="REML") ``` GAMMs usually take considerably longer to fit than GAMs, so it's usually start with a GAM first, select smooth terms and response distribution before starting to fit GAMMs. The object returned is part `lme` (for the random effects) and part `gam` (for the smooth terms). Looking at the `summary()` for the `gam` part of the model: ```{r} summary(dsm.xy.gamm$gam) ``` Note that the deviance explained is not reported for the GAMM. More information on `lme` can be found in Pinheiro and Bates (2000) and Wood (2006). # Model checking Fitting models is all well and good, but we'd like to confirm that the models we have are reasonable; `dsm` provides some functions for model checking. ## Goodness of fit, residuals etc ```{r echo=FALSE} fig_nums(name="dsm.xy-check", caption="Diagnostic plots for `dsm.xy`.", display=FALSE) ``` `r fig_c("dsm.xy-check")` shows diagnostic plots for the DSM with a quasi-Poisson response, using `gam.check`: ```{r dsm.xy-check, fig.cap="", fig.width=6, fig.height=6} gam.check(dsm.xy) ``` `r fig_nums("dsm.xy-check")` These show that there is some deviation in the Q-Q plot. The "line" of points in the plot of the residuals vs. linear predictor plot corresponds to the zeros in the data. Note that as well as the plots, `gam.check` also produces information about the model fitting. Of particular interest to us is the last few lines that tell us about the basis size. The `k` parameter provided to `s` (and `te`) terms in `dsm` controls the complexity of the smooths in the model. By setting the `k` parameter we specify the largest complexity for that smooth term in the model; as long as this is high enough, we can be sure that there is enough flexibility. In the output from `gam.check` above, we can see that there is a "p-value" calculated for the size of the basis, this can be a good guide as to whether the basis size needs to be increased. The `?choose.k` manual page from `mgcv` gives further guidance and technical details on this matter. ```{r echo=FALSE} fig_nums(name="dsm.xy.tweedie-check", caption="Diagnostic plots for `dsm.xy.tweedie`.", display=FALSE) ``` We can look at the same model form but with a Tweedie distribution specified as the response: ``` {r dsm.xy.tweedie-check, fig.cap="", fig.width=6, fig.height=6} gam.check(dsm.xy.tweedie) ``` `r fig_nums("dsm.xy.tweedie-check")` The Q-Q plot now seems much better (closer to the $y=x$ line). In both plots the histogram of residuals is rather hard to interpret due to the large number of zeros in the data. Further guidance on interpreting `gam.check` output can be found in Wood (2006). ### Randomised quantile residuals In the top right panel of the above `gam.check` plots the residuals vs. linear predictor plot includes a odd line of predictions. These are an artifact of the link function, showing the exact zeros in the data. These can be misleading and distracting, making it difficult to see whether residuals show heteroskedasticity. Randomised quantile residuals (Dunn and Smyth, 1996) avoid this issue by transforming the residuals to be exactly normally distributed. This makes the residuals vs. linear predictor plot much easier to interpret as it therefore doesn't include the artifacts generated by the link function. These plots can be produced using `rqgam.check` in `dsm`. ```{r echo=FALSE} fig_nums(name="dsm.xy.tweedie-rqcheck", caption="Randomised quantile residual diagnostic plots for `dsm.xy.tweedie`.", display=FALSE) ``` ``` {r dsm.xy.tweedie-rqcheck, fig.cap="", fig.width=6, fig.height=6} rqgam.check(dsm.xy.tweedie) ``` `r fig_nums("dsm.xy.tweedie-rqcheck")` Here we can see that there is no issue with heteroskedasticity (no increase in spread in the residuals vs. linear predictor plot with increasing values of the linear predictor). One can also plot these residuals against covariate values to check for pattern in the residuals. Note that in general, plots other than "Resids vs. linear pred." should be interpreted with caution in the output of `rqgam.check` as the residuals generated are normal by construction (so for example the Q-Q plot and histogram of residuals will always look fine). ## Autocorrelation In all spatial models we need to be aware of the issue of spatial autocorrelation. To check for residual autocorrelation we use the `dsm.cor` function: ```{r echo=FALSE} fig_nums(name="dsm.xy-cor", caption="Residual autocorrelation in `dsm.xy`.", display=FALSE) ``` ```{r dsm.xy-cor, fig.cap="", fig.height=5, fig.width=5} dsm.cor(dsm.xy, max.lag=10, Segment.Label="Sample.Label") ``` `r fig_nums("dsm.xy-cor")` The plot is shown in `r fig_c("dsm.xy-cor")`, and appears to have a spike at lag 6, but this is beyond the range of being interesting. Note that for data where there are many sample occasions over time, it may not be appropriate to simply plot the autocorrelogram over the whole data at once as this may confound spatial and temporal effects. In this case it would be more appropriate to plot an autocorrelogram for each sample occasion (e.g. by year or season). # Model selection Assuming that models have "passed" the checks in `gam.check`, `rqgam.check` and are sufficiently flexible, we may be left with a choice of which model is "best". There are several methods for choosing the best model -- AIC, REML/GCV scores, deviance explained, full cross-validation with test data and so on. Though this document doesn't intend to be a full analysis of the pantropical dolphin data, we can create a results table to compare the various models that have been fitted so far in terms of their abundance estimates and associated uncertainties. ```{r modelcomp, cache=TRUE} # make a data.frame to print out mod_results <- data.frame("Model name" = c("`dsm.xy`", "`dsm.xy.depth`", "`dsm.xy.tweedie`", "`dsm.xy.tweedie.soap`", "`dsm.est.xy`", "`dsm.xy.gamm`"), "Description" = c("Bivariate smooth of location, quasipoisson", "Bivariate smooth of location, smooth of depth, quasipoisson", "Bivariate smooth of location, smooth of depth, Tweedie", "Soap film smooth of location, smooth of depth, Tweedie", "Bivariate smooth of location, smooth of depth, Tweedie, Beaufort covariate in detection function", "Bivariate smooth of location,Tweedie, correlation structure"), "Deviance explained" = c(unlist(lapply(list(dsm.xy, dsm.xy.depth, dsm.xy.tweedie, dsm.xy.tweedie.soap, dsm.est.xy), function(x){paste0(round(summary(x)$dev.expl*100,2),"%")})),NA)) ``` We can then use the resulting `data.frame` to build a table of results using the `kable` function: ```{r results-table, results='asis'} kable(mod_results, col.names=c("Model name", "Description", "Deviance explained")) ``` Note the `NA` value for the GAMM (`dsm.xy.gamm`), this is because the deviance is not reported for these models. # Abundance estimation Once a model has been checked and selected, we can make predictions over the grid and calculate abundance. The offset is stored in the `area` column[^offsetcorrection]. ```{r} dsm.xy.pred <- predict(dsm.xy, preddata, preddata$area) ``` ```{r echo=FALSE} fig_nums(name="dsm.xy-preds", caption="Predicted density surface for `dsm.xy`.", display=FALSE) ``` `r fig_c("dsm.xy-preds")` shows a map of the predicted abundance. We use the `grid_plot_obj` helper function to assign the predictions to grid cells (polygons). ```{r dsm.xy-preds, fig.cap=""} p <- ggplot() + grid_plot_obj(dsm.xy.pred, "Abundance", pred.polys) + coord_equal() +gg.opts p <- p + geom_path(aes(x=x, y=y),data=survey.area) p <- p + labs(fill="Abundance") print(p) ``` `r fig_nums("dsm.xy-preds")` We can calculate abundance over the survey area by simply summing these predictions: ``` {r dsm.xy-abund} sum(dsm.xy.pred) ``` We can compare this with a plot of the predictions from this `dsm.xy.depth` (`r fig_c("dsm.xy.depth-preds")`, code below). ```{r echo=FALSE} fig_nums(name="dsm.xy.depth-preds", caption="Predicted density surface for `dsm.xy.depth`.", display=FALSE) ``` ```{r dsm.xy.depth-preds, fig.cap=""} dsm.xy.depth.pred <- predict(dsm.xy.depth, preddata, preddata$area) p <- ggplot() + grid_plot_obj(dsm.xy.depth.pred, "Abundance", pred.polys) + coord_equal() +gg.opts p <- p + geom_path(aes(x=x, y=y), data=survey.area) p <- p + labs(fill="Abundance") print(p) ``` `r fig_nums("dsm.xy.depth-preds")` We can see the inclusion of depth into the model has had a noticeable effect on the distribution (note the difference in legend scale between the two plots). We can again also look at the total abundance: ``` {r dsm.xy.depth-abund} sum(dsm.xy.depth.pred) ``` Here we see that there is not much of a change in the abundance, so in terms of abundance alone there isn't much between the two models. Next we'll go on to look at variance next where we can see bigger differences between the models. # Variance estimation Obviously point estimates of abundance are important, but we should also calculate uncertainty around these abundance estimates. Fortunately `dsm` provides functions to perform these calculations and display the resulting uncertainty estimates. We can use the approach of Williams *et al* (2011), which allows us to incorporate both detection function uncertainty and spatial model (GAM) uncertainty in our estimates of the variance. The `dsm.var.prop` function will estimate the variance of the abundance for each element in the list provided in `pred.data`. In our case we wish to obtain an abundance for each of the prediction cells, so we use `split` to chop our data set into list elements to give to `dsm.var.prop`. ```{r dsm.xy-varprop, cache=TRUE} preddata.varprop <- split(preddata, 1:nrow(preddata)) dsm.xy.varprop <- dsm.var.prop(dsm.xy, pred.data=preddata.varprop, off.set=preddata$area) ``` Calling `summary` will give some information about uncertainty estimation: ``` {r} summary(dsm.xy.varprop) ``` The section titled `Quantiles of differences between fitted model and variance model` can be used to check the variance model does not have major problems (values much less than 1 indicate no issues). This method will only work when there are either no covariates in the detection function or when those that are there only vary at the scale of segments (e.g. the Beaufort sea state varies per segment so would be fine to include, but the observer may vary within transect for multiple observer surveys, so would not be appropriate to include). Note that for models where there are covariates at the individual level we cannot calculate the variance via the variance propagation method (`dsm.var.prop`) of Williams *et al* (2011). Instead we can use a GAM uncertainty estimation and combine it with the detection function uncertainty via the delta method (`dsm.var.gam`) which simply sums the squared coefficients of variation to get a total coefficient of variation (and therefore assumes that the detection process and spatial process are independent). There are no restrictions on the form of the detection function when using `dsm.var.gam`. ```{r echo=FALSE} fig_nums(name="dsm.xyvarplot", caption="Plot of the coefficient of variation for the study area with transect lines and observations overlaid. Note the increase in CV away from the transect lines.", display=FALSE) ``` We can also make a plot of the CVs using the following code (`r fig_c("dsm.xyvarplot")`). ```{r dsm.xyvarplot, fig.cap=""} p <- ggplot() + grid_plot_obj(sqrt(dsm.xy.varprop$pred.var)/unlist(dsm.xy.varprop$pred), "CV", pred.polys) + coord_equal() +gg.opts p <- p + geom_path(aes(x=x, y=y),data=survey.area) p <- p + geom_line(aes(x, y, group=Transect.Label), data=segdata) print(p) ``` `r fig_nums("dsm.xyvarplot")` We can revisit the model that included both depth and location smooths and observe that the coefficient of variation for that model is larger than that of the model with only the location smooth. ```{r dsm.xy.depth-varprop, cache=TRUE} dsm.xy.depth.varprop <- dsm.var.prop(dsm.xy.depth, pred.data=preddata.varprop, off.set=preddata$area) summary(dsm.xy.depth.varprop) ``` # Conclusions This document has outlined an analysis of spatially-explicit distance sampling data using the `dsm` package. Note that there are many possible models that can be fitted using `dsm` and that the aim here was to show just a few of the options. Results from the models can be rather different, so care must be taken in performing model selection, discrimination and criticism. # Software * `Distance` is available at [http://github.com/DistanceDevelopment/Distance](http://github.com/DistanceDevelopment/Distance) as well as on CRAN. * `dsm` is available at [http://github.com/DistanceDevelopment/dsm](http://github.com/DistanceDevelopment/dsm), as well as on CRAN. # References * Bivand, RS, E Pebesma, and V Gomez-Rubio (2013). Applied Spatial Data Analysis with R, Springer Science & Business Media. * Hedley, SL and Buckland, ST (2004). Spatial models for line transect sampling. Journal of Agricultural, Biological, and Environmental Statistics, 9, 181–199. * Miller, DL, ML Burt, EA Rexstad, and L Thomas (2013). Spatial models for distance sampling data: recent developments and future directions. Methods in Ecology and Evolution 4(11): 1001–10. * Pinheiro, JC, and Bates, DM (2000). Mixed-effects models in S and S-PLUS, Springer. * Williams, R, Hedley, SL, Branch, TA, Bravington, MV, Zerbini, AN and Findlay, KP (2011). Chilean blue whales as a case study to illustrate methods to estimate abundance and evaluate conservation status of rare species. Conservation Biology, 25, 526–535. * Wood, SN (2006). Generalized Additive Models: an introduction with R. Chapman and Hall/CRC Press. [^offsetcorrection]: An earlier version of this vignette incorrectly stated that the areas of the prediction cells were 444km$^2$. This has been corrected. Thanks to Phil Bouchet for pointing this out. [^MGET]: These operations can be performed in R using the `sp` and `rgeos` packages. It may, however, be easier to perform these operations in GIS such as ArcGIS -- in which case the MGET Toolbox may be useful. [^rsqoffset]: Note though that the adjusted $R^2$ for the model is defined as the proportion of variance explained but the "original" variance used for comparison doesn't include the offset (the area (or effective of the segments). It is therefore not recommended for one to directly interpret the $R^2$ value (see the `summary.gam` manual page for further details).