Analysis of double platform data
Distance for Windows exercise
This version of the practical is for those who would like to conduct the analysis in Distance (Thomas et al., 2010). There is a separate version for conducting the analysis in R.
The first part of this practical involves initial analysis of a survey of a known number of golf tees. This is intended mainly to familiarise you with the analysis features in Distance. We will do this during a short in-class break on the first day of class.
The second part of the practical involves more detailed analysis of the golf tee data, and an exploration of the double-platform data structure and project setup.
The third part of the practical involves analysis of the pack-ice seal survey data of Borchers et al. (2006) and Southwell et al. (2007).
Golf Tee Survey – Initial analysis
The Golf Tee Data and Distance Project
The data come from a survey of clusters of golf tees in grass, conducted by undergraduate statistics students at the University of St Andrews. For the purposes of this exercise, we will assume that all the data were collected on one 210 metre long transect line out to a width of 4 metres (both sides), and that this comprises the study region. There were 250 clusters of tees in the study region and 760 individual tees in total.
The population was independently surveyed by two observer teams. The following data were recorded for each detected group: perpendicular distance, cluster size, observer (team 1 or 2), “sex” (males are yellow, females green and golf tees occur in single-sex clusters), and “exposure”. Exposure was a subjective judgement of whether the cluster was substantially obscured by grass (exposure=0) or not (exposure=1). The lengths of grass varied along the transect line, and the grass was also slightly more yellow along one part of the line compared with the rest.
The data are stored in the distance project GolfteesExercise.zip. Open the project within Distance, and click on the Analyses tab. Two Analyses have been set up for you:
- “FI – MR dist” is a model with only perpendicular distance as a covariate in the mark recapture detection function model.
- “FI – MR dist + size + sex + exp” is a model with distance, cluster size, sex and exposure as covariates in the mark recapture detection function model.
Have a look at the Properties of the Model Definition for each of these models, and familiarize yourself with the contents of each tab.
- Both specify trial configuration (see the Method button in the Detection Function tab) – team 2 setting up trials for team 1. That’s the configuration we’ll use throughout this exercise, but for this dataset an independent observer configuration would also be fine and you’re welcome to explore that yourself.
- Both are full independence models (we’ll try point independence models later in the exercise).
- The difference between the two is that they have different MR (mark recapture) models – see the MR Model button in the Detection Function tab. Notice that the names used in the model formulae are not the same as the field names in the Data tab – for example the “Perp distance” field is called “distance” in the formulae. (Recall from the lecture that some fields get renamed when the data are sent out to R for analysis. You can find out what renaming has been done in the Log tab once you’ve run an analysis – we’ll do this in a second.)
Golf Tee Survey Analyses
Estimation of p: distance only
Run the “FI - MR dist” analysis. Once it is run, look in the Analysis Details Log tab, and find the part of the log where Distance tells you how it has renamed the fields. It’s useful to know where this is in case you need reminding of what names to use when specifying formulae. Now look in the Results tab. Does the fit of the model look good?
Here’s what the plots are:
- The plots on the pages headed “Summary 1” and “Summary 2” show the histograms of distances for detections by either, or both, observer. The shaded regions show the number for observer 1 and observer 2, respectively.
- The plot on the page ‘Summary 3’ shows the histograms of distances for duplicates (detected by both observers).
- The plot on page headed ‘Summary 4’ shows the histograms of distances for observer 2. The shaded regions indicate the number of duplicates – for example, the shaded region is the number of clusters in each distance bin that were detected by Observer 1 given that they were also detected by Observer 2 (the “|” symbol in the plot legend means “given that”). [Note that if an ‘io’ configuration had been chosen, there would also be a plot showing the duplicates overlaid onto distances detected by Observer 1; in a ‘trial’ configuration we are only interested in Observer 2 setting up trials.]
- Q-Q plot – this has exactly the same interpretation as a Q-Q plot in single platform analyses.
- The plot on the page headed “Detection probability 1” shows a histogram of Observer 1 detections with the estimated Observer 1 detection function overlaid on it. The dots show the estimated detection probability for all Observer 1 detections.
- The plot on the page headed “Detection probability 2” shows the proportion of Obs 2’s detections that were detected by Obs 1 (also see the “Summary” page). The fitted line is the estimated detection function for Obs 1 (given detection by Obs 2). Dots are estimated detection probabilities for each Obs 1 detection.
Is there evidence of unmodelled heterogeneity? What do these results tell you about the estimates of p (average detection probability) and p(0) (detection probability at 0 distance)?
Estimation of p: distance and other variables
Run “FI - MR dist+size+sex+exp”. Can you explain the differences between the estimates of p and the abundance estimates between the two models? Which model would you use to estimate abundance? To decide, look at the goodness-of-fit test and AIC values from each model.
Golf Tee Survey – Further exploration
Golf Tee Survey Analyses
Specifying new models
The size covariate is the least significant of the covariates in the model “FI – MR dist + size + sex + exp” – its estimate is 0.078 with SE 0.183. Create a new model definition and analysis without this covariate. Does it have a lower AIC?
You can also try some models with interaction terms. For example, you would specify the interaction between sex and exposure as “sex:exposure”. If you also want both sex and exposure in as main effects (which you usually do) then the notation “sex*exposure” is shorthand for “sex + exposure + sex:exposure”. Don’t try too many of these models – leave time for the next part.
Point independence
All the models we have tried so far assume full independence – i.e. that the detections are independent between platforms at all distances. A less restrictive assumption is point independence – that the detections are only independent on the line.
Determine whether a simple point independence model is better than a simple full independence one. Set up a new Analysis and Model Definition, and under Detection Function | Method, choose Trial, Point independence. For the MR (mark-recapture) model, specify distance as the only covariate. You now also need to specify a DS (distance sampling) model. Start with a half-normal key function and constant scale parameter (i.e. no covariates).
Run this model, and compare it with the corresponding full independence model (i.e., the full independence model with the same MR model). Which has the lower AIC? Which has an estimate closer to the known true abundance?
Try a point independence model that has a MR model the same as the MR model from your full independence analyses. Which has the lower AIC and bias? Finally, try some models where you introduce covariates into the scale parameter of the DS model (for example, try sex as a covariate).
What is your final best model? What is the estimate of p and p(0) for this model? Was all this modelling necessary in this instance, given the value of p(0)? How else could you have obtained a robust estimate of abundance?
Golf Tee Survey Data exploration
It’s worth having a look at how the data are stored for a double-observer project. Open the project, and click on the Data tab to see how the data are stored. Notice that each detected cluster appears twice in the observation layer – once for observer 1 and once for observer 2. There is a field “object” which gives a unique number to each cluster and another field “detected” which indicates whether the cluster was seen (=1) or missed (=0) by each observer. There are also fields for the other covariates: perpendicular distance, cluster size, sex and exposure. In general, covariates could also be stored in other data layers (e.g., stratum- or transect-level covariates such as habitat).
Now click on the Survey tab and open the Survey details for “New survey”. Click on Properties to see the survey’s properties. Notice that the observer configuration is set to Double observer. We analysed these data assuming that Observer 2 was generating trials for Observer 1 but not vice versa – i.e., trial configuration, where Observer 1 is the primary and Observer 2 is the tracker. (As noted earlier, the data could also be analyzed in independent observer configuration – you are welcome to try this yourself!) Click on the Data fields tab, and notice that the Object, Observer and Detected roles are all filled by the appropriate fields – this is done by default when you are setting up a new project, if you tell Distance it is a double observer survey in the Setup Project Wizard.
Crabeater Seal Survey
The Crabeater Seal Data and Distance Project
These data come from a helicopter survey of crabeater seals within pack-ice in the Antarctic conducted by the Australian Antarctic Division as part of their pack-ice seals program. The helicopter can only operate within a relatively short distance from the icebreaker, which acts as its base, and the ice-breaker can only go where the pack-ice is thin enough, the aerial transects cannot be randomly located. This means that design-based abundance estimation was not a valid option – abundance was estimated using density surface modelling. In this exercise, we concentrate on detection function estimation.
The data are stored in two distance projects: CrabbieMCDSExercise and CrabbieMRDSExercise. The first contains a multiple-covariate distances analysis of the data (assuming p(0)=1); the second contains mark-recapture distance sampling analyses of the same data.
In addition to distance and cluster size, the following explanatory variables are available for modelling detection probability:
- side: the side of the helicopter from which the seals were seen
- exp: the experience (in survey hours) of the observer
- fatigue: the number of minutes the observer has been on duty on the current flight
- gscat: group size category (1/2/≥3)
- vis: visibility category (poor/good/excellent)
- glare: Yes or No, depending on whether there was glare or not
- ssmi: a measure of ice cover
- altitude: the height of the aircraft in metres
- obsname: individual observer identifier
The distance projects are set up as if the transects were random and survey area was 1,000,000 hectares. This is just a device to get Distance to produce an abundance estimate, which we can treat as a relative index of abundance.
The Crabeater Seal Analyses
The analyses described in Borchers et al. (2006) and Southwell et al. (2007) use the point independence method. Use the analyses in the two projects to decide whether:
- an MCDS analysis would have been adequate (and if so, why), and
- a full-independence MRDS analysis would have been adequate (and if so, why).
The projects contain the following analyses (which have already been run, since running them takes a while):
- MCDS like paper: MCDS model used for the MCDS component of the analysis in the papers. This one is in the CrabbieMCDSExercise project
- PI dist only: Point independence model using only distance. This and the subsequent ones are in the CrabbieMRDSExercise project.
- FI like paper: Full independence model using the MRDS model used in the papers.
- PI as in paper: Point independence model using the same MCDS model and MRDS model as used in the papers.
If you have time, you might want to try other models and see if you can do better than the model “PI as in paper”. (However, see Tip2, below, for a timesaver.)
Tip1: remember in Distance for Windows you queue up multiple models to run one after the other (by clicking the Run button while another analysis is already running – the new model will get added to the queue), so you could always set up a series of candidate models and let them run while you do something else!
Tip2: Because the dataset is large, analyses with lots of covariates take a long time to run – and so the dataset is not very conducive to rapid learning. Hence we’ve set up a data filter that selects just 1 in 5 of the observations (it’s called “trunc 700 folds 1 and 5” – the folds 1 and 5 comes from the fact that there is a column “fold” in the dataset that divides the data into 10 equal subsets (“folds”), so we can select 1/5th of the data through the Data Selection tab of the Data Filter by selecting “fold IN (1, 5)”). If you look in the analysis set “Folds 1 and 5” you’ll see that we’ve run the three models using this subset of the data. You might want to do your further covariate selection using this subset also, just to speed up the runs. Of course, you might well find a different model is selected by AIC compared to that in the paper, because you are now only working with 1/5th of the data