Covariates in the detection function solution

Author

Centre for Research into Ecological and Environmental Modelling
University of St Andrews

Modified

November 2024

Solution

Covariates in the detection function

Covariates in point transect detection functions: Amakihi

library(Distance)
data(amakihi)

hist(amakihi$distance, xlab="Radial distances (m)", main="Amakihi point transect data.")

A truncation distance of 82.5 m was chosen by T. A. Marques et al. (2007).

Plots of the covariates were generated. Not surprisingly, MAS and HAS are correlated and so we need to be cautious of including them in the same model.

boxplot(amakihi$distance~amakihi$OBs, xlab="Observer", ylab="Distance (m)")
boxplot(amakihi$distance~amakihi$HAS, xlab="Hour", ylab="Distance (m)")
plot(x=amakihi$MAS, y=amakihi$distance, xlab="Minutes after sunrise",
     ylab="Distance (m)", pch=20)
plot(x=amakihi$HAS, y=amakihi$MAS, xlab="Hours after sunrise",
     ylab="Minutes after sunrise", pch=20)

Hour since sunrise as possible covariate

Minutes since sunrise as possible covariate

Minutes and hours after sunrise plotted together

Some rearrangement of the covariates, relevel simply shuffles the order of the observer and hour after sunrise factor covariates to conform to the analysis as presented in T. A. Marques et al. (2007).

amakihi$HAS <- factor(amakihi$HAS)
amakihi$OBs <- relevel(amakihi$OBs, ref="TKP")
amakihi$HAS <- relevel(amakihi$HAS, ref="5")

The model selected by T. A. Marques et al. (2007) used a hazard rate key function and included observer and minutes after sunrise (treating time as a continuous rather than discrete covariate)- this model is fitted below. The PDF plot is shown.

# Fit model selected by Marques et al (2007)
conv <- convert_units("meter", NULL, "hectare")
amak.hr.obs.mas <- ds(amakihi, transect="point", key="hr", formula=~OBs+MAS, convert_units = conv,
                      truncation=82.5)
plot(amak.hr.obs.mas, showpoints=FALSE, main="Amakihi Observer and Minutes", pdf=TRUE)
sfzero <- data.frame(OBs="SGF", MAS=0)
sf180 <- data.frame(OBs="SGF", MAS=180)
t1zero <- data.frame(OBs="TJS", MAS=0)
t1180 <- data.frame(OBs="TJS", MAS=180)
t2zero <- data.frame(OBs="TKP", MAS=0)
t2180 <- data.frame(OBs="TKP", MAS=180)

add_df_covar_line(amak.hr.obs.mas, data=sfzero, lty=1, lwd=2,col="blue", pdf=TRUE)
add_df_covar_line(amak.hr.obs.mas, data=sf180, lty=2, lwd=2,col="blue", pdf=TRUE)
add_df_covar_line(amak.hr.obs.mas, data=t1zero, lty=1,lwd=2,col="red", pdf=TRUE)
add_df_covar_line(amak.hr.obs.mas, data=t1180, lty=2, lwd=2,col="red", pdf=TRUE)
add_df_covar_line(amak.hr.obs.mas, data=t2zero, lty=1,lwd=2,col="green", pdf=TRUE)
add_df_covar_line(amak.hr.obs.mas, data=t2180, lty=2, lwd=2,col="green", pdf=TRUE)

legend("topright", legend=c("SF, minutes=0",
                            "SF, minutes=180",
                            "TS, minutes=0",
                            "TS, minutes=180",
                            "TP, minutes=0",
                            "TP, minutes=180"),
       title="Covariate combination: observer and minutes",
       lty=rep(c(1,2),times=3), lwd=2, col=rep(c("blue","red","green"), each=2))

Cautionary tale

A second round of model criticism

The model chosen by T. A. Marques et al. (2007) for making inference included minutes after sunrise (MAS) and observer (OBS) as covariates in the detection function. Did that model produce estimates of detection probability that were small for some detections? Recall, because we are using a Horwitz-Thompson-like estimator, our estimates of abundance are quite sensitive to detections with small values of $\widehat{P_a(z_i)}$. Detections with small $\widehat{P_a(z_i)}$ can have quite large impacts upon population estimates.

A new function in the most recent version of the Distance package, permits examination of the distribution of the $\widehat{P_a(z_i)}$ so we can assess whether this analysis guideline is violated:

consider reducing the trunction distance if more than 5% of the $\widehat{P_a(z_i)}$ are < 0.2, or if any are less than 0.1

p_dist_table(amak.hr.obs.mas, proportion = TRUE)

Distribution of $P_a(z_i)$ from preferred model when w=82.5
p	count	proportion
0 - 0.1	0	0.000
0.1 - 0.2	167	0.134
0.2 - 0.3	106	0.085
0.3 - 0.4	515	0.414
0.4 - 0.5	455	0.366
0.5 - 0.6	0	0.000

This suggests redoing the analysis with stronger truncation. I’ll repeat the model fitting (without showing the code) using 70m as the truncation distance. What do we see when examining the distribution of the $\widehat{P_a(z_i)}$ with the more stringent truncation?

Distribution of $P_a(z_i)$ from preferred model when w=70
p	count	proportion
0 - 0.1	0	0.000
0.1 - 0.2	0	0.000
0.2 - 0.3	207	0.184
0.3 - 0.4	48	0.043
0.4 - 0.5	678	0.603
0.5 - 0.6	191	0.170

Should this much truncation cause concern for point estimates and precision?

The news that model criticism should start over with a new truncation distance would not be good news. Concerns might also creep into your heads that removing data might affect the point estimates and possibly the precision. In this case, moving the truncation distance from 82.5m to 70m removed only 1.5% (19) detections, so it is unlikely that precision was affected.

Regarding the effect upon the point estimates of amakihi density, let’s take a step back. If you were to examine the point and interval estimates for this data set from a number of detection function models, you would find the estimates are robust to model choice.

Density estimates for amakihi data set under a variety of detection function models and truncation distances.

My impression

For this data set, the effort expended in discriminating between models with one vs two covariates as well as the angst about truncation distances had imperceptible effects upon the survey-specific density estimates. In fact, a model simply using survey replicate as a covariate produced largely the same density estimates. Understanding the temporal dynamics of the amakihi population was not influenced by covariate modelling.

More MCDS with line transects: ETP dolphins (optional)

We have not provided a comprehensive analysis of these data but have highlighted a few general feature of these data. A more complete analysis can be found in F. F. C. Marques & Buckland (2003).

data(ETP_Dolphin)
head(ETP_Dolphin, n=3)

  Region.Label Area Sample.Label Effort object distance LnCluster Month
1      Default    0            1      1      1     2.25  6.792344     6
2      Default    0            1      1      2     3.15  7.080868     6
3      Default    0            1      1      3     3.00  5.181784     6
  Beauf.class Cue.type Search.method size Study.Area
1           2        3             0  891    Dolphin
2           2        1             2 1189    Dolphin
3           2        1             2  178    Dolphin

ETP_Dolphin_units[,1:2]

  Variable                Units
1   Effort        nautical mile
2 distance        nautical mile
3     Area square nautical mile

Notice that effort and perpendicular distances are both measured in nautical miles and that density is to be reported in animals per square nautical mile, so the conversation factor in this case is 1 and we do not represent it here.

To obtain an overall impression of the data, it is useful to fit a histogram with many intervals.

hist(ETP_Dolphin$distance, nclass=50, xlab="Distance (nm)",
     main="Tropical Pacific dolphin survey perpendicular distances")

The spikes in the histogram suggest that distances have been rounded to zero and possibly other values. To mitigate these problems, the distances could be binned although we do not do so in the analysis below. The distances have already been truncated at 5 nm and so we will not truncate distances further.

boxplot(ETP_Dolphin$distance~ETP_Dolphin$Search.method, xlab="Search method", 
        ylab="Distance (nm)")
boxplot(ETP_Dolphin$distance~ETP_Dolphin$Cue.type, xlab="Cue", ylab="Distance (nm)")
boxplot(ETP_Dolphin$distance~ETP_Dolphin$Beauf.class, xlab="Beaufort class", 
        ylab="Distance (nm)")
boxplot(ETP_Dolphin$distance~ETP_Dolphin$Month, xlab="Month", ylab="Distance (nm)")

To decide whether to fit a half normal or a hazard rate key function, each of these is tried in turn.

etp.hn <- ds(ETP_Dolphin, key="hn", adjustment=NULL)
etp.hr <- ds(ETP_Dolphin, key="hr", adjustment=NULL)
knitr::kable(as.data.frame(AIC(etp.hn, etp.hr))) %>%
    kable_styling(bootstrap_options = "condensed", full_width = F)

	df	AIC
etp.hn	1	3377.489
etp.hr	2	3365.915

The AIC values suggest that hazard rate key function is preferable to the half normal and so this will be used as the key function in the MCDS models. Each covariate is introduced in turn.

etp.hr.search <- ds(ETP_Dolphin, key="hr", formula=~factor(Search.method))
etp.hr.cue <- ds(ETP_Dolphin, key="hr", formula=~factor(Cue.type))
etp.hr.bf <- ds(ETP_Dolphin, key="hr", formula=~factor(Beauf.class))
etp.hr.month <- ds(ETP_Dolphin, key="hr", formula=~factor(Month))
knitr::kable(summarize_ds_models(etp.hr, etp.hr.search, etp.hr.cue, etp.hr.bf, etp.hr.month,
                                 output="plain")[,2:7], row.names = FALSE,
             caption="ETP dolphin model selection.", digits=3) %>%
       kable_styling(bootstrap_options = "condensed", full_width = F)

ETP dolphin model selection.
Key function	Formula	C-vM $p$-value	Average detectability	se(Average detectability)	Delta AIC
Hazard-rate	~factor(Search.method)	0.388	0.584	0.036	0.000
Hazard-rate	~factor(Cue.type)	0.449	0.587	0.039	23.314
Hazard-rate	~factor(Month)	0.494	0.593	0.039	24.264
Hazard-rate	~1	0.461	0.601	0.038	26.140
Hazard-rate	~factor(Beauf.class)	0.474	0.601	0.039	27.993

Based on the AIC, it seems as though the model including search method was preferable and we could continue the model selection process by looking at models with two covariates. However, before going on it is worth looking at the search method model in more detail. If we look at the detection function parameters for this model:

print(etp.hr.search$ddf)


Distance sampling analysis object

Summary for ds object
Number of observations :  1090 
Distance range         :  0  -  5 
AIC                    :  3339.775 
Optimisation           :  mrds (nlminb) 

Detection function:
 Hazard-rate key function 

Detection function parameters 
Scale coefficient(s): 
                        estimate         se
(Intercept)            0.3355272  0.2072090
factor(Search.method)2 0.3208247  0.2112992
factor(Search.method)3 3.1274315 78.5965032
factor(Search.method)5 0.2430573  0.5378512

Shape coefficient(s):  
              estimate        se
(Intercept) 0.07827832 0.1387482

                        Estimate           SE         CV
Average p              0.5837971   0.03602487 0.06170786
N in covered region 1867.0869397 121.18692980 0.06490696

we see that the estimated scale coefficient for search method 3 is substantially larger than the estimated scale coefficients for other methods. The effect this has on the detection function is clearly seen in the detection function plot.

plot(etp.hr.search, pch=".")

Search method 3 indicated that the detection was from a helicopter and this detection function suggests that all dolphin schools out to 5 nm were being detected and so detection does not decrease as distance increases. One assumption of MCDS is that the perpendicular distance distributions of the covariate factor levels have the same shape and so it may be worth refitting the models but excluding the observations made by the helicopter.

Colourful plot noting effect of cue type

Just an example of using the function add_df_covar_line to visually explore consequences of covariates on the detection function. A regular call to plot() is first used to produce the histogram and average detection function line; subsequent calls to the new function with different values of the covariate of interest completes the plot.

plot(etp.hr.cue, main="ETP dolphin survey", showpoints=FALSE)
add_df_covar_line(etp.hr.cue, data = data.frame(Cue.type=1), 
                  col='red', lwd=2, lty=1)
add_df_covar_line(etp.hr.cue, data = data.frame(Cue.type=2), 
                  col='blue', lwd=2, lty=1)
add_df_covar_line(etp.hr.cue, data = data.frame(Cue.type=3), 
                  col='green', lwd=2, lty=1)
add_df_covar_line(etp.hr.cue, data = data.frame(Cue.type=4), 
                  col='purple', lwd=2, lty=1)
legend("topright", 
       legend=c("Birds","Splashes","Unspecified","Floating objects"),
       col=c("red", "blue", "green", "purple"), lwd=2, title = "Cue type")

Detection function with cue type as covariate.

References

Marques, F. F. C., & Buckland, S. T. (2003). Incorporating covariates into standard line transect analyses. Biometrics, 59, 924–935. https://doi.org/10.1111/j.0006-341x.2003.00107.x

Marques, T. A., Thomas, L., Fancy, S. G., & Buckland, S. T. (2007). Improving estimates of bird density using multiple covariate distance sampling. The Auk, 127, 1229–1243. https://doi.org/10.1642/0004-8038(2007)124[1229:ieobdu]2.0.co;2