Migration of DistWin analysis structure to R

The Windows-based GUI software for conducting distance sampling analysis (Distance for Windows, heretofore DistWin) has been the industry-standard for two decades. Our development work continues to incorporate state-of-the-art analytical methods for software users. Most of that software development takes place in the R language.

The readdst package was created for statisticans and biologists who have existing DistWin projects and wish to port them into the R environment. It was originally intended as an in-house testing tool to enable comparison of analysis results of newly-developed analysis software, with results produced by DistWin. The purpose of the package is to convert data and analyses stored in DistWin projects into data objects and function calls for analysis using R.

This vignette is intended to give a few examples of the capabilities of readdst. Readers of this vignette should have some familiarity with the organisation of DistWin projects along with functions in the mrds library for fitting detection functions.

Note too, readdst can only work with DistWin projects created by DistWin versions 6 or 7.

Structure of DistWin project

DistWin projects carry not only the data collected in the field, but also survey descriptions, data filters, model definitions and analysis results. All of these facets are stored in Access database files created by DistWin. Reading the Access files into R is at the heart of readdst.

Example project files to transform

There is a project that comes with the distribution of DistWin located in My Documents\My Distance Projects\Sample Projects We will use to demonstrate the use of DistWin. In addition, we have taken another project that ships with DistWin, Stratify example, renamed it Vignette-stratify and added further analyses to that project for use in this vignette. This third example ships with the readdst R package and is found in the library folder associated with the readdst package (see below).

Purists among the readers will recognise that spaces in directory names can cause problems for R. If you will be converting DistWin projects located in other locations on a Windows machine, or are working under a different operating system where spaces in pathnames are forbidden, then you may need to make some alterations to the example code provided in this vignette.

Vignette-stratify

Synthetic line transect data, units of measure in non-SI units (effort and perpendicular distance in nautical miles). Animals were detected in groups, so cluster size enters into the calculation of animal density. The project contains a multiplier derived from an external estimate of \(\widehat{g(0)}\), but for purposes of this vignette, the use of the multiplier has been removed from the DistWin analysis.

The study area also contains two types of habitat and the design incorporated habitat information. Consequently, the brief analysis focuses upon whether it is most parsimonious to fit habitat-specific detection functions, or whether a single detection function pooled across habitat types.

Amakihi

This project demonstrates the use of the MRDS (multiple covariate) analysis engine in conjunction with a point transect sampling survey (using conventional SI measurement units). The analyses present in the project represent 16 model results presented in Marques et al. (2007). All analyses employ a data filter specifying an 82.5m truncation distance.

Conversion process

The following code chunk performs some preliminary setup before conducting any conversions. First, if running on a Windows machine, the code checks that the 32-bit version of R is being used. R uses the RODBC package to read Access files; this package is only supported by 32-bit versions of R. On a Mac OS X or Linux/Unix machine, readdst makes use of mdb-tools and the Hmisc R package.

Second, a path to the Sample Projects directory is created for subsequent use when accessing projects that ship with DistWin.

library(readdst)
if (.Platform$OS.type=="windows" && R.Version()$platform!="i386-w64-mingw32") print("32-bit version of R needed to run RODBC")
home.dir <- path.expand("~")
sample.proj.dir <- paste0(home.dir, "/My Distance Projects/Sample Projects/")

The Amakihi project that ships with DistWin and resides in My Documents\My Distance Projects\Sample Projects is converted with these two commands. There are no arguments to the function convert_project() other than the absolute path to the DistWin project (without the .dst file extension).

amakihi.proj.file <- paste0(sample.proj.dir, "Amakihi")
amakihi.proj <- convert_project(amakihi.proj.file)

The Vignette-stratify project that ships with readdst and resides wherever R libraries are stored, is covered with these three commands.

stratify.proj.directory <- system.file("Stratify", package="readdst")
stratify.proj.name <- paste0(stratify.proj.directory, "/Vignette-stratify")
stratify.proj <- convert_project(stratify.proj.name)

Components of converted DistWin projects

Analyses

The object created by convert_project() is a named list. There are as many elements in the list as analyses (run or not run) defined in the DistWin project. The names of the list elements correspond to the names of the analyses:

length(stratify.proj)
[1] 3
names(stratify.proj)
[1] "Half-normal cosine no stratification exact"
[2] "Half-normal cosine strat-specific detfn"   
[3] "Half-normal cosine pooled detfn"           

Each analysis is itself a named list, with list elements describing such things as data filter used and the equivalent call to ddf() from the mrds library to fit the detection function specified in the analysis. Several of these components will be discussed later.

Data for each analysis

Stored with each analysis are the data that were used in the analysis. Recall that DistWin allowed seperate specification of model definitions and data filters, which could be coupled to form an analysis. The data (in two forms) as well as the measurement units for effort, perpendicular (radial) distances and area are all stored in the env element of each analysis list element1.

ls(stratify.proj$'Half-normal cosine no stratification exact'$env)
[1] "data"         "obs.table"    "region.table" "sample.table"
[5] "units"       

Data are stored both in the flatfile format with all levels of the data hierarchy in a single object,

G0 G0.SE Cluster.strat distance size object Study.Area Region.Label Sample.Label
0.8367 0.1738 1 0.10 1 1 Stratify example Ideal Habitat 1
0.8367 0.1738 1 0.22 1 2 Stratify example Ideal Habitat 1
0.8367 0.1738 2 0.16 2 3 Stratify example Ideal Habitat 1

or in the multiple object form with a separate data frame for the region.table, sample.table and obs.table.

Analysis of converted project

The convert_project() function makes the data for a distance sampling analysis available and transforms the syntax of detection function fitting into calls to ddf(). This makes it possible to:

Repeating an analysis

Recall each analysis from a DistWin project is represented as a list element of an object created by convert_project(). The function run_analysis() takes as an argument an analysis and conducts that analysis in R using ddf() to fit a detection function and produce estimates of detection probability and \(\hat{N}\) in the covered region.

library(mrds)  # to access GOF function and dht()
This is mrds 2.1.15
Built: R 3.2.3; ; 2016-03-03 21:45:18 UTC; unix
stratify.reanalyse <- run_analysis(stratify.proj$'Half-normal cosine no stratification exact')
plot(stratify.reanalyse, showpoints=FALSE, pl.den=0)
gof.test <- ddf.gof(stratify.reanalyse, qq=FALSE)
text(1.2,0.9, cex=0.6,
     paste("K-S GOF D=",round(gof.test$dsgof$ks$Dn, 4), "\nP=", round(gof.test$dsgof$ks$p,3)))

summary(stratify.reanalyse)

Summary for ds object 
Number of observations :  90 
Distance range         :  0  -  1.94 
AIC                    :  63.87982 

Detection function:
 Half-normal key function 

Detection function parameters 
Scale coefficient(s):  
              estimate         se
(Intercept) -0.3514199 0.08078866

                       Estimate          SE         CV
Average p             0.4519568  0.03470614 0.07679084
N in covered region 199.1340969 21.80150344 0.10948152

Producing estimates of abundance from the fitted ddf() model requires a call to dht().

stratify.abund <- dht(stratify.reanalyse, 
                      region = stratify.proj$`Half-normal cosine no stratification exact`$env$region.table, 
                      sample = stratify.proj$`Half-normal cosine no stratification exact`$env$sample.table, 
                      obs = stratify.proj$`Half-normal cosine no stratification exact`$env$obs.table)
Region Area CoveredArea Effort n ER se.ER cv.ER mean.size se.mean
Ideal Habitat 85000 1470.52 379 84 0.2216359 0.0668354 0.3015550 2.153846 0.2905123
Marginal Habitat 600000 3918.80 1010 123 0.1217822 0.0523166 0.4295915 2.411765 0.3477394
Total 685000 5389.32 1389 207 0.1490281 0.0412544 0.2768229 2.300000 0.2330121
Label Estimate se cv lcl ucl
Ideal Habitat 10743.12 3343.031 0.3111788 5212.731 22140.91
Marginal Habitat 41668.36 18184.107 0.4364009 16071.287 108034.41
Total 52411.48 18631.076 0.3554770 24130.762 113836.55

The call to dht() shows the use of the hierarchial data type (region, sample and observation tables) available within each analysis.

It would be possible to perform new analyses of data present in DistWin projects, such as fitting a hazard-rate key function to the stratify data set that had only been fitted with half-normal detection functions.

stratify.hazard <- ddf(dsmodel=~cds(key="hr", formula=~1, adj.series="cos", adj.order=2),
                   meta.data=list(width=1.94000005722046,left=0), 
                   control=list(mono=TRUE, mono.strict=TRUE), method="ds", 
                   data=stratify.proj$'Half-normal cosine no stratification exact'$env$data)
hazard.aic <- stratify.hazard$criterion
halfnorm.aic <- stratify.reanalyse$criterion

This hazard rate detection function can also be fitted by an equivalent call to the ds() function found in the Distance package available on CRAN. ds() serves as a wrapper for ddf(). We show both the ddf() (created by run_analysis() in readdst) and ds() so readers can select the approach they desire.

stratify.hazard <- ds(key="hr", formula=~1, adj.series="cos", adj.order=2,
                   truncation=1.94000005722046,
                   data=stratify.proj$'Half-normal cosine no stratification exact'$env$data)

The hazard rate model (with a cosine adjustment) had an AIC of 60.5 while the unadjusted half-normal model had an AIC of 63.9 suggesting we were on firm footing when fitting the half-normal model to the stratify dataset.

Comparing results of DistWin and R analyses

DistWin projects contain not only the analyses but also the results of those analyses. Analysis results are stored in DistWin projects according to a coding scheme to identify the statistics or parameter estimates. Values of statistic and estimates can be extracted from a DistWin project using get_stats(); however this function is not intended for use by users.

Instead, get_stats() is usually called from within the function test_stats(). The test_stats() function takes as its first argument a single analysis from a converted DistWin project. The tasks performed by test_stats() are:

Statistic Distance_value mrds_value Difference Pass
n 90.0000000 90.0000000 0.0000000
parameters 1.0000000 1.0000000 0.0000000
AIC 63.8801003 63.8798192 0.0000044
Chi^2 p 0.1707298 0.9906514 4.8024520
P_a 0.4519590 0.4519568 0.0000050
CV(P_a) 0.0767919 0.0767908 0.0000136
log-likelihood -30.9400501 -30.9399096 0.0000000
K-S p 0.6884868 0.6884929 0.0000089
C-vM p 1.0000000 0.9000000 0.1000000
density 0.0505962 0.0573726 0.1339330
CV(density) 0.2693232 0.2818972 0.0466873
individuals 34658.0000000 39300.2621443 0.1339449
CV(individuals) 0.2693232 0.2818972 0.0466873

The column labeled Difference shows the ratio of the mrds value over the DistWin value. When called from the R console, the final Pass column will possess ticks if the ratio is less than 0.05 to quickly assess the agreement between DistWin and mrds result components.

Note test_stats() can take only a single analysis as an argument, rather than a complete set of analyses that may be embedded within a DistWin project. To test multiple analyses, lapply() could be wrapped around the call to test_stats().

There are a large number of reasons for disagreement between DistWin and R results. Reasons may include differences in:

Migrations not performed by readdst

readdst is a work in progress. As noted, it was developed for in-house testing of developing code against known solutions produced by DistWin. There are many nuances of analyses that can be performed by DistWin that have not been included in readdst conversion feature. The following list is incomplete, but gives a general idea of what is not included in readdst capabilities.

For this reason we recommend using the latest version of readdst from GitHub using the following commands:

install.packages("devtools")
devtools::install_github("DistanceDevelopment/readdst")

  1. Note that environments in R have rather different behaviour to other R objects – if you modify them within a function, you modify the object everywhere (so called “reference semantics”). This can have unexpected consequences when you pass them to your own functions. See http://adv-r.had.co.nz/Environments.html for more information.