Migration of DistWin analysis structure to R

The Windows-based GUI software for conducting distance sampling analysis (Distance for Windows, heretofore DistWin) has been the industry-standard for two decades. Our development work continues to incorporate state-of-the-art analytical methods for software users. Most of that software development takes place in the R language.

The readdst package was created for statisticans and biologists who have existing DistWin projects and wish to port them into the R environment. It was originally intended as an in-house testing tool to enable comparison of analysis results of newly-developed analysis software, with results produced by DistWin. The purpose of the package is to convert data and analyses stored in DistWin projects into data objects and function calls for analysis using R.

This vignette is intended to give a few examples of the capabilities of readdst. Readers of this vignette should have some familiarity with the organisation of DistWin projects along with functions in the mrds library for fitting detection functions.

Note too, readdst can only work with DistWin projects created by DistWin versions 6 or 7.

Structure of `DistWin` project

DistWin projects carry not only the data collected in the field, but also survey descriptions, data filters, model definitions and analysis results. All of these facets are stored in Access database files created by DistWin. Reading the Access files into R is at the heart of readdst.

Survey information includes point or line transect types, whether animals were recorded as individuals or clusters, whether perpendicular distances or radial distances and angles were recorded.
Data filters possess information about sightings to include in analysis (truncation distance, species indicators for multi-species surveys)
Model definitions describe the key function and adjustment term combinations to be fitted as the detection function, whether data will be combined across strata for detection function fitting, etc.
Analysis results are the product of submitting the survey/filter/model combination to the analysis engines in DistWin.

Example project files to transform

There is a project that comes with the distribution of DistWin located in My Documents\My Distance Projects\Sample Projects We will use to demonstrate the use of DistWin. In addition, we have taken another project that ships with DistWin, Stratify example, renamed it Vignette-stratify and added further analyses to that project for use in this vignette. This third example ships with the readdst R package and is found in the library folder associated with the readdst package (see below).

Purists among the readers will recognise that spaces in directory names can cause problems for R. If you will be converting DistWin projects located in other locations on a Windows machine, or are working under a different operating system where spaces in pathnames are forbidden, then you may need to make some alterations to the example code provided in this vignette.

Vignette-stratify

Synthetic line transect data, units of measure in non-SI units (effort and perpendicular distance in nautical miles). Animals were detected in groups, so cluster size enters into the calculation of animal density. The project contains a multiplier derived from an external estimate of \(\widehat{g(0)}\), but for purposes of this vignette, the use of the multiplier has been removed from the DistWin analysis.

The study area also contains two types of habitat and the design incorporated habitat information. Consequently, the brief analysis focuses upon whether it is most parsimonious to fit habitat-specific detection functions, or whether a single detection function pooled across habitat types.

Amakihi

This project demonstrates the use of the MRDS (multiple covariate) analysis engine in conjunction with a point transect sampling survey (using conventional SI measurement units). The analyses present in the project represent 16 model results presented in Marques et al. (2007). All analyses employ a data filter specifying an 82.5m truncation distance.

Conversion process

The following code chunk performs some preliminary setup before conducting any conversions. First, if running on a Windows machine, the code checks that the 32-bit version of R is being used. R uses the RODBC package to read Access files; this package is only supported by 32-bit versions of R. On a Mac OS X or Linux/Unix machine, readdst makes use of mdb-tools and the Hmisc R package.

Second, a path to the Sample Projects directory is created for subsequent use when accessing projects that ship with DistWin.

library(readdst)
if (.Platform$OS.type=="windows" && R.Version()$platform!="i386-w64-mingw32") print("32-bit version of R needed to run RODBC")
home.dir <- path.expand("~")
sample.proj.dir <- paste0(home.dir, "/My Distance Projects/Sample Projects/")

The Amakihi project that ships with DistWin and resides in My Documents\My Distance Projects\Sample Projects is converted with these two commands. There are no arguments to the function convert_project() other than the absolute path to the DistWin project (without the .dst file extension).

amakihi.proj.file <- paste0(sample.proj.dir, "Amakihi")
amakihi.proj <- convert_project(amakihi.proj.file)

The Vignette-stratify project that ships with readdst and resides wherever R libraries are stored, is covered with these three commands.

stratify.proj.directory <- system.file("Stratify", package="readdst")
stratify.proj.name <- paste0(stratify.proj.directory, "/Vignette-stratify")
stratify.proj <- convert_project(stratify.proj.name)

Components of converted `DistWin` projects

Analyses

The object created by convert_project() is a named list. There are as many elements in the list as analyses (run or not run) defined in the DistWin project. The names of the list elements correspond to the names of the analyses:

length(stratify.proj)

[1] 3

names(stratify.proj)

[1] "Half-normal cosine no stratification exact"
[2] "Half-normal cosine strat-specific detfn"   
[3] "Half-normal cosine pooled detfn"

Each analysis is itself a named list, with list elements describing such things as data filter used and the equivalent call to ddf() from the mrds library to fit the detection function specified in the analysis. Several of these components will be discussed later.

Data for each analysis

Stored with each analysis are the data that were used in the analysis. Recall that DistWin allowed seperate specification of model definitions and data filters, which could be coupled to form an analysis. The data (in two forms) as well as the measurement units for effort, perpendicular (radial) distances and area are all stored in the env element of each analysis list element¹.

ls(stratify.proj$'Half-normal cosine no stratification exact'$env)

[1] "data"         "obs.table"    "region.table" "sample.table"
[5] "units"

Data are stored both in the flatfile format with all levels of the data hierarchy in a single object,

G0	G0.SE	Cluster.strat	distance	size	object	Study.Area	Region.Label	Sample.Label
0.8367	0.1738	1	0.10	1	1	Stratify example	Ideal Habitat	1
0.8367	0.1738	1	0.22	1	2	Stratify example	Ideal Habitat	1
0.8367	0.1738	2	0.16	2	3	Stratify example	Ideal Habitat	1

or in the multiple object form with a separate data frame for the region.table, sample.table and obs.table.

Analysis of converted project

The convert_project() function makes the data for a distance sampling analysis available and transforms the syntax of detection function fitting into calls to ddf(). This makes it possible to:

repeat the analyses performed by DistWin in R, and/or
conduct new analyses of data that have been stored in DistWin.

Repeating an analysis

Recall each analysis from a DistWin project is represented as a list element of an object created by convert_project(). The function run_analysis() takes as an argument an analysis and conducts that analysis in R using ddf() to fit a detection function and produce estimates of detection probability and \(\hat{N}\) in the covered region.

library(mrds)  # to access GOF function and dht()

This is mrds 2.1.15
Built: R 3.2.3; ; 2016-03-03 21:45:18 UTC; unix

stratify.reanalyse <- run_analysis(stratify.proj$'Half-normal cosine no stratification exact')
plot(stratify.reanalyse, showpoints=FALSE, pl.den=0)
gof.test <- ddf.gof(stratify.reanalyse, qq=FALSE)
text(1.2,0.9, cex=0.6,
     paste("K-S GOF D=",round(gof.test$dsgof$ks$Dn, 4), "\nP=", round(gof.test$dsgof$ks$p,3)))

summary(stratify.reanalyse)


Summary for ds object 
Number of observations :  90 
Distance range         :  0  -  1.94 
AIC                    :  63.87982 

Detection function:
 Half-normal key function 

Detection function parameters 
Scale coefficient(s):  
              estimate         se
(Intercept) -0.3514199 0.08078866

                       Estimate          SE         CV
Average p             0.4519568  0.03470614 0.07679084
N in covered region 199.1340969 21.80150344 0.10948152

Producing estimates of abundance from the fitted ddf() model requires a call to dht().

stratify.abund <- dht(stratify.reanalyse, 
                      region = stratify.proj$`Half-normal cosine no stratification exact`$env$region.table, 
                      sample = stratify.proj$`Half-normal cosine no stratification exact`$env$sample.table, 
                      obs = stratify.proj$`Half-normal cosine no stratification exact`$env$obs.table)

Region	Area	CoveredArea	Effort	n	ER	se.ER	cv.ER	mean.size	se.mean
Ideal Habitat	85000	1470.52	379	84	0.2216359	0.0668354	0.3015550	2.153846	0.2905123
Marginal Habitat	600000	3918.80	1010	123	0.1217822	0.0523166	0.4295915	2.411765	0.3477394
Total	685000	5389.32	1389	207	0.1490281	0.0412544	0.2768229	2.300000	0.2330121

Label	Estimate	se	cv	lcl	ucl
Ideal Habitat	10743.12	3343.031	0.3111788	5212.731	22140.91
Marginal Habitat	41668.36	18184.107	0.4364009	16071.287	108034.41
Total	52411.48	18631.076	0.3554770	24130.762	113836.55

The call to dht() shows the use of the hierarchial data type (region, sample and observation tables) available within each analysis.

It would be possible to perform new analyses of data present in DistWin projects, such as fitting a hazard-rate key function to the stratify data set that had only been fitted with half-normal detection functions.

stratify.hazard <- ddf(dsmodel=~cds(key="hr", formula=~1, adj.series="cos", adj.order=2),
                   meta.data=list(width=1.94000005722046,left=0), 
                   control=list(mono=TRUE, mono.strict=TRUE), method="ds", 
                   data=stratify.proj$'Half-normal cosine no stratification exact'$env$data)
hazard.aic <- stratify.hazard$criterion
halfnorm.aic <- stratify.reanalyse$criterion

This hazard rate detection function can also be fitted by an equivalent call to the ds() function found in the Distance package available on CRAN. ds() serves as a wrapper for ddf(). We show both the ddf() (created by run_analysis() in readdst) and ds() so readers can select the approach they desire.

stratify.hazard <- ds(key="hr", formula=~1, adj.series="cos", adj.order=2,
                   truncation=1.94000005722046,
                   data=stratify.proj$'Half-normal cosine no stratification exact'$env$data)

The hazard rate model (with a cosine adjustment) had an AIC of 60.5 while the unadjusted half-normal model had an AIC of 63.9 suggesting we were on firm footing when fitting the half-normal model to the stratify dataset.

Comparing results of `DistWin` and R analyses

DistWin projects contain not only the analyses but also the results of those analyses. Analysis results are stored in DistWin projects according to a coding scheme to identify the statistics or parameter estimates. Values of statistic and estimates can be extracted from a DistWin project using get_stats(); however this function is not intended for use by users.

Instead, get_stats() is usually called from within the function test_stats(). The test_stats() function takes as its first argument a single analysis from a converted DistWin project. The tasks performed by test_stats() are:

run the specified analysis in R,
extract the DistWin-generated results from the DistWin project and
compare the two sets of results producing a table showing agreement (or disagreement) between DistWin and R results.

Statistic	Distance_value	mrds_value	Difference	Pass
n	90.0000000	90.0000000	0.0000000	✓
parameters	1.0000000	1.0000000	0.0000000	✓
AIC	63.8801003	63.8798192	0.0000044	✓
Chi^2 p	0.1707298	0.9906514	4.8024520
P_a	0.4519590	0.4519568	0.0000050	✓
CV(P_a)	0.0767919	0.0767908	0.0000136	✓
log-likelihood	-30.9400501	-30.9399096	0.0000000	✓
K-S p	0.6884868	0.6884929	0.0000089	✓
C-vM p	1.0000000	0.9000000	0.1000000
density	0.0505962	0.0573726	0.1339330
CV(density)	0.2693232	0.2818972	0.0466873
individuals	34658.0000000	39300.2621443	0.1339449
CV(individuals)	0.2693232	0.2818972	0.0466873

The column labeled Difference shows the ratio of the mrds value over the DistWin value. When called from the R console, the final Pass column will possess ticks if the ratio is less than 0.05 to quickly assess the agreement between DistWin and mrds result components.

Note test_stats() can take only a single analysis as an argument, rather than a complete set of analyses that may be embedded within a DistWin project. To test multiple analyses, lapply() could be wrapped around the call to test_stats().

There are a large number of reasons for disagreement between DistWin and R results. Reasons may include differences in:

convergence between optimisation engines in the two pieces of software,
treatment of size-biased regression adjustment for objects detected in clusters.

Migrations not performed by `readdst`

readdst is a work in progress. As noted, it was developed for in-house testing of developing code against known solutions produced by DistWin. There are many nuances of analyses that can be performed by DistWin that have not been included in readdst conversion feature. The following list is incomplete, but gives a general idea of what is not included in readdst capabilities.

multipliers
bootstrapping of variances
analyses involving the dsm, mads or DSsim engines
post-stratification.

For this reason we recommend using the latest version of readdst from GitHub using the following commands:

install.packages("devtools")
devtools::install_github("DistanceDevelopment/readdst")

Transforming DistWin projects to the `R` environment

Eric Rexstad and David L. Miller

05 April 2016

Migration of DistWin analysis structure to R

Structure of `DistWin` project

Example project files to transform

Vignette-stratify

Amakihi

Conversion process

Components of converted `DistWin` projects

Analyses

Data for each analysis

Analysis of converted project

Repeating an analysis

Comparing results of `DistWin` and R analyses

Migrations not performed by `readdst`

Transforming DistWin projects to the R environment

Eric Rexstad and David L. Miller

05 April 2016

Migration of DistWin analysis structure to R

Structure of DistWin project

Example project files to transform

Vignette-stratify

Amakihi

Conversion process

Components of converted DistWin projects

Analyses

Data for each analysis

Analysis of converted project

Repeating an analysis

Comparing results of DistWin and R analyses

Migrations not performed by readdst

Transforming DistWin projects to the `R` environment

Structure of `DistWin` project

Components of converted `DistWin` projects

Comparing results of `DistWin` and R analyses

Migrations not performed by `readdst`