# 2 Create and analyse a simulated dataset

The purpose of this session is to refresh our memories of analysing straightforward distance sampling data. We will simulate a simple line transect dataset using the Distance simulation engine, so we know the true population size and detection function, and then analyze the data to see how close we are to the truth. The exercise can be completed using Distance for Windows or `R`

(or both!).

Note that there will be a full session on the simulation capabilities of Distance later in the workshop, so we won’t try to tell you all about them here. Instead, we’ll give you the bare minimum of instructions you need to simulate a dataset, so you can focus on analyzing it.

## 2.1 Exercise using Distance for Windows

This section is for you if you want to use Distance for Windows. If you prefer to use `R`

directly, skip to Exercise using R.

We’ll assume you are familiar with analysis of line transect data using Distance for Windows, but that you have not yet used the simulation engine. Hence the simulation instructions are quite detailed.

### 2.1.1 Simulate a dataset

In Distance for Windows, open the

`Simulation example`

sample project (Choose`File | Open Project...`

and browse to the`Sample Projects`

folder within`My Distance Projects`

; click on the`Simulation example`

project file and click on`Open`

.)Once the project is open, click on the

`Simulations`

tab of the`Project Browser`

, and click on the 2nd simulation, called`Parallel Simulation`

and open the`Simulation Details`

window for this simulation (one way to do this is to choose the menu items`Simulation | Simulation Details`

).Click the

`Properties`

button that is just below the`Run`

button - this opens the`Simulation Properties`

window.In the

`Simulation`

tab, under`Number of repetitions`

, change the number 10 to 1, so it says`Repeat the simulation`

1`times`

.

In the

`Misc.`

tab, tick the option`Save example dataset to file`

Don’t look in any of the other tabs for now – they contain information about the true population size and detection function, and it’ll be more fun if you don’t know what these are until after you’ve analyzed the data!

Click

`OK`

to close the`Simulation Properties`

window. Click`Run`

to run the simulation. After a pause while the simulation is running, the`Results`

tab should go green.

You have now generated a simulated dataset, as a comma-separated text file (a “.csv” file). Have a look for it using the Windows `File Explorer`

. The file will be located in the `Simulation example.dat`

project folder (which in turn will be under `My Distance Projects`

. In the project folder, look in the subfolder `Simulation`

and within that in the subfolder `Simulation2`

. You should see the file `dist.data.2.csv`

- Copy this file into the
`My Distance Projects`

folder, and rename it`SimData.txt`

or some such. Whatever you call it, make sure there is only one dot “.” in its name, otherwise Distance is not going to want to import it (see later in these instructions).

### 2.1.2 Import the data into Distance

Back in

`Distance`

, create a new`Distance project`

in the`My Distance Projects`

folder. (Choose`File | New Project...`

.) Give it whatever name you like (e.g.,`Simulated dataset`

). In the`New Project Setup Wizard`

follow all the default options, except on`Step 4: Measurement Units`

you should specify that Distances are in Meters, Transect lengths are in Meters and Areas are in Square meters. On the last step (`Step 6: Finished`

), choose the option to`Proceed to Data Import Wizard`

. Click on`Finish`

.In the

`Data Import Wizard`

, in`Step 2`

, choose to import the file containing the simulated data (`SimData.txt`

or whatever you called it).In

`Step 4`

, choose Delimiter`Comma`

and tick the option to`Do not import first row`

.In

`Step 5`

, tick the option`Columns are in the same order as they will appear in the data sheet`

- this saves you having to tell`Distance`

which column is which. Click`Next`

and then`Finish`

.Check the data has imported correctly using the

`Data`

tab of the`Project Browser`

. You may need to fix one annoying*bug*in the simulation engine: if there are any transects that have no observations then they don’t get exported to the .csv file. So, check that you have around 20 transects, and that the labels go sequentially from 1 to 20. If not, you’ll have to manually add a transect (with line length 200). If you don’t know how to do this, ask one of the instructors.

### 2.1.3 Analyze the data in Distance

Once you’ve checked the data imported correctly, you can analyze the data in the normal way in Distance. (There should be no need to truncate the data, as it was truncated in the simulation at 3m.)

You might want to try a few detection functions and use AIC to select among them. Check the goodness-of-fit of the final model you select.

Optional (if you know about the advanced variance options): once you’ve selected a detection function model, you might want to think about which of the encounter rate variance options best suits the simulated scenario, which was a set of systematic parallel transects.

After you’ve finished your analysis, note the abundance estimate and confidence interval, and then open up the simulation project again. The true abundance is given in the `Simulation properties`

, `Population`

tab, under `Popultion size`

, `Fixed N`

. How close is your estimate to the true value? Does the confidence interval contain the true value?

The true detection function is given under the `Detectability`

tab. Did you select the true function? If not, why not, and does it matter?

## 2.2 Exercise using R

This section performs the same simulation in `R`

. Of course the actual simulated sample size and distances won’t be the same, as they are different each time the simulation is run, but the scenario is the same.

You’ll need to have the `DSsim`

and `Distance`

libraries installed in the version of R you’re using. Ask an instructor if you don’t know how to do this.

### 2.2.1 Simulate a dataset

We’re going to use the `DSsim`

package to do the simulation. Paste the code below into the RStudio console (or paste them into an `R`

script window and then click `Run`

to source them into the console).

```
library(DSsim)
library(shapefiles)
```

`## Loading required package: foreign`

```
##
## Attaching package: 'shapefiles'
```

```
## The following objects are masked from 'package:foreign':
##
## read.dbf, write.dbf
```

```
#Create study region
region <- make.region(coords = list(list(data.frame(x=c(0,0,500,500),y=c(0,200,200,0)))))
#Create population description
dens <- make.density(region = region, x.space = 8, y.space = 8, constant = c(10))
dens <- add.hotspot(dens, centre = c(100,50), sigma = 100, amplitude = 5)
dens <- add.hotspot(dens, centre = c(250,200), sigma = 50, amplitude = -5)
dens <- add.hotspot(dens, centre = c(300,70), sigma = 150, amplitude = 3)
pop.description <- make.population.description(region.obj = region, density.obj = dens,
N = c(500), fixed.N = TRUE)
#Specify detectability
det <- make.detectability(key.function = 'hr', scale.param = c(1.8), shape.param = c(3), truncation = 3)
#Create design
design <- make.design(transect.type = 'line', design.details = c('Parallel', 'Systematic'),
region.obj = region, spacing = 25)
analysis <- make.ddf.analysis.list(dsmodel = list(~cds(key = 'hn', formula = ~1),
~cds(key = 'hr', formula = ~1)),
method = 'ds', criteria = 'AIC', truncation = 3)
#Create simulation specification
sim <- make.simulation(reps = 1, single.transect.set = FALSE, region.obj = region,
design.obj = design, population.description.obj = pop.description,
detectability.obj = det, ddf.analyses.list = analysis)
#Run a single survey and plot it
survey.res <- create.survey.results(sim,dht.tables=TRUE)
plot(survey.res)
```

```
#Extract the data needed for distance sampling analysis
distance.data <- get.distance.data(survey.res)
region.table <- survey.res@region.table@region.table
sample.table <- survey.res@sample.table@sample.table
obs.table <- survey.res@obs.table@obs.table
```

You are now ready to analyze the data!

### 2.2.2 Detection function modelling

You may not be familiar with the `Distance`

package in `R`

, so the following code walks you through a simple analyses of the data we just generated. If you are familiar with the `Distance`

package, feel free to ignore this and do your own analysis.

The following code illustrates detection function modelling - feel free to try your own detection functions.

```
library(Distance)
hncos.simdata <- ds(distance.data,key="hn",adjustment="cos",truncation=3)
```

`## Starting AIC adjustment term selection.`

`## Fitting half-normal key function`

`## Key only model: not constraining for monotonicity.`

`## AIC= 155.347`

`## Fitting half-normal key function with cosine(2) adjustments`

`## AIC= 157.218`

```
##
## Half-normal key function selected.
```

`## No survey area information supplied, only estimating detection function.`

`unifcos.simdata <- ds(distance.data,key="unif",adjustment="cos",truncation=3)`

`## Starting AIC adjustment term selection.`

`## Fitting uniform key function with cosine(1) adjustments`

`## AIC= 155.346`

`## Fitting uniform key function with cosine(1,2) adjustments`

`## AIC= 157.201`

```
##
## Uniform key function with cosine(1) adjustments selected.
```

`## No survey area information supplied, only estimating detection function.`

`hrpoly.simdata <- ds(distance.data,key="hr",adjustment="poly",truncation=3)`

`## Starting AIC adjustment term selection.`

`## Fitting hazard-rate key function`

`## Key only model: not constraining for monotonicity.`

`## AIC= 159.136`

`## Fitting hazard-rate key function with simple polynomial(2) adjustments`

`## AIC= 158.812`

`## Fitting hazard-rate key function with simple polynomial(2,4) adjustments`

`## AIC= 160.17`

```
##
## Hazard-rate key function with simple polynomial(2) adjustments selected.
```

`## No survey area information supplied, only estimating detection function.`

Compare the models, for example using AIC. One can also plot the fitted functions, and look at goodness of fit – you should certainly do that for the detection function you select. As an illustration, here’s model checking code for the hazard rate function:

`plot(hrpoly.simdata)`

`ds.gof(hrpoly.simdata)`

```
##
## Goodness of fit results for ddf object
##
## Chi-square tests
## [0,0.333] (0.333,0.667] (0.667,1] (1,1.33] (1.33,1.67]
## Observed 24.0000000 13.000000 9.000000 13.0000000 9.0000000
## Expected 21.1957402 17.342874 13.097283 10.0025734 7.7076881
## Chisquare 0.3710119 1.087511 1.281772 0.8982255 0.2166759
## (1.67,2] (2,2.33] (2.33,2.67] (2.67,3] Total
## Observed 8.0000000 5.00000000 1.000000 3.0000000 85.000000
## Expected 5.9138386 4.44450203 3.194986 2.1005139 85.000000
## Chisquare 0.7359128 0.06942915 1.507977 0.3851796 6.553694
##
## P = 0.256 with 5 degrees of freedom
##
## Distance sampling Kolmogorov-Smirnov test
## Test statistic = 0.086215 P = 0.55252
##
## Distance sampling Cramer-von Mises test (unweighted)
## Test statistic = 0.10741 P = 0.54974
```

### 2.2.3 Abundance estimation

Having selected a function, we want to estimate abundance. To do this, we need to pass in the other data tables. Here’s example code, again using the hazard rate detection function.

```
abund<-ds(distance.data,key="hr",adjustment="poly",truncation=3,
region.table=region.table,sample.table=sample.table,obs.table=obs.table)
```

`## Starting AIC adjustment term selection.`

`## Fitting hazard-rate key function`

`## Key only model: not constraining for monotonicity.`

`## AIC= 159.136`

`## Fitting hazard-rate key function with simple polynomial(2) adjustments`

`## AIC= 158.812`

`## Fitting hazard-rate key function with simple polynomial(2,4) adjustments`

`## AIC= 160.17`

```
##
## Hazard-rate key function with simple polynomial(2) adjustments selected.
```

`summary(abund)`

```
##
## Summary for distance analysis
## Number of observations : 85
## Distance range : 0 - 3
##
## Model : Hazard-rate key function with simple polynomial adjustment term of order 2
##
## Strict monotonicity constraints were enforced.
## AIC : 158.8118
##
## Detection function parameters
## Scale coefficient(s):
## estimate se
## (Intercept) -0.1705974 0.5433225
##
## Shape coefficient(s):
## estimate se
## (Intercept) 5.496784e-11 0.7214452
##
## Adjustment term coefficient(s):
## estimate se
## poly, order 2 -0.6998223 0.4492375
##
## Estimate SE CV
## Average p 0.4358497 0.09642113 0.2212256
## N in covered region 195.0213606 45.97619684 0.2357495
##
## Summary statistics:
## Region Area CoveredArea Effort n k ER se.ER cv.ER
## 1 region 1e+05 24000 4000 85 20 0.02125 0.00194919 0.09172659
##
## Abundance:
## Label Estimate se cv lcl ucl df
## 1 Total 812.589 194.8507 0.23979 508.3216 1298.983 99.85814
##
## Density:
## Label Estimate se cv lcl ucl df
## 1 Total 0.00812589 0.001948507 0.23979 0.005083216 0.01298983 99.85814
```

The simulation scenario specified 500 individuals. How close is your estimate to the true value? Does your confidence interval contain the true value?

The true detection function was a hazard rate with no adjustment terms. Did you select the true function? If not, why not, and does it matter?