A robust P-spline approach to closed population capture–recapture models with time dependence and heterogeneity
Introduction
Capture–recapture experiments are a sampling scheme for a population where the primary aim is to estimate the size of the population. They are widely used in ecological statistics in particular and present many challenges to a statistician if the capture probabilities are heterogeneous between individuals. The data from these experiments consist of individual capture histories that record whether an individual has been captured or not captured on each sampling occasion along with individual and environmental covariates but no information is available on the uncaptured individuals. One of the frustrations in the analysis of capture–recapture experiments is that an analyst is well aware of the flexible models available to model probabilities in a logistic regression setting but few of these have been readily available to model the individual capture probabilities. Here, we partially remedy this by implementing the P-spline approach to nonparametric modeling for the generalized linear model (GLM) (McCullagh and Nelder, 1989) arising from the conditional likelihood approach to capture–recapture experiments (Huggins, 1989).
In general, populations may be open or closed and here we consider closed populations with neither immigration nor emigration over the period of the study. Modern population size estimators for closed populations arose from the seminal works of Darroch (1958), Jolly (1965) and Seber (1965) where capture probabilities are assumed to be the same for each individual but allowed to vary across each capture occasion. Further model developments allowed capture probabilities to permit time , behavioral response or heterogeneity between individuals (Pollock, 1974, Otis et al., 1978). Covariates can be used to explain heterogeneity in the capture probabilities. For example, in closed population models Pollock et al. (1984) and Huggins (1989) associated covariates collected on individuals (such as body weight or sex) to capture probabilities and used maximum likelihood or conditional likelihood to estimate model parameters. Furthermore, time effects modeled as factors or as functions of covariates, such as the recorded temperature on the capture occasion, can also be considered.
We are concerned with model , where capture probabilities may depend on individual covariates and time. A wide variety of approaches have been considered when to fit model , including martingale methods (Lloyd and Yip, 1991), sample coverage models (Chao et al., 1992), -linear and latent class models (Agresti, 1994), finite mixture models (Pledger, 2000) and, as mentioned above, the use of individual covariates in GLMs (Huggins, 1989). In practice, the relationship between capture probabilities and covariates can be quite nonlinear and may lead to model miss-specifications when assuming linearity (Huggins and Hwang, 2007). To overcome this, semi/nonparametric techniques have been introduced to capture–recapture models (Chen and Lloyd, 2000, Zwane and van der Heijden, 2004, Huggins and Hwang, 2007, Hwang and Huggins, 2007). Nonparametric models based on P-splines (Eilers and Marx, 1996) have become popular in applied statistics due to their flexibility and simple set up (Wood, 2006). P-splines are low rank smoothers that use a B-spline basis (de Boor, 2001) with a difference penalty applied on coefficients of adjacent B-splines. We consider generalized additive models (GAMs) (Hastie and Tibshirani, 1990, Marx and Eilers, 1998, Wood, 2006, Wang et al., 2011) which model the capture probabilities as a known function of a sum of smooth functions on covariates. The approach of Huggins (1989) uses a conditional likelihood to estimate the parameters in the model for the capture probabilities, which yields a GLM, and then a Horvitz–Thompson (H–T) estimator (Horvitz and Thompson, 1952) to estimate the population size. Here, we extend this model using P-splines, rather than the local polynomials of Huggins and Hwang (2007) and Hwang and Huggins (2007).
We saw in simulations and examples that using the H–T estimator as in Huggins (1989) can result in unrealistically large estimates of the population size. This problem can occur in both parametric and nonparametric estimation when extreme values of the covariate are not smoothed, as they are not close to the remainder of the data. This motivates the development of a robust estimator of the population size. There are two possible approaches to robustifying the population size estimator. The first is to robustify the GLM, or GAM equations and the second is to modify the H–T estimator, and of course one could combine both procedures. We consider the second approach as the first is an extension of known results for GLMs (Cantoni and Ronchetti, 2001). Moreover, even if we do have robust estimators of the model parameters, the estimated capture probabilities corresponding to outlying covariate values may still be extremely small resulting in inflated estimates of the population size. That is, the model that is fitted to the bulk of the data may not be appropriate at some extreme covariate values. Due to the form of the H–T estimator this is less straightforward than usual and care must be taken to not introduce excessive bias. Previous robust versions of the H–T estimators have tended to focus on the more common survey sampling applications where the sampling probabilities are known and the outliers occur in the survey variable (Chambers, 1986, Ghosh, 2008). The situation here is different as the survey variable is a known constant for each individual and the outliers occur in the sampling probabilities. Thus, we downweight unusual sampling probabilities rather than large residuals in the survey variable.
In Section 2.1 we introduce our notation and we review the conditional likelihood approach of Huggins (1989) in Section 2.2. In Section 2.3 we apply the GAM approach to model and the submodels and . We first apply the model on two examples in Sections 3.1 Example 1: Harvest mouse data, 3.2 Example 2: Mountain Pygmy Possum data and the problems encountered there motivates the development of our robust population size estimator in Section 4. We revisit the second example in Section 4.3 and to validate the new approaches we conduct some simulations in Section 5. Some discussion is given in Section 6 and some technical results are in Appendix A Fitting GAMs, Appendix B Variances of the robust estimators, Appendix D Figures.
Section snippets
Notation
Consider a population of unknown size and a capture–recapture experiment conducted over capture occasions labeled . The population is supposed closed over the course of the experiment. The individuals in the population are assumed to behave independently of each other. Let be the indicator that the th individual has been caught on the th occasion and take the value 1 if the th individual has been captured at least once and 0 otherwise. Let be the number of
Example 1: Harvest mouse data
We first illustrate the use of GAMs under model using capture–recapture data collected on the Harvest mouse (Micromys minutus). The experiment was conducted at Wulin Recreation Area in Shei-Pa National Park, Taiwan in the summer of 2008. The mice were trapped using Sherman traps in 5×20 m grids where traps were set 10 meters apart. Over the weekly capture occasions, individuals were captured at least once. Individuals were weighed (g), sexed and hindfoot measurements (mm) were
Robust population size estimates
To develop robust estimators we need to model the contamination. Note that our interest is in the situation where there is contamination on the observed or estimated value of the capture probability. That is, the captures are according to a model with the true probability but we observe some possibly unrelated value rather than a surrogate. This may be an artifact due to poor modeling of the estimated capture probability as a function of the covariate or may arise from errors in measuring the
GAMs
We first examined the performance of GAMs using univariate continuous and categorical covariates. We compared results with parametric linear and quadratic GLMs and the semiparametric local polynomials (LP) models of Hwang and Huggins (2007). Since LP models have not yet been developed, we only considered the submodel . We followed a similar simulation design as Hwang and Huggins (2007) where we considered a population size of with capture occasions and generated the continuous
Discussion
We have demonstrated that GAMs may be relatively easily fitted to capture–recapture data when the capture probabilities are functions of multiple continuous/categorical covariates and time effects. In developing these methods we noted that the H–T estimator can result in unreasonable estimates of the population size in the presence of contamination that can arise from outliers, measurement error or inappropriate modeling. This can occur in parametric and nonparametric models. We developed a
Acknowledgments
The authors would like to thank Dr. Sheng-Hai Wu of the Institute of Statistics, National Chung Hsing University for the use of the Harvest mouse mark–recapture data. They would also like to thank the referees for comments that helped to clarify the exposition.
References (40)
- et al.
Direct generalized additive modeling with penalized likelihood
Computational Statistics & Data Analysis
(1998) - et al.
Fitting generalized linear models with unspecified link function: a P-spline approach
Computational Statistics & Data Analysis
(2008) - et al.
Semiparametric models for capture–recapture studies with covariates
Computational Statistics & Data Analysis
(2004) Simple capture–recapture models permitting unequal catchability and variable sampling effort
Biometrics
(1994)- et al.
Robust inference for generalized linear models
Journal of the American Statistical Association
(2001) - et al.
Generalized partially linear single-index models
Journal of the American Statistical Association
(1997) Outlier robust finite population estimation
Journal of the American Statistical Association
(1986)- et al.
Estimating population size for capture–recapture data when capture probabilities vary by time and individual animal
Biometrics
(1992) - et al.
A non parametric approach to the analysis of two-stage mark-recapture experiments
Biometrika
(2000) The multiple–recapture census: I. Estimation of a closed population
Biometrika
(1958)
A Practical Guide to Splines
Flexible smoothing with B-splines and penalties
Statistical Science
The influence curve and its role in robust estimation
Journal of the American Statistical Association
Robust Statistics: The Approach Based on Influence Functions
Generalized Additive Models
A review of the ecology and conservation of the mountain pygmy-possum Burramys parvus
Adaptive robust procedures: a partial review and some suggestions for future applications and theory
Journal of the American Statistical Association
A generalization of sampling without replacement from a finite universe
Journal of the American Statistical Association
Robust Statistics
Cited by (17)
Estimating population size of heterogeneous populations with large data sets and a large number of parameters
2019, Computational Statistics and Data AnalysisCitation Excerpt :Zwane and van der Heijden (2004) developed log-linear models that used penalized splines to express dependence among continuous covariates. Gimenez et al. (2006), Hwang and Huggins (2011) and Stoklosa and Huggins (2012) proposed nonparametric and semiparametric regression methods for estimating capture probabilities in capture–recapture models. Another recent study that uses similar CRDA data and open population models that allows for covariates is Stoklosa et al. (2016).
Accounting for contamination and outliers in covariates for open population capture-recapture models
2016, Journal of Statistical Planning and InferenceCitation Excerpt :In this paper a robust MA-type open population size estimator is proposed. This can also be seen as an extension to the work of Stoklosa and Huggins (2012b) but in the open population setting. We found that the theory and implementation of robust statistics were easily transferable since the methods of Stoklosa and Huggins (2012b) used a similar Horvitz–Thompson type estimator.
Special issue on robust analysis of complex data
2013, Computational Statistics and Data AnalysisDoubly Robust Capture-Recapture Methods for Estimating Population Size
2023, Journal of the American Statistical AssociationEstimating the Size of an Open Population with Massive Datasets Based on a Generalized Varying-Coefficient Model
2022, Journal of Systems Science and Complexity