Computing inclusion probabilities for order sampling

doi:10.1016/j.jspi.2005.03.010

Journal of Statistical Planning and Inference

Volume 136, Issue 11, 1 November 2006, Pages 4026-4042

https://doi.org/10.1016/j.jspi.2005.03.010 Get rights and content

Abstract

Rosèn [1997. J. Statist. Plann. Inference 62, 159–191] introduced order sampling schemes of fixed shape which have inclusion probabilities roughly proportional to given size measures ( $π$ ps schemes). Three particular cases where the fixed shape distributions are Pareto, exponential and uniform, respectively, are specially treated. In this paper, we give general algorithms for computing the first- and second-order inclusion probabilities for a general fixed shape order sampling scheme and explicit formulae for the three special cases. Identities are given that can be used to check the accuracy of the numerical results. Examples are included as well as some comments on improving the computational efficiency and accuracy of the algorithms.

Introduction

Rosén (1997a) introduced a new class of fixed-size sampling designs called order sampling, where with each unit i in a population $U = {1, \dots, N}$ is associated a random variable, called an ordering variable. Independent realizations of the ordering variables are taken. The units with the n smallest realized values constitute the sample. By requiring all the ordering variables to be proportional to a common random variable, Rosén introduced the idea of order sampling with fixed distribution shape. Three distribution shapes were specifically investigated; namely, the uniform, exponential and Pareto distributions.

Let $y_{i}$ denote the variable of interest. The central inference task is usually the estimation of its population total $τ = \sum_{i = 1}^{N} y_{i}$ . Suppose auxiliary information is available on all the units in the population in the form of size measures $x_{i}$ ; that is, $x_{i}$ is roughly proportional to $y_{i}$ . There are several ways of utilizing this auxiliary information. One of them is to use a sampling scheme where the inclusion probabilities are proportional to the size measures. Such a scheme is called a $π$ ps scheme. Let $λ_{i} = {nx}_{i} / \sum_{j = 1}^{N} x_{j}$ denote the target inclusion probability. Rosén, 1997b, Rosén, 2000 showed that, in order sampling with fixed shape, the constants of proportionality can be chosen so that the resultant sampling scheme has inclusion probability approximately equal to $λ_{i}$ . As $x_{i}$ is only approximately proportional to $y_{i}$ , such a scheme is satisfactory and can also be classified as a $π$ ps scheme. The three cases where the underlying shape is uniform, exponential and Pareto are called, respectively, uniform OS $π$ ps, exponential OS $π$ ps and Pareto OS $π$ ps schemes.

Rosén (1997a) showed that uniform OS $π$ ps was the same as sequential Poisson sampling presented by Ohlsson, 1990, Ohlsson, 1995 and exponential OS $π$ ps coincided with successive sampling studied by Rosén (1972) and Hájek (1981). Pareto sampling was new and was shown to have an optimal property explained below. Pareto sampling was independently introduced by Rosén (1997a) and Saavedra (1995).

For estimation of $τ$ , Rosén suggested the estimator ${\hat{τ}}_{R} = \sum_{i = 1}^{N} \frac{y_{i}}{λ_{i}} I_{i},$ where $I_{i}$ is the sample inclusion indicator ( $I_{i} = 1$ if unit i is in the sample and $= 0$ otherwise). This avoids the computation of exact inclusion probabilities. Rosén gave the asymptotic variance of ${\hat{τ}}_{R}$ and a variance estimator. He showed that, among all $π$ ps schemes of fixed shape, the Pareto scheme minimizes the asymptotic variance. On the other hand, if the inclusion probabilities are known, the estimation of $τ$ can be based on the Horvitz–Thompson (HT) estimator ${\hat{τ}}_{HT} = \sum_{i = 1}^{N} \frac{y_{i}}{π_{i}} I_{i},$ where $π_{i}$ is the actual first-order inclusion probability: $π_{i} = \Pr$ (unit i is in the sample). This is well known to have minimum variance among the class of homogeneous linear unbiased estimators of $τ$ . The Sen–Yates–Grundy variance estimator for ${\hat{τ}}_{HT}$ is $\hat{V} ({\hat{τ}}_{HT}) = \frac{1}{2} \sum_{i = 1}^{N} \sum_{j = 1}^{N} {(\frac{y_{i}}{π_{i}} - \frac{y_{j}}{π_{j}})}^{2} (\frac{π_{i} π_{j}}{π_{ij}} - 1) I_{i} I_{j},$ where $π_{ij} = \Pr$ (units i and j are in the sample) are the second-order inclusion probabilities.

Rosén (2000) considered the problem of computing the first-order inclusion probabilities. He obtained explicit formulae for the three fixed shape $π$ ps schemes for the particular case where all units in a population have the same size measure except for one odd unit. He also obtained a formula for the Pareto $π$ ps scheme when taking a sample of size 1 from a population with all units having distinct size measures. These are very special and limited cases and Rosèn's overall conclusion is that the computation of inclusion probabilities for $π$ ps order sampling is exceedingly hard. This gives weight to the paper's main result which is that $π_{i} \approx λ_{i}$ for large n and N. More precisely, $\lim π_{i} / λ_{i} \to 1$ as n and N tend to infinity. This justifies using the estimator ${\hat{τ}}_{R}$ as a substitute for ${\hat{τ}}_{HT}$ . Rosén (2000) showed that the approximation was reasonable if $\min (n, N - n) ⩾ 5$ .

It is desirable to have methods to calculate exact inclusion probabilities. It facilitates the theoretical investigation of the method and the approximations. In some practical applications, some subsamples may be quite small and one may wish to compute the exact inclusion probabilities, as the computation is not time-intensive for small sizes. Furthermore, the variance estimator given by Rosén is also based on an asymptotic result. There is as yet no study on how large n needs to be for the variance estimator to be reliable. A method for computing exact first- and second-order inclusion probabilities will facilitate such a study.

Aires (1999) considered the computation of first- and second-order inclusion probabilities for Pareto $π$ ps sampling schemes. We propose a different algorithm. It is useful to have different approaches to computing probabilities as they give different insights to the problem. Moreover, one approach may be easier to extend to a different situation than another, such as the selection of coordinated samples. In Section 8, we discuss the mathematical and computational differences between the two methods. Our approach is combinatoric, and, in our view, computes more fundamental quantities which are used with combinatoric arguments to build up the required probabilities. The combinatoric arguments will be easier to adapt to a changed situation.

Even though Rosén (1997a) has shown that the Pareto $π$ ps scheme is optimal in the sense of minimizing the asymptotic variance of ${\hat{τ}}_{R}$ , we have included the computation of first- and second-order inclusion probabilities of all three schemes here for completeness. The exponential and uniform $π$ ps schemes are already in the literature (Rosén, 1972, Hájek, 1981; Ohlsson, 1990, Ohlsson, 1995). They may have other desirable properties upon further investigation.

In Section 2, we describe the method for computing the first- and second-order inclusion probabilities for a general $π$ ps order sampling scheme. Sections 3–5 consider the special cases of Pareto, exponential and uniform OS $π$ ps schemes, respectively. Section 6 gives three identities that are satisfied by the inclusion probabilities and some intermediate results of the computation. These can be used to check the plausibility and accuracy of the numerical results. Section 7 gives some examples. In Section 8, we give some comments on some computational aspects of the formulae for the Pareto $π$ ps sampling scheme, and the differences between our algorithm and Aires’.

Section snippets

General $π$ ps order sampling

Let $Q_{i}$ denote the ordering variable associated with unit i in the population $U = {1, \dots, N}$ . Suppose Q is a random variable with distribution function $F (t)$ and probability density function $f (t)$ . By letting $Q_{i} = Q / θ_{i}, i = 1, \dots, N,$ where $θ_{i}$ are constants called intensities, we obtain an order sampling scheme with fixed shape distribution described by $F (t)$ . Let $F_{i} (t)$ and $f_{i} (t)$ denote, respectively, the distribution function and probability density function of $Q_{i}$ . Then, $F_{i} (t) = F (θ_{i} t)$ and $f_{i} (t) = θ_{i} f (θ_{i} t)$ . Rosén

Inclusion probabilities for Pareto $π$ ps sampling

For a Pareto distribution, $F (t) = t / (1 + t)$ , and thus $f_{i} (t) = θ_{i} / (1 + θ_{i} t)^{2}$ and $F_{i} (t) = θ_{i} t / (1 + θ_{i} t)$ , where $θ_{i} = F^{- 1} (λ_{i}) = λ_{i} / (1 - λ_{i})$ are the desired intensities. Using (1), we have $P_{1} (r, R) = \int_{0}^{\infty} \frac{θ_{1}}{(1 + θ_{1} t)^{2}} (\prod_{j \in R} \frac{θ_{j} t}{1 + θ_{j} t}) (\prod_{k \in R^{'}} \frac{1}{1 + θ_{k} t}) d t = θ_{1} (\prod_{j \in R} θ_{j}) I_{1} (r),$ where $I_{1} (r) = \int_{0}^{\infty} t^{r - 1} / {(1 + θ_{1} t) \prod_{j = 1}^{N} (1 + θ_{j} t)} d t$ . This integral is independent of R and can be computed using numerical integration. An analytical solution for the special case when all the $θ_{i}$ 's are distinct is possible using the method given in Rosén (2000). However, numerical

Inclusion probabilities for exponential $π$ ps sampling

Here, Q has an exponential distribution with distribution function $F (t) = 1 - e^{- t}$ . Thus, $f_{i} (t) = θ_{i} e^{- θ_{i} t}$ and $F_{i} (t) = 1 - e^{- θ_{i} t}$ , for $i = 1, \dots, N$ where $θ_{i} = F^{- 1} (λ_{i}) = - \log (1 - λ_{i})$ are the intensities. For $V \subseteq U$ , a any positive real number and m any positive integer, define $B (V, m, a) = \sum_{R \in V^{m}} \int_{0}^{\infty} e^{- at} \prod_{j \in R} (1 - e^{- θ_{j} t}) (\prod_{k \in R^{'}} e^{- θ_{k} t}) d t,$ where, as before, $V^{m}$ denotes the set of all subsets of V of size m and $R^{'}$ is the complement of R in V. Let $| V | = v$ . Then $B (V, m, a) = \sum_{R \in V^{m}} (\frac{1}{a + \sum_{k \in R^{'}} θ_{k}} - \sum_{1 : R} \frac{1}{(a + \sum_{k \in R^{'}} θ_{k}) + θ_{j_{1}}} + \sum_{2 : R} \frac{1}{(a + \sum_{k \in R^{'}} θ_{k}) + θ_{j_{1}} + θ_{j_{2}}} + \dots + (- 1)^{m} \frac{1}{a + \sum_{j \in V}})$

Inclusion probabilities for uniform $π$ ps sampling

Here, $f_{i} (t) = θ_{i}$ for $0 ⩽ t ⩽ 1 / θ_{i}$ , $F_{i} (t) = θ_{i} t$ for $0 ⩽ t ⩽ 1 / θ_{i}$ and $= 1$ for $t > 1 / θ_{i}$ and the intensities are $θ_{i} = λ_{i}$ . Since different $Q_{i}$ 's have different support, we need to take a slightly different approach. Let $ψ_{1} > ψ_{2} > \dots > ψ_{L}$ be the distinct values of $θ_{1}, \dots, θ_{N}$ with multiplicities $m_{1}, m_{2}, \dots, m_{L}$ , respectively. Let $θ_{1} = ψ_{a}$ . The support of $Q_{1}$ is $[0, 1 / ψ_{a}]$ which is partitioned by the points $1 / ψ_{1}, 1 / ψ_{2}, \dots, 1 / ψ_{a - 1}$ . For $1 ⩽ l ⩽ a$ , let $P_{1} (r, l) = \Pr (Q_{1} = Q_{(r)} and \frac{1}{ψ_{l - 1}} < Q_{1} < \frac{1}{ψ_{l}}),$ where $ψ_{0} = \infty$ . Then $P_{1} (r) = \sum_{l = 1}^{a} P_{1} (r, l) .$

To compute $P_{1} (r, l)$ , we note

Checking numerical answers

The formulae given in Sections 3–5 are mathematically correct. But numerical computations can give surprises. It is important to have some means of checking the reasonableness and accuracy of the results as human errors are possible and numerical round-off errors can swamp the results.

The probability $P_{i} (r) = \Pr$ (unit i is in the rth position in the ordering) is computed for $i = 1, \dots, N$ and $r = 1, \dots, n$ in the process of computing the first-order inclusion probability $π_{i}$ . The identities $\sum_{i = 1}^{N} P_{i} (r) = 1, r = 1, \dots, n$

Numerical examples

We present two numerical examples to illustrate the methods introduced in Sections 3–5. The computations were carried out with computer programs written in Matlab on a PC with 1 GHz PentiumIV processor, 256 mB RAM running under Window98. The first example was taken from Aires (1999) and corresponds to a population of size $N = 5$ and a sample of size $n = 2$ . The measures of size are given by $x = (1, 2, 3, 5, 9)$ . The corresponding first-order inclusion probabilities for Pareto (PAR), exponential (EXP), and

Comments and conclusion

Of the three order sampling schemes, the Pareto scheme gives inclusion probabilities closest to the target inclusion probabilities $λ_{i}$ , and is also shown to have minimum asymptotic variance among all order sampling schemes of fixed shape (Rosén, 1997a). Thus, it is the one most likely to be used in practice if an order sampling scheme is desired. Hence, we have not devoted much effort on improving the Matlab programs for the exponential and uniform schemes. For the Pareto scheme, we make some

References (10)

N. Aires
Comparisons between conditional Poisson sampling and Pareto $π$ ps sampling designs
J. Statist. Plann. Inference
(2000)
B. Rosén
On sampling with probability proportional to size
J. Statist. Plann. Inference
(1997)
B. Rosén
Asymptotic theory for order sampling
J. Statist. Plann. Inference
(1997)
B. Rosén
On inclusion probabilities for order $π$ ps sampling
J. Statist. Plann. Inference
(2000)
N. Aires
Algorithms to find exact inclusion probabilities for conditional Poisson sampling and Pareto $π$ ps sampling designs
Methodology and Computing in Applied Probability
(1999)

There are more references available in the full text version of this article.

Cited by (3)

Remarks on some misconceptions about unequal probability sampling without replacement
2023, Computer Science Review
Before computer scientists became interested in unequal probability sampling methods, they were widely studied by survey statisticians. We show that sometimes the same sampling methods have been proposed again without reference to existing methods. We also show that methods that are not correct and that were widely discussed in the 1950s are being proposed again. We review the most common errors and misunderstandings about these methods.
Computational aspects of order π ps sampling schemes
2007, Computational Statistics and Data Analysis
In an order sampling a finite population of size N has its units ordered by a ranking variable and then, a sample of the first n units is drawn. For order $π ps$ sampling, the target inclusion probabilities $λ = (λ_{k})_{k = 1}^{N}$ are computed using a measure of size which is correlated with a variable of interest. The quantities $λ_{k}$ , however, are different from the true inclusion probabilities $π_{k}$ . Firstly, a new, simple method to compute $π_{k}$ from $λ_{k}$ is presented, and it is used to compute the inclusion probabilities of order $π ps$ sampling schemes (uniform, exponential and Pareto). Secondly, given two positively co-ordinated samples drawn with order $π ps$ sampling, the joint inclusion probability of a unit in both samples is approximated. This approximation can be used to derive the expected overlap or to construct an estimate of the covariance on these two samples. All presented methods use numerical integration.
Sampling and Estimation from Finite Populations
2020, Sampling and Estimation from Finite Populations

View full text

Computing inclusion probabilities for order sampling

Abstract

Introduction

Section snippets

General πps order sampling

Inclusion probabilities for Pareto πps sampling

Inclusion probabilities for exponential πps sampling

Inclusion probabilities for uniform πps sampling

Checking numerical answers

Numerical examples

Comments and conclusion

J. Statist. Plann. Inference

J. Statist. Plann. Inference

J. Statist. Plann. Inference

J. Statist. Plann. Inference

Algorithms to find exact inclusion probabilities for conditional Poisson sampling and Pareto πps sampling designs

Methodology and Computing in Applied Probability

General $π$ ps order sampling

Inclusion probabilities for Pareto $π$ ps sampling

Inclusion probabilities for exponential $π$ ps sampling

Inclusion probabilities for uniform $π$ ps sampling

Algorithms to find exact inclusion probabilities for conditional Poisson sampling and Pareto $π$ ps sampling designs