Computing inclusion probabilities for order sampling
Introduction
Rosén (1997a) introduced a new class of fixed-size sampling designs called order sampling, where with each unit i in a population is associated a random variable, called an ordering variable. Independent realizations of the ordering variables are taken. The units with the n smallest realized values constitute the sample. By requiring all the ordering variables to be proportional to a common random variable, Rosén introduced the idea of order sampling with fixed distribution shape. Three distribution shapes were specifically investigated; namely, the uniform, exponential and Pareto distributions.
Let denote the variable of interest. The central inference task is usually the estimation of its population total . Suppose auxiliary information is available on all the units in the population in the form of size measures ; that is, is roughly proportional to . There are several ways of utilizing this auxiliary information. One of them is to use a sampling scheme where the inclusion probabilities are proportional to the size measures. Such a scheme is called a ps scheme. Let denote the target inclusion probability. Rosén, 1997b, Rosén, 2000 showed that, in order sampling with fixed shape, the constants of proportionality can be chosen so that the resultant sampling scheme has inclusion probability approximately equal to . As is only approximately proportional to , such a scheme is satisfactory and can also be classified as a ps scheme. The three cases where the underlying shape is uniform, exponential and Pareto are called, respectively, uniform OSps, exponential OSps and Pareto OSps schemes.
Rosén (1997a) showed that uniform OSps was the same as sequential Poisson sampling presented by Ohlsson, 1990, Ohlsson, 1995 and exponential OSps coincided with successive sampling studied by Rosén (1972) and Hájek (1981). Pareto sampling was new and was shown to have an optimal property explained below. Pareto sampling was independently introduced by Rosén (1997a) and Saavedra (1995).
For estimation of , Rosén suggested the estimator where is the sample inclusion indicator ( if unit i is in the sample and otherwise). This avoids the computation of exact inclusion probabilities. Rosén gave the asymptotic variance of and a variance estimator. He showed that, among all ps schemes of fixed shape, the Pareto scheme minimizes the asymptotic variance. On the other hand, if the inclusion probabilities are known, the estimation of can be based on the Horvitz–Thompson (HT) estimator where is the actual first-order inclusion probability: (unit i is in the sample). This is well known to have minimum variance among the class of homogeneous linear unbiased estimators of . The Sen–Yates–Grundy variance estimator for is where (units i and j are in the sample) are the second-order inclusion probabilities.
Rosén (2000) considered the problem of computing the first-order inclusion probabilities. He obtained explicit formulae for the three fixed shape ps schemes for the particular case where all units in a population have the same size measure except for one odd unit. He also obtained a formula for the Pareto ps scheme when taking a sample of size 1 from a population with all units having distinct size measures. These are very special and limited cases and Rosèn's overall conclusion is that the computation of inclusion probabilities for ps order sampling is exceedingly hard. This gives weight to the paper's main result which is that for large n and N. More precisely, as n and N tend to infinity. This justifies using the estimator as a substitute for . Rosén (2000) showed that the approximation was reasonable if .
It is desirable to have methods to calculate exact inclusion probabilities. It facilitates the theoretical investigation of the method and the approximations. In some practical applications, some subsamples may be quite small and one may wish to compute the exact inclusion probabilities, as the computation is not time-intensive for small sizes. Furthermore, the variance estimator given by Rosén is also based on an asymptotic result. There is as yet no study on how large n needs to be for the variance estimator to be reliable. A method for computing exact first- and second-order inclusion probabilities will facilitate such a study.
Aires (1999) considered the computation of first- and second-order inclusion probabilities for Pareto ps sampling schemes. We propose a different algorithm. It is useful to have different approaches to computing probabilities as they give different insights to the problem. Moreover, one approach may be easier to extend to a different situation than another, such as the selection of coordinated samples. In Section 8, we discuss the mathematical and computational differences between the two methods. Our approach is combinatoric, and, in our view, computes more fundamental quantities which are used with combinatoric arguments to build up the required probabilities. The combinatoric arguments will be easier to adapt to a changed situation.
Even though Rosén (1997a) has shown that the Pareto ps scheme is optimal in the sense of minimizing the asymptotic variance of , we have included the computation of first- and second-order inclusion probabilities of all three schemes here for completeness. The exponential and uniform ps schemes are already in the literature (Rosén, 1972, Hájek, 1981; Ohlsson, 1990, Ohlsson, 1995). They may have other desirable properties upon further investigation.
In Section 2, we describe the method for computing the first- and second-order inclusion probabilities for a general ps order sampling scheme. Sections 3–5 consider the special cases of Pareto, exponential and uniform OSps schemes, respectively. Section 6 gives three identities that are satisfied by the inclusion probabilities and some intermediate results of the computation. These can be used to check the plausibility and accuracy of the numerical results. Section 7 gives some examples. In Section 8, we give some comments on some computational aspects of the formulae for the Pareto ps sampling scheme, and the differences between our algorithm and Aires’.
Section snippets
General ps order sampling
Let denote the ordering variable associated with unit i in the population . Suppose Q is a random variable with distribution function and probability density function . By letting where are constants called intensities, we obtain an order sampling scheme with fixed shape distribution described by . Let and denote, respectively, the distribution function and probability density function of . Then, and . Rosén
Inclusion probabilities for Pareto ps sampling
For a Pareto distribution, , and thus and , where are the desired intensities. Using (1), we havewhere . This integral is independent of R and can be computed using numerical integration. An analytical solution for the special case when all the 's are distinct is possible using the method given in Rosén (2000). However, numerical
Inclusion probabilities for exponential ps sampling
Here, Q has an exponential distribution with distribution function . Thus, and , for where are the intensities. For , a any positive real number and m any positive integer, define where, as before, denotes the set of all subsets of V of size m and is the complement of R in V. Let . Then
Inclusion probabilities for uniform ps sampling
Here, for , for and for and the intensities are . Since different 's have different support, we need to take a slightly different approach. Let be the distinct values of with multiplicities , respectively. Let . The support of is which is partitioned by the points . For , let where . Then
To compute , we note
Checking numerical answers
The formulae given in Sections 3–5 are mathematically correct. But numerical computations can give surprises. It is important to have some means of checking the reasonableness and accuracy of the results as human errors are possible and numerical round-off errors can swamp the results.
The probability (unit i is in the rth position in the ordering) is computed for and in the process of computing the first-order inclusion probability . The identities
Numerical examples
We present two numerical examples to illustrate the methods introduced in Sections 3–5. The computations were carried out with computer programs written in Matlab on a PC with 1 GHz PentiumIV processor, 256 mB RAM running under Window98. The first example was taken from Aires (1999) and corresponds to a population of size and a sample of size . The measures of size are given by . The corresponding first-order inclusion probabilities for Pareto (PAR), exponential (EXP), and
Comments and conclusion
Of the three order sampling schemes, the Pareto scheme gives inclusion probabilities closest to the target inclusion probabilities , and is also shown to have minimum asymptotic variance among all order sampling schemes of fixed shape (Rosén, 1997a). Thus, it is the one most likely to be used in practice if an order sampling scheme is desired. Hence, we have not devoted much effort on improving the Matlab programs for the exponential and uniform schemes. For the Pareto scheme, we make some
References (10)
Comparisons between conditional Poisson sampling and Pareto ps sampling designs
J. Statist. Plann. Inference
(2000)On sampling with probability proportional to size
J. Statist. Plann. Inference
(1997)Asymptotic theory for order sampling
J. Statist. Plann. Inference
(1997)On inclusion probabilities for order ps sampling
J. Statist. Plann. Inference
(2000)Algorithms to find exact inclusion probabilities for conditional Poisson sampling and Pareto ps sampling designs
Methodology and Computing in Applied Probability
(1999)
Cited by (3)
Remarks on some misconceptions about unequal probability sampling without replacement
2023, Computer Science ReviewComputational aspects of order π ps sampling schemes
2007, Computational Statistics and Data AnalysisSampling and Estimation from Finite Populations
2020, Sampling and Estimation from Finite Populations