Computing inclusion probabilities for order sampling

https://doi.org/10.1016/j.jspi.2005.03.010Get rights and content

Abstract

Rosèn [1997. J. Statist. Plann. Inference 62, 159–191] introduced order sampling schemes of fixed shape which have inclusion probabilities roughly proportional to given size measures (πps schemes). Three particular cases where the fixed shape distributions are Pareto, exponential and uniform, respectively, are specially treated. In this paper, we give general algorithms for computing the first- and second-order inclusion probabilities for a general fixed shape order sampling scheme and explicit formulae for the three special cases. Identities are given that can be used to check the accuracy of the numerical results. Examples are included as well as some comments on improving the computational efficiency and accuracy of the algorithms.

Introduction

Rosén (1997a) introduced a new class of fixed-size sampling designs called order sampling, where with each unit i in a population U={1,,N} is associated a random variable, called an ordering variable. Independent realizations of the ordering variables are taken. The units with the n smallest realized values constitute the sample. By requiring all the ordering variables to be proportional to a common random variable, Rosén introduced the idea of order sampling with fixed distribution shape. Three distribution shapes were specifically investigated; namely, the uniform, exponential and Pareto distributions.

Let yi denote the variable of interest. The central inference task is usually the estimation of its population total τ=i=1Nyi. Suppose auxiliary information is available on all the units in the population in the form of size measures xi; that is, xi is roughly proportional to yi. There are several ways of utilizing this auxiliary information. One of them is to use a sampling scheme where the inclusion probabilities are proportional to the size measures. Such a scheme is called a πps scheme. Let λi=nxi/j=1Nxj denote the target inclusion probability. Rosén, 1997b, Rosén, 2000 showed that, in order sampling with fixed shape, the constants of proportionality can be chosen so that the resultant sampling scheme has inclusion probability approximately equal to λi. As xi is only approximately proportional to yi, such a scheme is satisfactory and can also be classified as a πps scheme. The three cases where the underlying shape is uniform, exponential and Pareto are called, respectively, uniform OSπps, exponential OSπps and Pareto OSπps schemes.

Rosén (1997a) showed that uniform OSπps was the same as sequential Poisson sampling presented by Ohlsson, 1990, Ohlsson, 1995 and exponential OSπps coincided with successive sampling studied by Rosén (1972) and Hájek (1981). Pareto sampling was new and was shown to have an optimal property explained below. Pareto sampling was independently introduced by Rosén (1997a) and Saavedra (1995).

For estimation of τ, Rosén suggested the estimator τ^R=i=1NyiλiIi,where Ii is the sample inclusion indicator (Ii=1 if unit i is in the sample and =0 otherwise). This avoids the computation of exact inclusion probabilities. Rosén gave the asymptotic variance of τ^R and a variance estimator. He showed that, among all πps schemes of fixed shape, the Pareto scheme minimizes the asymptotic variance. On the other hand, if the inclusion probabilities are known, the estimation of τ can be based on the Horvitz–Thompson (HT) estimator τ^HT=i=1NyiπiIi,where πi is the actual first-order inclusion probability: πi=Pr(unit i is in the sample). This is well known to have minimum variance among the class of homogeneous linear unbiased estimators of τ. The Sen–Yates–Grundy variance estimator for τ^HT is V^(τ^HT)=12i=1Nj=1Nyiπi-yjπj2πiπjπij-1IiIj,where πij=Pr(units i and j are in the sample) are the second-order inclusion probabilities.

Rosén (2000) considered the problem of computing the first-order inclusion probabilities. He obtained explicit formulae for the three fixed shape πps schemes for the particular case where all units in a population have the same size measure except for one odd unit. He also obtained a formula for the Pareto πps scheme when taking a sample of size 1 from a population with all units having distinct size measures. These are very special and limited cases and Rosèn's overall conclusion is that the computation of inclusion probabilities for πps order sampling is exceedingly hard. This gives weight to the paper's main result which is that πiλi for large n and N. More precisely, limπi/λi1 as n and N tend to infinity. This justifies using the estimator τ^R as a substitute for τ^HT. Rosén (2000) showed that the approximation was reasonable if min(n,N-n)5.

It is desirable to have methods to calculate exact inclusion probabilities. It facilitates the theoretical investigation of the method and the approximations. In some practical applications, some subsamples may be quite small and one may wish to compute the exact inclusion probabilities, as the computation is not time-intensive for small sizes. Furthermore, the variance estimator given by Rosén is also based on an asymptotic result. There is as yet no study on how large n needs to be for the variance estimator to be reliable. A method for computing exact first- and second-order inclusion probabilities will facilitate such a study.

Aires (1999) considered the computation of first- and second-order inclusion probabilities for Pareto πps sampling schemes. We propose a different algorithm. It is useful to have different approaches to computing probabilities as they give different insights to the problem. Moreover, one approach may be easier to extend to a different situation than another, such as the selection of coordinated samples. In Section 8, we discuss the mathematical and computational differences between the two methods. Our approach is combinatoric, and, in our view, computes more fundamental quantities which are used with combinatoric arguments to build up the required probabilities. The combinatoric arguments will be easier to adapt to a changed situation.

Even though Rosén (1997a) has shown that the Pareto πps scheme is optimal in the sense of minimizing the asymptotic variance of τ^R, we have included the computation of first- and second-order inclusion probabilities of all three schemes here for completeness. The exponential and uniform πps schemes are already in the literature (Rosén, 1972, Hájek, 1981; Ohlsson, 1990, Ohlsson, 1995). They may have other desirable properties upon further investigation.

In Section 2, we describe the method for computing the first- and second-order inclusion probabilities for a general πps order sampling scheme. Sections 3–5 consider the special cases of Pareto, exponential and uniform OSπps schemes, respectively. Section 6 gives three identities that are satisfied by the inclusion probabilities and some intermediate results of the computation. These can be used to check the plausibility and accuracy of the numerical results. Section 7 gives some examples. In Section 8, we give some comments on some computational aspects of the formulae for the Pareto πps sampling scheme, and the differences between our algorithm and Aires’.

Section snippets

General πps order sampling

Let Qi denote the ordering variable associated with unit i in the population U={1,,N}. Suppose Q is a random variable with distribution function F(t) and probability density function f(t). By letting Qi=Q/θi,i=1,,N, where θi are constants called intensities, we obtain an order sampling scheme with fixed shape distribution described by F(t). Let Fi(t) and fi(t) denote, respectively, the distribution function and probability density function of Qi. Then, Fi(t)=F(θit) and fi(t)=θif(θit). Rosén

Inclusion probabilities for Pareto πps sampling

For a Pareto distribution, F(t)=t/(1+t), and thus fi(t)=θi/(1+θit)2 and Fi(t)=θit/(1+θit), where θi=F-1(λi)=λi/(1-λi) are the desired intensities. Using (1), we haveP1(r,R)=0θ1(1+θ1t)2jRθjt1+θjtkR11+θktdt=θ1jRθjI1(r),where I1(r)=0tr-1/{(1+θ1t)j=1N(1+θjt)}dt. This integral is independent of R and can be computed using numerical integration. An analytical solution for the special case when all the θi's are distinct is possible using the method given in Rosén (2000). However, numerical

Inclusion probabilities for exponential πps sampling

Here, Q has an exponential distribution with distribution function F(t)=1-e-t. Thus, fi(t)=θie-θit and Fi(t)=1-e-θit, for i=1,,N where θi=F-1(λi)=-log(1-λi) are the intensities. For VU, a any positive real number and m any positive integer, define B(V,m,a)=RVm0e-atjR(1-e-θjt)kRe-θktdt,where, as before, Vm denotes the set of all subsets of V of size m and R is the complement of R in V. Let |V|=v. ThenB(V,m,a)=RVm1a+kRθk-1:R1(a+kRθk)+θj1+2:R1(a+kRθk)+θj1+θj2++(-1)m1a+jV

Inclusion probabilities for uniform πps sampling

Here, fi(t)=θi for 0t1/θi, Fi(t)=θit for 0t1/θi and =1 for t>1/θi and the intensities are θi=λi. Since different Qi's have different support, we need to take a slightly different approach. Let ψ1>ψ2>>ψL be the distinct values of θ1,,θN with multiplicities m1,m2,,mL, respectively. Let θ1=ψa. The support of Q1 is [0,1/ψa] which is partitioned by the points 1/ψ1,1/ψ2,,1/ψa-1. For 1la, let P1(r,l)=PrQ1=Q(r)and1ψl-1<Q1<1ψl,where ψ0=. ThenP1(r)=l=1aP1(r,l).

To compute P1(r,l), we note

Checking numerical answers

The formulae given in Sections 3–5 are mathematically correct. But numerical computations can give surprises. It is important to have some means of checking the reasonableness and accuracy of the results as human errors are possible and numerical round-off errors can swamp the results.

The probability Pi(r)=Pr(unit i is in the rth position in the ordering) is computed for i=1,,N and r=1,,n in the process of computing the first-order inclusion probability πi. The identitiesi=1NPi(r)=1,r=1,,n

Numerical examples

We present two numerical examples to illustrate the methods introduced in Sections 3–5. The computations were carried out with computer programs written in Matlab on a PC with 1 GHz PentiumIV processor, 256 mB RAM running under Window98. The first example was taken from Aires (1999) and corresponds to a population of size N=5 and a sample of size n=2. The measures of size are given by x=(1,2,3,5,9). The corresponding first-order inclusion probabilities for Pareto (PAR), exponential (EXP), and

Comments and conclusion

Of the three order sampling schemes, the Pareto scheme gives inclusion probabilities closest to the target inclusion probabilities λi, and is also shown to have minimum asymptotic variance among all order sampling schemes of fixed shape (Rosén, 1997a). Thus, it is the one most likely to be used in practice if an order sampling scheme is desired. Hence, we have not devoted much effort on improving the Matlab programs for the exponential and uniform schemes. For the Pareto scheme, we make some

References (10)

There are more references available in the full text version of this article.

Cited by (3)

  • Computational aspects of order π ps sampling schemes

    2007, Computational Statistics and Data Analysis
  • Sampling and Estimation from Finite Populations

    2020, Sampling and Estimation from Finite Populations
View full text