Minimum Message Length shrinkage estimation

https://doi.org/10.1016/j.spl.2008.12.021Get rights and content

Abstract

This note considers estimation of the mean of a multivariate Gaussian distribution with known variance within the Minimum Message Length (MML) framework. Interestingly, the resulting MML estimator exactly coincides with the positive-part James–Stein estimator under the choice of an uninformative prior. A new approach for estimating parameters and hyperparameters in general hierarchical Bayes models is also presented.

Introduction

This work considers the problem of estimating the mean of a multivariate Gaussian distribution with a known variance given a single data sample. Define X as a random variable distributed according to a multivariate Gaussian density XNk(μ,Σ) with an unknown mean μRk and a known variance Σ=Ik. The accuracy, or risk, of an estimator μˆ(x) of μ is defined as (Wald, 1971) R(μ,μˆ(x))=Ex[L(μ,μˆ(x))] where L()0 is the squared error loss function: L(μ,μˆ(x))=(μˆ(x)μ)(μˆ(x)μ). The task is to find an estimator μˆ(x) which minimises the risk for all values of μ. Specifically, this work examines the problem of inferring the mean μ from a single observation xRk of the random variable X.

It is well known that the Uniformly Minimum Variance Unbiased (UMVU) estimator of μ is the least squares estimate (Lehmann and Casella, 2003) given by μˆLS(x)=x. This estimator is minimax under the squared error loss function and is equivalent to the maximum likelihood estimator. Stein (1956) has demonstrated that, remarkably, for k3, the least squares estimator is not admissible and is in fact dominated by a large class of minimax estimators. The most well known of these dominating estimators is the positive-part James–Stein estimator (James and Stein, 1961): μˆJS(x)=(1k2xx)+x where ()+=max(0,). Estimators in the James–Stein class tend to shrink towards some origin (in this case zero) and hence are usually referred to as shrinkage estimators. Shrinkage estimators dominate the least squares estimator by trading some increase in bias for a larger decrease in variance. A common method for deriving the James–Stein estimator is through the empirical Bayes method (Robbins, 1964), in which the mean μ is assumed to be distributed as per Nk(0k,cIk) and the hyperparameter c>0 is estimated from the data.

This work examines the James–Stein problem within the Minimum Message Length framework (see Section 2). Specifically, we derive MML estimators of μ and c which exactly coincide with the positive-part James–Stein estimator under the choice of an uninformative prior over c (see Section 3). A systematic approach to finding MML estimators for the parameters and hyperparameters in general hierarchical Bayes models is then developed (see Section 4). As a corollary, the new method of hyperparameter estimation appears to provide an information theoretic basis for hierarchical Bayes estimation. Some examples with multiple hyperparameters are discussed in Section 5. Concluding remarks are given in Section 6.

Section snippets

Inference by Minimum Message Length

Under the Minimum Message Length (MML) principle (Wallace and Boulton, 1968, Wallace, 2005), inference is performed by seeking the model that admits the briefest encoding (or most compression) of a message transmitted from an imaginary sender to an imaginary receiver. The message is transmitted in two parts; the first part, or assertion, states the inferred model, while the second part, or detail, states the data using a codebook based on the model stated in the assertion. This two-part message

James–Stein estimation and Minimum Message Length

Minimum Message Length (MML) shrinkage estimation of the mean of a multivariate normal distribution is now considered. The aim is to apply the Wallace and Freeman approximation (1) to inference of the mean parameter, μRk, and the hyperparameter cR. Recall from Section 2 that MML87 inference requires a negative log-likelihood function, prior densities on all model parameters and the Fisher information matrix. Let l(μ) denote the negative log-likelihood function of xNk(μ,cIk) given μ: l(μ)=12(

Message lengths of hierarchical Bayes structures

The previous section suggests a general procedure for estimating parameters in a hierarchical Bayes structure. Given a parametrised probability density p(x|θ), where θπθ(θ|α) and απα(α), first find the message length of x given θ conditioned on α, i.e. I87(x,θ|α)=l(θ)logπθ(θ|α)+12log|Jθ(θ)|+const. Next find the estimates θˆ87(x|α) that minimise the message length (10). In most cases, it appears necessary to apply the curved prior correction (see Wallace (2005), pp. 236–237) to the MML87

Shrinkage towards a grand mean

The following extension to the basic JS shrinkage estimator was proposed by Lindley in the discussion of Stein (1962) and has been applied to several problems by Efron and Morris, 1973, Efron and Morris, 1975. Lindley suggests that instead of shrinking to the origin, one may wish to shrink the parameters to another point in the parameter space. Under this modification, the parameters μi (i=1,,k) are assumed to be normally distributed, μiN(a,c), where both aR and cR+ are unknown parameters

Conclusion

This work has examined the task of estimating the mean of a multivariate normal distribution with known variance given a single data sample within the Minimum Message Length framework. We considered this problem in a hierarchical Bayes setting where the prior distribution on the mean depends on an unknown hyperparameter that must be estimated from the data. We show that if the hyperparameter is stated suboptimally, the resulting solution is inferior to the James–Stein estimator. Once the

References (25)

  • J. Rissanen

    Modeling by shortest data description

    Automatica

    (1978)
  • J.O. Berger et al.

    Choice of hierarchical priors: Admissibility in estimation of normal means

    The Annals of Statistics

    (1996)
  • J.H. Conway et al.

    Sphere Packing, Lattices and Groups

    (1998)
  • Dowe, D.L., Wallace, C.S., 1997. Resolving the Neyman–Scott problem by Minimum Message Length. In: Proc. Computing...
  • B. Efron et al.

    Combining possibly related estimation problems

    Journal of the Royal Statistical Society (Series B)

    (1973)
  • B. Efron et al.

    Data analysis using Stein’s estimator and its generalizations

    Journal of the American Statistical Association

    (1975)
  • G.E. Farr et al.

    The complexity of strict minimum message length inference

    Computer Journal

    (2002)
  • P.D. Grünwald
  • M.H. Hansen et al.

    Model selection and the principle of minimum description length

    Journal of the American Statistical Association

    (2001)
  • W. James et al.

    Estimation with quadratic loss

  • E.L. Lehmann et al.
  • J.S. Maritz

    Empirical Bayes Methods

    (1970)
  • Cited by (9)

    • MML, Hybrid Bayesian Network Graphical Models, Statistical Consistency, Invariance and Uniqueness

      2011, Philosophy of Statistics: Volume 7 in Handbook of the Philosophy of Science
    • MML is not consistent for neyman-scott

      2020, IEEE Transactions on Information Theory
    • Approximating message lengths of hierarchical bayesian models using posterior sampling

      2016, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Minimum message length ridge regression for generalized linear models

      2013, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text