Minimum Message Length shrinkage estimation

doi:10.1016/j.spl.2008.12.021

Statistics & Probability Letters

Volume 79, Issue 9, 1 May 2009, Pages 1155-1161

https://doi.org/10.1016/j.spl.2008.12.021 Get rights and content

Abstract

This note considers estimation of the mean of a multivariate Gaussian distribution with known variance within the Minimum Message Length (MML) framework. Interestingly, the resulting MML estimator exactly coincides with the positive-part James–Stein estimator under the choice of an uninformative prior. A new approach for estimating parameters and hyperparameters in general hierarchical Bayes models is also presented.

Introduction

This work considers the problem of estimating the mean of a multivariate Gaussian distribution with a known variance given a single data sample. Define $X$ as a random variable distributed according to a multivariate Gaussian density $X \sim N_{k} (μ, Σ)$ with an unknown mean $μ \in R^{k}$ and a known variance $Σ = I_{k}$ . The accuracy, or risk, of an estimator $\hat{μ} (x)$ of $μ$ is defined as (Wald, 1971) $R (μ, \hat{μ} (x)) = E_{x} [L (μ, \hat{μ} (x))]$ where $L (\cdot) \geq 0$ is the squared error loss function: $L (μ, \hat{μ} (x)) = {(\hat{μ} (x) - μ)}^{'} (\hat{μ} (x) - μ) .$ The task is to find an estimator $\hat{μ} (x)$ which minimises the risk for all values of $μ$ . Specifically, this work examines the problem of inferring the mean $μ$ from a single observation $x \in R^{k}$ of the random variable $X$ .

It is well known that the Uniformly Minimum Variance Unbiased (UMVU) estimator of $μ$ is the least squares estimate (Lehmann and Casella, 2003) given by ${\hat{μ}}_{LS} (x) = x .$ This estimator is minimax under the squared error loss function and is equivalent to the maximum likelihood estimator. Stein (1956) has demonstrated that, remarkably, for $k \geq 3$ , the least squares estimator is not admissible and is in fact dominated by a large class of minimax estimators. The most well known of these dominating estimators is the positive-part James–Stein estimator (James and Stein, 1961): ${\hat{μ}}_{JS} (x) = {(1 - \frac{k - 2}{x^{'} x})}_{+} x$ where ${(\cdot)}_{+} = max (0, \cdot)$ . Estimators in the James–Stein class tend to shrink towards some origin (in this case zero) and hence are usually referred to as shrinkage estimators. Shrinkage estimators dominate the least squares estimator by trading some increase in bias for a larger decrease in variance. A common method for deriving the James–Stein estimator is through the empirical Bayes method (Robbins, 1964), in which the mean $μ$ is assumed to be distributed as per $N_{k} (0_{k}, c \cdot I_{k})$ and the hyperparameter $c > 0$ is estimated from the data.

This work examines the James–Stein problem within the Minimum Message Length framework (see Section 2). Specifically, we derive MML estimators of $μ$ and $c$ which exactly coincide with the positive-part James–Stein estimator under the choice of an uninformative prior over $c$ (see Section 3). A systematic approach to finding MML estimators for the parameters and hyperparameters in general hierarchical Bayes models is then developed (see Section 4). As a corollary, the new method of hyperparameter estimation appears to provide an information theoretic basis for hierarchical Bayes estimation. Some examples with multiple hyperparameters are discussed in Section 5. Concluding remarks are given in Section 6.

Section snippets

Inference by Minimum Message Length

Under the Minimum Message Length (MML) principle (Wallace and Boulton, 1968, Wallace, 2005), inference is performed by seeking the model that admits the briefest encoding (or most compression) of a message transmitted from an imaginary sender to an imaginary receiver. The message is transmitted in two parts; the first part, or assertion, states the inferred model, while the second part, or detail, states the data using a codebook based on the model stated in the assertion. This two-part message

James–Stein estimation and Minimum Message Length

Minimum Message Length (MML) shrinkage estimation of the mean of a multivariate normal distribution is now considered. The aim is to apply the Wallace and Freeman approximation (1) to inference of the mean parameter, $μ \in R^{k}$ , and the hyperparameter $c \in R$ . Recall from Section 2 that MML87 inference requires a negative log-likelihood function, prior densities on all model parameters and the Fisher information matrix. Let $l (μ)$ denote the negative log-likelihood function of $x \sim N_{k} (μ, c \cdot I_{k})$ given $μ$ : $l (μ) = \frac{1}{2} ($

Message lengths of hierarchical Bayes structures

The previous section suggests a general procedure for estimating parameters in a hierarchical Bayes structure. Given a parametrised probability density $p (x | θ)$ , where $θ \sim π_{θ} (θ | α)$ and $α \sim π_{α} (α)$ , first find the message length of $x$ given $θ$ conditioned on $α$ , i.e. $I_{87} (x, θ | α) = l (θ) - log π_{θ} (θ | α) + \frac{1}{2} log | J_{θ} (θ) | + const .$ Next find the estimates ${\hat{θ}}_{87} (x | α)$ that minimise the message length (10). In most cases, it appears necessary to apply the curved prior correction (see Wallace (2005), pp. 236–237) to the MML87

Shrinkage towards a grand mean

The following extension to the basic JS shrinkage estimator was proposed by Lindley in the discussion of Stein (1962) and has been applied to several problems by Efron and Morris, 1973, Efron and Morris, 1975. Lindley suggests that instead of shrinking to the origin, one may wish to shrink the parameters to another point in the parameter space. Under this modification, the parameters $μ_{i}$ ( $i = 1, \dots, k$ ) are assumed to be normally distributed, $μ_{i} \sim N (a, c)$ , where both $a \in R$ and $c \in R^{+}$ are unknown parameters

Conclusion

This work has examined the task of estimating the mean of a multivariate normal distribution with known variance given a single data sample within the Minimum Message Length framework. We considered this problem in a hierarchical Bayes setting where the prior distribution on the mean depends on an unknown hyperparameter that must be estimated from the data. We show that if the hyperparameter is stated suboptimally, the resulting solution is inferior to the James–Stein estimator. Once the

References (25)

J. Rissanen
Modeling by shortest data description
Automatica
(1978)
J.O. Berger et al.
Choice of hierarchical priors: Admissibility in estimation of normal means
The Annals of Statistics
(1996)
J.H. Conway et al.
Sphere Packing, Lattices and Groups
(1998)
Dowe, D.L., Wallace, C.S., 1997. Resolving the Neyman–Scott problem by Minimum Message Length. In: Proc. Computing...
B. Efron et al.
Combining possibly related estimation problems
Journal of the Royal Statistical Society (Series B)
(1973)
B. Efron et al.
Data analysis using Stein’s estimator and its generalizations
Journal of the American Statistical Association
(1975)
G.E. Farr et al.
The complexity of strict minimum message length inference
Computer Journal
(2002)
P.D. Grünwald
M.H. Hansen et al.
Model selection and the principle of minimum description length
Journal of the American Statistical Association
(2001)
W. James et al.
Estimation with quadratic loss

E.L. Lehmann et al.

J.S. Maritz

Empirical Bayes Methods

(1970)

Cited by (9)

MML, Hybrid Bayesian Network Graphical Models, Statistical Consistency, Invariance and Uniqueness
2011, Philosophy of Statistics
The problem of statistical—or inductive—inference pervades a large number of human activities and a large number of (human and non-human) actions requiring “intelligence.” The Minimum Message Length (MML) approach to machine learning (within artificial intelligence) and statistical (or inductive) inference gives a trade-off between simplicity of hypothesis and goodness of fit to the data. There are several different and intuitively appealing ways of thinking of MML. There are many measures of predictive accuracy. The most common form of prediction seems to be a prediction without a probability or anything else to quantify it. MML is also discussed in terms of algorithmic information theory, the shortest input to a (Universal) Turing Machine [(U)TM] or computer program which yields the original data string. This chapter sheds light on information theory, turing machines and algorithmic information theory—and relates all of these to MML. It then moves on to Ockham's razor and the distinction between inference (or induction, or explanation) and prediction.
MML, Hybrid Bayesian Network Graphical Models, Statistical Consistency, Invariance and Uniqueness
2011, Philosophy of Statistics: Volume 7 in Handbook of the Philosophy of Science
The problem of statistical—or inductive—inference pervades a large number of human activities and a large number of (human and non-human) actions requiring “intelligence.” The Minimum Message Length (MML) approach to machine learning (within artificial intelligence) and statistical (or inductive) inference gives a trade-off between simplicity of hypothesis and goodness of fit to the data. There are several different and intuitively appealing ways of thinking of MML. There are many measures of predictive accuracy. The most common form of prediction seems to be a prediction without a probability or anything else to quantify it. MML is also discussed in terms of algorithmic information theory, the shortest input to a (Universal) Turing Machine [(U)TM] or computer program which yields the original data string. This chapter sheds light on information theory, turing machines and algorithmic information theory—and relates all of these to MML. It then moves on to Ockham's razor and the distinction between inference (or induction, or explanation) and prediction.
MML is not consistent for neyman-scott
2020, IEEE Transactions on Information Theory
A minimum message length criterion for robust linear regression
2018, arXiv
Approximating message lengths of hierarchical bayesian models using posterior sampling
2016, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Minimum message length ridge regression for generalized linear models
2013, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View all citing articles on Scopus

View full text

Minimum Message Length shrinkage estimation

Abstract

Introduction

Section snippets

Inference by Minimum Message Length

James–Stein estimation and Minimum Message Length

Message lengths of hierarchical Bayes structures

Shrinkage towards a grand mean

Conclusion

Automatica

Choice of hierarchical priors: Admissibility in estimation of normal means

The Annals of Statistics

Sphere Packing, Lattices and Groups

Combining possibly related estimation problems

Journal of the Royal Statistical Society (Series B)

Data analysis using Stein’s estimator and its generalizations

Journal of the American Statistical Association

The complexity of strict minimum message length inference

Computer Journal

Model selection and the principle of minimum description length

Journal of the American Statistical Association

Estimation with quadratic loss

Empirical Bayes Methods