Minimum Message Length shrinkage estimation
Introduction
This work considers the problem of estimating the mean of a multivariate Gaussian distribution with a known variance given a single data sample. Define as a random variable distributed according to a multivariate Gaussian density with an unknown mean and a known variance . The accuracy, or risk, of an estimator of is defined as (Wald, 1971) where is the squared error loss function: The task is to find an estimator which minimises the risk for all values of . Specifically, this work examines the problem of inferring the mean from a single observation of the random variable .
It is well known that the Uniformly Minimum Variance Unbiased (UMVU) estimator of is the least squares estimate (Lehmann and Casella, 2003) given by This estimator is minimax under the squared error loss function and is equivalent to the maximum likelihood estimator. Stein (1956) has demonstrated that, remarkably, for , the least squares estimator is not admissible and is in fact dominated by a large class of minimax estimators. The most well known of these dominating estimators is the positive-part James–Stein estimator (James and Stein, 1961): where . Estimators in the James–Stein class tend to shrink towards some origin (in this case zero) and hence are usually referred to as shrinkage estimators. Shrinkage estimators dominate the least squares estimator by trading some increase in bias for a larger decrease in variance. A common method for deriving the James–Stein estimator is through the empirical Bayes method (Robbins, 1964), in which the mean is assumed to be distributed as per and the hyperparameter is estimated from the data.
This work examines the James–Stein problem within the Minimum Message Length framework (see Section 2). Specifically, we derive MML estimators of and which exactly coincide with the positive-part James–Stein estimator under the choice of an uninformative prior over (see Section 3). A systematic approach to finding MML estimators for the parameters and hyperparameters in general hierarchical Bayes models is then developed (see Section 4). As a corollary, the new method of hyperparameter estimation appears to provide an information theoretic basis for hierarchical Bayes estimation. Some examples with multiple hyperparameters are discussed in Section 5. Concluding remarks are given in Section 6.
Section snippets
Inference by Minimum Message Length
Under the Minimum Message Length (MML) principle (Wallace and Boulton, 1968, Wallace, 2005), inference is performed by seeking the model that admits the briefest encoding (or most compression) of a message transmitted from an imaginary sender to an imaginary receiver. The message is transmitted in two parts; the first part, or assertion, states the inferred model, while the second part, or detail, states the data using a codebook based on the model stated in the assertion. This two-part message
James–Stein estimation and Minimum Message Length
Minimum Message Length (MML) shrinkage estimation of the mean of a multivariate normal distribution is now considered. The aim is to apply the Wallace and Freeman approximation (1) to inference of the mean parameter, , and the hyperparameter . Recall from Section 2 that MML87 inference requires a negative log-likelihood function, prior densities on all model parameters and the Fisher information matrix. Let denote the negative log-likelihood function of given :
Message lengths of hierarchical Bayes structures
The previous section suggests a general procedure for estimating parameters in a hierarchical Bayes structure. Given a parametrised probability density , where and , first find the message length of given conditioned on , i.e. Next find the estimates that minimise the message length (10). In most cases, it appears necessary to apply the curved prior correction (see Wallace (2005), pp. 236–237) to the MML87
Shrinkage towards a grand mean
The following extension to the basic JS shrinkage estimator was proposed by Lindley in the discussion of Stein (1962) and has been applied to several problems by Efron and Morris, 1973, Efron and Morris, 1975. Lindley suggests that instead of shrinking to the origin, one may wish to shrink the parameters to another point in the parameter space. Under this modification, the parameters () are assumed to be normally distributed, , where both and are unknown parameters
Conclusion
This work has examined the task of estimating the mean of a multivariate normal distribution with known variance given a single data sample within the Minimum Message Length framework. We considered this problem in a hierarchical Bayes setting where the prior distribution on the mean depends on an unknown hyperparameter that must be estimated from the data. We show that if the hyperparameter is stated suboptimally, the resulting solution is inferior to the James–Stein estimator. Once the
References (25)
Modeling by shortest data description
Automatica
(1978)- et al.
Choice of hierarchical priors: Admissibility in estimation of normal means
The Annals of Statistics
(1996) - et al.
Sphere Packing, Lattices and Groups
(1998) - Dowe, D.L., Wallace, C.S., 1997. Resolving the Neyman–Scott problem by Minimum Message Length. In: Proc. Computing...
- et al.
Combining possibly related estimation problems
Journal of the Royal Statistical Society (Series B)
(1973) - et al.
Data analysis using Stein’s estimator and its generalizations
Journal of the American Statistical Association
(1975) - et al.
The complexity of strict minimum message length inference
Computer Journal
(2002) - et al.
Model selection and the principle of minimum description length
Journal of the American Statistical Association
(2001) - et al.
Estimation with quadratic loss
Empirical Bayes Methods
Cited by (9)
MML, Hybrid Bayesian Network Graphical Models, Statistical Consistency, Invariance and Uniqueness
2011, Philosophy of StatisticsMML, Hybrid Bayesian Network Graphical Models, Statistical Consistency, Invariance and Uniqueness
2011, Philosophy of Statistics: Volume 7 in Handbook of the Philosophy of ScienceMML is not consistent for neyman-scott
2020, IEEE Transactions on Information TheoryApproximating message lengths of hierarchical bayesian models using posterior sampling
2016, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Minimum message length ridge regression for generalized linear models
2013, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)