Next Article in Journal
Gaussian Process Regression for Data Fulfilling Linear Differential Equations with Localized Sources
Next Article in Special Issue
Variational Information Bottleneck for Unsupervised Clustering: Deep Gaussian Mixture Embedding
Previous Article in Journal
Dynamics and Complexity of Computrons
Previous Article in Special Issue
Cross-Entropy Method for Content Placement and User Association in Cache-Enabled Coordinated Ultra-Dense Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Tutorial

On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views

1
Institut d’Électronique et d’Informatique Gaspard-Monge, Université Paris-Est, 77454 Champs-sur-Marne, France
2
Mathematics and Algorithmic Sciences Lab, Paris Research Center, Huawei Technologies France, 92100 Boulogne-Billancourt, France
3
Technion Institute of Technology, Technion City, Haifa 32000, Israel
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(2), 151; https://doi.org/10.3390/e22020151
Submission received: 16 October 2019 / Revised: 18 January 2020 / Accepted: 21 January 2020 / Published: 27 January 2020
(This article belongs to the Special Issue Information Theory for Data Communications and Processing)

Abstract

:
This tutorial paper focuses on the variants of the bottleneck problem taking an information theoretic perspective and discusses practical methods to solve it, as well as its connection to coding and learning aspects. The intimate connections of this setting to remote source-coding under logarithmic loss distortion measure, information combining, common reconstruction, the Wyner–Ahlswede–Korner problem, the efficiency of investment information, as well as, generalization, variational inference, representation learning, autoencoders, and others are highlighted. We discuss its extension to the distributed information bottleneck problem with emphasis on the Gaussian model and highlight the basic connections to the uplink Cloud Radio Access Networks (CRAN) with oblivious processing. For this model, the optimal trade-offs between relevance (i.e., information) and complexity (i.e., rates) in the discrete and vector Gaussian frameworks is determined. In the concluding outlook, some interesting problems are mentioned such as the characterization of the optimal inputs (“features”) distributions under power limitations maximizing the “relevance” for the Gaussian information bottleneck, under “complexity” constraints.

1. Introduction

A growing body of works focuses on developing learning rules and algorithms using information theoretic approaches (e.g., see [1,2,3,4,5,6] and references therein). Most relevant to this paper is the Information Bottleneck (IB) method of Tishby et al. [1], which seeks the right balance between data fit and generalization by using the mutual information as both a cost function and a regularizer. Specifically, IB formulates the problem of extracting the relevant information that some signal X X provides about another one Y Y that is of interest as that of finding a representation U that is maximally informative about Y (i.e., large mutual information I ( U ; Y ) ) while being minimally informative about X (i.e., small mutual information I ( U ; X ) ). In the IB framework, I ( U ; Y ) is referred to as the relevance of U and I ( U ; X ) is referred to as the complexity of U, where complexity here is measured by the minimum description length (or rate) at which the observation is compressed. Accordingly, the performance of learning with the IB method and the optimal mapping of the data are found by solving the Lagrangian formulation
L β IB , * : = max P U | X I ( U ; Y ) β I ( U ; X ) ,
where P U | X is a stochastic map that assigns the observation X to a representation U from which Y is inferred and β is the Lagrange multiplier. Several methods, which we detail below, have been proposed to obtain solutions P U | X to the IB problem in Equation (4) in several scenarios, e.g., when the distribution of the sources ( X , Y ) is perfectly known or only samples from it are available.
The IB approach, as a method to both characterize performance limits as well as to design mapping, has found remarkable applications in supervised and unsupervised learning problems such as classification, clustering, and prediction. Perhaps key to the analysis and theoretical development of the IB method is its elegant connection with information-theoretic rate-distortion problems, as it is now well known that the IB problem is essentially a remote source coding problem [7,8,9] in which the distortion is measured under logarithmic loss. Recent works show that this connection turns out to be useful for a better understanding the fundamental limits of learning problems, including the performance of deep neural networks (DNN) [10], the emergence of invariance and disentanglement in DNN [11], the minimization of PAC-Bayesian bounds on the test error [11,12], prediction [13,14], or as a generalization of the evidence lower bound (ELBO) used to train variational auto-encoders [15,16], geometric clustering [17], or extracting the Gaussian “part” of a signal [18], among others. Other connections that are more intriguing exist also with seemingly unrelated problems such as privacy and hypothesis testing [19,20,21] or multiterminal networks with oblivious relays [22,23] and non-binary LDPC code design [24]. More connections with other coding problems such as the problems of information combining and common reconstruction, the Wyner–Ahlswede–Korner problem, and the efficiency of investment information are unveiled and discussed in this tutorial paper, together with extensions to the distributed setting.
The abstract viewpoint of IB also seems instrumental to a better understanding of the so-called representation learning [25], which is an active research area in machine learning that focuses on identifying and disentangling the underlying explanatory factors that are hidden in the observed data in an attempt to render learning algorithms less dependent on feature engineering. More specifically, one important question, which is often controversial in statistical learning theory, is the choice of a “good” loss function that measures discrepancies between the true values and their estimated fits. There is however numerical evidence that models that are trained to maximize mutual information, or equivalently minimize the error’s entropy, often outperform ones that are trained using other criteria such as mean-square error (MSE) and higher-order statistics [26,27]. On this aspect, we also mention Fisher’s dissertation [28], which contains investigation of the application of information theoretic metrics to blind source separation and subspace projection using Renyi’s entropy as well as what appears to be the first usage of the now popular Parzen windowing estimator of information densities in the context of learning. Although a complete and rigorous justification of the usage of mutual information as cost function in learning is still awaited, recently, a partial explanation appeared in [29], where the authors showed that under some natural data processing property Shannon’s mutual information uniquely quantifies the reduction of prediction risk due to side information. Along the same line of work, Painsky and Wornell [30] showed that, for binary classification problems, by minimizing the logarithmic-loss (log-loss), one actually minimizes an upper bound to any choice of loss function that is smooth, proper (i.e., unbiased and Fisher consistent), and convex. Perhaps, this justifies partially why mutual information (or, equivalently, the corresponding loss function, which is the log-loss fidelity measure) is widely used in learning theory and has already been adopted in many algorithms in practice such as the infomax criterion [31], the tree-based algorithm of Quinlan [32], or the well known Chow–Liu algorithm [33] for learning tree graphical models, with various applications in genetics [34], image processing [35], computer vision [36], etc. The logarithmic loss measure also plays a central role in the theory of prediction [37] (Ch. 09) where it is often referred to as the self-information loss function, as well as in Bayesian modeling [38] where priors are usually designed to maximize the mutual information between the parameter to be estimated and the observations. The goal of learning, however, is not merely to learn model parameters accurately for previously seen data. Rather, in essence, it is the ability to successfully apply rules that are extracted from previously seen data to characterize new unseen data. This is often captured through the notion of “generalization error”. The generalization capability of a learning algorithm hinges on how sensitive the output of the algorithm is to modifications of the input dataset, i.e., its stability [39,40]. In the context of deep learning, it can be seen as a measure of how much the algorithm overfits the model parameters to the seen data. In fact, efficient algorithms should strike a good balance between their ability to fit training dataset and that to generalize well to unseen data. In statistical learning theory [37], such a dilemma is reflected through that the minimization of the “population risk” (or “test error” in the deep learning literature) amounts to the minimization of the sum of the two terms that are generally difficult to minimize simultaneously, the “empirical risk” on the training data and the generalization error. To prevent over-fitting, regularization methods can be employed, which include parameter penalization, noise injection, and averaging over multiple models trained with distinct sample sets. Although it is not yet very well understood how to optimally control model complexity, recent works [41,42] show that the generalization error can be upper-bounded using the mutual information between the input dataset and the output of the algorithm. This result actually formalizes the intuition that the less information a learning algorithm extracts from the input dataset the less it is likely to overfit, and justifies, partly, the use of mutual information also as a regularizer term. The interested reader may refer to [43] where it is shown that regularizing with mutual information alone does not always capture all desirable properties of a latent representation. We also point out that there exists an extensive literature on building optimal estimators of information quantities (e.g., entropy, mutual information), as well as their Matlab/Python implementations, including in the high-dimensional regime (see, e.g., [44,45,46,47,48,49] and references therein).
This paper provides a review of the information bottleneck method, its classical solutions, and recent advances. In addition, in the paper, we unveil some useful connections with coding problems such as remote source-coding, information combining, common reconstruction, the Wyner–Ahlswede–Korner problem, the efficiency of investment information, CEO source coding under logarithmic-loss distortion measure, and learning problems such as inference, generalization, and representation learning. Leveraging these connections, we discuss its extension to the distributed information bottleneck problem with emphasis on its solutions and the Gaussian model and highlight the basic connections to the uplink Cloud Radio Access Networks (CRAN) with oblivious processing. For this model, the optimal trade-offs between relevance and complexity in the discrete and vector Gaussian frameworks is determined. In the concluding outlook, some interesting problems are mentioned such as the characterization of the optimal inputs distributions under power limitations maximizing the “relevance” for the Gaussian information bottleneck under “complexity” constraints.

Notation

Throughout, uppercase letters denote random variables, e.g., X; lowercase letters denote realizations of random variables, e.g., x; and calligraphic letters denote sets, e.g., X . The cardinality of a set is denoted by | X | . For a random variable X with probability mass function (pmf) P X , we use P X ( x ) = p ( x ) , x X for short. Boldface uppercase letters denote vectors or matrices, e.g., X , where context should make the distinction clear. For random variables ( X 1 , X 2 , ) and a set of integers K N , X K denotes the set of random variables with indices in the set K , i.e., X K = { X k : k K } . If K = , X K = . For k K , we let X K / k = ( X 1 , , X k 1 , X k + 1 , , X K ) , and assume that X 0 = X K + 1 = . In addition, for zero-mean random vectors X and Y , the quantities Σ x , Σ x , y and Σ x | y denote, respectively, the covariance matrix of the vector X , the covariance matrix of vector ( X , Y ) , and the conditional covariance matrix of X , conditionally on Y , i.e., Σ x = E [ XX H ] Σ x , y : = E [ XY H ] , and Σ x | y = Σ x Σ x , y Σ y 1 Σ y , x . Finally, for two probability measures P X and Q X on the random variable X X , the relative entropy or Kullback–Leibler divergence is denoted as D KL ( P X Q X ) . That is, if P X is absolutely continuous with respect to Q X , P X Q X (i.e., for every x X , if P X ( x ) > 0 , then Q X ( x ) > 0 ) , D KL ( P X Q X ) = E P X [ log ( P X ( X ) / Q X ( X ) ) ] , otherwise D KL ( P X Q X ) = .

2. The Information Bottleneck Problem

The Information Bottleneck (IB) method was introduced by Tishby et al. [1] as a method for extracting the information that some variable X X provides about another one Y Y that is of interest, as shown in Figure 1.
Specifically, the IB method consists of finding the stochastic mapping P U | X : X U that from an observation X outputs a representation U U that is maximally informative about Y, i.e., large mutual information I ( U ; Y ) , while being minimally informative about X, i.e., small mutual information I ( U ; X ) (As such, the usage of Shannon’s mutual information seems to be motivated by the intuition that such a measure provides a natural quantitative approach to the questions of meaning, relevance, and common-information, rather than the solution of a well-posed information-theoretic problem—a connection with source coding under logarithmic loss measure appeared later on in [50].) The auxiliary random variable U satisfies that U X Y is a Markov Chain in this order; that is, that the joint distribution of ( X , U , Y ) satisfies
p ( x , u , y ) = p ( x ) p ( y | x ) p ( u | x ) ,
and the mapping P U | X is chosen such that U strikes a suitable balance between the degree of relevance of the representation as measured by the mutual information I ( U ; Y ) and its degree of complexity as measured by the mutual information I ( U ; X ) . In particular, such U, or effectively the mapping P U | X , can be determined to maximize the IB-Lagrangian defined as
L β IB ( P U | X ) : = I ( U ; Y ) β I ( U ; X )
over all mappings P U | X that satisfy U X Y and the trade-off parameter β is a positive Lagrange multiplier associated with the constraint on I ( U ; Y ) .
Accordingly, for a given β and source distribution P X , Y , the optimal mapping of the data, denoted by P U | X * , β , is found by solving the IB problem, defined as
L β IB , * : = max P U | X I ( U ; Y ) β I ( U ; X ) . ,
over all mappings P U | Y that satisfy U X Y . It follows from the classical application of Carathéodory’s theorem [51] that without loss of optimality, U can be restricted to satisfy | U | | X | + 1 .
In Section 3 we discuss several methods to obtain solutions P U | X * , β to the IB problem in Equation (4) in several scenarios, e.g., when the distribution of ( X , Y ) is perfectly known or only samples from it are available.

2.1. The Ib Relevance–Complexity Region

The minimization of the IB-Lagrangian L β in Equation (4) for a given β 0 and P X , Y results in an optimal mapping P U | X * , β and a relevance–complexity pair ( Δ β , R β ) where Δ β = I ( U β , X ) and R β = I ( U β , Y ) are, respectively, the relevance and the complexity resulting from generating U β with the solution P U | X * , β . By optimizing over all β 0 , the resulting relevance–complexity pairs ( Δ β , R β ) characterize the boundary of the region of simultaneously achievable relevance–complexity pairs for a distribution P X , Y (see Figure 2). In particular, for a fixed P X , Y , we define this region as the union of relevance–complexity pairs ( Δ , R ) that satisfy
Δ I ( U , Y ) , R I ( X , U )
where the union is over all P U | X such that U satisfies U X Y form a Markov Chain in this order. Any pair ( Δ , R ) outside of this region is not simultaneously achievable by any mapping P U | X .

3. Solutions to the Information Bottleneck Problem

As shown in the previous region, the IB problem provides a methodology to design mappings P U | X performing at different relevance–complexity points within the region of feasible ( Δ , R ) pairs, characterized by the IB relevance–complexity region, by minimizing the IB-Lagrangian in Equation (3) for different values of β . However, in general, this optimization is challenging as it requires computation of mutual information terms.
In this section, we describe how, for a fixed parameter β , the optimal solution P U | X β , * , or an efficient approximation of it, can be obtained under: (i) particular distributions, e.g., Gaussian and binary symmetric sources; (ii) known general discrete memoryless distributions; and (iii) unknown memory distributions and only samples are available.

3.1. Solution for Particular Distributions: Gaussian and Binary Symmetric Sources

In certain cases, when the joint distribution P X , Y is know, e.g., it is binary symmetric or Gaussian, information theoretic inequalities can be used to minimize the IB-Lagrangian in (4) in closed form.

3.1.1. Binary IB

Let X and Y be a doubly symmetric binary sources (DSBS), i.e., ( X , Y ) DSBS ( p ) for some 0 p 1 / 2 . (A DSBS is a pair ( X , Y ) of binary random variables X Bern ( 1 / 2 ) and Y Bern ( 1 / 2 ) and X Y Bern ( p ) , where ⊕ is the sum modulo 2. That is, Y is the output of a binary symmetric channel with crossover probability p corresponding to the input X, and X is the output of the same channel with input Y.) Then, it can be shown that the optimal U in (4) is such that ( X , U ) DSBS ( q ) for some 0 q 1 . Such a U can be obtained with the mapping P U | X such that
U = X Q , with Q DSBS ( q ) .
In this case, straightforward algebra leads to that the complexity level is given by
I ( U ; X ) = 1 h 2 ( q ) ,
where, for 0 x 1 , h 2 ( x ) is the entropy of a Bernoulli- ( x ) source, i.e., h 2 ( x ) = x log 2 ( x ) ( 1 x ) log 2 ( 1 x ) , and the relevance level is given by
I ( U ; Y ) = 1 h 2 ( p q )
where p q = p ( 1 q ) + q ( 1 p ) . The result extends easily to discrete symmetric mappings Y X with binary X (one bit output quantization) and discrete non-binary Y.

3.1.2. Vector Gaussian IB

Let ( X , Y ) C N x × C N y be a pair of jointly Gaussian, zero-mean, complex-valued random vectors, of dimension N x > 0 and N y > 0 , respectively. In this case, the optimal solution of the IB-Lagrangian in Equation (3) (i.e., test channel P U | X ) is a noisy linear projection to a subspace whose dimensionality is determined by the tradeoff parameter β . The subspaces are spanned by basis vectors in a manner similar to the well known canonical correlation analysis [52]. For small β , only the vector associated to the dimension with more energy, i.e., corresponding to the largest eigenvalue of a particular hermitian matrix, will be considered in U. As β increases, additional dimensions are added to U through a series of critical points that are similar to structural phase transitions. This process continues until U becomes rich enough to capture all the relevant information about Y that is contained in X. In particular, the boundary of the optimal relevance–complexity region was shown in [53] to be achievable using a test channel P U | X , which is such that ( U , X ) is Gaussian. Without loss of generality, let
U = A X + ξ
where A M N u , N x ( C ) is an N u × N x complex valued matrix and ξ C N u is a Gaussian noise that is independent of ( X , Y ) with zero-mean and covariance matrix I N u . For a given non-negative trade-off parameter β , the matrix A has a number of rows that depends on β and is given by [54] (Theorem 3.1)
A = 0 T ; ; 0 T , 0 β < β 1 c α 1 v 1 T ; 0 T ; ; 0 T , β 1 c β < β 2 c α 1 v 1 T ; α 2 v 2 T ; 0 T ; ; 0 T , β 2 c β < β 3 c
where { v 1 T , v 2 T , , v N x T } are the left eigenvectors of Σ X | Y Σ X 1 sorted by their corresponding ascending eigenvalues λ 1 , λ 2 , , λ N x . Furthermore, for i = 1 , , N x , β i c = 1 1 λ i are critical β -values, α i = β ( 1 λ i ) 1 λ i r i with r i = v i T Σ X v i , 0 T denotes the N x -dimensional zero vector and semicolons separate the rows of the matrix. It is interesting to observe that the optimal projection consists of eigenvectors of Σ X | Y Σ X 1 , combined in a judicious manner: for values of β that are smaller than β 1 c , reducing complexity is of prime importance, yielding extreme compression U = ξ , i.e., independent noise and no information preservation at all about Y . As β increases, it undergoes a series of critical points { β i c } , at each of which a new eignevector is added to the matrix A , yielding a more complex but richer representation—the rank of A increases accordingly.
For the specific case of scalar Gaussian sources, that is N x = N y = 1 , e.g., X = snr Y + N where N is standard Gaussian with zero-mean and unit variance, the above result simplifies considerably. In this case, let without loss of generality the mapping P U | X be given by
X = a X + Q
where Q is standard Gaussian with zero-mean and variance σ q 2 . In this case, for I ( U ; X ) = R , we get
I ( U ; Y ) = 1 2 log ( 1 + snr ) 1 2 log 1 + snr exp ( 2 R ) .

3.2. Approximations for Generic Distributions

Next, we present an approach to obtain solutions to the the information bottleneck problem for generic distributions, both when this solution is known and when it is unknown. The method consists in defining a variational (lower) bound on the IB-Lagrangian, which can be optimized more easily than optimizing the IB-Lagrangian directly.

3.2.1. A Variational Bound

Recall the IB goal of finding a representation U of X that is maximally informative about Y while being concise enough (i.e., bounded I ( U ; X ) ). This corresponds to optimizing the IB-Lagrangian
L β IB ( P U | X ) : = I ( U ; Y ) β I ( U ; X )
where the maximization is over all stochastic mappings P U | X such that U X Y and | U | | X | + 1 . In this section, we show that minimizing Equation (13) is equivalent to optimizing the variational cost
L β VIB ( P U | X , Q Y | U , S U ) : = E P U | X log Q Y | U ( Y | U ) β D KL ( P U | X | S U ) ,
where Q Y | U ( y | u ) is an given stochastic map Q Y | U : U [ 0 , 1 ] (also referred to as the variational approximation of P Y | U or decoder) and S U ( u ) : U [ 0 , 1 ] is a given stochastic map (also referred to as the variational approximation of P U ), and D KL ( P U | X | S U ) is the relative entropy between P U | X and S U .
Then, we have the following bound for a any valid P U | X , i.e., satisfying the Markov Chain in Equation (2),
L β IB ( P U | X ) L β VIB ( P U | X , Q Y | U , S U ) ,
where the equality holds when Q Y | U = P Y | U and S U = P U , i.e., the variational approximations correspond to the true value.
In the following, we derive the variational bound. Fix P U | X (an encoder) and the variational decoder approximation Q Y | U . The relevance I ( U ; Y ) can be lower-bounded as
I ( U ; Y ) = u U , y Y P U , Y ( u , y ) log P Y | U ( y | u ) P Y ( y ) d y d u
= ( a ) u U , y Y P U , Y ( u , y ) log Q Y | U ( y | u ) P Y ( y ) d y d u + D P Y Q Y | U
( b ) u U , y Y P U , Y ( u , y ) log Q Y | U ( y | u ) P Y ( y ) d y d u
= H ( Y ) + u U , y Y P U , Y ( u , y ) log Q Y | U ( y | u ) d y d u
( c ) u U , y Y P U , Y ( u , y ) log Q Y | U ( y | u ) d y d u
= ( d ) u U , x X , y Y P X ( x ) P Y | X ( y | x ) P U | X ( u | x ) log Q Y | U ( y | u ) d x d y d u ,
where in ( a ) the term D P Y Q Y | U is the conditional relative entropy between P Y and Q Y , given P U ; ( b ) holds by the non-negativity of relative entropy; ( c ) holds by the non-negativity of entropy; and ( d ) follows using the Markov Chain U X Y .
Similarly, let S U be a given the variational approximation of P U . Then, we get
I ( U ; X ) = u U , x X P U , X ( u , x ) log P U | X ( u | x ) P U ( u ) d x d u
= u U , x X P U , X ( u , x ) log P U | X ( u | x ) S U ( u ) d x d u D P U S U
u U , x X P U , X ( u , x ) log P U | X ( u | x ) S U ( u ) d x d u
where the inequality follows since the relative entropy is non-negative.
Combining Equations (21) and (24), we get
I ( U ; Y ) β I ( U ; X ) u U , x X , y Y P X ( x ) P Y | X ( y | x ) P U | X ( u | x ) log Q Y | U ( y | u ) d x d y d u β u U , x X P U , X ( u , x ) log P U | X ( u | x ) S U ( u ) d x d u .
The use of the variational bound in Equation (14) over the IB-Lagrangian in Equation (13) shows some advantages. First, it allows the derivation of alternating algorithms that allow to obtain a solution by optimizing over the encoders and decoders. Then, it is easier to obtain an empirical estimate of Equation (14) by sampling from: (i) the joint distribution P X , Y ; (ii) the encoder P U | X ; and (iii) the prior S U . Additionally, as noted in Equation (15), when evaluated for the optimal decoder Q Y | U and prior S U , the variational bound becomes tight. All this allows obtaining algorithms to obtain good approximate solutions to the IB problem, as shown next. Further theoretical implications of this variational bound are discussed in [55].

3.2.2. Known Distributions

Using the variational formulation in Equation (14), when the data model is discrete and the joint distribution P X , Y is known, the IB problem can be solved by using an iterative method that optimizes the variational IB cost function in Equation (14) alternating over the distributions P U | X , Q Y | U , and S U . In this case, the maximizing distributions P U | X , Q Y | U , and S U can be efficiently found by an alternating optimization procedure similar to the expectation-maximization (EM) algorithm [56] and the standard Blahut–Arimoto (BA) method [57]. In particular, a solution P U | X to the constrained optimization problem is determined by the following self-consistent equations, for all ( u , x , y ) U × X × Y , [1]
P U | X ( u | x ) = P U ( u ) Z ( β , x ) exp β D KL P Y | X ( · | x ) P Y | U ( · | u )
P U ( u ) = x X P X ( x ) P U | X ( u | x )
P Y | U ( y | u ) = x X P Y | X ( y | x ) P X | U ( x | u )
where P X | U ( x | u ) = P U | X ( u | x ) P X ( x ) / P U ( u ) and Z ( β , x ) is a normalization term. It is shown in [1] that alternating iterations of these equations converges to a solution of the problem for any initial P U | X . However, by opposition to the standard Blahut–Arimoto algorithm [57,58], which is classically used in the computation of rate-distortion functions of discrete memoryless sources for which convergence to the optimal solution is guaranteed, convergence here may be to a local optimum only. If β = 0 , the optimization is non-constrained and one can set U = , which yields minimal relevance and complexity levels. Increasing the value of β steers towards more accurate and more complex representations, until U = X in the limit of very large (infinite) values of β for which the relevance reaches its maximal value I ( X ; Y ) .
For discrete sources with (small) alphabets, the updating equations described by Equation (26) are relatively easy computationally. However, if the variables X and Y lie in a continuum, solving the equations described by Equation (26) is very challenging. In the case in which X and Y are joint multivariate Gaussian, the problem of finding the optimal representation U is analytically tractable in [53] (see also the related [54,59]), as discussed in Section 3.1.2. Leveraging the optimality of Gaussian mappings P U | X to restrict the optimization of P U | X to Gaussian distributions as in Equation (9), allows reducing the search of update rules to those of the associated parameters, namely covariance matrices. When Y is a deterministic function of X, the IB curve cannot be explored, and other Lagrangians have been proposed to tackle this problem [60].

3.3. Unknown Distributions

The main drawback of the solutions presented thus far for the IB principle is that, in the exception of small-sized discrete ( X , Y ) for which iterating Equation (26) converges to an (at least local) solution and jointly Gaussian ( X , Y ) for which an explicit analytic solution was found, solving Equation (3) is generally computationally costly, especially for high dimensionality. Another important barrier in solving Equation (3) directly is that IB necessitates knowledge of the joint distribution P X , Y . In this section, we describe a method to provide an approximate solution to the IB problem in the case in which the joint distribution is unknown and only a give training set of N samples { ( x i , y i ) } i = 1 N is available.
A major step ahead, which widened the range of applications of IB inference for various learning problems, appeared in [48], where the authors used neural networks to parameterize the variational inference lower bound in Equation (14) and show that its optimization can be done through the classic and widely used stochastic gradient descendent (SGD). This method, denoted by Variational IB in [48] and detailed below, has allowed handling handle high-dimensional, possibly continuous, data, even in the case in which the distributions are unknown.

3.3.1. Variational IB

The goal of the variational IB when only samples { ( x i , y i ) } i = 1 N are available is to solve the IB problem optimizing an approximation of the cost function. For instance, for a given training set { ( x i , y i ) } i = 1 N , the right hand side of Equation (14) can be approximated as
L low 1 N i = 1 N u U P U | X ( u | x i ) log Q Y | U ( y i | u ) β P U | X ( u | x i ) log P U | X ( u | x i ) S U ( u ) d u .
However, in general, the direct optimization of this cost is challenging. In the variational IB method, this optimization is done by parameterizing the encoding and decoding distributions P U | X , Q Y | U , and S U that are to optimize using a family of distributions whose parameters are determined by DNNs. This allows us to formulate Equation (14) in terms of the DNN parameters, i.e., its weights, and optimize it by using the reparameterization trick [15], Monte Carlo sampling, and stochastic gradient descent (SGD)-type algorithms.
Let P θ ( u | x ) denote the family of encoding probability distributions P U | X over U for each element on X , parameterized by the output of a DNN f θ with parameters θ . A common example is the family of multivariate Gaussian distributions [15], which are parameterized by the mean μ θ and covariance matrix Σ θ , i.e., γ : = ( μ θ , Σ θ ) . Given an observation X, the values of ( μ θ ( x ) , Σ θ ( x ) ) are determined by the output of the DNN f θ , whose input is X, and the corresponding family member is given by P θ ( u | x ) = N ( u ; μ θ ( x ) , Σ θ ( x ) ) . For discrete distributions, a common example are concrete variables [61] (or Gumbel-Softmax [62]). Some details are given below.
Similarly, for decoder Q Y | U over Y for each element on U , let Q ψ ( y | u ) denote the family of distributions parameterized by the output of the DNNs f ψ k . Finally, for the prior distributions S U ( u ) over U we define the family of distributions S φ ( u ) , which do not depend on a DNN.
By restricting the optimization of the variational IB cost in Equation (14) to the encoder, decoder, and prior within the families of distributions P θ ( u | x ) , Q ψ ( y | u ) , and S φ ( u ) , we get
max P U | X max Q Y | U , S U L β VIB ( P U | X , Q Y | U , S U ) max θ , ϕ , φ L β NN ( θ , ϕ , φ ) ,
where θ , ϕ , and φ denote the DNN parameters, e.g., its weights, and the cost in Equation (29) is given by
L β NN ( θ , ϕ , φ ) : = E P Y , X E { P θ ( U | X ) } log Q ϕ ( Y | U ) ( Y | U ) β D KL ( P θ ( U | X ) S φ ( U ) ) .
Next, using the training samples { ( x i , y i ) } i = 1 N , the DNNs are trained to maximize a Monte Carlo approximation of Equation (29) over θ , ϕ , φ using optimization methods such as SGD or ADAM [63] with backpropagation. However, in general, the direct computation of the gradients of Equation (29) is challenging due to the dependency of the averaging with respect to the encoding P θ , which makes it hard to approximate the cost by sampling. To circumvent this problem, the reparameterization trick [15] is used to sample from P θ ( U | X ) . In particular, consider P θ ( U | X ) to belong to a parametric family of distributions that can be sampled by first sampling a random variable Z with distribution P Z ( z ) , z Z and then transforming the samples using some function g θ : X × Z U parameterized by θ , such that U = g θ ( x , Z ) P θ ( U | x ) . Various parametric families of distributions fall within this class for both discrete and continuous latent spaces, e.g., the Gumbel-Softmax distributions and the Gaussian distributions. Next, we detail how to sample from both examples:
  • Sampling from Gaussian Latent Spaces: When the latent space is a continuous vector space of dimension D, e.g., U = R D , we can consider multivariate Gaussian parametric encoders of mean ( μ θ , and covariance Σ θ ) , i.e., P θ ( u | x ) = N ( u ; μ θ , Σ θ ) . To sample U N ( u ; μ θ , Σ θ ) , where μ θ ( x ) = f e , θ μ ( x ) and Σ θ ( x ) = f e , θ Σ ( x ) are determined as the output of a NN, sample a random variable Z N ( z ; 0 , I ) i.i.d. and, given data sample x X , and generate the jth sample as
    u j = f e , θ μ ( x ) + f e , θ Σ ( x ) z j
    where z j is a sample of Z N ( 0 , I ) , which is an independent Gaussian noise, and f e μ ( x ) and f e Σ ( x ) are the output values of the NN with weights θ for the given input sample x.
    An example of the resulting DIB architecture to optimize with an encoder, a latent space, and a decoder parameterized by Gaussian distributions is shown in Figure 3.
  • Sampling from a discrete latent space with the Gumbel-Softmax:
    If U is categorical random variable on the finite set U of size D with probabilities π : = ( π 1 , , π D ) ), we can encode it as D-dimensional one-hot vectors lying on the corners of the ( D 1 ) -dimensional simplex, Δ D 1 . In general, costs functions involving sampling from categorical distributions are non-differentiable. Instead, we consider Concrete variables [62] (or Gumbel-Softmax [61]), which are continuous differentiable relaxations of categorical variables on the interior of the simplex, and are easy to sample. To sample from a Concrete random variable U Δ D 1 at temperature λ ( 0 , ) , with probabilities π ( 0 , 1 ) D , sample G d Gumbel ( 0 , 1 ) i.i.d. (The Gumbel ( 0 , 1 ) distribution can be sampled by drawing u Uniform ( 0 , 1 ) and calculating g = log ( log ( u ) ) .), and set for each of the components of U = ( U 1 , , U D )
    U d = exp ( ( log ( π d + G d ) / λ ) ) j = 1 D exp ( ( log ( π j + G j ) / λ ) ) , d = 1 , , D .
    We denote by Q π , λ ( u , x ) the Concrete distribution with parameters ( π ( x ) , λ ) . When the temperature λ approaches 0, the samples from the concrete distribution become one-hot and Pr { lim λ 0 U d } = π d [61]. Note that, for discrete data models, standard application of Caratheodory’s theorem [64] shows that the latent variables U that appear in Equation (3) can be restricted to be with bounded alphabets size.
The reparametrization trick transforms the cost function in Equation (29) into one which can be to approximated by sampling M independent samples { u m } m = 1 M P θ ( u | x i ) for each training sample ( x i , y i ) , i = 1 , , N and allows computing estimates of the gradient using backpropagation [15]. Sampling is performed by using u i , m = g ϕ ( x i , z m ) with { z m } m = 1 M i.i.d. sampled from P Z . Altogether, we have the empirical-DIB cost for the ith sample in the training dataset:
L β , i emp ( θ , ϕ , φ ) : = 1 M m = 1 M log Q ϕ ( y i | u i , m ) β D KL ( P θ ( U i | x i ) Q φ ( U i ) ) ] .
Note that, for many distributions, e.g., multivariate Gaussian, the divergence D KL ( P θ ( U i | x i ) Q φ ( U i ) ) can be evaluated in closed form. Alternatively, an empirical approximation can be considered.
Finally, we maximize the empirical-IB cost over the DNN parameters θ , ϕ , φ as,
max θ , ϕ , φ 1 N i = 1 N L β , i emp ( θ , ϕ , φ ) .
By the law of large numbers, for large N , M , we have 1 / N i = 1 M L β , i emp ( θ , ϕ , φ ) L β NN ( θ , ϕ , φ ) almost surely. After convergence of the DNN parameters to θ * , ϕ * , φ * , for a new observation X, the representation U can be obtained by sampling from the encoders P θ k * ( U k | X k ) . In addition, note that a soft estimate of the remote source Y can be inferred by sampling from the decoder Q ϕ * ( Y | U ) . The notion of encoder and decoder in the IB-problem will come clear from its relationship with lossy source coding in Section 4.1.

4. Connections to Coding Problems

The IB problem is a one-shot coding problem, in the sense that the operations are performed letter-wise. In this section, we consider now the relationship between the IB problem and (asymptotic) coding problem in which the coding operations are performed over blocks of size n, with n assumed to be large and the joint distribution of the data P X , Y is in general assumed to be known a priori. The connections between these problems allow extending results from one setup to another, and to consider generalizations of the classical IB problem to other setups, e.g., as shown in Section 6.

4.1. Indirect Source Coding under Logarithmic Loss

Let us consider the (asymptotic) indirect source coding problem shown in Figure 4, in which Y designates a memoryless remote source and X a noisy version of it that is observed at the encoder.
A sequence of n samples X n = ( X 1 , , X n ) is mapped by an encoder ϕ ( n ) : X n { 1 , , 2 n R } which outputs a message from a set { 1 , , 2 n R } , that is, the encoder uses at most R bits per sample to describe its observation and the range of the encoder map is allowed to grow with the size of the input sequence as
ϕ ( n ) n R .
This message is mapped with a decoder ϕ ( n ) : { 1 , , 2 n R } Y ^ to generate a reconstruction of the source sequence Y n as Y n Y ^ n . As already observed in [50], the IB problem in Equation (3) is essentially equivalent to a remote point-to-point source coding problem in which distortion between Y n as Y n Y ^ n is measured under the logarithm loss (log-loss) fidelity criterion [65]. That is, rather than just assigning a deterministic value to each sample of the source, the decoder gives an assessment of the degree of confidence or reliability on each estimate. Specifically, given the output description m = ϕ ( n ) ( x n ) of the encoder, the decoder generates a soft-estimate y ^ n of y n in the form of a probability distribution over Y n , i.e., y ^ n = P ^ Y n | M ( · ) . The incurred discrepancy between y n and the estimation y ^ n under log-loss for the observation x n is then given by the per-letter logarithmic loss distortion, which is defined as
log ( y , y ^ ) : = log 1 y ^ ( y ) .
for y Y and y ^ P ( Y ) designates here a probability distribution on Y and y ^ ( y ) is the value of that distribution evaluated at the outcome y Y .
That is, the encoder uses at most R bits per sample to describe its observation to a decoder which is interested in reconstructing the remote source Y n to within an average distortion level D, using a per-letter distortion metric, i.e.,
E [ log ( n ) ( Y n , Y ^ n ) ] D
where the incurred distortion between two sequences Y n and Y ^ n is measured as
log ( n ) ( Y n , Y ^ n ) = 1 n i = 1 n log ( y i , y ^ i )
and the per-letter distortion is measured in terms of that given by the logarithmic loss in Equation (53).
The rate distortion region of this model is given by the union of all pairs ( R , D ) that satisfy [7,9]
R I ( U ; X )
D H ( Y | U )
where the union is over all auxiliary random variables U that satisfy that U X Y forms a Markov Chain in this order. Invoking the support lemma [66] (p. 310), it is easy to see that this region is not altered if one restricts U to satisfy | U | | X | + 1 . In addition, using the substitution Δ : = H ( Y ) D , the region can be written equivalently as the union of all pairs ( R , H ( Y ) Δ ) that satisfy
R I ( U ; X )
Δ I ( U ; Y )
where the union is over all Us with pmf P U | X that satisfy U X Y , with | U | | X | + 1 .
The boundary of this region is equivalent to the one described by the IB principle in Equation (3) if solved for all β , and therefore the IB problem is essentially a remote source coding problem in which the distortion is measured under the logarithmic loss measure. Note that, operationally, the IB problem is equivalent to that of finding an encoder P U | X which maps the observation X to a representation U that satisfies the bit rate constraint R and such that U captures enough relevance of Y so that the posterior probability of Y given U satisfies an average distortion constraint.

4.2. Common Reconstruction

Consider the problem of source coding with side information at the decoder, i.e., the well known Wyner–Ziv setting [67], with the distortion measured under logarithmic-loss. Specifically, a memoryless source X is to be conveyed lossily to a decoder that observes a statistically correlated side information Y. The encoder uses R bits per sample to describe its observation to the decoder which wants to reconstruct an estimate of X to within an average distortion level D, where the distortion is evaluated under the log-loss distortion measure. The rate distortion region of this problem is given by the set of all pairs ( R , D ) that satisfy
R + D H ( X | Y ) .
The optimal coding scheme utilizes standard Wyner–Ziv compression [67] at the encoder and the decoder map ψ : U × Y X ^ is given by
ψ ( U , Y ) = Pr [ X = x | U , Y ]
for which it is easy to see that
E [ log ( X , ψ ( U , Y ) ) ] = H ( X | U , Y ) .
Now, assume that we constrain the coding in a manner that the encoder is be able to produce an exact copy of the compressed source constructed by the decoder. This requirement, termed common reconstruction constraint (CR), was introduced and studied by Steinberg [68] for various source coding models, including the Wyner–Ziv setup, in the context of a “general distortion measure”. For the Wyner–Ziv problem under log-loss measure that is considered in this section, such a CR constraint causes some rate loss because the reproduction rule in Equation (41) is no longer possible. In fact, it is not difficult to see that under the CR constraint the above region reduces to the set of pairs ( R , D ) that satisfy
R I ( U ; X | Y )
D H ( X | U )
for some auxiliary random variable for which U X Y holds. Observe that Equation (43b) is equivalent to I ( U ; X ) H ( X ) D and that, for a given prescribed fidelity level D, the minimum rate is obtained for a description U that achieves the inequality in Equation (43b) with equality, i.e.,
R ( D ) = min P U | X : I ( U ; X ) = H ( X ) D I ( U ; X | Y ) .
Because U X Y , we have
I ( U ; Y ) = I ( U ; X ) I ( U ; X | Y ) .
Under the constraint I ( U ; X ) = H ( X ) D , it is easy to see that minimizing I ( U ; X | Y ) amounts to maximizing I ( U ; Y ) , an aspect which bridges the problem at hand with the IB problem.
In the above, the side information Y is used for binning but not for the estimation at the decoder. If the encoder ignores whether Y is present or not at the decoder side, the benefit of binning is reduced—see the Heegard–Berger model with common reconstruction studied in [69,70].

4.3. Information Combining

Consider again the IB problem. Assume one wishes to find the representation U that maximizes the relevance I ( U ; Y ) for a given prescribed complexity level, e.g., I ( U ; X ) = R . For this setup, we have
I ( X ; U , Y ) = I ( U ; X ) + I ( Y ; X ) I ( U ; Y )
= R + I ( Y ; X ) I ( U ; Y )
where the first equality holds since U X Y is a Markov Chain. Maximizing I ( U ; Y ) is then equivalent to minimizing I ( X ; U , Y ) . This is reminiscent of the problem of information combining [71,72], where X can be interpreted as a source information that is conveyed through two channels: the channel P Y | X and the channel P U | X . The outputs of these two channels are conditionally independent given X, and they should be processed in a manner such that, when combined, they preserve as much information as possible about X.

4.4. Wyner–Ahlswede–Korner Problem

Here, the two memoryless sources X and Y are encoded separately at rates R X and R Y , respectively. A decoder gets the two compressed streams and aims at recovering Y losslessly. This problem was studied and solved separately by Wyner [73] and Ahlswede and Körner [74]. For given R X = R , the minimum rate R Y that is needed to recover Y losslessly is
R Y ( R ) = min P U | X : I ( U ; X ) R H ( Y | U ) .
Thus, we get
max P U | X : I ( U ; X ) R I ( U ; Y ) = H ( Y ) R Y ( R ) ,
and therefore, solving the IB problem is equivalent to solving the Wyner–Ahlswede–Korner Problem.

4.5. The Privacy Funnel

Consider again the setting of Figure 4, and let us assume that the pair ( Y , X ) models data that a user possesses and which have the following properties: the data Y are some sensitive (private) data that are not meant to be revealed at all, or else not beyond some level Δ ; and the data X are non-private and are meant to be shared with another user (analyst). Because X and Y are correlated, sharing the non-private data X with the analyst possibly reveals information about Y. For this reason, there is a trade off between the amount of information that the user shares about X and the information that he keeps private about Y. The data X are passed through a randomized mapping ϕ whose purpose is to make U = ϕ ( X ) maximally informative about X while being minimally informative about Y.
The analyst performs an inference attack on the private data Y based on the disclosed information U. Let : Y × Y ^ R ¯ be an arbitrary loss function with reconstruction alphabet Y ^ that measures the cost of inferring Y after observing U. Given ( X , Y ) P X , Y and under the given loss function , it is natural to quantify the difference between the prediction losses in predicting Y Y prior and after observing U = ϕ ( X ) . Let
C ( , P ) = inf y ^ Y ^ E P [ ( Y , y ^ ) ] inf Y ^ ( ϕ ( X ) ) E P [ ( Y , Y ^ ) ]
where y ^ Y ^ is deterministic and Y ^ ( ϕ ( X ) ) is any measurable function of U = ϕ ( X ) . The quantity C ( , P ) quantifies the reduction in the prediction loss under the loss function that is due to observing U = ϕ ( X ) , i.e., the inference cost gain. In [75] (see also [76]), it is shown that that under some mild conditions the inference cost gain C ( , P ) as defined by Equation (49) is upper-bounded as
C ( , P ) 2 2 L I ( U ; Y )
where L is a constant. The inequality in Equation (50) holds irrespective to the choice of the loss function , and this justifies the usage of the logarithmic loss function as given by Equation (53) in the context of finding a suitable trade off between utility and privacy, since
I ( U ; Y ) = H ( Y ) inf Y ^ ( U ) E P [ log ( Y , Y ^ ) ] .
Under the logarithmic loss function, the design of the mapping U = ϕ ( X ) should strike a right balance between the utility for inferring the non-private data X as measured by the mutual information I ( U ; X ) and the privacy metric about the private date Y as measured by the mutual information I ( U ; Y ) .

4.6. Efficiency of Investment Information

Let Y model a stock market data and X some correlated information. In [77], Erkip and Cover investigated how the description of the correlated information X improves the investment in the stock market Y. Specifically, let Δ ( C ) denote the maximum increase in growth rate when X is described to the investor at rate C. Erkip and Cover found a single-letter characterization of the incremental growth rate Δ ( C ) . When specialized to the horse race market, this problem is related to the aforementioned source coding with side information of Wyner [73] and Ahlswede-Körner [74], and, thus, also to the IB problem. The work in [77] provides explicit analytic solutions for two horse race examples, jointly binary and jointly Gaussian horse races.

5. Connections to Inference and Representation Learning

In this section, we consider the connections of the IB problem with learning, inference and generalization, for which, typically, the joint distribution P X , Y of the data is not known and only a set of samples is available.

5.1. Inference Model

Let a measurable variable X X and a target variable Y Y with unknown joint distribution P X , Y be given. In the classic problem of statistical learning, one wishes to infer an accurate predictor of the target variable Y Y based on observed realizations of X X . That is, for a given class F of admissible predictors ψ : X Y ^ and a loss function : Y Y ^ that measures discrepancies between true values and their estimated fits, one aims at finding the mapping ψ F that minimizes the expected (population) risk
C P X , Y ( ψ , ) = E P X , Y [ ( Y , ψ ( X ) ) ] .
An abstract inference model is shown in Figure 5.
The choice of a “good” loss function ( · ) is often controversial in statistical learning theory. There is however numerical evidence that models that are trained to minimize the error’s entropy often outperform ones that are trained using other criteria such as mean-square error (MSE) and higher-order statistics [26,27]. This corresponds to choosing the loss function given by the logarithmic loss, which is defined as
log ( y , y ^ ) : = log 1 y ^ ( y )
for y Y , where y ^ P ( Y ) designates here a probability distribution on Y and y ^ ( y ) is the value of that distribution evaluated at the outcome y Y . Although a complete and rigorous justification of the usage of the logarithmic loss as distortion measure in learning is still awaited, recently a partial explanation appeared in [30] where Painsky and Wornell showed that, for binary classification problems, by minimizing the logarithmic-loss one actually minimizes an upper bound to any choice of loss function that is smooth, proper (i.e., unbiased and Fisher consistent), and convex. Along the same line of work, the authors of [29] showed that under some natural data processing property Shannon’s mutual information uniquely quantifies the reduction of prediction risk due to side information. Perhaps, this justifies partially why the logarithmic-loss fidelity measure is widely used in learning theory and has already been adopted in many algorithms in practice such as the infomax criterion [31], the tree-based algorithm of Quinlan [32], or the well known Chow–Liu algorithm [33] for learning tree graphical models, with various applications in genetics [34], image processing [35], computer vision [36], and others. The logarithmic loss measure also plays a central role in the theory of prediction [37] (Ch. 09), where it is often referred to as the self-information loss function, as well as in Bayesian modeling [38] where priors are usually designed to maximize the mutual information between the parameter to be estimated and the observations.
When the join distribution P X , Y is known, the optimal predictor and the minimum expected (population) risk can be characterized. Let, for every x X , ψ ( x ) = Q ( · | x ) P ( Y ) . It is easy to see that
E P X , Y [ log ( Y , Q ) ] = x X , y Y P X , Y ( x , y ) log 1 Q ( y | x )
= x X , y Y P X , Y ( x , y ) log 1 P Y | X ( y | x ) + x X , y Y P X , Y ( x , y ) log P Y | X ( y | x ) Q ( y | x )
= H ( Y | X ) + D P Y | X Q
H ( Y | X )
with equality iff the predictor is given by the conditional posterior ψ ( x ) = P Y ( Y | X = x ) . That is, the minimum expected (population) risk is given by
min ψ C P X , Y ( ψ , log ) = H ( Y | X ) .
If the joint distribution P X , Y is unknown, which is most often the case in practice, the population risk as given by Equation (56) cannot be computed directly, and, in the standard approach, one usually resorts to choosing the predictor with minimal risk on a training dataset consisting of n labeled samples { ( x i , y i ) } i = 1 n that are drawn independently from the unknown joint distribution P X , Y . In this case, one is interested in optimizing the empirical population risk, which for a set of n i.i.d. samples from P X , Y , D n : = { ( x i , y i ) } i = 1 n , is defined as
C ^ P X , Y ( ψ , , D n ) = 1 n i = 1 n ( y i , ψ ( x i ) ) .
The difference between the empirical and population risks is normally measured in terms of the generalization gap, defined as
gen P X , Y ( ψ , , D n ) : = C P X , Y ( ψ , log ) C ^ P X , Y ( ψ , , D n ) .

5.2. Minimum Description Length

One popular approach to reducing the generalization gap is by restricting the set F of admissible predictors to a low-complexity class (or constrained complexity) to prevent over-fitting. One way to limit the model’s complexity is by restricting the range of the prediction function, as shown in Figure 6. This is the so-called minimum description length complexity measure, often used in the learning literature to limit the description length of the weights of neural networks [78]. A connection between the use of the minimum description complexity for limiting the description length of the input encoding and accuracy studied in [79] and with respect to the weight complexity and accuracy is given in [11]. Here, the stochastic mapping ϕ : X U is a compressor with
ϕ R
for some prescribed “input-complexity” value R, or equivalently prescribed average description-length.
Minimizing the constrained description length population risk is now equivalent to solving
C P X , Y , DLC ( R ) = min ϕ E P X , Y [ log ( Y n , ψ ( U n ) ) ]
s . t . ϕ ( X n ) n R .
It can be shown that this problem takes its minimum value with the choice of ψ ( U ) = P Y | U and
C P X , Y , DLC ( R ) = min P U | X H ( Y | U ) s . t . R I ( U ; X ) ,
The solution to Equation (61) for different values of R is effectively equivalent to the IB-problem in Equation (4). Observe that the right-hand side of Equation (61) is larger for small values of R; it is clear that a good predictor ϕ should strike a right balance between reducing the model’s complexity and reducing the error’s entropy, or, equivalently, maximizing the mutual information I ( U ; Y ) about the target variable Y.

5.3. Generalization and Performance Bounds

The IB-problem appears as a relevant problem in fundamental performance limits of learning. In particular, when P X , Y is unknown, and instead n samples i.i.d from P X , Y are available, the optimization of the empirical risk in Equation (56) leads to a mismatch between the true loss given by the population risk and the empirical risk. This gap is measured by the generalization gap in Equation (57). Interestingly, the relationship between the true loss and the empirical loss can be bounded (in high probability) in terms of the IB-problem as [80]
C P X , Y ( ψ , log ) C ^ P X , Y ( ψ , , D n ) + gen P X , Y ( ψ , , D n ) = H P ^ X , Y ( n ) ( Y | U ) C ^ P X , Y ( ψ , , D n ) + A I ( P ^ X ( n ) ; P U | X ) · log n n + B Λ ( P U | X , P ^ Y | U , P Y ^ | U ) n + O log n n Bound on gen P X , Y ( ψ , , D n )
where P ^ U | X and P ^ Y | U are the empirical encoder and decoder and P Y ^ | U is the optimal decoder. H P ^ X , Y ( n ) ( Y | U ) and I ( P ^ X ( n ) ; P U | X ) are the empirical loss and the mutual information resulting from the dataset D n and Λ ( P U | X , P ^ Y | U , P Y ^ | U ) is a function that measures the mismatch between the optimal decoder and the empirical one.
This bound shows explicitly the trade-off between the empirical relevance and the empirical complexity. The pairs of relevance and complexity simultaneously achievable is precisely characterized by the IB-problem. Therefore, by designing estimators based on the IB problem, as described in Section 3, one can perform at different regimes of performance, complexity and generalization.
Another interesting connection between learning and the IB-method is the connection of the logarithmic-loss as metric to common performance metrics in learning:
  • The logarithmic-loss gives an upper bound on the probability of miss-classification (accuracy):
    ϵ Y | X ( Q Y ^ | X ) : = 1 E P X Y [ Q Y ^ | X ] 1 exp E P X , Y [ log ( Y , Q Y ^ | X ) ]
  • The logarithmic-loss is equivalent to maximum likelihood for large n:
    1 n log P Y n | X n ( y n | x n ) = 1 n i = 1 n log P Y | X ( y i | x i ) n E X , Y [ log P Y | X ( Y | X ) ]
  • The true distribution P minimizes the expected logarithmic-loss:
    P Y | X = arg min Q Y ^ | X E P log 1 Q Y ^ | X and min Q Y ^ | X E [ log ( Y , Q Y ^ | X ) ] = H ( Y | X )
Since for n the joint distribution P X Y can be perfectly learned, the link between these common criteria allows the use of the IB-problem to derive asymptotic performance bounds, as well as design criteria, in most of the learning scenarios of classification, regression, and inference.

5.4. Representation Learning, Elbo and Autoencoders

The performance of machine learning algorithms depends strongly on the choice of data representation (or features) on which they are applied. For that reason, feature engineering, i.e., the set of all pre-processing operations and transformations applied to data in the aim of making them support effective machine learning, is important. However, because it is both data- and task-dependent, such feature-engineering is labor intensive and highlights one of the major weaknesses of current learning algorithms: their inability to extract discriminative information from the data themselves instead of hand-crafted transformations of them. In fact, although it may sometimes appear useful to deploy feature engineering in order to take advantage of human know-how and prior domain knowledge, it is highly desirable to make learning algorithms less dependent on feature engineering to make progress towards true artificial intelligence.
Representation learning is a sub-field of learning theory that aims at learning representations of the data that make it easier to extract useful information, possibly without recourse to any feature engineering. That is, the goal is to identify and disentangle the underlying explanatory factors that are hidden in the observed data. In the case of probabilistic models, a good representation is one that captures the posterior distribution of the underlying explanatory factors for the observed input. For related works, the reader may refer, e.g., to the proceedings of the International Conference on Learning Representations (ICLR), see https://iclr.cc/.
The use of the Shannon’s mutual information as a measure of similarity is particularly suitable for the purpose of learning a good representation of data [81]. In particular, a popular approach to representation learning are autoencoders, in which neural networks are designed for the task of representation learning. Specifically, we design a neural network architecture such that we impose a bottleneck in the network that forces a compressed knowledge representation of the original input, by optimizing the Evidence Lower Bound (ELBO), given as
L ELBO ( θ , ϕ , φ ) : = 1 N i = 1 N log Q ϕ ( x i | u i ) D KL ( P θ ( U i | x i ) Q φ ( U i ) ) ] .
over the neural network parameters θ , ϕ , φ . Note that this is precisely the variational-IB cost in Equation (32) for β = 1 and Y = X , i.e., the IB variational bound when particularized to distributions whose parameters are determined by neural networks. In addition, note that the architecture shown in Figure 3 is the classical neural network architecture for autoencoders, and that is coincides with the variational IB solution resulting from the optimization of the IB-problem in Section 3.3.1. In addition, note that Equation (32) provides an operational meaning to the β -VAE cost [82], as a criterion to design estimators on the relevance–complexity plane for different β values, since the β -VAE cost is given as
L β VAE ( θ , ϕ , φ ) : = 1 N i = 1 N log Q ϕ ( x i | u i ) β D KL ( P θ ( U i | x i ) Q φ ( U i ) ) ] ,
which coincides with the empirical version of the variational bound found in Equation (32).

5.5. Robustness to Adversarial Attacks

Recent advances in deep learning has allowed the design of high accuracy neural networks. However, it has been observed that the high accuracy of trained neural networks may be compromised under nearly imperceptible changes in the inputs [83,84,85]. The information bottleneck has also found applications in providing methods to improve robustness to adversarial attacks when training models. In particular, the use of the variational IB method of Alemi et al. [48] showed the advantages of the resulting neural network for classification in terms of robustness to adversarial attacks. Recently, alternatives strategies for extracting features in supervised learning are proposed in [86] to construct classifiers robust to small perturbations in the input space. Robustness is measured in terms of the (statistical)-Fisher information, given for two random variables ( Y , Z ) as
Φ ( Z | Y ) = E Y , Z y log p ( Z | Y ) 2 .
The method in [86] builds upon the idea of the information bottleneck by introducing an additional penalty term that encourages the Fisher information in Equation (64) of the extracted features to be small, when parametrized by the inputs. For this problem, under jointly Gaussian vector sources ( X , Y ) , the optimal representation is also shown to be Gaussian, in line with the results in Section 6.2.1 for the IB without robustness penalty. For general source distributions, a variational method is proposed similar to the variational IB method in Section 3.3.1. The problem shows connections with the I-MMSE [87], de Brujin identity [88,89], Cramér–Rao inequality [90], and Fano’s inequality [90].

6. Extensions: Distributed Information Bottleneck

Consider now a generalization of the IB problem in which the prediction is to be performed in a distributed manner. The model is shown in Figure 7. Here, the prediction of the target variable Y Y is to be performed on the basis of samples of statistically correlated random variables ( X 1 , , X K ) that are observed each at a distinct predictor. Throughout, we assume that the following Markov Chain holds for all k K : = { 1 , , K } ,
X k Y X K / k .
The variable Y is a target variable and we seek to characterize how accurately it can be predicted from a measurable random vector ( X 1 , , X K ) when the components of this vector are processed separately, each by a distinct encoder.

6.1. The Relevance–Complexity Region

The distributed IB problem of Figure 7 is studied in [91,92] from information-theoretic grounds. For both discrete memoryless (DM) and memoryless vector Gaussian models, the authors established fundamental limits of learning in terms of optimal trade-offs between relevance and complexity, leveraging on the connection between the IB-problem and source coding. The following theorem states the result for the case of discrete memoryless sources.
Theorem 1
([91,92]). The relevance–complexity region IR DIB of the distributed learning problem is given by the union of all non-negative tuples ( Δ , R 1 , , R K ) R + K + 1 that satisfy
Δ k S [ R k I ( X k ; U k | Y , T ) ] + I ( Y ; U S c | T ) , S K
for some joint distribution of the form P T P Y k = 1 K P X k | Y k = 1 K P U k | X k , T .
Proof. 
The proof of Theorem 1 can be found in Section 7.1 of [92] and is reproduced in Section 8.1 for completeness. □
For a given joint data distribution P X K , Y , Theorem 1 extends the single encoder IB principle of Tishby in Equation (3) to the distributed learning model with K encoders, which we denote by Distributed Information Bottleneck (DIB) problem. The result characterizes the optimal relevance–complexity trade-off as a region of achievable tuples ( Δ , R 1 , , R K ) in terms of a distributed representation learning problem involving the optimization over K conditional pmfs P U k | X k , T and a pmf P T . The pmfs P U k | X k , T correspond to stochastic encodings of the observation X k to a latent variable, or representation, U k which captures the relevant information of Y in observation X k . Variable T corresponds to a time-sharing among different encoding mappings (see, e.g., [51]). For such encoders, the optimal decoder is implicitly given by the conditional pmf of Y from U 1 , , U K , i.e., P Y | U K , T .
The characterization of the relevance–complexity region can be used to derive a cost function for the D-IB similarly to the IB-Lagrangian in Equation (3). For simplicity, let us consider the problem of maximizing the relevance under a sum-complexity constraint. Let R sum = k = 1 K R k and
RI DIB sum : = ( Δ , R sum ) R + 2 : ( R 1 , , R K ) R + K s . t . k = 1 K R k = R sum and ( Δ , R 1 , , R K ) RI DIB .
We define the DIB-Lagrangian (under sum-rate) as
L s ( P ) : = H ( Y | U K ) s k = 1 K [ H ( Y | U k ) + I ( X k ; U k ) ] .
The optimization of Equation (67) over the encoders P U k | X k , T allows obtaining mappings that perform on the boundary of the relevance–sum complexity region RI DIB sum . To see that, note that it is easy to see that the relevance–sum complexity region RI DIB sum is composed of all the pairs ( Δ , R sum ) R + 2 for which Δ Δ ( R sum , P X K , Y ) , with
Δ ( R sum , P X K , Y ) = max P min I ( Y ; U K ) , R sum k = 1 K I ( X k ; U k | Y ) ,
where the maximization is over joint distributions that factorize as P Y k = 1 K P X k | Y k = 1 K P U k | X k . The pairs ( Δ , R sum ) that lie on the boundary of RI DIB sum can be characterized as given in the following proposition.
Proposition 1.
For every pair ( Δ , R sum ) R + 2 that lies on the boundary of the region RI DIB sum , there exists a parameter s 0 such that ( Δ , R sum ) = ( Δ s , R s ) , with
Δ s = 1 ( 1 + s ) ( 1 + s K ) H ( Y ) + s R s + max P L s ( P ) ,
R s = I ( Y ; U K * ) + k = 1 K [ I ( X k ; U k * ) I ( Y ; U k * ) ] ,
where P * is the set of conditional pmfs P = { P U 1 | X 1 , , P U K | X K } that maximize the cost function in Equation (67).
Proof. 
The proof of Proposition 1 can be found in Section 7.3 of [92] and is reproduced here in Section 8.2 for completeness. □
The optimization of the distributed IB cost function in Equation (67) generalizes the centralized Tishby’s information bottleneck formulation in Equation (3) to the distributed learning setting. Note that for K = 1 the optimization in Equation (69) reduces to the single encoder cost in Equation (3) with a multiplier s / ( 1 + s ) .

6.2. Solutions to the Distributed Information Bottleneck

The methods described in Section 3 can be extended to the distributed information bottleneck case in order to find the mappings P U 1 | X 1 , T , , P U K | X K , T in different scenarios.

6.2.1. Vector Gaussian Model

In this section, we show that for the jointly vector Gaussian data model it is enough to restrict to Gaussian auxiliaries ( U 1 , , U K ) in order to exhaust the entire relevance–complexity region. In addition, we provide an explicit analytical expression of this region. Let ( X 1 , , X K , Y ) be a jointly vector Gaussian vector that satisfies the Markov Chain in Equation (83). Without loss of generality, let the target variable be a complex-valued, zero-mean multivariate Gaussian Y C n y with covariance matrix Σ y , i.e., Y CN ( y ; 0 , Σ y ) , and X k C n k given by
X k = H k Y + N k ,
where H k C n k × n y models the linear model connecting Y to the observation at encoder k and N k C n k is the noise vector at encoder k, assumed to be Gaussian with zero-mean, covariance matrix Σ k , and independent from all other noises and Y .
For the vector Gaussian model Equation (71), the result of Theorem 1, which can be extended to continuous sources using standard techniques, characterizes the relevance–complexity region of this model. The following theorem characterizes the relevance–complexity region, which we denote hereafter as RI DIB G . The theorem also shows that in order to exhaust this region it is enough to restrict to no time sharing, i.e., T = and multivariate Gaussian test channels
U k = A k X k + Z k CN ( u k ; A k X k , Σ z , k ) ,
where A k C n k × n k projects X k and Z k is a zero-mean Gaussian noise with covariance Σ z , k .
Theorem 2.
For the vector Gaussian data model, the relevance–complexity region RI DIB G is given by the union of all tuples ( Δ , R 1 , , R L ) that satisfy
Δ k S R k + log I Σ k 1 / 2 Ω k Σ k 1 / 2 + log I + k S c Σ y 1 / 2 H k Ω k H k Σ y 1 / 2 , S K ,
for some matrices 0 Ω k Σ k 1 .
Proof. 
The proof of Theorem 2 can be found in Section 7.5 of [92] and is reproduced here in Section 8.4 for completeness. □
Theorem 2 extends the result of [54,93] on the relevance–complexity trade-off characterization of the single-encoder IB problem for jointly Gaussian sources to K encoders. The theorem also shows that the optimal test channels P U k | X k are multivariate Gaussian, as given by Equation (72).
Consider the following symmetric distributed scalar Gaussian setting, in which Y N ( 0 , 1 ) and
X 1 = snr Y + N 1
X 2 = snr Y + N 2
where N 1 and N 2 are standard Gaussian with zero-mean and unit variance, both independent of Y. In this case, for I ( U 1 ; X 1 ) = R and I ( U ; X 2 ) = R , the optimal relevance is
Δ ( R , snr ) = 1 2 log 1 + 2 snr exp ( 4 R ) exp ( 4 R ) + snr snr 2 + ( 1 + 2 snr ) exp ( 4 R ) .
An easy upper bound on the relevance can be obtained by assuming that X 1 and X 2 are encoded jointly at rate 2 R , to get
Δ ub ( R , snr ) = 1 2 log ( 1 + 2 snr ) 1 2 log 1 + 2 snr exp ( 4 R ) .
The reader may notice that, if X 1 and X 2 are encoded independently, an achievable relevance level is given by
Δ lb ( R , snr ) = 1 2 log ( 1 + 2 snr snr exp ( 2 R ) ) 1 2 log 1 + snr exp ( 2 R ) .

6.3. Solutions for Generic Distributions

Next, we present how the distributed information bottleneck can be solved for generic distributions. Similar to the case of single encoder IB-problem, the solutions are based on a variational bound on the DIB-Lagrangian. For simplicity, we look at the D-IB under sum-rate constraint [92].

6.4. A Variational Bound

The optimization of Equation (67) generally requires computing marginal distributions that involve the descriptions U 1 , , U K , which might not be possible in practice. In what follows, we derive a variational lower bound on L s ( P ) on the DIB cost function in terms of families of stochastic mappings Q Y | U 1 , , U K (a decoder), { Q Y | U k } k = 1 K and priors { Q U k } k = 1 K . For the simplicity of the notation, we let
Q : = { Q Y | U 1 , , U K , Q Y | U 1 , , Q Y | U K , Q U 1 , , Q U K } .
The variational D-IB cost for the DIB-problem is given by
L s VB ( P , Q ) : = E [ log Q Y | U K ( Y | U K ) ] av . logarithmic loss + s k = 1 K E [ log Q Y | U k ( Y | U k ) ] D KL ( P U k | X k Q U k ) regularizer .
Lemma 1.
For fixed P , we have
L s ( P ) L s VB ( P , Q ) , for all pmfs Q .
In addition, there exists a unique Q that achieves the maximum max Q L s VB ( P , Q ) = L s ( P ) , and is given by, k K ,
Q U k * = P U k
Q Y | U k * = P Y | U k
Q Y | U 1 , , U k * = P Y | U 1 , , U K ,
where the marginals P U k and the conditional marginals P Y | U k and P Y | U 1 , , U K are computed from P .
Proof. 
The proof of Lemma 1 can be found in Section 7.4 of [92] and is reproduced here in Section 8.3 for completeness. □
Then, the optimization in Equation (69) can be written in terms of the variational DIB cost function as follows,
max P L s ( P ) = max P max Q L s VB ( P , Q ) .
The variational DIB cost in Equation (78) is a generalization to distributed learning with K-encoders of the evidence lower bound (ELBO) of the target variable Y given the representations U 1 , , U K [15]. If Y = ( X 1 , , X K ) , the bound generalizes the ELBO used for VAEs to the setting of K 2 encoders. In addition, note that Equation (78) also generalizes and provides an operational meaning to the β -VAE cost [82] with β = s / ( 1 + s ) , as a criteria to design estimators on the relevance–complexity plane for different β values.

6.5. Known Memoryless Distributions

When the data model is discrete and the joint distribution P X , Y is known, the DIB problem can be solved by using an iterative method that optimizes the variational IB cost function in Equation (81) alternating over the distributions P , Q . The optimal encoders and decoders of the D-IB under sum-rate constraint satisfy the following self consistent equations,
p ( u k | y k ) = p ( u k ) Z ( β , u k ) exp ψ s ( u k , y k ) , p ( x | u k ) = y k Y k p ( y k | u k ) p ( x | y k ) p ( x | u 1 , , u K ) = y K Y K p ( y K ) p ( u K | y K ) p ( x | y K ) / p ( u K )
where ψ s ( u k , y k ) : = D KL ( P X | y k | | Q X | u k ) + 1 s E U K k | y k [ D KL ( P X | U K k , y k | | Q X | U K k , u k ) ) ] .
Alternating iterations of these equations converge to a solution for any initial p ( u k | x k ) , similarly to a Blahut–Arimoto algorithm and the EM.

6.5.1. Distributed Variational IB

When the data distribution is unknown and only data samples are available, the variational DIB cost in Equation (81) can be optimized following similar steps as for the variational IB in Section 3.3.1 by parameterizing the encoding and decoding distributions P , Q using a family of distributions whose parameters are determined by DNNs. This allows us to formulate Equation (81) in terms of the DNN parameters, i.e., its weights, and optimize it by using the reparameterization trick [15], Monte Carlo sampling, and stochastic gradient descent (SGD)-type algorithms.
Considering encoders and decoders P , Q parameterized by DNN parameters θ , ϕ , φ , the DIB cost in Equation (81) can be optimized by considering the following empirical Monte Carlo approximation:
max θ , ϕ , φ 1 n i = 1 n log Q ϕ K ( y i | u 1 , i , j , , u K , i , j ) + s k = 1 K log Q ϕ k ( y i | u k , i , j ) D KL ( P θ k ( U k , i | x k , i ) Q φ k ( U k , i ) ) ,
where u k , i , j = g ϕ k ( x k , i , z k , j ) are samples obtained from the reparametrization trick by sampling from K random variables P Z k . The details of the method can be found in [92]. The resulting architecture is shown in Figure 8. This architecture generalizes that from autoencoders to the distributed setup with K encoders.

6.6. Connections to Coding Problems and Learning

Similar to the point-to-point IB-problem, the distributed IB problem also has abundant connections with (asymptotic) coding and learning problems.

6.6.1. Distributed Source Coding under Logarithmic Loss

Key element to the proof of the converse part of Theorem 3 is the connection with the Chief Executive Officer (CEO) source coding problem. For the case of K 2 encoders, while the characterization of the optimal rate-distortion region of this problem for general distortion measures has eluded the information theory for now more than four decades, a characterization of the optimal region in the case of logarithmic loss distortion measure has been provided recently in [65]. A key step in [65] is that the log-loss distortion measure admits a lower bound in the form of the entropy of the source conditioned on the decoders’ input. Leveraging this result, in our converse proof of Theorem 3, we derive a single letter upper bound on the entropy of the channel inputs conditioned on the indices J K that are sent by the relays, in the absence of knowledge of the codebooks indices F L . In addition, the rate region of the vector Gaussian CEO problem under logarithmic loss distortion measure has been found recently in [94,95].

6.6.2. Cloud RAN

Consider the discrete memoryless (DM) CRAN model shown in Figure 9. In this model, L users communicate with a common destination or central processor (CP) through K relay nodes, where L 1 and K 1 . Relay node k, 1 k K , is connected to the CP via an error-free finite-rate fronthaul link of capacity C k . In what follows, we let L : = [ 1 : L ] and K : = [ 1 : K ] indicate the set of users and relays, respectively. Similar to Simeone et al. [96], the relay nodes are constrained to operate without knowledge of the users’ codebooks and only know a time-sharing sequence Q n , i.e., a set of time instants at which users switch among different codebooks. The obliviousness of the relay nodes to the actual codebooks of the users is modeled via the notion of randomized encoding [97,98]. That is, users or transmitters select their codebooks at random and the relay nodes are not informed about the currently selected codebooks, while the CP is given such information.
Consider the following class of DM CRANs in which the channel outputs at the relay nodes are independent conditionally on the users’ inputs. That is, for all k K and all i [ 1 : n ] ,
Y k , i X L , i Y K / k , i
forms a Markov Chain in this order.
The following theorem provides a characterization of the capacity region of this class of DM CRAN problem under oblivious relaying.
Theorem 3
([22,23]). For the class of DM CRANs with oblivious relay processing and enabled time-sharing for which Equation (83) holds, the capacity region C ( C K ) is given by the union of all rate tuples ( R 1 , , R L ) which satisfy
t T R t s S [ C s I ( Y s ; U s | X L , Q ) ] + I ( X T ; U S c | X T c , Q ) ,
for all non-empty subsets T L and all S K , for some joint measure of the form
p ( q ) l = 1 L p ( x l | q ) k = 1 K p ( y k | x L ) k = 1 K p ( u k | y k , q ) .
The direct part of Theorem 3 can be obtained by a coding scheme in which each relay node compresses its channel output by using Wyner–Ziv binning to exploit the correlation with the channel outputs at the other relays, and forwards the bin index to the CP over its rate-limited link. The CP jointly decodes the compression indices (within the corresponding bins) and the transmitted messages, i.e., Cover-El Gamal compress-and-forward [99] (Theorem 3) with joint decompression and decoding (CF-JD). Alternatively, the rate region of Theorem 3 can also be obtained by a direct application of the noisy network coding (NNC) scheme of [64] (Theorem 1).
The connection between this problem, source coding and the distributed information bottleneck is discussed in [22,23], particularly in the derivation of the converse part of the theorem. Note also the similarity between the resulting capacity region in Theorem 3 and the relevance complexity region of the distributed information bottleneck in Theorem 1, despite the significant differences of the setups.

6.6.3. Distributed Inference, ELBO and Multi-View Learning

In many data analytics problems, data are collected from various sources of information or feature extractors and are intrinsically heterogeneous. For example, an image can be identified by its color or texture features and a document may contain text and images. Conventional machine learning approaches concatenate all available data into one big row vector (or matrix) on which a suitable algorithm is then applied. Treating different observations as a single source might cause overfitting and is not physically meaningful because each group of data may have different statistical properties. Alternatively, one may partition the data into groups according to samples homogeneity, and each group of data be regarded as a separate view. This paradigm, termed multi-view learning [100], has received growing interest, and various algorithms exist, sometimes under references such as co-training [101,102,103,104], multiple kernel learning [104], and subspace learning [105]. By using distinct encoder mappings to represent distinct groups of data, and jointly optimizing over all mappings to remove redundancy, multi-view learning offers a degree of flexibility that is not only desirable in practice but is also likely to result in better learning capability. Actually, as shown in [106], local learning algorithms produce fewer errors than global ones. Viewing the problem as that of function approximation, the intuition is that it is usually not easy to find a unique function that holds good predictability properties in the entire data space.
Besides, the distributed learning of Figure 7 clearly finds application in all those scenarios in which learning is performed collaboratively but distinct learners either only access subsets of the entire dataset (e.g., due to physical constraints) or access independent noisy versions of the entire dataset.
In addition, similar to the single encoder case, the distributed IB also finds applications in fundamental performance limits and formulation of cost functions from an operational point of view. One of such examples is the generalization of the commonly used ELBO and given in Equation (62) to the setup with K views or observations, as formulated in Equation (78). Similarly, from the formulation of the DIB problem, a natural generalization of the classical autoencoders emerge, as given in Figure 8.

7. Outlook

A variant of the bottleneck problem in which the encoder’s output is constrained in terms of its entropy, rather than its mutual information with the encoder’s input as done originally in [1], was considered in [107]. The solution of this problem turns out to be a deterministic encoder map as opposed to the stochastic encoder map that is optimal under the IB framework of Tishby et al. [1], which results in a reduction of the algorithm’s complexity. This idea was then used and extended to the case of available resource (or time) sharing in [108].
In the context of privacy against inference attacks [109], the authors of [75,76] considered a dual of the information bottleneck problem in which X X represents some private data that are correlated with the non-private data Y Y . A legitimate receiver (analyst) wishes to infer as much information as possible about the non-private data Y but does not need to infer any information about the private data X. Because X and Y are correlated, sharing the non-private data X with the analyst possibly reveals information about Y. For this reason, there is a trade-off between the amount of information that the user shares about X as measured by the mutual information I ( U ; X ) and the information that he keeps private about Y as measured by the mutual information I ( U ; Y ) , where U = ϕ ( X ) .
Among interesting problems that are left unaddressed in this paper is that of characterizing optimal input distributions under rate-constrained compression at the relays where, e.g., discrete signaling is already known to sometimes outperform Gaussian signaling for single-user Gaussian CRAN [97]. It is conjectured that the optimal input distribution is discrete. Other issues might relate to extensions to continuous time filtered Gaussian channels, in parallel to the regular bottleneck problem [108], or extensions to settings in which fronthauls may be not available at some radio-units, and that is unknown to the systems. That is, the more radio units are connected to the central unit, the higher is the rate that could be conveyed over the CRAN uplink [110]. Alternatively, one may consider finding the worst-case noise under given input distributions, e.g., Gaussian, and rate-constrained compression at the relays. Furthermore, there are interesting aspects that address processing constraints of continuous waveforms, e.g., sampling at a given rate [111,112] with focus on remote logarithmic distortion [65], which in turn boils down to the distributed bottleneck problem [91,92]. We also mention finite-sample size analysis (i.e., finite block length n, which relates to the literature on finite block length coding in information theory). Finally, it is interesting to observe that the bottleneck problem relates to interesting problem when R is not necessarily scaled with the block length n.

8. Proofs

8.1. Proof of Theorem 1

The proof relies on the equivalence of the studied distributed learning problem with the Chief-Executive Officer (CEO) problem under logarithmic-loss distortion measure, which was studied in [65] (Theorem 10). For the K-encoder CEO problem, let us consider K encoding functions ϕ k : X k M k ( n ) satisfying n R k log | ϕ k ( X k n ) | and a decoding function ψ ˜ : M 1 ( n ) × × M K ( n ) Y ^ n , which produces a probabilistic estimate of Y from the outputs of the encoders, i.e., Y ^ n is the set of distributions on Y . The quality of the estimation is measured in terms of the average log-loss.
Definition 1.
A tuple ( D , R 1 , , R K ) is said to be achievable in the K-encoder CEO problem for P X K , Y for which the Markov Chain in Equation (83) holds, if there exists a length n, encoders ϕ k for k K , and a decoder ψ ˜ , such that
D E 1 n log 1 P ^ Y n | J K ( Y n | ϕ 1 ( X 1 n ) , , ϕ K ( X K n ) ) ,
R k 1 n log | ϕ k ( X k n ) | for all k K .
The rate-distortion region RD CEO is given by the closure of all achievable tuples ( D , R 1 , , R K ) .
The following lemma shows that the minimum average logarithmic loss is the conditional entropy of Y given the descriptions. The result is essentially equivalent to [65] (Lemma 1) and it is provided for completeness.
Lemma 2.
Let us consider P X K , Y and the encoders J k = ϕ k ( X k n ) , k K and the decoder Y ^ n = ψ ˜ ( J K ) . Then,
E [ log ( Y n , Y ^ n ) ] H ( Y n | J K ) ,
with equality if and only if ψ ˜ ( J K ) = { P Y n | J K ( y n | J K ) } y n Y n .
Proof. 
Let Z : = ( J 1 , , J K ) be the argument of ψ ˜ and P ^ ( y n | z ) be a distribution on Y n . We have for Z = z :
E [ log ( Y n , Y ^ n ) | Z = z ] = y n Y n P ( y n | z ) log 1 P ^ ( y n | z )
= y n Y n P ( y n | z ) log P ( y n | z ) P ^ ( y n | z ) + H ( Y n | Z = z )
= D KL ( P ( y n | z ) P ^ ( y n | z ) ) + H ( Y n | Z = z )
H ( Y n | Z = z ) ,
where Equation (91) is due to the non-negativity of the KL divergence and the equality holds if and only if for P ^ ( y n | z ) = P ( y n | z ) where P ( y n | z ) = Pr { Y n = y n | Z = z } for all z and y n Y n . Averaging over Z completes the proof. □
Essentially, Lemma 2 states that minimizing the average log-loss is equivalent to maximizing relevance as given by the mutual information I Y n ; ψ ϕ 1 ( X 1 n ) , , ϕ K ( X K n ) . Formally, the connection between the distributed learning problem under study and the K-encoder CEO problem studied in [65] can be formulated as stated next.
Proposition 2.
A tuple ( Δ , R 1 , , R K ) RI DIB if and only if ( H ( Y ) Δ , R 1 , , R K ) RD CEO .
Proof. 
Let the tuple ( Δ , R 1 , , R K ) RI DIB be achievable for some encoders ϕ k . It follows by Lemma 2 that, by letting the decoding function ψ ˜ ( J K ) = { P Y n | J K ( y n | J K ) } , we have E [ log ( Y n , Y ^ n ) | J K ] = H ( Y n | J K ) , and hence ( H ( Y ) Δ , R 1 , , R K ) RD CEO .
Conversely, assume the tuple ( D , R 1 , , R K ) RD CEO is achievable. It follows by Lemma 2 that H ( Y ) D H ( Y n ) H ( Y n | J K ) = I ( Y n ; J K ) , which implies ( Δ , R 1 , , R K ) RI DIB with Δ = H ( Y ) D . □
The characterization of rate-distortion region R CEO has been established recently in [65] (Theorem 10). The proof of the theorem is completed by noting that Proposition 2 implies that the result in [65] (Theorem 10) can be applied to characterize the region RI DIB , as given in Theorem 1.

8.2. Proof of Proposition 1

Let P * be the maximizing in Equation (69). Then,
( 1 + s ) Δ s = ( 1 + s K ) H ( Y ) + s R s + L s ( P * )
= ( 1 + s K ) H ( Y ) + s R s + H ( Y | U K * ) s k = 1 K [ H ( Y | U k * ) + I ( X k ; U k * ) ]
= ( 1 + s K ) H ( Y ) + s R s + ( H ( Y | U K * ) s ( R s I ( Y ; U K * ) + K H ( Y ) ) )
= ( 1 + s ) I ( Y ; U K * )
( 1 + s ) Δ ( R s , P X K , Y ) ,
where Equation (94) is due to the definition of L s ( P ) in Equation (67); Equation (95) holds since k = 1 K [ I ( X k ; U k * ) + H ( Y | U k * ) ] = R s I ( Y ; U K * ) + K H ( Y ) using Equation (70); and Equation (96) follows by using Equation (68).
Conversely, if P * is the solution to the maximization in the function Δ ( R sum , P X K , Y ) in Equation (68) such that Δ ( R sum , P X K , Y ) = Δ s , then Δ s I ( Y ; U K * ) and Δ s R k = 1 K I ( X k ; U k * | Y ) and we have, for any s 0 , that
Δ ( R sum , P X K , Y ) = Δ s
Δ s ( Δ s I ( Y ; U K * ) ) s Δ s R sum + k = 1 K I ( X k ; U k * | Y )
= I ( Y ; U K * ) s Δ s + s R sum s k = 1 K I ( X k ; U k * | Y )
= H ( Y ) s Δ s + s R sum H ( Y | U K * ) s k = 1 K [ I ( X k ; U k * ) + H ( Y | U k * ) ] + s K H ( Y )
H ( Y ) s Δ s + s R sum + L s * + s K H ( Y )
= H ( Y ) s Δ s + s R sum + s K H ( Y ) ( ( 1 + s K ) H ( Y ) + s R s ( 1 + s ) Δ s )
= Δ s + s ( R sum R s ) ,
where in Equation (100) we use that k = 1 K I ( X k ; U k | Y ) = K H ( Y ) + k = 1 K I ( X k ; U k ) + H ( Y | U k ) . which follows by using the Markov Chain U k X k Y ( X K k , U K k ) ; Equation (101) follows since L s * is the maximum over all possible distributions P (possibly distinct from the P * that maximizes Δ ( R sum , P X K , Y ) ); and Equation (102) is due to Equation (69). Finally, Equation (103) is valid for any R sum 0 and s 0 . Given s, and hence ( Δ s , R s ) , letting R = R s yields Δ ( R s , P X K , Y ) Δ s . Together with Equation (96), this completes the proof of Proposition 1.

8.3. Proof of Lemma 1

Let, for a given random variable Z and z Z , a stochastic mapping Q Y | Z ( · | z ) be given. It is easy to see that
H ( Y | Z ) = E [ log Q Y | Z ( Y | Z ) ] D KL ( P Y | Z Q Y | Z ) .
In addition, we have
I ( X k ; U k ) = H ( U k ) H ( U k | X k )
= D KL ( P U k | X k Q U k ) D KL ( P U k Q U k ) .
Substituting it into Equation (67), we get
L s ( P ) = L s VB ( P , Q ) + D KL ( P Y | U K | | Q Y | U K ) + s k = 1 K ( D KL ( P Y | U k | | Q Y | U k ) + D KL ( P U k | | Q U k ) )
L s VB ( P , Q ) ,
where Equation (108) follows by the non-negativity of relative entropy. In addition, note that the inequality in Equation (108) holds with equality iff Q * is given by Equation (80).

8.4. Proof of Theorem 2

The proof of Theorem 2 relies on deriving an outer bound on the relevance–complexity region, as given by Equation (66), and showing that it is achievable with Gaussian pmfs and without time-sharing. In doing so, we use the technique of [89] (Theorem 8), which relies on the de Bruijn identity and the properties of Fisher information and MMSE.
Lemma 3
([88,89]). Let ( X , Y ) be a pair of random vectors with pmf p ( x , y ) . We have
log | ( π e ) J 1 ( X | Y ) | h ( X | Y ) log | ( π e ) mmse ( X | Y ) | ,
where the conditional Fischer information matrix is defined as
J ( X | Y ) : = E [ log p ( X | Y ) log p ( X | Y ) ]
and the minimum mean square error (MMSE) matrix is
mmse ( X | Y ) : = E [ ( X E [ X | Y ] ) ( X E [ X | Y ] ) ] .
For t T and fixed k = 1 K p ( u k | x k , t ) , choose Ω k , t , k = 1 , , K satisfying 0 Ω k , t Σ k 1 such that
mmse ( Y k | X , U k , t , t ) = Σ k Σ k Ω k , t Σ k .
Note that such Ω k , t exists since 0 mmse ( X k | Y , U k , t , t ) Σ k 1 , for all t T , and k K .
Using Equation (66), we get
I ( X k ; U k | Y , t ) log | Σ k | log | mmse ( X k | Y , U k , t , t ) | = log | I Σ k 1 / 2 Ω k , t Σ k 1 / 2 | ,
where the inequality is due to Lemma 3, and Equation (113) is due to Equation (112).
In addition, we have
I ( Y ; U S c , t | t ) log | Σ y | log | J 1 ( Y | U S c , t , t ) |
= log k S c Σ y 1 / 2 H k Ω k , t H k Σ y 1 / 2 + I ,
where Equation (114) is due to Lemma 3 and Equation (115) is due to to the following equality, which relates the MMSE matrix in Equation (112) and the Fisher information, the proof of which follows,
J ( Y | U S c , t , t ) = k S c H k Ω k , t H k + Σ y 1 .
To show Equation (116), we use de Brujin identity to relate the Fisher information with the MMSE as given in the following lemma, the proof of which can be found in [89].
Lemma 4.
Let ( V 1 , V 2 ) be a random vector with finite second moments and N CN ( 0 , Σ N ) independent of ( V 1 , V 2 ) . Then,
mmse ( V 2 | V 1 , V 2 + N ) = Σ N Σ N J ( V 2 + N | V 1 ) Σ N .
From the MMSE of Gaussian random vectors [51],
Y = E [ Y | X S c ] + Z S c = k S c G k X k + Z S c ,
where G k = Σ y | x S c H k Σ k 1 and Z S c CN ( 0 , Σ y | x S c ) , and
Σ y | x S c 1 = Σ y 1 + k S c H k Σ k 1 H k .
Note that Z S c is independent of Y S c due to the orthogonality principle of the MMSE and its Gaussian distribution. Hence, it is also independent of U S c , q .
Thus, we have
mmse k S c G k X k | Y , U S c , t , t = k S c G k mmse X k | Y , U S c , t , t G k
= Σ y | x S c k S c H k Σ k 1 Ω k H k Σ y | x S c ,
where Equation (120) follows since the cross terms are zero due to the Markov Chain ( U k , t , X k ) Y ( U K / k , t , X K / k ) (see Appendix V of [89]); and Equation (121) follows due to Equation (112) and G k .
Finally, we have
J ( Y | U S c , t , t ) = Σ y | x S c 1 Σ y | x S c 1 mmse k S c G k X k | Y , U S c , t , t Σ y | x S c 1
= Σ y | x S c 1 k S c H k Σ k 1 Ω k , t H k
= Σ y 1 + k S c H k Ω k , t H k ,
where Equation (122) is due to Lemma 4; Equation (123) is due to Equation (121); and Equation (124) follows due to Equation (119).
Then, averaging over the time sharing random variable T and letting Ω ¯ k : = t T p ( t ) Ω k , t , we get, using Equation (113),
I ( X k ; U k | Y , T ) t T p ( t ) log | I Σ k 1 / 2 Ω k , t Σ k 1 / 2 | log | I Σ k 1 / 2 Ω ¯ k Σ k 1 / 2 | ,
where Equation (125) follows from the concavity of the log-det function and Jensen’s inequality.
Similarly, using Equation (115) and Jensen’s Inequality, we have
I ( Y ; U S c | T ) log k S c Σ y 1 / 2 H k Ω ¯ k H k Σ y 1 / 2 + I .
The outer bound on RI DIB is obtained by substituting into Equation (66), using Equations (125) and (126), noting that Ω k = t T p ( t ) Ω k , t Σ k 1 since 0 Ω k , t Σ k 1 , and taking the union over Ω k satisfying 0 Ω k Σ k 1 .
Finally, the proof is completed by noting that the outer bound is achieved with T = and multivariate Gaussian distributions p * ( u k | x k , t ) = CN ( x k , Σ k 1 / 2 ( Ω k I ) Σ k 1 / 2 ) .

Author Contributions

A.Z., I.E.-A. and S.S.(S.) equally contributed to the published work. All authors have read and agreed to the published version of the manuscript.

Funding

The work of S. Shamai was supported by the European Union’s Horizon 2020 Research And Innovation Programme, grant agreement No. 694630, and by the WIN consortium via the Israel minister of economy and science.

Acknowledgments

The authors would like to thank the anonymous reviewers for the constructive comments and suggestions, which helped us improve this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Tishby, N.; Pereira, F.; Bialek, W. The information bottleneck method. In Proceedings of the Thirty-Seventh Annual Allerton Conference on Communication, Control, and Computing, Allerton House, Monticello, IL, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
  2. Pratt, W.K. Digital Image Processing; John Willey & Sons Inc.: New York, NY, USA, 1991. [Google Scholar]
  3. Yu, S.; Principe, J.C. Understanding Autoencoders with Information Theoretic Concepts. arXiv 2018, arXiv:1804.00057. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Yu, S.; Jenssen, R.; Principe, J.C. Understanding Convolutional Neural Network Training with Information Theory. arXiv 2018, arXiv:1804.06537. [Google Scholar]
  5. Kong, Y.; Schoenebeck, G. Water from Two Rocks: Maximizing the Mutual Information. arXiv 2018, arXiv:1802.08887. [Google Scholar]
  6. Ugur, Y.; Aguerri, I.E.; Zaidi, A. A generalization of Blahut-Arimoto algorithm to computing rate-distortion regions of multiterminal source coding under logarithmic loss. In Proceedings of the IEEE Information Theory Workshop, ITW, Kaohsiung, Taiwan, 6–10 November 2017. [Google Scholar]
  7. Dobrushin, R.L.; Tsybakov, B.S. Information transmission with additional noise. IRE Trans. Inf. Theory 1962, 85, 293–304. [Google Scholar] [CrossRef]
  8. Witsenhausen, H.S.; Wyner, A.D. A conditional Entropy Bound for a Pair of Discrete Random Variables. IEEE Trans. Inf. Theory 1975, IT-21, 493–501. [Google Scholar] [CrossRef]
  9. Witsenhausen, H.S. Indirect Rate Distortion Problems. IEEE Trans. Inf. Theory 1980, IT-26, 518–521. [Google Scholar] [CrossRef]
  10. Shwartz-Ziv, R.; Tishby, N. Opening the Black Box of Deep Neural Networks via Information. arXiv 2017, arXiv:1703.00810. [Google Scholar]
  11. Achille, A.; Soatto, S. Emergence of Invariance and Disentangling in Deep Representations. arXiv 2017, arXiv:1706.01350. [Google Scholar]
  12. McAllester, D.A. A PAC-Bayesian Tutorial with a Dropout Bound. arXiv 2013, arXiv:1307.2118. [Google Scholar]
  13. Alemi, A.A. Variational Predictive Information Bottleneck. arXiv 2019, arXiv:1910.10831. [Google Scholar]
  14. Mukherjee, S. Machine Learning using the Variational Predictive Information Bottleneck with a Validation Set. arXiv 2019, arXiv:1911.02210. [Google Scholar]
  15. Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
  16. Mukherjee, S. General Information Bottleneck Objectives and their Applications to Machine Learning. arXiv 2019, arXiv:1912.06248. [Google Scholar]
  17. Strouse, D.; Schwab, D.J. The information bottleneck and geometric clustering. Neural Comput. 2019, 31, 596–612. [Google Scholar] [CrossRef] [PubMed]
  18. Painsky, A.; Tishby, N. Gaussian Lower Bound for the Information Bottleneck Limit. J. Mach. Learn. Res. (JMLR) 2018, 18, 7908–7936. [Google Scholar]
  19. Kittichokechai, K.; Caire, G. Privacy-constrained remote source coding. In Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, 10–15 July 2016; pp. 1078–1082. [Google Scholar]
  20. Tian, C.; Chen, J. Successive Refinement for Hypothesis Testing and Lossless One-Helper Problem. IEEE Trans. Inf. Theory 2008, 54, 4666–4681. [Google Scholar] [CrossRef] [Green Version]
  21. Sreekumar, S.; Gündüz, D.; Cohen, A. Distributed Hypothesis Testing Under Privacy Constraints. arXiv 2018, arXiv:1807.02764. [Google Scholar]
  22. Aguerri, I.E.; Zaidi, A.; Caire, G.; Shamai (Shitz), S. On the Capacity of Cloud Radio Access Networks with Oblivious Relaying. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 2068–2072. [Google Scholar]
  23. Aguerri, I.E.; Zaidi, A.; Caire, G.; Shamai (Shitz), S. On the capacity of uplink cloud radio access networks with oblivious relaying. IEEE Trans. Inf. Theory 2019, 65, 4575–4596. [Google Scholar] [CrossRef] [Green Version]
  24. Stark, M.; Bauch, G.; Lewandowsky, J.; Saha, S. Decoding of Non-Binary LDPC Codes Using the Information Bottleneck Method. In Proceedings of the ICC 2019-2019 IEEE International Conference on Communications (ICC), Shanghai, China, 20–24 May 2019; pp. 1–6. [Google Scholar]
  25. Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 1798–1828. [Google Scholar] [CrossRef]
  26. Erdogmus, D. Information Theoretic Learning: Renyi’s Entropy and Its Applications to Adaptive System Training. Ph.D. Thesis, University of Florida Gainesville, Florida, FL, USA, 2002. [Google Scholar]
  27. Principe, J.C.; Euliano, N.R.; Lefebvre, W.C. Neural and Adaptive Systems: Fundamentals Through Simulations; Wiley: New York, NY, USA, 2000; Volume 672. [Google Scholar]
  28. Fisher, J.W. Nonlinear Extensions to the Minumum Average Correlation Energy Filter; University of Florida: Gainesville, FL, USA, 1997. [Google Scholar]
  29. Jiao, J.; Courtade, T.A.; Venkat, K.; Weissman, T. Justification of logarithmic loss via the benefit of side information. IEEE Trans. Inf. Theory 2015, 61, 5357–5365. [Google Scholar] [CrossRef] [Green Version]
  30. Painsky, A.; Wornell, G.W. On the Universality of the Logistic Loss Function. arXiv 2018, arXiv:1805.03804. [Google Scholar]
  31. Linsker, R. Self-organization in a perceptual network. Computer 1988, 21, 105–117. [Google Scholar] [CrossRef]
  32. Quinlan, J.R. C4. 5: Programs for Machine Learning; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
  33. Chow, C.; Liu, C. Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory 1968, 14, 462–467. [Google Scholar] [CrossRef] [Green Version]
  34. Olsen, C.; Meyer, P.E.; Bontempi, G. On the impact of entropy estimation on transcriptional regulatory network inference based on mutual information. EURASIP J. Bioinf. Syst. Biol. 2008, 2009, 308959. [Google Scholar] [CrossRef] [Green Version]
  35. Pluim, J.P.; Maintz, J.A.; Viergever, M.A. Mutual-information-based registration of medical images: A survey. IEEE Trans. Med. Imaging 2003, 22, 986–1004. [Google Scholar] [CrossRef]
  36. Viola, P.; Wells, W.M., III. Alignment by maximization of mutual information. Int. J. Comput. Vis. 1997, 24, 137–154. [Google Scholar] [CrossRef]
  37. Cesa-Bianchi, N.; Lugosi, G. Prediction, Learning and Games; Cambridge University Press: New York, NY, USA, 2006. [Google Scholar]
  38. Lehmann, E.L.; Casella, G. Theory of Point Estimation; Springer Science & Business Media: Berlin, Germany, 2006. [Google Scholar]
  39. Bousquet, O.; Elisseeff, A. Stability and generalization. J. Mach. Learn. Res. 2002, 2, 499–526. [Google Scholar]
  40. Shalev-Shwartz, S.; Shamir, O.; Srebro, N.; Sridharan, K. Learnability, stability and uniform convergence. J. Mach. Learn. Res. 2010, 11, 2635–2670. [Google Scholar]
  41. Xu, A.; Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 2521–2530. [Google Scholar]
  42. Russo, D.; Zou, J. How much does your data exploration overfit? Controlling bias via information usage. arXiv 2015, arXiv:1511.05219. [Google Scholar] [CrossRef] [Green Version]
  43. Amjad, R.A.; Geiger, B.C. Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle. IEEE Trans. Pattern Anal. Mach. Intell. 2019. [Google Scholar] [CrossRef] [Green Version]
  44. Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1253. [Google Scholar] [CrossRef] [Green Version]
  45. Jiao, J.; Venkat, K.; Han, Y.; Weissman, T. Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory 2015, 61, 2835–2885. [Google Scholar] [CrossRef] [PubMed]
  46. Valiant, P.; Valiant, G. Estimating the unseen: improved estimators for entropy and other properties. In Proceedings of the Advances in Neural Information Processing Systems 26 (NIPS 2013), Lake Tahoe, NV, USA, 5–10 December 2013; pp. 2157–2165. [Google Scholar]
  47. Chalk, M.; Marre, O.; Tkacik, G. Relevant sparse codes with variational information bottleneck. arXiv 2016, arXiv:1605.07332. [Google Scholar]
  48. Alemi, A.; Fischer, I.; Dillon, J.; Murphy, K. Deep Variational Information Bottleneck. In Proceedings of the International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April 2017. [Google Scholar]
  49. Achille, A.; Soatto, S. Information Dropout: Learning Optimal Representations Through Noisy Computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 2897–2905. [Google Scholar] [CrossRef] [Green Version]
  50. Harremoes, P.; Tishby, N. The information bottleneck revisited or how to choose a good distortion measure. In Proceedings of the 2007 IEEE International Symposium on Information Theory, Nice, France, 24–29 June 2007; pp. 566–570. [Google Scholar]
  51. Gamal, A.E.; Kim, Y.H. Network Information Theory; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
  52. Hotelling, H. The most predictable criterion. J. Educ. Psycol. 1935, 26, 139–142. [Google Scholar] [CrossRef]
  53. Globerson, A.; Tishby, N. On the Optimality of the Gaussian Information Bottleneck Curve; Technical Report; Hebrew University: Jerusalem, Israel, 2004. [Google Scholar]
  54. Chechik, G.; Globerson, A.; Tishby, N.; Weiss, Y. Information Bottleneck for Gaussian Variables. J. Mach. Learn. Res. 2005, 6, 165–188. [Google Scholar]
  55. Wieczorek, A.; Roth, V. On the Difference Between the Information Bottleneck and the Deep Information Bottleneck. arXiv 2019, arXiv:1912.13480. [Google Scholar]
  56. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–38. [Google Scholar]
  57. Blahut, R. Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf. Theory 1972, 18, 460–473. [Google Scholar] [CrossRef] [Green Version]
  58. Arimoto, S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans. Inf. Theory 1972, IT-18, 12–20. [Google Scholar] [CrossRef] [Green Version]
  59. Winkelbauer, A.; Matz, G. Rate-information-optimal gaussian channel output compression. In Proceedings of the 48th Annual Conference on Information Sciences and Systems (CISS), Princeton, NJ, USA, 19–21 March 2014; pp. 1–5. [Google Scholar]
  60. Gálvez, B.R.; Thobaben, R.; Skoglund, M. The Convex Information Bottleneck Lagrangian. Entropy 2020, 20, 98. [Google Scholar] [CrossRef] [Green Version]
  61. Jang, E.; Gu, S.; Poole, B. Categorical Reparameterization with Gumbel-Softmax. arXiv 2017, arXiv:1611.01144. [Google Scholar]
  62. Maddison, C.J.; Mnih, A.; Teh, Y.W. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. arXiv 2016, arXiv:1611.00712. [Google Scholar]
  63. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  64. Lim, S.H.; Kim, Y.H.; Gamal, A.E.; Chung, S.Y. Noisy Network Coding. IEEE Trans. Inf. Theory 2011, 57, 3132–3152. [Google Scholar] [CrossRef]
  65. Courtade, T.A.; Weissman, T. Multiterminal source coding under logarithmic loss. IEEE Trans. Inf. Theory 2014, 60, 740–761. [Google Scholar] [CrossRef] [Green Version]
  66. Csiszár, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems; Academic Press: London, UK, 1981. [Google Scholar]
  67. Wyner, A.D.; Ziv, J. The rate-distortion function for source coding with side information at the decoder. IEEE Trans. Inf. Theory 1976, 22, 1–10. [Google Scholar] [CrossRef]
  68. Steinberg, Y. Coding and Common Reconstruction. IEEE Trans. Inf. Theory 2009, IT-11, 4995–5010. [Google Scholar] [CrossRef]
  69. Benammar, M.; Zaidi, A. Rate-Distortion of a Heegard-Berger Problem with Common Reconstruction Constraint. In Proceedings of the International Zurich Seminar on Information and Communication, Cambridge, MA, USA, 1–6 July 2016. [Google Scholar]
  70. Benammar, M.; Zaidi, A. Rate-distortion function for a heegard-berger problem with two sources and degraded reconstruction sets. IEEE Trans. Inf. Theory 2016, 62, 5080–5092. [Google Scholar] [CrossRef] [Green Version]
  71. Sutskover, I.; Shamai, S.; Ziv, J. Extremes of Information Combining. IEEE Trans. Inf. Theory 2005, 51, 1313–1325. [Google Scholar] [CrossRef]
  72. Land, I.; Huber, J. Information Combining. Found. Trends Commun. Inf. Theory 2006, 3, 227–230. [Google Scholar] [CrossRef] [Green Version]
  73. Wyner, A.D. On source coding with side information at the decoder. IEEE Trans. Inf. Theory 1975, 21, 294–300. [Google Scholar] [CrossRef]
  74. Ahlswede, R.; Korner, J. Source coding with side information and a converse for degraded broadcast channels. IEEE Trans. Inf. Theory 1975, 21, 629–637. [Google Scholar] [CrossRef]
  75. Makhdoumi, A.; Salamatian, S.; Fawaz, N.; Médard, M. From the information bottleneck to the privacy funnel. In Proceedings of the IEEE Information Theory Workshop, ITW, Hobart, Tasamania, Australia, 2–5 November 2014; pp. 501–505. [Google Scholar]
  76. Asoodeh, S.; Diaz, M.; Alajaji, F.; Linder, T. Information Extraction Under Privacy Constraints. IEEE Trans. Inf. Theory 2019, 65, 1512–1534. [Google Scholar] [CrossRef]
  77. Erkip, E.; Cover, T.M. The efficiency of investment information. IEEE Trans. Inf. Theory 1998, 44, 1026–1040. [Google Scholar] [CrossRef]
  78. Hinton, G.E.; van Camp, D. Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights. In Proceedings of the Sixth Annual Conference on Computational Learning Theory; ACM: New York, NY, USA, 1993; pp. 5–13. [Google Scholar]
  79. Gilad-Bachrach, R.; Navot, A.; Tishby, N. An Information Theoretic Tradeoff between Complexity and Accuracy. In Learning Theory and Kernel Machines; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2003; pp. 595–609. [Google Scholar]
  80. Vera, M.; Piantanida, P.; Vega, L.R. The Role of Information Complexity and Randomization in Representation Learning. arXiv 2018, arXiv:1802.05355. [Google Scholar]
  81. Huang, S.L.; Makur, A.; Wornell, G.W.; Zheng, L. On Universal Features for High-Dimensional Learning and Inference. arXiv 2019, arXiv:1911.09105. [Google Scholar]
  82. Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. β-VAE: Learning basic visual concepts with a constrained variational framework. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  83. Kurakin, A.; Goodfellow, I.; Bengio, S. Adversarial Machine Learning at Scale. arXiv 2016, arXiv:1611.01236. [Google Scholar]
  84. Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
  85. Engstrom, L.; Tran, B.; Tsipras, D.; Schmidt, L.; Madry, A. A rotation and a translation suffice: Fooling cnns with simple transformations. arXiv 2017, arXiv:1712.02779. [Google Scholar]
  86. Pensia, A.; Jog, V.; Loh, P.L. Extracting robust and accurate features via a robust information bottleneck. arXiv 2019, arXiv:1910.06893. [Google Scholar]
  87. Guo, D.; Shamai, S.; Verdu, S. Mutual Information and Minimum Mean-Square Error in Gaussian Channels. IEEE Trans. Inf. Theory 2005, 51, 1261–1282. [Google Scholar] [CrossRef] [Green Version]
  88. Dembo, A.; Cover, T.M.; Thomas, J.A. Information theoretic inequalities. IEEE Trans. Inf. Theory 1991, 37, 1501–1518. [Google Scholar] [CrossRef] [Green Version]
  89. Ekrem, E.; Ulukus, S. An Outer Bound for the Vector Gaussian CEO Problem. IEEE Trans. Inf. Theory 2014, 60, 6870–6887. [Google Scholar] [CrossRef] [Green Version]
  90. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Willey & Sons Inc.: New York, NY, USA, 1991. [Google Scholar]
  91. Aguerri, I.E.; Zaidi, A. Distributed information bottleneck method for discrete and Gaussian sources. In Proceedings of the International Zurich Seminar on Information and Communication, IZS, Zurich, Switzerland, 21–23 February 2018. [Google Scholar]
  92. Aguerri, I.E.; Zaidi, A. Distributed Variational Representation Learning. arXiv 2018, arXiv:1807.04193. [Google Scholar]
  93. Winkelbauer, A.; Farthofer, S.; Matz, G. The rate-information trade-off for Gaussian vector channels. In Proceedings of the The 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 2849–2853. [Google Scholar]
  94. Ugur, Y.; Aguerri, I.E.; Zaidi, A. Rate region of the vector Gaussian CEO problem under logarithmic loss. In Proceedings of the 2018 IEEE Information Theory Workshop (ITW), Guangzhou, China, 25–29 November 2018. [Google Scholar]
  95. Ugur, Y.; Aguerri, I.E.; Zaidi, A. Vector Gaussian CEO Problem under Logarithmic Loss. arXiv 2018, arXiv:1811.03933. [Google Scholar]
  96. Simeone, O.; Erkip, E.; Shamai, S. On Codebook Information for Interference Relay Channels With Out-of-Band Relaying. IEEE Trans. Inf. Theory 2011, 57, 2880–2888. [Google Scholar] [CrossRef] [Green Version]
  97. Sanderovich, A.; Shamai, S.; Steinberg, Y.; Kramer, G. Communication Via Decentralized Processing. IEEE Tran. Inf. Theory 2008, 54, 3008–3023. [Google Scholar] [CrossRef] [Green Version]
  98. Lapidoth, A.; Narayan, P. Reliable communication under channel uncertainty. IEEE Trans. Inf. Theory 1998, 44, 2148–2177. [Google Scholar] [CrossRef] [Green Version]
  99. Cover, T.M.; El Gamal, A. Capacity Theorems for the Relay Channel. IEEE Trans. Inf. Theory 1979, 25, 572–584. [Google Scholar] [CrossRef] [Green Version]
  100. Xu, C.; Tao, D.; Xu, C. A survey on multi-view learning. arXiv 2013, arXiv:1304.5634. [Google Scholar]
  101. Blum, A.; Mitchell, T. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, WI, USA, 24–26 July 1998; pp. 92–100. [Google Scholar]
  102. Dhillon, P.; Foster, D.P.; Ungar, L.H. Multi-view learning of word embeddings via CCA. In Proceedings of the 2011 Advances in Neural Information Processing Systems, Granada, Spain, 12–17 December 2011; pp. 199–207. [Google Scholar]
  103. Kumar, A.; Daumé, H. A co-training approach for multi-view spectral clustering. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), Bellevue, WA, USA, 28 June–2 July 2011; pp. 393–400. [Google Scholar]
  104. Gönen, M.; Alpaydın, E. Multiple kernel learning algorithms. J. Mach. Learn. Res. 2011, 12, 2211–2268. [Google Scholar]
  105. Jia, Y.; Salzmann, M.; Darrell, T. Factorized latent spaces with structured sparsity. In Proceedings of the Advances in Neural Information Processing Systems, Brno, Czech, 24–25 June 2010; pp. 982–990. [Google Scholar]
  106. Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: Berlin, Germany, 2013. [Google Scholar]
  107. Strouse, D.J.; Schwab, D.J. The deterministic information bottleneck. Mass. Inst. Tech. Neural Comput. 2017, 26, 1611–1630. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  108. Homri, A.; Peleg, M.; Shitz, S.S. Oblivious Fronthaul-Constrained Relay for a Gaussian Channel. IEEE Trans. Commun. 2018, 66, 5112–5123. [Google Scholar] [CrossRef] [Green Version]
  109. du Pin Calmon, F.; Fawaz, N. Privacy against statistical inference. In Proceedings of the 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, 1–5 October 2012. [Google Scholar]
  110. Karasik, R.; Simeone, O.; Shamai, S. Robust Uplink Communications over Fading Channels with Variable Backhaul Connectivity. IEEE Trans. Commun. 2013, 12, 5788–5799. [Google Scholar]
  111. Chen, Y.; Goldsmith, A.J.; Eldar, Y.C. Channel capacity under sub-Nyquist nonuniform sampling. IEEE Trans. Inf. Theory 2014, 60, 4739–4756. [Google Scholar] [CrossRef] [Green Version]
  112. Kipnis, A.; Eldar, Y.C.; Goldsmith, A.J. Analog-to-Digital Compression: A New Paradigm for Converting Signals to Bits. IEEE Signal Process. Mag. 2018, 35, 16–39. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Information bottleneck problem.
Figure 1. Information bottleneck problem.
Entropy 22 00151 g001
Figure 2. Information bottleneck relevance–complexity region. For a given β , the solution P U | X * , β to the minimization of the IB-Lagrangian in Equation (3) results in a pair ( Δ β , R β ) on the boundary of the IB relevance–complexity region (colored in grey).
Figure 2. Information bottleneck relevance–complexity region. For a given β , the solution P U | X * , β to the minimization of the IB-Lagrangian in Equation (3) results in a pair ( Δ β , R β ) on the boundary of the IB relevance–complexity region (colored in grey).
Entropy 22 00151 g002
Figure 3. Example parametrization of Variational Information Bottleneck using neural networks.
Figure 3. Example parametrization of Variational Information Bottleneck using neural networks.
Entropy 22 00151 g003
Figure 4. A remote source coding problem.
Figure 4. A remote source coding problem.
Entropy 22 00151 g004
Figure 5. An abstract inference model for learning.
Figure 5. An abstract inference model for learning.
Entropy 22 00151 g005
Figure 6. Inference problem with constrained model’s complexity.
Figure 6. Inference problem with constrained model’s complexity.
Entropy 22 00151 g006
Figure 7. A model for distributed, e.g., multi-view, learning.
Figure 7. A model for distributed, e.g., multi-view, learning.
Entropy 22 00151 g007
Figure 8. Example parameterization of the Distributed Variational Information Bottleneck method using neural networks.
Figure 8. Example parameterization of the Distributed Variational Information Bottleneck method using neural networks.
Entropy 22 00151 g008
Figure 9. CRAN model with oblivious relaying and time-sharing.
Figure 9. CRAN model with oblivious relaying and time-sharing.
Entropy 22 00151 g009

Share and Cite

MDPI and ACS Style

Zaidi, A.; Estella-Aguerri, I.; Shamai, S. On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. Entropy 2020, 22, 151. https://doi.org/10.3390/e22020151

AMA Style

Zaidi A, Estella-Aguerri I, Shamai S. On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. Entropy. 2020; 22(2):151. https://doi.org/10.3390/e22020151

Chicago/Turabian Style

Zaidi, Abdellatif, Iñaki Estella-Aguerri, and Shlomo Shamai (Shitz). 2020. "On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views" Entropy 22, no. 2: 151. https://doi.org/10.3390/e22020151

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop