Logical information theory: new logical foundations for information theory

Ellerman, David

doi:10.1093/jigpal/jzx022

Abstract

There is a new theory of information based on logic. The definition of Shannon entropy as well as the notions on joint, conditional and mutual entropy as defined by Shannon can all be derived by a uniform transformation from the corresponding formulas of logical information theory. Information is first defined in terms of sets of distinctions without using any probability measure. When a probability measure is introduced, the logical entropies are simply the values of the (product) probability measure on the sets of distinctions. The compound notions of joint, conditional and mutual entropies are obtained as the values of the measure, respectively, on the union, difference and intersection of the sets of distinctions. These compound notions of logical entropy satisfy the usual Venn diagram relationships (e.g. inclusion–exclusion formulas) since they are values of a measure (in the sense of measure theory). The uniform transformation into the formulas for Shannon entropy is linear so it explains the long-noted fact that the Shannon formulas satisfy the Venn diagram relations—as an analogy or mnemonic—since Shannon entropy is not a measure (in the sense of measure theory) on a given set. What is the logic that gives rise to logical information theory? Partitions are dual (in a category-theoretic sense) to subsets, and the logic of partitions was recently developed in a dual/parallel relationship to the Boolean logic of subsets (the latter being usually mis-specified as the special case of ‘propositional logic’). Boole developed logical probability theory as the normalized counting measure on subsets. Similarly the normalized counting measure on partitions is logical entropy—when the partitions are represented as the set of distinctions that is the complement to the equivalence relation for the partition. In this manner, logical information theory provides the set-theoretic and measure-theoretic foundations for information theory. The Shannon theory is then derived by the transformation that replaces the counting of distinctions with the counting of the number of binary partitions (bits) it takes, on average, to make the same distinctions by uniquely encoding the distinct elements—which is why the Shannon theory perfectly dovetails into coding and communications theory.

1 Introduction

This article develops the logical theory of information-as-distinctions. It can be seen as the application of the logic of partitions [15] to information theory. Partitions are dual (in a category-theoretic sense) to subsets. George Boole developed the notion of logical probability [7] as the normalized counting measure on subsets in his logic of subsets. This article develops the normalized counting measure on partitions as the analogous quantitative treatment in the logic of partitions. The resulting measure is a new logical derivation of an old formula measuring diversity and distinctions, e.g., Corrado Gini’s index of mutability or diversity [19], that goes back to the early 20th century. In view of the idea of information as being based on distinctions (see next section), I refer to this logical measure of distinctions as ‘logical entropy’.

This raises the question of the relationship of logical entropy to Claude Shannon’s entropy ([40], [41]). The entropies are closely related since they are both ultimately based on the concept of information-as-distinctions—but they represent two different way to quantify distinctions. Logical entropy directly counts the distinctions (as defined in partition logic) whereas Shannon entropy, in effect, counts the minimum number of binary partitions (or yes/no questions) it takes, on average, to uniquely determine or designate the distinct entities. Since that gives (in standard examples) a binary code for the distinct entities, the Shannon theory is perfectly adapted for applications to the theory of coding and communications.

The logical theory and the Shannon theory are also related in their compound notions of joint entropy, conditional entropy and mutual information. Logical entropy is a measure in the mathematical sense, so as with any measure, the compound formulas satisfy the usual Venn diagram relationships. The compound notions of Shannon entropy were defined so that they also satisfy similar Venn diagram relationships. However, as various information theorists, principally Lorne Campbell, have noted [9], Shannon entropy is not a measure (outside of the standard example of |$2^{n}$| equiprobable distinct entities where it is the count |$n$| of the number of yes/no questions necessary to unique determine or encode the distinct entities)—so one can conclude only that the ‘analogies provide a convenient mnemonic’ [9, p. 112] in terms of the usual Venn diagrams for measures. Campbell wondered if there might be a ‘deeper foundation’ [9, p. 112] to clarify how the Shannon formulas can be defined to satisfy the measure-like relations in spite of not being a measure. That question is addressed in this article by showing that there is a transformation of formulas that transforms each of the logical entropy compound formulas into the corresponding Shannon entropy compound formula, and the transform preserves the Venn diagram relationships that automatically hold for measures. This ‘dit-bit transform’ is heuristically motivated by showing how average counts of distinctions (‘dits’) can be converted in average counts of binary partitions (‘bits’).

Moreover, Campbell remarked that it would be ‘particularly interesting’ and ‘quite significant’ if there was an entropy measure of sets so that joint entropy corresponded to the measure of the union of sets, conditional entropy to the difference of sets, and mutual information to the intersection of sets [9, p. 113]. Logical entropy precisely satisfies those requirements.

2 Logical information as the measure of distinctions

There is now a widespread view that information is fundamentally about differences, distinguishability and distinctions. As Charles H. Bennett, one of the founders of quantum information theory, put it:

So information really is a very useful abstraction. It is the notion of distinguishability abstracted away from what we are distinguishing, or from the carrier of information. [5, p. 155]

This view even has an interesting history. In James Gleick’s book, The Information: A History, A Theory, A Flood, he noted the focus on differences in the 17th century polymath, John Wilkins, who was a founder of the Royal Society. In |$1641$|⁠, the year before Isaac Newton was born, Wilkins published one of the earliest books on cryptography, Mercury or the Secret and Swift Messenger, which not only pointed out the fundamental role of differences but noted that any (finite) set of different things could be encoded by words in a binary code.

For in the general we must note, That whatever is capable of a competent Difference, perceptible to any Sense, may be a sufficient Means whereby to express the Cogitations. It is more convenient, indeed, that these Differences should be of as great Variety as the Letters of the Alphabet; but it is sufficient if they be but twofold, because Two alone may, with somewhat more Labour and Time, be well enough contrived to express all the rest. [47, Chap. XVII, p. 69]

Wilkins explains that a five letter binary code would be sufficient to code the letters of the alphabet since |$2^{5}=32$|⁠.

Thus any two Letters or Numbers, suppose |$A.B$|⁠. being transposed through five Places, will yield Thirty Two Differences, and so consequently will superabundantly serve for the Four and twenty Letters ... .[47, Chap. XVII, p. 69]

As Gleick noted:

Any difference meant a binary choice. Any binary choice began the expressing of cogitations. Here, in this arcane and anonymous treatise of |$1641$|⁠, the essential idea of information theory poked to the surface of human thought, saw its shadow, and disappeared again for [three] hundred years. [20, p. 161]

Thus counting distinctions [12] would seem the right way to measure information,¹ and that is the measure which emerges naturally out of partition logic—just as finite logical probability emerges naturally as the measure of counting elements in Boole’s subset logic.

Although usually named after the special case of ‘propositional’ logic, the general case is Boole’s logic of subsets of a universe |$U$| (the special case of |$U=1$| allows the propositional interpretation since the only subsets are |$1$| and |$\emptyset$| standing for truth and falsity). Category theory shows there is a duality between sub-sets and quotient-sets (= partitions = equivalence relations), and that allowed the recent development of the dual logic of partitions ([13], [15]). As indicated in the title of his book, An Investigation of the Laws of Thought on which are founded the Mathematical Theories of Logic and Probabilities [7], Boole also developed the normalized counting measure on subsets of a finite universe |$U$| which was finite logical probability theory. When the same mathematical notion of the normalized counting measure is applied to the partitions on a finite universe set |$U$| (when the partition is represented as the complement of the corresponding equivalence relation on |$U\times U$|⁠) then the result is the formula for logical entropy.

In addition to the philosophy of information literature [4], there is a whole sub-industry in mathematics concerned with different notions of ‘entropy’ or ‘information’ ([2]; see [45] for a recent ‘extensive’ analysis) that is long on formulas and ‘intuitive axioms’ but short on interpretations. Out of that plethora of definitions, logical entropy is the measure (in the technical sense of measure theory) of information that arises out of partition logic just as logical probability theory arises out of subset logic.

The logical notion of information-as-distinctions supports the view that the notion of information is independent of the notion of probability and should be based on finite combinatorics. As Andrey Kolmogorov put it:

Information theory must precede probability theory, and not be based on it. By the very essence of this discipline, the foundations of information theory have a finite combinatorial character. [27, p. 39]

Logical information theory precisely fulfills Kolmogorov’s criterion.² It starts simply with a set of distinctions defined by a partition on a finite set |$U$|⁠, where a distinction is an ordered pair of elements of |$U$| in distinct blocks of the partition. Thus the ‘finite combinatorial’ object is the set of distinctions (‘ditset’) or information set (‘infoset’) associated with the partition, i.e., the complement in |$U\times U$| of the equivalence relation associated with the partition. To get a quantitative measure of information, any probability distribution on |$U$| defines a product probability measure on |$U\times U$|⁠, and the logical entropy is simply that probability measure of the information set.

3 Duality of subsets and partitions

Logical entropy is to the logic of partitions as logical probability is to the Boolean logic of subsets. Hence we will start with a brief review of the relationship between these two dual forms of mathematical logic.

Modern category theory shows that the concept of a subset dualizes to the concept of a quotient set, equivalence relation, or partition. F. William Lawvere called a subset or, in general, a subobject a ‘part’ and then noted: ‘The dual notion (obtained by reversing the arrows) of ‘part’ is the notion of partition.’ [31, p. 85] That suggests that the Boolean logic of subsets (usually named after the special case of propositions as ‘propositional’ logic) should have a dual logic of partitions ([13], [15]).

A partition|$\pi=\left\{B_{1},...,B_{m}\right\}$| on |$U$| is a set of subsets, called cells or blocks, |$B_{i}$| that are mutually disjoint and jointly exhaustive (⁠|$\cup_{i}B_{i}=U$|⁠). In the duality between subset logic and partition logic, the dual to the notion of an ‘element’ of a subset is the notion of a ‘distinction’ of a partition, where |$\left( u,u^{\prime}\right) \in U\times U$| is a distinction or dit of |$\pi$| if the two elements are in different blocks, i.e., the ‘dits’ of a partition are dual to the ‘its’ (or elements) of a subset. Let |$\operatorname*{dit}\left( \pi\right) \subseteq U\times U$| be the set of distinctions or ditset of |$\pi$|⁠. Thus the information set or infoset associated with a partition |$\pi$| is ditset |$\operatorname*{dit}\left( \pi\right)$|⁠. Similarly an indistinction or indit of |$\pi$| is a pair |$\left(u,u^{\prime}\right) \in U\times U$| in the same block of |$\pi$|⁠. Let |$\operatorname*{indit}\left(\pi\right) \subseteq U\times U$| be the set of indistinctions or inditset of |$\pi$|⁠. Then |$\operatorname*{indit}\left( \pi\right) $| is the equivalence relation associated with |$\pi$| and |$\operatorname*{dit}\left( \pi\right) =U\times U-\operatorname*{indit} \left( \pi\right) $| is the complementary binary relation that has been called an apartness relation or a partition relation.

4 Classical subset logic and partition logic

the partial order|$\sigma\preceq\pi$| of partitions |$\sigma=\left\{ C,C^{\prime},...\right\} $| and |$\pi=\left\{ B,B^{\prime },...\right\} $| holds when |$\pi$|refines|$\sigma$| in the sense that for every block |$B\in\pi$| there is a block |$C\in\sigma$| such that |$B\subseteq C$|⁠, or, equivalently, using the element-distinction pairing, the partial order is the inclusion of distinctions: |$\sigma\preceq\pi$| if and only if (iff) |$\operatorname*{dit}\left( \sigma\right) \subseteq\operatorname*{dit}\left( \pi\right) $|⁠;
the minimum or bottom partition is the indiscrete partition (or blob) |$\mathbf{0}=\left\{ U\right\} $| with one block consisting of all of |$U$|⁠;
the maximum or top partition is the discrete partition|$\mathbf{1}=\left\{ \left\{ u_{j}\right\} \right\} _{j=1,...,n}$| consisting of singleton blocks;
the join|$\pi\vee\sigma$| is the partition whose blocks are the non-empty intersections |$B\cap C$| of blocks of |$\pi$| and blocks of |$\sigma$|⁠, or, equivalently, using the element-distinction pairing, |$\operatorname*{dit} \left( \pi\vee\sigma\right) =\operatorname*{dit}\left( \pi\right) \cup\operatorname*{dit}\left( \sigma\right) $|⁠;
the meet|$\pi\wedge\sigma$| is the partition whose blocks are the equivalence classes for the equivalence relation generated by: |$u_{j}\sim u_{j^{\prime}}$| if |$u_{j}\in B\in\pi$|⁠, |$u_{j^{\prime}}\in C\in\sigma$|⁠, and |$B\cap C\neq\emptyset$|⁠; and
|$\sigma\Rightarrow\pi$| is the implication partition whose blocks are: (1) the singletons |$\left\{ u_{j}\right\} $| for |$u_{j}\in B\in\pi$| if there is a |$C\in\sigma$| such that |$B\subseteq C$|⁠, or (2) just |$B\in\pi$| if there is no |$C\in\sigma$| with |$B\subseteq C$|⁠, so that trivially: |$\sigma\Rightarrow\pi=\mathbf{1}$| iff |$\sigma\preceq\pi$|⁠.³

The logical partition operations can also be defined in terms of the corresponding logical operations on subsets. A ditset |$\operatorname*{dit} \left( \pi\right)$| of a partition on |$U$| is a subset of |$U\times U$| of a particular kind, namely the complement of an equivalence relation. An equivalence relation is reflexive, symmetric and transitive. Hence the ditset complement, i.e., a partition relation (or apartness relation), is a subset |$P\subseteq U\times U$| that is:

(1) irreflexive (or anti-reflexive), |$P\cap\Delta=\emptyset$| (where |$\Delta=\left\{ \left( u,u\right) :u\in U\right\} $| is the diagonal), i.e., no element |$u\in U$| can be distinguished from itself;
(2) symmetric, |$\left( u,u^{\prime}\right) \in P$| implies |$\left(u^{\prime},u\right) \in P$|⁠, i.e., if |$u$| is distinguished from |$u^{\prime}$|⁠, then |$u^{\prime}$| is distinguished from |$u$|⁠; and
(3) anti-transitive (or co-transitive), if |$\left( u,u^{\prime\prime}\right) \in P$| then for any |$u^{\prime}\in U$|⁠, |$\left( u,u^{\prime}\right) \in P$| or |$\left( u^{\prime},u^{\prime\prime}\right) \in P$|⁠, i.e., if |$u$| is distinguished from |$u^{\prime\prime}$|⁠, then any other element |$u^{\prime}$| must be distinguished from |$u$| or |$u^{\prime\prime}$| because if |$u^{\prime}$| was equivalent to both, then by transitivity of equivalence, |$u$| would be equivalent to |$u^{\prime\prime}$| contrary to them being distinguished.

That is how distinctions work at the logical level, and that is why the ditset of a partition is the ‘probability-free’ notion of an information set or infoset in the logical theory of information-as-distinctions.

Given any subset |$S\subseteq U\times U$|⁠, the reflexive-symmetric-transitive (rst) closure|$\overline{S^{c}}$| of the complement |$S^{c}$| is the smallest equivalence relation containing |$S^{c} $|⁠, so its complement is the largest partition relation contained in |$S$|⁠, which is called the interior|$\operatorname*{int}\left( S\right) $| of |$S$|⁠. This usage is consistent with calling the subsets that equal their rst-closures closed subsets of |$U\times U$| (so closed subsets = equivalence relations) so the complements are the open subsets (= partition relations). However it should be noted that the rst-closure is not a topological closure since the closure of a union is not necessarily the union of the closures, so the ‘open’ subsets do not form a topology on |$U\times U$|⁠.

The interior operation |$\operatorname*{int}:\wp\left( U\times U\right) \rightarrow\wp\left( U\times U\right) $| provides a universal way to define logical operations on partitions from the corresponding logical subset operations in Boolean logic:

\begin{align} {\rm{apply\,\,the\,\,subset\,\,operation\,\,to\,\,the\,\,ditsets\,\,and\,\,then,\,\,if\,\,necessary,}}\\ {\rm{take\,\,the\,\,interior\,\,to\,\,obtain\,\,the\,\,ditset\,\,of\,\,the\,\,partition\,\,operation.}} \end{align}

Since the same operations can be defined for subsets and partitions, one can interpret a formula |$\Phi\left( \pi,\sigma,...\right) $| either way as a subset or a partition. Given either subsets or partitions of |$U$| substituted for the variables |$\pi$|⁠, |$\sigma$|⁠,..., one can apply, respectively, subset or partition operations to evaluate the whole formula. Since |$\Phi\left(\pi,\sigma,...\right) $| is either a subset or a partition, the corresponding proposition is ‘|$u$| is an element of |$\Phi\left( \pi ,\sigma,...\right)$|’ or ‘|$\left( u,u^{\prime}\right) $| is a distinction of |$\Phi\left( \pi,\sigma,...\right)$|’. And then the definitions of a valid formula are also parallel, namely, no matter what is substituted for the variables, the whole formula evaluates to the top of the algebra. In that case, the subset |$\Phi\left( \pi,\sigma,...\right) $| contains all elements of |$U$|⁠, i.e., |$\Phi\left( \pi,\sigma,...\right) =U$|⁠, or the partition |$\Phi\left( \pi,\sigma,...\right) $| distinguishes all pairs |$\left( u,u^{\prime}\right) $| for distinct elements of |$U$|⁠, i.e., |$\Phi\left( \pi,\sigma,...\right) =\mathbf{1}$|⁠. The parallelism between the dual logics is summarized in the following Table 1.

Table 1.

Duality between subset logic and partition logic

Table 1	Subset logic	Partition logic
‘Elements’ (its or dits)	Elements \|$u$\| of \|$S$\|	Dits \|$\left(u,u^{\prime }\right) $\| of \|$\pi$\|
Inclusion of ‘elements’	Inclusion \|$S\subseteq T$\|	Refinement: \|$\operatorname{dit}\left( \sigma\right)\subseteq\operatorname{dit}\left( \pi\right) $\|
Top of order = all ‘elements’	\|$U$\| all elements	\|$\operatorname*{dit}(\mathbf{1)}=U^{2}-\Delta$\|⁠, all dits
Bottom of order = no ‘elements’	\|$\emptyset$\| no elements	\|$\operatorname*{dit}(\mathbf{0)=\emptyset}$\|⁠, no dits
Variables in formulas	Subsets \|$S$\| of \|$U$\|	Partitions \|$\pi$\| on \|$U$\|
Operations: \|$\vee,\wedge,\Rightarrow,...$\|	Subset ops.	Partition ops.
Formula \|$\Phi(x,y,...)$\| holds	\|$u$\| element of \|$\Phi(S,T,...)$\|	\|$\left( u,u^{\prime}\right) $\| dit of \|$\Phi(\pi,\sigma,...)$\|
Valid formula	\|$\Phi(S,T,...)=U$\|⁠, \|$\forall S,T,...$\|	\|$\Phi(\pi ,\sigma,...)=\mathbf{1}$\|⁠, \|$\forall\pi,\sigma,...$\|

Table 1	Subset logic	Partition logic
‘Elements’ (its or dits)	Elements \|$u$\| of \|$S$\|	Dits \|$\left(u,u^{\prime }\right) $\| of \|$\pi$\|
Inclusion of ‘elements’	Inclusion \|$S\subseteq T$\|	Refinement: \|$\operatorname{dit}\left( \sigma\right)\subseteq\operatorname{dit}\left( \pi\right) $\|
Top of order = all ‘elements’	\|$U$\| all elements	\|$\operatorname*{dit}(\mathbf{1)}=U^{2}-\Delta$\|⁠, all dits
Bottom of order = no ‘elements’	\|$\emptyset$\| no elements	\|$\operatorname*{dit}(\mathbf{0)=\emptyset}$\|⁠, no dits
Variables in formulas	Subsets \|$S$\| of \|$U$\|	Partitions \|$\pi$\| on \|$U$\|
Operations: \|$\vee,\wedge,\Rightarrow,...$\|	Subset ops.	Partition ops.
Formula \|$\Phi(x,y,...)$\| holds	\|$u$\| element of \|$\Phi(S,T,...)$\|	\|$\left( u,u^{\prime}\right) $\| dit of \|$\Phi(\pi,\sigma,...)$\|
Valid formula	\|$\Phi(S,T,...)=U$\|⁠, \|$\forall S,T,...$\|	\|$\Phi(\pi ,\sigma,...)=\mathbf{1}$\|⁠, \|$\forall\pi,\sigma,...$\|

Table 1.

Duality between subset logic and partition logic

Table 1	Subset logic	Partition logic
‘Elements’ (its or dits)	Elements \|$u$\| of \|$S$\|	Dits \|$\left(u,u^{\prime }\right) $\| of \|$\pi$\|
Inclusion of ‘elements’	Inclusion \|$S\subseteq T$\|	Refinement: \|$\operatorname{dit}\left( \sigma\right)\subseteq\operatorname{dit}\left( \pi\right) $\|
Top of order = all ‘elements’	\|$U$\| all elements	\|$\operatorname*{dit}(\mathbf{1)}=U^{2}-\Delta$\|⁠, all dits
Bottom of order = no ‘elements’	\|$\emptyset$\| no elements	\|$\operatorname*{dit}(\mathbf{0)=\emptyset}$\|⁠, no dits
Variables in formulas	Subsets \|$S$\| of \|$U$\|	Partitions \|$\pi$\| on \|$U$\|
Operations: \|$\vee,\wedge,\Rightarrow,...$\|	Subset ops.	Partition ops.
Formula \|$\Phi(x,y,...)$\| holds	\|$u$\| element of \|$\Phi(S,T,...)$\|	\|$\left( u,u^{\prime}\right) $\| dit of \|$\Phi(\pi,\sigma,...)$\|
Valid formula	\|$\Phi(S,T,...)=U$\|⁠, \|$\forall S,T,...$\|	\|$\Phi(\pi ,\sigma,...)=\mathbf{1}$\|⁠, \|$\forall\pi,\sigma,...$\|

Table 1	Subset logic	Partition logic
‘Elements’ (its or dits)	Elements \|$u$\| of \|$S$\|	Dits \|$\left(u,u^{\prime }\right) $\| of \|$\pi$\|
Inclusion of ‘elements’	Inclusion \|$S\subseteq T$\|	Refinement: \|$\operatorname{dit}\left( \sigma\right)\subseteq\operatorname{dit}\left( \pi\right) $\|
Top of order = all ‘elements’	\|$U$\| all elements	\|$\operatorname*{dit}(\mathbf{1)}=U^{2}-\Delta$\|⁠, all dits
Bottom of order = no ‘elements’	\|$\emptyset$\| no elements	\|$\operatorname*{dit}(\mathbf{0)=\emptyset}$\|⁠, no dits
Variables in formulas	Subsets \|$S$\| of \|$U$\|	Partitions \|$\pi$\| on \|$U$\|
Operations: \|$\vee,\wedge,\Rightarrow,...$\|	Subset ops.	Partition ops.
Formula \|$\Phi(x,y,...)$\| holds	\|$u$\| element of \|$\Phi(S,T,...)$\|	\|$\left( u,u^{\prime}\right) $\| dit of \|$\Phi(\pi,\sigma,...)$\|
Valid formula	\|$\Phi(S,T,...)=U$\|⁠, \|$\forall S,T,...$\|	\|$\Phi(\pi ,\sigma,...)=\mathbf{1}$\|⁠, \|$\forall\pi,\sigma,...$\|

5 Classical logical probability and logical entropy

George Boole [7] extended his logic of subsets to finite logical probability theory where, in the equiprobable case, the probability of a subset |$S$| (event) of a finite universe set (outcome set or sample space) |$U=\left\{ u_{1},...,u_{n}\right\} $| was the number of elements in |$S$| over the total number of elements: |$\Pr\left( S\right) =\frac{\left\vert S\right\vert }{\left\vert U\right\vert }=\sum_{u_{j}\in S}\frac{1}{\left\vert U\right\vert }$|⁠. Pierre-Simon Laplace’s classical finite probability theory [30] also dealt with the case where the outcomes were assigned real point probabilities |$p=\left\{ p_{1},...,p_{n}\right\} $| so rather than summing the equal probabilities |$\frac{1}{\left\vert U\right\vert }$|⁠, the point probabilities of the elements were summed: |$\Pr\left( S\right) =\sum_{u_{j}\in S}p_{j}=p\left( S\right) $|–where the equiprobable formula is for |$p_{j}=\frac{1}{\left\vert U\right\vert }$| for |$j=1,...,n$|⁠. The conditional probability of an event |$T\subseteq U$| given an event |$S$| is |$\Pr\left( T|S\right) =\frac{p\left( T\cap S\right) }{p\left( S\right)}$|⁠.

In Gian-Carlo Rota’s Fubini Lectures [38] (and in his lectures at MIT), he has remarked in view of duality between partitions and subsets that, quantitatively, the ‘lattice of partitions plays for information the role that the Boolean algebra of subsets plays for size or probability’ [29, p. 30] or symbolically:

\begin{align} {\rm{information\,\,:\,\,partitions\,\,::\,\,probability\,\,:\,\,subsets}}. \end{align}

Since ‘Probability is a measure on the Boolean algebra of events’ that gives quantitatively the ‘intuitive idea of the size of a set’, we may ask by ‘analogy’ for some measure to capture a property for a partition like ‘what size is to a set.’ Rota goes on to ask:

How shall we be led to such a property? We have already an inkling of what it should be: it should be a measure of information provided by a random variable. Is there a candidate for the measure of the amount of information? [38, p. 67]

Our claim is quite simple; the analogue to the size of a subset is the size of the ditset or information set, the set of distinctions, of a partition.⁴ The normalized size of a subset is the logical probability of the event, and the normalized size of the ditset of a partition is, in the sense of measure theory, ‘the measure of the amount of information’ in a partition. Thus we define the logical entropy of a partition |$\pi=\left\{ B_{1,}...,B_{m}\right\} $|⁠, denoted |$h\left(\pi\right) $|⁠, as the size of the ditset |$\operatorname*{dit}\left( \pi\right) \subseteq U\times U$| normalized by the size of |$U\times U$|⁠:

\begin{align} h\left( \pi\right) =\frac{\left\vert \operatorname*{dit}\left( \pi\right) \right\vert }{\left\vert U\times U\right\vert }=\sum\nolimits_{\left( u_{j} ,u_{k}\right) \in\operatorname*{dit}\left( \pi\right) }\frac{1}{\left\vert U\right\vert }\frac{1}{\left\vert U\right\vert }\\ {\rm{Logical\,\,entropy\,\,of}}\,\,\pi\,\,{\rm{(equiprobable\,\,case)}}. \end{align}

This is just the product probability measure of the equiprobable or uniform probability distribution on |$U$| applied to the information set or ditset |$\operatorname*{dit}\left( \pi\right) $|⁠. The inditset of |$\pi$| is |$\operatorname*{indit}\left( \pi\right) =\cup_{i=1}^{m}\left( B_{i}\times B_{i}\right) $| so where |$p\left( B_{i}\right) =\frac{|B_{i}|}{\left\vert U\right\vert }$| in the equiprobable case, we have:

\begin{align} h\left( \pi\right) =\frac{\left\vert \operatorname*{dit}\left( \pi\right) \right\vert }{\left\vert U\times U\right\vert }=\frac{\left\vert U\times U\right\vert -\sum\nolimits_{i=1}^{m}\left\vert B_{i}\times B_{i}\right\vert}{\left\vert U\times U\right\vert }=1-\sum\nolimits_{i=1}^{m}\left( \frac{\left\vert B_{i}\right\vert }{\left\vert U\right\vert }\right) ^{2}=1-\sum\nolimits_{i=1} ^{m}p\left( B_{i}\right) ^{2}. \end{align}

The corresponding definition for the case of point probabilities |$p=\left\{p_{1},...,p_{n}\right\} $| is to just add up the probabilities of getting a particular distinction:

\begin{align} h_{p}\left( \pi\right) =\sum\nolimits_{\left( u_{j},u_{k}\right) \in \operatorname*{dit}\left( \pi\right) }p_{j}p_{k}\\ {\rm{Logical\,\,entropy\,\,of}}\,\,\pi\,\,{\rm{with\,\,point\,\,probabilities}}\,\,p. \end{align}

Taking |$p\left( B_{i}\right) =\sum_{u_{j}\in B_{i}}p_{j}$|⁠, the logical entropy with point probabilities is:

\begin{align} h_{p}\left( \pi\right) =\sum\nolimits_{\left( u_{j},u_{k}\right) \in \operatorname*{dit}\left( \pi\right) }p_{j}p_{k}=\sum\nolimits_{i\neq i^{\prime}}p\left( B_{i}\right) p\left( B_{i^{\prime}}\right) =2\sum\nolimits_{i<i^{\prime}}p\left( B_{i}\right) p\left( B_{i^{\prime}}\right) =1-\sum\nolimits_{i=1}^{m}p\left( B_{i}\right) ^{2}. \end{align}

Instead of being given a partition |$\pi=\left\{ B_{1},...,B_{m}\right\} $| on |$U$| with point probabilities |$p_{j}$| defining the finite probability distribution of block probabilities |$\left\{ p\left( B_{i}\right) \right\}_{i}$|⁠, one might be given only a finite probability distribution |$p=\left\{p_{1},...,p_{m}\right\} $|⁠. Then substituting |$p_{i}$| for |$p\left(B_{i}\right) $| gives the:⁵

\begin{align} h\left( p\right) =1-\sum\nolimits_{i=1}^{m}p_{i}^{2}=\sum\nolimits_{i\neq j}p_{i}p_{j}\\ {\rm{Logical\,\,entropy\,\,of\,\,a\,\,finite\,\,probability\,\,distribution}}. \end{align}

Since |$1=\left( \sum_{i=1}^{n}p_{i}\right) ^{2}=\sum_{i}p_{i} ^{2}+\sum_{i\neq j}p_{i}p_{j}$|⁠, we again have the logical entropy |$h\left(p\right) $| as the probability |$\sum_{i\neq j}p_{i}p_{j}$| of drawing a distinction in two independent samplings of the probability distribution |$p$|⁠.

That two-draw probability interpretation follows from the important fact that logical entropy is always the value of a probability measure. The product probability measure on the subsets |$S\subseteq U\times U$| is:

\begin{align} \mu\left( S\right) =\sum\nolimits\left\{ p_{i}p_{j}:\left( u_{i},u_{j}\right) \in S\right\}\\ {\it{Product\,\,measure}}\,\,\rm{on}\,\,U \times U. \end{align}

Then the logical entropy |$h\left( p\right) =\mu\left(\operatorname*{dit}(\mathbf{1}_{U})\right) $| is just the product measure of the information set or ditset |$\operatorname*{dit}\left( \mathbf{1}_{U}\right) =U\times U-\Delta$| of the discrete partition |$\mathbf{1}_{U}$| on |$U$|⁠.

There are also parallel ‘element |$\leftrightarrow$| distinction’ probabilistic interpretations:

|$\Pr\left( S\right) =\sum_{u_{i}\in S}p_{i}$| is the probability that a single draw from |$U$| gives a element|$u_{j}\ $|of |$S$|⁠, and
|$h_{p}\left( \pi\right) =\mu\left( \operatorname*{dit}\left(\pi\right) \right) =\sum_{\left( u_{j},u_{k}\right) \in\operatorname*{dit}\left( \pi\right) }p_{j}p_{k}$| is the probability that two independent (with replacement) draws from |$U$| gives a distinction|$\left( u_{j},u_{k}\right) $| of |$\pi$|⁠.

The duality between logical probabilities and logical entropies based on the parallel roles of ‘its’ (elements of subsets) and ‘dits’ (distinctions of partitions) is summarized in Table 2.

Table 2.

Classical logical probability theory and classical logical information theory

Table 2	Logical probability theory	Logical information theory
‘Outcomes’	Elements \|$u\in U$\| finite	Dits \|$\left( u,u^{\prime}\right) \in U\times U$\| finite
‘Events’	Subsets \|$S\subseteq U$\|	Ditsets \|$\operatorname*{dit}\left( \pi\right) \subseteq U\times U$\|
Equiprobable points	\|$\ \Pr\left( S\right) =\frac{\|S\|}{\left\vert U\right\vert }$\|	\|$h\left( \pi\right) =\frac{\left\vert \operatorname*{dit}\left( \pi\right) \right\vert }{\left\vert U\times U\right\vert }$\|
Point probabilities	\|$\ \Pr\left( S\right) =\sum\left\{ p_{j}:u_{j}\in S\right\} $\|	\|$h\left( \pi\right) =\sum\left\{ p_{j}p_{k}:\left(u_{j},u_{k}\right) \in\operatorname*{dit}\left( \pi\right) \right\}$\|
Interpretation	\|$\Pr(S)=$\|\|$1$\|-draw prob. of \|$S$\|-element	\|$h\left(\pi\right) =$\|\|$2$\|-draw prob. of \|$\pi$\|-distinction

Table 2	Logical probability theory	Logical information theory
‘Outcomes’	Elements \|$u\in U$\| finite	Dits \|$\left( u,u^{\prime}\right) \in U\times U$\| finite
‘Events’	Subsets \|$S\subseteq U$\|	Ditsets \|$\operatorname*{dit}\left( \pi\right) \subseteq U\times U$\|
Equiprobable points	\|$\ \Pr\left( S\right) =\frac{\|S\|}{\left\vert U\right\vert }$\|	\|$h\left( \pi\right) =\frac{\left\vert \operatorname*{dit}\left( \pi\right) \right\vert }{\left\vert U\times U\right\vert }$\|
Point probabilities	\|$\ \Pr\left( S\right) =\sum\left\{ p_{j}:u_{j}\in S\right\} $\|	\|$h\left( \pi\right) =\sum\left\{ p_{j}p_{k}:\left(u_{j},u_{k}\right) \in\operatorname*{dit}\left( \pi\right) \right\}$\|
Interpretation	\|$\Pr(S)=$\|\|$1$\|-draw prob. of \|$S$\|-element	\|$h\left(\pi\right) =$\|\|$2$\|-draw prob. of \|$\pi$\|-distinction

Table 2.

Classical logical probability theory and classical logical information theory

Table 2	Logical probability theory	Logical information theory
‘Outcomes’	Elements \|$u\in U$\| finite	Dits \|$\left( u,u^{\prime}\right) \in U\times U$\| finite
‘Events’	Subsets \|$S\subseteq U$\|	Ditsets \|$\operatorname*{dit}\left( \pi\right) \subseteq U\times U$\|
Equiprobable points	\|$\ \Pr\left( S\right) =\frac{\|S\|}{\left\vert U\right\vert }$\|	\|$h\left( \pi\right) =\frac{\left\vert \operatorname*{dit}\left( \pi\right) \right\vert }{\left\vert U\times U\right\vert }$\|
Point probabilities	\|$\ \Pr\left( S\right) =\sum\left\{ p_{j}:u_{j}\in S\right\} $\|	\|$h\left( \pi\right) =\sum\left\{ p_{j}p_{k}:\left(u_{j},u_{k}\right) \in\operatorname*{dit}\left( \pi\right) \right\}$\|
Interpretation	\|$\Pr(S)=$\|\|$1$\|-draw prob. of \|$S$\|-element	\|$h\left(\pi\right) =$\|\|$2$\|-draw prob. of \|$\pi$\|-distinction

Table 2	Logical probability theory	Logical information theory
‘Outcomes’	Elements \|$u\in U$\| finite	Dits \|$\left( u,u^{\prime}\right) \in U\times U$\| finite
‘Events’	Subsets \|$S\subseteq U$\|	Ditsets \|$\operatorname*{dit}\left( \pi\right) \subseteq U\times U$\|
Equiprobable points	\|$\ \Pr\left( S\right) =\frac{\|S\|}{\left\vert U\right\vert }$\|	\|$h\left( \pi\right) =\frac{\left\vert \operatorname*{dit}\left( \pi\right) \right\vert }{\left\vert U\times U\right\vert }$\|
Point probabilities	\|$\ \Pr\left( S\right) =\sum\left\{ p_{j}:u_{j}\in S\right\} $\|	\|$h\left( \pi\right) =\sum\left\{ p_{j}p_{k}:\left(u_{j},u_{k}\right) \in\operatorname*{dit}\left( \pi\right) \right\}$\|
Interpretation	\|$\Pr(S)=$\|\|$1$\|-draw prob. of \|$S$\|-element	\|$h\left(\pi\right) =$\|\|$2$\|-draw prob. of \|$\pi$\|-distinction

This concludes the argument that logical information theory arises out of partition logic just as logical probability theory arises out of subset logic. Now we turn to the formulas of logical information theory and the comparison to the formulas of Shannon information theory.

6 Entropy as a measure of information

For a partition |$\pi=\left\{ B_{1},...,B_{m}\right\} $| with block probabilities |$p\left( B_{i}\right) $| (obtained using equiprobable points or with point probabilities), the Shannon entropy of the partition (using logs to base |$2$|⁠) is:

\begin{align} H\left( \pi\right) =-\sum\nolimits_{i=1}^{m}p\left( B_{i}\right) \log\left(p\left( B_{i}\right) \right) . \end{align}

Or if given a finite probability distribution |$p=\left\{p_{1},...,p_{m}\right\} $|⁠, the Shannon entropy of the probability distribution is:

\begin{align} H\left( p\right) =-\sum\nolimits_{i=1}^{m}p_{i}\log\left( p_{i}\right) . \end{align}

Shannon entropy and the many other suggested ‘entropies’ are routinely called ‘measures’ of information [2]. The formulas for mutual information, joint entropy and conditional entropy are defined so these Shannon entropies satisfy Venn diagram formulas (e.g. [1, p. 109]; [35, p. 508]) that would follow automatically if Shannon entropy were a measure in the technical sense. As Lorne Campbell put it:

Certain analogies between entropy and measure have been noted by various authors. These analogies provide a convenient mnemonic for the various relations between entropy, conditional entropy, joint entropy, and mutual information. It is interesting to speculate whether these analogies have a deeper foundation. It would seem to be quite significant if entropy did admit an interpretation as the measure of some set. [9, p. 112]

For any finite set |$X$|⁠, a measure|$\mu$| is a function |$\mu:\wp\left( X\right) \rightarrow \mathbb{R}$| such that:

(1) |$\mu\left( \emptyset\right) =0$|⁠,
(2) for any |$E\subseteq X$|⁠, |$\mu\left( E\right) \geq0$|⁠, and
(3) for any disjoint subsets |$E_{1}$| and |$E_{2}$|⁠, |$\mu(E_{1}\cup E_{2})=\mu\left( E_{1}\right) +\mu\left( E_{2}\right)$|⁠.

Considerable effort has been expended to try to find a framework in which Shannon entropy would be a measure in this technical sense and thus would satisfy the desiderata:

that |$H\left( \alpha\right) $| and |$H\left( \beta\right) $| are measures of sets, that |$H\left( \alpha,\beta\right)$| is the measure of their union, that |$I\left(\alpha,\beta\right) $| is the measure of their intersection, and that |$H\left( \alpha|\beta\right)$| is the measure of their difference. The possibility that |$I\left(\alpha ,\beta\right)$| is the entropy of the “intersection” of two partitions is particularly interesting. This “intersection,” if it existed, would presumably contain the information common to the partitions |$\alpha$| and |$\beta$|⁠. [9, p. 113]

But these efforts have not been successful beyond special cases such as |$2^{n}$| equiprobable elements where, as Campbell notes, the Shannon entropy is just the counting measure |$n$| of the minimum number of binary partitions it takes to distinguish all the elements. In general, Shannon entropy is not a measure on a set.⁶

In contrast, it is ‘quite significant’ that logical entropy is a measure such as the normalized counting measure on the ditset |$\operatorname*{dit}(\pi)$| representation of a partition |$\pi$| as a subset of the set |$U\times U$|⁠. Thus all of Campbell’s desiderata are true when:

‘sets’ = ditsets, the set of distinctions of partitions (or, in general, information sets of distinctions), and
‘entropies’ = normalized counting measure of the ditsets (or, in general, product probability measure on the infosets), i.e. the logical entropies.

7 The dit-bit transform

The logical entropy formulas for various compound notions (e.g. conditional entropy, mutual information and joint entropy) stand in certain Venn diagram relationships because logical entropy is a measure. The Shannon entropy formulas for these compound notions, e.g. |$H(\alpha,\beta)=H\left( \alpha\right) +H\left( \beta\right) -I\left( \alpha,\beta\right)$|⁠, are defined so as to satisfy the Venn diagram relationships as if Shannon entropy was a measure when it is not. How can that be? Perhaps there is some ‘deeper foundation’ [9, p. 112] to explain why the Shannon formulas still satisfy those measure-like Venn diagram relationships.

Indeed, there is such a connection, the dit-bit transform. This transform can be heuristically motivated by considering two ways to treat the standard set |$U_{n}$| of |$n$| elements with the equal probabilities |$p_{0}=\frac{1}{n}$|⁠. In that basic case of an equiprobable set, we can derive the dit-bit connection, and then by using a probabilistic average, we can develop the Shannon entropy, expressed in terms of bits, from the logical entropy, expressed in terms of (normalized) dits, or vice versa.

Given |$U_{n}$| with |$n$| equiprobable elements, the number of dits (of the discrete partition on |$U_{n}$|⁠) is |$n^{2}-n$| so the normalized dit count is:

\begin{align} h\left( p_{0}\right) =h\left( \frac{1}{n}\right) =\frac{n^{2}-n}{n^{2}}=1-\frac{1}{n}=1-p_{0}\,\,{\rm{normalized\,\,dits}}. \end{align}

That is the dit-count or logical measure of the information in a set of |$n$| distinct elements (think of it as the logical entropy of the discrete partition on |$U_{n}$| with equiprobable elements).

But we can also measure the information in the set by the number of binary partitions it takes (on average) to distinguish the elements, and that bit-count is [23]:

\begin{align} H\left( p_{0}\right) =H\left( \frac{1}{n}\right) =\log\left( n\right)=\log\left( \frac{1}{p_{0}}\right)\,\,{\rm{bits}}.\\ {\it{Shannon-Hartley\,\,entropy\,\,for\,\,an\,\,equiprobable\,\,set}}\,\,U\,\,{\rm{of}}\,\,n\,\,{\rm{elements}} \end{align}

The dit-bit connection is that the Shannon-Hartley entropy |$H\left( p_{0}\right) =\log\left( \frac{1}{p_{0}}\right) $| will play the same role in the Shannon formulas that |$h\left( p_{0}\right) =1-p_{0}$| plays in the logical entropy formulas—when both are formulated as probabilistic averages or expectations.

The common thing being measured is an equiprobable |$U_{n}$| where |$n=\frac {1}{p_{0}}$|⁠.⁷ The dit-count for |$U_{n}$| is |$h\left( p_{0}\right) =1-p_{0}$| and the bit-count for |$U$| is |$H\left( p_{0}\right) =\log\left(\frac{1}{p_{0}}\right) $|⁠, and the dit-bit transform converts one count into the other. Using this dit-bit transform between the two different ways to quantify the ‘information’ in |$U_{n}$|⁠, each entropy can be developed from the other. Nevertheless, this dit-bit connection should not be interpreted as if it was just converting a length using centimeters to inches or the like. Indeed, the (average) bit-count is a ‘coarser-grid’ that loses some information in comparison to the (exact) dit-count as shown by the analysis (below) of mutual information. There is no bit-count mutual information between independent probability distributions but there is always dit-count information even between two (non-trivial) independent distributions (see below the proposition that non-empty supports always intersect).

We start with the logical entropy of a probability distribution |$p=\left\{p_{1},...,p_{n}\right\}$|⁠:

\begin{align} h\left( p\right) =\sum\nolimits_{i=1}^{n}p_{i}h\left( p_{i}\right) =\sum\nolimits_{i}p_{i}\left(1-p_{i}\right). \end{align}

It is expressed as the probabilistic average of the dit-counts or logical entropies of the sets |$U_{1/p_{i}}$| with |$\frac{1}{p_{i}}$| equiprobable elements. But if we switch to the binary-partition bit-counts of the information content of those same sets |$U_{1/p_{i}}$| of |$\frac{1}{p_{i}}$| equiprobable elements, then the bit-counts are |$H\left( p_{i}\right) =\log\left(\frac{1}{p_{i}}\right)$| and the probabilistic average is the Shannon entropy: |$H\left(p\right)=\sum_{i=1}^{n}p_{i}H\left( p_{i}\right) =\sum_{i}p_{i}\log\left(\frac{1}{p_{i}}\right) $|⁠. Both entropies have the mathematical form as a probabilistic average or expectation:

\begin{align} \sum\nolimits_{i}p_{i}\left(\text{amount of `information' in }U_{1/p_{i}}\right) \end{align}

and differ by using either the dit-count or bit-count conception of information in |$U_{1/p_{i}}$|⁠.

The graph of the dit-bit transform is familiar in information theory from the natural log inequality: |$\ln p_{i}\leq p_{i}-1$|⁠. Taking negatives of both sides gives the graph (Figure 1) of the dit-bit transform for natural logs: |$1-p_{i}\rightsquigarrow\ln\left( \frac{1}{p_{i}}\right) $|⁠.

Fig. 1.

$The dit-bit transform $1-p\rightsquigarrow\ln\left(\frac{1}{p}\right)$ (natural logs).$

Open in new tab Download slide

The dit-bit transform |$1-p\rightsquigarrow\ln\left(\frac{1}{p}\right)$| (natural logs).

The dit-bit connection carries over to all the compound notions of entropy so that the Shannon notions of conditional entropy, mutual information, cross-entropy and divergence can all be developed from the corresponding notions for logical entropy. Since the logical notions are the values of a probability measure, the compound notions of logical entropy have the usual Venn diagram relationships. And then by the dit-bit transform, those Venn diagram relationships carry over to the compound Shannon formulas since the dit-bit transform preserves sums and differences (i.e. is, in that sense, linear). That is why the Shannon formulas can satisfy the Venn diagram relationships even though Shannon entropy is not a measure.

The logical entropy formula |$h\left( p\right) =\sum_{i}p_{i}\left(1-p_{i}\right) $| (and the corresponding compound formulas) are put into that form of an average or expectation to apply the dit-bit transform. But logical entropy is the exact measure of the information set |$S_{i\neq i^{\prime}}=\left\{ \left( i,i^{\prime}\right) :i\neq i^{\prime}\right\}\subseteq\left\{ 1,...,n\right\} \times\left\{ 1,...,n\right\} $| for the product probability measure |$\mu:\wp\left( \left\{ 1,...,n\right\}^{2}\right) \rightarrow\left[ 0,1\right] $| where for |$S\subseteq\left\{1,...,n\right\} ^{2}$|⁠, |$\mu\left( S\right) =\sum\left\{ p_{i}p_{i^{\prime}}:\left( i,i^{\prime}\right) \in S\right\} $|⁠, i.e., |$h\left( p\right)=\mu\left( S_{i\neq i^{\prime}}\right) $|⁠.

8 Information algebras and joint distributions

Consider a joint probability distribution |$\left\{ p\left( x,y\right) \right\} $| on the finite sample space |$X\times Y$| (where to avoid trivialities, assume |$\left\vert X\right\vert ,\left\vert Y\right\vert \geq 2$|⁠), with the marginal distributions |$\left\{ p\left( x\right) \right\} $| and |$\left\{ p\left( y\right) \right\} $| where |$p\left( x\right) =\sum_{y\in Y}p\left( x,y\right) $| and |$p\left( y\right) =\sum_{x\in X}p\left( x,y\right) $|⁠. For notational simplicity, the entropies can be considered as functions of the random variables or of their probability distributions, e.g., |$h\left( \left\{ p\left( x,y\right) \right\} \right) =h\left( X,Y\right) $|⁠, |$h\left( \left\{ p\left( x\right) \right\} \right) =h\left( X\right) $|⁠, and |$h\left( \left\{ p\left(y\right) \right\} \right) =h\left( Y\right) $|⁠. For the joint distribution, we have the:

\begin{align} h\left( X,Y\right) =\sum\nolimits_{x\in X,y\in Y}p\left( x,y\right) \left[1-p\left( x,y\right) \right] =1-\sum\nolimits_{x,y}p\left( x,y\right) ^{2}\\ {\it{Logical\,\,entropy\,\,of\,\,the\,\,joint\,\,distribution}} \end{align}

which is the probability that two samplings of the joint distribution will yield a pair of distinct ordered pairs |$\left(x,y\right) $|⁠, |$\left( x^{\prime},y^{\prime}\right) \in X\times Y$|⁠, i.e., with an |$X$|-distinction |$x\neq x^{\prime}$|or a |$Y$|-distinction |$y\neq y^{\prime}$|(since ordered pairs are distinct if distinct on one or more of the coordinates). The logical entropy notions for the probability distribution |$\left\{ p\left( x,y\right) \right\} $| on |$X\times Y$| are all product probability measures |$\mu\left( S\right) $| of certain subsets |$S\subseteq\left( X\times Y\right) ^{2}$|⁠. These information sets or infosets are defined solely in terms of equations and inequations (the ‘calculus of identity and difference’) independent of any probability distributions.

For the logical entropies defined so far, the infosets are:

\begin{align} S_{X}=\left\{ \left( \left( x,y\right) ,\left( x^{\prime},y^{\prime}\right) \right) :x\neq x^{\prime}\right\} ,\\ h\left( X\right) =\mu\left( S_{X}\right) =1-\sum\nolimits_{x}p\left( x\right)^{2};\\ S_{Y}=\left\{ \left( \left( x,y\right) ,\left( x^{\prime},y^{\prime}\right) \right) :y\neq y^{\prime}\right\} ,\\ h\left( Y\right) =\mu\left( S_{Y}\right) =1-\sum\nolimits_{y}p\left( y\right)^{2};\,\,{\rm{and}}\\ S_{X\vee Y}=\left\{ \left( \left( x,y\right) ,\left( x^{\prime},y^{\prime}\right) \right) :x\neq x^{\prime}\vee y\neq y^{\prime}\right\}=S_{X}\cup S_{Y},\\ h\left( X,Y\right) =\mu\left( S_{X\vee Y}\right) =\mu\left( S_{X}\cup S_{Y}\right) =1-\sum\nolimits_{x,y}p\left( x,y\right) ^{2}. \end{align}

The infosets |$S_{X}$| and |$S_{Y}$|⁠, as well as their complements⁸

\begin{align} S_{\lnot X}=\left\{ \left( \left( x,y\right) ,\left( x^{\prime },y^{\prime}\right) \right) :x=x^{\prime}\right\} =\left( X\times Y\right) ^{2}-S_{X}\,\,{\rm{and}}\\ S_{\lnot Y}=\left\{ \left( \left( x,y\right) ,\left( x^{\prime },y^{\prime}\right) \right) :y=y^{\prime}\right\} =\left( X\times Y\right) ^{2}-S_{Y}, \end{align}

generate a Boolean subalgebra |$\mathcal{I}\left( X\times Y\right) $| of |$\wp\left( \left( X\times Y\right) \times\left( X\times Y\right) \right) $| which will be called the information algebra of|$X\times Y$|⁠. It is defined independently of any probability measure |$\left\{ p\left(x,y\right) \right\} $| on |$X\times Y$|⁠, and any such measure defines the product measure |$\mu$| on |$\left( X\times Y\right) \times\left( X\times Y\right) $|⁠, and the corresponding logical entropies are the product measures on the infosets in |$\mathcal{I}\left( X\times Y\right) $|⁠.

The four atoms |$S_{X}\cap S_{Y}=S_{X\wedge Y}$|⁠, |$S_{X}\cap S_{\lnot Y}=S_{X\wedge\lnot Y}$|⁠, |$S_{\lnot X}\cap S_{Y}=S_{\lnot X\wedge Y}$|⁠, and |$S_{\lnot X}\cap S_{\lnot Y}=S_{\lnot X\wedge\lnot Y}$| in the information Boolean algebra are non-empty and correspond to the four rows in the truth table (Table 3) for the two propositions |$x\neq x^{\prime}$| and |$y\neq y^{\prime}$|(and to the four disjoint areas in the Venn diagram of Figure 2).

Fig. 2.

$$h\left(X,Y\right) =h\left( X|Y\right) +h\left( Y\right) =h\left( Y|X\right) +h\left( X\right)$.$

Open in new tab Download slide

|$h\left(X,Y\right) =h\left( X|Y\right) +h\left( Y\right) =h\left( Y|X\right) +h\left( X\right)$|⁠.

Table 3.

Truth table for atoms in the information algebra

Atoms	\|$x\neq x^{\prime}$\|	\|$y\neq y^{\prime}$\|	\|$X\not \equiv Y$\|	\|$X\supset Y$\|
\|$S_{X\wedge Y}$\|	T	T	F	T
\|$S_{X\wedge\lnot Y}$\|	T	F	T	F
\|$S_{Y\wedge\lnot X}$\|	F	T	T	T
\|$S_{\lnot X\wedge\lnot Y}$\|	F	F	F	T

Atoms	\|$x\neq x^{\prime}$\|	\|$y\neq y^{\prime}$\|	\|$X\not \equiv Y$\|	\|$X\supset Y$\|
\|$S_{X\wedge Y}$\|	T	T	F	T
\|$S_{X\wedge\lnot Y}$\|	T	F	T	F
\|$S_{Y\wedge\lnot X}$\|	F	T	T	T
\|$S_{\lnot X\wedge\lnot Y}$\|	F	F	F	T

Table 3.

Truth table for atoms in the information algebra

Atoms	\|$x\neq x^{\prime}$\|	\|$y\neq y^{\prime}$\|	\|$X\not \equiv Y$\|	\|$X\supset Y$\|
\|$S_{X\wedge Y}$\|	T	T	F	T
\|$S_{X\wedge\lnot Y}$\|	T	F	T	F
\|$S_{Y\wedge\lnot X}$\|	F	T	T	T
\|$S_{\lnot X\wedge\lnot Y}$\|	F	F	F	T

Atoms	\|$x\neq x^{\prime}$\|	\|$y\neq y^{\prime}$\|	\|$X\not \equiv Y$\|	\|$X\supset Y$\|
\|$S_{X\wedge Y}$\|	T	T	F	T
\|$S_{X\wedge\lnot Y}$\|	T	F	T	F
\|$S_{Y\wedge\lnot X}$\|	F	T	T	T
\|$S_{\lnot X\wedge\lnot Y}$\|	F	F	F	T

For |$n=2$| variables |$X$| and |$Y$|⁠, there are |$2^{\left( 2^{n}\right) }=16$| ways to fill in the T’s and F’s to define all the possible Boolean combinations of the two propositions so there are |$16$| subsets in the information algebra |$\mathcal{I}\left( X\times Y\right)$|⁠. The |$15$| non-empty subsets in |$\mathcal{I}\left( X\times Y\right) $| are defined in disjunctive normal form by the union of the atoms that have a T in their row. For instance, the set |$S_{X\not \equiv Y}$| corresponding to the symmetric difference or inequivalence |$\left( x\neq x^{\prime}\right) \not \equiv \left( y\neq y^{\prime}\right) $| is |$S_{X\not \equiv Y}=S_{X\wedge\lnot Y}\cup S_{Y\wedge\lnot X}=\left( S_{X}-S_{Y}\right) \cup\left( S_{Y} -S_{X}\right) $|⁠.

The information algebra |$\mathcal{I}\left( X\times Y\right) $| is a finite combinatorial structure defined solely in terms of |$X\times Y$| using only the distinctions and indistinctions between the elements of |$X$| and |$Y$|⁠. Any equivalence between Boolean expressions that is a tautology, e.g., |$x\neq x^{\prime}\equiv\left( x\neq x^{\prime}\wedge\lnot\left( y\neq y^{\prime }\right) \right) \vee\left( x\neq x^{\prime}\wedge y\neq y^{\prime}\right) $|⁠, gives a set identity in the information Boolean algebra, e.g., |$S_{X}=\left( S_{X}\cap S_{\lnot Y}\right) \cup\left( S_{X}\cap S_{Y}\right) $|⁠. Since that union is disjoint, any probability distribution on |$X\times Y$| gives the logically necessary identity |$h\left( X\right) =h\left( X|Y\right) +m\left( X,Y\right) $| (see below).

\begin{align} Supp\left( S_{X}\right) =\left\{ \left( \left( x,y\right) ,\left( x^{\prime},y^{\prime}\right) \right) :x\neq x^{\prime},p\left( x,y\right) p\left( x^{\prime},y^{\prime}\right) >0\right\} \subseteq\left( X\times Y\right) ^{2}\\ Supp\left( S_{Y}\right) =\left\{ \left( \left( x,y\right) ,\left( x^{\prime},y^{\prime}\right) \right) :y\neq y^{\prime},p\left( x,y\right) p\left( x^{\prime},y^{\prime}\right) >0\right\} \subseteq\left( X\times Y\right) ^{2}. \end{align}

Now |$Supp\left( S_{X}\right) \subseteq S_{X}$| and |$Supp\left( S_{Y}\right) \subseteq S_{Y}$|⁠, and for the product probability measure |$\mu$| on |$\left(X\times Y\right) ^{2}$|⁠, the sets |$S_{X}-Supp\left( S_{X}\right) $| and |$S_{Y}-Supp\left( S_{Y}\right) $| are of measure |$0$| so:

\begin{align} \mu\left( Supp\left( S_{X}\right) \right) =\mu\left( S_{X}\right) =h\left( X\right)\\ \mu\left( Supp\left( S_{Y}\right) \right) =\mu\left( S_{Y}\right)=h\left( Y\right). \end{align}

Consider |$S_{X\supset Y}=$||$S_{X\wedge Y}\cup S_{Y\wedge\lnot X}\cup S_{\lnot X\wedge\lnot Y}$| and suppose that the probability distribution gives |$\mu\left( S_{X\supset Y}\right) =1$| so that |$\mu\left( S_{X\wedge\lnot Y}\right) =0$|⁠. That means in a double draw of |$\left( x,y\right) $| and |$\left( x^{\prime},y^{\prime}\right) $|⁠, if |$x\neq x^{\prime}$|⁠, then there is zero probability that |$y=y^{\prime}$|⁠, so |$x\neq x^{\prime}$| implies (probabilistically) |$y\neq y^{\prime}$|⁠. In terms of the Venn diagram, the |$h\left( X\right) $| area is a subset of the |$h\left( Y\right) $| area. i.e., |$Supp\left( S_{X}\right) \subseteq Supp\left( S_{Y}\right) $| in terms of the underlying sets.

9 Conditional entropies

9.1 Logical conditional entropy

All the compound notions for Shannon and logical entropy could be developed using either partitions (with point probabilities) or probability distributions of random variables as the given data. Since the treatment of Shannon entropy is most often in terms of probability distributions, we will stick to that case for both types of entropy. The formula for the compound notion of logical entropy will be developed first, and then the formula for the corresponding Shannon compound entropy will be obtained by the dit-bit transform.

The general idea of a conditional entropy of a random variable |$X$| given a random variable |$Y$| is to measure the information in |$X$| when we take away the information contained in |$Y$|⁠, i.e., the set difference operation in terms of information sets.

For the definition of the conditional entropy |$h\left( X|Y\right) $|⁠, we simply take the product measure of the set of pairs |$\left( x,y\right) $| and |$\left( x^{\prime},y^{\prime}\right) $| that give an |$X$|-distinction but not a |$Y$|-distinction. Hence we use the inequation |$x\neq x^{\prime}$| for the |$X$|-distinction and negate the |$Y$|-distinction |$y\neq y^{\prime}$| to get the infoset that is the difference of the infosets for |$X$| and |$Y$|⁠:

\begin{align} S_{X\wedge\lnot Y}=\left\{ \left( \left( x,y\right) ,\left( x^{\prime },y^{\prime}\right) \right) :x\neq x^{\prime}\wedge y=y^{\prime}\right\} =S_{X}-S_{Y}\,\,{\rm{so}}\\ h\left( X|Y\right) =\mu\left( S_{X\wedge\lnot Y}\right) =\mu\left(S_{X}-S_{Y}\right) . \end{align}

Since |$S_{X\vee Y}$| can be expressed as the disjoint union |$S_{X\vee Y}=S_{X\wedge\lnot Y}\uplus S_{Y}$|⁠, we have for the measure |$\mu$|⁠:

\begin{align} h\left( X,Y\right) =\mu\left( S_{X\vee Y}\right) =\mu\left( S_{X\wedge\lnot Y}\right) +\mu\left( S_{Y}\right) =h\left( X|Y\right) +h\left( Y\right) , \end{align}

which is illustrated in the Venn diagram Figure 2.

In terms of the probabilities:

\begin{align} h\left( X|Y\right) =h\left( X,Y\right) -h\left( Y\right) =\sum\nolimits_{x,y}p\left( x,y\right) \left( 1-p\left( x,y\right) \right) -\sum\nolimits_{y}p\left( y\right) \left( 1-p\left( y\right) \right)\\ =\sum\nolimits_{x,y}p\left( x,y\right) \left[ \left( 1-p\left( x,y\right)\right) -\left( 1-p\left( y\right) \right) \right]\\ {\it{Logical\,\,conditional\,\,entropy\,\,of}}\,\,X\,\,{\it{given}}\,\,Y. \end{align}

Also of interest is the:

\begin{align} d\left( X,Y\right) =h\left( X|Y\right) +h\left( Y|X\right) =\mu\left( S_{X}\not \equiv S_{Y}\right),\\ {\it{Logical\,\,distance\,\,metric}} \end{align}

where |$\not \equiv $| is the inequivalence (symmetric difference) operation on sets. This logical distance is a Hamming-style distance function [34, p. 66] based on the difference between the random variables. Unlike the Kullback–Leibler divergence (see below), this logical distance is a distance metric.

9.2 Shannon conditional entropy

Given the joint distribution |$\left\{ p\left( x,y\right) \right\} $| on |$X\times Y$|⁠, the conditional probability distribution for a specific |$y_{0}\in Y$| is |$p\left( x|y_{0}\right) =\frac{p\left( x,y_{0}\right) }{p\left(y_{0}\right) }$| which has the Shannon entropy: |$H\left( X|y_{0}\right)=\sum_{x}p\left( x|y_{0}\right) \log\left( \frac{1}{p\left( x|y_{0}\right) }\right) $|⁠. Then the Shannon conditional entropy |$H\left(X|Y\right) $| is usually defined as the average of these entropies:

\begin{align} H\left( X|Y\right) =\sum\nolimits_{y}p\left( y\right) \sum\nolimits_{x}\frac{p\left(x,y\right) }{p\left( y\right) }\log\left( \frac{p\left( y\right)}{p\left( x,y\right) }\right) =\sum\nolimits_{x,y}p\left( x,y\right) \log\left(\frac{p\left( y\right) }{p\left( x,y\right) }\right)\\ {\it{Shannon\,\,conditional\,\,entropy\,\,of}}\,\,X\,\,{\it{given}}\,\,Y. \end{align}

All the Shannon notions can be obtained by the dit-bit transform of the corresponding logical notions. Applying the transform |$1-p\rightsquigarrow \log\left( \frac{1}{p}\right) $| to the logical conditional entropy expressed as an average of ‘|$1-p$|’ expressions: |$h\left(X|Y\right) =\sum_{x,y}p\left( x,y\right) \left[ \left( 1-p\left( x,y\right) \right) -\left(1-p\left( y\right) \right) \right]$|⁠, yields the Shannon conditional entropy:

\begin{align} H\left( X|Y\right) =\sum\nolimits_{x,y}p\left( x,y\right) \left[ \log\left( \frac{1}{p\left( x,y\right) }\right) -\log\left( \frac{1}{p\left( y\right) }\right) \right] =\sum\nolimits_{x,y}p\left( x,y\right) \log\left( \frac{p\left( y\right) }{p\left( x,y\right) }\right) . \end{align}

Since the dit-bit transform preserves sums and differences, we will have the same sort of Venn diagram formula for the Shannon entropies and this can be illustrated in the analogous ‘mnemonic’ Venn diagram (Figure 3).

Fig. 3.

$$H\left( X,Y\right)=H\left( X|Y\right) +H\left( Y\right) =H\left( Y|X\right)+H\left(X\right)$.$

Open in new tab Download slide

|$H\left( X,Y\right)=H\left( X|Y\right) +H\left( Y\right) =H\left( Y|X\right)+H\left(X\right)$|⁠.

10 Mutual information

10.1 Logical mutual information

Intuitively, the mutual logical information |$m\left( X,Y\right) $| in the joint distribution|$\left\{ p\left( x,y\right) \right\} $| would be the probability that a sampled pair of pairs |$\left( x,y\right) $| and |$\left(x^{\prime},y^{\prime}\right) $| would be distinguished in both coordinates, i.e., a distinction |$x\neq x^{\prime}$| of |$p\left( x\right) $|and a distinction |$y\neq y^{\prime}$| of |$p\left( y\right) $|⁠. In terms of subsets, the subset for the mutual information is intersection of infosets for |$X$| and |$Y$|⁠:

\begin{align} S_{X\wedge Y}=S_{X}\cap S_{Y}\,\,{\rm{so}}\,\,m\left( X,Y\right) =\mu\left(S_{X\wedge Y}\right) =\mu\left( S_{X}\cap S_{Y}\right). \end{align}

In terms of disjoint unions of subsets:

\begin{align} S_{X\vee Y}=S_{X\wedge\lnot Y}\uplus S_{Y\wedge\lnot X}\uplus S_{X\wedge Y} \end{align}

so

\begin{align} h\left( X,Y\right) =\mu\left( S_{X\vee Y}\right) =\mu\left(S_{X\wedge\lnot Y}\right) +\mu\left( S_{Y\wedge\lnot X}\right) +\mu\left(S_{X\wedge Y}\right)\\ =h\left( X|Y\right) +h\left( Y|X\right) +m\left( X,Y\right) {\rm{(as\,\,in\,\,Figure\,\,4)}}, \end{align}

or:

\begin{align} m\left( X,Y\right) =h\left( X\right) +h\left( Y\right) -h\left( X,Y\right) . \end{align}

Expanding |$m\left( X,Y\right) =h\left( X\right) +h\left( Y\right) -h\left( X,Y\right) $| in terms of probability averages gives:

\begin{align} m\left( X,Y\right) =\sum\nolimits_{x,y}p\left( x,y\right) \left[ \left[1-p\left( x\right) \right] +\left[ 1-p\left( y\right) \right] -\left[1-p\left( x,y\right) \right] \right]\\ {\it{Logical\,\,mutual\,\,information\,\,in\,\,a\,\,joint\,\,probability\,\,distribution}}. \end{align}

Since |$S_{Y}=S_{Y\wedge\lnot X}\cup S_{Y\wedge X}=\left( S_{Y}-S_{X}\right) \cup\left( S_{Y}\cap S_{X}\right) $| and the union is disjoint, we have the formula:

\begin{align} h\left( Y\right) =h\left( Y|X\right) +m\left( X,Y\right) \end{align}

which can be taken as the basis for a logical analysis of variation (ANOVA) for categorical data. The total variation in |$Y$|⁠, |$h(Y)$|⁠, is equal to the variation in |$Y$| ‘within’ |$X$| (i.e. with no variation in |$X$|⁠), |$h\left( Y|X\right) $|⁠, plus the variation ‘between’ |$Y$| and |$X$| (i.e. variation in both |$X$| and |$Y$|⁠), |$m\left( X,Y\right) $|⁠.

It is a non-trivial fact that two non-empty partition ditsets always intersect [12]. The same holds for the positive supports of the basic infosets |$S_{X}$| and |$S_{Y}$|⁠.

Proposition 1

(Two non-empty supports always intersect) If |$h\left( X\right) h\left( Y\right) >0$|⁠, then |$m\left( X,Y\right) >0$|⁠.

Proof.

Assuming |$h\left( X\right) h\left( Y\right) >0$|⁠, the support |$Supp\left( S_{X}\right) $| is non-empty and thus there are two pairs |$\left( x,y\right) $| and |$\left( x^{\prime},y^{\prime}\right) $| such that |$x\neq x^{\prime}$| and |$p\left( x,y\right) p\left(x^{\prime},y^{\prime}\right) >0$|⁠. If |$y\neq y^{\prime}$| then |$\left( \left( x,y\right) ,\left(x^{\prime},y^{\prime}\right) \right) \in Supp\left( S_{Y}\right) $| as well and we are finished, i.e., |$Supp\left( S_{X}\right) \cap Supp\left( S_{Y}\right) \neq\emptyset$|⁠. Hence assume |$y=y^{\prime}$|⁠. Since |$Supp\left( S_{Y}\right) $| is also non-empty and thus |$p\left(y\right) \neq1$|⁠, there is another |$y^{\prime\prime}$| such that for some |$x^{\prime\prime}$|⁠, |$p\left( x^{\prime\prime},y^{\prime\prime}\right) >0$|⁠. Since |$x^{\prime\prime}$| cannot be equal to both |$x$| and |$x^{\prime}$| (by the anti-transitivity of distinctions), at least one of the pairs |$\left( \left( x,y\right) ,\left( x^{\prime\prime},y^{\prime\prime}\right) \right) $| or |$\left( \left( x^{\prime},y\right) ,\left( x^{\prime\prime},y^{\prime\prime}\right) \right) $| is in both |$Supp\left( S_{X}\right) $| and |$Supp\left( S_{Y}\right) $|⁠, and thus the product measure on |$S_{\wedge\left\{ X,Y\right\} }=\left\{ \left( \left( x,y\right) ,\left( x^{\prime },y^{\prime}\right) \right) :x\neq x^{\prime}\wedge y\neq y^{\prime }\right\} $| is positive, i.e., |$m\left( X,Y\right) >0$|⁠. ■

10.2 Shannon mutual information

Applying the dit-bit transform |$1-p\rightsquigarrow\log\left( \frac{1} {p}\right) $| to the logical mutual information formula

\begin{align} m\left( X,Y\right) =\sum\nolimits_{x,y}p\left( x,y\right) \left[ \left[ 1-p\left( x\right) \right] +\left[ 1-p\left( y\right) \right] -\left[ 1-p\left( x,y\right) \right] \right] \end{align}

expressed in terms of probability averages gives the corresponding Shannon notion:

\begin{align} I\left( X,Y\right) =\sum\nolimits_{x,y}p\left( x,y\right) \left[ \left[\log\left( \frac{1}{p\left( x\right) }\right) \right] +\left[\log\left( \frac{1}{p\left( y\right) }\right) \right] -\left[\log\left( \frac{1}{p\left( x,y\right) }\right) \right] \right]\\ =\sum\nolimits_{x,y}p\left( x,y\right) \log\left( \frac{p\left( x,y\right)}{p\left( x\right) p\left( y\right) }\right)\\ {\it{Shannon\,\,mutual\,\,information\,\,in\,\,a\,\,joint\,\,probability\,\,distribution}}. \end{align}

Since the dit-bit transform preserves sums and differences, the logical formulas for the Shannon entropies gives the mnemonic Figure 5:

\begin{align} I\left( X,Y\right) =H\left( X\right) +H\left( Y\right) -H\left( X,Y\right) =H\left( X,Y\right) -H\left( X|Y\right) -H\left( Y|X\right) . \end{align}

Fig. 5.

$$H\left(X,Y\right) =H\left( X|Y\right) +H\left( Y|X\right) +I\left( X,Y\right)$.$

Open in new tab Download slide

|$H\left(X,Y\right) =H\left( X|Y\right) +H\left( Y|X\right) +I\left( X,Y\right)$|⁠.

This is the usual Venn diagram for the Shannon entropy notions that needs to be explained—since the Shannon entropies are not measures. Of course, one could just say the relationship holds for the Shannon entropies because that is how they were defined. It may seem a happy accident that the Shannon definitions all satisfy the measure-like Venn diagram formulas, but as one author put it: ‘Shannon carefully contrived for this “accident” to occur’ [39, p. 153]. As noted above, Campbell asked if ‘these analogies have a deeper foundation’ [9, p. 112] and the dit-bit transform answers that question.

11 Independent joint distributions

A joint probability distribution |$\left\{ p\left( x,y\right) \right\} $| on |$X\times Y$| is independent if each value is the product of the marginals: |$p\left( x,y\right) =p\left( x\right) p\left( y\right) $|⁠.

For an independent distribution, the Shannon mutual information

\begin{align} I\left( X,Y\right) =\sum\nolimits\nolimits_{x\in X,y\in Y}p\left( x,y\right) \log\left( \frac{p\left( x,y\right) }{p\left( x\right) p\left( y\right) }\right) \end{align}

is immediately seen to be zero so we have:

\begin{align} H\left( X,Y\right) =H\left( X\right) +H\left( Y\right)\\ {\rm{Shannon\,\,entropies\,\,for\,\,independent}}\,\,\left\{ p\left( x,y\right) \right\} . \end{align}

For the logical mutual information |$m(X,Y)$|⁠, independence gives:

\begin{align*} m\left( X,Y\right) & = {\textstyle\sum\nolimits\nolimits_{x,y}} p\left( x,y\right) \left[ 1-p\left( x\right) -p\left( y\right) +p\left( x,y\right) \right] \\ & = {\textstyle\sum\nolimits\nolimits_{x,y}} p\left( x\right) p\left( y\right) \left[ 1-p\left( x\right) -p\left( y\right) +p\left( x\right) p\left( y\right) \right] \\ & = {\textstyle\sum\nolimits\nolimits_{x}} p\left( x\right) \left[ 1-p\left( x\right) \right] {\textstyle\sum\nolimits\nolimits_{y}} p\left( y\right) \left[ 1-p\left( y\right) \right] \\ & =h\left( X\right) h\left( Y\right) \end{align*}

\begin{align} {\rm{Logical\,\,entropies\,\,for\,\,independent}}\,\,\left\{p\left(x,y\right) \right\}. \end{align}

Independence means the joint probability |$p\left( x,y\right) $| can always be separated into |$p\left( x\right) $| times |$p\left( y\right) $|⁠. This carries over to the standard two-draw probability interpretation of logical entropy. Thus independence means that in two draws, the probability |$m\left( X,Y\right) $| of getting distinctions in both |$X$| and |$Y$| is equal to the probability |$h\left( X\right) $| of getting an |$X$|-distinction times the probability |$h\left( Y\right) $| of getting a |$Y$|-distinction. Similarly, Table 4 shows that, under independence, the four atomic areas in Figure 4 can each be expressed as the four possible products of the areas |$\left\{ h\left( X\right) ,1-h\left( X\right) \right\} $| and |$\left\{ h\left( Y\right) ,1-h\left( Y\right) \right\} $| that are defined in terms of one variable.

Fig. 4.

$$h\left( X,Y\right) =h\left( X|Y\right) +h\left( Y|X\right) +m\left( X,Y\right)$.$

Open in new tab Download slide

|$h\left( X,Y\right) =h\left( X|Y\right) +h\left( Y|X\right) +m\left( X,Y\right)$|⁠.

Table 4.

Logical entropy relationships under independence

Atomic areas	\|$X$\|	\|$Y$\|
\|$m\left( X,Y\right) =$\|	\|$h\left( X\right) \times$\|	\|$h\left( Y\right)$\|
\|$h\left( X\|Y\right) =$\|	\|$h\left( X\right) \times$\|	\|$\left[1-h\left( Y\right) \right] $\|
\|$h\left( Y\|X\right) =$\|	\|$\left[1-h\left( X\right) \right] \times$\|	\|$h\left( Y\right) $\|
\|$1-h\left( X,Y\right) =$\|	\|$\left[ 1-h\left( X\right) \right] \times$\|	\|$\left[ 1-h\left( Y\right) \right] $\|

Atomic areas	\|$X$\|	\|$Y$\|
\|$m\left( X,Y\right) =$\|	\|$h\left( X\right) \times$\|	\|$h\left( Y\right)$\|
\|$h\left( X\|Y\right) =$\|	\|$h\left( X\right) \times$\|	\|$\left[1-h\left( Y\right) \right] $\|
\|$h\left( Y\|X\right) =$\|	\|$\left[1-h\left( X\right) \right] \times$\|	\|$h\left( Y\right) $\|
\|$1-h\left( X,Y\right) =$\|	\|$\left[ 1-h\left( X\right) \right] \times$\|	\|$\left[ 1-h\left( Y\right) \right] $\|

Table 4.

Logical entropy relationships under independence

Atomic areas	\|$X$\|	\|$Y$\|
\|$m\left( X,Y\right) =$\|	\|$h\left( X\right) \times$\|	\|$h\left( Y\right)$\|
\|$h\left( X\|Y\right) =$\|	\|$h\left( X\right) \times$\|	\|$\left[1-h\left( Y\right) \right] $\|
\|$h\left( Y\|X\right) =$\|	\|$\left[1-h\left( X\right) \right] \times$\|	\|$h\left( Y\right) $\|
\|$1-h\left( X,Y\right) =$\|	\|$\left[ 1-h\left( X\right) \right] \times$\|	\|$\left[ 1-h\left( Y\right) \right] $\|

Atomic areas	\|$X$\|	\|$Y$\|
\|$m\left( X,Y\right) =$\|	\|$h\left( X\right) \times$\|	\|$h\left( Y\right)$\|
\|$h\left( X\|Y\right) =$\|	\|$h\left( X\right) \times$\|	\|$\left[1-h\left( Y\right) \right] $\|
\|$h\left( Y\|X\right) =$\|	\|$\left[1-h\left( X\right) \right] \times$\|	\|$h\left( Y\right) $\|
\|$1-h\left( X,Y\right) =$\|	\|$\left[ 1-h\left( X\right) \right] \times$\|	\|$\left[ 1-h\left( Y\right) \right] $\|

The non-empty-supports-always-intersect proposition shows that |$h\left( X\right) h\left(Y\right) >0$| implies |$m\left( X,Y\right) >0$|⁠, and thus that logical mutual information |$m\left(X,Y\right) $| is still positive for independent distributions when |$h\left( X\right) h\left(Y\right) >0$|⁠, in which case |$m\left( X,Y\right) =h\left( X\right) h\left( Y\right) $|⁠. This is a striking difference between the average bit-count Shannon entropy and the dit-count logical entropy. Aside from the waste case where |$h\left( X\right) h\left( Y\right) =0$|⁠, there are always positive probability mutual distinctions for |$X$| and |$Y$|⁠, and that dit-count information is not recognized by the coarser-grained average bit-count Shannon entropy.

12 Cross-entropies and divergences

Given two probability distributions |$p=\left\{ p_{1},...,p_{n}\right\} $| and |$q=\left\{ q_{1},...,q_{n}\right\} $| on the same sample space |$U=\left\{1,...,n\right\} $|⁠, we can again consider the drawing of a pair of points but where the first drawing is according to |$p$| and the second drawing according to |$q$|⁠. The probability that the points are distinct would be a natural and more general notion of logical entropy that would be the:

\begin{align} h\left( p\Vert q\right) =\sum\nolimits_{i}p_{i}(1-q_{i})=1-\sum\nolimits_{i}p_{i}q_{i}\\ {\it{Logical\,\,cross\,\,entropy\,\,of}}\,\,p\,\,{\it{and}}\,\,q \end{align}

which is symmetric. Adding subscripts to indicate which probability measures are being used, the value of the product probability measure |$\mu_{pq}$| on any |$S\subseteq U^{2}$| is |$\mu_{pq}\left( S\right) =\sum\left\{ p_{i}q_{i^{\prime}}:\left( i,i^{\prime}\right) \in S\right\}$|⁠. Thus on the standard information set |$S_{i\neq i^{\prime}}=\left\{ \left(i,i^{\prime}\right) \in U^{2}:i\neq i^{\prime}\right\} =\operatorname*{dit}\left( \mathbf{1}_{U}\right) $|⁠, the value is:

\begin{align} h\left( p||q\right) =\mu_{pq}\left( S_{i\neq i^{\prime}}\right) . \end{align}

The logical cross entropy is the same as the logical entropy when the distributions are the same, i.e., if |$p=q$|⁠, then |$h\left( p\Vert q\right) =h\left( p\right) =\mu_{p}\left( S_{i\neq i^{\prime}}\right) $|⁠.

Although the logical cross entropy formula is symmetrical in |$p$| and |$q$|⁠, there are two different ways to express it as an average to apply the dit-bit transform: |$\sum_{i}p_{i}(1-q_{i})$| and |$\sum_{i}q_{i}\left( 1-p_{i}\right) $|⁠. The two transforms are the two asymmetrical versions of Shannon cross entropy:

\begin{align} H\left( p\Vert q\right) =\sum\nolimits_{i}p_{i}\log\left( \frac{1}{q_{i}}\right)\,\,{\rm{and}}\,\,H\left(q||p\right) =\sum\nolimits_{i}q_{i}\log\left( \frac{1}{p_{i}}\right) \end{align}

which is not symmetrical due to the asymmetric role of the logarithm, although if |$p=q$|⁠, then |$H\left( p\Vert p\right) =H\left(p\right) $|⁠. When the logical cross entropy is expressed as an average in a symmetrical way: |$h\left( p||q\right) =\frac{1}{2}\left[ \sum_{i}p_{i}(1-q_{i})+\sum_{i}q_{i}\left( 1-p_{i}\right) \right] $|⁠, then the dit-bit transform is a symmetrized Shannon cross entropy:

\begin{align} H_{s}\left( p||q\right) =\frac{1}{2}\left[ H\left( p||q\right) +H\left(q||p\right) \right]. \end{align}

The Kullback–Leibler divergence (or relative entropy) |$D\left( p\Vert q\right) =\sum_{i}p_{i}\log\left( \frac{p_{i}}{q_{i}}\right) $| is defined as a ‘measure’ of the distance or divergence between the two distributions where |$D\left( p\Vert q\right) =H\left( p\Vert q\right) -H\left( p\right) $|⁠. A basic result is the:

\begin{align} D\left( p\Vert q\right) \geq0\,\,{\rm{with\,\,equality\,\,if\,\,and\,\,only\,\,if}}\,\,p=q\\ {\it{Information\,\,inequality}}\,\,[10, p. 26]. \end{align}

A symmetrized Kullback–Leibler divergence is:

\begin{align} D_{s}(p||q)=D\left( p||q\right) +D\left( q||p\right) =2H_{s}\left( p||q\right) -\left[ H\left( p\right) +H\left( q\right) \right] . \end{align}

But starting afresh, one might ask: ‘What is the natural notion of distance between two probability distributions |$p=\left\{ p_{1},...,p_{n}\right\} $| and |$q=\left\{q_{1},...,q_{n}\right\} $| that would always be non-negative, and would be zero if and only if they are equal?’ The (Euclidean) distance metric between the two points in |$\mathbb{R}^{n}$| would seem to be the logical answer—so we take that distance squared as the definition of the:⁹

\begin{align} d\left( p\Vert q\right) = \sum\nolimits_{i}\left( p_{i}-q_{i}\right) ^{2}\\ {\it{Logical\,\,divergence}}\,\,({\rm{or}}\,\,{\it{logical\,\,relative entropy}}) \end{align}

which is symmetric and we trivially have:

\begin{align} d\left( p||q\right) \geq0\,\,{\rm{with\,\,equality\,\,iff}}\,\,p=q\\ {\it{Logical\,\,information\,\,inequality}}. \end{align}

We have component-wise:

\begin{align} 0\leq\left( p_{i}-q_{i}\right) ^{2}=p_{i}^{2}-2p_{i}q_{i}+q_{i} ^{2}=2\left[ \frac{1}{n}-p_{i}q_{i}\right] -\left[ \frac{1}{n}-p_{i} ^{2}\right] -\left[ \frac{1}{n}-q_{i}^{2}\right] \end{align}

so that taking the sum for |$i=1,...,n$| gives:

\begin{align*} d\left( p\Vert q\right) & = {\textstyle\sum\nolimits\nolimits_{i}} \left( p_{i}-q_{i}\right) ^{2}\\ & =2\left[ 1- {\textstyle\sum\nolimits\nolimits_{i}} p_{i}q_{i}\right] -\left[ \left( 1- {\textstyle\sum\nolimits\nolimits_{i}} p_{i}^{2}\right) +\left( 1- {\textstyle\sum\nolimits\nolimits_{i}} q_{i}^{2}\right) \right] \\ & =2h\left( p\Vert q\right) -\left[ h\left( p\right) +h\left( q\right) \right] \\ & =2\mu_{pq}\left( S_{i\neq i^{\prime}}\right) -\left[ \mu_{p}\left( S_{i\neq i^{\prime}}\right) +\mu_{q}\left( S_{i\neq i^{\prime}}\right) \right]. \end{align*}

\begin{align} {\rm{Logical\,\,divergence}} \end{align}

Aside from a scale factor, the logical divergence is the same as the Jensen difference [36, p. 25] |$J\left( p,q\right) =h\left( p||q\right) -\frac{h\left(p\right) +h\left( q\right) }{2}$|⁠. Then the information inequality implies that the logical cross-entropy is greater than or equal to the average of the logical entropies, i.e., the non-negativity of the Jensen difference:

\begin{align} h\left( p||q\right) \geq\frac{h\left( p\right) +h\left( q\right) }{2}\,\,{\rm{with\,\,equality\,\,iff}}\,\,p=q. \end{align}

The half-and-half probability distribution |$\frac{p+q}{2}$| that mixes |$p$| and |$q$| has the logical entropy of

\begin{align} h\left( \frac{p+q}{2}\right) =\frac{h\left( p\Vert q\right) }{2} +\frac{h\left( p\right) +h\left( q\right) }{4}=\frac{1}{2}\left[ h\left( p||q\right) +\frac{h\left( p\right) +h\left( q\right) }{2}\right] \end{align}

so that:

\begin{align} h(p||q)\geq h\left( \frac{p+q}{2}\right) \geq\frac{h\left( p\right)+h\left( q\right) }{2}\,\,{\rm{with\,\,equality\,\,iff}}\,\,p=q.\\ {\rm{Mixing\,\,different}}\,\,p\,\,{\rm{and}}\,\,q\,\,{\rm{increases\,\,logical\,\,entropy}}. \end{align}

The logical divergence can be expressed in the symmetrical form of averages to apply the dit-bit transform:

\begin{align} d\left( p\Vert q\right) =\left[ \sum\nolimits_{i}p_{i}\left( 1-q_{i}\right) +\sum\nolimits_{i}q_{i}\left( 1-p_{i}\right) \right] -\left[ \left( \sum\nolimits_{i} p_{i}\left( 1-p_{i}\right) \right) +\left( \sum\nolimits_{i}q_{i}\left( 1-q_{i}\right) \right) \right] \end{align}

so the dit-bit transform is:

\begin{align} \left[ \sum\nolimits_{i}p_{i}\log\left( \frac{1}{q_{i}}\right) +\sum\nolimits_{i}q_{i}\log\left( \frac{1}{p_{i}}\right) -\sum\nolimits_{i}p_{i}\log\left( \frac{1}{p_{i}}\right) -\sum\nolimits_{i}q_{i}\log\left( \frac{1}{q_{i}}\right) \right]\\ =\left[ \sum\nolimits_{i}p_{i}\log\left( \frac{p_{i}}{q_{i}}\right) +\sum\nolimits_{i}q_{i}\log\left( \frac{q_{i}}{p_{i}}\right) \right] =D\left( p||q\right)+D\left( q||p\right)\\ =D_{s}\left( p||q\right) . \end{align}

Since the logical divergence |$d\left(p||q\right)$| is symmetrical, it develops via the dit-bit transform to the symmetrized version |$D_{s}\left( p||q\right) $| of the Kullback–Leibler divergence. The logical divergence |$d\left(p||q\right)$| is clearly a distance function (or metric) on probability distributions, but even the symmetrized Kullback–Leibler divergence |$D_{s}\left( p||q\right) $| may fail to satisfy the triangle inequality [11, p. 58] so it is not a distance metric.

13 Summary of formulas and dit-bit transforms

The Table 5 summarizes the concepts for the Shannon and logical entropies.

Table 5.

Comparisons between Shannon and logical entropy formulas

	\|$\text{Shannon entropy}$\|	\|$\text{Logical entropy}$\|
Entropy	\|$H(p)=\sum p_{i}\log\left(1/p_{i}\right) $\|	\|$h\left( p\right) =\sum p_{i}\left( 1-p_{i}\right) $\|
Mutual Info.	\|${\small I(X,Y)=H}\left( X\right) +H\left( Y\right)-H\left( X,Y\right) $\|	\|$m\left( X,Y\right) =h\left( X\right) +h\left( Y\right) -h\left(X,Y\right) $\|
Cond. entropy	\|$H\left( X\|Y\right)=H(X)-I\left( X,Y\right) $\|	\|$h\left( X\|Y\right) =h\left( X\right) -m\left( X,Y\right)$\|
Independence	\|$I\left( X,Y\right) =0$\|	\|$m\left(X,Y\right) =h\left( X\right) h\left( Y\right) $\|
Indep. Relations	\|$H\left( X\|Y\right) =H\left( X\right) $\|	\|$h\left( X\|Y\right) =h\left( X\right)\left( 1-h\left( Y\right) \right) $\|
Cross entropy	\|${\small H}\left( p\Vert q\right) =\sum p_{i}\log\left( 1/q_{i}\right) $\|	\|${\small h}\left( p\Vert q\right) =\sum p_{i}\left( 1-q_{i}\right) $\|
Divergence	\|${\small D}\left( p\Vert q\right)=\sum_{i}p_{i}\log\left( \frac{p_{i}}{q_{i}}\right) $\|	\|${\small d}\left( p\|\|q\right) =\sum_{i}\left( p_{i}-q_{i}\right) ^{2}$\|
Relationships	\|${\small D}\left( p\Vert q\right) ={\small H}\left( p\Vert q\right) {\small -H}\left( p\right) $\|	\|${\small d}\left( p\Vert q\right) =2{\small h}\left( p\Vert q\right){\small -}\left[ {\small h}\left( p\right) {\small +h}\left( q\right)\right] $\|
Info. Inequality	\|${\small D}\left( p\Vert q\right) \geq{\small 0}\text{ with }=\text{ iff }p=q$\|	\|$d\left( p\Vert q\right) \geq0\text{ with }=\text{ iff }p=q$\|

	\|$\text{Shannon entropy}$\|	\|$\text{Logical entropy}$\|
Entropy	\|$H(p)=\sum p_{i}\log\left(1/p_{i}\right) $\|	\|$h\left( p\right) =\sum p_{i}\left( 1-p_{i}\right) $\|
Mutual Info.	\|${\small I(X,Y)=H}\left( X\right) +H\left( Y\right)-H\left( X,Y\right) $\|	\|$m\left( X,Y\right) =h\left( X\right) +h\left( Y\right) -h\left(X,Y\right) $\|
Cond. entropy	\|$H\left( X\|Y\right)=H(X)-I\left( X,Y\right) $\|	\|$h\left( X\|Y\right) =h\left( X\right) -m\left( X,Y\right)$\|
Independence	\|$I\left( X,Y\right) =0$\|	\|$m\left(X,Y\right) =h\left( X\right) h\left( Y\right) $\|
Indep. Relations	\|$H\left( X\|Y\right) =H\left( X\right) $\|	\|$h\left( X\|Y\right) =h\left( X\right)\left( 1-h\left( Y\right) \right) $\|
Cross entropy	\|${\small H}\left( p\Vert q\right) =\sum p_{i}\log\left( 1/q_{i}\right) $\|	\|${\small h}\left( p\Vert q\right) =\sum p_{i}\left( 1-q_{i}\right) $\|
Divergence	\|${\small D}\left( p\Vert q\right)=\sum_{i}p_{i}\log\left( \frac{p_{i}}{q_{i}}\right) $\|	\|${\small d}\left( p\|\|q\right) =\sum_{i}\left( p_{i}-q_{i}\right) ^{2}$\|
Relationships	\|${\small D}\left( p\Vert q\right) ={\small H}\left( p\Vert q\right) {\small -H}\left( p\right) $\|	\|${\small d}\left( p\Vert q\right) =2{\small h}\left( p\Vert q\right){\small -}\left[ {\small h}\left( p\right) {\small +h}\left( q\right)\right] $\|
Info. Inequality	\|${\small D}\left( p\Vert q\right) \geq{\small 0}\text{ with }=\text{ iff }p=q$\|	\|$d\left( p\Vert q\right) \geq0\text{ with }=\text{ iff }p=q$\|

Table 5.

Comparisons between Shannon and logical entropy formulas

	\|$\text{Shannon entropy}$\|	\|$\text{Logical entropy}$\|
Entropy	\|$H(p)=\sum p_{i}\log\left(1/p_{i}\right) $\|	\|$h\left( p\right) =\sum p_{i}\left( 1-p_{i}\right) $\|
Mutual Info.	\|${\small I(X,Y)=H}\left( X\right) +H\left( Y\right)-H\left( X,Y\right) $\|	\|$m\left( X,Y\right) =h\left( X\right) +h\left( Y\right) -h\left(X,Y\right) $\|
Cond. entropy	\|$H\left( X\|Y\right)=H(X)-I\left( X,Y\right) $\|	\|$h\left( X\|Y\right) =h\left( X\right) -m\left( X,Y\right)$\|
Independence	\|$I\left( X,Y\right) =0$\|	\|$m\left(X,Y\right) =h\left( X\right) h\left( Y\right) $\|
Indep. Relations	\|$H\left( X\|Y\right) =H\left( X\right) $\|	\|$h\left( X\|Y\right) =h\left( X\right)\left( 1-h\left( Y\right) \right) $\|
Cross entropy	\|${\small H}\left( p\Vert q\right) =\sum p_{i}\log\left( 1/q_{i}\right) $\|	\|${\small h}\left( p\Vert q\right) =\sum p_{i}\left( 1-q_{i}\right) $\|
Divergence	\|${\small D}\left( p\Vert q\right)=\sum_{i}p_{i}\log\left( \frac{p_{i}}{q_{i}}\right) $\|	\|${\small d}\left( p\|\|q\right) =\sum_{i}\left( p_{i}-q_{i}\right) ^{2}$\|
Relationships	\|${\small D}\left( p\Vert q\right) ={\small H}\left( p\Vert q\right) {\small -H}\left( p\right) $\|	\|${\small d}\left( p\Vert q\right) =2{\small h}\left( p\Vert q\right){\small -}\left[ {\small h}\left( p\right) {\small +h}\left( q\right)\right] $\|
Info. Inequality	\|${\small D}\left( p\Vert q\right) \geq{\small 0}\text{ with }=\text{ iff }p=q$\|	\|$d\left( p\Vert q\right) \geq0\text{ with }=\text{ iff }p=q$\|

	\|$\text{Shannon entropy}$\|	\|$\text{Logical entropy}$\|
Entropy	\|$H(p)=\sum p_{i}\log\left(1/p_{i}\right) $\|	\|$h\left( p\right) =\sum p_{i}\left( 1-p_{i}\right) $\|
Mutual Info.	\|${\small I(X,Y)=H}\left( X\right) +H\left( Y\right)-H\left( X,Y\right) $\|	\|$m\left( X,Y\right) =h\left( X\right) +h\left( Y\right) -h\left(X,Y\right) $\|
Cond. entropy	\|$H\left( X\|Y\right)=H(X)-I\left( X,Y\right) $\|	\|$h\left( X\|Y\right) =h\left( X\right) -m\left( X,Y\right)$\|
Independence	\|$I\left( X,Y\right) =0$\|	\|$m\left(X,Y\right) =h\left( X\right) h\left( Y\right) $\|
Indep. Relations	\|$H\left( X\|Y\right) =H\left( X\right) $\|	\|$h\left( X\|Y\right) =h\left( X\right)\left( 1-h\left( Y\right) \right) $\|
Cross entropy	\|${\small H}\left( p\Vert q\right) =\sum p_{i}\log\left( 1/q_{i}\right) $\|	\|${\small h}\left( p\Vert q\right) =\sum p_{i}\left( 1-q_{i}\right) $\|
Divergence	\|${\small D}\left( p\Vert q\right)=\sum_{i}p_{i}\log\left( \frac{p_{i}}{q_{i}}\right) $\|	\|${\small d}\left( p\|\|q\right) =\sum_{i}\left( p_{i}-q_{i}\right) ^{2}$\|
Relationships	\|${\small D}\left( p\Vert q\right) ={\small H}\left( p\Vert q\right) {\small -H}\left( p\right) $\|	\|${\small d}\left( p\Vert q\right) =2{\small h}\left( p\Vert q\right){\small -}\left[ {\small h}\left( p\right) {\small +h}\left( q\right)\right] $\|
Info. Inequality	\|${\small D}\left( p\Vert q\right) \geq{\small 0}\text{ with }=\text{ iff }p=q$\|	\|$d\left( p\Vert q\right) \geq0\text{ with }=\text{ iff }p=q$\|

The Table 6 summarizes the dit-bit transforms.

Table 6.

The dit-bit transform from logical entropy to Shannon entropy

	The dit-bit Transform: \|$1-p_{i}\rightsquigarrow\log\left( \frac {1}{p_{i}}\right) $\|
\|$h\left( p\right) =$\|	\|$\sum_{i}p_{i}\left( 1-p_{i}\right) $\|
\|$H\left( p\right) =$\|	\|$\sum_{i}p_{i}\log\left( 1/p_{i}\right)$\|
\|$h\left( X\|Y\right) =$\|	\|$\sum_{x,y}p\left( x,y\right)\left[ \left( 1-p\left( x,y\right) \right) -\left( 1-p\left( y\right) \right) \right]$\|
\|$H\left( X\|Y\right) =$\|	\|$\sum_{x,y}p\left( x,y\right) \left[\log\left( \frac{1}{p\left( x,y\right) }\right) -\log\left( \frac{1}{p\left( y\right)}\right) \right] $\|
\|$m\left( X,Y\right) =$\|	\|$\sum_{x,y}p\left( x,y\right) \left[ \left[ 1-p\left( x\right) \right] +\left[ 1-p\left(y\right) \right] -\left[ 1-p\left( x,y\right) \right] \right] $\|
\|$I(X,Y)=$\|	\|$\sum_{x,y}p\left( x,y\right) \left[ \log\left(\frac{1}{p\left( x\right) }\right) +\log\left( \frac {1}{p\left( y\right) }\right)-\log\left( \frac{1}{p\left( x,y\right) }\right) \right] $\|
\|$h\left( p\Vert q\right) =$\|	\|$\frac{1}{2}\left[\sum_{i}p_{i}(1-q_{i})+\sum_{i}q_{i}\left( 1-p_{i}\right) \right] $\|
\|$H_{s}(p\|\|q)=$\|	\|$\frac{1}{2}\left[ \sum_{i}p_{i}\log\left( \frac{1}{q_{i}}\right) +\sum_{i}q_{i}\log\left( \frac{1}{p_{i}}\right) \right] $\|
\|$d\left( p\|\|q\right) =$\|	\|$2h\left( p\|\|q\right)-\left[ \left( \sum_{i}p_{i}\left( 1-p_{i}\right) \right) +\left(\sum_{i}q_{i}\left( 1-q_{i}\right) \right) \right] $\|
\|$D_{s}\left( p\|\|q\right) =$\|	\|$2H_{s}\left(p\|\|q\right) -\left[ \sum_{i}p_{i}\log\left( \frac{1}{p_{i}}\right)+\sum_{i}q_{i}\log\left( \frac{1}{q_{i}}\right) \right] $\|

	The dit-bit Transform: \|$1-p_{i}\rightsquigarrow\log\left( \frac {1}{p_{i}}\right) $\|
\|$h\left( p\right) =$\|	\|$\sum_{i}p_{i}\left( 1-p_{i}\right) $\|
\|$H\left( p\right) =$\|	\|$\sum_{i}p_{i}\log\left( 1/p_{i}\right)$\|
\|$h\left( X\|Y\right) =$\|	\|$\sum_{x,y}p\left( x,y\right)\left[ \left( 1-p\left( x,y\right) \right) -\left( 1-p\left( y\right) \right) \right]$\|
\|$H\left( X\|Y\right) =$\|	\|$\sum_{x,y}p\left( x,y\right) \left[\log\left( \frac{1}{p\left( x,y\right) }\right) -\log\left( \frac{1}{p\left( y\right)}\right) \right] $\|
\|$m\left( X,Y\right) =$\|	\|$\sum_{x,y}p\left( x,y\right) \left[ \left[ 1-p\left( x\right) \right] +\left[ 1-p\left(y\right) \right] -\left[ 1-p\left( x,y\right) \right] \right] $\|
\|$I(X,Y)=$\|	\|$\sum_{x,y}p\left( x,y\right) \left[ \log\left(\frac{1}{p\left( x\right) }\right) +\log\left( \frac {1}{p\left( y\right) }\right)-\log\left( \frac{1}{p\left( x,y\right) }\right) \right] $\|
\|$h\left( p\Vert q\right) =$\|	\|$\frac{1}{2}\left[\sum_{i}p_{i}(1-q_{i})+\sum_{i}q_{i}\left( 1-p_{i}\right) \right] $\|
\|$H_{s}(p\|\|q)=$\|	\|$\frac{1}{2}\left[ \sum_{i}p_{i}\log\left( \frac{1}{q_{i}}\right) +\sum_{i}q_{i}\log\left( \frac{1}{p_{i}}\right) \right] $\|
\|$d\left( p\|\|q\right) =$\|	\|$2h\left( p\|\|q\right)-\left[ \left( \sum_{i}p_{i}\left( 1-p_{i}\right) \right) +\left(\sum_{i}q_{i}\left( 1-q_{i}\right) \right) \right] $\|
\|$D_{s}\left( p\|\|q\right) =$\|	\|$2H_{s}\left(p\|\|q\right) -\left[ \sum_{i}p_{i}\log\left( \frac{1}{p_{i}}\right)+\sum_{i}q_{i}\log\left( \frac{1}{q_{i}}\right) \right] $\|

Table 6.

The dit-bit transform from logical entropy to Shannon entropy

	The dit-bit Transform: \|$1-p_{i}\rightsquigarrow\log\left( \frac {1}{p_{i}}\right) $\|
\|$h\left( p\right) =$\|	\|$\sum_{i}p_{i}\left( 1-p_{i}\right) $\|
\|$H\left( p\right) =$\|	\|$\sum_{i}p_{i}\log\left( 1/p_{i}\right)$\|
\|$h\left( X\|Y\right) =$\|	\|$\sum_{x,y}p\left( x,y\right)\left[ \left( 1-p\left( x,y\right) \right) -\left( 1-p\left( y\right) \right) \right]$\|
\|$H\left( X\|Y\right) =$\|	\|$\sum_{x,y}p\left( x,y\right) \left[\log\left( \frac{1}{p\left( x,y\right) }\right) -\log\left( \frac{1}{p\left( y\right)}\right) \right] $\|
\|$m\left( X,Y\right) =$\|	\|$\sum_{x,y}p\left( x,y\right) \left[ \left[ 1-p\left( x\right) \right] +\left[ 1-p\left(y\right) \right] -\left[ 1-p\left( x,y\right) \right] \right] $\|
\|$I(X,Y)=$\|	\|$\sum_{x,y}p\left( x,y\right) \left[ \log\left(\frac{1}{p\left( x\right) }\right) +\log\left( \frac {1}{p\left( y\right) }\right)-\log\left( \frac{1}{p\left( x,y\right) }\right) \right] $\|
\|$h\left( p\Vert q\right) =$\|	\|$\frac{1}{2}\left[\sum_{i}p_{i}(1-q_{i})+\sum_{i}q_{i}\left( 1-p_{i}\right) \right] $\|
\|$H_{s}(p\|\|q)=$\|	\|$\frac{1}{2}\left[ \sum_{i}p_{i}\log\left( \frac{1}{q_{i}}\right) +\sum_{i}q_{i}\log\left( \frac{1}{p_{i}}\right) \right] $\|
\|$d\left( p\|\|q\right) =$\|	\|$2h\left( p\|\|q\right)-\left[ \left( \sum_{i}p_{i}\left( 1-p_{i}\right) \right) +\left(\sum_{i}q_{i}\left( 1-q_{i}\right) \right) \right] $\|
\|$D_{s}\left( p\|\|q\right) =$\|	\|$2H_{s}\left(p\|\|q\right) -\left[ \sum_{i}p_{i}\log\left( \frac{1}{p_{i}}\right)+\sum_{i}q_{i}\log\left( \frac{1}{q_{i}}\right) \right] $\|

	The dit-bit Transform: \|$1-p_{i}\rightsquigarrow\log\left( \frac {1}{p_{i}}\right) $\|
\|$h\left( p\right) =$\|	\|$\sum_{i}p_{i}\left( 1-p_{i}\right) $\|
\|$H\left( p\right) =$\|	\|$\sum_{i}p_{i}\log\left( 1/p_{i}\right)$\|
\|$h\left( X\|Y\right) =$\|	\|$\sum_{x,y}p\left( x,y\right)\left[ \left( 1-p\left( x,y\right) \right) -\left( 1-p\left( y\right) \right) \right]$\|
\|$H\left( X\|Y\right) =$\|	\|$\sum_{x,y}p\left( x,y\right) \left[\log\left( \frac{1}{p\left( x,y\right) }\right) -\log\left( \frac{1}{p\left( y\right)}\right) \right] $\|
\|$m\left( X,Y\right) =$\|	\|$\sum_{x,y}p\left( x,y\right) \left[ \left[ 1-p\left( x\right) \right] +\left[ 1-p\left(y\right) \right] -\left[ 1-p\left( x,y\right) \right] \right] $\|
\|$I(X,Y)=$\|	\|$\sum_{x,y}p\left( x,y\right) \left[ \log\left(\frac{1}{p\left( x\right) }\right) +\log\left( \frac {1}{p\left( y\right) }\right)-\log\left( \frac{1}{p\left( x,y\right) }\right) \right] $\|
\|$h\left( p\Vert q\right) =$\|	\|$\frac{1}{2}\left[\sum_{i}p_{i}(1-q_{i})+\sum_{i}q_{i}\left( 1-p_{i}\right) \right] $\|
\|$H_{s}(p\|\|q)=$\|	\|$\frac{1}{2}\left[ \sum_{i}p_{i}\log\left( \frac{1}{q_{i}}\right) +\sum_{i}q_{i}\log\left( \frac{1}{p_{i}}\right) \right] $\|
\|$d\left( p\|\|q\right) =$\|	\|$2h\left( p\|\|q\right)-\left[ \left( \sum_{i}p_{i}\left( 1-p_{i}\right) \right) +\left(\sum_{i}q_{i}\left( 1-q_{i}\right) \right) \right] $\|
\|$D_{s}\left( p\|\|q\right) =$\|	\|$2H_{s}\left(p\|\|q\right) -\left[ \sum_{i}p_{i}\log\left( \frac{1}{p_{i}}\right)+\sum_{i}q_{i}\log\left( \frac{1}{q_{i}}\right) \right] $\|

14 Entropies for multivariate joint distributions

Let |$\left\{ p\left( x_{1},...,x_{n}\right) \right\} $| be a probability distribution on |$X_{1}\times\ldots\times X_{n}$| for finite |$X_{i}$|’s. Let |$S$| be a subset of |$\left( X_{1}\times...\times X_{n}\right) ^{2}$| consisting of certain ordered pairs of ordered |$n$|-tuples |$\left( \left( x_{1} ,...,x_{n}\right) ,\left( x_{1}^{\prime},...,x_{n}^{\prime}\right) \right) $| so the product probability measure on |$S$| is:

\begin{align} \mu\left( S\right) =\sum\nolimits\left\{ p\left( x_{1},...,x_{n}\right) p\left( x_{1}^{\prime},...,x_{n}^{\prime}\right) :\left( \left( x_{1} ,...,x_{n}\right) ,\left( x_{1}^{\prime},...,x_{n}^{\prime}\right) \right) \in S\right\} . \end{align}

Then all the logical entropies for this |$n$|-variable case are given as the product measure of certain infosets |$S$|⁠. Let |$I,J\subseteq N$| be subsets of the set of all variables |$N=\left\{ X_{1},...,X_{n}\right\} $| and let |$x=\left( x_{1},...,x_{n}\right) $| and |$x^{\prime}=\left( x_{1} ^{\prime},...,x_{n}^{\prime}\right) $|⁠.

Since two ordered |$n$|-tuples are different if they differ in some coordinate, the joint logical entropy of all the variables is: |$h\left(X_{1},...,X_{n}\right) =\mu\left( S_{\vee N}\right) $| where:

\begin{align} S_{\vee N}=\left\{ \left( x,x^{\prime}\right) :\vee_{i=1}^{n}\left(x_{i}\neq x_{i}^{\prime}\right) \right\} =\cup\left\{ S_{X_{i}}:X_{i}\in N\right\}\,\,{\rm{where}}\\ S_{X_{i}}=S_{x_{i}\neq x_{i}^{\prime}}=\left\{ \left( x,x^{\prime}\right):x_{i}\neq x_{i}^{\prime}\right\} \end{align}

(where |$\vee$| represents the disjunction of statements). For a non-empty |$I\subseteq N$|⁠, the joint logical entropy of the variables in |$I$| could be represented as |$h\left( I\right) =\mu\left( S_{\vee I}\right) $| where:

\begin{align} S_{\vee I}=\left\{ \left( x,x^{\prime}\right) :\vee\left( x_{i}\neq x_{i}^{\prime}\right) \text{ for }X_{i}\in I\right\} =\cup\left\{ S_{X_{i} }:X_{i}\in I\right\} \end{align}

so that |$h\left(X_{1},...,X_{n}\right) =h\left( N\right) $|⁠.

As before, the information algebra |$\mathcal{I}\left( X_{1}\times...\times X_{n}\right) $| is the Boolean subalgebra of |$\wp\left( \left( X_{1} \times...\times X_{n}\right) ^{2}\right) $| generated by the basic infosets |$S_{X_{i}}$| for the variables and their complements |$S_{\lnot X_{i}}$|⁠.

For the conditional logical entropies, let |$I,J\subseteq N$| be two non-empty disjoint subsets of |$N$|⁠. The idea for the conditional entropy |$h\left( I|J\right) $| is to represent the information in the variables |$I$| given by the defining condition: |$\vee\left( x_{i}\neq x_{i}^{\prime}\right) $| for |$X_{i}\in I$|⁠, after taking away the information in the variables |$J$| which is defined by the condition: |$\vee\left( x_{j}\neq x_{j}^{\prime}\right) $| for |$X_{j}\in J$|⁠. ‘After the bar |$|$|’ means ‘negate’ so we negate that condition |$\vee\left( x_{j}\neq x_{j}^{\prime}\right) $| for |$X_{j}\in J$| and add it to the condition for |$I$| to obtain the conditional logical entropy as |$h\left( I|J\right) =h\left( \vee I|\vee J\right) =\mu(S_{\vee I|\vee J})$| (where |$\wedge$| represents the conjunction of statements):

\begin{align} S_{\vee I|\vee J}=\left\{ \left( x,x^{\prime}\right) :\vee\left(x_{i}\neq x_{i}^{\prime}\right) \text{ for }X_{i}\in I\text{ and}\wedge\left( x_{j}=x_{j}^{\prime}\right) \text{ for }X_{j}\in J\right\}\\ =\cup\left\{ S_{X_{i}}:X_{i}\in I\right\} -\cup\left\{ S_{X_{j}}:X_{j}\in J\right\} =S_{\vee I}-S_{\vee J}. \end{align}

The general rule is that the sets satisfying the after-the-bar condition are subtracted from the sets satisfying the before-the-bar condition:

\begin{align} S_{\vee I|\vee J}=\cup\left\{ S_{X_{i}}:X_{i}\in I\right\} -\cup\left\{S_{X_{j}}:X_{j}\in J\right\} =\left\{ \left( x,x^{\prime}\right) :\left(\vee_{I}x_{i}\neq x_{i}^{\prime}\right) \wedge\left( \wedge_{J}x_{j}=x_{j}^{\prime}\right) \right\}\\ S_{\vee I|\wedge J}=\cup\left\{ S_{X_{i}}:X_{i}\in I\right\} -\cap\left\{S_{X_{j}}:X_{j}\in J\right\} =\left\{ \left( x,x^{\prime}\right) :\left(\vee_{I}x_{i}\neq x_{i}^{\prime}\right) \wedge\left( \vee_{J}x_{j}=x_{j}^{\prime}\right) \right\}\\ S_{\wedge I|\vee J}=\cap\left\{ S_{X_{i}}:X_{i}\in I\right\} -\cup\left\{S_{X_{j}}:X_{j}\in J\right\} =\left\{ \left( x,x^{\prime}\right) :\left(\wedge_{I}x_{i}\neq x_{i}^{\prime}\right) \wedge\left( \wedge_{J}x_{j}=x_{j}^{\prime}\right) \right\}\\ S_{\wedge I|\wedge J}=\cap\left\{ S_{X_{i}}:X_{i}\in I\right\}-\cap\left\{ S_{X_{j}}:X_{j}\in J\right\} =\left\{ \left( x,x^{\prime}\right) :\left( \wedge_{I}x_{i}\neq x_{i}^{\prime}\right) \wedge\left(\vee_{J}x_{j}=x_{j}^{\prime}\right) \right\}. \end{align}

For the mutual logical information of a non-empty set of variables |$I$|⁠, |$m\left( I\right) =m\left( \wedge I\right) =\mu\left( S_{\wedge I}\right) $| where:

\begin{align} S_{\wedge I}=\left\{ \left( x,x^{\prime}\right) :\wedge_{I}x_{i}\neq x_{i}^{\prime}\right\} . \end{align}

For the conditional mutual logical information, let |$I,J\subseteq N$| be two non-empty disjoint subsets of |$N$| so that |$m\left( I|J\right) =m\left(\wedge I|\vee J\right) =\mu\left( S_{\wedge I|\vee J}\right) $| where:

\begin{align} S_{\wedge I|\vee J}=\left\{ \left( x,x^{\prime}\right) :\left( \wedge _{I}x_{i}\neq x_{i}^{\prime}\right) \wedge\left( \wedge_{J}x_{j} =x_{j}^{\prime}\right) \right\} . \end{align}

For the logical analysis of variation (ANOVA) of categorical data, the logical entropies in the multivariate case divide up the variation into the natural parts. For instance, suppose that two explanatory variables |$X_{1}$| and |$X_{2}$| affect a third response variable |$Y$| according to a probability distribution |$\left\{ p\left( x_{1},x_{2},y\right) \right\} $| on |$X_{1}\times X_{2}\times Y$|⁠. The logical division in the information sets is:

\begin{align} S_{Y}=S_{Y\wedge\lnot X_{1}\wedge\lnot X_{2}}\cup S_{Y\wedge\lnot X_{1}\wedge X_{2}}\cup S_{Y\wedge X_{1}\wedge\lnot X_{2}}\cup S_{Y\wedge X_{1}\wedge X_{2}}, \end{align}

where |$S_{Y\wedge\lnot X_{1}\wedge\lnot X_{2}}$| represents the variation in |$Y$| when |$X_{1}$| and |$X_{2}$| don’t vary, |$S_{Y\wedge\lnot X_{1}\wedge X_{2}}$| is the variation in |$Y$| when |$X_{1}$| does not vary but |$X_{2}$| does, and so forth. The union is disjoint so the formula for the multivariate logical analysis of variation is:

\begin{align} h\left( Y\right) =h\left( Y|X_{1},X_{2}\right) +m\left( Y,X_{2} |X_{1}\right) +m\left( Y,X_{1}|X_{2}\right) +m\left( Y,X_{1},X_{2}\right) , \end{align}

with the obvious generalization to more explanatory variables |$X_{1},...,X_{n}$|⁠. Figure 6 (with |$X_{1}=X$| and |$X_{2}=Z$|⁠) gives the Venn diagram.

Fig. 6.

Open in new tab Download slide

Venn diagram for logical entropies.

And finally by expressing the logical entropy formulas as averages, the dit-bit transform will give the corresponding versions of Shannon entropy. Consider an example of a joint distribution |$\left\{ p\left( x,y,z\right) \right\} $| on |$X\times Y\times Z$|⁠. The mutual logical information |$m\left(X,Y,Z\right) =\mu\left( S_{\wedge\left\{ X,Y,Z\right\} }\right) $| where:

\begin{align} S_{\wedge\left\{ X,Y,Z\right\} }=\left\{ \left( \left( x,y,z\right) ,\left( x^{\prime},y^{\prime},z^{\prime}\right) \right) :x\neq x^{\prime }\wedge y\neq y^{\prime}\wedge z\neq z^{\prime}\right\} =S_{X}\cap S_{Y}\cap S_{Z}. \end{align}

From the Venn diagram for |$h\left( X,Y,Z\right) $|⁠, we have (using a variation on the inclusion–exclusion principle)¹⁰:

\begin{align} m\left( X,Y,Z\right) =h\left( X\right) +h\left( Y\right) +h\left( Z\right) -h\left( X,Y\right) -h\left( X,Z\right) -h\left( Y,Z\right) +h\left( X,Y,Z\right) . \end{align}

Substituting the averaging formulas for the logical entropies gives:

\begin{align} \sum\nolimits_{x,y,z}p\left( x,y,z\right) \left[ \begin{array} [c]{c} m\left( X,Y,Z\right) =\\ \left[ 1-p\left( x\right) \right] +\left[ 1-p\left( y\right) \right]+\left[ 1-p\left( z\right) \right] \\ -\left[ 1-p\left( x,y\right) \right] -\left[ 1-p\left( x,z\right)\right] -\left[ 1-p\left( y,z\right) \right] +\left[ 1-p\left(x,y,z\right) \right] \end{array} \right] . \end{align}

Then applying the dit-bit transform gives the corresponding formula for the multivariate ‘Shannon’ mutual information:¹¹

\begin{align} \sum\nolimits_{x,y,z}p\left( x,y,z\right) \left[ \begin{array} [c]{c} I\left( X,Y,Z\right) =\\ \log\left( \frac{1}{p\left( x\right) }\right) +\log\left( \frac{1}{p\left( y\right) }\right) +\log\left( \frac{1}{p\left( z\right)}\right) \\ -\log\left( \frac{1}{p\left( x,y\right) }\right) -\log\left( \frac{1}{p\left( x,z\right) }\right) -\log\left( \frac{1}{p\left( y,z\right)}\right) +\log\left( \frac{1}{p\left( x,y,z\right) }\right) \end{array} \right]\\ =\sum\nolimits_{x,y,z}p\left( x,y,z\right) \left[ \log\left( \frac{p\left( x,y\right) p\left(x,z\right) p\left( y,z\right) }{p\left( x\right) p\left( y\right) p\left( z\right) p\left(x,y,z\right) }\right) \right] ({\rm{e.g.}}\,\,[17, p. 57]\,\,{\rm{or}}\,\,[1, p. 129]). \end{align}

To emphasize that Venn-like diagrams are only a mnemonic analogy, Norman Abramson gives an example [1, pp. 130–1] where the Shannon mutual information of three variables is negative.¹²

Consider the joint distribution |$\left\{ p\left( x,y,z\right) \right\} $|on |$X\times Y\times Z$| where |$X=Y=Z=\left\{ 0,1\right\} $|⁠. Suppose two dice are thrown, one after the other. Then |$X=1$| if the first die came up odd, |$Y=1$| if the second die came up odd, and |$Z=1$| if |$X+Y$| is odd [18, Exercise 26, p. 143]. Then the probability distribution is in Table 7.

Table 7.

Abramson’s example giving negative Shannon mutual information |$I\left(X,Y,Z\right)$|

\|$X$\|	\|$Y$\|	\|$Z$\|	\|$p(x,y,z)$\|	\|$p(x,y),p(x,z),p\left( y,z\right) $\|	\|$p(x),p(y),p\left( z\right) $\|
\|$0$\|	\|$0$\|	\|$0$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$0$\|	\|$0$\|	\|$1$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|
\|$0$\|	\|$1$\|	\|$0$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|
\|$0$\|	\|$1$\|	\|$1$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$0$\|	\|$0$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$0$\|	\|$1$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$1$\|	\|$0$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$1$\|	\|$1$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|

\|$X$\|	\|$Y$\|	\|$Z$\|	\|$p(x,y,z)$\|	\|$p(x,y),p(x,z),p\left( y,z\right) $\|	\|$p(x),p(y),p\left( z\right) $\|
\|$0$\|	\|$0$\|	\|$0$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$0$\|	\|$0$\|	\|$1$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|
\|$0$\|	\|$1$\|	\|$0$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|
\|$0$\|	\|$1$\|	\|$1$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$0$\|	\|$0$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$0$\|	\|$1$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$1$\|	\|$0$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$1$\|	\|$1$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|

Table 7.

Abramson’s example giving negative Shannon mutual information |$I\left(X,Y,Z\right)$|

\|$X$\|	\|$Y$\|	\|$Z$\|	\|$p(x,y,z)$\|	\|$p(x,y),p(x,z),p\left( y,z\right) $\|	\|$p(x),p(y),p\left( z\right) $\|
\|$0$\|	\|$0$\|	\|$0$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$0$\|	\|$0$\|	\|$1$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|
\|$0$\|	\|$1$\|	\|$0$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|
\|$0$\|	\|$1$\|	\|$1$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$0$\|	\|$0$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$0$\|	\|$1$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$1$\|	\|$0$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$1$\|	\|$1$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|

\|$X$\|	\|$Y$\|	\|$Z$\|	\|$p(x,y,z)$\|	\|$p(x,y),p(x,z),p\left( y,z\right) $\|	\|$p(x),p(y),p\left( z\right) $\|
\|$0$\|	\|$0$\|	\|$0$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$0$\|	\|$0$\|	\|$1$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|
\|$0$\|	\|$1$\|	\|$0$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|
\|$0$\|	\|$1$\|	\|$1$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$0$\|	\|$0$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$0$\|	\|$1$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$1$\|	\|$0$\|	\|$1/4$\|	\|$1/4$\|	\|$1/2$\|
\|$1$\|	\|$1$\|	\|$1$\|	\|$0$\|	\|$1/4$\|	\|$1/2$\|

Since the logical mutual information |$m(X,Y,Z)$| is the measure |$\mu\left(S_{\wedge\left\{ X,Y,Z\right\} }\right) $|⁠, it is always non-negative and in this case is |$0$|⁠:

\begin{align} m\left(X,Y,Z\right) =h\left( X\right) +h\left( Y\right) +h\left( Z\right) -h\left(X,Y\right) -h\left( X,Z\right) -h\left( Y,Z\right) +h\left( X,Y,Z\right)\\ =\frac{1}{2}+\frac{1}{2}+\frac{1}{2}-\frac{3}{4}-\frac{3}{4}-\frac{3}{4}+\frac{3}{4}=\frac{3}{2}-\frac{6}{4}=0. \end{align}

All the simple and compound notions of logical entropy have a direct interpretation as a two-draw probability. The logical mutual information |$m\left( X,Y,Z\right) $| is the probability that in two independent samples of |$X\times Y\times Z$|⁠, the outcomes would differ in all coordinates. This means the two draws would have the form |$\left( x,y,z\right) $| and |$\left(1-x,1-y,1-z\right) $| for the binary variables, but it is easily seen by inspection that |$p\left(x,y,z\right) =0$| or |$p\left( 1-x,1-y,1-z\right) =0$|⁠, so the products of those two probabilities are all |$0$| as computed—and thus there is no three-way overlap. The two-way overlaps are |$m\left( X,Y\right) =h\left( X\right) +h\left( Y\right) -h\left( X,Y\right)=\frac{1}{2}+\frac{1}{2}-\frac{3}{4}=\frac{1}{4}$| or since each pair of variables is independent, |$m\left( X,Y\right) =h\left( X\right) h\left( Y\right)=\frac{1}{2}\times\frac{1}{2}=\frac{1}{4}$|⁠, and similarly for the other pairs of variables. The non-empty-supports-always-intersect result holds for any two variables, but the example shows that there is no necessity in having a three-way overlap, i.e., |$h\left( X\right) h\left( Y\right) h\left( Z\right) >0$| does not imply |$m\left( X,Y,Z\right) >0$|⁠.¹³

\begin{align} I\left( X,Y,Z\right) =H\left( X\right) +H\left( Y\right) +H\left(Z\right) -H\left( X,Y\right) -H\left( X,Z\right) -H\left( Y,Z\right)+H\left( X,Y,Z\right)\\ =1+1+1-2-2-2+2=3-4=-1. \end{align}

It is unclear how that can be interpreted as the mutual information contained in the three variables or how the corresponding ‘Venn diagram’ (Figure 7) can be anything more than a mnemonic. Indeed, as Imre Csiszar and Janos Körner remark:

The set-function analogy might suggest to introduce further information quantities corresponding to arbitrary Boolean expressions of sets. E.g., the ‘information quantity’ corresponding to |$\mu\left(A\cap B\cap C\right) =\mu\left( A\cap B\right) -\mu\left( \left(A\cap B\right) -C\right)$| would be |$I(X,Y)-I(X,Y|Z)$|⁠; this quantity has, however, no natural intuitive meaning. [11, pp. 53–4]

Fig. 7.

$Negative ‘area’ $I\left(X,Y,Z\right) $ in ‘Venn diagram’.$

Open in new tab Download slide

Negative ‘area’ |$I\left(X,Y,Z\right) $| in ‘Venn diagram’.

Of course, all this works perfectly well in logical information theory for the ‘arbitrary Boolean expressions of sets’ in the information algebra |$\mathcal{I}\left( X\times Y\times Z\right)$|⁠, e.g.,

\begin{align} m\left( X,Y,Z\right) =\mu\left( S_{X}\cap S_{Y}\cap S_{Z}\right) =\mu\left( S_{X}\cap S_{Y}\right) -\mu\left( \left( S_{X}\cap S_{Y}\right) -S_{Z}\right) =m\left( X,Y\right) -m\left( X,Y|Z\right), \end{align}

which also is a (two-draw) probability measure and thus always non-negative.

Note how the supposed ‘intuitiveness’ of independent random variables giving disjoint or at least ‘zero overlap’ Venn diagram areas in the two-variable Shannon case comes at the cost of possibly having ‘no natural intuitive meaning’ and negative ‘areas’ in the multivariate case. In probability theory, for a joint probability distribution of |$3$| or more random variables, there is a distinction between the variables being pair-wise independent and being mutually independent. In any counterexample where three variables are pairwise but not mutually independent [18, p. 127], the Venn diagram areas for |$H(X)$|⁠, |$H(Y)$| and |$H(Z)$| have to have pairwise zero overlaps, but since they are not mutually independent, all three areas have a non-zero overlap. The only way that can happen is for the pairwise overlaps such as |$I\left( X,Y\right) =0$| between |$H\left( X\right)$| and |$H\left(Y\right)$| to have a positive part |$I\left( X,Y|Z\right) $| (always non-negative [49, Theorem 2.34, p. 23]) and a negative part |$I(X,Y,Z)$| that add to |$0$| as in Figure 7.

15 Logical entropy and some related notions

The Taylor series for |$\ln(x+1)$| around |$x=0$| is:

\begin{align} \ln(x+1)=\ln(1)+x-\frac{1}{2!}x^{2}+\frac{1}{3!}x^{3}2-...=x-\frac{x^{2}} {2}+\frac{x^{3}}{3}-... \end{align}

so substituting |$x=p_{i}-1$| (with |$p_{i}>0$|⁠) gives a version of the Newton–Mercator series:

\begin{align} -\ln\left( p_{i}\right) =\ln\left( \frac{1}{p_{i}}\right) =1-p_{i} +\frac{\left( p_{i}-1\right) ^{2}}{2}-\frac{\left( p_{i}-1\right) ^{3}} {3}+.... \end{align}

Then multiplying by |$p_{i}$| and summing yields:

\begin{align} H_{e}\left( p\right) =-\sum\nolimits_{i}p_{i}\ln\left( p_{i}\right) =\sum\nolimits_{i}p_{i}\left( 1-p_{i}\right) +\sum\nolimits_{i}\frac{p_{i}(p_{i}-1)^{2}}{2}-...\\ =h\left( p\right) +\sum\nolimits_{i}\frac{p_{i}(p_{i}-1)^{2}}{2}-.... \end{align}

A similar relationship holds in the quantum case between the von Neumann entropy |$S\left( \rho\right) =-\operatorname*{tr}\left[ \rho \ln\left( \rho\right) \right] $| and the quantum logical entropy|$h\left( \rho\right) =\operatorname*{tr}\left[ \rho\left( 1-\rho\right) \right] =1-\operatorname*{tr}\left[ \rho^{2}\right] $| which is defined by having a density matrix |$\rho$| replace the probability distribution |$p$| and the trace replace the sum.

This relationship between the Shannon/von Neumann entropies and the logical entropies in the classical and quantum cases is responsible for presenting the logical entropy as a ‘linear’ approximation to the Shannon or von Neumann entropies since |$1-p_{i}$| is the linear term in the series for |$-\ln\left( p_{i}\right) $| [before the multiplication by |$p_{i}$| to make the term quadratic!]. And |$h\left( p\right) =1-\sum_{i}p_{i}^{2}$| or it quantum counterpart |$h\left(\rho\right) =1-\operatorname*{tr}\left[ \rho ^{2}\right] $| are even called ‘linear entropy’ (e.g. [8]) even though the formulas are obviously quadratic.¹⁴ Another name for the quantum logical entropy found in the literature is ‘mixedness’ [26, p. 5] which at least does not call a quadratic formula ‘linear.’ It is even called ‘impurity’ since the complement |$1-h\left( \rho\right) =\operatorname*{tr}\left[\rho^{2}\right] $| (i.e. the quantum version of Alan Turing’s repeat rate |$\sum_{i}p_{i}^{2}$| [21]) is called the ‘purity.’

Quantum logical entropy is beyond the scope of this article but it might be noted that some quantum information theorists have been using that concept to rederive results previously derived using the von Neumann entropy such as the Klein inequality, concavity, and a Holevo-type bound for Hilbert–Schmidt distance ([42], [43]). Moreover, the logical derivation of the logical entropy formulas using the notion of distinctions gives a certain naturalness to the notion of quantum logical entropy.

We find this framework of partitions and distinction most suitable (at least conceptually) for describing the problems of quantum state discrimination, quantum cryptography and in general, for discussing quantum channel capacity. In these problems, we are basically interested in a distance measure between such sets of states, and this is exactly the kind of knowledge provided by logical entropy ([12], [42]).

There are many older results derived under the misnomer ‘linear entropy’ or derived for the quadratic special case of the Tsallis–Havrda–Charvat entropy ([24], [44], [45]). Those parameterized families of entropy formulas are sometimes criticized for lacking a convincing interpretation, but we have seen that the quadratic case is interpreted simply as a two-draw probability of a ‘dit’ of the partition—just as in the dual case, the normalized counting measure of a subset is the one-draw probability of an ‘it’ in the subset.

In accordance with its quadratic nature, logical entropy is the logical special case of C. R. Rao’s quadratic entropy [36]. Two elements from |$U=\left\{ u_{1},...,u_{n}\right\} $| are either identical or distinct. Gini [19] introduced |$d_{ij}$| as the ’distance’ between the |$i^{th}$| and |$j^{th}$| elements where |$d_{ij}=1$| for |$i\not =j$| and |$d_{ii} =0$|—which might be considered the ‘logical distance function’ |$d_{ij} =1-\delta_{ij}$|⁠, so logical distance is complement of the Kronecker delta. Since |$1=\left( p_{1}+...+p_{n}\right) \left( p_{1}+...+p_{n}\right) =\sum_{i}p_{i}^{2}+\sum_{i\not =j}p_{i}p_{j}$|⁠, the logical entropy, i.e., Gini’s index of mutability, |$h\left( p\right) =1-\sum_{i}p_{i}^{2} =\sum_{i\not =j}p_{i}p_{j}$|⁠, is the average logical distance between distinct elements. But in 1982, C. R. Rao [36] generalized this as quadratic entropy by allowing other distances |$d_{ij}=d_{ji}$| for |$i\not =j$| (but always |$d_{ii}=0$|⁠) so that |$Q=\sum_{i\not =j}d_{ij}p_{i}p_{j}$| would be the average distance between distinct elements from |$U$|⁠.

Rao’s treatment also includes (and generalizes) the natural extension of logical entropy to continuous probability density functions |$f\left( x\right) $| for a random variable |$X$|⁠: |$h\left(X\right) =1-\int f\left( x\right) ^{2}dx$|⁠. It might be noted that the natural extension of Shannon entropy to continuous probability density functions |$f(x)$| through the limit of discrete approximations contains terms |$1/\log\left( \Delta x_{i}\right) $| that blow up as the mesh size |$\Delta x_{i}$| goes to zero (see [34, pp. 34–38]).¹⁵ Hence the definition of Shannon entropy in the continuous case is defined not by the limit of the discrete formula but by the analogous formula |$H\left( X\right) =-\int f\left( x\right) \log\left( f\left( x\right) \right) dx$| which, as Robert McEliece notes, ‘is not in any sense a measure of the randomness of |$X$|’ [34, p. 38] in addition to possibly having negative values [46, p. 74].

16 The statistical interpretation of Shannon entropy

Shannon, like Ralph Hartley [23] before him, starts with the question of how much ‘information’ is required to single out a designated element from a set |$U$| of equiprobable elements. Alfréd Rényi formulated this in terms of the search [37] for a hidden designated element like the answer in a Twenty Questions game. But being able to always find the designated element is equivalent to being able to distinguish all elements from one another.

One might quantify ‘information’ as the minimum number of yes-or-no questions in a game of Twenty Questions that it would take in general to distinguish all the possible ‘answers’ (or ‘messages’ in the context of communications). This is readily seen in the standard case where |$\left\vert U\right\vert =n=2^{m}$|⁠, i.e., the size of the set of equiprobable elements is a power of |$2$|⁠. Then following the lead of Wilkins over three centuries earlier, the |$2^{m}$| elements could be encoded using words of length |$m$| in a binary code such as the digits |$\left\{0,1\right\} $| of binary arithmetic (or |$\left\{ A,B\right\} $| in the case of Wilkins). Then an efficient or minimum set of yes-or-no questions needed to single out the hidden element is the set of |$m$| questions:

\begin{align} {\rm{`Is\,\,the}}\,\,j^{th}\,\,{\rm{digit\,\,in\,\,the\,\,binary\,\,code\,\,for\,\,the\,\,hidden\,\,element\,\,a\,\,1?'}} \end{align}

for |$j=1,...,m$|⁠. Each element is distinguished from any other element by their binary codes differing in at least one digit. The information gained in finding the outcome of an equiprobable binary trial, like flipping a fair coin, is what Shannon calls a bit. Hence the information gained in distinguishing all the elements out of |$2^{m}$| equiprobable elements is:

\begin{align} m=\log_{2}\left( 2^{m}\right) =\log_{2}\left( \left\vert U\right\vert \right) =\log_{2}\left(\frac{1}{p_{0}}\right)\,\,{\rm{bits}}, \end{align}

where |$p_{0}=\frac{1}{2^{m}}$| is the probability of any given element (all logs to base |$2$|⁠).¹⁶

In the more general case where |$\left\vert U\right\vert =n$| is not a power of |$2$|⁠, Shannon and Hartley extrapolate to the definition of |$H\left(p_{0}\right) $| where |$p_{0}=\frac{1}{n}$| as:

\begin{align} H\left( p_{0}\right) =\log\left( \frac{1}{p_{0}}\right) =\log\left(n\right)\\ {\rm{Shannon-Hartley\,\,entropy\,\,for\,\,an\,\,equiprobable\,\,set}}\,\,U\,\,{\rm{of}}\,\,n\,\,{\rm{elements}}. \end{align}

The Shannon formula then extrapolates further to the case of different probabilities |$p=\left\{ p_{1},...,p_{n}\right\} $| by taking the average:

\begin{align} H\left( p\right) =\sum\nolimits_{i=1}^{n}p_{i}\log\left( \frac{1}{p_{i}}\right).\\ {\rm{Shannon\,\,entropy\,\,for\,\,a\,\,probability\,\,distribution}}\,\,p=\left\{ p_{1},...,p_{n}\right\} \end{align}

How can that extrapolation and averaging be made rigorous to offer a more convincing interpretation? Shannon uses the law of large numbers. Suppose that we have a three-letter alphabet |$\left\{ a,b,c\right\} $| where each letter was equiprobable, |$p_{a}=p_{b}=p_{c}=\frac{1}{3}$|⁠, in a multi-letter message. Then a one-letter or two-letter message cannot be exactly coded with a binary |$0,1$| code with equiprobable |$0$|’s and |$1$|’s. But any probability can be better and better approximated by longer and longer representations in the binary number system. Hence we can consider longer and longer messages of |$N$| letters along with better and better approximations with binary codes. The long-run behaviour of messages |$u_{1}u_{2}...u_{N}$| where |$u_{i}\in\left\{ a,b,c\right\} $| is modelled by the law of large numbers so that the letter |$a$| on average occur |$p_{a}N=\frac{1}{3}N$| times and similarly for |$b$| and |$c$|⁠. Such a message is called typical.

The probability of any one of those typical messages is:

\begin{align} p_{a}^{p_{a}N}p_{b}^{p_{b}N}p_{c}^{p_{c}N}=\left[ p_{a}^{p_{a}}p_{b}^{p_{b} }p_{c}^{p_{c}}\right] ^{N} \end{align}

or, in this case,

\begin{align} \left[ \left( \frac{1}{3}\right) ^{1/3}\left( \frac{1}{3}\right) ^{1/3}\left( \frac{1}{3}\right) ^{1/3}\right] ^{N}=\left( \frac{1} {3}\right) ^{N}. \end{align}

Hence the number of such typical messages is |$3^{N}$|⁠.

If each message was assigned a unique binary code, then the number of |$0,1$|’s in the code would have to be |$X$| where |$2^{X}=3^{N}$| or |$X=\log\left( 3^{N}\right) =N\log\left( 3\right) $|⁠. Hence the number of equiprobable binary questions or bits needed per letter (i.e. to distinguish each letter) of a typical message is:

\begin{align} N\log(3)/N=\log\left( 3\right) =3\times\frac{1}{3}\log\left( \frac{1} {1/3}\right) =H\left( p\right) . \end{align}

This example shows the general pattern.

In the general case, let |$p=\left\{ p_{1},...,p_{n}\right\} $| be the probabilities over a |$n$|-letter alphabet |$A=\left\{ a_{1},...,a_{n}\right\} $|⁠. In an |$N$|-letter message, the probability of a particular message |$u_{1}u_{2}...u_{N}$| is |$\Pi_{i=1}^{N}\Pr\left( u_{i}\right) $| where |$u_{i}$| could be any of the symbols in the alphabet so if |$u_{i}=a_{j}$| then |$\Pr\left( u_{i}\right) =p_{j}$|⁠.

In a typical message, the |$i^{th}$| symbol will occur |$p_{i}N$| times (law of large numbers) so the probability of a typical message is (note change of indices to the letters of the alphabet):

\begin{align} \Pi_{k=1}^{n}p_{k}^{p_{k}N}=\left[ \Pi_{k=1}^{n}p_{k}^{p_{k}}\right] ^{N}. \end{align}

Thus the probability of a typical message is |$P^{N}$| where it is as if each letter in a typical message was equiprobable with probability |$P=\Pi_{k=1} ^{n}p_{k}^{p_{k}}$|⁠. No logs have been introduced into the argument yet, so we have an interpretation of the base-free ‘numbers-equivalent’¹⁷ (or ‘anti-log’ or exponential) Shannon entropy:

\begin{align} E\left( p\right) =P^{-1}=\Pi_{k=1}^{n}\left( \frac{1}{p_{i}}\right) ^{p_{i}}=2^{H\left( p\right) }, \end{align}

it is as if each letter in a typical message is being draw from an alphabet with |$E\left( p\right) =2^{H\left( p\right) }$| equiprobable letters. Hence the number of |$N$|-letter messages from the equiprobable alphabet is then |$E\left( p\right) ^{N}$|⁠. The choice of base |$2$| means assigning a unique binary code to each typical message requires |$X$| bits where |$2^{X}=E\left( p\right) ^{N}$| where:

\begin{align} X=\log\left\{ E\left( p\right) ^{N}\right\} =N\log\left[ E\left( p\right) \right] =NH\left( p\right) . \end{align}

Dividing by the number |$N$| of letters gives the average bit-count interpretation of the Shannon entropy; |$H\left( p\right) =\log\left[E\left( p\right) \right] =\sum_{k=1}^{n}p_{k}\log\left( \frac{1}{p_{k} }\right) $| is the average number of bits necessary to distinguish, i.e., uniquely encode, each letter in a typical message.

This result, usually called the Noiseless Coding Theorem, allows us to conceptually relate the logical and Shannon entropies (the dit-bit transform gives the quantitative relationship). In terms of the simplest case for partitions, the Shannon entropy |$H\left( \pi\right) =\sum_{B\in\pi} p_{B}\log_{2}\left( 1/p_{B}\right) $| is a requantification of the logical measure of information |$h\left( \pi\right) =\frac {|\operatorname*{dit}\left( \pi\right) |}{\left\vert U\times U\right\vert }=1-\sum_{B\in\pi}p_{B}^{2}$|⁠. Instead of directly counting the distinctions of |$\pi$|⁠, the idea behind Shannon entropy is to count the (minimum) number of binary partitions needed to make all the distinctions of |$\pi$|⁠. In the special case of |$\pi$| having |$2^{m}$| equiprobable blocks, the number of binary partitions |$\beta_{i}$| needed to make the distinctions |$\operatorname*{dit} \left( \pi\right) $| of |$\pi$| is |$m$|⁠. Represent each block by an |$m$|-digit binary number so the |$i^{th}$| binary partition |$\beta_{i}$| just distinguishes those blocks with |$i^{th}$| digit |$0$| from those with |$i^{th}$| digit |$1$|⁠.¹⁸ Thus there are |$m$| binary partitions |$\beta_{i}$| such that |$\vee_{i=1}^{m}\beta_{i}=\pi$| (⁠|$\vee$| is here the partition join) or, equivalently, |$\cup_{i=1}^{m}\operatorname*{dit}\left( \beta_{i}\right) =\operatorname*{dit}\left(\vee_{i=1}^{m}\beta_{i}\right) =\operatorname*{dit}\left( \pi\right) $|⁠. Thus |$m$| is the exact number of binary partitions it takes to make the distinctions of |$\pi$|⁠. In the general case, Shannon gives the above statistical interpretation so that |$H\left( \pi\right) $| is the minimum average number of binary partitions or bits needed to make the distinctions of |$\pi$|⁠.

Note the difference in emphasis. Logical information theory is only concerned with counting the distinctions between distinct elements, not with uniquely designating the distinct entities. By requantifying to count the number of binary partitions it takes to make the same distinctions, the emphasis shifts to the length of the binary code necessary to uniquely designate the distinct elements. Thus the Shannon information theory perfectly dovetails into coding theory and is often presented today as the unified theory of information and coding (e.g. [34] or [22]). It is that shift to not only making distinctions but uniquely coding the distinct outcomes that gives the Shannon theory of information, coding and communication such importance in applications.

It might be noted that Shannon formula is often connected to (and even sometimes identified with) the Boltzmann–Gibbs entropy in statistical mechanics–which was the source for Shannon’s nomenclature. But that connection is only a numerical approximation, not an identity in functional form, where the natural logs of factorials in the Boltzmann formula are approximated using the first two terms in the Stirling approximation [14]. Indeed, as pointed out by David J. C. MacKay, one can use the next term in the Stirling approximation to give a ‘more accurate approximation’ [32, p. 2] to the entropy of statistical mechanics–but no one would suggest using such a formula in information theory. While the use of the ‘entropy’ terminology is here to stay in information theory, the Shannon Noiseless Coding Theorem gives the basis to interpret the Shannon formula, not numerical approximations to the Boltzmann-Gibbs entropy in statistical mechanics.

17 Concluding remarks

Logical information theory is based on the notion of information-as-distinctions. It starts with the finite combinatorial information sets which are the ditsets of partitions on a finite |$U$| or the infosets |$S_{X}$| and |$S_{Y}$| associated with a finite |$X\times Y$|—and that calculus of identities and differences is expressed in the information Boolean algebra |$\mathcal{I}\left( U\right)$| or |$\mathcal{I}\left(X\times Y\right)$|⁠. No probabilities are involved in the definition of the information sets of distinctions. But when a probability distribution is defined on |$U$| or on |$X\times Y$|⁠, then the product probability distribution is determined on |$U^{2}$| or |$\left( X\times Y\right)^{2}$| respectively. The quantitative logical entropy of an information set is the value of the product probability measure on the set.

Since conventional information theory has heretofore been focused on the original notion of Shannon entropy (and quantum information theory on the corresponding notion of von Neumann entropy), much of the article has compared the logical entropy notions to the corresponding Shannon entropy notions.

Logical entropy, like logical probability, is a measure, while Shannon entropy is not. The compound Shannon entropy concepts nevertheless satisfy the measure-like Venn diagram relationships that are automatically satisfied by a measure. This can be explained by the dit-bit transform so that by putting a logical entropy notion into the proper form as an average of dit-counts, one can replace a dit-count by a bit-count and obtain the corresponding Shannon entropy notion—which shows a deeper relationship behind the Shannon compound entropy concepts.

In sum, the logical theory of information-as-distinctions is the ground level logical theory of information stated first in terms of sets of distinctions and then in terms of two-draw probability measures on the sets. The Shannon information theory is a higher-level theory that requantifies distinctions by counting the minimum number of binary partitions (bits) that are required, on average, to make all the same distinctions, i.e., to encode the distinguished elements—and is thus well-adapted for the theory of coding and communication.

References

[1]

Abramson.

N.

Information Theory and Coding

.

McGraw-Hill

,

1963

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[2]

Aczel

J.

and

Daroczy.

Z.

On Measures of Information and Their Characterization

.

Academic Press

,

1975

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[3]

Adelman.

M. A.

Comment on the H concentration measure as a numbers-equivalent.

Review of Economics and Statistics

,

51

,

99

–

101

,

1969

.

Google Scholar

Crossref

WorldCat

[4]

Adriaans

P.

and

van Benthem

J.

(eds).

Philosophy of Information. Vol. 8. Handbook of the Philosophy of Science

.

North-Holland

,

2008

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[5]

Bennett.

C. H.

Quantum information: qubits and quantum error correction.

International Journal of Theoretical Physics

42

,

153

–

76

,

2003

.

Google Scholar

Crossref

WorldCat

[6]

Blachman.

N. M.

A generalization of mutual information.

Proceedings of IRE

,

49

,

1331

–

32

,

1961

.

Google Scholar

OpenURL Placeholder Text

WorldCat

[7]

Boole.

G.

An Investigation of the Laws of Thought on which are Founded the Mathematical Theories of Logic and Probabilities

.

Macmillan and Co

,

1854

.

[8]

Buscemi

F.

Bordone

P.

and

Bertoni.

A.

Linear entropy as an entanglement measure in two-fermion systems.

ArXiv.org

.

2

,

2007

.

Google Scholar

OpenURL Placeholder Text

WorldCat

[9]

Campbell.

L. L.

Entropy as a measure.

IEEE Transactions on Information Theory

,

IT-11

,

112

–

114

,

1965

.

Google Scholar

Crossref

WorldCat

[10]

Cover

T.

and

Thomas.

J.

Elements of Information Theory

.

John Wiley

,

1991

.

[11]

Csiszar

I.

and

Körner.

J.

Information Theory: Coding Theorems for Discrete Memoryless Systems

.

Academic Press

,

1981

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[12]

Ellerman.

D.

Counting distinctions: on the conceptual foundations of Shannon’s information theory.

Synthese

,

168

,

119

–

49

,

2009

.

Google Scholar

Crossref

WorldCat

[13]

Ellerman.

D.

The logic of partitions: introduction to the dual of the logic of subsets.

Review of Symbolic Logic

,

3

,

287

–

350

,

2010

.

Google Scholar

Crossref

WorldCat

[14]

Ellerman.

D.

An introduction to logical entropy and its relation to shannon entropy.

International Journal of Semantic Computing

,

7

,

121

–

45

,

2013

.

Google Scholar

Crossref

WorldCat

[15]

Ellerman.

D.

An introduction of partition logic.

Logic Journal of the IGPL

22

,

94

–

125

,

2014

.

Google Scholar

Crossref

WorldCat

[16]

Fano.

R. M.

The transmission of information II.

In

Research Laboratory of Electronics Report 149

.

MIT Press

,

1950

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[17]

Fano.

R. M.

Transmission of Information

.

MIT Press

,

1961

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[18]

Feller.

W.

An Introduction to Probability Theory and Its Applications Vol. 1

. 3rd ed.

John Wiley

,

1968

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[19]

Gini.

C.

Variabilità e mutabilità

.

Tipografia di Paolo Cuppini

,

1912

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[20]

Gleick.

J.

The Information: A History, A Theory, A Flood

.

Pantheon

,

2011

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[21]

Good.

I. J.

Turing’s

A.M.

statistical work in World War II.

Biometrika

,

66

,

393

–

396

,

1979

.

Google Scholar

Crossref

WorldCat

[22]

Hamming.

R. W.

Coding and Information Theory

.

Prentice-Hall

,

1980

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[23]

Hartley.

R. V. L.

Transmission of information.

Bell System Technical Journal

,

7

,

535

–

563

,

1928

.

Google Scholar

Crossref

WorldCat

[24]

Havrda

J.

and

Charvat.

F.

Quantification methods of classification processes: concept of structural |$\alpha$|-entropy.

Kybernetika

, (Prague),

3

,

30

–

35

,

1967

.

Google Scholar

OpenURL Placeholder Text

WorldCat

[25]

Hu.

G. D.

On the amount of information (in Russian).

Teoriya Veroyatnostei I Primenen

,

4

,

447

–

55

,

1962

.

Google Scholar

OpenURL Placeholder Text

WorldCat

[26]

Jaeger.

G.

Quantum Information: An Overview

.

Springer Science+Business Media

,

2007

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[27]

Kolmogorov.

A. N.

Combinatorial foundations of information theory and the calculus of probabilities.

Russian Mathematical Surveys

,

38

,

29

–

40

,

1983

.

Google Scholar

Crossref

WorldCat

[28]

Kolmogorov.

A. N.

Three approaches to the definition of the notion of amount of information.

In

Selected Works of A. N. Kolmogorov: Vol. III Information Theory and the Theory of Algorithms

,

Shiryayev

A. N.

ed., pp.

184

–

93

.

Springer Science+Business Media

,

1993

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[29]

Kung

J. P. S.

Rota

G.-C.

and

Yan.

C. H.

Combinatorics: The Rota Way

.

Cambridge University Press

,

2009

.

[30]

Laplace.

P.-S.

(

1825

).

Philosophical Essay on Probabilities

, trans.

Dale.

A. I.

ed.

Springer Verlag

,

1995

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[31]

Lawvere

F. W.

and

Rosebrugh.

R.

Sets for Mathematics

.

Cambridge University Press

,

2003

.

[32]

MacKay.

D. J. C.

Information Theory, Inference, and Learning Algorithms

.

Cambridge University Press

,

2003

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[33]

McGill.

W. J.

Multivariate information transmission.

Psychometrika

,

19

,

97

–

116

,

1954

.

Google Scholar

Crossref

WorldCat

[34]

McEliece.

R. J.

The Theory of Information and Coding: A Mathematical Framework for Communication (Encyclopedia of Mathematics and Its Applications, Vol. 3)

.

Addison-Wesley

,

1977

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[35]

Nielsen

M.

and

Chuang.

I.

Quantum Computation and Quantum Information

.

Cambridge University Press

,

2000

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[36]

Rao.

C. R.

Diversity and dissimilarity coefficients: a unified approach.

Theoretical Population Biology

,

21

,

24

–

43

,

1982

.

Google Scholar

Crossref

WorldCat

[37]

Rényi.

A.

Probability Theory

, Trans.

Vekerdi

Laszlo

ed.

North-Holland

,

1970

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[38]

Rota.

G.-C.

Twelve problems in probability no one likes to bring up.

In

Algebraic Combinatorics and Computer Science

,

Henry

Crapo

and

Senato

Domenico

eds, pp.

57

–

93

.

Springer

,

2001

.

[39]

Rozeboom.

W. W.

The theory of abstract partials: an introduction.

Psychometrika

,

33

,

133

–

67

,

1968

.

[40]

Shannon.

C. E.

A mathematical theory of communication.

Bell System Technical Journal

,

27

,

379–423

;

623–56

,

1948

.

Google Scholar

Crossref

WorldCat

[41]

Shannon

C. E.

and

Weaver.

W.

The Mathematical Theory of Communication

.

University of Illinois Press

,

1964

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[42]

Tamir

B.

and

Cohen.

E.

Logical entropy for quantum states.

ArXiv.org

. December. http://de.arxiv.org/abs/1412.0616v2,

2014

.

Google Scholar

OpenURL Placeholder Text

WorldCat

[43]

Tamir

B.

and

Cohen.

E.

A Holevo-type bound for a Hilbert Schmidt distance measure.

Journal of Quantum Information Science

,

5

,

127

–

33

,

2015

.

Google Scholar

Crossref

WorldCat

[44]

Tsallis.

C.

Possible generalization for Boltzmann-Gibbs statistics.

Journal of Statistical Physics

,

52

,

479

–

87

,

1988

.

Google Scholar

Crossref

WorldCat

[45]

Tsallis.

C.

Introduction to Nonextensive Statistical Mechanics

.

Springer Science+Business Media

,

2009

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[46]

Uffink.

J.

Measures of Uncertainty and the Uncertainty Principle

. (PhD Thesis),

University of Utrecht

,

1990

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[47]

Wilkins.

J.

(

1641

).

Mercury or the Secret and Swift Messenger

.

John Nicholson

,

London

,

1707

.

Google Scholar

Google Preview

OpenURL Placeholder Text

WorldCat

[48]

Yeung.

R. W.

A new outlook on Shannon’s information measures.

IEEE Transactions on Information Theory

,

37

,

466

–

74

,

1991

.

Google Scholar

Crossref

WorldCat

[49]

Yeung.

R. W.

A First Course in Information Theory

.

Springer Science+Business Media

,

2002

.

Footnotes

¹This article is about what Adriaans and van Benthem call ‘Information B: Probabilistic, information-theoretic, measured quantitatively’, not about ‘Information A: knowledge, logic, what is conveyed in informative answers’ where the connection to philosophy and logic is built-in from the beginning. Likewise, the paper is not about Kolmogorov-style ‘Information C: Algorithmic, code compression, measured quantitatively’ [4, p. 11].

²Kolmogorov had something else in mind such as a combinatorial development of Hartley’s |$\log\left( n\right) $| on a set of |$n$| equiprobable elements [28].

³There is a general method to define operations on partitions corresponding to the Boolean operations on subsets ([13], [15]) but the lattice operations of join and meet, and the implication are sufficient to define a partition algebra |$\prod\left( U\right) $| parallel to the familiar powerset Boolean algebra |$\wp\left( U\right) $|⁠.

⁴The lattice of partitions on |$U$| is isomorphically represented by the lattice of partition relations or ditsets on |$U\times U$| ([13], [15]), so in that sense, the size of the ditset of a partition is its ‘size’.

⁵The formula |$1-\sum_{i}p_{i}^{2}$| is quite old as a measure of diversity and goes back at least to Gini’s index of mutability in 1912 [19]. For the long history of the formula, see [12] or [14].

⁶Perhaps, one should say that Shannon entropy is not the measure of any independently defined set. The fact that the Shannon formulas ‘act like a measure on a set’ can, of course, be formalized by formally associating an (indefinite) ‘set’ with each random variable |$X$| and then defining the measure value on the ‘set’ as |$H\left( X\right) $|⁠. But since there is no independently defined ‘set’ with actual members and this ‘measure’ is defined by the Shannon entropy values (rather than the other way around), nothing is added to the already-known fact that the Shannon entropies act like a measure in the Venn diagram relationships. This formalization exercise seems to have been first carried out by Guo Ding Hu [25] but was also noted by Imre Csiszar and Janos Körner [11], and redeveloped by Raymond Yeung ([48], [49]).

⁷Note that |$n=1/p_{0}$| need not be an integer. We are following the usual practice in information theory where an implicit ‘on average’ interpretation is assumed since actual ‘binary partitions’ or ‘binary digits’ (or ‘bits’) only come in integral units. The ‘on average’ provisos are justified by the ‘Noiseless Coding Theorem’ covered in the later section on the statistical interpretation of Shannon entropy.

⁸Note that |$S_{\lnot X}$| and |$S_{\lnot Y}$| intersect in the diagonal |$\Delta\subseteq\left( X\times Y\right) ^{2}$|⁠.

⁹In a context where the logical distance |$d\left(X,Y\right) =h\left( X|Y\right) +h\left( Y|X\right) $| and the logical divergence |$d\left( p||q\right) $| are both defined, e.g., two partitions |$\pi$| and |$\sigma$| on the same set |$U$|⁠, then the two concepts are the same, i.e., |$d\left( \pi,\sigma\right) =d\left( \pi||\sigma\right) $|⁠.

¹¹The multivariate generalization of the ‘Shannon’ mutual information was developed not by Shannon but by William J. McGill [33] and Robert M. Fano ([16], [17]) at MIT in the early 1950s and independently by Nelson M. Blachman [6]. The criterion for it being the ‘correct’ generalization seems to be that it satisfied the generalized Venn diagram formulas that are automatically satisfied by any measure and are thus also obtained from the multivariate logical mutual information using the dit-bit transform—as is done here.

¹²Fano had earlier noted that, for three or more variables, the mutual information could be negative [17, p. 58].

¹³The simplest example suffices. There are three non-trivial partitions on a set with three elements and those partitions have no dits in common.

¹⁴Sometimes the misnomer ‘linear entropy’ is applied to the rescaled logical entropy |$\frac{n}{n-1}h\left(\pi\right)$|⁠. The maximum value of the logical entropy is |$h(\mathbf{1})=1-\frac{1}{n}=\frac{n-1}{n}$| so the rescaling gives a maximum value of |$1$|⁠. In terms of the partition-logic derivation of the logical entropy formula, this amounts to sampling without replacement and normalizing |$\left\vert \operatorname*{dit}\left( \pi\right) \right\vert $| by the number of possible distinctions |$\left\vert U\times U-\Delta\right\vert =n^{2}-n$| (where |$\Delta=\left\{ \left( u,u\right) :u\in U\right\} $| is the diagonal) instead of |$\left\vert U\times U\right\vert =n^{2}$| since: |$\frac{\left\vert \operatorname*{dit}\left( \pi\right) \right\vert }{\left\vert U\times U-\Delta\right\vert }=\frac{\left\vert \operatorname*{dit}\left(\pi\right) \right\vert }{n\left( n-1\right) }=\frac{n}{n-1}\frac{\left\vert\operatorname*{dit}\left( \pi\right) \right\vert }{n^{2}}=\frac{n}{n-1}h\left( \pi\right) $|⁠.

¹⁵For expository purposes, we have restricted the treatment to finite sample spaces |$U$|⁠. For some countably infinite discrete probability distributions, the Shannon entropy blows up to infinity [49, Example 2.46, p. 30], while the countable logical infosets are always well-defined and the logical entropy is always in the half-open interval |$[0,1)$|⁠.

¹⁶This is the special case where Campbell [9] noted that Shannon entropy acted as a measure to count that number of binary partitions.

¹⁷When an event or outcome has a probability |$p_{i}$|⁠, it is intuitive to think of it as being drawn from a set of |$\frac{1}{p_{i}}$| equiprobable elements (particularly when |$\frac{1}{p_{i}}$| is an integer) so |$\frac{1}{p_{i}}$| is called the numbers-equivalentof the probability|$p_{i}$| [3]. For a development of |$E\left( p\right) $| from scratch, see [12], [14].

¹⁸Thus as noted by John Wilkins in 1641, five letter words in a two-letter code would suffice to distinguish |$2^{5}=32$| distinct entities [47].

Download all slides

Table 1	Subset logic	Partition logic
‘Elements’ (its or dits)	Elements \|$u$\| of \|$S$\|	Dits \|$\left(u,u^{\prime }\right) $\| of \|$\pi$\|
Inclusion of ‘elements’	Inclusion \|$S\subseteq T$\|	Refinement: \|$\operatorname{dit}\left( \sigma\right)\subseteq\operatorname{dit}\left( \pi\right) $\|
Top of order = all ‘elements’	\|$U$\| all elements	\|$\operatorname*{dit}(\mathbf{1)}=U^{2}-\Delta$\|⁠, all dits
Bottom of order = no ‘elements’	\|$\emptyset$\| no elements	\|$\operatorname*{dit}(\mathbf{0)=\emptyset}$\|⁠, no dits
Variables in formulas	Subsets \|$S$\| of \|$U$\|	Partitions \|$\pi$\| on \|$U$\|
Operations: \|$\vee,\wedge,\Rightarrow,...$\|	Subset ops.	Partition ops.
Formula \|$\Phi(x,y,...)$\| holds	\|$u$\| element of \|$\Phi(S,T,...)$\|	\|$\left( u,u^{\prime}\right) $\| dit of \|$\Phi(\pi,\sigma,...)$\|
Valid formula	\|$\Phi(S,T,...)=U$\|⁠, \|$\forall S,T,...$\|	\|$\Phi(\pi ,\sigma,...)=\mathbf{1}$\|⁠, \|$\forall\pi,\sigma,...$\|

Table 1	Subset logic	Partition logic
‘Elements’ (its or dits)	Elements \|$u$\| of \|$S$\|	Dits \|$\left(u,u^{\prime }\right) $\| of \|$\pi$\|
Inclusion of ‘elements’	Inclusion \|$S\subseteq T$\|	Refinement: \|$\operatorname{dit}\left( \sigma\right)\subseteq\operatorname{dit}\left( \pi\right) $\|
Top of order = all ‘elements’	\|$U$\| all elements	\|$\operatorname*{dit}(\mathbf{1)}=U^{2}-\Delta$\|⁠, all dits
Bottom of order = no ‘elements’	\|$\emptyset$\| no elements	\|$\operatorname*{dit}(\mathbf{0)=\emptyset}$\|⁠, no dits
Variables in formulas	Subsets \|$S$\| of \|$U$\|	Partitions \|$\pi$\| on \|$U$\|
Operations: \|$\vee,\wedge,\Rightarrow,...$\|	Subset ops.	Partition ops.
Formula \|$\Phi(x,y,...)$\| holds	\|$u$\| element of \|$\Phi(S,T,...)$\|	\|$\left( u,u^{\prime}\right) $\| dit of \|$\Phi(\pi,\sigma,...)$\|
Valid formula	\|$\Phi(S,T,...)=U$\|⁠, \|$\forall S,T,...$\|	\|$\Phi(\pi ,\sigma,...)=\mathbf{1}$\|⁠, \|$\forall\pi,\sigma,...$\|

Table 1	Subset logic	Partition logic
‘Elements’ (its or dits)	Elements \|$u$\| of \|$S$\|	Dits \|$\left(u,u^{\prime }\right) $\| of \|$\pi$\|
Inclusion of ‘elements’	Inclusion \|$S\subseteq T$\|	Refinement: \|$\operatorname{dit}\left( \sigma\right)\subseteq\operatorname{dit}\left( \pi\right) $\|
Top of order = all ‘elements’	\|$U$\| all elements	\|$\operatorname*{dit}(\mathbf{1)}=U^{2}-\Delta$\|⁠, all dits
Bottom of order = no ‘elements’	\|$\emptyset$\| no elements	\|$\operatorname*{dit}(\mathbf{0)=\emptyset}$\|⁠, no dits
Variables in formulas	Subsets \|$S$\| of \|$U$\|	Partitions \|$\pi$\| on \|$U$\|
Operations: \|$\vee,\wedge,\Rightarrow,...$\|	Subset ops.	Partition ops.
Formula \|$\Phi(x,y,...)$\| holds	\|$u$\| element of \|$\Phi(S,T,...)$\|	\|$\left( u,u^{\prime}\right) $\| dit of \|$\Phi(\pi,\sigma,...)$\|
Valid formula	\|$\Phi(S,T,...)=U$\|⁠, \|$\forall S,T,...$\|	\|$\Phi(\pi ,\sigma,...)=\mathbf{1}$\|⁠, \|$\forall\pi,\sigma,...$\|

Table 1	Subset logic	Partition logic
‘Elements’ (its or dits)	Elements \|$u$\| of \|$S$\|	Dits \|$\left(u,u^{\prime }\right) $\| of \|$\pi$\|
Inclusion of ‘elements’	Inclusion \|$S\subseteq T$\|	Refinement: \|$\operatorname{dit}\left( \sigma\right)\subseteq\operatorname{dit}\left( \pi\right) $\|
Top of order = all ‘elements’	\|$U$\| all elements	\|$\operatorname*{dit}(\mathbf{1)}=U^{2}-\Delta$\|⁠, all dits
Bottom of order = no ‘elements’	\|$\emptyset$\| no elements	\|$\operatorname*{dit}(\mathbf{0)=\emptyset}$\|⁠, no dits
Variables in formulas	Subsets \|$S$\| of \|$U$\|	Partitions \|$\pi$\| on \|$U$\|
Operations: \|$\vee,\wedge,\Rightarrow,...$\|	Subset ops.	Partition ops.
Formula \|$\Phi(x,y,...)$\| holds	\|$u$\| element of \|$\Phi(S,T,...)$\|	\|$\left( u,u^{\prime}\right) $\| dit of \|$\Phi(\pi,\sigma,...)$\|
Valid formula	\|$\Phi(S,T,...)=U$\|⁠, \|$\forall S,T,...$\|	\|$\Phi(\pi ,\sigma,...)=\mathbf{1}$\|⁠, \|$\forall\pi,\sigma,...$\|

Table 2	Logical probability theory	Logical information theory
‘Outcomes’	Elements \|$u\in U$\| finite	Dits \|$\left( u,u^{\prime}\right) \in U\times U$\| finite
‘Events’	Subsets \|$S\subseteq U$\|	Ditsets \|$\operatorname*{dit}\left( \pi\right) \subseteq U\times U$\|
Equiprobable points	\|$\ \Pr\left( S\right) =\frac{\|S\|}{\left\vert U\right\vert }$\|	\|$h\left( \pi\right) =\frac{\left\vert \operatorname*{dit}\left( \pi\right) \right\vert }{\left\vert U\times U\right\vert }$\|
Point probabilities	\|$\ \Pr\left( S\right) =\sum\left\{ p_{j}:u_{j}\in S\right\} $\|	\|$h\left( \pi\right) =\sum\left\{ p_{j}p_{k}:\left(u_{j},u_{k}\right) \in\operatorname*{dit}\left( \pi\right) \right\}$\|
Interpretation	\|$\Pr(S)=$\|\|$1$\|-draw prob. of \|$S$\|-element	\|$h\left(\pi\right) =$\|\|$2$\|-draw prob. of \|$\pi$\|-distinction

Article Contents

Logical information theory: new logical foundations for information theory

Abstract

1 Introduction

2 Logical information as the measure of distinctions

3 Duality of subsets and partitions

4 Classical subset logic and partition logic

5 Classical logical probability and logical entropy

6 Entropy as a measure of information

7 The dit-bit transform

8 Information algebras and joint distributions

9 Conditional entropies

9.1 Logical conditional entropy

9.2 Shannon conditional entropy

10 Mutual information

10.1 Logical mutual information

10.2 Shannon mutual information

11 Independent joint distributions

12 Cross-entropies and divergences

13 Summary of formulas and dit-bit transforms

14 Entropies for multivariate joint distributions

15 Logical entropy and some related notions

16 The statistical interpretation of Shannon entropy

17 Concluding remarks

References

Footnotes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

Month:	Total Views:
August 2017	19
September 2017	18
October 2017	16
November 2017	26
December 2017	14
January 2018	7
February 2018	7
March 2018	6
April 2018	8
May 2018	4
June 2018	2
July 2018	1
August 2018	5
September 2018	11
October 2018	3
November 2018	3
December 2018	2
January 2019	4
February 2019	2
March 2019	8
April 2019	7
May 2019	11
June 2019	12
July 2019	51
August 2019	16
September 2019	23
October 2019	28
November 2019	18
December 2019	27
January 2020	5
February 2020	8
March 2020	3
April 2020	2
May 2020	1
June 2020	15
July 2020	14
August 2020	3
September 2020	8
October 2020	6
November 2020	2

Article Contents

Logical information theory: new logical foundations for information theory

Abstract

1 Introduction

2 Logical information as the measure of distinctions

3 Duality of subsets and partitions

4 Classical subset logic and partition logic

5 Classical logical probability and logical entropy

6 Entropy as a measure of information

7 The dit-bit transform

8 Information algebras and joint distributions

9 Conditional entropies

9.1 Logical conditional entropy

9.2 Shannon conditional entropy

10 Mutual information

10.1 Logical mutual information

10.2 Shannon mutual information

11 Independent joint distributions

12 Cross-entropies and divergences

13 Summary of formulas and dit-bit transforms

14 Entropies for multivariate joint distributions

15 Logical entropy and some related notions

16 The statistical interpretation of Shannon entropy

17 Concluding remarks

References

Footnotes

Citations

Views

Altmetric

Email alerts

Citing articles via

Latest

Most Read

Most Cited

This Feature Is Available To Subscribers Only