We applied a deep learning framework to calculate the information shared between two random variables of any dimensionality without discretizing the original data via a vector quantization. The quantity we were after is one of the cornerstones of information theory called Mutual Information (MI). It characterizes the total degree of dependence between two random variables. In technical terms, the MI represents the decrease in entropy of random variable $X$ after observing random variable $Y$, where entropy (specifically Shannon entropy $S(X)\coloneqq\mathbb{E}[-\log p(X)]$) is a functional that takes in a probability distribution and outputs a real value that represents the average number of binary questions required to determine the value of said random variable. Intuitively, entropy characterizes how distributed or clumpy the probability mass is over the state space, and MI tells us how much information we can extract from $X$ given $Y$. Unlike the classical Pearson correlation coefficient, MI captures the total (both linear and nonlinear) degree of dependence by measuring the discrepancy (by means of the KL-divergence) between the baseline uncorrelated state, naturally given by the product of the marginals, and the joint probability distribution. The discrete form of MI is given by $$ MI(X, Y) \coloneqq \sum_{x\in X}\sum_{y\in Y} p(x,y) \log \left[\frac{p(x,y)}{p(x)p(y)}\right]. $$ One can create a numerical foothold and turn the above calculation into a variational optimization problem by recalling Donsker and Varadhan's dual representation of the KL-divergence $$ D_{\mathrm{KL}}(p \| q)=\sup _{T: \Omega \rightarrow \mathbb{R}}\left\{\mathbb{E}_p[T]-\log \left(\mathbb{E}_q\left[e^T\right]\right)\right\}. $$ The idea here is simple. One can measure the difference between two objects by measuring the difference between their outputs when processed through the same function $T$. This function takes in an object of some undisclosed dimension $\Omega$ and reduces it to a scalar value in $\mathbb{R}$. What remains is the careful construction of this functional. With all variational problems, we start with an educated guess and slowly transverse the space of possibilities in a controlled manner until we hit upon something close to the supremum allowing us to achieve a tight lower bound for the estimate of the MI. We can parameterize the function $T$ with a neural network $T_\theta$ $$ M I(X ; Y)_n \geqslant \sup _{\theta \in \Theta}\left\{\mathbb{E}_{P_n}\left[T_\theta\right]-\log \left(\mathbb{E}_{P_n^{\dagger}}\left[e^{T_\theta}\right]\right)\right\}, $$ and then use gradient descent to traverse the space of all possible functions quickly. For notational convenience, the joint probability is represented by $P$, while the product of the marginals is given by $P^\dagger$. A simple architectural choice for the network is an encoder that can compress the input into a scalar-valued output, i.e., something that can induce the desired transformation $\Omega\to\mathbb{R}$. Furthermore, the general high expressibility of neural networks guarantees a tight lower bound. It is also important to note that we can only ever underestimate the MI with this representation. The particular architecture and training schedule that realizes the dual representation is called a Mutual Information Neural Estimator [cached]or MINE-network
The image shows a practical distillation of the above mathematical abstractions. The insert on
the left is taken by one of NASAs small explorer satellites IRIS, whose mission is to provide
insights into the mechanism leading to the coronal heating problem (a remarkably uninitiativ
phenomenon where the Sun's outer atmosphere, its corona, is orders of magnitude warmer than its
inner atmosphere). The satellite generates a small field-of-view image of the Sun as well as
spectra from different atomic ions. Each pixel in the above image contains multiple spectra
recorded simultaneously, such that each pixel serves as a sample point for multiple random
variables. The random variables in our case are extended line profiles living in
$\mathbb{R}^{240}$, whose shapes are forged via the joint action of radiative transfer and basic
atomic theory. Each line shape encodes to some extent the physical attributes of the solar
atmosphere within a particular temperature range. We can apply the MINE-network to two spectral
lines Mg II and C II and derive their MI. The network itself is relatively simple being composed
of a series of simple encoders that force the spectra down to scalar values. The fancy footwork
can be found in how the data is sampled. The top channel (above the dotted line) takes Mg II,
and C II spectra from the same pixel at the same moment in time, and therefore samples from the
joint probability distribution P. The bottom channel on the other hand takes the spectra from
two different pixels, therefore, sampling from the product of the marginals $P^\dagger$ and
providing a baseline uncorrelated state. This baseline is made more accurate if the sampling is
also random across the time domain. The encoder then consumes the spectra and finds a latent
variable representation in $z$. The new representation is then concatenated and further reduced
to a scalar value by a final encoder. It is important to keep in mind that the encoder complex
shares parameters for both the upper and lower sampling channels. The network is then charged
with the task of trying to separate the two scalar value outputs from both channels using any
means possible. Since the lower channel provides by definition the baseline uncorrelated state,
the only way the encoder complex can distinguish between the two channels is to learn all
possible correlations that exist between the lines drawn from the same pixel. The further apart
the network can pull the scalar values, the more information the two variables/lines share.
These scalar values can then be fed into the above dual formula resulting in the MI. We compared
the results for an ensemble of different spectral line-pairs as well as against a discrete
method and found that both methods produce the same results and that the lines that share most
information have similar atomic physics and formation heights. The information quantity between
almost all lines are also enhanced in regions of increased energy deposition, meaning massive
events such as solar flares open up the communication channels at different scale heights. The
confirmation of the results from multiple sources proves that the MINE-network provides a
hands-free way of calculating a critical information-theoretic quantity between any two random
variables of any dimensionality.
One of the major problems in the field of deep learning is model opacity. This was a natural
consequence of ML being raised and nurtured in an industrial environment, where performance
superseded understanding. The deep learning revolution further exacerbated the problem of model
capacity, where it was found that instead of relying on the carefully hand-crafted features of
scientists, one could do away with Newton by simply adding more layers to the network.
Furthermore, the fact that complexity is often correlated with higher performance leads to black
boxes, representing an unrevealed tension between the values of the scientific method and the
design mode of machine learning models.
The opacity of deep learning models can be reduced via the application of various gradient
attribution methods. These methods can for example highlight the discriminant region of a
network trained on a binary classification task. To validate our application of these techniques
before aiming them at a serious research question, we applied them to a toy problem of
identifying sunspots from pores. In this case, we know that the distinguishing feature is the
presence of a penumbra (the intermediate grey region between the dark center and light
surroundings).
We discuss two methods here, the first requires measuring the sensitivity of the output with
regards to the final convolutional layers in a CNN, while the second works at the level of the
input directly via the application of a method from competitive game theory.
We can replace the market $\mathcal{F}_\Theta$ with our black box ML model. The
people $\lambda$ become
individual pixels of our images, and the monetary reward can be mapped to the model's output
prediction in our binary classification pore/sunspot task. In other words, instead of fairly
distributing rewards, we use the Shapley value formalism to fairly distribute the model's
prediction, i.e., finding out which parts of the input is most responsible for said prediction.