Examining the causal structures of artificial neural networks using information theory: Videos

This post accompanies the paper Examining the causal structures of artificial neural networks by Simon Mattsson, Eric J. Michaud, and Erik Hoel. In the paper, we define the Effective Information ($EI$) between a layer $L_1$ and $L_2$ to be: $$ EI(L_1 \rightarrow L_2) = I(L_1, L_2) \ | \ do(L_1 = H^{\max}). $$ This is a mutual information between the vector of activations output by neurons in $L_1$ and the vector activations in $L_2$, where the neurons in $L_1$ take values independently and uniformly in their activation output range. We can also define the $sensitivity$: $$ Sensitivity(L_1 \rightarrow L_2) = \sum_{(A \in L_1, B \in L_2)} I(t_A, t_B) \ | \ do(A=H^{\max})$$ Sensitivity sums over mutual information between individual neurons in $A \in L_1$ and $B \in L_2$, where only $A$ fires (taking values uniformly in its output range), and with all other neurons in $L_1$ outputting 0. When $sensitivity$ and $EI$ are measured with sufficiently small bins, $EI \leq sensitivity$. We call the difference between $sensitivity$ and $EI$ the $degeneracy$ of the layer: $$ EI = sensitivity - degeneracy$$ For each layer $L_1 \rightarrow L_2$ in a neural network, we can measure sensitivity and degeneracy, and visualize how they increase or decrease throughout training in the "causal plane". When a layer moves vertically in the plane, then its $sensitivity$ and $degeneracy$ are both increasing while its $EI$ remains constant. Horizontal movement represents changes in $sensitivity$ or $degeneracy$ which do change the $EI$ of the layer.

Below, we show videos of how the layers of a neural network move and differentiate in the causal plane during training. For networks trained on MNIST, we observe that layers differentiate in the causal plane, and we conjecture that this differentiation reflects that the layers are assuming distinct roles in the structure of the network.

The videos below show movement in the causal plane as networks are trained on MNIST (with inputs reduced from 28x28 to 5x5, and with only digits 0-4 in the dataset) and Iris.

25->6->6->6->5 on MNIST

4->5->5->3 on Iris

Some observations:

In the hidden layers, we see a drift towards degeneracy right as the network begins to overfit.
The most movement in the plane occurs during periods when the loss is being reduced the fastest.
We see transitions in behavior as the network begins to overfit
We see differentiation between the behavior of the first layer, the hidden layers, and the last layer on MNIST