Information bottleneck in deep learning

cosmos 29th March 2018 at 2:17pm
Deep learning theory Information bottleneck

newer video

Back when you shared this (https://mail.google.com/mail/u/0/#search/quanta+magazine/15ed4a519541bcdd) I already made some analysis which made me skeptical of many claims in the Tishby information bottleneck paper.

Now recently accepted paper to ICML2018 makes a detailed theoretical and empirical analysis showing that Tishby's claims do not hold true in general. https://openreview.net/forum?id=ry_WPG-A-&noteId=ry_WPG-A- There is still debate going on, but I think this new paper is basically right.


Information Theory of Deep Learning. Naftali Tishby

Deep Learning and the Information Bottleneck Principle.. Performance of net is given by DIB=E[D[p(yx)p(yy^)]]D_{IB}=E\left[D[p(y|x )||p(y|\hat{y})]\right] (that is how similar is the distribution of labels yys when given the input xx or the output of the network y^(x)\hat{y}(x)), which can be related I think to I(Y;Y^)I(Y;\hat{Y}), whose bound (from Information bottleneck paper), can be used to bound DIBD_{IB}, for each given R=I^(X;Y^)R=\hat{I}(X;\hat{Y}) (which represents the amount of compression that the output of the network has performed on the input). It turns out that for finite samples, there is an optimal RR which gives a minimum DIBD_{IB}.

From this analysis it is clear that the empirical input layer of a DNN alone cannot guarantee good generalization even though it contains more information about the target variable Y than the hidden layers, as its representation of the data is too complex. Compression is thus necessary for generalization. In other words, the hidden layers must compress the input in order to reach a point where the worst case generalization error is tolerable.

See more at Information bottleneck

It is interesting how they get Generalization error bounds from information theoretic quantities. I think the reason these bounds are tighter is because they depend explicitly on the input distribution (needed to calculate I^(X;Y^)\hat{I}(X;\hat{Y}), while typical PAC learning bounds only depend on the hypothesis set, and focus on worst case over all possible input distributions (what's called "distribution free"), but which is fixed to be the same as the test distribution. However, it's intuitive that often the input distributions in deep learning are quite structured and not arbitrary. It is this structure, that allows compression, and low I^(X;Y^)\hat{I}(X;\hat{Y})

Clearly, there is no reason to believe that current training algorithms for DNNs will reach the optimal point of the IB finite sample bound. However, we do believe that the improved feature detection along the network’s layers corresponds to improvement on the information plane in this direction.

Opening the Black Box of Deep Neural Networks via Information. Uses the Information plane to empirically study DNNs. The Stochastic Gradient Decent (SGD) optimization has two main phases. In the first and shorter phase the layers increase the information on the labels (fitting), while in the second and much longer phase the layer reduce the information on the input (compression phase). We argue that the second phase amounts to a stochastic relaxation (diffusion) that maximizes the conditional entropy of the layers subject to the empirical error constraint.


from email:

I'm still reading his papers to understand it fully. But his theory seems quite interesting. One thing is that his theory seems to be less rigorous than PAC learning theory for instance. Another thing I still don't get is whether the mutual information decreasing just depends on the size of layers decreasing. This was asked at the end of his talk, and he replied, but his reply didn't make sense to me.

As far as I can see he doesn't really calculate generalization error bounds. The closest thing he does is bounds on |I(Y;T)-\hat{I}(Y;T)| (difference between empirical and true mutual information between a layer T (like output layer) and labels Y). These bounds rely on the bounds calculated on this paper http://www.cs.huji.ac.il/labs/learning/Papers/ibgen.pdf

However, after thinking about it, I think there might be a problem with the way he uses the bounds. I think his bounds are found as follows: "given a particular p(T|X) (like a particular stochastic neural network with given weights), then the bounds he calculates hold with probability at least 1-\delta over samples". That is if we get m random i.i.d. samples from p(X), then with probability 1-\delta, the bound will hold. However, this is for a fixed p(T|X). In contrast, the bounds of interest in machine learning must hold uniformly for *all* p(T|X) which your learning algorithm may output. If your algorithm may output any p(T|X) from a family H of possible hypotheses, then the interesting quantity is P(not *any* of the h \in H have error > \epsilon), and not P(a particular h have error > \epsilon). This difference is what causes the hypothesis class size |H| (or VC dimension for infinite classes) to appear in the bounds.

I think the above means that his bounds are not really relevant for learning. Although I would be happy to be proven wrong.

On the other hand, some of his empirical results are interesting and I'm sure there's stuff to learn from them. However, as of now I'm dubious it's useful for showing why DNNs generalize.

—> More fundamentally, the sample is not statistically independent if you use it to get p(TX)p(T|X)

the diffusion phase mostly adds random noise to the weights, and they evolve like Wiener processes, under the training error or label information constraint. Such diffusion processes can be described by a Focker-Planck equation [see e.g. Risken (1989)], whose stationary distribution maximizes the entropy of the weights distribution, under the training error constraint. That in turn maximizes the conditional entropy, H(X|Ti), or minimizes the mutual information I(X; Ti) = H(X)−H(X|Ti), because the input entropy, H(X), does not change.

critical slowing down, bifurcations

Moreover, the correlations between the in-weights of different neurons in the same layer, which converge to essentially the same point in the plane, was very small. This indicates that there is a huge number of different networks with essentially optimal performance, and attempts to interpret single weights or even single neurons in such networks can be meaningless. Redundancy!

He doesn't explain how {maximizing the entropy of the weights distribution} results in maximizing H(XTi)H(X|T_i) (or better H(Ti)H(T_i)? as I(X;Ti)=H(Ti)H(TiX)I(X;T_i) = H(T_i) - H(T_i|X) and H(TiX)=0H(T_i|X)=0 for deterministic nets? ). For that we need Simplicity bias? to explain that most weight configurations are simple and compress XX (low H(TiX)H(T_i|X))..

Networks which compress X (specially if they compress X at several layers before output) are highly constrained. Bias towards them alone would help generalization. Furthermore, these functions are probably all simple functions, I think.

Adding hidden layers dramatically reduces the number of training epochs for good generalization.

Wide layers dont seem to help..? they eventually compress representation also.

entropy growth is logarithmic in the number of time steps, or the number of steps is exponential in the entropy growth. If there is a potential, or empirical error constraint, this process converges asymptotically to the maximum entropy distribution, which is exponential in the constrained potential or training error. This exponential distribution meets the IB equations Eq. (9), as we saw in section 3.8 exponential decrase in number of iterations, with number of layers

I found decrease in number of iters with number of parameters. Hmmmm.. His argument looks good though. Mine perhaps assumes too many things.


To describe the function after layer TT we need less bits, because we need only describe its value for less inputs, when the entropy of TT is smaller

compression by noise

bias in matrix map <> compression of input!

Can I model this bias with my simplified process that constructs a sequence, which is more likely to have low entropy?