Information Theory

2 minute read

All content is a summary of this lecture (source). There may be somethings that I’ve misorganize.

Self-information

&I(x)= -logP(x)$ : as &P(x)& increases, &I(x)& is dereased.

$H(x)=E_{x\backsim P}[I(x)] = -E_{x\backsim P}[logP(x)] $ : Mag of uncertatiny in an entire probability distribution.
$H(P)$ : Shanon entropy of a distribution $P$
ex 1) when we represents the sentences in English, how many bits are needed? The answer is related to this entropy. = minimum number of bits per message needed to encode events from P.
ex 2) $H(P) = -(1-p)log(1-p)-plogp$ : It has a maximum value when $p= 0.5$, because it is close to random case. It’s also said that if entropy is high, distributions are close to unimform distribution.

$D_{KL}(P\parallel Q) = E_{x\backsim P}[log\frac{P(x)}{Q(x)}]$

It means how much different ‘$P$ ‘ with ‘$Q$’. ($Q$ is a comparison target )
Distribution of ‘$x$’ is ‘$P$’
$D_{KL}(P\parallel Q)$ gives an extra amount of information = It gives a difference between two distributions.
ex 1) considering an number of extra bits when encoding with another language.
ex 2) GAN makes distribution of ‘$p(x)$’ similar to the distribution of ‘$q(x)$’

It is similar to shannon entropy. But the form is crossed as below.

Minimizing cross entropy wrt $Q$ = minimizing KL divergence
- = maximum likelihood estimation
ex 1) expected total number of bits when encoding with another language.

$p_{data}(x)$ : True but unknown data generating distribution
$p_{model}(x; \theta)$ : probability distribution which is parametrized about $\theta$
$ \theta_{ML} = \arg\max_{\theta}p_{model}( \mathbb{X}; \theta) = \arg\max_{\theta}\prod_{i=1}^{m}p_{model}(x^{i}; \theta) $
- Cause of independent assumption, joint probability is to be factorized.
= $\theta_{ML}$ Even though, data generating probability is get from data, estimated distribution is somewhat differnet with real distribution. Therefore ‘hat’ is added.
(These figures are from this source)
Most deep learning learns in the direction of reducing KL divergence. This makes the distribution of the estimated data the same as the distribution of the actual model as shown in the figure above.

“You can’t do inference without making assumption” (MacKay)

Therefore, in machine learning, some assumptions are made by default.

Smoothness assumtion
Cluster assumption
Manifold assumption : data could be mappped to much lower dimensions space

ex) Consider a mosquito trajectory in three dimensions. In reality, there can be infinite branches, but the problem is much simpler if you think of them as strings.