2 minute read

Information Theory

All content is a summary of this lecture (source). There may be somethings that I’ve misorganize.

Self-information

&I(x)= -logP(x)$ : as &P(x)& increases, &I(x)& is dereased.

Shannon entropy

  • $H(x)=E_{x\backsim P}[I(x)] = -E_{x\backsim P}[logP(x)] $ : Mag of uncertatiny in an entire probability distribution.

  • $H(P)$ : Shanon entropy of a distribution $P$

  • ex 1) when we represents the sentences in English, how many bits are needed? The answer is related to this entropy. = minimum number of bits per message needed to encode events from P.

  • ex 2) $H(P) = -(1-p)log(1-p)-plogp$ : It has a maximum value when $p= 0.5$, because it is close to random case. It’s also said that if entropy is high, distributions are close to unimform distribution.

Kullback-Leibler (KL) divergence

$D_{KL}(P\parallel Q) = E_{x\backsim P}[log\frac{P(x)}{Q(x)}]$

  • It means how much different ‘$P$ ‘ with ‘$Q$’. ($Q$ is a comparison target )

  • Distribution of ‘$x$’ is ‘$P$’

  • $D_{KL}(P\parallel Q)$ gives an extra amount of information = It gives a difference between two distributions.

  • ex 1) considering an number of extra bits when encoding with another language.

  • ex 2) GAN makes distribution of ‘$p(x)$’ similar to the distribution of ‘$q(x)$’

Cross-entropy

It is similar to shannon entropy. But the form is crossed as below.

  • Shannon Entropy : $H(P)= -E_{x\backsim P}[logP(x)] $
  • Cross Entropy : $H(P, Q)= -E_{x\backsim P}[logQ(x)] $

image

source

  • Minimizing cross entropy wrt $Q$ = minimizing KL divergence

    • = maximum likelihood estimation
  • ex 1) expected total number of bits when encoding with another language.

Maximum likelihood estimator

  • $p_{data}(x)$ : True but unknown data generating distribution
  • $p_{model}(x; \theta)$ : probability distribution which is parametrized about $\theta$
  • $ \theta_{ML} = \arg\max_{\theta}p_{model}( \mathbb{X}; \theta) = \arg\max_{\theta}\prod_{i=1}^{m}p_{model}(x^{i}; \theta) $
    • Cause of independent assumption, joint probability is to be factorized.
  • image = $\theta_{ML}$ Even though, data generating probability is get from data, estimated distribution is somewhat differnet with real distribution. Therefore ‘hat’ is added.

  • image

  • image

    (These figures are from this source)

  • Most deep learning learns in the direction of reducing KL divergence. This makes the distribution of the estimated data the same as the distribution of the actual model as shown in the figure above.

※ Manifold Learning

“You can’t do inference without making assumption” (MacKay)

Therefore, in machine learning, some assumptions are made by default.

  1. Smoothness assumtion

  2. Cluster assumption

  3. Manifold assumption : data could be mappped to much lower dimensions space

    ​ ex) Consider a mosquito trajectory in three dimensions. In reality, there can be infinite branches, but the problem is much simpler if you think of them as strings.