1 minute read

I’ll deal with this topic assuming before and after presenting a ‘Batch Normalization’ method. This contents are based on this Source

Before a ‘Batch Normalization’

When we don’t apply a batch normalization, we should consider an initialization of parameters very well for achieving high performance.

There are two typical methods for initialization of the parameters.

1. Xavier/Glorot initialization

It initializes the weights of neural networks with a distribution having a zero mean and a specific variance.

2. He Normal initialization

The randomized weights are depends on the size of the previous layer of neurons.

  • Example
# torch.nn.Linear's weight initilzation function
def reset_parameters(self): 
    bound = 1 / math.sqrt(self.weight.size(1))
    init.uniform_(self.weight, -bound, bound)
    if self.bias is not None:
        init.uniform_(self.bias, -bound, bound)
# code source : https://stackoverflow.com/questions/49433936/how-to-initialize-weights-in-pytorch

def init_weights(m):
    if isinstance(m, nn.Linear):
        torch.nn.init.xavier_uniform(m.weight)
        m.bias.data.fill_(0.01)
        
net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net.apply(init_weights)

After a ‘Batch Normalization’

It makes us ‘relatively’ free from initializing the neural network’s parameters. During training a NNs, normalization on each batch are performed. Then, let’s take a look at what this concept is and what its advantages are.

Concept of a ‘Batch Normalization’
  • When training : Normalization is performs on each batch so that outputs of each NNs are bounded.

  • When testing : If normalization is performed through the values input during inference, the meaning of estimating the distribution of input data through the model learning itself disappears.(In other words, it’s wrong !) Therefore, when inferencing, normalization is performed using a fixed mean and variance to make the result deterministic.

    • Then, how to fix the mean and variance?

      Sol ) Using the moving average of the mini-batch stored in advance. In other words, before inference, each moving average should be obtained using mean and variance of the samples when selecting mini-batches in advance during training.

Advantages of a ‘Batch Normalization’

Normalization reduces possibility of a local optimum. This can be achieved by reducing dent of the local minimum.

When referencing a [Source] paper, authors tell that Instability is caused by ‘Internal Covariance Shift’ which making a value-shift on the next layers changing parameters of a previous layer’s. ‘Batch Norm’ also prevents this disadvantage.

Ultimately, this technique achieves below advantages.

  • The learning rate could be increased
  • Performance of training becomes less dependence on weight initial value selection. Therefore, the burden on weight initialization could be reduced.
  • It could reduce the risk of over-fitting