Points to consider when developing a machine learning model

2 minute read

Challenge in ML

For training a mochine learning model with a generality, we shold think about two objectives. (It will be more explained later)

First one is $E_{test} \approx E_{train}$.

And second one is $E_{train} \approx 0$

But there are some failure cases.

If a model is complicated,
1. What will happen?
  
  $\longrightarrow$ Overfittihng
  
  $\longrightarrow$ Model is also trained the noise
  
  $\longrightarrow$ causing a high variance ! (Even a small change in the data can change the result. )
2. How to solve ?
  
  (1) Regularization
  
  (2) Training more data
Else if a model is not that complicated ?
1. What will happen?
  
  $\longrightarrow$ Underfitting
  
  $\longrightarrow$ High bias( = The model cannot reflect the complexity of the data.)
2. How to solve ?
  
  (1) Optimization
  
  (2) Using a more complex model

Ideally, Err_generalization = 0 , but it’s not possible.

Using above 2-obejctives, we can achieve $Err_{generalization}$ = 0 . However, these two goals have conflicting model preferences as shown below.

Occam’s razer : among competing hypotheses, choose the “simplest” one. (ex) From the equation below, it can be seen that the larger $N$ is, the larger the gap and the smaller the learning rate. Also, it can be seen that the more complex the model (larger $d_{VC}$) is used, the higher the possibility that the gap will increase.

(Above figure reference)
- $\epsilon$ : a constant
- $N$ : # of training examples
- $f$ : a model
- $d_{VC}$ : Complexity of a model

Optimization : bias reducition

(ex) Newtown’s method (minmization/maximization)

$\longrightarrow$ Reflecting : When the slope is small, the parameter is changed quickly, and when the slope is steep, it is changed little by little.

$\longrightarrow$ But it cannot be adapeted to high dimensional optimization function. Because $f^{‘’}$ cannot be solve easily.

$\longrightarrow$ To solve this problem, gradient descent is developed. In this method, $1/f^{‘’} \approx \epsilon $. In otherwrods, the curvature information is replaced with hyperparameters.
Regularization : Broadly speaking, a method of reducing model performance variance in order to achieve generality.

ex) reflecting prior knowledge, drouput, weight decaying, norm penalty, ensemble learnign, etc

Using a complex model for ensuring higher chance of fitting data ($E_{train} = 0$)
Using a regularization and big data for reduicing generalization gap ($E_{test}=E_{train}$ )