2 minute read

Challenge in ML

For training a mochine learning model with a generality, we shold think about two objectives. (It will be more explained later)

First one is \(E_{test} \approx E_{train}\).

And second one is \(E_{train} \approx 0\)

But there are some failure cases.

  • If a model is complicated,

    1. What will happen?

      $\longrightarrow$ Overfittihng

      $\longrightarrow$ Model is also trained the noise

      $\longrightarrow$ causing a high variance ! (Even a small change in the data can change the result. )

    2. How to solve ?

      (1) Regularization

      (2) Training more data

  • Else if a model is not that complicated ?

    1. What will happen?

      $\longrightarrow$ Underfitting

      $\longrightarrow$ High bias( = The model cannot reflect the complexity of the data.)

    2. How to solve ?

      (1) Optimization

      (2) Using a more complex model

Two objectives for training a ML-model

Ideally, Err_generalization = 0 , but it’s not possible.

  1. Therefore, $Err_{train} = 0$ is to be our target,
  2. assuming $Err_{train} = Err_{test}$

Using above 2-obejctives, we can achieve $Err_{generalization}$ = 0 . However, these two goals have conflicting model preferences as shown below.

image

(Above figure reference)

Choosing a model

  • Occam’s razer : among competing hypotheses, choose the “simplest” one. (ex) From the equation below, it can be seen that the larger $N$ is, the larger the gap and the smaller the learning rate. Also, it can be seen that the more complex the model (larger $d_{VC}$) is used, the higher the possibility that the gap will increase.

    image

    (Above figure reference)

    • $\epsilon$ : a constant
    • $N$ : # of training examples
    • $f$ : a model
    • $d_{VC}$ : Complexity of a model

Two major methods to the trade off

  • Optimization : bias reducition

    (ex) Newtown’s method (minmization/maximization)

    $\longrightarrow$ Reflecting : When the slope is small, the parameter is changed quickly, and when the slope is steep, it is changed little by little.

    image

    $\longrightarrow$ But it cannot be adapeted to high dimensional optimization function. Because $f^{‘’}$ cannot be solve easily.

    $\longrightarrow$ To solve this problem, gradient descent is developed. In this method, $1/f^{‘’} \approx \epsilon $. In otherwrods, the curvature information is replaced with hyperparameters.

  • Regularization : Broadly speaking, a method of reducing model performance variance in order to achieve generality.

    ex) reflecting prior knowledge, drouput, weight decaying, norm penalty, ensemble learnign, etc

What is proper way when developing a ML model?

  • Using a complex model for ensuring higher chance of fitting data ($E_{train} = 0$)
  • Using a regularization and big data for reduicing generalization gap ($E_{test}=E_{train}$ )