Points to consider when developing a machine learning model
Challenge in ML
For training a mochine learning model with a generality, we shold think about two objectives. (It will be more explained later)
First one is \(E_{test} \approx E_{train}\).
And second one is \(E_{train} \approx 0\)
But there are some failure cases.
-
If a model is complicated,
-
What will happen?
$\longrightarrow$ Overfittihng
$\longrightarrow$ Model is also trained the noise
$\longrightarrow$ causing a high variance ! (Even a small change in the data can change the result. )
-
How to solve ?
(1) Regularization
(2) Training more data
-
-
Else if a model is not that complicated ?
-
What will happen?
$\longrightarrow$ Underfitting
$\longrightarrow$ High bias( = The model cannot reflect the complexity of the data.)
-
How to solve ?
(1) Optimization
(2) Using a more complex model
-
Two objectives for training a ML-model
Ideally, Err_generalization = 0 , but it’s not possible.
- Therefore, $Err_{train} = 0$ is to be our target,
- assuming $Err_{train} = Err_{test}$
Using above 2-obejctives, we can achieve $Err_{generalization}$ = 0 . However, these two goals have conflicting model preferences as shown below.
Choosing a model
-
Occam’s razer : among competing hypotheses, choose the “simplest” one. (ex) From the equation below, it can be seen that the larger $N$ is, the larger the gap and the smaller the learning rate. Also, it can be seen that the more complex the model (larger $d_{VC}$) is used, the higher the possibility that the gap will increase.
- $\epsilon$ : a constant
- $N$ : # of training examples
- $f$ : a model
- $d_{VC}$ : Complexity of a model
Two major methods to the trade off
-
Optimization : bias reducition
(ex) Newtown’s method (minmization/maximization)
$\longrightarrow$ Reflecting : When the slope is small, the parameter is changed quickly, and when the slope is steep, it is changed little by little.
$\longrightarrow$ But it cannot be adapeted to high dimensional optimization function. Because $f^{‘’}$ cannot be solve easily.
$\longrightarrow$ To solve this problem, gradient descent is developed. In this method, $1/f^{‘’} \approx \epsilon $. In otherwrods, the curvature information is replaced with hyperparameters.
-
Regularization : Broadly speaking, a method of reducing model performance variance in order to achieve generality.
ex) reflecting prior knowledge, drouput, weight decaying, norm penalty, ensemble learnign, etc
What is proper way when developing a ML model?
- Using a complex model for ensuring higher chance of fitting data ($E_{train} = 0$)
- Using a regularization and big data for reduicing generalization gap ($E_{test}=E_{train}$ )