https://cdn-images-1.medium.com/max/2600/0*gc4kJc_-vwWAkMzv

Original Source Here

A Bayesian Take On Model Regularization

In this article, we explore how we can, and do, regularize and control the complexity of the models we learn through Bayesian prior beliefs.

I’m currently reading “How We Learn” by Stanislas Dehaene. First off, I cannot recommend this book enough to anyone interested in learning, teaching, or AI.

One of the main themes of this book is explaining the neurological and psychological bases of why humans are so good at learning things quickly and with great sample-efficiency, i.e. given only a limited amount of experience¹. One of Dehaene’s main arguments of why humans can learn so effectively is because we are able to reduce the complexity of models we formulate of the world. In accordance with the principle of Occam’s Razor², we find the simplest model possible that explains the data we experience, rather than opting for more complicated models. But why do we do this, even from birth¹? One argument is that, contrary to the frequentist view in child psychology (the belief that babies learn solely through their experiences), we are already imparted with prior beliefs about the world when we are born¹.

Even before we first begin to experience the world, our brain is already hardwired with intrinsic knowledge and an incredible ability to learn¹. Photo by Natasha Connell on Unsplash

This notion of simplified model selection has a common name in the field of machine learning: model regularization. In this article, we’ll talk about regularization from a Bayesian perspective.

What’s one way we can control the complexity of the models we learn from observations? We can do this by placing a prior on our distribution of models. Before we show this, let’s briefly go over regularization, in this case, analytic regularization for supervised learning.

Background on Regularization

In machine learning, regularization, or model complexity control, is an essential and common practice to ensure that a model attains high out-of-sample performance, even if the distribution of out-of-sample data (test/validation data) differs significantly from the distribution of in-sample data (training data). In essence, the model must balance having a small empirical loss (how “wrong” it is on the data it is given) with a small regularization loss (how complicated the model is).

In regularization, a model learns to balance between empirical loss (how incorrect its predictions are) and regularization loss (how complex the model is). Photo by Gustavo Torres on Unsplash

In supervised learning, regularization is usually accomplished via L2 (Ridge)⁸, L1 (Lasso)⁷, or L2/L1 (ElasticNet)⁹ regularization. For neural networks, there are also techniques such as Drop-out³ or Early Stopping⁴. For now, we will focus on analytical regularization techniques, since their Bayesian interpretation is more well-defined. These techniques are summarized below.

Let’s start by defining our dataset and parameters:

Next, for a given supervised learning problem in which we wish to minimize a loss function (e.g. Mean-Squared Error):

Then we have the following objectives for each type of analytical supervised regularization techniques:

L2 (Ridge): Penalization of the squared values of the parameters (the L2 norm). Intuitively, this constrains the magnitude of the model’s parameters to be small while minimizing how “wrong” the model is in its predictions.

L1 (Lasso): Penalization of the absolute values of the parameters (the L1 norm). Intuitively, this constrains some coefficients to be zero. You may notice that the regularization term below is non-differentiable; this non-differentiability results in the coefficients either taking 0 or non-zero value⁵.

L2/L1 (ElasticNet): This regularization technique penalizes both the L1 and L2 norms of the parameter vector, resulting in a combination of the regularization results from L1 and L2 regression.

Below is a comparative plot showing the norm plots of Lasso, ElasticNet, and Ridge regularization, each drawn over a unit sphere. The code to generate this plot can be found in the Appendix.

Plot illustrating the different effects of regularization. Left: Lasso, Middle: ElasticNet, Right: Ridge.

In summary, these regularization techniques accomplish different objectives for controlling the complexity of our models. In the next section, we will derive these regularized objectives by imposing a prior belief (in the form of a probability distribution) on our model parameters, thus directly making the link between prior beliefs and model regularization.

Model Regularization as a Prior Belief on Models

Let’s dive deeper into the probabilistic and optimization theory behind implementing regularization through a prior belief in our model parameters. Specifically, we will demonstrate that:

L2 Regularization (Special Case of Tikhonov Regularization⁶) is achieved via a Multivariate Gaussian Prior
L1 Regularization (LASSO) is achieved via a Multivariate Laplace Prior

We will analyze these claims for regression problems, but they extend to other supervised learning tasks, such as classification, as well. We’ll focus on rigorously presenting the mathematics behind these claims (you can also find these derivations in this document here). Let’s dive in!

L2 (Ridge) Regularization as a Multivariate Gaussian Prior

Intuitively, when using ridge regularization, we create a ridge in the surface/hypersurface learned by the model. Photo by Jeremy Bishop on Unsplash

Suppose we again have the same set of observations D and a parameter vector w that we want to optimize in order to best make predictions on samples from D:

Using the Maximum a Posteriori (MAP) rule, we can show that the mean and mode of the posterior distribution of w is the solution for ridge regression when we invoke a Gaussian prior distribution on w. We first invoke Bayes’ Rule:

We now define our prior and observation model distributions, with the following assumptions:

a. Prior Model (distribution over parameters w):

b. Observation Model (conditional distribution of observations D conditioned on parameters):

Now, let’s substitute these expressions into Bayes’ Rule:

To derive our L2 regularized estimator, we now use the MAP rule and the negative log-likelihood function to transform this expression over products into an expression over sums.

This operation is permissible because this function is a strictly monotonic transformation of the likelihood function, and thus taking the maximum argument will preserve cardinality and yield the same maximal argument.

Substituting our posterior distribution into our expression for negative log-likelihood:

Since logarithms transform products into sums, we can decompose this logarithm of products into a sum of summation terms that depend on our parameters w, along with constants that do not depend on w:

Removing terms that don’t depend on our parameters w, multiplying the expression by a constant σ², and using the duality property (the argument that maximizes a negative objective also minimizes the objective):

The setting of λ = σ² / τ yields that our MAP estimator is also the estimator obtained via ridge regression (when our data is centered around 0):

This corresponds exactly to our ridge objective (for regression) above! In closed-form, this yields the normal equations:

Where the λI term is used to ensure that the matrix to be inverted (X^T X) is positive semi-definite, and therefore invertible.

Therefore, placing a Multivariate Gaussian prior on our parameters is equivalent to regularizing our parameters with an L2-norm penalty.

L1 (Lasso) Regularization as a Multivariate Laplace Prior

With Lasso regularization, we encourage sparsity, i.e. setting model coefficients to zero that don’t contribute much to reducing empirical loss. Photo by Markus Spiske on Unsplash

We’ll now examine a similar case with a Laplace prior. Suppose again that we have a dataset of observations D and a parameter vector w that we want to optimize in order to best make predictions on samples from D:

Again using the MAP rule, we can show that the mean and mode of the posterior distribution of w is the solution for LASSO regression when we invoke a Gaussian prior distribution on w. We first invoke Bayes’ Rule:

We now define our prior and observation model distributions, with the following assumptions:

a. Prior Model (distribution over parameters w):

b. Observation Model (Same distribution as above):

Repeating the same steps as above (referenced below):

Writing out the posterior distribution on parameters and optimal parameters w* using the MAP rule.
Deriving the negative log-likelihood function from the MAP objective.
Removing terms from the negative log-likelihood that don’t depend on w.
Appropriately setting the hyperparameters p, q, α_1, and α_2.

Here is the corresponding derivation for Lasso:

(Note that in the last step, we set p = 1, q = 1, and λ = α_1 / α_2.)

As before, we have derived our Lasso objective (for regression) as described in the section above!

Why Does This Matter?

Photo by Dmitry Ratushny on Unsplash

Though it may seem like all we did was invoke some tricks with optimization, logarithms, and probability distributions, the significance of the above derivations may be best understood in the context of the aforementioned novel “How We Learn”.

If I, a human being, am learning a model of the world, why do I tend towards the simplest model that explains my observations¹? One reason for this is because our brain creates “prior beliefs” over the models we learn even before we learn them, which we take into account when we learn from experience. Rather than learning these models only from experience, however, we use experience to update our previous beliefs of these models.

By placing a prior belief that the models we learn must be as simple as possible, we are able to control the complexity of the models we learn even before we learn them! This is exactly what we have done with our analytic derivations above: by placing a prior belief on the distribution of our model parameters (i.e. “the model parameters w are normally-distributed”) we are able to directly shape how complex these models are.

Summary

In this article, we introduced the ideas of model complexity and regularization, and how these concepts relate to the idea of prior beliefs. We made the claim that we can control the complexity of a model, i.e. regularize it, by setting a prior belief on our distribution of parameters. We then briefly introduced some common analytical, supervised regularization techniques (Ridge, Lasso, and ElasticNet regression). We then showed how we can derive the objective functions for Ridge and Lasso regularization using Multivariate Gaussian and Laplace prior distributions, respectively. Finally, we talked about why these results are significant not only for machine learning, but for psychology as well.

Thanks for reading Please follow me for more articles in reinforcement learning, computer vision, programming, and optimization!

References

[1] Dehaene, Stanislas. How we learn: Why brains learn better than any machine… for now. Penguin, 2020.

[2] Rasmussen, Carl Edward, and Zoubin Ghahramani. “Occam’s razor.” Advances in neural information processing systems (2001): 294–300.

[3] Srivastava, Nitish, et al. “Dropout: a simple way to prevent neural networks from overfitting.” The journal of machine learning research 15.1 (2014): 1929–1958.

[4] Caruana, Rich, Steve Lawrence, and Lee Giles. “Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping.” Advances in neural information processing systems (2001): 402–408.

[5] Tibshirani, Robert. “Regression shrinkage and selection via the lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996): 267–288.

[6] Calvetti, Daniela, and Lothar Reichel. “Tikhonov regularization of large linear problems.” BIT Numerical Mathematics 43.2 (2003): 263–283.

[7] Tibshirani, Robert. “Regression shrinkage and selection via the lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58.1 (1996): 267–288.

[8] Hoerl, Arthur E., and Robert W. Kennard. “Ridge regression: Biased estimation for nonorthogonal problems.” Technometrics 12.1 (1970): 55–67.

[9] Zou, Hui, and Trevor Hastie. “Regularization and variable selection via the elastic net.” Journal of the royal statistical society: series B (statistical methodology) 67.2 (2005): 301–320.

Appendix

Code to generate plots (adapted from this Stack OverFlow post).

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

via WordPress https://ramseyelbasheer.wordpress.com/2021/02/01/a-bayesian-take-on-model-regularization/