Gaussian Mixture Model



Original Source Here

We will be discussing the Gaussian Mixture Model.

  1. The GMM is nothing but a generalization of K Means clustering.
  2. Or another way k means clustering is nothing but a specific instance of the GMM with special constraints.

A basic prerequisite to this blog is that one must know about the Gaussian distribution.

The Gaussian is also called the normal distribution by statistics people.

However, the GMM is called the GMM because in this scenario the G stands for Gaussian and it’s not called a normal mixture model

Let’s first start with a basic intuition of what a Gaussian mixture is by considering only a single Gaussian we can begin with a typical example.

Suppose we want to measure all the heights of the students in our classroom.

We assume that this is a Gaussian distribution and therefore in order to fit a Gaussian model to the student heights we have to solve a maximum likelihood problem by the way. We call this maximum likelihood estimation.

So how do we do maximum likelihood estimation or MLE?.

Well let’s start by assuming we’ve collected a bunch of data x1 up to xN these represent the heights, then we can write our likelihood as follows.

It’s the product over each sample i from 1 up to n of the gaussian PDF on each xi and this expression the mean μ and the variance σ² are both parameters that we are trying to find.

The question is how do we use this expression to Find μ and the σ²

The key is in the name. It’s called the maximum likelihood estimation because we want to maximize the likelihood with respect to the parameters μ and σ².

So how do we maximize L?

Well, By calculus.

It says that the maximum of L is located at an extreme point. In other words where the gradient is zero.

Of course, you can check the second derivative to ensure that it’s a max and not a min.

First, we should answer a conceptual question why do we want to maximize L?

Aside from the obvious fact that it appears in the name maximum likelihood estimation. It makes sense because this is the joint probability of seeing all the data we observe given a set of parameters.

That out of all possible parameters μ and the σ² that the true μ and σ² would yield a high probability.

For example, it’s probably not the case that the true mean height of human beings is 1000 feet.

So in probability, it’s customary to first take the log of the likelihood and maximize that instead.

As you know the log is a monotonically increasing function.

So whatever input maximizes the function also maximizes the log of that function.

If you’re not convinced of this pick any numbers A and B and check that if A is greater than B then log A is also greater than log B you will probably not find any numbers where that is not the case.

So why do we want to take the log?

It makes our computation much easier.

So what do we get After taking a log? we get the sum over i from 1 to N of the expression you see here.

Intuitively you know this is easier because the exponential has disappeared.

The next step is to differentiate the log-likelihood with respect to μ and σ², Set the derivatives equal to zero and solve for μ when σ² under those conditions.

After doing the calculations what do we get?

Let’s start with μ

As expected we get that μ is the arithmetic average of all the xi.

Note that when we express the maximum likelihood estimate it’s wanted to put a hat on top to differentiate it from the true value of μ.

Now let’s do the same thing for the variance.

Actually, it’s a little simpler if we introduce a new variable called V and say that this is equal to σ².

So as expected when we work this out we get that V hat or sigma squared hat is equal to the sample variance of x. Note that because we don’t know the true μ we usually replace this with our predicted view which is the new hat we derived earlier.

Let’s suppose that instead of a gaussian we want a more accurate model.

If we look at this distribution of heights it would appear that there are in fact two bumps. We call this a multi-modal distribution.

It’s clear that no single Gaussian could ever fit this curve nicely.

But what if we had multiple Gaussians??

Here’s an example of a mixture of Gaussians.

It’s important that when you mix Gaussians than the mixture proportions sum up to 1. This is because since the GMM must be a valid probability distribution it’s integral from minus infinity to plus infinity must equal 1 and the only way to guarantee this is if the mixture components also sum to one. Notice that we use the symbol π to represent mixture proportions. This is standard notation.

So now that we have a model what do we do with it?

We want to do maximum likelihood to find the parameters of our model. However, unlike the single gaussian case we now have many parameters.

First, we have two parameters for each of our Gaussian is a mean and the variance.
Second, we have a mixture of proportion π for each of our Gaussians.

So in total for each Gaussian in our mixture, we must have three parameters.

If we had K Gaussian we would have three K parameters and that’s assuming the unit variant case

Let’s continue the next step obviously is to write down the likelihood.

And then once we have the likelihood we can take the log, take the derivative and set the derivative to 0 and solve for the parameters.

There is something unusual about this equation.

What is it?

What happens if we try to take the log of this expression on the right usually the log makes things simpler because the log of a product is the sum of logs and so on.

But the problem here is we have a sum inside of the log and there is no rule to simplify the log of a sum. in fact. Long story short there is no closed-form solution to this problem. We cannot in fact set the derivative to zero and solve for the parameters.

Instead, have to use what is called iterative methods and hope that your algorithm converges to a good solution…!

Although we could use a general iterative method such as gradient descent, we will see the more commonly used EM algorithm. EM stands for expectation maximization

It usually takes a lot of work in order to derive other updates which is one of the weaknesses of EM what We’re going to do is start by just looking at the algorithm itself So here’s the algorithm.

Notice that instead of a scalar X now assume we were working with vector observations. That’s why we see a capital σ ()which means the covariance matrix instead of lowercase σ² which just means scalar variance.

So this is now a multivariate Gaussian instead of a scalar Gaussian basically the algorithm proceeds in a loop and inside the loop. We are going to have two steps.

The first step is called the E step or the expectation step it is to calculate the responsibilities called the Gamma(γ). γ is indexed by both i and little k.

γ(i,k) is how much cluster K is responsible for data point I. In other words, how much does data point i belong to Cluster k and it also makes sense that this is the ratio of the probability on top which involves k only and the probability on the bottom which is the sum over all the clusters.

That’s like saying how much do you belong to K out of how much you belong to everyone else?

The second step is called the M step or the maximization step in this step. We simply update all of our parameters specifically πk, μk, and ∑k which are the mixture proportion, the mean and the covariance respectively. We do this for all k clusters.

Now again if we look at these updates we can make sure that they make sense. The mean Update looks like a weighted sample mean and the covariance update looks like a weighted sample covariance. In fact, they are weighted by the responsibilities we found in each step which makes perfect sense.

If i belong to a cluster k more then i should contribute more to calculating the sample mean of cluster k. The same logic applies to the weighted sample covariance. If we look at πk it also makes sense, saying if we sum up all the responsibilities for all points in cluster K and we divide by the total number of points that should give me the proportion of points that belong in cluster K

At this point is worth mentioning what probability this πk actually represent.

Here’s where we get back to this idea of hidden variables or latent variables we actually have a latent variable here called z. z is a categorical variable meaning it doesn’t take on real values; πk is actually the probability that z=k.

In other words, it’s the categorical probability distribution over z.

As it is a Probability distribution, that’s another reason why the π need to sum to 1.

Each individual Gaussian distribution is the density over X given that it belongs to some cluster. In other words, it’s p(x) given z. And thus in order to calculate P(x) we first have to multiply p(x) given z by P(z) and then sum over all possible values of z. This is just marginalization

It’s easy to confuse is πk and γ(i,k).

πk = distribution over Z without any knowledge of any data point x. It is Prior.

So what about γ(i,k)? This is actually a posterior. It represents P(z)= K given xi.

In other words, it’s the probability that the cluster identity is k.

In fact, this is nothing but Bayes rule.

SUMMARY

  1. we started with a basic example measuring heights of students in our class and trying to fit those heights to a Gaussian distribution. We can find the parameters of that Gaussian μ and σ² by using maximum likelihood estimation.
  2. Also, real heights might follow a multi-modal distribution. So if we want a more accurate model we should use a GMM instead of a single Gaussian.
  3. Unfortunately, we discovered that unlike the single Gaussian case the GMM has no closed-form solution from maximizing the likelihood.
  4. Instead, we have to resort to an iterative algorithm called expectation maximization.
  5. Unlike methods like gradient descent, expectation-maximization algorithms are non-trivial to derive and implement.

So for example while you can have a library like tensor flow or PI torch that can do gradient descent automatically no such thing as possible with expectation maximization.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot



via WordPress https://ramseyelbasheer.wordpress.com/2021/02/20/gaussian-mixture-model/

Popular posts from this blog

I’m Sorry! Evernote Has A New ‘Home’ Now

Jensen Huang: Racism is one flywheel we must stop

Streamlit — Deploy your app in just a few minutes