Argonauts:Expectation Maximization Algorithm
From Wasteland
Contents |
Sun 19 March 2006
Spatio-temporal coordinates
WMVL, 12.30pm to 2.30pm (exactly)
Attendees (in alphabetical order)
- Ramón Casero Cañas.
- Etienne von Lavante.
- Rohan Loveland.
Minutes
This meeting could be described in many ways, but let's not get foul mouthed. There's no photo, for it'd be a photo of shame of a meeting of shame.
We chose a really bad reference to read about the Expectation Maximization (EM) algorithm
F. Dellaert. The expectation maximization algorithm.College of Computing. Georgia Institute of Technology. Tech Report. Feb 2002.
We did everything wrong: The meeting started late, we attendees didn't focus on the problem, the reference we chose was a disaster, we got locked into details...
Next time we will follow Duda & Hart. O yeah.
Sun 16 April 2006
Spatio-temporal coordinates: WMVL, 11.00am to 1.00pm (exactly)
Attendees (in alphabetical order):
- Ramón Casero Cañas.
- Niranjan Joshi.
- Ingmar Posner.
Minutes
All that went wrong in the previous meeting went fine in this one. OK, we couldn't find a Duda & Hart, but we got pp. 62--73 of
C.M. Bishop. Neural Networks for Pattern Recognition. Clarendon Press, 1995.
Bishop presents the EM-algorithm as a solution applied to finding the parameters of a Gaussian mixture model from a set of incomplete data.
Basically, we want to solve this problem using Maximum Likelihood (ML). So we let the negative log-likelihood be the error function that we want to minimize. To find the minima, we just differentiate for each parameter, and equal to zero. Some parameters have constraints, so a couple of tricks are needed to have an optimization problem with no contraints, but this is not important for us now.
The catch in the ML approach is that we arrive at a set of highly non-linear coupled equations, so we have things like
parameter_0 = f_0(parameters) parameter_1 = f_1(parameters)
where f_0, f_1, ... are highly non-linear. This wouldn't be so if we knew in advance what point corresponds to what distribution.
Here the EM-algorithm comes to our rescue. We tranform the previous set of equations into something like
parameter_0(n) = f_0(parameters(n-1)) parameter_1(n) = f_1(parameters(n-1))
So now the parameters at iteration n depend on the parameters at iteration n-1. As the parameters at iteration n-1 are known values, we no longer have coupled equations.
Of course this only works if we can guarantee that the error function doesn't increase with each iteration. For this, we write
E(n) - E(n-1) <= 0
where E is the error function (the negative log-likelihood). But if we do this, we still have coupled equations, so instead of looking at E(n) - E(n-1) we look at an upper bound of E(n) - E(n-1), that is E(n-1) + Q
E(n) - E(n-1) <= Q <= 0
We obtain Q using Jensen's inequality and taking out all terms that depend on parameters(n-1). Thus we obtain an upper bound of the error function that is
E(n-1) + Q(n)
[See Eq. (2.92) in Bishop].
This is the Expectation (E) step of the algorithm, because it is equivalent to finding the expectation of the likelihood of the set of complete data. Note: what we obtain in this step is a whole function (the upper bound error function). In fact, this step is implicit and not used in the implementation of the algorithm.
The next thing to do is to find the minimum of the upper bound function. We differentiate with respect to the parameters(n), but this time we obtain simple update equations, instead of highly non-linear coupled ones.
These are Eq. (2.96)-(2.98) in Bishop, and represent the Maximization (M) step of the algorithm. It has this name because when the error is minimized the likelihood is maximized.
--

