WARNING: Summaries are generated by a large language model and may be inaccurate. We suggest that you use the synopsis, short and long summaries only as a loose guide to the topics discussed. The model may attribute to the speaker or other participants views that they do not in fact hold. It may also attribute to the speaker views expressed by other participants, or vice versa. The raw transcript (see the bottom of the page) is likely to be a more accurate representation of the seminar content, except for errors at the level of individual words during transcription.

Synopsis


MCMC is a powerful technique used in Bayesian statistics to sample from an intractable posterior. It includes Metropolis-Hastings and Hamiltonian Monte Carlo algorithms, which both stochastically explore parameter space to find regions of high posterior density. HMC is more sophisticated and uses Hamilton's equations of motion to escape local minima and prevent u-turns. NUTS is an enhancement of HMC, allowing the particle to escape regions of higher likelihood.

Short Summary


MCMC is an algorithm used in Bayesian Statistics to sample from an intractable posterior. It includes the Metropolis-Hastings algorithm and Hamiltonian Monte Carlo. The posterior is calculated using Bayes Rule, which states the posterior is equal to the likelihood function multiplied by the prior, divided by the evidence. The evidence is intractable, but the numerator, the model multiplied by the prior, is tractable and can be computed.
Markov Chain Monte Carlo (MCMC) is a method used to sample from a posterior distribution. It simulates a Markov Chain whose stationary distribution is equal to the posterior, allowing us to calculate the posterior density and other estimates. Metropolis Hastings algorithm is one such MCMC algorithm which starts with a random initial guess and uses a distribution centered at the current position to select a candidate next step, with a probability proportional to the model and prior evaluated at the new candidate. This adds stochasticity to the process and ensures it is not just about extremizing the posterior.
Hamiltonian Monte Carlo is a Markov chain Monte Carlo algorithm that uses Hamilton's equations of motion to explore parameter space. It samples an initial velocity from a normal distribution and then simulates a trajectory through parameter space. The total Hamiltonian incorporates the magnitude of the velocity to avoid disrupting a good region of the posterior. At each step, the probability of accepting a new candidate is proportional to the difference in energy between the old position and the new candidate. This stochasticity ensures that the sampler tends towards regions of low energy, which correspond to good models and can result in faster convergence than the Metropolis-Hastings algorithm.
Hamiltonian Monte Carlo is an advanced sampling technique which ensures that the sampler sticks to regions where the posterior is dense. It simulates a trajectory through phase space according to the Hamiltonian and is more sophisticated than Metropolis Hastings. It is more likely to avoid getting stuck at local minima and prevents u-turns by running multiple times from different starting points and using different initial velocities. NUTS is an enhancement of HMC which helps the particle escape regions of higher likelihood.

Long Summary


Markov Chain Monte Carlo (MCMC) is an algorithm used to estimate a posterior in Bayesian Statistics. It is used to sample from an analytically intractable posterior and is a broad class of algorithms. It includes the Metropolis-Hastings algorithm, which is the basic version, and Hamiltonian Monte Carlo, which is the main version used in practice. The setup involves a data set of identically and independently distributed random variables from a true distribution, a class of models, and a prior distribution. The posterior is the fundamental object of Bayesian Statistics and is calculated using Bayes Rule.
Bayes Rule states that the posterior is equal to the likelihood function multiplied by the prior, divided by the evidence. The evidence is the integral over the full space of the parameter, and is often intractable. The posterior is the adjusted belief based on the data observed, and is also often intractable. However, the numerator, the model multiplied by the prior, is tractable and can be computed.
Markov Chain Monte Carlo (MCMC) is a method used to sample from a posterior distribution in order to extract information from it. The mathematical goal is to generate a set of samples such that the expectation of a function over the posterior is equal to the mean of the generated set. MCMC simulates a Markov Chain whose stationary distribution is equal to the posterior, meaning the probability of the next step only depends on the present position and not on past history. This allows us to calculate the posterior density and other estimates to extract information from it.
Metropolis Hastings algorithm is a Markov Chain Monte Carlo algorithm which starts with a random initial guess. It uses a distribution centered at the current position with a variance of one to select a candidate next step. The Monte Carlo step then chooses to accept the candidate with a probability proportional to the model and prior evaluated at the new candidate, adding natural stochasticity to the process. This ensures that the process is not just about extremizing the posterior.
Hamiltonian Monte Carlo is a method of sampling from a posterior distribution which replaces the Metropolis-Hastings algorithm's Markov step with a particle simulation. This allows for faster exploration of parameter space and avoids potential barriers in the prior setup. The algorithm begins by sampling an initial velocity from a normal distribution and then simulating a trajectory through parameter space according to Hamilton's equations of motion. The total Hamiltonian incorporates the magnitude of the velocity to avoid disrupting a good region of the posterior.
Hamiltonian Monte Carlo is a Markov chain Monte Carlo algorithm that uses the gradient of the Hamiltonian to simulate a trajectory from an initial position and velocity. At each step, the probability of accepting a new candidate is proportional to the difference in energy between the old position and the new candidate. This stochasticity ensures that the sampler tends towards regions of low energy, which correspond to good models. Hamiltonian Monte Carlo converges faster than Metropolis-Hastings.
Hamiltonian Monte Carlo is a sampling technique which ensures that the sampler sticks to regions where the posterior is dense. It can be thought of as a skate park, with the Hamiltonian being the equation of the park. An initial velocity is chosen and the model is simulated until a stopping time is reached. This stopping time is the new candidate and the process is repeated until the stationary distribution converges to the posterior. This technique helps to accurately sample from the posterior.
Hamiltonian Monte Carlo is an algorithm that uses the information of a model and prior to simulate a trajectory through phase space according to the hamiltonian. It sticks to regions that have densely packed probability according to the posterior and is more sophisticated than Metropolis Hastings which simply takes any old random direction. Hamiltonian Monte Carlo is more likely to avoid getting stuck at local minima.
Hamiltonian Monte Carlo (HMC) is a state-of-the-art algorithm which enhances the traditional Monte Carlo algorithm by preventing "u-turns" from happening. It is still susceptible to getting stuck in undesirable regions, particularly those far away from the global maxima of the posterior. To avoid this, HMC should be run multiple times from different starting points, using different initial velocities to ensure the fullest exploration of the posterior. NUTS, an enhancement of HMC, prevents u-turns and helps the particle escape regions of higher likelihood.

Raw Transcript


my name is liam for those of you who i haven't met one of which is definitely ethan nice to meet you ethan so this morning i am going to give an introduction to markov chain monte carlo which is abbreviated as mcmc which is one of the hallmarks of bayesian statistics and with the catchphrase accessing the posterior so the general idea is that in bayesian statistics the posterior is the main object of interest it carries the information that we are interested in about parameters that are used to model the data that we are trying to model and in general the posterior is analytically intractable and so markov chain monte carlo is the algorithm that is used to [Music] estimate the posterior to access the posterior to get the information that we want out of it and in particular to sample from it so it is a very broad class of algorithms and i'm going to explain two of them so the aim of the talk is to explain why mcmc is necessary and secondly what mcmc is in particular the metropolis hastings algorithm which is the [Music] sort of the starting point the most basic version and then the step up from that is hamiltonian monte carlo which is the the main one that is used in practice and variants on hamiltonian monte carlo so that is improved metropolis hastings so our setup is as follows we have a data set as edmund talked about of identically and independently distributed random variables of samples from some truth which we denote from the truth q of x some true distribution and we have a class of models denoted by p dn given w such that w is in the parameter space so capital w here is the parameter space little w refers to a parameter which is a subset of r to the d for however many parameters you have being d and so we believe that this class of models is a model for or one of these models is a model for the data that we have gotten in the data set that we have sampled from the truth and so the object is to sorry the objective is to try and assess which of these models is good um and our other part of the setup which i will elaborate on in a little bit is some prior distribution denoted by phi of w on the space of parameters capital w okay so definition i'm wondering if i've got enough space here we'll see here we go so the posterior is sort of the fundamental object of bayesian statistics and edmund has just been writing this up on the other board but let me write it out for you here so by bayes rule the posterior is some probability of a parameter given the data set that we have drawn
from the truth which according to bayes rule is equal to the following the likelihood function so the likelihood that the data has been drawn from a model given the parameter that we're interested in little w so as i was referring to up here the models multiplied by the prior divided by the evidence so this is the posterior this is the likelihood so i should use a different color for this so this is the likelihood think of that as the model this is the prior and this is the evidence which is essentially because it does not depend on a parameter w it is a normalizing factor such that the posterior is actually a probability distribution on the space of parameters so the evidence is equal to an integral over the full space capital w of the numerator up there the likelihood times the prior so the way to think of the posterior is that it is a probability density on the space of parameters which assigns probabilities to different possible parameters according to the likelihood that the data came from that parameter and according to the prior belief that we had about what the space of parameters looked like or what the likelihood of that parameter was so yeah so in as i said before the posterior is essentially the fundamental object of all of bayesian statistics and you can also think of it as an adjusted belief it is how we adjust our prior beliefs based on the data that we have observed so i'll move to the next board yep so i'll just write it up here in the corner again okay so in general the evidence term which recall the evidence is this integral over the whole space of the model or the likelihood times the prior the evidence is highly intractable so there are a few pathological examples where the evidence is tractable when you've got a very small parameter space sorry when you've got a very small dimensionality of the parameter space and when you pick the likelihood and the prior such that they essentially combine in a nice fashion so the key word there is something called a conjugate prior so there are a few pathological examples where the posterior is tractable but in general in most cases that one cares about and in pretty much all of computational bayesian statistics the posterior is not tractable so because the evidence is not because that integral is very difficult to compute the posterior is itself intractable and oh i should say so despite this the um sorry but the numerator there the model times the prior is tractable so it is at least computable for
any given parameter that you care about because generally the model will be something like a normal distribution and potentially even the same thing for the prior w so we can compute the numerator but not the denominator and so markov chain monte carlo uses this fact all right so any questions so far cool so sampling from the posterior so since the posterior contains all of the information that we want about what parameters are so-called good to model the data that we have drawn from the truth um we want to estimate what this posterior is we want to be able to access it which is similar to saying that we want to be we want to [Music] want to sample from it and in doing so we can extract information oh actually i'll write out something first and then i'll say that in a sec so the mathematical goal of markov chain monte carlo or just from to sample from the posterior in general is to generate a set of samples of parameters which i will denote with a superscript k some set of samples such that for any function on the space of parameters any f of w we have the following the expectation of f of w over the posterior is approximately equal to the mean over the generated set of samples that we have obtained from something like markov chain monte carlo and this is obviously as we take an infinite number of samples as capital k goes to infinity so if we can get a set of samples that gives us this relation then we can extract any information that we care about from the posterior so we can calculate things like the posterior density over small subsets of w and we can calculate things like um means and medians and all sorts of estimates to to extract information from the posterior so the idea then of markov chain monte carlo is to simulate a markov chain whose stationary distribution which is another way of saying sort of long-term equilibrium is equal to the posterior that we want so here recall if you haven't heard this before a markov chain is just a random walk on the space of parameters such that the next step only depends on the present on the present position and is independent or in particular does not depend i should say does not depend on [Music] past history so if we've got a chain and we're going from here to here to here up to here we've got some random walk a markov chain is saying that the probability of our next step being down here is independent of what happens back here so we don't care about this all that we care about is where we are at the present moment the previous
history does not matter for the random walk for the probability of the next step so with this in mind i'm going to move to the next board this brings us to the first markov chain monte carlo algorithm the metropolis hastings metropolis hastings algorithm so you will always start with some random initial guess but let's just suppose suppose we are at sample w little k that is our little sorry that is our current position the first step of the algorithm is the markov step which is to select a candidate next step from a distribution [Music] depending on [Music] [Music] sorry can someone tell me if they can hear me yep we can just here yep no i can hear you yeah all right great apologies i hit the home button on my ipad and that clearly just exits me from the world so i need to be more careful about that okay so the markup step we are selecting a candidate step from a distribution depending on depending on [Music] the current step or the current position and it would be something like the next position could be distributed normally centered at the current position with something like a variance of one so this ensures that our next step doesn't escape too far from the previously accepted sample the second step in the algorithm is the monte carlo step and that is that we choose to accept the candidate sample with a probability proportional to the model and prior that we have chosen proportional to the model evaluated at the new candidate times the prior evaluated at the new candidate so essentially this is where we are able to use our model and prior to tell us whether we expect this new candidate to be good whether it has a high likelihood according to the model and a high likelihood according to the prior or whether it is bad and so i'll explain this a little bit more in the um when i talk about hamiltonian monte carlo but the fact that we choose this with a problem sorry we choose to accept with a probability proportional to this is to say that we are adding some natural stochasticity into this process so that it isn't just all about extremizing finding the extrema of the posterior so if we didn't have this step then all this process is doing is basically just stochastic gradient descent it's picking some random direction in the first step and then it's saying okay is this better or worse than where i was before is it more likely or less likely according to the posterior numerator and if it's more likely then we are more likely to accept it but there is some probability of doing so
because otherwise it would just be minimal well maximizing the posterior the whole time which is clearly not going to give us very good samples it's just going to approach the maximum likelihood estimator so the problems with the metropolis hastings algorithm number one the exploration step is difficult due to potential well due due to sometimes there can be potential barriers in our model prior setup in [Music] and these potential barriers essentially equate to you can think of them as sort of infinitely high walls that are very difficult to get past so there can be different regions of parameter space where there is uh various density sorry there can be different regions of parameter space where the posterior is dense has some sort of peak and if we can't get there because they are separated by a long distance then uh metropolis hastings struggles because it might only be taking quite small steps second why walk when you can run which is a phrase that i have learnt from a very good blog post that i'll reference at the end of the talk which is to say that the markov chains can be very slow very slow to converge and there's actually more info on offer to us to make a better guess in the markov step so that brings us to the fix which is hamiltonian monte carlo the idea of hamiltonian monte carlo is to replace the markov step with a particle simulation through parameter space w according to hamilton's equations of oops hamilton's equations of motion so um i might move on to the next board so our setup here is that we have a hamiltonian function which is just the negative log of the model and the prior so hamiltonian is such that the posterior is proportional to this model prior setup and we set the hamiltonian to be such that it is the negative log of the model and the prior we also define v to be the velocity of a particle let v be the velocity and then we define the total hamiltonian to also incorporate the magnitude of the velocity so as to bias away from ridiculously high velocities that can seriously disrupt us if we're in a good region of the posterior already so the algorithm proceeds as follows as i said the key idea is that we replace the markov step with a particle simulation so number one the hamiltonian step we start by sampling an initial random velocity according to something like a normal distribution we let tau denote the time in the hamilton's equations and then we simulate a trajectory through w according to hamiltonian's equations of motion
which look like the following negative the gradient of the hamiltonian that we care about so the key point here or i guess the markovian aspect of this is that at the initial time at tau equals zero we start the starting position of this trajectory is that our previously accepted sample wk and at the randomly drawn velocity v naught we then run the simulation up to some stopping time tower stop that might be fixed or it might be dynamically chosen according to the sort of algorithm that you're using and then we set the candidate to be where the candidate knew parameter to be where the particle has stopped in its trajectory number two the second step is again the monte carlo step so this again just adds the natural stochasticity so that it isn't just glorified gradient descent stochastic gradient descent we let p be equal to the minimum of the following one and then the exponential of negative the total hamiltonian of the new candidate minus the total hamiltonian of the old position so what this does is that we are letting our probability be proportional to the energy change or the energy difference in these two steps where we have started versus where we are going what our new candidate is and so we let p be this and then we either set our new can oh sorry we take our new sample to be oops either the new candidate with prob p or the old position that we started at with prob one minus p and so in the first case this would be accepting [Music] the new candidate and in the second case it would be rejecting so as i was alluding to before the formula for p here is such that the sampler will tend towards regions of low energy and low energy equates to good models effectively regions of posterior density so the stochasticity in the acceptance is still there so that it isn't just a glorified stochastic gradient descent algorithm [Applause] all right any questions about that yeah i actually have a question about um convergence yeah you have um like uh i guess i don't remember this but um so you got metropolis hastings and you've got this hamiltonian mounted monte carlo right what are the how do they compare in terms of convergence um which one's faster and by how much so definitely hamiltonian monte carlo is faster to converge i don't know by how much off the top of my head but i will give you um a pictorial sort of intuition for why that is in a second [Music] as for convergence in general for other people in the audience that might not know about the convergent stuff
um these change oh sorry these chains their stationary distributions um will converge to the posterior when there are a few criteria that are met so i'm not going to go into the detail but there's something called the detailed balance equation which is sort of a physics inherited phrase which says that the probability of going to some candidate next step is the same as doing that in reverse so there's some symmetry in the space of the parameters or in the chains that we're taking on the space of parameters i should say and then also that the markov chain is ergodic so it is a periodic each of the states are aperiodic and positive recurrent we know that there is some expected finite expected value that the chain will return there um so yes so as for the speed of it i'll go to the first whiteboard once again the general idea with these hamiltonian dynamics oops cool the orb has followed me so the general idea with these hamilton dynamics is that it makes sure that the um sampler is actually sticking to regions where the posterior is uh dense or where the posterior is quite peaked so the analogy that i love to think about which is drawn from this blog post that i'll reference at the end analogy for hmc is that imagine the hamiltonian which is essentially our cost function as a skate park or skate rink and here we can take our model to be oops sorry all the way around we can take the model to be maybe just a gaussian that looks like that and so oh actually apologies let me say that the posterior just for a very simple case where you wouldn't really need hamiltonian monte carlo at all but just suppose that the posterior was a gaussian distribution then the hamiltonian will be just w squared on two so the simple harmonic oscillator so we can imagine that our skate rink is y equals x squared and so what this process is doing is we start at some initial position wk and we choose our random initial velocity which is normally distributed potentially it could be anything but this is essentially picking some random direction to flick a ball through the skatepark so we flick a ball with that initial velocity and then we simulate simulate the dynamics in the rink in the ring until some stopping time so what that is saying is that we let the ball roll through the rink something like that and then eventually it might stop here so it literally is a physical simulation and then we've got this as being w at our stopping time this might be our new candidate now what is going on here or the key
difference between this and metropolis hastings is that metropolis hastings just pick some random new position as our new candidate so it it might be as i was saying earlier a gaussian distribution centered at our previous step or at our current step i should say and so it might just pick the new position to be over here that might be our candidate canned but we can already see that the hamiltonian is much higher at that point and so it's a higher energy and it's therefore going to be a worse region of the posterior a lower probability region of the posterior and so what hamiltonian monte carlo is doing is it uses the information of h of w which it has access to because we can calculate the model and the prior and it simulates a trajectory through phase space according to that hamiltonian so that it actually sticks to regions that are quite densely um or have quite dense probability according to the posterior it doesn't just take any old random direction which is effectively what the metropolis hastings algorithm does so it confines your search for parameters to much better regions of phase space does that help kenneth sorry i was trying to unmute yeah so yeah clearly metropolis hastings is um sorry hamiltonian is better than that yeah that's a good explanation thanks no worries so i will absolutely say that this intuition was not formed by me i will send a blog post in the discord channel that gives a really really great visual insight into what exactly this is doing um i'm not sure if there's a if there's a formal um a formal difference in convergent speed well there's definitely a difference in convergent speed i'm not sure if there's sort of a provable rate difference between them it would be interesting i'm sure there's i mean these algorithms are um very widely studied because they are incredibly useful and uh in some sense what a lot of modern statistics is kind of centered on so i'm sure there is work on that but i'm not familiar with it myself um yeah cool so i think that's probably a good place to stop does anyone else have any questions i i do have a question but uh it's more like a i so basically i can't recall which one has better uh behavior in terms of not getting stuck at local minima i believe it is still hamiltonian multi-color yes i um [Music] so metropolis hastings is really just the the starting point um there's there are potentially versions of metropolis hastings which are more sophisticated than what i outlined that are different to hamiltonian monte
carlo and potentially better in some other ways but in general hamiltonian monte carlo is still um is essentially the most widely used one in fact in my thesis so i might write this down remark one of the state-of-the-art algorithms at the moment is something called no u-turns or nuts and it is an enhancement it enhances hamiltonian monte carlo because it prevents prevents u-turns from happening essentially so if the geometry is such that um uh it might be likely to keep returning to the same spot based on the uh sort of distribution of initial velocities or something then hamiltonian monte carlo can get stuck just in the same region because it keeps sort of flying off and coming back down to the middle so or coming back down to its initial position so nuts prevents u-turns so that the particle so particle doesn't keep returning as far as getting stuck in different regions hamiltonian monte carlo is still susceptible to that but not to the same degree as metropolis hastings i don't believe and in my thesis work i definitely did see some instances where hmc would get stuck in undesirable positions particularly those that were quite far away from what i knew to be the global minima of or the global maxima of the posterior so the global minima of the coolback level divergence so in that sense really the best way of ensuring that you get full access to the posterior is to run hmc many times from different starting points to make sure that you're getting the best range of the posterior that you can the fullest exploration right right thanks for that explanation i i just thought the description you gave just now uh it's a really good um intuition for uh the exploration part but the description of getting confined at a more higher likelihood region seems to be a kind of a local minima focusing kind of description but i i think uh that might be solved by having the velocity term um if the velocity is high it might escape the region or something like that yes exactly exactly so really what you would be doing to try and fix this is um in some sense just fiddling with the kinds of velocities that the ball gets flipped with that the simulation starts with because if you've got a wall that is like this and you've got two minima and you're trying to get um you're trying to get samples from both then you might keep getting initial velocities like this that are keeping the ball in this well and not escaping and so you need something that's going to give you an initial velocity