Metropolis Hastings Review

When Randomness became the Leader

Image source

Very Short Introduction

Metropolis Hastings is a MCMC (Markov Chain Monte Carlo) class of sampling algorithms. Its most common usage is optimizing sampling from a posterior distribution when the analytical form is intractable or implausible to sample. This post follows the Statistics and the historical steps that led to the appearance of this algorithm.

Statistics is an inspiring discipline. The idea that each source of data: condo size in NY, annual income in Africa or distribution of sure names in Japan can be handled by using the same set of mathematical tools and provide a coherent corollary must not be taken for granted. As physics uses its rigorous equations to describe the entire real world phenomena, statistics use a fairly small amount of analytical tools to clarify the behavior of every real world data.

One of the leading inference problems in statistics is estimating distribution’s parameters such as identifying the parameters of Gaussian (mean and stdev) is one of the main assignment of statisticians. For using distributions, one must define a probability framework. There are two leading approaches for such frameworks:

· The frequentist

· The Bayesian

The frequentist has no prior knowledge about the data, she sees the probability as a long term frequencies. Each trial is a single experiment in an infinite sequence. She never assigns probability to parameters, i.e. a sample mean is a fixed measurable value. Nevertheless since it is fixed we cannot assign it a probability. In the words of one statistician: “there is no probability to have soup today” since today is not an infinite sequence of events.

Bayesian on the other hand has a belief. The Bayesian has a prior belief about the parameter’s distribution. Data is aggregated in order to update this belief. The belief in the parameters’ distribution is manifested in the prior distribution and the likelihood is the probability to get the data upon this belief. The product of these two functions is proportional to the posterior function. In the world of the Bayes not only the data but the parameters as well are random variables. In contrast to the frequentist, once a data is obtained, it is not a random variable any more.

In a more rigorous way we can write Bayesian models using Bayes formula :

Where H is our the hypothesis over the parameters and D is the data.

  • P(H) is the prior distribution on the parameters which is our assumption about the distribution of the parameters
  • P(D|H) is the likelihood : the probability that given a set of parameters H, we will get the sample D.
  • P(H|D) is the posterior distribution
  • P(D) is the data distribution

The objective is estimating the posterior upon the prior, the likelihood and the data distribution

What is the probability that a basketball player will make the next free shot given his statistics?

Consider a player that made 80 out of 100 attempts. The natural distribution for such problem is Binomial distribution with parameter q. It is known that the likelihood for y throws he made out of n becomes :

Still we need to integrate our prior belief . We assume that 0< q <1 and that it has a Beta distribution with the positive parameters α , β

We have :

Note that

We can write this explicitly

Which is Beta distribution with the parameters α+y ,β+n-y

All we need is just defining the values of α and β (our “probabilistic belief”).

We must notice to the followings:

1. The posterior has beta dist as the prior. We can consider that the parameters have been updated.

2. When the posterior has the same distribution as the prior, we say that the prior is the conjugate of the likelihood, its “prior conjugate”

This example illustrates a practical method to use Bayesian inference. However, as we know from other domains, life are sometimes tedious. Often, when we study “real world” Bayesian problems, we may encounter two obstacles:

  • The normalization term (the denominator) is intractable.
  • We don’t have a nice prior function for the likelihood

In this post we will discuss methods to resolve these obstacles.

Monte Carlo

Monte Carlo (or Monte Carlo simulation ) is a class of algorithms that uses randomness and sampling to solve mathematical problems. Particularly it is used when the function is difficult to be written analytically (there is no closed form). We use these methods for solving problems such as integration or optimization. Since they have been developed, Monte Carlo methods changed the approach of simulations usage. They offer solutions for deterministic problems using probability.

During the 1930s, Fermi studied neutron diffusion problem using random methods. However, he never published. In 1946, the world was working on a new invention: a fast computer. At that time, Stanislaw Ulam, one of Manhattan project veterans, was curious about method for estimating the chances of winning Solitaire. He realized that “abstract thinking” is less beneficial than simply let the new machine simulate this problem and simply count the results. This thought led him to think about physics problems such as neutron diffusion. Together with Von Neumann they became the pioneers of these methods. Mathematically, the first steps they took was converting differential equations into MC friendly form(as said beyond, they use stochastic methods to solve deterministic problems). Since this project was secret, Metropolis (a co-researcher that was leading MANIAC project two years later) gave the code name Monte Carlo (there Ulam’s uncle used to gamble). The appearance of MC created a need for better random generators, thus Von Neumann developed the pseudo generator machine (known as middle square method). These methods have been used in Los Alamos a while later in the invention of the hydrogen bomb.

The appearance of Monte Carlo offered a vast amount of numerical tools to solve mathematical problems. A prominent one was the definite integrals

The objective was to estimate the integral of a function h over a domain X. This problem can be defined as calculating the following average:

where X is a random variable and f a density function

We can use MC estimation

Simply by randomizing the sequence of x’s from f independently

The strong law of large numbers implies that this estimator converges to the analytical average of h under f almost surely

Consider the following integral

It can be written as follow:

Thus using the following estimator:

We have a stochastic tool that makes numerical problems tractable. However, a new dude arrived to the party: For distributions such as Gaussian or uniform we can always find a sampler. But sometimes we are interested in sampling from distributions that our s/w does not have a relevant sampler. In order to emphasize, knowing f does not imply having its sampler. Clearly we need to search for workarounds for such cases.

Recall that cumulative distribution function(CDF) of a random variable X is defined by

The Inverse CDF (inverse transformation algorithm) aims to sample from a distribution when one has its inverse cumulative.

Since the CDF F is a non-decreasing function on the real line, we can define a generalized inverse function G

Consider exponential distribution, we can define

Consider a r.v X, with a distribution f that we cannot sample, where G is given. We can sample as follow:

U ~ Unif(0,1), and X=G(U) we can simulate from f

Consider the interval (a,b).We wish to sample from

X~ N(μ,σ)I(a<X<b) (The truncated normal distribution in this interval).

An additional method to sample from functions that are difficult to sample is the accept-reject method. Here we wish to sample from a target pdf f where f is known up to a constant. Assume that we have an easy way to sample function g (e.g. uniform or Gaussian) which we call proposal function. Furthermore, we assume that there exists:

Ideally we have

We can use the following algorithm for simulating from f

The acceptance rate is then 1/m. The unknown constant does not influence the acceptance rate. In order to see this, we denote it this constant by t

Which implies

The main idea behind this algorithm is the ability to envelop f using the product of g and the unknown constant.

Where the X axis is the samples and Y axis is the quantity u*M*g(x). The output of such algorithm follows the following graph:

Markov property states that in a sequence of experiments, knowing the present makes the past and the future independent. The experiments are memory less. In a more rigorous way:

If P does not depend on t, we say that it is homogeneous.

A classical example for Markov chains is the “drunk man” problem. We wish to estimate the place of a drunk man at time t. Clearly we can measure his entire movement and construct a “full trajectory conditioned”distribution. However, if we know his place at time t-1 we have the complete distribution over the next step. One can imagine that at if at time t-1 we observe backward, the drunk’s location at time t-x, for x integer, is meaningless for predicting time t. The probabilities for the next step are fully determined by the current place.

Traditionally, we assume that the question whether tomorrow will rain or not is Markovian. We expect that knowing today’s weather provides a distribution on tomorrow’s weather. It can be represented using this matrix :

The upper row and the left column refers to rain (i.e. we have 0.9 probability that if today is rainy then tomorrow it will be too, and 0.2 that if today is sunny then tomorrow it will be rainy)

So, we are now familiar with one MC (Monte Carlo) and another MC (Markov Chain). It is a good time to try to combine them together.

Markov Chain Monte Carlo -MCMC

MCMC methods aim to sample from a distribution that we know up to a constant but cannot sample directly. In order to explain Metropolis Hastings, We will present its basic terminology.

Accessible actually means that if we are currently on state u there is a positive probability that we will arrive to state v in a finite number of steps

Communicate- Two state u and v are communicate if u->v and v->u

Namely, two states are communicate if they are accessible in both directions.

Irreducible — If all the states in a chain are communicate, the chain is irreducible .

Recurrent A state v is recurrent if the probability that it returns in a finite steps is 1.

Recurrence — A Markov Chain is called Recurrence if for all its states v are recurrent. If it is not recurrence it is Transient . A recurrent chain that for each state the expected time to return is finite , is called positive recurrent , otherwise it is null recurrent.

Namely, if you start from state j , you will return to j only in time steps which are multiplications of d. If d=1 we say the j is a-periodic

Obviously, if states u, v are communicate their periods are similar.

Ergodicity -A state is ergodic if it is a-periodic and positive recurrent. If all the states in a chain are ergodic then the chain is ergodic

Metropolis Hastings

Metropolis Hastings is considered the very first MCMC algorithm. As you already deduce from the previous sections it aims to sample from an arbitrary distribution that is not “samplable” and uses MCMC properties.

In 1953 Metrpolis, Rosenbluth, Rosenbluth, Teller and Teller (Indeed, too many Jews, spouses and extensive semantic correlation), studied phase transition phenomena. They published a paper Equation of State Calculations by Fast Computing Machines. The purpose of this paper was handling the “Hard disk in a box” problem:

Let’s assume the we have a box that contains substance molecules that cannot overlap (“hard disk”). They studied the ideal gas equation.

The research group used the following methodology:

Temperature (T ) is fixed

Amount of particles (n) is fixed

Gas volume (V) is fixed

They studied the pressure (P) as

Following these assumptions they surrounded each particle with a fixed cube and set a discretization within this cube. The idea was to randomize a new pair of coordinates with in the cube as long as they are valid (no overlap). They called this symmetric step Proposal function (see accept-reject section)

The process was therefore:

Metropolis was the most senior researcher. As I discussed above, he had a track record of working with Ulam and Von Neumann on Monte Carlo. However, there are claims that came (mainly from Rosenbluth but from Teller as well )that they all deserve a credit, (Teller as you all know became famous for another non-neglected invention).

In 1970 Hastings published a paper “Monte Carlo sampling methods using Markov chains and their applications”, there he extended the solution to non-symmetric proposal. For proposal q the final algorithm is the following:

One obvious remark: in MCMC the samples are not independent. Thus rather having a fixed M as in accept-reject, we duplicate the previous sample.


There are several classical examples for M.H. algorithm:

A more organized description of the Bayesian sampler is presented here

The plots beyond show:

1. The data sample

2. The set of Rho’s demonstrated

3. The posterior function

M.H has advantages but some disadvantages as well:

1 Although it works significantly better in high dimension than other accept-reject algorithms (it handles fairly well the curse of conditionality), It has a “burn in” rate, namely it converges slowly and the target is not achieved immediately.

2 Rare events are hardly taken.

3 Samples are correlated

Metropolis Hastings has already improvements:

1. The most famous is Gibbs algorithm that is commonly used in applications such as R.B.M and L.D.A.

2. In Physics and physical Chemistry the stochastic Annealing algorithm is massively used.

3 An interesting improvement of M.H. is the Hamiltonian Markov Chain (H.M.C)

On these topics we will talk in a different article.

This post required some one to challenge me to sit and write it and someone to verify that I am not too incoherent . The former has been made by Shlomo Kashani that made his chore as a ML evangelist, and the latter by savvy Ilana Volfin. They both did a massive proofreading ,so all the failures are their fault.


Markov Chain Monte Carlo in practice \Gilks, Richardson


Interested in theory behind the ML non-linearity stochasticity sampling Bayesian inference & generative models “Tiefe Gedanken sind ewig, daher der größte Spaß”