Bayesian Parameter Estimation with Metropolis MCMC in Haskell

Posted on January 26, 2015

When you have some measurement values at hand, you will quite often want to estimate parameters from the data. In principle you now have the choice between frequentist and Bayesian statistics. I believe that Bayesian statistics has many advantages, especially with respect to (SAR) calibration. This is because parameters are regarded as random variables, described by a probability distribution. After estimation, the uncertainty of this estimate is inherently part of the analysis (in form of the width of the probability distribution), which can then be used in a subsequent uncertainty analysis to determine the measurement uncertainty of a measurement result.

I will leave an introduction to Bayesian data analysis to others1, and concentrate here on the implementation of a simple Monte Carlo Markov chain (MCMC), using the Metropolis algorithm. The Metropolis algorithm is maybe the simplest of a range of Markov chain algorithms, and quite instructive to get your feet wet in the field of Bayesian simulation.

If you want to get serious with MCMC, then there are many full-fledged, efficient MCMC packages implemented in various programming languages. I have had good success with PyMC 2 for Python so far (and I am really looking forward to the much improved interface of PyMC 3, which is not yet officially released). The “new kid on the block” seems to be STAN, whose Python interface is also a pleasure to work with.

But now let’s get started, shall we?

Metropolis Monte Carlo Markov Chain in Haskell

Let’s look at a simple estimation example: linear regression. The idea is that we have measured \(N\) samples \(y_i\), \(i = 1 \dots N\), for given \(x_i\). We want to model the data according to \[y_i = ax_i + b + e_i,\] \[e \sim \mathrm{N}(0, \sigma^2),\] i. e., we want to fit a line through some noisy data. The unknowns are \(a\), \(b\), and \(\sigma\). Alternatively we could have written \[y \sim \mathrm{N}(ax + b, \sigma^2).\]

The concrete example we are going to look at is inspired by a post on an implementation of the Metropolis algorithm in R. Here I want to implement it using Haskell, not only to practice the language, but also to explore how well this compiled language lends itself to numerical simulations like this. Thanks to another blog post, it was a surprisingly pleasant experience.

By the way, the source file of this blog post is a literate Haskell file which can directly be compiled and executed, should you want to try it out for yourself!

Let’s get started with some imports…

> import Control.Monad
> import Control.Monad.State (evalState)
> import Data.Histogram (asList)
> import Data.Histogram.Fill
> import Data.Histogram.Generic (Histogram)
> import Data.List
> import Data.Random
> import Data.Random.Source.PureMT
> import Statistics.Distribution hiding (mean)
> import Statistics.Distribution.Normal
> import Statistics.Distribution.Uniform
> import Text.JSON
> import qualified Data.Histogram as H
> import qualified Data.Vector.Unboxed as V

Our model relates the known variable \(x\) to an output value \(y\), as long as \(a\), and \(b\) are known:

> predict :: Double -> Double-> Double -> Double
> predict a b x = a * x + b

Having defined this, we can generate some noisy test data now. Later on we will try to estimate the true values for \(a\), \(b\), and \(\sigma\) from this test data.

> trueA, trueB, trueSd :: Double
> trueA = 5
> trueB = 0
> trueSd = 10
> nSamples :: Int
> nSamples = 31
> xs :: [Double]
> xs = take nSamples $ [-15.0 ..]
> arbSeed :: Int
> arbSeed = 4022 -- Arbitrary seed for generating random numbers
> ys :: [Double]
> ys = zipWith (+) noise (map (predict trueA trueB) xs)
>   where noise = evalState (replicateM nSamples (sample (Normal 0.0 trueSd)))
>                 (pureMT $ fromIntegral arbSeed)

In order to include plots in this page, I use the C3.js JavaScript chart library, which can read in JSON data. So we export the generated test data as a JSON file:

> exportTestData :: IO ()
> exportTestData = writeFile "test-data.json" . encode . toJSObject $
>                  [("x", xs), ("y", ys)]

This is what our generated test data looks like, ys over xs:

Generated noisy test data

In order to draw from the posterior, we need to determine the likelihood function and the prior. Let’s start with the likelihood. The record level likelihood is given as \[ p(y_i|a, b, \sigma, x_i) = \frac{1}{\sqrt{2\pi}\sigma} \exp \left[ -\frac{(y_i-(ax_i + b))^2}{2\sigma^2} \right], \] i. e., the probability of sample \(y_i\) under a normal distribution with mean \(a x_i + b\) and variance \(\sigma^2\). Each sample \(y_i\) is assumed to be drawn independently, so the model-level likelihood for the complete sample \(y\) is the product of the separate likelihoods, or \[ p(y|a, b, \sigma, x) = \prod_{i=1}^N \frac{1}{\sqrt{2\pi}\sigma} \exp \left[-\frac{(y_i-(ax_i + b))^2}{2\sigma^2} \right]. \]

In numerical implementations one usually does not compute the likelihood, but the log-likelihood in order to avoid problems with small numbers: \[ \log [p(y|a, b, \sigma, x)] = \sum_{i=1}^N \frac{1}{2\sigma} \exp \left[ -\frac{(y_i-(ax_i + b))^2}{2\sigma^2} \right]. \] The location of the maximum of the log-likelihood function remains unchanged.

In Haskell the log-likelihood is:

> loglikelihood :: Double -> Double -> Double -> Double
> loglikelihood a b sd = sum $ zipWith lh xs ys
>   where lh x = logDensity (normalDistr (predict a b x) sd)

As an example, we can plot the log-likelihood \(p(y|a, b=0, \sigma=10, x)\), showing how likely the samples \((x_i, y_i)\), \(i=1 \dots N\), are for varying \(a\), a fixed \(b=0\), and a fixed \(\sigma=10\) (again by exporting the data to a JSON file first):

> exportLikelihoodA :: IO ()
> exportLikelihoodA = writeFile "log-likelihood.json" . encode . toJSObject $
>                     (\as -> [ ("x", as),
>                              ("loglikelihood", map (\a -> loglikelihood a trueB trueSd) as )])
>                     [3, 3.05 .. 7::Double]

Log-likelihood \(p(y|a, b=0, \sigma=10, x)\)

We can see that \(a=5\) is very close to the maximum. The remaining offset is a result of the random nature of our test samples \((x_i, y_i)\).

As for any Bayesian data analysis, we also need a prior. Here we choose rather uninformative priors which have little influence on the estimated parameters:

> logprior :: Double -> Double -> Double -> Double
> logprior a b sd = sum [ap, bp, sdp]
>   where
>     ap  = logDensity (uniformDistr 0 10) a
>     bp  = logDensity (normalDistr 0 5) b
>     sdp = logDensity (uniformDistr 0 30) sd

The (unscaled) log-posterior is now given as the sum of the log-likelihood and the log-prior:

> logposterior :: Double -> Double -> Double -> Double
> logposterior a b sd = loglikelihood a b sd + logprior a b sd

Now we can start to actually implement the Metropolis MCMC. The Metropolis algorithm is a random walk algorithm. Each of the

> nIters :: Int
> nIters = 100000

steps in the random walk only depends on the current state (location in the parameter space) and an acceptance/rejection criterion (which involves random numbers). To go from the current location \((a, b, \sigma)\) to a proposed location \((a + \Delta a, b + \Delta b, \sigma + \Delta\sigma)\), we need to generate random delta values with zero mean. Let’s define a general function for generating a list of random values from a seed:

> normalisedProposals :: Int -> Double -> Int -> [Double]
> normalisedProposals seed sigma nIters =
>   evalState (replicateM nIters (sample (Normal 0.0 sigma)))
>   (pureMT $ fromIntegral seed)

At each step, the Metropolis algorithm needs to decide if the proposed jump is accepted, for which random draws from a standard uniform distribution are required:

> acceptOrRejects :: Int -> Int -> [Double]
> acceptOrRejects seed nIters =
>   evalState (replicateM nIters (sample stdUniform))
>   (pureMT $ fromIntegral seed)

The acceptance probability for accepting the new location over the old location is then:

> acceptanceProb :: Double -> Double -> Double -- ^ Proposal parameters
>                -> Double -> Double -> Double -- ^ Old parameters 
>                -> Double
> acceptanceProb a' b' sd' a b sd =
>     exp ((logposterior a' b' sd') - (logposterior a b sd))

The core of the Metropolis algorithm is the following. We either advance to the proposed location or stay at the current location, depending on the acceptance probability:

> oneStep :: (Double, Double, Double, Int)    -- ^ Current location
>         -> (Double, Double, Double, Double) -- ^ Delta location and acceptance probability
>         -> (Double, Double, Double, Int)    -- ^ New state
> oneStep (a, b, sd, nAccs) (da, db, dsd, acceptOrReject) =
>   if acceptOrReject < acceptanceProb (a + da) (b + db) (sd + dsd) a b sd
>   then (a + da, b + db, sd + dsd, nAccs + 1)
>   else (a, b, sd, nAccs)

The complete Markov chain can be generated from a starting location and a succession of single steps:

> startA, startB, startSd :: Double
> startA = 4  -- ^ All start values are a little off
> startB = 1  --   to show that the algorithm converges
> startSd = 9 --   to the true posterior after burn in
> burnIn :: Int
> burnIn = 1000
> chain :: Int        -- ^ Seed for generation of random numbers
>       -> [ ( Double -- ^ a
>            , Double -- ^ b
>            , Double -- ^ sigma
>            , Int    -- ^ number of accepted samples
>            ) ]
> chain seed =
>   drop burnIn $
>   scanl oneStep (startA, startB, startSd, 0) $
>   zip4 (normalisedProposals seed 0.3 nIters)    
>     (normalisedProposals (seed+999) 1.5 nIters) 
>     (normalisedProposals (seed+1) 2.0 nIters )  
>     (acceptOrRejects (seed+2) nIters)

The promise of the trace data is that it was drawn from the posterior distribution. So by looking at the trace of each parameter, we can directly derive quantities of interest (like an estimate of the expected value, and an estimate of the variance of this expected value). Here I want to generate a histogram from the trace data and export it as a JSON file for plotting. Additionally I want to export the last 1000 samples of the trace data to get a visual clue if consecutive samples are sufficiently uncorrelated.

> hb :: Double -> Double
>    -> HBuilder Double (Data.Histogram.Generic.Histogram V.Vector H.BinD Double)
> hb lower upper = forceDouble -<< mkSimple (H.binD lower numBins upper)
>     where numBins = 121
> -- | Export trace data to JSON file (histogram and last 1000 samples)
> exportTrace :: V.Vector Double -> String -> Double -> Double -> IO ()
> exportTrace trace name lower upper = do
>   let hist = fillBuilderVec (hb lower upper) trace
>   write "histogram" $ [ ("x", V.toList $ H.binsCenters (H.bins hist))
>                       , ("y", V.toList (H.histData hist)) ]
>   write "trace"     $ [("data", V.toList . V.drop (V.length trace - 1000) $ trace)]
>       where write x = writeFile (concat [ x, name, ".json"]) . encode . toJSObject
> consumeResults :: ( [Double], [Double], [Double], [Int] ) -> IO ()
> consumeResults (traceA, traceB, traceSd, lAcc) = do
>   exportTrace (V.fromList traceA)  "A" 3 7
>   exportTrace (V.fromList traceB)  "B" (-6) 6
>   exportTrace (V.fromList traceSd) "Sd" 0 20
>   let nAccepts = last lAcc
>   putStrLn $ concat [ "nIter: "
>                     , show nIters
>                     , "  accepts: "
>                     , show nAccepts
>                     , "  ratio: "
>                     , show (fromIntegral nAccepts / fromIntegral nIters)
>                     ] 
> main = do
>   exportTestData
>   exportLikelihoodA
>   consumeResults . unzip4 . chain $ arbSeed
>   return ()        

And here is a histogram and trace of \(a\):

Histogram of \(a\)

Last 1000 values of trace of \(a\)

The histogram (an approximation of the probability density function of the parameter \(a\)) nicely summarizes the estimation result. As a measure of uncertainty, one often considers the 95 % highest probability density interval (HPDI), but for this demo the analysis shall stop here.

If you look at the trace, you can see that sometimes it remains at the same location for a few draws, and sometimes a new location is accepted right away. This acceptance rate is an important parameter when adjusting the variance of the jumps (the second parameter of our normalisedProposals function). If the variance is too small, then the acceptance ratio will be close to 1.0, but the location will change very slowly so that many random draws are required until the trace had the chance to sufficiently explore the parameter space. On the other hand, if the variance is too large then hardly any new location will be accepted and the trace appears to be stuck. It is difficult to get the variance of the jump proposals right in the general case (and this the reason why so many libraries exist which implement MCMCs). Here I just tuned the values manually until the acceptance ratio was around 0.4.

For completeness, here are the histograms and parts of the traces for the parameters \(b\) and \(\sigma\):

Histogram of \(b\)

Last 1000 values of trace of \(b\)

Histogram of \(\sigma\)

Last 1000 values of trace of \(\sigma\)


Implementing the Metropolis MCMC algorithm in Haskell was a good learning experience, both from the point of view of Bayesian data analysis and for programming in Haskell.

An interesting next step would be to look at more complex statistical models, considering hierarchical or pooled data. This is an area where Bayesian data analysis should excel over many Frequentist approaches because Bayesian models are easier to adapt to more complex situations. Solving these models in practice is then relatively easy thanks to the availability of random walk algorithms like the Metropolis sampler, especially when existing MCMC libraries are used.

  1. Andrew Gelman, John B. Carlin, Hal S. Stern: Bayesian Data Analysis, 3rd ed. CRC Press 2013.