# The Cracked Bassoon

Sam Mathias

I'm a psychoacoustician. I mainly post code for making interesting sounds.

# Signal detection theory with Bayesian inference in PyMC

Signal detection theory (SDT) is a framework for modelling the decision processes of observers in psychological experiments (see Green & Swets, 1966). One of the primary uses of SDT is to transform observed data (counts of trials and responses in an experiment) into psychologically meaningful variables, the most well known of which are the index of sensitivity, $d^\prime$, and the criterion, $c$. Methods for calculating these variables are widely available and generally straightforward to implement, at least for common experimental designs (see Macmillan & Creelman, 2005). The easy-to-use equations found in most textbooks and articles rely on maximum-likelihood (ML) estimation. An alternative to MLE is Bayesian inference, which has a number of advantages over traditional methods. This post is intended to be the first in a series exploring the use of Bayesian inference in SDT using python and PyMC. I'm going to start with the simplest thing that can be modelled in SDT — a single observer in a yes-no experiment.

## The SDT model

On each trial in a yes-no experiment, the observer makes an observation ($\psi$). According to SDT, these observations are random Gaussian variables which, depending on the trial, are drawn from one of two probability distributions: a 'noise' distribution (if it's a noise trial) and a 'signal' distribution (if it's a signal trial). The noise distribution has zero mean and unit variance, whereas the signal distribution has $d^\prime$ mean and unit variance. The bigger the value of $d^\prime$, the more distinct the two distributions, and therefore the more 'sensitive' the observer is to the signal. If $d^\prime=0$, the signal and noise are indistinguishable. Thus, $d^\prime$ is the index of sensitivity.

To respond on each trial, the observer must first set a criterion $k$ at some value. They respond 'no' if the observation is less than $k$, and 'yes' if it is greater than $k$. When $k = \frac{d^\prime}{2}$, the observer is said to be 'unbiased'. The distance between true $k$ and unbiased $k$ is denoted $c$. Thus, $c$ is the index of response bias. An alternative bias index is $\beta$, the logarithm of the 'likelihood ratio' of the two distributions at the criterion. The proportion of the signal distribution greater than $k$ is known as the hit rate, $h$, and the proportion of the noise distribution greater than $k$ is known as the false-alarm rate, $f$. All these parameters are illustrated in the figure below:

This is known as the equal-variance Gaussian model, and forms the foundation of most SDT analysis. It is fairly trivial to generate synthetic data under this model using python:

## Maximum-likelihood estimation

Given observed values of $N$, $S$, $H$ and $F$, we can use ML to first estimate $h$ and $f$, then $d^\prime$, $c$ and $\beta$. The hit rate $h$ can be considered the probability of success on a Bernoulli trial where a signal stimulus was presented. Likewise, $f$ is the 1 - the probability of success on a noise trial. Thus, $h$ and $f$ are binomial probabilities. Therefore, we can estimate $h$ and $f$ using these equations:

$\begin{eqnarray*}&\widehat{h}&=&\frac{H}{S}\\ \textrm{and}& \\ &\widehat{f}&=&\frac{F}{N}.\end{eqnarray*}$

Symbols with ^ above them denote maximum-likelihood estimators. Since all of the other variables in the SDT model can be expressed in terms of $h$ and $f$, we can construct maximum-likelihood estimators for them as well. $\widehat{k}$ is the simplest to derive:

$\begin{eqnarray*}\widehat{f}&=&\Phi\left(-\widehat{k}\right) \\ \widehat{k}&=&-\Phi^{-1}\left(\widehat{f}\right), \end{eqnarray*}$

where $\Phi$ is the cumulative standard normal distribution function and $\Phi^{-1}$ is its inverse (sometimes denoted $z$ or $Z$). Now we can derive $\widehat{d^\prime}$:

$\begin{eqnarray*}&\widehat{h}&=&\Phi\left[-\left(\widehat{k}-\widehat{d^\prime}\right)\right] \\ &\widehat{h}&=&\Phi\left\{-\left[-\Phi^{-1}\left(\widehat{f}\right)-\widehat{d^\prime}\right]\right\} \\ &\Phi^{-1}\left(\widehat{h}\right)&=&\Phi^{-1}\left(\widehat{f}\right)+\widehat{d^\prime} \\ &\widehat{d^\prime}&=&\Phi^{-1}\left(\widehat{h}\right)-\Phi^{-1}\left(\widehat{f}\right)\end{eqnarray*}$

and $\widehat{c}$:

$\begin{eqnarray*}&\widehat{c}&=&\widehat{k} - \frac{\widehat{d^\prime}}{2}\\ &\widehat{c}&=& -\Phi^{-1}\left(\widehat{f}\right) - \frac{\Phi^{-1}\left(\widehat{h}\right)-\Phi^{-1}\left(\widehat{f}\right)}{2}\\ &\widehat{c}&=&-\frac{1}{2}\cdot\left[\Phi^{-1}\left(\widehat{h}\right) + \Phi^{-1}\left(\widehat{f}\right)\right]\end{eqnarray*}$

Finally, through some re-arranging that I've omitted here (but see Macmillan & Creelman, 2005), it can be shown that $\ln\left(\widehat{\beta}\right) = \widehat{d^\prime}\cdot{}\widehat{c}$.

The above equations don't always work, however. Since the inverse cumulative standard normal function yields infinity for values of 0 and 1, ML estimation breaks down if the observer performs very well or very poorly in the experiment (i.e., when the estimates of either $h$ or $f$ are 0 or 1). Such occurrences are not uncommon in psychological experiments, especially when there are relatively few trials. A simple workaround (proposed by Snodgrass & Corwin, 1988) is to increase $N$ or $S$ by 1 and $H$ or $F$ by 0.5. This method is appealingly simple and intuitive. Whilst some authors have complained about it, it is not entirely without theoretical justification and is not a terribly big deal in most situations (I'll expand on this in a future post).

Again, it is quite simple to implement these ML estimators in python:

## Bayesian inference

Lee (2005) outlines a Bayesian equal-variance Gaussian SDT model for a yes-no experiment with a single observer. I'm going to assume you've read that article (it was freely available last time I checked), and are at least familiar with concepts relating to Markov chain Monte Carlo (MCMC) sampling, such as autocorrelation, burn-in, thinning, and so on. Here, I'm going to re-create Lee's Matlab and BUGS code in python using the PyMC package (Patil et al., 2010). I'm using PyMC 2.3.4, which as of writing is the current stable version. It seems that 2.3.3 contains a few serious bugs, and while purportedly much faster, PyMC 3 is currently in alpha.

We start by defining priors on $d^{\prime}$ and $c$. According to Lee (2005), appropriate choices are $d^\prime\sim\mathcal{N}\left(0,\frac{1}{2}\right)$ and $c\sim\mathcal{N}\left(0,2\right)$. Note that in PyMC (as in most Bayesian contexts), the Gaussian distribution is defined in terms of mean and precision ($\tau$), the reciprocal of the variance, rather than mean and variance. In PyMC, these are called stochastic variables and are defined using the special Stochastic class. Since $k$, $\beta$, $h$ and $f$ can all be determined uniquely by $d^{\prime}$ and $c$, we do not assign them priors. Instead, they are defined using the Deterministic class. Finally, $H$ and $F$ are also considered stochastic variables, but unlike $d^\prime$ and $c$, they are observed (i.e., we have some evidence that directly pertains to their values). Below is a 'factory function' that generates an SDT model:

Notice that unlike with ML, we do not need to correct for very good/poor performance. Before fitting the model, we should check the explicit priors on $d^{\prime}$ and $c$, as well as the implicit priors on the other variables:
As well as performing MCMC, PyMC allows us to calculate maximum a posteriori (MAP) estimates of the parameters. MAP is essentially the same as ML, except that one is finding the maximum of the posterior distribution rather than of the likelihood distribution. MAP estimates provide good starting values for Markov chains, and can reduce the need for very long burn-in periods (the ML estimates would also make good starting values for this particular model). PyMC relies on numerical approximations from scipy to obtain MAP estimates; I tend to avoid the default method and use either Powell (1954) or L-BFGS-B (Zu et al., 1997) wherever possible. Once set, we can run MCMC and produce posterior plots.

## Testing the model

To test the model, I've synthesised some data sets with different $d^\prime$ values. $c$ was fixed at 0.1, and there 100 noise trials and 100 signal trials in each simulation. The plot below shows that the ML and Bayesian solutions match up well. It also illustrates one of the key advantages of Bayesian inference: instead of simply giving us point estimates of the most likely parameter values given the data, the Bayesian approach returns samples from the joint posterior distribution and therefore naturally gives us information regarding the uncertainty associated with those values. The shaded region in the figure indicates 95% credible intervals (or highest-density intervals, HDIs).

In the next example, you can see how with more data (i.e., trials in the experiment), uncertainty is reduced. $d^\prime$ was always 1, and $c$ was fixed at 0.

Here are all the functions in one script:

## References

Banner image is The Enigma by Gustave Dore.

• Green, D. M., & Swets J. A. (1966). Signal Detection Theory and Psychophysics. New York: Wiley.
• Lee, M. D. (2005). BayesSDT: Software for Bayesian inference with signal detection theory. Behavior Research Methods, 40 (2), 450–456.
• Macmillan, N. A., & Creelman, C. D. (2005). Detection theory: A users guide (2nd ed.). New York: Cambridge University Press.
• Patil, A., Huard, D., & Fonnesbeck, C. J. (2010). PyMC: Bayesian stochastic modelling in Python. Journal of Statistical Software, 35 (4), 1–81.
• Snodgrass, J. G., & Corwin, J. (1988). Pragmatics of measuring recognition memory: Applications to dementia and amnesia. Journal of Experimental Psychology: General, 117 (1), 34–50.
• Powel, M. J. D. (1964). An efficient method for finding the minimum of a function of several variables without calculating derivatives. Computer Journal, 7 (2), 155–162.
• Zhu, C., Byrd, R. H., Lu, P., & Nocedal, J. (1997). L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization. ACM Transactions on Mathematical Software, 23 (4): 550–560.