Bayesian signal detection theory, part III: hierarchical generalised linear models
In the first post in this series, I discussed the basic assumptions of signal detection theory (SDT), and implemented simple Bayesian model that estimated sensitivity and bias for a single observer in a yes-no (YN) experiment. Here, I consider a slightly more realistic situation, where multiple observers participated in the same experiment, and present a considerably more useful model.
To recap, in the previous model, and were considered unknown parameters of interest, and were latent intermediate random variables, and were observed random variables with binomial likelihood functions, and and were fixed inputs. These probabilistic relationships are summarised in the directed acyclic graph below. The figure follows the conventions of other authors (e.g., Lee & Wagenmakers, 2013): circles represent continuous variables, squares represent discrete variables, double lines represent deterministic variables, and observed variables are shaded gray.
Note that I've changed the model slightly from my first post. I now express normal random variables in terms of mean and variance (), not mean and precision (). I’ve also omitted and , neither of which were critical to the model and could be calculated post-hoc if necessary.
The single-observer model can be extended to situations with more than one observer by simply creating a new set of random variables per observer. This kind of model is summarised in the following figure. This can be described as a non-hierarchical Bayesian model (important for later).
Now suppose that performance in the multiple-observer experiment is influenced by some other variable; for example, the age of the observer. To measure this effect under the traditional frequentist framework, we could enter our maximum-likelihood estimates (MLEs) of and , calculated in the conventional way (Macmillan & Crewman, 2005), into a linear regression analysis. Alternatively, we could incorporate linear affects on and into our Bayesian SDT model. The latter approach has some considerable advantages over the former, which I will talk about later.
I’m going to call our new model a “Bayesian hierarchical generalised linear signal detection model” (BHGLSDM). That’s a pretty dense name, so let’s unpack it. The “signal detection” bit is obvious. “Generalised linear” refers to generalised linear models. In basic linear regression with one predictor, the dependent variable,, is assumed to be a normally distributed random variable, whose value is equal to the product of the regression coefficient, (note: not to be confused with the measurement of bias from SDT!), and the predictor, , plus some Gaussian random noise, : . Assuming that is normally distributed, this is the simplest possible general linear model. However, if is not normally distributed, then the model requires a link function that transforms into a normal random variable. This kind of model is known as a generalised linear model. For our purposes, we are interested in the effects of a predictor on and , but have observed and , which are binomial random variables. Therefore, we have to place linear models on and , and link them to and . In this way, SDT can be thought of as the link function between the predictor and the observations.
“Bayesian hierarchical” models are Bayesian models with multiple levels of unobserved random variables. In the non-hierarchical multiple-observer SDT model described above, the parameters that described the prior distributions of and were fixed values. In a hierarchical version of that model, these parameters are random variables themselves, and therefore have their own parameters and priors (also called hyperparameters and hyperpriors, respectively). Such a model would look like this:
Hierarchical models are incredibly useful for psychological research, for at least two reasons. First, uncertainty is propagated throughout the model. In frequentist analyses of experimental data, such as regression or ANOVA, one normally takes a point estimate of performance for each subject in each condition, and by doing so essentially assumes that there is no error in those estimates (or at least that the error is roughly equal across all subjects/conditions). Since a hierarchical model uses random variables instead of point estimates, this is not a concern. Second, hierarchical models allow for a phenomenon called “shrinkage”. All random variables at all levels influence one another, which can cause extreme lower-level variables to be drawn (shrunk) towards the average. The degree of shrinkage depends on the strength of the information at each level: for instance, if we have a lot of observers but few trials per observer in an experiment, we may see lots of shrinkage; if we have lots of data from relatively few observers, there will be less shrinkage. Thus, hierarchical models provide a very nice method for dealing with outliers in data. For a good introductory explanation of hierarchical models, see Allenby, Rossi, and McCulloch (2005).
Example 1: One continuous predictor
OK, now we will construct a BHGLSDM for a fictitious experiment. Suppose 100 observers participated in the experiment, comprising 50 noise trials and 50 signal trials. For simplicity, we will assume that age has strong linear effects on both and . In the figure below, I plotted the regression lines, and the observers’ “true” and values, which are simply their predicted values plus a small amount of Gaussian noise. The figure also shows MLEs of and based on the simulated data. You'll notice that the estimated values—especially the values—are extremely noisy, which is to be expected when the experiment contains just 100 trials. Nevertheless, a traditional ordinary least-squares linear regression (implemented using seaborn) does a decent job of recovering the regression coefficients, also plotted.
Our BHGLSDM looks like this:
where represents the th observer’s age. Code for implementing the BHGLSDM is provided below. I've switched to PyMC 3 for this model. Although this package is still in development, it has some advantages over version 2: the syntax is much more intuitive and easier to write, and it contains a new sampling method (the no U-turn sampler, NUTS; Hoffman & Gelman, 2014), which is much faster than previous samplers. In addition to PyMC 3, pandas and patsy are also required. These are both requirements of the glm submodule of PyMC 3, which I borrowed from heavily when writing my code.
Now let’s sample from the model and view the results using the following snippet:
import pymc3 as pm from yesno import * with pm.Model() as model: data = pd.read_csv('data.csv') yesno(data, 'd ~ age', 'c ~ age') burn = pm.sample(500)[-1] traces = pm.sample(5000, start=burn) pm.traceplot(traces)
The figure below shows the posteriors and trace plots after 5000 samples (500 burn-in). There is still a bit of autocorrelation in the traces, indicating that more samples and some thinning may be required, but the model looks to have converged on the correct answers.
In this example experiment, we had a lot of observers, but relatively few trials. Let’s compare the Bayesian and MLE and values:
Due to shrinkage, the Bayesian estimates are MUCH closer to the true values!
Example 2: One continuous predictor and one dichotomous predictor
The above code works with arbitrarily complex models. In the following simulated data set, I have included a second, this time dichotomous, predictor variable (group). Moreover, the effect of age differs between groups.
We can fit a BHGLSDM to the new data, and by modifying the patsy design string, include two new coefficients to deal with the additional main effect and the interaction.
yesno(data, 'd ~ age * group', 'c ~ age * group')
Below are the results. The model captures both main effects and the interaction. Again, due to shrinkage, the mean posterior estimates are closer to the true values.
Example 3: Repeated measures
An experiment is said to have repeated measures if each observer contributes data to more than one of the predictors. The idea is that variation due to differences in the baseline abilities of the observers should be more balanced across the predictors. In another example experiment, 20 observers completed 600 trials, 200 in each of three different conditions. The plot below shows the true and values of this data set.
Even without any of the additional noise introduced by simulating the trials, it is difficult to see any differences between the conditions. However, there are actually strong differences if we look at the relationships of the different conditions within the data from each individual:
The syntax for a purely between-subjects model, where the data from each observer and condition are treated as completely independent, looks like this:
yesno(data, 'd ~ C(condition)', 'c ~ C(condition)')
Predictably, the posterior plots of the regression coefficients from this model are vague:
In order to exploit the fact that we have repeated measures, we can include a separate intercept for each observer, rather than a single intercept for everyone. These intercepts will absorb the inter-observer differences, and allow more precise estimation of the interesting regression coefficients. Thanks to patsy, it is very easy to include these additional intercepts:
yesno(data, 'd ~ 0 + C(subj) + C(condition)', 'c ~ 0 + C(subj) + C(condition)')
Now the posteriors for the condition-specific regression coefficients are tighter:
Banner image is Haystacks at Giverny by Claude Monet.
- Hoffman, M. D., & Gelman, A. (2014).The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 1351–13811.
- Lee, M. D., & Wagenmakers, E. J. (2013). Bayesian cognitive modeling: A practical course. Cambridge: Cambridge University Press.
- Macmillan, N. A., & Creelman, C. D. (2005). Detection theory: A users guide (2nd ed.). New York: Cambridge University Press.