# The Cracked Bassoon

Sam Mathias

I'm a psychoacoustician. I mainly post code for making interesting sounds.

13. September 2016 1983 words ~10 minutes read Comment

# Bayesian hierarchical model of working-memory capacity

Working memory (WM) is the name of the memory system that stores information for time periods in the order of seconds (see Baddeley & Hitch, 1974; Cowan, 2001). WM is considered to be limited both in terms of duration (how long information is stored) and capacity (how much information is stored). The nature of WM capacity limitation is a source of current debate, with some researchers arguing for “item-based” storage (WM stores up to a fixed number of discrete perceptual objects; e.g, Luck & Vogel, 1997; Rouder et al., 2008), and others advocating “resource-based” storage (storage is allocated flexibly depending on the stimulus; e.g., Bays & Hussain, 2008; Keshvari et al., 2013).  However, item-based storage has long been the dominant view, and the average number of items stored in WM, denoted $k$, remains a popular dependent variable in the cognitive sciences.

A simple way of measuring $k$ is to apply either of the formulae proposed by Pashler (1988) and Cowan (2001) to the data from a change-detection task. On each trial, the subject is briefly presented with a stimulus containing several distinct items, and after a few seconds, a second stimulus is presented. The second stimulus contains at least one item from the original stimulus (the “probed” item), which may or may not have changed in some way, and the subject indicates whether a change occurred. The choice of formula depends on whether the second stimulus also contains the remaining items from the original stimulus: if so, Pashler’s formula should be used; if not, Cowan’s should be used.

These formulae are easy to implement and widely used, but there are a number problems with them. First, using these formulae, $k$ can only be calculated from trials with the same set size; experiments typically use various set sizes, so estimates of $k$ must be calculated separately for each set size and then combined, rather than calculating one estimate from all the data. Second, since $k$ can never exceed the set size, it is systematically underestimated whenever the set size is smaller than the true $k$. Third, the formulae can yield impossible (negative) values of $k$. Fourth, $k$ cannot be calculated at all when performance is at ceiling or floor.

To remedy these issues, Morey (2011) formalised a Bayesian hierarchical model for the measurement of WM capacity from change-detection tasks. The model largely deals with the problems listed above, and is much more efficient (it recovers parameters more accurately) than traditional approaches when the data are few. I have implemented a version of this model in PyMC 3 (Salvatier, Wiecki, & Fonnesbeck, 2016). Since it isn’t exactly the same as the model described by Morey—who provides his own software (Morey & Morey, 2011)—I describe its logic in detail below. My version is a bit simpler than Morey’s, since it doesn’t explicitly model the covariance between parameters, and only applies to the Cowan style of change-detection tasks. It should be easy to modify the code to apply to Pashler-style tasks, or indeed any other variation, provided the decision rule can be specified.

## The model

In a similar way to signal detection theory (SDT), the model uses the number of hits (correct indication of a stimulus change), $H$, and the number of false alarms (incorrect indication of a stimulus change), $F$. These are binomial random variables with the probability distributions

$\begin{eqnarray}H_{i_j}&\sim&\mathcal{B}\left(h_{i_j},D_{i_j}\right)\\F_{i_j}&\sim&\mathcal{B}\left(f_{i_j},S_{i_j}\right)\textrm{,}\end{eqnarray}$

where $h$ is the hit probability, $D$ is the number of change trials, $f$ is the false-alarm probability, $S$ is the number of no-change trials, $i$ indexes the subject, and $j$ indexes the set size.

Again like SDT, $h$ and $f$ are determined by a decision process incorporating latent variables with clear psychological interpretations. The decision process is different, however. The model assumes that with probability $z$, the subject stores $k$ items from the stimulus in WM. If the probed item is one of these stored items, the subject responds correctly with a probability of 1. If the probed item is not stored, the subject guesses “change” with probability $g$. On the remaining trials ($1-z$), the subject suffers an attentional lapse, storing none of the items in memory, and again guessing “change” with probability $g$ (see Rouder et al., 2008).

The relationships between the response probabilities ($h$ and $f$) and the latent variables ($k$, $g$, and $z$) can be determined easily by considering the all the possible outcomes of a trial in the experiment and determining their associated probabilities. Those are summarised graphically as follows:

Here, $q$ represents the probability that the probed item was stored on a given trial. Assuming all items are equally likely to be stored, this is 1 if $k\ge{}M$, where $M$ is the set size, or $\frac{k}{M}$ if $k. Next, $h$ and $f$ are found by combining the appropriate paths, leading to

$\begin{eqnarray}h_{i_j}&=&\left(1-z_{i}\right)g_{i}+z_{i}q_{i_j}+z_{i}\left(1-q_{i_j}\right)g_i\\f_{i_j}&=&\left(1-z_i\right)g_i+z_i\left(1-q_{i_j}\right)g_i\textrm{,}\end{eqnarray}$

where $q_{i_j}=\min\left(1,\frac{k_i}{M_{i_j}}\right)\textrm{.}$

It is generally a good idea to do inference on normally distributed variables. However, $k$ cannot be normal, because negative values of $k$ are not possible; and neither can $g$ nor $z$, because they are probabilities. Therefore, we let

$\begin{eqnarray}\kappa_{i}&=&\max\left(k_i, 0\right)\\\gamma_{i}&=&\textrm{logistic}\left(g_i\right)\\\zeta_{i}&=&\textrm{logistic}\left(z_i\right)\textrm{.}\end{eqnarray}$

These comprise the first level of unknown random variables in the hierarchical model. We could assign them normal prior distributions, as Morey (2011) does. However, to make the model more robust to outliers (see Kruschke, 2012), I have assigned them non-central student $t$ priors:

$\begin{eqnarray}\kappa_i&\sim&\mathcal{T}\left(\nu^{(\kappa)}, \mu_{i}^{(\kappa)}, \sigma^{(\kappa)} \right)\\ \gamma_i&\sim&\mathcal{T}\left(\nu^{(\gamma)},\mu_{i}^{(\gamma)},\sigma^{(\gamma)}\right)\\\zeta_i&\sim&\mathcal{T}\left(\nu^{(\zeta)},\mu_{i}^{(\zeta)},\sigma^{(\zeta)}\right)\textrm{,}\end{eqnarray}$

where $\nu$ are the degrees-of-freedom parameters, $\mu$ are the mean parameters, and $\sigma$ are the standard-deviation parameters. Notice the $i$ index on $\mu$, and the lack thereof on $\nu$ and $\sigma$. We next place linear models on $\mu$, which we can write in vector form as follows:

$\begin{eqnarray}\vec{\mu}^{(\kappa)}&=&\textbf{X}_{\mu^{(\kappa)}}\vec{\beta}_{\mu^{(\kappa)}}\\\vec{\mu}^{(\gamma)}&=&\textbf{X}_{\mu^{(\gamma)}}\vec{\beta}_{\mu^{(\gamma)}}\\\vec{\mu}^{(\zeta)}&=&\textbf{X}_{\mu^{(\zeta)}}\vec{\beta}_{\mu^{(\zeta)}}\textrm{,}\end{eqnarray}$

where $\textbf{X}$ are subject-by-predictor design matrices and $\vec{\beta}$ are vectors of coefficients, one for each predictor (column) in the design matrix.

All that’s left to do is assign hyperpriors to the parameters denoted by $\nu$, $\sigma$, and $\beta$. For $\nu$, a shifted-exponential distribution is a good choice:

$\begin{eqnarray}\nu^{(\kappa)}&\sim&\mathcal{EXP}\left(0,\frac{1}{29}\right) + 1\\\nu^{(\gamma)}&\sim&\mathcal{EXP}\left(0,\frac{1}{29}\right) + 1\\\nu^{(\zeta)}&\sim&\mathcal{EXP}\left(0,\frac{1}{29}\right) + 1\textrm{.}\end{eqnarray}$

For $\sigma$, recent papers recommend a half-Cauchy distribution (Gelman, 2006; Polson & Scott, 2012):

$\begin{eqnarray}\sigma^{(\kappa)}&\sim&\mathcal{HC}\left(30\right)\\\sigma^{(\gamma)}&\sim&\mathcal{HC}\left(30\right)\\\sigma^{(\zeta)}&\sim&\mathcal{HC}\left(30\right)\textrm{.}\end{eqnarray}$

Finally, we will assign normal hyperpriors to each of the coefficients:

$\begin{eqnarray}\beta_{{\mu^{(\kappa)}}_m}&\sim&\mathcal{N}\left(0, 100\right)\\\beta_{{\mu^{(\gamma)}}_m}&\sim&\mathcal{N}\left(0, 100\right)\\\beta_{{\mu^{(\zeta)}}_m}&\sim&\mathcal{N}\left(0, 100\right)\textrm{.}\end{eqnarray}$

where $m$ indexes the covariate.

## References

Banner image is La persistencia de la memoria by Salvador Dali.

• Baddeley, A. D., & Hitch, G. (1974). Working memory. In G.H. Bower (Ed.), The psychology of learning and motivation: Advances in research and theory (Vol. 8, pp. 47–89). New York: Academic Press.
• Cowan, N. (2001). The magic number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24, 87–114.
• Gelman, A. (2006). Prior distributions for variance parameters in hierarchical models. Bayesian Analysis, 1, 515–533.
• Keshvari, S., van den Berg, R., & Ma, W. J. (2013). No evidence for an item limit in change detection. PLoS Computational Biology, 9, e1002927.
• Kruschke, J. K. (2012). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142, 573–603.
• Luck, S. J., & Vogel, E. K. (1997). The capacity of visual working memory for features and conjunctions. Nature, 390, 279–281.
• Morey, R. D. (2011). A Bayesian hierarchical model for the measurement of working memory capacity. Journal of Mathematical Psychology, 55, 8–24
• Morey, R. D., & Morey, C. C. (2011). WoMMBAT: A user interface for hierarchical Bayesian estimation of working memory capacity. Behavior Research Methods, 43, 1044–1065.
• Pashler, H. (1988). Familiarity and visual change detection. Perception & Psychophysics, 44, 369–378.
• Polson, N. G., & Scott, S. G. (2012). On the half-Cauchy prior for a global scale parameter. Bayesian Analysis, 7, 887–902.
• Rouder, J. N., Morey, R. D., Cowan, N., Zwilling, C. E., Morey, C. C., & Pratte, M. S. (2008). An assessment of fixed-capacity models of visual working memory. Proceedings of the National Academy of Sciences, 105, 5976–5979.
• Salvatier, J., Wiecki, T. V., & Fonnesbeck, C. (2016). Probabilistic programming in Python using PyMC3. PeerJ Computer Science, 2:e55.