1 Introduction

“Science advances by playing twenty questions with nature. The proper tactic is to frame a general question, hopefully binary, that can be attacked experimentally. Having settled that bits-worth, one can proceed to the next. The policy appears optimal — one never risks much, there is feedback from nature at every step, and progress is inevitable. Unfortunately, the questions never seem to be really answered, the strategy does not seem to work.”

— Allen Newell (1973)

Almost fifty years ago, the artificial intelligence pioneer and cognitive psychologist Allen Newell summarized his discontent with the field of psychology with the sentence “you cannot play twenty questions with nature and win”. In a game of “twenty questions”, one player thinks of a person or object and the other player attempts to guess it by asking up to twenty questions, such as “is it a person?”, “is he a man?”. Only questions that require a binary (yes/no) answer are allowed. Newell argued that most (cognitive) psychology research attempts to understand human behavior and cognition in a manner analogous to a game of twenty questions; that is, by repeatedly asking and trying to answer “binary” research questions — such as “Does cognition impact perception?” and “Are emotional facial expressions innate?” — researchers attempt to gradually explore and reduce the space of possible scientific explanations in the hope to, ultimately, converge on the “the right answer”. Almost fifty years after Newell’s twenty questions article, most of the research in both psychology and cognitive neuroscience still revolves around asking binary research questions about the mind and brain, framed as hypotheses that are evaluated using an ever increasingly sophisticated toolbox of statistical significance tests. Newell, however, believed that in order to gain a fundamental understanding of how the mind and brain work, we need to go beyond asking binary questions and try to investigate human behavior and cognition in all its complexity using quantitative, predictive models that implement human cognitive capacities and behaviors. I believe that this argument is still as relevant today as it was almost fifty years ago.

In this thesis, I explore a different, complementary approach to the traditional methodology of hypothesis testing used in psychology and cognitive neuroscience research. Although this alternative approach has deep roots in psychology and is thus by no means new, the version I advocate and have used in this thesis extends it with ideas and techniques from the rapidly growing field of artificial intelligence and specifically machine learning. As I will describe in more detail in the next section, the crucial difference between the “hypothesis testing approach” and the “predictive approach”, as advocated by Newell, is the way they go about trying to explain and understand a particular cognitive capacity or behavior (Breiman, 2001). Although I believe that both approaches have their merits, I think that the predictive approach may be particularly promising given the increasing availability of large datasets and rapid advances in artificial intelligence and machine learning (Halevy et al., 2009; Yarkoni & Westfall, 2017).

This thesis features research that applies, adapts, and contributes to machine learning techniques and methods in the context of predictive models of behavior and neuroimaging data. Specifically, chapters in this thesis describe both examples of predictive models applied to neuroimaging data (chapter 2) and behavior (chapter 6 and 7 as well as elements that facilitate and enrich the predictive modelling framework, such as the value of making datasets publicly accessible (chapter 4; Adjerid & Kelley, 2018; Poldrack & Gorgolewski, 2014), and a method to aid interpretation of predictive models (chapter 3). Note that the studies contained in this thesis do not all fall squarely in the predictive approach. For example, chapter 5 features a study that revolved around a confirmatory (and preregistered) hypothesis and chapter 2 describes a study that in fact tests a very specific hypothesis using a predictive model. In what follows, I will argue that the predictive approach represents a useful and promising way of doing research that complements the traditional hypothesis testing approach with respect to their common goal of explanation and gaining understanding of the brain and mind. But first, I will illustrate that these two approaches can be thought of as different inferences from the same model which helps to identify their relative (dis)advantages later.

1.1 Inference done differently

Both hypothesis testing and predictive modelling are scientific methods used in psychology and neuroscience to gain understanding of human cognition and behavior. Both approaches share an important common component: a statistical model (Breiman, 2001). Although there are many different definitions and interpretations of the term “model” (Kellen, 2019), in this chapter, I define a statistical model as a quantitative representation of (a part of) a target system (Frigg & Hartmann, 2020). In psychology and cognitive neuroscience, a target system may refer to a specific cognitive capacity (e.g., emotion recognition) or behavior (e.g., instrumental learning; Cummins, 2000; Rooij & Baggio, 2021). Models are used to create a quantitative description, or hypothesis, of how data within a target system may have been generated. Specifically, statistical models describe how one quantity of interest within the target system, \(y\) (the “target variable”), may arise as a function (\(f\)) of one or more other quantities in the target system, \(X\) (the predictor variables or features), often in the presence of noise (\(\epsilon\)):

\[\begin{equation} y = f(X) + \epsilon \end{equation}\]

Put differently, models represent explanations of how variability in a particular aspect of the target system (\(y\)) arises as the result of a set of (causally related) features (Cummins, 2000; Kay, 2017). For example, chapter 6 and 7 describe models that attempt to explain the emotion people see in others’ facial expressions (\(y\)) as a function of a combination of facial movements (\(X_{\mathrm{mov}}\)):

\[\begin{align} \mathrm{emotion} = f(X_{\mathrm{mov}}) + \epsilon \end{align}\]

In principle, the function linking the predictors to the target can be any function that maps a vector of numbers (the predictors, \(X_{i}\)) to a single number (the target value, \(y_{i}\)), but almost all statistical tests as well as most predictive models in psychology and cognitive neuroscience use a variant of the general(ized) linear model (GLM; Ivanova et al., 2021; Lindeløv, 2019). A linear model assumes that the target variable (which we assume to be continuous for now) can be expressed as the sum of a set of features (\(X_{1}, X_{2}, \dots , X_{P}\)) weighted by a corresponding set of parameters (\(\beta_{1}, \beta_{2}, \dots , \beta_{P}\)). When the target variable is continuous, the corresponding linear model is more commonly known as a linear regression model1:

\[\begin{equation} f(X_{i}) = \sum_{j=1}^{P}X_{ij} \beta_{ij} \end{equation}\]

In linear models, like the linear regression model above, the parameters quantify the strength of the association between each predictor and the target variable. Model parameters are considered unknown and need to be estimated from data. Here, “data” refers to a specific number of observations of the target variable (\(y\)) and the predictor variables (\(X\)). There are various mathematical techniques to estimate the model parameters, including the well-known (ordinary) least squares analytical solution, iterative gradient-based methods, regularized least squares, and Bayesian parameter inference. These methods differ in how they estimate the parameters (or, in more technical terms, which particular function they optimize or minimize during estimation), but they all return an estimate of the parameters of the model. These estimated parameters are often denoted with a “hat” (^; i.e., \(\hat{\beta}\)) to distinguish them from the true, but unknown, parameters (i.e., \(\beta\)).

After obtaining estimates of the model parameters, the model can be used to make predictions about the value of the target variable (\(y\)) given observations of the predictor variables (\(X\)):

\[\begin{equation} \hat{y_{i}} = \sum_{j=1}^{P}X_{ij} \hat{\beta}_{ij} \end{equation}\]

Like the “hat” is used to distinguish estimated from true parameters, the “hat” used in the equation above is used to distinguish a prediction (i.e., an estimate of the target variable, \(\hat{y}\)) from the true target value (\(y\)). The model predictions can be compared to the actual target values to evaluate the model’s predictive accuracy (which is alternatively called “model fit” or simply “accuracy”), which is usually summarized in a single number using metrics such as \(R^{2}\) (also more colloquially known as “explained variance”).

Thus far, the specification of a (linear) model and the estimation of its parameters is common to both the traditional and the predictive approach. The crucial difference between the two approaches, at this point, is what element they treat as unknown and perform inference on. In the hypothesis testing approach, inference is performed on the estimated model parameters while in the predictive approach, inference is performed on the model’s predictive accuracy (Bzdok, 2017).

This difference in their focus of inference is associated with different cultures of research which use statistical models to explain a target system in different ways (Breiman, 2001). In the traditional hypothesis testing approach, the inferences about model parameters are not meant to directly explain (parts of) a target system. Instead, the target system is verbally described and explained by a theory (Kellen, 2019). Explanation of the system occurs via testing hypotheses about very specific aspects of the system that are implied by a theory, often in strictly confirmatory experiments (Wagenmakers et al., 2012). Because such hypothesis-driven studies often use strictly controlled experiments in which the factor(s) of interest are explicitly manipulated, these studies afford causal interpretation of the observed statistical effects (Groot, 1961). For example, if a particular theory about emotion (e.g., basic emotion theory) implies that certain categorical emotions should be universally recognized (Keltner et al., 2019), then statistical tests that show that people across the globe are able to distinguish these emotions above chance level (e.g., Ekman et al., 1969) corroborate this theory. The logic behind this approach is that we get an increasingly better understanding of the target system if we keep testing hypotheses implied by the corresponding theory. Or in Newell’s terminology, if we just keep asking nature questions, we will at some point understand it.

Theories play a less significant role in the predictive modelling culture. Although theories may inspire particular classes of models and constrain the space of possible models (Rooij & Baggio, 2021), they do not necessarily represent (a description of) the target system itself. Instead of theories, the predictive approach uses models themselves to both describe and explain a target system (Guest & Martin, 2021). These models can be thought of as algorithmic or mechanistic hypotheses of how a particular cognitive capacity or behavior may emerge (Schyns et al., 2009). For example, the categorical emotion model in chapter 7 represents the mechanistic hypothesis that the capacity of people to infer and recognize emotions from others’ faces occurs through an integration of weighted linear combinations of both facial movements and facial morphological features. Another example is illustrated in Chapter 2, which describes a study in which we hypothesized that the same brain networks associated with emotion experience underlie the capacity for emotion understanding (Oosterwijk et al., 2017). Using a predictive model trained on neural patterns associated with components of emotion experience, we could accurately predict emotion components associated with emotion understanding in others, which suggests that these two processes share a common neural implementation (Peelen & Downing, 2007). Importantly, in the predictive approach, progress in terms of explanation and understanding is not achieved by binary tests of these theory-driven hypotheses, but by the exploration and development of increasingly accurate models of the target system itself (Naselaris et al., 2011).

To be clear, although the research in this thesis often uses techniques and models from machine learning, the predictive approach should not be equated with machine learning. The origins of this approach, at least in the domain of psychology, can be traced back to the psychophysics studies in the late nineteenth century. Psychophysics studies aim to develop lawlike models of how stimulus attributes give rise to sensory experiences and rarely feature explicit hypothesis tests of model parameters (Gescheider, 2013). Predictive, computational models also play a central role in the field of cognitive science, in which they are used as formal representations and implementations of cognitive processes (Núñez et al., 2019). While hypothesis testing has dominated much of psychology and cognitive neuroscience apart from psychophysics and cognitive science, the predictive approach has become more prominent in both psychology (Yarkoni & Westfall, 2017) and cognitive neuroscience (Varoquaux & Thirion, 2014) in recent years. Machine learning has been particularly influential in cognitive neuroscience, where it was introduced as “pattern analysis” (Norman et al., 2006), but there are many other examples of predictive approaches in psychology and cognitive neuroscience. These approaches include network analysis (Borsboom & Cramer, 2013) and structural equation modelling (Streiner, 2006) in psychology and system identification (Wu et al., 2006), model-based cognitive neuroscience (Forstmann & Wagenmakers, 2015; Turner et al., 2017), and encoding models in cognitive neuroscience (Holdgraf et al., 2017; Naselaris et al., 2011). Although these approaches differ in the way they construct and apply models, they all emphasize predictive accuracy rather than hypothesis testing.

In sum, although the traditional and predictive approach share a core component — a quantitative model — they differ in what aspect of the model they use for inference. The associated research cultures implement different approaches to explain and gain understanding of a target system. As I will discuss in the next section, the predictive and hypothesis testing approach each have specific advantages and, when used in combination, can compensate for the weaknesses of the other.

1.2 Towards prediction

Scientific models come in many forms and can have many different purposes. In psychology and cognitive neuroscience, researchers use scientific models primarily to explain cognitive capacities and behaviors (Yarkoni & Westfall, 2017). Here I use the term “explanation” to be the identification of the causal components of a particular target system (ibid.). Specifically, scientific models used for hypothesis tests serve as tests of the existence of causal components implied by a particular theory. Explanation is, arguably, not the only function of scientific models. Two other functions often attributed to scientific models are prediction and exploration (Cichy & Kaiser, 2019; Gelfert, 2016). In what follows, I will evaluate the models from the hypothesis testing and the predictive approach on these criteria and argue that they emphasize these criteria differently.

In terms of their ability to explain, models from the hypothesis testing approach are hard to beat. By employing carefully controlled experiments in which usually only a single factor is manipulated, hypothesis tests of models are able to clearly establish the presence of specific causal components of the target system. Moreover, these models usually contain few variables and parameters and are almost always linear, which makes for easy interpretation of the estimated causal effects. The strict experimental setup and simplicity of the models, however, leave little room for exploration of alternative, possibly better models of the target system of phenomenon. In fact, exploration is often explicitly discouraged in the context of hypothesis testing (Wagenmakers et al., 2012), which forces researchers to set up a completely new study in order to test an alternative model. Not only exploration suffers from the emphasis on explanation, but prediction as well. The very fact that most models for hypothesis testing only intend to investigate and test a very specific part of the target system results in very simple models that, arguably, cannot capture the complexity of the cognitive capacities and behaviors studied by psychologists and cognitive neuroscientists (Jolly & Chang, 2019; Tosh et al., 2020). The result is that each individual model is usually only able to correctly predict a fraction of the variance of the target variable. In a large sample of psychology studies, Schäfer & Schwarz (2019) found that the median model performance, expressed as the proportion of explained variance of the target variable, was only 12.6% and was found to be as low as 2.5% for purely confirmatory and preregistered studies.2

A prerequisite for comparing different predictive models is that, ideally, they use the same dataset. Using the same dataset to evaluate different models not only facilitates model comparison but also facilitates incremental progress over time. The famous ImageNet dataset used in computer vision provides a striking example of the impact common datasets can have on the field (Deng et al., 2009). Since 2011, the ImageNet dataset has been used in the yearly ImageNet Large Scale Visual Recognition Challenge (Russakovsky et al., 2015), a competition in which researchers can submit object recognition models trained and evaluated on the ImageNet dataset. In 2011, the best performing model achieved 51% accuracy, which has improved yearly, with the best performing model in the 2021 edition achieving 91%.6.3 In the past decade, public datasets have emerged in psychology and cognitive neuroscience as well, often motivated by the desire to improve research transparency and reproducibility (Gewin, 2016). However, few have emerged as de facto benchmarks for a given subdomain like ImageNet is for object recognition, which may be due to the fact that most of these datasets are acquired in strictly controlled experiments that strongly limit the variety of models that can be explored and thus limit their reuse (Naselaris et al., 2021). In cognitive neuroscience, there have been some notable exceptions, which include the Natural Scenes Database (Allen et al., 2021) and the Naturalistic Neuroimaging Database (Aliko et al., 2020), both with the goal to facilitate the development of models for real-world vision. In Chapter 4, I describe our effort to release a large, richly annotated dataset to the public domain (Snoek et al., 2021). This dataset, the Amsterdam Open MRI Collection, contains a set of multimodal MRI datasets for individual difference analyses, which colleagues and I made publicly available. Not only does the variety in data sources (MRI, physiological, demographic, and psychometric data) allow for the development of a wide variety of novel models, it can also be used to evaluate the generalizability of existing models (see e.g. Ngo et al., 2021).

The predictive accuracy of predictive models trained on large, observational datasets, however, does not come for free. One major disadvantage of the predictive approach is that the mechanisms their models represent may not represent the actual mechanisms underlying human cognition and behavior. In other words, complex models may represent what the philosopher Daniel Dennett called “cognitive wheels”: useful inventions that may solve practical problems, but just like wheels do not occur in nature, do not reflect the true mechanisms underlying human cognitive capacities and behaviors (Dennett, 2006; see also Maas et al., 2021). A famous example of a cognitive wheel is the finding that state-of-the-art object recognition models seem to rely more on the texture than the shape of the object (Geirhos et al., 2020; Xu et al., 2018), which seems to be the other way around in humans (Baker et al., 2018). Relatedly, models used in artificial intelligence seem to be extremely sensitive to spurious, non-causal relationships (Geirhos et al., 2020). A famous example of this issue is the observation that a model trained on X-ray data to predict pneumonia diagnosis in fact used text annotations included in the X-ray images rather than the images themselves (Zech et al., 2018). These limitations have led to the critique that using complex predictive models to explain and understand a target system, especially when using highly non-linear models as is common in many artificial intelligence applications, is like trading in one black box for another (Kay, 2017). Indeed, given the definition of “explanation” as identification of causal components of a target system, it is hard to argue that predictive models by themselves explain anything.

It is fair to say predictive models, by themselves, are not sufficient as a satisfactory explanation of a target system, but this is not an insurmountable issue. I would argue that the construction and evaluation of a predictive model is only the first step; the second step would be to gain insight into the mechanism that is learned by the model (Cichy & Kaiser, 2019). In this second step, the models are treated as concrete representations of the target system that can be manipulated, experimented with, and picked apart in order to gain insights into its mechanism — not unlike model organisms in animal research (Scholte, 2018). In both the machine learning community and the psychology and cognitive neuroscience community, techniques have been developed to gain insight into the mechanisms of predictive models. One common technique is to selectively manipulate specific model components, such as parameters or intermediate stimulus representations, to test whether these manipulations lead to similar changes in behavior in models and humans (e.g., Seijdel et al., 2020). A related technique is to selectively manipulate the input to the model instead of manipulating the model itself. Chapter 3 outlines such a method that can be used to control for specific stimulus features (“confounds”) in predictive models applied to neuroimaging data (see also Dinga et al., 2020), which can prevent models from learning spurious relationships. Another technique to increase the evidence for a “valid” model (rather than a cognitive wheel) is to show that key components of the model have plausible neural correlates (Güçlü & Gerven, 2015; Kriegeskorte et al., 2008; Yamins et al., 2014) or to directly constrain models with neural data (Turner et al., 2017). The underlying idea of applying these different techniques is that explanation and understanding of a target system is not something that is achieved by experiments on the target system directly, but with experiments on the models that represent them (Cichy & Kaiser, 2019).

Even though some of the weaknesses of the predictive approach can be mitigated, this does not mean that hypothesis testing should be abandoned. I believe that hypothesis testing remains and will remain an important tool in psychology and cognitive neuroscience and that there are plenty of scenarios in which hypothesis testing should in fact be preferred. First, if the goal is not to provide explanations and gain understanding of some target system, but to test an intervention, then hypothesis testing is an appropriate method. For example, if one wants to know whether some educational intervention improves reading skills in children, then running a randomized controlled experiment and associated hypothesis test is an excellent way to answer this question. Second, hypothesis tests may be useful in providing answers to important (binary) questions that may challenge important assumptions in a particular research domain or theory. For example, Chapter 5 describes a neuroimaging study that investigated the neural correlates of curiosity for negative information (“morbid curiosity”; Oosterwijk et al., 2020), with the preregistered hypothesis that choosing negative content activates reward-related brain regions. The confirmation of this hypothesis challenges current theories of curiosity, because the most obvious indicator of reward — a pleasurable experience — is missing in curiosity for negative content. Therefore, this finding may indicate that information is rewarding “in and of itself”. Finally, phenomena established by the hypothesis testing approach may inform and constrain the development of predictive models (Borsboom et al., 2020; Kellen, 2019). An example of this feature is illustrated in Chapter 7, which describes a model that uses variance in facial morphology to predict the emotions people see in “neutral” faces. The development of this model was inspired by the extensive literature on the associations between factors related to variance in facial morphology (e.g., age, gender, and ethnicity) and the emotional interpretations of static, “neutral” faces (Hess, Adams, & Kleck, 2009). Although the predictive models were not developed to test specific effects, we actually observed (or “replicated” if you will) several well-known effects from the emotional expression literature, such as the visual similarity and conceptual confusion between anger and disgust expressions (Jack et al., 2014).

To summarize, I believe that the predictive approach represents a useful addition to the methodological toolbelt of psychologists and cognitive neuroscientists. Given the striking progress in machine learning and artificial intelligence, I think that shifting the focus from explanation to prediction may be a promising avenue for psychology and cognitive neuroscience, but theory and hypothesis testing will remain important to constrain, inform, and test models — an idea that will be revisited in the general discussion.

1.3 Outline of this thesis

Although the chapters of this thesis have been shortly introduced in the previous sections, I will shortly summarize them here for convenience.

In chapter 2, I describe a study in which we used predictive models applied to functional MRI data, known as “decoding models” in the neuroimaging literature, to test a hypothesis about the shared neural basis of emotion experience and emotion understanding. To remedy the interpretational difficulties inherent to decoding analyses (and predictive models in general), chapter 3 outlines a method we developed to adjust for confounds in decoding analyses which helps to rule out alternative explanations of the results. Moving away from the focus on predictive models, chapter 4 is the result from our effort to publish the “Amsterdam Open MRI Collection” (AOMIC), a set of three large, multimodal, MRI datasets, and chapter 5 describes a confirmatory, fully pre-registered neuroimaging study on a psychological phenomenon called “morbid curiosity”. Finally, the last two chapters return to the use of predictive models, this time in the context of facial expression perception. Chapter 6 outlines a method we developed (“hypothesis kernel analysis”) to formalize verbal hypotheses as quantitative predictive models, which we apply to a specific set of hypotheses about how facial movements relate to categorical emotions. At last, chapter 7 concludes this thesis with a study that compares predictive models of affective face perception based on static features (i.e., facial morphology) and dynamic features (i.e., facial movements), which shows that people integrate both sources of information in their affective inferences and experiences.


  1. In this chapter, we assume for simplicity that the target variable, \(y\), is continuous. The target variable, however, does not need to be continuous; in that case, linear models from the GLM additionally include an “inverse link function”, \(g^{-1}\), that maps the linear combination of features to the right domain: \(y = g^{-1}(X\beta)\).↩︎

  2. Note that the original article by Schäfer & Schwarz (2019) reported effect size, \(r\), instead of “variance explained”, \(R^{2}\). In analyses that are not cross-validated, the latter can be obtained by squaring the former (but see Funder & Ozer, 2019).↩︎

  3. Retrieved from https://paperswithcode.com/sota/image-classification-on-imagenet.↩︎