Retiring the IRAP

Key Publications

Hussey (2025): Verification Report: A critical reanalysis of Vahey et al. (2015) “A meta-analysis of criterion effects for the Implicit Relational Assessment Procedure (IRAP) in the clinical domain”
The IRAP’s criterion validity is greatly overstated. Vahey et al. (2015) reports a meta-analysis of the IRAP’s clinical criterion validity, and has been frequently cited for sample size determination in subsequent IRAP publications for its claim that N = 37 is often sufficient. This article demonstrates that there are serious errors and biases in Vahey et al. (2015) at almost every stage of data extraction and analysis. I made the authors of Vahey et al. (2015) aware of some of these errors in 2019, and more in 2025, but they have declined to correct the original article.
- Incorrect inclusions: 23 of the 56 (41.1%) included effect sizes do not meet the authors inclusion criteria as they were not criterion effects (i.e., did not involve a second variable other than the IRAP).
- Incorrect omissions: a large number of additional effect sizes meeting inclusion criteria (360 additional effect sizes in addition to the 56 included). There is evidence of selection bias for including larger effect sizes and excluding smaller ones.
- Incorrect effect size conversions: e.g., treating η_p² as η².
- Data discrepancies: effect sizes reported in Vahey et al. (2015) are inconsistent, e.g., different results reported in the forest plot vs funnel plot, forest plot vs. the supplementary data.
- Weighted mean effect sizes were calculated incorrectly.
- Results cannot not be reproduced from the original data using the SPSS script the authors state they employed.
- A new meta-analysis correcting the detected errors finds an effect size less than half the size of the original (r = .22). Updated power analyses suggest that N = 346 is needed whereas Vahey et al. (2015) recommended N = 37. Results from Hussey (2023) demonstrate that no published IRAP study to date has met this sample size, reemphasizing that the IRAP literature is likely severely underpowered and therefore rife with false positive results.
Hussey & Drake (2020): The Implicit Relational Assessment Procedure demonstrates poor internal consistency and test-retest reliability: A meta-analysis
This file-drawer meta-analysis of published and unpublished results (N = 1839) suggests the IRAP’s internal consistency is poor (α = .49) and its test-retest reliability is very poor (ICC2 = .10). If scores are calculated for individual trial types as many IRAP proponents argue for, both forms of reliability are very poor (α = .27, ICC2 = .18). Low reliability reduces statistical power and replicability, suggesting that many published results may be false positives.
Hussey (2023): A systematic review of null hypothesis significance testing, sample sizes, and statistical power in research using the Implicit Relational Assessment Procedure
Sample sizes in IRAP studies are extremely low – lower than in social psychology prior the start of the replication crisis – and have not risen meaningfully over time. Very low samples imply low statistical power, and contribute to poor replicability.
Hussey & Drake (2020): The Implicit Relational Assessment Procedure is not very sensitive to the attitudes and learning histories it is used to assess
There is consensus among IRAP researchers that IRAP effects are biased in some way. O’Shea et al. (2016) called this a positive framing bias, and Finn, Barnes-Holmes, et al. (2016, 2018) refer to this generic pattern as the single trial type dominance effect. While there is disagreement about its cause, the presence, replicability, and generalizability of these biases in IRAP effects is apparently uncontroversial. However, the necessary implications of this confound are not fully appreciated. This article uses a large open data set (N = 753) of IRAPs in multiple domains to show that a) majority of variance in IRAP effects is attributable to the generic pattern rather than the stimuli domain being assessed by the IRAP, and b) that this pattern is observed even when nonsense stimuli are used. Given that many IRAP studies conclusions are based on the presence of non-zero IRAP effects, and these IRAP effects are observed regardless of the stimuli employed in the task, this implies that many conclusions in the IRAP literature are invalid or erroneous, and merely the result of a statistical artifact.
Hussey (2023): The Implicit Relational Assessment Procedure’s trial-types are not independent
A key rationale for using the IRAP over other more reliable and valid measures such as the IAT is that its four trial types are supposedly functionally independent of one another, but little evidence is presented for this. This analysis of 1464 participants across 35 IRAPs in 16 different domains suggest that the IRAP trial types are not independent and are typically correlated with one another.
Hussey & Barnes-Holmes (2012): The Implicit Relational Assessment Procedure as a measure of implicit depression and the role of psychological flexibility
This article was my bachelor's thesis project, and the results reported in this article are heavily p-hacked, with the knowledge of both authors (although not a full understanding that p-hacking was bad, at the time). Specifically, we used optional stopping and flexibility in data processing to obtain a significant result. At least one unpublished conceptual replication of this study using the RRT, a similar implicit measure, found null results. I no longer believe the claims in this article are credible.
Drake (2022): Rebutting a Revisionist History of the Implicit Relational Assessment Procedure
A blog post by Chad Drake detailing his own negative experiences in dealing with the IRAP and its research culture.