41  Generalisability: effect sizes

A term often used in substitute of validity, particularly external validity, is generalisability. The term is most commonly used when describing whether the results of a field trial can be taken to inform how an intervention might work in another context or at scale.

There are limited systematic evaluations of the generalisability of applied behavioural science trials. However, research on the generalisability of impact evaluations in a development context provides insight into this question.

Eva Vivalt (2020) reviewed 635 papers containing 15,024 estimates of effect sizes relating to 20 types of interventions in international development. In her paper, she assessed the extent the results from a particular intervention could be used to predict the sign of the effect or the magnitude of the effect of a similar study in another context. This question is effectively an examination of the Type M and Type S errors we discussed.

She found that an inference about a study’s effect using another similar study will have the correct sign 61% of the time (comparing the median intervention-outcome pair in each study). When comparing effect sizes, a naive prediction of the result in the new study is likely to be wrong by about 249%.

Listen to Eva Vivalt on the 80,000 Hours podcast, or read the transcript.

Another perspective on generalisability comes from analysis by DellaVigna and Linos (2022) of the implementation of behavioural interventions by two “Nudge Units” in the United States. They compared the results from 126 randomised controlled trials run by the Nudge Units to a sample of trials in academic journals. They wrote:

In the Academic Journals papers, the average impact of a nudge is very large—an 8.7 percentage point take-up effect, which is a 33.4% increase over the average control. In the Nudge Units sample, the average impact is still sizable and highly statistically significant, but smaller at 1.4 percentage points, an 8.0% increase. We document three dimensions which can account for the difference between these two estimates: (i) statistical power of the trials; (ii) characteristics of the interventions, such as topic area and behavioral channel; and (iii) selective publication. A meta-analysis model incorporating these dimensions indicates that selective publication in the Academic Journals sample, exacerbated by low statistical power, explains about 70 percent of the difference in effect sizes between the two samples. Different nudge characteristics account for most of the residual difference.