40  Validity

Validity is the extent to which the results of an experiment support a more general conclusion.

There are many forms of validity. Here we briefly define four (Salganik, 2018).

Statistical conclusion validity is the extent to which the experimental statistical analysis was done directly. For example, did the experimenter calculate the p-values correctly?

Internal validity is the extent to which the experimental treatment is actually responsible for the change in value of the dependent variable. You are able to link cause and effect while controlling for the effect of outside variables (usually by randomisation). Internal validity also concerns whether the experimental procedures were performed correctly. Internal validity tends to be higher in the lab, although the failure of many lab experiments to replicate indicates a problem of low validity.

Construct validity concerns whether the data matches the theoretical constructs. If you believe that a social norm triggers someone to pay their tax on time, does your treatment manipulate social norms while holding other constructs (such as prompts) constant?

External validity is the extent that experimental findings can be generalised to the population from which the participants in the experiment were drawn. Field experiments tend to provide higher external validity than those constrained to the lab.

40.0.1 The validity of lab and field experiments

A core driver of whether a lab or field experiment is more appropriate is whether internal or external validity are more important.

John List and Omar Al-Ubaydli (2014) provide one perspective on this trade-off:

Bob wants to purchase Susan’s mug for $5, but they live far apart and so he will need to send a check. The mug is worth $3 to Susan and $9 to Bob, implying a societal surplus of $9 - $3 = $6 if the transaction occurs. However, if Bob sends the money first, will Susan send the mug? If Susan sends the mug first, will Bob send the money? Signing a legally enforceable contract would facilitate the trade.

However, what if property rights are poorly enforced (e.g., if they live in different countries)? Then their fear may result in them forgoing the trade and the surplus of $6. …

How can we test whether we could expect the trade to occur? Perhaps we could look at the trust game?

This is what Berg et al. (1995) did using the ‘trust game’ – a microcosm of Bob and Susan’s quandary. The sender starts with $10 and the responder starts with $0. The sender can send any amount to the responder, retaining the remainder. The responder receives triple whatever the sender transfers. The responder then decides how to divide the tripled amount between the two, terminating the game. For example, if the sender transfers $4, the sender retains $6 and the responder receives $4 x 3 = $12; finally, the responder can choose to return anywhere between $0 and $12.

The desirable outcome is for the sender to transfer all $10, and for the responder to return at least $10 ($15 under egalitarianism). This mimics Bob and Susan trading the mug. It leaves the two with $30 between them – much more than just $10 with the sender. But the sender may doubt the responder’s trustworthiness. The responder may just choose to pocket whatever the sender transfers since the sender has no legal recourse. Anticipating this, like Bob and Susan, the sender may just decide to avoid transacting, retaining the $10.

Berg et al.’s results suggest that such fears were potentially ill-founded. Even when the trust game was played with anonymous strangers, on average, senders would send $5.16, and responders would return $4.66. The authors appealed to altruism to explain the results, i.e., the players feel bad about the other party receiving a low payoff, motivating behaviour closer to what emerges under fully enforced property rights. Subsequent studies (Fehr et al. 1993) have confirmed these results, and interviews reveal that participants’ altruism is their most common stated motivation.

But should this experiment provide comfort to Bob? List and Al-Ubaydli describe a related field experiment.

I went to professional sports memorabilia markets and recruited professional traders to play laboratory trust games. Like the literature, I found a modest positive causal effect of property rights.

I then ran a complementary field experiment. As a sports card enthusiast, I was aware that the market was awash with player cards of different quality (grade), and that the professional sellers were better at discerning grade than your average fan milling around the exhibition. Critically, an individual requesting a high-grade card and receiving a low-grade card was rarely aware at the time of the transaction and, if they found out subsequently, they had no legal recourse. Thus, buying a card required a buyer to trust – like Bob buying the mug – the sender in the trust game. I recruited some archetypal sports fans to go to professional sellers and request a specific card at a specific grade in exchange for a predetermined price, without revealing to the seller that this was an experiment. If traders were completely selfish, then they would return the lowest grade of card whatever price was offered to them, confirming the need to enforce property rights. Alternatively, if the traders behaved ‘altruistically’ like the laboratory experiment participants, then they should offer higher quality when offered higher prices.

The results were pretty grim for anyone who believes in the humanity of sports card dealers – card quality was insensitive to price offers. Note that these were the same traders who had apparently exhibited altruistic tendencies in the preceding laboratory experiment. …

These results painted a bleaker picture of anonymous trade in the absence of property rights. Were Bob and Susan to read the entire literature, what would be the epistemologically ideal way for them to update their beliefs on the causal effect of property rights on trading behaviour? Which results generalise to their setting more accurately – laboratory or field?

40.0.2 The validity of randomised controlled trials

Much of this unit has focussed on how we can use randomised controlled trials to obtain accurate measures of treatment effects, as opposed to delving into how the results could be used. This is effectively a focus on internal validity.

This unit’s focus is matched in much of the literature about randomised controlled trials. External validity if often an afterthought. And in an article arguing against placing randomised controlled trials on a pedestal, Angus Deaton and Nancy Cartwright (2018) agree that this can have value:

Suppose a trial has (probabilistically) established a result in a specific setting. If ‘the same’ result holds elsewhere, it is said to have external validity. External validity may refer just to the replication of the causal connection or go further and require replication of the magnitude of the ATE. Either way, the result holds—everywhere, or widely, or in some specific elsewhere—or it does not.

This binary concept of external validity is often unhelpful because it asks the results of an RCT to satisfy a condition that is neither necessary nor sufficient for trials to be useful, and so both overstates and understates their value. It directs us toward simple extrapolation—whether the same result holds elsewhere—or simple generalization—it holds universally or at least widely—and away from more complex but equally useful applications of the results. The failure of external validity interpreted as simple generalization or extrapolation says little about the value of the results of the trial.

But this paragraph starts to shape the critique:

Establishing causality does nothing in and of itself to guarantee that the causal relation will hold in some new case, let alone in general. Nor does the ability of an ideal RCT to eliminate bias from selection or from omitted variables mean that the resulting ATE from the trial sample will apply anywhere else.