In 1991, researchers in Jamaica studied 129 undernourished children, providing some with weekly home play sessions led by community health aides. The play stimulation showed substantial effects on child development. Similar early stimulation programs were implemented at national scale in Bangladesh and Colombia, integrated into existing government nutrition and parenting programs. The programs still helped children, but the improvements in child development measures were much smaller: 0.17 and 0.16 standard deviations compared to 0.8 in Jamaica.
This pattern shows up constantly in development research. A program works brilliantly at small scale. It is expanded to reach millions more people. And then the impact changes: sometimes shrinking, sometimes shifting to different groups or outcomes, occasionally disappearing altogether.
This isn't a simple story about failure but instead reveals something deeper: that we are surprisingly bad at predicting what will happen when we take something that works in one place and try it somewhere else, or at a much larger scale. As a recent SPIA technical note investigates, this is what experts call the problem of scaling (expanding a successful pilot) and external validity (whether results hold in new contexts). Addressing these challenges means making predictions through extrapolation (applying results to new settings) or interpolation (filling gaps within familiar ones). However, both extrapolation and interpolation are a lot harder than they sound for a variety of reasons, as shown below.
When Good Programs Go Bad
Consider this head-scratcher from Kenya: hire a teacher on a fixed-term contract through an international NGO, and students learn significantly more. Have the government hire teachers on the exact same contracts? Zero impact. Same contracts, different employer/implementer, very different outcomes. The culprit? Political opposition to contract reforms that changed on-ground implementation dynamics.
The pattern repeats in other places. "Nudge" interventions, that social scientists adore, aim to subtly influence people’s behaviors by altering the way options are presented. These interventions appear to work four times better in controlled academic studies than in Nudge Units within governments or organizations which implement them at scale[1]. While programs that plug into existing government infrastructure generally tend to fare better, as these examples demonstrate, government involvement does not guarantee success.
Why Success Doesn't Always Travel
The reasons for this are frustratingly human. When California tried to replicate Tennessee's successful small-class initiative statewide, they had to hire more teachers rapidly. Many were relatively inexperienced or uncertified. Consequently, smaller classes helped, but not as much as anticipated.
In the same vein, a pilot migration loans program in Bangladesh increased temporary migration by 25-40 percentage points but fell to 12 percent at scale. Why? The incentives for loan officers changed. In the small pilot, they had discretion to give loans to people who would benefit most from migrating. At scale, they started giving loans to people most likely to repay (implying more returns for them).
Sometimes the problem is even more peculiar. Researchers simulated what would happen if fertilizer subsidies, tested in small areas of Uganda, rolled out nationwide. They found that more than 80 percent of households would experience wildly different effects than the original experiment suggested and a third would see changes of 50 percent or more, in either direction. Why? Because when everyone changes their behavior at once, the economic dynamics are likely to shift.
These are scaling and external validity challenges in action: what worked in one context doesn’t automatically translate elsewhere, even with careful replication.
The CGIAR Reality Check
These issues matter enormously for CGIAR, which heavily invests in agricultural research and innovations meant to reach millions of farmers. In the past, when asked to project future benefits, optimistic projections were produced based on expert knowledge and literature, which regrettably could not substitute hard evidence for what actually happens when you scale. The uncomfortable truth is that we cannot simply multiply localized successes by the target populations to derive benefit projections. When interventions scale, impacts and incentives will shift and need to be accounted for.
So what can be done? Two promising methods offer hope. One, AI and machine learning techniques can help identify populations most similar to those in successful pilots and offer workarounds for external validity constraints. Two, remote sensing data can help track climatic and weather conditions, yield variability, productivity, and intervention coverage, particularly in contexts where traditional data is scarce or expensive to collect. Despite their significant potential, these methods require heavy caution as data complexity and measurement errors remain persistent issues.
The Takeaway
In the face of these challenges, the main takeaway is to maintain honesty about uncertainty. Scaling something up or trying an intervention someplace new is akin to re-running an experiment. And experiments have uncertain outcomes. The question, therefore, is not whether to scale successful interventions, but rather being realistic about what to expect when we attempt to transport its impacts.
Read the Technical Note Here
[1] Although this may be influenced by various factors such as publication bias (i.e, academic studies publishing trials with large effects) or that government-run nudges operate in more complex settings.