Most international organizations are past the stage of asking whether to use AI and in the harder phase of identifying the best use cases. A 2025 study of over 2,500 humanitarian practitioners found that 70% of the respondents integrated AI into daily or weekly workflows, but less than half agreed it has improved organizational efficiency. Also, 4 out of 5 surveyed organizations at the time had no AI policies in place.
Widespread use without organizational clarity could be a manageable problem for low-stakes tasks. However, what happens when the outputs carry evidentiary weight?

For research and independent evaluation teams, the stakes are high. If the AI platform invents a finding, softens a caveat, or subtly re-anchors a study around the wrong theme, the harm could end up in a decision-maker's briefing note. A 2026 study found that 13 state-of-the-art LLMs hallucinated citations at rates ranging from 14% to 95%. The broader concerns around LLMs producing plausible but incorrect advice, spewing misinformation, reinforcing biases, and undermining privacy are well-known. But what is not known is how precisely AI platforms behave with independent evaluation evidence.
To start answering that question, we designed a small experiment around one task: producing donor-tailored summaries based on independent evidence. If an AI platform can read a dense study or evaluation and land a summary that's accurate, specific, and contextualized correctly for a given donor, that's one concrete use-case. So we picked six documents (SPIA Uganda and Bangladesh country reports, the Genebank and Genetic Innovation evaluations, and the ISDC Megatrends report and the 2025 Portfolio Review). We then asked ChatGPT, Claude, and Microsoft Copilot to produce a donor-tailored summary of each study for four CGIAR donors, based on a human-generated donor brief. The briefs covered publicly available information on the donor’s funding priorities, past investments within CGIAR (by theme and funding modality), and donor language that the AI should aim to mirror. The AI-generated summaries were then scored by a team member on accuracy, specificity, clarity, and donor tailoring.
The main finding was that all three AI platforms could theoretically do the job, but none of them do it the same way.
Claude scored highest overall and led on the two dimensions that mattered most for this task: factual accuracy and donor tailoring. It stayed anchored to the source material and mirrored donor language without drifting or hallucinating.
Copilot came in close behind and stood out on specificity, i.e., more willing to name programs, numbers, and concrete recommendations.
ChatGPT was the clearest to read and the most polished on the surface, but also the most generic. It scored the lowest overall and required substantive revision on a third of its summaries because of thin donor framing or superficial content.
Two other patterns emerged that are worth flagging. First, SPIA country-level studies scored lower than other evaluations and strategic reviews across all platforms. The methodological complexity and empirical nuances captured in SPIA studies appear to make it harder for LLMs to summarize without flattening. Second, donor briefs mattered more than we expected. Since the briefs relied on publicly available information relating to various donors, the most specific briefs scored consistently higher. This implies that donor briefs themselves are a confound, in that good AI performance for a given donor tells us something about the donor's documentation as much as it does about the AI platform.
Now here is what the pilot did not tell us. First, we do not account for prompt sensitivity, meaning we don't know how much of the ranking would hold under different phrasing, different ordering, or a follow-up round. Second, we picked studies we thought were thematically rich enough to test donor tailoring, which means our findings are partly a statement about the studies we chose. Third, we do not know how much the quality would vary if the inputs (donor briefs) were also AI-generated.
So, is it worth using AI platforms to generate contextualized donor summaries? The response is a tentative yes, but strictly as a drafting tool and not a publishing one. All three platforms cleared the bar of producing usable draft material, with Claude and Copilot clearing it consistently. But whether it is ‘worth using’ depends on a question this pilot experiment did not answer: whether the review load is lighter than the self-writing load. If verifying every claim and data point, checking donor framing, and catching the subtle re-anchoring takes as long as writing a contextualized donor summary from scratch, the efficiency case collapses.
The takeaway from this pilot is that AI 'cost’ is not simply the subscription charges, but also the review time, the expertise needed to catch misrepresentations, undertake accurate prompting, and make judgment calls about when AI-generated drafts are contextually sound versus not. Until we price that, ‘using AI’ will remain a vague assertion about surface-level adoption and not value.
Swetha Ramachandran is the Use of Evidence Senior Officer of the CGIAR Standing Panel on Impact Assessment (SPIA). She is an evaluation practitioner and applied researcher with a keen interest in evidence-based policymaking. Her work focuses on developing approaches to measure and monitor evidence uptake and policy influence.
Thomas Griffin was a former intern of the CGIAR Standing Panel on Impact Assessment (SPIA). Currently, he works as legal analyst for GCM Grosvenor at their Chicago office.
