Skip to main content

Main menu

  • Home
  • About
  • Who we are
  • News
  • Events
  • Publications
  • Search

Secondary Menu

  • Independent Science for Development CouncilISDC
    • Home
    • Who we are
    • News
    • Events
    • Publications
    • Featured Projects
      • Inclusive Innovation
        • Agricultural Systems Special Issue
      • Proposal Reviews
        • 2025-30 Portfolio
        • Reform Advice
      • Foresight & Trade-Offs
        • Megatrends
      • QoR4D
      • Comparative Advantage
  • Standing Panel on Impact AssessmentSPIA
    • About
      • Who we are
      • Our Mandate
      • Impact Assessment Focal Points
      • SPIA Affiliates Network
    • Our Work
      • Country Studies
        • Community of Practice
        • Bangladesh Study
        • Ethiopia Study
        • Uganda Study
        • Vietnam Study
      • Causal Impact Assessment
      • Use of Evidence
      • Cross-Cutting Areas
        • Capacity Strengthening
        • Methods and Measurement
        • Guidance to IDTs
    • Resources
      • News & Blogs
        • Blog Series on Qualitative Methods for Impact Assessment
      • Publications
      • Events
      • SPIA-emLab Agricultural Interventions Database
      • Webinars
  • Evaluation
    • Who we are
    • News
    • Events
    • Publications
    • Evaluations
      • Science Group Evaluations
      • Learning on CGIAR's Ways of Working
      • Platform Evaluations
        • CGIAR Genebank Platform Evaluation
        • CGIAR GENDER Platform Evaluation
        • CGIAR Excellence in Breeding Platform
        • CGIAR Platform for Big Data in Agriculture
    • Framework and Policy
      • Evaluative Learning Hub
      • Evaluation Method Notes Resource Hub
      • Management Engagement and Response Resource Hub
      • Evaluating Quality of Science for Sustainable Development
      • Evaluability Assessments – Enhancing Pathway to Impact
      • Evaluation Guidelines
  • Independent Science for Development CouncilISDC
  • Standing Panel on Impact AssessmentSPIA
  • Evaluation
Back to IAES Main Menu

Secondary Menu

  • About
    • Who we are
    • Our Mandate
    • Impact Assessment Focal Points
    • SPIA Affiliates Network
  • Our Work
    • Country Studies
      • Community of Practice
      • Bangladesh Study
      • Ethiopia Study
      • Uganda Study
      • Vietnam Study
    • Causal Impact Assessment
    • Use of Evidence
    • Cross-Cutting Areas
      • Capacity Strengthening
      • Methods and Measurement
      • Guidance to IDTs
  • Resources
    • News & Blogs
      • Blog Series on Qualitative Methods for Impact Assessment
    • Publications
    • Events
    • SPIA-emLab Agricultural Interventions Database
    • Webinars
Image of two researchers looking at a paper, surrounded by floating bubbles with logos of different AI LLM tools: Claude, ChatGPT, and Microsoft Copilot
Blog

Can AI Actually Understand Independent Evidence? We Put Three Platforms to Test

You are here

  • Home
  • Standing Panel on Impact AssessmentSPIA
  • News & Blogs
  • Can AI Actually Understand Independent Evidence? We Put Three Platforms to Test

Most international organizations are past the stage of asking whether to use AI and in the harder phase of identifying the best use cases. A 2025 study of over 2,500 humanitarian practitioners found that 70% of the respondents integrated AI into daily or weekly workflows, but less than half agreed it has improved organizational efficiency. Also, 4 out of 5 surveyed organizations at the time had no AI policies in place. 

Widespread use without organizational clarity could be a manageable problem for low-stakes tasks. However, what happens when the outputs carry evidentiary weight? 

Image of two researchers looking at a paper, surrounded by floating bubbles with logos of different AI LLM tools: Claude, ChatGPT, and Microsoft Copilot

For research and independent evaluation teams, the stakes are high. If the AI platform invents a finding, softens a caveat, or subtly re-anchors a study around the wrong theme, the harm could end up in a decision-maker's briefing note. A 2026 study found that 13 state-of-the-art LLMs hallucinated citations at rates ranging from 14% to 95%. The broader concerns around LLMs producing plausible but incorrect advice, spewing misinformation, reinforcing biases, and undermining privacy are well-known. But what is not known is how precisely AI platforms behave with independent evaluation evidence. 

To start answering that question, we designed a small experiment around one task: producing donor-tailored summaries based on independent evidence. If an AI platform can read a dense study or evaluation and land a summary that's accurate, specific, and contextualized correctly for a given donor, that's one concrete use-case. So we picked six documents (SPIA Uganda and Bangladesh country reports, the Genebank and Genetic Innovation evaluations, and the ISDC Megatrends report and the 2025 Portfolio Review). We then asked ChatGPT, Claude, and Microsoft Copilot to produce a donor-tailored summary of each study for four CGIAR donors, based on a human-generated donor brief. The briefs covered publicly available information on the donor’s funding priorities, past investments within CGIAR (by theme and funding modality), and donor language that the AI should aim to mirror. The AI-generated summaries were then scored by a team member on accuracy, specificity, clarity, and donor tailoring.  

The main finding was that all three AI platforms could theoretically do the job, but none of them do it the same way. 

Claude scored highest overall and led on the two dimensions that mattered most for this task: factual accuracy and donor tailoring. It stayed anchored to the source material and mirrored donor language without drifting or hallucinating. 

Copilot came in close behind and stood out on specificity, i.e., more willing to name programs, numbers, and concrete recommendations. 

ChatGPT was the clearest to read and the most polished on the surface, but also the most generic. It scored the lowest overall and required substantive revision on a third of its summaries because of thin donor framing or superficial content. 

Two other patterns emerged that are worth flagging. First, SPIA country-level studies scored lower than other evaluations and strategic reviews across all platforms. The methodological complexity and empirical nuances captured in SPIA studies appear to make it harder for LLMs to summarize without flattening. Second, donor briefs mattered more than we expected. Since the briefs relied on publicly available information relating to various donors, the most specific briefs scored consistently higher. This implies that donor briefs themselves are a confound, in that good AI performance for a given donor tells us something about the donor's documentation as much as it does about the AI platform. 

Now here is what the pilot did not tell us. First, we do not account for prompt sensitivity, meaning we don't know how much of the ranking would hold under different phrasing, different ordering, or a follow-up round. Second, we picked studies we thought were thematically rich enough to test donor tailoring, which means our findings are partly a statement about the studies we chose. Third, we do not know how much the quality would vary if the inputs (donor briefs) were also AI-generated. 

So, is it worth using AI platforms to generate contextualized donor summaries? The response is a tentative yes, but strictly as a drafting tool and not a publishing one. All three platforms cleared the bar of producing usable draft material, with Claude and Copilot clearing it consistently. But whether it is ‘worth using’ depends on a question this pilot experiment did not answer: whether the review load is lighter than the self-writing load. If verifying every claim and data point, checking donor framing, and catching the subtle re-anchoring takes as long as writing a contextualized donor summary from scratch, the efficiency case collapses.  

The takeaway from this pilot is that AI 'cost’ is not simply the subscription charges, but also the review time, the expertise needed to catch misrepresentations, undertake accurate prompting, and make judgment calls about when AI-generated drafts are contextually sound versus not. Until we price that, ‘using AI’ will remain a vague assertion about surface-level adoption and not value. 

Swetha Ramachandran is the Use of Evidence Senior Officer of the CGIAR Standing Panel on Impact Assessment (SPIA). She is an evaluation practitioner and applied researcher with a keen interest in evidence-based policymaking. Her work focuses on developing approaches to measure and monitor evidence uptake and policy influence.

Thomas Griffin was a former intern of the CGIAR Standing Panel on Impact Assessment (SPIA). Currently, he works as legal analyst for GCM Grosvenor at their Chicago office. 

Share on

Impact SPIA
May 18, 2026

Written by

  • Swetha Ramachandran

    Senior Officer, SPIA Use of Evidence
  • Thomas Griffin

    Former Intern, CGIAR Standing Panel for Impact Assessment (SPIA)

Related News

Posted on
26 Feb 2026
by
  • Sujata Visaria

Honest Assumptions: Why SPIA Uses Ranges for ROI Estimates

Posted on
23 Feb 2026
by
  • Ricardo Labarta

ROI for Agricultural Research: A Two-Stage Framework

Posted on
03 Dec 2025
by
  • Elyse Franko-Filipasic

New SPIA Brief: Returns on Investment (ROI) for Select CGIAR Innovations

More News

Related Publications

Image displaying a rice harvest in Bangladesh
Reference Materials
Impact SPIA
Issued on 2026

Estimating the Returns on Investment for Select CGIAR Innovations

Reference Materials
Impact SPIA
Issued on 2025

SPIA Bangladesh Study 2025: Updating the Green Revolution

Briefs
Impact SPIA
Issued on 2025

The Returns on Investment (ROI) for Select CGIAR Innovations

More publications

CGIAR Independent Advisory and Evaluation Service (IAES)

Alliance of Bioversity International and CIAT
Via di San Domenico,1
00153 Rome, Italy
  • IAES@cgiar.org
  • (39-06) 61181

Follow Us

  • LinkedIn
  • Twitter
  • YouTube
JOIN OUR MAILING LIST
  • Terms and conditions
  • © CGIAR 2026

IAES provides operational support as the secretariat for the Independent Science for Development Council and the Standing Panel on Impact Assessment, and implements CGIAR’s multi-year, independent evaluation plan as approved by the CGIAR’s System Council.