Evaluating a GenAI-powered assistive chatbot for caseworkers

Summary

Project overview

Nava Labs developed and evaluated an artificial intelligence (AI)-powered chatbot designed to help caseworkers more effectively assist clients find and enroll in public benefit programs. This first-of-its-kind study addresses narrowing the gap of over $227 billion in annual unclaimed benefits by improving caseworker capabilities with generative AI (GenAI) technology.

Our solution

The assistive chatbot explains program rules in plain language, uses retrieval-augmented generation to pull information from pre-vetted sources, provides direct citations, supports multilingual translation, and prevents hallucinations through careful guardrails. We built the technology to integrate with existing workflows via an application programming interface (API) and tested the chatbot in real scenarios with nonprofits and government agencies.

Evaluation methodology

Using an implementation science framework in which we sought to identify factors that affected the uptake of the assistive chatbot in addition to impact outcomes, researchers conducted a mixed-methods evaluation, including:

A randomized controlled trial with 125 caseworkers examining accuracy effects from being shown AI-generated responses to hypothetical client questions developed from real experiences.
A 14-week, real-world pilot with Benefit Navigator, a web-based tool by Amplifi that helps caseworkers navigate benefit and tax credit programs on behalf of benefit-seeking clients, that included 61 caseworkers across six organizations in Los Angeles County and a quasi-experimental design comparing intervention and comparison groups.
A mixed-methods analysis aggregating qualitative and quantitative findings across multiple data sources, including product logs, surveys, and in-depth interviews.

A middle-aged Black woman sits at a desk working on a computer. Her screen shows secure data information.

Key findings

Accuracy

The chatbot is estimated to improve caseworker accuracy by an average of 40% with stronger improvements for more difficult client questions.

Acceptability

Throughout the pilot, about 65% of caseworkers with access to the chatbot used it, submitting an average of 14 prompts each. However, usage declined over time and varied by site, highlighting the value of sustained engagement strategies. The Net Promoter Score (NPS), a metric used to assess user experience, was 11, indicating moderate average satisfaction. While 40% of respondents were “promoters” meaning they would recommend the chatbot to their colleagues, others expressed less enthusiasm, highlighting mixed adoption and opportunities for improvement.

Administrative burden

Results from the pilot showed promising but inconclusive evidence of reduced learning and psychological costs on caseworkers, though these findings lack statistical significance due to sample size limitations and lower response rate on the endpoint survey.

Accessibility

Chatbot responses averaged a 10th to 12th grade reading level — higher than the recommended 8th grade standard but significantly more accessible than source policy manuals, which require college-level reading ability.

Implementation insights

We identified “super users” with high engagement and chatbot usage at sites with active participation in training, consistent organizational reinforcement, and peer support. Service delivery models significantly influenced use patterns, with rapid-intake sites showing higher activity than long-term case management programs.

Implications

This evaluation demonstrates that rigorously tested AI-powered tools can meaningfully improve benefit navigation accuracy, particularly for newer staff. However, successful implementation requires intentional support strategies, ongoing engagement, and attention to accessibility. The findings provide a roadmap for scaling AI-powered tools in public benefit contexts while highlighting critical areas for continued refinement.

Read the full report

Check out the full report to learn more about the Nava Labs AI research program, assistive chatbot, chatbot evaluation design, and outcomes.

Explore the report »

About the Nava Labs AI research program

Nava Labs is the philanthropically-funded division within Nava Public Benefit Corporation focused on prototyping systems changes for government programs. We research and prototype products, practices, and policies within government programs and advocate for the adoption of what works. This interdisciplinary team leverages Nava’s deep technology delivery experience to identify critical junctures where philanthropy can help accelerate public interest projects and build more trustworthy public institutions.

A first-of-its-kind exploratory project, the Nava Labs AI research program has sought to answer if and how caseworkers can use generative AI powered by large language models (LLMs) to help more eligible people get enrolled in programs like Medicaid and the Supplemental Nutrition Program (SNAP). These efforts can help distribute the over $227 billion in benefits that go unclaimed annually.

Complex policies and processes can make navigating and enrolling in public benefits programs difficult. As a result, people often seek and receive guidance from caseworkers. However, caseworkers can also struggle to understand and interpret eligibility rules or help families complete lengthy applications. For nonprofits and government agencies, it can be costly to train and difficult to retain skilled caseworkers due to the challenges of the job. Over the years, we have seen many examples of long call center wait times, indicating that demand is likely outpacing supply.

A middle-aged CMS employee points to the screen of a tablet. An older South Asian woman and her young granddaughter follow along.

The Nava Labs approach to developing and testing GenAI solutions that support caseworkers is rooted in human-centered design, iterative agile development, and rigorous program evaluation. The Nava Labs team conducted user research to understand caseworker needs and where GenAI might be an appropriate fit to address those needs, such as reducing administrative burdens, freeing up more quality time to meet with clients, and lowering the training barrier to helping clients. The Nava Labs team also developed proofs of concept for five GenAI-powered tools to address caseworker needs and started piloting some of those tools in real-world settings; the following list outlines each tool and its development phase as of December 2025. This report describes the evaluation methods and results for the Assistive Chatbot pilot.

Assistive Chatbot: Retrieves program rules and provides plain-language explanations [Pilot complete]
Referral Generator: Suggests local resources and government programs with action plans [Piloting]
Form-Filling Assistant: Pulls data from a variety of sources to autocomplete benefits applications [Piloting]
Document Analyzer: Verifies that documents meet requirements [Piloting]
Call Notes Summarizer: Minimizes note-taking burden and outlines next steps for the client [Proof of concept complete]

About the assistive chatbot

Chatbot functionality

The chatbot aims to make it easier for caseworkers to find credible answers to questions about health and human services program eligibility and enrollment to discuss with their clients in real-time. Nava Labs sought to reduce cognitive load for caseworkers, speed up responses, and build their confidence.

Three screenshots of our chatbot showing its capabilities.

The chatbot solution:

Leverages a foundational Large Language Model (LLM).
Provides plain-language descriptions about program rules.
Uses retrieval augmented generation (RAG) to pull information from pre-vetted sources only.
Provides direct source citations from the pre-vetted sources with links to the original source if further exploration is needed.
Provides multilingual translation support.
Prevents hallucinations with clear guardrails around the chatbot’s scope of knowledge; if a question is out of scope, the chatbot responds “I don’t know the answer” or provides a link to the topic rather than making up a response.

Preparing the chatbot for real-world implementation

We intentionally built the chatbot technology to adapt to different settings and integrate with a range of different workflows through an application programming interface (API).

The team completed several rounds of prototyping and testing to prepare the assistive chatbot for the pilot with nonprofits and government agencies. Rigorous testing ensures the assistive chatbot is ready to use in real-world settings and collect data on usage and impact. The team iterated on building the solution until technical evaluations signaled high chatbot response accuracy and user testing showcased promise for addressing caseworker needs. The team also confirmed they met all security, privacy, and infrastructure requirements. Then they readied the assistive chatbot to integrate into the workflows of our pilot partner Amplifi, who integrated the chatbot with their Benefit Navigator tool and ensured a seamless user interface across the tools.

Illustration of a young Black man with a prosthetic left arm. He sits at a table and talks to a group of three people. There are charts and notes on the table and around the room.

Chatbot evaluation design

Read the full report to learn more about the evaluation questions and methods.

Outcomes

1. Measurement and instrument development

Outcome 1.1: We effectively adopted validated measures and instrumentation to study the effects of AI-powered tools in a public benefits context.

2. Accuracy

Outcome 2.1: Use of the assistive chatbot is estimated to improve caseworkers’ accuracy by an average of 40%.
Outcome 2.2: Chatbot accuracy had the largest impact on more difficult client questions.

3. Appropriateness

Outcome 3.1: The assistive chatbot offered caseworkers reliable and timely information about a wide range of public benefit programs.

4. Acceptability

Outcome 4.1: 65% of caseworkers with chatbot access reported real-world use.
Outcome 4.2: Several “super users” of the assistive chatbot emerged during the pilot.
Outcome 4.3: The Net Promoter Score suggests a fairly successful pilot with opportunities for improvement.

5. Administrative burden

Outcome 5.1: There is promising evidence that the assistive chatbot may reduce learning and psychological costs, but results were inconclusive due to small sample size and wide variation in experiences.

6. Accessibility

Outcome 6.1: Chatbot responses averaged a 10th to 12th grade reading level.
Outcome 6.2: Chatbot responses are more accessible than source policy manual text.
Outcome 6.3: Multilingual capabilities are enhanced by caseworkers’ language and cultural translation.

Read the full report for details on the outcomes.

Acknowledgements

The report authors would like to acknowledge members of the Nava Labs team: Alicia Benish, Charlie Cheng, Diana Griffin, Foad Green, Genevieve Gaudet, Kanaiza Imbuye, Kasmin Scott, Kevin Boyer, Ryan Hansz, and Yoom Lam for contributing to the development and implementation of the pilot and evaluation plan; Bob Wilkinson, Chloe Hilles, Greg Jordan-Detamore, and Noam Leead for design and communications support; members of the Amplifi team: Jill Bauman, Brit Gilmore, Karen Van Kirk, and Ryan Fendy, for partnering on the real-world pilot implementation and recruiting caseworkers for the advisory council and pilot participation; and the 12 members of the caseworker advisory council and the over 60 caseworkers who participated in the pilot from various human services agencies and nonprofit organizations in Los Angeles and who provided invaluable insights about the experience using the chatbot.

Written by

Michael Chen

Partnerships and Evaluation Lead

Michael is the Partnerships and Evaluation Lead for Nava Labs, where he leads research pilots evaluating the impact of GenAI tools on service delivery and program outcomes. He holds a PhD in Health Services Research and Policy.

Martelle Esposito

Director of Partnerships and Advocacy

Martelle Esposito is the director of partnerships and advocacy at Nava and co-lead of Nava Labs. Before Nava, she managed a WIC services innovation lab at Johns Hopkins University and worked on policy development and implementation at nonprofits.

Eric Giannella

Associate Research Professor, Georgetown University

Eric Giannella is an Associate Research Professor at Georgetown University. Previously, he was Senior Director of Data Science at Code for America.

Zhaowen Guo

Postdoctoral Fellow, Georgetown University

Zhaowen Guo is a postdoctoral fellow at the Better Government Lab at Georgetown University. Her research investigates the impact of digital interventions on public opinion, political behavior, and government accountability.

Jennah Gosciak

Information Science PhD Student, Cornell University

Jennah Gosciak is an Information Science PhD student at Cornell University. She applies computational and causal inference methods to social science domains like public health, housing, and healthcare.

Allison Koenecke

Assistant Professor of Information Science, Cornell University

Allison Koenecke is an assistant professor of information science at Cornell University. Her research on algorithmic fairness applies computational methods to study societal inequities in domains from online services to public health.

Case Study

On this page

Stay up to date