Data engineering is a mature and widely used engineering discipline in the private sector—by making data more collectible, accessible, and usable, organizations can better unlock and utilize its potential. Nava brings data engineering approaches and experience into the government space, where better data practices have the potential to greatly impact program outcomes. For the last several years, we have collaborated with multiple vendor and government partners on the federal level to deliver data engineering and modernization efforts at scale.
By prioritizing cloud-based data infrastructure and interoperability via Application Programming Interfaces (APIs) and data pipelines, Nava has helped make data more accessible and useful. From unlocking Medicare claims data that can help providers better care for their patients to establishing data pipelines for COVID-19 tests across the nation, we’ve seen how data engineering has the potential to serve as a powerful tool in the government services space.
In this case study, we’ll explore how building interoperable data pipelines and APIs is key to better data practices, exemplified by our work across multiple federal agencies. Read more about our work making data more accessible by adopting cloud-based infrastructure in this case study.
Data pipelines and interoperability are key tools for government systems looking to ingest, move, and transform data from multiple sources. Our experience working on the Centers for Disease Control and Prevention (CDC) ReportStream project and on several data projects for the Centers for Medicare and Medicaid Services (CMS) demonstrates the value of data engineering practices in government technology.
Nava is helping CDC as they design, build, and operate ReportStream, a data pipeline that aggregates and delivers reportable test results to health departments across the nation. Initially launched in 2020, Pandemic-Ready Interoperability Modernization Effort (PRIME) ReportStream was built by CDC and the U.S. Digital Service to gather COVID-19 data from states and counties. The initiative was born out of the need to unify national health data sharing and analysis so that public health institutions could more quickly and accurately react to the pandemic. It has since expanded to include support for monkeypox data and continues making it easier to share public health data for cities, counties, states, and tribal health authorities across the country.
For CMS, Nava is supporting the agency as part of their eMedicare Initiative to modernize and improve the way Medicare beneficiaries access care and coverage information. The eMedicare initiative requires integrating CMS’s multi-channel tools—including Medicare.gov, 1-800-Medicare Contact Centers, and outbound outreach tools—with existing Medicare systems. The Beneficiary Experience Data Analytics Platform (BEDAP) project is the data integration layer between different sources that hold important beneficiary data.
Meanwhile, the Quality Payment Program (QPP) is a CMS initiative to improve quality of care by ranking physicians based on aggregate scores. These scores help Medicare patients and caregivers search and compare doctors, clinicians, and groups who are enrolled in Medicare. Nava created and implemented QPP’s Submissions API, which aggregates physician performance data collected across the country.
Building and scaling flexible data pipelines is a key data engineering practice for government services. For CDC, we built an interoperable, cloud-based pipeline that facilitates the streamlined flow of high quality data to public health departments, enabling them to receive, process, and share data across and between jurisdictions. This pipeline standardizes and transforms data, then routes it from individual hospitals, laboratories, facilities, hospital/laboratory networks, and non-traditional testing sites to the appropriate state, territory, locality, tribal affiliation (STLT) public health departments and/or federal government entities.
APIs are another useful data engineering tool, acting as an integration layer that enables different systems to “talk” and transfer data to each other. For CMS, we led the API, infrastructure, and data intake pipeline development for the BEDAP project, helping to integrate data sources such as the National Data Warehouse, Medicare Beneficiary Database, Drug Data Processing System, and others. Our work helped produce the “source of truth,” or single location for aggregated Medicare beneficiary data. Our API offers scalable identity-proofing—a tool for authenticating a beneficiary’s identity—in support of the millions of people who go through the MyMedicare.gov account registration process.
Nava also created and implemented the Submissions API for CMS’s QPP project, the central service that accepted performance data from various entities. Alongside CMS, Nava provided final scores to physicians as well as some group practice, electronic health records, and registry administrative websites.
CDC ReportStream has been in production for two years and currently handles COVID-19 and Monkeypox data. ReportStream has processed around 40 million reports since it was launched, and has about 50 senders, 70 receivers, and covers around 45 states as of today.
Our work with CMS on the BEDAP project enables the agency to manage and utilize terabytes of data, helping them execute outreach campaigns to several groups of beneficiaries. These campaigns reach several key groups of beneficiaries: it guides those living in areas affected by natural disasters on how they can continue to use their Medicare coverage during an emergency, it helps beneficiaries that have recently changed their mailing address to access coverage in their new area, and more. Our identity-proofing API for the eMedicare Initiative has enabled Medicare systems to scale up with increased usage of up to 30,000 concurrent users and 3,000-4,000 registrations per day.
Finally, for the QPP project, our submissions API was the central hub for data collection and distribution within the project. To ensure data integrity, it combined granular authorization roles—defining what different entities could see and do—with strict validation before data could be posted into the database.
Building a data pipeline for ReportStream
ReportStream’s data pipeline represents a “foundational service,” which is a core concept of CDC’s North Star Architects. Built on the Microsoft Azure platform, using Serverless Functions, Azure Storage Queue, and Blob storage to store, process, and route data, ReportStream acts as a hub that connects senders to the public health ecosystem and can respond to common problems. Public health department staff can utilize a user-friendly web portal to access data and monitor their data usage. A robust monitoring system ensures the application’s performance.
ReportStream transforms, routes, cleans and monitors data. The data pipeline began production using CSV file format as its internal language for reading and transforming data. In early 2022, the team began diversifying ReportStream’s capabilities and selected Fast Healthcare Interoperability Resources' (FHIR) standards for the next system iteration, which upgraded its internal format language. The continuous quality improvement is defined as 100 percent send and receive completion with receiver filters. To reach this requirement, approaches are anchored on aggressive bug resolutions with automated testing; automated failure fixes, enabling streamlined approach for manual fixes; and, ongoing reports on transmission metrics.
For CMS’s BEDAP project, we worked with prime contractor Flexion to build a data pipeline that securely extracts data from Medicare systems through various integration mechanisms used by each system (e.g., SFTP). The pipeline then loads that data into APIs and data warehouses. This flexible approach to data interoperability means that we can successfully integrate with other organizations, even if they don't use the same technology or don't talk about their data in the same way. So far, we’ve integrated with several data sources, including Beneficiary Information in the Cloud, National Data Warehouse, GovDelivery, and Health Plan Management System, with plans to integrate more in the future.
Our work includes building several different APIs that provide different data functions. The Identity Verifications API uses beneficiary data to handle remote identity proofing for the MyMedicare.gov account registration process, avoiding relying on sensitive credit history information used by identity proofing providers. The Beneficiary Profiles API provides personalization information, such as prescription drug usage, which is used by Medicare Plan Finder for choosing a plan, by MyMedicare.gov for reviewing claims, and by Compare tools for choosing a provider. We also supported a data warehouse used by data analysts and CMS staff for performing data analysis to uncover program-level insights.
Finally, we helped build an outreach tool that allows the Office of Communications at CMS to create personalized outreach campaigns for beneficiaries. It helps CMS, for example, reach out to the Dual Eligible Office for people who are eligible for both Medicaid and Medicare, or to help beneficiaries understand their choice of providers and plans.
Nava used a centralized database in our work with QPP because the size of the data—submissions from several million physicians per year—was relatively small, meaning it didn't require the overhead or maintenance of a larger system. Additionally, we mirrored database schema migrations to submission schema adjustments, which provided backward compatibility, or making sure new data fits the same format as the old data. This allowed physicians to continue to submit data in the legacy format while upgrading their systems to support the newer format.
QPP’s final scoring was a “Goldilocks-style problem,” meaning the initial implementation was done on a short timeline using only command-line scripting and the database to capture final scores. The resulting system was difficult to understand and lacked transparency into how scores were computed. So, Nava’s team prototyped a successor system that used Luigi to assemble and recompute scores as necessary. The subsequent system implemented by our teams, alongside the prime contractor, employed Spark for computations with Apache NiFi to specify data pipelines for loading data to compute scores and save them to the database.
Our work with CDC and CMS demonstrates the value of data engineering practices in government technology. Whether building and scaling flexible data pipelines or APIs that enable different systems to transfer data to each other, data interoperability has the potential to connect and unlock public health data across the nation.
Director of Engineering, Growth and Strategy