Choosing the right format for machine-readable data

For decades, SAS (previously Statistical Analysis System) was the default analytics platform across government. Now, that's changing. Licensing costs are significant, open source tools like Python and R now match or exceed SAS for most analytical workflows, and the workforce is increasingly trained on those tools. With support for older SAS releases winding down, organizations face a concrete deadline to evaluate alternatives.

The replacement format agencies choose will dictate more than how government stores data. It will shape who can use the data, what tools they’ll need, how fast they can work, and whether the data fits cleanly into modern analytics and AI workflows.

Nava is currently working with the Centers for Medicare & Medicaid Services (CMS) to transition one key file type from SAS to Apache Parquet, an open source data format. These files, called Provider Specific Files (PSF), inform how much Medicare pays hospitals, so there’s no room for error. Our experience taught us how to evaluate alternatives to SAS and how to implement alternatives in a way that meets analysts' needs and aligns with their workflows.

We also gathered valuable data that suggests Parquet is a viable replacement for SAS. Designed for fast loading and analysis, we found Parquet delivered reads 58 times faster than CSV.

Maintaining transparent communication

The first thing we did was meet with analysts at CMS to understand their workflows and needs. Analysts shared that they need substantive historical data to confirm that their new, non-SAS workflows would produce the same results as their legacy ones. This is especially true for PSFs, which contain records stretching back to the 1980s and where provider information is constantly updated.

As we generated Parquet-formatted files for new periods, we also made previous files available. We worked with CMS to communicate a clear deprecation timeline, which gives staff advance notice of when CMS will publish the last SAS-formatted files and information on where to find the new formats.

The combination of meeting our stakeholders’ needs with historical data and communicating transparently helped facilitate a seamless, risk-free transition to Parquet.

Preserving data characteristics

Format migrations can fail if the new format does not preserve the data characteristics that existing workflows depend on. For example, CMS analysts shared that data column structures need to map cleanly to their Python and R-based workflows.

The early Parquet files we produced were structurally sound, but contained issues with data characteristics — for example, leading zeroes on provider identifiers were dropped and there were subtle differences in how spaces and dates were handled. To address this, one of our software engineers partnered with an analyst at CMS. Our engineer focused on generating correct outputs while the analyst compared results against historical analysis. Together, they conducted tests until the Parquet output aligned with the SAS output.

This kind of feedback loop is what separates a format change on paper from one that works in practice. It also builds trust, which is critical when you're asking analysts to abandon processes they've relied on for years.

Preparing for AI-enabled infrastructure

Choosing the right data format to replace SAS helps ensure that future data will be machine-readable without expensive human curation. As government shifts toward AI-enabled infrastructure, machine-readability is paramount.

CMS selected Parquet because the most widely used and open source analytics tools read it natively, making the data immediately available to the ecosystem analysts are moving toward. That machine-readability is critical as government open data increasingly becomes AI-ready infrastructure.

What makes Parquet particularly useful is that each file carries its own data dictionary: column names, column types, and how the data is organized. Anyone using the file, whether a human analyst or an AI tool, can understand what they're looking at without needing a separate guide.

The same is not true for a CSV file, where tools often interpret text fields as numbers, causing leading zeroes to disappear and identification numbers to be confused with measurements. A computer reading a CSV has no way to tell the difference without a human to interpret the file's structure.

Benchmarking put a number on the difference: Parquet delivered reads 58 times faster than CSV.

Conclusion

When transitioning data formats, it’s important to bring analysts and engineers together early so they can understand each others’ needs and workflows. Analyst requirements aren't nice-to-haves; they're the difference between a format that's technically valid and one that's actually usable. That collaboration also builds trust between the teams who steward the data and those who rely on it.

Choosing the right format to replace SAS isn't just an immediate investment. It can contribute to positive outcomes years or perhaps even decades down the road. Every dataset published in a machine-readable format is one less dataset that will require human curation. As government adopts the next generation of AI-enabled analytical tools, that compounding investment in the value of public data will matter far more than the format migration that started it.

Written by

Jose Oyola-Sepulveda

Sr. Product Manager

Jose Oyola-Sepulveda is a product leader in federal health data, compliance, and government services. He currently builds data infrastructure behind Medicare claims processing at CMS. Previously, he worked for NIH, Google, and MedStar Health.

Kevin Alarcon

Software Engineer

Kevin Alarcon is a software engineer at Nava and U.S. Marine Corps veteran. Before joining Nava, he gained years of experience as a software engineer in the private sector.

Insight

On this page

Stay up to date

Maintaining transparent communication

Preserving data characteristics

Preparing for AI-enabled infrastructure

Conclusion

Written by

More from Nava

Partner with us

Stay in touch