How to prepare for the unexpected with strong, scalable website architecture

When the Biden Administration announced its federal student loan forgiveness program—an initiative that cancels up to $20,000 of student loan debt for millions of people—borrowers flocked to the StudentAid.gov website to check their eligibility. But just hours after the August announcement, many people encountered issues logging into the site or couldn’t access it at all.

Some were sent to a virtual waiting room that explained the site was experiencing high volumes of visitors. Others who made it through the waiting room were met with another message: “A lot of people are interested in our website. As a result, some pages may take longer to display than usual.”

While outages like this can frustrate users, it’s important to remember that there are civil servants who work tirelessly to keep public services running. Outages are equally as frustrating for operations staff who want to help. At Nava, we know this to be the case from our experience rebuilding another site impacted by a spike in traffic: HealthCare.gov.

In 2013, the Affordable Care Act website launched a new functionality that allowed users to apply for Medicaid, CHIP, or purchase individual private health insurance, generally at reduced rates. HealthCare.gov experienced a similar user overload, and on its first day, only six people were able to sign up for health insurance.

Members of the team that helped re-launch HealthCare.gov after its rocky start went on to found Nava. One of their key contributions was to help prepare the Department of Health and Human Services (HHS), the agency that oversees HealthCare.gov, to adapt to abrupt changes like policy shifts or a spike in site traffic. This process revealed the importance of pre-planning to develop a strong, scalable site architecture that meets users’ needs and can withstand unexpected change.

Of course, building simple, effective, and accessible government services requires more than being able to handle spikes in traffic. As the Department of Education and other government agencies look forward, they have an opportunity to keep pace and even lead the way as policy and technology change. By building digital services in a modular, human-centered way, they can create equitable government services.

For this post, we'll focus on a core component of building any successful digital service: secure, reliable infrastructure that allows government agencies to adapt to shocks to the system, such as a historic student loan cancellation announcement. Conducting user load tests and gathering key metrics via pilots before a site’s launch can help prevent crashes like those of HealthCare.gov—and provides potential insight into how the StudentAid.gov crash might have been prevented.

Testing a site’s limits helps prepare for a successful launch

When launching a new website or feature, it’s crucial to ask “How many people will this serve?” Whether the answer is 10 or a billion, estimating your site’s user load is the first step in conducting user-centered research that will help your site launch smoothly.

Once you’ve landed on a number, test whether your site can easily handle the user load you foresee. This is how we approached testing our Scalable Login System (SLS), which provides authentication and account management for millions of people on HealthCare.gov. SLS is a RESTful API service built on Amazon Web Services. As an added experiment, we wanted to see how far the system could scale by running a load test of 1 billion users, 50 times the 20 million accounts that SLS currently handles.

With a throughput of 7,754 transactions per second, over the course of an hour, response time was 128ms—in the 90th percentile—and there were zero errors. In other words, our test confirmed that currently available open source software and a solid cloud infrastructure can succeed under intense user loads.

Rolling out pilots and collecting metrics are the keys to success

It’s not enough to passively conduct a user load test—while performing your test, it’s important to gather goal-oriented metrics that reveal whether your service will function for the people who use it.

Last year, Nava partnered with California’s Employment Development Department (EDD) to rebuild a web application for people to confirm their status for unemployment benefits during the pandemic. The state’s goal was to confirm claimants’ eligibility and pay out unemployment benefits as quickly as possible. With this goal in mind, we tracked login and completion rates to measure the web application’s efficacy. We found that 93 percent of people who logged in were able to complete the multi-page unemployment certification, confirming that the application worked for end-users.

But sometimes it isn’t so easy. Sometimes your program runs into bugs or unexpected snags. In order to prevent catastrophe in the event of a hiccup, it’s important to roll your program out in small bites, or pilots.

In California, we rolled out a soft launch of our retroactive certification form. On the first day, we emailed a link to 10,000 people, less than 1 percent of total claimants. On the second day, we sent the link to 100,000 people. Every day we monitored Google Analytics to determine if the form was performing. Within a few hours on the first day, our metrics revealed that a percentage of users were not able to log in. Our team rapidly diagnosed and fixed the issue, all before the form was officially launched.

These types of precautions don’t require large teams or billions of dollars—in fact, catching errors before a government program launches can save taxpayers money. The old HealthCare.gov login system, for example, cost $250 million to launch and would have cost another $70 million to stay online. The new SLS cost $4 million to launch and costs less than $1 million per year to stay online.

In the case of government programs that serve millions of people, the importance of planning ahead—and iterating along the way—cannot be overstated. Rolling programs out in small bites, collecting essential metrics, and testing a site’s user load are small steps that can help prevent big issues like the StudentAid.gov crash. Most importantly, developing secure, reliable, and scalable infrastructure helps agencies prepare for unexpected events or changes. Taking these steps can help get essential services—like student loan forgiveness—to those who need them most, and it can build trust in our public institutions.

Special thanks to Zoe Blumenfeld, Sha Hwang, Cyrus Sethna, and Karen Turner for their contributions to this article.

Written by

Kira Leadholm

Senior Editorial manager

Kira Leadholm is the editorial manager at Nava. Before working at Nava, she held various editorial roles and worked as a reporter at outlets including the Better Government Association, SF Weekly, and the Chicago Reader.

Brendan Neutra

Senior Infrastructure Engineer

Brendan Neutra is a senior infrastructure engineer lead at Nava. Brendan worked on infrastructure for Google before being recruited to help stabilize the HealthCare.gov site in 2014.

Jacqueline Siotto

Project Manager

Jacqueline Siotto is a project manager at Nava.

Haiyan Sui

Director of Engineering, Growth and Strategy

Haiyan Sui is the Director of Engineering, Growth and Strategy at Nava. Before, Haiyan led the digital product team at the New York City Mayor’s Office for Economic Opportunity.

Insight

Stay up to date

Testing a site’s limits helps prepare for a successful launch

Rolling out pilots and collecting metrics are the keys to success

Written by

More from Nava

Partner with us

Stay in touch