Skip to main content

The billion user load test

Client: Centers for Medicare & Medicaid Services

By Brendan Neutra and Aimee Barciauskas

Nava’s Scalable Login System (SLS) provides authentication and account management for users on HealthCare.gov. It is a RESTful API service built on Amazon Web Services. The Nava team wanted to see just how far it could scale by running a load test with a billion users, 50 times the 20 million accounts that SLS currently handles.

Summary

Throughput of 7,754 transactions per second was served for an hour. Response time was 128ms, in the 90th percentile, and there were zero errors. Nine times the number of current production servers were used to serve this load: 70 4-core machines vs 15 2-core machines in current production. Application servers’ CPU was at a comfortable 50 percent.

A graph showing a transaction test.

Nava’s Scalable Login System running 7,754 transactions per second, with a 128ms response time, at 90th percentile and zero errors

Approach

Tools

Nava has developed its own load testing infrastructure based on Apache JMeter, an industry standard load testing tool and ruby-jmeter. The tests are written in Ruby from reusable components that simulate SLS client http requests. All components and tests are revision controlled in Github. The tests were conducted from a distributed load generation “grid” (of only two machines!) with a total of 6,000 worker threads.

Architecture

A chart showing load testing for architecture.

The test

We wanted to see how far the current SLS architecture could scale. The current system has over 20 million users in the database. The current observed peak load for HealthCare.gov’s Open Enrollment, in 2015, was on December 14th (about 150 requests per second). The load test simulated key API requests for registering users, logging in and getting user information.

Extrapolating current peak service size and usage, our goal was a database of one billion users and a request rate of 7,500 per second (50 times the current number of users and the peak throughput SLS had seen). While populating the database with one billion users, Brendan posted updates in Slack:

A slack message with updates on the load test. A slack message with updates on the load test. A slack message with updates on the load test. A slack message with updates on the load test.

With a database of one billion users, the load test was prepped. We achieved 7,754 requests per second sustained for one hour with acceptable latency and zero errors.

The most time consuming part of this exercise was populating the database with one billion users. At 2,000 users per second, it still took over a week!

Conclusion

Currently available open source software and commodity cloud infrastructure can, if properly implemented, perform under intense loads. SLS has performed very well for over 20 million users currently on HealthCare.gov, and this load test demonstrates that SLS can comfortably accommodate many times the entire American population without any major changes.

Brendan Neutra designed and executed the test and was recently recognized as a 2016 FCW Rising Star for this load testing work. Brendan continues to work on load testing Nava’s systems with the help of Mari Miyachi.

On this page