Case Study

The billion user load test

Nava’s Scalable Login System (SLS) provides authentication and account management for millions of people on HealthCare.gov. To see how far it could scale, we tested the limits by running a billion user load test.

Nava’s Scalable Login System (SLS) provides authentication and account management for users on HealthCare.gov. It is a RESTful API service built on Amazon Web Services. The Nava team wanted to see just how far it could scale by running a load test with a billion users, 50 times the 20 million accounts that SLS currently handles.

Summary

Throughput of 7,754 transactions per second was served for an hour. Response time was 128ms, in the 90th percentile, and there were zero errors. Nine times the number of current production servers were used to serve this load: 70 4-core machines vs 15 2-core machines in current production. Application servers’ CPU was at a comfortable 50 percent.

A chart that shows the load test of more than 7,500 transactions over a 1-hour time-period.

Nava’s Scalable Login System running 7,754 transactions per second, with a 128ms response time, at 90th percentile and zero errors.

Approach

Tools

Nava has developed its own load testing infrastructure based on Apache JMeter, an industry standard load testing tool and ruby-jmeter. The tests are written in Ruby from reusable components that simulate SLS client http requests. All components and tests are revision controlled in Github. The tests were conducted from a distributed load generation “grid” (of only two machines!) with a total of 6,000 worker threads.

Architecture

A chart with three columns describes the user load architecture: description, current production, and load testing.

The test

We wanted to see how far the current SLS architecture could scale. The current system has over 20 million users in the database. The current observed peak load for HealthCare.gov’s Open Enrollment, in 2015, was on December 14th (about 150 requests per second). The load test simulated key API requests for registering users, logging in and getting user information.

Extrapolating current peak service size and usage, our goal was a database of one billion users and a request rate of 7,500 per second (50 times the current number of users and the peak throughput SLS had seen). While populating the database with one billion users, Brendan posted updates in Slack:

A series of Slack messages from Brendan recounts the experience of watching the load test. They read: "Breezed past Brazil (205M) to 227M this a.m." "257M just breezed past Indonesia (255M) next stop... the US of A!" "Weird thing happened. We crossed 322M a couple hours ago and I kind of lost track. #youhadonejob" And at the bottom of the chain, a text snippet that shows the number of users at 1,000,854,883 is labeled with "The eagle has landed."
A series of Slack messages from Brendan recounts the experience of watching the load test. They read: "Breezed past Brazil (205M) to 227M this a.m." "257M just breezed past Indonesia (255M) next stop... the US of A!" "Weird thing happened. We crossed 322M a couple hours ago and I kind of lost track. #youhadonejob" And at the bottom of the chain, a text snippet that shows the number of users at 1,000,854,883 is labeled with "The eagle has landed."
A series of Slack messages from Brendan recounts the experience of watching the load test. They read: "Breezed past Brazil (205M) to 227M this a.m." "257M just breezed past Indonesia (255M) next stop... the US of A!" "Weird thing happened. We crossed 322M a couple hours ago and I kind of lost track. #youhadonejob" And at the bottom of the chain, a text snippet that shows the number of users at 1,000,854,883 is labeled with "The eagle has landed."

Running the test proved to be... uneventful.

With a database of one billion users, the load test was prepped. We achieved 7,754 requests per second sustained for one hour with acceptable latency and zero errors.

The most time consuming part of this exercise was populating the database with one billion users. At 2,000 users per second, it still took over a week!

Conclusion

Currently available open source software and commodity cloud infrastructure can, if properly implemented, perform under intense loads. SLS has performed very well for over 20 million users currently on HealthCare.gov, and this load test demonstrates that SLS can comfortably accommodate many times the entire American population without any major changes.

Brendan Neutra, who designed and executed the test, was recognized as a 2016 FCW Rising Star for this load testing work.

Written by


Brendan Neutra

Senior Infrastructure Engineer

Brendan Neutra is a senior infrastructure engineer lead at Nava. Brendan worked on infrastructure for Google before being recruited to help stabilize the HealthCare.gov site in 2014.

Aimee Barciauskas

Software Engineer

Aimee Barciauskas was a software engineer at Nava. Previously, Aimee worked as an applications and web developer.

PublishedDecember 3, 2016

Authors


More from Nava

Partner with us

Let’s talk about what we can build together.