Preface
ClinSpark is designed to be performant with the goal of…
always fast, regardless of user load
We expect to see average response times in properly sized production environments around 250 ms.
Introduction
We don’t typically engage in performance testing exercises with customers. If this is something you need, please talk to your project manager and they can arrange to have such an exercise scoped and costed.
Related
Experiment
Introduction
High performance has always been a key design goal for ClinSpark. This foundational concern has driven every aspect of the system's design. It permeates through the application code, the database design and the deployment infrastructure. Customers tend to be drawn to ClinSpark because usage of the system has the potential to transform business and clinical operations. But, precisely because it impacts so many aspects of the clinical ecosystem, performance issues over time could potentially be disastrous for customers.
We are including here a testing approach, with related results and analysis from an exercise that we performed to address specifically the question of whether ClinSpark performance will degrade over time. System demonstrations have repeatedly shown that ClinSpark is fast with default datasets. This effort presents the answer to the question about what will happen to performance when large multi-site customers use ClinSpark continuously for years, with the accumulated data from past studies available in a single database.
A key part of our testing effort entailed creating a series of progressively large datasets. As described in detail below, each dataset contains a volume of data representing a particular number of studies, ranging from 5 to 2000. A test script was created that simulates a user interacting with a large coverage area of ClinSpark in an automated, repeatable fashion. Then a series of tests were conducted where this user activity was applied over separate instances of ClinSpark initialized with each of the sample databases. Each test recorded minute details about system response times.
A key goal for this test was to identify application functionality that would not perform acceptably as the size of the database grows over time. The primary aim was not an attempt to apply a massive volume of requests concurrently, but rather the focus was to apply a relatively constant number of requests against a massively increasing volume of data. Also note that while this exercise isn't truly a load test, the request volume applied was still roughly 20 requests per second. This can be observed in detail in the zip attachments; look for the charts on throughput.
Raw data outputs from JMeter are available for the experiment described here but have not been included.
Executive Summary
This exercise objectively demonstrates that ClinSpark maintains a very low response time even with large volumes of accumulated study data in the database.
Average Response Times Stay Small
The aggregate average response time for ClinSpark requests increases linearly with increasing historical study volume. However, even with 2000 studies in the ClinSpark database, the average response time did not exceed 289 milliseconds.
ClinSpark pages may issue multiple requests concurrently using AJAX, in addition to the loading of the primary page. So these results do not correlate one-to-one with page view times. However, they do mean that the typical user with a typical site network connection should have the majority of pages ready for use with sub-second response times.
Response Time Distributions
The averages, of course, only show a blend of all transaction types. Which transactions are fast and which are slower matters.
The below chart is taken from the Response Time Distribution data collected for each execution of the test. This chart groups response counts into time ranges and comes from the run with 1000 studies. It shows that the majority of the transactions had responses rendered in less than a second, but a few outliers took as many as 10. Note the color of the bars, which indicates transaction type.
This next chart on response distribution shows a per-transaction view of the same data. This makes it easy to spot the outliers, the slower transactions in the far right of the curve.
The small numbers of transactions that exceed a 2-second threshold are not common high-frequency user operations. The most common user operations fall within the sub-second grouping.
Vertically Scaling the Database
As detailed below, the database used for this series of tests is a memory-optimized "db.r3.2xlarge" instance. This database has 8 virtual CPUs and 61 GB of RAM. This database proved more than adequate even with the test where over 1 billion rows of test data were present. Our preferred deployment provider, Amazon Web Services, currently has database servers four times as powerful, with 32 virtual CPUs and 244 GB of RAM, which at any point in the future can be swapped in should an environment require additional performance. This is a key point, because the database is the one part of the architecture that can only be vertically scaled.
Approach
A base study was designed to maximize a variety of data types that are commonly found in typical clinical trials. While although the protocol is not overly realistic, it does allow for establishing a solid foundation for which testing can occur. Once the protocol was built into the system, a manual execution of the study was performed. With the base study results, a program was then developed that allowed for copying Y subjects into X studies. During the copy program's execution, database snapshots were taken at different study milestones to allow for future testing: 5, 50, 100, 250, etc. Finally, an automated test script was developed that allowed for capturing objective server metrics for analysis.
Study Overview
50 subjects per study
10 study events per protocol: Days 1-10
Each study day is identical:
23 PK events
Basic sample path of centrifuge, transfer to two aliquots, freeze
Each transfer tube is associated with a sample container and later associated with a configured shipment
4 ECGs: standard Mortara intervals and interpretation fields exposed in the form
4 vitals signs: Diastolic, Systolic, MAP, Rate, and Temperature
1 dosing event
1 PE with three repeating groups
Standard PE form from CDISC CDASH
Hematology, Clinical Chemistry, Urinalysis
Each test to have simulated test results included
Each result to be reviewed in the system
Data Basics
With each study milestone, the copy program increases the number of volunteers, with a final goal of 500,000
Each managed volunteer will randomly have 7 medical conditions associated
Date of birth is randomly created, and will be in the range of 12-SEP-1926 to 12-SEP-1996
Height and weight are randomly created with a range of 1.3 - 2.2m and 34 - 181kg respective
This level of randomization created a wide spectrum BMIs
Address, city, first name, middle name, last name, postal code, email, and phone numbers were randomly obtained using an online data generator
11 contraception types
155 medical conditions
50 volunteer regions (US states)
5 tobacco types
Mock ECG and mock Vitals 'devices'
Each ECG and vitals test leverages this mock interface, and thus the results are identical across each time-point
Each study has 75 appointment windows
900 users
Usernames are 001 to 900
Password hardcoded as [REDACTED]
Each user assigned an 'admin' role
Testing Automation and Execution
A JMeter (http://jmeter.apache.org) test script was developed in order to simulate users interacting with ClinSpark against varying data loads. The script allowed for specifying URL endpoints, and any corresponding data that was required to complete a given action (ie entering volunteer details in order to perform a search). The goal was to interact with each ClinSpark application component, all while automatically collecting critical performance metrics. Leveraging JMeter empowered execution to be consistent in terms of the tests conducted, and it also allowed for establishing desired levels of concurrency. For the tests performed, the script was configured to run with five concurrent threads, a ramp up time of 50 seconds, and a loop count of 15. Because the script does not include pause times in between sampler invocations, the test simulated a high level of throughput in relation to system usage. Upon completion of each test, the output from the framework was saved and analyzed. Tests were explicitly executed on a computer system and network outside of the application's deployment infrastructure.
The script was run against differing database sizes as described in the Results section that follows. For each test run, very detailed reports are generated and are available for download. High-level summaries of each test run are found with their corresponding tests in the sections that follow. APDEX (http://www.apdex.org) is a measure of user experience based on user wait times. The configured APDEX satisfaction and tolerance thresholds have been defined as 3 and 5 seconds respective.
Pre-Test Environment Warm Up
A freshly restarted database and application environment will always be slower than one that has a chance to warm up. Warm up typically consists of the population of database caches and application-level byte-code optimizations. As is the best practice, all performance data is captured after a warm up period to allow for these automatic optimizations to be initialized. Note that in production scenarios, database caches are persistent across database restart, meaning this warm up period is needed only for load tests.
Results
Details regarding the inputs and results data are included here for each test execution. All data here is empirical, captured before, during or after test run execution.
The Key Data section shows database row counts. Table definitions can be found in the appendix.
Select visualizations from JMeter are included directly in this document for each run. Full results from JMeter are available but not included here.
Five (5) Studies
APDEX Score: 0.996
Key Data
Table | Row Count |
volunteer | 10,001 |
volunteer_medical_condition | 70,007 |
base_test_result | 144,576 |
item_data | 288,650 |
item_data_sample_audit_record | 346,380 |
item_data_audit_record | 578,053 |
Sum database table rows | 2,800,162 |
Reports
50 Studies
APDEX Score: 0.996
Key Data
Table | Row Count |
volunteer | 50,001 |
volunteer_medical_condition | 350,007 |
base_test_result | 1,440,576 |
item_data | 2,876,150 |
item_data_sample_audit_record | 3,451,380 |
item_data_audit_record | 5,759,803 |
Sum database table rows | 26,945,065 |
Postscript
This testing exercise was executed when ClinSpark was being pitched in 2016 to a prospective customer that was also evaluating another eSource platform. We were aware that this platform, from an established multinational vendor, was poorly performant and we were happy to go head-to-head to demonstrate ClinSpark’s designed-in performance characteristics.
Internally, our name for this competitive endeavour was ‘the bakeoff’.
Our opposition never showed up.