...
The application was designed to be inherently resilient and to maximize availability and to minimize downtime. Much of this resilience is owed to the hosting infrastructure, the Amazon Web Services (AWS) cloud. The design requirements that determine this level of resilience are specified in the Application Infrastructure Architecture document and to a lesser degree in the Application Architecture document. These documents are updated and revised as required and are available to customers in release technical files (via service desk portal) or upon request.
The application is designed to:
Maintain high availability –single points of failure are eliminated
Have high fault tolerance –application and database servers fail. Service to the customer must not be interrupted and recovery time must be quick
Survive a full datacenter disaster without data loss or significant inconvenience to a customer
Testing
Recovery following a full data center disaster is designed to be automatic and transparent.
As such, there is not a manual disaster recovery procedure or process. Robustness in the face of a disaster is an application and infrastructure architectural characteristic. Recovery is designed to complete automatically, potentially before users or support teams even realize that an outage has occurred. Testing simulates outages at multiple layers of the infrastructure and verifies that automated recovery has taken place as designed.
To simulate this disaster scenario, each layer of the infrastructure within a single AWS Availability Zone (AZ) (equivalent to a data center) is forcibly brought down. This is done in two separate exercises. The first targets the Elastic Beanstalk application server layer. The second targets the RDS database master. In both cases, the stimulus is a forced failure of all resources within an AZ. The verification is an observation of how the infrastructure responds automatically. The system is expected to recover within the expected timeframe, with no support intervention. See also https://foundryhealth.atlassian.net/wiki/spaces/DOCS/pages/3708420496/ClinSpark+Application+SLA?src=search.
Testing occurs annually.Further information regarding our approach to business continuity and disaster recovery are in this article:
Access Controls
All customer PROD MAIN superadmin support accounts are protected via MFA. Reviews of superadmin support accounts across all customer PROD MAIN instances are conducted on a quarterly basis.
...
On a yearly basis the application is subjected to manual penetration testing, conducted by an external vendor. A summary of findings from the pentest vendor is produced and reviewed by the product team. Findings are summarized into four classifications that are aligned with the OWASP Risk Rating Methodology. We take review and action based on the classifications.
Critical = Address immediately, via hotfix release or other remediation.
High = Address in the current functional release in development.
Medium = Prioritized into the next functional release.
Low = Added to backlog, to be prioritized into an upcoming release.
...
All application instances are hosted within Foundry Health’s an IQVIA AWS account.
Infrastructure as Code
...
All customer data is stored in AWS RDS instances. Application servers do not store any customer data, only configuration. As such this topic is limited in scope to how RDS supports backups and recovery.
First Line of
...
Defense
RDS is by design a service providing the highest level of data backups. All customer PROD instances use RDS instances in a Multiple Availability Zone configuration. The relevant components are shown in this diagram:
...
Application server instances interact with the RDS Master instance at all times. However each transaction synchronously updates the underlying physical storage, which in Aurora is striped across 3 separate physical Availability Zones. This replication forms the primary line of defence defense for backups. There is always a complete copy of the application database ready in a separate physical location. When a failure of any sort occurs to the Primary database, the infrastructure automatically shifts all application traffic to the Standby, which is now promoted to the role of Master with no loss of data.
Second Line of
...
Defense
The second line of defence defense is snapshots stored in Amazon S3 storage. Database transactions produce records in logs. These records are comprehensive. These logs are streamed continuously to S3. As documented by Amazon, this stream of backup data is sufficient to restore a completely new replacement instance of the database to a point in time of within 5 minutes of when a disaster occurred. S3 storage is configured to itself be backed up across separate geographic regions.
...
In the event that both the Primary and Secondary databases go down, RDS will automatically provision replacements and swap them in within 15 minutes. In addition, new instances could be manually created anywhere in the AWS cloud. These instances can be loaded with the snapshots stored in S3. This scenario does cause an interruption in service, and manual intervention. It is highly unlikely, given the high platform availability of a Multi Availability Zone deployment. However should it occur, Foundry Health IQVIA would follow the procedure for building an environment in another region or Availability Zone, restore the data using the previously referenced procedure, and restore service using the new instances.
...