1 Overview
2 Read Replicas
- 2.1 Latency
- 2.2 Connecting to your Replica
3 Clinical Data Interchange Standards Consortium (CDISC)
4 ClinSpark Data Model
- 4.1 Study Design
  - 4.1.1 Org, Volunteer, Study, Sites, StudySites and Subjects
  - 4.1.2 Form Definition
  - 4.1.3 Activity Plans
  - 4.1.4 Epochs and Cohorts
- 4.2 Clinical Data
  - 4.2.1 Parallel data and definition objects in CDISC
  - 4.2.2 Study Data
  - 4.2.3 Queries
  - 4.2.4 Lab Data
- 4.3 Samples
- 4.4 Volunteers
- 4.5 Users
5 Read Replica Usage
- 5.1 Integration to your Existing Data Warehouse
- 5.2 Ad Hoc SQL Queries
- 5.3 Business Intelligence Tools
  - 5.3.1 Tableau
  - 5.3.2 Tibco Spotfire
  - 5.3.3 AWS Quicksight
  - 5.3.4 Crystal Reports or Similar

Overview

ClinSpark customers have available to them a live read-only copy of their production ClinSpark database called a Read Replica. This document will describe what a Read Replica is from a technical perspective. It will provide descriptions of key aspects of the underlying Data Model so that technical users may better understand how the data is stored. It will explore various use cases for leveraging the Read Replica as well as some examples.

Read Replicas

All ClinSpark data is stored within a MySQL compatible relational database within the AWS Cloud. The Master database is the production database for the ClinSpark instance. This is where all data is written to and updated during the use of the ClinSpark application. In addition to other live backup mechanisms for operational use, a special read-only copy of this production data can be made available to customers. This database is a dedicated copy of the production database, solely for the purpose of customer usage. It is not used by running ClinSpark instances in any way. It is read-only, meaning that it does not accept writes, and it is not possible for any usage of this database to impact the Master database in any way.

ClinSpark Read Replicas are standard AWS RDS Aurora Read Replicas. You can find comprehensive documentation from AWS online on these replicas.

Here is a visual depiction of the Read Replica in the ClinSpark AWS production environment:

Note that like all ClinSpark application assets, it lives within private subnets of a Production Virtual Private Cloud (VPC). It is not exposed to the internet. Its data is encrypted at rest, and in transit via SSH.

Latency

How old is the data in the Replica?

As updates are applied to Master, the Replica receives these updates in real time. There is a delay, but due to the architecture of AWS RDS Aurora, this delay is typically no greater than 20 milliseconds. So, 20ms after a change to the production database is made, the customer's Read Replica has this update available.

Connecting to your Replica

There are a variety of ways to access the Read Replica. The process that explains steps needed to gain access to the replica via service desk ticket, visit this page: Connecting to your Read Replica . This article will cover an in-depth view on connecting assuming that access has been granted.

For customers with established relationships with AWS and special connections between AWS and their site's data centers, a variety of approaches exist for providing more direct connections to the replica.

However, for most customers the common way to access the Read Replica is via SSH. For these customers, we provision a special bastion host which has no purpose other than to provide this customer with access to this replica. This host is accessible only via SSH, and only using a customer-owned private SSH key. This host has no access to the environment other than to access this replica.

During onboarding, 2 DNS names and one set of credentials will be provided. One is the DNS name of your dedicated bastion host to access the replica via SSH. And the other is the DNS name of the read replica itself with the AWS VPC. For this example, let's pretend these DNS name are as follows:

Customer Bastion Host (public DNS)	customer-replica-bastion.clinspark.com
Customer Read Replica (private DNS)	customer-replica.crs8xf7ezw7g.us-east-1.rds.amazonaws.com
Read Replica Username	<username>
Read Replica Password	<password>

To connect to his replica via the command line, the steps are as follows:

Create a SSH key you will use to access the DB through the bastion. If you already have one that's fine. Here are instructions for doing this and also adding the key to the ssh-agent.
Provide IQVIA with the public SSH keys of users who need access to this database. We will place this on the bastion, allowing these users to tunnel into the replica. Open a service ticket with the public key attached and we'll get this installed quickly.
Verify you have access by connecting to the bastion and replica as follows from the command line:

$ ssh ec2-user@customer-replica-bastion.clinspark.com

Last login: Fri Dec 16 18:39:49 2016 from nclarkllc03.n.subnet.rcn.com


       __|  __|_  )

       _|  (     /   Amazon Linux AMI

      ___|\___|___|


https://aws.amazon.com/amazon-linux-ami/2016.09-release-notes/

[ec2-user@ip-172-31-17-113 ~]$ mysql -u <username> -h customer-replica.crs8xf7ezw7g.us-east-1.rds.amazonaws.com -p<password>

Warning: Using a password on the command line interface can be insecure.

Welcome to the MySQL monitor.  Commands end with ; or \g.

Your MySQL connection id is 262

Server version: 5.6.22 MySQL Community Server (GPL)

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.


mysql>

The above shows a terminal session where the user has first added the private ssh key to the ssh-agent as described in the previous link. The user connects to the bastion, and from there is able to use the MySQL client to interact with the replica directly from the bastion.

You may want to establish a SSH tunnel between your local workstation and the replica through the bastion. This is a common usage pattern, see your SSH client of choice documentation for instructions on how to configure this.

Note that in typical usage with a SSH tunnel from a user's workstation to the bastion, the tunnel would use port forwarding. The SQL tool used on the user's desktop would actually connect using this local tunnel.

The above approach is appropriate if only one or two trusted users will access the bastion. It is secure as long as the private keys are secured. Customers with more users will want to find more scalable ways to access the replica data. One could be a SSH persistent tunnel on the customer site, and used by other users at the site. AWS has a variety of options for this. See AWS VPNs and Direct Connect, which are two mechanisms to securely connect your sites to your own AWS account. From there a variety of options exist, including AWS PrivateLink to create secure connections between AWS accounts.

It is the customer's responsibility to select and configure any options such as the above. IQVIA will assist by suggesting approaches and making required configurations on our end. But the majority of the setup is in customer-owned infrastructure, and for this the customer is solely responsible.

Clinical Data Interchange Standards Consortium (CDISC)

CDISC is an open, multidisciplinary, non-profit organization committed to the development of industry standards to support the electronic acquisition, exchange, submission, and archiving of clinical trials data and metadata for medical and biopharmaceutical product development. Details about CDISC and ClinSpark are covered here: CDISC

Operational Data Model (CDISC ODM)

Here is how the Clinical Data Interchange Standards Consortium describes the ODM in the introduction of the specification:

The Operational Data Model (ODM) is a vendor neutral, platform independent format for interchange and archive of clinical study data. The model includes the clinical data along with its associated metadata, administrative data, reference data and audit information. All of the information that needs to be shared among different software systems during the setup, operation, analysis, submission or for long term retention as part of an archive is included in the model.
Clinical data management systems vary significantly in the information they store and the rules they enforce. The ODM model has been designed to represent a wide range of study information so as to be compatible with most existing clinical data management systems. Systems that do not have all of the features represented by the ODM model may still be ODM compatible as long as they comply with the conformity rules provided in the section on System Conformity.
The ODM has been designed to be compliant with guidance and regulations published by the FDA for computer systems used in clinical studies. This document is intended to be both the formal specification of the ODM and a user guide for those involved in transferring or archiving of clinical data using the model.

To the maximum extent possible, the ClinSpark data model is based on the CDISC Operational Data Model (ODM) standard. In fact, a significant portion of the database schema for ClinSpark was generated directly from the ODM XML schema. ODM scope is limited to the core entities of clinical study data. The ClinSpark data model includes this but goes far beyond this scope. As such the CDISC ODM and it's related documentation are excellent sources of information to understand the subset of the ClinSpark data model which overlaps.

The CDISC ODM documentation can be freely downloaded from here. We recommend that anyone intending to work with and understand the ClinSpark data model spend some time getting familiar with the ODM, as it provides valuable insights into both design and intended usage.

Study Data Tabulation Model (CDISC SDTM)

The SDTM defines a standard structure for study data tabulations that are to be submitted as part of a product application to a regulatory authority such as the United States Food and Drug Administration (FDA). The ClinSpark data model is influenced by this specification where applicable. The SDTM specification documentation can be found on the CDISC site.

Laboratory Data Model (CDISC LAB)

The CDISC Laboratory Data Team (LAB) has as its mission the development of a standard model for the acquisition and interchange of laboratory data. The specification and all related documentation can be found on the CDISC LAB homepage. To the extent possible, the ClinSpark data model adheres to the CDISC LAB specification. Model entities which come directly from this specification include:

Base Specimen
Base Battery
Base Result

Additionally the way that ClinSpark handles the concept of Accession comes directly from the CDISC LAB standard, section 3.4.8. Here is the guidance from this section:

Accession data identifies the specimen collection kit and the laboratory from where it came. The convention that seems to be standard for all laboratories is that one accession number identifies one accession used at one subject visit. Central Laboratory ID must always be populated or required.

ClinSpark Data Model

To the extent possible and practical, the ClinSpark data model is based on the CDSIC ODM 1.3.2 data model. A few of the benefits of this are:

Standards-based simplifies interoperability. ClinSpark natively produces and accepts CDISC data, simplifying integration with other products and vendors supporting this standard.
Simplified training and transferable knowledge. CDISC ODM is fairly widely used within the industry. This makes our data model relatively easy for new users familiar with ODM to comprehend.
High quality data foundation vs proprietary makes data more future-proof.

The following section highlights a series of key subject areas within the ClinSpark data model. This section covers both CDISC and non-CDISC entities, since this is the nature of the ClinSpark data model.

The entity diagrams and the relationships between them is intended to help readers understand what can be found in tables and also how tables can be joined using SQL. This is not meant to be comprehensive, it should be used in conjunction with the other schema documentation provided.

Note that this information may change over time, but will provide a solid foundation for training purposes. The data model in ClinSpark may change each functional release, to support new capabilities or changes to existing features.

Study Design

The following subject areas are involved with study design.

Note that ClinSpark has made significant extensions to the ODM in a number of areas. Two examples are device connectivity and importation from the volunteer record. There is no provision within ODM for either of these features. To support this, extensions have been made to data within the ItemGroup and ItemRef domains. This allows noting fields which will have their values populated from direct capture from medical devices, or from a volunteer record within the database. Other such extensions exist throughout the ClinSpark data model. As such, domains which originate in the CDISC standard often contain a superset of the data described by CDISC and also additions created by IQVIA.

Org, Volunteer, Study, Sites, StudySites and Subjects

Key	Table	From CDISC?	Notes

Key	Table	From CDISC?	Notes
1	org	No	An org represents an entity performing clinical research (CRO / CRU). Orgs can have multiple sites that execute studies.
2	study	Yes	This element collects static structural information about an individual study. A study is related to a given clinical trial protocol.
3	site	No	A site is a physical place belonging to an organization. An organization having multiple physical clinical sites will have multiple site rows.
4	study_site	No	An association between a physical site and a study. A study site is different than a physical location. Often, pharma sponsors will specify sites with arbitrary codes and those codes must pass through during data export time. In addition, this domain encapsulates recruitment efforts for a given study / site.
5	volunteer	No	A volunteer is someone who indicates that they are interested in participating in clinical research for the given org.
6	subject	Yes	Someone participating in clinical research within the context of a given study. Creates glue between the volunteer and the participation.

Form Definition

Key	Table	From CDISC?	Notes

Key	Table	From CDISC?	Notes
1	StudyMetaData	Yes	StudyMetaData (MetadataVersion in CDISC) defines the types of study events, forms, item groups, and items that form the study data. This is basically an aggregation of all CRF design elements for a study.
2	Form	Yes	A Form (FormDef in CDISC) describes a type of form that can occur in a study. A form is basically a container for item groups. This class explicitly excludes the required 'repeating' (Yes\|No) attribute from the domain due to the fact that phase 1 studies are different and that it's likely that most of the forms will repeat as they are related to a study event. At ODM build time, we'll check for those forms that repeat and ensure that we create the repeating attribute properly.
3	ItemGroup	Yes	An ItemGroup (ItemGroupDef in CDISC) describes a type of item group that can occur within a study. It basically is collection of related items in a given form.
4	Item	Yes	An Item (ItemDef in CDISC) describes a type of item that can occur within a study. Item properties include name, datatype, measurement units, range or codelist restrictions, and several other properties. It basically represents the definition of a piece of data collected.
5	ItemGroupRef	Yes	A reference to a given ItemGroup. This reference can hold data about the association.
6	ItemRef	Yes	A reference to a given Item. This reference can hold data about the association.
7	Method	Yes	ODM representation that allows a value of an Item to be computed. In ODM, this OID can be found on an ItemRef - meaning that item is computed by this method via invocation of the formal expression

Activity Plans

An activity plan is a schedule of events for a given cohort. Activity Plans do not appear in CDSIC. In ODM, there is the notion of a FormRef. However, this design doesn't fit well with early phase trials where forms are commonly repeated for a given study event (ie many PKs in a given day). As such, FormRef is implicitly available by way of Scheduled Activities that are a part of a Segment / Activity Plan.

Key	Table	From CDISC?	Notes

Key	Table	From CDISC?	Notes
1	Study	Yes	This element collects static structural information about an individual study. A study is related to a given clinical trial protocol.
2	Activity Plan	No	A schedule of events for a given cohort. Plans can be assigned to multiple cohorts. A timed plan must have a reference time in order to properly provide UI feedback as segments and scheduled activities are set. Untimed Activity Plan: Reference time must be null Single segment; segment must be root and must have offset second set to zero Timed Activity Plan: Must have a reference time Can have 1-n segments; always sort by offset seconds Reference segment must have offset seconds of zero Activity Plan fills the role of the FormRef in the ODM.
3	Segment	Partially	Holds a group of scheduled activities in an activity plan. The segment's offset seconds is essentially the time of the reference event. All scheduled activities are relative to this. Modeled somewhat off of CDISC SDM: "Segments are often seen as the basic building blocks of study design. A segment usually specifies a combination of planned observations and interventions, which may or may not involve treatment, during a period of time."
4	Scheduled Activity	No	Wraps a form, but adds metadata including timing.
5	Form	Yes	A form is basically a container for item groups.
6	Study Event	Yes	A study event represents a given 'visit'. In phase 1 trials this will commonly simply refer to a 'day'. When scheduling forms for a given schedule, the builder must associate the study event. Note: there are common study events that are typically reserved for special events: unscheduled, common (AE, CM), etc

Epochs and Cohorts

Key	Table	From CDISC?	Notes

Key	Table	From CDISC?	Notes
1	epoch	No	An epoch is typically specified in a study protocol and typically signifies some milestone type events within the trial. ie: screening, treatment, follow-up, etc
2	cohort	No	A cohort is a way to break up epochs into different groupings. Protocols will occasionally indicate that epochs should be broken up (perhaps in different trial arms to test different dose levels, etc), or this can just purely be an organizational thing.
3	cohort_assignment	No	Binds a Volunteer as a Subject within a given cohort along with an Activity Plan. We allow assigning different schedules, and we allow for later scheduling a subject.
4	activity_plan	No	A schedule of events for a given cohort.
6	subject	Yes	Someone participating in clinical research within the context of a given study.

Clinical Data

Parallel data and definition objects in CDISC

CDISC ODM defines a pattern for modeling the definition of a clinical data element and the data captured. This pattern has one element describe the definition, and another element record the data for a particular clinical collection of this element type. ClinSpark adheres to this pattern, though as noted in the table below, the suffix "Def" is removed from the definition elements. Here are the examples:

Data Definition Element	Data Capture Element	Notes

Data Definition Element	Data Capture Element	Notes
Item (ItemDef in CDISC)	ItemData	Represent the definition of a piece of data collected
ItemGroup (ItemGroupDef in CDISC)	ItemGroupData	Aggregates item data
Form (FormDef in CDISC)	FormData	Basically a container for item groups
StudyEvent (StudyEventDef in CDISC)	StudyEventData	Clinical data for a study event (visit) for a given subject

With this in mind, below is a view of how these elements relate to each other.

Note that form_data can have linkage to forms through scheduled activities or unscheduled.

ItemGroupData has an item_group_repeat_key which is used to track repeats.

Key	Table	From CDISC?	Notes

Key	Table	From CDISC?	Notes
1	form	Yes	Basically a container for item groups.
2	form_data	Yes	Form data represents data collected for a given subject. Instead of storing the scheduled time on this domain, we leverage the relationship to the encapsulated scheduledActivity domain and thus its relationship to the segment. We purposely don't set formRepeatKey in the domain. It is calculated later on when building ODM.
3	item_group	Yes	A collection of related items in a given form
4	item_group_ref	Yes	A given item group within a form definition.
5	item_group_data	Yes	Aggregates item data
6	item	Yes	Represents the definition of a piece of data collected.
7	item_ref	Yes	A given item within an item group definition.
8	item_data	Yes	A piece of collected data
9	study_event	Yes	A study event represents a given 'visit'. In phase 1 trials this will commonly simply refer to a 'day'. When scheduling forms for a given schedule, the builder must associate the study event. Note: there are common study events that are typically reserved for special events: unscheduled, common (AE, CM), etc
10	study_event_data	Yes	Clinical data for a study event (visit) for a given subject

Study Data

Key	Table	From CDISC?	Notes

Key	Table	From CDISC?	Notes
1	item_data	Yes	A piece of collected data
2	item_group_data	Yes	Aggregates item data
3	form_data	Yes	Form data represents data collected for a given subject. Instead of storing the scheduled time on this domain, we leverage the relationship to the encapsulated scheduledActivity domain and thus its relationship to the segment.
4	study_event_data	Yes	Clinical data for a study event (visit) for a given subject
5	study_event	Yes	A study event represents a given 'visit'. In phase 1 trials this will commonly simply refer to a 'day'. When scheduling forms for a given schedule, the builder must associate the study event. Note: there are common study events that are typically reserved for special events: unscheduled, common (AE, CM), etc
6	subject	Yes	Someone participating in clinical research within the context of a given study.

Queries

Queries and Annotations can be seen on ItemData entities like this:

Key	Table	From CDISC?	Notes

Key	Table	From CDISC?	Notes
1	item_data_query	No	A query related to a piece of item data
2	item_data_query_comment	No	A given comment related to a given query
3	item_data_sample_annotation	No	Comments added directly on the item

Lab Data

Lab data can be reached via ItemData.

Key	Table	From CDISC?	Notes

Key	Table	From CDISC?	Notes
1	item_data	Yes	A piece of collected data. Note that lab orders and all results are associated to a given Item Data. You will need to join through Item Data when working with lab data.
2	base_specimen	Yes (CDISC LAB)	Modeled off of CDISC Lab. A specimen is collected from a subject and assigned to a given item data instance. There can be multiple batteries (test groups) associated to a given specimen. Combines Accession Level and Base Specimen from spec.
3	base_battery	Yes (CDISC LAB)	A panel related to a specimen - typically this is just a 1 to 1 relationship, meaning there is often 1 base_battery for a single specimen.
4	base_test_result	Yes (CDISC LAB)	Combines CDISC Lab BaseTest and BaseResult. These are the results from the lab.
5	lab_order	No	When specimens are collected, this domain represents that an order is generated. It causes a manifest file to be created (PDF) and potentially a file order to be dumped on to the file system and made available to web services.
6	lab_interface	No	Encapsulates how to send and receive orders and results from a particular safety lab. Sites may have multiple labs, and if so each of these will have their own lab interface instance.
7	study_lab_panel	No	Something that can be ordered from item level
8	specimen_container	No	When setting up samples or labs, users can optionally choose a container.
9	lab_repeat	No	A domain that allows for the management of lab repeat workflows

Samples

Key	Table	From CDISC?	Notes

Key	Table	From CDISC?	Notes
1	sample_path_step		A step in the path to fulfill a given sample
2	sample_task		Coarse grained description of what needs to be done for a given sample path step; this could be an enum on the class, but having it here gives a bit more flexibility and can encode some further instructions.
3	sample_path		Container for the sample path steps
4	specimen_container		When setting up samples or labs, users can optionally choose a container
5	sample_transfer		A given step may require a transfer of sample material; this domain captures that
6	sample_shipment		Signifies a sample container being removed
7	sample_container		Something that we place sample path step data into
8	sample_container_item		An individual sample that is placed into a container
9	sample_batch_data	No	Allows for doing a step with a group of samples. Furthermore, it allows the steps for the relevant sample path step data objects to span a period of time. Non-ODM.
10	sample_path_step_data	No	For forms that require sample data to be collected, this domain tracks accordingly. Non-ODM.
11	item_data	Yes	All sample data can be joined through item_data_id

Volunteers

Volunteers are not a part of CDISC.

Key	Table	From CDISC?	Notes

Key	Table	From CDISC?	Notes
1	volunteer	No	Someone who indicates that they are interested in participating in clinical research for the given org. Someone who indicates that they are interested in participating in clinical research for the given org.
2	volunteer_medical_condition	No	Associates a given condition to a given volunteer
3	volunteer_note	No	A simple note that can be attached to the volunteer record
4	volunteer_correspondence	No	Represents a call or text to / from a volunteer by way of Twilio
5	volunteer_substance_use	No	We purposely don't track SUOCCUR, it allows us to indicate that the vlunteer is not using the substance.
6	recruitment_appointment	No	Allows for a given volunteer to be assigned to a given time slot
7	study_site	No	An association between a study and a site
8	subject	Yes	Someone participating in clinical research within the context of a given study.

Users

Key	Table	From CDISC?	Notes

Key	Table	From CDISC?	Notes
1	application_user	No	A user in the system
2	study	Yes	In this context, these are study restrictions. Studies which appear here mean the user is whitelisted to working only with data from these studies.
3	application_user_role	No	Roles which the user has
4	application_user_role_secure_actions	Yes	Secure Actions are permissions which the role entitles the user to.

Read Replica Usage

There are a wide variety of usage patterns for customer Read Replicas. This is your data, so use it as your business needs require. The following are a few common patterns presented as examples.

Integration to your Existing Data Warehouse

Customers who have existing data warehouses or data-marts may choose to integrate ClinSpark data into these repositories. This can be done using the SSH channel provided.

Ad Hoc SQL Queries

Nearly all SQL tools connect to MySQL databases. So whatever SQL tool you typically use should work with very little configuration. Use the tool that is most familiar to you. If you do not have a tool preference, one option may be the free MySQL Workbench, which is a fairly full featured tool. Using a SQL tool like this is useful for creating queries to answer questions on the fly. It can also be used to generate simple reports, perform data modeling (the above diagrams were created with this tool), etc.

The connection instructions above show how to connect MySQL Workbench to your replica. And there is extensive help online and in the app for using the tool from there.

Business Intelligence Tools

The ClinSpark Read Replica can be used with any Business Intelligence (BI) tool which operates on relational data. BI tools are very popular these days, and there are a wide variety of vendors.

Here is an example of one way that the Read Replica can be connected to a customer-hosted BI tool called Tableau:

Tableau

As shown above, it is easy to connect to Read Replica data using Tableau using a SSH tunnel to a local workstation or gateway. The standard Tableau Desktop connection wizard will guide you through the steps to connect from there. Please contact a Tableau representative for more details. They provide consulting services and training.

Here is a presentation on using Tableau with the Read Replica from the 2017 ClinSpark User Group conference.

Tibco Spotfire

Tibco Spotfire can access the Read Replica in the same manner as Tableau.

AWS Quicksight

AWS Quicksight is a cheaper BI tool than Tableau or Spotfire, and may be adequate depending on the use case. It may be possible to connect a customer AWS Quicksight account to a Replica exposed via SSH. We have limited experience connecting a Quicksight account connecting to IQVIA RDS replicas. This approach will not work using customer-owned AWS Quicksight accounts. This is a topic the ClinSpark engineering team is still investigating. If customers are interested in using AWS Quicksight, please reach out via service desk ticket.

Crystal Reports or Similar

Crystal Reports or other similar products all can operate on MySQL databases. As such they can connect to and use the Read Replica. It is possible to create customer-specific reports using these tools.

Documentation

Read Replica Introductory Training