Technical Guide
YES3 Exporter Version 1.0.0, August 2022
Intended Audience
This guide is intended for individuals who will be writing programs to process files generated by the YES3 Exporter. These include data managers and statistical programmers writing programs to import YES3 Exporter data into statistical systems such as SAS and R, as well as application developers seeking to integrate the YES3 Exporter into datamart generation and validation workflows.
The YES3 technical team
The following individuals participate in one aspect or another - usually multiple aspects - of YES3 software development.
Katy Araujo ninja, testing, documentation
Venugopal Bhatia, application developer
Peter Charpentier, lead EM developer
Kraig Eisenman, IT system engineer
Brian Funaro, lead IT system engineer
Mary Geda, program manager, documentation lead
Kaitlin Maciejewski, testing
Baylah Tessier-Sherman, ninja, testing
Janet Truebig, ninja, testing
Sui Tsang, lead ninja, testing lead, documentation
YES3 Exporter file encoding
The YES3 Exporter generates UTF-8 encoded text files. While not required by the UTF-8 specification, for compatibility the byte order mark (BOM) is included. We have found that the BOM is needed for Microsoft products like Excel. Rows are terminated by a new line (0A), but this might be implementation-specific (our hosts run Ubuntu 20.04 and PHP 7.4). It might be best to anticipate a possible CRLF (0D0A) row termination.
YES3 Exporter destinations
As explained in the README, there are two export destinations: download and filesystem.
The download destination offers the options of downloading the exported data, the data dictionary, or a .zip package that includes both (see “The YES3 Exporter Payload,” below).
To configure filesystem exports for a project, filesystem exports must be enabled by a REDCap administrator in the YES3 Exporter system EM configuration settings. Then a host filesystem target folder must be specified in the YES3 Exporter project EM configuration settings. The host filesyetem target folder should be a mount point for a secure institutional fileshare.
A future version of the YES3 Exporter will support daily, automated filesystem exports.
Managing multiple datamarts
Since host filesystem mount points must be configured by ITS system administrators, and probably approved by an Information Security Officer, in most implementations filesystem exports will be reserved for large studies having the resources and a compelling case for automated datamart builds. However, filesystem export targets for large numbers of studies could be facilitated using the following approach:
- Create a single mount point as a folder on a secure, very restricted institutional fileshare.
- Create a subfolder for each study requiring a filesystem export. This will be that study’s filesystem export target - but will not be accessible to that study.
- Build an external application that will sweep through the daily exports, validate them against the REDCap database and if validated, pass each to a folder that is accessible to that export’s study staff.
The YES3 Exporter Payload
When exporting to a host filesystem folder, or downloading to a .zip export package, a “payload” of three files is delivered:
File | Naming convention | Description |
---|---|---|
export information file | [normalized export name]_info_[timestamp].json | information about the project and the export settings |
export data dictionary | [normalized export name]_dd_[timestamp].csv | metadata and data distributions, tailored to the specific export |
export data file | [normalized export name]_data_[timestamp].csv | the exported data |
The export name is “normalized” by converting to lower case and replacing spaces and other non-alphanumeric characters with underscores.
Future versions of the YES3 Exporter will include execution-ready generated code - in SAS, R and possibly other formats - tailored to the specific export. This code will be added to the YES3 Exporter payload. If you are writing code for importing YES3 Exports into datamarts and would like to share tips and techniques with us, please let us know.
How export specifications are identified
The REDCap EM logging feature provides an elegant way to maintain specification backups. In order to track specification backups across EM log entries, each export specification is identified by the EM parameter export_uuid
. This value is an RFC-compliant universally unique identifier, generated by the Javascript function YES3.uuidv4
function in js/yes3.js
.
EM Log records
The YES3 Exporter makes extensive use of External Module logs, both for logging actions such as downloads and filesystem exports and for storing the export specifications and event prefixes. An EM log parameter named log_entry_type
indicates the function of the log record. Below is a listing of each log_entry_type
value and the purpose of the associated EM log record.
The valid entries for log_entry_type, shown below, are found in
defines/defines.php
log_entry_type | EM Log record purpose | includes export_uuid |
---|---|---|
yes3-export-specification | an export specification | yes |
yes3-export-events | event prefixes | no |
yes3-export-log | an event associated with an export or a download | yes |
yes3-export-cron-log | a daily cron summary across all projects and exports | no |
The YES3 Exporter Information file
The YES3 Export Information file is a JSON-encoded object that provides information about the project and the export specification.
Object properties:
Property | Example | Comments |
---|---|---|
host | http |
|
timestamp | 2022-06-25 09:20:22 | |
project_id | 16 | |
project_recordid_field | screenid | |
project_title | YES3 Test Project | |
project_is_longitudinal | 1 | |
project_has_dags | 1 | |
export_name | DCARE outcomes | |
export_layout | h | v =vertical, h =horizontal, r =repeating form |
export_uuid | 8d99f980-fa35-4f3a-9f33-3cf302568a33 | export ID |
export_target_folder | /mnt/exports/dcare/ | for filesystem exports |
path | /mnt/exports/dcare/dcare_outcomes_data_22062592022.csv, /tmp/ys30W6gjh | either the full path of a filesystem export or the temporary file name used during download generation |
bytes_written | 66216 | |
columns | 273 | |
rows | 221 | |
destination | download | download or filesystem |
notification_email | joe_user@trantor.gov | email address for project notifications, defined in the project EM configuration |
username | palooka |
YES3 Exporter Variables
Variable Types
Type | format | comments/associated REDCap metadata |
---|---|---|
INTEGER | REDCap field type slider or REDCap field validation integer |
|
FLOAT | w.d as determined by PHP | REDCap field type calc or REDCap field validation float . 14 significant digits (IEEE 64-bit). |
DATE | yyyy-mm-dd | REDCap field validations date_mdy , date_ymd , date_dmy , date_dmy |
DATETIME | yyyy-mm-dd hh:nn:ss | REDCap field validations datetime_mdy , datetime_ymd , datetime_dmy , datetime_seconds_mdy , datetime_seconds_ymd , datetime_seconds_dmy |
TIME | hh:nn:ss | REDCap field validation time |
NOMINAL | value set | REDCap field types radio , yesno , truefalse , select (dropdown) |
CHECKBOX | value set | REDCap field type checkbox |
TEXT | REDCap field type paragraph/notes , REDCap field type sql or REDCap field type text with no REDCap field validation |
CHECKBOX data values: If a “check all that apply” checkbox has multiple items, they will be returned in the export data as a comma-separated list of checked values. This is equivalent to the
combine_checkbox_values
option of the REDCap::getData() method.
REDCap Attribute variables
Depending on the export layout and study design, an exported dataset may include one or more “REDCap Attribute” variables. These are the intrinsic REDCap attributes that may be associated with an output record: data access group, event and instance. Where possible we have used the same names that are used by the REDCap API. The full list is shown below.
variable name | type | example |
---|---|---|
redcap_data_access_group_id | INTEGER | 31 |
redcap_data_access_group_name | TEXT | utmb |
redcap_event_id | INTEGER | 4159 |
redcap_event_name | TEXT | baseline_arm_1 |
redcap_repeat_instance | INTEGER | 1 |
REDCap Data variables
All exported datasets include REDCap data values of course, and these are stored in REDCap Data variables. For vertical and repeated export layouts, the variable names for REDCap data will be the same as the underlying REDCap field_name. For horizontal layouts the variable names will be composites of the REDCap event prefix and the REDCap field name.
Work with REDCap project designers to ensure that for longitudinbal studies the length of no composite field name exceeds the limits imposed by any statistical packages that will be used to process the exports.
For example, the variable name limit for SAS and STATA is 32 characters, for MATLAB the limit is 63 characters and for R the limit is 64 characters. You should be okay, even considering the event prefixes, if field names adhere to the length warnings broadcast by the REDCap Field Editor.
Export Data Layouts
The YES3 Exporter currently supports three export layouts: Vertical(v
), Horizontal(h
) and Repeating forms(r
).
For all export layouts the first field or ‘variable’ in the exported record will be the recordid, and the variable name will be the recordid field name as defined for the specific project.
Depending on the layout and study design, the recordid variable may be followed by a collection of “REDCap Attribute” variables that indicate the REDCap data access group and event.
The remainder of the exported record is the collection of values for the selected fields, in the order specified. The variable name of each will be either the REDCap field name (vertical and repeating form layouts), or a composite of the REDCap event prefix and the REDCap field name (horizontal layout).
Following are more detailed descriptions of the export layouts.
The Vertical Export Layout
For classic (cross sectional) study designs, the vertical export layout has one row for each record. For longitudinal study designs, the vertical export layout has one row per observed record and event combination.
Repeating forms are not allowed in vertical export layouts.
For a longitudinal study, a vertical layout record will be exported only for events for which data are observed for the record. No empty records will be exported.
Export record specification for a Vertical layout
variable | description | comments |
---|---|---|
[recordId] | The record unique identifier | study-specific variable name |
redcap_data_access_group_id | The numeric DAG id | only present if DAGs are assigned to records |
redcap_data_access_group_name | The unique DAG name | only present if DAGs are assigned to records |
redcap_event_id | The numeric event id | only present for longitudinal studies |
redcap_event_name | The unique REDCap event name | only present for longitudinal studies |
[fieldname] | the first selected field value | variable name is the underlying REDCap field name |
… | ||
[fieldname] | the last selected field value | variable name is the underlying REDCap field name |
The Horizontal Export Layout
Horizontal layouts are allowed only for longitudinal study designs, and have one row for each record.
Repeating forms are not allowed in horizontal export layouts.
A horizontal layout record will have one variable per specified REDCap field and event combination. This can lead to exported datasets having enormous column counts and many empty cells. In general, vertical layouts are more efficient than horizontal layouts. However, for most studies horizontal layouts have the advantage of having one row per analysis unit (record), and are therefore “analysis ready” in many contexts.
Export record specification for a Horizontal layout
variable | description | comments |
---|---|---|
[recordId] | The record unique identifier | study-specific variable name |
redcap_data_access_group_id | The numeric DAG id | only present if DAGs are assigned to records |
redcap_data_access_group_name | The unique DAG name | only present if DAGs are assigned to records |
[composite fieldname] | the first selected field/event value | variable name is [event prefix]_[ REDCap field name] |
… | ||
[composite fieldname] | the last selected field/event value | variable name is [event prefix]_[ REDCap field name] |
The Repeating Form Export Layout
For classic (cross sectional) study designs, the repeating form export layout has one row for each observed record and instance combination.
For longitudinal study designs, the repeating form export layout has one row per observed record, event and instance combination.
Note that only one form is allowed in a repeating form layout. This is because ‘instance’ normally has no meaning across forms. We are considering a new repeating event export layout that would allow multiple forms. Please let us know if the repeating event layout would be useful for your studies.
Export record specification for a Repeating Form layout
variable | description | comments |
---|---|---|
[recordId] | The record unique identifier | study-specific variable name |
redcap_data_access_group_id | The numeric DAG id | only present if DAGs are assigned to records |
redcap_data_access_group_name | The unique DAG name | only present if DAGs are assigned to records |
redcap_event_id | The numeric event id | only present for longitudinal studies |
redcap_event_name | The unique REDCap event name | only present for longitudinal studies |
redcap_repeat_instance | the sequential instance value (1, 2, …) | |
[fieldname] | the first selected field value | variable name is the underlying REDCap field name |
… | ||
[fieldname] | the last selected field value | variable name is the underlying REDCap field name |
The YES3 Exporter Data Dictionary
The YES3 Exporter Data Dictionary includes detailed metadata for every exported variable, as well as data distribution summaries.
One intended use for the YES3 Exporter data dictionary is for external applications to generate and execute fine-tuned data import code. For example, the observed maximum string length can be used to tailor SAS variable attributes to exactly match the exported data. Once a datamart has been rebuilt, an external process can use the data distribution information in the data dictionary to validate the import - or to create and deploy a simple summary data report.
Data distributions summaries
The data distribution summaries stored in the YES3 Exporter data dictionary were originally conceived as an aid to fine-tuned code generation for datamart builds.
Data validations
The distribution summaries can be used to validate a data management process, for example an R or SAS program to import data into a datamart. Be sure to make appropriately “fuzzy” comparisons for distributional parameters like mean and variance. If I (PC) remember correctly, SAS has a bit-level truncation function that’s useful in this regard.
For performance reasons, we don’t include quantiles in the export data dictionaries. There might be other, less expensive calulations that we can add if there is any interest. For example, the geometric mean was once used as a proxy for the median in skewed distributions. In those days CPU cycles cost real money!
Ad hoc reports
We have found that the data distribution summaries are also convenient for data completion reports, study monitoring reports and to check for extreme values. Always: make sure that your intended purpose wouldn’t be achieved more simply and elegantly by the reporting and project portal tools provided by REDCap.
YES3 Exporter Data Dictionary specification
Each row of a YES3 Exporter Data Dictionary matches one column in the exported dataset, in the same order. Below is a table of column specifications.
column name | column description |
---|---|
var_name | The variable name. This will either be a raw REDCap field name, a REDCap Attribute field name or for horizontal layouts, a composite field name based on the event and REDCap field name |
var_label | The label for the variable. It is based on the REDCap field label, but is subject to the truncation and ‘sanitizing’ options of the export specification. |
var_type | The variable’s data type (see “Variable Types”) |
valueset | NOMINAL and CHECKBOX variables only: A JSON-encoded array of value/label pairs. |
origin | Indicates the data source for the variable: redcap = REDCap field, other = other REDCap attribute such as the event, DAG or instance. |
redcap_field_name | the name of the underlying REDCap field, if relevant |
redcap_form_name | the form associated with the underlying REDCap field, if relevant |
redcap_event_id | HORIZONTAL LAYOUTS ONLY: the REDCap event id associated with the variable. |
redcap_event_name | HORIZONTAL LAYOUTS ONLY: the REDCap event name associated with the variable. |
non_missing_count | The count of non-blank values for this variable, as observed in the exported records. |
min_length | For TEXT variables: the minimum string length - after any truncations or sanitations - observed in the exported records. |
max_length | For TEXT variables: the maximum string length - after any truncations or sanitations - observed in the exported records. |
min_value | For numeric and date/time variables: the minimum value observed in the exported records. For date/time variables this is a PHP time value, i.e. the count of seconds since 1/1/1970 00:00:00 |
max_value | For numeric and date/time variables: the maximum value observed in the exported records. For date/time variables this is a PHP time value, i.e. the count of seconds since 1/1/1970 00:00:00 |
sum_of_values | For numeric and date/time variables: the sum of the values observed in the exported records |
sum_of_squared_values | For numeric and date/time variables: the sum of the squared values observed in the exported records |
mean | For numeric and date/time variables: the average of the values observed in the exported records |
standard_deviation | For numeric and date/time variables: the sample standard deviation of the values observed in the exported records |
formatted_min_value | The minimum observed value, formatted as necessary for DATE, DATETIME and TIME variables. |
formatted_max_value | The maximum observed value, formatted as necessary for DATE, DATETIME and TIME variables. |
formatted_mean | The average observed value, formatted as necessary for DATE, DATETIME and TIME variables. |
frequency_table | For NOMINAL variables: a JSON-encoded table of discrete values and their frequency counts. |