home View on GitHub

The YES3 Exporter

User Guide, Technical Guide, Change Log and more

Technical Guide

YES3 Exporter Version 1.0.0, August 2022

Intended Audience

This guide is intended for individuals who will be writing programs to process files generated by the YES3 Exporter. These include data managers and statistical programmers writing programs to import YES3 Exporter data into statistical systems such as SAS and R, as well as application developers seeking to integrate the YES3 Exporter into datamart generation and validation workflows.

The YES3 technical team

The following individuals participate in one aspect or another - usually multiple aspects - of YES3 software development.

Katy Araujo ninja, testing, documentation
Venugopal Bhatia, application developer
Peter Charpentier, lead EM developer
Kraig Eisenman, IT system engineer
Brian Funaro, lead IT system engineer
Mary Geda, program manager, documentation lead
Kaitlin Maciejewski, testing
Baylah Tessier-Sherman, ninja, testing
Janet Truebig, ninja, testing
Sui Tsang, lead ninja, testing lead, documentation

YES3 Exporter file encoding

The YES3 Exporter generates UTF-8 encoded text files. While not required by the UTF-8 specification, for compatibility the byte order mark (BOM) is included. We have found that the BOM is needed for Microsoft products like Excel. Rows are terminated by a new line (0A), but this might be implementation-specific (our hosts run Ubuntu 20.04 and PHP 7.4). It might be best to anticipate a possible CRLF (0D0A) row termination.

YES3 Exporter destinations

As explained in the README, there are two export destinations: download and filesystem.

The download destination offers the options of downloading the exported data, the data dictionary, or a .zip package that includes both (see “The YES3 Exporter Payload,” below).

To configure filesystem exports for a project, filesystem exports must be enabled by a REDCap administrator in the YES3 Exporter system EM configuration settings. Then a host filesystem target folder must be specified in the YES3 Exporter project EM configuration settings. The host filesyetem target folder should be a mount point for a secure institutional fileshare.

A future version of the YES3 Exporter will support daily, automated filesystem exports.

Managing multiple datamarts

Since host filesystem mount points must be configured by ITS system administrators, and probably approved by an Information Security Officer, in most implementations filesystem exports will be reserved for large studies having the resources and a compelling case for automated datamart builds. However, filesystem export targets for large numbers of studies could be facilitated using the following approach:

  1. Create a single mount point as a folder on a secure, very restricted institutional fileshare.
  2. Create a subfolder for each study requiring a filesystem export. This will be that study’s filesystem export target - but will not be accessible to that study.
  3. Build an external application that will sweep through the daily exports, validate them against the REDCap database and if validated, pass each to a folder that is accessible to that export’s study staff.

The YES3 Exporter Payload

When exporting to a host filesystem folder, or downloading to a .zip export package, a “payload” of three files is delivered:

File Naming convention Description
export information file [normalized export name]_info_[timestamp].json information about the project and the export settings
export data dictionary [normalized export name]_dd_[timestamp].csv metadata and data distributions, tailored to the specific export
export data file [normalized export name]_data_[timestamp].csv the exported data

The export name is “normalized” by converting to lower case and replacing spaces and other non-alphanumeric characters with underscores.

Future versions of the YES3 Exporter will include execution-ready generated code - in SAS, R and possibly other formats - tailored to the specific export. This code will be added to the YES3 Exporter payload. If you are writing code for importing YES3 Exports into datamarts and would like to share tips and techniques with us, please let us know.

How export specifications are identified

The REDCap EM logging feature provides an elegant way to maintain specification backups. In order to track specification backups across EM log entries, each export specification is identified by the EM parameter export_uuid. This value is an RFC-compliant universally unique identifier, generated by the Javascript function YES3.uuidv4 function in js/yes3.js.

EM Log records

The YES3 Exporter makes extensive use of External Module logs, both for logging actions such as downloads and filesystem exports and for storing the export specifications and event prefixes. An EM log parameter named log_entry_typeindicates the function of the log record. Below is a listing of each log_entry_type value and the purpose of the associated EM log record.

The valid entries for log_entry_type, shown below, are found in defines/defines.php

log_entry_type EM Log record purpose includes
export_uuid
yes3-export-specification an export specification yes
yes3-export-events event prefixes no
yes3-export-log an event associated with an export or a download yes
yes3-export-cron-log a daily cron summary across all projects and exports no

The YES3 Exporter Information file

The YES3 Export Information file is a JSON-encoded object that provides information about the project and the export specification.

Object properties:

Property Example Comments
host http://redcap@trantor.gov/  
timestamp 2022-06-25 09:20:22  
project_id 16  
project_recordid_field screenid  
project_title YES3 Test Project  
project_is_longitudinal 1  
project_has_dags 1  
export_name DCARE outcomes  
export_layout h v=vertical, h=horizontal, r=repeating form
export_uuid 8d99f980-fa35-4f3a-9f33-3cf302568a33 export ID
export_target_folder /mnt/exports/dcare/ for filesystem exports
path /mnt/exports/dcare/dcare_outcomes_data_22062592022.csv, /tmp/ys30W6gjh either the full path of a filesystem export or the temporary file name used during download generation
bytes_written 66216  
columns 273  
rows 221  
destination download download or filesystem
notification_email joe_user@trantor.gov email address for project notifications, defined in the project EM configuration
username palooka  

YES3 Exporter Variables

Variable Types

Type format comments/associated REDCap metadata
INTEGER   REDCap field type slider or REDCap field validation integer
FLOAT w.d as determined by PHP REDCap field type calc or REDCap field validation float. 14 significant digits (IEEE 64-bit).
DATE yyyy-mm-dd REDCap field validations date_mdy, date_ymd, date_dmy, date_dmy
DATETIME yyyy-mm-dd hh:nn:ss REDCap field validations datetime_mdy, datetime_ymd, datetime_dmy, datetime_seconds_mdy, datetime_seconds_ymd, datetime_seconds_dmy
TIME hh:nn:ss REDCap field validation time
NOMINAL value set REDCap field types radio, yesno, truefalse, select (dropdown)
CHECKBOX value set REDCap field type checkbox
TEXT   REDCap field type paragraph/notes, REDCap field type sql or REDCap field type text with no REDCap field validation

CHECKBOX data values: If a “check all that apply” checkbox has multiple items, they will be returned in the export data as a comma-separated list of checked values. This is equivalent to the combine_checkbox_values option of the REDCap::getData() method.

REDCap Attribute variables

Depending on the export layout and study design, an exported dataset may include one or more “REDCap Attribute” variables. These are the intrinsic REDCap attributes that may be associated with an output record: data access group, event and instance. Where possible we have used the same names that are used by the REDCap API. The full list is shown below.

variable name type example
redcap_data_access_group_id INTEGER 31
redcap_data_access_group_name TEXT utmb
redcap_event_id INTEGER 4159
redcap_event_name TEXT baseline_arm_1
redcap_repeat_instance INTEGER 1

REDCap Data variables

All exported datasets include REDCap data values of course, and these are stored in REDCap Data variables. For vertical and repeated export layouts, the variable names for REDCap data will be the same as the underlying REDCap field_name. For horizontal layouts the variable names will be composites of the REDCap event prefix and the REDCap field name.

Work with REDCap project designers to ensure that for longitudinbal studies the length of no composite field name exceeds the limits imposed by any statistical packages that will be used to process the exports.

For example, the variable name limit for SAS and STATA is 32 characters, for MATLAB the limit is 63 characters and for R the limit is 64 characters. You should be okay, even considering the event prefixes, if field names adhere to the length warnings broadcast by the REDCap Field Editor.

Export Data Layouts

The YES3 Exporter currently supports three export layouts: Vertical(v), Horizontal(h) and Repeating forms(r).

For all export layouts the first field or ‘variable’ in the exported record will be the recordid, and the variable name will be the recordid field name as defined for the specific project.

Depending on the layout and study design, the recordid variable may be followed by a collection of “REDCap Attribute” variables that indicate the REDCap data access group and event.

The remainder of the exported record is the collection of values for the selected fields, in the order specified. The variable name of each will be either the REDCap field name (vertical and repeating form layouts), or a composite of the REDCap event prefix and the REDCap field name (horizontal layout).

Following are more detailed descriptions of the export layouts.

The Vertical Export Layout

For classic (cross sectional) study designs, the vertical export layout has one row for each record. For longitudinal study designs, the vertical export layout has one row per observed record and event combination.

Repeating forms are not allowed in vertical export layouts.

For a longitudinal study, a vertical layout record will be exported only for events for which data are observed for the record. No empty records will be exported.

Export record specification for a Vertical layout

variable description comments
[recordId] The record unique identifier study-specific variable name
redcap_data_access_group_id The numeric DAG id only present if DAGs are assigned to records
redcap_data_access_group_name The unique DAG name only present if DAGs are assigned to records
redcap_event_id The numeric event id only present for longitudinal studies
redcap_event_name The unique REDCap event name only present for longitudinal studies
[fieldname] the first selected field value variable name is the underlying REDCap field name
   
[fieldname] the last selected field value variable name is the underlying REDCap field name

The Horizontal Export Layout

Horizontal layouts are allowed only for longitudinal study designs, and have one row for each record.

Repeating forms are not allowed in horizontal export layouts.

A horizontal layout record will have one variable per specified REDCap field and event combination. This can lead to exported datasets having enormous column counts and many empty cells. In general, vertical layouts are more efficient than horizontal layouts. However, for most studies horizontal layouts have the advantage of having one row per analysis unit (record), and are therefore “analysis ready” in many contexts.

Export record specification for a Horizontal layout

variable description comments
[recordId] The record unique identifier study-specific variable name
redcap_data_access_group_id The numeric DAG id only present if DAGs are assigned to records
redcap_data_access_group_name The unique DAG name only present if DAGs are assigned to records
[composite fieldname] the first selected field/event value variable name is [event prefix]_[ REDCap field name]
   
[composite fieldname] the last selected field/event value variable name is [event prefix]_[ REDCap field name]

The Repeating Form Export Layout

For classic (cross sectional) study designs, the repeating form export layout has one row for each observed record and instance combination.

For longitudinal study designs, the repeating form export layout has one row per observed record, event and instance combination.

Note that only one form is allowed in a repeating form layout. This is because ‘instance’ normally has no meaning across forms. We are considering a new repeating event export layout that would allow multiple forms. Please let us know if the repeating event layout would be useful for your studies.

Export record specification for a Repeating Form layout

variable description comments
[recordId] The record unique identifier study-specific variable name
redcap_data_access_group_id The numeric DAG id only present if DAGs are assigned to records
redcap_data_access_group_name The unique DAG name only present if DAGs are assigned to records
redcap_event_id The numeric event id only present for longitudinal studies
redcap_event_name The unique REDCap event name only present for longitudinal studies
redcap_repeat_instance the sequential instance value (1, 2, …)  
[fieldname] the first selected field value variable name is the underlying REDCap field name
   
[fieldname] the last selected field value variable name is the underlying REDCap field name

The YES3 Exporter Data Dictionary

The YES3 Exporter Data Dictionary includes detailed metadata for every exported variable, as well as data distribution summaries.

One intended use for the YES3 Exporter data dictionary is for external applications to generate and execute fine-tuned data import code. For example, the observed maximum string length can be used to tailor SAS variable attributes to exactly match the exported data. Once a datamart has been rebuilt, an external process can use the data distribution information in the data dictionary to validate the import - or to create and deploy a simple summary data report.

Data distributions summaries

The data distribution summaries stored in the YES3 Exporter data dictionary were originally conceived as an aid to fine-tuned code generation for datamart builds.

Data validations

The distribution summaries can be used to validate a data management process, for example an R or SAS program to import data into a datamart. Be sure to make appropriately “fuzzy” comparisons for distributional parameters like mean and variance. If I (PC) remember correctly, SAS has a bit-level truncation function that’s useful in this regard.

For performance reasons, we don’t include quantiles in the export data dictionaries. There might be other, less expensive calulations that we can add if there is any interest. For example, the geometric mean was once used as a proxy for the median in skewed distributions. In those days CPU cycles cost real money!

Ad hoc reports

We have found that the data distribution summaries are also convenient for data completion reports, study monitoring reports and to check for extreme values. Always: make sure that your intended purpose wouldn’t be achieved more simply and elegantly by the reporting and project portal tools provided by REDCap.

YES3 Exporter Data Dictionary specification

Each row of a YES3 Exporter Data Dictionary matches one column in the exported dataset, in the same order. Below is a table of column specifications.

column name column description
var_name The variable name. This will either be a raw REDCap field name, a REDCap Attribute field name or for horizontal layouts, a composite field name based on the event and REDCap field name
var_label The label for the variable. It is based on the REDCap field label, but is subject to the truncation and ‘sanitizing’ options of the export specification.
var_type The variable’s data type (see “Variable Types”)
valueset NOMINAL and CHECKBOX variables only: A JSON-encoded array of value/label pairs.
origin Indicates the data source for the variable: redcap = REDCap field, other = other REDCap attribute such as the event, DAG or instance.
redcap_field_name the name of the underlying REDCap field, if relevant
redcap_form_name the form associated with the underlying REDCap field, if relevant
redcap_event_id HORIZONTAL LAYOUTS ONLY: the REDCap event id associated with the variable.
redcap_event_name HORIZONTAL LAYOUTS ONLY: the REDCap event name associated with the variable.
non_missing_count The count of non-blank values for this variable, as observed in the exported records.
min_length For TEXT variables: the minimum string length - after any truncations or sanitations - observed in the exported records.
max_length For TEXT variables: the maximum string length - after any truncations or sanitations - observed in the exported records.
min_value For numeric and date/time variables: the minimum value observed in the exported records. For date/time variables this is a PHP time value, i.e. the count of seconds since 1/1/1970 00:00:00
max_value For numeric and date/time variables: the maximum value observed in the exported records. For date/time variables this is a PHP time value, i.e. the count of seconds since 1/1/1970 00:00:00
sum_of_values For numeric and date/time variables: the sum of the values observed in the exported records
sum_of_squared_values For numeric and date/time variables: the sum of the squared values observed in the exported records
mean For numeric and date/time variables: the average of the values observed in the exported records
standard_deviation For numeric and date/time variables: the sample standard deviation of the values observed in the exported records
formatted_min_value The minimum observed value, formatted as necessary for DATE, DATETIME and TIME variables.
formatted_max_value The maximum observed value, formatted as necessary for DATE, DATETIME and TIME variables.
formatted_mean The average observed value, formatted as necessary for DATE, DATETIME and TIME variables.
frequency_table For NOMINAL variables: a JSON-encoded table of discrete values and their frequency counts.