Skip to content

Data Dictionary

The YES3 Exporter II Data Dictionary includes detailed metadata for every exported variable, as well as data distribution summaries.

One intended use for the YES3 Exporter II data dictionary is for external applications to generate and execute fine-tuned data import code. For example, the observed maximum string length can be used to tailor SAS variable attributes to exactly match the exported data. Once a datamart has been rebuilt, an external process can use the data distribution information in the data dictionary to validate the import - or to create and deploy a simple summary data report.

Data distribution summaries

The data distribution summaries stored in the YES3 Exporter II data dictionary were originally conceived as an aid to fine-tuned code generation for datamart builds.

Data validations

The distribution summaries can be used to validate a data management process, for example an R or SAS program to import data into a datamart. Be sure to make appropriately "fuzzy" comparisons for distributional parameters like mean and variance. If I (PC) remember correctly, SAS has a bit-level truncation function that's useful in this regard.

Ad hoc reports

We have found that the data distribution summaries are also convenient for data completion reports, study monitoring reports and to check for extreme values. Always: make sure that your intended purpose wouldn't be achieved more simply and elegantly by the reporting and project portal tools provided by REDCap.

Data Dictionary Specification

Each row of a YES3 Exporter II Data Dictionary matches one column in the exported dataset, in the same order. Below is a table of column specifications.

column name column description
var_name The variable name. This will either be a raw REDCap field name, a REDCap Attribute field name or for horizontal layouts, a composite field name based on the event and REDCap field name
var_label The label for the variable. It is based on the REDCap field label, but is subject to the truncation and 'sanitizing' options of the export specification.
var_type The variable's data type (see "Variable Types")
valueset NOMINAL and CHECKBOX variables only: A JSON-encoded array of value/label pairs.
origin Indicates the data source for the variable: redcap = REDCap field, other = other REDCap attribute such as the event, DAG or instance.
redcap_field_name the name of the underlying REDCap field, if relevant
redcap_form_name the form associated with the underlying REDCap field, if relevant
redcap_event_id HORIZONTAL LAYOUTS ONLY: the REDCap event id associated with the variable.
redcap_event_name HORIZONTAL LAYOUTS ONLY: the REDCap event name associated with the variable.
non_missing_count The count of non-blank values for this variable, as observed in the exported records.
min_length For TEXT variables: the minimum string length - after any truncations or sanitations - observed in the exported records.
max_length For TEXT variables: the maximum string length - after any truncations or sanitations - observed in the exported records.
min_value For numeric and date/time variables: the minimum value observed in the exported records. For date/time variables this is a PHP time value, i.e. the count of seconds since 1/1/1970 00:00:00
max_value For numeric and date/time variables: the maximum value observed in the exported records. For date/time variables this is a PHP time value, i.e. the count of seconds since 1/1/1970 00:00:00
sum_of_values For numeric and date/time variables: the sum of the values observed in the exported records
sum_of_squared_values For numeric and date/time variables: the sum of the squared values observed in the exported records
mean For numeric and date/time variables: the average of the values observed in the exported records
standard_deviation For numeric and date/time variables: the sample standard deviation of the values observed in the exported records
formatted_min_value The minimum observed value, formatted as necessary for DATE, DATETIME and TIME variables.
formatted_max_value The maximum observed value, formatted as necessary for DATE, DATETIME and TIME variables.
formatted_mean The average observed value, formatted as necessary for DATE, DATETIME and TIME variables.
frequency_table For NOMINAL variables: a JSON-encoded table of discrete values and their frequency counts.