wiki:Onyx to PDO

Onyx to PDO

This is a discussion document and work in progress.

Detailed below are the assumptions underlying importing facts from the BRICCS Study questionnaire into i2b2. The primary focus here is on participant data. All assumptions are open for discussion.

The Target : i2b2 CRC Cell

The data from an Onyx export will eventually end up in the main i2b2 data warehouse, otherwise known as the CRC cell. The CRC cell has eight tables in which to store data:

  1. Observation facts
  2. Patient dimension
  3. Visit dimension
  4. Patient mapping
  5. Event mapping
  6. Concept dimension
  7. Provider dimension
  8. Code lookup

The concept dimension can be considered immaterial to importing onyx data into i2b2 on a routine basis. The assumption here is that the concept dimension should be in synch with the ontology cell where the main ontology data is held. Currently, this assumption is partly safeguarded by previous steps in the process which update both the ontology cell and the ontology dimension before we import any participant data.

The code lookup is a lookup table. Basically this is a code-to-description mapping (eg, LCBRU : Leicester Cardiovascular Biomedical Research Unit). Currently no attempt is made to populate the lookup table: it is empty. The code lookup table is considered here to be a minor issue.

The provider dimension holds details of physicians or providers at an institution. Every observation relates (or should relate) to a provider; ie: the observation has been made by a provider. 'Provider' and 'observer' could be considered synonyms. Currently no attempt is made to populate the provider dimension: it is empty.

The patient mapping maps external identifiers (eg: s-number or bpt-number) to an internal i2b2 patient identifier. The favoured way of populating this is via the web service that invokes the CRC loader facility. The alternative is to use a manually supplied number. Either way, the contents of this is relatively uncontroversial.

The event mapping (or visit mapping or encounter mapping) maps external identifiers of a visit to an internal i2b2 event identifier. The latter internal event identifier is similarly uncontroversial to the patient mapping (see above). But the external identifier is a somewhat hazy idea. What constitutes a unique identifier for an event in the BRICCS Study questionnaire (or for that matter a pathology test)?

That leaves just three tables to consider, and these are the main ones:

The visit dimension holds one row for every visit/encounter/event that a patient is involved in. What is an event with respect to the BRICCS Study questionnaire? Currently, the process creates just one event for each participant in the questionnaire.

The patient dimension holds one row for every patient (ie: for every participant interviewed in the questionnaire).

The observation facts holds one row for every fact (deemed sensible to include) that has been gleaned about a participant from the questionnaire.

Each observation fact points to a patient and a visit using the unique i2b2 internal patient id and event id: these are part of the observation fact's primary key. All three main tables have dates:

observation fact start date end date
visit dimension start date end date
patient dimension birthdate death date

The start dates for observation fact and visit dimension are mandatory.

The Source : Onyx Questionnaire

(A description of the questionnaire structure is required here)

Building Observation Facts

The strategy in dealing with Onyx variables is to fall back on accommodating them as either discrete variables (true/false) or enumerated variables (single, married, separated, divorced, widowed, prefer not to answer, don't know). In effect, not variables that can be measured by a continuous measure. This is in contrast to pathology tests, where the background is to expect a continuous measure of some sort.

The following is the main algorithm for forming observation facts.

For each variable relating to a participant...

Firstly, all of the following are ignored:

  1. All variables from QuestionnaireRun and QuestionnaireMetric.
  2. All variables from Participant except for: ethnicity, age, gender and recruitment type.
  3. Other primary diagnoses, other secondary diagnoses and other symptoms are textual note fields which should get folded into the associated "other" observation_fact. But I don't think they are. What should I do?
  4. Symptoms onset are ignored here and used instead as start date for the associated observation_fact.
  5. Patient email 1 and email 2 from the EndContactQuestionnaire.
  6. TubeCode, barcode and prefixCode from UrineSamplesCollection and BloodSamplesCollection.
  7. The following Onyx types: DATA, LOCALE and BINARY

Secondly, we examine the remainder for BOOLEANS where the answer is TRUE.
Some of these are (already) enumerated types (ie: not generated ones, but ones designed into the questionnaire itself), and are definite observation facts, whilst others have an uncertain status. The only way to decide between the two is to see whether we have ontological data within the refined metadata. If that exists, then we are looking at an enumerated type and the value represents an observation fact. These are processed as enumerations. The other BOOLEANS are reported upon.

Thirdly, non BOOLEANS where ontological data can be found are examined.
It has to be said at the outset that each of these could be considered a discrete observation fact, but as emphasized above, the overall strategy (at least as a point of departure) has been to accommodate facts as either true/false statements or as enumerations of values.

This is what then happens:

  1. All variables where we have specifically designed for an enumeration are processed. Currently, these are of type:

AGE, BEERNUMBER, BICEPS, CIGARETTENUMBER, CIGARNUMBER, DIASTOLICBP, ETHNICITY, HEARTRATE, HEIGHT, HIPS, PIPENUMBER, RELATIVESNUMBER, SMALLNUMBER, SUBSCAPULAR, SUPRAILIAC, SYSTOLICBP, TRICEPS, WAIST, WEIGHT, WINESPIRITNUMBER, YEAR, RECENTTIME.

Note that these are types, and can therefore account for a considerable number of facts (eg: YEAR covers every question where a year can be returned as an answer).

  1. For type of DATETIME, if the variable concerns interventions this clinical episode, then we build a non-generated enumeration using the datetime value as the observation start date. The other DATETIME variables are reported upon but no facts built.

  1. For types DECIMAL, INTEGER and TEXT variables are reported upon but no facts built.

  1. For all others, variables are reported upon but no facts built.

Fourthly, non BOOLEANS where ontological data cannot be found are examined.
The following are special cases within this group where enumerations are built:

  1. Admin.Participant.age of type INTEGER (generated enumeration)
  2. Admin.Participant.pat_ethnicity of type TEXT (generated enumeration)
  3. Admin.Participant.gender of type TEXT (built-in enumeration)
  4. Admin.Participant.recruitmentType of type TEXT (built-in enumeration)
  5. Admin.Participant.vital_status of type TEXT (generated enumeration)

Other variables within this group are reported upon but no facts built.

Lastly, any other variables are reported upon but no facts built.

Last modified 8 years ago Last modified on 05/09/15 13:37:17
Note: See TracWiki for help on using the wiki.