Importing a PDO - the algorithms examined in detail
This is work in progress based upon i2b2 version 1.5.5
The context of this discussion is what happens to a PDO file when you ask the CRC loader to process the file using one of i2b2's web services. There are so many things happening here that throwing some light on the detail is important to understanding the process, and - by reflection - on how to format a PDO suitable for importing in the first place.
A lot of this detail can be gleaned from a careful reading of the CRC documentation, particularly those parts of the CRC Design and CRC Messaging pdf's covering the import use cases. But understanding is not easy. The details have been augmented by code reading the CRC loader.
The PDO (Patient Data Object)
The outline structure of a PDO matches the star-schema of the data mart and is given in the following:
<?xml version="1.0" encoding="UTF-8"?> <pdo:patient_data xmlns:pdo="http://www.i2b2.org/xsd/hive/pdo/1.1/pdo"> <!-- patient identifier set --> <pdo:pid_set> <!-- Identifies a patient in a source system --> <pid>... details here ...</pid> </pdo:pid_set> <!-- event identifier set --> <pdo:eid_set> <!-- Identifies an event/occurrence/visit in a source system --> <eid>... details here ...</eid> </pdo:eid_set> <pdo:patient_set> <!-- Basic patient details --> <patient>... details here ...</patient> </pdo:patient_set> <pdo:event_set> <!-- Basic event details --> <event>... details here ...</event> </pdo:event_set> <pdo:concept_set> <!-- Basic details of one concept --> <concept>... details here ...</concept> </pdo:concept_set> <pdo:observer_set> <!-- Basic observer/provider details --> <observer>... details here ...</observer> </pdo:observer_set> <pdo:observation_set> <!-- A single fact concerning one patient --> <observation>... details here ...</observation> </pdo:observation_set> </pdo:patient_data>
Order of Processing
The loader processes the PDO in the order displayed in the above XML skeleton; ie:
- pid_set
- eid_set
- patient_set
- event_set
- concept_set
- observer_set
- observation_set
Detailed processing will be reviewed later. For the moment, there are two observations I think need to be made:
- I believe that processing the pid_set and eid_set first in the work flow allows the loader to be in control in assigning i2b2 internal identifiers to patients and to events. I've made one pass at code reading this, and feel reasonably confident. That is: there is no need for i2b2 internal id's to somehow be manufactured and placed in the PDO beforehand. The source and source identifiers (eg: BRICCS participant id and/or s-number) can be used as patient identifiers and the loader will take care of assigning internal ids and mapping them to the external source ids. This is a big gain: the process is transactional, it is database independent and there are no problems with concurrency (multiple processes doing the same thing at the same time). However, although we know what a participant is, we are still somewhat in the dark concerning events: what is an event in terms of a source system?
- All seven sets or some subset of the seven can be supplied. Even if all seven were supplied, the loader message itself (the web service message that triggers the load process) contains control data which can specify which sets of those present should be processed. The processing will always be done in the above order, even if it has gaps, but see the next section for dependencies.
Dependencies
Target | Must already have been processed |
---|---|
pid_set, eid_set, observer_set, concept_set | None |
patient_set | pid_set |
event_set | eid_set |
observation_set | pid_set, eid_set, event_set, patient_set |
One thing to note in the above is the relative independence of the concept_set: it depends upon nothing, and nothing depends upon it. What role does the concept_set play - or can play - in a PDO upload is an interesting question. From the dependencies above, it seems you could use the upload facility just to upload concepts to the concept dimension: a PDO with nothing but concepts in it!
Basic Algorithm
Each set has the same overall algorithm applied to it, covered by five separate transactions:
- Set up a status message ("PROCESSING").
- Create a temporary table.
- Load the temporary table with data from the relevant PDO set.
- Merge the temporary table with the real table.
- Issue a status message ("FINISHED" or "WARNING")
Remember that each of the above is a separate transaction. Each step cannot be undone once committed, although presumably step/txn 4 is the most critical. If anything goes wrong and a transaction has begun but has not been committed, then it is rolled back, the remaining steps passed over and another status message is raised ("ERROR").
The controlling process decides whether the temporary table is deleted or not. If the appropriate flag is set in the web service message that triggers the loader, then the temporary tables can be left in place. A good debugging feature.
If an error does occur, the overall loading process (of the seven possible subsets) is terminated, so the situation can be one of uncertainty. For example, the process could be terminated after three of the seven subsets had been processed, leaving four unprocessed.
Comments on the steps:
- The load temporary table step is basically mapping the data within the relevant PDO subset into a temporary table. Nothing too controversial about this (one might think), but there are cardinality issues. I'll take these up when the pid_set processing is considered in detail.
- For each PDO subset, the real work is done in step/txn 4 where a temporary table is merged with a real table. The detailed processing for this is always carried out by a database procedure.
Detailed Processing
Patient Identifier Set
Event Identifier Set
Patient Set
Event Set
Concept Set
Observer Set
Observation Set