wiki:i2b2 Onyx Importer

Version 5 (modified by Nick Holden, 12 years ago) ( diff )

--

i2b2 - importing data from Onyx

Assuming incremental export processes by Onyx, the first import is going to be the most time consuming. You need to go through the complete process for the first one. The most time consuming aspect is loading the metadata. Once that is underway, I would start on the second and subsequent onyx export files whilst your waiting for the metadata upload to complete.

Note that there is an important file, tracking incremental ETL processes (including pid and eid numbers) on the BRICCS shared drive, called incrementalexportlog.xls

From the README in /usr/local/i2b2-procedures-1.0-SNAPSHOT-development:

QUICK START.

============

Assuming you are already root ('sudo su -' if not)...

1. Unzip this package into a convenient place on a server hosting an i2b2 domain.
2. Set the I2B2_PROCEDURES_HOME environment variable and export it:
   # export I2B2_PROCEDURES_HOME=/usr/local/i2b2-procedures-1.0-SNAPSHOT-development
3. Ensure the following environment variables are also set and exported:
   (but these can be set within one of the config files, e.g. config/defaults.sh)
     JAVA_HOME
     ANT_HOME
     JBOSS_HOME
4  If you wish to run procedures from any current directory,
   run the build-symlinks script in the bin/utility directory
   and add the bin/symlinks to your path.
   But go steady; this could wait until later.
5  Review configuration settings within the config directory.
   Basically three files:
   config.properties - DATABASE CONNECTION SETTINGS GO IN HERE
   defaults.sh
   log4j.properties
6. The order of completion (by directories) of procedures:
   i)   data-prep          (regular)
   ii)  project-install    (once)
        NB: The job step update-datasources.sh tries to recover if it fails.
        However, it is good practice to check the JBoss data source files
        for correctness before rerunning this step.
   iii) meta-upload        (once and then whenever required)
   iv)  participant-upload (regular)

Notes from Jeff:

Note that there is a parameter in the Defaults.sh file: # Max number of participants to be folded into one PDO xml file: BATCH_SIZE=50

If this number is exceeded in any export, no matter, it will simply create more than one PDO file. Or you can bump up the batch figure to ensure just one file, but this increases the memory usage. The PDO has a particular naming convention as illustrated below:

onyx-4-20111101-111556704-TEST-DATA-ONLY-pdo.xml

Note the 4 after "onyx-". It indicates how many participants are included in this file. It might help when you need to indicate pid and eid for the next export!

"TEST-DATA-ONLY" is there only when executing A-onyx2pdo-testdata.sh as opposed to A-onyx2pdo.sh. The rest is date/time.

A-onyx2pdo-testdata.sh mangles dates and does no mapping for the s-number.

First export:
==========
data-prep:
   1-namespace-update.sh
   2-clean-onyx-variables.sh
   3-onyx2metadata.sh
   5-refine-metadata.sh
   6-xslt-refined-2ontcell.sh
   7-xslt-refined-2ontdim.sh
   8-xslt-refined-enum2ontcell.sh
   9-xslt-refined-enum2ontdim.sh
   A-onyx2pdo.sh    or   A-onyx2pdo-testdata.sh  (make sure you record the pid and eid ranges)
   B-xslt-pdo2crc.sh

project-install:
   1-project-install.sh
   2-update-datasources.sh

metadata-upload:
   metadata-upload-sql.sh         (Once this is underway, begin on the second onyx export file)

participant-upoad:
   participant-upload-sql.sh       (Good idea to make sure this is the first one triggered if working in parallel)

Second and Subsequent Export Files:
==============================
data-prep:
   1-namespace-update.sh
   2-clean-onyx-variables.sh
   3-onyx2metadata.sh
   5-refine-metadata.sh
   A-onyx2pdo.sh    or   A-onyx2pdo-testdata.sh    (make sure you record the pid and eid ranges)
   B-xslt-pdo2crc.sh

participant-upoad:
   participant-upload-sql.sh      (DON'T start this until you know the metadata-upload for the first export has completed successfully)

Naming

It's entirely up to you how you name the jobs for each of these: whatever is convenient.

Deleting and re-importing data

When deleting and re-importing Onyx data (which includes the data for creating the patients in the i2b2 patient dimension), it is important to consider the data held against the other ontologies in i2b2. Either everything has to be deleted, which implies not having pathology, PATS or subsequent third-party data until the next time those routines load to i2b2, or the data has to be selectively deleted without damaging the data held against the other ontologies AND the patients need to be reloaded in the same order, to ensure that patients are allocated to the correct i2b2 identifiers. Database preparation for a complete reload of Onyx data.

Assuming all metadata is unchanged.

Need to be sure that the patients are mapped consistently in the re-loading process. General approach is to delete all the data (patient dimension, observation facts, visit dimension) but not touch ANY metadata.

metadata database - NO CHANGES

work database - NO CHANGES

hive - NO CHANGES

pm - NO CHANGES

data DATABASE TABLES:

ARCHIVE_OBSERVATION_FACT

  • Currently empty. Leave.

CODE_LOOKUP

  • Currently empty. Leave.

concept_dimension

  • Pointers to ontology. Leave alone because ontology is not changing.

DATAMART_REPORT

  • Currently empty. Leave.

Encounter_Mapping

  • Currently empty. Leave.

Observation_Fact

  • All observations included, from all ontologies.

Selective delete. Need to delete where Concept_Cd LIKE 'CBO:%' 19770 rows deleted. ALSO THERE ARE FOUR ENTRIES WHERE Concept_Cd is empty. WHY?????? Delete them also. 4 rows deleted.

Patient_Dimension

  • All patients

Need to delete all. 110 rows deleted.

Patient_Mapping

  • All patients

Need to delete all. 330 rows deleted.

Provider_Dimension

  • Currently empty. Leave.

QT_*

  • Are related to queries. Leave.

SET_TYPE Index of 'set' types. Leave.

SET_UPLOAD_STATUS

  • Currently empty. Leave.

SOURCE_MASTER

  • Currently empty. Leave.

UPLOAD_STATUS

  • Currently empty. Leave.

Visit_Dimension Only Onyx data is loaded against 'visit dimension', one per patient. It is built by the import process. DELETE ALL. 110 rows deleted.

Looking to check which onyx export file(s) to use.

procedures-1.0-SNAPSHOT-development-old alpha used test-BRICCS-20111021090857.zip, generated 8 patients.

procedures-1.0-SNAPSHOT-development alpha used live-BRICCS-20111031180021.zip, generated 53 patients.

procedures-1.0-SNAPSHOT-development beta used live-BRICCS-20111031184824.zip, generated 57 patients.

procedures-development-trac92 used BRICCS-20110106095220.zip, generated 4 patients.

procedures-trac108-SNAPSHOT_development ws-test used live-BRICCS-20111031184824.zip, generated 57 patients.

Do two jobs: alpha and beta, treating both of them as 'second and subsequent uploads'.

alpha: Use /home/nick/onyxexports/live-BRICCS-20111031180021.zip, generate 53 patients.

beta: Use /home/nick/onyxexports/live-BRICCS-20111031184824.zip, generate 57 patients.

This kind of worked, except that the TEST i2b2 on uhlbriccsapp02.xuhl-tr.nhs.uk had 110 patients loaded against pids from 2 to 111, and the above process re-loaded them against pids from 1 to 110. Oops. But either the other data (PATS and pathology) will be re-loaded against the new pids overnight, OR I can re-run this process tomorrow with pid=2 and eid=2 for the first batch.

NOTE: PROCESS WILL BE DIFFERENT IF THE ONTOLOGY ITSELF IS CHANGED IN ANY WAY.

Note: See TracWiki for help on using the wiki.