wiki:i2b2 Onyx Importer

Version 3 (modified by Nick Holden, 12 years ago) ( diff )

--

i2b2 - importing data from Onyx

Assuming incremental export processes by Onyx, the first import is going to be the most time consuming. You need to go through the complete process for the first one. The most time consuming aspect is loading the metadata. Once that is underway, I would start on the second and subsequent onyx export files whilst your waiting for the metadata upload to complete.

Note that there is an important file, tracking incremental ETL processes (including pid and eid numbers) on the BRICCS shared drive, called incrementalexportlog.xls

From the README in /usr/local/i2b2-procedures-1.0-SNAPSHOT-development:

QUICK START. ============ Assuming you are already root ('sudo su -' if not)...

  1. Unzip this package into a convenient place on a server hosting an i2b2 domain.
  2. Set the I2B2_PROCEDURES_HOME environment variable and export it: # export I2B2_PROCEDURES_HOME=/usr/local/i2b2-procedures-1.0-SNAPSHOT-development
  3. Ensure the following environment variables are also set and exported: (but these can be set within one of the config files, e.g. config/defaults.sh)

JAVA_HOME ANT_HOME JBOSS_HOME

4 If you wish to run procedures from any current directory,

run the build-symlinks script in the bin/utility directory and add the bin/symlinks to your path. But go steady; this could wait until later.

5 Review configuration settings within the config directory.

Basically three files: config.properties defaults.sh log4j.properties

  1. The order of completion (by directories) of procedures: i) data-prep (regular) ii) project-install (once)

NB: The job step update-datasources.sh tries to recover if it fails. However, it is good practice to check the JBoss data source files for correctness before rerunning this step.

iii) meta-upload (once and then whenever required) iv) participant-upload (regular)

Notes from Jeff:

Note that there is a parameter in the Defaults.sh file: # Max number of participants to be folded into one PDO xml file: BATCH_SIZE=50

If this number is exceeded in any export, no matter, it will simply create more than one PDO file. Or you can bump up the batch figure to ensure just one file, but this increases the memory usage. The PDO has a particular naming convention as illustrated below:

onyx-4-20111101-111556704-TEST-DATA-ONLY-pdo.xml

Note the 4 after "onyx-". It indicates how many participants are included in this file. It might help when you need to indicate pid and eid for the next export!

"TEST-DATA-ONLY" is there only when executing A-onyx2pdo-testdata.sh as opposed to A-onyx2pdo.sh. The rest is date/time.

A-onyx2pdo-testdata.sh mangles dates and does no mapping for the s-number.

First export: ========== data-prep:

1-namespace-update.sh 2-clean-onyx-variables.sh 3-onyx2metadata.sh 5-refine-metadata.sh 6-xslt-refined-2ontcell.sh 7-xslt-refined-2ontdim.sh 8-xslt-refined-enum2ontcell.sh 9-xslt-refined-enum2ontdim.sh A-onyx2pdo.sh or A-onyx2pdo-testdata.sh (make sure you record the pid and eid ranges) B-xslt-pdo2crc.sh

project-install:

1-project-install.sh 2-update-datasources.sh

metadata-upload:

metadata-upload-sql.sh (Once this is underway, begin on the second onyx export file)

participant-upoad:

participant-upload-sql.sh (Good idea to make sure this is the first one triggered if working in parallel)

Second and Subsequent Export Files: ============================== data-prep:

1-namespace-update.sh 2-clean-onyx-variables.sh 3-onyx2metadata.sh 5-refine-metadata.sh A-onyx2pdo.sh or A-onyx2pdo-testdata.sh (make sure you record the pid and eid ranges) B-xslt-pdo2crc.sh

participant-upoad:

participant-upload-sql.sh (DON'T start this until you know the metadata-upload for the first export has completed successfully)

Naming ====== It's entirely up to you how you name the jobs for each of these: whatever is convenient.

Note: See TracWiki for help on using the wiki.