wiki:i2b2 Data Import Pathology Procedure v1

i2b2 Data Import Pathology Procedure v1

Tags: i2b2

Version 1 of the i2b3 pathology data integration load uses two SSIS packages, loaded on the UHL Data warehouse server, to move data from the DWPATH database on the UHL data warehouse server to i2b2 databases on the UHLSQLBRICCSDB\UHLBRICCSDB server. One package moves data to the production i2b2 database, the other moves data to the test database.


The two packages are identical and only differ in the destination database and the set of patients that they extract.

Package NameDestination DatabasePatient Set
TestPathology To I2B2.dtsxi2b2_b1_dataTest patients from i2b2_b1_data
ProductionPathology To I2B2 APP03.dtsxi2b2_app03_b1_dataActual patients from i2b2_app03_b1_data


Both packages use the same following three steps which utilise the DWBRICCS database on the UHL data warehouse server and the i2b2ClinDataIntegration database on the I2B2 server as an intermediate databases.

1. Import I2B2 Patients to DWBRICCS

Runs the stored procedure USP_DWH_IMPORT_BRICCS_PATIENTS in the DWBRICCS database on the UHL data warehouse, which copies patients from the destination database on the I2B2 database server to the DWBRICCS database using a linked SQL server.

2. Delete Pathology from I2B2

Runs the stored procedure USP_DWH_DELETE_PATHOLOGY_FROM_I2B2 in the i2b2ClinDataIntegration database, which deletes all pathology data from the Observation_Fact table of the destination i2b2 database. These are identified as all records where the concept_cd begins with 'PAT:'.

3. Insert Pathology into I2B2

Runs the stored procedure USP_DWH_INSERT_PATHOLOGY_I2B2 in the i2b2ClinDataIntegration database, which loads data into the destination i2b2 database from the view UVW_BRICCS_PATHOLOGY_RESULTS in the DWBRICCS database on the UHL data warehouse as a linked server.

Only copies data from after 4th July 2010. This is probably the date that iLab officially went live.

Duplicate Processing

Version one of the data load identifies some records as being duplicates because they have the same patient, sample collection datetime and concept code. When a duplicate is identified it discards the most recent record. This is probably not correct for several reasons:

  1. If there are more that two duplicates, it only discards one record and so there will still be a duplicate.
  2. Common sense and reason 1 suggest that it should be keeping the most recent record.
  3. There may be a better way to identify which record is correct. For example, if the result has been suppressed (result suppression will not solely solve the problem).
  4. Both records may be valid.

Paul Smalley has looked at the duplicates and thinks that most of them are not duplicates as they have a different specimin number. He also says that the received date and time should be unique for them, but that the sample time may be 'unknown' if the sample was not input into an electronic system.

ACTION (RB): Check for duplicates with the same specimin number. Are all duplicates supressed?

ACTION (RB): Check for sample collection date and time population. The time is zero for 492 BRICCS patients' records.

SOLUTION: Would date and time received be a reasonable start date?

SOLUTION: Can we add the specimin number as a modifier? No, modifers are for different aspects of the same observation. So the observation would have a modifer code of 'specimin number', but the actual unique number would be the value. So the record would still be a duplicate.

SOLUTION: Since we are reloading all the data every load. Could we just add a second on for every preceeding duplicate?

SOLUTION: I have just checked the HAEM_RESULTS_FILE for records where the time of WHO_COLLECTION_DATE_TIME is zero. The table also contains varchar columns called Coll_Date_Source and Coll_Time_Source. On the ones that I have checked, the Coll_Date_Source is the same as the date portion of the WHO_COLLECTION_DATE_TIME column. However, the Coll_Time_Source column does contain a valid time. I Will need to check that the dates match and the time is filled in for all records of interest. Also contact Paul Smalley to double check that these columns mean what they appear to mean. I have checked the data and for all instances where the time portion of the WHO_COLLECTION_DATE_TIME is zero, it matches the time in the Coll_Time_Source column. Looks promising. However, I had a thought that the collection time may not mean the time that the sample was collected from the patient, but may mean the time that the sample was collected by the van.

Other advantages of Coll_Date_Source & Coll_Time_Source:

  • They are never NULL.
  • They are always a valid date/time

Paul Smalley has taken a look at the coll_date_source and thinks that it is the datetime when the test was ordered. Or, if it was manually ordered, the datetime that it was received and booked into iLab. Either way, it is the earliest datetime we have for the sample. Just hope there aren't any duplicates!

ACTION (RB): Ask Nick why the Pathology data is only loaded for tests received after 04 July 2010.

ACTION (RB): Change pathology load job to load results where the reference range is not numeric. Check that this works in i2b2.

ACTION (RB): Change pathology load job to load results where the result is '> x' or '< x'. I2b2 has a observation fact column that states whether the result stored is equals, greater than or less than result.

Error: Macro BackLinks(None) failed
'Environment' object has no attribute 'get_db_cnx'

Last modified 8 years ago Last modified on 12/10/15 09:30:47
Note: See TracWiki for help on using the wiki.