wiki:i2b2 Database Structure

Version 29 (modified by jeff.lusted, 14 years ago) ( diff )

--

i2b2 - DB Basics

You can get all of the following from the documentation, but here are pertinent things to where we are...

Structure

This is a very bare basics. At this stage we should be primarily interested in the observation_fact table and its relationship with the concept_dimension table within the CRC Cell. And then the correspondence with any ontology tables within the Ontology Cell. Interest in the patient_dimension, visit_dimension and other of the tables within the CRC are a little further down the line.

There can be a number of ontology tables within the Ontology Cell. We know which are ontology tables by an entry in the table_access table.

The QueryTool uses the ontology tables within the Ontology Cell to format queries for submission against the tables within the CRC Cell.

+-------------CRC Cell-------------+   +-----------Ontology Cell---------+
|                                  |   |                                 |
|        +===================+     |   |     +=====================+     |
|        | concept_dimension |<<<=========>>>|   ontology tables   |>:   | 
|        +=========^=========+     |   |     +==========^==========+ :   |
|                  |               |   |                |            :   |
|                  |               |   |                |            :   |
|                  |               |   |                |            :   |
|        +=========^=========+     |   |     +==========^==========+ :   |
|        | observation_fact  |     |   |     |   table_access      | :   | 
|        +===V===========V===+     |   |     +=====================+ :   |
|            |           |         |   |                             :   |
|            |           |         |   +-----------------------------:---+
| +==========V========+  |         |                                 :
| | patient_dimension |  |         |           +------------+        :
| +===================+  |         |<<<<<<<<<<<| QUERY TOOL |<<<<<<<<:
|                        |         |           +------------+  
|             +==========V======+  |
|             | visit_dimension |  |
|             +=================+  |
+----------------------------------+

Ontology Cell: DDL for Table ONYX

-- ORACLE --
  CREATE TABLE ONYX 
   (	"C_HLEVEL"           NUMBER(22,0)   NOT NULL, 
	"C_FULLNAME"         VARCHAR2(700)  NOT NULL, 
	"C_NAME"             VARCHAR2(2000) NOT NULL, 
	"C_SYNONYM_CD"       CHAR(1)	    NOT NULL, 
	"C_VISUALATTRIBUTES" CHAR(3)	    NOT NULL, 
	"C_TOTALNUM"         NUMBER(22,0)   NULL, 
	"C_BASECODE"         VARCHAR2(50)   NULL, 
	"C_METADATAXML"      CLOB	    NULL, 
	"C_FACTTABLECOLUMN"  VARCHAR2(50)   NOT NULL, 
	"C_TABLENAME"        VARCHAR2(50)   NOT NULL, 
	"C_COLUMNNAME"       VARCHAR2(50)   NOT NULL, 
	"C_COLUMNDATATYPE"   VARCHAR2(50)   NOT NULL, 
	"C_OPERATOR"         VARCHAR2(10)   NOT NULL, 
	"C_DIMCODE"          VARCHAR2(700)  NOT NULL, 
	"C_COMMENT"          CLOB	    NULL, 
	"C_TOOLTIP"          VARCHAR2(900)  NULL, 
	"UPDATE_DATE"        DATE	    NOT NULL, 
	"DOWNLOAD_DATE"      DATE	    NULL, 
	"IMPORT_DATE"        DATE	    NULL, 
	"SOURCESYSTEM_CD"    VARCHAR2(50)   NULL, 
	"VALUETYPE_CD"       VARCHAR2(50)   NULL
   ) ;

These are worth highlighting...
C_HLEVEL: hierarchical level in the tree. 0,1,2 etc
C_FULLNAME: full path within the tree.
C_NAME: in the demo system this is the last part of the path within the tree. Don't understand the size implications between C_FULLNAME and C_NAME.
C_SYNONYM_CD: Y or N regarding a synonym
C_VISUALATTRIBUTES: eg 'LA' would be an active leaf node...

  • 1st character: F = Folder C = Container M = Multiple L = Leaf
  • 2nd character: A = Active I = Inactive H = Hidden
  • 3rd character: E = editable

C_TOTALNUM: Not used, but suspect can be used in the latest version (for performance reasons?).
C_BASECODE: the term that describes the ontological concept.
C_METADATAXML: an optional field to store extra information about the concept in xml format. It is currently used to describe value metadata associated with a lab finding. For further details, see Documents page.Jeff has points to make about this.
C_FACTTABLECOLUMN: name of the column in the CRC cell observation_fact table that holds the concept code.
C_TABLENAME: name of the lookup table within the CRC cell that holds the concept code and concept path. In the top diagram this would be concept_dimension.
C_COLUMNNAME: the name of the column within the table described by C_TABLENAME that holds the concept path. In the top diagram this would be the column within the concept_dimension table, which in our case is called CONCEPT_PATH. See below.
C_COLUMNDATATYPE: either ‘T’ for text or ‘N’ for numeric and describes the datatype of the concept. Don't understand this fully. Jeff has questions about this.
C_OPERATOR: the SQL operator used in the WHERE clause of the SQL query. It is usually either ‘LIKE’ or ‘=’.
C_DIMCODE: the path that lives in the dimension table and maps to the concept path.

CRC Cell: DDL for Table concept_dimension

The following sentences are from the design document...
The concept_dimension table contains one row for each concept. Possible concept types are diagnoses, procedures, medications and lab tests. The structure of the table gives enough flexibility to store virtually any concept type, such as demographics and genetics data.

-- Oracle -- 
CREATE TABLE CONCEPT_DIMENSION ( 
	CONCEPT_CD          VARCHAR2(50)   NOT NULL,
	CONCEPT_PATH        VARCHAR2(700)  NOT NULL,
	NAME_CHAR           VARCHAR2(2000) NULL,
	CONCEPT_BLOB        CLOB           NULL,
	UPDATE_DATE         DATE           NULL,
	DOWNLOAD_DATE       DATE           NULL,
	IMPORT_DATE         DATE           NULL,
	SOURCESYSTEM_CD     VARCHAR2(50)   NULL,
	UPLOAD_ID           NUMBER(38,0)   NULL,
    CONSTRAINT CONCEPT_DIMENSION_PK PRIMARY KEY(CONCEPT_PATH)
) ;	

The design document also states that the concept_dimension table has three required columns:

  1. concept_path: path that delineates the concept’s hierarchy.
  2. concept_cd: code that represents the diagnosis, procedure, or any other coded value.
  3. name_char: name of the concept.

CRC Cell: DDL for Table observation_fact

The fact table of the i2b2 star schema: represents the intersection of the dimension tables. Each row describes one observation about a patient made during a visit. Most queries in the i2b2 database require joining the observation_fact table with one or more dimension tables together.

-- ORACLE --
CREATE TABLE OBSERVATION_FACT (
	ENCOUNTER_NUM    NUMBER(38,0)  NOT NULL,
	PATIENT_NUM      NUMBER(38,0)  NOT NULL,
	CONCEPT_CD       VARCHAR2(50)  NOT NULL,
	PROVIDER_ID      VARCHAR2(50)  NOT NULL,
	START_DATE       DATE          NOT NULL,
	MODIFIER_CD      VARCHAR2(100) NOT NULL,
	VALTYPE_CD       VARCHAR2(50)  NULL,
	TVAL_CHAR        VARCHAR2(255) NULL,
	NVAL_NUM         NUMBER(18,5)  NULL,
	VALUEFLAG_CD     VARCHAR2(50)  NULL,
	QUANTITY_NUM     NUMBER(18,5)  NULL,
	INSTANCE_NUM	 NUMBER(18,0)  NULL,
	UNITS_CD         VARCHAR2(50)  NULL,
	END_DATE         DATE          NULL,
	LOCATION_CD      VARCHAR2(50)  NULL,
	CONFIDENCE_NUM   NUMBER(18,5)  NULL,
	OBSERVATION_BLOB CLOB          NULL,
	UPDATE_DATE      DATE          NULL,
	DOWNLOAD_DATE    DATE          NULL,
	IMPORT_DATE      DATE          NULL,
	SOURCESYSTEM_CD  VARCHAR2(50)  NULL,
	UPLOAD_ID        NUMBER(38,0)  NULL, 
    CONSTRAINT OBSERVATION_FACT_PK PRIMARY KEY(ENCOUNTER_NUM,CONCEPT_CD,PROVIDER_ID,START_DATE,MODIFIER_CD)
) ;

Eventually we will have detailed knowledge and understanding of all of these. But for the moment, here are some details:
Columns UPDATE_DATE, DOWNLOAD_DATE, IMPORT_DATE, SOURCESYSTEM_CD and UPLOAD_ID are administrative/info columns only. Ignore these.

Key Column Name Column Definition Nullable (default=YES)
PK encounter_num Encoded i2b2 patient visit number NO
patient_num Encoded i2b2 patient number NO
PK concept_cd Code for observation of interest NO
PK provider_id Practitioner id or provider id NO
PK start_date Starting date-time of observation (mm/dd/yyyy) NO
modifier_cd Code for modifier of interest (i.e. “ROUTE”,” DOSE”)
note that value columns are often used to hold the amounts such as “100” (mg) or “PO”
YES
instance_num Encoded instance number that allows more that one modifier to be provided for each concept_cd.
Each row will have a different modifier_cd but a similar instance_num.
YES
valType_cd Format of the concept
N = Numeric
T = Text (enums/short messages)
B = Raw Text (notes/reports)
NLP = NLP result text
tval_char Used in conjunction with valType_cd = “T” or “N”...
When valtType_cd = “T”: Stores the text value
When valType_cd = “N”:
E = Equals
NE = Not equal
L = Less Than
LE = Less than and Equal to
G = Greater Than
GE = Greater than and Equal to
nval_num Used in conjunction with valType_cd = “N” to store a numerical value
valueFlag_cd Used in conjunction with valType_cd = “B”, “NLP”, “N”, or “T”...
When valType_cd = “B” or “NLP” it is used to indicate whether or not the blob field is encrypted
( X = Encrypted text in blob field )
When valType_cd = “N” or “T” it is used to flag certain outlying or abnormal values
( H = High, L = Low, A = Abnormal )
quantity_num Quantity of nval
units_cd Units of measurement of nval
end_date The end date-time for the observation
location_cd A location code, such as for a clinic
confidence_num Assessment of accuracy of data
observation_blob Holds any raw or miscellaneous data that exists, often encrypted PHI
Note: See TracWiki for help on using the wiki.