wiki:i2b2 AUG 2013

i2b2 AUG 2013


NLP Workshop

  1. UMLS Ontologies and Ontology Resources (Olivier Bodenreider)
  2. Ontology-based De-identification of Clinical Naratives (Finch and McMurry)
  3. Ontology-based Discovery of Disease Activity from the Clinical Record (Lin)
  4. Ontology Normalisation of the Clinical Narrative (Chen)
  5. Ontology Concept Selection (Yu)
  6. Active Learning for Ontology-based Phenotyping (Dligach)
  7. Conclusion

Academic User Group

  1. Genomic Cell (Shawn Murphy and Lori Philips)
  2. SMART Apps (Wattanasin)
  3. i2b2 Roadmap (Shawn Murphy)
  4. Planning for the future (Kohane)
  5. AUG Community Projects (Murphy)
  6. From Genetic Variants to i2b2 using NoSQL database (Matteo Gabetta - Pavia)
  7. Identifying Normal Patients (G Weber)
  8. Extending i2b2 with the R Statistical Platform
  9. Integrated Data Repository Toolkit (IDRT) and ETL Tools (Sebastian Mate - Erlangen; Christian Bauer - Goettingen)
  10. Other Comments and Things Learnt

i2b2 SHRINE Conference

  1. SHRINE Clinical Trials (CT) Functionality and Roadmap (Shawn Murphy)
  2. SHRINE National Pilot Lessons Learned
  3. SHRINE Ontology Panel
  4. University of California Research Exchange (UC ReX) (Doug Berman)
  5. Preparation for Patient-Centred Research (Ken Mandl)
  6. Case Study: Improve Care Now (Peter Margolis)

NLP Workshop

UMLS Ontologies and Ontology Resources

Presentation showing how UMLS resources can be used with NLP to extract information from free text.

NLP has two stages:

  1. Entity Recognition - Identifying important terms within text
  2. Relationship Extraction - linking entities together

Entity Recognition

Three major problems when identifying entities within a text:

  1. Entities are missed
  2. Entities are partially matched - part of the term is matched but another part is missed leading to incomplete information or context. For example, in the term 'bilateral vestibular' only the second word may be matched.
  3. Ambiguous terms - terms that may have two meanings.

Entities are identified by a combination of normalisation and longest term matching.

Normalisation is the process whereby a term is manipulated to produce a form of words that will match a large number of potential matches. The process involves removing noise words, standardising inflections and derivatives (e.g., remove plural), removing punctuation, converting to lower case, and sorting the words into alphabetical order.

In order to extract the most meaning from the text, an attempt is made to try to match the term with the most number of matching words. For example, 'left atrium' as opposed to just 'atrium'.

Types of Resources useful for Entity Recognition

There are several types of resource:

  1. Lexical resources - lists of terms with variant spellings, derivatives and inflections, associated with the part of speach to which they refer. These can be either general or include specialist medical terms.
  2. Ontologies - set of entities with relationships between the entities.
  3. Technical resources - set of terms and identifiers used to map a term to an ontology.
  4. Hybrid - A mixture of 1 and 2. They are not strictly speaking ontologies as the relationships may not always be true (e.g., a child may not always be a part of the parent). They are useful for finding terms, but should not be used for aggregation.

Lexical Resources

  1. UMLS Specialist Lexicon - Medical and general English
  2. WordNet - General English
  3. LVG Lexical Variant Generation - specialist tool
  4. BioLexicon - EU project. Not as general. Mainly focused on genes.
  5. BioThesaurus - Focused on proteins and genes.
  6. RxNorm - Drug specific.

Ontological Resources

  1. UMLS Semantic Network

Terminology Resources

  1. MetaThesaurus
    • Groups terms from many ontologies
    • Produces a graph of all the relationships
    • Graph is not acyclic and contains contradictions because it reproduces its source ontologies exactly.
    • Allows standards to be mapped between.
  2. RxNorm
    • Map between many drug lists.
    • Map between branded and generic drug names.
  3. MetaMap
    • Free with licence agreement
    • Based on UMLS MetaThesaurus.
    • Parses text to find terms.
    • Used in IBM's Watson tool.
    • Terms can be translated between various standards, including Snomed.
    • Copes with term negation and disambiguation.
  4. TerMine
  5. WhatIzIt

Relationship Extraction

  1. SemRep

Orbit Project

The Orbit Project is the Online Registry of Biomedical Informatics Tools.

Ontology-based De-identification of Clinical Naratives

Presentation showing a method to remove Protected Health Information (PHI) from free text fields, using the Apache cTakes lexical annotation tool.

The normal method for attempting to de-identify free text is to train software to recognise personal information. However, the number of training examples available is usually quite small. This team attempted to reverse the task by training the software to recognise non-PHI data.


  1. cTakes
  2. Frequency of term in medical journal articles.
  3. Match terms to ontologies. Diseases (etc) named after people can be a problem, but matching terms with more than one word implies that it is not a name. For example, 'Hodgkins Lymphoma' would not match 'Mr Hodgkins'
  4. Remove items from known PHI lists - presumably the person's name and address, etc.

Ontology-based Discovery of Disease Activity from the Clinical Record

Presentation of a project to use NLP to find evidence of disease activity and find its temporal relationship to drug events to identify patients as responders or non-responders for genetic analysis.

This talk put forward a method of using 3 data sets when training the software:

  1. Annotated training set
  2. Known set - a pre-annotated set that is used repeatedly to test the software, but not to train it.
  3. Unknown random set - a random set of a larger set that is used once for testing. The results of the test are manually assesed after the run.

Ontology Normalisation of the Clinical Narrative

Introduction to the Apache cTakes project: a set of tools for parsing text, one of which is a UMLS component, but can also use custom dictionary.

Can be used to extract UMLS concepts (or other codes), but also extracts who the patient is where (for example, knee) and also negation.

Ontology Concept Selection

Presentation of a batch tool for selecting UMLS terms that match local terms. The consensus in the room appeared to be that the UMLS website had similar or better tools.

One topic that was discussed was when terms may safely be aggregated. Three cases were identified where terms may be safely aggregated:

  1. Terms are the same thing.
  2. Terms are closely related and do not need to be disambiguated in this specific case.
  3. True hierarchies, e.g., troponin C, I & T => troponin.

Active Learning for Ontology-based Phenotyping

Presentation on a proposed Active Learning method to reduce the number of training examples an algorithm requires for machine learning, as annotation of examples is slow and expensive.

The normal method for training an NLP algorithm is to randomly select a potion of the data to annotate. This method proposes initially annotating a small number of random samples and selecting subsequent samples to annotate based on the algorithm's output for that sample having a low prediction margin.

Prediction Margin = confidence for class A - confidence for class B

In other words, how sure it is that the best answer is correct.

The presentation showed that in general (but not for every example) the Active Learning method needed fewer annotated examples to reach a high level of confidence.


These are the conclusions that I (Richard Bramley) drew from the NLP conference.

  1. There are a lot of tools and resources available. The integration of the cTakes tools with UMLS seems especially useful.
  2. Access to clinician time to train NLP algorithms is essential.
  3. The statistical analysis of the results is beyond my current capabilities. I may need some training in this area.

Academic User Group

Genomic Cell

Presentation of the new Genome Cell for i2b2 which uses the following pipeline:

  1. VCF - variant cell format (variations from a reference genome)
  2. Annovar
  3. GVF - Genome Variation Format
  4. i2b2
    • Observation (snp - no data)
    • use modifiers to record the snp information

Uses the NCBO genome onthologies.

One problem with recording VCF information is that annotations will change over time. This is because:

  • The reference human genome changes regularly (~once per year).
  • New knowledge changes the way the changes way genomes are annotated (could be much more frequent).


Presentation of SMART plugin framework for i2b2 that allows apps that use individual patient level data to be displayed or utilised within a SMART container, such as i2b2.

Smart apps are registered with the SMART container and then can be selected from an App Store from within the SMART container. Many SMART apps can be embedded into a single page using panels. Examples of SMART apps are:

  1. Cardiac Risk Monitor.
  2. Patient details.
  3. Diabetes Risk.
  4. Blood Pressure standardiser.
  5. Medication List
  6. Procedure List.

Another useful SMART app is the Clinical Trial Matcher. It allows you to enter criteria for participation into a clinical trial. You can then view the matcher for particular patients to see if they match the criteria.

SMART apps may be useful if working with a project for each trial within i2b2. That is, a cohort of patients is identified for a trial and a project is then created for those patients (see i2b2 CT). Researcher for that trial can then see the individual patient details for the patients within that project.


i2b2 Roadmap

Long-term goals for i2b2:

  1. Supporting Cohort Discovery
  2. Supporting Big Data
  3. Plugin Development: SETL - ETL Cell
  4. Continued Development

Supporting Cohort Discovery

Development is underway to increase the workflow capabilities of i2b2 for clinical trials. This is called i2b2 CT. These improvements include:

  1. Better visibility of patient details for users with the correct permissions. This requires the population of the patient mapping table. Patient sets can then be dragged onto a patient list tool that shows all their mapped IDs. Individual patients can then be dragged into user-defined sets that can be used in the same way as query defined patient sets.
  2. Patient sets can be used to create new projects for a clinical trial, where they will have a separate project-specific ID.
  3. Multi-site clinical trial projects can be set up using SHRINE.
  4. Better visibility of patient details using SMART apps.
  5. Individual projects may have a subset of the ontology, though can still use the same observation fact table.
  6. An specific web client to be written to support the i2b2 CT workflow.

Supporting Big Data

This will be based on a flag in the Ontology Cell which will inform the CRC that it must query an external system that will return a patient set.

Plugin Development: SETL - ETL Cell

This new cell allows i2b2 to connect to web services to request specific patient level data. The new cell has two purposes. Firstly, it's main purpose appears to be to allow SMART apps to request extra data. Additionally it can be used to start a bespoke written SSIS package to load data from a file in the File Repository cell.

Continue Development

Temporal Query UI Improvements

The definition of a temporal query has been split into two stages.

  1. Define population - normal i2b2 query definition.
  2. Define temporal aspects of the query.
PostgreSQL Support

Initial work to support PostgreSQL has been carried out, but further performance improvements need to be made. For this the team would like help from members of the community familiar with PostgreSQL.

Other Upgrades
  1. Support for JBoss 7.1
  2. Moved to a POJO architecture.
  3. Support of SQL Server 2012.

Planning for the future

Suggestions for community development

  1. Geolocation Cell.
  2. Merging of patients from different i2b2 instances probabilistically.
  3. SMART App for simple NLP.

The Future of i2b2

Funding for the i2b2 project will end in September 2014. i2b2 will still be partially funded by other projects be there will not be funding for a team to concentrate solely on i2b2. Suggested alternatives are:

  1. Have commercial and community editions.
  2. Increase involvement of the community in support and development.
  3. Kickstarter.

AUG Community Projects

Community projects hosted through community website. These include:

  1. mi2b2 - for viewing images.
  2. Time align - A different time line view.
  3. Trends - (developed by Wake Forest) show changes of result set over time.
  4. NCBO Ontology Tools.

They are currently looking for people to help with the development of the web client.

Mention was made of CRC plugins that I need to investigate.

From Genetic Variants to i2b2 using NoSQL database

Various NoSQL databases available: Mongo DB, Cassandra, Apache CouchDB, Hadoop.

For this project they used Apache CouchDB, which is a JSON document store. It used predefined queries written in JavaScript called a Design Document. The Design Document in compiled the first time it is run, so it is usual to run queries after deploying them to precompile all the queries.

Genomic Data Load Workflow

VCF => Annovar => CSV (Plus additional data from other systems such as patient ID) => Parser => JSON => CounchDB <-> i2b2

Running Queries

They have created a web client plugin into which can be dragged a patient set. The plugin then count and list of the patients who have that variant.

The plugin is due to be released by the end of the year. They need to do some additional testing on how well the plugin scales.

Identifying "Normal" Patients

A presentation for the identification of normal patients for control groups. The process involved identifying criteria and that excluded the patient from being normal, such as serious illness, age range, missing details, etc.

Extending i2b2 with the R Statistical Platform

Presentation on the use of a set of i2b2 plugins to allow R statistical platform to be used.

Integrated Data Repository Toolkit (IDRT) and ETL Tools

i2b2 Wizard

Presentation of the i2b2 wizard tool that aids in the instalation of i2b2.

Changes for version 2:

  • Different versions of i2b2 can be installed.
  • Works with different DBs (currently only works with oracle)

ETL Tools

Presentation of the ETL tools using the Talend ETL framework.

Other Comments and Things Learnt

Ontology Item Sub Queries

Ontology items can be defined as a sub query. This can be used for things such as:

  1. Date Caluculations
  2. Aggregate Values

CRC Plugins

There was mention made of CRC plugins. I will need to investigate these further.

Ontologies and mapping

Wisdom provided by Matvey Palchuck (Recombinant):

  • Use actual values then problems of ranges go away.
  • Get stuff into i2b2 even if the only hierarchy that you can make is splitting things by first letter.
  • Payment data is usually clean data

i2b2 SHRINE Conference

SHRINE Clinical Trials (CT) Functionality and Roadmap

SHRINE allows a query to be run across multiple i2b2 instances. It is implemented as an i2b2 cell called the SHRINE adapter. This adapter maps concept codes from their local value to the standard values used on the SHRINE network.


This project uses the i2b2 CT changes that allow individual patient details to be viewed in projects. It extends the idea to allow the projects to span multiple sites.

  1. User creates a query using SHRINE to create a patient set.
  2. This patient set is dragged into the new Authorization Request Module in the we client.
  3. Project is created using the patients when the authorisation has been received.
  4. Specified user with the correct permissions can then view a limited set of data for these patients and run queries based on this patient set.
  5. Users from the patient's originating site with the correct permission can also view the patient PII data using SMART apps.
Clinical Trial SMART app

This SMART app allows the user to enter criteria for entering a clinical trial. The app then tells the user whether the patient is eligible or if not why not.

Release Stages

The release of the changes will be staged as follows:

  1. All managers at each site to view queries run from the SHRINE on their i2b2 instance, including the patient sets returned.
  2. Improved web client UI.
  3. Allow patients sets to be assigned to multi-site projects, which make visible a limited data set.
  4. Allow selection of individual patients into user generated patient set.
  5. Enable patient-centric SMART apps showing PII to be enabled for multi-site projects.

SHRINE National Pilot Lessons Learned

A couple of presentations and a panel to discuss the problems encountered from running the SHRINE national pilot. These included:

  • Authentication problems. Running queries across multiple sites highlighted potential problems with user authentication:
    • How can we be sure that the request has come from the correct organisation? For this each node in the SHRINE had to install an SSL certificate for each other node.
    • How can we be sure that the user making the request is authenticated? Have to trust client's authentication. Could use Shibboleth or OAuth.
  • Connectivity. Network problems meant that quite often one or more of the nodes required for a query was unavailable, causing the query to fail silently. These issues will be addressed in the SHRINE source code.
  • Ontology / Mapping / Semantic Issues. See later.
  • Peer-to-peer issues. Each node connects directly to each other node. Thus making the administration effort rise exponentially with the number of nodes.

SHRINE Ontology Panel

All the sites within a SHRINE do not have to use the same ontology because there is an intermediate mapping step. Therefore, each organisation within the SHRINE has to map its own codes to the standard specified by the SHRINE.

It may, however, be more efficient and easier to create a new instance of i2b2 for the SHRINE with the data mapped to the SHRINE common ontology.

Some of the problems encountered with SHRINE mappings were:

  1. Data aggregated at different levels between the SHRINE and node ontologies can cause problems. If the SHRINE ontology is mapped at a higher level, you may be OK, but if SHRINE is mapped at a lower level, it is impossible to split the node value into SHRINE values.
  2. If the data is split into ranges (e.g., age ranges), it is possible that the ranges for the SHRINE and those for the node do not match. This can be solved by everyone loading the actual value and not grouping things into ranges (e.g., actual age).

University of California Research Exchange (UC ReX)

Presentation of SHRINE utilisation to join data across 2 medical schools in California. Data imported is mainly demographics data, but have recently imported the top 200 LOINC codes.

Preparation for Patient-Centred Research

Social media shows that people are willing to share their data. There is the scope for altruistic patients to share their information.

The best way to get things to change is to use Disruptive Technologies. These are usually low level, basic technologies that are cheap, easy to create and easy to use. SMART apps could be such a Disruptive Technology.

Case Study: Improve Care Now

Error: Macro BackLinks(None) failed
'Environment' object has no attribute 'get_db_cnx'

Last modified 7 years ago Last modified on 12/09/15 20:04:17
Note: See TracWiki for help on using the wiki.