= i2b2 AUG 2013 =

== Program ==
=== NLP Workshop ===

1. [#NLP1 UMLS Ontologies and Ontology Resources] ''(Olivier Bodenreider)''
1. [#NLP2 Ontology-based De-identification of Clinical Naratives] ''(Finch and !McMurry)''
1. [#NLP3 Ontology-based Discovery of Disease Activity from the Clinical Record] ''(Lin)''
1. [#NLP4 Ontology Normalisation of the Clinical Narrative] ''(Chen)''
1. [#NLP5 Ontology Concept Selection] ''(Yu)''
1. [#NLP6 Active Learning for Ontology-based Phenotyping] ''(Dligach)''
1. [#NLP7 Conclusion]

=== Academic User Group ===

1. [#AUG1 Genomic Cell] ''(Shawn Murphy and Lori Philips)''
1. [#AUG2 SMART Apps] ''(Wattanasin)''
1. [#AUG3 i2b2 Roadmap] ''(Shawn Murphy)''
1. [#AUG4 Planning for the future] ''(Kohane)''
1. [#AUG5 From Genetic Variants to i2b2 using NoSQL database] ''(Matteo Gabetta - Pavia)''
1. [#AUG6 Extending i2b2 with the R Statistical Platform]
1. [#AUG7 Integrated Data Repository Toolkit (IDRT) and ETL Tools] ''(Sebastian Mate - Erlangen; Christian Bauer - Goettingen)''

=== i2b2 SHRINE Conference ===

1. [#SHRINE1 SHRINE Clinical Trials (CT) Functionality and Roadmap] ''(Shawn Murphy)''
1. [#SHRINE2 SHRINE National Pilot Lessons Learned]
1. [#SHRINE3 SHRINE Ontology Panel]
1. [#SHRINE4 University of California Research Exchange (UC ReX)] ''(Doug Berman)''
1. [#SHRINE5 Preparation for Patient-Centred Research] ''(Ken Mandl)''
1. [#SHRINE6 Case Study: Improve Care Now] ''(Peter Margolis)''

== NLP Workshop ==

=== [=#NLP1 UMLS Ontologies and Ontology Resources] ===

Presentation showing how UMLS resources can be used with NLP to extract information from free text.

NLP has two stages:

1. Entity Recognition - Identifying important terms within text
1. Relationship Extraction - linking entities together

==== Entity Recognition ====

Three major problems when identifying entities within a text:

1. Entities are missed
1. Entities are partially matched - part of the term is matched but another part is missed leading to incomplete information or context.  For example, in the term 'bilateral vestibular' only the second word may be matched.
1. Ambiguous terms - terms that may have two meanings.

Entities are identified by a combination of normalisation and longest term matching.

Normalisation is the process whereby a term is manipulated to produce a form of words that will match a large number of potential matches.  The process involves removing noise words, standardising inflections and derivatives (e.g., remove plural), removing punctuation, converting to lower case, and sorting the words into alphabetical order.

In order to extract the most meaning from the text, an attempt is made to try to match the term with the most number of matching words.  For example, 'left atrium' as opposed to just 'atrium'.

==== Types of Resources useful for Entity Recognition ====

There are several types of resource:

1. Lexical resources - lists of terms with variant spellings, derivatives and inflections, associated with the part of speach to which they refer.  These can be either general or include specialist medical terms.
1. Ontologies - set of entities with relationships between the entities.
1. Technical resources - set of terms and identifiers used to map a term to an ontology.
1. Hybrid - A mixture of 1 and 2.  They are not strictly speaking ontologies as the relationships may not always be true (e.g., a child may not always be a part of the parent).  They are useful for finding terms, but should not be used for aggregation.

==== Lexical Resources ====

1. [[http://lexsrv3.nlm.nih.gov/Specialist/Home/index.html|UMLS Specialist Lexicon]] - Medical and general English
1. [[http://wordnet.princeton.edu/|WordNet]] - General English
1. [[http://lexsrv2.nlm.nih.gov/LexSysGroup/Projects/lvg/2012/docs/userDoc/tools/lvg.html|LVG Lexical Variant Generation]] - specialist tool
1. [[http://www.ebi.ac.uk/Rebholz-srv/BioLexicon/biolexicon.html|BioLexicon]] - EU project.  Not as general.  Mainly focused on genes.
1. [[http://pir.georgetown.edu/pirwww/iprolink/biothesaurus.shtml|BioThesaurus]] - Focused on proteins and genes.
1. [[http://www.nlm.nih.gov/research/umls/rxnorm/|RxNorm]] - Drug specific.

==== Ontological Resources ====

1. [[http://semanticnetwork.nlm.nih.gov/|UMLS Semantic Network]]

==== Terminology Resources ====

1. [http://www.nlm.nih.gov/research/umls/knowledge_sources/metathesaurus/index.html|UMLS MetaThesaurus]
 * Groups terms from many ontologies
 * Produces a graph of all the relationships
 * Graph is not acyclic and contains contradictions ''because'' it reproduces its source ontologies exactly.
 * Allows standards to be mapped between. 
1. [[http://www.nlm.nih.gov/research/umls/rxnorm/|RxNorm]]
 * Map between many drug lists.
 * Map between branded and generic drug names.
1. [[http://metamap.nlm.nih.gov/|MetaMap]]
 * Free with licence agreement
 * Based on UMLS MetaThesaurus.
 * Parses text to find terms.
 * Used in IBM's Watson tool.
 * Terms can be translated between various standards, including Snomed.
 * Copes with term negation and disambiguation.
1. [[http://www.nactem.ac.uk/software/termine/|TerMine]]
1. [[http://www.ebi.ac.uk/webservices/whatizit/info.jsf|WhatIzIt]]

==== Relationship Extraction ====

1. [[http://skr.nlm.nih.gov/|SemRep]]

==== Orbit Project ====

The [[http://orbit.nlm.nih.gov|Orbit Project]] is the Online Registry of Biomedical Informatics Tools.


=== [=#NLP2 Ontology-based De-identification of Clinical Naratives] ===

Presentation showing a method to remove Protected Health Information (PHI) from free text fields, using the Apache cTakes lexical annotation tool.

The normal method for attempting to de-identify free text is to train software to recognise personal information.  However, the number of training examples available is usually quite small.  This team attempted to reverse the task by training the software to recognise non-PHI data.

Pipeline:

1. cTakes
1. Frequency of term in medical journal articles.
1. Match terms to ontologies.  Diseases (etc) named after people can be a problem, but matching terms with more than one word implies that it is not a name.  For example, 'Hodgkins Lymphoma' would not match 'Mr Hodgkins'
1. Remove items from known PHI lists - presumably the person's name and address, etc.

=== [=#NLP3 Ontology-based Discovery of Disease Activity from the Clinical Record] ===

Presentation of a project to use NLP to find evidence of disease activity and find its temporal relationship to drug events to identify patients as responders or non-responders for genetic analysis.

This talk put forward a method of using 3 data sets when training the software:

1. Annotated training set
1. Known set - a pre-annotated set that is used repeatedly to test the software, but not to train it.
1. Unknown random set - a random set of a larger set that is used once for testing.  The results of the test are manually assesed after the run.

=== [=#NLP4 Ontology Normalisation of the Clinical Narrative] ===

Introduction to the [[http://ctakes.apache.org/|Apache cTakes]] project: a set of tools for parsing text, one of which is a UMLS component, but can also use custom dictionary.

Can be used to extract UMLS concepts (or other codes), but also extracts '''who''' the patient is '''where''' (for example, knee) and also '''negation'''.

=== [=#NLP5 Ontology Concept Selection] ===

Presentation of a batch tool for selecting UMLS terms that match local terms.  The consensus in the room appeared to be that the UMLS website had similar or better tools.

One topic that was discussed was when terms may safely be aggregated.  Three cases were identified where terms may be safely aggregated:

1. Terms are the same thing.
1. Terms are closely related and do not need to be disambiguated in this specific case.
1. True hierarchies, e.g., troponin C, I & T => troponin.

=== [=#NLP6 Active Learning for Ontology-based Phenotyping] ===

Presentation on a proposed Active Learning method to reduce the number of training examples an algorithm requires for machine learning, as annotation of examples is slow and expensive.

The normal method for training an NLP algorithm is to randomly select a potion of the data to annotate.  This method proposes initially annotating a small number of random samples and selecting subsequent samples to annotate based on the algorithm's output for that sample having a low ''prediction margin''.

'''Prediction Margin = confidence for class A - confidence for class B'''

In other words, how sure it is that the best answer is correct.

The presentation showed that in general (but not for every example) the Active Learning method needed fewer annotated examples to reach a high level of confidence.

=== [=#NLP7 Conclusion] ===

These are the conclusions that I (Richard Bramley) drew from the NLP conference.

1. There are a lot of tools and resources available.  The integration of the cTakes tools with UMLS seems especially useful.
1. Access to clinician time to train NLP algorithms is '''essential'''.
1. The statistical analysis of the results is beyond my current capabilities.  I may need some training in this area.

== Academic User Group ==

=== [=#AUG1 Genomic Cell] ===

Presentation of the new Genome Cell for i2b2 which uses the following pipeline:

1. VCF - variant cell format (variations from a reference genome)
1. [[http://www.openbioinformatics.org/annovar/|Annovar]]
1. GVF - Genome Variation Format
1. i2b2
 * Observation (snp - no data)
 * use modifiers to record the snp information

Uses the NCBO genome onthologies.

One problem with recording VCF information is that annotations will change over time.  This is because:

* The reference human genome changes regularly (~once per year).
* New knowledge changes the way the changes way genomes are annotated (could be much more frequent).

=== [=#AUG2 SMART Apps] ===

Presentation of SMART plugin framework for i2b2 that allows apps that use individual patient level data to be displayed or utilised within a SMART container, such as i2b2.

Smart apps are registered with the SMART container and then can be selected from an App Store from within the SMART container.  Many SMART apps can be embedded into a single page using panels.  Examples of SMART apps are:
1. Cardiac Risk Monitor.
1. Patient details.
1. Diabetes Risk.
1. Blood Pressure standardiser.
1. Medication List
1. Procedure List.

Another useful SMART app is the Clinical Trial Matcher.  It allows you to enter criteria for participation into a clinical trial.  You can then view the matcher for particular patients to see if they match the criteria.

SMART apps may be useful if working with a project for each trial within i2b2.  That is, a cohort of patients is identified for a trial and a project is then created for those patients (see i2b2 CT). Researcher for that trial can then see the individual patient details for the patients within that project.

See:
* [[http://www.smarti2b2.org]]
* [[http://www.smartplatforms.org]]



=== [=#AUG3 i2b2 Roadmap] ===
=== [=#AUG4 Planning for the future] ===
=== [=#AUG5 From Genetic Variants to i2b2 using NoSQL database] ===
=== [=#AUG6 Extending i2b2 with the R Statistical Platform] ===
=== [=#AUG7 Integrated Data Repository Toolkit (IDRT) and ETL Tools] ===

== i2b2 SHRINE Conference ==

=== [=#SHRINE1 SHRINE Clinical Trials (CT) Functionality and Roadmap] ===
=== [=#SHRINE2 SHRINE National Pilot Lessons Learned] ===
=== [=#SHRINE3 SHRINE Ontology Panel] ===
=== [=#SHRINE4 University of California Research Exchange (UC ReX)] ===
=== [=#SHRINE5 Preparation for Patient-Centred Research] ===
=== [=#SHRINE6 Case Study: Improve Care Now] ===