wiki:Onyx Export and Purge

Version 41 (modified by Nick Holden, 11 years ago) ( diff )

--

Onyx Export and Purge

Sources of Info

The Onyx User Guide has a useful Chapter 12 "Topics for System Administrators" with details of the export and purge functions.

Other useful links to the obiba wiki

Overview

Exporting data from Onyx means reading data from the Onyx database and writing it to one or more export destinations. Exporting does not delete any data from the Onyx database. Deleting data from the database is done by the purge function. An export destination is a compressed zip file. Participant data and experimental conditions data can be exported. Configuration of data export is done entirely in configuration files, not through the Onyx user interface. Some things that can be configured:

  • Which data is selected for export
  • Directory to which export files are written
  • How many export destinations are defined

System administrators trigger an export via the Onyx web interface. The configuration file controls everything else. It is NOT possible from within the web interface to choose which participants will be exported. Any selective export can only be tailored via the configuration file.

Destination

The Onyx export routine exports data to a specific 'target' directory within the tomcat working directory. As of 31st October 2011, this was different for the test and live systems on UHL briccs servers.

For the live BRICCS onyx server, the target is /tmp/tomcat6-temp/target/

For the test BRICCS onyx server, the target is /var/lib/tomcat6/target/

This was changed so that for 2012 and onwards, the live BRICCS server also outputs to /var/lib/tomcat6/target/ because outputing into the /tmp directory isn't a good idea - when tomcat is shut down it empties the /tmp/tomcat6-temp directory.

Purging data

Purging data means deleting data from the Onyx database. Only participant data can be purged — not experimental conditions data. Configuration of data purging is done entirely in configuration files, not through the Onyx user interface. As per data export, only a system administrator can execute a purge of data by a function from within the user interface.

Sample Config Export File and resulting Export Zip File

The following represent an export of only four participants. The zip file contains a lot of xml. It's worth opening and just pondering how we might approach this. Virtually everything is captured regarding a participant and the interview process. How much of this do we want in i2b2?

Export destinations file
Resulting export zip file

How much do we want to export and purge?

It looks as if the export config file attached results in almost all data being exported for those participants whose interview status is closed, completed or cancelled.

Some aspects are excluded which I (Jeff) do not fully comprehend, notably of the Consent table.

The question of how much we purge is an open question. Remember that this may affect the reporting tool. I mention this here because I believe the purge config file that we have as a default will result in almost all participant data going that does not have an open status.

On the whole it seems sensible to export as much as we can for each participant and then archive export files; ie: retain them forever. Do we wish to consider encryption given the idea of retaining in perpetuity. I hope the answer to this is "No"; it's simply more work.

Why export everything for each participant? Because it gives us more than one bite of the cherry for the import into i2b2 (or any other piece of software). The detail shown in the export file is quite daunting. It is conceivable that if we filtered the export we might get this wrong, or change our minds later. Keeping track of what has and has not been exported could be a nightmare.

Filtering the exported data

All of the exported data is in XML format. Given the large amount of detail exported, we need some way of marshalling this into a somewhat simpler form prior to organizing it for import into i2b2.

The idea is to come up with a programmable process (an automated process) that will act as a first filter which can be applied to all exports.

Whatever process we come up with, it is likely to be a process with a number of steps, and we are unlikely to get it correct first time. The process will be one involving manual inspection of example files from within an export zip file together with some programming to decide on a what data can be eliminated.

( Since the above was written the i2b2-procedures project within SVN has brought together a workflow of job steps and programmes which form a prototype of this process. The latest is within SVN at svn+ssh://svn.briccs.org.uk/var/local/briccs/svn/repo/i2b2/trunk/i2b2-procedures and is still being worked upon. )

The manual inspection is the thinking bit. Don't jump to conclusions on first inspection.

For instance, this is an extract from a Participant's file...

  <variableValue variable="Admin.Action.fromState">
    <value class="sequence" valueType="text" size="40">
      <value valueType="text" order="0"/>
      <value valueType="text" order="1">Ready</value>
      <value valueType="text" order="2">InProgress</value>
      <value valueType="text" order="3">Ready</value>
      <value valueType="text" order="4">InProgress</value>
     
... similar lines removed ...

      <value valueType="text" order="36">Interrupted</value>
      <value valueType="text" order="37">InProgress</value>
      <value valueType="text" order="38">Ready</value>
      <value valueType="text" order="39">InProgress</value>
    </value>
  </variableValue>

It's probable in my judgement that this could be filtered out. But what about:

<variableValue variable="Admin.StageInstance.user">
    <value class="sequence" valueType="text" size="14">
      <value valueType="text" order="0">JeffLusted</value>
      <value valueType="text" order="1">JeffLusted</value>
      <value valueType="text" order="2">JeffLusted</value>

... similar lines removed ...

      <value valueType="text" order="11">JeffLusted</value>
      <value valueType="text" order="12">JeffLusted</value>
      <value valueType="text" order="13">JeffLusted</value>
    </value>
  </variableValue>

Experiments with Exclusion

Altered the export-destinations.xml file so the type="EXCLUDE" scripts were commented out throughout the file. The following is just the first instance of this:

    <valueset entityType="Participant" valueTable="Participants">
      <entities>
        <excludeAll />
        <script type="INCLUDE">
          <javascript><![CDATA[// Include any ValueSet that has 'CLOSED' or 'COMPLETED' or 'CANCELLED' as a value for the 'Participant.Interview.Status' variable
          $('Participants:Admin.Interview.status').any('CLOSED','COMPLETED','CANCELLED')]]></javascript>
        </script>
        <!-- script type="EXCLUDE">
          <javascript><![CDATA[$('Participants:Admin.Interview.exportLog.destination').any('BRICCS.Participants')]]></javascript>
        </script -->
      </entities>
    </valueset>

I then ran an export. The export produced a zip file containing all the participants that had a completed status on my test system, even those that had been previously exported. So the conclusion is that the exclude condition above ensures that duplication of exported participants does NOT take place. As an aside, there are no participants on my test system with a closed or cancelled status.

I assume that there are a number of ways of achieving the same result using JavaScript. On the Obiba web site, the equivalent of the above is given by:

    <valueset entityType="Participant" valueTable="Participants">
      <entities>
        <script type="INCLUDE">
          <javascript>
          ...
          </javascript>
        </script>
        <script type="EXCLUDE">
          <javascript><![CDATA[$('Admin.Participant.exported').any('TRUE')]]></javascript>
        </script>
      </entities>
      ...
    </valueset>

The export-destinations.xml configuration file that shipped with the Briccs questionnaire has the following:

   <valueset entityType="Participant" valueTable="Consent">
      <entities>
        <excludeAll />
        <script type="INCLUDE">
          <javascript><![CDATA[// Include any ValueSet that has 'CLOSED' or 'COMPLETED' or 'CANCELLED' as a value for the 'Participant.Interview.Status' variable
          $('Participants:Admin.Interview.status').any('CLOSED','COMPLETED','CANCELLED')]]></javascript>
        </script>
        <script type="EXCLUDE">
          <javascript><![CDATA[$('Participants:Admin.Interview.exportLog.destination').any('BRICCS.Consent')]]></javascript>
        </script>
      </entities>
      <variables>
        <variableName type="EXCLUDE" match="consent_q1" />
        <variableName type="EXCLUDE" match="consent_q2" />
        <variableName type="EXCLUDE" match="consent_q3" />
        <variableName type="EXCLUDE" match="consent_q4" />
        <variableName type="EXCLUDE" match="consent_q5" />
      </variables>
    </valueset>

I conducted another experiment whereby the above variables section was commmented out, and then ran an export. I could detect no difference in the export zip files comparing before and after the change. But note:

  • This is only with very limited test data
  • I'm not sure how many valueTables are associated with consent. Are there separate valueTables for Consent, ManualConsentQuestionnaire and VerbalConsentQuestionnaire? If so the latter two have been omitted from our configuration file.

Some thoughts on the Experiments

As per our existing configuration file, it looks as if we are currently set up to export all participants where the Briccs questionnaire has been closed, completed or cancelled. We may want to alter this so only a completed status is taken into account.

On my laptop (dual core, 64bit with 1GB available to the JVM running Tomcat) and with limited test data, an export of 7 participants with a completed status took 1 minute and 24 seconds overall. On average, each participant added about 1MB to the uncompressed contents of the zip file.

On the whole, I would still like to think we could make small exports, say of 50 participants at a time, even though we may have seven hundred or more complete. This would mean experimenting with the JavaScript conditioning. I think it would make for easier testing and debugging.

Another awkwardness is that once a participant is exported, it is only possible (as far as I can see) to look at a limited amount of their data via the Onyx web interface. So validating what is at the end of the Onyx-to-i2b2 trail by manual inspection will be difficult.

On Exporting Small Numbers of Participants Regularly

I remember from a conversation last year that Philippe Laflamme of OBIBA (and one of the principle developers of Onyx) recommended using the export facility on a regular basis to export smallish numbers of participants. We haven't, and now (October 2011) have a backlog of over 1000 ready and waiting. I think it is imperative we employ the export config to nevertheless stick to Philippe's recommendation; ie: smallish, regular exports. The following sections outline how this might be achieved.

But in the meantime here are some things to ponder:

  • For 1000 participants, the unzipped export file will contain 14 subdirectories, each with two control files (one of which is metadata) and one file for each exported participant; and there is one control file for the overall export. Altogether, there would be 14029 xml files for an export of 1000 participants, probably totalling over a gigabyte in size.
  • The export is triggered by the administrator within the Onyx web application, and is executed by the Onyx web application itself. I suspect that it will take something in the order of 3 hours or more to export 1000 participants. My initial reaction is that the web server is unlikely to survive the memory demands of processing that many xml files to produce one zip file.
  • If, however, we were successful in producing such a huge zip file, processing it would be difficult. Our first export(s) of non-artificial data will be used to bottom out the multi-step pipeline process between Onyx export and i2b2 import. It would be sensible, at least in our first realistic tests of participants with non-artificial data, to keep the process within human bounds. If each execution of the pipeline has individual steps which take hours to finish, and maybe a retry, the development process, which will certainly be required considering things like changes to ontology, will be tortuous indeed.

On Filtering using a Date

Further experiments on exporting revealed that the filter illustrated below worked, selecting participants with a status of completed where the conclusion questionnaire end date was some time in January, and the participant hadn't been exported before. All the valueTables within the export config file had the same filter.

    <valueset entityType="Participant" valueTable="Participants">
      <entities>
        <excludeAll />
        <script type="INCLUDE">
          <javascript><![CDATA[// Include any ValueSet that has 'COMPLETED' as a value for the 'Participant.Interview.Status' variable
                               // and the conclusion questionnaire timeEnd is in January
          ($('Participants:Admin.Interview.status').any('COMPLETED')).
             and($('ConclusionQuestionnaire:QuestionnaireRun.timeEnd').month().trim().any('0'))
          ]]></javascript>
        </script>
        <script type="EXCLUDE">
          <javascript><![CDATA[$('Participants:Admin.Interview.exportLog.destination').any('BRICCS.Participants')]]></javascript>
        </script>
      </entities>
    </valueset>

I suggest something along these lines would provide an incremental basis of exporting until the number of participants with completed status has dropped to a manageable extent.

    <valueset entityType="Participant" valueTable="Participants">
      <entities>
        <excludeAll />
        <script type="INCLUDE">
          <javascript><![CDATA[// Include any ValueSet that has 'COMPLETED' as a value for the 'Participant.Interview.Status' variable
                               // and the conclusion questionnaire timeEnd is in May or June of 2010
          ($('Participants:Admin.Interview.status').any('COMPLETED')).
             and($('ConclusionQuestionnaire:QuestionnaireRun.timeEnd').year().trim().any('2010')).
             and($('ConclusionQuestionnaire:QuestionnaireRun.timeEnd').month().trim().any('4', '5'))
          ]]></javascript>
        </script>
        <script type="EXCLUDE">
          <javascript><![CDATA[$('Participants:Admin.Interview.exportLog.destination').any('BRICCS.Participants')]]></javascript>
        </script>
      </entities>
    </valueset>

Incremental changes to the above month condition on each export could include a further month and pick up any participants from previous months that had since moved to completed status:

  • month().trim().any('4', '5', '6')
  • month().trim().any('4', '5', '6', '7')
  • month().trim().any('4', '5', '6', '7', '8')

and so on

The EXCLUDE ensures those already exported will not be exported again. By the time an export got to include December 2010 it would be time for a rethink.

More details on a strategy for the initial incremental exporting conducted in November 2011 is on the Initial Incremental Export page.

The ongoing management of Onyx export files and their subsequent upload into i2b2 is managed from a spreadsheet (yeah, irony) held on the UHL BRICCS shared drive at \\BRICCS\Data Export from Live Onyx - the spreadsheet with control metadata is incrementalexportlog.xls

A copy of each Onyx export file is held in the same directory as a zip file.

On Editing the Export Configuration File

The following probably apply to the purge configuration file as well:

  • If you manage to produce a basic error with the XML whilst editing the file, the Onyx application will fail to start when next you restart it.
  • Once edited, you must stop and start the application in order for the new file contents to be picked up. I advise you stop/start Tomcat itself. Using the Tomcat Manager web application to restart the Onyx application is prone to memory loss errors (Used to be a weakness).

Attachments (2)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.