CiviCRM Module HSCIC Importer Planning
A proposed process to automatically import and update GP and GP Practice details.
Options
- Import data from HSCIC web site.
- Import the data from the UHL data warehouse.
1. HSCIC
This is the preferred method.
The data files are downloadable from the following page: http://systems.hscic.gov.uk/data/ods/datadownloads/gppractice.
New files are release quarterly, but update files are released monthly (http://systems.hscic.gov.uk/data/ods/datadownloads/monthamend/index_html)
There is a 27 column 'standard' format for the data, as follows:
- epraccur.zip - GP practice current data
*epraccur.csv
- Fields:
- Organisation code
- Practice Name
- National Grouping
- High Level Health Authority
- Address line 1
- Address line 2
- Address line 3
- Address line 4
- Address line 5
- Postcode
- Open date
- Close date
- Status (A = Active, C = Closed, D = Dormant, P = Proposed)
- Sub-type code (B = Allocated to a parent organisation, Z = Not allocated to a parent organisation)
- Parent Organisation code (CCG/PCT etc code)
- Join parent date
- Left parent date
- Telephone number
- Null
- Null
- Null
- Amended record indicator
- Null
- Null
- Null
- Practice Type (0 = Other, 1 = WIC Practice, 2 = OOH Practice, 3 = WIC + OOH Practice, 4 = GP Practice, 5 = Prison prescribing cost centre)
- Null
- Fields:
- ebranchs.zip - Branch surgery data
*ebranchs.csv
- Fields:
- Organisation code (made up of the surgery code plus three digits - 001, 002, etc - to denote a branch surgery)
- Branch surgery Name
- National Grouping
- High Level Health Authority
- Address line 1
- Address line 2
- Address line 3
- Address line 4
- Address line 5
- Postcode
- Open date
- Close date
- Null
- Null
- Parent Organisation code (GP surgery code)
- Join parent date
- Left parent date
- Telephone number
- Null
- Null
- Null
- Amended record indicator
- Null
- Government Office Region Code
- Null
- Null
- Null
- Fields:
- egpcur.zip - GP current data
*egpcur.csv
- Fields:
- G code
- Name (surname space initials)
- National Grouping
- High Level Health Authority
- Address line 1
- Address line 2
- Address line 3
- Address line 4
- Address line 5
- Postcode
- Open date
- Close date
- Status (A = Active, C = Closed, P = Proposed)
- Sub-type code (P = Principal GP / Senior partner, O = Other GP)
- Parent Organisation code (GP surgery code)
- Join parent date
- Left parent date
- Telephone number
- Null
- Null
- Null
- Amended record indicator
- Null
- Current care organisation
- Null
- Null
- Null
- Fields:
- epracmem.zip - Contains current and historical records of membership of practices by GPs.
- epracmem.csv
- Practitioner Code
- Parent Organisation Code
- Parent Organisation Type
- Join Parent Date
- Left Parent Date
- Amended Record Indicator
- Not Amended
- epracmem.csv
The monthly update file (egpam.zip -> egpam.csv) is an amalgamation of entries in both the above two formats into a single file for GP and GP practice data. Updated branch data is in ebranchsam.csv which is contained in eamendam.zip each month.
Note that addresses are 'unstructured' other than postcode. We could replicate the address matching approach we've implemented for the NIHR BioResource module, but (a) beware of google maps API limit and (b) the street address for practices rarely begins with a number, so is less predictable. A reliable source of address data is becoming a higher priority.
Process
Match each primary practice with a record in CiviCRM - update details, ensure 'main' address matches or is updated Match each branch surgery with an address in CiviCRM of type 'other' Match each GP to a health worker record in CiviCRM, ensure relationship links to correct GP Practice. Include senior partner / principal GP relationship.
How to deal with archive data? Do we care? For the time being, assume we don't care. What matters to us is a currently-viable record of each practice so that we can construct mailing lists, etc., not a historically accurate record of all changes.
This requires an amendment to the CiviCRM object model for Practice addresses - an additional item of custom data for the 'organisation code' which comprises the practice code plus a three digit identifier. This will be optional - it only has relevance for branch surgery addresses, not main addresses.
See https://api.drupal.org/api/drupal/modules%21system%21system.api.php/function/hook_cron/7 for details on using hook_cron to schedule the work.
To begin with, we should limit our work to loading in GP surgeries, GPs themselves, and addresses / telephone numbers for GP surgeries and branch surgeries. Links to health authorities and other entitities would be possible, but is outside of scope for the time being.
Psuedocode
wget monthly GP + practice amendment file compare to last version held if the same, delete if different, unpack zip file process csv file wget monthly branch amendment file compare to last version held if the same, delete if different, unpack zip file process csv file
Usage of Drupal cron
If we use the Drupal 'easy cron' then all the processing is done within a normal request for a Drupal page. This could lead to a user having to wait an inexplicably long time for a page because we're doing the processing in the background.
We could split the task over several requests. Either manually or using queues. However, the actual act of downloading the file could take a long time on its own.
We could therefore use real cron to download the file. Or just use real cron to do the whole process.
Assumptions in the code
Which GPs are we interested in
Ideally, we would be interested in every GP linked to a practice we are interested in. Pragmatically, we are going to say that we are interested only GPs whose main practice is one we are interested in.
This assumption is deemed to be safe because:
- Any other practices that the GP is linked to are likely to be in the areas that we are interested in.
- GPs whose main branch is not in our area are less likely to be of interest to us.
When does a GP get deleted
A GP gets deleted when they leave their main branch. Links to other branches are ignored.
2. UHL Data Warehouse
The data is stored in the DWREPO_BASE
database. The tables are:
- MF_GP_OCS (GPs)
- MF_GP_PRACTICE_OCS (Practices)
There are other tables with similar names, but I don't know how they differ from the ones above.
It may be possible to use the SHA codes of the GPs to filter the details for the Leicestershire area.
These tables are recreated from scratch on a weekly basis, but the source from which they are created is the quarterly file above.
3. HSCIC to Our Staging Area for use by many systems
This option involves downloading the contents of the HSCIC file to a central location from which we make updates to all of our systems.