This tool provides a similar capability to mysqldump, but with OpenMRS-aware configuration settings aiming to extract a database out of a production environment for development, testing, or demonstration. To accomplish this, it has two primary features:
-
Export a subset of the database. As production databases grow in size, running mysqldump and then sourcing this into your development environment becomes very inefficient. For databases containing hundreds of thousands of patients, and millions of obs, such a process can take hours (or days) to perform. This tool aims to allow for exporting a subset of this data, ensuring that broad representation is accounted for across age ranges, program enrollments, encounters, relationships, etc, but which can be sourced in a matter of minutes.
-
Transform aspects of the database during the export process. Most commonly, this involves de-identification of data, but can involve any modification, removal, or addition of data that is needed during the export process. Examples include obfuscation of data such as person names and addresses, associating random locations with encounter or obs data, or removing existing users and replacing them with a set of test users.
- Clone this project to your workspace, and change into the project root directory
- Build an executable jar by running "mvn clean package"
- Copy the artifact at "./target/databaseexporter-XYZ-jar-with-dependencies.jar" into the directory of your choice
- Create a configuration file that specifies the parameters of the export. See configuration section for details.
- Run the executable jar, specifying the location of the configuration file, as follows:
$java -jar databaseexporter-XYZ-jar-with-dependencies.jar /path/to/configuration/file/created/above
In order to run the databaseexporter, you must specify an appropriate configuration file. This configuration file is expected to be in JSON format, and contains the following settings:
This is where you specify the details of how the tool should connect to the source database from which you would like to export data. The supported attributes are:
- driver: (optional), defaults to "com.mysql.jdbc.Driver"
- url: (required), should be the JDBC connection string to connect to your database
- user: (required), an existing database user that has sufficient privileges to run the operation
- password: (required), the password for this user
Example:
"sourceDatabaseCredentials": {
"url": "jdbc:mysql://localhost:3306/openmrs?autoReconnect=true&useUnicode=true&characterEncoding=UTF-8",
"user": "openmrs",
"password": "openmrs"
}
This allows you to specify the directory in which the tool will write it's output. When run, two files will be produced in this target directory: "export_yyyy_MM_dd_HH_ss.log" and "export_yyyy_MM_dd_HH_ss.sql".
Example:
"targetLocation": "/home/openmrs/exports"
This allows you to specify which tables are included and which are excluded from the export. This allows you fine-grained control over which tables you wish to create and which you want to export data.
The supported attributes are:
- includeSchema: (optional), list of table names for which a "drop table..." and "create table..." statement will be written in the export file. By default, if not specified, all tables will be included.
- excludeSchema: (optional), list of table names for which a "drop table..." and "create table..." statement will not be written in the export file. By default, if not specified, no tables will be included.
- includeData: (optional), list of table names for which contents will be extracted into the export file. By default, if not specified, all tables will be included.
- excludeData: (optional), list of table names for which contents will be extracted into the export file. By default, if not specified, no tables will be included.
The use of the "*" wildcard is supported within all of the attributes.
Example:
"tableFilter": {
"includeSchema": ["*"],
"excludeSchema": ["temp*"],
"includeData": ["*"],
"excludeData": ["hl7_in_*","person_merge_log","concept_word"]
}
Whereas tableFilters (above) allow you to specify which tables you want to export data for altogether, Row Filters are what provide the capability to export only a subset of data across your database. You can specify 0-N filters, which will run in the order that they are listed in your configuration file. Filters are additive, meaning that if 2 Filters operate against the same type of data (Patients, for example), then the results of these two filters will be combined and the union will be included in the export. If a particular table is affected by at least 1 filter, then it's rows will be limited by the results of the filters. If a table is not affected by any of the listed filters, then all of it's data will be included.
When a particular type of data is filtered, this means that only a subset of the rows in that table will be exported. Because of this, any table that has a foreign key relationship with this table will be affected. In this case, one of two things will happen:
-
If the table represents data that depends upon the referenced table to retain it's meaning or purpose, then this table will be filtered to match the parent table. For example, if you filter patients, this will automatically filter patient identifiers, such that only those patient identifiers associated with patients in the export will be included.
-
If the table represents data that does not derive it's meaning from it's relationship with the referenced table, then the foreign key will be replaced. For example, if you filter users, this will not automatically filter all data that references these users as their "creator", "changed_by", "retired_by", or "voided_by". Rather, a randomly selected value from the group of valid, filtered users will be chosen and used as a replacement.
Currently the system supports 3 different types of Row Filters:
A patient filter limits the patients that are exported based on a series of configurable queries that you can specify. These queries include the following:
PatientIdQuery
Include patient data for only those patients with the specified patientIds in the export
Options:
- ids (required): The list of patientIds to include
Example: Export data for only 5 specific patients for testing
{
"@class" : "org.openmrs.contrib.databaseexporter.query.PatientIdQuery",
"ids": [111,222,333,444,555]
}
PatientAgeQuery
Include representative data for up to the specified number of patients in each of the specified age ranges.
Options:
- numberPerAgeRange (optional, defaults to 10): For each listed age range, ensure at least this number of patients are included, if possible
- ageRanges (required): List of age ranges that we want to ensure have adequate representation
Example: Ensure at least 100 patients exist who are between 0-2, 3-10, 11-15, 16-30, 31-60, and 60+ years old
{
"@class" : "org.openmrs.contrib.databaseexporter.query.PatientAgeQuery",
"numberPerAgeRange": 100,
"ageRanges": [
{"maxAge": 2},
{"minAge": 3, "maxAge": 10},
{"minAge": 11, "maxAge": 15},
{"minAge": 16, "maxAge": 30},
{"minAge": 31, "maxAge": 60},
{"minAge": 61}
]
}
PatientEncounterQuery
Include representative encounter data that includes at least the specified number of patients who have encounters by form and encounter type
Options:
- numberPerType (optional, defaults to 10): For each listed encounter type (or all if none are specified), include this number of patients who have encounters of this type
- numberPerForm (optional, defaults to 10): For each listed form (or allif none are specified), include this number of patients who have encounters for this form
- includeRetired (optional, defaults to false): Indicate whether retired forms and encounter types should be included, if none are specifically listed
- limitToTypes (optional): if specified, will limit the filter to only those encounter types
- limitToForms (optional): if specified, will limit the filter to only those forms
- order (optional, options are RANDOM, DATE_ASC, DATE_DESC, NUM_OBS_DESC, defaults to RANDOM). Determines which patients are selected for the given criteria - based on encounter date created, randomly, or favoring those with the most obs
Example: Ensure at least 5 patients with each non-retired encounter type are included, and ensure these are the encounters that have the most number of Obs associated with them for each type
{
"@class" : "org.openmrs.contrib.databaseexporter.query.PatientEncounterQuery",
"numberPerType": 5,
"numberPerForm": 0,
"order": "NUM_OBS_DESC"
}
PatientProgramEnrollmentQuery
Include representative program enrollment data that includes at least the specified number of patients who have enrollments in the specified programs with the specified status
Options:
- numberActivePerProgram (optional, defaults to 10): For each listed program (or all if none are specified), include this number of patients who are currently active in that program
- numberCompletedPerProgram (optional, defaults to 10): For each listed program (or all if none are specified), include this number of patients who are currently completed in that program
- includeRetired (optional, defaults to false): Indicate whether retired programs should be included, if none are specifically listed
- limitToPrograms (optional, defaults to false): if specified, will limit the filter to only those programs
Example: Ensure at least 20 patients with active enrollments and and 50 patients with completed enrollments are included for the programs with programId 1 and 3
{
"@class" : "org.openmrs.contrib.databaseexporter.query.PatientProgramEnrollmentQuery",
"numberActivePerProgram": 20,
"numberCompletedPerProgram": 50,
"limitToPrograms": [1,3]
}
PatientRelationshipQuery
Include representative relationship data that includes at least the specified number of patients who have relationships in each of the specified type
Options:
- numberPerType (optional, defaults to 10): For each listed relationshipType (or all if none are specified), include this number of patients who have this relationship
- includeRetired (optional, defaults to false): Indicate whether retired relationship types should be included, if none are specifically listed
- limitToTypes (optional, defaults to false): if specified, will limit the filter to only those relationship types
Example: Ensure at least 5 patients with each relationship type, including retired types, is included in the export
{
"@class" : "org.openmrs.contrib.databaseexporter.query.PatientRelationshipQuery",
"numberPerType": 5,
"includeRetired": true
}
PatientIdentifierQuery
Include representative identifier data that includes at least the specified number of patients who have identifiers in each of the specified type
Options:
- numberPerType (optional, defaults to 10): For each listed identifierType (or all if none are specified), include this number of patients who have this identifier type
- includeRetired (optional, defaults to false): Indicate whether retired identifier types should be included, if none are specifically listed
- limitToTypes (optional, defaults to false): if specified, will limit the filter to only those identifier types
Example: Ensure at least 100 patients with each non-retired identifier type is included in the export
{
"@class" : "org.openmrs.contrib.databaseexporter.query.PatientIdentifierQuery",
"numberPerType": 100,
"includeRetired": false
}
PatientVitalStatusQuery
Include representative data for both alive and dead patients
Options:
- numberAlive (optional, defaults to 10): Ensure at least this number of alive patients is included
- numberDead (optional, defaults to false): Ensure at least this number of dead patients is included
Example: Ensure at least 20 alive patients and 5 dead patients are included in the export
{
"@class" : "org.openmrs.contrib.databaseexporter.query.PatientVitalStatusQuery",
"numberAlive": 20,
"numberDead": 5
}
A user filter limits the users that are exported based on a series of configurable queries that you can specify. These queries include the following:
UserIdentificationQuery
Include only those users with the given usernames (note: admin and daemon will always be included)
Options:
- userNames (required): The list of users to include
Example: Export only my account
{
"@class" : "org.openmrs.contrib.databaseexporter.query.UserIdentificationQuery",
"userNames": ["mseaton"]
}
A provider filter limits the providers that are exported based on a series of configurable queries that you can specify. These queries include the following:
ProviderIdQuery
Include only those providers with the given primary key provider ids
Options:
- ids (required): The list of provider ids to include
Example: Export just 5 sample providers
{
"@class" : "org.openmrs.contrib.databaseexporter.query.ProviderIdQuery",
"ids": [1,2,3,4,5]
}
Whereas row filters determine what rows of data should be exported, row transforms are able to transform the actual content of each row prior to exporting. For example, if you wanted to replace each patient name with a randomly-selected generic alternative, you would do this by configuring the appropriate transform. Specifying rowTransforms follows the format:
"rowTransforms": [
{
"@class" : "org.openmrs.contrib.databaseexporter.transform.Transform1",
"numericAttribute": numValue,
"stringAttribute": "strValue",
"listAttribute": ["value1", "value2"]
},
{
"@class" : "org.openmrs.contrib.databaseexporter.transform.Transform2",
"numericAttribute": numValue,
"stringAttribute": "strValue",
"listAttribute": ["value1", "value2"]
}
...
]
- Row Transforms run after Row Filters, and only operate upon the rows that have passed through the filter criteria
- You can choose to specify 0-N transforms, which will run in the order that they are listed in your configuration file.
- Transforms can exclude rows altogether, so even if a rowFilter chooses to include a particular row, a rowTransform may later decide to exclude it.
- Transforms can add completely new rows
- Many transforms support "expressions". When you see the word "expression" in the description below, this means that it supports free text, as well as replacement values for any other value within the row in the format "Hello ${another_column_name_in_the_row}"
The list of available filters, and their configuration options, is as follows:
SimpleReplacementTransform
Allows you to specify a collection of tableName.columnName -> replacement pairs, in order to apply a common replacement expression across all values in a particular column
Options:
- replacements (required): This should be a map from tableName.columnName to replacement expression
Example: Replace the obs.value_text column before exporting to ensure no sensitive data is exported that might be contained there
{
"@class" : "org.openmrs.contrib.databaseexporter.transform.SimpleReplacementTransform",
"replacements": {
"obs.value_text": "Value replaced during de-identification ${obs_id}"
}
}
GlobalPropertyTransform
Allows you to specify a simple expression replacement for 1-N global properties
Options:
- replacements (required): This should be a map from propertyName to propertyValue
Example: Change the scheduler password for security reasons, and change any server specific settings
{
"@class" : "org.openmrs.contrib.databaseexporter.transform.GlobalPropertyTransform",
"replacements": {
"scheduler.password": "Admin123",
"metadatasharing.urlPrefix": "http://localhost:8080/openmrs"
}
}
GlobalPropertyTransform
Allows you to specify a simple expression replacement for 1-N global properties
Options:
- replacements (required): This should be a map from propertyName to propertyValue
Example: Change the scheduler password for security reasons, and change any server specific settings
{
"@class" : "org.openmrs.contrib.databaseexporter.transform.GlobalPropertyTransform",
"replacements": {
"scheduler.password": "Admin123",
"metadatasharing.urlPrefix": "http://localhost:8080/openmrs"
}
}
IdentifierTransform
Allows you to transform all identifiers of one or more types with a particular identifier generator. The format for this is as follows:
Options:
- defaultGenerator (optional): If specified, all identifiers not configured with a specific generator will use this on
- replacementGenerators (optional): If specified, is a Map from Identifier Type name to Generator configuration
Example: Configure all identifiers to be replaced by default by their primary key in the patient_identifier table for de-identification purposes. Configure the "OpenMRS ID" to use a Verhoeff-based generator to produce a random replacement that meets validation requirements.
{
"@class": "org.openmrs.contrib.databaseexporter.transform.IdentifierTransform",
"defaultGenerator": {
"@class" : "org.openmrs.contrib.databaseexporter.generator.SimpleReplacementGenerator",
"replacement": "${patient_identifier_id}"
},
"replacementGenerators": {
"OpenMRS ID": {
"@class" : "org.openmrs.contrib.databaseexporter.generator.VerhoeffGenerator",
"firstIdentifierBase": "90100000",
"baseCharacterSet": "0123456789"
}
}
}
The following Generators are available:
** SimpleReplacementGenerator: **
Replaces the identifier with the configured expression replacment value
Options:
- replacement (required): The expression to use for the replacement
Example:
"defaultGenerator": {
"@class" : "org.openmrs.contrib.databaseexporter.generator.SimpleReplacementGenerator",
"replacement": "${patient_identifier_id}"
}
** SequentialGenerator: **
Replaces the identifier with a sequentially allocated value in the configured base
Options:
- baseCharacterSet (required): The character set to use
- firstIdentifierBase (optional): The first identifierBase to start at
- prefix (optional): Any prefix to put on the identifier
- suffix (optional): Any suffix to put on the identifier
Example: Generate a sequential numeric identifier, starting at 5001
"defaultGenerator": {
"@class" : "org.openmrs.contrib.databaseexporter.generator.SequentialGenerator",
"baseCharacterSet": "0123456789",
"firstIdentifierBase": "5001"
}
** VerhoeffGenerator: **
Extension of the SequentialGenerator that appends a Verhoeff-based checkdigit, with a dash separator, to the end of the identifier
Options:
- See Sequential Generator
Note: You should make sure you configure this generator use use a baseCharacterSet of "0123456789"
** LuhnGenerator: **
Extension of the SequentialGenerator that appends a Luhn-based checkdigit, with a configurable separator, to the end of the identifier
Options:
- See Sequential Generator
- checkDigitSeparator (optional): If specified, will use this to divide the undecorated identifier from the check-digit. Defaults to ""
LocationTransform
For each location in the system, this will automatically choose a random address from the configured address resource, and replace the location address components with it. It also allows you to replace the name and description of the locations with configured expressions and optionally scramble what locations are associated with what patients in the underlying data
Options:
- nameReplacement (optional): An expression for replacing the name
- descriptionReplacement (optional): An expression for replacing the description
- addressPath (optional): The location of a resource file containing replacement addresses. There is a resource that is used by default if no custom file specified.
- addressSeparator (optional): The character used to separate address fields in the address replacement resource. Defaults to ","
- scrambleLocationsInData (optional): Defaults to true. If true, this will randomize the locations that are associated with all patient data in the system, to ensure that location patterns cannot be used to identify patient populations in the exported dataset
Example: De-identify the names and addresses of all Locations in the system, but do not scramble how these locations are distributed across the exported patients
{
"@class" : "org.openmrs.contrib.databaseexporter.transform.LocationTransform",
"nameReplacement": "${address1} Health Center",
"descriptionReplacement": "A de-identified health center located at ${address1}",
"scrambleLocationsInData:: false
}
PersonAddressTransform
For each person address in the system, this will automatically choose a random replacement address from the configured address resource
Options:
- addressPath (optional): The location of a resource file containing replacement addresses. There is a resource that is used by default if no custom file specified.
- addressSeparator (optional): The character used to separate address fields in the address replacement resource. Defaults to ","
Example: De-identify all person addresses with the default address resource
{
"@class" : "org.openmrs.contrib.databaseexporter.transform.PersonAddressTransform"
}
RwandaAddressHierarchyTransform
Update the address hierarchy entry table used by the rwandaaddresshierarchy module with the values from the configured address replacement file
Options:
- See PersonAddressTransform for valid options
Example: Set up the address hierarchy entry tables with the default values from the included address replacement configuration
{
"@class": "org.openmrs.contrib.databaseexporter.transform.RwandaAddressHierarchyTransform"
}
PersonNameTransform
For each person name in the system, this will automatically choose a random replacement from the configured name resources
Options:
- maleNamePath (optional): The location of a resource file containing replacement male names. There is a resource that is used by default if no custom file specified.
- femaleNamePath (optional): The location of a resource file containing replacement female names. There is a resource that is used by default if no custom file specified.
- familyNamePath (optional): The location of a resource file containing replacement family names. There is a resource that is used by default if no custom file specified.
- reproducible (optional): Defaults to false. If true, this will pick replacement names in a way that is consistent every time this tool is run on the same database. The names will still be de-identified, but in a predictable way.
Resource files are expected to be a text file that contains one name per line
Example: De-identify all person names with default given names, and user-specified family names
{
"@class" : "org.openmrs.contrib.databaseexporter.transform.PersonNameTransform",
"familyNamePath": "/home/openmrs/export/customizedFamilyNames.txt"
}
ProviderTransform
Provides exactly the same functionality as the PersonName transform, except that it replaces any non-null provider name with a random name taken from the configured name resources
Options:
- See PersonNameTransform for valid options
Example: De-indentify each provider name with defaults from the provided configuration files
{
"@class" : "org.openmrs.contrib.databaseexporter.transform.ProviderTransform"
}
PersonAttributeTransform
Allows you to specify a List of valid replacements for one or more Person Attribute types, which will be randomly applied to all exported person attributes of those types
Options:
- replacements (required): This should be a map from attribute type name to a List of possible replacement values
Example: De-indentify a variety of person attribute values as configured below:
{
"@class" : "org.openmrs.contrib.databaseexporter.transform.PersonAttributeTransform",
"replacements": {
"Race": [""],
"Birthplace": ["MA","NY","NJ","FL"],
"Boarding School": [""],
"Citizenship": ["USA"],
"Father's name": ["Adams","Banks","Ford","Green","Jones","King","Murphy","Price","Smith","Thomas"],
"Mother's Name": ["Bell","Carroll","Ellis","Faye","Kay","Madison","Pearl","Riley","Shelly","Taylor"],
"Health District": ["Middlesex","Suffolk","New York","Burlington","Monmouth","Union","Palm Beach"]
}
}
ScrambleStatesInWorkflowTransform
For a particular program workflow, this transform will randomize what state a user is associated with in that workflow.
Options:
- workflowToScramble (required): This is the primary key id for the program_workflow to scramble for patients
- possibleStates: (optional): A List of valid states to use to scramble the workflow. If not specified, all non-retired states for that workflow will be used
Example: For the HIV Program Treatment Group workflow, scramble these values to that patients cannot be identified based on their assigned treatment group
{
"@class": "org.openmrs.contrib.databaseexporter.transform.ScrambleStatesInWorkflowTransform",
"workflowToScramble": 9,
"possibleStates": [247,248,249]
}
UserTransform
Allows for setting the systemId, username, and password to particular expression replacement values for every user. For any user configured in the usernamesToPreserve property, passwords will also be replaced, but systemId and username will be left intact.
Options:
- systemIdReplacement (optional): An expression used to replace each systemId
- usernameReplacement: (optional): An expression used to replace each username
- passwordReplacement: (optional): An expression used to replace each password
- usernamesToPreserve: (optional): An list of usernames for which the systemId and username replacements should not apply
Example: De-identify all users except for me, and replace everyone's password with a known value
{
"@class": "org.openmrs.contrib.databaseexporter.transform.UserTransform",
"systemIdReplacement": "${userId}",
"usernameReplacement": "user${userId}",
"passwordReplacement": "Test1234",
"usernamesToPreserve": ["mseaton"]
}