GithubHelp home page GithubHelp logo

hicservices / badmedicine Goto Github PK

View Code? Open in Web Editor NEW
27.0 6.0 3.0 5.8 MB

Library and CLI for randomly generating medical data like you might get out of an Electronic Health Records (EHR) system

License: GNU General Public License v3.0

C# 99.95% TSQL 0.05%
testing-tools tests dataset synthetic-data electronic-health-records nuget hospital-admission patient ehr cli

badmedicine's Introduction

SynthEHR (Previously BadMedicine)

Build Status NuGet Badge

Library and CLI for randomly generating medical data like you might get out of an Electronic Health Records (EHR) system. It is intended for generating data for demos and testing ETL / cohort generation/ data management tools.

SynthEHR differs from other random data generators e.g. Mockaroo, SQL Data Generator etc in that data generated is based on (simple) models generated from live EHR datasets collected for over 30 years in Tayside and Fife (UK). This makes the data generated recognisable (codes used, frequency of codes etc) from a clinical perspective and representative of the problems (ontology mapping etc) that data analysts would encounter working with real medical data.

Datasets generated are not suitable for training AI algorithms etc (See What is Modelled?)

Rename

As of v2.0.0 BadMedicine was renamed to SynthEHR. Previous versions of the software can be found at nuget.org.

Datasets

The following synthetic datasets can be produced.

Dataset Description
Demography Address and patient details as might appear in the CHI register
Biochemistry Lab test codes as might appear in Sci Store lab system extracts
Prescribing Prescription data of prescribed drugs
Carotid Artery Scan Scan results for Carotid Artery
Hospital Admissions ICD9 and ICD10 codes for admission to hospital
Maternity Records of births etc

Usage:

SynthEHR is available as a nuget package for linking as a library

The standalone CLI (SynthEHR.exe) is available in the releases section of Github

Usage is as follows:

SynthEHR.exe c:\temp\

You can change how much data is produced (e.g. 500 patients, 10000 records per dataset):

SynthEHR.exe c:\temp\ 500 10000

Or run only a single dataset:

SynthEHR.exe c:\omg 5000 200000 -l -d CarotidArteryScan

You can seed the generator (Guids generated will still differ)

SynthEHR.exe c:\omg 5000 200000 -l -d CarotidArteryScan -s 5000

Building

Building requires MSBuild 15 or later (or Visual Studio 2017 or later). You will also need to install the DotNetCore 2.2 SDK.

You can build a OS specific binary

First build SynthEHR.csproj

dotnet publish SynthEHR.csproj -r win-x64 --self-contained
cd .\bin\Debug\netcoreapp2.2\win-x64\

Direct to Database

You can generate data directly into a relational database (instead of onto disk).

To turn this mode on rename the file SynthEHR.template.yaml to SynthEHR.yaml and provide the connection strings to your database e.g.:

Database:
  # Set to true to drop and recreate tables described in the Template
  DropTables: false
  # The connection string to your database
  ConnectionString: server=(localdb)\MSSQLLocalDB;Integrated Security=true;
  # Your DBMS provider ('MySql', 'PostgreSql','Oracle' or 'MicrosoftSQLServer')
  DatabaseType: MicrosoftSQLServer
  # Database to create/use on the server
  DatabaseName: SynthEHRTestData

Library Usage

You can generate test data for your program yourself by referencing the nuget package:

//Seed the random generator if you want to always produce the same randomisation
var r = new Random(100);

//Create a new person
var person = new Person(r);

//Create test data for that person
var a = new HospitalAdmissionsRecord(person,person.DateOfBirth,r);

Assert.IsNotNull(a.Person.CHI);
Assert.IsNotNull(a.Person.DateOfBirth);
Assert.IsNotNull(a.Person.Address.Line1);
Assert.IsNotNull(a.Person.Address.Postcode);
Assert.IsNotNull(a.AdmissionDate);
Assert.IsNotNull(a.DischargeDate);
Assert.IsNotNull(a.Condition1);

What is Modelled?

Data generated by SynthEHR is driven by Aggregate distributions of real health data collected in Tayside (UK). This means that codes appear in data with the frequency that match real data. For example in the Hospital Admissions data we can see that ICD9 codes (denoted by dash) cease being recorded in ~1997 in favour of ICD10 codes and we can see the most common admission conditions are sensible:

alt text

ICD 9 and ICD 10 codes in Condition1 (the main condition) upon Hospital Admission

What is not Modelled?

No inter dataset / inter record level randomisation model exists. For example the following would not be modelled:

  • If a patient is on Drug A they are more likely to also be on Drug B
  • Hospitalisations are more likely to be at the beginning/end of a patients life
  • Drug A is likely to be given to patients discharged having been treated for condition Y

badmedicine's People

Contributors

dependabot-preview[bot] avatar dependabot[bot] avatar jas88 avatar jfriel avatar lgtm-migrator avatar tznind avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

badmedicine's Issues

CommandLine arguments are showing help as 'long name'

Looks like the 'command help' bit in the CommandLineParser options are now rendering as the 'long name' of the switch.

Presumably at some point the order or number of constructor args changed on the option attribute. So we need to change to explicitly reference Description or CommandHelp or whatever.

bad-command-line-name

CLI Broken

Describe the bug

Looks like the current CLI is broken, possibly in the same way as BadMedicine.Dicom was (building multiple assemblies ontop of each other).

Running the current CLI gives this error:

System.IO.FileNotFoundException: Could not load file or assembly 'Microsoft.Bcl.AsyncInterfaces, Version=1.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'. The system cannot find the file specified.
File name: 'Microsoft.Bcl.AsyncInterfaces, Version=1.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51'
   at BadMedicine.Datasets.DataGenerator.EmbeddedCsvToDataTable(Type requestingType, String resourceFileName, DataTable dt)
   at BadMedicine.Datasets.BiochemistryRecord.Initialize() in /home/runner/work/BadMedicine/BadMedicine/BadMedicine.Core/Datasets/BiochemistryRecord.cs:line 105
   at BadMedicine.Datasets.BiochemistryRecord..ctor(Random r) in /home/runner/work/BadMedicine/BadMedicine/BadMedicine.Core/Datasets/BiochemistryRecord.cs:line 68
   at BadMedicine.Datasets.Biochemistry.GenerateTestDataRow(Person p) in /home/runner/work/BadMedicine/BadMedicine/BadMedicine.Core/Datasets/Biochemistry.cs:line 23
   at BadMedicine.Datasets.DataGenerator.GenerateTestDataFile(IPersonCollection cohort, FileInfo target, Int32 numberOfRecords) in /home/runner/work/BadMedicine/BadMedicine/BadMedicine.Core/Datasets/DataGenerator.cs:line 75
   at BadMedicine.Program.RunOptionsAndReturnExitCode(ProgramOptions opts) in /home/runner/work/BadMedicine/BadMedicine/BadMedicine/Program.cs:line 76

Investigate dropping CSVHelper dependency

We are using CSVHelper in core, why?

https://github.com/HicServices/BadMedicine/blob/64a47e3a64bcf6e74181ed045858aac997599e44/BadMedicine.Core/BadMedicine.nuspec#L18

CsvHelper is used in the following methods:

  • EmbeddedCsvToDataTable
    • For reading internal resources e.g. example biochemistry descriptions, prescribable drugs etc. I don't think there are multi line, escaped commas or quotes issues in these resources but I wouldn't want to rule out the possibility

https://github.com/HicServices/BadMedicine/blob/64a47e3a64bcf6e74181ed045858aac997599e44/BadMedicine.Core/Datasets/DataGenerator.cs#L291-L321

  • GenerateTestDataFile
    • For writting output files of synthetic data

https://github.com/HicServices/BadMedicine/blob/64a47e3a64bcf6e74181ed045858aac997599e44/BadMedicine.Core/Datasets/DataGenerator.cs#L48-L81

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.