cogstack / cogstack-nifi Goto Github PK

View Code? Open in Web Editor NEW

36.0 36.0 18.0 75.7 MB

Building data processing pipelines for documents processing with NLP using Apache NiFi and related services

Home Page: https://hub.docker.com/r/cogstacksystems/cogstack-nifi/

License: Other

Dockerfile 2.76% Groovy 5.80% Python 46.68% Shell 20.84% Makefile 1.35% Jupyter Notebook 21.90% R 0.20% TSQL 0.46%

apache-nifi data-integration data-pipelines elasticsearch electronic-health-records kibana nifi nlp rest

cogstack-nifi's Introduction

Archived

This project is archived and no longer maintained. CogStack-Nifi is the successor to this project and continues to be actively maintained.

Introduction

CogStack is a lightweight distributed, fault tolerant database processing architecture and ecosystem, intended to make NLP processing and preprocessing easier in resource constrained environments. It comprises of multiple components, where CogStack Pipeline, the one covered in this documentation, has been designed to provide a configurable data processing pipelines for working with EHR data. For the moment it mainly uses databases and files as the primary source of EHR data with the possibility of adding custom data connectors soon. It makes use of the Java Spring Batch framework in order to provide a fully configurable data processing pipeline with the goal of generating an annotated JSON files that can be readily indexed into ElasticSearch, stored as files or pushed back to a database.

Documentation

For the most up-to-date documentation about usage of CogStack, building, running with example deployments please refer to the official CogStack Confluence page.

Discussion

If you have any questions why not reach out to the community discourse forum.

Quick Start Guide

Introduction

This simple tutorial demonstrates how to get CogStack Pipeline running on a sample electronic health record (EHR) dataset stored initially in an external database. CogStack ecosystem has been designed with handling efficiently both structured and unstructured EHR data in mind. It shows its strength while working with the unstructured type of data, especially as some input data can be provided as documents in PDF or image formats. For the moment, however, we only show how to run CogStack on a set of structured and free-text EHRs that have been already digitalized. The part covering unstructured type of data in form of PDF documents, images and other clinical notes which needs to processed prior to analysis is covered in the official CogStack Confluence page.

This tutorial is divided into 3 parts:

Getting CogStack (link),
A brief description of how does CogStack pipeline work and its ecosystem (link),
Running CogStack pipeline 'out-of-the-box' using the dataset already preloaded into a sample database (link).

To skip the brief description and to get hands on running CogStack pipeline please head directly to Running CogStack part.

The main directory with resources used in this tutorial is available in the CogStack bundle under examples/. This tutorial is based on the Example 2, however, there are more examples available to play with.

Getting CogStack

The most convenient way to get CogStack bundle is to download it directly from the official github repository either by cloning the source by using git:

git clone https://github.com/CogStack/CogStack-Pipeline.git

or by downloading the bundle from the repository's Releases page and decompressing it.

How does CogStack work

Data processing workflow

The data processing workflow of CogStack pipeline is based on Java Spring Batch framework. Not to dwell too much into technical details and just to give a general idea -- the data is being read from a predefined data source, later it follows a number of processing operations with the final result stored in a predefined data sink. CogStack pipeline implements variety of data processors, data readers and writers with scalability mechanisms that can be selected in CogStack job configuration. Although the data can be possibly read from different sources, the most frequently used data sink is ElasicSearch. For more details about the CogStack functionality, please refer to the CogStack Documentation.

In this tutorial we only focus on a simple and very common use-case, where CogStack pipeline reads and process structured and free-text EHRs data from a single PostgreSQL database. The result is then stored in ElasticSearch where the data can be easily queried in Kibana dashboard. However, CogStack pipeline data processing engine also supports multiple data sources -- please see Example 3 which covers such case.

A sample CogStack ecosystem

CogStack ecosystem consists of multiple inter-connected microservices running together. For the ease of use and deployment we use Docker (more specifically, Docker Compose), and provide Compose files for configuring and running the microservices. The selection of running microservices depends mostly on the specification of EHR data source(s), data extraction and processing requirements.

In this tutorial the CogStack ecosystem is composed of the following microservices:

samples-db -- PostgreSQL database loaded with a sample dataset under db_samples name,
cogstack-pipeline -- CogStack data processing pipeline with worker(s),
cogstack-job-repo -- PostgreSQL database for storing information about CogStack jobs,
elasticsearch-1 -- ElasticSearch search engine (single node) for storing and querying the processed EHR data,
kibana -- Kibana data visualization tool for querying the data from ElasticSearch.

Since all the examples share the common configuration for the microservices used, the base Docker Compose file is provided in examples/docker-common/docker-compose.yml. The Docker Compose file with configuration of microservices being overriden for this example can be found in examples/example2/docker/docker-compose.override.yml. Both configuration files are automatically used by Docker Compose when deploying CogStack, as will be shown later.

Sample datasets

The sample dataset used in this tutorial consists of two types of EHR data:

Synthetic - structured, synthetic EHRs, generated using synthea application,
Medial reports - unstructured, medical health report documents obtained from MTsamples.

These datasets, although unrelated, are used together to compose a combined dataset.

Full description of these datasets can be found in the official CogStack Confluence page.

Running CogStack platform

Running CogStack pipeline for the first time

For the ease of use CogStack is being deployed and run using Docker. However, before starting the CogStack ecosystem for the first time, one needs to have the database dump files for sample data either by creating them locally or downloading from Amazon S3. To download the database dumps, just type in the main examples/ directory:

bash download_db_dumps.sh

Next, a setup scripts needs to be run locally to prepare the Docker images and configuration files for CogStack data processing pipeline. The script is available in examples/example2/ path and can be run as:

bash setup.sh

As a result, a temporary directory __deploy/ will be created containing all the necessary artifacts to deploy CogStack.

Docker-based deployment

Next, we can proceed to deploy CogStack ecosystem using Docker Compose. It will configure and start microservices based on the provided Compose files:

common base configuration, copied from examples/docker-common/docker-compose.yml ,
example-specific configuration copied from examples/example2/docker/docker-compose.override.yml. Moreover, the PostgreSQL database container comes with pre-initialized database dump ready to be loaded directly into.

In order to run CogStack, type in the examples/example2/__deploy/ directory:

docker-compose up

In the console there will be printed status logs of the currently running microservices. For the moment, however, they may be not very informative (sorry, we're working on that!).

Connecting to the microservices

CogStack ecosystem

The picture below sketches a general idea on how the microservices are running and communicating within a sample CogStack ecosystem used in this tutorial.

Assuming that everything is working fine, we should be able to connect to the running microservices. Selected running services (elasticsearch-1 and kibana) have their port connections forwarded to host localhost.

Kibana and ElasticSearch

Kibana dashboard used to query the EHRs can be accessed directly in browser via URL: http://localhost:5601/. The data can be queried using a number of ElasticSearch indices, e.g. sample_observations_view. Usually, each index will correspond to the database view in db_samples (samples-db PostgreSQL database) from which the data was ingested. However, when entering Kibana dashboard for the first time, an index pattern needs to be configured in the Kibana management panel -- for more information about its creation, please refer to the official Kibana documentation.

In addition, ElasticSearch REST end-point can be accessed via URL http://localhost:9200/. It can be used to perform manual queries or to be used by other external services -- for example, one can list the available indices:

curl 'http://localhost:9200/_cat/indices'

or query one of the available indices -- sample_observations_view:

curl 'http://localhost:9200/sample_observations_view'

For more information about possible documents querying or modification operations, please refer to the official ElasticSearch documentation.

As a side note, the name for ElasticSearch node in the Docker Compose has been set as elasticsearch-1. The -1 ending emphasizes that for larger-scale deployments, multiple ElasticSearch nodes can be used -- typically, a minimum of 3.

PostgreSQL sample database

Moreover, the access PostgreSQL database with the input sample data is exposed directly at localhost:5555. The database name is db_sample with user test and password test. To connect, one can run:

psql -U 'test' -W -d 'db_samples' -h localhost -p 5555

Publications

CogStack - Experiences Of Deploying Integrated Information Retrieval And Extraction Services In A Large National Health Service Foundation Trust Hospital, Richard Jackson, Asha Agrawal, Kenneth Lui, Amos Folarin, Honghan Wu, Tudor Groza, Angus Roberts, Genevieve Gorrell, Xingyi Song, Damian Lewsley, Doug Northwood, Clive Stringer, Robert Stewart, Richard Dobson. BMC medical informatics and decision making 18, no. 1 (2018): 47.

cogstack-nifi's People

Contributors

Stargazers

Watchers

Forkers

lrog databill86 sohail0786 sandertan umcu abhiagar2019 marrowp1968 datastark rsun0013 jthteo richardbeare monash-cogstack richdobson tomgw rajeevyadav cyruschan360 tomdango datom95

cogstack-nifi's Issues

Any plans to move from OpenDistro to OpenSearch?

It seems the work on OpenDistro has now quite definitively moved into OpenSearch, see e.g: https://opendistro.github.io/for-elasticsearch/blog/2021/06/forward-to-opensearch/

Do you happen to have any plans moving to OpenSearch? As far as I can tell right now, this should not result in very fundamental problems. Kibana is renamed OpenSearch Dashboards, and still comes in a separate docker.

In the short term, the problem for me is that OpenDistro does not support docker for Apple M1 chips, so I cannot work with it locally anymore. I might try to make the change myself, but just wondering if you had any thoughts on the issue.

Thanks!
-Vincent

Upgrade opendistro/opensearch for log4j vulnerability

You're probably well aware, but you it's recommended to upgrade to OpenDistro >= 1.13.3 or OpenSearch >=1.2.1 asap to mitigate the log4j vulnerability (see: the internet).

https://opendistro.github.io/for-elasticsearch/blog/2021/12/update-to-1-13-3/
https://opensearch.org/blog/releases/2021/12/update-to-1-2-1/

ElasticSearch reader component

We need to implement a functionality to only read the newest documents from ElasticSearch since the last ingestion.

A NiFi database reader has the possibility to persist the value of the last record's maximum-value columns (such as primary key). Hence it can keep track of new records available to be ingested. However, such option does not seem to be implemented when reading documents from ElasticSearch.

Integration tests for supported workflows

We need to provide integration tests for supported workflows - these include for now:

1. documents ingestion: DB -> ES
2. documents ingestion with text extraction from BLOBs: DB -> Tika -> ES
3. documents ingestion with NLP annotations extraction: DB -> NLP -> ES
4. combined 2 and 3: DB -> Tika -> NLP -> ES

Sample DB docker not populated with data on startup

I am deploying the Cogstack on Windows 10 machine for testing.

The Sample BD which is supposed to get populated with following data is empty. Leading to workflows not being run.
patients - structured patient information,
encounters - structured encounters information,
observations - structured observations information,

Any thoughts ????

Support for Different File Data Sources

Currently, the preferred data source is a relational DB with documents included as BLOBs but there are other possible sources, but some are less needed to be supported at present. I've ranked them based on how likely I expect these to be encountered.

1. BLOBs in database
2. Pointers to file paths on a filesystem (or object store)
3. Files on Filesystem with metadata e.g. patientID in Filename or File contents
4. Object Store (e.g. S3 or other) with metadata e.g. patientID in object metadata labels

nlp response groovy script error

Using the script "parse-anns-from-nlp-response-bulk.groovy" in the nifi annotation workflow gives an error as it cannot validate ann_id using "assert ann_id". I had to convert the ann_id type to string to make it work:

def ann_id = outAnn[annotation_id_field as String]
assert ann_id.**toString()**

Suggestions for simplifying Docker Compose

Hi @vladd-bit , in addition to #19 I think there are some more simplifications possible for services.yml that make it easier to do custom deployments while also making it easy to regularly pull updates from this repository's master branch.

Move env variables to YML files.
Are there any specific reasons to keep some ENV vars in the docker-compose, while having other configuration for the same services in the YML / properties files? In elasticsearch-1, -2 and kibana services there are quite a number of environment variables that can also be specified in the respective YML-files. ~~Also the nifi service contains some env variables that can be moved to nifi/conf/nifi.properties I think (although I've not tested this).~~ Apparently some NiFi properties can only be set using ENV vars https://stackoverflow.com/a/55266528/4141535.
Create git tracked -EXAMPLE files for configuration files
Just like I suggested earlier with deploy/.env-example (git tracked) and deploy/.env (git ignored), we can use this way of working with the OpenSearch YML and NiFi properties files. Custom deployments can copy the example file and tailor it to their needs. This makes it easy to pull new changes, and maintainers can inspect (e.g. using a diff-tool) the differences between the example and used file to see whether properties are added/changed/deleted. This way of working was quite effective in the previous projects I collaborated in (example).
Remove container and network names, and rely on $COMPOSE_PROJECT_NAME (https://docs.docker.com/compose/reference/envvars/#compose_project_name). I documented how to use this in #19 .
There are a lot of commented and uncommented lines regarding ElasticSearch and Kibana mounted security files. Perhaps this can be simplified by using a single ENV var, e.g. $ELASTICSEARCH_SECURITY_DIR, which we can also put it .env-example and refer to ../security/es_certificates/opensearch/.
In our deployments we set all hosts ports in the .env outside of the docker-compose. For example, - ${KIBANA_HOST_PORT}:5601. This makes it easy to switch between local (fine if port is open) and server deployments (we set port to 127.0.0.1:5601 and let the reverse proxy on the host machine regulated traffic). What are your thoughts on moving this configuration to .env?
When ports: is set, expose: no longer has any informative meaning, and can be removed (https://stackoverflow.com/a/40801773/4141535). Or do you include them for a different reason?

For a new user doing a new deployment, it would be nice to require the least amount of actions to start the Docker containers. Perhaps only creating a .env file from the .env-example and executing docker-compose up is enough. We can configure the .env-example to point it to all the other examples files, which the user can at a later time change to his deployment specific configuration files.

I'd rather discuss this with you before creating a PR, since your workflows probably depend on the current way of working.

By the way, congratulations on releasing v1.0.0 :)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

Jobs

Jooble