GithubHelp home page GithubHelp logo

cogstack / cogstack-nifi Goto Github PK

View Code? Open in Web Editor NEW
36.0 17.0 18.0 75.79 MB

Building data processing pipelines for documents processing with NLP using Apache NiFi and related services

Home Page: https://hub.docker.com/r/cogstacksystems/cogstack-nifi/

License: Other

Dockerfile 2.87% Python 50.11% Shell 22.03% Makefile 1.48% Jupyter Notebook 22.82% R 0.21% TSQL 0.48%
apache-nifi nifi elasticsearch kibana data-pipelines nlp rest electronic-health-records data-integration

cogstack-nifi's Issues

nlp response groovy script error

Using the script "parse-anns-from-nlp-response-bulk.groovy" in the nifi annotation workflow gives an error as it cannot validate ann_id using "assert ann_id". I had to convert the ann_id type to string to make it work:

def ann_id = outAnn[annotation_id_field as String]
assert ann_id.**toString()**

ElasticSearch reader component

We need to implement a functionality to only read the newest documents from ElasticSearch since the last ingestion.

A NiFi database reader has the possibility to persist the value of the last record's maximum-value columns (such as primary key). Hence it can keep track of new records available to be ingested. However, such option does not seem to be implemented when reading documents from ElasticSearch.

Integration tests for supported workflows

We need to provide integration tests for supported workflows - these include for now:

  • 1. documents ingestion: DB -> ES
  • 2. documents ingestion with text extraction from BLOBs: DB -> Tika -> ES
  • 3. documents ingestion with NLP annotations extraction: DB -> NLP -> ES
  • 4. combined 2 and 3: DB -> Tika -> NLP -> ES

Sample DB docker not populated with data on startup

I am deploying the Cogstack on Windows 10 machine for testing.

The Sample BD which is supposed to get populated with following data is empty. Leading to workflows not being run.
patients - structured patient information,
encounters - structured encounters information,
observations - structured observations information,

image

Any thoughts ????

Support for Different File Data Sources

Currently, the preferred data source is a relational DB with documents included as BLOBs but there are other possible sources, but some are less needed to be supported at present. I've ranked them based on how likely I expect these to be encountered.

  • 1. BLOBs in database
  • 2. Pointers to file paths on a filesystem (or object store)
  • 3. Files on Filesystem with metadata e.g. patientID in Filename or File contents
  • 4. Object Store (e.g. S3 or other) with metadata e.g. patientID in object metadata labels

Any plans to move from OpenDistro to OpenSearch?

It seems the work on OpenDistro has now quite definitively moved into OpenSearch, see e.g: https://opendistro.github.io/for-elasticsearch/blog/2021/06/forward-to-opensearch/

Do you happen to have any plans moving to OpenSearch? As far as I can tell right now, this should not result in very fundamental problems. Kibana is renamed OpenSearch Dashboards, and still comes in a separate docker.

In the short term, the problem for me is that OpenDistro does not support docker for Apple M1 chips, so I cannot work with it locally anymore. I might try to make the change myself, but just wondering if you had any thoughts on the issue.

Thanks!
-Vincent

Suggestions for simplifying Docker Compose

Hi @vladd-bit , in addition to #19 I think there are some more simplifications possible for services.yml that make it easier to do custom deployments while also making it easy to regularly pull updates from this repository's master branch.

  1. Move env variables to YML files.
    Are there any specific reasons to keep some ENV vars in the docker-compose, while having other configuration for the same services in the YML / properties files? In elasticsearch-1, -2 and kibana services there are quite a number of environment variables that can also be specified in the respective YML-files. Also the nifi service contains some env variables that can be moved to nifi/conf/nifi.properties I think (although I've not tested this). Apparently some NiFi properties can only be set using ENV vars https://stackoverflow.com/a/55266528/4141535.

  2. Create git tracked -EXAMPLE files for configuration files
    Just like I suggested earlier with deploy/.env-example (git tracked) and deploy/.env (git ignored), we can use this way of working with the OpenSearch YML and NiFi properties files. Custom deployments can copy the example file and tailor it to their needs. This makes it easy to pull new changes, and maintainers can inspect (e.g. using a diff-tool) the differences between the example and used file to see whether properties are added/changed/deleted. This way of working was quite effective in the previous projects I collaborated in (example).

  3. Remove container and network names, and rely on $COMPOSE_PROJECT_NAME (https://docs.docker.com/compose/reference/envvars/#compose_project_name). I documented how to use this in #19 .

  4. There are a lot of commented and uncommented lines regarding ElasticSearch and Kibana mounted security files. Perhaps this can be simplified by using a single ENV var, e.g. $ELASTICSEARCH_SECURITY_DIR, which we can also put it .env-example and refer to ../security/es_certificates/opensearch/.

  5. In our deployments we set all hosts ports in the .env outside of the docker-compose. For example, - ${KIBANA_HOST_PORT}:5601. This makes it easy to switch between local (fine if port is open) and server deployments (we set port to 127.0.0.1:5601 and let the reverse proxy on the host machine regulated traffic). What are your thoughts on moving this configuration to .env?

  6. When ports: is set, expose: no longer has any informative meaning, and can be removed (https://stackoverflow.com/a/40801773/4141535). Or do you include them for a different reason?

For a new user doing a new deployment, it would be nice to require the least amount of actions to start the Docker containers. Perhaps only creating a .env file from the .env-example and executing docker-compose up is enough. We can configure the .env-example to point it to all the other examples files, which the user can at a later time change to his deployment specific configuration files.

I'd rather discuss this with you before creating a PR, since your workflows probably depend on the current way of working.

By the way, congratulations on releasing v1.0.0 :)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.