cogstack / cogstack-nifi Goto Github PK

View Code? Open in Web Editor NEW

36.0 17.0 18.0 75.79 MB

Building data processing pipelines for documents processing with NLP using Apache NiFi and related services

Home Page: https://hub.docker.com/r/cogstacksystems/cogstack-nifi/

License: Other

Dockerfile 2.87% Python 50.11% Shell 22.03% Makefile 1.48% Jupyter Notebook 22.82% R 0.21% TSQL 0.48%

apache-nifi nifi elasticsearch kibana data-pipelines nlp rest electronic-health-records data-integration

cogstack-nifi's Issues

nlp response groovy script error

Using the script "parse-anns-from-nlp-response-bulk.groovy" in the nifi annotation workflow gives an error as it cannot validate ann_id using "assert ann_id". I had to convert the ann_id type to string to make it work:

def ann_id = outAnn[annotation_id_field as String]
assert ann_id.**toString()**

ElasticSearch reader component

We need to implement a functionality to only read the newest documents from ElasticSearch since the last ingestion.

A NiFi database reader has the possibility to persist the value of the last record's maximum-value columns (such as primary key). Hence it can keep track of new records available to be ingested. However, such option does not seem to be implemented when reading documents from ElasticSearch.

Upgrade opendistro/opensearch for log4j vulnerability

You're probably well aware, but you it's recommended to upgrade to OpenDistro >= 1.13.3 or OpenSearch >=1.2.1 asap to mitigate the log4j vulnerability (see: the internet).

https://opendistro.github.io/for-elasticsearch/blog/2021/12/update-to-1-13-3/
https://opensearch.org/blog/releases/2021/12/update-to-1-2-1/

Integration tests for supported workflows

We need to provide integration tests for supported workflows - these include for now:

1. documents ingestion: DB -> ES
2. documents ingestion with text extraction from BLOBs: DB -> Tika -> ES
3. documents ingestion with NLP annotations extraction: DB -> NLP -> ES
4. combined 2 and 3: DB -> Tika -> NLP -> ES

Sample DB docker not populated with data on startup

I am deploying the Cogstack on Windows 10 machine for testing.

The Sample BD which is supposed to get populated with following data is empty. Leading to workflows not being run.
patients - structured patient information,
encounters - structured encounters information,
observations - structured observations information,

Any thoughts ????

Support for Different File Data Sources

Currently, the preferred data source is a relational DB with documents included as BLOBs but there are other possible sources, but some are less needed to be supported at present. I've ranked them based on how likely I expect these to be encountered.

1. BLOBs in database
2. Pointers to file paths on a filesystem (or object store)
3. Files on Filesystem with metadata e.g. patientID in Filename or File contents
4. Object Store (e.g. S3 or other) with metadata e.g. patientID in object metadata labels

Any plans to move from OpenDistro to OpenSearch?

It seems the work on OpenDistro has now quite definitively moved into OpenSearch, see e.g: https://opendistro.github.io/for-elasticsearch/blog/2021/06/forward-to-opensearch/

Do you happen to have any plans moving to OpenSearch? As far as I can tell right now, this should not result in very fundamental problems. Kibana is renamed OpenSearch Dashboards, and still comes in a separate docker.

In the short term, the problem for me is that OpenDistro does not support docker for Apple M1 chips, so I cannot work with it locally anymore. I might try to make the change myself, but just wondering if you had any thoughts on the issue.

Thanks!
-Vincent

Suggestions for simplifying Docker Compose

Hi @vladd-bit , in addition to #19 I think there are some more simplifications possible for services.yml that make it easier to do custom deployments while also making it easy to regularly pull updates from this repository's master branch.

Move env variables to YML files.
Are there any specific reasons to keep some ENV vars in the docker-compose, while having other configuration for the same services in the YML / properties files? In elasticsearch-1, -2 and kibana services there are quite a number of environment variables that can also be specified in the respective YML-files. ~~Also the nifi service contains some env variables that can be moved to nifi/conf/nifi.properties I think (although I've not tested this).~~ Apparently some NiFi properties can only be set using ENV vars https://stackoverflow.com/a/55266528/4141535.
Create git tracked -EXAMPLE files for configuration files
Just like I suggested earlier with deploy/.env-example (git tracked) and deploy/.env (git ignored), we can use this way of working with the OpenSearch YML and NiFi properties files. Custom deployments can copy the example file and tailor it to their needs. This makes it easy to pull new changes, and maintainers can inspect (e.g. using a diff-tool) the differences between the example and used file to see whether properties are added/changed/deleted. This way of working was quite effective in the previous projects I collaborated in (example).
Remove container and network names, and rely on $COMPOSE_PROJECT_NAME (https://docs.docker.com/compose/reference/envvars/#compose_project_name). I documented how to use this in #19 .
There are a lot of commented and uncommented lines regarding ElasticSearch and Kibana mounted security files. Perhaps this can be simplified by using a single ENV var, e.g. $ELASTICSEARCH_SECURITY_DIR, which we can also put it .env-example and refer to ../security/es_certificates/opensearch/.
In our deployments we set all hosts ports in the .env outside of the docker-compose. For example, - ${KIBANA_HOST_PORT}:5601. This makes it easy to switch between local (fine if port is open) and server deployments (we set port to 127.0.0.1:5601 and let the reverse proxy on the host machine regulated traffic). What are your thoughts on moving this configuration to .env?
When ports: is set, expose: no longer has any informative meaning, and can be removed (https://stackoverflow.com/a/40801773/4141535). Or do you include them for a different reason?

For a new user doing a new deployment, it would be nice to require the least amount of actions to start the Docker containers. Perhaps only creating a .env file from the .env-example and executing docker-compose up is enough. We can configure the .env-example to point it to all the other examples files, which the user can at a later time change to his deployment specific configuration files.

I'd rather discuss this with you before creating a PR, since your workflows probably depend on the current way of working.

By the way, congratulations on releasing v1.0.0 :)

cogstack / cogstack-nifi Goto Github PK

cogstack-nifi's Issues

nlp response groovy script error

ElasticSearch reader component

Upgrade opendistro/opensearch for log4j vulnerability

Integration tests for supported workflows

Sample DB docker not populated with data on startup

Support for Different File Data Sources

Any plans to move from OpenDistro to OpenSearch?

Suggestions for simplifying Docker Compose

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs