GithubHelp home page GithubHelp logo

texta-tk / texta Goto Github PK

View Code? Open in Web Editor NEW
32.0 6.0 8.0 107.67 MB

Terminology EXtraction and Text Analytics (TEXTA) Toolkit

Home Page: https://git.texta.ee/texta/texta-rest

License: GNU General Public License v3.0

Python 99.63% Shell 0.16% Dockerfile 0.22%
ai artificial-intelligence natural-language-processing nlp nlp-machine-learning textanalytics django python

texta's Introduction

TEXTA Toolkit 3

Documentation

https://docs.texta.ee

Wiki

https://git.texta.ee/texta/texta-rest/wikis/home

Notes

Works with Python 3.8

Creating environment:

conda env create -f environment.yaml

Running migrations:

python3 migrate.py

  • This script will also create an admin account with the default username "admin". You can use python migrate.py -u {{username}} instead for a custom username of your choice.
  • Password for that admin account will be generated automatically and printed to the console. This behaviour can be overwritten with the environment variable TEXTA_ADMIN_PASSWORD, in which case the password will be set to the same value as 'TEXTA_ADMIN_PASSWORD'.
  • Running python migrate.py -o will overwrite the password with whatever value you have inside the environment variable TEXTA_ADMIN_PASSWORD.

Running application:

python3 manage.py runserver

celery -A toolkit.taskman worker -l info

Import testing data:

python3 import_test_data.py

Run all tests:

python3 manage.py test

Run tests for specific app:

python3 manage.py test appname (eg python3 manage.py test toolkit.neurotagger)

Run performance tests (not run by default as they are slow):

python3 manage.py test toolkit.performance_tests

Building Docker:

docker build -t texta-rest:latest -f docker/Dockerfile .

Running Docker:

docker run -p 8000:8000 texta-rest:latest

Building Docker with GPU support:

docker build -t texta-rest:gpu-latest -f docker/gpu.Dockerfile .

Running Docker with GPU support requires NVIDIA Container Toolkit to be installed on the host machine: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker. When Container Toolkit is installed:

docker run --gpus all -p 8000:8000 texta-rest:latest-gpu

Environment variables

Deploy & Testing variables

  • TEXTA_ENV_FILE - Optional file path for a typical .env file to be loaded into memory for TEXTA Toolkit (Default: None).

  • TEXTA_SECRET_KEY - String key for cryptographic security purposes. ALWAYS SET IN PRODUCTION.

  • TEXTA_CORS_ALLOW_CREDENTIALS - Whether to allow cookies to be included in cross-site HTTP requests (Default: True)

  • TEXTA_CORS_ALLOW_ALL_ORIGINS - Whether to allow requests from all Origins (Default: false)

  • TEXTA_CELERY_USED_QUEUES - Comma separated list of Celery queues you are using for TEXTA Toolkit. No need to touch when running a standard configuration.

  • TEXTA_ELASTIC_VERSION - Must equal to the integer of the main Elasticsearch cluster version (Default: 6).

  • TEXTA_DEPLOY_KEY - Used to separate different Toolkit instances for cases where Elasticsearch or the database are shared amongst multiple instances. Best to give this a simple number (Default: 1).

  • TEXTA_ADMIN_PASSWORD - Password of the admin user created on first run.

  • TEXTA_USE_CSRF - Whether to disable CSRF for integration tests (Default: false).

  • TEXTA_CELERY_ALWAYS_EAGER - Whether to use Celerys async features or not, useful for testing purposes locally. ( Default: False)

  • TEXTA_DATA_DIR - Path to the directory in which TEXTA Toolkit saves the models it generates, and the binary model dependencies it needs (Default: data).

  • TEXTA_EXTERNAL_DATA_DIR - Path to the base directory in which 3rd party models (MLP/BERT/etc) are kept (Default: data/models).

  • TEXTA_CACHE_DIR - Path for the cache folder which BERT uses (Default: data/external/.cache).

  • TEXTA_RELATIVE_MODELS_DIR - Relative path of the directory in which all the different types of models are stored in. (Default: "/data/models").

  • TEXTA_LANGUAGE_CODES - Comma separated string of Stanza supported language codes to use for Multilingual Processing. (Default: "").

  • TEXTA_MLP_USE_GPU - Use GPU to speed up MLP (Default: False).

  • TEXTA_MLP_MODEL_DIRECTORY_PATH - Relative path to the directory into which Stanza models will be stored under the " stanza" folder (setting this to ./home/texta will create ./home/texta/stanza which contains subfolders for every language like ./home/texta/stanza/et etc). (Default: "./data/external/mlp").

  • TEXTA_MLP_DEFAULT_LANGUAGE - Language code of the language the MLP module will default to when trying to analyze a document whichs language it could not detect properly (Default: en).

  • TEXTA_ALLOW_BERT_MODEL_DOWNLOADS - Boolean flag indicating if the users can download additional BERT models. (Default: True).

  • TEXTA_BERT_MODEL_DIRECTORY_PATH - Relative path to the directory into which pretrained and fine-tuned BERT models will be stored under the "bert_tagger" folder. (setting this to ./home/texta will create ./home/texta/bert_tagger/pretrained/ which contains subfolders for every downloaded bert more like ./home/texta/bert_model/pretrained/bert-base-multilingual-cased etc and ./home/texta/bert_model/fine_tuned/ which will store fine-tuned BERT models. (Default: "./data/models").

  • TEXTA_NLTK_DATA_DIRECTORY_PATH - Path of the directory where the NLTK library keeps its resources (Default: data/external/nltk).

  • TEXTA_BERT_MODELS - Comma seprated string of pretrained BERT models to download. (Default: "bert-base-multilingual-cased,bert-base-uncased,EMBEDDIA/finest-bert").

  • SKIP_BERT_RESOURCES - If set "True", skips downloading pretrained BERT models. (Default: false).

  • SKIP_MLP_RESOURCES - Whether to skip downloading MLP resources on application boot-up (Default: false).

  • SKIP_NLTK_RESOURCES - Whether to skip downloading NLTK library resources on application boot-up (Default: false).

  • TEXTA_EVALUATOR_MEMORY_BUFFER_GB - The minimum amount of memory that should be left free while using the evaluator, unit = GB. (Default = 50% of available_memory)

  • TEXTA_DATASOURCE_CHOICES - Choices for index domain field given as a list ex: [['prefix_name', 'display_name']]. ( Default = [["emails", "emails"], ["news articles", "news articles"], ["comments", "comments"] , ["court decisions", "court decisions"], ["tweets", "tweets"], ["forum posts", "forum posts"] , ["formal documents", "formal documents"], ["other", "other"]])

  • TOOLKIT_PROJECT_DATA_PATH - Path of the directory in which project specific data is kept (Default: data/projects).

External services

  • TEXTA_ES_PREFIX - String used to limit Elasticsearch index access. Only indices matched will be the ones matching " {TEXTA_ES_PREFIX}*".
  • TEXTA_ES_URL - URL of the Elasticsearch instance including the protocol, host and port (ex. http://localhost:9200).
  • TEXTA_REDIS_URL - URL of the Redis instance including the protocol, host and port (ex. redis://localhost:6379).

Django specifics

  • TEXTA_CORS_ORIGIN_WHITELIST - Comma separated string of urls (NO WHITESPACE) for the CORS whitelist. Needs to include the protocol (ex. http://* or http://*,http://localhost:4200).
  • TEXTA_ALLOWED_HOSTS - Comma separated string (NO WHITESPACE) representing the host/domain names that this Django site can serve (ex. * or *,http://localhost:4200).
  • TEXTA_DEBUG - True/False values on whether to run Django in its Debug mode or not (Default: true).
  • TEXTA_MAX_UPLOAD - Maximum size of files in bytes that are allowed to be updated, which Django validates (Default: 1073741824 aka 1GB)

Database credentials

  • DJANGO_DATABASE_ENGINE - https://docs.djangoproject.com/en/3.0/ref/settings/#engine
  • DJANGO_DATABASE_NAME - The name of the database to use. For SQLite, it’s the full path to the database file. When specifying the path, always use forward slashes, even on Windows.
  • DJANGO_DATABASE_USER - The username to use when connecting to the database. Not used with SQLite.
  • DJANGO_DATABASE_PASSWORD - The password to use when connecting to the database. Not used with SQLite.
  • DJANGO_DATABASE_HOST - Which host to use when connecting to the database. An empty string means localhost. Not used with SQLite.
  • DJANGO_DATABASE_PORT - The port to use when connecting to the database. An empty string means the default port. Not used with SQLite.

Docker specific configurations:

  • TEXTA_SHORT_TASK_WORKERS - Number of processes available for short term tasks (Default: 2).
  • TEXTA_LONG_TASK_WORKERS - Number of processes available for long term tasks (Default: 4).
  • TEXTA_MLP_TASK_WORKERS - Number of processes available for MLP based tasks (Default: 2).
  • TEXTA_SHORT_MAX_TASKS - Number of tasks per worker for short term tasks (Default: 10).
  • TEXTA_LONG_MAX_TASKS - Number of tasks per worker for long term tasks (Default: 10).
  • TEXTA_MLP_MAX_TASKS - Number of tasks per worker for MLP based tasks (Default: 10).
  • TEXTA_BEAT_LOG_LEVEL - Which log level should beat output within the Docker image (Default: WARNING).
  • TEXTA_CELERY_LOG_LEVEL - Which log level should Celery workers output within the Docker image (Default: WARNING).

Extra Elasticsearch connection configurations

Unless you have a specially configured Elasticsearch instance, you can ignore these options.

  • TEXTA_ES_USER - Username to authenticate to a secured Elasticsearch instance.
  • TEXTA_ES_PASSWORD - Password to authenticate to a secured Elasticsearch instance.

https://elasticsearch-py.readthedocs.io/en/6.3.1/connection.html#elasticsearch.Urllib3HttpConnection:

  • TEXTA_ES_USE_SSL
  • TEXTA_ES_VERIFY_CERTS
  • TEXTA_ES_CA_CERT_PATH
  • TEXTA_ES_CLIENT_CERT_PATH
  • TEXTA_ES_CLIENT_KEY_PATH
  • TEXTA_ES_TIMEOUT
  • TEXTA_ES_SNIFF_ON_START
  • TEXTA_ES_SNIFF_ON_FAIL

UAA specific configurations

  • TEXTA_USE_UAA - Whether to include UAA authentication with the default authentication (Default: false).

  • TEXTA_UAA_SCOPES - Which scopes should be sent with communication between TEXTA Toolkit and UAA (Default: openid texta.*).

  • TEXTA_UAA_SUPERUSER_SCOPE - Which scope to use for determining whether an UAA user is a superuser (Default: texta.admin).

  • TEXTA_UAA_PROJECT_ADMIN_SCOPE - Which scope to use to specify whether an UAA user has project administrator rights to ANY project which is available to them (Default: texta.project_admin).

  • TEXTA_UAA_SCOPE_PREFIX - Prefix for determining UAA user access to TEXTA Toolkit. Any user who does not have a scope which matches the pattern "{TEXTA_UAA_SCOPE_PREFIX}.*" will be denied entry to TEXTA Toolkit (Default: texta).

  • TEXTA_UAA_URL - URI for the UAA service (Default: http://localhost:8080).

  • TEXTA_UAA_REDIRECT_URI - URI into which the user will be redirected after a successful UAA login ( Default: http://localhost:8000/api/v2/uaa/callback).

  • TEXTA_UAA_FRONT_REDIRECT_URL - Configuration for the front end to determine where it will redirect the user after a successful login with UAA by Toolkit (Default: http://localhost:4200/oauth/uaa)

  • TEXTA_UAA_CLIENT_ID - UAA client ID for authenticating the TEXTA Toolkit application for UAA. Must be kept secret

  • TEXTA_UAA_CLIENT_SECRET - Password for authenticating the TEXTA Toolkit application with UAA. Must be kept secret

texta's People

Contributors

asula avatar erikjyrmann avatar githubuser88442 avatar gpaimla avatar helehh avatar jussuf avatar lindafr avatar mrkkollo avatar ranetp avatar rsirel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

texta's Issues

update docs: if in docker-compose elastic is looping in start-errror

if elasticsearch will die after start with:

ERROR: [1] bootstrap checks failed
max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]

one should fix it with something like:

sysctl -w vm.max_map_count=262144


git clone https://github.com/texta-tk/texta.git
cd texta/docker/
docker-compose pull
docker-compose up


Linux texta-test-2 4.4.0-141-generic #167-Ubuntu SMP Wed Dec 5 10:40:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Docker version 18.09.1, build 4c52b90
docker-compose version 1.23.2, build 1110ad01


texta-elastic | [2019-01-16T13:34:13,668][INFO ][o.e.d.DiscoveryModule ] [TEXTA-1] using discovery type [zen]
texta-elastic | [2019-01-16T13:34:14,317][INFO ][o.e.n.Node ] [TEXTA-1] initialized
texta-elastic | [2019-01-16T13:34:14,318][INFO ][o.e.n.Node ] [TEXTA-1] starting ...
texta-elastic | [2019-01-16T13:34:14,512][INFO ][o.e.t.TransportService ] [TEXTA-1] publish_address {192.168.16.2:9300}, bound_addresses {0.0.0.0:9300}
texta-elastic | [2019-01-16T13:34:14,527][INFO ][o.e.b.BootstrapChecks ] [TEXTA-1] bound or publishing to a non-loopback address, enforcing bootstrap checks
texta-elastic | ERROR: [1] bootstrap checks failed
texta-elastic | [1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
texta-elastic | [2019-01-16T13:34:14,538][INFO ][o.e.n.Node ] [TEXTA-1] stopping ...
texta-elastic | [2019-01-16T13:34:14,612][INFO ][o.e.n.Node ] [TEXTA-1] stopped
texta-elastic | [2019-01-16T13:34:14,612][INFO ][o.e.n.Node ] [TEXTA-1] closing ...
texta-elastic | [2019-01-16T13:34:14,627][INFO ][o.e.n.Node ] [TEXTA-1] closed
texta-elastic exited with code 78

Error training language model

Training language model results in traceback:

  File "   /texta/task_manager/tasks/workers/language_model_worker.py", line 50, in run
    iter=int(num_passes)
  File "   /anaconda3/envs/texta-toolkit/lib/python3.5/site-packages/gensim/models/word2vec.py", line 748, in __init__
    fast_version=FAST_VERSION)
  File "   /anaconda3/envs/texta-toolkit/lib/python3.5/site-packages/gensim/models/base_any2vec.py", line 633, in __init__
    end_alpha=self.min_alpha, compute_loss=compute_loss)
  File "   /anaconda3/envs/texta-toolkit/lib/python3.5/site-packages/gensim/models/word2vec.py", line 856, in train
    queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks)
  File "   /anaconda3/envs/texta-toolkit/lib/python3.5/site-packages/gensim/models/base_any2vec.py", line 938, in train
    queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks)
  File "   /anaconda3/envs/texta-toolkit/lib/python3.5/site-packages/gensim/models/base_any2vec.py", line 421, in train
    total_words=total_words, **kwargs)
  File "   /anaconda3/envs/texta-toolkit/lib/python3.5/site-packages/gensim/models/base_any2vec.py", line 1044, in _check_training_sanity
    raise RuntimeError("you must first build vocabulary before training the model")

Debugging reveals it might be caused by data that is discarded in EsIterator?
First
response = self.es_m.scroll()
is called
then, another scroll is called overwriting results obtained previously.
response = self.es_m.scroll(scroll_id=scroll_id)
https://github.com/texta-tk/texta/blob/master/task_manager/tools/data_manager.py#L75-L83

To reproduce - add dataset with less than ES_SCROLL_SIZE rows.

dictor lib

Hello, FYI,

your project uses dictor library, there have been updates in latest dictor version (0.1.1) that remove the eval() function from dictor code for security. As well as other additional changes (see readme)

Newest version is also has better performance parsing large JSON lookups.

[Documentation] Update presentation

Should update documentation for better representation of the project.

1 - Project Logo
2 - Requirements
3 - Logo's for companies/entities using TTK

Fact highlight

Searcher doesn't highlight facts, if there are several listed within one constraint

Dataset Importer Error

Hi there and thanks for this great initiative!

Unfortunately, I can't import any data.

When trying to do so on http://localhost:8000/dataset_importer/ (there is no explicit link to this page in the interface BTW), by
choosing simple documents or archives,
selecting the appropriate file in Input data,
naming the dataset,
setting overwrite dataset or not,
the job is correctly submitted but processing does not happen.

The following error being logged:

[25/Jun/2018 15:26:04] "GET /static/base/img/bg.jpg HTTP/1.1" 200 34998 Exception ignored in: <module 'threading' from '/usr/local/lib/python3.5/threading.py'> Traceback (most recent call last): File "/usr/local/lib/python3.5/threading.py", line 1351, in _after_fork thread._stop() TypeError: 'Event' object is not callable [25/Jun/2018 15:26:19] "POST /dataset_importer/import HTTP/1.1" 200 0 Process Process-6: Traceback (most recent call last): File "/usr/local/lib/python3.5/multiprocessing/process.py", line 252, in _bootstrap self.run() File "/usr/local/lib/python3.5/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "/texta/dataset_importer/importer/importer.py", line 296, in _import_dataset parameter_dict['file_path'] = download(parameter_dict['url'], parameter_dict['directory']) KeyError: 'url' [25/Jun/2018 15:26:19] "GET /dataset_importer/reload_table HTTP/1.1" 200 4436

django 2.0.2 doesn't seem to be available for python2.7

Hi there,
texta looks like a great package. Just when trying to install in in my python 2.7 virtual env on Ubuntu, pip informs me that there is no django 2.0.2 for python 2.7 although this is explicitly required in the requirements.txt

Would it make sens to try to run it with an earlier version? Or with python 3.5? Or what am I missing?

Thanks for a hint and best regards,
Stefan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.