GithubHelp home page GithubHelp logo

e3-jsi / elens-miner-system Goto Github PK

View Code? Open in Web Editor NEW
0.0 7.0 3.0 592 KB

The microservice architecture for processing, analysing and searching through the environmental legal documents

Home Page: https://jozefstefaninstitute.github.io/eLENS-miner-system/

License: BSD 2-Clause "Simplified" License

Python 58.28% CSS 2.35% JavaScript 0.21% HTML 37.35% Shell 0.36% Batchfile 0.98% PowerShell 0.47%
text-embeddings-microservice document-retrieval-microservice microservices python

elens-miner-system's Introduction

eLENS Miner System

License Build Status Python 3.6 Platform

The eLENS miner system retrieves, processes and analyzes legal documents and maps them to specific geographical areas.

The system follows the microservice architecture and is written in Python 3. It consists of the following microservices:

  • Document Retrieval. The service responsible for providing documents based on the user's query. It leverages query expansion to improve the query results.

  • Document Similarity. This service calculates the semantic similarity of the documents and can provide a list of most similar documents to a user selected one. Here, we integrate state-of-the-art methods using word and document embeddings to capture the semantic meaning of the documents and use it to compare the documents.

  • Text Embeddings. The service is a collection of text embedding methods. For a given text it generates the text embedding which is then used in the previous microservices.

  • Entrypoint. This service is the interface and connects the previous microservices together. It is the entrypoint for the users to access the services.

Prerequisites

You may want to create separate virtual environments for each of the microservices or you can create one for all of them. We advise to use virtual environments if you are developing multiple projects with Python, due to clashing of dependencies between projects. (Suppose one project only supports numpy < 1.0 and the other needs numpy=1.5).

To create a virtual environment navigate to the desired directory (usually the main folder of the project) and write

python -m venv venv

To activate this virtual environment navigate into venv/Scripts and then execute activate. To deactivate a virtual environment execute deactivate.

You can see that your virtual environment is being used if you see (venv) before the command line.

Each microservice must be run separately. Each service can be used for themself or one can employ the entrypoint microservice that connects all of the microservices together.

What follows is a short description of how to run each microservice. A more detailed description of the microservice can be found in their designated folders.

Text Embeddings Microservice

Currently you are able to run only one version of the text embedding so that it will be connected to the main component. But later you will be able to connect more.

  • Activate virtual environment if you wish to do so
  • Navigate into text_embeddings folder
  • Execute
    pip install -r requirements.txt
  • Run
    python -m nltk.downloader all
  • Place a copy of your word2vec or fasttext word embeddings in the data/embeddings folder
  • Navigate back to the base of the text_embeddings folder and run the service with
    # linux or mac
    python -m text_embedding.main start \
           -e production \
           -H localhost \
           -p 4001 \
           -mp (path to the model) \
           -ml (language of the model)
    
    # windows
    python -m text_embedding.main start -e production -H localhost -p 4001 -mp (path to the model) -ml (language of the model)

Document Retrieval Microservice

  • Activate virtual environment if you wish to do so
  • Navigate into document_retrieval folder
  • Execute
    pip install -r requirements.txt
  • Navigate into microservice/config folder
  • Create .env file and inside define the following variables:
    PROD_PG_DATABASE=
    PROD_PG_USERNAME=
    PROD_PG_PASSWORD=
    PROD_TEXT_EMBEDDING_HOST=
    PROD_TEXT_EMBEDDING_PORT=
    
    DEV_PG_DATABASE=
    DEV_PG_USERNAME=
    DEV_PG_PASSWORD=
    DEV_TEXT_EMBEDDING_HOST=
    DEV_TEXT_EMBEDDING_PORT=
  • Navigate to the base of document_retrieval folder and run the service with:
    # linux or mac
    python -m microservice.main start \
           -e production \
           -H localhost \
           -p 4100
    
    # windows
    python -m microservice.main start -e production -H localhost -p 4100

If you want you can also run the service on custom host and port.

Document Similarity Microservice

  • Activate virtual environment if you wish to do so
  • Navigate into document_similarity folder
  • Execute
    pip install -r requirements.txt
    
  • Navigate into microservice/config folder
  • Create a .env file with the following variables
    PROD_DATABASE_NAME =
    PROD_DATABASE_USER =
    PROD_DATABASE_PASSWORD =
    PROD_TEXT_EMBEDDING_URL =
    
    DEV_DATABASE_NAME =
    DEV_DATABASE_USER =
    DEV_DATABASE_PASSWORD =
    DEV_TEXT_EMBEDDING_URL =
    
  • Set the text embedding url to http://{HOST}:{PORT}/api/v1/embeddings/create where HOST and PORT are the values used to run text embedding microservice
  • Navigate back into the base of the document_similarity folder and run the service with
    # linux or mac
    python -m microservice.main start \
           -e production \
           -H localhost \
           -p 4200
    
    # windows
    python -m microservice.main start -e production -H localhost -p 4200

You can also use custom host and port.

Entrypoint

  • Activate virtual environment if you wish to do so
  • Navigate into entrypoint folder
  • Run
    pip install -r requirements.txt
    
  • Navigate into microservice/config folder
  • Create .env file with contents
    DEV_DATABASE_USER =
    DEV_DATABASE_HOST =
    DEV_DATABASE_PORT =
    DEV_DATABASE_PASSWORD =
    DEV_DATABASE_NAME =
    
    PROD_DATABASE_USER =
    PROD_DATABASE_HOST =
    PROD_DATABASE_PORT =
    PROD_DATABASE_PASSWORD =
    PROD_DATABASE_NAME =
    
  • Navigame back into entrypoint folder
  • Run the main service with
    # linux or mac
    python -m microservice.main start \
           -e production \
           -H localhost \
           -p 4500
    
    # windows
    python -m microservice.main start -e production -H localhost -p 4500
    However if you routed other microservices to different hosts/ports, you can provide this values in the following way:
    # linux or mac
    python -m microservice.main start -H localhost -p 4500 \
      -teh {host of the text embedding microservice} \
      -tep {port of the text embedding microservice} \
      -drh {host of the document retrieval microservice} \
      -drp {port of the document retrieval microservice} \
      -dsh {host of the document similarity microservice} \
      -dsp {port of the document similarity microservice}
    
    # windows
    python -m microservice.main start -H localhost -p 4500 -teh {host of the text embedding microservice} -tep {port of the text embedding microservice} -drh {host of the document retrieval microservice} -drp {port of the document retrieval microservice} -dsh {host of the document similarity microservice} -dsp {port of the document similarity microservice}

Usage:

Available endpoints:

  • GET {HOST}/{PORT}/api/v1/documents/search query_params query, m

    • query -> your text query
    • m -> number of results

    Example request:

    {BASE_URL}/api/v1/documents/search?query=deforestation&m=10 You will receive top 10 documents similar to query "deforestation".

  • GET {HOST}/{PORT}/api/v1/documents/<document_id>/similar query_params get_k

    • document_id -> id of the document
    • get_k -> number of results

    Example request:

    {BASE_URL}/api/v1/documents/123/similar?get_k=5 You will receive 5 of the most similar documents to document with id 123.

  • POST {HOST}/{PORT}/api/v1/documents/<document_id>/similarity_update

    • document_id -> id of the document

    Example request:

    {BASE_URL}/api/v1/documents/similarity_update Recalculates similarities of the document with the given id to the other documents.

  • GET {HOST}/{PORT}/api/v1/embeddings/create query_params text, language

    • text -> your text
    • language -> language of the text

    Example request:

    {BASE_URL}/api/v1/embedding/create?text=ice cream&language=en You will receive the embedding of the text "ice cream" from the english word embedding model.

  • GET {HOST}/{PORT}/api/v1/documents query_params document_ids

    • document_ids : (comma separated document ids)

    Example request:

    {BASE_URL}/api/v1/documents?document_ids=1,3,17 With the GET request at this endpoint you will receive documents data for documents ids 1, 3 and 17.

  • GET {HOST}/{PORT}/api/v1/documents/<document_id>

    • document_id : (id of the document)

    Example request:

    {BASE_URL}/api/v1/documents/3 With the GET request at this endpoint you will receive documents data for document with id 3.

Acknowledgments

This work is developed by AILab at Jozef Stefan Institute.

The work is supported by the EnviroLENS project, a project that demonstrates and promotes the use of Earth observation as direct evidence for environmental law enforcement, including in a court of law and in related contractual negotiations.

elens-miner-system's People

Contributors

eriknovak avatar kraljsamo avatar zivaurbancic avatar dependabot[bot] avatar sarabrezec avatar

Watchers

Klemen Kenda avatar James Cloos avatar Filip Koprivec avatar M.Besher Massri avatar  avatar  avatar  avatar

elens-miner-system's Issues

[FEATURE] Unit tests for microservices

Is your feature request related to a problem? Please describe.
The code that we write and submit to the repository should be tested. One thing is to test it manually, the other thing is to have automatic testing of the code. Automatic testing is great for finding bugs and to see if the new changes break parts of the old code - making writing code less stressful.

Describe the solution you'd like
I suggest each code owner (or someone nice) should write tests for the methods. I suggest we use the pytest library (https://docs.pytest.org/en/latest/) which allows to write tests for python code.

Afterwards, I will setup a Travis CI build action, which will automatically run the unit tests with each PR.

[FEATURE] Add filtering options to document comparison

Is your feature request related to a problem? Please describe.
Providing documents based on their similarity is nice, but it would be beneficial to provide some filtering options.

Describe the solution you'd like
The option to filter the results of the user query based on:

  • environmental tags
  • NUTS tags
  • named entities

These filtering parameters should be optional and should be added to the query (e.g.
/route?text=some text&nuts=dw1,dw3&env_tags=deforestation)

[FEATURE] Provide a nice frontend for the system

Is your feature request related to a problem? Please describe.
A nice frontend would be beneficial for showing how the system works and its capabilities.

Describe the solution you'd like
The frontend would describe:

  • The overview of the system (what it does, the dataset used)
  • The API used for accessing the data and the functionalities
  • (optional) A use-case of the capabilities

It would be developed with a simple framework for website development (jquery, ejs, etc). If required, we would use a stronger framework (e.g. react).

[FEATURE] Add filtering options to document retrieval

Is your feature request related to a problem? Please describe.
Providing documents based on the user query is nice, but it would be beneficial to provide some filtering options.

Describe the solution you'd like
The option to filter the results of the user query based on:

  • environmental tags
  • NUTS tags
  • named entities

These filtering parameters should be optional and should be added to the query (e.g.
/route?text=some text&nuts=dw1,dw3&env_tags=deforestation)

[FEATURE] Cleanup and add the data crawling scripts into a separate folder

Additional context
To enable crawling the legal documents, it would be nice to have the crawling scripts present in this repository.

Cleanup the scripts so that it can be used by any user (be careful with setting the path names), store them in a separate folder called crawlers and add documentation on how to use them to collect the data.

[BUG] Remove static postgres credentials from the code

Describe the bug
Credentials for connecting to Postgresql are hardcoded into the entrypoint route database.py. This is a security issue.

To Reproduce

  1. Go to file entrypoint/microservice/routes/database.py.
  2. Check lines 72-74

Expected behavior
The credentials should be stored inside the entrypoint/microservice/config/.env file and then loaded in the entrypoint/microservice/config/config.py file. In addition, the methodology for accessing the database should be registered as a service (similar to the logging functionality in entrypoint/microservice/config/config_logging.py).

See the following example: https://flask.palletsprojects.com/en/1.1.x/tutorial/database/

[DOC] Incorrect documentation at ENTRYPOINT index page

Description of the missing/incorrect documentation
When starting the microservice from ENTRYPOINT and opening http://localhost:4500/, the index page should include information on how to use all the microservices combined (document_similarity, document_retrieval, text_embedding). Instead the provided html page consists of documentation for text embedding microservice.

Screenshot
Zaslonska slika 2020-01-10 09-53-45

[DOC] SETUP without the database

If the user does not have an active database like we do, he cannot use any of these services since he does not the database or his database is not in the right structure.

Should we add database description or database structure to the instructions or do we not care?

[FEATURE] Implement elastic search

Is your feature request related to a problem? Please describe.
Elastic search is a nice service that allows querying relevant documents/records in an efficient way. We would use this for document retrieval.

Describe the solution you'd like

  • Setup the Elastic Search service on the production machine
  • Populate Elastic Search with the documents currently in the database
  • Connect Elastic Search with the document retrieval service
  • Provide scripts and documentation on how to set up and use Elastic Search (for reproducibility)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.