GithubHelp home page GithubHelp logo

eisenh / document-qa Goto Github PK

View Code? Open in Web Editor NEW

This project forked from lfoppiano/document-qa

0.0 0.0 0.0 609 KB

Scientific Document Insight Q/A

Home Page: https://lfoppiano-document-qa.hf.space/

License: Apache License 2.0

Python 99.21% Dockerfile 0.79%

document-qa's Introduction

title emoji colorFrom colorTo sdk sdk_version app_file pinned license
Scientific Document Insights Q/A
๐Ÿ“
yellow
pink
streamlit
1.27.2
streamlit_app.py
false
apache-2.0

DocumentIQA: Scientific Document Insights Q/A

Work in progress ๐Ÿ‘ท

https://lfoppiano-document-qa.hf.space/

Introduction

Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, GPT4, GPT4-Turbo, Mistral-7b-instruct and Zephyr-7b-beta. The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS (National Institute for Materials Science), in Tsukuba, Japan. Different to most of the projects, we focus on scientific articles and we extract text from a structured document. We target only the full-text using Grobid which provides cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).

Additionally, this frontend provides the visualisation of named entities on LLM responses to extract physical quantities, measurements (with grobid-quantities) and materials mentions (with grobid-superconductors).

The conversation is kept in memory by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".

(The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)

Getting started

  • Select the model+embedding combination you want to use
  • If using gpt3.5-turbo, gpt4 or gpt4-turbo, enter your API Key (Open AI).
  • Upload a scientific article as a PDF document. You will see a spinner or loading indicator while the processing is in progress.
  • Once the spinner disappears, you can proceed to ask your questions

screenshot2.png

Documentation

Context size

Allow to change the number of blocks from the original document that are considered for responding. The default size of each block is 250 tokens (which can be changed before uploading the first document). With default settings, each question uses around 1000 tokens.

NOTE: if the chat answers something like "the information is not provided in the given context", changing the context size will likely help.

Chunks size

When uploaded, each document is split into blocks of a determined size (250 tokens by default). This setting allows users to modify the size of such blocks. Smaller blocks will result in a smaller context, yielding more precise sections of the document. Larger blocks will result in a larger context less constrained around the question.

Query mode

Indicates whether sending a question to the LLM (Language Model) or to the vector storage.

  • LLM (default) enables question/answering related to the document content.
  • Embeddings: the response will consist of the raw text from the document related to the question (based on the embeddings). This mode helps to test why sometimes the answers are not satisfying or incomplete.

NER (Named Entities Recognition)

This feature is specifically crafted for people working with scientific documents in materials science. It enables to run NER on the response from the LLM, to identify materials mentions and properties (quantities, measurements). This feature leverages both grobid-quantities and grobid-superconductors external services.

Troubleshooting

Error: streamlit: Your system has an unsupported version of sqlite3. Chroma requires sqlite3 >= 3.35.0. Here the solution on Linux. For more information, see the details on Chroma website.

Disclaimer on Data, Security, and Privacy โš ๏ธ

Please read carefully:

  • Avoid uploading sensitive data. We temporarily store text from the uploaded PDF documents only for processing your request, and we disclaim any responsibility for subsequent use or handling of the submitted data by third-party LLMs.
  • Mistral and Zephyr are FREE to use and do not require any API, but as we leverage the free API entrypoint, there is no guarantee that all requests will go through. Use at your own risk.
  • We do not assume responsibility for how the data is utilized by the LLM end-points API.

Development notes

To release a new version:

  • bump-my-version bump patch
  • git push --tags

To use docker:

  • docker run lfoppiano/document-insights-qa:{latest_version)

  • docker run lfoppiano/document-insights-qa:latest-develop for the latest development version

To install the library with Pypi:

  • pip install document-qa-engine

Acknowledgement

This project is developed at the National Institute for Materials Science (NIMS) in Japan in collaboration with Guillaume Lambard and the Lambard-ML-Team. Contributed by Pedro Ortiz Suarez, Tomoya Mato. Thanks also to Patrice Lopez, the author of Grobid.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.