A toolkit to parse PDFs for RAGs.
Clone the repo and cd to project root.
- Create and activate venv. Ex:
python -m venv venv
.\venv\Scripts\activate
- This project uses poetry.
pip install poetry
poetry install
Context:
- The process has two steps: PDF -> MD -> CSV
- PDF and MD files are in input/{language} folder
- PDF -> MD is done by external repo
- The repo contains code for MD -> CSV conversion.
Step 1: python src/markdown_parser.py
This will convert all the md files (from input/md/ folder) to csv files (saved in output/ folder).
Step 2: python src/csv_preprocessing.py
This performs further processing on the csv files in output/ folder and saves the output there only.
The idea is to create a single function for both the above steps and after combining them wiht PDF -> MD conversion, we can create a celery function for the API.
Setup steps would be:
docker run -d -p 6379:6379 --name my-redis redis
celery -A worker worker --loglevel=info
uvicorn main:app --reload