Playground for using luigi, PyPDF2 and nltk.
Reads PDF Files from a given folder and splits it according to the cosine similarity of the extracted texts. Work load is balanced and monitored by luigi.
Each file will be processed only once, as luigi detects which work has already been done.
git clone https://github.com/MtnFranke/semantic-pdf-splitter
pip install -r requirements.txt
touch stopwords.txt ### TODO: Add desired stop words to this text file, divided by newline
luigid &
PYTHONPATH='.' luigi --workers 4 --module semantic_pdf_splitter GetFiles --fin /home/user/Documents/PDF/ --fout ./target/
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request.
- Martin Franke (MtnFranke)
MIT License