erykdarnowski / ts-test-extractor Goto Github PK

Simple script for extracting questions, answers and so on from test PDFs (for a subject called TS I have at uni) to a more usable format.

License: MIT License

Python 100.00%

pdf pdf-conversion pdf-converter pdf-extractor pdf-json pdf-txt

ts-test-extractor's Issues

Check if everything works nicely on Windows

Implement proper fix for Polish diacritics in text extraction

Description

The currently implemented quick & dirty fix, doesn't fully resolve the issue and in fact introduces a new one (when a word starts with a diacritic, it gets combined with the word previous to it).

The proposed solution is a switch from PyPDF2 to using fitz from PyMuPDF.
Another thing is the need for encoding="utf-8" when writing the out.txt file on Windows.

Fix JSON file encoding on Windows

Description

Add the encoding parameter when writing the JSON file, like was done before:
...('ts.json', 'w') as... -> ...('ts.json', 'w', encoding='utf-8')...

Fix no err msg for missing PDF files

Description

Add an error message to the part of the script that extracts pure text from PDF files, when there are none.

Change the formatting of the `README.md` a little

Description

Actually, don't really like the all caps.

Update `README.md`

TODO

Info about input files + included example + that they should be numbered
Run instructions
That the regex patterns and formatting fixes are really specific so this repo won't be really usefull cause it was made with a really specific usecase in mind (it will most likely be only usefull as inspiration).

Implement data extraction with regex patterns

TODO

*Make sure each expression gets 171 matches

Add clean up
- Titles: /(\s\n){0,2}(^Test\sz\s.*?)(\s\n){2,}/gms (replace)
Add extraction
- Tasks: /(^[0-9]{1,}\.\n?\s?)(?![0-9])(.*?)(?=\n?\s\n[A-Z]\.)/gms (G2)
- Answers (A, B, C): /(^[A-Z]\..*?)(?=\s\n?Odp\.)/gms (G1)
- Correct answers: /(?<=^Odp\.)(.*?)([A-Z])(?=\.?)/gm (G2)
- Answer expls.: /(^Odp\.(\:|\s)\s?[A-Z])(\.?\s+\n?)(.*?)(?=((^[0-9]{1,}\.\n?\s?)(?![0-9]))|(\s\n){3})/gms (G4)

Resources

Check that tasks, answers etc. are correctly joined / aligned

Description

Basically compare the output JSON file of the script with the PDFs to make sure that the tasks, answers, correct answers and so on are correctly grouped / joined / aligned. In addition could also look for formatting stuff to fix that was missed previously.

Add sample PDF file, to show how the script works

Description

Add a PDF file with some sample syntax resembling the tests, to show how the script works.

Implement PDF text extraction

TODO

Add venv instructions to README.md
Implement (for all files at once)
Fix issue with Polish diacritic characters

Resources

Implement output

TODO

Add check for misaligned values (based on lengths)
Add formatting fixes
- Answers (A, B, C): /\s(?=[A-Z]\.)/gm (replace)
Implement writing to JSON file

Resources

Fix colon formatting in tasks

Description

Fix this : -> :.

Change the `Instructions` part of the `README.md`

Description

For the last time...
Just change Instructions to Usage and remove info about performing setup.

Fix lack of PDF filenames sorting

Description

After messing around with the PDF files, I've noticed that when they get found by glob they don't have to be in order.

erykdarnowski / ts-test-extractor Goto Github PK

ts-test-extractor's People

Watchers

ts-test-extractor's Issues

Description

Description

Description

Description

TODO

TODO

Resources

Description

Description

TODO

Resources

TODO

Resources

Description

Description

Description

Recommend Projects

Recommend Topics

Recommend Org

Jobs