erykdarnowski / ts-test-extractor Goto Github PK
View Code? Open in Web Editor NEWSimple script for extracting questions, answers and so on from test PDFs (for a subject called TS I have at uni) to a more usable format.
License: MIT License
Simple script for extracting questions, answers and so on from test PDFs (for a subject called TS I have at uni) to a more usable format.
License: MIT License
The currently implemented quick & dirty fix, doesn't fully resolve the issue and in fact introduces a new one (when a word starts with a diacritic, it gets combined with the word previous to it).
The proposed solution is a switch from PyPDF2
to using fitz
from PyMuPDF
.
Another thing is the need for encoding="utf-8"
when writing the out.txt
file on Windows.
Add the encoding parameter when writing the JSON file, like was done before:
...('ts.json', 'w') as...
-> ...('ts.json', 'w', encoding='utf-8')...
Add an error message to the part of the script that extracts pure text from PDF files, when there are none.
Actually, don't really like the all caps.
*Make sure each expression gets 171
matches
/(\s\n){0,2}(^Test\sz\s.*?)(\s\n){2,}/gms
(replace)/(^[0-9]{1,}\.\n?\s?)(?![0-9])(.*?)(?=\n?\s\n[A-Z]\.)/gms
(G2)/(^[A-Z]\..*?)(?=\s\n?Odp\.)/gms
(G1)/(?<=^Odp\.)(.*?)([A-Z])(?=\.?)/gm
(G2)/(^Odp\.(\:|\s)\s?[A-Z])(\.?\s+\n?)(.*?)(?=((^[0-9]{1,}\.\n?\s?)(?![0-9]))|(\s\n){3})/gms
(G4)Basically compare the output JSON file of the script with the PDFs to make sure that the tasks, answers, correct answers and so on are correctly grouped / joined / aligned. In addition could also look for formatting stuff to fix that was missed previously.
Add a PDF file with some sample syntax resembling the tests, to show how the script works.
venv
instructions to README.md
/\s(?=[A-Z]\.)/gm
(replace)Fix this :
-> :
.
For the last time...
Just change Instructions
to Usage
and remove info about performing setup.
After messing around with the PDF files, I've noticed that when they get found by glob
they don't have to be in order.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.