Python homework exercise to pull out words of interest from text files via NLP.
NOTE: See Releases
section for the rendered output for each tagged
release (generated by the release
Github Action).
From the documents provided:
-
Produce a list of the most frequent interesting words.
-
Summary table showing where those words appear (sentences and documents) eg.
Word (Total Occurrences) Documents Sentences containing the word Philosophy (42) x,y,z I don't have time for philosophy
Surely this was a touch of fine philosophy; though no doubt he had never heard there was such a thing as that.
Still, her pay-as-you-go philosophy implies it.... ... ...
- To show-off the type of developer I am:
- Solution design & planning.
- List assumptions & reasoning for my choices.
- Documentation (docs, code comments, commit messages).
- Workflow (CI, tests, concise commits).
- To learn new packages:
- NLTK - I've only worked on rule-based NLP products built in C++ (Boost Spirit). Recently found out that NLTK is a great python alternative.
- Poetry - I've heard great things about Poetry from @mikeymo for
years, but not had a chance to experiment (and potentially) migrate
Current Production code away from PEP-517 style packaging
(
setup.cfg
/pyproject.toml
) to Poetry.- Had also found that PDM seems to be a good alternative, but feels like there is an upfront education cost for it to be smooth sailing to use in Production.
- Previously used Pipenv but my opinions have soured after the mix of issues; Pypa distancing themselves from Pipenv usage in Production vs using PEP-517 config, non-OS-agnostic lock files and dependency graphs breaking easily.
PlantUml design to solve the above problem (See: PlantUml Design (original) & initial design sketch in docs/ for my pre-NLTK investigation):
Current implementation is using Stemming. This is documented as being typically faster than Lemmatization at the cost of accuracy/quality.
Current design uses "2-passes" to go from filtered tokens to counts + file/sentence mappings:
- Pass 0: tokenize/filter/stem words from content (
Normaliser
). - Pass 1: Use collections.Counter to get a set of words+counts.
- Currently counts are done per-sentence, then amalgamated together per-file and then all files.
- Pass 2: Loop through
Content
instances to discover file/sentence mappings for each word. python_homework_nlp/test_main.py::TestMain::test_workflow_with_real_docs
takes:- ~0.7s to run
main.workflow
(Normaliser
+Counter
) on my Linux PC with an i7-8700 CPU. - ~0.96s in CI (eg. Github Action: Python Application #33).
- ~0.7s to run
Pros:
- Fast to get the
Word (Total Occurences)
column in the Exercise section's table. - Could horizontally scale "Pass 2" by using a task queue system and then amalgamating the results at the end.
- Can specify the a number for the top/most-common words as output from -"Pass 1"_. ie. Save time in "Pass 2".
- Efficiency: "Pass 1" is O(n) for the collections.Counter with a quick O(n^2) to get a singular collections.Counter total.
Cons:
- "Pass 2" is inefficient to get the
Documents
/Sentences containing the word
columns in the Exercise section's table, due to doing a full pass of all sentences for each word. - Efficiency: "Pass 2" is: O(n^2). For each word, for each file, for each sentence find a match.
- Modifications to the above "2-Pass" design:
- juggle the current
Counter
ordering. eg: For each file, for each sentence, for each word find a match. - Restructure code to benefit from caching. eg. functools.lru_cache.
- juggle the current
- "Single-pass" where file/sentence mapping is recorded at point of
normalising each sentence + increment a counter for each word when seen
again.
- Would need to flatten the files/sentences structure to avoid O(n^3)
Either:
pip install <package.whl>
into a local virtualenv (See "Releases" section for published wheels).- Follow the steps in: Contribute section.
You can then run the application via the entrypoint:
app
.app --help
returns the usage.
Example:
$ app
[nltk_data] Downloading package punkt to /path/to/nltk_data...
[nltk_data] Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /path/to/nltk_data...
[nltk_data] Package stopwords is already up-to-date!
Gathering `*.txt` file contents from the root of: test_docs ...
Calling main workflow.
-- Parsing file contents...
-- Counting words against files & sentences...
Render results.
-- Rendering output via: JsonRenderer...
-- Writing Rendered output to: build/output/output.json ...
-- Rendering output via: CsvRenderer...
-- Writing Rendered output to: build/output/output.csv ...
done.
$
- Pre-req.:
curl -sSL https://install.python-poetry.org | python3 -
Poetry Docs: install. - potentially: Explicitly install the git versioning plugin:
- https://github.com/mtkennerly/poetry-dynamic-versioning.
poetry self add "poetry-dynamic-versioning[plugin]"
.
- Install dependencies:
poetry install
. - Run tests:
poetry run pytest
. - Build Wheel:
poetry build
. - Run app:
poetry run app
.
NOTE: Explicitly not publishing this package to PyPI!! I don't want to bloat the PyPI namespace with a point-in-time homework piece that wont have on-going maintenance/support (beyond the initial learning/puzzle-solving stage).
- Poetry: never used it before, but so far seems to be what Pipenv should have been. No major pain (outside of the py3.10/pytest version bug) so far and very smooth.
- NLTK: never used it before. Interesting package with adequate reference examples. Had to do a general refresher on NLP terminology (and changes) from what terms were used in the rule-based NLP that I've interacted with in the past. Only scratched the surface so far.
- mypy: I've been pushing type hints usage on past work projects as we came to support those python versions in Production (We didn't us mypy in anger, due to the tech debt of warnings). Definitely benefited from the mypy warnings catching mistakes around single/double nested lists in function calls.
- PlantUml Proxy: Use the proxy to generate UML on page loads, instead of committing generated images into git.
NOTE: Follow is in addition to, or summary of, the TODO
comments
intentionally left in the code.
- Additional Renderer's. eg. Markdown table, HTML, console.
- Fix bugs:
- Multiline strings in single CSV field syntax.
- TypeHint an ABC class correctly.
- NLTK Data download singleton + remove
normaliser.py
import side-effect. - Bold special word in the sentence.
- Add tox for matrix building/testing of the application against python
versions locally. NOTE: negated by the matrix building in CI
(
python_app
Github Action).
- Poetry Docs: version: I prefer discovering versions git tags, instead of forcing a commit/PR to bump. I've been using PyPA: setuptools_scm to do this in current projects. Need to investigate how other uses of Poetry handle versioning.
- Try out and profile Alternative Design Ideas.
- NLTK: Try out different Stemming/Lemmatizing calls in NLTK to compare
speed vs quality of results.
- Break out the filtering from the
Normaliser
so that the current Stemming and futureAlternativeStemming
orLemmatization
callers are classes that use the same abstract base class interface. ie. simplify changing them based on purpose/speed/quality.
- Break out the filtering from the
- NLTK: investigate best practices for managing NLTK Data in Production code. eg. gathering data dependencies at build-time vs run-time. Packing data into built wheel or not?
- Using frameworks for the Renderers. eg.
HtmlRenderer
- Django (if I was converting over to also use models instead ofContent
and/orCounter
's OutputtDict. Negating on-disk/in-memory hits), Python-Markdown (Convert markdown-to-html).ConsoleRenderer
- I know there are TUI/CLI libraries in multiple languages to easily do frames/tables in a console.