python_homework_nlp

Python homework exercise to pull out words of interest from text files via NLP.

NOTE: See Releases section for the rendered output for each tagged release (generated by the release Github Action).

Exercise

From the documents provided:

Produce a list of the most frequent interesting words.

Summary table showing where those words appear (sentences and documents) eg.

Word (Total Occurrences)	Documents	Sentences containing the word
Philosophy (42)	x,y,z	I don't have time for philosophy Surely this was a touch of fine philosophy; though no doubt he had never heard there was such a thing as that. Still, her pay-as-you-go philosophy implies it.
...	...	...

Personal Aims

To show-off the type of developer I am:
- Solution design & planning.
- List assumptions & reasoning for my choices.
- Documentation (docs, code comments, commit messages).
- Workflow (CI, tests, concise commits).
To learn new packages:
- NLTK - I've only worked on rule-based NLP products built in C++ (Boost Spirit). Recently found out that NLTK is a great python alternative.
- Poetry - I've heard great things about Poetry from @mikeymo for years, but not had a chance to experiment (and potentially) migrate Current Production code away from PEP-517 style packaging (setup.cfg/pyproject.toml) to Poetry.
  - Had also found that PDM seems to be a good alternative, but feels like there is an upfront education cost for it to be smooth sailing to use in Production.
  - Previously used Pipenv but my opinions have soured after the mix of issues; Pypa distancing themselves from Pipenv usage in Production vs using PEP-517 config, non-OS-agnostic lock files and dependency graphs breaking easily.

Design

PlantUml design to solve the above problem (See: PlantUml Design (original) & initial design sketch in docs/ for my pre-NLTK investigation):

Normaliser

Current implementation is using Stemming. This is documented as being typically faster than Lemmatization at the cost of accuracy/quality.

Counter

Current design uses "2-passes" to go from filtered tokens to counts + file/sentence mappings:

Pass 0: tokenize/filter/stem words from content (Normaliser).
Pass 1: Use collections.Counter to get a set of words+counts.
- Currently counts are done per-sentence, then amalgamated together per-file and then all files.
Pass 2: Loop through Content instances to discover file/sentence mappings for each word.
python_homework_nlp/test_main.py::TestMain::test_workflow_with_real_docs takes:
- ~0.7s to run main.workflow (Normaliser + Counter) on my Linux PC with an i7-8700 CPU.
- ~0.96s in CI (eg. Github Action: Python Application #33).

Pros:

Fast to get the Word (Total Occurences) column in the Exercise section's table.
Could horizontally scale "Pass 2" by using a task queue system and then amalgamating the results at the end.
Can specify the a number for the top/most-common words as output from -"Pass 1"_. ie. Save time in "Pass 2".
Efficiency: "Pass 1" is O(n) for the collections.Counter with a quick O(n^2) to get a singular collections.Counter total.

Cons:

"Pass 2" is inefficient to get the Documents/Sentences containing the word columns in the Exercise section's table, due to doing a full pass of all sentences for each word.
Efficiency: "Pass 2" is: O(n^2). For each word, for each file, for each sentence find a match.

Alternative Design Ideas

Modifications to the above "2-Pass" design:
- juggle the current Counter ordering. eg: For each file, for each sentence, for each word find a match.
- Restructure code to benefit from caching. eg. functools.lru_cache.
"Single-pass" where file/sentence mapping is recorded at point of normalising each sentence + increment a counter for each word when seen again.
- Would need to flatten the files/sentences structure to avoid O(n^3)

Usage

Either:

pip install <package.whl> into a local virtualenv (See "Releases" section for published wheels).
Follow the steps in: Contribute section.

You can then run the application via the entrypoint:

app.
app --help returns the usage.

Example:

$ app
[nltk_data] Downloading package punkt to /path/to/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /path/to/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Gathering `*.txt` file contents from the root of: test_docs ...
Calling main workflow.
-- Parsing file contents...
-- Counting words against files & sentences...
Render results.
-- Rendering output via: JsonRenderer...
-- Writing Rendered output to: build/output/output.json ...
-- Rendering output via: CsvRenderer...
-- Writing Rendered output to: build/output/output.csv ...
done.
$

Contribute

Pre-req.: curl -sSL https://install.python-poetry.org | python3 - Poetry Docs: install.
potentially: Explicitly install the git versioning plugin:
- https://github.com/mtkennerly/poetry-dynamic-versioning.
- poetry self add "poetry-dynamic-versioning[plugin]".
Install dependencies: poetry install.
Run tests: poetry run pytest.
Build Wheel: poetry build.
Run app: poetry run app.

NOTE: Explicitly not publishing this package to PyPI!! I don't want to bloat the PyPI namespace with a point-in-time homework piece that wont have on-going maintenance/support (beyond the initial learning/puzzle-solving stage).

Retrospective

Things I've learnt?

Poetry: never used it before, but so far seems to be what Pipenv should have been. No major pain (outside of the py3.10/pytest version bug) so far and very smooth.
NLTK: never used it before. Interesting package with adequate reference examples. Had to do a general refresher on NLP terminology (and changes) from what terms were used in the rule-based NLP that I've interacted with in the past. Only scratched the surface so far.
mypy: I've been pushing type hints usage on past work projects as we came to support those python versions in Production (We didn't us mypy in anger, due to the tech debt of warnings). Definitely benefited from the mypy warnings catching mistakes around single/double nested lists in function calls.
PlantUml Proxy: Use the proxy to generate UML on page loads, instead of committing generated images into git.

What would I improve (with additional time)?

NOTE: Follow is in addition to, or summary of, the TODO comments intentionally left in the code.

Additional Renderer's. eg. Markdown table, HTML, console.
Fix bugs:
- Multiline strings in single CSV field syntax.
- TypeHint an ABC class correctly.
- NLTK Data download singleton + remove normaliser.py import side-effect.
- Bold special word in the sentence.
Add tox for matrix building/testing of the application against python versions locally. NOTE: negated by the matrix building in CI (python_app Github Action).

What would I change?

Poetry Docs: version: I prefer discovering versions git tags, instead of forcing a commit/PR to bump. I've been using PyPA: setuptools_scm to do this in current projects. Need to investigate how other uses of Poetry handle versioning.
Try out and profile Alternative Design Ideas.
NLTK: Try out different Stemming/Lemmatizing calls in NLTK to compare speed vs quality of results.
- Break out the filtering from the Normaliser so that the current Stemming and future AlternativeStemming or Lemmatization callers are classes that use the same abstract base class interface. ie. simplify changing them based on purpose/speed/quality.
NLTK: investigate best practices for managing NLTK Data in Production code. eg. gathering data dependencies at build-time vs run-time. Packing data into built wheel or not?
Using frameworks for the Renderers. eg.
- HtmlRenderer - Django (if I was converting over to also use models instead of Content and/or Counter's OutputtDict. Negating on-disk/in-memory hits), Python-Markdown (Convert markdown-to-html).
- ConsoleRenderer - I know there are TUI/CLI libraries in multiple languages to easily do frames/tables in a console.

jackson15j / python_homework_nlp Goto Github PK