GithubHelp home page GithubHelp logo

jackson15j / python_homework_nlp Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 470 KB

Python homework exercise to pull out words of interest from text files via NLP.

License: GNU General Public License v3.0

Python 98.75% Emacs Lisp 1.25%

python_homework_nlp's Introduction

python_homework_nlp

Python application Release

Python homework exercise to pull out words of interest from text files via NLP.

NOTE: See Releases section for the rendered output for each tagged release (generated by the release Github Action).

Exercise

From the documents provided:

  • Produce a list of the most frequent interesting words.

  • Summary table showing where those words appear (sentences and documents) eg.

    Word (Total Occurrences) Documents Sentences containing the word
    Philosophy (42) x,y,z I don't have time for philosophy

    Surely this was a touch of fine philosophy; though no doubt he had never heard there was such a thing as that.

    Still, her pay-as-you-go philosophy implies it.
    ... ... ...

Personal Aims

  • To show-off the type of developer I am:
    • Solution design & planning.
    • List assumptions & reasoning for my choices.
    • Documentation (docs, code comments, commit messages).
    • Workflow (CI, tests, concise commits).
  • To learn new packages:
    • NLTK - I've only worked on rule-based NLP products built in C++ (Boost Spirit). Recently found out that NLTK is a great python alternative.
    • Poetry - I've heard great things about Poetry from @mikeymo for years, but not had a chance to experiment (and potentially) migrate Current Production code away from PEP-517 style packaging (setup.cfg/pyproject.toml) to Poetry.
      • Had also found that PDM seems to be a good alternative, but feels like there is an upfront education cost for it to be smooth sailing to use in Production.
      • Previously used Pipenv but my opinions have soured after the mix of issues; Pypa distancing themselves from Pipenv usage in Production vs using PEP-517 config, non-OS-agnostic lock files and dependency graphs breaking easily.

Design

PlantUml design to solve the above problem (See: PlantUml Design (original) & initial design sketch in docs/ for my pre-NLTK investigation):

PlantUml Design (current)

Normaliser

Current implementation is using Stemming. This is documented as being typically faster than Lemmatization at the cost of accuracy/quality.

Counter

Current design uses "2-passes" to go from filtered tokens to counts + file/sentence mappings:

Pros:

  • Fast to get the Word (Total Occurences) column in the Exercise section's table.
  • Could horizontally scale "Pass 2" by using a task queue system and then amalgamating the results at the end.
  • Can specify the a number for the top/most-common words as output from -"Pass 1"_. ie. Save time in "Pass 2".
  • Efficiency: "Pass 1" is O(n) for the collections.Counter with a quick O(n^2) to get a singular collections.Counter total.

Cons:

  • "Pass 2" is inefficient to get the Documents/Sentences containing the word columns in the Exercise section's table, due to doing a full pass of all sentences for each word.
  • Efficiency: "Pass 2" is: O(n^2). For each word, for each file, for each sentence find a match.

Alternative Design Ideas

  • Modifications to the above "2-Pass" design:
    • juggle the current Counter ordering. eg: For each file, for each sentence, for each word find a match.
    • Restructure code to benefit from caching. eg. functools.lru_cache.
  • "Single-pass" where file/sentence mapping is recorded at point of normalising each sentence + increment a counter for each word when seen again.
    • Would need to flatten the files/sentences structure to avoid O(n^3)

Usage

Either:

  • pip install <package.whl> into a local virtualenv (See "Releases" section for published wheels).
  • Follow the steps in: Contribute section.

You can then run the application via the entrypoint:

  • app.
  • app --help returns the usage.

Example:

$ app
[nltk_data] Downloading package punkt to /path/to/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /path/to/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
Gathering `*.txt` file contents from the root of: test_docs ...
Calling main workflow.
-- Parsing file contents...
-- Counting words against files & sentences...
Render results.
-- Rendering output via: JsonRenderer...
-- Writing Rendered output to: build/output/output.json ...
-- Rendering output via: CsvRenderer...
-- Writing Rendered output to: build/output/output.csv ...
done.
$

Contribute

  • Pre-req.: curl -sSL https://install.python-poetry.org | python3 - Poetry Docs: install.
  • potentially: Explicitly install the git versioning plugin:
  • Install dependencies: poetry install.
  • Run tests: poetry run pytest.
  • Build Wheel: poetry build.
  • Run app: poetry run app.

NOTE: Explicitly not publishing this package to PyPI!! I don't want to bloat the PyPI namespace with a point-in-time homework piece that wont have on-going maintenance/support (beyond the initial learning/puzzle-solving stage).


Retrospective

Things I've learnt?

  • Poetry: never used it before, but so far seems to be what Pipenv should have been. No major pain (outside of the py3.10/pytest version bug) so far and very smooth.
  • NLTK: never used it before. Interesting package with adequate reference examples. Had to do a general refresher on NLP terminology (and changes) from what terms were used in the rule-based NLP that I've interacted with in the past. Only scratched the surface so far.
  • mypy: I've been pushing type hints usage on past work projects as we came to support those python versions in Production (We didn't us mypy in anger, due to the tech debt of warnings). Definitely benefited from the mypy warnings catching mistakes around single/double nested lists in function calls.
  • PlantUml Proxy: Use the proxy to generate UML on page loads, instead of committing generated images into git.

What would I improve (with additional time)?

NOTE: Follow is in addition to, or summary of, the TODO comments intentionally left in the code.

  • Additional Renderer's. eg. Markdown table, HTML, console.
  • Fix bugs:
    • Multiline strings in single CSV field syntax.
    • TypeHint an ABC class correctly.
    • NLTK Data download singleton + remove normaliser.py import side-effect.
    • Bold special word in the sentence.
  • Add tox for matrix building/testing of the application against python versions locally. NOTE: negated by the matrix building in CI (python_app Github Action).

What would I change?

  • Poetry Docs: version: I prefer discovering versions git tags, instead of forcing a commit/PR to bump. I've been using PyPA: setuptools_scm to do this in current projects. Need to investigate how other uses of Poetry handle versioning.
  • Try out and profile Alternative Design Ideas.
  • NLTK: Try out different Stemming/Lemmatizing calls in NLTK to compare speed vs quality of results.
    • Break out the filtering from the Normaliser so that the current Stemming and future AlternativeStemming or Lemmatization callers are classes that use the same abstract base class interface. ie. simplify changing them based on purpose/speed/quality.
  • NLTK: investigate best practices for managing NLTK Data in Production code. eg. gathering data dependencies at build-time vs run-time. Packing data into built wheel or not?
  • Using frameworks for the Renderers. eg.
    • HtmlRenderer - Django (if I was converting over to also use models instead of Content and/or Counter's OutputtDict. Negating on-disk/in-memory hits), Python-Markdown (Convert markdown-to-html).
    • ConsoleRenderer - I know there are TUI/CLI libraries in multiple languages to easily do frames/tables in a console.

python_homework_nlp's People

Contributors

jackson15j avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.