rmariano / compr Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 0.0 161 KB

A text compression tool & library

Home Page: https://compr.readthedocs.io/en/latest/?badge=latest

License: MIT License

Python 94.43% Makefile 2.80% Shell 2.78%

python huffman text-compression python-3 algorithms

compr's Introduction

Hi there 👋

I'm Mariano 😃. I'm a software engineer, passionate about the field.

📚 I'm the author of the book Clean code in Python. I also write about software in my blog.

📬 You can reach out to me at LinkedIn.

compr's People

Contributors

Stargazers

Watchers

compr's Issues

Prototype with Google Fire

https://github.com/google/python-fire

Prototype to see if it's a suitable replacement for the command line interface.
Depends upon #25

Document file formats

Generate documentation for the low-level internals of the project.

Format of the file that is compressed, or extracted.
Structure of bytes in the resulting table
Algorithms & Data structures being used in the program
Algorithm complexity

Move code to new cli module

All code related to the command line interface should be moved from the __init__ to ` new, module, called (for example) cli.py

Handle multiple files asynchronously

Blocked by

Do the processing in an asynchronous fashion.
Compare performance results, against the same benckmark

[optimization] Use actual bit array on processing

The lib is currently encoding byte characters of '1' or '0' for the binary bit representation, respectively, and not actual bits in an array.

Is not strictly required to port everything to C at this point, just doing the optimisation in Python will suffice.

Some alternatives might be:

numpy: https://docs.scipy.org/doc/numpy/reference/generated/numpy.packbits.html
Having integers being shifted with << 1 (dumping in chunks of 64 bits, for instance), etc.
Python bitarray: https://pypi.python.org/pypi/bitarray/

Compare memory utilisation before and after the change.

Support multiple files

Ability to compress multiple files, packaging the compression into a single one.

This changes the cli interface, for now the user has to specify the name of the output file first (default one will probably not work anymore), and then the list of the files to search (similar to tar, etc.), like:

pycompress --ouptut <ofilename> [files...]

The compression can be done sequentially, no need to parallelism; Any sort of optimisation will be done later on.

Document program cli

Parameters for compression & decompress
Examples of invocations

change tests layout

For each compressor/<X>.py there should be a corresponding test file tests/unit/test_<X>.py

Group tests by scenarios.
Separate unit vs. functional tests, and allow running them separately.

Use mmap in files

depends on [blocked by]: #29
Change the underlying implementation for mmap, and compare performance results.

Refactor tests

Move to pytest style of tests (functions with assets, etc.)
Remove nose dependency
Test in smaller units, each function separately
Add tox (For Python 3.5 and Python 3.6)
make checklist: check for style issues (pylint), syntax, & run tests
Remove randomization in tests
Remove cli-specific tools (subprocess calls to sha256sum for instance).

Create helper for streaming file

Helper object that will yield the contents of the file reading by a given buffer size.

Conditions:

A file-like object
Context manager, making sure the file is closed upon completion.
Iterable

pseudocode:

with IterableFile('/tmp/foo/bar') as streamed_file:
     for chunk in streamed_file.stream(buffer_size=1024):
             print(chunk)

Package project

Create a setup.py that allows the project to be installed as a package for development and installation.

Publish at pypi

Remove 'b' prefix on encoding

Remove FIXME at 72f879f

raise coverage

at least 90%
setup codecov

Add a `--verbose` option

This optional parameter, when selected, should gather information along the process of the main command being performed, and display the results just before the program finishes.
For example, it can collect the time elapsed, the sizes of both files (prior and after the program was called), and the compression/extraction ratio (in %), etc.
This information is rendered on stdout

Improve CI

Code linting for all code
- pylint, flake8, with the most strict controls
- Set max column=79
- Break the build if the linting does not pass
- setup code revision for common patterns in code review, automatically. Maybe https://github.com/integrations/sideci can help
Tests should ignore the dataset on the run
Check that coverage level did not decrease. Fail if it did.
- setup codecov https://codecov.io/
Automate coverage level report per branch, and PR. Link directly in the project main page and documentation.
Create a checklist target in Makefile, and separate tests from checklist.
Check for security issues and updates automatically. Maybe https://github.com/integrations/src-clr can help

Default file should be placed in local directory

ATM if no default is provided for the file being worked on (extraction/compression), it uses <original-file>.comp as a default one. If an absolute path is provided, it will still use that absolute path with the .comp suffix.

A user might have read permissions for the file being worked on (that's all it should take for compression), but not write permissions (for the output file).

The proposal is to change the default for:

`pwd`/`basename <original-file>`.comp

Leaving the resulting file in the current directory, where write permissions are assumed.

Use pathlib: https://docs.python.org/3.5/library/pathlib.html

Setup mypy

Add a new target in Makefile that checks type hinting. If the mypy validation has some issues, the target should fail.
This new target will be part of the checklist, so make checklist should run mypy among other things.

Release version 0.1.0

Tag and sign version with current master at 2017-04-15
Create necessary Makefile targets for building
Build wheel for project
Public on pypi
Update README

run mypy as part of the CI

Include as one of the items of the checklist. Build must fail⚠️ , if it does not pass the type hinting checks.

Setup lint checking

make lint should be part of the checklist, and should run linting checks automatically (pycodestyle, pylint, etc.).

If some issues are found on any of the files, it should fail with exit code 1.

New cli option: output directory

Enable the user to indicate an output directory for the file/s that are going to be written.

Parameter must be called --output-dir or -O.

If this parameter is provided, all files will be written inside this directory with the default naming convention.

Update documentation with examples of this use.

Document Project

Generate the documentation for the project, describing the main functions their parameters, etc.

High-level project information
API documentation: generated from docstrings + adding custom information about each function on the project, modules, how to use, etc.
Python annotations
~~[Low-level file documentation:]~~
- ~~Binary file format, structure, bytes, parsing, etc.~~
- ~~Input and output~~
Make documentation available online (RTD)
Update Readme

Document:

cli:
- Parameters for compression & decompression
- Examples of invocations
Programatic API

Warn if target file already exists

In case the target file already exists (regardless if it was user-specified or detault one), warn the user about it, and ask for confirmation before continuing with the processing.

This has to be done, before any actual processing of the file takes place.

If -f | --force is indicated, assume the output file will be overwritten and do not prompt.

Setup CI

Travis CI for the project.

Prepare build against Python 3.7 and remove 3.5

Add configuration entry in CI to run against Python 3.7
Deprecate Python 3.5 in this project (Python 3.6+ only)

Setup tox

Run tests against the following Python versions:

Update travis CI

Setup performance benchmark

Automatically run performance checks on the platform, that should be used to measure differences on changes, regression, etc. It is recommended to run as part of the CI along with the unit tests. It should be possible to compare performance across different branches and revisions.

Instrument the code, to support performance testability.

Have a separate target in Makefile.

The benchmark has to include the following relevant metrics (to be reviewed):

Running time (latency) for one "mark" file
Running time for N files (traceability to determine how does it scale as more files are added).
CPU load (%, load average, etc.)
Memory usage.
I/O

rmariano / compr Goto Github PK

compr's Introduction

Hi there 👋

compr's People

Contributors

Stargazers

Watchers

compr's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs