payscale / fables Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 5.0 317 KB

(F)ile T(ables)

License: Apache License 2.0

Python 100.00%

fables's People

Contributors

Stargazers

Watchers

Forkers

thomasjohns jbcooley7 bk521234 paradise-runner wordythebyrd

fables's Issues

nox: black producing non-equivalent code

Latest version of black seems to be on the frits. There are a ton of duplicate issues related to unstable formatting,https://github.com/psf/black/issues?q=is%3Aissue+is%3Aopen+label%3A%22unstable+formatting%22
and a pinned issue where most of the conversation is happening, psf/black#1629

I saw this message when running nox

nox > Running session blacken-3.8.1
nox > Creating virtualenv using python3.8 in /home/salvidor/dev/python/bk-fables/.nox/blacken-3-8-1
nox > pip install black
nox > black fables tests noxfile.py setup.py
error: cannot format /home/bk/dev/python/bk-fables/tests/integration/test_it_parses_files.py: INTERNAL ERROR: Black produced code that is not equivalent to the source.  Please report a bug on https://github.com/psf/black/issues.  This diff might be helpful: /tmp/blk_8ca_wv62.log
Oh no! 💥 💔 💥
21 files left unchanged, 1 file failed to reformat.
nox > Command black fables tests noxfile.py setup.py failed with exit code 123
nox > Session blacken-3.8.1 failed.

sniff_delimiter returns wrong delimiter when csv contains a lot of JSON

I haven't put together an example that reproduces this yet, as the real-world example csv contains JSON cells with around 26000 characters each of privileged information and I haven't made a test file of comparable size yet.

The expected delimiter is ",", but sniff_delimiter returns ":". If I pass the entire byte stream to csv.Sniffer() it gets the delimiter right, so I guess 1024 * 4 bytes isn't enough to disambiguate the "," from the ":" as possible delimiters with all that JSON in the mix. I'm sure we don't want to pass the entire byte stream every time, and increasing the number of bytes we pass seems like an endless game of whack-a-mole as file sizes get larger.

For our use-case, where the delimiter is known beforehand, it might be enough to just have parse accept an expected_delimiter argument that gets passed down to the ParseVisitor and down to parse_csv, calling sniff_delimiter as needed if expected_delimiter isn't specified. This wouldn't solve the problem for users that don't have prior knowledge of the delimiter, though.

Fables supports python3.8

Add 3.8 version to noxfile to test in CI.

Support python >= 3.6

Right now we can only support 3.7 because were using dataclasses for error types and the parse result type. We can support 3.6 with this backport: https://github.com/ericvsmith/dataclasses

Adopt Black Code Style

https://github.com/python/black

A few things involved.

apply black
add to requirements.txt
add black code style preference as a README badge

3 Testing Failures on Local

When running tests locally on a fresh master build, 3 tests fail without any changes to code.

2 of the failures occur on test_it_detects_file_metadata where basic.csv and xlsx_with_zip_mimetype.xlsx produce incorrect mimetypes. Confirmed occurring on macos and ubuntu per @thomasjohns.

application/csv for the csv when it should be text/plain
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet for the xlsx file when it should be application/zip

test_node_stream_property_returns_at_byte_0_after_parse fails when the node._stream returns the byte position at 12 instead of 0. After some initial testing this seems to be occurring in the pd.read_csv in fables/parse.py/parse_csv

Bump Fables to 1.0.0

Issues to resolve before bumping Fables to 1.0.0.

Support pandas 1.0.0 and add CI support for both it and 0.25.1 (current)

Pandas 1.0.0 just came out https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html.

Here Fables should switch to pandas 1.0.0 (in setup.py and requirements.txt). But it should still back test against pandas 0.25.1. I think we can achieve this with the "parametrizing sessions" feature of nox (https://nox.thea.codes/en/stable/config.html#parametrizing-sessions).

Fables handles excel and csv files with missing leading columns.

Prune `requirements.txt`

I expect there are some unneeded dependencies there. E.g. pylint can be used, though it shouldn't be required.

It might also be good to break out requirements-test.txt and requirements-dev.txt.

fables is on PyPI

PayScale will manage the PyPI account.

We will need to update setup.py. https://github.com/crwilcox/my-pypi-package/blob/master/setup.py is a good reference.

Eventually we'll want to automatically twine upload through CI like https://github.com/crwilcox/my-pypi-package/blob/master/.circleci/config.yml though not in scope for this issue.

Add or link to code of conduct

Let's enforce a friendly and welcoming community!

File parsing should record parse error if remove_data_before_header removes every row in the table

When parsing a file and removing data before headers, fables sometimes removes every row but the last row of the file, and returns an empty table having headers corresponding to the data in the last row of the file.

Example File

The request is that in situations where this happens, fables records a ParseError rather than returning the empty table.

Autodetect text/plain mimetype files instead of assuming utf-8

Probably https://github.com/chardet/chardet or https://github.com/PyYoshi/cChardet could be used.

Add feather / arrow support

Using https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_feather.html .

mimetype_from_stream fails on large files

mimetype = magic.from_buffer(mimebytes, mime=True) fails without a stacktrace on large files, this is likely due to an excessively large buffer being sent in from mimebytes = stream.read(). Recommend using the first 2048 bytes of the files as recommended by magic (https://pypi.org/project/python-magic/) that ensures correct identification and should be more performant.

Add pdf support

Following the progress on https://github.com/socialcopsdev/camelot we might be able to integrate it into fables.

Add stata support

Using https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_stata.html .

Auto-Generate API Docs (Starter)

We'd likely want to use sphinx.

For this issue, the scope is to document fables.detect() and fables.parse() and have that documentation automatically build into an html page.

Forward pandas csv and excel kwargs through fables.parse()

E.g. if the user wants to toggle keep_default_na to have "N/A" values parse as the string literal instead of the pandas Na/Nan.

These arguments will get passed through to pd.read_csv in visit_Csv and pd.read_excel in visit_Xls*.

Add base library code

Handle case where fables detects both Xlsx and Zip

On somewhat rare occasion, excel xlsx files have mimetype == application/zip. If the extension is e.g. xlsx, then this file will match to both the Xlsx node type and the Zip node type, and by chance the Zip node type gets detected first, which can result in clients trying to e.g. unzip xlsx files.

To reproduce such a file (this took some experimentation):
pip install openpyxl

import pandas as pd

df = pd.DataFrame(data=[[1, 2], [3, 4]], ["a", "b"])
df.to_excel("test.xlsx", engine="openpyxl")

Bump pandas to 0.24.2 and fables to 0.0.2

Add yaml support

CircleCI Continuous Integration on PRs

https://github.com/crwilcox/my-pypi-package/blob/master/.circleci/config.yml seems like a good reference.

We will want to check

pytest tests pass
display code coverage diff (though don't auto-bounce PR because of this)
flake8 linter shows no output
mypy --strict type checking passes
on every PR.

Fables nodes have an `is_empty` property for zero byte files

A convenience property equivalent to node.mimetype == "application/x-empty"

Add 7z support

Add nox support.

https://nox.thea.codes/en/stable/

The developer can type nox and this will run tests, linting, and mypy type checking for python 3.6 and 3.7.

This will be useful when we set up CI later.

This can just include 3.7 depending on whether this or #13 is attempted first.

Don't force conversion to numeric for files with data before headers start

Provide an argument to choose whether to force as numeric in parse e.g. force_numeric=False.

Add CONTRIBUTING.md doc

Add json support

Release fables 1.2.0

[ ] Reset the setup.py version to be 1.2.0.
[ ] Add pandas 1.0.1 testing to CI
[ ] Release to pypi.

Add coveralls for test coverage

Using https://coveralls.io/. We'd also like to add the badge to the README.

Add encoding detection to csv parsing

Currently, the parse_csv function does not specify an encoding to pd.read_csv unless one is provided by the client using the pandas_kwargs argument. If no encoding is thus provided by the client, pandas will use the system default, which is utf-8 for unix systems, and cp1252 for windows systems. This isn't ideal, and has caused parser errors, e.g. this error from fables running on Linux: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 2: invalid continuation byte.

@thomasjohns left a TODO in the sniff_delimiter around using chardet or cchardet to auto-detect encoding. I think we should go ahead and make that change, but expand the scope a little further and detect the encoding for passing to pd.read_csv as well.

In testing performance of chardet and cchardet, Andres Sanz found that cchardet is much more performant, but seems to have lower confidence in its detections. So I'd suggest we use cchardet, and accept any encoding suggestion with a confidence over 50%.

Anticipated work to do:

add a detect_encoding function to parse.py, similar to sniff_delimiter, which uses cchardet to detect the encoding, returning the encoding if the confidence is over 50%, raising an error otherwise
updating the signature of sniff_delimiter to accept an encoding argument and removing @thomasjohns’ todo comment
update parse_csv to call detect_encoding and pass the result to both sniff_delimiter and pd.read_csv. This should respect the encoding passed in pandas_kwargs if present
integration tests

Add xlsb support

Fables might be able to detect and parse Excel 2007-2010 Binary Workbooks - xlsb files with https://pypi.org/project/pyxlsb/ .

payscale / fables Goto Github PK

fables's People

Contributors

Stargazers

Watchers

Forkers

fables's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs