GithubHelp home page GithubHelp logo

fables's People

Contributors

andres-endava avatar andrew-pay avatar bk-payscale avatar jbcooley7 avatar nfmcclure avatar paradise-runner avatar thomasjohns avatar wordythebyrd avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fables's Issues

nox: black producing non-equivalent code

Latest version of black seems to be on the frits. There are a ton of duplicate issues related to unstable formatting,https://github.com/psf/black/issues?q=is%3Aissue+is%3Aopen+label%3A%22unstable+formatting%22
and a pinned issue where most of the conversation is happening, psf/black#1629

I saw this message when running nox

nox > Running session blacken-3.8.1
nox > Creating virtualenv using python3.8 in /home/salvidor/dev/python/bk-fables/.nox/blacken-3-8-1
nox > pip install black
nox > black fables tests noxfile.py setup.py
error: cannot format /home/bk/dev/python/bk-fables/tests/integration/test_it_parses_files.py: INTERNAL ERROR: Black produced code that is not equivalent to the source.  Please report a bug on https://github.com/psf/black/issues.  This diff might be helpful: /tmp/blk_8ca_wv62.log
Oh no! ๐Ÿ’ฅ ๐Ÿ’” ๐Ÿ’ฅ
21 files left unchanged, 1 file failed to reformat.
nox > Command black fables tests noxfile.py setup.py failed with exit code 123
nox > Session blacken-3.8.1 failed.

sniff_delimiter returns wrong delimiter when csv contains a lot of JSON

I haven't put together an example that reproduces this yet, as the real-world example csv contains JSON cells with around 26000 characters each of privileged information and I haven't made a test file of comparable size yet.

The expected delimiter is ",", but sniff_delimiter returns ":". If I pass the entire byte stream to csv.Sniffer() it gets the delimiter right, so I guess 1024 * 4 bytes isn't enough to disambiguate the "," from the ":" as possible delimiters with all that JSON in the mix. I'm sure we don't want to pass the entire byte stream every time, and increasing the number of bytes we pass seems like an endless game of whack-a-mole as file sizes get larger.

For our use-case, where the delimiter is known beforehand, it might be enough to just have parse accept an expected_delimiter argument that gets passed down to the ParseVisitor and down to parse_csv, calling sniff_delimiter as needed if expected_delimiter isn't specified. This wouldn't solve the problem for users that don't have prior knowledge of the delimiter, though.

3 Testing Failures on Local

When running tests locally on a fresh master build, 3 tests fail without any changes to code.

  1. 2 of the failures occur on test_it_detects_file_metadata where basic.csv and xlsx_with_zip_mimetype.xlsx produce incorrect mimetypes. Confirmed occurring on macos and ubuntu per @thomasjohns.
  • application/csv for the csv when it should be text/plain
  • application/vnd.openxmlformats-officedocument.spreadsheetml.sheet for the xlsx file when it should be application/zip
  1. test_node_stream_property_returns_at_byte_0_after_parse fails when the node._stream returns the byte position at 12 instead of 0. After some initial testing this seems to be occurring in the pd.read_csv in fables/parse.py/parse_csv

Prune `requirements.txt`

I expect there are some unneeded dependencies there. E.g. pylint can be used, though it shouldn't be required.

It might also be good to break out requirements-test.txt and requirements-dev.txt.

mimetype_from_stream fails on large files

mimetype = magic.from_buffer(mimebytes, mime=True) fails without a stacktrace on large files, this is likely due to an excessively large buffer being sent in from mimebytes = stream.read(). Recommend using the first 2048 bytes of the files as recommended by magic (https://pypi.org/project/python-magic/) that ensures correct identification and should be more performant.

Auto-Generate API Docs (Starter)

We'd likely want to use sphinx.

For this issue, the scope is to document fables.detect() and fables.parse() and have that documentation automatically build into an html page.

Handle case where fables detects both Xlsx and Zip

On somewhat rare occasion, excel xlsx files have mimetype == application/zip. If the extension is e.g. xlsx, then this file will match to both the Xlsx node type and the Zip node type, and by chance the Zip node type gets detected first, which can result in clients trying to e.g. unzip xlsx files.

To reproduce such a file (this took some experimentation):
pip install openpyxl

import pandas as pd

df = pd.DataFrame(data=[[1, 2], [3, 4]], ["a", "b"])
df.to_excel("test.xlsx", engine="openpyxl")

Add nox support.

https://nox.thea.codes/en/stable/

The developer can type nox and this will run tests, linting, and mypy type checking for python 3.6 and 3.7.

This will be useful when we set up CI later.

This can just include 3.7 depending on whether this or #13 is attempted first.

Release fables 1.2.0

[ ] Reset the setup.py version to be 1.2.0.
[ ] Add pandas 1.0.1 testing to CI
[ ] Release to pypi.

Add encoding detection to csv parsing

Currently, the parse_csv function does not specify an encoding to pd.read_csv unless one is provided by the client using the pandas_kwargs argument. If no encoding is thus provided by the client, pandas will use the system default, which is utf-8 for unix systems, and cp1252 for windows systems. This isn't ideal, and has caused parser errors, e.g. this error from fables running on Linux: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 2: invalid continuation byte.

@thomasjohns left a TODO in the sniff_delimiter around using chardet or cchardet to auto-detect encoding. I think we should go ahead and make that change, but expand the scope a little further and detect the encoding for passing to pd.read_csv as well.

In testing performance of chardet and cchardet, Andres Sanz found that cchardet is much more performant, but seems to have lower confidence in its detections. So I'd suggest we use cchardet, and accept any encoding suggestion with a confidence over 50%.

Anticipated work to do:

  • add a detect_encoding function to parse.py, similar to sniff_delimiter, which uses cchardet to detect the encoding, returning the encoding if the confidence is over 50%, raising an error otherwise
  • updating the signature of sniff_delimiter to accept an encoding argument and removing @thomasjohnsโ€™ todo comment
  • update parse_csv to call detect_encoding and pass the result to both sniff_delimiter and pd.read_csv. This should respect the encoding passed in pandas_kwargs if present
  • integration tests

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.