payscale / fables Goto Github PK
View Code? Open in Web Editor NEW(F)ile T(ables)
License: Apache License 2.0
(F)ile T(ables)
License: Apache License 2.0
Latest version of black seems to be on the frits. There are a ton of duplicate issues related to unstable formatting,https://github.com/psf/black/issues?q=is%3Aissue+is%3Aopen+label%3A%22unstable+formatting%22
and a pinned issue where most of the conversation is happening, psf/black#1629
I saw this message when running nox
nox > Running session blacken-3.8.1
nox > Creating virtualenv using python3.8 in /home/salvidor/dev/python/bk-fables/.nox/blacken-3-8-1
nox > pip install black
nox > black fables tests noxfile.py setup.py
error: cannot format /home/bk/dev/python/bk-fables/tests/integration/test_it_parses_files.py: INTERNAL ERROR: Black produced code that is not equivalent to the source. Please report a bug on https://github.com/psf/black/issues. This diff might be helpful: /tmp/blk_8ca_wv62.log
Oh no! ๐ฅ ๐ ๐ฅ
21 files left unchanged, 1 file failed to reformat.
nox > Command black fables tests noxfile.py setup.py failed with exit code 123
nox > Session blacken-3.8.1 failed.
I haven't put together an example that reproduces this yet, as the real-world example csv contains JSON cells with around 26000 characters each of privileged information and I haven't made a test file of comparable size yet.
The expected delimiter is ","
, but sniff_delimiter
returns ":"
. If I pass the entire byte stream to csv.Sniffer()
it gets the delimiter right, so I guess 1024 * 4
bytes isn't enough to disambiguate the ","
from the ":"
as possible delimiters with all that JSON in the mix. I'm sure we don't want to pass the entire byte stream every time, and increasing the number of bytes we pass seems like an endless game of whack-a-mole as file sizes get larger.
For our use-case, where the delimiter is known beforehand, it might be enough to just have parse
accept an expected_delimiter
argument that gets passed down to the ParseVisitor
and down to parse_csv
, calling sniff_delimiter
as needed if expected_delimiter
isn't specified. This wouldn't solve the problem for users that don't have prior knowledge of the delimiter, though.
Add 3.8
version to noxfile to test in CI.
Right now we can only support 3.7 because were using dataclasses for error types and the parse result type. We can support 3.6 with this backport: https://github.com/ericvsmith/dataclasses
https://github.com/python/black
A few things involved.
requirements.txt
When running tests locally on a fresh master build, 3 tests fail without any changes to code.
test_it_detects_file_metadata
where basic.csv
and xlsx_with_zip_mimetype.xlsx
produce incorrect mimetypes. Confirmed occurring on macos and ubuntu per @thomasjohns.application/csv
for the csv when it should be text/plain
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
for the xlsx file when it should be application/zip
test_node_stream_property_returns_at_byte_0_after_parse
fails when the node._stream
returns the byte position at 12 instead of 0. After some initial testing this seems to be occurring in the pd.read_csv
in fables/parse.py/parse_csv
Pandas 1.0.0 just came out https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html.
Here Fables should switch to pandas 1.0.0 (in setup.py
and requirements.txt
). But it should still back test against pandas 0.25.1. I think we can achieve this with the "parametrizing sessions" feature of nox (https://nox.thea.codes/en/stable/config.html#parametrizing-sessions).
I expect there are some unneeded dependencies there. E.g. pylint can be used, though it shouldn't be required.
It might also be good to break out requirements-test.txt
and requirements-dev.txt
.
PayScale will manage the PyPI account.
We will need to update setup.py
. https://github.com/crwilcox/my-pypi-package/blob/master/setup.py is a good reference.
Eventually we'll want to automatically twine upload
through CI like https://github.com/crwilcox/my-pypi-package/blob/master/.circleci/config.yml though not in scope for this issue.
Let's enforce a friendly and welcoming community!
When parsing a file and removing data before headers, fables sometimes removes every row but the last row of the file, and returns an empty table having headers corresponding to the data in the last row of the file.
The request is that in situations where this happens, fables records a ParseError rather than returning the empty table.
Probably https://github.com/chardet/chardet or https://github.com/PyYoshi/cChardet could be used.
mimetype = magic.from_buffer(mimebytes, mime=True) fails without a stacktrace on large files, this is likely due to an excessively large buffer being sent in from mimebytes = stream.read(). Recommend using the first 2048 bytes of the files as recommended by magic (https://pypi.org/project/python-magic/) that ensures correct identification and should be more performant.
Following the progress on https://github.com/socialcopsdev/camelot we might be able to integrate it into fables.
We'd likely want to use sphinx.
For this issue, the scope is to document fables.detect()
and fables.parse()
and have that documentation automatically build into an html page.
E.g. if the user wants to toggle keep_default_na
to have "N/A" values parse as the string literal instead of the pandas Na/Nan.
These arguments will get passed through to pd.read_csv
in visit_Csv
and pd.read_excel
in visit_Xls*
.
On somewhat rare occasion, excel xlsx files have mimetype == application/zip
. If the extension is e.g. xlsx
, then this file will match to both the Xlsx
node type and the Zip
node type, and by chance the Zip
node type gets detected first, which can result in clients trying to e.g. unzip xlsx files.
To reproduce such a file (this took some experimentation):
pip install openpyxl
import pandas as pd
df = pd.DataFrame(data=[[1, 2], [3, 4]], ["a", "b"])
df.to_excel("test.xlsx", engine="openpyxl")
https://github.com/crwilcox/my-pypi-package/blob/master/.circleci/config.yml seems like a good reference.
We will want to check
A convenience property equivalent to node.mimetype == "application/x-empty"
https://nox.thea.codes/en/stable/
The developer can type nox
and this will run tests, linting, and mypy type checking for python 3.6 and 3.7.
This will be useful when we set up CI later.
This can just include 3.7 depending on whether this or #13 is attempted first.
Provide an argument to choose whether to force as numeric in parse
e.g. force_numeric=False
.
[ ] Reset the setup.py
version to be 1.2.0.
[ ] Add pandas 1.0.1 testing to CI
[ ] Release to pypi.
Using https://coveralls.io/. We'd also like to add the badge to the README.
Currently, the parse_csv function does not specify an encoding to pd.read_csv
unless one is provided by the client using the pandas_kwargs
argument. If no encoding is thus provided by the client, pandas will use the system default, which is utf-8
for unix systems, and cp1252
for windows systems. This isn't ideal, and has caused parser errors, e.g. this error from fables running on Linux: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 2: invalid continuation byte
.
@thomasjohns left a TODO in the sniff_delimiter around using chardet
or cchardet
to auto-detect encoding. I think we should go ahead and make that change, but expand the scope a little further and detect the encoding for passing to pd.read_csv
as well.
In testing performance of chardet
and cchardet
, Andres Sanz found that cchardet
is much more performant, but seems to have lower confidence in its detections. So I'd suggest we use cchardet
, and accept any encoding suggestion with a confidence over 50%.
Anticipated work to do:
sniff_delimiter
, which uses cchardet
to detect the encoding, returning the encoding if the confidence is over 50%, raising an error otherwisesniff_delimiter
to accept an encoding argument and removing @thomasjohnsโ todo commentparse_csv
to call detect_encoding
and pass the result to both sniff_delimiter
and pd.read_csv
. This should respect the encoding passed in pandas_kwargs
if presentFables might be able to detect and parse Excel 2007-2010 Binary Workbooks - xlsb
files with https://pypi.org/project/pyxlsb/ .
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.