GithubHelp home page GithubHelp logo

reubano / meza Goto Github PK

View Code? Open in Web Editor NEW
410.0 18.0 32.0 4.6 MB

A Python toolkit for processing tabular data

License: MIT License

Makefile 0.41% Shell 0.93% Python 82.04% HTML 16.63%
pandas csv xml xlsx excel data tabular-data library functional-programming featured

meza's Introduction

meza: A Python toolkit for processing tabular data

GHA versions pypi

Index

Introduction | Requirements | Motivation | Hello World | Usage | Interoperability | Installation | Project Structure | Design Principles | Scripts | Contributing | Credits | More Info | License

Introduction

meza is a Python library for reading and processing tabular data. It has a functional programming style API, excels at reading/writing large files, and can process 10+ file types.

With meza, you can

  • Read csv/xls/xlsx/mdb/dbf files, and more!
  • Type cast records (date, float, text...)
  • Process Uñicôdë text
  • Lazily stream files by default
  • and much more...

Requirements

meza has been tested and is known to work on Python 3.7, 3.8, and 3.9; and PyPy3.7.

Optional Dependencies

Function Dependency Installation File type / extension
meza.io.read_mdb mdbtools sudo port install mdbtools

Microsoft Access / mdb

meza.io.read_html lxml1 pip install lxml

HTML / html

meza.convert.records2array NumPy2 pip install numpy

n/a

meza.convert.records2df pandas pip install pandas

n/a

Notes

Motivation

Why I built meza

pandas is great, but installing it isn't exactly a walk in the park, and it doesn't play nice with PyPy. I designed meza to be a lightweight, easy to install, less featureful alternative to pandas. I also optimized meza for low memory usage, PyPy compatibility, and functional programming best practices.

Why you should use meza

meza provides a number of benefits / differences from similar libraries such as pandas. Namely:

For more detailed information, please check-out the FAQ.

Hello World

A simple data processing example is shown below:

First create a simple csv file (in bash)

echo 'col1,col2,col3\nhello,5/4/82,1\none,1/1/15,2\nhappy,7/1/92,3\n' > data.csv

Now we can read the file, manipulate the data a bit, and write the manipulated data back to a new file.

>>> from meza import io, process as pr, convert as cv
>>> from io import open

>>> # Load the csv file
>>> records = io.read_csv('data.csv')

>>> # `records` are iterators over the rows
>>> row = next(records)
>>> row
{'col1': 'hello', 'col2': '5/4/82', 'col3': '1'}

>>> # Let's replace the first row so as not to lose any data
>>> records = pr.prepend(records, row)

# Guess column types. Note: `detect_types` returns a new `records`
# generator since it consumes rows during type detection
>>> records, result = pr.detect_types(records)
>>> {t['id']: t['type'] for t in result['types']}
{'col1': 'text', 'col2': 'date', 'col3': 'int'}

# Now type cast the records. Note: most `meza.process` functions return
# generators, so lets wrap the result in a list to view the data
>>> casted = list(pr.type_cast(records, result['types']))
>>> casted[0]
{'col1': 'hello', 'col2': datetime.date(1982, 5, 4), 'col3': 1}

# Cut out the first column of data and merge the rows to get the max value
# of the remaining columns. Note: since `merge` (by definition) will always
# contain just one row, it is returned as is (not wrapped in a generator)
>>> cut_recs = pr.cut(casted, ['col1'], exclude=True)
>>> merged = pr.merge(cut_recs, pred=bool, op=max)
>>> merged
{'col2': datetime.date(2015, 1, 1), 'col3': 3}

# Now write merged data back to a new csv file.
>>> io.write('out.csv', cv.records2csv(merged))

# View the result
>>> with open('out.csv', 'utf-8') as f:
...     f.read()
'col2,col3\n2015-01-01,3\n'

Usage

meza is intended to be used directly as a Python library.

Usage Index

Reading data

meza can read both filepaths and file-like objects. Additionally, all readers return equivalent records iterators, i.e., a generator of dictionaries with keys corresponding to the column names.

>>> from io import open, StringIO
>>> from meza import io

"""Read a filepath"""
>>> records = io.read_json('path/to/file.json')

"""Read a file like object and de-duplicate the header"""
>>> f = StringIO('col,col\nhello,world\n')
>>> records = io.read_csv(f, dedupe=True)

"""View the first row"""
>>> next(records)
{'col': 'hello', 'col_2': 'world'}

"""Read the 1st sheet of an xls file object opened in text mode."""
# Also, santize the header names by converting them to lowercase and
# replacing whitespace and invalid characters with `_`.
>>> with open('path/to/file.xls', 'utf-8') as f:
...     for row in io.read_xls(f, sanitize=True):
...         # do something with the `row`
...         pass

"""Read the 2nd sheet of an xlsx file object opened in binary mode"""
# Note: sheets are zero indexed
>>> with open('path/to/file.xlsx') as f:
...     records = io.read_xls(f, encoding='utf-8', sheet=1)
...     first_row = next(records)
...     # do something with the `first_row`

"""Read any recognized file"""
>>> records = io.read('path/to/file.geojson')
>>> f.seek(0)
>>> records = io.read(f, ext='csv', dedupe=True)

Please see readers for a complete list of available readers and recognized file types.

Processing data

Numerical analysis (à la pandas)3

In the following example, pandas equivalent methods are preceded by -->.

>>> import itertools as it
>>> import random

>>> from io import StringIO
>>> from meza import io, process as pr, convert as cv, stats

# Create some data in the same structure as what the various `read...`
# functions output
>>> header = ['A', 'B', 'C', 'D']
>>> data = [(random.random() for _ in range(4)) for x in range(7)]
>>> df = [dict(zip(header, d)) for d in data]
>>> df[0]
{'A': 0.53908..., 'B': 0.28919..., 'C': 0.03003..., 'D': 0.65363...}

"""Sort records by the value of column `B` --> df.sort_values(by='B')"""
>>> next(pr.sort(df, 'B'))
{'A': 0.53520..., 'B': 0.06763..., 'C': 0.02351..., 'D': 0.80529...}

"""Select column `A` --> df['A']"""
>>> next(pr.cut(df, ['A']))
{'A': 0.53908170489952006}

"""Select the first three rows of data --> df[0:3]"""
>>> len(list(it.islice(df, 3)))
3

"""Select all data whose value for column `A` is less than 0.5
--> df[df.A < 0.5]
"""
>>> next(pr.tfilter(df, 'A', lambda x: x < 0.5))
{'A': 0.21000..., 'B': 0.25727..., 'C': 0.39719..., 'D': 0.64157...}

# Note: since `aggregate` and `merge` (by definition) return just one row,
# they return them as is (not wrapped in a generator).
"""Calculate the mean of column `A` across all data --> df.mean()['A']"""
>>> pr.aggregate(df, 'A', stats.mean)['A']
0.5410437473067938

"""Calculate the sum of each column across all data --> df.sum()"""
>>> pr.merge(df, pred=bool, op=sum)
{'A': 3.78730..., 'C': 2.82875..., 'B': 3.14195..., 'D': 5.26330...}

Text processing (à la csvkit)4

In the following example, csvkit equivalent commands are preceded by -->.

First create a few simple csv files (in bash)

echo 'col_1,col_2,col_3\n1,dill,male\n2,bob,male\n3,jane,female' > file1.csv
echo 'col_1,col_2,col_3\n4,tom,male\n5,dick,male\n6,jill,female' > file2.csv

Now we can read the files, manipulate the data, convert the manipulated data to json, and write the json back to a new file. Also, note that since all readers return equivalent records iterators, you can use them interchangeably (in place of read_csv) to open any supported file. E.g., read_xls, read_sqlite, etc.

>>> import itertools as it

>>> from meza import io, process as pr, convert as cv

"""Combine the files into one iterator
--> csvstack file1.csv file2.csv
"""
>>> records = io.join('file1.csv', 'file2.csv')
>>> next(records)
{'col_1': '1', 'col_2': 'dill', 'col_3': 'male'}
>>> next(it.islice(records, 4, None))
{'col_1': '6', 'col_2': 'jill', 'col_3': 'female'}

# Now let's create a persistent records list
>>> records = list(io.read_csv('file1.csv'))

"""Sort records by the value of column `col_2`
--> csvsort -c col_2 file1.csv
"""
>>> next(pr.sort(records, 'col_2'))
{'col_1': '2', 'col_2': 'bob', 'col_3': 'male'

"""Select column `col_2` --> csvcut -c col_2 file1.csv"""
>>> next(pr.cut(records, ['col_2']))
{'col_2': 'dill'}

"""Select all data whose value for column `col_2` contains `jan`
--> csvgrep -c col_2 -m jan file1.csv
"""
>>> next(pr.grep(records, [{'pattern': 'jan'}], ['col_2']))
{'col_1': '3', 'col_2': 'jane', 'col_3': 'female'}

"""Convert a csv file to json --> csvjson -i 4 file1.csv"""
>>> io.write('file.json', cv.records2json(records))

# View the result
>>> with open('file.json', 'utf-8') as f:
...     f.read()
'[{"col_1": "1", "col_2": "dill", "col_3": "male"}, {"col_1": "2",
"col_2": "bob", "col_3": "male"}, {"col_1": "3", "col_2": "jane",
"col_3": "female"}]'

Geo processing (à la mapbox)

In the following example, mapbox equivalent commands are preceded by -->.

First create a geojson file (in bash)

echo '{"type": "FeatureCollection","features": [' > file.geojson
echo '{"type": "Feature", "id": 11, "geometry": {"type": "Point", "coordinates": [10, 20]}},' >> file.geojson
echo '{"type": "Feature", "id": 12, "geometry": {"type": "Point", "coordinates": [5, 15]}}]}' >> file.geojson

Now we can open the file, split the data by id, and finally convert the split data to a new geojson file-like object.

>>> from meza import io, process as pr, convert as cv

# Load the geojson file and peek at the results
>>> records, peek = pr.peek(io.read_geojson('file.geojson'))
>>> peek[0]
{'lat': 20, 'type': 'Point', 'lon': 10, 'id': 11}

"""Split the records by feature ``id`` and select the first feature
--> geojsplit -k id file.geojson
"""
>>> splits = pr.split(records, 'id')
>>> feature_records, name = next(splits)
>>> name
11

"""Convert the feature records into a GeoJSON file-like object"""
>>> geojson = cv.records2geojson(feature_records)
>>> geojson.readline()
'{"type": "FeatureCollection", "bbox": [10, 20, 10, 20], "features": '
'[{"type": "Feature", "id": 11, "geometry": {"type": "Point", '
'"coordinates": [10, 20]}, "properties": {"id": 11}}], "crs": {"type": '
'"name", "properties": {"name": "urn:ogc:def:crs:OGC:1.3:CRS84"}}}'

# Note: you can also write back to a file as shown previously
# io.write('file.geojson', geojson)

Writing data

meza can persist records to disk via the following functions:

  • meza.convert.records2csv
  • meza.convert.records2json
  • meza.convert.records2geojson

Each function returns a file-like object that you can write to disk via meza.io.write('/path/to/file', result).

>>> from meza import io, convert as cv
>>> from io import StringIO, open

# First let's create a simple tsv file like object
>>> f = StringIO('col1\tcol2\nhello\tworld\n')
>>> f.seek(0)

# Next create a records list so we can reuse it
>>> records = list(io.read_tsv(f))
>>> records[0]
{'col1': 'hello', 'col2': 'world'}

# Now we're ready to write the records data to file

"""Create a csv file like object"""
>>> cv.records2csv(records).readline()
'col1,col2\n'

"""Create a json file like object"""
>>> cv.records2json(records).readline()
'[{"col1": "hello", "col2": "world"}]'

"""Write back csv to a filepath"""
>>> io.write('file.csv', cv.records2csv(records))
>>> with open('file.csv', 'utf-8') as f_in:
...     f_in.read()
'col1,col2\nhello,world\n'

"""Write back json to a filepath"""
>>> io.write('file.json', cv.records2json(records))
>>> with open('file.json', 'utf-8') as f_in:
...     f_in.readline()
'[{"col1": "hello", "col2": "world"}]'

Cookbook

Please see the cookbook or ipython notebook for more examples.

Notes

Interoperability

meza plays nicely with NumPy and friends out of the box

setup

from meza import process as pr

# First create some records and types. Also, convert the records to a list
# so we can reuse them.
>>> records = [{'a': 'one', 'b': 2}, {'a': 'five', 'b': 10, 'c': 20.1}]
>>> records, result = pr.detect_types(records)
>>> records, types = list(records), result['types']
>>> types
[
    {'type': 'text', 'id': 'a'},
    {'type': 'int', 'id': 'b'},
    {'type': 'float', 'id': 'c'}]

from records to pandas.DataFrame to records

>>> import pandas as pd
>>> from meza import convert as cv

"""Convert the records to a DataFrame"""
>>> df = cv.records2df(records, types)
>>> df
        a   b   c
0   one   2   NaN
1  five  10  20.1
# Alternatively, you can do `pd.DataFrame(records)`

"""Convert the DataFrame back to records"""
>>> next(cv.df2records(df))
{'a': 'one', 'b': 2, 'c': nan}

from records to arrays to records

>>> import numpy as np

>>> from array import array
>>> from meza import convert as cv

"""Convert records to a structured array"""
>>> recarray = cv.records2array(records, types)
>>> recarray
rec.array([('one', 2, nan), ('five', 10, 20.100000381469727)],
          dtype=[('a', 'O'), ('b', '<i4'), ('c', '<f4')])
>>> recarray.b
array([ 2, 10], dtype=int32)

"""Convert records to a native array"""
>>> narray = cv.records2array(records, types, native=True)
>>> narray
[[array('u', 'a'), array('u', 'b'), array('u', 'c')],
[array('u', 'one'), array('u', 'five')],
array('i', [2, 10]),
array('f', [0.0, 20.100000381469727])]

"""Convert a 2-D NumPy array to a records generator"""
>>> data = np.array([[1, 2, 3], [4, 5, 6]], np.int32)
>>> data
array([[1, 2, 3],
       [4, 5, 6]], dtype=int32)
>>> next(cv.array2records(data))
{'column_1': 1, 'column_2': 2, 'column_3': 3}

"""Convert the structured array back to a records generator"""
>>> next(cv.array2records(recarray))
{'a': 'one', 'b': 2, 'c': nan}

"""Convert the native array back to records generator"""
>>> next(cv.array2records(narray, native=True))
{'a': 'one', 'b': 2, 'c': 0.0}

Installation

(You are using a virtualenv, right?)

At the command line, install meza using either pip (recommended)

pip install meza

or easy_install

easy_install meza

Please see the installation doc for more details.

Project Structure

┌── CONTRIBUTING.rst
├── LICENSE
├── MANIFEST.in
├── Makefile
├── README.rst
├── data
│   ├── converted/*
│   └── test/*
├── dev-requirements.txt
├── docs
│   ├── AUTHORS.rst
│   ├── CHANGES.rst
│   ├── COOKBOOK.rst
│   ├── FAQ.rst
│   ├── INSTALLATION.rst
│   └── TODO.rst
├── examples
│   ├── usage.ipynb
│   └── usage.py
├── helpers/*
├── manage.py
├── meza
│   ├── __init__.py
│   ├── convert.py
│   ├── dbf.py
│   ├── fntools.py
│   ├── io.py
│   ├── process.py
│   ├── stats.py
│   ├── typetools.py
│   └── unicsv.py
├── optional-requirements.txt
├── py2-requirements.txt
├── requirements.txt
├── setup.cfg
├── setup.py
├── tests
│   ├── __init__.py
│   ├── standard.rc
│   ├── test_fntools.py
│   ├── test_io.py
│   └── test_process.py
└── tox.ini

Design Principles

  • prefer functions over objects
  • provide enough functionality out of the box to easily implement the most common data analysis use cases
  • make conversion between records, arrays, and DataFrames dead simple
  • whenever possible, lazily read objects and stream the result

Scripts

meza comes with a built in task manager manage.py

Setup

pip install -r dev-requirements.txt

Examples

Run python linter and nose tests

manage lint
manage test

Contributing

Please mimic the coding style/conventions used in this repo. If you add new classes or functions, please add the appropriate doc blocks with examples. Also, make sure the python linter and nose tests pass.

Please see the contributing doc for more details.

Credits

Shoutouts to csvkit, messytables, and pandas for heavily inspiring meza.

More Info

License

meza is distributed under the MIT License.


  1. http://pandas.pydata.org/pandas-docs/stable/10min.html#min

  2. https://csvkit.readthedocs.org/en/0.9.1/cli.html#processing

  3. https://github.com/mapbox?utf8=%E2%9C%93&query=geojson

  4. Notable exceptions are meza.process.group, meza.process.sort, meza.io.read_dbf, meza.io.read_yaml, and meza.io.read_html. These functions read the entire contents into memory up front.

meza's People

Contributors

christian-ensodata avatar dmwyatt avatar jaraco avatar petonic avatar pyup-bot avatar randlet avatar reubano avatar zenofsahil avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

meza's Issues

Future in requirements

Hi Reuben, I installed meza in Anaconda py2.7 and needed to install future before using it:

In [1]: from meza import convert as cv
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-fbe11a4e5778> in <module>()
----> 1 from meza import convert as cv

/Users/thadk/anaconda/lib/python2.7/site-packages/meza/convert.py in <module>()
     20
     21 import itertools as it
---> 22 import pygogo as gogo
     23
     24 from os import path as p

/Users/thadk/anaconda/lib/python2.7/site-packages/pygogo/__init__.py in <module>()
     50
     51 from copy import copy
---> 52 from builtins import *
     53 from . import formatters, handlers, utils
     54

ImportError: No module named builtins

Do you think it should be added into requirements.txt for python2?

Deprecation warnings in Python 3.8

Here are some of the deprecation warnings emitted when running tests on Python 3.8:

/Users/jaraco/code/reubano/meza/.tox/py38/lib/python3.8/site-packages/manager/main.py:3: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  import imp
/Users/jaraco/code/reubano/meza/.tox/py38/lib/python3.8/site-packages/manager/__init__.py:65: DeprecationWarning: inspect.getargspec() is deprecated since Python 3.0, use inspect.signature() or inspect.getfullargspec()
  self.arg_names, varargs, keywords, defaults = inspect.getargspec(
...
/Users/jaraco/code/reubano/meza/.tox/py38/lib/python3.8/site-packages/nose/plugins/manager.py:418: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  import pkg_resources
/Users/jaraco/code/reubano/meza/.tox/py38/lib/python3.8/site-packages/nose/importer.py:12: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
  from imp import find_module, load_module, acquire_lock, release_lock
/Users/jaraco/code/reubano/meza/.tox/py38/lib/python3.8/site-packages/nose/suite.py:106: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3, and in 3.10 it will stop working
  if isinstance(tests, collections.Callable) and not is_suite:
#140 Test for reading open files ... /Users/jaraco/code/reubano/meza/tests/test_io.py:379: DeprecationWarning: 'U' mode is deprecated
  f = open(filepath, "rU", newline=None)
...

All but one of these warnings is coming from nose and manage.py, both projects which appear to be abandoned. These deprecation warnings fully break on later Pythons. Let's find replacement for these libraries (I suggest tox and pytest may be sufficient).

Can't pip install meza latest on Fedora 38 (and 37)

Hi

I'm not a python expert at all but I need installing meza on my Fedora laptop. pip install meza fails on Fedora 38 (and 37) with the following message:

> pip install meza
Defaulting to user installation because normal site-packages is not writeable
Collecting meza
  Using cached meza-0.46.0-py2.py3-none-any.whl (56 kB)
Collecting chardet<4.0.0,>=3.0.4
  Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting python-slugify<2.0.0,>=1.2.5
  Using cached python-slugify-1.2.6.tar.gz (6.8 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: python-dateutil<3.0.0,>=2.7.2 in /usr/lib/python3.11/site-packages (from meza) (2.8.2)
Requirement already satisfied: requests<3.0.0,>=2.18.4 in /usr/lib/python3.11/site-packages (from meza) (2.28.2)
Collecting xlrd<2.0.0,>=1.1.0
  Using cached xlrd-1.2.0-py2.py3-none-any.whl (103 kB)
Collecting dbfread==2.0.4
  Using cached dbfread-2.0.4-py2.py3-none-any.whl (19 kB)
Collecting ijson<3.0.0,>=2.3
  Using cached ijson-2.6.1.tar.gz (29 kB)
  Preparing metadata (setup.py) ... done
Requirement already satisfied: beautifulsoup4<5.0.0,>=4.6.0 in /usr/lib/python3.11/site-packages (from meza) (4.12.2)
Collecting PyYAML<6.0.0,>=4.2b1
  Using cached PyYAML-5.4.1.tar.gz (175 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [68 lines of output]
      /var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/config/setupcfg.py:293: _DeprecatedConfig: Deprecated config in `setup.cfg`
      !!
      
              ********************************************************************************
              The license_file parameter is deprecated, use license_files instead.
      
              By 2023-Oct-30, you need to update your project and remove deprecated calls
              or your builds will no longer be supported.
      
              See https://setuptools.pypa.io/en/latest/userguide/declarative_config.html for details.
              ********************************************************************************
      
      !!
        parsed = self.parsers.get(option_name, lambda x: x)(value)
      running egg_info
      writing lib3/PyYAML.egg-info/PKG-INFO
      writing dependency_links to lib3/PyYAML.egg-info/dependency_links.txt
      writing top-level names to lib3/PyYAML.egg-info/top_level.txt
      Traceback (most recent call last):
        File "/usr/lib/python3.11/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 351, in <module>
          main()
        File "/usr/lib/python3.11/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 333, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib/python3.11/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 118, in get_requires_for_build_wheel
          return hook(config_settings)
                 ^^^^^^^^^^^^^^^^^^^^^
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 355, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 325, in _get_build_requires
          self.run_setup()
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/build_meta.py", line 341, in run_setup
          exec(code, locals())
        File "<string>", line 271, in <module>
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/__init__.py", line 107, in setup
          return distutils.core.setup(**attrs)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 185, in setup
          return run_commands(dist)
                 ^^^^^^^^^^^^^^^^^^
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/_distutils/core.py", line 201, in run_commands
          dist.run_commands()
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 969, in run_commands
          self.run_command(cmd)
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/dist.py", line 1233, in run_command
          super().run_command(command)
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/_distutils/dist.py", line 988, in run_command
          cmd_obj.run()
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/command/egg_info.py", line 319, in run
          self.find_sources()
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/command/egg_info.py", line 327, in find_sources
          mm.run()
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/command/egg_info.py", line 549, in run
          self.add_defaults()
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/command/egg_info.py", line 587, in add_defaults
          sdist.add_defaults(self)
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/command/sdist.py", line 113, in add_defaults
          super().add_defaults()
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/sdist.py", line 251, in add_defaults
          self._add_defaults_ext()
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/_distutils/command/sdist.py", line 336, in _add_defaults_ext
          self.filelist.extend(build_ext.get_source_files())
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "<string>", line 201, in get_source_files
        File "/var/tmp/pip-build-env-fx45x08n/overlay/lib/python3.11/site-packages/setuptools/_distutils/cmd.py", line 107, in __getattr__
          raise AttributeError(attr)
      AttributeError: cython_sources
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

Can you please advise?

Writing using different dialects

The documentation provides no examples on how to do this and I cannot find tests that cover this feature. From the code I understand more or less how it should work but I'm not sure. Is there some example somewhere?

type casting assumes 'month first' for ambiguous dates

Right now, the type detection does infer a date, datetime or time types without taking into account the fact that 01/02/2002 can be both Februrary the 1st or January the 2nd depending on the date format used respectively DD/MM/YYYY and MM/DD/YYYY.

This might be undecidable in some rare cases, but in general it's possible given enough values to decide between both formats.

One possible way, to handle this in meza is to use a higher level datatype for representing the type of a field to replace the current string representation. For instance:

datetime_type = namedtuple('DateTimeType', ['format'])

Basically, use a representation that takes optional extra information about the type.

Drop Python 3.7 support

FYI, i think enough time has passed that we can drop 3.7 support. The intent of tox is to support the most recent 3 python versions plus the most recent pypy version. Maybe a sep PR bumping up the supported py versions.

Originally posted by @reubano in #58 (comment)

Allow process.group to accept a function for `predicate`

current

kwargs = {'aggregator': pr.merge, 'pred': 'B', 'op': sum}
pr.group(records, 'A', **kwargs)

desired

pred = lambda row: row['B']
kwargs = {'aggregator': pr.merge, 'pred': pred, 'op': sum}
pr.group(records, 'A', **kwargs)

[CR #5]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position... while reading Microsoft Access .mdb file

I am trying to read Microsoft Access [.mdb] file (created by ChemFinder on Windows), but I am getting
UnicodeDecodeError: 'utf-8' codec... error despite specifying encoding as recovered by meza.io.get_encoding()
to be TIS-620

I would appreciate any suggestions...

Details below:

import meza
fn = 'test.mdb'
encoding = meza.io.get_encoding(fn)
print(enc) # TIS-620
records = meza.io.read_mdb(fn, encoding=enc)
z = list(records)
~/anaconda3/lib/python3.8/site-packages/meza/io.py in read_mdb(filepath, table, **kwargs)
    636     # https://stackoverflow.com/a/17698359/408556
    637     with Popen(['mdb-export', filepath, table], **pkwargs).stdout as pipe:
--> 638         first_line = StringIO(str(pipe.readline()))
    639         names = next(csv.reader(first_line, **kwargs))
    640         uscored = ft.underscorify(names) if sanitize else names

~/anaconda3/lib/python3.8/codecs.py in decode(self, input, final)
    320         # decode input (taking the buffer into account)
    321         data = self.buffer + input
--> 322         (result, consumed) = self._buffer_decode(data, self.errors, final)
    323         # keep undecoded input until the next call
    324         self.buffer = data[consumed:]

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 70: invalid start byte

I tried to pass encoding to pkwargs used in Popen (in meza/io.py)

    pkwargs = {'stdout': PIPE, 'bufsize': 1, 'universal_newlines': True}
--> pkwargs['encoding'] = kwargs.get('encoding', None)

    # https://stackoverflow.com/a/2813530/408556
    # https://stackoverflow.com/a/17698359/408556
    with Popen(['mdb-export', filepath, table], **pkwargs).stdout as pipe:

but it does not resolve the issue. With this modification I am getting:

UnicodeDecodeError: 'charmap' codec can't decode byte 0xff in position 327: character maps to <undefined>

Datetimes with all-zero time component (exact midnight) are detected as dates

Datetime values with all-zero time components are detected as dates by process.detect_types. This is because typetools.is_datetime explicitly checks that the time component is not '00:00:00'

has_time = converted and the_time != NULL_TIME
. I'd argue that if a time value is present at all then the value should be treated as a datetime as this likely better represents the intent of the source. For example, a database report may include a datetime column but all the values in a particular output happened to occur at midnight. The down-casting behaviour is similar to that of floats #34.

Example Test Case

diff --git a/tests/test_process.py b/tests/test_process.py
index cc538a2..f90556d 100644
--- a/tests/test_process.py
+++ b/tests/test_process.py
@@ -76,6 +76,12 @@ class Test:
         nt.assert_equal(Decimal('0.87'), result['confidence'])
         nt.assert_false(result['accurate'])
 
+    def test_detect_types_datetimes_midnight(self):
+        records = it.repeat({"foo": "2000-01-01 00:00:00"})
+        records, result = pr.detect_types(records)
+
+        nt.assert_equal(result["types"], [{"id": "foo", "type": "datetime"}])
+
     def test_fillempty(self):
         records = [
             {'a': '1', 'b': '27', 'c': ''},

Fails with:

'AssertionError: Lists differ: [{'id': 'foo', 'type': 'date'}] != [{'id': 'foo', 'type': 'datetime'}]

First differing element 0:
{'id': 'foo', 'type': 'date'}
{'id': 'foo', 'type': 'datetime'}

- [{'id': 'foo', 'type': 'date'}]
+ [{'id': 'foo', 'type': 'datetime'}]
?                             ++++

Making the time part non-zero will pass the test.

Potential Solutions

  • Prefer stricter type inference for datetimes by default; e.g. if it has both a date and a time field, it's a datetime.
  • Allow stricter type inference for datetimes as an option e.g. a kwarg to detect_types that is passed down to is_datetime to change the behaviour from "can this only be parsed as a datetime" to "this is a datetime"
  • Any other ideas of course!

I am happy to implement the changes required after a decision is made on the correct behaviour 😄

ValueError converting zero-value currencies

Type detection raises a ValueError for a currencies with a value of zero e.g. '$0', '0$'.

>>> import itertools as it
>>> from meza import process as pr
>>> 
>>> records = it.repeat({"money": "$0"})
>>> records, result = pr.detect_types(records)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/meza/meza/process.py", line 333, in detect_types
    for t in tt.guess_type_by_value(record):
  File "/meza/meza/typetools.py", line 172, in guess_type_by_value
    result = type_test(g['func'], g['type'], key, value)
  File "/meza/meza/typetools.py", line 33, in type_test
    passed = test(value)
  File "//meza/meza/fntools.py", line 509, in is_int
    passed = is_numeric(content, thousand_sep, decimal_sep)
  File "/meza/meza/fntools.py", line 489, in is_numeric
    passed = int(content) == 0
ValueError: invalid literal for int() with base 10: '$0'

This is caused by is_numeric casting the original, unstripped content to an int:

passed = int(content) == 0

As far as I can tell, this should only be an issue when the value starts with 0 and the only non-numeric characters are currency symbols. Here is a failing test case for this:

diff --git a/tests/test_fntools.py b/tests/test_fntools.py
index 922bc17..f8cdc75 100644
--- a/tests/test_fntools.py
+++ b/tests/test_fntools.py
@@ -45,6 +45,11 @@ class TestIterStringIO:
         nt.assert_false(ft.is_numeric(None))
         nt.assert_false(ft.is_numeric(''))
 
+    def test_is_numeric_0_currency(self):
+        for sym in ft.CURRENCIES:
+            nt.assert_true(ft.is_numeric(f'0{sym}'))
+            nt.assert_true(ft.is_numeric(f'{sym}0'))
+
     def test_is_int(self):
         nt.assert_false(ft.is_int('5/4/82'))

Initial Update

Hi 👊

This is my first visit to this fine repo, but it seems you have been working hard to keep all dependencies updated so far.

Once you have closed this issue, I'll create seperate pull requests for every update as soon as I find one.

That's it for now!

Happy merging! 🤖

Setup.py develop doesn't work.

My operating system

Description:  Ubuntu 16.04.3 LTS
Release:	     16.04
Codename:   xenial

Steps to re-produce

the same steps in the contribution doc

Error

error in meza setup command: 'tests_require' must be a string or list of strings containing valid project/version requirement specifiers; Unordered types are not allowed

tox fails with pippy is not allowed

 meza main @ tox
py: install_deps> helpers/pippy -r /Users/jaraco/code/reubano/meza/dev-requirements.txt -r /Users/jaraco/code/reubano/meza/requirements.txt
py: failed with /Users/jaraco/code/reubano/meza/helpers/pippy (resolves to /Users/jaraco/code/reubano/meza/helpers/pippy) is not allowed, use allowlist_externals to allow it
  py: FAIL code 1 (0.15 seconds)
  evaluation failed :( (0.20 seconds)

Same as reubano/csv2ofx#102.

Missing dependency "future" on Python 2.7

Hi,

When installing Meza v0.41.1 on Python 2.7, the dependency "future" is not installed (but required).

To reproduce (for instance on Windows but the problem is the same on Linux):

D:\Laurent\Projets\virtualenv>C:\Python27\python.exe -m virtualenv meza
New python executable in D:\Laurent\Projets\virtualenv\meza\Scripts\python.exe
Installing setuptools, pip, wheel...done.

D:\Laurent\Projets\virtualenv>meza\Scripts\activate

(meza) D:\Laurent\Projets\virtualenv>pip --version
pip 19.0.1 from d:\laurent\projets\virtualenv\meza\lib\site-packages\pip (python 2.7)

(meza) D:\Laurent\Projets\virtualenv>pip install meza==0.41.1
[...]

(meza) D:\Laurent\Projets\virtualenv>pip list
Package                       Version
----------------------------- ----------
backports.functools-lru-cache 1.5
beautifulsoup4                4.7.1
certifi                       2018.11.29
chardet                       3.0.4
dbfread                       2.0.4
idna                          2.8
ijson                         2.3
meza                          0.41.1
pip                           19.0.1
pygogo                        0.12.0
python-dateutil               2.7.5
python-slugify                1.2.6
PyYAML                        3.13
requests                      2.21.0
setuptools                    40.7.1
six                           1.12.0
soupsieve                     1.7.3
Unidecode                     1.0.23
urllib3                       1.24.1
wheel                         0.32.3
xlrd                          1.2.0

As you can see, future is missing.

The problem occurs because the Wheel meta info is not valid.

If you want to install "future" only for Python 2.7, your requirements should be:

    'future>=0.16.0,<1.0.0; python_version < "3"'

See:

io.py incompatible with PyYAML-6.0

When building meza, this test failure happens:

Doctest: meza.io.read_yaml ... FAIL

======================================================================
FAIL: Doctest: meza.io.read_yaml
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/sw/lib/python3.9/doctest.py", line 2205, in runTest
    raise self.failureException(self.format_failure(new.getvalue()))
AssertionError: Failed doctest test for meza.io.read_yaml
  File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/meza/io.py", line 1256, in read_yaml

----------------------------------------------------------------------
File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/meza/io.py", line 1279, in meza.io.read_yaml
Failed example:
    next(records) == {
        'text': 'Chicago Reader',
        'float': 1.0,
        'datetime': dt(1971, 1, 1, 4, 14),
        'boolean': True,
        'time': '04:14:00',
        'date': date(1971, 1, 1),
        'integer': 40}
Exception raised:
    Traceback (most recent call last):
      File "/sw/lib/python3.9/doctest.py", line 1334, in __run
        exec(compile(example.source, filename, "single",
      File "<doctest meza.io.read_yaml[3]>", line 1, in <module>
        next(records) == {
      File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/meza/io.py", line 551, in read_any
        for line in _read_any(f, reader, args, **kwargs):
      File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/meza/io.py", line 470, in _read_any
        for num, line in enumerate(reader(f, *args, **kwargs)):
    TypeError: load() missing 1 required positional argument: 'Loader'

In

meza/meza/io.py

Line 1289 in 370c292

return read_any(filepath, yaml.load, mode, **kwargs)
, changing yaml.load to yaml.safe_load takes care of the problem (not sure what version yaml.safe_load was introduced).
See yaml/pyyaml#576

test failure in test_excel_html_export with io.read_html

Testing meza-0.46.0, I get this error (py38-py310):

Test for reading an html table exported from excel ... FAIL

======================================================================
FAIL: Test for reading an html table exported from excel
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/sw/lib/python3.9/site-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/sw/build.build/meza-py39-0.46.0-1/meza-0.46.0/tests/test_io.py", line 354, in test_excel_html_export
    nt.assert_equal(expected, next(records))
AssertionError: {'sparse_data': 'Iñtërnâtiônàližætiøn', 'so[61 chars]dam'} != {'13_width_75_some_date': '13 class=xl24 al[123 chars]dam'}
- {'some_date': '05/04/82',
+ {'13_width_75_some_date': '13 class=xl24 align=right>05/04/82',
+  '2_width_150_unicode_test': 'Ādam',
-  'some_value': '234',
+  '75_some_value': 'right>234',
?   +++              ++++++

-  'sparse_data': 'Iñtërnâtiônàližætiøn',
?                                       ^

+  '75_sparse_data': 'Iñtërnâtiônàližætiøn'}
?   +++                                    ^

-  'unicode_test': 'Ādam'}

----------------------------------------------------------------------

The output in the AssertionError line seems all mangled with the attributes from the different html table elements sprinkled in.
If I remove the html attributes for the table in data/test/test.htm, then the test passes. I notice that io.read_html uses BeautifulSoup. I have beautifulsoup-4.10.0 and soupsieve-2.3.1 installed.

Windows support for mdb_tools?

Hey Reuben!

Can the mdb_tools part of this library be run on windows without a linux environment? I'm thinking no given the reliance on MDB Tools?

In py3.7+, read_mdb should not use StopIteration

Here is a simple example:

from meza.io import read_mdb
import pandas as pd
process_data = read_mdb(file_path, "ProcessData")
process_data_df = pd.DataFrame(process_data)
process_data.close()

Before py3.7, this works OK. After py3.7, it throws "RuntimeError: generator raised StopIteration". This is due to PEP-479 - https://peps.python.org/pep-0479/ which changed how StopIteration is handled inside generators.

I tried this with v 0.45.5, which should be py3.7 compatible, and still see the error.

Allow `process.detect_types` to match last type instead of the first

Float values with zero as the fractional component e.g. '0.0', '0.1', '1.00' are detected as int instead of float. This is because they can be parsed as int according to fntools.is_int. Although the data could be interpreted as an integer, given that the source has a decimal place I would argue that detect_types should not perform casting to an integer. For example, database reports may include data from float/decimal columns which, just by chance, have no fractional component however this doesn't mean they should not be treated as floats.

Example Test Case

diff --git a/tests/test_process.py b/tests/test_process.py
index cc538a2..9b720e5 100644
--- a/tests/test_process.py
+++ b/tests/test_process.py
@@ -76,6 +76,12 @@ class Test:
         nt.assert_equal(Decimal('0.87'), result['confidence'])
         nt.assert_false(result['accurate'])
 
+    def test_detect_types_floats_zero_fractional_component(self):
+        records = it.cycle([{"foo": '0.0'}, {"foo": "1.0"}, {"foo": "10.00"}])
+        records, result = pr.detect_types(records)
+
+        nt.assert_equal(result["types"], [{"id": "foo", "type": "float"}])
+
     def test_fillempty(self):
         records = [
             {'a': '1', 'b': '27', 'c': ''},

Fails with:

AssertionError: Lists differ: [{'id': 'foo', 'type': 'int'}] != [{'id': 'foo', 'type': 'float'}]

First differing element 0:
{'id': 'foo', 'type': 'int'}
{'id': 'foo', 'type': 'float'}

- [{'id': 'foo', 'type': 'int'}]
?                         ^^

+ [{'id': 'foo', 'type': 'float'}]
?                         ^^^^

Potential Solutions

  • Prefer stricter type inference for floats by default; e.g. if it has decimal places, it's a float.
  • Allow stricter type inference for floats via an option e.g. a kwarg to detect_types that is passed down to is_int to change the behaviour from "can this be parsed as an int" to "this is definitely an int"
  • Any other ideas of course!

I am happy to implement the changes required after a decision is made on the correct behaviour 😄

UnicodeDecodeError when reading MDB file

When loading an MDB file, it's throwing a UnicodeDecode error immediately upon reading the first record. It can be reproduced with the "PetStore2000.mdb" or "PetStore2002.mdb" sample databases found at this link: https://jerrypost.com/dbbook/mainbook/Databases/Access/Access.html
(I've been able to read the other sample databases at that link with meza successfully)

Example code:

import meza.io
print("Detected encoding:", meza.io.get_encoding('PetStore2002.mdb'))
records = meza.io.read_mdb('PetStore2002.mdb')
print(next(records))

Output:

Detected encoding: None

Traceback (most recent call last):
  File "/Users/me/Library/Application Support/JetBrains/PyCharm2022.3/scratches/scratch_84.py", line 6, in <module>
    print(next(records))
  File "/Users/me/projects/leaks-tools/splitters/venv/lib/python3.8/site-packages/meza/io.py", line 664, in read_mdb
    first_line = StringIO(str(pipe.readline()))
  File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 268: invalid start byte

It's possible this is the same problem described in #39

Unexepcted warning from iterators

Hello guys,

I'm having a pletora of warnings from meza and they look like this

/home/simone/hypothesis-csv/.eggs/meza-0.41.1-py3.6.egg/meza/process.py:342: DeprecationWarning: generator 'gen_types' raised StopIteration types = list(gen_types(tally)) /home/simone/hypothesis-csv/.eggs/meza-0.41.1-py3.6.egg/meza/process.py:335: DeprecationWarning: generator 'guess_type_by_value' raised StopIteration for t in tt.guess_type_by_value(record): /home/simone/hypothesis-csv/.eggs/meza-0.41.1-py3.6.egg/meza/process.py:342: DeprecationWarning: generator 'gen_types' raised StopIteration types = list(gen_types(tally))

The version is 0.41.1. I checked the code and it actually doesn't raise StopIteration so I have no idea where the error comes from? Any insight?

requests >=2.10.0?

Hello. I'm trying to install meza with pip-tools, but am getting a dependency constraint resolution error between this package and the awsebcli package. meza calls for requests >= 2.10.0, and awsebcli calls for <=2.9.1.

Is there a specific reason for meza to require >= 2.10.0? Could this requirement be loosened at all, since requests is pretty stable?

Thanks!

Getting error with large mdb file

when I execute in ubuntu python3 (and this works fine if I have a small MDB file with one table)

records = io.read_mdb(db_file_path) # only file path, no file objects
next(records)
read: Is a directory
Couldn't read first page.
Couldn't open database.

Rename main branch to main

@reubano How do you feel about renaming the main branches of these projects (meza, csv2ofx) to "main"? It's the new default for git and many projects have moved to it. It would be nice for me for these to be consistent with other projects. I can help work through any transitional issues (though in my experience, it's pretty straightforward). As the project owner, you'll either have to grant me access or do it yourself (docs). Let me know how you'd like to proceed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.