GithubHelp home page GithubHelp logo

frictionlessdata / tableschema-pandas-py Goto Github PK

View Code? Open in Web Editor NEW
51.0 9.0 8.0 69 KB

Generate Pandas frames, load and extract data, based on JSON Table Schema descriptors.

License: GNU Lesser General Public License v3.0

Makefile 4.15% Python 95.85%

tableschema-pandas-py's Introduction

tableschema-pandas-py

Travis Coveralls PyPi Github Gitter

Generate and load Pandas data frames Table Schema descriptors.

Features

  • implements tableschema.Storage interface

Contents

Getting Started

Installation

The package use semantic versioning. It means that major versions could include breaking changes. It's highly recommended to specify package version range in your setup/requirements file e.g. package>=1.0,<2.0.

$ pip install tableschema-pandas

Documentation

# pip install datapackage tableschema-pandas
from datapackage import Package

# Save to Pandas

package = Package('http://data.okfn.org/data/core/country-list/datapackage.json')
storage = package.save(storage='pandas')

print(type(storage['data']))
#  <class 'pandas.core.frame.DataFrame'>

print(storage['data'].head())
#               Name   Code
#  0     Afghanistan   AF
#  1   Åland Islands   AX
#  2         Albania   AL
#  3         Algeria   DZ
#  4  American Samoa   AS

# Load from Pandas

package = Package(storage=storage)
print(package.descriptor)
print(package.resources[0].read())

Storage works as a container for Pandas data frames. You can define new data frame inside storage using storage.create method:

>>> from tableschema_pandas import Storage

>>> storage = Storage()
>>> storage.create('data', {
...     'primaryKey': 'id',
...     'fields': [
...         {'name': 'id', 'type': 'integer'},
...         {'name': 'comment', 'type': 'string'},
...     ]
... })

>>> storage.buckets
['data']

>>> storage['data'].shape
(0, 0)

Use storage.write to populate data frame with data:

>>> storage.write('data', [(1, 'a'), (2, 'b')])

>>> storage['data']
id comment
1        a
2        b

Also you can use tabulator to populate data frame from external data file. As you see, subsequent writes simply appends new data on top of existing ones:

>>> import tabulator

>>> with tabulator.Stream('data/comments.csv', headers=1) as stream:
...     storage.write('data', stream)

>>> storage['data']
id comment
1        a
2        b
1     good

API Reference

Storage

Storage(self, dataframes=None)

Pandas storage

Package implements Tabular Storage interface (see full documentation on the link):

Storage

Only additional API is documented

Arguments

  • dataframes (object[]): list of storage dataframes

Contributing

The project follows the Open Knowledge International coding standards.

Recommended way to get started is to create and activate a project virtual environment. To install package and development dependencies into active environment:

$ make install

To run tests with linting and coverage:

$ make test

Changelog

Here described only breaking and the most important changes. The full changelog and documentation for all released versions could be found in nicely formatted commit history.

v1.1

  • Added support for composite primary keys (loading to pandas)

v1.0

  • Initial driver implementation

tableschema-pandas-py's People

Contributors

roll avatar sirex avatar scls19fr avatar danfowler avatar pmlandwehr avatar rflprr avatar

Stargazers

Ruiqi Long avatar John Patrick Roach avatar  avatar  avatar  avatar  avatar Varvara Efremova avatar  avatar David Peckham avatar Tony Fast avatar Nikolaus Schlemm avatar Keith Hughitt avatar Fabio Classo avatar Matt Erbst avatar fabio-classo-portobelloshop-dev avatar Augusto Herrmann avatar Robin Linacre avatar Nolan Nichols avatar  avatar Anish Patel avatar Lars Schöning avatar Mathieu Morey avatar Oleg Lavrovsky avatar Wes Turner avatar Cameron Yick avatar Nick Evershed avatar Benno Kruit avatar Oliver Hofkens avatar Thiago Britto Borges avatar Feite Brekeveld avatar Sam Zeitlin avatar Tony Narlock avatar Gary Wu avatar Rahul Singh avatar Alexey Strokach avatar Stan avatar Stefano Costa avatar Márcio Roberto Francisco avatar  avatar Dmitry Kulikov avatar RajaniKanth Reddy avatar Nicolas Paris avatar Jaume Bonet avatar Jonathan Sick avatar Hans Olav Melberg avatar jrovegno avatar Bipin Navlakha avatar Murali Ravipudi avatar  avatar Paul Walsh avatar  avatar

Watchers

Adrià Mercader avatar Adam Kariv avatar  avatar  avatar Paul Walsh avatar James Cloos avatar Varvara Efremova avatar Irakli Mchedlishvili avatar  avatar

tableschema-pandas-py's Issues

Quickstart and odd data___data

Hello,

After having a look at Quickstart I wonder why data___data is available
(and necessary) to output one ressource as a Pandas DataFrame.
This kind of detail should be hidden from user.

Kind regards

Update to jsontableschema-v0.7

Overview

The driver should be updated to jsontableschema-v0.7 changes.

Tasks

  • fix breaking storage changes
  • add new storage features
  • remove deprecated code

Test issue

Overview

Please replace this line with full information about your idea or problem. If it's a bug share as much as possible to reproduce it


Please preserve this line to notify @roll (lead of this repository)

Validating large pandas data frames

Overview

I am looking for an easy way to validate pandas data frames. I have previously found an option that requires converting the data frame into dictionary records. Since I am now working with rather large tables (up to 1.5 M rows), I want to avoid this conversion and validate them directly.

Since this package is about tableschema, I was hoping to find something more specific here but the readme is rather sparse and seems to be more about data packages (which I thought would be the job of this other package).

Am I in the wrong place? Is there some other option that I have missed? Thank you for your insight.


Please preserve this line to notify @roll (lead of this repository)

Error raised from urllib2

I get an error whenever I try this with a valid url (that works through my browser), even with the example given on the blog.

urllib2.HTTPError: HTTP Error 404: Not Found

Having had a bit of a search, I would need to set the User-Agent header in the HTTP request urllib2.Request. Any idea if I could do that in datapackage.push_datapackage?

Error running example code (again)

The example code provided in the README is not working in the current versions.

Versions used

Python==3.6.8
datapackage==1.10.0
tableschema-pandas==1.1.0

How to reproduce

Dependencies

pip install datapackage==1.10.0 tableschema-pandas==1.1.0

Code

>>> import datapackage
>>> data_url = 'http://data.okfn.org/data/core/country-list/datapackage.json'

>>> storage = datapackage.push_datapackage(data_url, 'pandas')
/home/user/project/env/lib/python3.6/site-packages/datapackage/pushpull.py:39: UserWarning: Functions "push/pull_datapackage" are deprecated. Please use "Package" class
  UserWarning)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/project/env/lib/python3.6/site-packages/datapackage/pushpull.py", line 51, in push_datapackage
    plugin = import_module('jsontableschema.plugins.%s' % backend)
  File "/usr/lib/python3.6/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 941, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 941, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 994, in _gcd_import
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'jsontableschema'

Note: this is not the same issue as #27.

Please preserve this line to notify @roll (lead of this repository)

Use DatetimeIndex when appropriate instead of Index when creating DataFrames

Given the following data and schema (note: "Date" is primaryKey with a type of datetime):

Date VIXClose VIXHigh VIXLow VIXOpen
2004-01-05T00:00:00Z 17.49 18.49 17.44 18.45
2004-01-06T00:00:00Z 16.73 17.67 16.19 17.66
"schema": {
  "fields": [
    {
      "name": "Date",
      "type": "datetime",
    },
    {
      "name": "VIXOpen",
      "type": "number",
    },
    {
      "name": "VIXHigh",
      "type": "number",
    },
    {
      "name": "VIXLow",
      "type": "number",
    },
    {
      "name": "VIXClose",
      "type": "number",
    }
  ],
  "primaryKey": "Date"
}

create_data_frame will create a DataFrame with an index (using that primaryKey) that looks like this:

Index([1073001600000000000, 1073260800000000000, 1073347200000000000,
       1073433600000000000, 1073520000000000000, 1073606400000000000,
       1073865600000000000, 1073952000000000000, 1074038400000000000,
       1074124800000000000,
       ...
       1459382400000000000, 1459468800000000000, 1459728000000000000,
       1459814400000000000, 1459900800000000000, 1459987200000000000,
       1460073600000000000, 1460332800000000000, 1460419200000000000,
       1460505600000000000],
      dtype='object', name='Date', length=3091)

When creating a new DataFrame from a table, if the table has a primaryKey with a type of datetime, the index should be created using pandas.DatetimeIndex instead of pandas.Index. pandas.DatetimeIndex will create the expected index from this data:

DatetimeIndex(['2004-01-02', '2004-01-05', '2004-01-06', '2004-01-07',
               '2004-01-08', '2004-01-09', '2004-01-12', '2004-01-13',
               '2004-01-14', '2004-01-15',
               ...
               '2016-03-31', '2016-04-01', '2016-04-04', '2016-04-05',
               '2016-04-06', '2016-04-07', '2016-04-08', '2016-04-11',
               '2016-04-12', '2016-04-13'],
              dtype='datetime64[ns]', name='Date', length=3091, freq=None)

Current dependency bounds are causing warnings on install

Overview

Currently installing the following dependency list:

install_requires =
  click >= 7.1.2, < 8
  datapackage >= 1.14.1, < 2
  jsontableschema-pandas >= 0.5.0, < 1
  pandas >= 1.0.4, < 2
  rich >= 2.2.3, < 3
  typer >= 0.2.1, < 1

The dependency bounds on the package are causing warnings which may move into errors if we leave it sit for too long. Here are the warnings I am currently seeing. One is from a sub-dependency which may need another issue. Let's see.

ERROR: jsontableschema 0.10.1 has requirement click<7.0,>=3.3, but you'll have click 7.1.2 which is incompatible.
ERROR: jsontableschema 0.10.1 has requirement jsonschema<3.0,>=2.5, but you'll have jsonschema 3.2.0 which is incompatible.
ERROR: tableschema 1.19.2 has requirement rfc3986>=1.1.0, but you'll have rfc3986 0.4.1 which is incompatible.


Please preserve this line to notify @roll (lead of this repository)

Reading into DataFrame

A question on scope of this project, I'd like to directly read a Data Package resource into a DataFrame, something like

from_datapackage("un-locode", resource="Country codes")

Or is it always assumed to be used with a "storage"?

Error running example code

I'm having a problem running the example code in your README.md on Python 3.6.2. I've tried both with latest versions of datapackage, jsontableschema-pandas from PyPI and from github. The tests for jsontableschema-pandas runs ok, however, even for py36 (which I added to tox.ini).

>>> import datapackage
>>> storage = datapackage.push_datapackage('http://data.okfn.org/data/core/country-list/datapackage.json', 'pandas')
/src/datapackage/datapackage/package.py:78: UserWarning: Resource property "url: <url>" is deprecated. Please use "path: [url]" instead (as array).
  UserWarning)
/src/datapackage/datapackage/resource.py:359: UserWarning: Property "resource.table" is deprecated. Please use "resource.iter/read" directly.
  UserWarning)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/src/datapackage/datapackage/pushpull.py", line 52, in push_datapackage
    data = resource.table.iter(keyed=True)
  File "/src/datapackage/datapackage/resource.py", line 361, in table
    return self.__get_table()
  File "/src/datapackage/datapackage/resource.py", line 322, in __get_table
    options = _get_table_options(self.descriptor)
  File "/src/datapackage/datapackage/resource.py", line 463, in _get_table_options
    if not dialect['header']:
KeyError: 'header'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.