GithubHelp home page GithubHelp logo

turicas / rows Goto Github PK

View Code? Open in Web Editor NEW
860.0 64.0 136.0 7.87 MB

A common, beautiful interface to tabular data, no matter the format

License: GNU Lesser General Public License v3.0

Python 82.40% Makefile 0.22% HTML 17.25% Dockerfile 0.13%
python tabular-data convert-data data-science csv excel xlsx xls table data hacktoberfest

rows's Introduction

rows

Join the chat at https://gitter.im/turicas/rows Current version at PyPI Downloads per month on PyPI Supported Python Versions Software status License: LGPLv3

No matter in which format your tabular data is: rows will import it, automatically detect types and give you high-level Python objects so you can start working with the data instead of trying to parse it. It is also locale-and-unicode aware. :)

Want to learn more? Read the documentation (or build and browse the docs locally by running make docs-serve after installing requirements-development.txt).

Installation

The easiest way to getting the hands dirty is install rows, using pip.

pip install rows

For another ways to instal refer to the Installation section documentation.

Contribution start guide

The preferred way to start contributing for the project is creating a virtualenv (you can do by using virtualenv, virtualenvwrapper, pyenv or whatever tool you'd like).

Create the virtualenv:

mkvirtualenv rows

Install all plugins' dependencies:

pip install --editable .[all]

Install development dependencies:

pip install -r requirements-development.txt

rows's People

Contributors

arloc avatar augusto-herrmann avatar berinhard avatar cuducos avatar danieldrumond avatar diegosouza avatar disouzaleo avatar ellisonleao avatar ericof avatar gitter-badger avatar humrochagf avatar infog avatar israelst avatar izabelacborges avatar jeanferri avatar jsbueno avatar kretcheu avatar marcelometal avatar marcosvbras avatar mbaraldiciandt avatar naanadr avatar narrowfail avatar raphaelguim avatar rhenanbartels avatar romulocollopy avatar rossjones avatar sxslex avatar tian2992 avatar turicas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rows's Issues

Create Field types to replace converters

Create a better way to express data types, converters and provide many useful already implemented, full-featured fields (we already did this one! just need more tests).

  • Create all desired field types
  • Check if every field type will be locale-aware automatically

Create `rows.plugins.utils.prepare_to_export`

It's something like serialize but do not serialize, actually: it only filter the rows which will be exported but return high-level Python objects.

Can take some code from export_to_xls.

PDF Plugin

Create an algorithm to automatically extract tables from PDFs (available in text format).
Could use pdftables, but the code is not up-to-date, does not work with Python3 etc.

Auto discover plugins

Create an automated way to discover installed plugins and give users the ability to create their own plugins and upload to PyPI without our intervention (like nose does, using setuptools' entrypoints)

Add currency converter

Convert a string made of a number and a currency symbol to float or to a Currency Type.

I think that we should write a simple version of this converter supporting a few frequently used currency, such as:

  • Dolar
  • Euro
  • Yen
  • Real
  • Bitcoin

And then, eventually, build it seriously as another project.

What do you think?

Create a website

Should have:

  • General information
  • Links to GitHub (code, issues etc.)
  • Installation instructions (using pip, setup.py and apt-get)
  • Documentation (may use the continuous docs approach)

Filter field names when exporting

User should be able to export only some fields of a Table. The option may be added to serialize (or actually prepare_to_export -- see #54) so all exporto_to_* will benefit from it.

Create output converters

Currently converters work only for converting input data (raw) to native Python types; we need to add support for custom converters to export native types (for example: datetime.date objects will always be exported using the %Y-%m-%d format but it should be possible to provide an "output converter" to receive the object and return the raw value converted).

Create a `TableList` class

Similar to a "workbook" from xlrd: a collection of Table objects, which one with its own properties (like name and a link to the TableList ("list" is better than "set" here because order matters).

Could be used in the plugins: XLS, JSON, HTML and maybe others.

Implement plugin JSON

The idea is to export an array of objects, where each row is an (JS) object, for example, the file examples/data.csv would be encoded like this:

[
  {
    "username": "turicas", 
    "birthday": "1987-04-29", 
    "id": 1
  }, 
  {
    "username": "another-user", 
    "birthday": "2000-01-01", 
    "id": 2
  }
]

Should represent Table's fields declaration and rows instances as a class?

Currently we use two data types to represent something that could be represented in one class. The first is the fields parameter received by import_from_* (which are passed to utils.create_table), like:

UWSGI_FIELDS = OrderedDict([('pid', rows.fields.IntegerField),
                            ('ip', rows.fields.UnicodeField),
                            ('datetime', rows.fields.DatetimeField),
                            ('http_verb', rows.fields.UnicodeField),
                            ('http_path', rows.fields.UnicodeField),
                            ('generation_time', rows.fields.FloatField),
                            ('http_version', rows.fields.FloatField),
                            ('http_status', rows.fields.IntegerField)])

Second is the Table.Row (created in Table.__init__), which is a named tuple containing row data.

We could use an approach similar to ORMs and use a class to define the fields, like Django does. We could start with something like this:

class UwsgiLog(rows.Row):
    pid = rows.fields.IntegerField()
    ip = rows.fields.UnicodeField()
    datetime = rows.fields.DatetimeField()
    http_verb = rows.fields.UnicodeField()
    http_path = rows.fields.UnicodeField()
    generation_time = rows.fields.FloatField()
    http_version = rows.fields.FloatField()
    http_status = rows.fields.IntegerField()    

And the Table rows (returned when we iterate over it) will be instances of UwsgiLog.

Pros:

  • This syntax is more flexible since we can create utility methods inside the class
  • More declarative

Cons:

  • We may not have access to the field order in this case (which is very important)
  • namedtuple is probably faster than any other customized class
  • We'll need to add more complexity to the code

Note: check if we can integrate this feature with scrapy so it'll easier to parse data using rows in a scrapy project.

Add option to change row class

The row class returined by iterating over rows.Table could be a dict a collections.namedtuple or even a customized class representing that data. We need to provide an way to the user to specify this class. Actually, preferably the user should pass a metaclass (think of collections.namedtuple: when you call it the object returned is a Python class).
We may create an interface for it (instead of just passing dict for example, we may need to create a RowDict class that does some things).

Another option: use the attrs library.

Possible API: add row_class parameter to import functions, like in:

for book in rows.import_from_csv(csv_path, row_class=dict):
    print(book["title"])

Related to #304.

Note: check if we can integrate this feature with scrapy so it'll easier to parse data using rows in a scrapy project.

SQLite Plugin

It can be easily implemented based on MySQL plugin.

Stabilize plugins calling API

Create a more stable API regarding to calling plugins (rows.import_from_X, for example).

  • Define the way we're going to call the plugins (import and export)
  • Define which default parameters every plugin will have (such as lazy, callback etc.)

Design issues

Design issues

Some decisions need to be made before we declare the API as stable. We can put
here all the questions for discussing (we should answer these questions as soon
as possible since it impacts the current implementation and would cause rework
if delayed).

(A) About rows.Table

  • A.1) What about lazyness? Should rows.Table be always lazy? Always
    not lazy? Support both? What are the implications? if it's lazy, how to deal
    with deletion and addition of rows?
  • A.2) How should we handle row filtering? What would be the best API?
    For example: we have a rows.Table with many rows but want to filter some
    rows. Should we provide a special method for this or use Python's built-in
    filter? Using Python's built-in filter would be the more Pythonic way but
    we can optimize some operations on certain plugins if we provide a special
    method (example: filtering on a MySQL-based Table).
  • A.3) What if we want to import everything filtered? It's not a filter
    on a pre-existing rows.Table like in question A.2: it's a filter to be
    executed during importation process so we're going to import only some rows.
  • A.4) We should provide an API to modify the current rows during the
    iteration over the Table. User can specify a custom function that will
    receive Table.Row object and return a new one (that should be returned when
    iterating over the Table). This way we can deal with addition of new fields
    and other custom operations online. How should we expose this API? This
    implementation may solve problem on question A.3.
  • A.5) The default row class is a collections.namedtuple. What is the
    best API to change it? Should the default be another one? If we want an
    object with read-write access and also value access via attributes
    AttrDict would be a good option.
    Should we add metadata to the row instance, like its index on that Table?
    See sqlite3.Row and other Python's DBAPI implementations.
  • A.6) rows current architecture is good for importing and exporting
    data but is not well suited for working with that data. One of the key facts
    is that we cannot create a Table from a CSV, change some rows' values and
    save it to the same CSV without doing a batch operation. Should we implement
    read-write access? It can add a lot of complication on the implementation
    (not only the Table itself but in the plugins) since we'll need to deal
    with problems like seeking hrough the rows, saving/flushing partial data (not
    the entire set), amont other problems.
  • A.7) As many users will use rows to import-and-export data it'd be
    handy if we have a shortcut (and maybe some optimizations) to do it. If the
    entire Table is lazy we may not need this shortcut because we can iterate
    over one Table (in a lazy way) at the same time we're saving into another.
  • A.8) Should implement __add__ (so, for example,
    sum([table1, table2, ..., tableN]) will return another Table with all the
    rows -- but only if all table's types are the same). What metadata should
    remain?
  • A.9) Which other operations should be implemented? Join, intersect,
    ...?

(B) About rows.fields

  • B.1) Field instances (values, actually) should be native Python
    objects or custom objects (based on custom classes)? I'm inclined to use
    native Python objects (as it's implemented today).

(C) About Plugins

(D) About CLI

  • D.1) Should we implement --query (to query using SQL -- same as
    import-and-filter)?

(E) Other

  • E.1) How to deal with Table collections? Examples: a XLS file have more than one sheet (each one is a rows.Table itself), a HTML file could contain more than one <table>. See how tablib deals with it.
  • E.2) See sqlite's detect_types.

Upload to PyPI

Hi @turicas, I was surprised this is not on PyPI yet. I know you put a note in the README, but do you think it's still not "good enough" to go to PyPI?

Thanks for the library!

Stabilize rows.Table API

Create a more stable API for rows.Table class, regarding to access its rows and utility methods.

Support more types

  • Look into other tabular formats and also in SQLAlchemy and Django types to create a list of possible new types
  • Define which types we're going to support

List of possible types:

Filter field names when importing

User should be able to import only some fields to a Table. The option should be added to create_table so all import_from_* will benefit from it.

Dot not assume Table._rows will always be in memory

Currently some operations assume all rows are in memory (such as order_by). We may move all the code to something lazy.
For order_by specifically we could sort in-disk instead of in-memory, like csvsort does.

  • Implement LazyTable class
  • Add lazy parameter to rows.plugins.utils.create_table
  • Change all plugins to force (or not) lazyness (example: in HTML
    lazy=False, in CSV lazy=True by default)
  • Change default sample size to 1000 (or any other non-arbitrary number)
  • Create documentation about it

Stabilize plugin creation API

Create a better API for writing plugins, taking into consideration that the library will try to do the most it can so the only plugin's job will be to import/export data (rows will automatically deal with importing data in a lazy way or not, for example -- the plugin should only provide a generator on the import function).

ODS Plugin

Since an ODS is just a zip file with a XML an other meta-data files inside (and the spreadsheet data actually goes on the XML), we can use lxml (as we're already using it on plugin HTML) to deal with it.

There are two approaches, actually:
1- Use lxml (maybe slower, better to maintain and more accurate)
2- Use regular expressions (maybe faster, not so accurate and easy to maintain)

Should be able to query tables using SQL

The command-line interface should expose an option to query the table user is acessing.

  • The query language should be SQL;
  • If the table is not on a SQL database, it should convert the table internally to SQLite, than execute the query (in this case, table name will be table);
  • If the output is not set, it should export the result to text and print on standard output.

Usage example:

rows --from examples/data.csv --query 'SELECT * FROM table WHERE username == "turicas"'

It should print:

+----+----------+------------+
| id | username |  birthday  |
+----+----------+------------+
|  1 |  turicas | 1987-04-29 |
+----+----------+------------+

Migrate tests

I have written many tests for all available plugins on outputty library, but we need to migrate them to support the new API (rows).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.