turicas / rows Goto Github PK

A common, beautiful interface to tabular data, no matter the format

License: GNU Lesser General Public License v3.0

Python 82.40% Makefile 0.22% HTML 17.25% Dockerfile 0.13%

python tabular-data convert-data data-science csv excel xlsx xls table data hacktoberfest

rows's Introduction

rows

No matter in which format your tabular data is: rows will import it, automatically detect types and give you high-level Python objects so you can start working with the data instead of trying to parse it. It is also locale-and-unicode aware. :)

Want to learn more? Read the documentation (or build and browse the docs locally by running make docs-serve after installing requirements-development.txt).

Installation

The easiest way to getting the hands dirty is install rows, using pip.

PyPI

pip install rows

For another ways to instal refer to the Installation section documentation.

Contribution start guide

The preferred way to start contributing for the project is creating a virtualenv (you can do by using virtualenv, virtualenvwrapper, pyenv or whatever tool you'd like).

Create the virtualenv:

mkvirtualenv rows

Install all plugins' dependencies:

pip install --editable .[all]

Install development dependencies:

pip install -r requirements-development.txt

rows's People

Contributors

Stargazers

Watchers

Forkers

fccoelho interlegis gitter-badger rossjones mfer tilacog rhenanbartels kretcheu berinhard douglas raphapassini infog duducosmos pombredanne maquinuz wikicarlos edwardbetts henryfentanes meirinaldojunior ruanaragao abelthf mswell maribedran guilhermefernandes1 easydataworks mauler marcelometal andrewsmedina rafaelgontijo annacruz arnaldorusso aoqfonseca trevisanj ramiroluz juliano777 diegoponciano leandromet abk-chicago robson-gupy elnuno biancarosa flaviocpontes maxmorais eduardostalinho lipemorais vianaweb rsip22 tonnydourado tian2992 sbenza cleisonsilfer vitor-costa izabelacborges marcosvbras danieldrumond thaisviana jsbueno jklemm cibelecastelo dreamsdesign marcelomoraes28 royopa intip dgreyling marceloandriolli christiansgoncalves j053g betovisk1 clebercarmo lijielife snifferchess lucianonmoreira nilmarnascimento wnlima carlosedubarreto crissky rafaelang hilam israelst aregee pilgrim2go hhy5277 frimes armindoguerra marcusmariano hosanaufrrj2014 fabiobispo1985 ficorrea matpfernandes rapha-rodrgues samambaman maurigre hendrixfreire brunohbueno disouzaleo eadmaster otomazeli din00tavares beingsane stanleycruvinel

rows's Issues

Create Field types to replace converters

Create a better way to express data types, converters and provide many useful already implemented, full-featured fields (we already did this one! just need more tests).

Create all desired field types
Check if every field type will be locale-aware automatically

Use pyuca instead of locale

https://github.com/jtauber/pyuca

Standardize parameters for "import_from_X"

Create a standard set o parameters for importing functions, like converters (present on CSV and HTML plugins).

XLSX Plugin

Create `rows.plugins.utils.prepare_to_export`

It's something like serialize but do not serialize, actually: it only filter the rows which will be exported but return high-level Python objects.

Can take some code from export_to_xls.

Add text plugin

The old branch has a text plugin (only for exporting). We may use it or texttable.

PDF Plugin

Create an algorithm to automatically extract tables from PDFs (available in text format).
Could use pdftables, but the code is not up-to-date, does not work with Python3 etc.

Auto discover plugins

Create an automated way to discover installed plugins and give users the ability to create their own plugins and upload to PyPI without our intervention (like nose does, using setuptools' entrypoints)

Integrate with pandas dataframes

It'll be very useful to have import from/export to pandas dataframes, as @mdipierro suggested. We may add this feature as a plugin.

Add currency converter

Convert a string made of a number and a currency symbol to float or to a Currency Type.

I think that we should write a simple version of this converter supporting a few frequently used currency, such as:

And then, eventually, build it seriously as another project.

What do you think?

Standardize parameters for "export_to_X"

Create a standard set o parameters for exporting functions.

Create a website

Should have:

General information
Links to GitHub (code, issues etc.)
Installation instructions (using pip, setup.py and apt-get)
Documentation (may use the continuous docs approach)

Implement plugin XLS

The libraries xlrd and xlwt could be used to implement this plugin.

PostgreSQL Plugin

It can be easily implemented based on SQLite plugin.

It would be nice to have parameters like batch_size, callback, callback_every and commit_every as in the old export_to_mysql.

Use callbacks for status report during MySQL export

Requirement: #133.

Filter field names when exporting

User should be able to export only some fields of a Table. The option may be added to serialize (or actually prepare_to_export -- see #54) so all exporto_to_* will benefit from it.

Add `indent` parameter to `export_to_json`

Move `serialize` and `create_table` to rows.plugins.utils

Create output converters

Currently converters work only for converting input data (raw) to native Python types; we need to add support for custom converters to export native types (for example: datetime.date objects will always be exported using the %Y-%m-%d format but it should be possible to provide an "output converter" to receive the object and return the raw value converted).

YAML Plugin

Fully support locale on standard converters

The current implementation only look at numbers (thousands and decimal separators), not for date and datetime fields, days on week etc.

Integrate with csvstudio

We could merge some code from csvstudio into rows, as @mdipierro suggested.

sqlet.py may be an inspiration also.

Fix \r\n on Windows (for all plugins)

We need to open some files as wb instead of w so it will add \r\n.

@sxslex will work on this.

Support Python 3

Create a `TableList` class

Similar to a "workbook" from xlrd: a collection of Table objects, which one with its own properties (like name and a link to the TableList ("list" is better than "set" here because order matters).

Could be used in the plugins: XLS, JSON, HTML and maybe others.

Implement plugin JSON

The idea is to export an array of objects, where each row is an (JS) object, for example, the file examples/data.csv would be encoded like this:

[
  {
    "username": "turicas", 
    "birthday": "1987-04-29", 
    "id": 1
  }, 
  {
    "username": "another-user", 
    "birthday": "2000-01-01", 
    "id": 2
  }
]

Should represent Table's fields declaration and rows instances as a class?

Currently we use two data types to represent something that could be represented in one class. The first is the fields parameter received by import_from_* (which are passed to utils.create_table), like:

UWSGI_FIELDS = OrderedDict([('pid', rows.fields.IntegerField),
                            ('ip', rows.fields.UnicodeField),
                            ('datetime', rows.fields.DatetimeField),
                            ('http_verb', rows.fields.UnicodeField),
                            ('http_path', rows.fields.UnicodeField),
                            ('generation_time', rows.fields.FloatField),
                            ('http_version', rows.fields.FloatField),
                            ('http_status', rows.fields.IntegerField)])

Second is the Table.Row (created in Table.__init__), which is a named tuple containing row data.

We could use an approach similar to ORMs and use a class to define the fields, like Django does. We could start with something like this:

class UwsgiLog(rows.Row):
    pid = rows.fields.IntegerField()
    ip = rows.fields.UnicodeField()
    datetime = rows.fields.DatetimeField()
    http_verb = rows.fields.UnicodeField()
    http_path = rows.fields.UnicodeField()
    generation_time = rows.fields.FloatField()
    http_version = rows.fields.FloatField()
    http_status = rows.fields.IntegerField()

And the Table rows (returned when we iterate over it) will be instances of UwsgiLog.

Pros:

This syntax is more flexible since we can create utility methods inside the class
More declarative

Cons:

We may not have access to the field order in this case (which is very important)
namedtuple is probably faster than any other customized class
We'll need to add more complexity to the code

Note: check if we can integrate this feature with scrapy so it'll easier to parse data using rows in a scrapy project.

Add option to change row class

The row class returined by iterating over rows.Table could be a dict a collections.namedtuple or even a customized class representing that data. We need to provide an way to the user to specify this class. Actually, preferably the user should pass a metaclass (think of collections.namedtuple: when you call it the object returned is a Python class).
We may create an interface for it (instead of just passing dict for example, we may need to create a RowDict class that does some things).

Another option: use the attrs library.

Possible API: add row_class parameter to import functions, like in:

for book in rows.import_from_csv(csv_path, row_class=dict):
    print(book["title"])

Related to #304.

Note: check if we can integrate this feature with scrapy so it'll easier to parse data using rows in a scrapy project.

SQLite Plugin

It can be easily implemented based on MySQL plugin.

Implement plugin XLSX

Stabilize plugins calling API

Create a more stable API regarding to calling plugins (rows.import_from_X, for example).

Define the way we're going to call the plugins (import and export)
Define which default parameters every plugin will have (such as lazy, callback etc.)

Design issues

Some decisions need to be made before we declare the API as stable. We can put
here all the questions for discussing (we should answer these questions as soon
as possible since it impacts the current implementation and would cause rework
if delayed).

(A) About `rows.Table`

A.1) What about lazyness? Should rows.Table be always lazy? Always
not lazy? Support both? What are the implications? if it's lazy, how to deal
with deletion and addition of rows?
A.2) How should we handle row filtering? What would be the best API?
For example: we have a rows.Table with many rows but want to filter some
rows. Should we provide a special method for this or use Python's built-in
filter? Using Python's built-in filter would be the more Pythonic way but
we can optimize some operations on certain plugins if we provide a special
method (example: filtering on a MySQL-based Table).
A.3) What if we want to import everything filtered? It's not a filter
on a pre-existing rows.Table like in question A.2: it's a filter to be
executed during importation process so we're going to import only some rows.
A.4) We should provide an API to modify the current rows during the
iteration over the Table. User can specify a custom function that will
receive Table.Row object and return a new one (that should be returned when
iterating over the Table). This way we can deal with addition of new fields
and other custom operations online. How should we expose this API? This
implementation may solve problem on question A.3.
A.5) The default row class is a collections.namedtuple. What is the
best API to change it? Should the default be another one? If we want an
object with read-write access and also value access via attributes
AttrDict would be a good option.
Should we add metadata to the row instance, like its index on that Table?
See sqlite3.Row and other Python's DBAPI implementations.
A.6) rows current architecture is good for importing and exporting
data but is not well suited for working with that data. One of the key facts
is that we cannot create a Table from a CSV, change some rows' values and
save it to the same CSV without doing a batch operation. Should we implement
read-write access? It can add a lot of complication on the implementation
(not only the Table itself but in the plugins) since we'll need to deal
with problems like seeking hrough the rows, saving/flushing partial data (not
the entire set), amont other problems.
A.7) As many users will use rows to import-and-export data it'd be
handy if we have a shortcut (and maybe some optimizations) to do it. If the
entire Table is lazy we may not need this shortcut because we can iterate
over one Table (in a lazy way) at the same time we're saving into another.
A.8) Should implement __add__ (so, for example,
sum([table1, table2, ..., tableN]) will return another Table with all the
rows -- but only if all table's types are the same). What metadata should
remain?
A.9) Which other operations should be implemented? Join, intersect,
...?

(B) About `rows.fields`

B.1) Field instances (values, actually) should be native Python
objects or custom objects (based on custom classes)? I'm inclined to use
native Python objects (as it's implemented today).

(C) About Plugins

C.1) Should plugins implement classes instead of functions? These
classes should inherite from rows.Table and implement only the needed
methods to access data (everything else should be made by rows.Table). This
way we can optimize operations like __len__, __reverse__ and others.
These magic methods may be implemented only on rows.Table and not
overwritten (the plugin class would create a custom method rows.Table will
call for each operation) -- we need to specify these methods' API.
C.2) What should be the list of default plugins? May be: text,
json, csv, sqlite.
C.3) What should be the list of official plugins (available on PyPI,
maintained by rows team but not pre-installed by default)? May be: xls,
html, ods. See graphlab's connectors and tablib's supported extensions.
C.4) How should we represent the table rows internally?
Table.__rows? What plugins can and cannot do with it? What is the expected
behaviour?
C.5) Should add a Table.meta with metadata about that Table. For
example: plugin data if the Table was generated by a plugin (example: if
the plugin is csv could have the actual CSV filename, encoding and so on).
C.6) If we are dealing with a huge amount of data it'd nice to have
callbacks and batch options (like the old MySQL plugin). How the API should
be exposed?
These links may help:

(D) About CLI

D.1) Should we implement --query (to query using SQL -- same as
import-and-filter)?

(E) Other

E.1) How to deal with Table collections? Examples: a XLS file have more than one sheet (each one is a rows.Table itself), a HTML file could contain more than one <table>. See how tablib deals with it.
E.2) See sqlite's detect_types.

Upload to PyPI

Hi @turicas, I was surprised this is not on PyPI yet. I know you put a note in the README, but do you think it's still not "good enough" to go to PyPI?

Thanks for the library!

Stabilize rows.Table API

Create a more stable API for rows.Table class, regarding to access its rows and utility methods.

JTS (JSON Table Schema) Plugin

Also enhance rows schema.

DBF Plugin

Support more types

Look into other tabular formats and also in SQLAlchemy and Django types to create a list of possible new types
Define which types we're going to support

List of possible types:

geopoint, like in JSON table schema

MessagePack Plugin

As the JSON plugin, a Message Pack plugin should be pretty easy to implement.

Create changelog for 0.1.0

As @kretcheu needs it to package to Debian (#40), we need to update CHANGELOG.md with features into already released 0.1.0.

Filter field names when importing

User should be able to import only some fields to a Table. The option should be added to create_table so all import_from_* will benefit from it.

Put rows into Debian

Yeh, rows will be available on Debian! @kretcheu is going to do it o/

Dot not assume Table._rows will always be in memory

Currently some operations assume all rows are in memory (such as order_by). We may move all the code to something lazy.
For order_by specifically we could sort in-disk instead of in-memory, like csvsort does.

Implement LazyTable class
Add lazy parameter to rows.plugins.utils.create_table
Change all plugins to force (or not) lazyness (example: in HTML
lazy=False, in CSV lazy=True by default)
Change default sample size to 1000 (or any other non-arbitrary number)
Create documentation about it

Stabilize plugin creation API

Create a better API for writing plugins, taking into consideration that the library will try to do the most it can so the only plugin's job will be to import/export data (rows will automatically deal with importing data in a lazy way or not, for example -- the plugin should only provide a generator on the import function).

Create a man page for `rows` command

It would be pretty simple since the CLI just imports/exports from the available plugins.

Create documentation for each plugin

Tasks:

Improve docs/plugins.md to include all import_from_* and export_to_* functions.

ODS Plugin

Since an ODS is just a zip file with a XML an other meta-data files inside (and the spreadsheet data actually goes on the XML), we can use lxml (as we're already using it on plugin HTML) to deal with it.

There are two approaches, actually:
1- Use lxml (maybe slower, better to maintain and more accurate)
2- Use regular expressions (maybe faster, not so accurate and easy to maintain)

Auto-generation of primary keys

@fccoelho wrote:
"Add the option to add an auto-incremented PK when outputting to relational databases.
This is relevant since certain highly used ORMs, such as SQLAlchemy require tables to have a primary key"
at https://github.com/turicas/outputty/issues/16

Related to #117, #4 and #133.

Should be able to query tables using SQL

The command-line interface should expose an option to query the table user is acessing.

The query language should be SQL;
If the table is not on a SQL database, it should convert the table internally to SQLite, than execute the query (in this case, table name will be table);
If the output is not set, it should export the result to text and print on standard output.

Usage example:

rows --from examples/data.csv --query 'SELECT * FROM table WHERE username == "turicas"'

It should print:

+----+----------+------------+
| id | username |  birthday  |
+----+----------+------------+
|  1 |  turicas | 1987-04-29 |
+----+----------+------------+

Plugin metakit (?)

metakit is an efficient embedded database library with a small footprint. It has some built-in operations (like groupby).
There is an official Python package on PyPI.

Migrate tests

I have written many tests for all available plugins on outputty library, but we need to migrate them to support the new API (rows).