GithubHelp home page GithubHelp logo

petl-developers / petl Goto Github PK

View Code? Open in Web Editor NEW
1.2K 1.2K 191.0 3.8 MB

Python Extract Transform and Load Tables of Data

License: MIT License

Python 97.72% Jupyter Notebook 2.04% Shell 0.24%

petl's People

Contributors

alimanfoo avatar arturponinski avatar blais avatar bmaggard avatar bmos avatar dependabot[bot] avatar dhait avatar dnicolodi avatar dusktreader avatar engstrom avatar fahadsiddiqui avatar florentx avatar henryrizzi avatar hugovk avatar icenine457 avatar jfitzell avatar john-dennert avatar juarezr avatar miguelosana avatar mikmara1 avatar mzaeemz avatar pjakobsen avatar rogerkwoodley avatar scardine avatar thatneat avatar timgates42 avatar timheb avatar timhebzf avatar vilos avatar wassey16 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

petl's Issues

function to report progress, e.g., monitor/logger

Proposed to add a function 'progress' or 'monitor' or some similar name which basically wraps a row container, but outputs a message to stderr or stdout (or some other logging output?) every N rows, where N is configurable. E.g.:

>>> p = progress(sometbl, 1000, 'my progress: ')
>>> rowcount(p)
my progress: 1000 rows in 4s (250 rows/second); this batch in 4s (250 rows/second)
my progress: 2000 rows in 8s (250 rows/second); this batch in 4s (250 rows/second)
my progress: 2500 rows in 10s (250 rows/second)
2500

Maybe only makes sense in petl.interactive.

rename cat to concat to avoid clash with unix command

Proposed to rename 'cat' to 'concat' to avoid clash with the unix 'cat' command when working in the iPython interactive shell (I find I want to use shell cat more often than petl cat, and so having to remember to add an initial '!' is annoying and I keep forgetting to do it).

add field function

The extend function is useful but often you want to add a new field not at the right if existing fields but at the left or at some specific index. This can be done with extend followed by cut but it would be more convenient to have a single function that behaved like extend but also allowed specification of the index position where the new field should be inserted.

Proposed to add a new function addfield as described above.

standard methods on row container base class

Proposed to implement standard functions like len on the row container base class, maybe row access and slicing via suffix notation, so row containers can be used more like lists.

diff not showing subtractions

Trying out your module, I notice that diff are not showing subtractions when the changed file does not have lines from the original. I assume that if the change file does not have the same lines as the original file, you should see a subtractions. Could you please take a look into this?

Test Code

from petl import *

f1 = fromcsv('list1', delimiter=',')
f2 = fromcsv('list2', delimiter=',')

print '- file 1'
for i in f1:
print i
print ''
print '- file 2'
for i in f2:
print i

print ''
print '#'
print '#'
print ''

a,s = diff(f1, f2)
print look(a)
print look(s)

print '#'

a,s = recorddiff(f1, f2)
print look(a)
print look(s)

OUTPUT

$ ./test.py

  • file1
    ('h1', 'h2')
    ('test1', 'hello1')
    ('test2', 'hello2')

-file2
('h1', 'h2')

+------+------+
| 'h1' | 'h2' |
+======+======+

+------+------+
| 'h1' | 'h2' |
+======+======+

+------+------+
| 'h1' | 'h2' |
+======+======+

+------+------+
| 'h1' | 'h2' |
+======+======+

map fields to strings

All functions involving some field selection should map fields in the header row to field names ie strings so field selection is consistently by field name everywhere and this won't get confused if objects other than strings are used as fields (as long as they implement str).

tee... functions

Proposed to add functions teecsv, teetext, ..., mirroring the to... functions, which return row containers that wrap an upstream table and write rows out to a file or db as they are pulled through. E.g.:

>>> a = fromcsv('foo.txt')
>>> b = convert(a, 'foo', int)
>>> c = teepickle('foo.dat')
>>> d = convert(c, 'foo', lambda v: v * 2)
>>> topickle(d, '2foo.dat')

...would create two files, one with data from the intermediate point in the transformation pipeline.

add/prepend/insert/append rows from sequence

It would be convenient to have one or more functions that supported addition of rows to an existing table from some other sequence of rows. This includes adding data rows at any index, as well as adding header rows before the actual header row eg to add comments prior to writing to file.

implement mergesort

I propose a new transformation function to combine multiple input tables into one sorted output table.

If the input tables are presorted, then this should be fully streaming, i.e., just implement the merge part of merge sort, to give a memory-efficient way of combining possibly many large presorted inputs into one sorted output. (This is the main motivation for the function.)

If the input tables are not presorted, then each will be sorted individually before combining into a single output.

The proposal is to create a new function mergesort(_tables, *_kwargs) taking multiple input tables as positional args, with keyword args including 'key' as the specification of field or fields to sort by, 'presorted' to specify if input tables are already sorted, and 'buffersize' to pass through to sort() if the inputs are not already sorted.

This function could also be used in the implementation of the existing function merge(). I.e., rather than cat() then mergereduce(), we would use mergesort() then mergereduce(), which would make merge() memory efficient for presorted inputs.

Note that mergesort() will need to deal with the possibility of different headers in the input tables. It's proposed that headers are standardised in the same way that cat() does, i.e., find the union of all headers in the inputs, and fill any missing fields.

properties on row containers

Proposed to implement properties with getters and setters for all row container classes so it's clearer what can be modified post creation and any changes can be handled properly eg change of key on anything involving a sort.

add column function

In some cases you have a table and some sequence of data values eg a list and you want to use the sequence of values as a column within the table. Adding/inserting a new column from an existing sequence of values is currently not easily done, and can't be supported by extend or addfield because those functions need to support adding a constant valued field and the constant is a string or some other iterable.

Proposed to add a function addcolumn which takes a table, a field name, a sequence, and an optional field index, and returns a new table with the column integrated at the given position.

convenience functions for tab-delimited files

For people who work mostly with tab-delimited files, typing fromcsv('myfile.tsv', delimiter='\t') is a pain. The proposal is to create functions fromtsv, totsv, appendtsv as convenient shorthands for using the ...csv files with the delimiter='\t' keyword argument.

substring before, substring after

Proposed to add convenience functions 'substringbefore' and 'substringafter' which provide syntactic sugar for splitting values around first occurrence of a delimiter. I.e., calling substringbefore(' ') would return a function equivalent to lambda v: v.partition(' ')[0].

unpack dict into fields

Proposed to add a transformation function 'unpackdict' which would support unpacking of dict values within a single field into multiple fields, where names of new fields are derived from dict keys and values are corresponding dict values.

complex sort

Proposed to add some convenience function supporting a complex sort, i.e., sorting by multiple fields, some forward and some reverse. This can be done already with a sequence of sorts, but it could be made a bit more intuitive with some convenience function. E.g.:

>>> a = fromcsv('foo.csv')
>>> b = convertnumbers(a)
>>> c = complexsort(b, ('foo', False), ('bar', True))
>>> # equivalent to sort(sort(b, 'bar', reverse=True), 'foo')

The table c would be sorted by field 'foo' ascending then 'bar' descending.

data() return container

Proposed to modify the data() function to return a container, and to implement iterdata, analogous to values and itervalues, so it's easier to iterate over data rows more than once.

add field of row numbers

Proposed to add a function 'index' or 'numberrows' or similar name which inserts a new field as the first field with row numbers. Start index and step could be configurable, defaulting to 1, 1.

remove examples.py from distribution

The examples.py file is only there to store the raw commands used to generate docstrings for functions, it shouldn't be part of the distribution, and it has an error in it in 0.9:

Downloading/unpacking petl
Downloading petl-0.9.tar.gz (92Kb): 92Kb downloaded
Running setup.py egg_info for package petl

Installing collected packages: petl
Running setup.py install for petl
SyntaxError: ('non-keyword arg after keyword arg', ('/usr/local/lib/python2.7/dist-packages/petl/examples.py', 1393, None, "actual = mergesort(sort(table1, key='foo'), reverse=True, sort(table2, key='foo', reverse=True), key='foo', reverse=True, presorted=True)\n"))

version field

As a convenient way to check which version of petl is installed, it is proposed to add a petl.version field with the current version number as a string.

nthword

Proposed to add a convenience function for selecting nth word in a string. New function nthword(n) returns lambda s: s.split()[n]. Maybe also add optional arg to specify split characters. Useful in conversion and mapping where you want to extract the nth word from a string.

to... functions write to stdout as default if no file name

Proposed to modify all the to... functions such that the file name arg becomes optional, and if none then default to writing to stdout.

This feature is requested to make it possible to write small reusable command line scripts which can be piped to and from, and to support the petl executable, see also issue #16.

annexvalues function

Proposed to add a function similar to annex (#13) but takes a table and any sequence of objects as args and annexes the sequence to the table as an extra column. Signature would be annexvalues(table, 'newfld', newcol).

partition function

Proposed to add a function 'partition' which is like facet but takes an arbitrary function operating on a row rather than a key field, and returns a dictionary mapping distinct results of the function application to matching tables. Eg if the function returns true or false, ret[true] gives a table of rows where the function evaluates to true and ret[false] gives the complement.

silent failure when too many files open

Try doing a merge on tables from more than ~200 files, you get an apparently empty table and no error message. There should at least be an error message.

ordered function

Proposed to add a utility function 'ordered' which returns true if a table is ordered by a given key otherwise false.

simple timer/profiler for iteration over row containers

Proposed to add a function 'watch' or 'clock' which returns a row container wrapping some upstream row container, and which is a noop (i.e., passes rows through unaltered when iterated over) but times how long each row takes to retrieve from the wrapped container. E.g.:

>>> t2 = clock(t1)
>>> rowcount(t2)
234
>>> t2.time
10.0001
>>> t2.count
234
>>> t2.rate
23.4

Not sure how this will work given that timing variables would be set in the iterator but are accessed in the above example from the container, maybe OK but need a reset() method also?

diff could be more efficient, avoid repeated sorting

The diff function could be more efficient if it sorted the inputs then passed them as presorted to the two complements, so sorting is done only once and cached data are reused by the other complement, whichever is iterated over second.

fromtext gzip error

I'm getting the following error when trying to open gzipped files with fromtext...


Traceback (most recent call last):
  File "../../../../common/programs/scripts/mergevars.py", line 37, in <module>
    print repr(look(step6))
  File "/usr/local/lib/python2.7/dist-packages/petl/util.py", line 298, in __repr__
    rows = list(islice(it, *self.sliceargs))
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 3590, in iterrowreduce
    yield tuple(reducer(key, hybridrows(srcflds, rows, missing)))
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 3662, in _mergereducer
    for row in rows:
  File "/usr/local/lib/python2.7/dist-packages/petl/util.py", line 2247, in <genexpr>
    return (HybridRow(row, flds, missing) for row in it)
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 6802, in itermergesort
    for row in shortlistmergesorted(getkey, reverse, *sits):
  File "/usr/local/lib/python2.7/dist-packages/petl/util.py", line 2218, in shortlistmergesorted
    shortlist[nextidx] = iterators[nextidx].next()
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 6775, in _standardisedata
    for row in it:
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 924, in iterfieldconvert
    for row in it:
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 2970, in iterfieldselect
    for row in it:
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 260, in itercut
    for row in it:
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 4690, in iterpushheader
    for row in it:
  File "/usr/local/lib/python2.7/dist-packages/petl/io.py", line 425, in __iter__
    for line in f:
  File "/usr/lib/python2.7/gzip.py", line 450, in readline
    c = self.read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 307, in _read
    uncompress = self.decompress.decompress(buf)
zlib.error: Error -3 while decompressing: invalid distances set

implement __getattr__ on hybrid row class

Proposed to implement getattr on hybrid row class to allow use of dot notation where field names are valid python symbols, e.g., when writing functions for selecting or mapping rows.

alternative, simpler, signature to aggregate()

Proposed to modify aggregate() to accept a simpler signature in addition to the current signature, to support the case where just one field needs to be aggregated. E.g.:

>>> t2 = aggregate(t1, 'type', 'count', sum)

filecache and memcache

Proposed to add functions filecache and memcache which wrap row containers and cache rows to file or memory as they are pulled through.

hybrid row/record

Currently a number of functions are effectively duplicated with one version passing on a row (tuple-like) object, and the other version passing on a record (dict-like) object. E.g., rowselect/recordselect, rowreduce/recordreduce, rangerowreduce/rangerecordreduce, rowmap/recordmap, rowmapmany/recordmapmany, and possibly also lookup/recordlookup, lookupone/recordlookupone, could all be simplified into one function per existing pair, if we passed through an object like sqlite3.Row which behaves both as a tuple and a dict, allowing indexing by field position and field name.

The proposal is to use the sqlite3.Row class in the Python standard library, to modify the ...row... versions of all of the existing functions to pass on the sqlite3.Row object instead of a tuple, and to deprecate all of the ...record... versions of the above functions.

fold function

The existing aggregate function loads each group of rows into a list to allow for multiple aggregation functions to iterate over the group. This is possibly inefficient, especially for simple functions like sum that can be calculated in a purely iterative fashion a la the Python built-in reduce function.

So the proposal is to add a function to petl called 'fold' which reduces groups of rows under a given key by iterative application of a function to the next value and the previous result, e.g.:

>>> t1 = [['id', 'count'], [1, 3], [1, 5], [2, 4], [2, 8]] 
>>> t2 = fold(t1, 'id', 'count', operator.add)
>>> list(t2) 
[['id', 'count'], [1, 8], [2, 12]]

transparent support for gzipped files in from/to text and csv

The proposal is to provide transparent support for gzipped text and csv files in the following way. If the file name passed to fromtext or fromcsv ends in '.gz' then an attempt will be made to gzip decompress the file when reading it. Similarly, if the file name passed to totext, appendtext, tocsv or appendcsv ends in '.gz' then an attempt will be made to gzip compress the file when writing.

annex function

Proposed to add a function 'annex' which accepts two or more tables and simply annexes data values from rows at same position in input tables. Where tables are not of the same length, fill with configurable missing value. A bit like an outer join where row number is the implicit key.

add support for other sources eg URL, stdin, to from... functions

Proposed to modify from... functions in a backwards compatible way to enable extracting data from sources other than files, eg URLs, stdin.

The proposal is to modify the signature of the from... functions to support a source kword arg, in place of the file name (which becomes optional). With neither specified, defaults to reading from stdin. E.g.:

Fromcsv('foo.csv') # read from file
Fromcsv(source=filesource('foo.csv')) # equivalent
Fromcsv(source=urlsource('http://foo.com/bar'))
Fromcsv() # read from stdin
Fromcsv(source=stdinsource()) # equivalent

The source objects would implement open() and cachetag() as methods.

all transformation functions should only make one pass through the data

Currently some tranformation functions make more than one pass through upstream tables, eg cast, to collect information prior to finalising the actual transformation. This prevents pipelines on scripts operating on an input stream eg stdin which is exhausted after a single pass. Should all tranformation functions be restricted to only make one pass through the data? Those that need to collect information before running the actual transformation could read rows into a cache.

petl executable

Proposed to implement a 'petl' shell command which takes one positional arg which is executed which all petl.fluent functions in scope. E.g.:

$ cat foo.csv | petl "fromcsv().cut('foo', 'bar').selecteq('foo', 42).topickle()" > foo42.dat

Would require some other modifications to from... and to... functions to support reading from stdin and writing to stdout.

tomediawikitext

Proposed to add a convenience function to generate text formatted as a mediawiki table.

Unicode support in from/to csv

Proposed to add Unicode support to from/to csv and related functions based on the examples given in the documentation for the python csv package.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.