petl-developers / petl Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 191.0 3.8 MB

Python Extract Transform and Load Tables of Data

License: MIT License

Python 97.72% Jupyter Notebook 2.04% Shell 0.24%

petl's People

Contributors

Stargazers

Watchers

Forkers

kaiquewdev greeness florentx youngwookim aklimchak brent-hoover james-unified shayh datamade a-musing-moose pjakobsen rs sashka pawl pombredanne rogerkwoodley podpearson talwai andrewakim wstranger jason-curtis stancikcom viktor-evdokimov icenine457 kristi jbarascut palchicz mgutjahr pilgrim2go kunalrshah fi-ecavc index01d jeffcjohnson pgower pokoli sv1jsb garaujodealmeida smnorris deanway jimexist willingwang zhatrix vilos pablocastellano why-not-sky nafisabulsara ulissespsm2 mattkatz hq20051252 diegoandresdiazespinoza brycecaine mjumbewu khaledto randybuildsthings pedroalvesbatista showliu henryrizzi bertday reinhardhsu miqwit schevalier lupusumbrae abkim127 hwhu lotfizad piclo umitsamima a-kurashov mikekiwa shyamala7 fjhjava bullet-terrier numaflores interworks-morr mbelmadani aditya-sanghi cgwhitehead ssfdust cswang scardine zli69 warfront1 thecharge loovelj eosit gamesbook koiluntan bmaggard mneche doxxxxxx jessewei jsivak ignasibosch rexxxdj optionmetrics sangramga balajirepsol mksingh202 fourierserious jacobchurch

petl's Issues

function to report progress, e.g., monitor/logger

Proposed to add a function 'progress' or 'monitor' or some similar name which basically wraps a row container, but outputs a message to stderr or stdout (or some other logging output?) every N rows, where N is configurable. E.g.:

>>> p = progress(sometbl, 1000, 'my progress: ')
>>> rowcount(p)
my progress: 1000 rows in 4s (250 rows/second); this batch in 4s (250 rows/second)
my progress: 2000 rows in 8s (250 rows/second); this batch in 4s (250 rows/second)
my progress: 2500 rows in 10s (250 rows/second)
2500

Maybe only makes sense in petl.interactive.

rename cat to concat to avoid clash with unix command

Proposed to rename 'cat' to 'concat' to avoid clash with the unix 'cat' command when working in the iPython interactive shell (I find I want to use shell cat more often than petl cat, and so having to remember to add an initial '!' is annoying and I keep forgetting to do it).

outputs from diff aren't wrapped in petl.interactive

The two row containers returned by diff aren't wrapped as they should be when using petl.interactive and probably not petl.fluent either.

support value access by integer position in expression strings

Modify the expr function to add support for referring to a data value by position as well as field name. E.g., "{0} + {1}" would become lambda r: r[0] + r[1]. Access by position would take preference where there is ambiguity.

add field function

The extend function is useful but often you want to add a new field not at the right if existing fields but at the left or at some specific index. This can be done with extend followed by cut but it would be more convenient to have a single function that behaved like extend but also allowed specification of the index position where the new field should be inserted.

Proposed to add a new function addfield as described above.

standard methods on row container base class

Proposed to implement standard functions like len on the row container base class, maybe row access and slicing via suffix notation, so row containers can be used more like lists.

Unicode support in from/to text

Proposed to add Unicode support in from/to text functions.

diff not showing subtractions

Trying out your module, I notice that diff are not showing subtractions when the changed file does not have lines from the original. I assume that if the change file does not have the same lines as the original file, you should see a subtractions. Could you please take a look into this?

Test Code

from petl import *

f1 = fromcsv('list1', delimiter=',')
f2 = fromcsv('list2', delimiter=',')

print '- file 1'
for i in f1:
print i
print ''
print '- file 2'
for i in f2:
print i

print ''
print '#'
print '#'
print ''

a,s = diff(f1, f2)
print look(a)
print look(s)

print '#'

a,s = recorddiff(f1, f2)
print look(a)
print look(s)

OUTPUT

$ ./test.py

file1
('h1', 'h2')
('test1', 'hello1')
('test2', 'hello2')

-file2
('h1', 'h2')

+------+------+
| 'h1' | 'h2' |
+======+======+

rename resub to sub for consistency

Proposed to rename resub to sub for consistency with other regex functions split and capture.

use cache complete flag in interactive module

Reinstate the cache complete flag in the interactive module wrapper class to avoid pointless upstream scans if total size of table falls within cache size.

map fields to strings

All functions involving some field selection should map fields in the header row to field names ie strings so field selection is consistently by field name everywhere and this won't get confused if objects other than strings are used as fields (as long as they implement str).

tee... functions

Proposed to add functions teecsv, teetext, ..., mirroring the to... functions, which return row containers that wrap an upstream table and write rows out to a file or db as they are pulled through. E.g.:

>>> a = fromcsv('foo.txt')
>>> b = convert(a, 'foo', int)
>>> c = teepickle('foo.dat')
>>> d = convert(c, 'foo', lambda v: v * 2)
>>> topickle(d, '2foo.dat')

...would create two files, one with data from the intermediate point in the transformation pipeline.

add/prepend/insert/append rows from sequence

It would be convenient to have one or more functions that supported addition of rows to an existing table from some other sequence of rows. This includes adding data rows at any index, as well as adding header rows before the actual header row eg to add comments prior to writing to file.

implement mergesort

I propose a new transformation function to combine multiple input tables into one sorted output table.

If the input tables are presorted, then this should be fully streaming, i.e., just implement the merge part of merge sort, to give a memory-efficient way of combining possibly many large presorted inputs into one sorted output. (This is the main motivation for the function.)

If the input tables are not presorted, then each will be sorted individually before combining into a single output.

The proposal is to create a new function mergesort(_tables, *_kwargs) taking multiple input tables as positional args, with keyword args including 'key' as the specification of field or fields to sort by, 'presorted' to specify if input tables are already sorted, and 'buffersize' to pass through to sort() if the inputs are not already sorted.

This function could also be used in the implementation of the existing function merge(). I.e., rather than cat() then mergereduce(), we would use mergesort() then mergereduce(), which would make merge() memory efficient for presorted inputs.

Note that mergesort() will need to deal with the possibility of different headers in the input tables. It's proposed that headers are standardised in the same way that cat() does, i.e., find the union of all headers in the inputs, and fill any missing fields.

mergesort with empty table

If one table in a mergesort is empty the output is empty.

combine interactive and fluent packages

Proposed to provide a package which combines features of the petl.interactive and petl.fluent packages, so both styles can be mixed. E.g., petl.fi?

data() should return container not iterator

Like values(), data() should return a container not an iterator, to allow treatment of the data like any sequence which can be iterated over multiple times.

properties on row containers

Proposed to implement properties with getters and setters for all row container classes so it's clearer what can be modified post creation and any changes can be handled properly eg change of key on anything involving a sort.

add column function

In some cases you have a table and some sequence of data values eg a list and you want to use the sequence of values as a column within the table. Adding/inserting a new column from an existing sequence of values is currently not easily done, and can't be supported by extend or addfield because those functions need to support adding a constant valued field and the constant is a string or some other iterable.

Proposed to add a function addcolumn which takes a table, a field name, a sequence, and an optional field index, and returns a new table with the column integrated at the given position.

convenience functions for tab-delimited files

For people who work mostly with tab-delimited files, typing fromcsv('myfile.tsv', delimiter='\t') is a pain. The proposal is to create functions fromtsv, totsv, appendtsv as convenient shorthands for using the ...csv files with the delimiter='\t' keyword argument.

substring before, substring after

Proposed to add convenience functions 'substringbefore' and 'substringafter' which provide syntactic sugar for splitting values around first occurrence of a delimiter. I.e., calling substringbefore(' ') would return a function equivalent to lambda v: v.partition(' ')[0].

unpack dict into fields

Proposed to add a transformation function 'unpackdict' which would support unpacking of dict values within a single field into multiple fields, where names of new fields are derived from dict keys and values are corresponding dict values.

diff behaviour with duplicate rows

Splitting this out as a separate issue, originally reported by zigen in a comment on issue 1 - #1 (comment)

complex sort

Proposed to add some convenience function supporting a complex sort, i.e., sorting by multiple fields, some forward and some reverse. This can be done already with a sequence of sorts, but it could be made a bit more intuitive with some convenience function. E.g.:

>>> a = fromcsv('foo.csv')
>>> b = convertnumbers(a)
>>> c = complexsort(b, ('foo', False), ('bar', True))
>>> # equivalent to sort(sort(b, 'bar', reverse=True), 'foo')

The table c would be sorted by field 'foo' ascending then 'bar' descending.

data() return container

Proposed to modify the data() function to return a container, and to implement iterdata, analogous to values and itervalues, so it's easier to iterate over data rows more than once.

add field of row numbers

Proposed to add a function 'index' or 'numberrows' or similar name which inserts a new field as the first field with row numbers. Start index and step could be configurable, defaulting to 1, 1.

remove examples.py from distribution

The examples.py file is only there to store the raw commands used to generate docstrings for functions, it shouldn't be part of the distribution, and it has an error in it in 0.9:

Downloading/unpacking petl
Downloading petl-0.9.tar.gz (92Kb): 92Kb downloaded
Running setup.py egg_info for package petl

Installing collected packages: petl
Running setup.py install for petl
SyntaxError: ('non-keyword arg after keyword arg', ('/usr/local/lib/python2.7/dist-packages/petl/examples.py', 1393, None, "actual = mergesort(sort(table1, key='foo'), reverse=True, sort(table2, key='foo', reverse=True), key='foo', reverse=True, presorted=True)\n"))

version field

As a convenient way to check which version of petl is installed, it is proposed to add a petl.version field with the current version number as a string.

nthword

Proposed to add a convenience function for selecting nth word in a string. New function nthword(n) returns lambda s: s.split()[n]. Maybe also add optional arg to specify split characters. Useful in conversion and mapping where you want to extract the nth word from a string.

to... functions write to stdout as default if no file name

Proposed to modify all the to... functions such that the file name arg becomes optional, and if none then default to writing to stdout.

This feature is requested to make it possible to write small reusable command line scripts which can be piped to and from, and to support the petl executable, see also issue #16.

annexvalues function

Proposed to add a function similar to annex (#13) but takes a table and any sequence of objects as args and annexes the sequence to the table as an extra column. Signature would be annexvalues(table, 'newfld', newcol).

partition function

Proposed to add a function 'partition' which is like facet but takes an arbitrary function operating on a row rather than a key field, and returns a dictionary mapping distinct results of the function application to matching tables. Eg if the function returns true or false, ret[true] gives a table of rows where the function evaluates to true and ret[false] gives the complement.

silent failure when too many files open

Try doing a merge on tables from more than ~200 files, you get an apparently empty table and no error message. There should at least be an error message.

ordered function

Proposed to add a utility function 'ordered' which returns true if a table is ordered by a given key otherwise false.

simple timer/profiler for iteration over row containers

Proposed to add a function 'watch' or 'clock' which returns a row container wrapping some upstream row container, and which is a noop (i.e., passes rows through unaltered when iterated over) but times how long each row takes to retrieve from the wrapped container. E.g.:

>>> t2 = clock(t1)
>>> rowcount(t2)
234
>>> t2.time
10.0001
>>> t2.count
234
>>> t2.rate
23.4

Not sure how this will work given that timing variables would be set in the iterator but are accessed in the above example from the container, maybe OK but need a reset() method also?

diff could be more efficient, avoid repeated sorting

The diff function could be more efficient if it sorted the inputs then passed them as presorted to the two complements, so sorting is done only once and cached data are reused by the other complement, whichever is iterated over second.

fromtext gzip error

I'm getting the following error when trying to open gzipped files with fromtext...


Traceback (most recent call last):
  File "../../../../common/programs/scripts/mergevars.py", line 37, in <module>
    print repr(look(step6))
  File "/usr/local/lib/python2.7/dist-packages/petl/util.py", line 298, in __repr__
    rows = list(islice(it, *self.sliceargs))
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 3590, in iterrowreduce
    yield tuple(reducer(key, hybridrows(srcflds, rows, missing)))
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 3662, in _mergereducer
    for row in rows:
  File "/usr/local/lib/python2.7/dist-packages/petl/util.py", line 2247, in <genexpr>
    return (HybridRow(row, flds, missing) for row in it)
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 6802, in itermergesort
    for row in shortlistmergesorted(getkey, reverse, *sits):
  File "/usr/local/lib/python2.7/dist-packages/petl/util.py", line 2218, in shortlistmergesorted
    shortlist[nextidx] = iterators[nextidx].next()
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 6775, in _standardisedata
    for row in it:
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 924, in iterfieldconvert
    for row in it:
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 2970, in iterfieldselect
    for row in it:
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 260, in itercut
    for row in it:
  File "/usr/local/lib/python2.7/dist-packages/petl/transform.py", line 4690, in iterpushheader
    for row in it:
  File "/usr/local/lib/python2.7/dist-packages/petl/io.py", line 425, in __iter__
    for line in f:
  File "/usr/lib/python2.7/gzip.py", line 450, in readline
    c = self.read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 256, in read
    self._read(readsize)
  File "/usr/lib/python2.7/gzip.py", line 307, in _read
    uncompress = self.decompress.decompress(buf)
zlib.error: Error -3 while decompressing: invalid distances set

implement getattr on hybrid row class

Proposed to implement getattr on hybrid row class to allow use of dot notation where field names are valid python symbols, e.g., when writing functions for selecting or mapping rows.

alternative, simpler, signature to aggregate()

Proposed to modify aggregate() to accept a simpler signature in addition to the current signature, to support the case where just one field needs to be aggregated. E.g.:

>>> t2 = aggregate(t1, 'type', 'count', sum)

support bz2 compression transparently in to/from functions

Proposed to add transparent support for bz2 compression in to/from functions based on file name suffix like for gzip already implemented.

filecache and memcache

Proposed to add functions filecache and memcache which wrap row containers and cache rows to file or memory as they are pulled through.

hybrid row/record

Currently a number of functions are effectively duplicated with one version passing on a row (tuple-like) object, and the other version passing on a record (dict-like) object. E.g., rowselect/recordselect, rowreduce/recordreduce, rangerowreduce/rangerecordreduce, rowmap/recordmap, rowmapmany/recordmapmany, and possibly also lookup/recordlookup, lookupone/recordlookupone, could all be simplified into one function per existing pair, if we passed through an object like sqlite3.Row which behaves both as a tuple and a dict, allowing indexing by field position and field name.

The proposal is to use the sqlite3.Row class in the Python standard library, to modify the ...row... versions of all of the existing functions to pass on the sqlite3.Row object instead of a tuple, and to deprecate all of the ...record... versions of the above functions.

fold function

The existing aggregate function loads each group of rows into a list to allow for multiple aggregation functions to iterate over the group. This is possibly inefficient, especially for simple functions like sum that can be calculated in a purely iterative fashion a la the Python built-in reduce function.

So the proposal is to add a function to petl called 'fold' which reduces groups of rows under a given key by iterative application of a function to the next value and the previous result, e.g.:

>>> t1 = [['id', 'count'], [1, 3], [1, 5], [2, 4], [2, 8]] 
>>> t2 = fold(t1, 'id', 'count', operator.add)
>>> list(t2) 
[['id', 'count'], [1, 8], [2, 12]]

transparent support for gzipped files in from/to text and csv

The proposal is to provide transparent support for gzipped text and csv files in the following way. If the file name passed to fromtext or fromcsv ends in '.gz' then an attempt will be made to gzip decompress the file when reading it. Similarly, if the file name passed to totext, appendtext, tocsv or appendcsv ends in '.gz' then an attempt will be made to gzip compress the file when writing.

annex function

Proposed to add a function 'annex' which accepts two or more tables and simply annexes data values from rows at same position in input tables. Where tables are not of the same length, fill with configurable missing value. A bit like an outer join where row number is the implicit key.

add support for other sources eg URL, stdin, to from... functions

Proposed to modify from... functions in a backwards compatible way to enable extracting data from sources other than files, eg URLs, stdin.

The proposal is to modify the signature of the from... functions to support a source kword arg, in place of the file name (which becomes optional). With neither specified, defaults to reading from stdin. E.g.:

Fromcsv('foo.csv') # read from file
Fromcsv(source=filesource('foo.csv')) # equivalent
Fromcsv(source=urlsource('http://foo.com/bar'))
Fromcsv() # read from stdin
Fromcsv(source=stdinsource()) # equivalent

The source objects would implement open() and cachetag() as methods.

all transformation functions should only make one pass through the data

Currently some tranformation functions make more than one pass through upstream tables, eg cast, to collect information prior to finalising the actual transformation. This prevents pipelines on scripts operating on an input stream eg stdin which is exhausted after a single pass. Should all tranformation functions be restricted to only make one pass through the data? Those that need to collect information before running the actual transformation could read rows into a cache.

petl executable

Proposed to implement a 'petl' shell command which takes one positional arg which is executed which all petl.fluent functions in scope. E.g.:

$ cat foo.csv | petl "fromcsv().cut('foo', 'bar').selecteq('foo', 42).topickle()" > foo42.dat

Would require some other modifications to from... and to... functions to support reading from stdin and writing to stdout.

tomediawikitext

Proposed to add a convenience function to generate text formatted as a mediawiki table.

Unicode support in from/to csv

Proposed to add Unicode support to from/to csv and related functions based on the examples given in the documentation for the python csv package.

petl-developers / petl Goto Github PK

petl's People

Contributors

Stargazers

Watchers

Forkers

petl's Issues

Test Code

OUTPUT

Recommend Projects

Recommend Topics

Recommend Org

Jobs