johnkerl / miller Goto Github PK

Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON

Home Page: https://miller.readthedocs.io

License: Other

C 0.51% Shell 0.77% Ruby 1.00% D 0.05% Go 96.25% Rust 0.01% Python 0.90% Nim 0.03% Batchfile 0.03% ASL 0.01% Makefile 0.18% Vim Script 0.28%

data-processing data-cleaning csv csv-format streaming-data streaming-algorithms tsv json json-data data-reduction

miller's Introduction

What is Miller?

Miller is like awk, sed, cut, join, and sort for data formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed.

What can Miller do for me?

With Miller, you get to use named fields without needing to count positional indices, using familiar formats such as CSV, TSV, JSON, JSON Lines, and positionally-indexed. Then, on the fly, you can add new fields which are functions of existing fields, drop fields, sort, aggregate statistically, pretty-print, and more.

Miller operates on key-value-pair data while the familiar Unix tools operate on integer-indexed fields: if the natural data structure for the latter is the array, then Miller's natural data structure is the insertion-ordered hash map.
Miller handles a variety of data formats, including but not limited to the familiar CSV, TSV, and JSON/JSON Lines. (Miller can handle positionally-indexed data too!)

In the above image you can see how Miller embraces the common themes of key-value-pair data in a variety of data formats.

Getting started

Installing

There's a good chance you can get Miller pre-built for your system:

OS	Installation command
Linux	`yum install miller` `apt-get install miller`
Mac	`brew install miller` `port install miller`
Windows	`choco install miller` `winget install Miller.Miller`

See also README-versions.md for a full list of package versions. Note that long-term-support (LtS) releases will likely be on older versions.

Community

Discussion forum: https://github.com/johnkerl/miller/discussions
Feature requests / bug reports: https://github.com/johnkerl/miller/issues
How to contribute: https://miller.readthedocs.io/en/latest/contributing/

Build status

Building from source

First:
- cd /where/you/want/to/put/the/source
- git clone https://github.com/johnkerl/miller
- cd miller
With make:
- To build: make. This takes just a few seconds and produces the Miller executable, which is ./mlr (or .\mlr.exe on Windows).
- To run tests: make check.
- To install: make install. This installs the executable /usr/local/bin/mlr and manual page /usr/local/share/man/man1/mlr.1 (so you can do man mlr).
- You can do ./configure --prefix=/some/install/path before make install if you want to install somewhere other than /usr/local.
Without make:
- To build: go build github.com/johnkerl/miller/cmd/mlr.
- To run tests: go test github.com/johnkerl/miller/pkg/... and mlr regtest.
- To install: go install github.com/johnkerl/miller/cmd/mlr will install to GOPATH/bin/mlr.
See also the doc page on building from source.
For more developer information please see README-dev.md.

For developers

License

License: BSD2

Features

Miller is multi-purpose: it's useful for data cleaning, data reduction, statistical reporting, devops, system administration, log-file processing, format conversion, and database-query post-processing.
You can use Miller to snarf and munge log-file data, including selecting out relevant substreams, then produce CSV format and load that into all-in-memory/data-frame utilities for further statistical and/or graphical processing.
Miller complements data-analysis tools such as R, pandas, etc.: you can use Miller to clean and prepare your data. While you can do basic statistics entirely in Miller, its streaming-data feature and single-pass algorithms enable you to reduce very large data sets.
Miller complements SQL databases: you can slice, dice, and reformat data on the client side on its way into or out of a database. You can also reap some of the benefits of databases for quick, setup-free one-off tasks when you just need to query some data in disk files in a hurry.
Miller also goes beyond the classic Unix tools by stepping fully into our modern, no-SQL world: its essential record-heterogeneity property allows Miller to operate on data where records with different schema (field names) are interleaved.
Miller is streaming: most operations need only a single record in memory at a time, rather than ingesting all input before producing any output. For those operations which require deeper retention (sort, tac, stats1), Miller retains only as much data as needed. This means that whenever functionally possible, you can operate on files which are larger than your system’s available RAM, and you can use Miller in tail -f contexts.
Miller is pipe-friendly and interoperates with the Unix toolkit.
Miller's I/O formats include tabular pretty-printing, positionally indexed (Unix-toolkit style), CSV, TSV, JSON, JSON Lines, and others.
Miller does conversion between formats.
Miller's processing is format-aware: e.g. CSV sort and tac keep header lines first.
Miller has high-throughput performance on par with the Unix toolkit.
Miller is written in portable, modern Go, with zero runtime dependencies. You can download or compile a single binary, scp it to a faraway machine, and expect it to work.

What people are saying about Miller

Today I discovered Miller—it's like jq but for CSV: https://t.co/pn5Ni241KM

Also, "Miller complements data-analysis tools such as R, pandas, etc.: you can use Miller to clean and prepare your data." @GreatBlueC @nfmcclure
— Adrien Trouillaud (@adrienjt) September 24, 2020

Underappreciated swiss-army command-line chainsaw.

"Miller is like awk, sed, cut, join, and sort for [...] CSV, TSV, and [...] JSON." https://t.co/TrQqSUK3KK
— Dirk Eddelbuettel (@eddelbuettel) February 28, 2017

Miller looks like a great command line tool for working with CSV data. Sed, awk, cut, join all rolled into one: http://t.co/9BBb6VCZ6Y
— Mike Loukides (@mikeloukides) August 16, 2015

Miller is like sed, awk, cut, join, and sort for name-indexed data such as CSV: http://t.co/1zPbfg6B2W - handy tool!
— Ilya Grigorik (@igrigorik) August 22, 2015

Btw, I think Miller is the best CLI tool to deal with CSV. I used to use this when I need to preprocess too big CSVs to load into R (now we have vroom, so such cases might be rare, though...)https://t.co/kUjrSSGJoT
— Hiroaki Yutani (@yutannihilat_en) April 21, 2020

Miller: a *format-aware* data munging tool By @__jo_ker__ to overcome limitations with *line-aware* workshorses like awk, sed et al https://t.co/LCyPkhYvt9

The project website is a fantastic example of good software documentation!!
— Donny Daniel (@dnnydnl) September 9, 2018

Holy holly data swiss army knife batman! How did no one suggest Miller https://t.co/JGQpmRAZLv for solving database cleaning / ETL issues to me before

Congrats to @__jo_ker__ for amazingly intuitive tool for critical data management tasks!#DataScienceandLaw #ComputationalLaw
— James Miller (@japanlawprof) June 12, 2018

🤯@__jo_ker__'s Miller easily reads, transforms, + writes all sorts of tabular data. It's standalone, fast, and built for streaming data (operating on one line at a time, so you can work on files larger than memory).

And the docs are dream. I've been reading them all morning! https://t.co/Be2pGPZK6t
— Benjamin Wolfe (he/him) (@BenjaminWolfe) September 9, 2021

Contributors ✨

Thanks to all the fine people who help make Miller better (emoji key):

_{Andrea Borruso} 🤔 🎨	_{Shaun Jackman} 🤔	_{Fred Trotter} 🤔 🎨	_komosa 🤔	_{jungle-boogie} 🤔	_{Thomas Klausner} 🚇	_{Stephen Kitt} 📦
_{Leah Neukirchen} 🤔	_{Luigi Baldoni} 📦	_{Hiroaki Yutani} 🤔	_{Daniel M. Drucker} 🤔	_{Nikos Alexandris} 🤔	_kundeng 📦	_{Victor Sergienko} 📦
_{Adrian Ho} 🎨	_zachp 📦	_{David Selassie} 🤔	_{Joel Parker Henderson} 🤔	_{Michel Ace} 🤔	_{Matus Goljer} 🤔	_{Richard Patel} 📦
_{Jakub Podlaha} 🎨	_{Miodrag Milić} 📦	_{Derek Mahar} 🤔	_spmundi 🤔	_{Peter Körner} 🛡️	_rubyFeedback 🤔	_rbolsius 📦
_awildturtok 🤔	_agguser 🤔	_jganong 🤔	_{Fulvio Scapin} 🤔	_{Jordan Torbiak} 🤔	_{Andreas Weber} 🤔	_vapniks 📦
_Zombo 📦	_{Brian Fulton-Howard} 📦	_ChCyrill 🤔	_{Jauder Ho} 💻	_{Paweł Sacawa} 🐛	_schragge 📖	_Jordi 📖 🤔

This project follows the all-contributors specification. Contributions of any kind are welcome!

miller's People

Contributors

Stargazers

Watchers

Forkers

ivanfretes jwymanm blackedder companyontheworld elfring ksmaheshkumar tkob jungle-boogie ffa7a7 svacha 0-wiz-0 sikhnerd eveith qarth graydon wzugang olorin shekkbuilder indera shelltips jbales jgoldfar joe2hpimn snashraf aaronwolen sjackman ayourtch argonlaser rogervaas yixf-self lulin lovaya skitt ifzz maniacs-oss uhjish siegfried5 johanlundsurf liumorgan alexmdotru ldsemerenko ichobits pythseq happy-ferret brianwitte hee2000 andres-lowrie milkmod joejaywww devopsotrator aborruso darthburrito linecode tst2005fork jjdevbiz kebing1011 adrisede bryanchance backwardn gdttn herbygillot m4k3r-org cmrosenberg swipswaps riptl neuroradiology trantor chenokay felipeduarteferreira majkinetor evansnf sonyeric nunonog starters gerbenvoshol srinivas32 data-man waltarix flyeven tonytheodore adamkorcz wprobot nikosalexandris torbiak zabrane arno01 dandridge-cole a1ip olayinkaadeleye shoobyban wutang-financial diegosiqueir4 kokizzu tikimcfee triptych crackercat cklinuxproject vicdecode pseudobobsmith hightechfarmer

miller's Issues

test/run reports error on open

Hello,

Tail end of sh test/run:
open: No such file or directory
test/../mlr: could not open "test/input/{a,b,c,d,e,f,g}.csv"

I can run mlr --icsv --odkvp cat test/input/{a,b,c,d,e,f,g}.csv just fine:

a=1,b=2,c=3
a=4,b=5,c=6
d=5,e=6,f=7
a=1,b=2,c=3
a=4,b=5,c=6
a=7,b=8,c=9
h=3,i=4,j=5
m=8,n=9,o=10
a=1,b=2,c=3
a=4,b=5,c=6

Sort by counted-distinct

Hello,

Suppose you have the following data:

% mlr --icsv --opprint count-distinct -f type,state 1.csv 
type       state    count
Debit Sale Approved 53
Debit Sale Declined 13
Sale       Approved 55
Authorize  Approved 18
Sale       Declined 25
Return     Approved 16
Capture    Approved 3
Authorize  Declined 3
Sale       Voided   2
Credit     Approved 1

Probably looks better sorted by type:

% mlr --icsv --opprint count-distinct -f type,state then sort -f type 1.csv 
type       state    count
Authorize  Approved 18
Authorize  Declined 3
Capture    Approved 3
Credit     Approved 1
Debit Sale Approved 53
Debit Sale Declined 13
Return     Approved 16
Sale       Approved 55
Sale       Declined 25
Sale       Voided   2

But can you sort it by the last count column that's created by the count-distinct?

The sort page indicates that you use one of the csv field names:
http://johnkerl.org/miller/doc/reference.html#sort

Support double-quoting in DKVP format

This will involve adapting some of the RFC4180 CSV code over to the DKVP-handling code.

Median (p_50) for even number of data points

Minor issue on a terrific tool btw. Thanks!

Median is usually defined as the average of two midpoints with an even number of data points. P_50 isn't working that way at the moment..

Homebrew

Hi,

Is there a homebrew formula for this by any chance?

Thanks!

segfault with bad syntax

Hello,

mlr --csvex uniq -g ST -c then sort -f -nr count banklist.csv| most resulted in a core dump.

zsh: segmentation fault (core dumped)  mlr --csvex uniq -g ST -c then sort -f -nr count banklist.csv | 
zsh: done                              most

I'm certain this is a result of my bad syntax, but I don't think the core dump is the expected outcome.

This is the file I used: http://catalog.data.gov/dataset/fdic-failed-bank-list

Would you like the core dump file?

Thanks!

Please tag a stable release

Thanks!

Use travis-ci

Split off from #8.

Note: at present I use manual CI [see also http://johnkerl.org/miller/doc/build.html] -- all unit/regression tests run on each build. So this will simply automate that. It looks at present like multi-platform is still beta, but, the more platforms I can get auto-building, the better.

Misc. neatens

I made a focused effort to get RFC4180 CSV, multi-character separators, and autoconfig (the latter mostly thanks to 0-wiz-0) delivered ASAP after the HN release announcement.

Following those are some lower-priority items which should be addressed before they turn into longer-term technical debt:

Make all on-line-help output 80-character clean. Most of it is 120-character but some even wider than that.
Delivery of usage messages should be standardized in all cases (mlr main as well as subcommands): when invoked with -h/--help, print to stdout and exit 0; when invoked due to unacceptable syntax, print to stderr and exit 1.
Replace the two or three remaining manual-test ifdef-mains with unit-test code. (All the rest have already been done.)
Subsubcommands for step/stats1/stats2 should have online help.

Does not compile on OSX

brew install ctags lemon

samm-mb ~/git/miller/c % make
ctags -R .
make -C dsls put_dsl_parse.o
lemon put_dsl_parse.y
mv put_dsl_parse.c put_dsl_parse.c.tmp
sed \
            -e 's/ParseTrace/put_dsl_ParseTrace/g' \
            -e 's/ParseTokenName/put_dsl_ParseTokenName/g' \
            -e 's/lemon_parser_alloc/put_dsl_lemon_parser_alloc/g' \
            -e 's/lemon_parser_free/put_dsl_lemon_parser_free/g' \
            -e 's/lemon_parser_parse_token/put_dsl_lemon_parser_parse_token/g' \
            -e 's/yy_destructor/put_dsl_yy_destructor/g' \
        put_dsl_parse.c.tmp > put_dsl_parse.c
rm -f put_dsl_parse.c.tmp
gcc -I.. -O3 -c -std=gnu99 put_dsl_parse.c
put_dsl_parse.c:529:4: error: use of undeclared identifier 'yypParser'; did you mean 'pparser'?
                 ParseARG_FETCH;
                 ^
put_dsl_parse.c:85:40: note: expanded from macro 'ParseARG_FETCH'
#define ParseARG_FETCH sllv_t* pasts = yypParser->pasts
                                       ^
put_dsl_parse.c:521:18: note: 'pparser' declared here
        lemon_parser_t *pparser,          /* The parser to be shifted */
                        ^
put_dsl_parse.c:539:4: error: use of undeclared identifier 'yypParser'; did you mean 'pparser'?
                 ParseARG_STORE; /* Suppress warning about unused %extra_argument var */
                 ^
put_dsl_parse.c:86:24: note: expanded from macro 'ParseARG_STORE'
#define ParseARG_STORE yypParser->pasts = pasts
                       ^
put_dsl_parse.c:521:18: note: 'pparser' declared here
        lemon_parser_t *pparser,          /* The parser to be shifted */
                        ^
put_dsl_parse.c:618:2: error: use of undeclared identifier 'yypParser'; did you mean 'pparser'?
        ParseARG_FETCH;
        ^
put_dsl_parse.c:85:40: note: expanded from macro 'ParseARG_FETCH'
#define ParseARG_FETCH sllv_t* pasts = yypParser->pasts
                                       ^
put_dsl_parse.c:610:18: note: 'pparser' declared here
        lemon_parser_t *pparser,         /* The parser */
                        ^
put_dsl_parse.c:650:89: warning: implicit declaration of function 'yytestcase' is invalid in C99 [-Wimplicit-function-declaration]
      case 6: /* put_dsl_or_term ::= put_dsl_or_term FILTER_DSL_AND put_dsl_and_term */ yytestcase(yyruleno==6);
                                                                                        ^
put_dsl_parse.c:771:2: error: use of undeclared identifier 'yypParser'; did you mean 'pparser'?
        ParseARG_FETCH;
        ^
put_dsl_parse.c:85:40: note: expanded from macro 'ParseARG_FETCH'
#define ParseARG_FETCH sllv_t* pasts = yypParser->pasts
                                       ^
put_dsl_parse.c:770:45: note: 'pparser' declared here
static void yy_parse_failed(lemon_parser_t *pparser) {
                                            ^
put_dsl_parse.c:781:2: error: use of undeclared identifier 'yypParser'; did you mean 'pparser'?
        ParseARG_STORE; /* Suppress warning about unused %extra_argument variable */
        ^
put_dsl_parse.c:86:24: note: expanded from macro 'ParseARG_STORE'
#define ParseARG_STORE yypParser->pasts = pasts
                       ^
put_dsl_parse.c:770:45: note: 'pparser' declared here
static void yy_parse_failed(lemon_parser_t *pparser) {
                                            ^
put_dsl_parse.c:792:2: error: use of undeclared identifier 'yypParser'; did you mean 'pparser'?
        ParseARG_FETCH;
        ^
put_dsl_parse.c:85:40: note: expanded from macro 'ParseARG_FETCH'
#define ParseARG_FETCH sllv_t* pasts = yypParser->pasts
                                       ^
put_dsl_parse.c:788:18: note: 'pparser' declared here
        lemon_parser_t *pparser,           /* The parser */
                        ^
put_dsl_parse.c:798:2: error: use of undeclared identifier 'yypParser'; did you mean 'pparser'?
        ParseARG_STORE; /* Suppress warning about unused %extra_argument variable */
        ^
put_dsl_parse.c:86:24: note: expanded from macro 'ParseARG_STORE'
#define ParseARG_STORE yypParser->pasts = pasts
                       ^
put_dsl_parse.c:788:18: note: 'pparser' declared here
        lemon_parser_t *pparser,           /* The parser */
                        ^
put_dsl_parse.c:805:2: error: use of undeclared identifier 'yypParser'; did you mean 'pparser'?
        ParseARG_FETCH;
        ^
put_dsl_parse.c:85:40: note: expanded from macro 'ParseARG_FETCH'
#define ParseARG_FETCH sllv_t* pasts = yypParser->pasts
                                       ^
put_dsl_parse.c:804:39: note: 'pparser' declared here
static void yy_accept(lemon_parser_t *pparser) {
                                      ^
put_dsl_parse.c:822:2: error: use of undeclared identifier 'yypParser'; did you mean 'pparser'?
        ParseARG_STORE; /* Suppress warning about unused %extra_argument variable */
        ^
put_dsl_parse.c:86:24: note: expanded from macro 'ParseARG_STORE'
#define ParseARG_STORE yypParser->pasts = pasts
                       ^
put_dsl_parse.c:804:39: note: 'pparser' declared here
static void yy_accept(lemon_parser_t *pparser) {
                                      ^
put_dsl_parse.c:870:2: error: use of undeclared identifier 'yypParser'; did you mean 'pparser'?
        ParseARG_STORE;
        ^
put_dsl_parse.c:86:24: note: expanded from macro 'ParseARG_STORE'
#define ParseARG_STORE yypParser->pasts = pasts
                       ^
put_dsl_parse.c:855:18: note: 'pparser' declared here
        lemon_parser_t *pparser;  /* The parser */
                        ^
1 warning and 10 errors generated.
make[1]: *** [put_dsl_parse.o] Error 1
make: *** [dsls] Error 2

Make do not work

When I do make in a fresh git clone of the repo, I get:

./test/../mlr --icsv --odkvp cat test/input/null-fields.csv
./test/../mlr --inidx --odkvp cat test/input/null-fields.nidx
./test/../mlr --idkvp --oxtab cat test/input/missings.dkvp
mlr --icsv --opprint cat test/input/utf8-1.csv
./test/run: 45: ./test/run: mlr: not found
Makefile:116: recipe for target 'reg-test' failed
make[1]: *** [reg-test] Error 127
make[1]: Leaving directory '/home/dubreuia/Documents/program/johnkerl-miller/c'
Makefile:7: recipe for target 'c' failed
make: *** [c] Error 2

I think it is trying to execute the tests, with mlr but can't because it not yet installed.

I'm on gnome ubuntu 15.04.

mlr filter function for hours/minutes/seconds to seconds

From an email request.

Example: "01:23:45" -> 5025

and vice versa.

header data mismatch

Hello,

Obtained this file: http://catalog.data.gov/dataset/fdic-failed-bank-list
Edited header to have it resemble this:
BankName,City,ST,CERT,AcquiringInstitution,ClosingDate,UpdatedDate

And I want to cat listbank.csv| mlr --icsv --opprint count-distinct -f ST,ClosingDate
But I get error: Header-data length mismatch!

The original file did contain ctrl-M characters that you can observe with cat -e banklist.csv but even after removing those characters, I get the same mismatch message.

Any idea how I can get around the mismatch and what it means?

Thanks!

Avoiding timegm improves portability

I tried compiling miller on AIX and found that timegm function, which is GNU extension, is missing from AIX libc.
If I replace timegm function, the build completes, although some unit tests still fail (probably another problem).

The timegm man page provides the way to write portable timegm replacement and avoiding timegm would improve portability. How do you think?

build error (Ubuntu 15.04)

Using gcc 4.9.2 I get a build error with the current code (7aead02 2015-08-16):

$ make
make -C c top
make[1]: Entering directory '/home/voj/proj/miller/c'
ctags -R .
make -C dsls put_dsl_parse.o
make[2]: Entering directory '/home/voj/proj/miller/c/dsls'
make[2]: 'put_dsl_parse.o' is up to date.
make[2]: Leaving directory '/home/voj/proj/miller/c/dsls'
make -C dsls put_dsl_lexer.o
make[2]: Entering directory '/home/voj/proj/miller/c/dsls'
flex --prefix=put_dsl_lexer_ --outfile=put_dsl_lexer.c --header-file=put_dsl_lexer.h put_dsl_lexer.l
gcc -I.. -O3 -c -std=gnu99 put_dsl_lexer.c
make[2]: Leaving directory '/home/voj/proj/miller/c/dsls'
make -C dsls put_dsl_wrapper.o
make[2]: Entering directory '/home/voj/proj/miller/c/dsls'
gcc -Wall -I.. -O3 -c -std=gnu99 put_dsl_wrapper.c
put_dsl_wrapper.c: In function ‘put_dsl_parse_inner’:
put_dsl_wrapper.c:24:3: warning: implicit declaration of function ‘put_dsl_lemon_parser_parse_token’ [-Wimplicit-function-declaration]
   parse_code = put_dsl_lemon_parser_parse_token(pvparser, lex_code, plexed_node, pasts);
   ^

I'd enable continuous builds, e.g. at travis-ci to ensure that each commit at least compiles at a given environment. Right now I don't now if it's the error results from the commit or the build environment.

Regex or equivalent?

Is there are way to use regex or equivalent with Millier?

[enhancement] Wildcards in column names

It will be nice to say something like mlr put '$*bytes = $*_in_bytes + $*_out_bytes'
or mlr cut -x -f '*_in_*,*_out_*'

For my use cases I made some bash wrappers to provide described behaviour.

Computation cost should be low, we need recalc this only when header changes.

cut command not working with this .csv file

Using the latest miller release (v2.1.4.tar.gz, although mlr --help reports v2.1.3.).
The command: mlr --csv cut -f country wbdata.csv produces no output where wbdata.csv contains the following data:

iso2c,country,year,MS.MIL.XPND.GD.ZS,NE.CON.TETC.ZS,UIS.XGDP.FSGOV.FDINSTADM.FFD,NY.GDP.MKTP.KD.ZG,SH.XPD.PUBL.ZS,IE.ICT.TOTL.GD.ZS,IE.PPI.TRAN.CD,GB.XPD.DEFN.CN,GB.XPK.INLD.CN
CM,Cameroon,1980,NA,78.2638039227086,NA,-1.96529166883707,NA,NA,NA,NA,73240002560
CM,Cameroon,1981,NA,79.3593625984257,NA,17.082682248754,NA,NA,NA,NA,153900007420
CM,Cameroon,1982,NA,70.9004012667756,NA,7.51620260553753,NA,NA,NA,NA,234499997700
CM,Cameroon,1983,NA,72.928138593501,NA,6.86683056611426,NA,NA,NA,NA,267699994620
CM,Cameroon,1984,NA,71.5711709053482,NA,7.47457254301733,NA,NA,NA,NA,274299994110
CM,Cameroon,1985,NA,73.2420993013362,NA,8.06316167187164,NA,NA,NA,NA,425999990780
CM,Cameroon,1986,NA,73.9302317635993,NA,6.77166308049775,NA,NA,NA,NA,470999990270
CM,Cameroon,1987,NA,79.2320191357579,NA,-2.14665021191912,NA,NA,NA,NA,694999973890
CM,Cameroon,1988,1.26160386177472,79.0313468274156,NA,-7.82363197506437,NA,NA,NA,NA,283000012800

If I replace country with any other column name I still get no output.
Am I doing something wrong? It seems to work OK with another .csv file I tried, but not this one.

Furthermore, if I try piping the data to miller like this: cat wbdata.csv | mlr --csv cut -f country I get the following error:

fopen: No such file or directory
mlr: Couldn't fopen "-" for read.

Addition of a build system generator

I suggest to reuse a higher level build system than your current small make file so that powerful checks for software features will become easier.

"Header-data length mismatch!" can and should give file/line context

Infrastructure is already in place via context_t. Just need to connect that information all the way through to error messages in all appropriate places.

Workaround: mlr {format arguments} cat {filename} > foo.txt and then wc -l foo.txt to see all successfully processed records up to the problematic one.

Allow trailing spaces with --allow-repeat-ifs

In particular, NIDX or DKVP format splitting on space, with trailing spaces. Currently, the error message is Empty key disallowed. Workaround is pre-sedding.

build failing

Hello,

 ./test/../mlr put $z=min($x, $y) ./test/input/minmax.dkvp
  x=1,y=2,z=1.000000
  x=1,y=,z=1.000000
  x=,y=,z=
*** Error code 1

Stop.
make: stopped in /usr/home/sean/bin/miller/c

Full log here:
https://gist.github.com/jungle-boogie/5fe0b5f064a64dcb6dca

I've even removed miller git clone in hopes it was something specific with my clone, but it continues to occur.

RFC-compliant CSV I/O

Split out from #4

DKVP: pair separator in value breaks parsing

Miller has difficulty parsing DKVP when the pair separator character is within the value of the data that it is parsing. To whit:

Source File

a=pan,b=pan,i=1,x=0.3467901443380824,y=0.7268028627434533
a=eks,b=pan,i=2,x=0.7586799647=899636,y=0.5221511083334797
a=wye,b="wy=e",i=3,x=0.20460330576630303,y=0.33831852551664776
a=eks,b=wye,i=4,x==0.38139939387114097,y=0.13418874328430463
a=wye,b=pan,i=5=,x=0.5732889198020006,y=0.8636244699032729

Expected Output of mlr --opprint cat

a   b      i  x                    y
pan pan    1  0.3467901443380824   0.7268028627434533
eks pan    2  0.7586799647=899636  0.5221511083334797
wye "wy=e" 3  0.20460330576630303  0.33831852551664776
eks wye    4  =0.38139939387114097 0.13418874328430463
wye pan    5= 0.5732889198020006   0.8636244699032729

Actual Output of mlr --opprint cat

a   b   i x                   y
pan pan 1 0.3467901443380824  0.7268028627434533
eks pan 2 899636              0.5221511083334797
wye e"  3 0.20460330576630303 0.33831852551664776
eks wye 4 0.38139939387114097 0.13418874328430463
wye pan - 0.5732889198020006  0.8636244699032729

It looks like what miller does is that it reads the key until it finds the first pair separator, and then starts reading the value after the last pair separator.

What it should do, I think, is read the key until it finds the first pair separator, and then assign everything after that first pair separator right up until the field separator is found to be value.

Use CC in Makefiles so Travis gcc/clang variants will use gcc/clang, respectively

At present, both the gcc and clang builds at https://travis-ci.org/johnkerl/miller are actually using gcc since that's how Miller's makefiles are written.

OSX Build

I have built miller for OSX against 10.11 (BETA 6) - The binary should 'just work' and can be placed in your path somewhere.

Might be a tide-over until it's available in homebrew.

samm-mbp ~/git/miller/c % file mlr
mlr: Mach-O 64-bit executable x86_64

samm-mbp ~/git/miller/c % md5 mlr
MD5 (mlr) = d4199111ec0cb33365a9fc5a0dd185d6

samm-mbp ~/git/miller/c % md5 mlr.zip
MD5 (mlr.zip) = e2e8fd016ad07279ff771d06e84a27d4

Release download: https://github.com/sammcj/miller/releases/tag/v0.1

make failing

Hello,

On a gnu/linux system, make is failing with a simple make:

make: *** No targets specified and no makefile found.  Stop.

cc --version
gcc-4.8.real (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4
Copyright (C) 2013 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Is there a new method to make mlr now?

CRLF support for file formats other than CSV

Currently parsing files with CRLF line ending is not supported. (I used simplest possible DKVP file cat'ed to mlr put)
I know, that now you are working on CSV-RFC4180 support (which explicitly says that CRLF is vaild (and required - :( ).

For now, I can suggest following:
0. at least log warning if CRLF is encountered.

silently drop CRLFs

Later of course proper support for line endings should be added :)

BTW. really great tool!

Clean up warnings in Lemon and Lemon-generated code

I don't think anything here is problematic per se, but a nice clean -Werror build in c/dsls/ (as we currently already have in the Miller-per-se code in c/) would be a nice-to-have.

Make resetting of data structure members before memory release configurable

I have noticed that a few function implementations reset members of specific data structures directly before corresponding memory is released.

The following semantic patch can point more update candidates out.

@show_modification_before_memory_release@
expression value;
identifier element, var;
@@
 <+...
*var->element = value;
 ...+>
 free(var);

I suggest to avoid such extra data modification by default.
How do you think about to make it configurable if you really need to turn this functionality on again?

Rows aggregation

Hi,
I think that aggregation (any type from stats1) of multiple rows with key field equal into one will be nice.
We already can achieve this via mlr filter ... then stats1 ... and loop in shell, but that way it is cumbersome.

Let me show example:
data (first and last three rows are the same):

a=1,b=2,key=klucz,c=3
a=4,b=4,key=klucz,c=3
a=3,b=4,key=klucz,c=1
a=0,b=0,key=klucz2,c=4
a=2,b=3,key=klucz2,c=3
a=1,b=2,key=klucz2,c=0
a=1,b=2,key=klucz,c=3
a=4,b=4,key=klucz,c=3
a=3,b=4,key=klucz,c=1

output of mlr aggr -k key -a sum -f a,b,c:

a=7,b=10,key=klucz,c=7
a=3,b=5,key=klucz2,c=7
a=7,b=10,key=klucz,c=7

We can of course use different name instead of 'aggr', or change parameter names.
It is open to disscussion if there should be parameter to aggregate further results as mlr step -g do (first and third row in examplary output).
Maybe rows names should now end with _sum, but I skipped it here for brevity.

So, shortly speaking, this is just step command but we will replace all rows by one instead of decorating them.

Soliciting FAQ material

What I have so far:

No output at all: record separator is CRLF but file data contains LF line endings
Fields not being picked out: field separator doesn't match data, e.g. the file is TSV but FS is comma (default)
mlr put '$y=string($x);$z=$y.$y' gives (error) on numeric data such as x=123 while mlr put '$z=string($x).string($x)' does not
How to handle data in application log files, e.g. 2015-10-08 08:29:09,445 INFO com.company.path.to.ClassName @ [search] various sorts of data {& punctuation} hits=1 status=0 time=2.378

Any other roadblocks/stumpers/head-scratch-moments?

Release tarballs are huge

The tarball for 2.1.1 is almost 60 MB, perhaps the CSV test files can be computed on demand instead of including them in the repo?

Expression does not evaluate to boolean: got T_NULL

Here's a minimal csv test file (named "miller_test.csv"):

"header1","header2"
"value1","value2"

I'm trying to filter on empty header1 column with:

mlr --csv filter '$header1 != ""' miller_test.csv

And I get:

Expression does not evaluate to boolean: got T_NULL.
*** Error in `mlr': munmap_chunk(): invalid pointer: 0x000000000042915d ***
Aborted (core dumped)

I have a fresh master from 5 minutes ago, freshly compiled, on gnome ubuntu 15.04.

failed build with gcc

Hello,

For the first time in a few days, I tried to build mlr from source after fetching new bits with git pull but this is a result of the build with make:
https://gist.github.com/jungle-boogie/b333580be350c4063e87

 make[1]: Leaving directory '/home/sean/bin/miller/c/dsls'
 gcc -std=gnu99 -I. -Wall -Werror -O3 *.c cli/*.c lib/*.c containers/*.c stream/*.c input/*.c mapping/*.c     output/*.c ./dsls/put_dsl_parse.o    ./dsls/put_dsl_lexer.o    ./dsls/put_dsl_wrapper.o     ./dsls/filter_dsl_parse.o ./dsls/filter_dsl_lexer.o ./dsls/filter_dsl_wrapper.o -lm -o mlr
 mapping/mlr_val.c: In function ‘s_i_sec2hms_func’:
 mapping/mlr_val.c:319:12: error: variable ‘sign’ set but not used [-Werror=unused-but-set-variable]
  long long sign = 1LL;
         ^
 mapping/mlr_val.c: In function ‘s_f_fsec2hms_func’:
 mapping/mlr_val.c:336:12: error: variable ‘sign’ set but not used [-Werror=unused-but-set-variable]
   long long sign = 1LL;
        ^
 cc1: all warnings being treated as errors
Makefile:75: recipe for target 'mlr' failed
make: *** [mlr] Error 1

Some double-byte characters cause offsets in --opprint output

Try downloading the small test file at http://pastebin.com/hUvrUBnN and run the following command on it:

cat /tmp/mlr_double-byte_character_test_file.txt | mlr --nidx --fs $'\t' --opprint cat

That should produce the following output:

1     2
191º test
191   test2

As you can see, the test in the first line is offset by one character to the left instead of starting right underneath the 2.

I suspect it's because the º is a double-byte character and it's causing some count to be offset. Sorry, I'm not a C guy and so I didn't even try to go through the code to find what might be causing this. :/

Love the tool though! ;]

Improve RFC-CSV read performance

Currently about 70% slower than CSV-lite.

Part of this is unavoidable: CSV-lite does not support double-quoting, and in particular embedded CRLFs within fields, but RFC-CSV does. This means the CSV-lite reader can do a fast, buffered getline followed by fast pointer-walks through the line, knowing that any CRLF (or EOF) encountered is certainly a line-terminator. The RFC-CSV reader is char-buffered which is slower, even with getc_unlocked and ring-buffering at the application level.

That said, there are two obvious improvements to make. (1) When reading char-at-a-time via stdio (e.g. for reading stdin), it's of course necessary to dynamically allocate the string data. Yet the heap is getting fragmented by various-length strdups; these should be replaced with power-of-two logic in string_builder_t. (2) When reading char-at-a-time via mmap (which is the case for all non-stdin reads, unless the user specifies mlr --no-mmap) the data are already pointer-backed by mmap and no strdups need to be done at all.

As a necessary implementation detail, the byte_reader_t abstraction, elegant as it is for reducing code duplication, will need to be split into a stdio-using API and a pointer-backed API.

Autopull most of the manpage content

Leverage mlr --help and mlr --help-all-functions, via shell scripting and/or via a little bit of C coding.

Completion of error handling

Would you like to add more error handling for return values from functions like the following?

fseek ⇒ Parse
strdup ⇒ lrec_put

Open task for feature requests

Feel free to open a new task, or just add discussion as a comment here.

Optimize read performance for CSV

RFC4180 support in https://github.com/johnkerl/miller/releases/tag/v2.0.0 is significantly slower than other formats. This needs to be optimized.

instrument for afl?

Hello,

Would there be any benefit to run afl against the mlr binary?
http://lcamtuf.coredump.cx/afl/

filter/put don't work with fields that contain spaces

I've got a CSV file where the column headers contain spaces. For example:

 Datetime; Mean Real Power (MW); Mean Reactive Power (MVA)

I'd like to filter column 2 and use the stats1 verb on it (some cells in column 2 don't have values). But I cannot construct a filter expression to name this second column, i.e., $'Mean Real Power (MW)' or variances of it don't work.

Or perhaps I'm just missing something?

Make RS/FS configurable for CSV

It will still be RFC4180 by default, but:

Option to set RS, e.g. to CR or LF rather than standard CRLF;
Option to set FS, e.g. for TSV
As a necessary side effect, RS/FS/PS for all formats will be able to be multi-character, e.g. you'll be able to use CRLF for DKVP format which will resolve #19.

supporting double quotes

I love the idea of Miller. It is clearly a needed tool that is missing from the standard unix toolbox.

However, you really cannot say you have a tool that is designed to support csv, without supporting csv.

CSV is a standard file format, and has an RFC: https://tools.ietf.org/html/rfc4180

Not supporting double quotes is the same thing as saying that you do not support csv, since double quotes are central to the way that the standard handles other characters... comma being just one example. Your tool is young enough that supporting the standard now will make later development much simpler. This will prevent the situation years from now where you have a 'normal mode' and a 'standards mode'. If you make the change now you can just have the one correct mode.

You have an ambitious work-list, but I would suggest taking a pause and thinking about how you will support the RFC version of the file format.

People like me (open data advocates) spend alot of time trying to ensure that organizations that release csv do so under the standard format, rather than releasing unparsable garbage. Having a library like yours that supported the standard too would be a huge boon.. I could say things like:

"See by using the RFC for your data output, all kinds of open tools will work out of the box on your data... like Miller (link)"

Thank you for working on such a clever tool...

Regards,
-FT

variable eol set but not used.

In the files:

input/lrec_reader_mmap_csv.c
input/lrec_reader_mmap_dkvp.c
input/lrec_reader_mmap_nidx.c
input/lrec_reader_mmap_xtab.c

the eol variable is defined but not used. It is just a compiler warning but with the option -Werror (specifically [-Werror=unused-but-set-variable]) it wont compile until cleared on Linux/Arch.

git version : 7aead02

clang support? / freebsd support?

Hello,

How would I go about compiling your program with clang?

clang -v
FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
Target: i386-unknown-freebsd10.2
Thread model: posix
Selected GCC installation:

Thanks,
Sean

mmap reader doesn't correctly handle files with missing final newline

Use --no-mmap as a workaround.

sum can't compute $

Hello,

Sample data:

% mlr stats1 -a sum -f 1 numbers

Couldn't parse "$456.1" as number.

It's probably impossible to have mlr recognize all currency indicators so is there a work around for this?

Remove unnecessary null pointer checks

An extra null pointer check is not needed in functions like the following.

Would you like to apply the following semantic patch to find more update candidates?

@Remove_unnecessary_pointer_checks@
expression x;
@@
-if (x)
    free(x);

johnkerl / miller Goto Github PK

miller's Introduction

What is Miller?

What can Miller do for me?

Getting started

More documentation links

Installing

Community

Build status

Building from source

For developers

License

Features

What people are saying about Miller

Contributors ✨

miller's People

Contributors

Stargazers

Watchers

Forkers

miller's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs