harelba / q Goto Github PK

View Code? Open in Web Editor NEW

10.1K 168.0 414.0 3.13 MB

q - Run SQL directly on delimited files and multi-file sqlite databases

Home Page: http://harelba.github.io/q/

License: GNU General Public License v3.0

Python 97.14% Shell 1.08% Batchfile 0.03% HTML 0.21% JavaScript 0.48% CSS 0.12% Starlark 0.94%

python sql cli command-line-tool textasdata q qtextasdata csv tsv command-line

q's Introduction

q - Text as Data

q's purpose is to bring SQL expressive power to the Linux command line and to provide easy access to text as actual data.

q allows the following:

Performing SQL-like statements directly on tabular text data, auto-caching the data in order to accelerate additional querying on the same file.
Performing SQL statements directly on multi-file sqlite3 databases, without having to merge them or load them into memory

The following table shows the impact of using caching:

Rows	Columns	File Size	Query time without caching	Query time with caching	Speed Improvement
5,000,000	100	4.8GB	4 minutes, 47 seconds	1.92 seconds	x149
1,000,000	100	983MB	50.9 seconds	0.461 seconds	x110
1,000,000	50	477MB	27.1 seconds	0.272 seconds	x99
100,000	100	99MB	5.2 seconds	0.141 seconds	x36
100,000	50	48MB	2.7 seconds	0.105 seconds	x25

Notice that for the current version, caching is not enabled by default, since the caches take disk space. Use -C readwrite or -C read to enable it for a query, or add caching_mode to .qrc to set a new default.

q's web site is https://harelba.github.io/q/ or https://q.textasdata.wiki It contains everything you need to download and use q immediately.

Usage Examples

q treats ordinary files as database tables, and supports all SQL constructs, such as WHERE, GROUP BY, JOINs, etc. It supports automatic column name and type detection, and provides full support for multiple character encodings.

Here are some example commands to get the idea:

$ q "SELECT COUNT(*) FROM ./clicks_file.csv WHERE c3 > 32.3"

$ ps -ef | q -H "SELECT UID, COUNT(*) cnt FROM - GROUP BY UID ORDER BY cnt DESC LIMIT 3"

$ q "select count(*) from some_db.sqlite3:::albums a left join another_db.sqlite3:::tracks t on (a.album_id = t.album_id)"

Detailed examples are in here

Installation.

New Major Version 3.1.6 is out with a lot of significant additions.

Instructions for all OSs are here.

The previous version 2.0.19 Can still be downloaded from here

Contact

Any feedback/suggestions/complaints regarding this tool would be much appreciated. Contributions are most welcome as well, of course.

Linkedin: Harel Ben Attia

Twitter @harelba

Email [email protected]

q on twitter: #qtextasdata

Patreon: harelba - All the money received is donated to the Center for the Prevention and Treatment of Domestic Violence in my hometown - Ramla, Israel.

q's People

Contributors

Stargazers

Watchers

Forkers

rj03hou haidahaha tianyu263 jkrag luftreich yotamoron melt-snow zhanxw legui33 aniljava qyiming global01 yazgazan gijs rtvt123 akshatknsl postpcera elazarg mattn saibotshamtul alexprengere chriscx jemygraw dor1n bag-of-projects davewbrown xmonader bonkrater awesome wukkuan cybernetics jmackenz yaynon boriel zkbswgs rvborer topham26 josephwinston gasol yifzhang zyzapai prexident vicgc sivacn-zeomega bunam thinkbox craig5 incrediblemouse jimmyjose-dev chitambo barsnick brainv changpingc cleberar38 heliac2000 canoefzh xjzhou csuliweilong pangpanlong zengqiang2006 inzaghi2012 giserh lijinhui cofyc violetlife iziang kaliscuit devlato kklik bnamatherdhala yutao1115 kiapper zofuthan afthill rainsome-org1 gysan qichenftw jbkahn cgfalcon nickygit bright99 huangwangping nai7 juanpabloaj getgitxxx sethhall pilgrim2go fil lifeles666 swapnilmj askelias debarshri xushiwei 2089764 saikswaroop suniltaneja jamesblunt ndevox yan5lang serverlab

q's Issues

New feature: use header to name columns

I hope I did not miss something from the documentation. It would be nice to be able to use the header with -H to give names to the columns (or any other new option).

For example:

$ echo "name,value\na,1\nb,2\na,5" > test.csv
$ cat test.csv
name,value
a,1
b,2
a,5

This is already possible:

$ cat test.csv | ./q -d',' "select c1,avg(c2) from - group by c1" -H 1
a,3.0
b,2.0

This would be nice (automatic naming):

$ cat test.csv | ./q -d',' "select name,avg(value) from - group by name" -H 1
a,3.0
b,2.0

Ensure that generated temp tables are uniquely named

Currently, the temp tables created by the tool are named using a random number (1B options). Although very unlikely, it is possible that temp tables would be overlapping, and cause a failure of the tool.

The tool should verify that the generated temp table name is unique, and retry the name generation if it is, until a unique name is found.

Harel

Allow setting an empty output delimiter

This will allow a simple way to combine fields in q's output.

A dot in the column name results in "no such column" error

Using this simple file as input:

a.id a.name
1 bill
2 bob

The query q -H "select a.id from test_file" results in the following error:

query error: no such column: a.id
Warning - There seems to be a "no such column" error, and -H (header line) exists. Please make sure that you are using the column names from the header line and not the default (cXX) column names

Allow easier tab-delimited output

There doesn't seem to be a simple way to specify tab-delimited output.

E.g., I've tried q -D '\t' and q -D "\t", but both fail to produce tabs.

q -D "$(echo -e '\t')" works (in BASH anyway), but is needlessly complicated.

Since -t is shorthand for -d <tab>, perhaps -T can be shorthand for -D <tab>.

Real SQL Parsing, allowing smarter logic

Couldn't find a good sql parser in python, so there are some ugly hacks there. If anyone knows a good sql parser for python, i'll be glad to integrate it.

Prevent regexp from failing when field value is null

Currently, throws the following error:

query error: user-defined function raised exception

Workaround is to add a where clause which filters out null values for the relevant column.

Null values can also occur when running relaxed mode (the default) when there are rows with fewer columns than expected. Use -m strict if needed, in order to get an error when column count is not as expected.

Will be fixed in the coming release.

Please add a tag for an official release, so I can get it included in homebrew

The issue is they won't include it unless it comes for the official repo. See here: Homebrew/legacy-homebrew#23192

Fix bug when parsing empty rows

Code refactoring and cleaning

Required, in order to provide for a more testable and more maintainable code base.

Left joining an empty file can cause problems

Customer's use case:

Example:

File a.csv:

Name;Val
A;1
B;2
C;3

File b.csv:

Name;Val
A;10
B;11

select f1.name , ifnull(f2.val,0) as val from e:\a.csv f1 left join e:\b.csv f2 on f1.name = f2.name

In the example above, the result is:

A;10
B;11
C;0

But if the b.csv file is empty (only the header), the query doesn't produce any result at all instead of

A;0
B;0
C;0

Support quoted csv files

a question

hi,

a very nice project. 👍

by the way, i have a question: is this project same as https://github.com/dinedal/textql ??

i hope you could add a option to compact with different column numbers

if a column didnt affects the result , why not just ignore the difference between the specific line and other standard line

for eg, lets say i have a directory which contains space in its name

so when you use

ls -l | q "select c5 from - order by c5 desc"

it will throw an error caused by the useless directory name

rpm: changelog

RPM changelog is for history of changes to the spec file, not the project. This is way too much:

Mon Mar 03 2014 Harel Ben-Attia [email protected] 1.3.0-1
Added column name and type detection (Use -A to see name/type analysis for the specified input)
Added support for multiple parsing modes - Relaxed, Strict and Fluffy (old, backward compatible behavior)
Fixed tab delimition parameter problem
More improvements to error reporting
Added a test suite, in preparation for refactoring
#7 - Dynamic column count support
#8 - Column name inference from input containing a header row
#9 - Automatic column type inference using sample data
#30 - Header lines option does nothing

Reuse of previously loaded data

Reusing a file that has already been loaded in the past should be faster. Can be that by some form of caching the loaded data.

process CSV quotes according to folk practices?

According to (hmm…) folk practice ( http://en.wikipedia.org/wiki/Comma-separated_values#Technical_background ), quotes in CSV can be either "double-quoted" "" or backslash-escaped \".

q does seem to do something else with quotes (but what exactly)?

$ cat doublequotes.csv ; q -t "select * from doublequotes.csv"
A   B
"a ""quote"" is escaped so" isn't it
"yeah"  """"""

A   B
a "quote"" is escaped so"   isn't it
yeah    """"

$ cat backquotes.csv; q -t "select * from backquotes.csv"
A   B
"a \"quote\" is escaped so" isn't it
"yeah"  "\"\""

A   B
a \quote\" is escaped so"   isn't it
yeah    \\""

RPM: License is wrong

GPL should be GPLv3 or GPLv3+ (per LICENSE file).

Package for pypi

Any reason/objection to doing that?

It looks like https://pypi.python.org/pypi/q is already taken, so an expanded name would have to be thought up.

Allow working with an External DB

Currenty, only in memory sqlite is being used. The tool could be used in more situations if data could be written using an external database.

Space in column name on TAB separated values break q

$ echo -e "a\tb c\n1\t2" > test.tsv
$ q -H -t 'select a from -' < test.tsv
query error: near "c": syntax error

Spaces in non space separators should be accepted.

casting can be simpler

This is not a real issue, but to share that the "not so clean casting" can be made simpler.
From the readme sample:
q "SELECT c5,c9 FROM mydatafile WHERE CAST(c5 AS INT) > 1000"
Can be made easier to type by:
q "SELECT c5,c9 FROM mydatafile WHERE 0+c5 > 1000"

As far as I know, this way quite common in SQL semantic for real DB.
It will goes further with ''||cX for casting as string (if needed).

Fix project folder structure

Project folder structure needs to be fixed to be more orderly. Folders such as src, examples, etc.

Provide a python api + PyPI package for q

Use if name == '__main__' for standard command line usage in script.
This would also allow to test the code itself.

Copyright peculiarities

https://github.com/harelba/q/search?q=copyright

In1988 there wasn't even Python... How is it possible for q to be dated back to that year?

Also, is this software really FSF's property? Does FSF even know about it? (GNU GPL license doesn't imply that software licensed under it are part of GNU project).

//EDIT changed the issue title.

Can't force "q" to interpret the file as single-row table

Scenario:

q -d'' --mode=strict -c 1 'SELECT * FROM a LIMIT 1,1'

Expected:

The first line of the file, possibly trimmed unless -k

Actual:

Delimiter must be one character only

RPM: Packaging issues

Vendor, Packager are not used anymore; kill those fields.

%clean section isn't needed anymore.

'%{__install}' can be replaced with just 'install'. No need to use a macro.

Automatic column type inference using sample data

Currently, column types are not inferred, and the sql type treatment is dependent on the expression being used. For example when using sum(c2), c2 will be treated as a number. This sometimes requires converting types as part of the sql statement - For example: ls -dltr * | q "select * from - where cast(c5 as int) > 10000". Automatic type inference can fix that.

Last column should allow for spaces?

I love the idea of being able to query the output of ls but there are some practical constraints that I haven't seen addressed. Maybe you have some ideas for the following?

https://gist.github.com/canadaduane/9197079

Piping the above output to q will fail with the following error message:

Encountered a line in an invalid format -:1 - 11 columns instead of 9. Did you make sure to set the correct delimiter?

This is due to the symlink "Projects -> /Users/duane/Dropbox/Projects". Perhaps we could add a flag that allows the last column to scoop up spaces? I realize this wouldn't work if the first line/file happens to have spaces. So, yeah... not a perfect solution yet.

Feature request: process substitution

Something like q "SELECT * FROM <(find . -ls)"

It should be like

mkfifo tmptable
find . -ls > tmptable&
q "SELECT * FROM tmptable"

Column name inference from input containing a header row

Add a LICENSE file

Hello,

q looks nice but there's no mention of what license it is under. Can you add a LICENSE file?

Cheers!

Support spaces in tables/files names

I haven't found a way to use files with spaces in their name: quoting with ", ', or ` doesn't work; escaping spaces with a backslash doesn't work either.

expand numeric types when autodetecting column types

Column types are autodetected and handled properly. However, there are some cases where it is possible that a numeric column will be treated as text, depending on the actual sample data used for autodetection of the column type.

This can be easily resolved by automatically expanding numeric types int->long->float as part of the autodetection.

Harel

Quote values that contain either field or record seperators.

Please correct if I'm wrong, but as far as I know, when q outputs CSV, it does not quote fields that contain a field separator or a record separator. This likely implies that it also does not escape quote characters, but I haven't checked this explicitly.

Example 1 fails (OS X)

$ ls -ltr * | q "select c1,count(1) from - group by c1"
Traceback (most recent call last):
File "/usr/local/bin/q", line 485, in
table_creator.populate()
File "/usr/local/bin/q", line 379, in populate
self._flush_inserts()
File "/usr/local/bin/q", line 415, in _flush_inserts
self.db.execute_and_fetch(insert_row_stmt)
File "/usr/local/bin/q", line 111, in execute_and_fetch
self.cursor.execute(q)
sqlite3.OperationalError: near ")": syntax error

UTF-8 with BOM files cause column naming issues

 $ cat dailytasks.csv 
"typeid","limit","apcost","date","checkpointId"
"int","int","int","string","string"
"1","2","5","1,2,3,4,5,6,7","3000,3001,3002"
"2","2","5","1,2,3,4,5,6,7","3003,3004,3005"

This original content of file

 $ q -H -O -d ,  'select * from ./dailytasks.csv where "limit" = 2' 
"typeid",limit,apcost,date,checkpointId
1,2,5,1,2,3,4,5,6,7,3000,3001,3002
2,2,5,1,2,3,4,5,6,7,3003,3004,3005

I use "limit" in where, it works fine.

 $ q -H -O -d ,  'select * from ./dailytasks.csv where typeid = 1' 
query error: no such column: typeid
Warning - There seems to be a "no such column" error, and -H (header line) exists. Please make sure that you are using the column names from the header line and not the default (cXX) column names

but when I try first column of header, typeid, q return me a Warning and no result.

I also tried ...

 $ q -H -O -d ,  'select * from ./dailytasks.csv where "typeid" = 1'  
"typeid",limit,apcost,date,checkpointId

 $ q -H -O -d ,  'select * from ./dailytasks.csv where `"typeid"` = 1' 
query error: no such column: "typeid"
Warning - There seems to be a "no such column" error, and -H (header line) exists. Please make sure that you are using the column names from the header line and not the default (cXX) column names

but they all failed.

q accepts EOF twice when reading input from tty

If the input is a pipe, it is not noticable:

$ cat | q 'SELECT * FROM -'
1 2 3
^D
1 2 3
$

But if reading input directly from the terminal, it stops only on double EOF, ignoring single ones:

$ q 'SELECT * FROM -'
2 3 4
^D
5 4 3
^D^D
2 3 4
5 4 3
$

Fix query encoding (data encoding works well)

Currently, the encoding is done properly for data, but not for the query itself.

Need to fix that.

Thanks @PreXident

Dynamic column count support

The tool should support input with varying column count per row. This will allow more uses of the tool for semi-structured input.

Problem with number of columns

Here is a simple example where q seems to fail recognizing the number of columns. Let's create a simple CSV file.

$ echo "a,1,0\nb,2,0\nc,,0" > test.csv
$ cat test.csv
a,1,0
b,2,0
c,,0

Now, using the latest version of q:

$ cat test.csv| ./q -d',' "select * from -"
a,1,0,,,,,,,
b,2,0,,,,,,,
c,,0,,,,,,,

My guess is q is having problems with EOL characters here, but I did not dig into the code. For me, the expected output would be:

$ cat test.csv| ./q -d',' "select * from -"
a,1,0
b,2,0
c,,0

"SELECT 5" shows error in addition to the result

$ q 'SELECT 5'
5
Traceback (most recent call last):
  File "/usr/local/bin/q", line 1098, in <module>
    table_creator.drop_table()
NameError: name 'table_creator' is not defined

(It's my the first ever query I tried with q)

Problem with whitespace delimiter

Let's create a simple CSV file where the delimiter is a whitespace, e.g. " ".

$ echo "a 1 0\nb 2 0\nc  0" > test.csv
$ cat test.csv
a 1 0
b 2 0
c  0

Note that for the last row, we have two spaces between the c and the 0.
Now, using the latest version of q:

$ cat test.csv| ./q "select * from -" -D ';' 
a;1;0;;;;;;;
b;2;0;;;;;;;
c;0;;;;;;;;

As we can see, apart for the problem of the extras ; (described in a separate issue #36), the last row is incorrect. The expected output would be (without the extras ;):

$ cat test.csv| ./q "select * from -" -D ';' 
a;1;0;;;;;;;
b;2;0;;;;;;;
c;;0;;;;;;;;

Note the two ; between the c and the 0 for the last row, meaning that 0 was indeed a value for the third column, since the delimiter is the whitespace and there were two of them.

problem caused by header_skip

try this exampledatafile:

total 368
-rw-rw-r--. 1 zhaokunyao zhaokunyao 1621 8月 8 16:08 add_fkey_idx.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 317 8月 8 16:08 count.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 3105 8月 8 16:08 create_table.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 763 8月 8 16:08 drop_cons.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 0 8月 8 20:25 exampledatafile

q -H1 "select * from exampledatafile"
Traceback (most recent call last):
File "/bin/q", line 484, in
table_creator.populate()
File "/bin/q", line 374, in populate
self._insert_row(line)
File "/bin/q", line 393, in _insert_row
self._insert_row_i(line)
File "/bin/q", line 403, in _insert_row_i
raise Exception('Encountered a line in an invalid format %s:%s - %s columns instead of %s. Did you make sure to set the correct delimiter?' % (self.current_filename,self.lines_read,len(col_vals),len(self.column_inferer.column_names)))
Exception: Encountered a line in an invalid format exampledatafile:2 - 13 columns instead of 7. Did you make sure to set the correct delimiter?

Header lines option does nothing

The option for ignoring header lines is there, however it doesn't actually get used at any point. All lines make it into the results.

Empty values should be handled as NULL when computing average

Let's create a simple CSV file.

$ echo "a,1,0\nb,2,0\nc,,0" > test.csv
$ cat test.csv
a,1,0
b,2,0
c,,0

Now, using the latest version of q:

$ cat test.csv| ./q -d',' "select avg(c2) from -"  
1.0

This output is unexpected for me.
Here we are computing the average of 1, 2 and an empty value. The result given by q is 1, because I think q consider the empty value as 0, and that gives (1 + 2 + 0) / 3 = 1. From a statistical point a view, it makes more sense to consider the empty value as unknown, and to compute the average as follow: (1 + 2) / 2 = 1.5.

In other SQL engines, NULL values are excluded from the average, and not counted as zeros. This means that the average of 1, 2 and NULL will be 1.5 indeed.
Converting empty values to NULL could solve the problem here.

SPEC file for "properly" creating RPMs

Wow, "q", the tool I've always been waiting for! I recall RedHat's own "squeal", but it is not configurable for arbitrary text files such as CSV.

Your provided Fedora RPM and SRPM both work absolutely fine, but I have managed to modify the SPEC file to build q from the proper, unmodified tarball, and to also create the docs while building the RPM, without preprocessing from the create-rpm script.

Are you interested in this SPEC file? I can branch your project, commit it, and create a pull request.

By the way, I have integrated the RPMs using this SPEC file into my Fedora copr repository:
http://copr.fedoraproject.org/coprs/barsnick/non-fed/

Build report:
http://copr.fedoraproject.org/coprs/barsnick/non-fed/build/27090/

Packages to be found e.g. here (as well as for various other Fedora flavors):
http://copr-be.cloud.fedoraproject.org/results/barsnick/non-fed/fedora-20-x86_64/q-text-as-data-1.4.0-1.fc20.1sunshine/

or via "yum install q-text-as-data" if the repo is enabled.