GithubHelp home page GithubHelp logo

harelba / q Goto Github PK

View Code? Open in Web Editor NEW
10.1K 168.0 414.0 3.13 MB

q - Run SQL directly on delimited files and multi-file sqlite databases

Home Page: http://harelba.github.io/q/

License: GNU General Public License v3.0

Python 97.14% Shell 1.08% Batchfile 0.03% HTML 0.21% JavaScript 0.48% CSS 0.12% Starlark 0.94%
python sql cli command-line-tool textasdata q qtextasdata csv tsv command-line

q's Introduction

Build and Package

q - Text as Data

q's purpose is to bring SQL expressive power to the Linux command line and to provide easy access to text as actual data.

q allows the following:

  • Performing SQL-like statements directly on tabular text data, auto-caching the data in order to accelerate additional querying on the same file.
  • Performing SQL statements directly on multi-file sqlite3 databases, without having to merge them or load them into memory

The following table shows the impact of using caching:

Rows Columns File Size Query time without caching Query time with caching Speed Improvement
5,000,000 100 4.8GB 4 minutes, 47 seconds 1.92 seconds x149
1,000,000 100 983MB 50.9 seconds 0.461 seconds x110
1,000,000 50 477MB 27.1 seconds 0.272 seconds x99
100,000 100 99MB 5.2 seconds 0.141 seconds x36
100,000 50 48MB 2.7 seconds 0.105 seconds x25

Notice that for the current version, caching is not enabled by default, since the caches take disk space. Use -C readwrite or -C read to enable it for a query, or add caching_mode to .qrc to set a new default.

q's web site is https://harelba.github.io/q/ or https://q.textasdata.wiki It contains everything you need to download and use q immediately.

Usage Examples

q treats ordinary files as database tables, and supports all SQL constructs, such as WHERE, GROUP BY, JOINs, etc. It supports automatic column name and type detection, and provides full support for multiple character encodings.

Here are some example commands to get the idea:

$ q "SELECT COUNT(*) FROM ./clicks_file.csv WHERE c3 > 32.3"

$ ps -ef | q -H "SELECT UID, COUNT(*) cnt FROM - GROUP BY UID ORDER BY cnt DESC LIMIT 3"

$ q "select count(*) from some_db.sqlite3:::albums a left join another_db.sqlite3:::tracks t on (a.album_id = t.album_id)"

Detailed examples are in here

Installation.

New Major Version 3.1.6 is out with a lot of significant additions.

Instructions for all OSs are here.

The previous version 2.0.19 Can still be downloaded from here

Contact

Any feedback/suggestions/complaints regarding this tool would be much appreciated. Contributions are most welcome as well, of course.

Linkedin: Harel Ben Attia

Twitter @harelba

Email [email protected]

q on twitter: #qtextasdata

Patreon: harelba - All the money received is donated to the Center for the Prevention and Treatment of Domestic Violence in my hometown - Ramla, Israel.

q's People

Contributors

bessarabov avatar bfontaine avatar cmpt376kor avatar fil avatar ggventurini avatar h5rdly avatar harelba avatar heliac2000 avatar imba-tjd avatar incognito124 avatar jinzhencheng avatar jkrag avatar jungle-boogie avatar mattn avatar mbrukman avatar michaeljoseph avatar serima avatar shigemk2 avatar streakycobra avatar swapnilmj avatar zhanxw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

q's Issues

New feature: use header to name columns

I hope I did not miss something from the documentation. It would be nice to be able to use the header with -H to give names to the columns (or any other new option).

For example:

$ echo "name,value\na,1\nb,2\na,5" > test.csv
$ cat test.csv
name,value
a,1
b,2
a,5

This is already possible:

$ cat test.csv | ./q -d',' "select c1,avg(c2) from - group by c1" -H 1
a,3.0
b,2.0

This would be nice (automatic naming):

$ cat test.csv | ./q -d',' "select name,avg(value) from - group by name" -H 1
a,3.0
b,2.0

Ensure that generated temp tables are uniquely named

Currently, the temp tables created by the tool are named using a random number (1B options). Although very unlikely, it is possible that temp tables would be overlapping, and cause a failure of the tool.

The tool should verify that the generated temp table name is unique, and retry the name generation if it is, until a unique name is found.

Harel

A dot in the column name results in "no such column" error

Using this simple file as input:

a.id a.name
1 bill
2 bob

The query q -H "select a.id from test_file" results in the following error:

query error: no such column: a.id
Warning - There seems to be a "no such column" error, and -H (header line) exists. Please make sure that you are using the column names from the header line and not the default (cXX) column names

Allow easier tab-delimited output

There doesn't seem to be a simple way to specify tab-delimited output.

E.g., I've tried q -D '\t' and q -D "\t", but both fail to produce tabs.

q -D "$(echo -e '\t')" works (in BASH anyway), but is needlessly complicated.

Since -t is shorthand for -d <tab>, perhaps -T can be shorthand for -D <tab>.

Prevent regexp from failing when field value is null

Currently, throws the following error:

query error: user-defined function raised exception

Workaround is to add a where clause which filters out null values for the relevant column.

Null values can also occur when running relaxed mode (the default) when there are rows with fewer columns than expected. Use -m strict if needed, in order to get an error when column count is not as expected.

Will be fixed in the coming release.

Left joining an empty file can cause problems

Customer's use case:

Example:

File a.csv:

Name;Val
A;1
B;2
C;3

File b.csv:

Name;Val
A;10
B;11

select f1.name , ifnull(f2.val,0) as val from e:\a.csv f1 left join e:\b.csv f2 on f1.name = f2.name

In the example above, the result is:

A;10
B;11
C;0

But if the b.csv file is empty (only the header), the query doesn't produce any result at all instead of

A;0
B;0
C;0

i hope you could add a option to compact with different column numbers

if a column didnt affects the result , why not just ignore the difference between the specific line and other standard line

for eg, lets say i have a directory which contains space in its name

so when you use

ls -l | q "select c5 from - order by c5 desc"

it will throw an error caused by the useless directory name

rpm: changelog

RPM changelog is for history of changes to the spec file, not the project. This is way too much:

  • Mon Mar 03 2014 Harel Ben-Attia [email protected] 1.3.0-1
  • Added column name and type detection (Use -A to see name/type analysis for the specified input)
  • Added support for multiple parsing modes - Relaxed, Strict and Fluffy (old, backward compatible behavior)
  • Fixed tab delimition parameter problem
  • More improvements to error reporting
  • Added a test suite, in preparation for refactoring
  • #7 - Dynamic column count support
  • #8 - Column name inference from input containing a header row
  • #9 - Automatic column type inference using sample data
  • #30 - Header lines option does nothing

Reuse of previously loaded data

Reusing a file that has already been loaded in the past should be faster. Can be that by some form of caching the loaded data.

process CSV quotes according to folk practices?

According to (hmmโ€ฆ) folk practice ( http://en.wikipedia.org/wiki/Comma-separated_values#Technical_background ), quotes in CSV can be either "double-quoted" "" or backslash-escaped \".

q does seem to do something else with quotes (but what exactly)?

$ cat doublequotes.csv ; q -t "select * from doublequotes.csv"
A   B
"a ""quote"" is escaped so" isn't it
"yeah"  """"""

A   B
a "quote"" is escaped so"   isn't it
yeah    """"
$ cat backquotes.csv; q -t "select * from backquotes.csv"
A   B
"a \"quote\" is escaped so" isn't it
"yeah"  "\"\""

A   B
a \quote\" is escaped so"   isn't it
yeah    \\""

Allow working with an External DB

Currenty, only in memory sqlite is being used. The tool could be used in more situations if data could be written using an external database.

casting can be simpler

This is not a real issue, but to share that the "not so clean casting" can be made simpler.
From the readme sample:
q "SELECT c5,c9 FROM mydatafile WHERE CAST(c5 AS INT) > 1000"
Can be made easier to type by:
q "SELECT c5,c9 FROM mydatafile WHERE 0+c5 > 1000"

As far as I know, this way quite common in SQL semantic for real DB.
It will goes further with ''||cX for casting as string (if needed).

Copyright peculiarities

https://github.com/harelba/q/search?q=copyright

In1988 there wasn't even Python... How is it possible for q to be dated back to that year?

Also, is this software really FSF's property? Does FSF even know about it? (GNU GPL license doesn't imply that software licensed under it are part of GNU project).

//EDIT changed the issue title.

RPM: Packaging issues

Vendor, Packager are not used anymore; kill those fields.

%clean section isn't needed anymore.

'%{__install}' can be replaced with just 'install'. No need to use a macro.

Automatic column type inference using sample data

Currently, column types are not inferred, and the sql type treatment is dependent on the expression being used. For example when using sum(c2), c2 will be treated as a number. This sometimes requires converting types as part of the sql statement - For example: ls -dltr * | q "select * from - where cast(c5 as int) > 10000". Automatic type inference can fix that.

Last column should allow for spaces?

I love the idea of being able to query the output of ls but there are some practical constraints that I haven't seen addressed. Maybe you have some ideas for the following?

https://gist.github.com/canadaduane/9197079

Piping the above output to q will fail with the following error message:

Encountered a line in an invalid format -:1 - 11 columns instead of 9. Did you make sure to set the correct delimiter?

This is due to the symlink "Projects -> /Users/duane/Dropbox/Projects". Perhaps we could add a flag that allows the last column to scoop up spaces? I realize this wouldn't work if the first line/file happens to have spaces. So, yeah... not a perfect solution yet.

Add a LICENSE file

Hello,

q looks nice but there's no mention of what license it is under. Can you add a LICENSE file?

Cheers!

Support spaces in tables/files names

I haven't found a way to use files with spaces in their name: quoting with ", ', or ` doesn't work; escaping spaces with a backslash doesn't work either.

expand numeric types when autodetecting column types

Column types are autodetected and handled properly. However, there are some cases where it is possible that a numeric column will be treated as text, depending on the actual sample data used for autodetection of the column type.

This can be easily resolved by automatically expanding numeric types int->long->float as part of the autodetection.

Harel

Quote values that contain either field or record seperators.

Please correct if I'm wrong, but as far as I know, when q outputs CSV, it does not quote fields that contain a field separator or a record separator. This likely implies that it also does not escape quote characters, but I haven't checked this explicitly.

Example 1 fails (OS X)

$ ls -ltr * | q "select c1,count(1) from - group by c1"
Traceback (most recent call last):
File "/usr/local/bin/q", line 485, in
table_creator.populate()
File "/usr/local/bin/q", line 379, in populate
self._flush_inserts()
File "/usr/local/bin/q", line 415, in _flush_inserts
self.db.execute_and_fetch(insert_row_stmt)
File "/usr/local/bin/q", line 111, in execute_and_fetch
self.cursor.execute(q)
sqlite3.OperationalError: near ")": syntax error

UTF-8 with BOM files cause column naming issues

 $ cat dailytasks.csv 
"typeid","limit","apcost","date","checkpointId"
"int","int","int","string","string"
"1","2","5","1,2,3,4,5,6,7","3000,3001,3002"
"2","2","5","1,2,3,4,5,6,7","3003,3004,3005"

This original content of file

 $ q -H -O -d ,  'select * from ./dailytasks.csv where "limit" = 2' 
"typeid",limit,apcost,date,checkpointId
1,2,5,1,2,3,4,5,6,7,3000,3001,3002
2,2,5,1,2,3,4,5,6,7,3003,3004,3005

I use "limit" in where, it works fine.

 $ q -H -O -d ,  'select * from ./dailytasks.csv where typeid = 1' 
query error: no such column: typeid
Warning - There seems to be a "no such column" error, and -H (header line) exists. Please make sure that you are using the column names from the header line and not the default (cXX) column names

but when I try first column of header, typeid, q return me a Warning and no result.

I also tried ...

 $ q -H -O -d ,  'select * from ./dailytasks.csv where "typeid" = 1'  
"typeid",limit,apcost,date,checkpointId

 $ q -H -O -d ,  'select * from ./dailytasks.csv where `"typeid"` = 1' 
query error: no such column: "typeid"
Warning - There seems to be a "no such column" error, and -H (header line) exists. Please make sure that you are using the column names from the header line and not the default (cXX) column names

but they all failed.

q accepts EOF twice when reading input from tty

If the input is a pipe, it is not noticable:

$ cat | q 'SELECT * FROM -'
1 2 3
^D
1 2 3
$

But if reading input directly from the terminal, it stops only on double EOF, ignoring single ones:

$ q 'SELECT * FROM -'
2 3 4
^D
5 4 3
^D^D
2 3 4
5 4 3
$

Dynamic column count support

The tool should support input with varying column count per row. This will allow more uses of the tool for semi-structured input.

Problem with number of columns

Here is a simple example where q seems to fail recognizing the number of columns. Let's create a simple CSV file.

$ echo "a,1,0\nb,2,0\nc,,0" > test.csv
$ cat test.csv
a,1,0
b,2,0
c,,0

Now, using the latest version of q:

$ cat test.csv| ./q -d',' "select * from -"
a,1,0,,,,,,,
b,2,0,,,,,,,
c,,0,,,,,,,

My guess is q is having problems with EOL characters here, but I did not dig into the code. For me, the expected output would be:

$ cat test.csv| ./q -d',' "select * from -"
a,1,0
b,2,0
c,,0

"SELECT 5" shows error in addition to the result

$ q 'SELECT 5'
5
Traceback (most recent call last):
  File "/usr/local/bin/q", line 1098, in <module>
    table_creator.drop_table()
NameError: name 'table_creator' is not defined

(It's my the first ever query I tried with q)

Problem with whitespace delimiter

Let's create a simple CSV file where the delimiter is a whitespace, e.g. " ".

$ echo "a 1 0\nb 2 0\nc  0" > test.csv
$ cat test.csv
a 1 0
b 2 0
c  0

Note that for the last row, we have two spaces between the c and the 0.
Now, using the latest version of q:

$ cat test.csv| ./q "select * from -" -D ';' 
a;1;0;;;;;;;
b;2;0;;;;;;;
c;0;;;;;;;;

As we can see, apart for the problem of the extras ; (described in a separate issue #36), the last row is incorrect. The expected output would be (without the extras ;):

$ cat test.csv| ./q "select * from -" -D ';' 
a;1;0;;;;;;;
b;2;0;;;;;;;
c;;0;;;;;;;;

Note the two ; between the c and the 0 for the last row, meaning that 0 was indeed a value for the third column, since the delimiter is the whitespace and there were two of them.

problem caused by header_skip

try this exampledatafile:

total 368
-rw-rw-r--. 1 zhaokunyao zhaokunyao 1621 8ๆœˆ 8 16:08 add_fkey_idx.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 317 8ๆœˆ 8 16:08 count.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 3105 8ๆœˆ 8 16:08 create_table.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 763 8ๆœˆ 8 16:08 drop_cons.sql
-rw-rw-r--. 1 zhaokunyao zhaokunyao 0 8ๆœˆ 8 20:25 exampledatafile

q -H1 "select * from exampledatafile"
Traceback (most recent call last):
File "/bin/q", line 484, in
table_creator.populate()
File "/bin/q", line 374, in populate
self._insert_row(line)
File "/bin/q", line 393, in _insert_row
self._insert_row_i(line)
File "/bin/q", line 403, in _insert_row_i
raise Exception('Encountered a line in an invalid format %s:%s - %s columns instead of %s. Did you make sure to set the correct delimiter?' % (self.current_filename,self.lines_read,len(col_vals),len(self.column_inferer.column_names)))
Exception: Encountered a line in an invalid format exampledatafile:2 - 13 columns instead of 7. Did you make sure to set the correct delimiter?

Header lines option does nothing

The option for ignoring header lines is there, however it doesn't actually get used at any point. All lines make it into the results.

Empty values should be handled as *NULL* when computing average

Let's create a simple CSV file.

$ echo "a,1,0\nb,2,0\nc,,0" > test.csv
$ cat test.csv
a,1,0
b,2,0
c,,0

Now, using the latest version of q:

$ cat test.csv| ./q -d',' "select avg(c2) from -"  
1.0

This output is unexpected for me.
Here we are computing the average of 1, 2 and an empty value. The result given by q is 1, because I think q consider the empty value as 0, and that gives (1 + 2 + 0) / 3 = 1. From a statistical point a view, it makes more sense to consider the empty value as unknown, and to compute the average as follow: (1 + 2) / 2 = 1.5.

In other SQL engines, NULL values are excluded from the average, and not counted as zeros. This means that the average of 1, 2 and NULL will be 1.5 indeed.
Converting empty values to NULL could solve the problem here.

SPEC file for "properly" creating RPMs

Wow, "q", the tool I've always been waiting for! I recall RedHat's own "squeal", but it is not configurable for arbitrary text files such as CSV.

Your provided Fedora RPM and SRPM both work absolutely fine, but I have managed to modify the SPEC file to build q from the proper, unmodified tarball, and to also create the docs while building the RPM, without preprocessing from the create-rpm script.

Are you interested in this SPEC file? I can branch your project, commit it, and create a pull request.

By the way, I have integrated the RPMs using this SPEC file into my Fedora copr repository:
http://copr.fedoraproject.org/coprs/barsnick/non-fed/

Build report:
http://copr.fedoraproject.org/coprs/barsnick/non-fed/build/27090/

Packages to be found e.g. here (as well as for various other Fedora flavors):
http://copr-be.cloud.fedoraproject.org/results/barsnick/non-fed/fedora-20-x86_64/q-text-as-data-1.4.0-1.fc20.1sunshine/

or via "yum install q-text-as-data" if the repo is enabled.

Better error reporting

Currently, some errors return tracebacks - a proper error message should be shown instead.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.