Dragnet is a tool for analyzing event stream data stored in files. There are three main commands:
- scan: scan over raw data to execute a query
- build: scan over raw data to produce an index for quickly answering predefined queries
- query: search indexes to execute a query
The prototypical use case is analyzing request logs from a production service. The workflow for Dragnet looks like this:
- Predefine a bunch of metrics you care about (like total request count, request count by server instance, request type, and so on).
- When you accumulate new logs (e.g., hourly or daily), you build the index.
- Whenever you want the values of those metrics, you query the index. This might be part of a constantly-updating dashboard, a daily report, or a threshold-based alarm.
- If you want to gather new metrics, you can define them and rebuild.
- If you want to run a complex query just once, you can scan the raw data rather than adding the query as a metric.
This project is still a prototype. The commands and library interfaces may change incompatibly at any time!
dragnet only supports newline-separated JSON. Try it on the sample data in ./tests/data. Start by defining a new datasource:
$ dn datasource-add my_logs --path=$PWD/tests/data
$ dn datasource-list -v
DATASOURCE LOCATION
my_logs file://home/dap/dragnet/dragnet/tests/data
dataFormat: "json"
Now you can scan the data to count the total number of requests:
$ dn scan my_logs
VALUE
2252
You can also break out counts, e.g., by request method:
$ dn scan -b req.method my_logs
REQ.METHOD VALUE
DELETE 582
GET 556
HEAD 551
PUT 563
You can break out results by more than one field:
$ dn scan -b req.method,res.statusCode my_logs
REQ.METHOD RES.STATUSCODE VALUE
DELETE 200 75
DELETE 204 87
DELETE 400 94
DELETE 404 85
DELETE 499 83
DELETE 500 79
DELETE 503 79
GET 200 77
GET 204 83
GET 400 84
GET 404 74
GET 499 79
GET 500 73
GET 503 86
HEAD 200 71
HEAD 204 85
HEAD 400 66
HEAD 404 77
HEAD 499 88
HEAD 500 88
HEAD 503 76
PUT 200 80
PUT 204 79
PUT 400 83
PUT 404 88
PUT 499 68
PUT 500 83
PUT 503 82
(This is randomly-generated data, which is why you see some combinations that probably don't make sense, like a 200 from a DELETE.)
You can specify multiple fields separated by commas, like above, or using "-b" more than once. This example does the same thing as the previous one:
$ dn scan -b req.method -b res.statusCode my_logs
REQ.METHOD RES.STATUSCODE VALUE
DELETE 200 75
DELETE 204 87
DELETE 400 94
DELETE 404 85
DELETE 499 83
DELETE 500 79
DELETE 503 79
GET 200 77
GET 204 83
GET 400 84
GET 404 74
GET 499 79
GET 500 73
GET 503 86
HEAD 200 71
HEAD 204 85
HEAD 400 66
HEAD 404 77
HEAD 499 88
HEAD 500 88
HEAD 503 76
PUT 200 80
PUT 204 79
PUT 400 83
PUT 404 88
PUT 499 68
PUT 500 83
PUT 503 82
The order of breakdowns matters. If we reverse them, we get different output:
$ dn scan -b res.statusCode,req.method my_logs
RES.STATUSCODE REQ.METHOD VALUE
200 DELETE 75
200 GET 77
200 HEAD 71
200 PUT 80
204 DELETE 87
204 GET 83
204 HEAD 85
204 PUT 79
400 DELETE 94
400 GET 84
400 HEAD 66
400 PUT 83
404 DELETE 85
404 GET 74
404 HEAD 77
404 PUT 88
499 DELETE 83
499 GET 79
499 HEAD 88
499 PUT 68
500 DELETE 79
500 GET 73
500 HEAD 88
500 PUT 83
503 DELETE 79
503 GET 86
503 HEAD 76
503 PUT 82
You can filter records using node-krill filter syntax:
$ dn scan -f '{ "eq": [ "req.method", "GET" ] }' my_logs
VALUE
556
and you can combine this with breakdowns, of course:
$ dn scan -f '{ "eq": [ "req.method", "GET" ] }' -b operation my_logs
OPERATION VALUE
getjoberrors 181
getpublicstorage 176
getstorage 199
To break down by numeric quantities, it's usually best to aggregate nearby values into buckets. Here's a histogram of the "latency" field from this log:
$ dn scan -b latency[aggr=quantize] my_logs
value ------------- Distribution ------------- count
0 | 0
1 |@@ 113
2 |@@@@@@@@ 449
4 |@@@@@@ 348
8 | 0
16 |@@@@@@@@@@@@ 682
32 | 0
64 |@ 57
128 |@@@ 165
256 | 0
512 | 0
1024 |@@ 136
2048 |@@@@@ 302
4096 | 0
"aggr=quantize" specifies a power-of-two bucketization. You can also do a linear quantization, say with steps of size 200:
$ dn scan -b latency[aggr=lquantize,step=200] my_logs
value ------------- Distribution ------------- count
0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1814
200 | 0
400 | 0
600 | 0
800 | 0
1000 | 23
1200 |@ 31
1400 |@ 35
1600 | 18
1800 | 24
2000 |@ 34
2200 |@ 35
2400 | 28
2600 |@ 33
2800 | 18
3000 |@ 34
3200 | 27
3400 |@ 34
3600 | 26
3800 | 25
4000 | 13
4200 | 0
These are modeled after DTrace's aggregating actions. You can combine these with filters and other breakdowns:
$ dn scan -f '{ "eq": [ "req.method", "GET" ] }' \
-b req.method,operation,latency[aggr=quantize] my_logs
GET, getjoberrors
value ------------- Distribution ------------- count
0 | 0
1 |@@ 9
2 |@@@@@@@ 32
4 |@@@@@ 24
8 | 0
16 |@@@@@@@@@@@@@@ 63
32 | 0
64 |@ 5
128 |@@@ 13
256 | 0
512 | 0
1024 |@@@ 13
2048 |@@@@@ 22
4096 | 0
GET, getpublicstorage
value ------------- Distribution ------------- count
0 | 0
1 |@@@ 12
2 |@@@@@@@@ 37
4 |@@@@@@ 28
8 | 0
16 |@@@@@@@@@@@@ 51
32 | 0
64 | 1
128 |@@@@ 17
256 | 0
512 | 0
1024 |@@ 9
2048 |@@@@@ 21
4096 | 0
GET, getstorage
value ------------- Distribution ------------- count
0 | 0
1 |@@ 12
2 |@@@@@@@ 37
4 |@@@@@@ 29
8 | 0
16 |@@@@@@@@@@@@@ 67
32 | 0
64 |@@ 9
128 |@@ 8
256 | 0
512 | 0
1024 |@@ 11
2048 |@@@@@ 26
4096 | 0
If the last field isn't an aggregation, "dn" won't print a histogram, but it will still group nearby values. For example, if we reverse the order of that last example:
$ dn scan -f '{ "eq": [ "req.method", "GET" ] }' \
-b latency[aggr=quantize],req.method,operation my_logs
LATENCY REQ.METHOD OPERATION VALUE
1 GET getjoberrors 9
1 GET getpublicstorage 12
1 GET getstorage 12
2 GET getjoberrors 32
2 GET getpublicstorage 37
2 GET getstorage 37
4 GET getjoberrors 24
4 GET getpublicstorage 28
4 GET getstorage 29
16 GET getjoberrors 63
16 GET getpublicstorage 51
16 GET getstorage 67
64 GET getjoberrors 5
64 GET getpublicstorage 1
64 GET getstorage 9
128 GET getjoberrors 13
128 GET getpublicstorage 17
128 GET getstorage 8
1024 GET getjoberrors 13
1024 GET getpublicstorage 9
1024 GET getstorage 11
2048 GET getjoberrors 22
2048 GET getpublicstorage 21
2048 GET getstorage 26
You can get per-day results by specifying a "date" field and doing a linear quantization with steps of size 86400 (for 86400 seconds per day). When using a "date" field, you have to specify what underlying JSON field should be parsed as a date:
$ dn scan -b timestamp[date,field=time,aggr=lquantize,step=86400] \
-b req.method my_logs
TIMESTAMP REQ.METHOD VALUE
2014-05-01T00:00:00.000Z DELETE 142
2014-05-01T00:00:00.000Z GET 113
2014-05-01T00:00:00.000Z HEAD 125
2014-05-01T00:00:00.000Z PUT 120
2014-05-02T00:00:00.000Z DELETE 133
2014-05-02T00:00:00.000Z GET 120
2014-05-02T00:00:00.000Z HEAD 125
2014-05-02T00:00:00.000Z PUT 122
2014-05-03T00:00:00.000Z DELETE 122
2014-05-03T00:00:00.000Z GET 124
2014-05-03T00:00:00.000Z HEAD 123
2014-05-03T00:00:00.000Z PUT 131
2014-05-04T00:00:00.000Z DELETE 128
2014-05-04T00:00:00.000Z GET 120
2014-05-04T00:00:00.000Z HEAD 127
2014-05-04T00:00:00.000Z PUT 125
2014-05-05T00:00:00.000Z DELETE 55
2014-05-05T00:00:00.000Z GET 79
2014-05-05T00:00:00.000Z HEAD 51
2014-05-05T00:00:00.000Z PUT 65
All of the examples above used a full file scan just to demonstrate the data model. The point of Dragnet is to create indexes that can answer these same queries much faster.
To build indexes, the datasource must have an index-path property, which tells Dragnet where the index should be stored. We'll also need to specify a time field, which Dragnet will use to build per-day indexes by default:
$ dn datasource-update my_logs \
--index-path=$PWD/my_index --time-field=time
$ dn datasource-list -v
DATASOURCE LOCATION
my_logs file://home/dap/dragnet/dragnet/tests/data/
dataFormat: "json"
indexPath: "/home/dap/dragnet/dragnet/my_index"
timeField: "time"
Now we can start adding metrics for the datasource. Let's say we want to be able to quickly get the count of requests per minute, possibly broken out by status code. Let's add this metric:
$ dn metric-add \
-b timestamp[field=time,date,aggr=lquantize,step=60] \
-b res.statusCode my_logs requests_bystatus
Now build the index:
$ dn build my_logs
indexes for "my_logs" built
By default, "dn build" builds daily indexes. You can see the individual files:
$ find my_index -type f
my_index/by_day/2014-05-01.sqlite
my_index/by_day/2014-05-03.sqlite
my_index/by_day/2014-05-05.sqlite
my_index/by_day/2014-05-04.sqlite
my_index/by_day/2014-05-02.sqlite
The indexes are much smaller than the original data, since they contain only enough information to answer the queries.
You can query an index the same way you would scan the original data. Generally, the query will be much faster, since it's not scanning the raw data. Here's a count of all requests:
$ dn query my_logs
VALUE
2250
Or just the server-side failures (status code at least 500):
$ dn query --filter='{ "ge": [ "res.statusCode", 500 ] }' my_logs
VALUE
646
Or the failures by day:
$ dn query --filter='{ "ge": [ "res.statusCode", 500 ] }' \
-b timestamp[date,aggr=lquantize,step=86400] my_logs
value ------------- Distribution ------------- count
2014-05-01T00:00:00.000Z |@@@@@@@@@ 142
2014-05-02T00:00:00.000Z |@@@@@@@@ 132
2014-05-03T00:00:00.000Z |@@@@@@@@@ 144
2014-05-04T00:00:00.000Z |@@@@@@@@@@ 154
2014-05-05T00:00:00.000Z |@@@@@ 74
2014-05-06T00:00:00.000Z | 0
Notice that you define metrics to build the index, but you don't need to query a specific metric. You can query anything that can be fetched from the data that was gathered for those metrics. If you ask for something that's not there, you'll get an error:
$ dn query -b req.method my_logs
dn: index "/home/dap/dragnet/dragnet/my_index/by_day/2014-05-01.sqlite" query:
no metrics available to serve query
But it will work if you add the metric and rebuild the index:
$ dn metric-add -b req.method my_logs my_logs
$ dn build my_logs
indexes for "my_logs" built
$ dn query -b req.method my_logs
REQ.METHOD VALUE
DELETE 580
GET 556
HEAD 551
PUT 563
Dragnet supports operating directly on data stored in Joyent's Manta Storage Service. When working with Manta:
- Raw data is read from Manta objects rather than local files.
- Data operations (scanning, indexing, and querying) are executed in Manta compute jobs to avoid copying data out of the object store. Only the final results of scan and query operations are downloaded so they can be printed by the "dn" command.
- As with files, you can use --time-format and --time-field when creating the data source, and then use --before and --after options to prune directories to search when scanning, indexing, or querying. For large datasets, this saves an enormous amount of time just enumerating inputs.
- You're responsible for cost of storing data and running compute jobs on Manta.
To use Dragnet on Manta, first set up the Manta CLI tools using the Manta "Getting Started" instructions. You need to set MANTA_URL, MANTA_USER, and MANTA_KEY_ID as you would for the rest of the Manta command-line tools. If "mls" works, you're good to go.
For sample data, there's a Manta copy of the test data shipped with Dragnet in /dap/public/dragnet/testdata. You can scan add it like this:
$ dn datasource-add dragnet_test_manta --backend=manta \
--path=/dap/public/dragnet/testdata --time-field=time \
Then you can scan it just as with local data. There's a little more debug output in case you need to dig into the job:
$ dn scan dragnet_test_manta
using existing asset: "/manta/public/dragnet/assets/dragnet-0.0.2.tgz"
submitted job 4a74af91-4b3d-c69b-e607-efe0c2911826
submitted 9 inputs
VALUE
2252
Similarly, you can define metrics, build an index, and query it. To do that, we have to specify an index path, which must be somewhere in Manta you have access to write:
$ dn datasource-update dragnet_test_manta \
--index-path=/$MANTA_USER/stor/myindex
$ dn metric-add \
-b timestamp[date,field=time,aggr=lquantize,step=86400],req.method \
dragnet_test_manta by_method
Now we can build the index:
$ dn build dragnet_test_manta
using existing asset: "/manta/public/dragnet/assets/dragnet-0.0.2.tgz"
submitted job 507242e7-7e76-6ae4-8ef4-cec1f9593909
submitted 9 inputs
indexes for "dragnet_test_manta" built
and query it:
$ dn query dragnet_test_manta
using existing asset: "/manta/public/dragnet/assets/dragnet-0.0.2.tgz"
submitted job 66f20f4f-9d5d-68ae-a860-b4f1fedc9f53
submitted 5 inputs
VALUE
2250
$ dn query -b req.method dragnet_test_manta
using existing asset: "/manta/public/dragnet/assets/dragnet-0.0.2.tgz"
submitted job 39b4e5d1-2449-4529-fcbf-916cb885d979
submitted 5 inputs
REQ.METHOD VALUE
DELETE 580
GET 556
HEAD 551
PUT 563
If you don't already know what "dn" does, you're better off starting with the "Getting Started" section above.
dn scan [--before=START_TIME] [--after=END_TIME] [--filter=FILTER]
[--breakdowns=BREAKDOWN[,...]]
[--raw] [--points] [--counters] [--warnings] [--dry-run]
[--assetroot=ASSET_ROOT] DATASOURCE
Scans all records in a datasource and aggregate the results.
The datasource specifies a backend (local files or Manta), a path to the files, the file format, and a few option options describing how data is organized. By default, records must be newline-separated JSON.
The basic operation is counting records. The assumption is that records represent some useful metric (e.g., HTTP requests). You can use --filter to skip records. You can use --breakdowns to break out the results by some field (e.g., HTTP requests by request method).
Options include:
-b | --breakdowns COLUMN[,COLUMN...]
: A list of column definitions by which to break out the results. With no breakdowns specified, the result of a scan is a count of all records scanned (excluding those dropped by the filter). With a breakdown on a column like "req.method" (request method, which is usually a string like "GET" or "PUT"), the result is a count for each value of "req.method" that was found. With a breakdown on two columns, the result is a count for each unique combination of values for those columns (e.g., 15 records with "req.method" equal to "GET" and "res.statusCode" equal to "200"). To avoid exploding the number of results, you can group nearby values of numeric quantities using an aggregation. See the tutorial above for details.-f | --filter FILTER
: A node-krill (JSON format) predicate to evaluate on each record. Records not matching the filter, as well as records missing fields that are used by the filter, are dropped.--before END_TIMESTAMP
: Only scan data files containing data before END_TIMESTAMP, and filter out data points after END_TIMESTAMP (exclusive). Requires the datasource to have--time-format
so that it can prune input files and--time-field
so that it can filter records within each bucket.--after START_TIMESTAMP
: Only scan data files containing data after START_TIMESTAMP, and filter out data points before START_TIMESTAMP (inclusive). Requires the datasource to have--time-format
so that it can prune input files and--time-field
so that it can filter records within each bucket.
There are some options you specify when creating the datasource:
--time-format TIME_FORMAT
: Specifies how the names of directories and files under "data_directory" correspond with the timestamps of the data points contained in each file. This is a format string like what strftime(3C) supports, except that only "%Y", "%m", "%d", and "%H" are currently implemented. This is used to prune data that has to be scanned when using --before and --after.--time-field TIME_FIELD
: Specifies which field contains the timestamp. This is used for --before and --after.--data-format json | json-skinner
: Specifies the incoming data format. Currently, only newline-separated JSON data ("json") and an internal node-skinner format ("json-skinner") are supported.
There are a few debugging options:
--counters
: upon completion, show non-zero values of miscellaneous internal counters, which include things like inputs processed at each state of the pipeline, records filtered out, records with invalid fields, and so on. The names of internal streams, their counters, and the output format are not stable and are subject to change at any time.--points
: emit data as node-skinner data points rather than human-readable results. node-skinner points are similar to the input data except that they include a "value" field for representing N instances of the same record without replicating the record N times. These points can be used as input to subsequent scans or indexes using --data-format=json-skinner.--warnings
: as data is scanned, show warnings about records that are dropped. Common reasons include: filtered out by a --filter filter, filtered out by --before or --after, failed to evaluate the --filter (e.g., because a field specified in the filter isn't present), failed to parse a numeric field (e.g., a field with "aggr"), or failed to parse a timestamp field. As with --counters, everything about this option's output is unstable and subject to change at any time.
dn build [--before=START_TIME] [--after=END_TIME]
[--interval=hour|day|all] [--index-config=CONFIG_FILE]
[--dry-run] [--assetroot=ASSET_ROOT]
DATASOURCE
Generate a single index file from a single newline-separated-JSON data file:
dn build --interval=all my_datasource
This generates an index capable of answering all of the metrics you've defined on this datasource.
Generate daily index files (the default):
dn build --interval=day my_datasource
Generate hourly indexes, but only for the first few days of July, assuming data is laid out under "data_directory/YYYY/MM/DD"
dn build --interval=hour --after=2014-07-01 --before=2014-07-04
Options include:
--after START_TIMESTAMP
: Same as "dn scan --after".--before END_TIMESTAMP
: Same as "dn scan --before".-i | --interval INTERVAL
: Specifies that indexes should be chunked into files by INTERVAL, which is either "all", "hour" or "day". The default is "day".
Like "scan", this uses several options on the datasource:
--time-format TIME_FORMAT
: See "dn scan".--time-field TIME_FIELD
: See "dn scan".--data-format json | json-skinner
: See "dn scan".
To specify the time resolution of a metric, specify your own "timestamp" column
with each metric. For example, specifying column
timestamp[date,field=time,aggr=lquantize,step=60]
adds a field called
"timestamp" to the index which is the result of parsing the "time" field in the
raw data as an ISO 8601 timestamp and converting that to a Unix timestamp
(seconds since the epoch). The result is bucketed by minute (step=60
). If
you want the resolution to be 10 seconds instead, use step=10
.
dn query [--before=START_TIME] [--after=END_TIME] [--filter=FILTER]
[--breakdowns=BREAKDOWN[,...]] [--interval=hour|day|all]
[--raw] [--points] [--counters]
[--dry-run] [--assetroot=ASSET_ROOT]
DATASOURCE
"dn query" is used just like "dn scan", but fetches data from the indexes built by "dn build" rather than scanning the raw data every time. The options are the same as for "dn scan", with the addition of:
--interval all|hour|day
: scan the all-time, hourly, or daily indexes. By default, scans daily indexes.
The --data-format
, --time-format
, and --time-field
properties of the
datasource are not used when querying.
Dragnet is currently limited by the maximum size of the V8 heap, and the Manta version uses a 32-bit binary. The limit is not affected by the number of input data points, but the number of unique tuples. If you're just counting records, you can process an arbitrary number of data points. If you're indexing 10 fields, each of which can have 10 different values (all independently), that's 10 billion output tuples, which is more than Dragnet can currently handle.
There's no built-in limit on the number of unique tuples, or the number of allowed values for each field, so it's easy to accidentally exceed this limit by selecting a field that has a lot of different values. When you exceed this limit, the failure mode is not good. The program will usually start running extremely slowly for a while as V8 tries to collect lots of garbage, and eventually the program will crash (hopefully dumping core) with a message about a memory allocation failure.
To deal with this, you have to reduce the number of unique tuples that Dragnet has to keep track of. You can do this in a few ways:
- First, check that you didn't forget to aggregate some numeric fields. If you try to index the timestamp without aggregating, you'll get per-second data, which is likely to produce way too many unique tuples. Aggregate per-minute instead.
- Give up some resolution on numeric fields. Instead of bucketing per 10 seconds, bucket per-minute. Instead of latency in groups of 10 milliseconds, use power-of-two buckets.
- Select fewer columns. In the above example with 10 columns, skipping one column reduces the number of unique data points by a factor of 10. You can do this by configuring more metrics with different fields (e.g., replace a single metric that includes timestamp, request method, and user agent with two that include timestamp and request method and (separately) timestamp and user agent. These aren't exactly equivalent, but it's often sufficient.
- If you only run into this problem while indexing, try indexing less data at once. If you're generating daily indexes, restrict each "build" operation to a day's worth of input data, and run separate operations for each day.
See "Memory usage" above.
- If you're using a filter, check that you didn't accidentally filter out the records.
- If you specified a "date" field for a scan or index operation, check that the field is present and parseable as a date.
- If you're aggregating on a numeric field (e.g., you used "aggr=quantize" or "aggr=lquantize"), check that the field is present and actually numeric. Strings representing numbers (like "123") don't count. Records are dropped where Dragnet finds anything but a number where a numeric field is required, so if you configure it wrong, you'll be left with no records.
It may help to check all of these by running with --counters
or --warnings
.
Counters will show how many records make it through each stage of the data
processing pipeline, and warnings should print out a warning when records are
dropped.
See "Some data is missing". In many of those cases, the problem ends up applying to all records and all of them get dropped.
This is always a bug. It means Node exited before "dn" expected it to, which usually means a missed callback.
While the architecture is designed for scalability, no serious performance work has been done on the implementation. There's currently a lot of startup cost.