mapbox / cardboard Goto Github PK

tile indexed geo database interface.

License: ISC License

JavaScript 100.00%

cardboard's Introduction

cardboard

Cardboard is a JavaScript library for managing the storage of GeoJSON features on an AWS backend. It relies on DynamoDB for indexing and small-feature storage, and S3 for large-feature storage. Cardboard provides functions to create, read, update, and delete single features or in batch, as well as simple bounding-box spatial query capabilities.

Installation

npm install cardboard
# or globally
npm install -g cardboard

Configuration

Generate a client by passing the following configuration options to cardboard:

option	required	description
mainTable	X	the name of the DynamoDB table to use
region	X	the region containing the given DynamoDB table
accessKeyId		AWS credentials
secretAccessKey		AWS credentials
sessionToken		AWS credentials
dyno		a pre-configured dyno client to use for DynamoDB interactions

Providing AWS credentials is optional. Cardboard depends on the AWS SDK for JavaScript, and so credentials can be provided in any way supported by that library. See configuring the SDK in Node.js for more configuration options.

If you provide a preconfigured dyno client, you do not need to specify table and region when initializing cardboard.

Example

var Cardboard = require('cardboard');
var cardboard = Cardboard({
    mainTable: 'my-cardboard-table',
    region: 'us-east-1',
});- '6.9'

Creating a Cardboard table

Once you've initialized the client, you can use it to create a table for you:

cardboard.createTable(callback);

You don't have to create the table each time; you can provide the name of a pre-existing table to your configuration options to use that table.

API documentation

See api.md.

Concepts

Datasets

Most cardboard functions require you to specify a dataset. This is a way of grouping sets of features within a single Cardboard table. It is similar in concept to "layers" in many other GIS systems, but there are no restrictions on the types of features that can be associated with each other in a single dataset. Each feature managed by cardboard can only belong to one dataset.

Identifiers

Features within a single dataset must each have a unique id. Cardboard uses a GeoJSON feature's top-level id property to determine and persist the feature's identifier. If you provide a cardboard function with a GeoJSON feature that does not have an id property, it will assign one for you, otherwise, it will use the id that you provide. Be aware that inserting two features to a single dataset with the same id value will result in only the last feature being persisted in cardboard.

Collections

Whenever dealing with individual GeoJSON features, cardboard will expect or return a GeoJSON object of type Feature. In batch situations, or in any request that returns multiple features, cardboard will expect/return a FeatureCollection.

Precision

Cardboard retains the precision of a feature's coordinates to six decimal places.

cardboard's People

Contributors

Stargazers

Watchers

Forkers

xkwangy nvdnkpr behboud waldyrious ahmedobayah testbigorg rubythonode isabella232 mapclone

cardboard's Issues

Refactor tests

The plan: copy what @mick is doing in dyno.

List layers API

combine sequential cells queries into ranges.

Right now we do a range query for each cell in the s2cover. We can reduce the number of requests in some case by combining sequential cells into the same range request.

generate id

Have cardboard generate an uuid each feature. Either add the the user supplied id to the dynamo doc and add a global secondary index, or create another entry like we do for featureid

query types

We now have only a bbox query. Obviously we want a polygon query. Do we want point and line queries? Do they make sense?

Implement post-query filter in Dynamo

@mick I wrote a post-query filter in javascript so that I could pass tests. I know you're working on getting dynamo to do this for us, just breaking out a ticket.

Store multiple data entries at each key level

As discussed with @DennisOSRM - instead of storing data like

cell!CELLID!PRIMARYKEY ⇢ geometry
cell!CELLID!PRIMARYKEY2 ⇢ geometry2

We should store it as

cell!CELLID⇢ geometry, geometry2

Implementation details:

How do we separate chunks of geometry in this scheme?
How do we indicate primary keys for each geometry so that they're quickly unique-able

Performance implications:

Fewer queries, which is good
Updating data will require downloading and uploading the chunk, which will get larger as more features overlap

Return GeoJSON FeatureCollection from get/listing

point indexes are wrong

support paging for list of features in dataset

flat index in s3

started in the s3 branch

The idea is to using a flat index in s3, but to track the contents of each cell in dynamo, to make them easier to update. A hybrid of the s3 branch and master.

size-specific max_cells value

Covering indexes are not covering

delLayer error: 'Too many items requested for the BatchWriteItem call'

Looks like we need to batch our batches.

Zero features returned when bboxQuery crosses prime meridian

cardboard.bboxQuery() returns an empty set of features if the provided bounding box crosses the prime meridian.

Test case:

var Cardboard = require('cardboard');

var c = new Cardboard({
    region: 'us-east-1',
    table: 'cardboard-staging'
});

c.bboxQuery([ -180, -85.05112877980659, 0, 85.0511287798066 ], '1409021191288.1dfc169f', function(err, data) {
    if (err) return console.error(err);
    console.log('not crossing prime: %d features', data.length);
});

c.bboxQuery([ -180, -85.05112877980659, 1, 85.0511287798066 ], '1409021191288.1dfc169f', function(err, data) {
    if (err) return console.error(err);
    console.log('crossing prime: %d features', data.length);
});

Output:

crossing prime: 0 features
not crossing prime: 47 features

Version 0.4.4. Need to see if simply upgrading will just fix this.

Don't hardcode table name

Looks like the table name is hardcoded in a few spots, like https://github.com/mapbox/cardboard/blob/master/lib/dynamodbadapter.js#L20 we'll want to make this a configuration option as table names are per account/region.

test against sqlite

dumpGeoJSON should gracefully handle the id index

dumpGeoJSON may just want to dump id! features.

use dynamodb-down

would love to abstract out the crappy dynamodb api https://github.com/mapbox/dynamodb-down

Use AttributesToGet to limit initial query to getting unique cells

For polygons, this would meant that initial queries would return, for instance,

cell!fdjsaklfdjsa!id:1
cell!fdjsaklfdjsa!id:2
cell!fdjsaklfdjsa!id:1

And then we could run de-duplication based on cell ids alone, and then run another query that grabs data with a batchgetitem. This would shoot more queries under the 64KB limit, I reckon.

Remove unneeded getParent calls in bboxQuery

Given a qkey like '2111111' we can get its parent cells by qkey = qkey.slice(0, -1).

@mick there's no "IN" operator or OR conditional for key conditions, is there? Something like cell: {'IN': ['2111111', '211111', '21111',...]} seems ideal. I could be ignorant about Dynamo's constraints.

Deduplicate results near the prime meridian

As I said in today's scrum, I'm adding tests near the prime meridian in an issue54 branch. We're getting duplicate features.

I'm thinking deduplication is the immediate fix with work in the near future to avoid duplicate results. @rclark @mick cardboard is where to dedupe, right? It's only dyno between us and AWS from cardboard?

Details below:

# query for line crossing 0 lon
ok 120 inserted
ok 121 {"type":"FeatureCollection","features":[{"type":"Feature","properties":{},"geometry":{"coordinates":[[-1,1],[1,1]],"type":"LineString"},"id":"ci079jsy70012rm2ha1ooft6e"}]}
not ok 122 {"type":"FeatureCollection","features":[{"type":"Feature","properties":{},"geometry":{"coordinates":[[-1,1],[1,1]],"type":"LineString"},"id":"ci079jsy70012rm2ha1ooft6e"},{"type":"Feature","properties":{},"geometry":{"coordinates":[[-1,1],[1,1]],"type":"LineString"},"id":"ci079jsy70012rm2ha1ooft6e"},{"type":"Feature","properties":{},"geometry":{"coordinates":[[-1,1],[1,1]],"type":"LineString"},"id":"ci079jsy70012rm2ha1ooft6e"},{"type":"Feature","properties":{},"geometry":{"coordinates":[[-1,1],[1,1]],"type":"LineString"},"id":"ci079jsy70012rm2ha1ooft6e"}]}
  ---
    file:   /Users/sean/code/cardboard/node_modules/queue-async/queue.js
    line:   46
    column: 21
    stack:
      - getCaller (/Users/sean/code/cardboard/node_modules/tap/lib/tap-assert.js:418:17)
      - assert (/Users/sean/code/cardboard/node_modules/tap/lib/tap-assert.js:21:16)
      - Function.equal (/Users/sean/code/cardboard/node_modules/tap/lib/tap-assert.js:162:10)
      - Test._testAssert [as equal] (/Users/sean/code/cardboard/node_modules/tap/lib/tap-test.js:87:16)
      - /Users/sean/code/cardboard/test/index.js:520:23
      - /Users/sean/code/cardboard/index.js:262:17
      - notify (/Users/sean/code/cardboard/node_modules/queue-async/queue.js:46:21)
      - Object.q.awaitAll (/Users/sean/code/cardboard/node_modules/queue-async/queue.js:68:25)
      - resolveFeatures (/Users/sean/code/cardboard/index.js:323:11)
      - /Users/sean/code/cardboard/index.js:260:13
    found:  4
    wanted: 1
  ...
ok 123 {"type":"FeatureCollection","features":[{"type":"Feature","properties":{},"geometry":{"coordinates":[[-1,1],[1,1]],"type":"LineString"},"id":"ci079jsy70012rm2ha1ooft6e"}]}
not ok 124 {"type":"FeatureCollection","features":[{"type":"Feature","properties":{},"geometry":{"coordinates":[[-1,1],[1,1]],"type":"LineString"},"id":"ci079jsy70012rm2ha1ooft6e"},{"type":"Feature","properties":{},"geometry":{"coordinates":[[-1,1],[1,1]],"type":"LineString"},"id":"ci079jsy70012rm2ha1ooft6e"},{"type":"Feature","properties":{},"geometry":{"coordinates":[[-1,1],[1,1]],"type":"LineString"},"id":"ci079jsy70012rm2ha1ooft6e"},{"type":"Feature","properties":{},"geometry":{"coordinates":[[-1,1],[1,1]],"type":"LineString"},"id":"ci079jsy70012rm2ha1ooft6e"}]}
  ---
    file:   /Users/sean/code/cardboard/node_modules/queue-async/queue.js
    line:   46
    column: 21
    stack:
      - getCaller (/Users/sean/code/cardboard/node_modules/tap/lib/tap-assert.js:418:17)
      - assert (/Users/sean/code/cardboard/node_modules/tap/lib/tap-assert.js:21:16)
      - Function.equal (/Users/sean/code/cardboard/node_modules/tap/lib/tap-assert.js:162:10)
      - Test._testAssert [as equal] (/Users/sean/code/cardboard/node_modules/tap/lib/tap-test.js:87:16)
      - /Users/sean/code/cardboard/test/index.js:520:23
      - /Users/sean/code/cardboard/index.js:262:17
      - notify (/Users/sean/code/cardboard/node_modules/queue-async/queue.js:46:21)
      - Object.q.awaitAll (/Users/sean/code/cardboard/node_modules/queue-async/queue.js:68:25)
      - resolveFeatures (/Users/sean/code/cardboard/index.js:323:11)
      - /Users/sean/code/cardboard/index.js:260:13
    found:  4
    wanted: 1
  ...
ok 125 passed queries
# teardown

search for 0,0 brings up ghana

reverse point geo searches are way too fuzzy right now.

return feature collections

cardboard.bboxQuery
cardboard.get
cardboard.getBySecondaryId

Should all return valid geojson feature collections.

operation: Delete Feature

Get feature geometry from feature id index
Recompute cover
Issue delete requests for each cell id with batchWriteItem

For S3: same technique, except with s3.deleteObjects

Cost for deletion:

S3: free
DynamoDB: unclear - is this a read/write capacity unit in any way?

Footnote

If deleting things is onerous in terms of performance or cost, we could defer by using a journal - in the id-keyed record for a feature, we'd record a deleted flag and early-abort any requests / decodes of that feature.

Or: we can defer deletes to a different server. Anyway, need to implement it first.

Eliminate primary key argument in cardboard.insert

Since we are standardizing around geojson and the top-level id property, is there any problem with removing the first argument of cardboard.insert and instead just validating that the feature object contains an top-level id property?

/cc @mick

dynamodb transition

@mick i'm currently looking around in dynamodb land for how we should angle this

intuition would be that simple 'get a ton of exact keys' would be faster than any range queries, but dynamodb charges read units for misses, so that seems inefficient
otherwise, we could use lots of range queries using the Query type

as far as how to abstract this, it's either finishing #10 or writing simple-ish 'wrappers' for each backend, like i've started with dynamodb. not sure if leveldown is a decent abstraction for dynamodb

Page out only items that exceed 64kb limit

@mick as an option 2 for dodging the 64kb limit - only page out items that exceed 64kb limit, otherwise store data in cells.

mock s3?

If we wanted something for the s3 testing similar to dynalite:

https://github.com/MathieuLoutre/mock-aws-s3

expose table setup as command

right now api-data needs to direct-require a file from cardboard in a flaky way

Findability of features at the dateline

Write tests that query for short dateline-crossing linestrings. I've a hunch that there's a lot of undefined behavior here. GeoJSON itself provides no guidance (yet) in this case.

rename layer to dataset

Big things index

As @mick has been noticing, indexing big stuff like countries is tough with our default index levels.

Index big things at a different level than other things
Query this index simultaneously with our normal-sized-things index?

More bboxQuery tests around prime meridian and equator

There are some corner cases to test:

queries that barely touch feature bboxen along edges
queries that barely touch feature bboxen at their corners

Queries get nudged a bit at 0,0 and so that's the spot at which to focus.

Untested api methods: export and dumpGeoJSON

These methods are not tested and look like they'll fail when they come up against the variety of record types that exist in the database (primary, metadata, spatial index, user index).

use batchWriteItem

Delete layer API

geojsoncover module

https://github.com/mapbox/cardboard/blob/master/lib/geojsoncover.js

This code should be its own module.

"Global secondary index cell does not project [geometryid]"

Query in the cardboard script isn't working for my new table. The fio program below is the Fiona CLI (replacement for ogrinfo).

$ ./cardboard sgillies-shade --export | fio info
endpoint undefined
your table is ready sgillies-shade
{"count": 288, "crs": "+datum=WGS84 +no_defs +proj=longlat", "driver": "GeoJSON", "bounds": [-106.523437, 39.571822, -106.435546, 39.639537], "schema": {"geometry": "Polygon", "properties": {"val": "int", "id": "str"}}}
$ ./cardboard sgillies-shade --query="-107,39,-106,40"
endpoint undefined
your table is ready sgillies-shade
{ [ValidationException: One or more parameter values were invalid: Global secondary index cell does not project [geometryid]]
  message: 'One or more parameter values were invalid: Global secondary index cell does not project [geometryid]',
  code: 'ValidationException',
  time: Thu Sep 18 2014 13:46:44 GMT-0600 (MDT),
  statusCode: 400,
  retryable: false }

Removing 'geometryid' from the query options in cardboard.bboxQuery() doesn't break the tests, @mick, but then I get an empty result collection.

verbose query mode

Flat index mode

This will be a mode that disables merging by modifying min and max level constants to be the same, and replacing range queries with direct GET queries. This will test out @DennisOSRM's idea that avoiding range queries will be faster and simpler than trying to use ranges.

Make "export", "dump", "query" sub-commands of cardboard

The cardboard script is going to be a useful tool for the Satellite team, which uses a lot of bash and Python programs, and I think it's worthwhile to change from cardboard table --export to cardboard export table. It's git-ish, which is a plus, a match for Satellite tools, and also an opportunity for me to learn another side of Node. Not high priority atm, but something I want to track.

Expose dynamoAdapter in module

Right now this is necessary to interact with Cardboard via the nodejs API.

@mick do you think it makes sense to continue down the

databaseAdapter(dbConfig, function(database) {
  new Cardboard(database);
});

Path, or should we wrap this sort of asyncness inside?

var cardboard = new Cardboard(dbConfig);

cardboard.on('ready', function() {
  // blahhh
});

{ '0':
   { [TimeoutError: Could not load credentials from any providers]
     message: 'Could not load credentials from any providers',
     code: 'CredentialsError',
     time: Mon May 19 2014 18:35:38 GMT-0400 (EDT),
     originalError:
      { message: 'Connection timed out after 1000ms',
        code: 'TimeoutError',
        time: Mon May 19 2014 18:35:38 GMT-0400 (EDT) },
     _willRetry: false },
  '1': null }

/Users/tmcw/src/cardboard/node_modules/aws-sdk/lib/sequential_executor.js:117
          if (err._hardError) throw err;
                                    ^
TypeError: Object #<Object> has no method 'call'
    at Request.<anonymous> (/Users/tmcw/src/cardboard/node_modules/aws-sdk/lib/request.js:347:20)
    at Request.callListeners (/Users/tmcw/src/cardboard/node_modules/aws-sdk/lib/sequential_executor.js:114:20)
    at Request.emit (/Users/tmcw/src/cardboard/node_modules/aws-sdk/lib/sequential_executor.js:81:10)
    at Request.emit (/Users/tmcw/src/cardboard/node_modules/aws-sdk/lib/request.js:578:14)
    at Request.transition (/Users/tmcw/src/cardboard/node_modules/aws-sdk/lib/request.js:12:12)
    at AcceptorStateMachine.runTo (/Users/tmcw/src/cardboard/node_modules/aws-sdk/lib/state_machine.js:14:12)
    at /Users/tmcw/src/cardboard/node_modules/aws-sdk/lib/state_machine.js:26:10
    at Request.<anonymous> (/Users/tmcw/src/cardboard/node_modules/aws-sdk/lib/request.js:28:9)
    at Request.<anonymous> (/Users/tmcw/src/cardboard/node_modules/aws-sdk/lib/request.js:580:12)
    at Request.callListeners (/Users/tmcw/src/cardboard/node_modules/aws-sdk/lib/sequential_executor.js:90:20)

Any idea what I might be doing wrong in my testing setup? 3a6a754#diff-6015c9f6e4f7700bf6800946c7f61984R3

Benchmarking

/cc @morganherlocker have you used benchmark at all?

How can we test implementations against each other and find the bottlenecks in this implementation?
How should this interact with dynamodb so we can test real world numbers but also not spend a billion buckaroos?