mediachain / aleph Goto Github PK

א: The mediachain universe manipulation engine

License: MIT License

JavaScript 95.42% Shell 2.09% Protocol Buffer 1.09% Python 1.40%

aleph's Introduction

א

The Aleph's diameter was probably little more than an inch, but all space was there, actual and undiminished. Each thing (a mirror's face, let us say) was infinite things, since I distinctly saw it from every angle of the universe¹

Aleph is part of the mediachain project and is an integral component of the Phase II architecture.

Aleph provides two main components. First is a client for the HTTP API exposed by concat, the reference go peer implementation. Second is a lightweight peer in its own right.

For system-wide Mediachain documentation see https://mediachain.github.io/mediachain-docs.

Installation

Aleph requires node 6 or greater, and has primarily been tested with 6.5 and above.

To globally install a release from npm: npm install --global aleph. This will install the aleph command, which provides a remote query command and an interactive REPL for exploring the mediachain network. It also installs the mcclient command, which you can use to control and interact with a concat node.

If you'd prefer to install from the source repository, clone this repo and run npm install, followed by npm link, which will create a mcclient and aleph symlinks that run the latest compiled version of the code. This is very useful during development.

If you don't want mcclient or aleph on your path at all, you can just run npm install and execute ./bin/mcclient.js and ./bin/aleph.js directly instead.

Usage

Installing the aleph package will install two command line tools: mcclient and aleph.

`mcclient`

mcclient is a wrapper around the HTTP API exposed by concat, aleph's heavy-lifting counterpart.

mcclient contains several sub-commands, so the general invocation is mcclient [global-options] <command> [command-options].

At the moment, the only global option is --apiUrl or -p, which sets the location of the remote node's HTTP API. By default, mcclient will attempt to connect to a concat node running on localhost at port 9002, which is concat's default listen address for the HTTP API. If you've configured concat to run on a different port or on a remote machine, use the -p flag to pass in a new URL, e.g. mcclient -p http://localhost:5678 id

Some useful commands include:

id: print the node's peer id and publisher id
status:
- with no arguments, prints the current status (online, offline, or public)
- set the status with e.g. mcclient status online
publish: publish JSON metadata to the node's local store. see mcclient publish --help for more info.
statement: retrieve a statement by its id
query: run a mediachain query against the node's local store.

To see a full list of supported commands, run mcclient --help

`aleph`

While mcclient is a front-end for the golang peer implementation concat, the aleph command provides access to the javascript peer implementation. The two are interoperable, but they do not have feature parity. Most notably, the javascript peer has no local storage for mediachain statements or data objects, so it can't be used to provide mediachain data. However, aleph is useful for interacting with local and remote concat nodes, and for exploring the mediachain network and peer-to-peer architecture.

There are currently two aleph subcommands, aleph repl and aleph query

`aleph repl`

The aleph repl command provides an interactive Read-Eval-Print-Loop for controlling a javascript mediachain node. When you start the repl, a new peer identity is created and stored in the ~/.mediachain/aleph directory. You can override that default with the --identityPath flag.

The repl provides a javascript prompt, and a global node object. This is an instance of the MediachainNode class, which implements the peer-to-peer protocols.

The repl can be used to remotely interact with another node using the mediachain protocol. For example:

א > node.ping('/ip4/54.205.184.122/tcp/9001/p2p/QmeiY2eHMwK92Zt6X4kUUC3MsjMmVb2VnGZ17DhnhRPCEQ')
true

The long string above is a multiaddr, which is a format for representing and combining addresses for multiple network protocols. The string above is for the peer-to-peer node with id QmeiY2eHMwK92Zt6X4kUUC3MsjMmVb2VnGZ17DhnhRPCEQ, located at the IP4 address 54.205.184.122, on tcp port 9001. The /p2p/ protocol identifier is not yet part of the multiaddr standard, but it is in the works as a replacement for the /ipfs/ identifier, and we've adopted it in anticipation of it being integrated into the standard soon.

Let's make our addresses a bit simpler by connecting to the Mediachain Labs directory server:

א > node.setDirectory('/ip4/52.7.126.237/tcp/9000/p2p/QmSdJVceFki4rDbcSrW7JTJZgU9so25Ko7oKHE97mGmkU6')

Now we can just use the peer identifier portion of the address, and the directory will provide us with the full address behind the scenes:

א > node.ping('QmeiY2eHMwK92Zt6X4kUUC3MsjMmVb2VnGZ17DhnhRPCEQ')
true

You can also provide the directory server address at the command line when launching the repl:

$ aleph repl --dir /ip4/52.7.126.237/tcp/9000/p2p/QmSdJVceFki4rDbcSrW7JTJZgU9so25Ko7oKHE97mGmkU6

This will set the directory address on startup, and avoid the need for the node.setDirectory call.

Pairing a remote node

Since the aleph repl is at its most useful when interacting with a remote node, there's built-in support for "pairing" the javascript aleph node to a remote node (most likely a concat node).

To do so, just provide the --remotePeer flag when launching the repl, and give it a multiaddr where the peer is located:

$ aleph repl --remotePeer /ip4/54.205.184.122/tcp/9001/p2p/QmeiY2eHMwK92Zt6X4kUUC3MsjMmVb2VnGZ17DhnhRPCEQ

Now, in addition to the global node object, you also have a remote object that represents the remote peer.

א > remote.query('SELECT * FROM images.dpla LIMIT 1')
[ { value: { simple: [Object] } } ]

The query result is printed at the repl in condensed form. To examine it further, we can use the "magic" _ variable, which holds the result of the last repl command. Assign it to a new variable, and we can interact with it more easily:

var result = _
console.dir(result[0].value.simple)
{ stmt: 
   { id: '4XTTM4K8sqTb7xYviJJcRDJ5W6TpQxMoJ7GtBstTALgh5wzGm:1478267497:1',
     publisher: '4XTTM4K8sqTb7xYviJJcRDJ5W6TpQxMoJ7GtBstTALgh5wzGm',
     namespace: 'images.dpla',
     body: 
      { simple: 
         { object: 'QmeFJSTPKSEiNqebxZvYcduWH8UBmxqNq724gHEQnxV5D1',
           refs: [ 'dpla_1ff6b36174426026847c8f8ca216ffa9' ],
           tags: [],
           deps: [ 'QmYGRQYmWC3BAtTAi88mFb7GVeFsUKGM4nm25SBUB9vfc9' ] } },
     timestamp: 1478267497,
     signature: 
      Buffer [...] } }

To also fetch the data objects associated with each result, use remote.queryWithData instead of remote.query.

`aleph query`

aleph query --remotePeer <peerAddress> <queryString> will execute a query on a remote peer using the peer-to-peer query protocol. This is very similar to the mcclient query command, with the exception that the --remotePeer argument is not optional, since the local aleph node does not have a local datastore.

Mediachain records are composed of two main components: statements and objects. Objects are the core metadata objects that represent a given artwork, person, or other resource. A statement is a mediachain-specific record that contextualizes an object within the mediachain network. For example, the statement includes the namespace in which the record was published, the ids of the publisher, the well-known identifiers attached to the record, etc.

By default, the aleph query command will only show the statements that match the query results. However, it's common that you'll want to see the content of the objects themselves.

Use the --includeData or -i flag to dereference the objects for each statement in the query results. For example:

$ aleph query --remotePeer /ip4/54.205.184.122/tcp/9001/ipfs/QmeiY2eHMwK92Zt6X4kUUC3MsjMmVb2VnGZ17DhnhRPCEQ --includeData 'SELECT * FROM images.dpla LIMIT 1'

{
  "simple": {
    "stmt": {
      "id": "4XTTM4K8sqTb7xYviJJcRDJ5W6TpQxMoJ7GtBstTALgh5wzGm:1478267497:1",
      "publisher": "4XTTM4K8sqTb7xYviJJcRDJ5W6TpQxMoJ7GtBstTALgh5wzGm",
      "namespace": "images.dpla",
      "body": {
        "simple": {
          "object": {
            "key": "QmeFJSTPKSEiNqebxZvYcduWH8UBmxqNq724gHEQnxV5D1",
            "data": {
              "schema": {
                "/": "QmYGRQYmWC3BAtTAi88mFb7GVeFsUKGM4nm25SBUB9vfc9"
              },
              "data": {
                "artist_names": [
                  [
                    "Meredith L. Clausen"
                  ]
                ],
                "aspect_ratio": 0.7166666666666667,
                "attribution": [
                  {
                    "name": "Meredith L. Clausen"
                  }
                ],
                "omitted_for_bevity": "..."
              }
            }
          },
          "refs": [
            "dpla_1ff6b36174426026847c8f8ca216ffa9"
          ],
          "tags": [],
          "deps": [
            "QmYGRQYmWC3BAtTAi88mFb7GVeFsUKGM4nm25SBUB9vfc9"
          ]
        }
      },
      "timestamp": 1478267497,
      "signature": "yDhPpc/RIkW3+sHjl/cB00j3jurqMsDdb/tUyVMUfa6I4EnNiYdSqasxWTiRGtsaT2M/xX++RgRNQQ/97x8IDA=="
    }
  }
}

Development and contribution

Thanks! We welcome all contributions of ideas, bug reports, code, and whatever else you'd like to send our way. Please take a look at our contributing guidelines -- they are very friendly.

Code structure

The code lives in src, and is organized into a few main subdirectories:

client/api contains the RestClient class, which provides a Promise-based wrapper around concat's HTTP API
client/cli contains the code for the mcclient command line app, which uses RestClient for all of its functionality. The cli is powered by the yargs argument parser, and each subcommand is contained in its own module in client/cli/commands
peer contains the javascript implementation of mediachain peer-to-peer nodes. There are two main node types, a DirectoryNode that corresponds to concat's mcdir command, and a MediachainNode that corresponds to concat's mcnode. Both use the LibP2PNode class (defined in src/peer/libp2p_node.js) for low-level peer-to-peer networking.
protobuf contains protocol-buffer definitions for messages exchanged between nodes, and is kept in sync with concat.

aleph's People

Contributors

Stargazers

Watchers

Forkers

digideskio eldk codeaudit haiderny loretoparisi ottodevs zillerium edchainio kustomzone alexanderattar derekzhang79 wallywanghaoxiang globotree dhanatontanahal jatinvinkumar bmxentertainmentcorporation

aleph's Issues

Dial remote peer

Integration testing

Test rt with concat

test harness/dockerize
meaningful RT tests (which?)

lip2p endpoint

Query remote peer by internal ID/multihash

Assume we already know the peer that has the object (it is the aleph), get it directly

HTTP API

At the moment, concat exposes an HTTP api from the mcnode for pinging other nodes, publishing statements and retrieving statements from the local index.

Is this repo the right place to develop an API client? Ideally it would target both the js and go implementations. We could also spin that off into a separate "front end" repo, and keep this one focused on the p2p interactions.

Retrieve internal ID by WKI, namespace

Automatic ssh tunneling for mcclient

It would be sweet to be able to point mcclient at a file with ssh credentials for a remote node (like the file generated by Deploy) and have it create an ssh tunnel for you.

Of course, since this is node, there's a package for that 😄
ssh-tunnel uses a javascript implementation of ssh, so we would be able to use it on Windows without requiring putty to be installed.

Write WKI, namespace mapping

Allow automatically expanding query body

Can we add a flag to automatically expand the body object? That way the results of the CLI command on the image detail in Attribution Engine will be much more exciting!

Current output:

$ mcclient query -r QmeiY2eHMwK92Zt6X4kUUC3MsjMmVb2VnGZ17DhnhRPCEQ 'SELECT * FROM images.* WHERE wki = pexels_60224'

{ id: '4XTTM4K8sqTb7xYviJJcRDJ5W6TpQxMoJ7GtBstTALgh5wzGm:1476977626:4534031',
  publisher: '4XTTM4K8sqTb7xYviJJcRDJ5W6TpQxMoJ7GtBstTALgh5wzGm',
  namespace: 'images.pexels',
  body:
   { Body:
      { Simple:
         { object: 'QmYjqHCiRBF4zbTz3esm9ZVEEk8s7UqSeW1VZMQenAawwZ',
           refs: [ 'pexels_60224' ] } } },
  timestamp: 1476977626,
  signature: '1mQZlZ5ywm2LTQ6TSCK0g6kq8ozOhZdGSrHx6hFVB5sp2GKEdZMkpIWAWlIiIGtdvLCTvxD21ZSE6pJPQGHzBg==' }

Don't require changing metadata object to specify WKI

Moved from mediachain/concat#70

Thinking through WKI specification while trying to write a tutorial for writing MoMA data (credit to @yusefnapora ). It would be great to completely remove all the steps where you modify the original data.

We shouldn't require publishers to change their data to specify a WKI. Currently in our examples we're either adding prefixes to IDs (dpla_abc123) or new object keys (MediachainWKI: "moma:artist:123""from the guide).

It would be elegant if publishers could keep their original metadata object completely untouched, giving nice properties such as keeping the same multihash if they're using the raw object somewhere else (like IPFS).

Can WKI prefixes be stored outside of the object? We already have a unique prefix we can automatically use: the namespace. Is this enough to avoid id collisions since, for example, MoMA uses integer IDs for artists and artworks.

We should also include searching by WKI more prominently in the docs!

ID genrator/flake

If we're accepting "mediachain first" data, we need to be able to hand out IDs. An appropriate technique for this is *flake (after Twitter's snowflake), which uses a combination of timestamp and node ID to make mostly-k-ordered IDs.

Explainer: http://yellerapp.com/posts/2015-02-09-flake-ids.html
Erlang impl: https://github.com/boundary/flake
Clojure impl: https://github.com/maxcountryman/flake
2 Go impls: https://github.com/davidnarayan/go-flake + https://github.com/casualjim/flakeid (not sure if we should use these or handroll)

We have a perfectly good "node id" in the form of peerId, though it's possible to introduce a collision by running multiple nodes with the same identity -- I think this is a degenerate case that we don't care about.

This is more appropriate than v1 UUIDs (MAC address based) or v4 UUIDs (totally random). v3 UUIDs could potentially be used, but I don't think the semantics are quite right.

Persistent configuration for mcclient

As mentioned in #99, I'd like to add a config file for mcclient, so that we can persist things like the api url and SSH tunnel configuration. This should be pretty simple, since the only configuration we want to save is the global args, not anything specific to subcommands.

Cleaner API

Right now extracting the relevant data object from a query response is kind of ugly:

remote.queryWithData('SELECT * FROM images.dpla LIMIT 1').then(r => r[0].simple.stmt.body.simple.object.data)

א > remote.query('SELECT * FROM images.dpla LIMIT 1').then(r => remote.data(r.map(_ => _.value.simple.stmt.body.simple.object)))
[ { key: 'QmeFJSTPKSEiNqebxZvYcduWH8UBmxqNq724gHEQnxV5D1',
    data: <Buffer a2 66 73 63 68 65 6d 61 a1 61 2f 78 2e 51 6d 59 47 52 51 59 6d 57 43 33 42 41 74 54 41 69 38 38 6d 46 62 37 47 56 65 46 73 55 4b 47 4d 34 6e 6d 32 35 ... > } ]
// decode...

This would only get more problematic with other statement types.

From a user perspective, the contract is more like "I'm asking for ID(s) in [a] namespace(s), give me a merged set of of what that namespace knows about it, respecting my trust settings". Given that #17 isn't merged yet (and lots of thinking about merges hasn't happened yet), we can just make the merge function be naive set add.

So this API looks more like

א > remote.magicQuery('SELECT * FROM images.dpla LIMIT 1')

[ { schema: { '/': 'QmYGRQYmWC3BAtTAi88mFb7GVeFsUKGM4nm25SBUB9vfc9' }, // ??
    data:
     { artist_names: [Object], // FIXME: expand this also?
       aspect_ratio: 0.7166666666666667,
       attribution: [Object],
       camera_exif: {},
       date_captured: null,
       date_created_at_source: null,
       date_created_original: null,
       date_source_version: null,
       dedupe_hsh: '3fdfefe0f8381403',
       derived_qualities: [Object],
       description: 'Clausen/Donnelly House',
       keywords: [],
       licenses: [Object],
       location: [Object],
       native_id: 'dpla_http://dp.la/api/items/1ff6b36174426026847c8f8ca216ffa9',
       orientation: null,
       providers_list: [Object],
       sizes: [Object],
       source: [Object],
       source_dataset: 'dpla',
       source_tags: [Object],
       title: [Object],
       transient_info: [Object],
       url_direct: [Object],
       url_shown_at: [Object] } } ]

Questions:

Should we include schema or just object itself? What if all schemas in a result set are the same/different?
What about packing this into a "results" object that has "data_rows" but also some metadata fields?
What about refs? Are we sufficiently "application level" to find and (recursively?) reattach all ref leaves?

Best practices for referencing files

from marionzualo in slack

How can I publish things to the network? this section (https://github.com/mediachain/concat#publishing-statements) has some details on the topic but it seems to focus on metadata, but I could not find any "add file" command that would give me an IPFS style hash that then could be added to the metadata.

Can this happen as part of the data preparation process for now, i.e. we provide an optional out-of-band script that pins files to IPFS and adds references to your metadata.

Accept ECDSA identities/signatures

Support for Ethereum (pre-Metropolis) identities. Depends on libp2p/js-libp2p-crypto#57

[SUMMARY] Aleph core

Core Aleph functionality. Goals:

Accept connections with libp2p

~~We accomplish this through two modes of operation, [superpeer](real aleph node) and [thinclient](empty client)~~

Failure if `idSelector` value is integer

$ mcclient publish --idSelector id museum.tate.artworks TwoWorks.ndjson
Error publishing statements:  { Error: Bad Request
Error: json: cannot unmarshal number into Go value of type string

    at RestError (/Users/denisnazarov/web/rc/aleph/lib/client/api/RestClient.js:31:5)
    at response.text.then.responseBody (/Users/denisnazarov/web/rc/aleph/lib/client/api/RestClient.js:53:17)
    at process._tickCallback (internal/process/next_tick.js:103:7)
  statusCode: 400,
  response:
   Body {
     url: 'http://localhost:9002/publish/museum.tate.artworks',
     status: 400,
     statusText: 'Bad Request',
     headers: Headers { _headers: [Object] },
     ok: false,
     body:
      PassThrough {
        _readableState: [Object],
        readable: false,
        domain: null,
        _events: [Object],
        _eventsCount: 3,
        _maxListeners: undefined,
        _writableState: [Object],
        writable: false,
        allowHalfOpen: true,
        _transformState: [Object] },
     bodyUsed: true,
     size: 0,
     timeout: 0,
     _raw: [ <Buffer 45 72 72 6f 72 3a 20 6a 73 6f 6e 3a 20 63 61 6e 6e 6f 74 20 75 6e 6d 61 72 73 68 61 6c 20 6e 75 6d 62 65 72 20 69 6e 74 6f 20 47 6f 20 76 61 6c 75 65 ... > ],
     _abort: false,
     _bytes: 66 } }
All statements published successfully

Fails if id is not a string
Gives vague error message
Still says "All statements published successfully" even though it failed

Related to mediachain/concat#70

Support / documentation for Windows

At minimum, we should document setting up and using the Windows Subsystem for Linux, but I'd also like to support "real" Windows. This is unfortunately a fairly big hassle, since node-gyp and Windows don't get along as well as they could, and running node on Windows is kind of awkward at the best of times. Or maybe I'm just out of touch with the platform, so things feel harder than they should 😄

Notes on the Linux/windows install:

Enable WSL - all following instructions are run from inside bash
Install dependencies: apt-get install build-essential g++ libssl-dev git
Install node: curl -sL https://deb.nodesource.com/setup_6.x | bash, then apt-get install nodejs
Install aleph: sudo npm install -g aleph (takes a while... it helps to add the -d flag to get verbose output)

Here's what I've been able to piece together so far for the "real Windows" install on a Windows 10 VM:

Install node from the official Windows installer.
Install git for windows.

Install C++ build tools:
Installation method depends on whether you have Visual Studio installed, although hopefully that will be fixed soon, and we can just recommend everyone use the windows-build-tools package.

If you do have Visual Studio, you need to ensure that you've installed the Common Tools for Visual C++. You can either re-run the installer, or open VS and try to create a new C++ project that targets Windows, which will open the installer for you. Make sure the box for the Common Tools for Visual C++ is checked. After the installation, open a Command Prompt or Git Bash and run npm config set msvs_version 2015

If you do not have Visual Studio installed, open a Command Prompt (or Git Bash) as an Administrator and run npm install -g --production windows-build-tools.

try to install aleph:
- global installation seems to choke with a bunch of git errors; this can probably be resolved by running in an elevated shell, but for now I'm just trying to install to a local node_modules dir.
- die trying to install node-webcrypto-ossl
- probably die trying to build jq

Pull Dockerfiles into separate repo

We should probably pull the Dockerfiles for concat into a separate repo, and make them suitable for production use.

Things needed to run in a production container:

data volumes (should have one for statement db, and one for rocks datastore)
ntp needs to be running in the container

Electron app OR Chrome extension (like Metamask)

Wraps aleph in a nice electron shell. Either

connects to remote concat as "surrogate"
embeds local concat as storage

No error if --schemaReference doesn't exist

mcclient publish just exits if the hash of the schema doesn't exist without telling you what went wrong.

Schema versioning and validation

Opening this to discuss and plan the schema semantics, especially versioning and translation.
Below are some notes that describe my current thinking:

Versioning

Schemas are stored as mediachain records in jsonschema format in a well-known namespace (e.g mediachain.schemas or something).
Schemas have human-readble names + semantic versions, which combined form the WKI used to look them up. e.g. SELECT * FROM mediachain.schemas WHERE wki = io.mediachain.indexer/image/jsonschema/1-0-0
A schema is "self-describing", in that it contains a json object that states its name, version, etc.
Examples here use the snowplow self-describing schema format of vendor/schema-name/format/version, where version is a SchemaVer version string
Mediachain records should contain a link to the schema that they conform to.
Link format is up for discussion; actual IPLD-compatible multiaddrs don't support custom protocol codes, so linking directly to a mediachain object blob by multihash would probably need something else.
I think we should probably have a "link object" that has, at minimum, the WKI of the schema + a multihash of the schema blob. It could also have e.g. http links as a fallback.

Possible example of record with schema link:

{
    "schema": 
    {
        "wki": "io.mediachain.indexer/image/jsonschema/1-0-0," 
        "object": "QmF00...", 
        "http": "http://indexer.mediachain.io/schemas/image/jsonschema/1-0-0", 
        "ipfs": {"/": "QmF00"}
    },
    "data":
    {
        "id": "foo",
        "description": "actual record that conforms to schema above"
    }
}

The example above uses the snowplow convention of wrapping the actual payload in a data object, which lets us avoid having to designate a special "schema key" that you're not allowed to use in your actual data payload.

When a record is published, the schema object will be listed in the deps field of the statement, so that when you fetch the record you'll also get its schema. This should ensure that we don't end up with "orphan" records whose schema can't be retrieved. It might also be useful to put the WKI of the schema into the statement's tags field, so that users can filter by objects they know how to work with.

Translation

A "translator" is a jq filter that will massage data from your schema into another.

A translator is published as a mediachain record; it can be a json object with links to the source and destination schemas, plus the jq filter string (and version of jq used!).
translators are "one way", and may be lossy; there's no guarantee that the translated output can be re-translated back into the original schema
However, a translated record should contain a link to the original record and the translator used. This could be another top-level field like schema and data above
In theory, translation can happen "on the fly" at read time, instead of a node republishing translated records. Whether that's a good idea will depend on read volumes, etc.

improve cbor performance

node-cbor is pretty slow 😞, about 20x slower than tinycbor at converting json to cbor.

Improving this probably means using tinycbor or another native code solution, which could go a few different ways:

write a command line app to pipe json through and get CBOR maps out the other end
- don't need to mess with FFI
- need to figure out how to delimit objects in the output
- IPLD transform needs to be in native code for tag handling
- could be used by any language (go, etc)
- can spawn multiple concurrent processes
write a C addon or use node-ffi to wrap a native cbor lib
- FFI is tricky, could potentially crash the node process
- gives us more flexibility (can expose CBOR tags, etc)
- no IPC overhead, but blocks the JS thread unless we get fancy with async code
(crazy option) try to compile tinycbor to webassembly and run node with the --expose-wasm flag 😄

Tool to deterministically compute multihashes/signatures

Thinking about out-of-system flows, I think we need a tool that deterministically computes a statement but does not publish it, rather just outputs to stdout. This will allow the underlying data to be passed around through e.g. smart contracts together with matching signature etc.

publish should throw if `--idFilter` returns null

Had a typo in my idFilter so ended up ingesting a bunch of refs: [ 'null' ]. Shouldn't allow that.

JSON-LD / schema.org

Relates to #54, in that it would be nice to use the schema.org definitions for "mediachain native" records, and they use and RDF / JSON-LD data model.

To briefly recap the arguments in favor of using JSON-LD:

There are a lot of useful definitions at schema.org that would be nice to adopt
It's "just JSON", so we can easily work with it as opaque "dumb JSON"
- This also means it's interoperable with IPLD
It adds a semantic layer beyond what json-schema valdiation provides
- e.g. the schema.org property definitions tell you the purpose of each property, not just what type of data it can contain. You can get something similar if you're diligent about tagging your json-schema properties with description fields, but most people probably won't bother, and it's not easy to share those definitions between multiple schemas.
It's extensible; you can define schemas that inherit the properties from any number of base schemas
There's a lot of tooling to store, process and query data in JSON-LD formats (and / or RDF), which could come in handy down the road when writing an indexer, etc.

There are some issues around versioning:

The schema.org schemas in particular are very permissive; all properties are optional, and any object can have an open-ended set of additional properties
The implication of the open-ended set of properties is that there's no great way to apply semantic versioning to a schema; if you add a new property to the schema, someone could have attached a property with the same name but different semantics to an object with the old version.
There is a schemaVersion property you can use to point to a specific version of a schema. It seems like it's not very widely used, but might be worth encouraging.

How to implement?

For now, we just want to store and validate at a basic structural level.

When accepting JSON-LD objects, tag them with this schema: http://json-ld.org/schemas/jsonld-schema.json
Accept any structurally valid JSON-LD object.
Optional (because full JSON-LD processing can be expensive):
- Bundle some common @contexts (e.g. the schema.org ones) in aleph, and try to validate that, if a field has a known context, its value is in the appropriate range for that property type. So, e.g., if your root object is a CreativeWork, your creator field should be an Organization or a Person.
  - I'm not sure how difficult the above will be... there (surprisingly) doesn't seem to be any off-the-shelf tools for that. Google and Yandex, etc have online tools you can paste markup into, but that doesn't help us much. It seems doable, but maybe not worth the time to implement, at least initially.
- We'd most likely want to disable the default document loader so we're not hitting the network during the validation phase, although maybe we could have it behind a flag. If we do fetch remote contexts, we should definitely cache them for future reference.
Store the records as regular cbor objects.
- Should the @type of the root object be stored in the statement envelope?

Generic write interface/protocol

It seems like we are looking at a few types of events that might trigger a "hey write this thing" event:

(so far)

native mediachain publication announce event
ethereum contract event
0mq

There's a couple of places the object can be specified by these:

literal payload in body (JSON) --> insert
a mediachain path --> merge
out-of-band request --> TLSNotary/curl/etc

They may also optionally contain a few things like proofs of identity, access, or payment (either completed or pending/in escrow)

It seems like we may want a relatively uniform interface for receiving these. Unfortunately, we can't really have a totally consistent message format, because e.g. the 0mq case we're looking at already has one defined. But we can e.g. translate the 0mq message (after running the out-of-band request) to our internal format.

Add IPFS DHT support to aleph peer

Now that concat support's the DHT for peer lookups we don't want aleph to feel left out 😄

aleph-concat pairing

Configure an aleph to have some notion of an associated concat that it's mated to and uses as the respective datastore. This will be useful for #19, #25, etc

Using refs

How does one use refs to other objects?

Example:
https://github.com/MuseumofModernArt/collection has an Artists.json and Artworks.json. Artworks reference objects in Artists via an array of ConstituentID:

{
  "Title": "Ferdinandsbrücke Project, Vienna, Austria, Elevation, preliminary version",
  "Artist": [
    "Otto Wagner"
  ],
  "ConstituentID": [
    6210
  ],
  "ArtistBio": [
    "Austrian, 1841–1918"
  ],
  "Nationality": [
    "Austrian"
  ],
  "BeginDate": [
    1841
  ],
  "EndDate": [
    1918
  ],
  "Gender": [
    "Male"
  ],
  "Date": "1896",
  "Medium": "Ink and cut-and-pasted painted pages on paper",
  "Dimensions": "19 1/8 x 66 1/2\" (48.6 x 168.9 cm)",
  "CreditLine": "Fractional and promised gift of Jo Carole and Ronald S. Lauder",
  "AccessionNumber": "885.1996",
  "Classification": "Architecture",
  "Department": "Architecture & Design",
  "DateAcquired": "1996-04-09",
  "Cataloged": "Y",
  "ObjectID": 2,
  "URL": "http://www.moma.org/collection/works/2",
  "ThumbnailURL": "http://www.moma.org/media/W1siZiIsIjU5NDA1Il0sWyJwIiwiY29udmVydCIsIi1yZXNpemUgMzAweDMwMFx1MDAzZSJdXQ.jpg?sha=137b8455b1ec6167",
  "Height (cm)": 48.6,
  "Width (cm)": 168.9
}

I want to include a reference to the Artist object via the statement's refs array:

is this done via Artist WKI
does it reference an object in another namespace explicitly (e.g. museums.moma.artwork and museums.moma.artist)?
what if the reference is to a statement/object on a remote peer?

Other, probably not solved via refs:

what if an object in museums.moma.artist (museum 1) represents the same person/entity in museums.tate.artist (museum 2) (example)?
- how can one mark them as the same thing
- can i then query both museums for artworks if i know the WKI of an artist from one of them
related artworks: Weeping Woman Tate, Weeping Woman Moma 1, Weeping Woman Moma 2

Oracle interface

Listens on ethereum network, returns payment addresses. Will need to RPC to concat for now.

Using https://github.com/ethereum/web3.js/

Recommended general purpose schemas

For mediachain-first data.

npm

Let's make mcclient npm installable

Show progress when doing a big remote merge

Would be nice to see some log output when merging tens of thousands of records. Maybe log statement IDs?

Retrieve statement by internal ID/multihash

Visual ingest editor

At some point we'll definitely need a visual editor to pick out WKIs, etc. Something like this:

So you'd click/draw a box around a field and a prefix popup would appear. This of course creates all sorts of path expression generation from examples fun.

Batch remote data fetching for remoteQueryWithData

Right now the MediachainNode.remoteQueryWithData opens a new data stream for each statement it receives, since that was the simplest implementation. This is quite inefficient for two reasons:

the protocol handler accepts an array of keys to fetch, so fetching them individually adds protobuf overhead
it's possible to reuse the existing data stream for multiple requests instead of paying the cost to establish a new stream for each request

We can fix the first problem by applying a window to the query result stream and collecting the object ids from each statement in the window, then sending out a batch of object requests for each chunk of query results.

The second will need a little bit of pull-stream trickery to implement. The helper I'm using to read the query result and data object streams will automatically close the stream when it encounters a StreamEnd or StreamError message. This is necessary with pull streams because it will otherwise try to read from the connection forever (or until the other side closes the stream).

However, we'd like to keep the stream open to send a new request. This can probably be accomplished by tweaking the helper to accept an optional callback to invoke on a StreamEnd response instead of just closing the stream; then the consumer can decide whether to close the stream or not.

Write statement to internal ID/multihash

Local store

Probably just postgres, so we can do joins and lookups easily in SQL

No "deps" in publish output

Even though we now have deps, its not in the publish "log" output, just seeing:

statement id: 4XTTM9Y6Sso29BhUFWsNwjRbtmQrTz1oYSPfVNFxMkhLyH7iF:1478226868:129023
[ { object: 'QmTXSkbqGr5Rb6dABrzaKK8xKqaBGLfUsXDBcCtR5cwi8n',
    refs: [ '207152' ],
    tags: [] } ]

even though there is now a schema hash in the dep.

Also what are tags?

README

Add some meat to readme:

INSTALL/usage
limitations/roadmap (very brief)

make cbor size limit configurable

#75 upped the highWaterMark for the cbor encoder stream to 1MB - we should either make this astronomically high, or make it configurable.

The highWaterMark is supposed to be used for backpressure, to let the underlying resource catch up... but the way we're feeding the encoder is entirely synchronous and in-memory, so maybe an absurdly large limit is the way to go.

Gateway mode

Similar to IPFS gateway:

https://ipfs.io/ipfs/QmeQTR7tjdkHZAx8zzH9kiKfv1yj7idyszmr2G66QTfpip

we should have a gateway server mode where we return some nice application/json at /mediachain/Q....

This could also be served directly by a concat itself, but I'm inclined to keep that HTTP API as protected admin/owner only (similar to IPFS daemon)

In-memory DB + datastore

Needed to push to remote concats since push is implemented as reverse merge on concat side

Pull request management

An aleph node that's acting as a "surrogate" for a heavy concat node (i.e. is a trusted publisher) can receive a "Pull Request" through a web API. This PR can be reviewed and published into the remote node, either with no additional signatures (simple store) or with signoff by the operator (corroboration)

How is PR relayed?
How is data received/transmitted?
UI? (likely #25)

Publish support

Publish to in-memory db/datastore. Requires #138

Return valid json from all queries

Right now the pretty print output isn't valid json (missing quotes on keys, not using double quotes). Can we use jq for this internally, now that its a dependency?

"npm run build-jq" fails when building Aleph

Hello, I'm trying to build Aleph on version 6.9.1 of Node on Ubuntu 16.04.1 LTS with release 1.2.1 of Aleph but it fails with the follow error:

[email protected] build-jq /usr/lib/node_modules/aleph
node scripts/build-jq.js

building jq...
err { Error: ./configure --disable-maintainer-mode && make && cp ./jq /usr/lib/node_modules/aleph/node_modules/.bin/jq
Command failed: cp ./jq /usr/lib/node_modules/aleph/node_modules/.bin/jq
cp: cannot create regular file '/usr/lib/node_modules/aleph/node_modules/.bin/jq': Permission denied

at ChildProcess.exithandler (child_process.js:206:12)
at emitTwo (events.js:106:13)
at ChildProcess.emit (events.js:191:7)
at maybeClose (internal/child_process.js:877:16)
at Process.ChildProcess._handle.onexit (internal/child_process.js:226:5)

killed: false,
code: 1,
signal: null,
cmd: 'cp ./jq /usr/lib/node_modules/aleph/node_modules/.bin/jq' }

npm ERR! Linux 4.4.0-47-generic
npm ERR! argv "/usr/bin/nodejs" "/usr/bin/npm" "run" "build-jq"
npm ERR! node v6.9.1
npm ERR! npm v3.10.8
npm ERR! code ELIFECYCLE
npm ERR! [email protected] build-jq: node scripts/build-jq.js
npm ERR! Exit status 1

This also fails as root. Any ideas?

Query remote peer by WKI, namespace

Read API

Now that we have more building blocks in place, let's think about the general read API. We have a number of possibilities for addressing the "source":

specific peer (full multiaddr)
specific peer (peerId)
all peers announcing namespace (via directory)
all peers announcing namespace (via DHT -- though I am thinking maybe directory handles the DHT interaction)
peers announcing and qualified for namespace (non-universe)

And actual query:

naked data query
MCQL query with data
"index scan" MCQL query w/o data

I think for the moment the assumptions we're making are:

peers assumed to hold data necessary to materialize their statements (missing objects in datastore is an error) -- would like to revisit this
queries to performed and merged clientside in aleph (no routing/swaps/etc)