GithubHelp home page GithubHelp logo

opencog / atomspace-rocks Goto Github PK

View Code? Open in Web Editor NEW
13.0 7.0 8.0 618 KB

AtomSpace Graph Database RocksDB backend

License: Other

CMake 5.85% Scheme 39.15% C++ 54.54% C 0.46%
atomspace opencog persistance file

atomspace-rocks's Introduction

AtomSpace RocksDB Backend

CircleCI

Save and restore AtomSpace contents to a RocksDB database. The RocksDB database is a single-user, local-host-only file-backed database. That means that only one AtomSpace can connect to it at any given moment.

In ASCII-art:

 +---------------------+
 |                     |
 |      AtomSpace      |
 |                     |
 +-- StorageNode API --+
 |                     |
 |  RocksStorageNode   |
 |                     |
 +---------------------+
 |       RocksDB       |
 +---------------------+
 |     filesystem      |
 +---------------------+

Each box is a shared library. Library calls go downwards. The StorageNode API is the same for all StorageNodes; the RocksStorageNode is just one of them.

RocksDB (see https://rocksdb.org/) is an "embeddable persistent key-value store for fast storage." The goal of layering the AtomSpace on top of it is to provide fast persistent storage for the AtomSpace. There are several advantages to doing this:

  • RocksDB is file-based, and so it is straight-forward to make backup copies of datasets, as well as to share these copies with others. (You don't need to be a DB Admin to do this!)
  • RocksDB runs locally, and so the overhead of pushing bytes through the network is eliminated. The remaining inefficiencies/bottlenecks have to do with converting between the AtomSpace's natural in-RAM format, and the position-independent format that all databases need. (Here, we say "position-independent" in that the DB format does not contain any C/C++ pointers; all references are managed with local unique ID's.)
  • RocksDB is a "real" database, and so enables the storage of datasets that might not otherwise fit into RAM. This back-end does not try to guess what your working set is; it is up to you to load, work with and save those Atoms that are important for you. The examples demonstrate exactly how that can be done.

This backend, together with the CogServer-based network AtomSpace backend provides a building-block out of which more complex distributed and/or decentralized AtomSpaces can be built.

Status

This is Version 1.5.1. All unit tests pass. It has been used in at least one major project, to process tens of millions of Atoms.

This code is 2x or 3x faster than Postgres on synthetic benchmarks, and has been observed to run 12x faster in a real-world application.

Building and Installing

The build and install of atomspace-rocks follows the same pattern as other AtomSpace projects.

RocksDB is a prerequisite. On Debian/Ubuntu, apt install librocks-dev Then build, install and test:

    cd to project dir atomspace-rocks
    mkdir build
    cd build
    cmake ..
    make -j4
    sudo make install
    make check

Example Usage

See the examples directory for details. In brief:

$ guile
scheme@(guile-user)> (use-modules (opencog))
scheme@(guile-user)> (use-modules (opencog persist))
scheme@(guile-user)> (use-modules (opencog persist-rocks))
scheme@(guile-user)> (define sto (RocksStorageNode "rocks:///tmp/foo.rdb/"))
scheme@(guile-user)> (cog-open sto)
scheme@(guile-user)> (load-atomspace)
scheme@(guile-user)> (cog-close sto)

That's it! You've loaded the entire contents of foo.rdb into the AtomSpace! Of course, loading everything is not generally desirable, especially when the file is huge and RAM space is tight. More granular load and store is possible; see the examples directory for details.

Contents

There are two implementations in this repo: a simple one, suitable for users who use only a single AtomSpace, and a sophisticated one, intended for sophisticated users who need to work with complex DAG's of AtomSpaces. These two are accessed by using either MonoStorageNode or by using RocksStorageNode. Both use the standard StorageNode API.

The implementation of MonoStorageNode is smaller and simpler, and is the easier of the two to understand.

The implementation of RocksStorageNode provides full support for deep stacks (DAG's) of AtomSpaces, layered one on top another (called "Frames", a name meant to suggest "Kripke Frames" and "stackframes"). An individual "frame" can be thought of as a change-set, a collection of deltas to the next frame further down in the DAG. A frame inheriting from multiple AtomSpaces contains the set-union of Atoms in the contributing AtomSpaces. Atoms and Values can added, changed and removed in each changeset, without affecting Atoms and Values in deeper frames.

Design

This is a minimalistic implementation. There has been no performance tuning. There's only just enough code to make everything work; that's it. This does nothing at all fancy/sophisticated with RocksDB, and it might be possible to improve performance and squeeze out some air. However, the code is not sloppy, so it might be hard to make it go faster.

If you are creating a new StorageNode to some other kind of database, using the code here as a starting point would make an excellent design choice. All the hard problems have been solved, and yet the overall design remains fairly simple. All you'd need to do is to replace all of the references to RocksDB to your favorite, desired DB.

atomspace-rocks's People

Contributors

linas avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

atomspace-rocks's Issues

Ram usage insanity.

During learning on a tiny grammar, RAM usage by RocksDB exploded to 90GBytes. This is .. insane, it should not be more than a few GBytes for this workload. This is 40x greater RAM usage than expected ... the 40x number is just like the one in issue #9 and might be curable in the same way...

Alpha-renaming is broken.

Two alpha-equivalent Atoms will be stored s distinct entries. This will result in duplicate, corrupted Values associated with the alpha-equivalent Atoms.

The fix for this seems to involve storing the 64-bit hash and using that to detect possible alpha-equivalence. All Atoms with the same hash would have to be fetched, and then tested for possible alpha-equivalence. The getLink() and getNode() callbacks are problematic, since they don't come with a pre-computed hash, and so have to pay the extra cost of computing it.

FWIW, this problem might affect other back-ends too, e.g. the sql backend, and maybe the cogserver backend? This problem has not been explored...

cog-rocks-print/print_range not working after cog-atomspace-clear

(store-atomspace)
; Clear the local AtomSpace (the Atoms remain on disk, just not in RAM).
(cog-atomspace-clear)

Before (cog-atomspace-clear), the cog-rocks-print works as expected.
If I try to inspect the RocksDB after (cog-atomspace-clear), I get:

scheme@(guile-user)> (cog-rocks-print rsn  "a@")
ice-9/boot-9.scm:1685:16: In procedure raise-exception:
In procedure cog-rocks-print: Wrong type (expecting opencog atom): #<Invalid handle>

Entering a new prompt.  Type `,bt' for a backtrace or `,q' to continue.
scheme@(guile-user) [1]> 

As I understand it, rocks should not be affected if you clear the AS in RAM.

Provide summary report of what is stored

The (cog-report-counts) function provides a summary of the Atom types stored in the AtomSpace. A similar function is needed, to report what is held in a RocksStorageNode (without actually loading those atoms, of course).

The simple version of this would be to modify the existing report function to total these up and print them. The fancy version of this would be to also add the API to the generic StorageNode API, to make it usable by all backends.

cog-prt-atomspace doesn't work in union of two atomspaces

In the context of #15 , while testing the Multiframe-Example
following happened:
When I tried to connect the union atomspace to a RocksDB and then try cog-prt-atomspace, I got a segfault.:

(cog-set-atomspace! c)
(define rsn (RocksStorageNode "rocks:///home/opcwdir/repodock/ForTestingStuff/foo.rdb"))
(cog-open rsn)
(store-atomspace)
(cog-prt-atomspace)

This happens also, if I use cog-rocks-open.
Printing the content with cog-rocks-print works fine.

Not sure, if this is a realistic use case.

Bug or User error? when using frames

Here's some weird unexpected, confusing behavior w.r.t. frames. It's both a bug and a user-error. It is exhibited by the following:

(use-modules (opencog) (opencog exec))
(use-modules (opencog persist))
(use-modules (opencog persist-rocks))

(define a (AtomSpace))
(define b (AtomSpace))
(define c (AtomSpace a b))
(cog-set-atomspace! a)
(Concept "foo")
(Concept "bar")
(Concept "baz")
(Concept "I'm in A")

(cog-set-atomspace! c)
(define rsn (RocksStorageNode "rocks:///tmp/foo"))
(cog-open rsn)

; Oh no! Saving atomspace contents without saving frames first!!
(store-atomspace)
(cog-rocks-print rsn "")

; Now store atomspace b (which should be empty!!)
(store-frames b)

; Hmm looks like all atoms got assigned to atomspace b! Oh no!
(cog-rocks-print rsn "")

If the store-frames is done first, before the store-atomspace, then the correct behavior is seen. This is a user-error, since the user should have done the store-frames first. It is also a bug, since the store-frames should not have re-assigned the atoms to the wrong atomspace.

Performance insanity!

On small datasets and on synthetic benchmarks, the RocksDB backend is 2x or 3x faster than the Postgres backend. This is not surprising, given the complexity of mapping the AtomSpace to SQL, together with the overhead of client-server communications. However, in a real-world use-case, RocksDB is 9x slower than Postgres. Why? What is going wrong? Is this related to issues #10 and #9? Will fixes to those also fix this?

From the learning project. RocksDB:

(define wsv (make-shape-vec-api))
(define wss (add-pair-stars wsv))
(define wst (batch-transpose wss))
(wst 'mmt-marginals)
...
Stored 40000 of 510993 left-wilds in 101 secs (396 pairs/sec)
Stored 80000 of 510993 left-wilds in 356 secs (112 pairs/sec)
Stored 120000 of 510993 left-wilds in 607 secs (66 pairs/sec)
Stored 160000 of 510993 left-wilds in 865 secs (46 pairs/sec)
Stored 200000 of 510993 left-wilds in 1075 secs (37 pairs/sec)
Stored 240000 of 510993 left-wilds in 1352 secs (30 pairs/sec)
Stored 280000 of 510993 left-wilds in 1591 secs (25 pairs/sec)
Stored 320000 of 510993 left-wilds in 1841 secs (22 pairs/sec)
Stored 360000 of 510993 left-wilds in 2076 secs (19 pairs/sec)
Stored 400000 of 510993 left-wilds in 2296 secs (17 pairs/sec)
Stored 440000 of 510993 left-wilds in 2584 secs (15 pairs/sec)
Stored 480000 of 510993 left-wilds in 2871 secs (14 pairs/sec)
Done storing 510993 left-wilds in 19969 secs

Freakin disaster. That works out to 5.5 hours, or 26 pairs/second. Compare this to exactly the same dataset on Postgres:

Stored 40000 of 510993 left-wilds in 170 secs (235 pairs/sec)
Stored 80000 of 510993 left-wilds in 168 secs (238 pairs/sec)
Stored 120000 of 510993 left-wilds in 188 secs (213 pairs/sec)
Stored 160000 of 510993 left-wilds in 177 secs (226 pairs/sec)
Stored 200000 of 510993 left-wilds in 171 secs (234 pairs/sec)
Stored 240000 of 510993 left-wilds in 165 secs (242 pairs/sec)
Stored 280000 of 510993 left-wilds in 164 secs (244 pairs/sec)
Stored 320000 of 510993 left-wilds in 176 secs (227 pairs/sec)
Stored 360000 of 510993 left-wilds in 163 secs (245 pairs/sec)
Stored 400000 of 510993 left-wilds in 189 secs (212 pairs/sec)
Stored 440000 of 510993 left-wilds in 187 secs (214 pairs/sec)
Stored 480000 of 510993 left-wilds in 185 secs (216 pairs/sec)
Done storing 510993 left-wilds in 2246 secs

Postgres took 37 minutes to store the same dataset. What's the problem, here? How can we fix this?

store encoded s-expressions

s-expressions are OK, but the database could be smaller if they were encoded by e.g. using an integer instead of the spelled-out type-name. Additional compression is possible, but at some point, the cpu-time of additional compression/decompression outweighs storage savings. Note also RocksDB has built-in compression, so this idea is kind-of questionable anyway?

Provide query "is this Atom being held in this DB?"

The (cog-node TYPE NAME) function allows a user to ask "does a node with the given TYPE and NAME exist in the AtomSpace?", without actually creating that Node. A similar function is needed for the RocksStorageNode: to ask if an Atom is being held, without actually creating it.

This needs modifications the generic StorageNode API, so that this question can be asked w/ the generic API, which then trickles down to the specific StorageNode.

In the same vein, other useful functions would be modeled on (cog-incoming-size ATOM) and (cog-incoming-size-by-type ATOM TYPE) to report on the incoming set of an ATOM held in storage, without actually fetching that incoming set.

Support handling of unknown atom types

Per discussion in opencog/atomspace#2787 (comment) if a RocksDB database holds an atom type that the C++ code does not know about, then it should be auto-created in the atomspace. To make this possible, the RocksDB tables will need to store the atom inheritance hierarchy, which they do not currently do.

Perhaps this could be best achieved by formally defining new atom types in atomese!? (and then just storing the atomese!?)

See also opencog/atomspace#2789

Need to automate compaction on close. 40x too big.

Running the learning kit on a tiny dictionary resulted in this:

$ du -s *rdb
21710584        gram-2.rdb   <<<<<<<<<<< wtf 21GB really?

closing and opening gives:

538360  gram-2.rdb  << half a GB. That's more like it.

which is a 40x compaction. That's sick. That's too big.

Do not always hide atoms when using frames.

When multiple frames are being used, the current rocks code will just hide atoms, instead of removing them, when frames are being used. This results in functionally correct behavior, but is wasteful of storage if the atom can actually be deleted.

The AtomSpace extract code currently implements the correct (reference) implementation: it will either hide an atom, when needed, or it will delete the atom, if possible. It can be used as the final arbiter of whether to delete, or not. The backend should follow this advice.

There right way to solve this is to implement the pre-delete and a post-delete calls in the backend. The atomspace calls pre-delete before doing the deletion, and rocks can gather any needed info for the deletion to happen. Next, the atomspace extracts the atom. Then it calls the post-delete hook. The hook code should look to see what the atomspace did: either the absent flag is set on the atom, or the atom is actually gone. If its only marked absent, then rocks should also hide the atom. else, rocks should delete the atom.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.