guokr / simbase Goto Github PK

A vector similarity database

License: Apache License 2.0

Shell 0.38% Clojure 0.75% Java 98.87%

simbase's Introduction

#果壳前端组件

##目录结构

./
  build/        构建工具
  example/      javascript组件样例
  js/           javascript目录
      G/            G.js库
  skin/         css及图片目录

##如何开始在当前目录启动一个server，然后访问http://127.0.0.1/。或者安装python，然后运行http.sh，访问http://127.0.0.1:8000/。

##LICENTSE MIT

simbase's People

Contributors

Stargazers

Watchers

simbase's Issues

Help me with my use case plz

Could you please help me with a starter code for my use case)

I want to store in vector similarity db key: sentenceID value: vector. Examples:
id_1 [0.06284283101558685, 0.046207964420318604, 0.0053909290581941605, ...]
id_2 [0.006631242576986551, 0.08234132081270218, -0.0787612572312355, ...]

And then I want n top similar vectors' IDs to the given vector.

Support matrix and vector transforming

API still need to be discussed.

is only "int" vector id type supported?

I've read some simbase codes (because you
ve requested implementing score function using Euclidean-distance(preserving vector magnitude))
I noticed that vector id type of simbase is java int type.
That means vecid value space is limited to 32 bit integer space..

How about supporting "long" vector id type or "String" vector id type?
String type => can support any kind of id values, but maybe much more memory foot print.
long type => can provide huge integer id value space, more memory than 32bit int type but much less than String type.

In some application case(my case :D), int id type is not enough..
My database key value which will be matched to vector instance(match by vecid) in simbase DB is 64bit long type..
(I'm using Titan Graph Database(It's awesome 👍 ), and graph vertex id is 64 bit long type)

lein uberjar报错

Compiling simbase
Exception in thread "main" java.io.FileNotFoundException: Could not locate simbase__init.class or simbase.clj on classpath:
at clojure.lang.RT.load(RT.java:443)
at clojure.lang.RT.load(RT.java:411)
at clojure.core$load$fn__5018.invoke(core.clj:5530)
at clojure.core$load.doInvoke(core.clj:5529)
at clojure.lang.RestFn.invoke(RestFn.java:408)
at clojure.core$load_one.invoke(core.clj:5336)
at clojure.core$compile$fn__5023.invoke(core.clj:5541)
at clojure.core$compile.invoke(core.clj:5540)
at user$eval9.invoke(form-init5853325299914114337.clj:1)
at clojure.lang.Compiler.eval(Compiler.java:6619)
at clojure.lang.Compiler.eval(Compiler.java:6609)
at clojure.lang.Compiler.load(Compiler.java:7064)
at clojure.lang.Compiler.loadFile(Compiler.java:7020)
at clojure.main$load_script.invoke(main.clj:294)
at clojure.main$init_opt.invoke(main.clj:299)
at clojure.main$initialize.invoke(main.clj:327)
at clojure.main$null_opt.invoke(main.clj:362)
at clojure.main$main.doInvoke(main.clj:440)
at clojure.lang.RestFn.invoke(RestFn.java:421)
at clojure.lang.Var.invoke(Var.java:419)
at clojure.lang.AFn.applyToHelper(AFn.java:163)
at clojure.lang.Var.applyTo(Var.java:532)
at clojure.main.main(main.java:37)
Compilation failed: Subprocess failed

how to run simbase?

I installed simbase (not sure if correctly, but there were not errrors). I have ssh connection to the server where I installed simbase, so I can work directly on server machine. But I have no access to root, only can sudo. When I do bin/start or sudo bin/start, it writes command not found. Any ideas?

Eliminate command layer by reflection or codegen

The commands can be inferred from engine. we can introduce some annotation and automatically create command by reflection or codegen.

Is Euclidean distance not supported?

I'm very happy to see open source Vector Database!
Simbase is great for me, thanks :D

I have a question (or maybe new feature request..)
Supported similarity(score) functions are "cosinesq" and "jensenshannon"
cosine similarity function does not count vector magnitude..
But in my application, vector magnitude is meaningful for similar vector search.
I want similarity function using "Euclidean distance" to be supported also :D
Give some guides, thanks for your great vector DB :D

why not reuse the memory when set the vectors?

@Override
public void set(int vecid, float[] vector) {
    if (indexer.containsKey(vecid)) {
        float[] old = get(vecid);

        if (lengths.get(vecid) != vector.length) {
            remove(vecid);
            add(vecid, vector);
        } else {
            int cursor = indexer.get(vecid);
            for (float val : vector) {
                data.set(cursor, val);
                cursor++;
            }
        }
        if (listening) {
            for (VectorSetListener l : listeners) {
                l.onVectorSetted(this, vecid, old, vector);
            }
        }
    } else {
        add(vecid, vector);
    }
}

在判断长度不相等之后就立即移除了这个元素，然后又新加了元素。
可是在新长度小于老长度的情况下，这部分空间应该还是可以继续使用的，只需要修改一下length，为什么要这么做呢？

在Basis类中，get方法和all方法有什么区别？

它们看起来好像差不多？

is "Instant Similarity Query" not supported?

Recommendation query seems to be allowed to the only registered vectors.
I think, when registering new vector (vadd command) all comparison calculation is performed (O(n)), and when querying recommendations, pre-caculated results are outputted (Constant Time Complexity?)
But in my application, user input arbitrary vector to the system for "instant" recommendation results..
Please give me some guides~

Enhence Info command for monitoring

Most of redis monitoring tools use Info command to get the execution status data, we need implement same mechanism to reuse these monitoring tools.

Support clustering of vectors

When dealing with the scenario of personalized recommendation, the user profile vector set usually are very large, for example ~10m vectors or even above. We take 10m vectors as a baseline, because it still possible to store all the 10m data into one physical machine.

10m vectors * 2048 dimensions * 4 byte float = 80 G memory

Current solution does not fit into the level, because write latency would be ~30s which is not acceptable.

One idea is that: we do not recommend for a single users, but for a cluster of similar users.

Two choices: online KMeans or SimHash?

    if (source != target) {
        scoring.onAttached(target.key());
        target.addListener(scoring);

        if (target.type().equals("dense")) {
            for (int id : source.ids()) {
                scoring.onVectorAdded(target, id, target.get(id));
            }
        } else {
            for (int id : target.ids()) {
                scoring.onVectorAdded(target, id, target._get(id));
            }
        }
    }

for(int id: source.ids()) ==> for(int id: target.ids())

A more sophisticated test framework

Currently the very thin test framework are not nature and robust for asynchronous tests, this make the tests difficult to read and also lead to failures on travisCI. Try to solve them.

guokr / simbase Goto Github PK

simbase's Introduction

simbase's People

Contributors

Stargazers

Watchers

Forkers

simbase's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs