GithubHelp home page GithubHelp logo

wilsonzlin / edgesearch Goto Github PK

View Code? Open in Web Editor NEW
471.0 471.0 23.0 1.06 MB

Serverless full-text search with Cloudflare Workers, WebAssembly, and Roaring Bitmaps

License: MIT License

C 17.02% TypeScript 32.27% JavaScript 11.59% Rust 38.96% Shell 0.15%

edgesearch's People

Contributors

jmonster avatar wilsonzlin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

edgesearch's Issues

'main' panicked at 'removal of null terminator' with data from README and Linux client

Using the Linux binary client and the data from the README I am getting the following error and I have no idea what I am doing wrong.

Any help would be appricated :)

documents.txt

{"title":"Stupid Love","artist":"Lady Gaga","year":2020} \0
{"title":"Don't Start Now","artist":"Dua Lipa","year":2020} \0

document-terms.txt:

title_stupid \0 title_love \0 artist_lady \0 artist_gaga \0 year_2020 \0 \0
title_dont \0 title_start \0 title_now \0 artist_dua \0 artist_lipa \0 year_2020 \0 \0
RUST_BACKTRACE=full  ./edgesearch   --documents documents.txt   --document-terms document-terms.txt   --maximum-query-results 20   --output-dir dist/worker/
thread 'main' panicked at 'removal of null terminator', src/data/document_terms.rs:56:21
stack backtrace:
   0:     0x564f916dcdf4 - backtrace::backtrace::libunwind::trace::hc1c4a1d8ad423b97
                               at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/libunwind.rs:86
   1:     0x564f916dcdf4 - backtrace::backtrace::trace_unsynchronized::h82274781060cb056
                               at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/mod.rs:66
   2:     0x564f916dcdf4 - std::sys_common::backtrace::_print_fmt::h2a45d89b653a4da8
                               at src/libstd/sys_common/backtrace.rs:78
   3:     0x564f916dcdf4 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h41a0a93ab85e6aa1
                               at src/libstd/sys_common/backtrace.rs:59
   4:     0x564f91706c5c - core::fmt::write::hdaea18585065a96d
                               at src/libcore/fmt/mod.rs:1069
   5:     0x564f916da953 - std::io::Write::write_fmt::h0cea70c809005252
                               at src/libstd/io/mod.rs:1504
   6:     0x564f916df945 - std::sys_common::backtrace::_print::hd95f9978cc145ca4
                               at src/libstd/sys_common/backtrace.rs:62
   7:     0x564f916df945 - std::sys_common::backtrace::print::hfb25ca2291be47d0
                               at src/libstd/sys_common/backtrace.rs:49
   8:     0x564f916df945 - std::panicking::default_hook::{{closure}}::h44f76cee5dc8591c
                               at src/libstd/panicking.rs:198
   9:     0x564f916df682 - std::panicking::default_hook::h198e1a712910f1e6
                               at src/libstd/panicking.rs:218
  10:     0x564f916dffa2 - std::panicking::rust_panic_with_hook::hc0b4730bb8013f9d
                               at src/libstd/panicking.rs:511
  11:     0x564f916dfb8b - rust_begin_unwind
                               at src/libstd/panicking.rs:419
  12:     0x564f91705681 - core::panicking::panic_fmt::h1ac71ad045d55416
                               at src/libcore/panicking.rs:111
  13:     0x564f91705413 - core::option::expect_failed::h7baa1c60813ff0e3
                               at src/libcore/option.rs:1260
  14:     0x564f91675228 - <edgesearch::data::document_terms::DocumentTermsReader as core::iter::traits::iterator::Iterator>::next::hd8425f88ccfdcc36
  15:     0x564f9166f584 - edgesearch::build::build::h38305df836e7a53a
  16:     0x564f9166e506 - edgesearch::main::h0cc1efdc0569d412
  17:     0x564f9166ecf3 - std::rt::lang_start::{{closure}}::h052b2793b163c52a
  18:     0x564f916e03e8 - std::rt::lang_start_internal::{{closure}}::h7a212202ff76034b
                               at src/libstd/rt.rs:52
  19:     0x564f916e03e8 - std::panicking::try::do_call::h6d214a73427d759b
                               at src/libstd/panicking.rs:331
  20:     0x564f916e03e8 - std::panicking::try::hc078f0e11721d1cb
                               at src/libstd/panicking.rs:274
  21:     0x564f916e03e8 - std::panic::catch_unwind::hb31c05be30625612
                               at src/libstd/panic.rs:394
  22:     0x564f916e03e8 - std::rt::lang_start_internal::hcf7fb98a775d5af0
                               at src/libstd/rt.rs:51
  23:     0x564f9166e8d2 - main
  24:     0x7f96ecf35cb2 - __libc_start_main
  25:     0x564f91669c9a - _start
  26:                0x0 - <unknown>

Issue compiling worker

Update: Turns out this was due to not having clang, lld or wasm-ld. Installing v11 of each seems to have resolved my issue!

Reading document terms (94.02%)...
Processing documents (0%)...
Processing documents (18.42%)...
Processing documents (36.85%)...
Processing documents (55.27%)...
Processing documents (73.7%)...
Processing documents (92.12%)...
There are 11,116 documents with 15,900 terms
1 chunks contain terms
1 chunks contain documents
thread 'main' panicked at 'compile WASM: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/build/wasm.rs:102:22
stack backtrace:
   0:     0x5569e3143e20 - std::backtrace_rs::backtrace::libunwind::trace::h04d12fdcddff82aa
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/../../backtrace/src/backtrace/libunwind.rs:100:5
   1:     0x5569e3143e20 - std::backtrace_rs::backtrace::trace_unsynchronized::h1459b974b6fbe5e1
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x5569e3143e20 - std::sys_common::backtrace::_print_fmt::h9b8396a669123d95
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/sys_common/backtrace.rs:67:5
   3:     0x5569e3143e20 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::he009dcaaa75eed60
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/sys_common/backtrace.rs:46:22
   4:     0x5569e31684cc - core::fmt::write::h77b4746b0dea1dd3
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/core/src/fmt/mod.rs:1078:17
   5:     0x5569e3141002 - std::io::Write::write_fmt::heb7e50902e98831c
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/io/mod.rs:1518:15
   6:     0x5569e3146585 - std::sys_common::backtrace::_print::h2d880c9e69a21be9
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/sys_common/backtrace.rs:49:5
   7:     0x5569e3146585 - std::sys_common::backtrace::print::h5f02b1bb49f36879
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/sys_common/backtrace.rs:36:9
   8:     0x5569e3146585 - std::panicking::default_hook::{{closure}}::h658e288a7a809b29
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panicking.rs:208:50
   9:     0x5569e3146228 - std::panicking::default_hook::hb52d73f0da9a4bb8
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panicking.rs:227:9
  10:     0x5569e3146d21 - std::panicking::rust_panic_with_hook::hfe7e1c684e3e6462
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panicking.rs:593:17
  11:     0x5569e3146867 - std::panicking::begin_panic_handler::{{closure}}::h42939e004b32765c
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panicking.rs:499:13
  12:     0x5569e31442dc - std::sys_common::backtrace::__rust_end_short_backtrace::h9d2070f7bf9fd56c
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/sys_common/backtrace.rs:141:18
  13:     0x5569e31467c9 - rust_begin_unwind
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panicking.rs:495:5
  14:     0x5569e3166e31 - core::panicking::panic_fmt::ha0bb065d9a260792
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/core/src/panicking.rs:92:14
  15:     0x5569e3166c53 - core::option::expect_none_failed::h7e1dd0a94971eb61
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/core/src/option.rs:1268:5
  16:     0x5569e30d1008 - edgesearch::build::wasm::generate_and_compile_runner_wasm::he796daae589dc04c
  17:     0x5569e30cc5dc - edgesearch::build::build::hf79fefb20336bfa3
  18:     0x5569e30c7a9b - edgesearch::main::heb516d46d23f0e37
  19:     0x5569e30c4b33 - std::sys_common::backtrace::__rust_begin_short_backtrace::h3c4b2ea8815b4622
  20:     0x5569e30c4b09 - std::rt::lang_start::{{closure}}::h005a553ef7350dea
  21:     0x5569e3147147 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h57e2a071d427b24c
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/core/src/ops/function.rs:259:13
  22:     0x5569e3147147 - std::panicking::try::do_call::h81cbbe0c3b30a28e
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panicking.rs:381:40
  23:     0x5569e3147147 - std::panicking::try::hbeeb95b4e1f0a876
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panicking.rs:345:19
  24:     0x5569e3147147 - std::panic::catch_unwind::h59c48ccb40a0bf20
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panic.rs:396:14
  25:     0x5569e3147147 - std::rt::lang_start_internal::ha53ab63f88fee728
                               at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/rt.rs:51:25
  26:     0x5569e30c7f12 - main
  27:     0x7ff7c883809b - __libc_start_main
  28:     0x5569e30c47ba - _start
  29:                0x0 - <unknown>

Rust browser client?

Hey! This project looks awesome, but it'd be really useful if there could be a native Rust client that can be compiled to Wasm that could be used instead of edgesearch-client in Rust web apps. This might make use of web-sys and the native browser Fetch API to avoid bringing in any unnecessary libraries like reqwest that would bloat things. If this isn't an antipattern for the project, I'd be happy to submit a PR on this!

Why not https://crates.io/crates/roaring?

Awesome project!

We are using the above pure Rust library for roaring bitmaps for a wasm based application and was wondering why you opted for a C based implementation.

How large can an index be?

This is some super impressive work here, I've experimented with a similar approach but couldn't get within the CPU-limits. I don't have any experience with C though and didn't think of using Roaring Bitmaps, which makes a lot more sense!

Have you tested how large the max Index size can be?

I have 20MB of documents (30k), how large would they become when indexed here?
When I tested with a JS-based search engine, the worker ran out of CPU-time at ~8MB, but searches where still fast enough (my target is < 1 second).

Any plans for relevancy ranking?

I.e say I have a document with one term that is text_你 and another document with terms text_你 text_好 the first document should be ranked higher than the second

"The script will never generate a response." ... then it generates a response

As a user, everything appears to work just fine -- but when I look at my CloudFlare dashboard I'm seeing an error for all of my requests.

image

I dug in with wrangler tail and here's a subset of what I saw:

{
    "outcome": "exception",
    "scriptName": null,
    "exceptions": [{
        "name": "Error",
        "message": "The script will never generate a response.",
        "timestamp": 1624986983071
    }],
    "logs": [{
        "message": ["Found containing chunk"],
        "level": "log",
        "timestamp": 1624986981885
    }, {
        "message": ["Fetched chunk from KV"],
        "level": "log",
        "timestamp": 1624986982339
    }, {
        "message": ["Found entry in chunk"],
        "level": "log",
        "timestamp": 1624986982339
    }, {
        "message": ["Bit sets retrieved"],
        "level": "log",
        "timestamp": 1624986982339
    }, {
        "message": ["Query built"],
        "level": "log",
        "timestamp": 1624986982339
    }, {
        "message": ["Query executed"],
        "level": "log",
        "timestamp": 1624986982339
    }, {
        "message": ["Found containing chunk"],
        "level": "log",
        "timestamp": 1624986982339
    }, {
        "message": ["Found entry in chunk"],
        "level": "log",
        "timestamp": 1624986983071
    }, {
        "message": ["Documents fetched"],
        "level": "log",
        "timestamp": 1624986983071
    }],
    "eventTimestamp": 1624986981885,
    "event": {
        "request": {
            "url": "https://gifs-edgesearch.garage.workers.dev/search?c=0&t=0_text_bobby",
            "method": "GET",
        }
    }
}

Meanwhile the client (a browser) received the following (headers removed for brevity):

:status: 200
Content-Type: application/json
Content-Encoding: br
Server: cloudflare

And an idea of what the response data looked like:

{
    "total": 2709,
    "continuation": 50,
    "results": [/* omitted */]
}

and an idea of what the underlying client code looks like:

const t = 'bobby'
const q = new Edgesearch.Query();
q.add(Edgesearch.Mode.REQUIRE, `text_${t}`)
let response = await client.search(q);

Failed to compile WASM

I suspect this is because I'm on an arm64 (Apple Silicon) machine and running an arm64 compiled version of edgesearch. When I attempt to run the provided x86 binary my script aborts with SIGILL (Illegal instruction).

This is happening for me (with the same document/terms file) on macOS ARM and x86 freebsd .. so it's almost definitely user error?

RUST_BACKTRACE=1 edgesearch \
  --documents docs \
  --document-terms terms \
  --maximum-query-results 10 \
  --output-dir out \
  --data-store KV

Processing documents (0%)...
Processing documents (18.34%)...
Processing documents (36.68%)...
Processing documents (55.01%)...
Processing documents (73.35%)...
Processing documents (91.69%)...
There are 349 documents with 945 terms
1 chunks contain terms
1 chunks contain documents

error: unable to create target: 'No available targets are compatible with triple "wasm32-unknown-unknown-wasm"'
1 error generated.
thread 'main' panicked at 'Failed to compile WASM', src/build/wasm.rs:103:9
stack backtrace:
   0: std::panicking::begin_panic
   1: edgesearch::build::wasm::generate_and_compile_runner_wasm
   2: edgesearch::build::build
   3: edgesearch::main

snippet from docs:

{"timecode":4,"text":"<i>Sensors detect a ship\non an intercept course at warp speed.</i>","at":"11/4.jpg"}�{"timecode":8,"text":"<i>Should we go to red alert?</i>","at":"11/8.jpg"}�

snippet from terms:

t_sensors �,t_detect �,t_a �,t_shipon �,t_an �,t_intercept �,t_course �,t_at �,t_warp �,t_speed ��t_should �,t_we �,t_go �,t_to �,t_red �,t_alert ��

On-demand loading of documents

Nice API for serverless search! I like the use of compressed bit sets.

Looks like the engine loads all the documents (data files to KV) once-only at deploy time. How hard or easy is it to allow runtime document adding/updating/removing (via REST methods), and automatic reindexing?

how to compile?

I'm on Apple Silicon, so I was going to build for arm64, but nbd.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.