wilsonzlin / edgesearch Goto Github PK
View Code? Open in Web Editor NEWServerless full-text search with Cloudflare Workers, WebAssembly, and Roaring Bitmaps
License: MIT License
Serverless full-text search with Cloudflare Workers, WebAssembly, and Roaring Bitmaps
License: MIT License
Hi Wilson, today I found your repo. This is a truly amazing project!
In future, you might want to add highlight the keywords.
Thank you.
Using the Linux binary client and the data from the README I am getting the following error and I have no idea what I am doing wrong.
Any help would be appricated :)
documents.txt
{"title":"Stupid Love","artist":"Lady Gaga","year":2020} \0
{"title":"Don't Start Now","artist":"Dua Lipa","year":2020} \0
document-terms.txt:
title_stupid \0 title_love \0 artist_lady \0 artist_gaga \0 year_2020 \0 \0
title_dont \0 title_start \0 title_now \0 artist_dua \0 artist_lipa \0 year_2020 \0 \0
RUST_BACKTRACE=full ./edgesearch --documents documents.txt --document-terms document-terms.txt --maximum-query-results 20 --output-dir dist/worker/
thread 'main' panicked at 'removal of null terminator', src/data/document_terms.rs:56:21
stack backtrace:
0: 0x564f916dcdf4 - backtrace::backtrace::libunwind::trace::hc1c4a1d8ad423b97
at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/libunwind.rs:86
1: 0x564f916dcdf4 - backtrace::backtrace::trace_unsynchronized::h82274781060cb056
at /cargo/registry/src/github.com-1ecc6299db9ec823/backtrace-0.3.46/src/backtrace/mod.rs:66
2: 0x564f916dcdf4 - std::sys_common::backtrace::_print_fmt::h2a45d89b653a4da8
at src/libstd/sys_common/backtrace.rs:78
3: 0x564f916dcdf4 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h41a0a93ab85e6aa1
at src/libstd/sys_common/backtrace.rs:59
4: 0x564f91706c5c - core::fmt::write::hdaea18585065a96d
at src/libcore/fmt/mod.rs:1069
5: 0x564f916da953 - std::io::Write::write_fmt::h0cea70c809005252
at src/libstd/io/mod.rs:1504
6: 0x564f916df945 - std::sys_common::backtrace::_print::hd95f9978cc145ca4
at src/libstd/sys_common/backtrace.rs:62
7: 0x564f916df945 - std::sys_common::backtrace::print::hfb25ca2291be47d0
at src/libstd/sys_common/backtrace.rs:49
8: 0x564f916df945 - std::panicking::default_hook::{{closure}}::h44f76cee5dc8591c
at src/libstd/panicking.rs:198
9: 0x564f916df682 - std::panicking::default_hook::h198e1a712910f1e6
at src/libstd/panicking.rs:218
10: 0x564f916dffa2 - std::panicking::rust_panic_with_hook::hc0b4730bb8013f9d
at src/libstd/panicking.rs:511
11: 0x564f916dfb8b - rust_begin_unwind
at src/libstd/panicking.rs:419
12: 0x564f91705681 - core::panicking::panic_fmt::h1ac71ad045d55416
at src/libcore/panicking.rs:111
13: 0x564f91705413 - core::option::expect_failed::h7baa1c60813ff0e3
at src/libcore/option.rs:1260
14: 0x564f91675228 - <edgesearch::data::document_terms::DocumentTermsReader as core::iter::traits::iterator::Iterator>::next::hd8425f88ccfdcc36
15: 0x564f9166f584 - edgesearch::build::build::h38305df836e7a53a
16: 0x564f9166e506 - edgesearch::main::h0cc1efdc0569d412
17: 0x564f9166ecf3 - std::rt::lang_start::{{closure}}::h052b2793b163c52a
18: 0x564f916e03e8 - std::rt::lang_start_internal::{{closure}}::h7a212202ff76034b
at src/libstd/rt.rs:52
19: 0x564f916e03e8 - std::panicking::try::do_call::h6d214a73427d759b
at src/libstd/panicking.rs:331
20: 0x564f916e03e8 - std::panicking::try::hc078f0e11721d1cb
at src/libstd/panicking.rs:274
21: 0x564f916e03e8 - std::panic::catch_unwind::hb31c05be30625612
at src/libstd/panic.rs:394
22: 0x564f916e03e8 - std::rt::lang_start_internal::hcf7fb98a775d5af0
at src/libstd/rt.rs:51
23: 0x564f9166e8d2 - main
24: 0x7f96ecf35cb2 - __libc_start_main
25: 0x564f91669c9a - _start
26: 0x0 - <unknown>
Hey, FYI the key value limit was recently raised to 25MB, which could improve some performance, right?
https://developers.cloudflare.com/workers/platform/limits#kv
Haven't found anything regarding paging in the code or in the examples so I'm curious if paging search results is available or could be easily added?
Try searching for something like Megaman
Update: Turns out this was due to not having clang, lld or wasm-ld. Installing v11 of each seems to have resolved my issue!
Reading document terms (94.02%)...
Processing documents (0%)...
Processing documents (18.42%)...
Processing documents (36.85%)...
Processing documents (55.27%)...
Processing documents (73.7%)...
Processing documents (92.12%)...
There are 11,116 documents with 15,900 terms
1 chunks contain terms
1 chunks contain documents
thread 'main' panicked at 'compile WASM: Os { code: 2, kind: NotFound, message: "No such file or directory" }', src/build/wasm.rs:102:22
stack backtrace:
0: 0x5569e3143e20 - std::backtrace_rs::backtrace::libunwind::trace::h04d12fdcddff82aa
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/../../backtrace/src/backtrace/libunwind.rs:100:5
1: 0x5569e3143e20 - std::backtrace_rs::backtrace::trace_unsynchronized::h1459b974b6fbe5e1
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
2: 0x5569e3143e20 - std::sys_common::backtrace::_print_fmt::h9b8396a669123d95
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/sys_common/backtrace.rs:67:5
3: 0x5569e3143e20 - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::he009dcaaa75eed60
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/sys_common/backtrace.rs:46:22
4: 0x5569e31684cc - core::fmt::write::h77b4746b0dea1dd3
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/core/src/fmt/mod.rs:1078:17
5: 0x5569e3141002 - std::io::Write::write_fmt::heb7e50902e98831c
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/io/mod.rs:1518:15
6: 0x5569e3146585 - std::sys_common::backtrace::_print::h2d880c9e69a21be9
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/sys_common/backtrace.rs:49:5
7: 0x5569e3146585 - std::sys_common::backtrace::print::h5f02b1bb49f36879
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/sys_common/backtrace.rs:36:9
8: 0x5569e3146585 - std::panicking::default_hook::{{closure}}::h658e288a7a809b29
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panicking.rs:208:50
9: 0x5569e3146228 - std::panicking::default_hook::hb52d73f0da9a4bb8
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panicking.rs:227:9
10: 0x5569e3146d21 - std::panicking::rust_panic_with_hook::hfe7e1c684e3e6462
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panicking.rs:593:17
11: 0x5569e3146867 - std::panicking::begin_panic_handler::{{closure}}::h42939e004b32765c
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panicking.rs:499:13
12: 0x5569e31442dc - std::sys_common::backtrace::__rust_end_short_backtrace::h9d2070f7bf9fd56c
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/sys_common/backtrace.rs:141:18
13: 0x5569e31467c9 - rust_begin_unwind
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panicking.rs:495:5
14: 0x5569e3166e31 - core::panicking::panic_fmt::ha0bb065d9a260792
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/core/src/panicking.rs:92:14
15: 0x5569e3166c53 - core::option::expect_none_failed::h7e1dd0a94971eb61
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/core/src/option.rs:1268:5
16: 0x5569e30d1008 - edgesearch::build::wasm::generate_and_compile_runner_wasm::he796daae589dc04c
17: 0x5569e30cc5dc - edgesearch::build::build::hf79fefb20336bfa3
18: 0x5569e30c7a9b - edgesearch::main::heb516d46d23f0e37
19: 0x5569e30c4b33 - std::sys_common::backtrace::__rust_begin_short_backtrace::h3c4b2ea8815b4622
20: 0x5569e30c4b09 - std::rt::lang_start::{{closure}}::h005a553ef7350dea
21: 0x5569e3147147 - core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once::h57e2a071d427b24c
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/core/src/ops/function.rs:259:13
22: 0x5569e3147147 - std::panicking::try::do_call::h81cbbe0c3b30a28e
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panicking.rs:381:40
23: 0x5569e3147147 - std::panicking::try::hbeeb95b4e1f0a876
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panicking.rs:345:19
24: 0x5569e3147147 - std::panic::catch_unwind::h59c48ccb40a0bf20
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/panic.rs:396:14
25: 0x5569e3147147 - std::rt::lang_start_internal::ha53ab63f88fee728
at /rustc/e1884a8e3c3e813aada8254edfa120e85bf5ffca/library/std/src/rt.rs:51:25
26: 0x5569e30c7f12 - main
27: 0x7ff7c883809b - __libc_start_main
28: 0x5569e30c47ba - _start
29: 0x0 - <unknown>
Hey! This project looks awesome, but it'd be really useful if there could be a native Rust client that can be compiled to Wasm that could be used instead of edgesearch-client
in Rust web apps. This might make use of web-sys
and the native browser Fetch API to avoid bringing in any unnecessary libraries like reqwest
that would bloat things. If this isn't an antipattern for the project, I'd be happy to submit a PR on this!
It would be wonderful if you could offer your module as a plugin at Cloudflare Pages Plugin (https://blog.cloudflare.com/cloudflare-pages-plugins/).
This would definitely make things easier for everyone!
And thank you for this wonderful gift <3
Awesome project!
We are using the above pure Rust library for roaring bitmaps for a wasm based application and was wondering why you opted for a C based implementation.
This is some super impressive work here, I've experimented with a similar approach but couldn't get within the CPU-limits. I don't have any experience with C though and didn't think of using Roaring Bitmaps, which makes a lot more sense!
Have you tested how large the max Index size can be?
I have 20MB of documents (30k), how large would they become when indexed here?
When I tested with a JS-based search engine, the worker ran out of CPU-time at ~8MB, but searches where still fast enough (my target is < 1 second).
Titled.
I.e say I have a document with one term that is text_你 and another document with terms text_你 text_好 the first document should be ranked higher than the second
As a user, everything appears to work just fine -- but when I look at my CloudFlare dashboard I'm seeing an error for all of my requests.
I dug in with wrangler tail
and here's a subset of what I saw:
{
"outcome": "exception",
"scriptName": null,
"exceptions": [{
"name": "Error",
"message": "The script will never generate a response.",
"timestamp": 1624986983071
}],
"logs": [{
"message": ["Found containing chunk"],
"level": "log",
"timestamp": 1624986981885
}, {
"message": ["Fetched chunk from KV"],
"level": "log",
"timestamp": 1624986982339
}, {
"message": ["Found entry in chunk"],
"level": "log",
"timestamp": 1624986982339
}, {
"message": ["Bit sets retrieved"],
"level": "log",
"timestamp": 1624986982339
}, {
"message": ["Query built"],
"level": "log",
"timestamp": 1624986982339
}, {
"message": ["Query executed"],
"level": "log",
"timestamp": 1624986982339
}, {
"message": ["Found containing chunk"],
"level": "log",
"timestamp": 1624986982339
}, {
"message": ["Found entry in chunk"],
"level": "log",
"timestamp": 1624986983071
}, {
"message": ["Documents fetched"],
"level": "log",
"timestamp": 1624986983071
}],
"eventTimestamp": 1624986981885,
"event": {
"request": {
"url": "https://gifs-edgesearch.garage.workers.dev/search?c=0&t=0_text_bobby",
"method": "GET",
}
}
}
Meanwhile the client (a browser) received the following (headers removed for brevity):
:status: 200
Content-Type: application/json
Content-Encoding: br
Server: cloudflare
And an idea of what the response data looked like:
{
"total": 2709,
"continuation": 50,
"results": [/* omitted */]
}
and an idea of what the underlying client code looks like:
const t = 'bobby'
const q = new Edgesearch.Query();
q.add(Edgesearch.Mode.REQUIRE, `text_${t}`)
let response = await client.search(q);
I suspect this is because I'm on an arm64 (Apple Silicon) machine and running an arm64 compiled version of edgesearch. When I attempt to run the provided x86 binary my script aborts with SIGILL (Illegal instruction).
This is happening for me (with the same document/terms file) on macOS ARM and x86 freebsd .. so it's almost definitely user error?
RUST_BACKTRACE=1 edgesearch \
--documents docs \
--document-terms terms \
--maximum-query-results 10 \
--output-dir out \
--data-store KV
Processing documents (0%)...
Processing documents (18.34%)...
Processing documents (36.68%)...
Processing documents (55.01%)...
Processing documents (73.35%)...
Processing documents (91.69%)...
There are 349 documents with 945 terms
1 chunks contain terms
1 chunks contain documents
error: unable to create target: 'No available targets are compatible with triple "wasm32-unknown-unknown-wasm"'
1 error generated.
thread 'main' panicked at 'Failed to compile WASM', src/build/wasm.rs:103:9
stack backtrace:
0: std::panicking::begin_panic
1: edgesearch::build::wasm::generate_and_compile_runner_wasm
2: edgesearch::build::build
3: edgesearch::main
snippet from docs
:
{"timecode":4,"text":"<i>Sensors detect a ship\non an intercept course at warp speed.</i>","at":"11/4.jpg"}�{"timecode":8,"text":"<i>Should we go to red alert?</i>","at":"11/8.jpg"}�
snippet from terms
:
t_sensors �,t_detect �,t_a �,t_shipon �,t_an �,t_intercept �,t_course �,t_at �,t_warp �,t_speed ��t_should �,t_we �,t_go �,t_to �,t_red �,t_alert ��
Nice API for serverless search! I like the use of compressed bit sets.
Looks like the engine loads all the documents (data files to KV) once-only at deploy time. How hard or easy is it to allow runtime document adding/updating/removing (via REST methods), and automatic reindexing?
I'm on Apple Silicon, so I was going to build for arm64, but nbd.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.