I'm Kat!
I'm a Rust, C#, and JavaScript developer working at Microsoft and I do a bunch of open source stuff.
You can also find me on Mastodon as @[email protected] and Matrix as @kat:zkat.tech, or on Discord as kat#8645
.
A high-performance, concurrent, content-addressable disk cache, with support for both sync and async APIs. ๐ฉ๐ต but for your ๐ฆ
Home Page: https://crates.io/crates/cacache
License: Other
I'm Kat!
I'm a Rust, C#, and JavaScript developer working at Microsoft and I do a bunch of open source stuff.
You can also find me on Mastodon as @[email protected] and Matrix as @kat:zkat.tech, or on Discord as kat#8645
.
I have a high-level writing and deletion scenario. I write first and delete after a while. Is this project suitable for this scenario?
If we try to remove an index entry or content entry when it's already deleted, cacache throw an io::ErrorKind::NotFound
. This is fine, but when using RemoveOpts
and remove_fully
, a missing content file will throw an early error and prevent the deletion of the bucket.
A solution would be to simply ignore NotFound errors, as it is already in the intended final state.
cacache version: v11.3.0
rustc: 1.69.0
I'd like to use cacache in a performance sensitive area of my application, but I have found there to be more overhead than hoped. For instance on my M2 MacBook Air(2,800 MBps read SSD) it takes ~90ms to read a 30MB file from cache(sync or async), while reading the file directly from disk takes ~15ms. I see similar performance within my application as well as within microbenchmarks, so I don't believe this is simply bad benchmarking.
I'm not familiar with the caching strategy, so maybe this is expected? Possibly a system(MacOS) specific issue? Bad benchmarking?
Hi Kat,
I am currently trying to use your library in a project of mine and kinda tripped over error-handling in the crates documentation. Because the Error
struct exposed (and documented) by the crate is not actually used in any of the crates return types, it became somewhat unclear to me, what the preferred approach to error handling with your crate is.
Also, the anyhow
error type is not exposed by your crate directly, so if someone wants to wrap the error themselves they would have to add an additional dependency to anyhow.
I would love to send a pull request to fix this issue, though I am not sure what the right approach would be. If you consider this purely a documentation issue I will gladly change the ReadMe and add a paragraph to the error enums documentation.
Oh and thank you for your work, on this library and in general.
Greetings,
Florian
Add more context to returned failures to make things more debuggable
I'm investigating allowing content to be streamed out of a CAS store, while it's still being ingested.
The rough idea I have is to be able to get a Reader for an open Writer, which reads from the tmpfile, that then somehow switches to the canonical version of the content, once the Writer is closed, and the content hash is known.
This may be a terrible idea ๐
If it did sound reasonable though, I'd be interested in helping to contribute the feature.
Before releasing 6.0, we should make sure to switch over to the stable version of the async ecosystem, once both async-std
and futures
crates have been updated to work with stable Rust and async/await.
This should be in the next day or two!
Among the claims you make, are: "Fault tolerance (immune to corruption, partial writes, process races, etc)" and "Consistency guarantees on read and write (full data verification)". These two claims stand out to me, but what I'm curious about also applies to the other claims. What kind of verification have you done, and keep doing to ensure these claims? As Kyle Kingsbury can certainly attest, many distributed systems make claims in a similar vain, but many don't uphold what they promise. I'd appreciate if you explain your taken steps for validating such claims not only here but also in the README. So future readers can more easily evaluate the trust they place in this software.
According the npm/cacache rm.entry API
By default, this appends a new entry to the index with an integrity of null. If opts.removeFully is set to true then the index file itself will be physically deleted rather than appending a null
thanks a lot!
Thanks to some feedback from a couple of folks, including @yoshuawuyts, I think it's worth exploring a new API that's more "object-oriented". The following is a sketch, partly based on Yoshua's suggestions, but adapted to more specific needs of cacache features:
struct Cache {
path: Path
};
// Assume any necessary AsRefs below. Omitting them for the sake of readability.
impl Cache {
fn new(path: Path) -> Self;
async fn open(&self, key: String) -> Result<AsyncGet>;
async fn insert(&mut self, key: String, value: &[u8]) -> Result<Integrity>;
async fn get(&self, key: String) -> Result<Option<&[u8]>>;
async fn get_entry(&self, key: String) -> Result<Option<Entry>>;
async fn get_by_hash(&self, sri: &Integrity) -> Result<Option<&[u8]>>;
async fn contains_hash(&self) -> bool;
// Same as the above, but sync, suffixed by `_sync`
fn insert_sync(&mut self, key: String, value: &[u8]) -> Result<Integrity>;
// etc
// Iterate over all entries
async fn entries(&self) -> Stream<Entry>;
}
// Stream over all entries.
impl Stream for Entries {
type Item = Entry;
}
struct Entry;
impl Entry {
// getters for all fields
// Similar to ptr::copy_to
async fn copy_to(&self, Path) -> io::Result;
}
impl std::hash::Hash for Entry;
When make remove_fully
to true
, it's only delete the key, but not the value.
I write a code to create a random key and value, the result of cache dir is:
remove_fully
to false
$ rg "137-" my-cache/
my-cache/content-v2/sha256/ce/fd/02dcb440266abebb725a81019707a40f0f82bcf6c6f2dff2ca21480eb0a8
1:137-some data
my-cache/index-v5/8b/1e/f744c40e0577aced59824ae6d9dcb05ff399
2:3d32862bb4beb9a748577804b2aa8c473e40b9a322ea6bc05948816421e39c07 {"key":"137-key","integrity":"sha256-DQxf03Cgxhi/gMg4NOVHBt9J/C0SLsja7/IhebII74k=","time":1706346133948,"size":13,"metadata":null,"raw_metadata":null}
3:cf1c942e7d568a97c7259a8d6e35d31958f814d0ce8ec7cb94deb997e6b3ffb9 {"key":"137-key","integrity":null,"time":1706346134097,"size":0,"metadata":null,"raw_metadata":null}
4:9eb741017218ae0a02038e916eeea4bc54667e840c68ed0649deac285ca5864d {"key":"137-key","integrity":"sha256-zv0C3LRAJmq+u3JagQGXB6QPD4K89sby3/LKIUgOsKg=","time":1706346305649,"size":13,"metadata":null,"raw_metadata":null}
5:ea842e6e8787ebc0da77f8c29bd4f867d12adc3fb824d374adeaf6b1a6c7c73e {"key":"137-key","integrity":null,"time":1706346305682,"size":0,"metadata":null,"raw_metadata":null}
remove_fully
to true
$ rg "137-" my-cache/
my-cache/content-v2/sha256/ce/fd/02dcb440266abebb725a81019707a40f0f82bcf6c6f2dff2ca21480eb0a8
1:137-some data
When remove_fully
is true
, I expect the value to be deleted too.
Hi, we are using the crate through reqwest-cache. The crate seems to perform the standard read operation found here. I have an example of the cache reading here that works in the manner the code is written because of the select and pin https://github.com/spider-rs/spider/blob/main/examples/cache.rs. If we change the code to remove the select and spawn tasks instead the subscription will hang forever.
I could post the deadlock here in a code example if needed.
Should the name in the README.md be cacache-rs?
There's a terrifying lack of test coverage on cacache right now. There should be unit tests for at least all the API functions, and preferably also for the internal APIs.
I'm trying to use only the sync api, because I don't need the async one. So I don't want to pull the async runtime, and I copied the following content in my Cargo.toml, as the Docs.rs page said:
cacache = { version = "12.0.0", default-features = false, features = ["mmap"] }
And I got the following error log:
error[E0433]: failed to resolve: could not find `async_lib` in the crate root
--> /home/kinire98/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cacache-12.0.0/src/index.rs:426:20
|
426 | crate::async_lib::remove_file(&bucket)
| ^^^^^^^^^ could not find `async_lib` in the crate root
error[E0425]: cannot find function `find_async` in module `index`
--> /home/kinire98/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cacache-12.0.0/src/get.rs:329:37
|
329 | if let Some(entry) = index::find_async(cache, key).await? {
| ^^^^^^^^^^ not found in `index`
|
note: found an item that was configured out
--> /home/kinire98/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cacache-12.0.0/src/index.rs:179:14
|
179 | pub async fn find_async(cache: &Path, key: &str) -> Result<Option<Metadata>> {
| ^^^^^^^^^^
error[E0425]: cannot find function `find_async` in module `index`
--> /home/kinire98/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cacache-12.0.0/src/get.rs:365:37
|
365 | if let Some(entry) = index::find_async(cache, key).await? {
| ^^^^^^^^^^ not found in `index`
|
note: found an item that was configured out
--> /home/kinire98/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cacache-12.0.0/src/index.rs:179:14
|
179 | pub async fn find_async(cache: &Path, key: &str) -> Result<Option<Metadata>> {
| ^^^^^^^^^^
error[E0425]: cannot find function `open_async` in this scope
--> /home/kinire98/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cacache-12.0.0/src/content/read.rs:166:22
|
166 | let mut reader = open_async(cache, sri.clone()).await?;
| ^^^^^^^^^^ not found in this scope
error[E0433]: failed to resolve: use of undeclared type `AsyncReadExt`
--> /home/kinire98/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cacache-12.0.0/src/content/read.rs:169:20
|
169 | let read = AsyncReadExt::read(&mut reader, &mut buf)
| ^^^^^^^^^^^^ use of undeclared type `AsyncReadExt`
error[E0425]: cannot find function `delete_async` in this scope
--> /home/kinire98/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cacache-12.0.0/src/index.rs:423:13
|
423 | delete_async(cache.as_ref(), key.as_ref()).await
| ^^^^^^^^^^^^ not found in this scope
Some errors have detailed explanations: E0425, E0433.
For more information about an error, try `rustc --explain E0425`.
error: could not compile `cacache` (lib) due to 6 previous errors
This is the neofetch information of my PC if it is useful:
./o.
./sssso- ------------------
`:osssssss+- OS: EndeavourOS Linux x86_64
`:+sssssssssso/. Host: GF63 Thin 10SCSR REV:1.0
`-/ossssssssssssso/. Kernel: 6.6.10-arch1-1
`-/+sssssssssssssssso+:` Uptime: 4 hours, 36 mins
`-:/+sssssssssssssssssso+/. Packages: 1085 (pacman), 8 (snap)
`.://osssssssssssssssssssso++- Shell: bash 5.2.21
.://+ssssssssssssssssssssssso++: Resolution: 1920x1080, 1920x1080
.:///ossssssssssssssssssssssssso++: DE: Xfce 4.18
`:////ssssssssssssssssssssssssssso+++. WM: Xfwm4
`-////+ssssssssssssssssssssssssssso++++- WM Theme: Default
`..-+oosssssssssssssssssssssssso+++++/` Theme: Adwaita-dark [GTK2], Arc-Dark [GTK3]
./++++++++++++++++++++++++++++++/:. Icons: Qogir-dark [GTK2/3]
`:::::::::::::::::::::::::------`` Terminal: xfce4-terminal
Terminal Font: Source Code Pro 10
CPU: Intel i7-10750H (12) @ 5.000GHz
GPU: Intel CometLake-H GT2 [UHD Graphics]
GPU: NVIDIA GeForce GTX 1650 Ti Mobile
Memory: 4150MiB / 31917MiB
I don't know if I have to change something for it to work. If I just write:
cacache = "12.0.0"
works normally, but imports the whole async runtime.
Use case: I have zip files being added to a cacache instance, and zip is a non-streamable format - the index of a zip is at the end, so it is somewhat inefficient to read everything, then scan the index, then reopen and re-read to extract the actual content (using async_zip
)
The documentation of cacache::remove()
says this: "Removes an individual index metadata entry. The associated content will be left in the cache".
Is there a safe way to remove the key and the content? Or is automatically done at one point?
nix
is apparently pretty heavy, and it's currently used only for chownr
. Update chownr, and then this crate, to use libc
directly, instead.
A cache read by key now takes about ~30 seconds for my application.
A clue:
โฏ sudo du -sh *
[sudo] password for blarsen:
15M content-v2
2.4G index-v5
0 tmp
Usage pattern: write to a small number of keys (<10) every few seconds. On program start, read those keys.
The cache is used to dump state to disk so that it can be read on program start after unclean exit.
The index file for each key is about 280M, over 1M entries.
It appears that you're keeping the entire history? Is this just for reliability reasons, because there doesn't appear to be an API to read older versions of a key. Is there a way to reliably trim history to get my speed back?
The current bucket format is copied directly from what the JavaScript version of cacache does.
I no longer think it's worth trying to preserve compatibility, and the performance of index-related operations is kind of horrendous right now, so I think it's time to explore a new on-disk format for the index buckets.
My current thinking is to use serde
more directly, and come up with a better strategy for the generic metadata field, as well.
And of course, if there's no actual perf difference, this issue should just be closed, but this is worth exploring anyway.
Hi. Trying to build project with rustc 1.77.1 and tokio
feature enabled outputs multiple compile errors.
--> /Users/pablo/.cargo/registry/src/index.crates.io-6f17d22bba15001f/futures-io-0.3.30/src/lib.rs:60:12
error[E0308]: mismatched types
--> /Users/pablo/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cacache-13.0.0/src/get.rs:46:9
|
45 | ) -> Poll<tokio::io::Result<()>> {
| --------------------------- expected `Poll<std::result::Result<(), std::io::Error>>` because of return type
46 | Pin::new(&mut self.reader).poll_read(cx, buf)
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ expected `Poll<Result<(), Error>>`, found `Poll<Result<usize, Error>>`
|
= note: expected enum `Poll<std::result::Result<(), _>>`
found enum `Poll<std::result::Result<usize, _>>`
error[E0599]: no method named `poll_shutdown` found for struct `Pin<&mut AsyncWriter>` in the current scope
--> /Users/pablo/.cargo/registry/src/index.crates.io-6f17d22bba15001f/cacache-13.0.0/src/put.rs:170:36
|
170 | Pin::new(&mut self.writer).poll_shutdown(cx)
| ^^^^^^^^^^^^^ method not found in `Pin<&mut AsyncWriter>`
|
= help: items from traits can only be used if the trait is implemented and in scope
= note: the following trait defines an item `poll_shutdown`, perhaps you need to implement it:
candidate #1: `tokio::io::AsyncWrite`
As a result I cannot use versions 0.13.0
with my project, it still works with 0.12.0
. Could you look into resolving it?
Heya! This just came across my GitHub following feed and it looks really interesting, but I must admit I almost ignored it at first because of the about description:
๐ฉ๐ต but for your ๐ฆ
It's fun, but it's not particularly informative (and honestly, it sounded like a shitpost-as-repo). I'm glad I checked it out, though, because
A high-performance, concurrent, content-addressable disk cache, optimized for async APIs.
is much more interesting as a driveby viewer!
Apologies if it sounds like I'm telling you how to run your repo - I just don't want other people to make the same mistake I almost did ๐
This project looks promising. However, lack of examples makes it difficult to get started. Add some examples
In src/content/read.rs, there are two different size of buf, 1024 or 1024*8.
And it looks kind of random to pick the value, neither by async/sync or hard-link/reflink/copy.
So, maybe some value is just not updated?
And it would be great to explain why it's this number.
I was wondering whether it would be possible to store a single entry under multiple hashes by computing multiple hashes at the same time and hard linking the content in the cache.
In some usecases it makes sense to use one hash over another, because you might want to use the hash outside of the cache, but as I understand it correctly content would be duplicated when using two hashes for the same content.
An integrity can already contain multiple hashes but I think the API doesnt offer support to store/calculate multiple hashes.
For copy/reflink/hard-link, there should be
And their sync variant version. So there shuold be 8 methods for each category.
However, some methods are missed:
Seems reflink_hash_unchecked
hard_link_hash_unchecked
and hard_link_unchecked
are not very useful.
hard_link_hash
is missed anyway.
It would be really nice to have more thorough benchmark coverage of the various external APIs, much like cacache-js does.
cacache-11.5.2 crashes a program with SIGBUS when working under disk-full conditions on linux.
Running ubuntu-19.10 with linux-5.3.0-64-generic.
$ rustc --version
rustc 1.69.0 (84c898d65 2023-04-16)
Reproduction:
mkdir /tmp/ram && mount -ttmpfs -osize=5m tmpfs /tmp/ram
fn main() {
for i in 0..12 {
println!("{}", i);
let data: Vec<_> = (0..512*1024).map(|_| rand::random::<u8>()).collect();
println!("{:?}", cacache::write_hash_sync("/tmp/ram/cache", &data));
}
}
// in Cargo.toml
// ...
// [dependencies]
// cacache = "11.5.2"
// rand = "0.8"
0
Writer::new size=Some(524288)
Ok(Integrity { hashes: [Hash { algorithm: Sha256, digest: "QQ9CVmHX6CzNPkuGFAhp/k8wSkmEVexMp6ARULmLdMM=" }] })
1
Writer::new size=Some(524288)
Ok(Integrity { hashes: [Hash { algorithm: Sha256, digest: "JAI/yZ2LjUfpKko8L4RFV7g7DzNxHvq7jfYhX/9mQ4o=" }] })
[...]
9
Writer::new size=Some(524288)
Bus error (core dumped)
The problem is caused by the optimization where if the binary blob to cache is no more than 512KB, then it is written via mmap. Since the file obtained from tempfile
may be sparse, writing to it may result in allocation of more blocks on the fs and failure with SIGBUS (I'm not sure this is standard/defined behavior or not, it seems legit though - what else can the OS do?). The call to std::fs::File::set_len
only results in calling the truncate
syscall, which does not guarantee file allocation and does not return an error on not enough space on device.
Calling posix_fallocate
on the fd of the file obtained from tempfile
fixes the issue for me, but I am not sure posix_fallocate
guarantees file allocation or it just happens to work ok.
The use case is to be able to stream content into cacache, when a key is not readily available.
I'm looking at using cacache to store rust structs. But cache uses AsRef<[u8]> for data and AsRef for strings.
There's a lot of ways to turn a struct into a [u8] and back, and a lot of ways of turning a struct key into a str. It's easy to spend a lot of time evaluating the different ways of doing so when in many cases it'd probably be best just to choose one and move on.
And example in your docs might do wonders here. Presumably you have more insight into the better ways of doing this, so the mechanism used in the example could presumed as a "good" way, even if it isn't the best for every situation.
For example, it seems tempting to use the rust hash mechanism for the key, but that's double hashing and could cause collisions so I imagine that's not recommended.
It's also tempting to use Debug or Display formatting for the key since most structs already have it and it'd probably work well in some situations. Probably not something you should use in an example though because those are sometimes lossy.
Which means likely a serde format for both key & data. But which one, there are so many...
I found this overview. 2 years old so things may have changed, but likely still mostly correct: https://blog.logrocket.com/rust-serialization-whats-ready-for-production-today/
conclusion: json for key and bincode for data?
This was less of an issue report and more of a "thinking aloud" situation. But it may be helpful to others in the same situation, so I'm going to post it anyways. Feel free to close.
Panics occurs when writing 1MiB or less with write_hash
. That is, this works:
let _ = cacache::write_hash("./cache", &[b'a'; 1024 * 1024 + 1]).await;
But this results in a panic:
let _ = cacache::write_hash("./cache", &[b'a'; 1024 * 1024]).await;
thread 'blocking-1' panicked at 'source slice length (1048576) does not match destination slice length (0)', /home/tgnottingham/.cargo/registry/src/github.com-1ecc6299db9ec823/cacache-10.0.1/src/content/write.rs:260:38
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'async-std/runtime' panicked at 'task has failed', /home/tgnottingham/.cargo/registry/src/github.com-1ecc6299db9ec823/async-task-4.3.0/src/task.rs:426:45
thread 'main' panicked at 'task has failed', /home/tgnottingham/.cargo/registry/src/github.com-1ecc6299db9ec823/async-task-4.3.0/src/task.rs:426:45
The issue doesn't occur when using cacache::write
.
Appears to be similar to #32.
Running on Ubuntu 22.04.1, Linux kernel 5.15.0-46-generic, x86_64, cacache 10.0.1.
Under "cache/index-v5/xx/yy/zzz", I found something like "raw_metadata":[0,18,32,112,97,......]"
It is the serialization of Vec, rather than binary like format.
Is this intentional? It looks like not very efficient.
The related lib reflink is quite old and has a bug described there nicokoch/reflink#4 so the same issue happens when I'm using cacache too. Would not be a problem for you to bump dependency like there bczhc/rust@e9df2ba to the master branch?
This appears to happen with any vec <= the memmap size.
thread 'main' panicked at 'source slice length (705) does not match destination slice length (0)', /Users/chris/.cargo/registry/src/github.com-1ecc6299db9ec823/cacache-10.0.0/src/content/write.rs:78:18
That backtrace points at:
impl Write for Writer {
fn write(&mut self, buf: &[u8]) -> std::io::Result<usize> {
self.builder.input(buf);
if let Some(mmap) = &mut self.mmap {
mmap.copy_from_slice(buf); // <-------------------------- this line
Ok(buf.len())
} else {
self.tmpfile.write(buf)
}
}
fn flush(&mut self) -> std::io::Result<()> {
self.tmpfile.flush()
}
}
I tested manually lowering the max memmap limit to 0 and it started working again; also pinning cacache at v9 seems to pass as well. I noticed this with both sync and async calls of cacache::write_hash
.
My machine is:
If there's any other info I can post here to help please let me know, & thanks for cacache!
Add support for saving things using sha3.
The documentation, while fairly "complete", is missing doc tests, and more detailed explanations of how to use cacache. Writing some more examples would be really helpful, and doctests for all the various API functions in their respective pages.
The failure
crate is pretty heavy and requires a bit more manual stuff than desired. There's a newer crate, anyhow
, that seems like a very nice alternative and builds on std::error::Error
, so it might improve compile times if used.
One thing missing from the readme is whether cacache checks for hash collisions, since hashes seems to be the way the contents is indexed internally. While they are rare, that's definitely something to be worried about
So is cacache checking for those or do we have to manually check it ourselves?
I'm looking at different caching libraries right now for my project and this one looks really cool! However, I cannot find any information on cache eviction. Do I need to implement it manually on top of cacache? Have others already done it? (Is this even possible?)
What I need is some simple (access) time eviction, but I think most use cases require bounded size and some LRU/LFU algorithm.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.