maddiem4 / dirtabase Goto Github PK

Immutable directory manipulation library and CLI tool

Rust 96.94% Shell 0.37% Nix 2.69%

dirtabase's Introduction

dirtabase

A build system using Arkive, but providing a higher level list of verbs (operations), including downloads and command execution, with caching. It's going to be the backbone of the layover package manager.

# Run this command in this repo!
$ dirtabase \
  --import . src fixture \
  --merge \
  --prefix misc \
  --cmd-impure 'find misc -type f | xargs md5sum > sums' \
  --filter '^sums' \
  --export out

Here's what it'd output (this will probably be less chatty in the future):

================================================================
Import
================================================================
 + Can cache? false
 + Is in cache? false
dd45aedac81fb5e08f594bee978c9c6bd74b758f4f458ccd4fe250d271dcf171
8c958951d9f61be6a7b1ec48611710efc3d12ee71f3dc6ac34251afe4a95378e
================================================================
Merge
================================================================
 + Can cache? true
 + Is in cache? true
fe4462adb040549b5e632c4962e9ddfd98cd7f710949a50c137a351547eb170d
================================================================
Prefix
================================================================
 + Can cache? true
 + Is in cache? true
f5587f960dc28e8753f8558f61567cef5ed820ba9a87792d64162aed5fe9f4e0
================================================================
CmdImpure
================================================================
 + Can cache? false
 + Is in cache? false
--- [find misc -type f | xargs md5sum > sums] ---
20b1c125cbbc550603a3bbf5e6dec21802a656bf1f2d23b11011430d94f86b3b
================================================================
Filter
================================================================
 + Can cache? true
 + Is in cache? true
56b34c726418366b10db4cffe4285e04d47fd6f8161b1cda7a4bdc1a302c83e5
================================================================
Export
================================================================
 + Can cache? false
 + Is in cache? false

And you can poke around at the directory you just made!

$ ls out
sums

$ cat out
c2333d995e4dbacab98f9fa37a1201a9  misc/fixture/file_at_root.txt
9d358d667fe119ed3a8a98faeb0de40b  misc/fixture/dir1/dir2/nested.txt
1dba60d0147ca0771b3e2553de7fb2f2  misc/src/context.rs
9156988bafe609194d7aca6adf8a3e70  misc/src/doc.rs
cc255b333228984a0bbccbcf1a16f1d0  misc/src/cli.rs
f18205c6a9877b2e6cb757cfeb266dfc  misc/src/test_tools.rs
9c8a8227ccef3ec678df0723e7621bd8  misc/src/op.rs
74d1290949aca1cd5bc4d3b4128ae99d  misc/src/prelude.rs
b330c35e6816a7895e0d202458d591c0  misc/src/behavior.rs
799a951d84acaad174313a340c730dc6  misc/src/lib.rs
5d6c6c5d29506c037eecc4611afb18ec  misc/src/main.rs
f1bbacd456d6e7695ed60d7c0d6d1901  misc/src/logger.rs

At each step, the interface is a stream of archives passing from one stage of processing to the next. That's the input and output stream format of dirtabase Operators. The cache can actually pick back up after uncacheable steps, because each archive has a full hash of its contents - we can recognize when we've stumbled back into familiar territory.

The biggest missing pieces at this point are sandboxed (pure) commands, and Layover building on top of this tech.

Contributing

This repo is equipped to use devenv.sh, which is pretty easy to get set up. It also integrates nicely with direnv.

# These commands should work after setup!
direnv test
direnv shell

I'm going to also set up building this as a Nix package/flake later.

dirtabase's People

Contributors

Watchers

Forkers

onethirtyfive

dirtabase's Issues

Add the CommandImpure op

This is, of course, exposed as --cmd-impure. Support for --cmd is a few significant steps down the road, but --cmd-impure shouldn't be unreasonable to implement. Create a tempdir, export to it, run the given command, import the directory after whatever changes.

Detrait Storage

Very ironic after how proud I was of it, but the abstraction is still premature and getting in the way of two tickets: #12 and more importantly #2.

Move archive stream tech under `crate::stream::archive`

Right now these live in a non-standard, oddball location. They should be consolidated under the high-level domain of streaming, and the Archive module should be more humble and low-level, focused on in-memory representation and manipulation.

Next generation of Archive in-memory representation

There's a better approach here that solves a lot of problems at once. An Archive is probably best represented as something of an SOA thing where we hold a few external invariants and allow the file content type to be a generic param. Here's a sketch:

type Path = ...; // Will probably use unix_path crate
struct Entry {
    pub path: Path,
    pub attrs: Attrs,
}

// Cannot be constructed without meeting invariants internally:
//  1. No duplicate paths between self.dirs and self.files
//  2. self.dirs and self.files are each sorted by path
//  3. self.contents[n] corresponds to self.files[n]
//
// These aren't actually hard to guarantee, but allow us to chain transformations on the content
// representation, which actually allows incredibly efficient and safe import/export from simple DB,
// as well as &'static str use for testing.
struct Archive<C> {
    dirs: Vec<Entry>,
    files: Vec<Entry>,
    contents: Vec<C>,
}

impl<C> Archive<C> {
    // NB: completely reuses existing dirs/files memory without harming it.
    fn map<T, E>(self, mapper: Fn(C) -> Result<T,E>) -> Result<Archive<T>, E> {
        self.files.zip(self.contents).into_iter().map(mapper).collect()
    }

    // Same thing but Rayon-enabled (faster but only applicable sometimes
    fn par_map<T,E>(self, ...) { todo!() }
}

// Helper format for import/export
pub enum StreamEntry<C> {
  Dir(Path, Attrs),
  File(Path, Attrs, C),
}

// Fast import (starts with up-front sorting in From impl)
let os_src: PathBuf = "./fixture".into();
Archive<&Path>::from([
        // Would actually be gathered from walking dirs and grabbing metadata
        StreamEntry::Dir("/xyz", at!{}),
        StreamEntry::File("/xyz/hello.txt", at!{}, &os_src),
        StreamEntry::Dir("/xyz/123", at!{}),
    ])
    .map(|(entry, src)| {
        let file_dest = tempfile(); // in store
        let file_src = File::open(src.join(entry.path))?;
        std::fs::copy(file_src, file_dest) // Near instant and skips userspace a lot of the time
    })
    .par_map(|(entry, tmp)| {
        // Tempfile is now a copy in the store and safely out of harms way
        // which we can mmap and hash in a threadpool - more CPU limited than IO limited
        let m = memmap::mmap(tmp);
        let digest: Digest = m.into();
        Ok((digest, tmp))
    }).map(|(entry, (digest, tmp))| {
        std::fs::rename(...)?;
        Ok(Triad(TriadFormat::File, Compression::Plain, digest)))
    })

There's a lot of reasons this kicks ass, and the looser coupling of the generic type is one of those reasons. But when I talk about efficiency, I'm talking about being able to break complex dances into phases that are easier to reason about. I don't think we need that level of tuning in the short term, but as a final product, dirtabase needs to be competitive with products that break down work in parallel ways, and not take an excessive amount of physical memory to do in-memory operations. That complexity has to exist somewhere. Map chains let us wrangle that complexity.

GZip and XZ compression support

This will require changes to a couple parts of the code and probably inspire some appropriate convenience abstractions. It should also be postponed until more basic functionality is achieved. Still, there's no doubt it's going to be an important feature in the final product.

Prefix op

This could be more universally implemented as a "Replace" regex op. The intent with offering a Prefix op is that there are some optimizations you can do when you accept the limitation of only changing the starting bytes. But for a start, we can offer a Replace op, and have Prefix just call into the same code for now, having reserved space for optimization later. I think that's the best call for now rather than prematurely optimizing.

URL parsing

I think this needs to happen very separately from actually making use of URLs properly. Like, existing commands should assert that URLs refer to local files, and fail outside those circumstances, as an intermediate stage of development while it's figured out how to act on non-local-dir URLs. But we should have a well-tested suite of URL parsing according to the established rules, which can be used most immediately for the label system.

Archive normalization

This is a precursor to supporting the Merge op. It'll require really thinking about the storage model of Archives, and what forms are truly equal to each other.

Build caching

Some sort of data storage, with the particulars intentionally encapsulated by the backend, for recording that a particular pure transformation on archive X will result in archive Y. This allows for rapidly fast-forwarding through build steps where we know what the outcome must be.

Some build steps are not, on their own, pure. For example, importing and exporting (and cmd-impure). But the fact that impure build steps produce archives (which are now immutable going forward and might happen to be referenced in the cache) allows the cache mechanism to pick back up after impure steps. It will be awhile before cmd (the pure variety happening in a sandbox) is properly designed and implemented, but in the meantime, we should be able to at least get pretty good build performance in the short term with cmd-impure thanks to the cache resume properties.

Merge op

Append together multiple archives in sequence, then normalize the concatenation.

maddiem4 / dirtabase Goto Github PK

dirtabase's Introduction

dirtabase

Contributing

dirtabase's People

Contributors

Watchers

Forkers

dirtabase's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs