Comments (9)
Related to #53
from gitoxide.
In pursuit of better control pack generation and also pave the way for improved async
integration, I figured having an Iterator
interface would be a good idea.
Now it's possible to step through parallel computations.
However, the respective implementation has to expose unsafe
due to the use of a scoped thread which exposes its join handle that can thus be leaked.
Depending on where this is exposed, unsafe
might bubble up even further - after all, anything that holds the SteppedReduce
can also leak it.
My intuition is to stop bubbling this up beyond git-features
just to keep it practical, even though technically it's incorrect. What do you think, @joshtriplett?
from gitoxide.
It's a CPU-intensive operation; my first instinct would be to run it normally and use unblock or similar to run it on a blocking thread.
Trying to structure the computation so that it happens incrementally seems incredibly painful. And in particular, trying to adapt an operation that happens in a thread to happen incrementally seems like it's incurring all the pain of async without any language support for async.
I would suggest building the initial MVP in a synchronous fashion, on the theory that it can still be run in a background thread and controlled via an async mechanism.
I definitely don't think it's OK to use a scoped thread and hide the unsafe, if the unsafe isn't truly encapsulated to the point that you can't do anything unsound with the interface.
from gitoxide.
One other thought on that front: compared to the cost of generating a pack, one or two allocations to set up things like channels or Arc will not be an issue.
from gitoxide.
Thanks for sharing. The main motivator for using scoped threads is to allow standard stack based operation without any wrapping - it's not at all about allocations, merely about usability and the least surprising behaviour. Truth to be told, I cannot currently imagine how traversal will play into static threads when arcs are involved especially along with traits representing an Object (an attempt to allow things like traversing directory trees).
What I take away is the following
- let's not hide
unsafe
unless it's encapsulated - let's not make step-wise computation or extreme async friendliness a requirement for the MVP if something like
blocking
would work, too.
I hope to overcome my 'writers block' and just write the missing bits to be able to see through the whole operation and play with the parts more until I find a version of the API that feels right.
from gitoxide.
The unsafe
is now legitimately gone due to the usage of standard 'static
threads. I could assure myself that the mechanism still works even with Arc's involved, despite being a little more difficult to use on the call site. Callers will need to prepare a little more to start the procedure, which is probably acceptable given how long it runs and how 'important' it is.
let's not make step-wise computation or extreme async friendliness a requirement for the MVP if something like blocking would work, too.
This capability probably doesn't have to be removed just yet as machinery itself is exactly the same as already used in in_parallel()
, except that now there is more control on the call site. This comes at the cost of having to deal with Arc
for the object database, and of course that the API now has yet another way to call it. Those who don't need fine-grained control will not get the best experience that way.
However, it's possible to eventually provide a non-static variant of pack generation too which works similar to pack verification (it uses the non-static version of the machinery) by factoring out the parts that are similar.
Another argument for trying hard to make pack generation play well in an async context certainly is that it commonly happens as part of network interactions like uploading a pack. Right now much of it is a little hypothetical as actual code to prove it works nicely doesn't actually exist, but I am confident it will work as envisioned.
Finally, since both machines, static and non-static are the same at core it should always be possible to return to the non-static one at very low cost should everything else fail.
from gitoxide.
On another note: I am also thinking backpressure and and back-communication. Backpressure is already present as threads will block once the results channel is full. Back-communication should also be possible if the handed-in closures get access to another synchronized channel of sorts to tell them when to deliver the pack entries they have been working on. Such an algorithm would continuously work (probably until it can't meaningfully improve the deltas) until it is told to deliver what's there right now before continuing. Such a message could then be delivered in the moment somebody actually calls .next()
on the iterator, which in turn will be based on how fast data can be written to the output (sync or async).
Even though the MVP will not do back-communication I don't see why it shouldn't be possible to implement it. What's neat is that no matter how the machinery operates, in the moment the iterator is dropped it will (potentially with some delay), stop working automatically.
from gitoxide.
The first break-through: pack files (base object only) can now be written from object ids.
from gitoxide.
About opportunities for performance improvements
@pascalkuthe I have created quick profile from running cargo build --release --no-default-features --features max,cache-efficiency-debug --bin gix && /usr/bin/time -lp ./target/release/gix -v free pack create -r ../../../git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux --statistics --thin -e tree-traversal --pack-cache-size-mb 200 --object-cache-size-mb 100 HEAD
(single-threaded), and here is the result:
My takeaways are as follows:
- indeed, in single-threaded mode the hashset performance doesn't seem to be the issue
- a little less than half the time of the counting phase is spent with getting objects from the object database
- a lot of time is spent parsing trees, and
memchr
seems particularly hot. It's all about finding a null-byte here and I wonder if this can be any faster though. - when digging deeper, it shows that many of these ~5s buckets end up spending most of their time in the object database
- and we have 3 seconds in
hashbown::set::HashSet::insert()
This is just a quick summary, and right now I am missing a dataset to compare git with gix
of various repos of different sizes to understand the size of the performance gap in single-threaded mode. From there it might be possible to figure out what to focus on.
While at it, profiling git
might also be useful, which (I think) I did in the past as well. Unfortunately my memory (as well as the notes about this here) are spotty.
from gitoxide.
Related Issues (20)
- Oxidize Radicle-Git HOT 7
- Oxidize comtrya HOT 1
- Oxidize `comtrya` HOT 4
- another parsing failure of malformed author/committer timestamp HOT 2
- gix-transport 0.41.3 has issues HOT 6
- unread field warnings on nightly HOT 1
- Availability: crates.io vs releases HOT 5
- gix_submodule::File::from_bytes is a catch22 HOT 4
- some way to refresh in-memory packed refs cache without relying on mtime HOT 2
- Panic receiving pack if fetch interrupted HOT 2
- `gix clone` ignores global `core.symlinks` on Windows HOT 8
- Checking out a dangling symlink on Windows is treated as a hard error HOT 3
- CI install-action now fails on Windows, can't find .cargo/bin
- 16 tests fail on Windows with GIX_TEST_IGNORE_ARCHIVES=1 HOT 1
- Tests on Windows require Git Bash or a similar environment HOT 1
- Assertion failure crash in `gix_date::time::write::<impl gix_date::Time>::write_to` HOT 3
- `core.excludesFile` config entry exists but has blank value causes error: is this considered a bug or expected behavior? HOT 1
- Nondeterministic macOS `is_symlink` assertion failure in `overwriting_files_and_lone_directories_works` HOT 1
- Backport outside traversal fix to v0.62.x HOT 2
- Installing `[email protected]` via `cargo install` not possible because the `zip` crate in the specified verision is yanked HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gitoxide.