I've been experimenting with using non-reproducible systems in build systems like baze

[Feature request] Support custom caching for non-reproducible actions? about buck2 HOT 4 OPEN

silvergasp commented on April 28, 2024

[Feature request] Support custom caching for non-reproducible actions?

from buck2.

Comments (4)

cbarrete commented on April 28, 2024

Is this actually a build problem? Shouldn't your implementation in fuzz.cpp e.g. take the current time and use that as a seed (or even better, take it via a command line flag or environment variable so that it's actually reproducible)? You could quantize that time if you need reproducible execution over a given period of time.

It seems to me that the part that you really care about is not the buck2 build part, but rather the buck2 run one, which isn't cached at all.

from buck2.

silvergasp commented on April 28, 2024

Is this actually a build problem? Shouldn't your implementation in fuzz.cpp e.g. take the current time and use that as a seed (or even better, take it via a command line flag or environment variable so that it's actually reproducible)? You could quantize that time if you need reproducible execution over a given period of time.

It may not be a build problem, but more generally an automation problem, but that might come down to just semantics. I'd be quite happy for this to look something like;

buck2 run_pipeline //:some_pipeline

It's true you could use a consistent seed for fuzzing and just fuzz n-iterations and it's possible to get reproducible outputs in that case. However there are still use-cases where it's nice to have a full execution graph (that buck2 provides via DICE) where each node in the execution graph is not necessarily reproducible. A more concrete (though still toy) example of a non-reproducible execution graph might include;

Use 5 different scanners to detect a mis-configurations in a website, each outputting there own json file.
Take said json files and convert them into a markdown file.
Take the markdown file and convert it to a static html page for developer to view.

But let's say that that the scanner's take 40min to run, and don't need to be run all that often. So having some caching involved would be great, but then you wouldn't want them to be cached indefinitely (which would be the case with buck2). It's also the case that this execution graph is by definition non-reproducible because the website that is being scanned is outside of your control.

Buck2 solves 90% of this automation problem by handling execution graphs, remote execution and caching etc. I'm aware that this doesn't necessarily fit with the primary goal of buck2 being a build system. But it has enough overlap for me to find it interesting as a generalised declarative automation framework. Does this sound too far out of left field for buck2? I'm aware that this is kind of build-system adjacent.

It seems to me that the part that you really care about is not the buck2 build part, but rather the buck2 run one, which isn't cached at all.

This is sort of true, although I think what I'm hoping for is something like buck2 run but re-using some of the execution graph semantics in the runtime space.

from buck2.

JakobDegen commented on April 28, 2024

Never cache

Yeah, so we've talked about adding support for this kind of a thing before, primarily under the name "volatile actions." I think the hypothetical API is that when you call ctx.actions.run, you can specify volatile = True and then your action will get rerun on every command. Obviously this would need to be used with care.

The use-case that we had in mind at the time is better integration with system toolchains; for example, maybe you want to invalidate all your rust library builds when you upgrade your rustc version. You could define a volatile action that prints the rustc version into a file, and then add that as a never-read input to every rustc action.

I think the vibe on volatile actions is basically positive. Just needs someone to go and write some code I think.

Cache expires after N seconds

Cache expires on cron schedule

These two seem like they could be implemented on top of the first one. You can have a volatile action that prints the current timestamp / 3600 to a file, and then depend on that file from every other action - at the top of the hour, the contents of that file will change and your actions get invalidated.

I suppose that's not exactly the same as "expire after 1 hour," but its pretty close. If you don't care about RE, then you can actually modify this scheme to use incremental actions and then get exactly those semantics (have an action that writes the current timestamp to its output, if its been more than 1 hour since the timestamp written there right now).

Lazy cache evaluation i.e. immediately return cached artifact and then update it next time it's used.

This one I'm a bit more hesitant on. My concern though isn't around the caching, but rather around the action execution management. Action executions currently are clearly tied to the lifetime of a single command, ie they are executed as part of that command, need to finish before the command can finish, and are cancelled if the command is cancelled. What you're suggesting seems like it would be a deviation from that, which I think is probably hard to do correctly, both in principle and in practice.

from buck2.

thoughtpolice commented on April 28, 2024

I've thought about the fuzzing thing a number of times, and I sort of came to the conclusion that you probably want to fix the seeds in your fuzzing tests and try to have a reasonable amount of them if you expect them to run under buck2 test or whatnot. Actual major-scale runs of fuzzing e.g. with Clusterfuzz should probably be done by deploying some other kind of artifact (e.g. an OCI image to be deployed and probed.)

But "Volatile actions" are also really useful for a lot of other random things where a program may need to invoke some kind of ambient side effect on the system, which can actually be used to improve the precision of dependency tracking. When combined with early cut-off, a lot of the time they aren't so bad, like this example:

You could define a volatile action that prints the rustc version into a file, and then add that as a never-read input to every rustc action.

This is actually a great example that I used to do all the time when using Shake (through a feature called "Oracles.") I think it's really important for some cases. For example, let's say a user builds a project with CC=gcc as the compiler, it's just picked up off $PATH. Then they do a global system upgrade to their whole system, getting a new C compiler. If the user then enters the project and tries to build, it won't rebuild anything, because nothing seems to have changed; the build system can't track anything more than the fact it invokes "$CC" to compile objects, and as far as it can tell that command still exists just fine (probably /usr/bin/gcc, so even the path doesn't tell you anything), so there's nothing left to do.

In C or C++, this kind of mistake isn't so bad, because they have de-facto stabilized ABIs. This exact case can happen today in Buck2 with system_rust_toolchain and system_cxx_toolchain; but in the case of Rust, this error could cause catastrophic and hard-to-understand build failures. You can upgrade rustc, add a new library, run buck2 build, and now you might end up with rlibs that were cached and compiled previously with an old compiler, and rlibs that were compiled freshly with the upgraded compiler, and you will be lucky if the linker just explodes on them.

from buck2.

[Feature request] Support custom caching for non-reproducible actions? about buck2 HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs