Comments (11)
It appears that the mirrored blobstore configuration does not provide highly available deployments, although its not clear to me why.
It's because clients such as Bazel are fairly picky about how accurate calls like FindMissingBlobs() are. If you're using features like --remote_download_minimal
, then you cannot lose any blobs during the lifetime of the build. If a client calls GetActionResult(), then the objects referenced by the ActionResult must remain present during the remainder of the build.
If you were to have a best effort MirroredBlobAccess, then there is no way you can ensure this constraint is met. With small amounts of network flakiness you may run into situations where a blob only gets written to storage backend A, but a subsequent FindMissingBlobs() is sent to backend B. By enforcing that Write() and FindMissingBlobs() always go to both halves, we ensure that replies remain consistent.
Please refer to this section of ADR#2, which covers this topic in detail:
https://github.com/buildbarn/bb-adrs/blob/master/0002-storage.md#storage-requirements-of-bazel-and-buildbarn
My goal is to be able to upgrade storage nodes with zero downtime. Without backing from an external store (e.g. S3), I am not seeing an option that would enable zero downtime storage deployments.
Zero downtime? Sure, that's not possible with the current set of tools. Brief downtime? That is most certainly possible with MirroredBlobAccess. And brief downtime is good enough, because you can just set --remote_retries
on the client side, so that it stalls for a bit while the upgrade takes place. You want to pass in that option regardless, as network flakiness may also occur in the path leading up to the build cluster.
I've been operating a setup for the last two years where I alternate rollouts between the A and B halves. Week n, I upgrade the A nodes. Week n+1, I upgrade the B nodes. MirroredBlobAccess has been carefully designed, so that it's fine with one half losing its data. This means that storage nodes don't even need to be persistent. Every time I do an upgrade, I just replace half of the nodes with empty ones and let replication take care of the rest. The last year or so I even stopped doing maintenance announcements, because I know we can do it without people noticing it.
This approach deals well with unexpected note failures. I just rely on Kubernetes to quickly spin up a replacement pod on some spare node in case node/pod health checks fail. I think that over the last couple of years availability has been around 99.97% (including maintenance), which I think is "HA enough" for a build cluster.
One concern people also have is the cost of LocalBlobAccess on EC2, compared to using an S3 bucket. Sure, it's more expensive per byte, but that's only a fraction of the costs of a Buildbarn cluster. An is4gen.8xlarge storage node would be capable of holding 30 TB of data. It is cheaper than a single r5.24xlarge worker node, and you would only need a few of them relative to the number of worker nodes. Furthermore, because LocalBlobAccess has (in my experience) a far lower latency than S3, you end up making better use of the worker nodes' time. The cost is therefore negligible.
from bb-storage.
Also, I did end up creating bazelbuild/bazel#16058 to add --remote_retry_max_delay
. We'll see if we can get it reviewed and landed.. :)
from bb-storage.
Yes I had read that ADR, thank you for writing it.
When churning "A" half of the mirror with empty instances, don't you also run into situations where "A" had received a blob but perhaps "B" did not due to flake? In this setup, we do rely on each half being correct and complete.
If we are implicitly relying on each node having a complete set, we might as well allow clients to continue to receive blobs from the storage when on side of mirror is down. We can still rely on replication to provide fill in missing blobs in "A" from the time that it was offline.
Of course, you should not churn both sides of storage without considerable time between to allow for natural replication, like a week as you mentioned.
from bb-storage.
When churning "A" half of the mirror with empty instances, don't you also run into situations where "A" had received a blob but perhaps "B" did not due to flake? In this setup, we do rely on each half being correct and complete.
Every time FindMissingBlobs() is invoked, it is forwarded to both backends. MirroredBlobAccess compares the responses to both backends, and does the following:
- If a blob is present in A but not in B, it is synced from A to B.
- If a blob is present in B but not in A, it is synced from B to A.
- Only if a blob is absent from both A and B, it is reported back to the caller as being absent.
All of this is done in a blocking manner. This means that by the time the caller gets the results, it has the guarantee that data is stored on both replicas.
(Note that GetActionResult() against the AC does a FindMissingBlobs() against the CAS, using CompletenessCheckingBlobAccess. Thereby guaranteeing that objects referenced from AC entries are also present on both replicas.)
from bb-storage.
Oh I see, I was under the impression that replication could occur asynchronously, perhaps by the bb_replicator service.
Do you have anecdotal evidence to support a suitable duration for which storage may be unavailable? I'm wondering if Bazel might use up all of its retries while we wait for a single node to come online. Are we relying on the bb-storage network request timing out to give the new node enough time to come online? I do not see any interval or back-off time limit for Bazel when settings --remote_retries.
I was considering using persistent local bb-storage configuration, as opposed to transient storage. I was supposing that it would put less strain on the storage nodes and it would allow a node restart to not require JIT re-population before servicing build jobs. Do you have a suggestion?
from bb-storage.
Oh I see, I was under the impression that replication could occur asynchronously, perhaps by the bb_replicator service.
It's all synchronous at this point. bb_replicator is just there to deduplicate replication at a global level.
Do you have anecdotal evidence to support a suitable duration for which storage may be unavailable? I'm wondering if Bazel might use up all of its retries while we wait for a single node to come online. Are we relying on the bb-storage network request timing out to give the new node enough time to come online?
This may sound oddly specific, but I have observed that if an EKS node running bb_storage dies, it takes about eight minutes for EKS to detect it, mark the node as down, and evict the pods. So my advice would be to let Bazel retry for at least 15-30 minutes.
I do not see any interval or back-off time limit for Bazel when settings --remote_retries.
Yeah, Bazel is a bit limited in that regard. Bazel's exponential backoff uses an interval of at most five seconds, meaning that if you set --remote_retries=360
, then you have the guarantee that it retries for at least 30 minutes. It would have been easier if Bazel had some kind of --remote_retry_duration
flag, but it does not. Be sure to file an issue upstream!
I was considering using persistent local bb-storage configuration, as opposed to transient storage. I was supposing that it would put less strain on the storage nodes and it would allow a node restart to not require JIT re-population before servicing build jobs. Do you have a suggestion?
The amount of strain tends to be pretty small. The hot part of the CAS tends to be relatively small, meaning that after a restart you'll only see high loads on the storage nodes for five or so minutes. Making them ephemeral has the advantage that you can just schedule them as Kubernetes 'deployments' instead of 'daemonsets', meaning that there's a higher probability Kubernetes is able to start them for you.
from bb-storage.
It's all synchronous at this point. bb_replicator is just there to deduplicate replication at a global level.
Interesting, I didn't realize this. If the load on bb-storage is minimal after coming online, do you have a recommendation on when to use a bb_replicator in terms of number of workers?
This may sound oddly specific, but I have observed that if an EKS node running bb_storage dies, it takes about eight minutes for EKS to detect it, mark the node as down, and evict the pods. So my advice would be to let Bazel retry for at least 15-30 minutes.
That is odd. 8 mins sounds very long for k8s to detect an unhealthy node. Do think the health checks are not tuned properly? Or perhaps are you seeing bb_storage nodes misbehave a while before their health checks start failing? 30 minutes sounds like a long time to attempt retries, I was hoping to have this be as minimal as possible perhaps < 2-3mins. An empty storage node in k8s should be very faster startup, especially if the k8s cluster is sized appropriately and does not need a new EC2 to come online. I'll play around with this and see what it is in my case.
--remote_retries=360
Wow! Have you observed any worst-case scenarios, like if a network call hangs instead of fails quickly (e.g hangs for 1 min), would the build wait for 6 hours before failing? I am using --keep_going
which would theoretically make me worried that Bazel would try to traverse each available edge in the DAG and wait for a long time until exhausting its retries and finally fail.
Thanks for the information, Since buildbarn has many various configuration options, its helpful to understand how you have productionalized it, and which is the best path to start down.
Related to the concern of making storage HA, I am wondering if you have attempted to make the scheduler more available or durable. I do realize that bazel --remote_retries might be your answer here as well, but it would still be nice if all runners would not lose their work-in-progress if a scheduler dies. This is especially painful for long running actions.
from bb-storage.
It's all synchronous at this point. bb_replicator is just there to deduplicate replication at a global level.
Interesting, I didn't realize this. If the load on bb-storage is minimal after coming online, do you have a recommendation on when to use a bb_replicator in terms of number of workers?
I think that for any reasonable production setup it's a requirement to set this up. Bazel tends to issue many FindMissingBlobs() calls that have overlapping sets of objects, so even a single Bazel client with a high --jobs
could trigger an undesirable amount of replication traffic if not deduplicated.
This may sound oddly specific, but I have observed that if an EKS node running bb_storage dies, it takes about eight minutes for EKS to detect it, mark the node as down, and evict the pods. So my advice would be to let Bazel retry for at least 15-30 minutes.
That is odd. 8 mins sounds very long for k8s to detect an unhealthy node. Do think the health checks are not tuned properly?
Yes, I am fairly certain of that. Unfortunately, EKS doesn't allow you to tune parameters on the API server. So there is no way to tighten node check intervals/timeouts. We have to rely on EC2 terminating the instance, causing a lifecycle hook event to be generated. Only then will EKS start to evict pods.
--remote_retries=360
Wow! Have you observed any worst-case scenarios, like if a network call hangs instead of fails quickly (e.g hangs for 1 min), would the build wait for 6 hours before failing?
Yeah, that sometimes happens. Most of the times we see calls fail quickly, fortunately. Buildbarn does retrying sparingly, because it simply wants the client to do the retries (so that clients don't observe mysterious hangs). As I said, I think Bazel should just have a --remote_retry_duration
or something.
Related to the concern of making storage HA, I am wondering if you have attempted to make the scheduler more available or durable. I do realize that bazel --remote_retries might be your answer here as well, but it would still be nice if all runners would not lose their work-in-progress if a scheduler dies. This is especially painful for long running actions.
Yes, exactly. It can be painful, but at least for my setup, scheduler restarts tend to be rare (once a week, when I do upgrades). I'd rather have some delays now and then than making the scheduler's design more complex (and potentially slower as a result of it). Letting the scheduler be non-persistent has the advantage that it's hard to get it stuck due to corrupted state. If there's a bug that causes it to crash now and then, that's fine. We can just fix it as part of the next release.
from bb-storage.
@EdSchouten thanks for all the clarifications, I do have a few more questions I'd like to ask. Perhaps I should close this ticket and move to the buildbarn slack channel instead...
One point of clarification;
It's all synchronous at this point. bb_replicator is just there to deduplicate replication at a global level.
Is this stating that FindMissingBlobs will block until the bb_replicator has finished replicating all the items? Does this negatively affect performance when a new empty bb_storage CAS comes online?
Also, I'm considering putting ISCC and AC in Redis, mostly for reliability purposes. I've seen that you stated that it can take up to a month for the ISCC to start stabilizing, and I'd be very concerned about inadvertently losing this data. A managed redis service with backups seems like it could be a good option. Are there any drawbacks that would be concerning to you? Do you have regular backups of the ISCC store?
from bb-storage.
It's all synchronous at this point. bb_replicator is just there to deduplicate replication at a global level.
Is this stating that FindMissingBlobs will block until the bb_replicator has finished replicating all the items? Does this negatively affect performance when a new empty bb_storage CAS comes online?
Yes, it will. But considering that the 'hot set' of the build tends to be relatively small compared to the full size of the CAS, performance should come back relatively quickly. I've had great experiences using the following replicator configuration on the bb_replicator side:
replicator: { deduplicating: { concurrencyLimiting: {
base: { 'local': {} },
maximumConcurrency: 100,
} } },
Also, I'm considering putting ISCC and AC in Redis, mostly for reliability purposes. I've seen that you stated that it can take up to a month for the ISCC to start stabilizing, and I'd be very concerned about inadvertently losing this data. A managed redis service with backups seems like it could be a good option. Are there any drawbacks that would be concerning to you?
No drawbacks that I can think of. That should likely work all right.
Do you have regular backups of the ISCC store?
I don't. I mean, it's important enough that I tend to mirror it, but I don't take any offline backups.
from bb-storage.
I'm going ahead and close this, considering that it's unlikely that any concrete work items are coming out of it. Feel free to open new issues if needed.
from bb-storage.
Related Issues (20)
- Revisit deprecation of GCS storage backend HOT 12
- Documentation needed for allowing cache access from different types of bazel clients HOT 1
- Missing shell in bb-storage docker image HOT 1
- Support for compression from buildbarn HOT 5
- Panic in local blockstore write HOT 3
- Doc Update: ISCC / AC storages are only compatible with local replicator HOT 1
- gRPC Client Certificate Refresh Interval is not respected
- Filesystem errors in bb-storage are recorded as "Unknown" in prometheus
- Feature request: Support RSA signed JWTs
- Feature request: Support JWKS for specifying JWT public keys HOT 4
- Failed to fetch file errors in "builds without the bytes" builds in a sharded setup HOT 7
- Failed to create authorization header parser for JWT authentication policy: Unsupported public key type HOT 3
- Cannot open raw block device provisioned by kubernetes when running as non root user HOT 3
- Support connection draining in kubernetes environments HOT 2
- Tunable LogLevels? HOT 3
- Is there any detailed description about config? HOT 6
- Creating buildbarn storage image doesn't work on bazel 7 HOT 2
- JWT: support ALB token format HOT 5
- Load server CA certificates from files for client configuration
- Publish images with arm64 HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bb-storage.