Comments (14)
We use only s3 as our AC and CAS (yes we have thought about moving AC to a faster storage, but we found there's currently no need)
Are there any blockers moving from S3 to some faster SSD storage? My experience of moving from S3 to SSD block storage was both painless and easy.
from bb-storage.
You may also find the following documentation useful:
https://github.com/buildbarn/bb-adrs/blob/master/0002-storage.md
There are no facilities for performing bulk existence queries, meaning it is hard to efficiently implement FindMissing().
Did you ever figure out a workaround for this?
from bb-storage.
Hi Nathan,
Let me repeat myself by saying that I'm absolutely not against having an S3 backend. S3 is cheap and durable. The only thing that I do object to is having a backend that directly maps the CAS/AC key space to that of a bucket, like CloudBlobAccess did. The reason being that's virtually impossible to get decent performance and Builds without the Bytes support. It should be possible to do something smarter here. Storing WAL logs in S3? For example, consider Thanos. Does Thanos store every time series as an individual object in S3? The answer to that is 'no', for almost exactly the same reasons.
I am currently happy with LocalBlobAccess + MirroredBlobAccess + ShardingBlobAccess. It has worked very well for me, just like it has worked well for @finn-ball and several others. The number of infra failures I see are far less than what you're experiencing, especially if you set health checks aggressively and replace broken storage nodes quickly. That's why I've not invested any time myself to address any of this.
In summary: Bringing back CloudBlobAccess is not going to happen. Proposals & contributions for building something smarter than that are more than welcome.
from bb-storage.
It is quite rare any of our infra failures are due to anything with S3 (since we applied our patches to bb-storage
).
Here's a bit more information on how we use buildbarn:
- We have a single S3 bucket that holds all objects with pruning set to remove any objects after 30 days of upload.
- We modified buildbarn's (previous) s3 logic so it will make an s3->s3 copy of the object if the object is >10 days since modified.
- We shard our s3 uploads into 1000 different folders in s3. This is because S3 has a limit of ~5000 requests per second per folder (we peek at about 30-40k requests per second ~once per day).
- We use existence checking + existence caching which reduces requests to s3 and ensures every object needed for remote execution is touched.
- ~90% of the builds are all in CI under tight controlled environment. This is where most remote execution happens.
- ~10% of the usage is by local users that build locally. We use a very complicated configuration that causes users to download from the builds CI makes if available, if that fails it will request from an s3 folder (in same bucket) that is unique to that user's AWS IAM role. When the user uploads it will upload only to the user's folder. This prevents a user from poisoning other users builds. We do allow user's to build remotely, but don't encourage it because it has high latency/queue times (which is why we care mostly about using it in CI); in addition, it's unlikely we could ever make it fast enough, since most of our binaries are over 1 gigabyte and network speed is often slower than just building locally [especially since we give all employees 64 core machines].
As a summary, what we really care about is making CI reliable and maintenance free, and scalable, which is why we use S3. Here are some things we tried and why we don't use them:
- LocalBlobAccess - This worked part of the time, but it required additional infrastructure and points of failure. For a short amount of time we used this kind of configuration with a single
i3en.24xlarge
, but this ended up resulting in the instance being unable to keep up with the load and would result in some connections to timeout. Thought was given into load balance it, but we opted out of this because of the complexity. - Redis - This worked fairly well, but suffers from same issues that S3 has (in that it will eventually evict items [in our configuration]). This did work well for Action Cache, but for CAS objects, it has a limit of 512mb per object and we have some objects over 10 gigs. We thought about doing a SizeDistinguishing-type configuration, but this isn't actually solving our problem. We also were worried about the amount of traffic that Redis could support. Yes, in theory it can scale well, but our build requests tend to come all at once... Meaning, we hammer whatever the backing store is extremely hard for a short amount of time and the cost of maintaining a Redis configuration that could auto scale quickly + need for SizeDistinguishing and additional complexity ruled this configuration out. [Side note: Lyft's L5 team did use this configuration, but only for Action Cache, but used S3 for CAS, for similar reasons]
- ICAS - This did seem to solve a bit of our concerns, but by the time it was finished we had patched all the S3 problems we were having.
My intention is not to say that CloudBlobAccess
is the solution. My intention is to give incite into how some of your users are using buildbarn.
from bb-storage.
We do allow user's to build remotely, but don't encourage it because it has high latency/queue times (which is why we care mostly about using it in CI); in addition, it's unlikely we could ever make it fast enough, since most of our binaries are over 1 gigabyte and network speed is often slower than just building locally [especially since we give all employees 64 core machines].
An alternative is to build/test everything remotely and use Builds without the Bytes. It's unlikely that people are interesting in accessing most of those outputs anyway.
LocalBlobAccess - [...] Thought was given into load balance it, but we opted out of this because of the complexity.
I think that's a bit unreasonable, isn't it? I'm pretty sure that if you used Redis at a similar scale, you would have had to shard it for sure, and apparently that "worked fairly well". Even though the example deployment of Buildbarn in bb-deployments isn't very complete and could really use some love, it does demonstrate how to set up sharding, and it's not a whole lot of effort. Just spin up multiple instances and tie them together using ShardingBlobAccess. Building a setup that at least uses plain sharding shouldn't be hard.
To repeat myself, I'm not against reintroducing S3 support, but it should be backed by a concrete and solid proposal, and someone needs to implement/maintain it. In the 14 months that CloudBlobAccess has been gone, I haven't seen a single proposal, nor have I seen any detailed technical discussions about more efficient ways for Buildbarn to use S3. Like @finn-ball, I would be very interested in having an answer to this question:
There are no facilities for performing bulk existence queries, meaning it is hard to efficiently implement FindMissing().
Did you ever figure out a workaround for this?
from bb-storage.
Closing this issue, as no significant process has been made on it for quite some time.
Just to reiterate, I'm not against reintroducing S3 support, but there needs to be some concrete plan here around how things like FindMissingBlobs() can be implemented in a performant way. Happy to see such proposals appear!
from bb-storage.
@allada do you have a public fork that contains your mentioned S3 backend? I think it would be very interesting to bring back S3 support, especially when used as a slow source of truth with alternative fast pull-through caches available.
from bb-storage.
@joeljeske kinda, send me an email and I can send the patch/details.
from bb-storage.
I think it would be very interesting to bring back S3 support, especially when used as a slow source of truth with alternative fast pull-through caches available.
Again, to me it's not clear how that would work. Keep in mind that the speed of a cache is strongly influenced by that of FindMissingBlobs(). Because the S3 bucket would be the source of truth, FindMissingBlobs() calls would need to be sent there. Because S3 doesn't have a decent API for sending bulk existence queries, you wouldn't be able to offer an implementation that works efficiently.
from bb-storage.
Yes that does make a lot of sense.
Would persistent existence caching be helpful to make FindMissing more efficient? Perhaps if used with a bucket life cycle event notification, we could reliably keep the bucket and BB existence cache in sync when objects are expired.
from bb-storage.
The existence cache is only capable of caching positive responses. The issue is that especially with remote execution, the number of negative responses is also non-negligible. Both issued by the client and the workers.
Long story short, it would save some traffic, but still not enough to make it perform as well as, say, a large scale LocalBlobAccess setup.
With the solution you're proposing FindMissingBlobs() may still fill take seconds to complete, while you generally want it to complete in mere milliseconds to let the build run at a decent pace.
from bb-storage.
Does anyone have experience of how https://github.com/buchgr/bazel-remote is performing for S3? One way is to configure the backend as a gRPC client pointing to bazel-remote and measure.
from bb-storage.
Buildbuddy also supports S3 cache backend. Another datapoint for comparison.
@EdSchouten my thinking is to use a new persisted existence caching that is cached on a server fronting S3. This would not be the current client side existence cache. If all requests to S3 Go through a new service, then we can accurately cache negative and positive existence. If folks use life cycle events for expiring objects, we could hook into the lifecycle notifications to mark items as deleted from the existence cache.
Since the existence cache would be persisted, one should not require scanning S3 to populate this cache upon startup regularly, but perhaps one could optionally repopulate existence from S3 in case of real failure.
Do you think that would solve the issues you've encountered with latency when scanning for missing blobs?
To scale, one should shard the servers fronting S3.
from bb-storage.
Such an approach could work. You don't even need to use EC2 lifecycle events I guess. That client side existence cache could just store the expiration date of the object as part of its bookkeeping.
But how would a model like this guarantee availability? With what you sketched out, you would have a single EC2 instance managing the existence cache (for a given shard of the data set). How is the existence cache persisted? What happens if the EC2 instance fails? Would a replacement instance only have an older copy of this existence cache to its disposal? How would that keep clients satisfied? Those will now have lost objects in the middle of the build.
from bb-storage.
Related Issues (20)
- Revisit deprecation of GCS storage backend HOT 12
- Documentation needed for allowing cache access from different types of bazel clients HOT 1
- Missing shell in bb-storage docker image HOT 1
- Support for compression from buildbarn HOT 5
- Panic in local blockstore write HOT 3
- Doc Update: ISCC / AC storages are only compatible with local replicator HOT 1
- gRPC Client Certificate Refresh Interval is not respected
- Filesystem errors in bb-storage are recorded as "Unknown" in prometheus
- Feature request: Support RSA signed JWTs
- Feature request: Support JWKS for specifying JWT public keys HOT 4
- Failed to fetch file errors in "builds without the bytes" builds in a sharded setup HOT 7
- Failed to create authorization header parser for JWT authentication policy: Unsupported public key type HOT 3
- Cannot open raw block device provisioned by kubernetes when running as non root user HOT 3
- Support connection draining in kubernetes environments HOT 2
- Tunable LogLevels? HOT 3
- Is there any detailed description about config? HOT 6
- Creating buildbarn storage image doesn't work on bazel 7 HOT 2
- JWT: support ALB token format HOT 5
- Load server CA certificates from files for client configuration
- Publish images with arm64 HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from bb-storage.