GithubHelp home page GithubHelp logo

Bring back S3 support about bb-storage HOT 14 CLOSED

buildbarn avatar buildbarn commented on June 11, 2024
Bring back S3 support

from bb-storage.

Comments (14)

finn-ball avatar finn-ball commented on June 11, 2024

We use only s3 as our AC and CAS (yes we have thought about moving AC to a faster storage, but we found there's currently no need)

Are there any blockers moving from S3 to some faster SSD storage? My experience of moving from S3 to SSD block storage was both painless and easy.

from bb-storage.

finn-ball avatar finn-ball commented on June 11, 2024

You may also find the following documentation useful:
https://github.com/buildbarn/bb-adrs/blob/master/0002-storage.md

There are no facilities for performing bulk existence queries, meaning it is hard to efficiently implement FindMissing().

Did you ever figure out a workaround for this?

from bb-storage.

EdSchouten avatar EdSchouten commented on June 11, 2024

Hi Nathan,

Let me repeat myself by saying that I'm absolutely not against having an S3 backend. S3 is cheap and durable. The only thing that I do object to is having a backend that directly maps the CAS/AC key space to that of a bucket, like CloudBlobAccess did. The reason being that's virtually impossible to get decent performance and Builds without the Bytes support. It should be possible to do something smarter here. Storing WAL logs in S3? For example, consider Thanos. Does Thanos store every time series as an individual object in S3? The answer to that is 'no', for almost exactly the same reasons.

I am currently happy with LocalBlobAccess + MirroredBlobAccess + ShardingBlobAccess. It has worked very well for me, just like it has worked well for @finn-ball and several others. The number of infra failures I see are far less than what you're experiencing, especially if you set health checks aggressively and replace broken storage nodes quickly. That's why I've not invested any time myself to address any of this.

In summary: Bringing back CloudBlobAccess is not going to happen. Proposals & contributions for building something smarter than that are more than welcome.

from bb-storage.

allada avatar allada commented on June 11, 2024

It is quite rare any of our infra failures are due to anything with S3 (since we applied our patches to bb-storage).

Here's a bit more information on how we use buildbarn:

  • We have a single S3 bucket that holds all objects with pruning set to remove any objects after 30 days of upload.
  • We modified buildbarn's (previous) s3 logic so it will make an s3->s3 copy of the object if the object is >10 days since modified.
  • We shard our s3 uploads into 1000 different folders in s3. This is because S3 has a limit of ~5000 requests per second per folder (we peek at about 30-40k requests per second ~once per day).
  • We use existence checking + existence caching which reduces requests to s3 and ensures every object needed for remote execution is touched.
  • ~90% of the builds are all in CI under tight controlled environment. This is where most remote execution happens.
  • ~10% of the usage is by local users that build locally. We use a very complicated configuration that causes users to download from the builds CI makes if available, if that fails it will request from an s3 folder (in same bucket) that is unique to that user's AWS IAM role. When the user uploads it will upload only to the user's folder. This prevents a user from poisoning other users builds. We do allow user's to build remotely, but don't encourage it because it has high latency/queue times (which is why we care mostly about using it in CI); in addition, it's unlikely we could ever make it fast enough, since most of our binaries are over 1 gigabyte and network speed is often slower than just building locally [especially since we give all employees 64 core machines].

As a summary, what we really care about is making CI reliable and maintenance free, and scalable, which is why we use S3. Here are some things we tried and why we don't use them:

  • LocalBlobAccess - This worked part of the time, but it required additional infrastructure and points of failure. For a short amount of time we used this kind of configuration with a single i3en.24xlarge, but this ended up resulting in the instance being unable to keep up with the load and would result in some connections to timeout. Thought was given into load balance it, but we opted out of this because of the complexity.
  • Redis - This worked fairly well, but suffers from same issues that S3 has (in that it will eventually evict items [in our configuration]). This did work well for Action Cache, but for CAS objects, it has a limit of 512mb per object and we have some objects over 10 gigs. We thought about doing a SizeDistinguishing-type configuration, but this isn't actually solving our problem. We also were worried about the amount of traffic that Redis could support. Yes, in theory it can scale well, but our build requests tend to come all at once... Meaning, we hammer whatever the backing store is extremely hard for a short amount of time and the cost of maintaining a Redis configuration that could auto scale quickly + need for SizeDistinguishing and additional complexity ruled this configuration out. [Side note: Lyft's L5 team did use this configuration, but only for Action Cache, but used S3 for CAS, for similar reasons]
  • ICAS - This did seem to solve a bit of our concerns, but by the time it was finished we had patched all the S3 problems we were having.

My intention is not to say that CloudBlobAccess is the solution. My intention is to give incite into how some of your users are using buildbarn.

from bb-storage.

EdSchouten avatar EdSchouten commented on June 11, 2024

We do allow user's to build remotely, but don't encourage it because it has high latency/queue times (which is why we care mostly about using it in CI); in addition, it's unlikely we could ever make it fast enough, since most of our binaries are over 1 gigabyte and network speed is often slower than just building locally [especially since we give all employees 64 core machines].

An alternative is to build/test everything remotely and use Builds without the Bytes. It's unlikely that people are interesting in accessing most of those outputs anyway.

LocalBlobAccess - [...] Thought was given into load balance it, but we opted out of this because of the complexity.

I think that's a bit unreasonable, isn't it? I'm pretty sure that if you used Redis at a similar scale, you would have had to shard it for sure, and apparently that "worked fairly well". Even though the example deployment of Buildbarn in bb-deployments isn't very complete and could really use some love, it does demonstrate how to set up sharding, and it's not a whole lot of effort. Just spin up multiple instances and tie them together using ShardingBlobAccess. Building a setup that at least uses plain sharding shouldn't be hard.

To repeat myself, I'm not against reintroducing S3 support, but it should be backed by a concrete and solid proposal, and someone needs to implement/maintain it. In the 14 months that CloudBlobAccess has been gone, I haven't seen a single proposal, nor have I seen any detailed technical discussions about more efficient ways for Buildbarn to use S3. Like @finn-ball, I would be very interested in having an answer to this question:

There are no facilities for performing bulk existence queries, meaning it is hard to efficiently implement FindMissing().

Did you ever figure out a workaround for this?

from bb-storage.

EdSchouten avatar EdSchouten commented on June 11, 2024

Closing this issue, as no significant process has been made on it for quite some time.

Just to reiterate, I'm not against reintroducing S3 support, but there needs to be some concrete plan here around how things like FindMissingBlobs() can be implemented in a performant way. Happy to see such proposals appear!

from bb-storage.

joeljeske avatar joeljeske commented on June 11, 2024

@allada do you have a public fork that contains your mentioned S3 backend? I think it would be very interesting to bring back S3 support, especially when used as a slow source of truth with alternative fast pull-through caches available.

from bb-storage.

allada avatar allada commented on June 11, 2024

@joeljeske kinda, send me an email and I can send the patch/details.

from bb-storage.

EdSchouten avatar EdSchouten commented on June 11, 2024

I think it would be very interesting to bring back S3 support, especially when used as a slow source of truth with alternative fast pull-through caches available.

Again, to me it's not clear how that would work. Keep in mind that the speed of a cache is strongly influenced by that of FindMissingBlobs(). Because the S3 bucket would be the source of truth, FindMissingBlobs() calls would need to be sent there. Because S3 doesn't have a decent API for sending bulk existence queries, you wouldn't be able to offer an implementation that works efficiently.

from bb-storage.

joeljeske avatar joeljeske commented on June 11, 2024

Yes that does make a lot of sense.

Would persistent existence caching be helpful to make FindMissing more efficient? Perhaps if used with a bucket life cycle event notification, we could reliably keep the bucket and BB existence cache in sync when objects are expired.

from bb-storage.

EdSchouten avatar EdSchouten commented on June 11, 2024

The existence cache is only capable of caching positive responses. The issue is that especially with remote execution, the number of negative responses is also non-negligible. Both issued by the client and the workers.

Long story short, it would save some traffic, but still not enough to make it perform as well as, say, a large scale LocalBlobAccess setup.

With the solution you're proposing FindMissingBlobs() may still fill take seconds to complete, while you generally want it to complete in mere milliseconds to let the build run at a decent pace.

from bb-storage.

moroten avatar moroten commented on June 11, 2024

Does anyone have experience of how https://github.com/buchgr/bazel-remote is performing for S3? One way is to configure the backend as a gRPC client pointing to bazel-remote and measure.

from bb-storage.

joeljeske avatar joeljeske commented on June 11, 2024

Buildbuddy also supports S3 cache backend. Another datapoint for comparison.

@EdSchouten my thinking is to use a new persisted existence caching that is cached on a server fronting S3. This would not be the current client side existence cache. If all requests to S3 Go through a new service, then we can accurately cache negative and positive existence. If folks use life cycle events for expiring objects, we could hook into the lifecycle notifications to mark items as deleted from the existence cache.

Since the existence cache would be persisted, one should not require scanning S3 to populate this cache upon startup regularly, but perhaps one could optionally repopulate existence from S3 in case of real failure.

Do you think that would solve the issues you've encountered with latency when scanning for missing blobs?

To scale, one should shard the servers fronting S3.

from bb-storage.

EdSchouten avatar EdSchouten commented on June 11, 2024

Such an approach could work. You don't even need to use EC2 lifecycle events I guess. That client side existence cache could just store the expiration date of the object as part of its bookkeeping.

But how would a model like this guarantee availability? With what you sketched out, you would have a single EC2 instance managing the existence cache (for a given shard of the data set). How is the existence cache persisted? What happens if the EC2 instance fails? Would a replacement instance only have an older copy of this existence cache to its disposal? How would that keep clients satisfied? Those will now have lost objects in the middle of the build.

from bb-storage.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.