truelayer / ginepro Goto Github PK

View Code? Open in Web Editor NEW

117.0 19.0 22.0 69 KB

A client-side gRPC channel implementation for tonic

License: Apache License 2.0

Rust 100.00%

grpc tonic load-balancer

ginepro's People

Contributors

Stargazers

Watchers

Forkers

keats isgasho flisky undeflife toolchainlabs tdyas 16892434 pinecone-io dzazzera rnarubin cjmcgraw conradludgate thomwright sync lboynton qibinlin lboy2 chensk xmakro

ginepro's Issues

RUSTSEC-2021-0073: Conversion from `prost_types::Timestamp` to `SystemTime` can cause an overflow and panic

Conversion from prost_types::Timestamp to SystemTime can cause an overflow and panic

Details
Package	`prost-types`
Version	`0.7.0`
URL	tokio-rs/prost#438
Date	2021-07-08
Patched versions	`>=0.8.0`

Affected versions of this crate contained a bug in which untrusted input could cause an overflow and panic when converting a Timestamp to SystemTime.

It is recommended to upgrade to prost-types v0.8 and switch the usage of From<Timestamp> for SystemTime to TryFrom<Timestamp> for SystemTime.

See #438 for more information.

See advisory page for additional details.

ginepro does not load 'balance'

Bug description

ginepro indirectly uses tower::Balance which makes use a best-of-two random load balancing strategy.

Unfortunately, tonic has no built in mechanism for determining load, so this is hard coded to be 0 always.

https://docs.rs/tonic/0.8.3/src/tonic/transport/service/connection.rs.html#114-120

This means we have randomise distribution. This is still a balancing technique but is known to not be very good

Problems updating from 0.5.1 to 0.5.2

I ran into various problems with this upgrade. Still trying to understand them and have had to revert for now, I'll provide more detail when I can get it.

But think upgrading the tonic dependency from 0.8 to 0.9 should probably have been done in a 0.6.0 version update rather than a patch release.

wait for an initial resolution to succeed before fully constructing `LoadBalancedChannel`

Motivations

The service probe loop appears to log and ignore errors while running. This seems fine while the service is running, but at startup, it could be an indication that the ServiceDefinition has an invalid hostname (e.g., has a typo). Some callers might prefer to panic in that situation so deployment systems are immediately aware of a problem instead of just reporting nothing resolved via metrics.

Solution

When constructing a LoadBalancedChannel, the code should wait for an initial resolution of the provided ServiceDefinition to succeed before returning the LoadBalancedChannel. This has the benefit of (1) finding invalid DNS names immediately (and allowing the program to exit immediately); and (2) ensures that LoadBalancedChannel has an initial non-empty set of endpoints to use before the program enters application code.

Alternatives

An alternative would be for callers to access the current set of endpoints and wait with a timeout until there are one or more endpoints. If no endpoints are resolved by the timeout, then a caller could error exit or alert. I don't know offhand whether Channel supports something like this.

Connection timeouts

Bug description

Symptoms

GRPC requests are taking too long to time out when Kubernetes network policies are misconfigured.
The connection_timeout_is_not_fatal test takes ~75 seconds to finish.

Well-configured timeouts are important for system stability. Requests which take too long can hog resources and block other work from happening.

Causes

I can see two separate timeout problems:

DNS resolution – when ResolutionStrategy::Lazy is used, there is currently no way to apply a timeout just for DNS resolution. If DNS never resolves, requests never complete.
TCP connection – there is currently no way to set a connection timeout. On my machine, the socket seems to time out after ~75s. Even when we do set a connection timeout, tonic doesn't use it!

Even though we're setting our own fairly short timeouts around the overall request, I've seen some strange behaviour where requests are hanging for a long time. I think there's still something else going on that I don't understand, but I expect addressing the two points above will be generally helpful anyway.

To Reproduce

For the TCP connection timeout, just run the tests. I'll supply a test for lazy DNS resolution timeouts in a separate PR.

Expected behavior

Ability to control timeouts for TCP connections and DNS resolution.

Environment

OS: MacOS
Rust version: rustc 1.65.0 (897e37553 2022-11-02)

Additional context

Solutions

The TCP connection timeout is simpler to solve (though I will admit took me a long time to find): we just need to set connect_timeout in the right places. First, topic doesn't respect connect_timeout, which will be fixed by hyperium/tonic#1215. When that is merged, we can create our own connect_timeout option on top of it in #38.

DNS resolution is harder. There are currently two options:

Lazy resolution seems to be the default, but it also seems very hard to put a timeout around (at least, I can't think of a way!). It's possible to put an overall timeout around every request in the application, but this is 1. overly broad and 2. error-prone. It's very easy to forget or simply not know that it's needed ("why don't the other timeouts take care of it?").
Eager resolution has a timeout option. In practice, using this will probably mean doing DNS resolution on service startup, when the LoadBalancedChannel is created. This might be a good thing, preventing services from successfully starting when DNS would never resolve.

Of the two, I wonder if we should favour Eager resolution, and consider changing the default to this.

However, we might want a third option: Active lazy resolution (for want of a better name). Lazy resolution is currently passive, as in it happens in the background on a schedule. It is never actively called in the request flow, which is why it's hard to put a timeout around. Instead, could we implement something which actively calls probe_once() (with a timeout!) as part of the first request (or alternatively when GrpcServiceProbe.endpoints is empty)? This could give us lazy DNS resolution, but with timeouts.

Scratch that, I took a different approach: tower-rs/tower#715. EDIT: Nope, that hasn't worked out. Back to the drawing board.

Consider means of configuring Endpoints

Motivations

tonic's Endpoint type has many configurable parameters, for example keep_alive_interval

ginepro internally constructs Endpoint instances from socket addresses, and then applies some limited configuration values (like tls and timeout), but otherwise most settings remain the default.

Solution

It's probably not practical or ergonomic to specify every configuration value in ginepro's API; however it would be useful if the library could accept something like a Fn(SocketAddr) -> Result<Endpoint, SomeError> so that the user could configure an endpoint while the library handles stuff like periodic dns lookups

Unexpected blocked when custom lookup service returns ipv6 addresses

Bug description

If we registered a custom lookup service that returns ipv6 addresses, grpc client request would block forever. After tracking source code, I guess the bug exists in build_endpoint method:

fn build_endpoint(&self, ip_address: &SocketAddr) -> Option<Endpoint> {
        let uri = format!(
            "{}://{}:{}",
            self.scheme,
            ip_address.ip(),
            ip_address.port()
        );
        // ...
}

If the type of ip_address is SocketAddr::V6, the correct patten should be {}://[{}]:{} instead of {}://{}:{} which would fail the endpoint building. Then the create_changeset method would always report nothing because build_endpoint returns None:

changeset.extend(
            add_set
                .into_iter()
                .filter_map(|addr| self.build_endpoint(&addr).map(|endpoint| (addr, endpoint)))
                .map(|(addr, endpoint)| Change::Insert(addr, endpoint)),
        );

To Reproduce

Implement a custom lookup service that returns ipv6 addresses and register it.

Expected behavior

Custom lookup service returns either ipv4 or ipv6 addresses should work correctly.

Environment

Environment independent

truelayer / ginepro Goto Github PK

ginepro's People

Contributors

Stargazers

Watchers

Forkers

ginepro's Issues

Bug description

Motivations

Solution

Alternatives

Bug description

Symptoms

Causes

To Reproduce

Expected behavior

Environment

Additional context

Solutions

Motivations

Solution

Bug description

To Reproduce

Expected behavior

Environment

Recommend Projects

Recommend Topics

Recommend Org

Jobs