tokio-rs / turmoil Goto Github PK

View Code? Open in Web Editor NEW

622.0 15.0 37.0 251 KB

Add hardship to your tests

License: MIT License

Rust 100.00%

turmoil's Introduction

Turmoil

This is very experimental

Add hardship to your tests.

Turmoil is a framework for testing distributed systems. It provides deterministic execution by running multiple concurrent hosts within a single thread. It introduces "hardship" into the system via changes in the simulated network. The network can be controlled manually or with a seeded rng.

Quickstart

Add this to your Cargo.toml.

[dev-dependencies]
turmoil = "0.6"

See crate documentation for simulation setup instructions.

Examples

/tests for TCP and UDP.
gRPC using tonic and hyper.
axum

License

This project is licensed under the MIT license.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in turmoil by you, shall be licensed as MIT, without any additional terms or conditions.

turmoil's People

Contributors

Stargazers

Watchers

turmoil's Issues

Infinitely running.

Hi, I'm not sure it is a bug, but given the code below ( I know it's a misuse of TcpListener) ,
the sim runs infinitely, and it is expected to stop within 10 logical seconds, and far shorter real world duration.

Maybe it's make sense to add a new simulator config like realworld_duration ?

#[test]
fn infinite() -> turmoil::Result {
    use std::time::SystemTime;
    use turmoil::{net::TcpListener, Builder};

    let mut sim = Builder::new().epoch(SystemTime::UNIX_EPOCH).build();
    sim.host("s", || async {


        loop {
            TcpListener::bind("0.0.0.0:80").await?;
        }
    });

    sim.run()
}

meets Tokio `enble_io()` error

I was trying to init a simulator runs a closure which bind to localhost. like this:

#[test]
fn test_main() -> Result {
    let mut sim = Builder::new()
        .build();

    sim.client("10.129.11.11", async {
        let (mut sock, addr) = TcpListener::bind((IpAddr::from(Ipv6Addr::UNSPECIFIED), 8080))
            .await?
            .accept()
            .await?;
        sock.write_i32(124).await?;
        Ok(())
    });
    sim.run()
}

And I meets error :

It seems turmoil didn't call tokio's enable_io() method when initiating a tokio runtime?
related code in turmoil/src/rt.rs

fn init() -> (Runtime, LocalSet) {
    let mut builder = tokio::runtime::Builder::new_current_thread();

    #[cfg(tokio_unstable)]
    builder.unhandled_panic(tokio::runtime::UnhandledPanic::ShutdownRuntime);

    let tokio = builder.enable_time().start_paused(true).build().unwrap();

    tokio.block_on(async {
        // Sleep to "round" `Instant::now()` to the closest `ms`
        tokio::time::sleep(Duration::from_millis(1)).await;
    });

    (tokio, new_local())
}

Support holding messages after send

Is your feature request related to a problem? Please describe.
No. This is new functionality.

Describe the solution you'd like
During the simulation I'd like to place a "hold" on the link between two hosts. Any messages sent will remain in the queue while the "hold" is active. At a later time I'd like to remove the "hold", which allows delivery for the queued messages.

This is useful for tests that need to control the ordering of events across multiple hosts.

Support loopback

Hosts currently may only bind to 0.0.0.0.

https://github.com/tokio-rs/turmoil/blob/main/src/net/tcp/listener.rs#L28

Add support to bind 127.0.0.1 to unblock loopback scenarios. We need to decide how network topology is affected by these changes. For example, it doesn't make sense to allow partitions within a host.

support ephemeral port assignments

I tried the following code:

#[test]
fn ephemeral_port() -> Result {
    let mut sim = Builder::new().build();

    sim.client("client", async {
        let sock = bind_to(0).await?;

        // turmoil should assign a port to the ephemeral range
        assert_ne!(sock.local_addr()?.port(), 0);
        assert!(sock.local_addr()?.port() >= 49152);

        Ok(())
    });

    sim.run()
}

It would be nice to support ephemeral port assignment. This is useful for clients that don't care about the specific port number; they just need a free port.

From https://www.rfc-editor.org/rfc/rfc6335#section-6:

o the System Ports, also known as the Well Known Ports, from 0-1023
(assigned by IANA)

o the User Ports, also known as the Registered Ports, from 1024-
49151 (assigned by IANA)

o the Dynamic Ports, also known as the Private or Ephemeral Ports,
from 49152-65535 (never assigned)

Look into spans for network tracing

See comments in #48 re: spans.

For TcpStream and UdpSocket spans might simplify the context needed for each event, ie syn, fin, etc.

Add warning for blocking tasks that block the sim

Related to #139 we should add a warning that prints when a blocking task is still active in the runtime causing the next tick to not happen. This can be done by 1) adding a blocking task count metric to tokio-metrics and then to spin a bg thread that checks this metric and some sort of tick count. It will then start printing if the tick can not progress.

Setup CI

At min we should run a full build per PR.

See: https://github.com/jonhoo/rust-ci-conf

Return errors instead of panicking, when sending invalid packets.

Currently turmoil will panic, if a packet is send to an ip address that does not exist,
since this will result in an invalid access to the index map in top.rs.

This does not mirror the behavior if tokio or std sockets and panicking seems too extrem,
especially since some applications may create such sockets, expecting errors instead of
panics.

Therefore it might be advantageous to return errors instead of panicking in World::send_message.

Example

This example will panic.

fn main() -> Result {
     let mut sim = Builder::new().build();
     sim.client("client", async move {
         let _ = net::TcpStream::connect("192.168.30.1:80").await?;
         Ok(())
     });

     sim.run()
 }

thread 'main' panicked at 'IndexMap: key not found', ~/dev/turmoil/src/top.rs:221:25
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Error: JoinError::Panic(Id(1), ...)

Calling `run()` after crashing a host errors

Repro:

#[test]
fn run_after_host_crashes() -> Result {
    let mut sim = Builder::new().build();

    sim.host("h", || async { future::pending().await });

    sim.crash("h");

    sim.run()
}

Fails with:

running 1 test
Error: JoinError::Cancelled(Id(1))
test sim::test::run_after_host_crashes ... FAILED

failures:

failures:
    sim::test::run_after_host_crashes

test result: FAILED. 0 passed; 1 failed; 0 ignored; 0 measured; 10 filtered out; finished in 0.00s

Time and clocks

https://github.com/tokio-rs/turmoil/blob/main/src/host.rs#L69

I think code like this should be replaced with something that sources time off a virtual clock. If so, there are two broad patterns to apply:

Pass the instant 'now' in as a param
Given the host a timesource and fetch 'now' from said source

Add a condensed tracing format

Currently, tracing emits a "pretty-print" JSON format for all events. Some scenarios warrant seeing a more condensed version of the output.

Make the format configurable. Perhaps it could look like this?

src(dot) | dst(dot) | what(send, recv, etc.) | timestamp | ...

QA: How is this diffferent the original `simulation` project?

Add simulated PRNG

We should support deterministic PRNG for usage for retries, hashmaps, etc. We can accomplish this by providing a deterministic version of RandomState.

Implement additional UDP features

It would be useful to extend the current UDP model with the following features:

Randomized packet corruption/truncation
Randomized packet duplication/retransmission
Randomized packet reordering - this can be accomplished by having some jitter assigned to each packet.
~~Preferring new packets instead of old on full receive buffers - currently we drop new packets on full buffers but this isn't usually what network stacks do or what applications expect.~~ - turns out this is exactly what stacks do - see #128 (comment)
Setting the MTU for a path and being able to drop and/or truncate packets larger than that value
Simulate bufferbloat (i.e. latency increases by some function as the number of packets being buffered increases).

Re-starting a crashed host with bounce panics

Repro:

#[test]
fn restart_host_after_crash() -> Result {
    let mut sim = Builder::new().build();

    sim.host("h", || async { future::pending().await });

    // crash and step to execute the err handling logic
    sim.crash("h");
    sim.step()?;

    // restart and step to ensure the host sfotware runs
    sim.bounce("h");
    sim.step()?;

    Ok(())
}

running 1 test
thread 'sim::test::restart_host_after_crash' panicked at 'missing host', src/sim.rs:143:43
stack backtrace:
   0: rust_begin_unwind
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:65:14
   2: core::panicking::panic_display
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:139:5
   3: core::panicking::panic_str
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/panicking.rs:123:5
   4: core::option::expect_failed
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/option.rs:1879:5
   5: core::option::Option<T>::expect
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/option.rs:741:21
   6: turmoil::sim::Sim::run_with_hosts
             at ./src/sim.rs:143:22
   7: turmoil::sim::Sim::bounce
             at ./src/sim.rs:125:9
   8: turmoil::sim::test::restart_host_after_crash
             at ./src/sim.rs:606:9
   9: turmoil::sim::test::restart_host_after_crash::{{closure}}
             at ./src/sim.rs:596:5
  10: core::ops::function::FnOnce::call_once
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/ops/function.rs:251:5
  11: core::ops::function::FnOnce::call_once
             at /rustc/69f9c33d71c871fc16ac445211281c6e7a340943/library/core/src/ops/function.rs:251:5

On crash, the rt is removed causing this issue.
https://github.com/tokio-rs/turmoil/blob/main/src/sim.rs#L321-L322

Add support for `TcpStream#peek`

Method: TcpStream#peek

I have run into an issue that requires the use of this method. I would be happy to get started on a fix, but I'm uncertain on the approach to take. The issue I'm running across is (1) that self is immutably referenced:

pub async fn peek(&self, buf: &mut [u8]) -> io::Result<usize>

This was previously mutable and relied on poll_peek.

The other issue (2) is that turmoil currently implements the ReadHalf and WriteHalf using a tokio::mpsc channel. However, it doesn't look like the Receiver has an option to immutably read the internal lock-free list.

So my question is whether we should paper over the ReadHalf with an internal data structure, try to get this implemented in tokio/chan or some other potential solution I'm missing?

Explore state exploration in turmoil

State exploration entails navigating all (or some portion of) the possible states a program can enter during its execution. Model checkers exist (TLA+, P, etc.), but they require building a model that is separate from the actual implementation.

turmoil provides an interesting opportunity where we are running all of the real code, but with a simulated network. The network provides a place to both view states and control state transitions. Can we expose the right APIs to make state exploration possible?

Note that this approach differs from fuzzing the network, which is already possible today.

Question: Reproduce, Random and Time

I read the examples and tests code in turmoil but still have some puzzle:

In the similar project MadSim , there is a "Test Seed" for every run to generate a deterministic time and random number, so users can use same seed to get exactly same result. Can turmoil do something like this? and how ?

BTW, I thought it was Sim::epoch() to do this, but I got different result by every run in code below:

use rand::SeedableRng;
use std::time::SystemTime;
use turmoil::{Builder, Result};

fn main() {
    println!("Hello, world!");
}

#[test]
fn test_main() -> Result {
    let mut sim = Builder::new()
        .epoch(SystemTime::UNIX_EPOCH)
        .rng(rand::rngs::StdRng::seed_from_u64(10))
        .build();

    sim.client("host", async {
        println!("Hello world!");

// now() is diffferent in every run . And there seems no API in turmoil to mock time.
        println!("now: {:?}", std::time::Instant::now());
        Ok(())
    });
    sim.run()
}

Can turmoil simulate IO other than network? (for example, Disk IO )

Add support for SO_LINGER to `TcpStream`

[Placeholder]

Regex matching throws exception in Pair

When using hold with regular expressions, Pair throws an exception because it expects the two IpAddr to be different.
Here is a minimal test:

    #[test] 
    #[cfg(feature = "regex")]
    fn hold_all() -> Result {
        let mut sim = Builder::new().build();

        sim.host("host", || { async { future::pending().await } });
        sim.client("client", async {  
                hold(regex::Regex::new(r".*")?, regex::Regex::new(r".*")?);
                Ok(())
        });
    
        sim.run()?;
        Ok(())
    }

Fails with:

thread 'sim::test::hold_all' panicked at 'assertion failed: `(left != right)`
  left: `192.168.0.1`,
 right: `192.168.0.1`', src/top.rs:35:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread 'sim::test::hold_all' panicked at 'a spawned task panicked and the LocalSet is configured to shutdown on unhandled panic', /Users/foo/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.26.0/src/task/local.rs:603:17

Spawn blocking blocking sim runtime

main...lucio/spawn-blocking-bug#diff-ace3e8abab9fb7b84efd253a7cea095084172b5cc3431426e3391305a554b152R46

With this example code its possible to never run the client future as the server one will hang until all spawn blockings complete. The real answer here is to not use threads since this removes determinism. But this is still surprising behavior. The work around is to use another thread provider like a different tokio runtime (where you call spawn_blocking on that) or std::thread.

cc @MarinPostma @mcches

Support async dns resolution

Tokio's https://docs.rs/tokio/latest/tokio/net/trait.ToSocketAddrs.html under the hood is an async operation which presents surface area for dns to hang etc

Error `ConnectionRefused` when binding UdpSocket to `localhost`

This line in the UdpSocket implementation suggests that it can be used with localhost, :: or 0.0.0.0.

But if we try to change the binding address from "unspecified" to "localhost" in the udp tests, they all fail with ConnectionRefused error.

If this is the expected behaviour for the socket, that part of documentation can be seen as somewhat misleading.

Support client and host software errors

Currently, we only have panic to trigger failure during simulation runs. This makes writing both the test and host software a little clunky, as we can't ? return on err.

To make the experience better, we can define a dynamic Error type and have both client and hosts supply a future that aligns. On each run() iteration, we can check if any host finishes with an error, end the simulation and return the error.

e.g.

pub type TurmoilResult<T = ()> = std::result::Result<T, Box<dyn std::error::Error>>;

Add simulated disk Io

Is your feature request related to a problem? Please describe.
No. This is new functionality.

Describe the solution you'd like
Hosts have an Io concept today, but it is just network. Add the ability to write/read to/from disk, and have this state persist across host restarts.

Make the tracing sink configurable

Tracing currently accepts a path to a file. Make this more flexible by accepting any Write. Using stdout out is useful for short running tests.

Document determinism guidelines

Turmoil is built on the concept of deterministic execution. Using structures such as HashMap initialize with non-deterministic RandomState. Both the internals of turmoil and applications using it need to buy in.

e.g. HashMap, HashSet, tokio::select!, etc.

Document the guidelines.

Support multiple network interfaces

This refactor aims to introduce the ability of nodes to have multiple addresses
in distinct subnets.

Immediate Goals

Each node should be bound to an unique Ipv4Addr AND an unqiue Ipv6Addr
All bound addresses should be in a predefined subnet (like 192.168.0.0/16)
The available subnets should be configurable in the Builder
Addresses can be either automatically or manually assigned

Challenges

I have tried to implement this, and come to the conlsuion that some major changes
internally AND externally would be nessecary. Notably:

Nodes, and thus Rt/Host can no longer by identified by a single IpAddr.
The best possible solution would be to identify them by something like a MAC addr,
but that would warrent major internal changes
lookup would need to return more than one possible address, thus the public API
of ToIpAddr / lookup / lookup_many would need to change. This could be a good
moment to introduce API compliance with either std::net or tokio::net
using ToIpAddr for module creation creates problems when statically assigning addresses.
Even without this refactoring, nodes with explicitly assigned address cannot have human readable
names, since their place is traded for the address assignment. Mixed IP subnets only enlarge this
problem. The current api of Sim::client / Sim::host provides no way to explicitly assign both
an Ipv4 and an Ipv6 address. In short, the current API cannot support a node with explicitly assigned
addresses, let alone a human-readable name, so changes would be nessecary.

In my opinion, this amount of changes would exceed the scope of one PR, so it might be benifical
to make step by step changes to the public API. However this warrants discussion.

Some related thoughts

While not in the scope of this refactoring, binding sockets to specific addresses may be beneficial
in std/tokio Ipv6 sockets bound to [::] can receive incoming Ipv4 packets (addresses are being mapped to Ipv6),
however the reverse is not possible. This seems like an rare edge case, so i do not know whether we
should ever support this behaviour
I might prove useful in the future, to refrain from hardcoding only two possible addresses in two possible subnets per node.
Supporting a set of bound addresses+subnets might be beneficial to a) remodel localhost, to use top.rslinks or b) add support for multiple interface, thus multiple subnets, should that ever be a goal

As a reference, my current test implementation can be found here.
I have closed the corresponding PR #125, since it is already out of date.

Progress

types representing subnets
dns lookups that may return multiple IpAddrs
decoupling dns lookup and dns registration
uuid as Host/Rt identifers
updated node creation API
subnet configuration in Builder
multiple network interfaces per node (according to subnet configuration)
socket support for binding to specific addresses

Support binding multiple addrs within a host

Listener::bind() was added in #35, but it assumes the sole host's SocketAddr. It's both reasonable to support this (say for loopback or multiple acceptors within a process) and necessary to mirror tokio::net.

Support bouncing a host

Is your feature request related to a problem? Please describe.
No. This is new functionality.

Describe the solution you'd like
Hosts in the simulation are simply futures. During run_until() I'd like to have the ability to "bounce" a host (cancel, join and restart).

Fix AsyncRead impl for TcpStream

Is your feature request related to a problem? Please describe.

Yes, turmoil::net::TcpStream does not behave like tokio::net::TcpStream.

AsyncRead is broken if the supplied buf does not have capacity for the next message.

#[test]
fn read_buf_smaller_than_msg() -> Result {
    let mut sim = Builder::new().build();

    sim.client("server", async {
        let listener = bind().await?;
        let (mut s, _) = listener.accept().await?;

        s.write_u64(1234).await?;

        Ok(())
    });

    sim.client("client", async {
        let mut s = TcpStream::connect(("server", PORT)).await?;

        let mut buf = [0; 1];
        // panic!: buf.len() must fit in remaining()
        let _r = s.read(&mut buf).await?;

        Ok(())
    });

    sim.run()
}

See:

turmoil/src/net/tcp/stream.rs

Line 137 in 2d0fadd

buf.put_slice(bytes.as_ref());

Describe the solution you'd like

Align turmoil with tokio::net.

Fix socket close behavior when there is unread data

[Placeholder]

Loopback is incomplete

The following scenarios exist for client -> server within the same host:

(Only Tcp is shown, but we need to handle it for Udp as well)

// bind | connect

// 0s | 127.0.0.1
// client: local Ok(127.0.0.1:49582), peer Ok(127.0.0.1:1234)
// server: 127.0.0.1:49582, local Ok(127.0.0.1:1234), peer Ok(127.0.0.1:49582)

// 127.0.0.1 | 127.0.0.1
// client: local Ok(127.0.0.1:49622), peer Ok(127.0.0.1:1234)
// server: 127.0.0.1:49622, local Ok(127.0.0.1:1234), peer Ok(127.0.0.1:49622)

// 0s | 192.168.1.42
// client: local Ok(192.168.1.42:49716), peer Ok(192.168.1.42:1234)
// server: 192.168.1.42:49716, local Ok(192.168.1.42:1234), peer Ok(192.168.1.42:49716)

// 127.0.0.1 | 192.168.1.42
// Error: Os { code: 61, kind: ConnectionRefused, message: "Connection refused" }

The first two work as expected, including setting the correct local|peer_addr on each side of the stream. The last two cause panics today due to holes in the stop-gap implementation for loopback.

We need to address this with workarounds and/or include this in the refactor being discussed in #132 .

More interesting examples of testing distribution system building blocks

Description

It would be great to provide a few more complex example to showcase more Turmoil capabilities. The example should be succinct with lightweight dependencies but need functional testing for:

Fault tolerance
Scalability to many nodes

Proposal

Food for thought - https://martinfowler.com/articles/patterns-of-distributed-systems/ with a few candidates:

Heartbeat: seems straightforward, can be made more complex with Gossip?
Leader - Follower: some reference implementation here for raft, seems to touch heartbeat, quorum as well - too big?
2PC
Others?? Happy to contribute

Clean up network topology semantics

The simulation has the ability to manually and randomly change network conditions during the simulation. This was initially designed for the datagram (UDP) APIs, and does not fully translate to streams (TCP), namely dropping messages. The goal of the simulation is not to test that TCP works, rather it aims to test that applications built over TCP work correctly. These applications lean on the guarantees that TCP provides, ie message order.

Currently, one can apply two types of network partitions:

partition: All messages are dropped. Works for datagram. Not supported on established streams, however it works for new connections as we only send one message for the 3-way handshake.

See: https://github.com/tokio-rs/turmoil/blob/main/src/world.rs#L250

hold: Hold all messages "on the network". Works for both modes.

The goal of this issue is to figure out consistent semantics and naming for both networking modes.

Cannot build project with turmoil

I am trying to use turmoil for one of my projects, but it is failing to build.

steps to reproduce:

cargo new test-turmoil
cd test-turmoil
cargo add turmoil
cargo check

yields

error[E0433]: failed to resolve: could not find `UnhandledPanic` in `runtime`
  --> /Users/mpostma/Documents/code/rust/turmoil/src/rt.rs:94:42
   |
94 |         .unhandled_panic(tokio::runtime::UnhandledPanic::ShutdownRuntime)
   |                                          ^^^^^^^^^^^^^^ could not find `UnhandledPanic` in `runtime`

error[E0433]: failed to resolve: could not find `UnhandledPanic` in `runtime`
   --> /Users/mpostma/Documents/code/rust/turmoil/src/rt.rs:108:43
    |
108 |     local.unhandled_panic(tokio::runtime::UnhandledPanic::ShutdownRuntime);
    |                                           ^^^^^^^^^^^^^^ could not find `UnhandledPanic` in `runtime`

error[E0599]: no method named `unhandled_panic` found for mutable reference `&mut tokio::runtime::Builder` in the current scope
  --> /Users/mpostma/Documents/code/rust/turmoil/src/rt.rs:94:10
   |
94 |         .unhandled_panic(tokio::runtime::UnhandledPanic::ShutdownRuntime)
   |          ^^^^^^^^^^^^^^^ method not found in `&mut tokio::runtime::Builder`

error[E0599]: no method named `unhandled_panic` found for struct `LocalSet` in the current scope
   --> /Users/mpostma/Documents/code/rust/turmoil/src/rt.rs:108:11
    |
108 |     local.unhandled_panic(tokio::runtime::UnhandledPanic::ShutdownRuntime);
    |           ^^^^^^^^^^^^^^^ method not found in `LocalSet`

This seems to be caused by the fact that the tokio dependency in the turmoil project is set to 0.19, but unhandled_panic is not part of this version.

I tried to patch the tokio version in turmoil and use the path dependency, but this still does not work.

This is on both macOS and a fresh linux VM.

tokio-rs / turmoil Goto Github PK

turmoil's Introduction

Turmoil

Quickstart

Examples

License

Contribution

turmoil's People

Contributors

Stargazers

Watchers

Forkers

turmoil's Issues

Example

Immediate Goals

Challenges

Some related thoughts

Progress

Description

Proposal

Recommend Projects

Recommend Topics

Recommend Org

Jobs