pytorch / tensorpipe Goto Github PK

A tensor-aware point-to-point communication primitive for machine learning

License: Other

CMake 3.07% C++ 94.83% Python 0.58% Cuda 0.09% Roff 0.06% Shell 1.19% C 0.18%

tensorpipe's Introduction

TensorPipe

The TensorPipe project provides a tensor-aware channel to transfer rich objects from one process to another while using the fastest transport for the tensors contained therein (e.g., CUDA device-to-device copy).

Getting started

First clone the repository:

$ git clone --recursive https://github.com/pytorch/tensorpipe

Then, build as follows (using ninja instead of make):

$ cd tensorpipe
$ mkdir build
$ cd build
$ cmake ../ -GNinja
$ ninja

You can find test executables in build/tensorpipe/test.

Interface

There are four classes you need to know about:

tensorpipe::Context, which keeps track of the global state of the system, such as thread pools, open file descriptors, etc.
tensorpipe::Listener, which allows one process to open an entry point for other processes to connect to.
tensorpipe::Pipe, the one communication primitive that this entire project is about. You can obtain one either by connecting to the listener of another process or from such a listener when another process connects to it. Once you have a pipe, you can send messages on it, and that's the whole point.
tensorpipe::Message, which is the the language that pipes read and write in. Pipes are streams of structured messages (not just raw byte buffers), and a message is composed of a "core" payload (memory living on CPU) plus a list of tensors (memory living on any device, like GPUs).

Sending a message from one end of the pipe to the other can be achieved using the write method, which takes a message (with the data to send) and a callback which will be invoked once the sending has completed. This callback will be invoked with an error (if one happened) and with the message.

Receiving a message takes two steps: on an incoming message, first the pipe asks you to provide some memory to hold the message in, and then you ask the pipe to read the data into that memory. In order to do this, first you must register a callback that will be notified for incoming messages. This is performed by calling the readDescriptor method with said callback. The callback will be invoked with a so-called descriptor, which can be seen as a "message skeleton", i.e., a message with no buffers attached to it (they are set to null pointers). The job of this callback is filling in those buffers, either by allocating the required memory or by obtaining it from somewhere else (from a cache, as a slice of a batch that's being assembled, ...). This descriptor also contains some metadata, given by the sender, which can be used to provide allocation hints or any other information that can help the receiver determine where to store the data. Once the message's buffers are ready, you can tell the pipe to go ahead and fill them in with the incoming data by passing the message to the read method, together with a callback which will be called when all the data has been received and stored. As when writing, this callback will be given a (possibly empty) error and the original message. The readDescriptor callback is one-shot, which means that after it fires it "expires" and will not be called again. It must be re-armed for a new event to be received.

When you pass a message to the pipe, to send it or to receive into it, you must not tamper with the underlying memory until the callback has completed, even if the write or read call already returned. (The write and read calls, and all other calls, are non-blocking so that it's easier to schedule asynchronous parallel trasfers without having to use threads). This means you can not deallocate the memory or alter it in any way, as the pipe may still be reading or modifying it. In other terms, you relinquish control over the memory when you pass a message to the pipe, only to reacquire it once the message is given back to you in the callback. This contract is encoded by the requirement to move the messages into and out of the pipe (using rvalue references). Also, because of this agreement, all callbacks will always be called, even if the pipe is closed or if it errors, in order to give back the memory.

The order in which messages are written to a pipe is preserved when these messages are read on the other side. Moreover, for a given pipe endpoint, the callbacks of the performed operations are executed in the same order that these operations were scheduled, even if the operations are performed asynchronously or out-of-band and thus may overlap or occur out of order. What this means is that if two write operations are scheduled one after the other back-to-back, even if the second one completes before the first one, its callback is delayed until the first one also completes and its callback is invoked. The same applies for reads. All the callbacks of all the pipes in a given context are called from the same per-context thread and thus no two callbacks will occur at the same time. However, different contexts will use different threads and their callbacks may thus overlap.

All the callbacks are invoked with an error reference. This may be "empty", i.e., indicate that no error has in fact occurred. In this case, the error object evaluates to false. In case of an actual error it will instead evaluate to true. When invoked with an error, the remaining arguments of the callback may be meaningless. For the read and write callbacks they will still contain the message that these methods will be invoked with, but the readDescriptor one will be an empty or invalid message. It should not be used.

There is no expectation for the readDescriptor callback to be armed at all times. Similarly, it is not necessary to call the read method immediately after a descriptor has been read. Both these possibilities are by design, in order to allow the user of the pipe to apply some backpressure in case it's receiving messages at a faster rate than it can handle, or for any other reason. This backpressure will be propagated to the lower-level components as as far down as possible (e.g., by stopping listening for readability events on the socket file descriptor).

Transports and channels

TensorPipe aims to be "backend-agnostic": it doesn't want to be restricted to a single way of copying data around but wants to be able to choose the fastest medium from a library of backends, based on the circumstances (e.g., are the two processes on the same machine?) and on the available hardware (e.g., are the GPUs connected with NVLink?). TensorPipe strives to have the largest selection of backends, enabling users to implement specific backends for their systems (should the default ones prove limited) and encouraging contributions.

The two processes that are establishing a pipe will automatically negotiate during setup to determine which of the backends they have at their disposal can be used and how well they would perform, in order to choose the best one in a way that is completely transparent to the user.

Backends come in two flavors:

Transports are the connections used by the pipes to transfer control messages, and the (smallish) core payloads. They are meant to be lightweight and low-latency. The most basic transport is a simple TCP one, which should work in all scenarios. A more optimized one, for example, is based on a ring buffer allocated in shared memory, which two processes on the same machine can use to communicate by performing just a memory copy, without passing through the kernel.
Channels are where the heavy lifting takes place, as they take care of copying the (larger) tensor data. High bandwidths are a requirement. Examples include multiplexing chunks of data across multiple TCP sockets and processes, so to saturate the NIC's bandwidth. Or using a CUDA memcpy call to transfer memory from one GPU to another using NVLink.

These different usage patterns promote different design choices when implementing transports and channels, which means the two are not perfectly interchangeable. For example, a TCP-based transport is best implemented using a single connection, whereas a TCP-based channel will benefit from using multiple connection and chunk and multiplex the payload over them in order to saturate the bandwidth even on the most powerful NICs.

Moreover, the APIs of transports and channels put different constraints on them, which demand and permit different approaches. As a rule of thumb, we require more from the transports: the only out-of-band information they can use is a simple address, which is all they can use to bootstrap the connection, and they need to include some "signaling" capabilities (a write on one side "wakes up" the other side by causing a read). Channels, on the other hand, have much looser requirements: they basically just need to implement a memcpy and, for anything beyond that, they can leverage a transport that the pipe gives to them for support.

License

TensorPipe is BSD licensed, as found in the LICENSE.txt file.

tensorpipe's People

Stargazers

Watchers

tensorpipe's Issues

Have channel factories return their descriptors as unserialized protobufs

Right now some channels represent their descriptors as protobufs but then serialize them to bytes and return those bytes. Which means an extra allocation and an extra copy. We could optimize this a bit in some cases by having them return the protobuf which we then embed in the core protobuf and serialize all at once, which allows us to use more performant serializations (like when using shm, where we could write directly to the target ringbuffer).

rpc_init fails with runtime error using tensorpipe backend on torch 1.6

terminate called after throwing an instance of 'std::runtime_error'
what():  In initFromLoop at /pytorch/third_party/tensorpipe/tensorpipe/transport/uv/listener.cc:126 "rv < 0: invalid argument"

This is the traceback python gives when passing tensorpipe backend to rpc_init. Does it have to do something with libuv?

And mandatory - I'm using Arch (sorry, but I had to)

strerror isn't thread-safe

The strerror function (which returns the human-readable message for an error number) isn't thread safe. In practice on Linux it only happens for invalid error numbers, because in that case strerror creates a dynamic error message (something like Unknown error N) in a process-wide buffer, hence different threads could race on it.

This is highly unlikely to pose a problem for us, but it's a nice-to-fix thing. This issue is to remember it.

Find out what's the oldest version of CMake we can support

I just tried to build TP on my laptop running a recent version of Arch Linux and got an error that I only had CMake 3.16.3 whereas 3.17 was required. I was going to update but I first tried just lowering the required version and everything worked. We should not require the very latest version unless there's a real reason.

Fix deadlock in Basic/ChannelTest.FactoryIsNotJoined

Accept callback is never called in Basic/ChannelTest.FactoryIsNotJoined on OSX. This is the reason why ctest -R Basic/ChannelTest.FactoryIsNotJoined/0 hangs.

Edit: The issue seems to be that the connection is being destroyed before being established, hence the accept callback is never fired.

Minimal example:

TEST_P(TransportTest, UVHangs) {
  testConnection(
      [&](std::shared_ptr<transport::Connection> conn) {
        std::cerr << "foo" << std::endl;
      },
      [&](std::shared_ptr<transport::Connection> conn) {
        std::cerr << "bar" << std::endl;
      });
}

Make SHM support arbitrarily large libnop objects

In #58 and #97 we created a fastpath in SHM to serialize and deserialize protobufs directly to and from the ringbuffer. However, due to a limitation in protobuf's zero copy streams, we cannot do so when the serialized protobuf would be larger than the whole ringbuffer. So, at the moment, we fail in those cases. This will typically not be a problem, as under normal operation the protobufs we send are tiny (a few hundred bytes) and the buffer is huge (2MiB). However, a careless user could put large data in a message's metadata, which would be embedded in the protobuf and cause a failure. It would be nice if we could now re-implement a slowpath in SHM that first serializes those large protobufs (and only those!) to a temporary buffer and then chunks that buffer over many writes.

Bundle Python bindings in single shared library

Currently when building the Python bindings we end up with a lot of separate shared libraries:

$ ls venv/lib/python3.6/site-packages/
easy-install.pth                              libtensorpipe_uv.so                           pkg_resources-0.0.0.dist-info/
easy_install.py                               libuv.so                                      __pycache__/
libtensorpipe_basic.so                        libuv.so.1                                    pytensorpipe.cpython-36m-x86_64-linux-gnu.so
libtensorpipe_proto_channel.so                libuv.so.1.0.0                                setuptools/
libtensorpipe_proto.so                        pip/                                          setuptools-39.0.1.dist-info/
libtensorpipe_shm.so                          pip-9.0.1.dist-info/                          tensorpipe-0.0.0.egg-info/
libtensorpipe.so                              pkg_resources/

Basically, it seems our internal CMake target structure is replicated as .so files.

This has already proven to cause problems. For example, I'm getting this:

>>> import pytensorpipe
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: libtensorpipe_basic.so: cannot open shared object file: No such file or directory

I'm not sure what's the exact cause of that problem, but I suspect it may be related to the use of separate .so files. Perhaps we should try to stick all of TensorPipe (+libuv) into a single .so file.

short read: got 2147479552 bytes while expecting to read 2560000000 bytes

I used RPC+tensorpipe to train a network with a layer of size 250k * 2560 * sizeof(float) = 2560000000:

[W tensorpipe_agent.cpp:756] RPC agent for worker8 encountered error when reading incoming request from master: short read: got 2147479552 bytes while expecting to read 2560000000 bytes

Does RPC has any limitation on the number of bytes to send/receive?

stubs/libcuda.so is loaded instead of libcuda.so

On my machine, tensorpipe loads /usr/lib/x86_64-linux-gnu/stubs/libcuda.so instead of /usr/lib/x86_64-linux-gnu/libcuda.so. The stubs libcuda.so is just a stub and does not have any (correct) function implementations, resulting in the exception In create at /home/a/nethack/moorpc/src/tensorpipe/tensorpipe/common/cuda_lib.h:105 "res != CUDA_SUCCESS"

Changing tensorpipe from loading "libcuda.so" to "libcuda.so.1" fixes/works around the issue.

The same issue occurs on my devfair, where it loads /public/apps/cuda/10.1/lib64/stubs/libcuda.so. (pytorch has already loaded /lib/x86_64-linux-gnu/libcuda.so.1)

Set `SO_NOSIGPIPE` on sockets

Cf. man setsockopt.

Use TP_CUDA_CHECK in tests

Currently we are using EXPECT_EQ(cudaSuccess, cudaFoo()), which works but leads to unhelpful error messages.

Make pipes use only one channel of each type

Currently the two endpoints of a pipe open one instance of all the channel types they have in common. However, due to a strict priority order between them, they will inevitably end up using only one of them (for each type: CPU, CUDA, ...) leaving the other ones unused. We should avoid opening them in the first place. This may also simplify a bit the content of the libnop messages.

Clean up ring buffer interface

There is a lot of clutter in there – remove unused methods and adapt tests.

Support SHM ringbuffer chunking writes larger than the buffer's size

We have by default a 2MB buffer, which is plenty for the small control messages we expect to send on it. However, to support some corner cases (like messages with millions of tensors, or the SHM transport being used as a channel using the basic channel factory) we may want to implement the ability for writes larger than the buffer's size to be split into chunks which are written separately. This should be transparent to the transport's user.

Detach UV's address discovery utilities from the context's loop

The UV context offers methods to resolve the IP address of the machine's hostname. These require access to libuv's loop, in order to perform the getaddrinfo call and to open and bind a few sockets to verify that the IP address works. This results in an odd setup as these methods eschew the async behavior of the rest of the transport (e.g., when they close the handles, they're supposed to wait until the next loop iteration for this to complete, but they don't).

In fact, these utilities could simply create their own ephemeral loop, and make sure to close and clean it up before they return. This would make them completely independent from the context. (It may make the functions a bit slower, as for example they need to open and then close their own epoll fd, but that should be negligible as they're supposed to only be called once at program startup).

A challenge with doing this is disentangling the libuv helpers (in uv.h and uv.cc) from the Loop class, so that they can be used with these ephemeral loops too. For example, they could just take a pointer to uv_loop_t. However, they do also have tons of checks for loop_.inLoop(). Those could be replaced with checks on a generic DeferredExecutor, although in turn it requires the ephemeral loop to implement the DeferredExecutor interface.

One more advantage of this change is that then the UV context doesn't contain anything in addition to the abstract context. This would allow us to use a boilerplate for its definition (together with like a factory function), and further reduce code duplication. The utility function could at that point be moved to their own files, and the UV context impl would become just a thin wrapper around the Loop class, and could be merged with it

Avoid copy to temp buffer when data is available contiguously in ring buffer

What's the expected interaction of this with NCCL when using data parallelism?

This is a cool project!

I was wondering what the expected interaction of TensorPipe is with NCCL used for data parallelism, especially given this limitation.

Does it make sense to make channels ordered?

At the moment channels are designed to be unordered: the order in which calls to send and recv are performed on the two sides doesn't have to match. However, in practice they always do, since each channel is used by only one pipe, and that pipe is ordered, and it calls send/recv in the order in which messages go over the pipe, and in the order in which tensors appear in those messages.

Making the channels ordered would lead to some simplifications. For example, the "basic" channel is basic because it only uses the connection provided to it by the pipe, but isn't so basic in how it uses it: sending a tensor requires the receiver to request it and only then does the sender send it. This means that it takes 3 hops rather than 1 to get the data across (sending the descriptor, requiring the data, sending the data). A more basic version of the basic channel would have the sender just immediately write the tensor on the connection and the receiver immediately read it. Since connections are ordered though this only works if channels are ordered too. Ordering channels would also simplify bookkeeping, as instead of having to look up the sequence number of the request/reply in its collection, each endpoint could just assume that it's the first one.

Note that we could distinguish two levels of "order": the order in which methods are called, and the order in which callbacks are fired. By that I mean that if the channel is intrinsically unordered, an operation that is requested later could complete earlier than another one. The jury is still out on whether we should wait to fire the callbacks and fire them in order, or whether we should fire them as soon as possible. The pipe anyways is robust to this.

Use consistent naming convention for type aliases

We have a lot of shorthand using alias = long complicated type. One chief example are the types of callbacks. The issue is that we've been inconsistent with the name of the alias: half of them use a snake_case style (e.g., read_callback_fn), whereas others use a CamelCase style (e.g., TReadCallback). There's also the fact that one has a trailing _fn whereas the other one doesn't (we should drop it). And finally there's the fact that the leading T typically denotes a template type parameter, and thus is misleading in aliases. Hence the convention we should converge on is probably just ReadCallback.

Optimize callbacks of UV event loop

The libuv library isn't thread-safe so we make sure to perform all the operations on it from inside the event loop. For that we use the defer function to which we provide a callback. For safety at the moment we wrap these callbacks in the usual runIfAlive boilerplate, but that's not really necessary. Thanks to the "leak" trick (objects wrapping libuv resources hold a shared_ptr to themselves, creating a reference cycle) we're guaranteed that the object will be alive until it's closed (since it's unleaked in the close callback). The loop calls the deferred functions in the order they were scheduled, so we're guaranteed that if the handle isn't closing at the time we're scheduling the callback then it won't be closed by the time the callback is executed and thus will still be alive.

This deduction is far from trivial and thus quite error prone, which is why the code for now adopts the safer runIfAlive alternative. However, as these calls are in the hot path it would be very valuable to optimize them. We should do so as soon as we have proper tests to detect regressions.

Channel factory test failure

See https://app.circleci.com/jobs/github/pytorch/tensorpipe/1380

Running main() from /root/project/third_party/googletest/googletest/src/gtest_main.cc
[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from ChannelFactory
[ RUN      ] ChannelFactory.Name
[       OK ] ChannelFactory.Name (0 ms)
[ RUN      ] ChannelFactory.DomainDescriptor
[       OK ] ChannelFactory.DomainDescriptor (0 ms)
[ RUN      ] ChannelFactory.CreateChannel
[tensorpipe warning: In operator() at /root/project/tensorpipe/test/channel/channel_test.cc:125] EOF: end of file
terminate called after throwing an instance of 'std::runtime_error'
  what():  In bufferToType at /root/project/tensorpipe/test/channel/channel_test.cc:35 "Expected len(0) == sizeof(ret)(1)"
/bin/bash: line 8:  5308 Aborted                 (core dumped) build/tensorpipe/test/channel/channel_test

Add headers to CMake targets

This would ensure the generated Makefiles would properly rebuild targets upon modification of a header.

TSAN job failing on CircleCI in libuv code

https://app.circleci.com/pipelines/github/pytorch/tensorpipe/956/workflows/97acedc6-b2e9-449a-b010-65a4ead3b670/jobs/6759

That job failed because of a flaky TSAN data race, or actually three of them, all involving libuv and all looking similar: one thread was doing a uv_loop_init and the other one was doing a uv_tcp_init or _bind. However, from the thread numbers and the stack trace of their creation, it looks like these two operations were performed by two different UV contexts. It's weird that they would race.

See here for the traces:

https://gist.github.com/lw/fec763ca140ec9eb42a5991d694f3588

Make TensorPipe work on OSX

First step will be without shm support.

Reusing tokens in the SHM reactor may not be a good idea

The SHM consists of:

a map from integers (known as tokens) to functions (let's call them reactions)
a shared-memory ringbuffer where the producers are other processes who push tokens and the consumer is the reactor look which runs the corresponding reactions.

At the moment the reactor tries to reuse and "compact" the tokens: each new reaction takes the smallest unused token (just like the kernel with file descriptors).

I've begun to wonder what happens when one reaction is unregistered and another one registered with the same token, and there are still instances of that token in the ringbuffer, intended for the old reaction. Unless I'm missing some safeguard, they should now be triggering the new reaction. This may pose problems unless all reactions are designed to be resilient to spurious wakeups.

Support GPU tensor transport on pytorch

According to https://pytorch.org/docs/stable/rpc.html, tensorpipe currently does not support sending GPU tensors over the wire. I'm wondering when this functionality would be implemented and integrated with pytorch.

The benchmarks cannot find any of the transports and channels

I've been trying to run the benchmarks and they told me the uv transport is invalid and asked me choose from... an empty list!

I suspect the recent registry work got somethign wrong and the static registration is not working.

Include context ID in thread names

Threads spawned by contexts (transports, channels, ...) are given a name so that they can be more easily identified in GDB when debugging. The names currently are of the form TP_UV_loop or TP_SHM_reactor. If we have multiple UV contexts, however, it will be impossible to distinguish which one the thread belongs to. This will become important for multiplexing channels.

We already have a system to assign unique IDs to every object, so we could that ID, for a context, in the context's threads. Sounds easy, but there are two tricky aspects:

The contexts don't know their ID when they are created, it is given to them later with the setId method. So, when they receive their ID, they need to update their threads' names. Under pthread, this is easy to do under Linux, but impossible to do under OSX, because there a thread's name can only be changed by the threads itself. It may be fine to only support Linux, but it would be nice for this to work everywhere. For that, we'd have to defer the renaming to be executed on the target thread. This is often easy, but not always possible: for example, the SHM loop doesn't offer a way to defer work to it (all the work is deferred to the reactor).
The name of a thread is limited to 16 chars by pthread (in fact 15 chars, as the last one has to be \0). Currently the ID of most contexts is already beyond that, as it contains the PID and then a "hierarchy", e.g., 1234567:c0.ch_mp_uv.42 could be the ID of the 42nd UV context used by a multiplexing channel of the first core context created in process 1234567. We need to find a shorter format in order to fit it in 15 chars. For one, the PID is probably useless, as threads are already scoped within a process.

Note that a bunch of utilities around setting thread names can be found here: https://github.com/facebook/folly/blob/4ad9455e2d38a0d267d0f6db474060f96bd659f4/folly/system/ThreadName.cpp

Deadlock in XTH

https://app.circleci.com/pipelines/github/pytorch/tensorpipe/748/workflows/4d1ba53b-e512-4e6a-acc9-5894d7cbc15a/jobs/5062/steps

Change Python bindings' module name to just `tensorpipe`

It currently is pytensorpipe because it appears that the module must have the same name as the CMake targets, and we already have a target names tensorpipe.

add_library cannot create ALIAS target "uv::uv" because target "PkgConfig::uv" does not already exist.

An error I only saw on CI so far:

CMake Error at third_party/tensorpipe/cmake/Finduv.cmake:33 (add_library):
  add_library cannot create ALIAS target "uv::uv" because target
  "PkgConfig::uv" does not already exist.

For those with access, details at https://app.circleci.com/pipelines/github/fairinternal/postman/223/workflows/0be48842-04db-405c-8571-18b63f544d3d/jobs/420/steps. Full error at https://gist.github.com/heiner/498a4e6224c50d6de43059f0fedb17a4

Use weak symbols with dlopen

Example:

tensorpipe/common/cuda.h

#include <cuda.h>
#include <dlfcn.h>

CUresult __attribute__((weak)) cuGetErrorName(CUresult, const char **);

void* loadCuda() {
    return dlopen("libcuda.so", RTLD_LAZY | RTLD_GLOBAL);
}

void unloadCuda(void *handle) {
    dlclose(handle);
}

usage:

    void * handle = loadCuda();
    const char *pstr;
    cuGetErrorName(CUDA_SUCCESS, &pstr);
    std::cout << pstr << std::endl;
    unloadCuda(handle);

"errorCouldn't get list of InfiniBand devices: ibv_get_device_list: Unknown error -38"

With pytorch 1.7.1, I was able to successfully initialize the RPC context.

With pytorch nightly (1.9.0.dev20210223),

  File "/home/jeremy/PycharmProjects/hearthstone_battlegrounds/hearthstone/training/pytorch/worker/distributed/worker_pool.py", line 54, in __init__
    rpc.init_rpc(INFERENCE_PROCESS_NAME, rank=0, world_size=num_workers+1)
  File "/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 194, in init_rpc
    _init_rpc_backend(backend, store, name, rank, world_size, rpc_backend_options)
  File "/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/distributed/rpc/__init__.py", line 230, in _init_rpc_backend
    rpc_agent = backend_registry.init_backend(
  File "/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 99, in init_backend
    return backend.value.init_backend_handler(*args, **kwargs)
  File "/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/distributed/rpc/backend_registry.py", line 278, in _tensorpipe_init_backend_handler
    api._init_rpc_states(agent)
  File "/home/jeremy/PycharmProjects/hearthstone_battlegrounds/venv/lib/python3.9/site-packages/torch/distributed/rpc/api.py", line 116, in _init_rpc_states
    _set_and_start_rpc_agent(agent)
RuntimeError: In create at /pytorch/third_party/tensorpipe/tensorpipe/transport/ibv/context_impl.cc:55 "errorCouldn't get list of InfiniBand devices: ibv_get_device_list: Unknown error -38"

Cuda is working correctly otherwise, e.g. I can run

>>> torch.tensor([1,2], device=torch.device('cuda'))

successfully.

Fix auto-detection of CMA support

Our CircleCI tests explicitly disable CMA because it doesn't seem to work. I found out that it might be due to a security measure adopted by a Ubuntu which only allows ptrace from a process to one of its descendants. There are ways to detect that and even to counter that, and we should adopt those. In general, we should improve the CMA detection. Here are some references:

Permission to read from or write to another process is governed by a
ptrace access mode PTRACE_MODE_ATTACH_REALCREDS check; see ptrace(2).

man 2 process_vm_readv

See Ptrace access mode checking and /proc/sys/kernel/yama/ptrace_scope in man 2 ptrace.

See the kernel docs about Yama (i.e., Ubuntu's security measure).

See PR_SET_PTRACER in man 2 prctl.

Merge utils/ subdir into common/

There's no strong reason for keeping them separate. I think we used to do that because the files in utils were copied from other repos, whereas the ones in common were "original", but we've now modified the former to a point where they're unrecognizable. Moreover we've added dependencies between the two (e.g., Segment, in util, depends on Socket, in common; and RingBufferReadOperation, in common, depends on RingBuffer, in util). So in fact we have a cyclical dependency, which only works because of a lucky include order. We should just put everything in common to everything clearer and simpler.

Implement "memcpy" transport and channel, for intra-process communication

Some prospective users have been asking whether TensorPipe will support pipes between different threads of the same program. There's no reason not to support this, and its implementation is very straightforward (basically a few memcpys, some global state and a bunch of synchronization primitives).

[W tensorpipe_agent.cpp:641] RPC agent for Test0 encountered error when sending outgoing request #0 to Test1: ECONNREFUSED: connection refused

Consider the attached program, run it on two machines, e.g.

machine1% python hello.py -o machine1 -r 0 -t
machine2% python hello.py -o machine1 -r 1 -t

Omitting the "-t" (i.e. don't use tensorpipe) everything works fine and rank1 prints

hello from the other side
got t = tensor([4.2842e+20, 4.5806e-41, 4.2842e+20, 4.5806e-41, 0.0000e+00])

With "-t", (i.e. use tensorpipe) rank 1 hangs and doesn't respond to ^C, and rank 0 prints

dist init r=0, world=2, host=learnfair1228
going to try init_process_group: tcp://learnfair1228:10638
got nccl, now barrier 0
got nccl, now rpc 0
going to try init_rpc tcp://learnfair1228:10639
got rpc 0
[W tensorpipe_agent.cpp:641] RPC agent for Test0 encountered error when sending outgoing request #0 to Test1: ECONNREFUSED: connection refused
Traceback (most recent call last):
  File "hello.py", line 63, in <module>
    torch.distributed.rpc.rpc_sync("Test1", rpc_send, args=(t,))
  File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 81, in wrapper
    return func(*args, **kwargs)
  File "/private/home/tbirch/.conda/envs/torch160/lib/python3.7/site-packages/torch/distributed/rpc/api.py", line 752, in rpc_sync
    return fut.wait()
RuntimeError: ECONNREFUSED: connection refused

import torch
import argparse
import os
from torch.distributed import rpc

parser = argparse.ArgumentParser(description="benchmark")
parser.add_argument("--rank", "-r", type=int, help="rank base", default=0)
parser.add_argument("--host", "-o", type=str, default=None, help="hostname")
parser.add_argument("-t",  action='store_true', default=False, help="use tensorpipe")

args = parser.parse_args()

hostname = args.host
world_size = 2
rank = args.rank

if hostname == None:
    hostname = "localhost"
print(f"dist init r={rank}, world={world_size}, host={hostname}")
os.environ["MASTER_ADDR"] = hostname
os.environ["MASTER_PORT"] = "10638"
os.environ["WORLD_SIZE"] = str(world_size)
os.environ["RANK"] = str(rank)
os.environ["GLOO_SOCKET_IFNAME"] = "enp1s0f0"
if torch.__version__ == "1.6.0":
    init_method = f"tcp://{os.environ['MASTER_ADDR']}:{os.environ['MASTER_PORT']}"
    print(f"going to try init_process_group: {init_method}")
    torch.distributed.init_process_group(backend="nccl", rank=rank, world_size=world_size, init_method=init_method)
    print(f"got nccl, now barrier {rank}")
    torch.distributed.barrier()
    print(f"got nccl, now rpc {rank}")
    os.environ["MASTER_ADDR"] = hostname
    os.environ["MASTER_PORT"] = "10639"
    init_method = f"tcp://{os.environ['MASTER_ADDR']}:{os.environ['MASTER_PORT']}"
    print(f"going to try init_rpc {init_method}")
    if args.t:
        rpc.init_rpc(
            f"Test{rank}",
            rank=rank,
            world_size=world_size,
            backend=rpc.BackendType.TENSORPIPE,
            rpc_backend_options=rpc.TensorPipeRpcBackendOptions(
                init_method=init_method
            ),
        )
    else:
        rpc.init_rpc(
            f"Test{rank}",
            rank=rank,
            world_size=world_size,
        )
    print(f"got rpc {rank}")
else:
    rpc.init_rpc(f"Test{rank}", rank=rank, world_size=world_size)
    print(f"got rpc {rank}")

def rpc_send(t):
    print(f"hello from the other side")
    print(f"got t = {t}")

t = torch.Tensor(5)
if rank == 0:
    torch.distributed.rpc.rpc_sync("Test1", rpc_send, args=(t,))

torch.distributed.barrier()

Consider using a non-copyable function wrapper

In several occasions we'd like to capture non-copyable objects in a lambda's closure but can't because this lambda gets encapsulated in a std::function which must be copyable. This happens for example when we allocate a buffer for a write operation, where we want that buffer to remain alive for as long as the write is going on and be destroyed as soon as it's finished: a natural solution is to have the write callback capture a unique_ptr to said buffer. We can't do it, so we resort to a shared_ptr.

This is not a big problem, but since we never make use of copyability of callbacks we'd be better of without this requirement. I found out today that this may be possible by rolling out our own function wrapper! The Folly library has what we need, see its function class (doc here, code here). I'm not saying to depend on Folly, nor to copy-paste that file (it depends on other Folly parts), but it's proof that what we need exists.

Note that we can do this at a later time without breaking any backwards compatibility, because the function wrapper we'd migrate to is looser than the current one and thus anything that works now will keep working (in fact, folly::Function can wrap a std::function).

TSAN failures in MPT's ContextIsNotJoined test

https://app.circleci.com/pipelines/github/pytorch/tensorpipe/1456/workflows/d987d8ee-9b5e-422a-aa38-1c7d5f8fd3f7/jobs/11139

Have SHM transport optimize protobuf serialization by writing to ringbuffer directly

In D19723175 we changed the interface of transport connections to allow short-cutting the writing of protobufs. For now these are just helper functions that serialize to memory and then call the standard write. One instance where we could use the shortcut to optimize is with the SHM transport, which could serialize directly to the ringbuffer, thus saving one allocation and one memory copy. This will benefit latency.

Close file descriptors after mapping ring buffer

No need to keep them open once the memory has been mapped.

Also, we keep them as int on the shm connection, so they aren't cleaned up on destruction.

Mark CudaXth not viable when UVA not supported.

Or have a fallback code path.

Using tensorpipe to write distributed applications

I'd like to leverage tensorpipe to perform rpc in a distributed setting (C++). Since I couldn't find any test for multi-device using tensorpipe, I looked at how pytorch uses tensorpipe in the backend. I found this file to be relevant to my search (https://github.com/pytorch/pytorch/blob/master/torch/csrc/distributed/rpc/tensorpipe_agent.cpp) but it is still not clear to me as to how the connection between different nodes occur as no information (IP) about the nodes are being considered. My understanding is that depending on the world size, worker name(s) and id(s) are stored and accessed through registry (https://github.com/pytorch/pytorch/blob/master/c10/util/Registry.h). But I don't understand how registry knows about different nodes.

Is it possible to use standalone tensorpipe to do rpc between nodes? Or does it require more from the pytorch codebase?
Is there any example/test with tensorpipe used in distributed setting?

The distributed training examples that I found all are in python but I'd like to use C++

Thank you.

Tests hang on MacOS

Following the README on my MacBook, then running tensorpipe/test/tensorpipe_test hangs at [ RUN ] Basic/ChannelTest.contextIsNotJoined/0 (pretty reliably).

Running with TP_VERBOSE_LOGGING=9 does not hang typically, which seems to indicate some kind of race?

I managed to get it to hang with verbose logging and got the backtraces of the threads with lldb. Outputs in this gist: https://gist.github.com/heiner/eda54649104500ec23ddfa887a276619

My system:
OS: Mac OSX 10.14.6
GCC version: Could not collect
CMake version: version 3.17.2
Apple clang version 11.0.0 (clang-1100.0.33.12)
Target: x86_64-apple-darwin18.7.0

Avoid extra copy when reading protobuf from ring buffer

Using ZeroCopyInputStream, the same way we used ZeroCopyOutputStream for writing straight to the ring buffer.

Test `UVTransportContextTest.LookupHostnameAddress` fails.

[----------] 2 tests from Uv/UVTransportContextTest
[ RUN      ] Uv/UVTransportContextTest.LookupHostnameAddress/0
/home/beauby/Code/tensorpipe/tensorpipe/test/transport/uv/context_test.cc:33: Failure
Value of: error
  Actual: true
Expected: false
EAI_NONAME: unknown node or service
/home/beauby/Code/tensorpipe/tensorpipe/test/transport/uv/context_test.cc:34: Failure
Expected: (addr) != (""), actual: "" vs ""
[  FAILED  ] Uv/UVTransportContextTest.LookupHostnameAddress/0, where GetParam() = 0x55c41dac5d20 (25 ms)
[ RUN      ] Uv/UVTransportContextTest.LookupInterfaceAddress/0
[       OK ] Uv/UVTransportContextTest.LookupInterfaceAddress/0 (0 ms)
[----------] 2 tests from Uv/UVTransportContextTest (25 ms total)

Try to use huge pages to store ringbuffers

Latency-sensitive applications, like the transports, that use ringbuffers as intermediate storage for the data (think SHM, or InfiniBand), might benefit from using huge pages to avoid hitting as many page faults and interrupts. Huge pages are however a bit finicky: we thought we were supporting them but it turned out we were passing the flags incorrectly; and doing that the right way raises an error if huge pages are disabled/unavailable. Having been unable to get huge pages to work, I don't know how much performance benefit they can bring and whether they're worth investing in some graceful fallback logic.

Our CMake files are not able to pick up a preinstalled version of libuv

My system, which is a recent Arch Linux, has libuv already installed (v1.34.2). Our CMake files correctly detected it through pkg-config but then failed because that code path didn't set uv_LIBRARY_DIRS, which we require to be set in find_package_handle_standard_args.

Fix cmake configs to enable multithread builds

Protobuf generated files dependencies seem broken.

Use std::regex to parse uv sockaddr

The sockaddr parsing logic in transport/uv/sockaddr.cc could be done with std::regex instead.

See #12.

pytorch / tensorpipe Goto Github PK

tensorpipe's Introduction

TensorPipe

Getting started

Interface

Transports and channels

License

tensorpipe's People

Stargazers

Watchers

Forkers

tensorpipe's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs