GithubHelp home page GithubHelp logo

tinygrad / open-gpu-kernel-modules Goto Github PK

View Code? Open in Web Editor NEW

This project forked from nvidia/open-gpu-kernel-modules

786.0 786.0 60.0 51.79 MB

NVIDIA Linux open GPU with P2P support

License: Other

Shell 0.34% C++ 1.83% Python 0.03% C 97.65% Makefile 0.16%

open-gpu-kernel-modules's Introduction

tiny corp logo

tinygrad: For something between PyTorch and karpathy/micrograd. Maintained by tiny corp.

GitHub Repo stars Unit Tests Discord


This may not be the best deep learning framework, but it is a deep learning framework.

Due to its extreme simplicity, it aims to be the easiest framework to add new accelerators to, with support for both inference and training. If XLA is CISC, tinygrad is RISC.

tinygrad is still alpha software, but we raised some money to make it good. Someday, we will tape out chips.

Features

LLaMA and Stable Diffusion

tinygrad can run LLaMA and Stable Diffusion!

Laziness

Try a matmul. See how, despite the style, it is fused into one kernel with the power of laziness.

DEBUG=3 python3 -c "from tinygrad import Tensor;
N = 1024; a, b = Tensor.rand(N, N), Tensor.rand(N, N);
c = (a.reshape(N, 1, N) * b.T.reshape(1, N, N)).sum(axis=2);
print((c.numpy() - (a.numpy() @ b.numpy())).mean())"

And we can change DEBUG to 4 to see the generated code.

Neural networks

As it turns out, 90% of what you need for neural networks are a decent autograd/tensor library. Throw in an optimizer, a data loader, and some compute, and you have all you need.

from tinygrad import Tensor, nn

class LinearNet:
  def __init__(self):
    self.l1 = Tensor.kaiming_uniform(784, 128)
    self.l2 = Tensor.kaiming_uniform(128, 10)
  def __call__(self, x:Tensor) -> Tensor:
    return x.flatten(1).dot(self.l1).relu().dot(self.l2)

model = LinearNet()
optim = nn.optim.Adam([model.l1, model.l2], lr=0.001)

x, y = Tensor.rand(4, 1, 28, 28), Tensor([2,4,3,7])  # replace with real mnist dataloader

with Tensor.train():
  for i in range(10):
    optim.zero_grad()
    loss = model(x).sparse_categorical_crossentropy(y).backward()
    optim.step()
    print(i, loss.item())

See examples/beautiful_mnist.py for the full version that gets 98% in ~5 seconds

Accelerators

tinygrad already supports numerous accelerators, including:

And it is easy to add more! Your accelerator of choice only needs to support a total of ~25 low level ops.

Installation

The current recommended way to install tinygrad is from source.

From source

git clone https://github.com/tinygrad/tinygrad.git
cd tinygrad
python3 -m pip install -e .

Direct (master)

python3 -m pip install git+https://github.com/tinygrad/tinygrad.git

Documentation

Documentation along with a quick start guide can be found on the docs website built from the docs/ directory.

Quick example comparing to PyTorch

from tinygrad import Tensor

x = Tensor.eye(3, requires_grad=True)
y = Tensor([[2.0,0,-2.0]], requires_grad=True)
z = y.matmul(x).sum()
z.backward()

print(x.grad.numpy())  # dz/dx
print(y.grad.numpy())  # dz/dy

The same thing but in PyTorch:

import torch

x = torch.eye(3, requires_grad=True)
y = torch.tensor([[2.0,0,-2.0]], requires_grad=True)
z = y.matmul(x).sum()
z.backward()

print(x.grad.numpy())  # dz/dx
print(y.grad.numpy())  # dz/dy

Contributing

There has been a lot of interest in tinygrad lately. Following these guidelines will help your PR get accepted.

We'll start with what will get your PR closed with a pointer to this section:

  • No code golf! While low line count is a guiding light of this project, anything that remotely looks like code golf will be closed. The true goal is reducing complexity and increasing readability, and deleting \ns does nothing to help with that.
  • All docs and whitespace changes will be closed unless you are a well-known contributor. The people writing the docs should be those who know the codebase the absolute best. People who have not demonstrated that shouldn't be messing with docs. Whitespace changes are both useless and carry a risk of introducing bugs.
  • Anything you claim is a "speedup" must be benchmarked. In general, the goal is simplicity, so even if your PR makes things marginally faster, you have to consider the tradeoff with maintainablity and readablity.
  • In general, the code outside the core tinygrad/ folder is not well tested, so unless the current code there is broken, you shouldn't be changing it.
  • If your PR looks "complex", is a big diff, or adds lots of lines, it won't be reviewed or merged. Consider breaking it up into smaller PRs that are individually clear wins. A common pattern I see is prerequisite refactors before adding new functionality. If you can (cleanly) refactor to the point that the feature is a 3 line change, this is great, and something easy for us to review.

Now, what we want:

  • Bug fixes (with a regression test) are great! This library isn't 1.0 yet, so if you stumble upon a bug, fix it, write a test, and submit a PR, this is valuable work.
  • Solving bounties! tinygrad offers cash bounties for certain improvements to the library. All new code should be high quality and well tested.
  • Features. However, if you are adding a feature, consider the line tradeoff. If it's 3 lines, there's less of a bar of usefulness it has to meet over something that's 30 or 300 lines. All features must have regression tests. In general with no other constraints, your feature's API should match torch or numpy.
  • Refactors that are clear wins. In general, if your refactor isn't a clear win it will be closed. But some refactors are amazing! Think about readability in a deep core sense. A whitespace change or moving a few functions around is useless, but if you realize that two 100 line functions can actually use the same 110 line function with arguments while also improving readability, this is a big win. Refactors should pass process replay.
  • Tests/fuzzers. If you can add tests that are non brittle, they are welcome. We have some fuzzers in here too, and there's a plethora of bugs that can be found with them and by improving them. Finding bugs, even writing broken tests (that should pass) with @unittest.expectedFailure is great. This is how we make progress.
  • Dead code removal from core tinygrad/ folder. We don't care about the code in extra, but removing dead code from the core library is great. Less for new people to read and be confused by.

Running tests

You should install the pre-commit hooks with pre-commit install. This will run the linter, mypy, and a subset of the tests on every commit.

For more examples on how to run the full test suite please refer to the CI workflow.

Some examples of running tests locally:

python3 -m pip install -e '.[testing]'  # install extra deps for testing
python3 test/test_ops.py                # just the ops tests
python3 -m pytest test/                 # whole test suite

Process replay tests

Process replay compares your PR's generated kernels against master. If your PR is a refactor or speedup without any expected behavior change, It should include [run_process_replay] in the PR title, example. Note that you should keep your branch up-to-date with master.

open-gpu-kernel-modules's People

Contributors

alcaparra avatar aritger avatar bigswag420 avatar fffedo avatar geohot avatar gregjhogan avatar joshua-ashton avatar keroeslux avatar mmaneetsingh avatar nitepone avatar niv avatar par2020 avatar realastolfo avatar thebeanogamer avatar trickydmitriy avatar wozeparrot avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

open-gpu-kernel-modules's Issues

Getting ~40GB/s instead of 50+ in the example. Curious why

NVIDIA Open GPU Kernel Modules Version

550.54.15-p2p default

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

TUXEDO OS 2 (Ubuntu 22.04 Fork)

Kernel Release

6.5.0-10022-tuxedo

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

3x RTX 4090

Describe the bug

The P2P is working, but i get 40 GB/s.

I'm curious why this is, do i need to overclock?

All slots are running GEN 4@16x

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2 
     0 918.31  25.76  25.91 
     1  25.95 923.67  25.60 
     2  26.15  25.68 923.74 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2 
     0 919.56  41.38  41.37 
     1  41.38 923.19  41.37 
     2  41.37  41.36 921.56 
 [./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 3

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU2) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU2) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 21.06GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

PS. Thx for this driver team tinygrad🙌

To Reproduce

Run ./p2pBandwidthLatencyTest

Bug Incidence

Always

nvidia-bug-report.log.gz

None

More Info

No response

Getting RuntimeError: CUDA error: an illegal memory access was encountered with 3090s

NVIDIA Open GPU Kernel Modules Version

this one

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 22.04.4 LTS

Kernel Release

6.5.0-27-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

all (4x 3090)

Describe the bug

I installed this driver, and torch.cuda.can_device_access_peer(a, b) gives me TRUE for all gpus.

I get the following error when textgenwebui tries to load a model:

RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Aphrodite also crashes when loading any model.

To Reproduce

I installed this driver on ubuntu.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

'mapping of buffer object failed' error

NVIDIA Open GPU Kernel Modules Version

550.54.15

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

ubuntu 23.10

Kernel Release

Linux ai5 6.5.0-28-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

8x4090

Describe the bug

Seem to dtecet p2p but fails on test

./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 4090, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 4090, pciBusID: 2c, pciDeviceID: 0, pciDomainID:0
Device: 4, NVIDIA GeForce RTX 4090, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device: 5, NVIDIA GeForce RTX 4090, pciBusID: 42, pciDeviceID: 0, pciDomainID:0
Device: 6, NVIDIA GeForce RTX 4090, pciBusID: 61, pciDeviceID: 0, pciDomainID:0
Device: 7, NVIDIA GeForce RTX 4090, pciBusID: 62, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=0 CAN Access Peer Device=5
Device=0 CAN Access Peer Device=6
Device=0 CAN Access Peer Device=7
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=5
Device=1 CAN Access Peer Device=6
Device=1 CAN Access Peer Device=7
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=5
Device=2 CAN Access Peer Device=6
Device=2 CAN Access Peer Device=7
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=5
Device=3 CAN Access Peer Device=6
Device=3 CAN Access Peer Device=7
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=6
Device=4 CAN Access Peer Device=7
Device=5 CAN Access Peer Device=0
Device=5 CAN Access Peer Device=1
Device=5 CAN Access Peer Device=2
Device=5 CAN Access Peer Device=3
Device=5 CAN Access Peer Device=4
Device=5 CAN Access Peer Device=6
Device=5 CAN Access Peer Device=7
Device=6 CAN Access Peer Device=0
Device=6 CAN Access Peer Device=1
Device=6 CAN Access Peer Device=2
Device=6 CAN Access Peer Device=3
Device=6 CAN Access Peer Device=4
Device=6 CAN Access Peer Device=5
Device=6 CAN Access Peer Device=7
Device=7 CAN Access Peer Device=0
Device=7 CAN Access Peer Device=1
Device=7 CAN Access Peer Device=2
Device=7 CAN Access Peer Device=3
Device=7 CAN Access Peer Device=4
Device=7 CAN Access Peer Device=5
Device=7 CAN Access Peer Device=6

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
D\D 0 1 2 3 4 5 6 7
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 913.74 6.60 6.60 10.94 10.77 10.75 10.73 10.90
1 6.51 924.56 6.23 6.51 6.51 6.51 6.51 6.51
2 6.50 6.32 923.87 6.51 6.39 6.48 6.51 6.51
3 10.87 6.61 6.60 923.60 10.81 10.77 10.58 10.80
4 9.92 6.52 6.61 10.91 924.56 10.89 10.69 10.89
5 10.91 6.61 6.60 10.95 10.91 922.92 9.02 9.59
6 10.93 6.61 6.60 10.92 10.93 10.92 921.98 10.67
7 10.90 6.61 6.60 10.94 10.92 10.88 10.53 924.01
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
Cuda failure p2pBandwidthLatencyTest.cu:189: 'mapping of buffer object failed'

To Reproduce

After installing it, p2p seem to be detected as active but when test I breaks.

./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 4090, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 4090, pciBusID: 2c, pciDeviceID: 0, pciDomainID:0
Device: 4, NVIDIA GeForce RTX 4090, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device: 5, NVIDIA GeForce RTX 4090, pciBusID: 42, pciDeviceID: 0, pciDomainID:0
Device: 6, NVIDIA GeForce RTX 4090, pciBusID: 61, pciDeviceID: 0, pciDomainID:0
Device: 7, NVIDIA GeForce RTX 4090, pciBusID: 62, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=0 CAN Access Peer Device=5
Device=0 CAN Access Peer Device=6
Device=0 CAN Access Peer Device=7
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=5
Device=1 CAN Access Peer Device=6
Device=1 CAN Access Peer Device=7
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=5
Device=2 CAN Access Peer Device=6
Device=2 CAN Access Peer Device=7
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=5
Device=3 CAN Access Peer Device=6
Device=3 CAN Access Peer Device=7
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=6
Device=4 CAN Access Peer Device=7
Device=5 CAN Access Peer Device=0
Device=5 CAN Access Peer Device=1
Device=5 CAN Access Peer Device=2
Device=5 CAN Access Peer Device=3
Device=5 CAN Access Peer Device=4
Device=5 CAN Access Peer Device=6
Device=5 CAN Access Peer Device=7
Device=6 CAN Access Peer Device=0
Device=6 CAN Access Peer Device=1
Device=6 CAN Access Peer Device=2
Device=6 CAN Access Peer Device=3
Device=6 CAN Access Peer Device=4
Device=6 CAN Access Peer Device=5
Device=6 CAN Access Peer Device=7
Device=7 CAN Access Peer Device=0
Device=7 CAN Access Peer Device=1
Device=7 CAN Access Peer Device=2
Device=7 CAN Access Peer Device=3
Device=7 CAN Access Peer Device=4
Device=7 CAN Access Peer Device=5
Device=7 CAN Access Peer Device=6

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
D\D 0 1 2 3 4 5 6 7
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 913.74 6.60 6.60 10.94 10.77 10.75 10.73 10.90
1 6.51 924.56 6.23 6.51 6.51 6.51 6.51 6.51
2 6.50 6.32 923.87 6.51 6.39 6.48 6.51 6.51
3 10.87 6.61 6.60 923.60 10.81 10.77 10.58 10.80
4 9.92 6.52 6.61 10.91 924.56 10.89 10.69 10.89
5 10.91 6.61 6.60 10.95 10.91 922.92 9.02 9.59
6 10.93 6.61 6.60 10.92 10.93 10.92 921.98 10.67
7 10.90 6.61 6.60 10.94 10.92 10.88 10.53 924.01
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
Cuda failure p2pBandwidthLatencyTest.cu:189: 'mapping of buffer object failed'

Bug Incidence

Always

nvidia-bug-report.log.gz

./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device: 2, NVIDIA GeForce RTX 4090, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device: 3, NVIDIA GeForce RTX 4090, pciBusID: 2c, pciDeviceID: 0, pciDomainID:0
Device: 4, NVIDIA GeForce RTX 4090, pciBusID: 41, pciDeviceID: 0, pciDomainID:0
Device: 5, NVIDIA GeForce RTX 4090, pciBusID: 42, pciDeviceID: 0, pciDomainID:0
Device: 6, NVIDIA GeForce RTX 4090, pciBusID: 61, pciDeviceID: 0, pciDomainID:0
Device: 7, NVIDIA GeForce RTX 4090, pciBusID: 62, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=0 CAN Access Peer Device=2
Device=0 CAN Access Peer Device=3
Device=0 CAN Access Peer Device=4
Device=0 CAN Access Peer Device=5
Device=0 CAN Access Peer Device=6
Device=0 CAN Access Peer Device=7
Device=1 CAN Access Peer Device=0
Device=1 CAN Access Peer Device=2
Device=1 CAN Access Peer Device=3
Device=1 CAN Access Peer Device=4
Device=1 CAN Access Peer Device=5
Device=1 CAN Access Peer Device=6
Device=1 CAN Access Peer Device=7
Device=2 CAN Access Peer Device=0
Device=2 CAN Access Peer Device=1
Device=2 CAN Access Peer Device=3
Device=2 CAN Access Peer Device=4
Device=2 CAN Access Peer Device=5
Device=2 CAN Access Peer Device=6
Device=2 CAN Access Peer Device=7
Device=3 CAN Access Peer Device=0
Device=3 CAN Access Peer Device=1
Device=3 CAN Access Peer Device=2
Device=3 CAN Access Peer Device=4
Device=3 CAN Access Peer Device=5
Device=3 CAN Access Peer Device=6
Device=3 CAN Access Peer Device=7
Device=4 CAN Access Peer Device=0
Device=4 CAN Access Peer Device=1
Device=4 CAN Access Peer Device=2
Device=4 CAN Access Peer Device=3
Device=4 CAN Access Peer Device=5
Device=4 CAN Access Peer Device=6
Device=4 CAN Access Peer Device=7
Device=5 CAN Access Peer Device=0
Device=5 CAN Access Peer Device=1
Device=5 CAN Access Peer Device=2
Device=5 CAN Access Peer Device=3
Device=5 CAN Access Peer Device=4
Device=5 CAN Access Peer Device=6
Device=5 CAN Access Peer Device=7
Device=6 CAN Access Peer Device=0
Device=6 CAN Access Peer Device=1
Device=6 CAN Access Peer Device=2
Device=6 CAN Access Peer Device=3
Device=6 CAN Access Peer Device=4
Device=6 CAN Access Peer Device=5
Device=6 CAN Access Peer Device=7
Device=7 CAN Access Peer Device=0
Device=7 CAN Access Peer Device=1
Device=7 CAN Access Peer Device=2
Device=7 CAN Access Peer Device=3
Device=7 CAN Access Peer Device=4
Device=7 CAN Access Peer Device=5
Device=7 CAN Access Peer Device=6

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
D\D 0 1 2 3 4 5 6 7
0 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1
5 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1
7 1 1 1 1 1 1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2 3 4 5 6 7
0 913.74 6.60 6.60 10.94 10.77 10.75 10.73 10.90
1 6.51 924.56 6.23 6.51 6.51 6.51 6.51 6.51
2 6.50 6.32 923.87 6.51 6.39 6.48 6.51 6.51
3 10.87 6.61 6.60 923.60 10.81 10.77 10.58 10.80
4 9.92 6.52 6.61 10.91 924.56 10.89 10.69 10.89
5 10.91 6.61 6.60 10.95 10.91 922.92 9.02 9.59
6 10.93 6.61 6.60 10.92 10.93 10.92 921.98 10.67
7 10.90 6.61 6.60 10.94 10.92 10.88 10.53 924.01
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
Cuda failure p2pBandwidthLatencyTest.cu:189: 'mapping of buffer object failed'

More Info

No response

Small BAR Size Support?

NVIDIA Open GPU Kernel Modules Version

550.90.07

Operating System and Version

Ubuntu 22.04

Kernel Release

6.8.9

Hardware: GPU

4090

Describe the bug

We have tested the modified kernel on two systems: 1x intel desktop (with full BAR=32GB), 1x amd server (without resizable_bar, BAR=256MB,512MB)

On the intel with full 32GB BAR size for the 2x4090, NCCL/P2P test is passing with modded driver.

However on the amd server platform where bios doesn't support resizable_bar, nvidia-smi is only showing 256MB and 512MB bar sizes for the 2x4090. On this amd server, even with the this modded nvidia driver, NCCL/P2P tests failed. The amd server also has lots of pcie devices so it may be running out of pcie map space to assign the large 32GB bars that 4090 support.

So my question is, is the current P2P+4090 code only working if BAR size >= full 4090 GPU vram size? Thank you!

Low performance when running over NVLink

NVIDIA Open GPU Kernel Modules Version

Comparing with NVIDIA commit 12933b2

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Ubuntu 22.04.4 LTS

Kernel Release

5.15.0-102-generic

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 3090

Describe the bug

Thank you for this project! It seems to be working well on 3090s. However, NVLink seems to underperform with this fork.

In the results below, the variation in the performance of PCIe GPUs is caused by differing PCIe versions and lanes. GPUs 2 and 3 are connected via NVLink (4 lanes, 56.25GB/s theoretical unidirectional performance). They are also connected via PCIe Gen 4 x8 (25GB/s theoretical unidirectional performance).

Running p2pBandwidthLatencyTest with this fork:

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6
     0 837.80  11.33  11.67  10.49  15.66  11.40  11.11
     1  11.37 812.92   8.92   8.93  11.38   8.94  11.40
     2  11.23   8.94 838.70   8.97  11.14   8.98  11.27
     3  11.20   8.90   8.91 838.00  11.12   8.92  11.25
     4  15.48  11.35  11.57  11.55 838.93  11.39  16.07
     5  11.34   8.90   8.95   8.93  11.38 838.03  11.31
     6  15.86  11.39  10.57  11.67  16.05  10.95 838.48

Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6
     0 837.13  25.42  26.00  25.98  50.82  25.46  51.20
     1  25.45 838.25  25.40  25.50  25.46  25.52  25.45
     2  25.95  25.45 837.58  17.27  25.99  25.46  25.99
     3  25.99  25.50  17.04 835.34  25.99  25.46  25.99
     4  50.18  25.46  26.00  25.98 838.25  25.42  51.21
     5  25.46  25.57  25.41  25.51  25.38 837.35  25.47
     6  50.20  25.46  25.99  25.98  51.22  25.47 839.83

With the original open-source driver:

Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6
     0 837.13  11.36  11.66  10.49  15.65  11.37  15.92
     1  11.43 830.23   8.88   8.92  11.41   8.95  11.38
     2  11.18   8.93 837.80   8.97  11.13   8.99  11.26
     3  11.21   8.91   8.91 839.60  11.13   8.91  11.26
     4  15.51  11.38  11.56  11.57 838.70  11.41  16.01
     5  11.34   8.97   8.93   8.94  11.35 838.67  11.28
     6  15.86  11.35  11.66  11.68  11.66  11.27 838.03
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1      2      3      4      5      6
     0 837.56  11.35  11.66  11.65  15.60  11.32  15.92
     1  11.42 838.66   8.94   8.94  11.37   8.94  11.38
     2  11.21   8.94 838.70 101.69  11.14   8.94  11.26
     3  11.19   8.97 101.91 837.80  11.11   8.92  11.26
     4  15.50  11.37  11.57  11.57 838.48  11.37  15.84
     5  11.31   8.95   8.93   8.94  11.33 838.03  11.28
     6  15.80  11.35  11.70  10.43  16.07  11.28 838.93

We can see that the p2p driver improves performance as expected on PCIe with this fork (e.g. 15.80 GB/s -> 50.20 GB/s). However the NVLink performance (GPUs 2 and 3) decreases from ~100 GB/s to ~17 GB/s.

To Reproduce

Run p2pBandwidthLatencyTest and compare with original fork

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

simpleP2P fails and Vulkan programs cannot allocate memory

NVIDIA Open GPU Kernel Modules Version

550.90.07-p2p

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Arch Linux

Kernel Release

6.9.4-arch1-1

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

2x NVIDIA GeForce RTX 4090

Describe the bug

I tried the modified driver to get P2P running on my two 4090s, but it doesn't work properly and with this driver also applications that use Vulkan don't work properly anymore, crashing when trying to allocate memory (vkAllocateMemory returns VK_ERROR_OUT_OF_DEVICE_MEMORY).

To Reproduce

Here's my output from the cuda sample simpleP2P:

Checking for multiple GPUs...
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 4090 (GPU0) -> NVIDIA GeForce RTX 4090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 4090 (GPU1) -> NVIDIA GeForce RTX 4090 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 12.21GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 1: val = 1.000000, ref = 4.000000
Verification error @ element 2: val = 2.000000, ref = 8.000000
Verification error @ element 3: val = 3.000000, ref = 12.000000
Verification error @ element 4: val = 4.000000, ref = 16.000000
Verification error @ element 5: val = 5.000000, ref = 20.000000
Verification error @ element 6: val = 6.000000, ref = 24.000000
Verification error @ element 7: val = 7.000000, ref = 28.000000
Verification error @ element 8: val = 8.000000, ref = 32.000000
Verification error @ element 9: val = 9.000000, ref = 36.000000
Verification error @ element 10: val = 10.000000, ref = 40.000000
Verification error @ element 11: val = 11.000000, ref = 44.000000
Verification error @ element 12: val = 12.000000, ref = 48.000000
Disabling peer access...
Shutting down...
Test failed!

Here is the output of p2pBandwidthLatencyTest (for this I had to downgrade cuda from 12.5.0 to 12.4.1, otherwise it would error, complaining about an incompatible ptx version when running the delay kernel in the sample):

[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 4090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 4090, pciBusID: 2, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
     D\D     0     1
     0       1     1
     1       1     1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 905.80  11.95 
     1  12.02 922.37 
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
   D\D     0      1 
     0 837.35  13.18 
     1  13.18 939.00 
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 882.52  16.74 
     1  16.81 923.64 
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
   D\D     0      1 
     0 846.51  25.89 
     1  25.93 920.97 
P2P=Disabled Latency Matrix (us)
   GPU     0      1 
     0   1.40  10.63 
     1  11.24   1.43 

   CPU     0      1 
     0   1.35   4.21 
     1   4.27   1.36 
P2P=Enabled Latency (P2P Writes) Matrix (us)
   GPU     0      1 
     0   1.40   0.92 
     1   0.91   1.40 

   CPU     0      1 
     0   1.39   1.10 
     1   1.19   1.36 

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

And as mentioned, Vulkan applications cannot allocate (at least device local) memory anymore, e.g., vkcube

Selected GPU 0: NVIDIA GeForce RTX 4090, type: DiscreteGpu
[1]    5876 segmentation fault (core dumped)  vkcube

or vkgears:

Failed to allocate memory for the depth image

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

I made sure that iommu is disabled with the kernel parameters amd_iommu=off iommu=off and having it disabled in the BIOS and large BAR support is there as well:

% nvidia-smi -q | grep -i bar -A 3
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 24205 MiB
        Free                              : 8563 MiB
--
    BAR1 Memory Usage
        Total                             : 32768 MiB
        Used                              : 24212 MiB
        Free                              : 8556 MiB

I'm wondering why used is almost the 24 GB that the GPUs have and if that is the reason, why Vulkan applications cannot allocate memory?

Gds support?

NVIDIA Open GPU Kernel Modules Version

NONE

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

None

Kernel Release

None

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

None

Describe the bug

Howdy! Thank you so much for this work!
Kinda stupid question, could we use same hack for gds support, for weights offloading?
Thanks!

To Reproduce

None

Bug Incidence

Once

nvidia-bug-report.log.gz

None

More Info

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.