GithubHelp home page GithubHelp logo

nccl.jl's Introduction

nccl.jl's People

Contributors

avik-pal avatar dependabot[bot] avatar github-actions[bot] avatar juliatagbot avatar kshyatt avatar maleadt avatar ranocha avatar simonbyrne avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nccl.jl's Issues

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Complex number wrapper

NCCL does not support complex numbers directly and does not plan to (see issue). Are we willing to add a wrapper to NCCL.jl to make using complex numbers more convienient? Alternatively, the wrapper could be put in a higher level package (ex. Lux.jl, see issue). I am happy to start working on this but would like some feedback if possible. My primary motivation is using neural networks with complex valued weights and this feature would greatly simplify things.

Tests assume 2 GPUs

Should probably also work on a single-GPU device (we can force some of the CI jobs on a multigpu system to ensure coverage), and also on cyclops which has 8 devices.

status

Hi @maleadt Could you please let me know the status of nccl.jl package. I would like to contribute and finish this package.
Thank you very much!

Best
Raj

Tests failed

Do I need to configure anything to pass the test?

(This is a fresh new installation based on the pytorch:24.01-py3 image)

     Testing Running tests...
┌ Info: CUDA information:
│ CUDA runtime 12.5, artifact installation
│ CUDA driver 12.5
│ NVIDIA driver 535.161.8, originally for CUDA 12.2
│
│ CUDA libraries:
│ - CUBLAS: 12.5.3
│ - CURAND: 10.3.6
│ - CUFFT: 11.2.3
│ - CUSOLVER: 11.6.3
│ - CUSPARSE: 12.5.1
│ - CUPTI: 2024.2.1 (API 23.0.0)
│ - NVML: 12.0.0+535.161.8
│
│ Julia packages:
│ - CUDA: 5.4.3
│ - CUDA_Driver_jll: 0.9.1+1
│ - CUDA_Runtime_jll: 0.14.1+0
│
│ Toolchain:
│ - Julia: 1.10.4
│ - LLVM: 15.0.7
│
│ 8 devices:
│   0: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   1: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   2: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   3: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   4: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   5: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
│   6: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
└   7: NVIDIA H100 80GB HBM3 (sm_90, 79.106 GiB / 79.647 GiB available)
[ Info: NCCL version: 2.19.4
Communicator: Error During Test at /....../NCCL.jl/test/runtests.jl:11
  Got exception outside of a @test
  NCCLError(code ncclUnhandledCudaError, a call to a CUDA function failed)
  Stacktrace:
  Stacktrace:
    [1] check
      @ /....../NCCL.jl/src/libnccl.jl:17 [inlined]
    [2] ncclCommInitAll
      @ ~/.julia/packages/CUDA/Tl08O/lib/utils/call.jl:34 [inlined]
    [3] Communicators(deviceids::Vector{Int32})
      @ NCCL /....../NCCL.jl/src/communicator.jl:70
    [4] Communicators(devices::CUDA.DeviceIterator)
      @ NCCL /....../NCCL.jl/src/communicator.jl:80
    [5] macro expansion
      @ /....../NCCL.jl/test/runtests.jl:13 [inlined]
    [6] macro expansion
      @ ~/.julia/juliaup/julia-1.10.4+0.x64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
    [7] macro expansion
      @ /....../NCCL.jl/test/runtests.jl:13 [inlined]
    [8] macro expansion
      @ ~/.julia/juliaup/julia-1.10.4+0.x64.linux.gnu/share/julia/stdlib/v1.10/Test/src/Test.jl:1577 [inlined]
    [9] top-level scope
      @ /....../NCCL.jl/test/runtests.jl:11
   [10] include(fname::String)
      @ Base.MainInclude ./client.jl:489
   [11] top-level scope
      @ none:6
   [12] eval
      @ ./boot.jl:385 [inlined]
   [13] exec_options(opts::Base.JLOptions)
      @ Base ./client.jl:291
   [14] _start()
      @ Base ./client.jl:552

Register?

It seems like NCCL has been functional for a month now. Is there a plan/timeline to register it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.