GithubHelp home page GithubHelp logo

container-jupyter-nvidia's People

Contributors

e-carlin avatar robnagler avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

container-jupyter-nvidia's Issues

tensorflow libcudnn version mismatch

For some reason tensorflow 2.8.0 is being installed in the image. Even though we specify 2.3.1

tensorflow (master)$ docker run -it --rm radiasoft/jupyter-nvidia:prod /bin/bash -c '/home/vagrant/.pyenv/shims/pip list | grep "tensorflow\s"'
tensorflow                      2.8.0

Sirepo doen't have this problem so it is probably something specific to this image

tensorflow (master)$ docker run -it --rm radiasoft/sirepo:prod /bin/bash -c '/home/vagrant/.pyenv/shims/pip list | grep "tensorflow\s"'
tensorflow                      2.3.1

Need to be careful that all versions work with eachother https://www.tensorflow.org/install/source#gpu

`gpu-elegant` results not matching `elegant`

After some experimentation, we're finding that the output of gpu-elegant for certain configurations seems to be failing. In particular, the particles don't seem to make it through the initial beamline elements, with subtleties as to what is going wrong. We're not quite sure where it is failing yet, but @cchall has some examples (see attached). TL;DR, elegant is losing particles.

@cchall built some test files using a csbend component that highlight some of this.

I've rebuilt the newest version of elegant on gpu-jupyter (2021.4) and it shows the same behavior; the default install is 2021.1. I'm currently awaiting account activation for the elegant users' forum to post about the bug.

There's also a local build issue with elegant and CUDA 11 w/ GCC 11 that is known on other projects, likely due to a cstd versioning issue. I'm debugging that myself to build gpu-elegant locally on another system for testing as to if it is the version in our container that has issues or the code itself: c++17 vs c++14. I still need to build with c++17 instead as this requires changing some flags.

The (draft) build process generally follows the procedure in this Google doc, that is currently only available for RadiaSoft personnel: elegant build notes

All commands were run as $ <binary> tracking.ele > <logfile>.log 2>&1

2021.1 non-GPU elegant (ele.log):

tracking 4000 particles
3 May 22 16:42:38: This step establishes energy profile vs s (fiducial beam).
3 May 22 16:42:38: Rf phases/references reset.
4000 particles present after pass 0
...
Adding OCT_K after (null)
Adding OCT_K after (null)
Adding OCT_K after (null)
Adding OCT_K after (null)
...
4000 particles present after pass 4        
4000 particles transmitted, total effort of 16000 particle-turns
33776880 multipole kicks done

2021.1 gpu-elegant (gpu-standard-ele.log):

tracking 4000 particles
3 May 22 16:41:48: This step establishes energy profile vs s (fiducial beam).
3 May 22 16:41:48: Rf phases/references reset.
4000 particles present after pass 0 
...
0 particles present after pass 4        
0 particles transmitted, total effort of 0 particle-turns
18720 multipole kicks done

2021.4 gpu-elegant:

tracking 4000 particles
3 May 22 16:40:46: This step establishes energy profile vs s (fiducial beam).
3 May 22 16:40:46: Rf phases/references reset.
4000 particles present after pass 0
...
0 particles present after pass 4        
Post-tracking output completed.
Tracking step completed   ET:     00:00:01 CP:    1.43 BIO:0 DIO:0 PF:0 MEM:4838467

And an additional fun compilation message:

/home/vagrant/jupyter/oag/apps/src/epics/extensions/include/mdb.h:599: warning: "PI" redefined
  599 | #define PI 3.141592653589793
      | 
In file included from gpu_lsc.cu:1:
/home/vagrant/jupyter/oag/apps/src/epics/extensions/include/constants.h:35: note: this is the location of the previous definition
   35 | #define PI   3.141592653589793238462643

tensorflow: failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

2019-11-19 00:50:49.396719: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-11-19 00:50:49.398230: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2019-11-19 00:50:49.398264: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: [snip]
2019-11-19 00:50:49.398273: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: [snip]
2019-11-19 00:50:49.398337: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.87.1
2019-11-19 00:50:49.398357: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.87.1
2019-11-19 00:50:49.398362: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 418.87.1

tried to follow dockerfiles with things like:

LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:$LD_LIBRARY_PATH python -c 'import tensorflow; print(tensorflow.config.experimental.list_physical_devices())'

Some say it is the stubs, removed those.

Trying with a clean beamsim.

disable gpuElegant

Getting compile errors on f36

/usr/include/c++/12/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
   32 | #error This file requires compiler and library support \

Reached out to the team and no one is using it.

GPU cannot be found

Project: nuline

Using tensorflow 2.4.0

Cannot find the GPU on the server. Looks to be searching for localhost when GPU is defined at /device/GPU:<number>. Looks as if the only the CPU is supported, but I’ve confirmed that tensorflow can see all 4 GPUs. The complete error output is:

InvalidArgumentError                      Traceback (most recent call last)
<timed exec> in <module>

~/.pyenv/versions/py3/lib/python3.7/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
     65     except Exception as e:  # pylint: disable=broad-except
     66       filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67       raise e.with_traceback(filtered_tb) from None
     68     finally:
     69       del filtered_tb

~/.pyenv/versions/py3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name)

InvalidArgumentError: Could not satisfy device specification '/job:localhost/replica:0/task:0/device:GPU:3'. enable_soft_placement=0. Supported device types [CPU]. All available devices [/job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3, /job:localhost/replica:0/task:0/device:CPU:0]. [Op:RangeDataset]```

Hypre on V100

Interested in testing hypre on our V100, related to hypre-space/hypre#551

The hypre build script for Perlmutter is:

#!/bin/bash
set -e

_VERSION=2.23.0
_PACKAGE='hypre'
_MAKE_JOBS=10

#This is using the default Perlmutter module set as of 11/21
module load cudatoolkit

git clone https://github.com/hypre-space/hypre.git
cd hypre
git checkout tags/v${_VERSION}
cd ..

_INSTALL="${HOME}/Perlmutter/hypre/${NERSC_HOST}/${_VERSION}"
echo "export HYPREHOME=${_INSTALL}" >> module.sh

cd ${_PACKAGE}/src

export FC=ftn
export CC=cc
export CXX=CC
export HYPRE_CUDA_SM=80

_COMPFLAGS="-O3"

export FFLAGS="${_COMPFLAGS}"
export F77FLAGS="${_COMPFLAGS}"
export LDFLAGS="${_OPENMP}"
export CXXFLAGS="${_COMPFLAGS}"
export CFLAGS="${_COMPFLAGS}"
_PREFIX="${_INSTALL}"
./configure
--prefix="${_PREFIX}"
--with-MPI
--with-cuda
--enable-unified-memory
make -j${_MAKE_JOBS} test
make install
cp config.log ${_INSTALL}

where the compiler commands are aliased to nfortran, nvcc, nvc++ based on the set of imported modules.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.