radiasoft / container-jupyter-nvidia Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 22 KB

License: Apache License 2.0

Shell 100.00%

container-jupyter-nvidia's People

Contributors

Watchers

container-jupyter-nvidia's Issues

tensorflow libcudnn version mismatch

For some reason tensorflow 2.8.0 is being installed in the image. Even though we specify 2.3.1

tensorflow (master)$ docker run -it --rm radiasoft/jupyter-nvidia:prod /bin/bash -c '/home/vagrant/.pyenv/shims/pip list | grep "tensorflow\s"'
tensorflow                      2.8.0

Sirepo doen't have this problem so it is probably something specific to this image

tensorflow (master)$ docker run -it --rm radiasoft/sirepo:prod /bin/bash -c '/home/vagrant/.pyenv/shims/pip list | grep "tensorflow\s"'
tensorflow                      2.3.1

Need to be careful that all versions work with eachother https://www.tensorflow.org/install/source#gpu

Fix tensorflow, cuda, py2 issues

`gpu-elegant` results not matching `elegant`

After some experimentation, we're finding that the output of gpu-elegant for certain configurations seems to be failing. In particular, the particles don't seem to make it through the initial beamline elements, with subtleties as to what is going wrong. We're not quite sure where it is failing yet, but @cchall has some examples (see attached). TL;DR, elegant is losing particles.

@cchall built some test files using a csbend component that highlight some of this.

I've rebuilt the newest version of elegant on gpu-jupyter (2021.4) and it shows the same behavior; the default install is 2021.1. I'm currently awaiting account activation for the elegant users' forum to post about the bug.

There's also a local build issue with elegant and CUDA 11 w/ GCC 11 that is known on other projects, likely due to a cstd versioning issue. I'm debugging that myself to build gpu-elegant locally on another system for testing as to if it is the version in our container that has issues or the code itself: c++17 vs c++14. I still need to build with c++17 instead as this requires changing some flags.

The (draft) build process generally follows the procedure in this Google doc, that is currently only available for RadiaSoft personnel: elegant build notes

All commands were run as $ <binary> tracking.ele > <logfile>.log 2>&1

2021.1 non-GPU elegant (ele.log):

tracking 4000 particles
3 May 22 16:42:38: This step establishes energy profile vs s (fiducial beam).
3 May 22 16:42:38: Rf phases/references reset.
4000 particles present after pass 0
...
Adding OCT_K after (null)
Adding OCT_K after (null)
Adding OCT_K after (null)
Adding OCT_K after (null)
...
4000 particles present after pass 4        
4000 particles transmitted, total effort of 16000 particle-turns
33776880 multipole kicks done

2021.1 gpu-elegant (gpu-standard-ele.log):

tracking 4000 particles
3 May 22 16:41:48: This step establishes energy profile vs s (fiducial beam).
3 May 22 16:41:48: Rf phases/references reset.
4000 particles present after pass 0 
...
0 particles present after pass 4        
0 particles transmitted, total effort of 0 particle-turns
18720 multipole kicks done

2021.4 gpu-elegant:

tracking 4000 particles
3 May 22 16:40:46: This step establishes energy profile vs s (fiducial beam).
3 May 22 16:40:46: Rf phases/references reset.
4000 particles present after pass 0
...
0 particles present after pass 4        
Post-tracking output completed.
Tracking step completed   ET:     00:00:01 CP:    1.43 BIO:0 DIO:0 PF:0 MEM:4838467

And an additional fun compilation message:

/home/vagrant/jupyter/oag/apps/src/epics/extensions/include/mdb.h:599: warning: "PI" redefined
  599 | #define PI 3.141592653589793
      | 
In file included from gpu_lsc.cu:1:
/home/vagrant/jupyter/oag/apps/src/epics/extensions/include/constants.h:35: note: this is the location of the previous definition
   35 | #define PI   3.141592653589793238462643

Determine if cuda libs need to manually be added to ld.so.conf

Maybe the rpm adds them now.

update to cuda 12.2, cudnn 8.9, tensorflow 2.15.0

tensorflow: failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error

2019-11-19 00:50:49.396719: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-11-19 00:50:49.398230: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2019-11-19 00:50:49.398264: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: [snip]
2019-11-19 00:50:49.398273: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: [snip]
2019-11-19 00:50:49.398337: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.87.1
2019-11-19 00:50:49.398357: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.87.1
2019-11-19 00:50:49.398362: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 418.87.1

tried to follow dockerfiles with things like:

LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:$LD_LIBRARY_PATH python -c 'import tensorflow; print(tensorflow.config.experimental.list_physical_devices())'

Some say it is the stubs, removed those.

Trying with a clean beamsim.

disable gpuElegant

Getting compile errors on f36

/usr/include/c++/12/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
   32 | #error This file requires compiler and library support \

Reached out to the team and no one is using it.

release an update

@j-edelen wants the julia change from https://github.com/radiasoft/container-beamsim-jupyter/pull/98

GPU cannot be found

Project: nuline

Using tensorflow 2.4.0

Cannot find the GPU on the server. Looks to be searching for localhost when GPU is defined at /device/GPU:<number>. Looks as if the only the CPU is supported, but I’ve confirmed that tensorflow can see all 4 GPUs. The complete error output is:

InvalidArgumentError                      Traceback (most recent call last)
<timed exec> in <module>

~/.pyenv/versions/py3/lib/python3.7/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
     65     except Exception as e:  # pylint: disable=broad-except
     66       filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67       raise e.with_traceback(filtered_tb) from None
     68     finally:
     69       del filtered_tb

~/.pyenv/versions/py3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name)

InvalidArgumentError: Could not satisfy device specification '/job:localhost/replica:0/task:0/device:GPU:3'. enable_soft_placement=0. Supported device types [CPU]. All available devices [/job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3, /job:localhost/replica:0/task:0/device:CPU:0]. [Op:RangeDataset]```

#!/bin/bash
set -e

_VERSION=2.23.0
_PACKAGE='hypre'
_MAKE_JOBS=10

#This is using the default Perlmutter module set as of 11/21
module load cudatoolkit

git clone https://github.com/hypre-space/hypre.git
cd hypre
git checkout tags/v${_VERSION}
cd ..

_INSTALL="${HOME}/Perlmutter/hypre/${NERSC_HOST}/${_VERSION}"
echo "export HYPREHOME=${_INSTALL}" >> module.sh

cd ${_PACKAGE}/src

export FC=ftn
export CC=cc
export CXX=CC
export HYPRE_CUDA_SM=80

_COMPFLAGS="-O3"

export FFLAGS="${_COMPFLAGS}"
export F77FLAGS="${_COMPFLAGS}"
export LDFLAGS="${_OPENMP}"
export CXXFLAGS="${_COMPFLAGS}"
export CFLAGS="${_COMPFLAGS}"
_PREFIX="${_INSTALL}"
./configure
--prefix="${_PREFIX}"
--with-MPI
--with-cuda
--enable-unified-memory
make -j${_MAKE_JOBS} test
make install
cp config.log ${_INSTALL}

where the compiler commands are aliased to nfortran, nvcc, nvc++ based on the set of imported modules.

Upgrade to cuda-11-1

cuda repos for fedora 32 do not support 10-1 radiasoft/download#131

radiasoft / container-jupyter-nvidia Goto Github PK

container-jupyter-nvidia's People

Contributors

Watchers

container-jupyter-nvidia's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs