radiasoft / container-jupyter-nvidia Goto Github PK
View Code? Open in Web Editor NEWLicense: Apache License 2.0
License: Apache License 2.0
For some reason tensorflow 2.8.0 is being installed in the image. Even though we specify 2.3.1
tensorflow (master)$ docker run -it --rm radiasoft/jupyter-nvidia:prod /bin/bash -c '/home/vagrant/.pyenv/shims/pip list | grep "tensorflow\s"'
tensorflow 2.8.0
Sirepo doen't have this problem so it is probably something specific to this image
tensorflow (master)$ docker run -it --rm radiasoft/sirepo:prod /bin/bash -c '/home/vagrant/.pyenv/shims/pip list | grep "tensorflow\s"'
tensorflow 2.3.1
Need to be careful that all versions work with eachother https://www.tensorflow.org/install/source#gpu
After some experimentation, we're finding that the output of gpu-elegant for certain configurations seems to be failing. In particular, the particles don't seem to make it through the initial beamline elements, with subtleties as to what is going wrong. We're not quite sure where it is failing yet, but @cchall has some examples (see attached). TL;DR, elegant
is losing particles.
@cchall built some test files using a csbend component that highlight some of this.
I've rebuilt the newest version of elegant
on gpu-jupyter
(2021.4) and it shows the same behavior; the default install is 2021.1. I'm currently awaiting account activation for the elegant
users' forum to post about the bug.
There's also a local build issue with elegant
and CUDA 11 w/ GCC 11 that is known on other projects, likely due to a cstd versioning issue. I'm debugging that myself to build gpu-elegant
locally on another system for testing as to if it is the version in our container that has issues or the code itself: c++17 vs c++14. I still need to build with c++17 instead as this requires changing some flags.
The (draft) build process generally follows the procedure in this Google doc, that is currently only available for RadiaSoft personnel: elegant build notes
All commands were run as $ <binary> tracking.ele > <logfile>.log 2>&1
2021.1 non-GPU elegant (ele.log):
tracking 4000 particles
3 May 22 16:42:38: This step establishes energy profile vs s (fiducial beam).
3 May 22 16:42:38: Rf phases/references reset.
4000 particles present after pass 0
...
Adding OCT_K after (null)
Adding OCT_K after (null)
Adding OCT_K after (null)
Adding OCT_K after (null)
...
4000 particles present after pass 4
4000 particles transmitted, total effort of 16000 particle-turns
33776880 multipole kicks done
2021.1 gpu-elegant (gpu-standard-ele.log):
tracking 4000 particles
3 May 22 16:41:48: This step establishes energy profile vs s (fiducial beam).
3 May 22 16:41:48: Rf phases/references reset.
4000 particles present after pass 0
...
0 particles present after pass 4
0 particles transmitted, total effort of 0 particle-turns
18720 multipole kicks done
2021.4 gpu-elegant:
tracking 4000 particles
3 May 22 16:40:46: This step establishes energy profile vs s (fiducial beam).
3 May 22 16:40:46: Rf phases/references reset.
4000 particles present after pass 0
...
0 particles present after pass 4
Post-tracking output completed.
Tracking step completed ET: 00:00:01 CP: 1.43 BIO:0 DIO:0 PF:0 MEM:4838467
And an additional fun compilation message:
/home/vagrant/jupyter/oag/apps/src/epics/extensions/include/mdb.h:599: warning: "PI" redefined
599 | #define PI 3.141592653589793
|
In file included from gpu_lsc.cu:1:
/home/vagrant/jupyter/oag/apps/src/epics/extensions/include/constants.h:35: note: this is the location of the previous definition
35 | #define PI 3.141592653589793238462643
Maybe the rpm adds them now.
2019-11-19 00:50:49.396719: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-11-19 00:50:49.398230: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_UNKNOWN: unknown error
2019-11-19 00:50:49.398264: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: [snip]
2019-11-19 00:50:49.398273: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: [snip]
2019-11-19 00:50:49.398337: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 418.87.1
2019-11-19 00:50:49.398357: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 418.87.1
2019-11-19 00:50:49.398362: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 418.87.1
tried to follow dockerfiles with things like:
LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/lib64:$LD_LIBRARY_PATH python -c 'import tensorflow; print(tensorflow.config.experimental.list_physical_devices())'
Some say it is the stubs, removed those.
Trying with a clean beamsim.
Getting compile errors on f36
/usr/include/c++/12/bits/c++0x_warning.h:32:2: error: #error This file requires compiler and library support for the ISO C++ 2011 standard. This support must be enabled with the -std=c++11 or -std=gnu++11 compiler options.
32 | #error This file requires compiler and library support \
Reached out to the team and no one is using it.
@j-edelen wants the julia change from https://github.com/radiasoft/container-beamsim-jupyter/pull/98
Project: nuline
Using tensorflow 2.4.0
Cannot find the GPU on the server. Looks to be searching for localhost when GPU is defined at /device/GPU:<number>
. Looks as if the only the CPU is supported, but I’ve confirmed that tensorflow can see all 4 GPUs. The complete error output is:
InvalidArgumentError Traceback (most recent call last)
<timed exec> in <module>
~/.pyenv/versions/py3/lib/python3.7/site-packages/keras/utils/traceback_utils.py in error_handler(*args, **kwargs)
65 except Exception as e: # pylint: disable=broad-except
66 filtered_tb = _process_traceback_frames(e.__traceback__)
---> 67 raise e.with_traceback(filtered_tb) from None
68 finally:
69 del filtered_tb
~/.pyenv/versions/py3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py in raise_from_not_ok_status(e, name)
InvalidArgumentError: Could not satisfy device specification '/job:localhost/replica:0/task:0/device:GPU:3'. enable_soft_placement=0. Supported device types [CPU]. All available devices [/job:localhost/replica:0/task:0/device:GPU:0, /job:localhost/replica:0/task:0/device:GPU:1, /job:localhost/replica:0/task:0/device:GPU:2, /job:localhost/replica:0/task:0/device:GPU:3, /job:localhost/replica:0/task:0/device:CPU:0]. [Op:RangeDataset]```
Having access to gpu-elegant could be useful for middl
Interested in testing hypre on our V100, related to hypre-space/hypre#551
The hypre build script for Perlmutter is:
#!/bin/bash
set -e_VERSION=2.23.0
_PACKAGE='hypre'
_MAKE_JOBS=10#This is using the default Perlmutter module set as of 11/21
module load cudatoolkitgit clone https://github.com/hypre-space/hypre.git
cd hypre
git checkout tags/v${_VERSION}
cd .._INSTALL="${HOME}/Perlmutter/hypre/${NERSC_HOST}/${_VERSION}"
echo "export HYPREHOME=${_INSTALL}" >> module.shcd ${_PACKAGE}/src
export FC=ftn
export CC=cc
export CXX=CC
export HYPRE_CUDA_SM=80_COMPFLAGS="-O3"
export FFLAGS="${_COMPFLAGS}"
export F77FLAGS="${_COMPFLAGS}"
export LDFLAGS="${_OPENMP}"
export CXXFLAGS="${_COMPFLAGS}"
export CFLAGS="${_COMPFLAGS}"
_PREFIX="${_INSTALL}"
./configure
--prefix="${_PREFIX}"
--with-MPI
--with-cuda
--enable-unified-memory
make -j${_MAKE_JOBS} test
make install
cp config.log ${_INSTALL}
where the compiler commands are aliased to nfortran, nvcc, nvc++ based on the set of imported modules.
cuda repos for fedora 32 do not support 10-1 radiasoft/download#131
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.