Comments (9)
It looks like the CUDA backend has similar but not identical problems to the ISPC backend:
$ env CFLAGS="-I$CUDA_HOME/include -O -L$CUDA_HOME/lib64" futhark-aarch64 cuda badness.fut && echo 30 | ./badness
Warning: device compute capability is 9.0, but newest supported by Futhark is 8.7.
[100i64, 101i64, 102i64, 103i64, 104i64, 105i64, 106i64, 107i64, 108i64, 109i64, 110i64]
$ env CFLAGS="-I$CUDA_HOME/include -O -L$CUDA_HOME/lib64" futhark-aarch64 cuda badness.fut && echo 31 | ./badness
Warning: device compute capability is 9.0, but newest supported by Futhark is 8.7.
[100i64, 101i64, 102i64, 103i64, 104i64, 105i64, 106i64, 107i64, 108i64, 109i64, 110i64]
$ env CFLAGS="-I$CUDA_HOME/include -O -L$CUDA_HOME/lib64" futhark-aarch64 cuda badness.fut && echo 32 | ./badness
Warning: device compute capability is 9.0, but newest supported by Futhark is 8.7.
./badness: badness.c:7153: CUDA call
cuCtxSynchronize()
failed with error code 700 (an illegal memory access was encountered)
This is on early Grace Hopper hardware (ARM CPU + H100 GPU).
from futhark.
Well! That is not great. And on the AMD RX7900 at home, running your program is a very efficient way to shut down the graphical display.
Fortunately, I don't think this is difficult to fix. It's probably an artifact of the old 32-bit size handling, which still lurks in some calculations in the code generator, but the foundations are 64-bit clean. I will take a look.
from futhark.
The OpenCL backend works, so it's likely a problem in the single pass scan, which is used for the CUDA and HIP backends.
The multicore backend also works, so the ISPC error is due to something ISPC-specific.
from futhark.
For the GPU backends, this might actually just be an OOM error. Filtering is surprisingly memory expensive. With the GPU backends, n=29 requires 12GiB of memory. Presumably, n=30 would require 24GiB, n=31 48GiB, and n=32 96GiB - the latter beyond even what a H100 possesses. It's just a coincidence that this is somewhat close to the 32-bit barrier.
The reason for the memory usage is as follows.
- Creating the array to be filtered: 8n bytes.
- Creating a true/false mask of booleans: n bytes.
- Offset array: 8n bytes.
- Creating an output array into which the filter result will be put: 8n bytes.
The mask array is fused with the scan producing the offset array, and so doesn't take any memory. I suppose there is no reason for the output array to be so large, however - I think it is only because our filter
is actually implemented as partition
.
We should handle GPU OOM better. This has been on my list for a while, but the GPU APIs make it surprisingly difficult to do robustly.
The ISPC error is probably a real 32-bit issue however, and I think I remember why: ISPC is very slow when asked to do 64-bit index arithmetic, so we reasoned nobody would want to use it with such large arrays anyway.
from futhark.
I didn't realize that a literal range would still require full-size array construction. Thanks for explaining where all the memory is going.
I just checked, and my Grace Hopper node has 80GiB of GPU memory and 512GiB of CPU memory. I assume this implies that Futhark is not using Unified Memory. That would be a nice option for the CUDA backend if you're looking for more work. 😀 I've heard that the penalty for GPUs accessing CPU memory is a lot lower on Grace Hopper than on previous systems, but I haven't yet tested that myself.
from futhark.
I do in fact consider just enabling unified memory unconditionally on the CUDA backend. I did some experiments recently, and it doesn't seem to cause any overhead for the cases where you stay within GPU memory anyway (and let you finish execution when you don't).
from futhark.
@athas Would you add unified memory to the HIP backend as well then?
from futhark.
If it performs similarly, I don't see why not. But it would also be easy to make a configuration option indicating which kind of memory you prefer, as the allocation APIs are pretty much identical either way.
In the longer term, this would also make some operations more efficient (such as directly indexing Futhark arrays from the host), but just making oversize allocations possible would be an easy start.
from futhark.
Related Issues (20)
- Size-type error after pass `simplify` HOT 1
- Futhark does not run with GHC 9.8 HOT 2
- Internal compiler error: unknown variable HOT 1
- Document `FUTHARK_COMPILER_DEBUGGING=2`.
- Limit context memory use HOT 4
- Internal compiler error
- Internal compiler error (unhandled IO exception).
- Dead repl link on front page HOT 1
- Justification for Alias Tracking HOT 1
- futhark c and multicore are giving different results HOT 3
- Module ascription does not type check
- Is it possible to add a compiled version of the windows version available for download? HOT 4
- Documentation improvement for Prelude math functions HOT 3
- Type suffixes should be ignored when unifying expressions
- Spurious variance in tensor contraction expression going into loop tiling HOT 10
- OpenCL and Macbook M1 pro - Program that uses builint hist fails to run HOT 1
- Monomorphisation does not variable-ize size expressions properly
- `futhark bench --skip-compilation` with non-server executable hangs indefinitely HOT 2
- Type checker does not detect size mismatch in loop HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from futhark.