ut-parla / parla.py Goto Github PK

A Python based programming system for heterogeneous computing

License: Other

Python 79.69% CMake 2.53% Shell 0.81% C++ 8.74% Makefile 0.62% C 4.46% Cython 0.94% Cuda 2.21%

heterogeneous-computing python scientific-computing

parla.py's Introduction

______          _                 ┌─────────┐┌──────┐
| ___ \        | |                │ Task A  ││Task B│
| |_/ /_ _ _ __| | __ _           └┬───────┬┘└────┬─┘
|  __/ _` | '__| |/ _` |          ┌▽─────┐┌▽─────┐│  
| | | (_| | |  | | (_| |          │Task D││Task C││  
\_|  \__,_|_|  |_|\__,_|          └┬─────┘└┬─────┘│  
                                  ┌▽─────┐┌▽──────▽┐ 
                                  └──────┘└────────┘

Introduction

Parla is a task-parallel programming library for Python. Parla targets the orchestration of heterogeneous (CPU+GPU) workloads on a single shared-memory machine. We provide features for resource management, task variants, and automated scheduling of data movement between devices.

We design for gradual-adoption allowing users to easily port sequential code for parallel execution.

The Parla runtime is multi-threaded but single-process to utilize a shared address space. In practice this means that the main compute workload within each task must release the CPython Global Interpreter Lock (GIL) to achieve parallel speedup.

Note: Parla is not designed with workflow management in mind and does not currently support features for fault-tolerance or checkpointing.

Installation

Parla is currently distributed from this repository as a Python module.

Parla 0.2 requires Python>=3.7, numpy, cupy, and psutil and can be installed as follows:

conda (or pip) install -c conda-forge numpy cupy psutil
git clone https://github.com/ut-parla/Parla.py.git
cd Parla.py
pip install .

To test your installation, try running

python tutorial/0_hello_world/hello.py

This should print

Hello, World!

We recommend working through the tutorial as a starting point for learning Parla!

Example Usage

Parla tasks are launched in an indexed namespace (the 'TaskSpace') and capture variables from the local scope through the task body's closure.

Basic usage can be seen below:

with Parla:
    T = TaskSpace("Example Space")

    for i in range(4):
        @spawn(T[i], placement=cpu)
        def tasks_A():
            print(f"We run first on the CPU. I am task {i}", flush=True)

    @spawn(T[4], dependencies=[T[0:4]], placement=gpu)
    def task_B():
        print("I run second on any GPU", flush=True)

Example Mini-Apps

The examples have a wider set of dependencies.

Running all requires: scipy, numba, pexpect, mkl, mkl-service, and Cython.

To get the full set of examples (BLR, N-Body, and synthetic graphs) initialize the submodules:

git submodule update --init --recursive --remote

Specific running installation instructions for each of these submodules can be found in their directories.

The test-suite over them (reproducing the results in the SC'22 Paper) can be launched as:

python examples/launcher.py --figures <list of figures to reproduce>

Acknowledgements

This software is based upon work supported by the Department of Energy, National Nuclear Security Administration under Award Number DE-NA0003969.

How to Cite Parla.py

Please cite the following reference.

@inproceedings{
    author = {H. Lee, W. Ruys, Y. Yan, S. Stephens, B. You, H. Fingler, I. Henriksen, A. Peters, M. Burtscher, M. Gligoric, K. Schulz, K. Pingali, C. J. Rossbach, M. Erez, and G. Biros},
    title = {Parla: A Python Orchestration System for Heterogeneous Architectures},
    year = {2022},
    booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
    series = {SC'22}
}

parla.py's People

Contributors

Stargazers

Watchers

Forkers

wlruys zhangsiyu1103 nicelhc13 bozhiyou ag1548 ostars-eye i3s93 forest1040 distributedsystemresearch

parla.py's Issues

Type aliases

We use type hints extensively. Cross-module type reference complicates the import chain. Some imports are only for certain Parla types and unnecessarily creates module dependency. Type aliases may help to create a clear hierarchy and concise semantics, e.g. PlacementSource = Union[Architecture, Device, Task, TaskID, Any] means that placement can be determined from objects of these types.

Tasks can't modify nonlocal variables

When a nonlocal variable is modified within a task, the variable's value doesn't actually change outside the scope of the task. As an example:

from parla import Parla
from parla.cpu import cpu
from parla.cuda import gpu
from parla.tasks import *

async def foo():
    a = 1
    print(a)

    @spawn()
    def t1():
        nonlocal a
        print(a)
        a = 2
        print(a)

    await t1
    print(a)

if __name__ == "__main__":
    with Parla():
        @spawn()
        async def bar():
            await foo()

I would expect this to print

but instead, it prints

Relatively aggressive logging may affect performance

The system has quite a lot of logging in there from various debugging efforts. Most of the time this is fine because it's at the debug level so it doesn't show up. If you think this might be affecting you, profile it. Don't just assume this is an issue. If you are sure logging is an issue, search for logger. in the code base. There is a fair amount of logging in task_runtime.py and other core files.

Overhead	Cost	Mitigation
Information collection	Can be high	Use `logger.isEnabledFor` to check if logging it enabled before collecting information and making the log call.
String formatting	Can be high	Use the `%`-style formatting built into the logger functions. This formatting only happens if the log entry is actually printed. Sadly, f-strings and `str.format` used in log calls end up bring run before regardless of the log level.
The function call and log level check	Very low	Create a `_debug_logging` flag in the module and check it before making the log call. This check will be faster. In extreme cases, comment out the logging statement.

VECs don't forward a module's `dict` attribute

This is preventing our ARPACK demo from working. The fix is probably to override __getattribute__, but doing that for module objects will require some messing around with the underlying C data structures.

Scheduler gives unnecessary warnings

The scheduler gives unnecessary warnings when trying to assign tasks to processors. Also, the print format on these warning messages needs to be cleaned up.
Simple repro on Frontera:

cd Parla.py/benchmarks/qr_factorization
python qr_parla.py -r 1600000 -c 1000 -b 100000 -i 1 -p gpu -g 1

Once the GPU memory fills up, the scheduler warns that the rest of the tasks may not be able to be scheduled even though they definitely can once previous tasks complete.

It may be worthwhile to have two separate warnings: one that warns that a task can't be scheduled right now due to lack of resources, and a separate warning for when a task won't ever be able to be scheduled (e.g. because it requires a GPU and more memory than any GPU has due to the memory argument in spawn).

Global Variables Aren't Properly Captured By Spawn

The spawn decorator currently messes with the closure of the function it's passed to change the scoping semantics. (See

Parla.py/parla/tasks.py

Line 416 in d82e6d5

separated_body = type(body)(

.) Though the current commenting calls that a "hack", but it is actually an integral part of how Parla functions and uses only supported APIs. We need to double down on it for semantic consistency. The current handling only does this hack for nonlocal variables. We need to do it for globals too. Closures should acquire objects and not allow name lookups to be redirected through an outer scope.

Persistent memory location

I am not sure if I am using persistent memory parameter properly, but my code doesn't work as I expected.

This is parts of the multiGPU blocked cholesky code and usage of the persistent memory parameter: here

I would like to pre-define/allocate persistent memory for each device before starting the loop.

This call allocates and append memory for each GPU to the gpu_arrs. Therefore, the first dimension of the gpu_arrs is memory for each gpu (like gpu_arrs[0] -> gpu0, gpu_arrs[1] -> gpu1, ..)

Then, I pass gpu_arrs to the reserve_persistent_memory().

My expectation is each gpu is aware of data owned by them.
(e.g. gpu0 knows gpu_arrs[0].nbytes are residing on its memory)

But, Parla allocates all memory to the current device.
For example, my usecase allocates all memory on the gpu0.
This is because reserve_persistent_memory() makes reference or copies the source data first
and then uses device information of its newly created/referenced array.
Link

Therefore, it degraded performance.

For tests, my temporary solution is to replace view.device with amount.device.

I was wondering if it is right usecase.

Parla task scheduler may limit scaling due to lock usage and polling

The current Parla runtime scheduler uses several locks and some unoptimized queues in task dispatch. This may be limiting scaling since it can induce delays between the end of one kernel and the start of the next. The runtime also has a polling element in which the poll action acquires locks. This may well be creating contention on those locks.

This will need detailed profiling and will probably only be fully fixed once we move our scheduler out of Python to avoid the GIL.

NumPy Can't Automatically Recognize NumPy Arrays From Other Loaded Copies of NumPy

Currently we have to shuttle data handoffs between VECs through the memoryview type since numpy isn't able to copy data out of a duplicate copy of itself even though it ostensibly supports PEP 3118 and __array_function__. This seems unintentional, and is the kind of thing that we need to raise upstream, but pinning down the details will take some time.

Storage Size Computations Could Detect Duplicates

While it's generally an unreasonable pain to detect arbitrary overlap of numpy array objects, the storage_size helper routine could potentially check for duplicate objects. Do we want to do this?

VECs should be optional

Currently VECs interfere with our hardware topology layer because they have to be imported before numpy, but some other parts of the code may need to use numpy. Really all of the following should be true:

Parla should be usable without the modified glibc and forwarding libraries if a user doesn't use VECs.
Trying to use VECs without the supporting libraries should raise an informative error.
VECs should be usable if the supporting libraries are loaded correctly.

Avoid VEC Cache Clearing Hacks By Fixing Upstream

Our hacks to clear internal CPython caches to get VECs to work both seem like they may be addressing issues that could arguably be fixed upstream.

One is working around an ancient cache of shared object handles at https://github.com/python/cpython/blob/2d2af320d94afc6561e8f8adf174c9d3fd9065bc/Python/dynload_shlib.c#L51-L55. That looks like an old optimization that may not even be applicable with more modern libc implementations.

Another is working around the fact that the cache of module-spec objects for compiled extensions isn't cleared properly when a user runs importlib.invalidate_caches. See https://github.com/python/cpython/blob/dff1ad509051f7e07e77d1e3ec83314d53fb1118/Python/import.c#L440-L454 for the cache this refers to and https://github.com/ut-parla/Parla.py/blob/master/parla/multiload.py#L74-L79 for our hack to get at it so we can swap out entries when working in different VECs. While I'm not 100% sure how to fix this upstream, this does go against the documented behavior of the import system. invalidate_caches is supposed to invalidate all internal caches, and IIUC, that applies to this cace. A similar issue https://bugs.python.org/issue33169 was fixed semi-recently.

In both cases we'll need to make the case for fixes upstream, but in both cases I think we could eventually divest ourselves of some of the more evil hacks involved in getting VECs working. For now we just need to figure out how to get the ball rolling on fixes upstream. Even an acknowledgment that a fix belongs in CPython would help us with our story since these would become temporary workarounds for bugs upstream rather than insane implementation details.

Forwarding Libraries Are Too Invasive

Currently setting attributes on libraries and adding/removing stuff from sys.modules takes up an absolutely awful amount of time when a library lazy-loads its submodules. This is even worse when modules are lazy-imported inside routine calls. To at least some extent lazy-loading is going to be slow until we can fix #10, but based off the profiling data from @dialecticDolt it sounds like packing and unpacking stuff from sys.modules is the biggest performance bottleneck right now.

Confirm CuPy's Device Management Interface Works For Numba

Flagging this more as a potential issue where we need to go through and check that things actually work as they should. Numba has an interface for managing device contexts. We should make sure that swapping the device using cupy's context managers actually works for numba kernels. The current device is a thread-local variable inside the cuda runtime, and numba appears to be using that (https://github.com/numba/numba/blob/b1258c82c0ed2845fce52508f439e29f729384a9/numba/cuda/cudadrv/driver.py#L411), but there's a lot of potential for something wrong to happen here, so we should confirm that it really works as it should.

Buggy Cross-GPU Synchronization?

Currently the cholesky app shows some nondeterministic errors. Our current theory is that there's some kind of issue with the way we map tasks onto streams in the multi-device setting.

Imports for VECs Can't Happen in Threads

The following segfaults, but it should work fine:

from threading import Thread
from parla.multiload import multiload_contexts

if __name__ == '__main__':
    def f():
        with multiload_contexts[1]:
            import numpy
    t = Thread(target=f)
    t.start()
    t.join()

VCUs have no defined meaning and need guidelines in general and a definition for GPUs specifically

The default value for vcus is currently 1 for GPUs. Previously it was set to attrs["MultiProcessorCount"], which ought to be equivalent to the number of SMs on the GPU.

The value is set here in Parla.py/parla/cupy.py.

We need to have a discussion on which would be preferable. 1 is simpler, but SM count would be more expressive and potentially useful for nodes with multiple different types of GPUs.

Handling resource assignment failure

https://github.com/ut-parla/Parla.py/blob/master/parla/task_runtime.py#L790-L795
It warns when the assignment fails too many times: Failed to assign devices. The required resources may not be available on this machine at all. However, cases can be that tasks are competing for a particular device, e.g. many tasks wanting a particular GPU. The failure should be warned with more details and there may be another way to handle the competing case.

Uploading docs built from master

We should set up automated upload to github.io on each master commit.

Allocating Numpy Arrays in Pinned Host Memory

This probably needs to be addressed upstream, but it affects our apps like QR and Barnes Hut that use the pattern of copy to GPU, compute, send the data back in each task. I'm documenting it now and we can try discussing upstream later. We'd like to allow host-to-device and device-to-host transfers to happen fully asynchronously, but cuda can only do that for memory with pinned pages. It looks like someone has already worked around this to some extent in the chainer source code. We should follow that example. See https://stackoverflow.com/a/47492027.

Hard-to-reproduce segmentation fault with Parla cleanup

I'm occasionally seeing a segmentation fault with the QR factorization app that occurs with Parla (non-VEC related). I haven't been able to reproduce it myself but it's come up before and it came up recently when one of my teammates was testing his Parla install using the app. He ran on the Maverick2 gtx node and said the bug only happened the first time he ran it (I'm not sure whether that's a coincidence or not). The command was python qr_parla.py run in the Parla.py/benchmarks/qr_factorization directory. The output is as follows:

(base) ~/Parallelism-Locality/project/Parla.py/benchmarks/qr_factorization$ python qr_parla.py
%**********************************************************************************************%

Config: rows=5000 cols=100 block_size=500 iterations=1 warmup=0 threads=16 ngpus=4 placement=gpu check_result=False csv=False
--- ITERATION 0 ---
t1
Num GPU tasks: 10
H2D: 0.0742948055267334
CPU kernels: 0
GPU kernels: 26.731510162353516
D2H: 0.005108356475830078
Total: 2.2886359691619873

t2
Total: 1.2995290756225586

t3
Num GPU tasks: 10
H2D: 0.0652766227722168
CPU kernels: 0
GPU kernels: 0.007777690887451172
D2H: 0.008157730102539062
Total: 0.015547752380371094

Full run total: 3.606827974319458

%**********************************************************************************************%

Segmentation fault (core dumped)

Note that the segmentation fault occurs after the main program has completed, presumably when Parla is cleaning up its resources.

Not setting memory argument can cause crashes

When creating tasks with @spawn, if the memory argument isn't set, the scheduler seems to assume the task will actually take no memory. Thus, tasks can fill up a device and cause it to crash. I recommend that by default if neither memory nor vcus are set, only one task can run per device (particularly for GPUs).

Simple repro on Frontera:

cd Parla.py/benchmarks/qr_factorization

Replace this line

@spawn(taskid=T1[i], placement=PLACEMENT, memory=T1_MEMORY)

with

@spawn(taskid=T1[i], placement=PLACEMENT)

then run

python qr_parla.py -r 1600000 -c 1000 -b 100000 -i 1 -p gpu -g 1

Allow Completely Distinct Sets of Libraries in VECs

Title says it all. Currently we restrict VECs to only allow importing multiple copies of the same library all at once. That's a purely artificial restriction. The only reason this hasn't happened yet is because of the tight time crunch we hit with the last VEC deadline.

Scheduler Awareness of Persistent Data

The scheduler needs to be aware of persistent data on each device, otherwise our occupancy model can't account for tasks that need to create some data and then leave it there or for the case where some persistent data exists on each device throughout an entire application. The easiest short-term fix is to just use a context manager to explicitly reserve space. The long-term fix is probably an array wrapper class, but that's significantly more work, so we can't do that for the current paper deadline.

Poor Scaling In Matmul and Exp Demos

This is likely some kind of performance bug. Both the matmul and exp demo are really simple cases where parla ought to perform much better than it currently does. Right now both apps show near ideal scaling up to two devices and then plateau after that.

Sequential Execution for Debugging

I was just discussing sequential semantics in the paper draft and remembered that, although this is well-defined, we don't actually have any kind of execution mode that actually makes it happen right now. Basically all we need for sequential tasking runs is to have some flag somewhere (probably in the Parla context manager?) that makes it so that tasks execute as soon as they are created instead of running asynchronously.

Fail Fast When An Exception Is Thrown

Currently parla has the unfortunate tendency to keep running if an exception has already been thrown. This prevents keyboard interrupts and can easily lead to massive error logs. If an exception has been thrown, task execution should stop immediately and the error should be reported.

Can't load MKL into a VEC

MKL doesn't load correctly into a VEC yet. All our current examples have had to run with OpenBLAS because of this. When we try to load MKL we get a dynamic linker error claiming that omp_num_threads can't be found, but it's not clear why there would be any problems with that.

Reserving memory for an automatically partitioned object will not behave as desired

If you call reserve_persistent_memory on an object that is partitioned over devices for automatic data movement, rather than reserving memory for the object on its device context, the automatic mapper will copy the object to the current device context, then reserve memory on the current context. Desired behavior would be leaving the object on its own device and reserving memory on that context. We need better support for this. Current workaround is to call reserve_persistent_memory with an integer argument of the size of the object, and specify its device with the device argument.

Separate examples, benchmarks, and integration tests

We should have examples, benchmarks, and integration test versions of all our current examples. Ideally the integration test would be wrapper around the example to make these integration tests also serve as test of the documentation examples themselves (to make sure they don't bit-rot).
The benchmark versions will need to be separate so we can perform optimizations which are not appropriate in examples. Benchmarking the examples themselves would be useful (using a wrapper as for testing), but would not be the "top performance" versions, it would just be to track the performance of simple versions of algorithms.

Readable Tracebacks

When an error is thrown from within a task, it'd be nice if there were some way for us to get the traceback to show where the task was spawned (as if the error were raised from a code block). This may have performance penalties, so maybe we just make it an option?

clone_here could not maintain major order

Matrices could be stored in column or row major.
However, clone_here() always copies array in row major order as a default.
IMO, it is problematic since users could intentionally declare their arrays in column major order.
(For example, cublas always requires column major order (Fortran))

I think clone_here() or copy() should maintain orders as many as they can.

clone_here May Swap Current Device

Currently clone_here may switch out the current device unintentionally. An older version of the matmul This is NOT desired behavior and needs fixing.

Registering Array Types Doesn't Work With VECs

VECs don't fully work with our current relative data movement features. This is because the ndarray classes themselves are registered instead of the modules, but VECs do their redirects at module attribute access time. See

Parla.py/parla/array.py

Line 47 in d079616

def _register_array_type(ty, get_memory_impl: ArrayType):

Resource Specifications Should Be Per-Architecture

Currently the end user specifies resource use for a task independent of the architecture used. This doesn't make practical sense since if something runs on a CPU it may use an entirely different set of resources than if it runs on a GPU.

It's also unclear how to specify resources for a task that wants to utilize multiple devices.

Tasks should run on their own streams

Title says it all. For some reason we didn't set it up like this. IIUC, this may limit parallelism when we co-locate kernels on a device.

Main Memory Unnecessarily Associated With Cores

Currently the devices for the individual CPU cores are each set up to have a dedicated piece of main memory that can be reserved by tasks. This may unnecessarily prevent tasks that need more memory from running at all. It'll likely take some refactoring in the scheduler to make this work right, but this limitation should be removed. This is not even remotely necessary for the upcoming deadline. I'm just documenting it for later.

"Virtual Compute Unit" should be documentated and potentially renamed

Virtual Compute Unit (VCU) sounds like a virtualized environment and is very confusing when compared to VEX (Virtual EXecution environment). We should carefully document VCUs since we are using them now. We should consider renaming them "Abstract Compute Units" ("ACUs") since that implies that they don't represent any concrete resources.

Segmentation Fault with VECs when multithreaded

In the QR-factorization example, we spawn multiple threads for calling into VECs. Each VEC then calls into its own NumPy thread pool to do a qr factorization on its block. A segmentation fault occurs during this process. It's more likely to happen when more threads calling VECs are used.

Parla.py/examples/qr_factorization/vec-segfault.py

Lines 45 to 63 in 991f036

 def VEC_qr(A): 

 # Acquire lock 

 VEC_id = VEC_q.get() 

 mystring = ['|' for x in range(MAX_WORKERS)] 

 mystring[VEC_id] = 'x' 

 print(mystring) 

 with VECs[VEC_id]: 

 Q, R = np.linalg.qr(fixarr(A)) 

 mystring = ['|' for x in range(MAX_WORKERS)] 

 mystring[VEC_id] = 'o' 

 print(mystring) 

 # Release Lock 

 VEC_q.task_done() 

 VEC_q.put(VEC_id) 

 return Q, R

Should Parla tasks/contexts also setup stream handles for numba?

Inspired by the recent discussion on Slack around https://gist.github.com/leofang/4a043e5d94b4702d04fde2b9e7dcebbd
and passing the current stream to numba kernels.

Looking forward, would it make sense to provide a local variable within a task for this as well to prevent the user from creating and destroying these stream wrappers themselves?

Round-Robin Scheduling of Ready Tasks

@hfingler noted in the N-body demo that letting the scheduler pick which GPU to use for each task instead of specifying them manually was actually harmful in that case. Our current best-guess for why that would be is that the scheduler may be putting too much on one device and not enough on the others if the tasks don't saturate the memory. It'd be nice to avoid that if possible.

CuPy loses track of resource handles

Every so often I randomly see an error where I'm doing automatic data movement, as such

A = np.random.rand(NROWS, NCOLS)
# nblocks and NGPUS are integers
mapper = LDeviceSequenceBlocked(nblocks, placement=[gpu(block % NGPUS) for block in range(nblocks)])
A_dev = mapper.partition_tensor(A)

and I get the following error

Exception in task
Traceback (most recent call last):
  File "/home1/07999/stephens/Parla.py/parla/task_runtime.py", line 283, in run
    task_state = self._state.func(self, *self._state.args)
  File "/home1/07999/stephens/Parla.py/parla/tasks.py", line 300, in _task_callback
    new_task_info = body.send(in_value)
  File "temp.py", line 481, in test_tsqr_blocked
    A_dev = mapper.partition_tensor(A)
  File "/home1/07999/stephens/Parla.py/parla/ldevice.py", line 144, in partition_tensor
    return self.partition(lambda i: data[self.slice(i, n, overlap=overlap), ...],
  File "/home1/07999/stephens/Parla.py/parla/ldevice.py", line 131, in partition
    return PartitionedTensor([data(i, memory=self.memory(i, kind=memory_kind), device=self.device(i))
  File "/home1/07999/stephens/Parla.py/parla/ldevice.py", line 131, in <listcomp>
    return PartitionedTensor([data(i, memory=self.memory(i, kind=memory_kind), device=self.device(i))
  File "/home1/07999/stephens/Parla.py/parla/ldevice.py", line 344, in wrapper
    return memory(data(*args))
  File "/home1/07999/stephens/Parla.py/parla/cuda.py", line 75, in __call__
    return cupy.asarray(target)
  File "/home1/07999/stephens/miniconda3/envs/parla/lib/python3.8/site-packages/cupy/_creation/from_data.py", line 66, in asarray
    return core.array(a, dtype, False, order)
  File "cupy/core/core.pyx", line 2004, in cupy.core.core.array
  File "cupy/core/core.pyx", line 2083, in cupy.core.core.array
  File "cupy/core/core.pyx", line 2170, in cupy.core.core._send_object_to_gpu
  File "cupy/cuda/stream.pyx", line 245, in cupy.cuda.stream.BaseStream.record
  File "cupy_backends/cuda/api/runtime.pyx", line 854, in cupy_backends.cuda.api.runtime.eventRecord
  File "cupy_backends/cuda/api/runtime.pyx", line 247, in cupy_backends.cuda.api.runtime.check_status
cupy_backends.cuda.api.runtime.CUDARuntimeError: cudaErrorInvalidResourceHandle: invalid resource handle

Raise Exception Groups From Parallel Sections

Python is in the process of standardizing exception groups in PEP 654 (https://www.python.org/dev/peps/pep-0654/). This is specifically designed for cases like Parla parallel sections where multiple exceptions may need to be raised simultaneously. Once this feature has finished the standardization process and is readily available, we should use it.

RTLD_GLOBAL Loads Don't Reliably work in VECs

Continuation of #9. MKL is fine now due to changes in how it's normally loaded, but numba is still broken, and I'm not sure what's going on with OpenMP generally. Numba fails when, through various layers of code, this call is executed: https://github.com/llvm/llvm-project/blob/d480f968ad8b56d3ee4a6b6df5532d485b0ad01e/llvm/lib/Support/Unix/DynamicLibrary.inc#L28. I'm not clear on why yet. Weirdly enough, IIRC, MKL was failing with a symbol not found error with the logic used at https://github.com/IntelPython/mkl-service/blob/master/mkl/__init__.py#L38, not a crash.

Some import mechanisms bypass a customized builtins.import

Although we can currently load CUDA into a VEC, we can't currently load cupy. The error shows up as an assertion failure in our custom import code. I've done some work on diagnosing this, and here's what I think is going on.

Some Python C API routines bypass a customized builtins.__import__. In particular, the PyImport_ImportModuleLevelObject does, and that's one of the C API routines used to implement cimports in Cython. This only hits us with cupy because our import override currently only cares about detecting imports that are the first to load any new submodule of a given base-level module. Our existing examples work fine because, in the wild, it's rare for a cimport to be that kind of first import while also being implemented as an API call that would bypass our modified __import__. IIRC in this case the bad import is when cupy.cuda.device triggers the first load of cupy_backends.cuda.libs.cublas. It's not the first load of stuff from cupy_backends, but prior imports from that module have already completed and no new import of anything in cupy_backends is already in-progress to catch the changes to sys.modules that result from the lazy import of cupy_backends.cuda.libs.cublas. See https://github.com/cupy/cupy/blob/890e40cfd29c2ea37d52fbbef3d2e7d7ceb105d7/cupy/cuda/device.pyx#L8 for the culprit.

There are a few ways to hack this particular case to work in the short term if we need to do that. (e.g., having an import of cupy also observe changes to cupy_backends), but I'd prefer to actually fix the problem. As I see it, there are two problems here:

Cython doesn't reliably respect overrides to __import__ with their cimport machinery.
The PyImport_* routines (other than PyImport_Import and things that call it) bypass our current overrides.

The first bullet point needs to be fixed upstream and will only partially fix the problems we're having with our modified import not always getting called, but I suspect taking care of that would be good enough to everything we actually need for demos to work right. This is also a fix that I suspect the Cython devs will be happy to have. I've started working on a patch for this.

The fix for the second bullet point is to set up overrides for PyImport_* (other than PyImport_Import and things that call it) that allow us to observe arbitrary calls to those functions. This will be more of a hassle to set up, but it's still doable. It's what's required to fully address this issue. In particular, we'll have to modify our stub library generation scripts so that they're aware of any overrides for stuff in libpython. There's also some subtlety with interpreter initialization order where, if the builtin import hasn't been changed yet, nothing special should happen.

Related to this issue: importlib.import_module and _frozen_importlib._gcd_import also bypass our __import__ override. Those interfaces aren't frequently used in library code, but it'd probably be worth overriding them too. Most of the work in our import override is done via a context manager so overriding these additional interfaces isn't hard.

Segfaults in get_nprocs

We've been seeing mysterious segfaults in get_nprocs when threads are used together with VECs. The exact conditions that trigger this aren't known since lots of things still appear to work fine.

With the ARPACK demo this does show up, but only if many copies are used (e.g. one ARPACK copy per core, so increase the limit then run 24 copies or something). I most recently saw it there when masively oversubscribed though since I wasn't setting OMP_NUM_THREADS there yet. I wasn't able to get an informative backtrace beyond seeing get_nprocs at the bottom of it.

@hfingler saw segfaults like this several times when debugging the Galois/VECs demo. Here are two backtraces that we saw:

0x7f1db871385f: (killpg+0x40)                                                                                                                                                        
 (killpg+0x40)                                                         
0x7f1db871385f:0x7f1db871385f: (killpg+0x40)                                                                     
0x7f1db871385f: (get_nprocs+0x11f)                                                                               
0x7f1db869defb: (get_nprocs+0x11f)                                                                                                             
 (get_nprocs+0x11f)                                                                                              
0x7f1db869defb: (get_nprocs+0x11f)                                                                                                                                                            
0x7f1db869defb:0x7f1db869defb: (arena_get2.part.4+0x19b)
0x7f1db86a0dc9: (arena_get2.part.4+0x19b)                                                                                                                                      
 (arena_get2.part.4+0x19b)                                                                                                                                                                    
0x7f1db86a0dc9:0x7f1db86a0dc9: (arena_get2.part.4+0x19b)                                                     
0x7f1db86a0dc9: (tcache_init.part.6+0xb9)                                                                                                                                             
0x7f1db86a1b9e: (tcache_init.part.6+0xb9)                                               
 (tcache_init.part.6+0xb9)                                                                             
0x7f1db86a1b9e:0x7f1db86a1b9e: (tcache_init.part.6+0xb9)                                                                             
0x7f1db86a1b9e: (__libc_malloc+0xde)

Another one:

0x7ff896bc5850: (handler+0x28)
0x7ff896bc5850: (killpg+0x40)
0x7ff896c8685f:----- Galois setting # threads to 24
Galois: load_file:304 0x7ff880002680
Reading from file: inputs/r4-2e26.gr
 (killpg+0x40)
0x7ff896c8685f: (get_nprocs+0x11f)
0x7ff896c10efb: (get_nprocs+0x11f)
0x7ff896c10efb: (arena_get2.part.4+0x19b)
0x7ff896c13dc9: (arena_get2.part.4+0x19b)
0x7ff896c13dc9: (tcache_init.part.6+0xb9)
 (tcache_init.part.6+0xb9)
0x7ff896c14b9e:0x7ff896c14b9e: (__libc_malloc+0xde)
 (__libc_malloc+0xde)
0x7ff897e952f5:0x7ff897e952f5: (tls_get_addr_tail+0x165)
 (tls_get_addr_tail+0x165)
0x7ff897e9ae08:0x7ff897e9ae08: (__tls_get_addr+0x38)
 (__tls_get_addr+0x38)
0x7ff88b30a422:0x7ff88b30a422: (_ZTHN6galois9substrate10ThreadPool6my_boxE+0x14)
 (_ZTHN6galois9substrate10ThreadPool6my_boxE+0x14)
0x7ff88b2db545:0x7ff88b2db545: (_ZTWN6galois9substrate10ThreadPool6my_boxE+0x9)

@sestephens73 at one point saw this one as well when working on the matmul demo (I'm not sure what the workaround to avoid this there was):

0x7fa9598e2188: (handler+0x28)
0x7fa95c966400: (killpg+0x40)
0x7fa95bf7837f: (get_nprocs+0x11f)
0x7fa95bf02aab: (arena_get2.part.4+0x19b)

VECs don't work in interactive Python (REPL)

The Python interpreter segfaults at startup when the modified glibc and forarding libraries for VECs are used and the interpreter is run in interactive mode. Running scripts runs fine. It's currently unclear what's causing this.

Parla components import order errors

Some import orders cause errors, for example:

from parla import Parla
from parla.cpu import *
from parla.tasks import *

Works, but the following does not:

from parla import Parla
from parla.tasks import *
from parla.cpu import *

Traceback (most recent call last):
  File "./bin/run_2d.py", line 8, in <module>
    from barneshut.implementations import SimpleBarnesHut, ProcessPoolBarnesHut, AsyncBarnesHut, ParlaBarnesHut
  File "/home/hfingler/parla/barnes-hut/2d-barnes-hut-python/barneshut/implementations/__init__.py", line 4, in <module>
    from .barneshut.parla       import ParlaBarnesHut
  File "/home/hfingler/parla/barnes-hut/2d-barnes-hut-python/barneshut/implementations/barneshut/parla.py", line 5, in <module>
    from parla.tasks import *
  File "/home/hfingler/parla/.bh/lib/python3.7/site-packages/parla/tasks.py", line 22, in <module>
    from parla import task_runtime, array
  File "/home/hfingler/parla/.bh/lib/python3.7/site-packages/parla/array.py", line 10, in <module>
    from parla.tasks import get_current_devices
ImportError: cannot import name 'get_current_devices' from 'parla.tasks' (/home/hfingler/parla/.bh/lib/python3.7/site-packages/parla/tasks.py)

Vague module semantics

Examples:

class Task is currently in module task_runtime while we have module tasks.
function get_current_devices() is currently in module tasks while all it does is to call task_runtime.get_devices().
functions get_placement_for_any/set/value(...) are currently in module tasks. We don't have a placement module for now but have module device.

These may confuse users when importing objects. Classes/functions may need relocation and imports everywhere may be affected accordingly. We may also need new modules.

Integration Testing

Post-deadline we need to set up CI tests ASAP. Hopefully there's some kind of cuda emulator we can use to be able to test the benchmark apps too.

	def VEC_qr(A):
	# Acquire lock
	VEC_id = VEC_q.get()

	mystring = ['\|' for x in range(MAX_WORKERS)]
	mystring[VEC_id] = 'x'
	print(mystring)

	with VECs[VEC_id]:
	Q, R = np.linalg.qr(fixarr(A))

	mystring = ['\|' for x in range(MAX_WORKERS)]
	mystring[VEC_id] = 'o'
	print(mystring)

	# Release Lock
	VEC_q.task_done()
	VEC_q.put(VEC_id)
	return Q, R