Silly bug on splitHalf / splitGuided

Weave, a state-of-the-art multithreading runtime

"Good artists borrow, great artists steal." -- Pablo Picasso

Weave (codenamed "Project Picasso") is a multithreading runtime for the Nim programming language.

It is continuously tested on Linux, MacOS and Windows for the following CPU architectures: x86, x86_64 and ARM64 with the C and C++ backends.

Weave aims to provide a composable, high-performance, ultra-low overhead and fine-grained parallel runtime that frees developers from the common worries of "are my tasks big enough to be parallelized?", "what should be my grain size?", "what if the time they take is completely unknown or different?" or "is parallel-for worth it if it's just a matrix addition? On what CPUs? What if it's exponentiation?".

Thorough benchmarks track Weave performance against industry standard runtimes in C/C++/Cilk language on both Task parallelism and Data parallelism with a variety of workloads:

Compute-bound
Memory-bound
Load Balancing
Runtime-overhead bound (i.e. trillions of tasks in a couple milliseconds)
Nested parallelism

Benchmarks are issued from recursive tree algorithms, finance, linear algebra and High Performance Computing, game simulations. In particular Weave displays as low as 3x to 10x less overhead than Intel TBB and GCC OpenMP on overhead-bound benchmarks.

At implementation level, Weave unique feature is being-based on Message-Passing instead of being based on traditional work-stealing with shared-memory deques.

⚠️ Disclaimer:

Only 1 out of 2 complex synchronization primitives was formally verified to be deadlock-free. They were not submitted to an additional data race detection tool to ensure proper implementation.

Furthermore worker threads are state-machines and were not formally verified either.

Weave does limit synchronization to only simple SPSC and MPSC channels which greatly reduces the potential bug surface.

Installation

Weave can be simply installed with

nimble install weave

or for the devel version

nimble install weave@#master

Weave requires at least Nim v1.2.0

Changelog

The latest changes are available in the file.

Demos

A raytracing demo is available, head over to demos/raytracing.

API

Task parallelism

Weave provides a simple API based on spawn/sync which works like async/await for IO-based futures.

The traditional parallel recursive Fibonacci would be written like this:

import weave

proc fib(n: int): int =
  # int64 on x86-64
  if n < 2:
    return n

  let x = spawn fib(n-1)
  let y = fib(n-2)

  result = sync(x) + y

proc main() =
  var n = 20

  init(Weave)
  let f = fib(n)
  exit(Weave)

  echo f

main()

Data parallelism

Weave provides nestable parallel for loop.

A nested matrix transposition would be written like this:

import weave

func initialize(buffer: ptr UncheckedArray[float32], len: int) =
  for i in 0 ..< len:
    buffer[i] = i.float32

proc transpose(M, N: int, bufIn, bufOut: ptr UncheckedArray[float32]) =
  ## Transpose a MxN matrix into a NxM matrix with nested for loops

  parallelFor j in 0 ..< N:
    captures: {M, N, bufIn, bufOut}
    parallelFor i in 0 ..< M:
      captures: {j, M, N, bufIn, bufOut}
      bufOut[j*M+i] = bufIn[i*N+j]

proc main() =
  let M = 200
  let N = 2000

  let input = newSeq[float32](M*N)
  # We can't work with seq directly as it's managed by GC, take a ptr to the buffer.
  let bufIn = cast[ptr UncheckedArray[float32]](input[0].unsafeAddr)
  bufIn.initialize(M*N)

  var output = newSeq[float32](N*M)
  let bufOut = cast[ptr UncheckedArray[float32]](output[0].addr)

  init(Weave)
  transpose(M, N, bufIn, bufOut)
  exit(Weave)

main()

Strided loops

You might want to use loops with a non unit-stride, this can be done with the following syntax.

import weave

init(Weave)

# expandMacros:
parallelForStrided i in 0 ..< 100, stride = 30:
  parallelForStrided j in 0 ..< 200, stride = 60:
    captures: {i}
    log("Matrix[%d, %d] (thread %d)\n", i, j, myID())

exit(Weave)

Complete list

We separate the list depending on the threading context

Root thread

The root thread is the thread that started the Weave runtime. It has special privileges.

init(Weave), exit(Weave) to start and stop the runtime. Forgetting this will give you nil pointer exceptions on spawn.
The thread that calls init will become the root thread.
syncRoot(Weave) is a global barrier. The root thread will not continue beyond until all tasks in the runtime are finished.

Weave worker thread

A worker thread is automatically created per (logical) core on the machine. The root thread is also a worker thread. Worker threads are tuned to maximize throughput of computational tasks.

spawn fnCall(args) which spawns a function that may run on another thread and gives you an awaitable Flowvar handle.
newFlowEvent, trigger, spawnOnEvent and spawnOnEvents (experimental) to delay a task until some dependencies are met. This allows expressing precise data dependencies and producer-consumer relationships.
sync(Flowvar) will await a Flowvar and block until you receive a result.
isReady(Flowvar) will check if sync will actually block or return the result immediately.
syncScope is a scope barrier. The thread will not move beyond the scope until all tasks and parallel loops spawned and their descendants are finished. syncScope is composable, it can be called by any thread, it can be nested. It has the syntax of a block statement:
```
syncScope():
  parallelFor i in 0 ..< N:
    captures: {a, b}
    parallelFor j in 0 ..< N:
      captures: {i, a, b}
  spawn foo()
```
In this example, the thread encountering syncScope will create all the tasks for parallel loop i, will spawn foo() and then will be waiting at the end of the scope. A thread blocked at the end of its scope is not idle, it still helps processing all the work existing and that may be created by the current tasks.
parallelFor, parallelForStrided, parallelForStaged, parallelForStagedStrided are described above and in the experimental section.
loadBalance(Weave) gives the runtime the opportunity to distribute work. Insert this within long computation as due to Weave design, it's the busy workers that are also in charge of load balancing. This is done automatically when using parallelFor.
isSpawned(Flowvar) allows you to build speculative algorithm where a thread is spawned only if certain conditions are valid. See the nqueens benchmark for an example.
getThreadId(Weave) returns a unique thread ID. The thread ID is in the range 0 ..< number of threads.

The max number of worker threads can be configured by the environment variable WEAVE_NUM_THREADS and default to your number of logical cores (including HyperThreading). Weave uses Nim's countProcessors() in std/cpuinfo

Foreign thread & Background service (experimental)

Weave can also be run as a background service and process jobs similar to the Executor concept in C++. Jobs will be processed in FIFO order.

Experimental: The distinction between spawn/sync on a Weave thread and submit/waitFor on a foreign thread may be removed in the future.

A background service can be started with either:

thr.runInBackground(Weave)
or thr.runInBackground(Weave, signalShutdown: ptr Atomic[bool])

with thr an uninitialized Thread[void] or Thread[ptr Atomic[bool]]

Then the foreign thread should call:

setupSubmitterThread(Weave): Configure a thread so that it can send jobs to a background Weave service and on shutdown
waitUntilReady(Weave): Block the foreign thread until the Weave runtime is ready to accept jobs.

and for shutdown

teardownSubmitterThread(Weave): Cleanup Weave resources allocated on the thread.

Once setup, a foreign thread can submit jobs via:

submit fnCall(args) which submits a function to the Weave runtime and gives you an awaitable Pending handle.
newFlowEvent, trigger, submitOnEvent and submitOnEvents (experimental) to delay a task until some dependencies are met. This allows expressing precise data dependencies and producer-consumer relationships.
waitFor(Pending) which await a Pending job result and blocks the current thread
isReady(Pending) will check if waitFor will actually block or return the result immediately.
isSubmitted(job) allows you to build speculative algorithm where a job is submitted only if certain conditions are valid.

Within a job, tasks can be spawned and parallel for constructs can be used.

If runInBackground() does not provide fine enough control, a Weave background event loop can be customized using the following primitive:

at a very low-level:
- The root thread primitives: init(Weave) and exit(Weave)
- processAllandTryPark(Weave): Process all pending jobs and try sleeping. The sleep may fail to avoid deadlocks if a job is submitted concurrently. This should be used in a while true event loop.
at a medium level:
- runForever(Weave): Start a never-ending event loop that processes all pending jobs and sleep until new work arrives.
- runUntil(Weave, signalShutdown: ptr Atomic[bool]): Start an event-loop that quits on signal.

For example:

proc runUntil*(_: typedesc[Weave], signal: ptr Atomic[bool]) =
  ## Start a Weave event loop until signal is true on the current thread.
  ## It wakes-up on job submission, handles multithreaded load balancing,
  ## help process tasks
  ## and spin down when there is no work anymore.
  preCondition: not signal.isNil
  while not signal[].load(moRelaxed):
    processAllandTryPark(Weave)
  syncRoot(Weave)

proc runInBackground*(
       _: typedesc[Weave],
       signalShutdown: ptr Atomic[bool]
     ): Thread[ptr Atomic[bool]] =
  ## Start the Weave runtime on a background thread.
  ## It wakes-up on job submissions, handles multithreaded load balancing,
  ## help process tasks
  ## and spin down when there is no work anymore.
  proc eventLoop(shutdown: ptr Atomic[bool]) {.thread.} =
    init(Weave)
    Weave.runUntil(shutdown)
    exit(Weave)
  result.createThread(eventLoop, signalShutdown)

Platforms supported

Weave supports all platforms with pthread and Windows. Missing pthread functionality may be emulated or unused. For example on MacOS, the pthread implementation does not expose barrier functionality or affinity settings.

C++ compilation

The syncScope feature will not compile correctly in C++ mode if it is used in a for loop. Upstream: nim-lang/Nim#14118

Windows 32-bit

Windows 32-bit targets cannot use the MinGW compiler as it is missing support for EnterSynchronizationBarrier. MSVC should work instead.

Resource-restricted devices

Weave uses a flexible and efficient memory subsystem that has been optimized for a wide range of hardware: low power Raspberry Pi, phones, laptops, desktops and 30+ cores workstations. It currently assumes by default that 16KB at least are available on your hardware for a memory pool and that this memory pool can grow as needed. This can be tuned with -d:WV_MemArenaSize=2048 to have the base pool use 2KB for example. The pool size should be a multiple of 256 bytes. PRs to improve support of very restricted devices are welcome.

Backoff mechanism

A Backoff mechanism is enabled by default. It allows workers with no tasks to sleep instead of spinning aimlessly and burning CPU cycles.

It can be disabled with -d:WV_Backoff=off.

Weave using all CPUs

Weave multithreading is cooperative, idle threads send steal requests instead of actively stealing in other workers queue. This is called "work-requesting" in the literature as opposed to "work-stealing".

This means that a thread sleeping or stuck in a long computation may starve other threads and they will spin burning CPU cycles.

Don't sleep or block a thread as this blocks Weave scheduler. This is a similar to async/await libraries.
If you really need to sleep or block the root thread, make sure to empty all the tasks beforehand with syncRoot(Weave) in the root thread. The child threads will be put to sleep until new tasks are spawned.
The loadBalance(Weave) call can be used in the middle of heavy computations to force the worker to answer steal requests. This is automatically done in parallelFor loops. loadBalance(Weave) is a very fast call that makes a worker thread checks its queue and dispatch its pending tasks to others. It does not block.

We call the root thread the thread that called init(Weave)

Experimental features

Experimental features might see API and/or implementation changes.

For example both parallelForStaged and parallelReduce allow reductions but parallelForStaged is more flexible, it however requires explicit use of locks and/or atomics.

LazyFlowvars may be enabled by default for certain sizes or if escape analysis become possible or if we prevent Flowvar from escaping their scope.

Data parallelism (experimental features)

Awaitable loop

Loops can be awaited. Awaitable loops return a normal Flowvar.

This blocks the thread that spawned the parallel loop from continuing until the loop is resolved. The thread does not stay idle and will steal and run other tasks while being blocked.

Calling sync on the awaitable loop Flowvar will return true for the last thread to exit the loop and false for the others.

Due to dynamic load-balancing, an unknown amount of threads will execute the loop.
It's the thread that spawned the loop task that will always be the last thread to exit. The false value is only internal to Weave.

⚠️ This is not a barrier: if that loop spawns tasks (including via a nested loop) and exits, the thread will continue, it will not wait for the grandchildren tasks to be finished. Use a syncScope section to wait on all tasks and descendants including grandchildren.

import weave

init(Weave)

# expandMacros:
parallelFor i in 0 ..< 10:
  awaitable: iLoop
  echo "iteration: ", i

let wasLastThread = sync(iLoop)
echo wasLastThread

exit(Weave)

Parallel For Staged

Weave provides a parallelForStaged construct with supports for thread-local prologue and epilogue.

A parallel sum would look like this:

proc sumReduce(n: int): int =
  let res = result.addr # For mutation we need to capture the address.

  parallelForStaged i in 0 .. n:
    captures: {res}
    awaitable: iLoop
    prologue:
      var localSum = 0
    loop:
      localSum += i
    epilogue:
      echo "Thread ", getThreadID(Weave), ": localsum = ", localSum
      res[].atomicInc(localSum)

  let wasLastThread = sync(iLoop)

init(Weave)
let sum1M = sumReduce(1000000)
echo "Sum reduce(0..1000000): ", sum1M
doAssert sum1M == 500_000_500_000
exit(Weave)

parallelForStagedStrided is also provided.

Parallel Reduction

Weave provides a parallel reduction construct that avoids having to use explicit synchronization like atomics or locks but instead uses Weave sync(Flowvar) under-the-hood.

Syntax is the following:

proc sumReduce(n: int): int =
  var waitableSum: Flowvar[int]

  # expandMacros:
  parallelReduceImpl i in 0 .. n, stride = 1:
    reduce(waitableSum):
      prologue:
        var localSum = 0
      fold:
        localSum += i
      merge(remoteSum):
        localSum += sync(remoteSum)
      return localSum

  result = sync(waitableSum)

init(Weave)
let sum1M = sumReduce(1000000)
echo "Sum reduce(0..1000000): ", sum1M
doAssert sum1M == 500_000_500_000
exit(Weave)

In the future the waitableSum will probably be not required to be declared beforehand. Or parallel reduce might be removed to only keep parallelForStaged.

Dataflow parallelism

Dataflow parallelism allows expressing fine-grained data dependencies between tasks. Concretely a task is delayed until all its dependencies are met and once met, it is triggered immediately.

This allows precise specification of data producer-consumer relationships.

In contrast, classic task parallelism can only express control-flow dependencies (i.e. parent-child function calls relationships) and classic tasks are eagerly scheduled.

In the literature, it is also called:

Stream parallelism
Pipeline parallelism
Graph parallelism
Data-driven task parallelism

Tagged experimental as the API and its implementation are unique compared to other libraries/language-extensions. Feedback welcome.

No specific ordering is required between calling the event producer and its consumer(s).

Dependencies are expressed by a handle called FlowEvent. An flow event can express either a single dependency, initialized with newFlowEvent() or a dependencies on parallel for loop iterations, initialized with newFlowEvent(start, exclusiveStop, stride)

To await on a single event pass it to spawnOnEvent or the parallelFor invocation. To await on an iteration, pass a tuple:

(FlowEvent, 0) to await precisely and only for iteration 0. This works with both spawnOnEvent or parallelFor (via a dependsOnEvent statement)
(FlowEvent, loop_index_variable) to await on a whole iteration range. For example
```
parallelFor i in 0 ..< n:
  dependsOnEvent: (e, i) # Each "i" will independently depends on their matching event
  body
```
This only works with parallelFor. The FlowEvent iteration domain and the parallelFor domain must be the same. As soon as a subset of the pledge is ready, the corresponding parallelFor tasks will be scheduled.

Delayed computation with single dependencies

import weave

proc echoA(eA: FlowEvent) =
  echo "Display A, sleep 1s, create parallel streams 1 and 2"
  sleep(1000)
  eA.trigger()

proc echoB1(eB1: FlowEvent) =
  echo "Display B1, sleep 1s"
  sleep(1000)
  eB1.trigger()

proc echoB2() =
  echo "Display B2, exit stream"

proc echoC1() =
  echo "Display C1, exit stream"

proc main() =
  echo "Dataflow parallelism with single dependency"
  init(Weave)
  let eA = newFlowEvent()
  let eB1 = newFlowEvent()
  spawnOnEvent eB1, echoC1()
  spawnOnEvent eA, echoB2()
  spawnOnEvent eA, echoB1(eB1)
  spawn echoA(eA)
  exit(Weave)

main()

Delayed computation with multiple dependencies

import weave

proc echoA(eA: FlowEvent) =
  echo "Display A, sleep 1s, create parallel streams 1 and 2"
  sleep(1000)
  eA.trigger()

proc echoB1(eB1: FlowEvent) =
  echo "Display B1, sleep 1s"
  sleep(1000)
  eB1.trigger()

proc echoB2(eB2: FlowEvent) =
  echo "Display B2, no sleep"
  eB2.trigger()

proc echoC12() =
  echo "Display C12, exit stream"

proc main() =
  echo "Dataflow parallelism with multiple dependencies"
  init(Weave)
  let eA = newFlowEvent()
  let eB1 = newFlowEvent()
  let eB2 = newFlowEvent()
  spawnOnEvents eB1, eB2, echoC12()
  spawnOnEvent eA, echoB2(eB2)
  spawnOnEvent eA, echoB1(eB1)
  spawn echoA(eA)
  exit(Weave)

main()

Delayed loop computation

You can combine data parallelism and dataflow parallelism.

Currently parallel loops only support one dependency (single, fixed iteration or range iteration).

Here is an example with a range iteration dependency. Note: when sleeping threads are unresponsive, meaning a sleeping thread cannot schedule other ready tasks.

import weave

proc main() =
  init(Weave)

  let eA = newFlowEvent(0, 10, 1)
  let pB = newFlowEvent(0, 10, 1)

  parallelFor i in 0 ..< 10:
    captures: {eA}
    sleep(i * 10)
    eA.trigger(i)
    echo "Step A - stream ", i, " at ", i * 10, " ms"

  parallelFor i in 0 ..< 10:
    dependsOn: (eA, i)
    captures: {pB}
    sleep(i * 10)
    pB.trigger(i)
    echo "Step B - stream ", i, " at ", 2 * i * 10, " ms"

  parallelFor i in 0 ..< 10:
    dependsOn: (pB, i)
    sleep(i * 10)
    echo "Step C - stream ", i, " at ", 3 * i * 10, " ms"

  exit(Weave)

main()

Lazy Allocation of Flowvars

Flowvars can be lazily allocated, this reduces overhead by at least 2x on very fine-grained tasks like Fibonacci or Depth-First-Search that may spawn trillions of tasks in less than a couple hundreds of milliseconds. This can be enabled with -d:WV_LazyFlowvar.

⚠️ This only works for Flowvar of a size up to your machine word size (int64, float64, pointer on 64-bit machines) ⚠️ Flowvars cannot be returned in that mode, you will at best trigger stack smashing protection or crash

Limitations

Weave has not been tested with GC-ed types. Pass a pointer around or use Nim channels which are GC-aware. If it works, a heads-up would be valuable.

This might improve with Nim ARC/newruntime.

Statistics

Curious minds can access the low-level runtime statistic with the flag -d:WV_metrics which will give you the information on number of tasks executed, steal requests sent, etc.

Very curious minds can also enable high resolution timers with -d:WV_metrics -d:WV_profile -d:CpuFreqMhz=3000 assuming you have a 3GHz CPU.

The timers will give you in this order:

Time spent running tasks, Time spent recv/send steal requests, Time spent recv/send tasks, Time spent caching tasks, Time spent idle, Total

Tuning

A number of configuration options are available in weave/config.nim.

In particular:

-d:WV_StealAdaptativeInterval=25 defines the number of steal requests after which thieves reevaluate their steal strategy (steal one task or steal half the victim's tasks). Default: 25
-d:WV_StealEarly=0 allows worker to steal early, when only `WV_StealEraly tasks are leftin their queue. Default: don't steal early

Unique features

Weave provides an unique scheduler with the following properties:

Message-Passing based: unlike alternative work-stealing schedulers, this means that Weave is usable on any architecture where message queues, channels or locks are available and not only atomics. Architectures without atomics include distributed clusters or non-cache coherent processors like the Cell Broadband Engine (for the PS3) that favors Direct memory Access (DMA), the many-core mesh Tile CPU from Mellanox (EzChip/Tilera) with 64 to 100 ARM cores, or the network-on-chip (NOC) CPU Epiphany V from Adapteva with 1024 cores, or the research CPU Intel SCC.
Scalable: As the number of cores in computer is growing steadily, developers need to find new avenues of parallelism to exploit them. Unfortunately existing framework requires computation to take 10000 cycles at minimum (Intel TBB) which corresponds to 3.33 µs on a 3 GHz CPU to amortize the cost of scheduling. This burden the developers with questions of grain size, heuristics on distributing parallel loop for the common case and mischeduling on recursive tree algorithms with potentially very low compute-intensive leaves.
- Weave uses an adaptative work-stealing scheduler that adapts its stealing strategy depending on each core load and the intensity of tasks. Small tasks will be packaged into chunks to amortize scheduling overhead.
- Weave also uses an adaptative lazy loop splitting strategy. Loops will only be split when needed. There is no partitioning issue or grain size issue, or estimating if the workload is memory-bound or compute-bound, see PyTorch OpenMP woes on parallel map.
- Weave aims efficient multicore scaling for very fine-grained tasks starting from the 2000 cycles range upward (0.67 µs on 3GHz).
Fast and low-overhead: While the number of cores have been growing steadily, many programs are now hitting the limit of memory bandwidth and require tuning allocators, cache lines, CPU caches. Enormous care has been given to optimize Weave to keep it very low-overhead. Weave uses efficient memory allocation and caches to avoid stressing the system allocator and prevent memory fragmentation. Soon, a thread-safe caching system that can release memory to the OS will be added to prevent reserving memory for a long-time.
Ergonomic and composable: Weave API is based on futures similar to async/await for concurrency. The task dependency graph is implicitly built when awaiting a result An OpenMP-syntax is planned.

The "Project Picasso" RFC is available for discussion in Nim RFC #160 or in the (potentially outdated) picasso_RFC.md file

Research

Weave is based on the research by Andreas Prell. You can read his PhD Thesis or access his C implementation.

Several enhancements were built into Weave, in particular:

Memory management was carefully studied to allow releasing memory to the OS while still providing very high performance and solving the decades old cactus stack problem. The solution, coupling a threadsafe memory pool with a lookaside buffer, is inspired by Microsoft's Mimalloc and Snmalloc, a message-passing based allocator (also by Microsoft). Details are provided in the multiple Markdown file in the memory folder.
The channels were reworked to not use locks. In particular the MPSC channel (Multi-Producer Single-Consumer) supports batching for both producers and consumers without any lock.

License

Licensed and distributed under either of

MIT license: LICENSE-MIT or http://opensource.org/licenses/MIT

or

Apache License, Version 2.0, (LICENSE-APACHEv2 or http://www.apache.org/licenses/LICENSE-2.0)

at your option. These files may not be copied, modified, or distributed except according to those terms.

	func splitGuided*(task: Task): int {.inline.} =
	## Split iteration range based on the number of workers
	let stepsLeft = (task.stop - task.cur + task.stride-1) div task.stride
	preCondition: stepsLeft > 0

	{.noSideEffect.}:
	let numWorkers = workforce()
	let chunk = max(((task.stop - task.start + task.stride-1) div task.stride) div numWorkers, 1)
	if stepsLeft <= chunk:
	return task.splitHalf()
	return roundPrevMultipleOf(task.stop - chunk*task.stride, task.stride)

	proc main() =
	echo "Sanity check 1: Printing 123456 654321 in parallel"

	init(Weave)

	spawn display_int(123456)
	spawn display_int(654321)

	sync(Weave)
	exit(Weave)

	proc nextTask*(childTask: static bool): Task {.inline.} =
	# TODO: rewrite as a finite state machine

	profile(enq_deq_task):
	if childTask:
	result = myWorker().deque.popFirstIfChild(myTask())
	else:
	result = myWorker().deque.popFirst()

	func allocBlock(arena: var Arena): ptr MemBlock {.inline.} =
	## Allocate from an arena
	preCondition: not arena.meta.free.isNil
	preCondition: arena.meta.used < arena.blocks.len

	arena.meta.used += 1
	result = arena.meta.free
	unpoisonMemRegion(result, WV_MemBlockSize)
	# The following acts as prefetching for the block that we are returning as well
	arena.meta.free = cast[ptr MemBlock](result.next.load(moRelaxed))

	postCondition: arena.meta.used in 0 .. arena.blocks.len

	func peek*(chan: var ChannelMpscUnboundedBatch): int32 {.inline.} =
	## Estimates the number of items pending in the channel
	## - If called by the consumer the true number might be more
	## due to producers adding items concurrently.
	## - If called by a producer the true number is undefined
	## as other producers also add items concurrently and
	## the consumer removes them concurrently.
	##
	## This is a non-locking operation.
	result = int32 chan.count.load(moAcquire)

	# For the consumer it's always positive or zero
	postCondition: result >= 0 # TODO somehow it can be -1

	func splitHalf*(task: Task): int {.inline.} =
	## Split loop iteration range in half
	task.cur + ((task.stop - task.cur + task.stride-1) div task.stride) shr 1

	proc decline(req: sink StealRequest) =
	## Pass steal request to another worker
	## or the manager if it's our own that came back
	preCondition: req.retry <= PI_MaxRetriesPerSteal

	req.retry += 1

	profile(send_recv_req):
	incCounter(stealDeclined)

	if req.thiefID == myID():
	# No one had jobs to steal
	ascertain: req.victims.isEmpty()
	ascertain: req.retry == PI_MaxRetriesPerSteal

	if req.state == Stealing and myWorker().leftIsWaiting and myWorker().rightIsWaiting:
	when PI_MaxConcurrentStealPerWorker == 1:
	# When there is only one concurrent steal request allowed, it's always the last.
	lastStealAttempt(req)
	else:
	# Is this the last theft attempt allowed per steal request?
	# - if so: lastStealAttempt special case (termination if lead thread, sleep if worker)
	# - if not: drop it and wait until we receive work or all out steal requests failed.
	if myThefts().outstanding == PI_MaxConcurrentStealPerWorker and
	myTodoBoxes().len == PI_MaxConcurrentStealPerWorker - 1:
	# "PI_MaxConcurrentStealPerWorker - 1" steal requests have been dropped
	# as evidenced by the corresponding channel "address boxes" being recycled
	ascertain: myThefts().dropped == PI_MaxConcurrentStealPerWorker - 1
	lastStealAttempt(req)
	else:
	drop(req)
	else:
	# Our own request but we still have work, so we reset it and recirculate.
	# This can only happen if workers are allowed to steal before finishing their tasks.
	when PI_StealEarly > 0:
	req.retry = 0
	req.victims.init(workforce)
	req.victims.clear(myID())
	req.findVictimAndSteal()
	else: # No-op in "-d:danger"
	postCondition: PI_StealEarly > 0 # Force an error
	else: # Not our own request
	req.findVictimAndSteal()

	proc sync*(_: type Weave) =
	## Global barrier for the Picasso runtime
	## This is only valid in the root task
	Worker: return

	debugTermination:
	log(">>> Worker %2d enters barrier <<<\n", myID())

	preCondition: myTask().isRootTask()

Loop around microkernel	Index	Length	Stride	Dimension	Dependencies	Notes
5th loop	jc	N	nc	N
4th loop	pc	K	kc	K	Pack panels of B: kc*nc by blocks of nr	Panel packing: Parallelization with Master barrier Loop: Difficult to parallelize as K is the reduction dimension so parallelizing K requires handling conflicting writes
3rd loop	ic	M	mc	M	Pack panels of A: kc*mc by blocks of mr	Panel packing: Parallelized Normally parallelized but missing a way to express dependency or a worker barrier
2nd loop	jr	nc	nr	N	Slice A into micropanel kc*nr	Parallelized with master barrier
1st loop	ir	mc	mr	M	Slice B into micropanel kc*mr
Microkernel	k,i,j	kc,mr,nr	1	K, M, N	Read the micropanel to do mr*nr	Vectorized

	ChannelMpscUnboundedBatch*[T: Enqueueable] = object
	## Lockless multi-producer single-consumer channel
	##
	## Properties:
	## - Lockless
	## - Wait-free for producers
	## - Consumer can be blocked by producers when they swap
	## the tail, the tail can grow but the consumer sees it as nil
	## until it's published all at once.
	## - Unbounded
	## - Intrusive List based
	## - Keep an approximate count on enqueued
	## - Support batching on both the producers and consumer side

	# TODO: pass this through Relacy and Valgrind/Helgrind
	# to make sure there are no bugs
	# on arch with relaxed memory models

	# Accessed by all
	count{.align: WV_CacheLinePadding.}: Atomic[int]
	# Producers and consumer slow-path
	back{.align: WV_CacheLinePadding.}: Atomic[pointer] # Workaround generic atomics bug: https://github.com/nim-lang/Nim/issues/12695
	# Consumer only - front is a dummy node
	front{.align: WV_CacheLinePadding.}: typeof(default(T)[])

	func append(pool: var TLPoolAllocator, arena: ptr Arena) {.inline.} =
	preCondition: arena.next.isNil

	debugMem:
	log("Pool 0x%.08x - TID %d - append Arena 0x%.08x\n",
	pool.addr, pool.threadID, arena)

	if pool.numArenas == 0:
	ascertain: pool.first.isNil
	ascertain: pool.last.isNil
	pool.first = arena
	else:
	arena.prev = pool.last
	pool.last.next = arena
	pool.last = arena

	pool.numArenas += 1
	arena.allocator = pool.addr

	proc distributeWork(req: sink StealRequest): bool =
	## Handle incoming steal request
	## Returns true if we found work
	## false otherwise

	# Send independent task(s) if possible
	if not myWorker().deque.isEmpty():
	req.dispatchElseDecline()
	return true
	# TODO - the control flow is entangled here
	# since we have a non-empty deque we will never take
	# the branch that leads to termination
	# and would logically return true

	# Otherwise try to split the current one
	if myTask().isSplittable():
	if req.thiefID != myID():
	myTask().splitAndSend(req)
	return true
	else:
	req.forget()
	return false

	if req.state == Waiting:
	# Only children can send us a failed state.
	# Request should be saved by caller and
	# worker tree updates should be done by caller as well
	# TODO: disantangle control-flow and sink the request
	postCondition: req.thiefID == myWorker().left or req.thiefID == myWorker().right
	else:
	decline(req)

	return false

	proc shareWork*() {.inline.} =
	## Distribute work to all the idle children workers
	## if we can
	while not myWorker().workSharingRequests.isEmpty():
	# Only dequeue if we find work
	let req = myWorker().workSharingRequests.peek()
	ascertain: req.thiefID == myWorker().left or req.thiefID == myWorker.right
	if distributeWork(req): # Shouldn't this need a copy?
	if req.thiefID == myWorker().left:
	ascertain: myWorker().leftIsWaiting
	myWorker().leftIsWaiting = false
	else:
	ascertain: myWorker().rightIsWaiting
	myWorker().rightIsWaiting = false
	Backoff:
	wakeup(req.thiefID)

	# Now we can dequeue as we found work
	# We cannot access the steal request anymore or
	# we would have a race with the child worker recycling it.
	discard myWorker().workSharingRequests.dequeue()
	else:
	break

	proc schedulingLoop() =
	## Each worker thread execute this loop over and over

	while not localCtx.signaledTerminate:
	# Global state is intentionally minimized,
	# It only contains the communication channels and read-only environment variables
	# There is still the global barrier to ensure the runtime starts or stops only
	# when all threads are ready.

	# 1. Private task deque
	debug: log("Worker %d: schedloop 1 - task from local deque\n", myID())
	while (let task = nextTask(childTask = false); not task.isNil):
	# Prio is: children, then thieves then us
	ascertain: not task.fn.isNil
	profile(run_task):
	run(task)
	profile(enq_deq_task):
	# The memory is reused but not zero-ed
	localCtx.taskCache.add(task)

	# 2. Run out-of-task, become a thief
	debug: log("Worker %d: schedloop 2 - becoming a thief\n", myID())
	trySteal(isOutOfTasks = true)
	ascertain: myThefts().outstanding > 0

	var task: Task
	profile(idle):
	while not recv(task, isOutOfTasks = true):
	ascertain: myWorker().deque.isEmpty()
	ascertain: myThefts().outstanding > 0
	declineAll()

	# 3. We stole some task(s)
	ascertain: not task.fn.isNil
	debug: log("Worker %d: schedloop 3 - stoled tasks\n", myID())

	let loot = task.batch
	if loot > 1:
	# Add everything
	myWorker().deque.addListFirst(task, loot)
	# And then only use the last
	task = myWorker().deque.popFirst()

	StealAdaptative:
	myThefts().recentThefts += 1

	# 4. Share loot with children
	debug: log("Worker %d: schedloop 4 - sharing work\n", myID())
	shareWork()

	# 5. Work on what is left
	debug: log("Worker %d: schedloop 5 - working on leftover\n", myID())
	profile(run_task):
	run(task)
	profile(enq_deq_task):
	# The memory is reused but not zero-ed
	localCtx.taskCache.add(task)

	proc forceFuture*[T](fv: Flowvar[T], parentResult: var T) =
	## Eagerly complete an awaited FlowVar
	let thisTask = myTask() # Only for ascertain

	block CompleteFuture:
	# Almost duplicate of schedulingLoop and sync() barrier
	if isFutReady():
	break CompleteFuture

	## 1. Process all the children of the current tasks (and only them)
	debug: log("Worker %d: forcefut 1 - task from local deque\n", myID())
	while (let task = nextTask(childTask = true); not task.isNil):
	profile(run_task):
	run(task)
	profile(enq_deq_task):
	localCtx.taskCache.add(task)
	if isFutReady():
	break CompleteFuture

	# ascertain: myTask() == thisTask # need to be able to print tasks TODO

	# 2. Run out-of-task, become a thief and help other threads
	# to reach children faster
	debug: log("Worker %d: forcefut 2 - becoming a thief\n", myID())
	while not isFutReady():
	trySteal(isOutOfTasks = false)
	var task: Task
	profile(idle):
	while not recv(task, isOutOfTasks = false):
	# We might inadvertently remove our own steal request in
	# dispatchTasks so resteal
	profile_stop(idle)
	trySteal(isOutOfTasks = false)
	# If someone wants our non-child tasks, let's oblige
	var req: StealRequest
	while recv(req):
	dispatchTasks(req)
	profile_start(idle)
	if isFutReady():
	profile_stop(idle)
	break CompleteFuture

	# 3. We stole some task(s)
	ascertain: not task.fn.isNil
	debug: log("Worker %d: forcefut 3 - stoled tasks\n", myID())

	let loot = task.batch
	if loot > 1:
	profile(enq_deq_task):
	# Add everything
	myWorker().deque.addListFirst(task, loot)
	# And then only use the last
	task = myWorker().deque.popFirst()

	StealAdaptative:
	myThefts().recentThefts += 1

	# Share loot with children workers
	debug: log("Worker %d: forcefut 4 - sharing work\n", myID())
	shareWork()

	# Run the rest
	profile(run_task):
	run(task)
	profile(enq_deq_task):
	# The memory is reused but not zero-ed
	localCtx.taskCache.add(task)

	LazyFV:
	# Cleanup the lazy flowvar if allocated or copy directly into result
	if not fv.lazyFV.hasChannel:
	ascertain: fv.lazyFV.isReady
	copyMem(parentResult.addr, fv.lazyFV.lazyChan.buf.addr, sizeof(parentResult))
	else:
	ascertain: not fv.lazyFV.lazyChan.chan.isNil
	fv.lazyFV.lazyChan.chan.delete()

	proc sync*(_: type Runtime) =
	## Global barrier for the Picasso runtime
	## This is only valid in the root task
	Worker: return

	debugTermination:
	log(">>> Worker %d enters barrier <<<\n", myID())

	preCondition: myTask().isRootTask()

	block EmptyLocalQueue:
	## Empty all the tasks and before leaving the barrier
	while true:
	debug: log("Worker %d: globalsync 1 - task from local deque\n", myID())
	while (let task = nextTask(childTask = false); not task.isNil):
	# TODO: duplicate schedulingLoop
	profile(run_task):
	run(task)
	profile(enq_deq_task):
	# The memory is reused but not zero-ed
	localCtx.taskCache.add(task)

	if workforce() == 1:
	localCtx.runtimeIsQuiescent = true
	break EmptyLocalQueue

	if localCtx.runtimeIsQuiescent:
	break EmptyLocalQueue

	# 2. Run out-of-task, become a thief and help other threads
	# to reach the barrier faster
	debug: log("Worker %d: globalsync 2 - becoming a thief\n", myID())
	trySteal(isOutOfTasks = true)
	ascertain: myThefts().outstanding > 0

	var task: Task
	profile(idle):
	while not recv(task, isOutOfTasks = true):
	ascertain: myWorker().deque.isEmpty()
	ascertain: myThefts().outstanding > 0
	declineAll()
	if localCtx.runtimeIsQuiescent:
	# Goto breaks profiling, but the runtime is still idle
	break EmptyLocalQueue


	# 3. We stole some task(s)
	debug: log("Worker %d: globalsync 3 - stoled tasks\n", myID())
	ascertain: not task.fn.isNil

	let loot = task.batch
	if loot > 1:
	profile(enq_deq_task):
	# Add everything
	myWorker().deque.addListFirst(task, loot)
	# And then only use the last
	task = myWorker().deque.popFirst()

	StealAdaptative:
	myThefts().recentThefts += 1

	# 4. Share loot with children
	debug: log("Worker %d: globalsync 4 - sharing work\n", myID())
	shareWork()

	# 5. Work on what is left
	debug: log("Worker %d: globalsync 5 - working on leftover\n", myID())
	profile(run_task):
	run(task)
	profile(enq_deq_task):
	# The memory is reused but not zero-ed
	localCtx.taskCache.add(task)

	# Restart the loop

	# Execution continues but the runtime is quiescent until new tasks
	# are created
	postCondition: localCtx.runtimeIsQuiescent

	debugTermination:
	log(">>> Worker %d leaves barrier <<<\n", myID())

	# Stack pointer
	top: T
	# Mempool doesn't provide the proper free yet
	freeFn*: proc(threadID: int32, t: T) {.nimcall, gcsafe.}
	threadID*: int32 # TODO, memory pool abstraction leaking
	# Adaptative freeing
	count: int
	recentAsk: int
	# "closure" - This points to the proc + env currently registered in the allocator
	# It is nil-ed on destruction of the lookaside list.
	#
	registeredAt: ptr tuple[onHeartbeat: proc(env: pointer) {.nimcall.}, env: pointer]

	proc display_range() =
	forEach(i):
	log("%d (thread %d)\n", i, ID)
	log("Thread %d - SUCCESS\n", ID)

	proc main() =
	tasking_init()

	async_for 0..100, display_range()

	tasking_barrier()
	tasking_exit()

	main()

	proc recycle*[T](myThreadID: int32, p: ptr T) {.gcsafe.} =
	## Returns a memory block to its memory pool.
	##
	## This is thread-safe, any thread can call it.
	## It must indicates its ID.
	## A fast path is used if it's the ID of the borrowing thread,
	## otherwise a slow path will be used.
	##
	## If the thread owning the pool was exited before this
	## block was returned, the main thread should now
	## have ownership of the related arenas and can deallocate them.

	# TODO: sink ptr T - parsing bug to raise
	# similar to https://github.com/nim-lang/Nim/issues/12091
	preCondition: not p.isNil

	let p = cast[ptr MemBlock](p)

	# Find the owning arena
	let arena = p.getArena()

	if myThreadID == arena.meta.threadID.load(moRelaxed):
	# thread-local free
	if arena.meta.localFree.isNil:
	p.next.store(nil, moRelaxed)
	arena.meta.localFree = p
	else:
	arena.meta.localFree.prepend(p)
	arena.meta.used -= 1
	if unlikely(arena.isUnused()):
	# If an arena is unused, we can try releasing it immediately
	arena.allocator[].considerRelease(arena)
	else:
	# remote arena
	let remoteRecycled = arena.meta.remoteFree.trySend(p)
	postCondition: remoteRecycled

	proc lastStealAttemptFailure*(req: sink StealRequest) =
	## If it's the last theft attempt per emitted steal requests
	## - if we are the lead thread, we know that every other threads are idle/waiting for work
	## but there is none --> termination
	## - if we are a worker thread, we message our parent and
	## passively wait for it to send us work or tell us to shutdown.

	if myID() == LeaderID:
	detectTermination()
	forget(req)
	else:
	req.state = Waiting
	debugTermination:
	log("Worker %2d: sends state passively WAITING to its parent worker %d\n", myID(), myWorker().parent)
	sendShare(req)
	ascertain: not myWorker().isWaiting
	myWorker().isWaiting = true
	myParking().wait() # Thread is blocked here until woken up.

	proc signalTerminate*(_: pointer) =
	preCondition: not localCtx.signaledTerminate

	# 1. Terminating means everyone ran out of tasks
	# so their cache for task channels should be full
	# if there were sufficiently more tasks than workers
	# 2. Since they have an unique parent, no one else sent them a signal (checked in asyncSignal)
	if myWorker().left != Not_a_worker:
	# Send the terminate signal
	asyncSignal(signalTerminate, globalCtx.com.tasks[myWorker().left].access(0))
	# Wake the worker up so that it can process the terminate signal
	wakeup(myWorker().left)
	if myWorker().right != Not_a_worker:
	asyncSignal(signalTerminate, globalCtx.com.tasks[myWorker().right].access(0))
	wakeup(myWorker().right)

mratsim / weave Goto Github PK

weave's Introduction

Weave, a state-of-the-art multithreading runtime

Installation

Changelog

Demos

Table of Contents

API

Task parallelism

Data parallelism

Strided loops

Complete list

Root thread

Weave worker thread

Foreign thread & Background service (experimental)

Platforms supported

C++ compilation

Windows 32-bit

Resource-restricted devices

Backoff mechanism

Weave using all CPUs

Experimental features

Data parallelism (experimental features)

Awaitable loop

Parallel For Staged

Parallel Reduction

Dataflow parallelism

Delayed computation with single dependencies

Delayed computation with multiple dependencies

Delayed loop computation

Lazy Allocation of Flowvars

Limitations

Statistics

Tuning

Unique features

Research

License

weave's People

Contributors

Stargazers

Watchers

Forkers

weave's Issues

splitGuided

splitAdaptative

splitAdaptativeDelegated

FIFO scheduling

Job priorities

Sanitizers

valgrind --tool:helgrind build/mybinary

LLVM ThreadSanitizer

Libraries

Relacy Race Detector

Chess

Landslide

Do-it-yourself

Model Checking and Formal verification

VCC

Iris

TLA+

Spin

Resources

Analysis

Potential solutions

Extending the barrier

Providing static loop scheduling

Create a waitable nested-for loops iterations

Task graphs

Create a polyhedral compiler

Code explanation

References

Research

Implementations

Important benchmarks

Instrumentation, tutorials, examples

Stretch goals

Padding

Count

Status

First look

`valgrind --tool:helgrind build/mybinary`