GithubHelp home page GithubHelp logo

mratsim / weave Goto Github PK

View Code? Open in Web Editor NEW
524.0 20.0 22.0 8.79 MB

A state-of-the-art multithreading runtime: message-passing based, fast, scalable, ultra-low overhead

License: Other

Nim 86.93% C 4.72% C++ 4.81% Haskell 0.10% Julia 0.04% TLA 3.33% Crystal 0.07%
multithreading runtime message-passing openmp parallelism task-scheduler work-stealing threadpool scheduler fork-join

weave's Introduction

Weave, a state-of-the-art multithreading runtime

Github Actions CI
License: Apache License: MIT Stability: experimental

"Good artists borrow, great artists steal." -- Pablo Picasso

Weave (codenamed "Project Picasso") is a multithreading runtime for the Nim programming language.

It is continuously tested on Linux, MacOS and Windows for the following CPU architectures: x86, x86_64 and ARM64 with the C and C++ backends.

Weave aims to provide a composable, high-performance, ultra-low overhead and fine-grained parallel runtime that frees developers from the common worries of "are my tasks big enough to be parallelized?", "what should be my grain size?", "what if the time they take is completely unknown or different?" or "is parallel-for worth it if it's just a matrix addition? On what CPUs? What if it's exponentiation?".

Thorough benchmarks track Weave performance against industry standard runtimes in C/C++/Cilk language on both Task parallelism and Data parallelism with a variety of workloads:

  • Compute-bound
  • Memory-bound
  • Load Balancing
  • Runtime-overhead bound (i.e. trillions of tasks in a couple milliseconds)
  • Nested parallelism

Benchmarks are issued from recursive tree algorithms, finance, linear algebra and High Performance Computing, game simulations. In particular Weave displays as low as 3x to 10x less overhead than Intel TBB and GCC OpenMP on overhead-bound benchmarks.

At implementation level, Weave unique feature is being-based on Message-Passing instead of being based on traditional work-stealing with shared-memory deques.

⚠️ Disclaimer:

Only 1 out of 2 complex synchronization primitives was formally verified to be deadlock-free. They were not submitted to an additional data race detection tool to ensure proper implementation.

Furthermore worker threads are state-machines and were not formally verified either.

Weave does limit synchronization to only simple SPSC and MPSC channels which greatly reduces the potential bug surface.

Installation

Weave can be simply installed with

nimble install weave

or for the devel version

nimble install weave@#master

Weave requires at least Nim v1.2.0

Changelog

The latest changes are available in the changelog.md file.

Demos

A raytracing demo is available, head over to demos/raytracing.

ray_trace_300samples_nim_threaded.png

Table of Contents

API

Task parallelism

Weave provides a simple API based on spawn/sync which works like async/await for IO-based futures.

The traditional parallel recursive Fibonacci would be written like this:

import weave

proc fib(n: int): int =
  # int64 on x86-64
  if n < 2:
    return n

  let x = spawn fib(n-1)
  let y = fib(n-2)

  result = sync(x) + y

proc main() =
  var n = 20

  init(Weave)
  let f = fib(n)
  exit(Weave)

  echo f

main()

Data parallelism

Weave provides nestable parallel for loop.

A nested matrix transposition would be written like this:

import weave

func initialize(buffer: ptr UncheckedArray[float32], len: int) =
  for i in 0 ..< len:
    buffer[i] = i.float32

proc transpose(M, N: int, bufIn, bufOut: ptr UncheckedArray[float32]) =
  ## Transpose a MxN matrix into a NxM matrix with nested for loops

  parallelFor j in 0 ..< N:
    captures: {M, N, bufIn, bufOut}
    parallelFor i in 0 ..< M:
      captures: {j, M, N, bufIn, bufOut}
      bufOut[j*M+i] = bufIn[i*N+j]

proc main() =
  let M = 200
  let N = 2000

  let input = newSeq[float32](M*N)
  # We can't work with seq directly as it's managed by GC, take a ptr to the buffer.
  let bufIn = cast[ptr UncheckedArray[float32]](input[0].unsafeAddr)
  bufIn.initialize(M*N)

  var output = newSeq[float32](N*M)
  let bufOut = cast[ptr UncheckedArray[float32]](output[0].addr)

  init(Weave)
  transpose(M, N, bufIn, bufOut)
  exit(Weave)

main()

Strided loops

You might want to use loops with a non unit-stride, this can be done with the following syntax.

import weave

init(Weave)

# expandMacros:
parallelForStrided i in 0 ..< 100, stride = 30:
  parallelForStrided j in 0 ..< 200, stride = 60:
    captures: {i}
    log("Matrix[%d, %d] (thread %d)\n", i, j, myID())

exit(Weave)

Complete list

We separate the list depending on the threading context

Root thread

The root thread is the thread that started the Weave runtime. It has special privileges.

  • init(Weave), exit(Weave) to start and stop the runtime. Forgetting this will give you nil pointer exceptions on spawn.
    The thread that calls init will become the root thread.
  • syncRoot(Weave) is a global barrier. The root thread will not continue beyond until all tasks in the runtime are finished.

Weave worker thread

A worker thread is automatically created per (logical) core on the machine. The root thread is also a worker thread. Worker threads are tuned to maximize throughput of computational tasks.

  • spawn fnCall(args) which spawns a function that may run on another thread and gives you an awaitable Flowvar handle.

  • newFlowEvent, trigger, spawnOnEvent and spawnOnEvents (experimental) to delay a task until some dependencies are met. This allows expressing precise data dependencies and producer-consumer relationships.

  • sync(Flowvar) will await a Flowvar and block until you receive a result.

  • isReady(Flowvar) will check if sync will actually block or return the result immediately.

  • syncScope is a scope barrier. The thread will not move beyond the scope until all tasks and parallel loops spawned and their descendants are finished. syncScope is composable, it can be called by any thread, it can be nested. It has the syntax of a block statement:

    syncScope():
      parallelFor i in 0 ..< N:
        captures: {a, b}
        parallelFor j in 0 ..< N:
          captures: {i, a, b}
      spawn foo()

    In this example, the thread encountering syncScope will create all the tasks for parallel loop i, will spawn foo() and then will be waiting at the end of the scope. A thread blocked at the end of its scope is not idle, it still helps processing all the work existing and that may be created by the current tasks.

  • parallelFor, parallelForStrided, parallelForStaged, parallelForStagedStrided are described above and in the experimental section.

  • loadBalance(Weave) gives the runtime the opportunity to distribute work. Insert this within long computation as due to Weave design, it's the busy workers that are also in charge of load balancing. This is done automatically when using parallelFor.

  • isSpawned(Flowvar) allows you to build speculative algorithm where a thread is spawned only if certain conditions are valid. See the nqueens benchmark for an example.

  • getThreadId(Weave) returns a unique thread ID. The thread ID is in the range 0 ..< number of threads.

The max number of worker threads can be configured by the environment variable WEAVE_NUM_THREADS and default to your number of logical cores (including HyperThreading). Weave uses Nim's countProcessors() in std/cpuinfo

Foreign thread & Background service (experimental)

Weave can also be run as a background service and process jobs similar to the Executor concept in C++. Jobs will be processed in FIFO order.

Experimental: The distinction between spawn/sync on a Weave thread and submit/waitFor on a foreign thread may be removed in the future.

A background service can be started with either:

  • thr.runInBackground(Weave)
  • or thr.runInBackground(Weave, signalShutdown: ptr Atomic[bool])

with thr an uninitialized Thread[void] or Thread[ptr Atomic[bool]]

Then the foreign thread should call:

  • setupSubmitterThread(Weave): Configure a thread so that it can send jobs to a background Weave service and on shutdown
  • waitUntilReady(Weave): Block the foreign thread until the Weave runtime is ready to accept jobs.

and for shutdown

  • teardownSubmitterThread(Weave): Cleanup Weave resources allocated on the thread.

Once setup, a foreign thread can submit jobs via:

  • submit fnCall(args) which submits a function to the Weave runtime and gives you an awaitable Pending handle.
  • newFlowEvent, trigger, submitOnEvent and submitOnEvents (experimental) to delay a task until some dependencies are met. This allows expressing precise data dependencies and producer-consumer relationships.
  • waitFor(Pending) which await a Pending job result and blocks the current thread
  • isReady(Pending) will check if waitFor will actually block or return the result immediately.
  • isSubmitted(job) allows you to build speculative algorithm where a job is submitted only if certain conditions are valid.

Within a job, tasks can be spawned and parallel for constructs can be used.

If runInBackground() does not provide fine enough control, a Weave background event loop can be customized using the following primitive:

  • at a very low-level:
    • The root thread primitives: init(Weave) and exit(Weave)
    • processAllandTryPark(Weave): Process all pending jobs and try sleeping. The sleep may fail to avoid deadlocks if a job is submitted concurrently. This should be used in a while true event loop.
  • at a medium level:
    • runForever(Weave): Start a never-ending event loop that processes all pending jobs and sleep until new work arrives.
    • runUntil(Weave, signalShutdown: ptr Atomic[bool]): Start an event-loop that quits on signal.

For example:

proc runUntil*(_: typedesc[Weave], signal: ptr Atomic[bool]) =
  ## Start a Weave event loop until signal is true on the current thread.
  ## It wakes-up on job submission, handles multithreaded load balancing,
  ## help process tasks
  ## and spin down when there is no work anymore.
  preCondition: not signal.isNil
  while not signal[].load(moRelaxed):
    processAllandTryPark(Weave)
  syncRoot(Weave)

proc runInBackground*(
       _: typedesc[Weave],
       signalShutdown: ptr Atomic[bool]
     ): Thread[ptr Atomic[bool]] =
  ## Start the Weave runtime on a background thread.
  ## It wakes-up on job submissions, handles multithreaded load balancing,
  ## help process tasks
  ## and spin down when there is no work anymore.
  proc eventLoop(shutdown: ptr Atomic[bool]) {.thread.} =
    init(Weave)
    Weave.runUntil(shutdown)
    exit(Weave)
  result.createThread(eventLoop, signalShutdown)

Platforms supported

Weave supports all platforms with pthread and Windows. Missing pthread functionality may be emulated or unused. For example on MacOS, the pthread implementation does not expose barrier functionality or affinity settings.

C++ compilation

The syncScope feature will not compile correctly in C++ mode if it is used in a for loop. Upstream: nim-lang/Nim#14118

Windows 32-bit

Windows 32-bit targets cannot use the MinGW compiler as it is missing support for EnterSynchronizationBarrier. MSVC should work instead.

Resource-restricted devices

Weave uses a flexible and efficient memory subsystem that has been optimized for a wide range of hardware: low power Raspberry Pi, phones, laptops, desktops and 30+ cores workstations. It currently assumes by default that 16KB at least are available on your hardware for a memory pool and that this memory pool can grow as needed. This can be tuned with -d:WV_MemArenaSize=2048 to have the base pool use 2KB for example. The pool size should be a multiple of 256 bytes. PRs to improve support of very restricted devices are welcome.

Backoff mechanism

A Backoff mechanism is enabled by default. It allows workers with no tasks to sleep instead of spinning aimlessly and burning CPU cycles.

It can be disabled with -d:WV_Backoff=off.

Weave using all CPUs

Weave multithreading is cooperative, idle threads send steal requests instead of actively stealing in other workers queue. This is called "work-requesting" in the literature as opposed to "work-stealing".

This means that a thread sleeping or stuck in a long computation may starve other threads and they will spin burning CPU cycles.

  • Don't sleep or block a thread as this blocks Weave scheduler. This is a similar to async/await libraries.
  • If you really need to sleep or block the root thread, make sure to empty all the tasks beforehand with syncRoot(Weave) in the root thread. The child threads will be put to sleep until new tasks are spawned.
  • The loadBalance(Weave) call can be used in the middle of heavy computations to force the worker to answer steal requests. This is automatically done in parallelFor loops. loadBalance(Weave) is a very fast call that makes a worker thread checks its queue and dispatch its pending tasks to others. It does not block.

We call the root thread the thread that called init(Weave)

Experimental features

Experimental features might see API and/or implementation changes.

For example both parallelForStaged and parallelReduce allow reductions but parallelForStaged is more flexible, it however requires explicit use of locks and/or atomics.

LazyFlowvars may be enabled by default for certain sizes or if escape analysis become possible or if we prevent Flowvar from escaping their scope.

Data parallelism (experimental features)

Awaitable loop

Loops can be awaited. Awaitable loops return a normal Flowvar.

This blocks the thread that spawned the parallel loop from continuing until the loop is resolved. The thread does not stay idle and will steal and run other tasks while being blocked.

Calling sync on the awaitable loop Flowvar will return true for the last thread to exit the loop and false for the others.

  • Due to dynamic load-balancing, an unknown amount of threads will execute the loop.
  • It's the thread that spawned the loop task that will always be the last thread to exit. The false value is only internal to Weave.

⚠️ This is not a barrier: if that loop spawns tasks (including via a nested loop) and exits, the thread will continue, it will not wait for the grandchildren tasks to be finished. Use a syncScope section to wait on all tasks and descendants including grandchildren.

import weave

init(Weave)

# expandMacros:
parallelFor i in 0 ..< 10:
  awaitable: iLoop
  echo "iteration: ", i

let wasLastThread = sync(iLoop)
echo wasLastThread

exit(Weave)

Parallel For Staged

Weave provides a parallelForStaged construct with supports for thread-local prologue and epilogue.

A parallel sum would look like this:

proc sumReduce(n: int): int =
  let res = result.addr # For mutation we need to capture the address.

  parallelForStaged i in 0 .. n:
    captures: {res}
    awaitable: iLoop
    prologue:
      var localSum = 0
    loop:
      localSum += i
    epilogue:
      echo "Thread ", getThreadID(Weave), ": localsum = ", localSum
      res[].atomicInc(localSum)

  let wasLastThread = sync(iLoop)

init(Weave)
let sum1M = sumReduce(1000000)
echo "Sum reduce(0..1000000): ", sum1M
doAssert sum1M == 500_000_500_000
exit(Weave)

parallelForStagedStrided is also provided.

Parallel Reduction

Weave provides a parallel reduction construct that avoids having to use explicit synchronization like atomics or locks but instead uses Weave sync(Flowvar) under-the-hood.

Syntax is the following:

proc sumReduce(n: int): int =
  var waitableSum: Flowvar[int]

  # expandMacros:
  parallelReduceImpl i in 0 .. n, stride = 1:
    reduce(waitableSum):
      prologue:
        var localSum = 0
      fold:
        localSum += i
      merge(remoteSum):
        localSum += sync(remoteSum)
      return localSum

  result = sync(waitableSum)

init(Weave)
let sum1M = sumReduce(1000000)
echo "Sum reduce(0..1000000): ", sum1M
doAssert sum1M == 500_000_500_000
exit(Weave)

In the future the waitableSum will probably be not required to be declared beforehand. Or parallel reduce might be removed to only keep parallelForStaged.

Dataflow parallelism

Dataflow parallelism allows expressing fine-grained data dependencies between tasks. Concretely a task is delayed until all its dependencies are met and once met, it is triggered immediately.

This allows precise specification of data producer-consumer relationships.

In contrast, classic task parallelism can only express control-flow dependencies (i.e. parent-child function calls relationships) and classic tasks are eagerly scheduled.

In the literature, it is also called:

  • Stream parallelism
  • Pipeline parallelism
  • Graph parallelism
  • Data-driven task parallelism

Tagged experimental as the API and its implementation are unique compared to other libraries/language-extensions. Feedback welcome.

No specific ordering is required between calling the event producer and its consumer(s).

Dependencies are expressed by a handle called FlowEvent. An flow event can express either a single dependency, initialized with newFlowEvent() or a dependencies on parallel for loop iterations, initialized with newFlowEvent(start, exclusiveStop, stride)

To await on a single event pass it to spawnOnEvent or the parallelFor invocation. To await on an iteration, pass a tuple:

  • (FlowEvent, 0) to await precisely and only for iteration 0. This works with both spawnOnEvent or parallelFor (via a dependsOnEvent statement)
  • (FlowEvent, loop_index_variable) to await on a whole iteration range. For example
    parallelFor i in 0 ..< n:
      dependsOnEvent: (e, i) # Each "i" will independently depends on their matching event
      body
    This only works with parallelFor. The FlowEvent iteration domain and the parallelFor domain must be the same. As soon as a subset of the pledge is ready, the corresponding parallelFor tasks will be scheduled.

Delayed computation with single dependencies

import weave

proc echoA(eA: FlowEvent) =
  echo "Display A, sleep 1s, create parallel streams 1 and 2"
  sleep(1000)
  eA.trigger()

proc echoB1(eB1: FlowEvent) =
  echo "Display B1, sleep 1s"
  sleep(1000)
  eB1.trigger()

proc echoB2() =
  echo "Display B2, exit stream"

proc echoC1() =
  echo "Display C1, exit stream"

proc main() =
  echo "Dataflow parallelism with single dependency"
  init(Weave)
  let eA = newFlowEvent()
  let eB1 = newFlowEvent()
  spawnOnEvent eB1, echoC1()
  spawnOnEvent eA, echoB2()
  spawnOnEvent eA, echoB1(eB1)
  spawn echoA(eA)
  exit(Weave)

main()

Delayed computation with multiple dependencies

import weave

proc echoA(eA: FlowEvent) =
  echo "Display A, sleep 1s, create parallel streams 1 and 2"
  sleep(1000)
  eA.trigger()

proc echoB1(eB1: FlowEvent) =
  echo "Display B1, sleep 1s"
  sleep(1000)
  eB1.trigger()

proc echoB2(eB2: FlowEvent) =
  echo "Display B2, no sleep"
  eB2.trigger()

proc echoC12() =
  echo "Display C12, exit stream"

proc main() =
  echo "Dataflow parallelism with multiple dependencies"
  init(Weave)
  let eA = newFlowEvent()
  let eB1 = newFlowEvent()
  let eB2 = newFlowEvent()
  spawnOnEvents eB1, eB2, echoC12()
  spawnOnEvent eA, echoB2(eB2)
  spawnOnEvent eA, echoB1(eB1)
  spawn echoA(eA)
  exit(Weave)

main()

Delayed loop computation

You can combine data parallelism and dataflow parallelism.

Currently parallel loops only support one dependency (single, fixed iteration or range iteration).

Here is an example with a range iteration dependency. Note: when sleeping threads are unresponsive, meaning a sleeping thread cannot schedule other ready tasks.

import weave

proc main() =
  init(Weave)

  let eA = newFlowEvent(0, 10, 1)
  let pB = newFlowEvent(0, 10, 1)

  parallelFor i in 0 ..< 10:
    captures: {eA}
    sleep(i * 10)
    eA.trigger(i)
    echo "Step A - stream ", i, " at ", i * 10, " ms"

  parallelFor i in 0 ..< 10:
    dependsOn: (eA, i)
    captures: {pB}
    sleep(i * 10)
    pB.trigger(i)
    echo "Step B - stream ", i, " at ", 2 * i * 10, " ms"

  parallelFor i in 0 ..< 10:
    dependsOn: (pB, i)
    sleep(i * 10)
    echo "Step C - stream ", i, " at ", 3 * i * 10, " ms"

  exit(Weave)

main()

Lazy Allocation of Flowvars

Flowvars can be lazily allocated, this reduces overhead by at least 2x on very fine-grained tasks like Fibonacci or Depth-First-Search that may spawn trillions of tasks in less than a couple hundreds of milliseconds. This can be enabled with -d:WV_LazyFlowvar.

⚠️ This only works for Flowvar of a size up to your machine word size (int64, float64, pointer on 64-bit machines) ⚠️ Flowvars cannot be returned in that mode, you will at best trigger stack smashing protection or crash

Limitations

Weave has not been tested with GC-ed types. Pass a pointer around or use Nim channels which are GC-aware. If it works, a heads-up would be valuable.

This might improve with Nim ARC/newruntime.

Statistics

Curious minds can access the low-level runtime statistic with the flag -d:WV_metrics which will give you the information on number of tasks executed, steal requests sent, etc.

Very curious minds can also enable high resolution timers with -d:WV_metrics -d:WV_profile -d:CpuFreqMhz=3000 assuming you have a 3GHz CPU.

The timers will give you in this order:

Time spent running tasks, Time spent recv/send steal requests, Time spent recv/send tasks, Time spent caching tasks, Time spent idle, Total

Tuning

A number of configuration options are available in weave/config.nim.

In particular:

  • -d:WV_StealAdaptativeInterval=25 defines the number of steal requests after which thieves reevaluate their steal strategy (steal one task or steal half the victim's tasks). Default: 25
  • -d:WV_StealEarly=0 allows worker to steal early, when only `WV_StealEraly tasks are leftin their queue. Default: don't steal early

Unique features

Weave provides an unique scheduler with the following properties:

  • Message-Passing based: unlike alternative work-stealing schedulers, this means that Weave is usable on any architecture where message queues, channels or locks are available and not only atomics. Architectures without atomics include distributed clusters or non-cache coherent processors like the Cell Broadband Engine (for the PS3) that favors Direct memory Access (DMA), the many-core mesh Tile CPU from Mellanox (EzChip/Tilera) with 64 to 100 ARM cores, or the network-on-chip (NOC) CPU Epiphany V from Adapteva with 1024 cores, or the research CPU Intel SCC.
  • Scalable: As the number of cores in computer is growing steadily, developers need to find new avenues of parallelism to exploit them. Unfortunately existing framework requires computation to take 10000 cycles at minimum (Intel TBB) which corresponds to 3.33 µs on a 3 GHz CPU to amortize the cost of scheduling. This burden the developers with questions of grain size, heuristics on distributing parallel loop for the common case and mischeduling on recursive tree algorithms with potentially very low compute-intensive leaves.
    • Weave uses an adaptative work-stealing scheduler that adapts its stealing strategy depending on each core load and the intensity of tasks. Small tasks will be packaged into chunks to amortize scheduling overhead.
    • Weave also uses an adaptative lazy loop splitting strategy. Loops will only be split when needed. There is no partitioning issue or grain size issue, or estimating if the workload is memory-bound or compute-bound, see PyTorch OpenMP woes on parallel map.
    • Weave aims efficient multicore scaling for very fine-grained tasks starting from the 2000 cycles range upward (0.67 µs on 3GHz).
  • Fast and low-overhead: While the number of cores have been growing steadily, many programs are now hitting the limit of memory bandwidth and require tuning allocators, cache lines, CPU caches. Enormous care has been given to optimize Weave to keep it very low-overhead. Weave uses efficient memory allocation and caches to avoid stressing the system allocator and prevent memory fragmentation. Soon, a thread-safe caching system that can release memory to the OS will be added to prevent reserving memory for a long-time.
  • Ergonomic and composable: Weave API is based on futures similar to async/await for concurrency. The task dependency graph is implicitly built when awaiting a result An OpenMP-syntax is planned.

The "Project Picasso" RFC is available for discussion in Nim RFC #160 or in the (potentially outdated) picasso_RFC.md file

Research

Weave is based on the research by Andreas Prell. You can read his PhD Thesis or access his C implementation.

Several enhancements were built into Weave, in particular:

  • Memory management was carefully studied to allow releasing memory to the OS while still providing very high performance and solving the decades old cactus stack problem. The solution, coupling a threadsafe memory pool with a lookaside buffer, is inspired by Microsoft's Mimalloc and Snmalloc, a message-passing based allocator (also by Microsoft). Details are provided in the multiple Markdown file in the memory folder.
  • The channels were reworked to not use locks. In particular the MPSC channel (Multi-Producer Single-Consumer) supports batching for both producers and consumers without any lock.

License

Licensed and distributed under either of

or

at your option. These files may not be copied, modified, or distributed except according to those terms.

weave's People

Contributors

awr1 avatar clyybber avatar disruptek avatar hadrieng2 avatar littleli avatar metagn avatar mratsim avatar ringabout avatar vindaar avatar zevv avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

weave's Issues

Silly bug on splitHalf / splitGuided

So while introducing support for loop strides, I also broke splitHalf.
AFAIK splitAdaptative is working fine but it's an untested part of the runtime.

The splitting bugs should be fixed with an anti-regression added.
This is completely self-contained in the loop-splitting file.

Offending code:

func splitHalf*(task: Task): int {.inline.} =
## Split loop iteration range in half
task.cur + ((task.stop - task.cur + task.stride-1) div task.stride) shr 1

Test case:

func splitHalfBuggy*(cur, stop, stride: int): int {.inline.} =
  ## Split loop iteration range in half
  cur + ((stop - cur + stride-1) div stride) shr 1

echo splitHalfBuggy(32, 128, 32) # 33 <---- the caller only keeps a single iteration

# fixed
func splitHalf*(cur, stop, stride: int): int {.inline.} =
  ## Split loop iteration range in half
  cur + (stop - cur) shr 1

echo splitHalf(32, 128, 32) # 80

SplitHalf is fairly easy to fix. Explanation of splitGuided, splitAdaptative, splitAdaptativeDelegated to have proper tests

splitGuided

Split-guided is similar to OpenMP guided split. Assuming N iterations and P workers, you first deal thieves work chunks of size N/P. When the iterations left are less than N/P you deal exponentially decreasing work chunks.

func splitGuided*(task: Task): int {.inline.} =
## Split iteration range based on the number of workers
let stepsLeft = (task.stop - task.cur + task.stride-1) div task.stride
preCondition: stepsLeft > 0
{.noSideEffect.}:
let numWorkers = workforce()
let chunk = max(((task.stop - task.start + task.stride-1) div task.stride) div numWorkers, 1)
if stepsLeft <= chunk:
return task.splitHalf()
return roundPrevMultipleOf(task.stop - chunk*task.stride, task.stride)

splitAdaptative

SplitAdaptative is described here p120: https://epub.uni-bayreuth.de/2990/
image

In practice, if a victim is at iteration 19 of a [0, 100) task
we have task.cur = 20 (task.cur is the next splittable iteration, so you don't give up your current work)
https://github.com/mratsim/weave/blob/5d9017239ca9792cc37e3995f422f86ac57043ab/weave/parallel_for.nim#L24-L49

Assuming we have approximately 7 thiefs (concurrency, we only have a lower-bound) + the victim we want to distribute 10 iterations each.
But the split is done one thief at a time in a loop so the algorithm does:

task.cur = 20, task.stop = 100, thieves = 7 => split at 90
task.cur = 20, task.stop = 90, thieves = 6 => split at 80
task.cur = 20, task.stop = 80, thieves = 5 => split at 70
task.cur = 20, task.stop = 70, thieves = 4 => split at 60
task.cur = 20, task.stop = 60, thieves = 3 => split at 50
task.cur = 20, task.stop = 50, thieves = 2 => split at 40
task.cur = 20, task.stop = 40, thieves = 1 => split at 30
No thieves: we do [20, 30)

And if there is only one thief, it is equivalent to split half.

splitAdaptativeDelegated

When Weave is compiled with Backoff, workers that backed off from stealing are sleeping and cannot respond to steal requests.
They have a parent that will check their steal requests queue and their children's on their behalf.

When a parent has a loop task it can wake up a child worker with, it can't just do splitAdaptative because of the following, assuming we have a leftChild sleeping with 6 steal requests (7 thieves total):

task.cur = 20, task.stop = 100, leftsubtreeThieves = 7 => split at 90
# oops the leftChild is woken up, we can't check its thief queue anymore
task.cur = 20, task.stop = 90, leftsubtreeThieves = 0 => we are left with work imbalance
# And now there is communication overhead because the left child cannot satisfy all steal requests of its tree.

So the parent sends enough work to the whole subtree before waking the left child which will do the same, avoiding latency and reducing the number of messages to log(n) tasks instead of many recirculated steal requests.

Note that the parent has its own thieves and also another child so it needs to keep enough for them as well.

Support for gc:arc

We need a test-case with a gc:arc based type that is created within a task and sent to the caller (i.e. it escapes its creating thread).

[Glibc] Condition variable lost wakeups

So I'm in conflict:

  • On the red corner, formal verification, model checking, axioms, proofs and the foundation of our comprehension of our universe.
  • On the blue corner, glibc, an industry standard which should have a reference implementation of pthreads.

As an arbiter, OSX C standard library.

Those are example logs of unfortunate fates of my runtime:

image
image
image

Plenty of people suggested on stack overflow that you should use locks, you should unlock before signaling, you should unlock after signaling.
However the evidence seem damning, Glibc sits on the box of the accused. OSX does not exhibit the same behaviour. Musl does though. Formal verification explored over almost 10 millions of possible state interleaving and did not find a deadlock or livelock in my event notifier:
image

So what does Glibc has for defence?

While waiting for a fix that may never come especially in distros that upgrade very slowly, let's review our options:

  • Ask everyone to switch to Mac (yeah no, ...)
  • Don't back off: saving power would be nice still.
  • Use exponential backoff, log-log-iterated backoff of some of the backoffs techniques used for Wi-fi/bluetooth explained in the backoff readme: https://github.com/mratsim/weave/blob/1a458832/weave/channels/event_notifiers_and_backoff.md: acceptable but nanosleep is not a posix standard and sleep in microsecond is too long and also "approximative". It also does not completely sleep the threads so there is still some power use remaining. And lastly it increases latency.
  • Try one of the wakeup ceremony mentioned in there: https://stackoverflow.com/a/9918764 (apparently used in the linux kernel)
/* atomic op outside of mutex, and then: */

pthread_mutex_lock(&m);
pthread_mutex_unlock(&m);

pthread_cond_signal(&c);
  • or condition signaling via mutex unlock? https://news.ycombinator.com/item?id=11894100 (but in the screenshot I did add lock everywhere
  • last resort implementing my own futexes and condition variable from scratch on top of Linux, Mac and Windows primitives ...

Support destructor-based seq/strings

This is more exploratory but being able to return seq string would be a huge boon.

The main unknown is:

  • do we pass around ownership of a buffer, this requires it to be allocated in shared memory.
  • or do we copy it in a channel, this restricts the size to Weave max closure size.

Implement backoff mechanism

Currently the workers are spinning full throttle when they have no tasks, wasting CPU.

Backoff has been implemented in the original C implementation via exponential sleep or condition variable: aprell/tasking-2.0@9d6f46b

However exponential backoff has throughput issues if the tasking is bursty. An alternative would be:
the Robust Exponential Backoff presented in this paper https://arxiv.org/abs/1402.5207 that is target as lowering Wifi power consumption but still provides guarantees on Wifi packet messages.

Latency-optimized / job priorities / soft real-time parallel scheduling

Most work-stealing schedulers work in a LIFO manner, the worker works on the task it just enqueued. The main reason is that this maximizes locality (the data needed for the just enqueued task is probably hot in cache).
As such, those schedulers are optimizing throughput (doing all the work as fast as possible) but are fundamentally unfair as the first task enqueued might not be done as soon as possible but only when there is nothing else to do. This is also how Weave works.

In many cases, for example:

  • (soft) realtime audio-processing or video-processing
  • game engines
  • services where FIFO is expected, for example a service that processes a stream of images or a services that have users post tasks to it with user expecting the first one to be the first scheduled.

we want to optimize latency:

  1. Assume that for optimizing latency, the early tasks scheduled are those that are logically needed first, i.e. FIFO scheduling
  2. We might want to support job priorities.

There are several papers on soft-real-time scheduler (i.e. "Earliest Deadline First" scheduling) see:

However it seems relatively straightforward to have a latency optimized Weave switch.

FIFO scheduling

Instead of pop-ing the last task enqueued from the deque we can just pop the first task enqueued.
By default Weave add from the front

myWorker().deque.addFirst task

and pops from the front

weave/weave/scheduler.nim

Lines 137 to 144 in bf2ec2f

proc nextTask*(childTask: static bool): Task {.inline.} =
# TODO: rewrite as a finite state machine
profile(enq_deq_task):
if childTask:
result = myWorker().deque.popFirstIfChild(myTask())
else:
result = myWorker().deque.popFirst()

We can just pop from the back instead.

Job priorities

Job priorities are important for certain workload for example game engines

Supporting priorities in Weave should just require adding a per-thread priority queue for priority tasks (and keep the deque for "best-effort tasks"). No need to solve the complex lock-free concurrent priority queue problem (and the associated thread-safe memory reclamation) when using a message-passing based runtime ✌️.

Distributed computing

Naive distributed computing requires the following things:

  • A SPSC channel
  • A MPSC channel
  • Finding peers

Thanks to our message-passing based design, we should be able to reuse large part of the code though some hierarchical work-stealing probably needs to be introduced.

See Thesis for MPI based channels

image

Further distributed channel alternatives could be ZeroMQ, see presentation: http://irpf90.ups-tlse.fr/files/oslo_zmq.pdf
and Nanomsg which is MIT-licensed https://github.com/nanomsg/nng see writeup: https://nanomsg.github.io/nng/RATIONALE.html

Parallel reduction: performance or termination detection issue

A simple log-sum-exp on 256x10 tensors is 8x slower than sequential with eager flowvars and has to be killed (workers idling but runtime not terminating) for lazy flowvars.

See: aprell/tasking-2.0#3

Suspicions:

  • Bug in the memory subsystem, but why wouldn't that get triggered on Black and Scholes or matrix transposition which are also for-loops and may do lazy loop splitting and release memory from remote threads
  • The linked lists that linking reduction dependencies that are causing issues
  • Maybe we are doing N times the work?
  • Or we have a termination detection issue

Metrics and perf profile

with (in micro-seconds)

Timer, WorkerID, timer_run_task, timer_send_recv_req, timer_send_recv_task, timer_enq_deq_task, timer_idle, total
Sanity check, logSumExp(1..<10) should be 9.4585514 (numpy logsumexp): 9.458551406860352


--------------------------------------------------------------------------
Scheduler:                                    Sequential
Benchmark:                                    Log-Sum-Exp (Machine Learning) 
Threads:                                      1
datasetSize:                                  20000
batchSize:                                    256
# of full batches:                            78
# of image labels:                            10
Text vocabulary size:                         1000
--------------------------------------------------------------------------
Dataset:                                      256x10
Time(ms):                                     65.479
Max RSS (KB):                                 97496
Runtime RSS (KB):                             0
# of page faults:                             0
Logsumexp:                                    994.2670288085938
--------------------------------------------------------------------------
Scheduler:                                    Weave (eager flowvars)
Benchmark:                                    Log-Sum-Exp (Machine Learning) 
Threads:                                      1
datasetSize:                                  20000
batchSize:                                    256
# of full batches:                            78
# of image labels:                            10
Text vocabulary size:                         1000
--------------------------------------------------------------------------
Dataset:                                      256x10
Time(ms):                                     308.294
Max RSS (KB):                                 97496
Runtime RSS (KB):                             0
# of page faults:                             342
Logsumexp:                                    994.2674560546875

+========================================+
|  Per-worker statistics                 |
+========================================+
  / use -d:WV_profile for high-res timers /  
Worker  3: 33 steal requests sent
Worker 10: 36 steal requests sent
Worker  3: 75 steal requests handled
Worker 10: 65 steal requests handled
Worker 29: 33 steal requests sent
Worker 27: 39 steal requests sent
Worker  8: 37 steal requests sent
Worker  8: 72 steal requests handled
Worker  8: 3421 steal requests declined
Worker  8: 5390 tasks executed
Worker  8: 72 tasks sent
Worker 10: 3436 steal requests declined
Worker  3: 3812 steal requests declined
Worker 10: 4988 tasks executed
Worker  6: 31 steal requests sent
Worker 22: 33 steal requests sent
Worker 22: 0 steal requests handled
Worker 22: 2576 steal requests declined
Worker 22: 16986 tasks executed
Worker 22: 0 tasks sent
Worker 22: 0 tasks split
Worker 22: 100.00 % steal-one
Worker 22: 0.00 % steal-half
Timer,22,70.254,6998.741,3451.635,0.622,17846.474,28367.727
Worker 17: 38 steal requests sent
Worker 17: 38 steal requests handled
Worker 17: 2784 steal requests declined
Worker 31: 34 steal requests sent
Worker 31: 0 steal requests handled
Worker 31: 2052 steal requests declined
Worker 26: 38 steal requests sent
Worker 19: 34 steal requests sent
Worker 19: 0 steal requests handled
Worker 19: 2713 steal requests declined
Worker 19: 23280 tasks executed
Worker 19: 0 tasks sent
Worker 19: 0 tasks split
Worker 19: 100.00 % steal-one
Worker 19: 0.00 % steal-half
Timer,19,131.721,7056.739,3433.744,1.068,17743.030,28366.303
Worker 28: 36 steal requests sent
Worker 28: 0 steal requests handled
Worker 28: 2537 steal requests declined
Worker 28: 24649 tasks executed
Worker 28: 0 tasks sent
Worker 28: 0 tasks split
Worker 28: 100.00 % steal-one
Worker 28: 0.00 % steal-half
Timer,28,114.383,7038.537,3418.580,0.956,17730.621,28303.077
Worker 32: 34 steal requests sent
Worker 32: 0 steal requests handled
Worker 32: 2082 steal requests declined
Worker 32: 9079 tasks executed
Worker 32: 0 tasks sent
Worker 32: 0 tasks split
Worker 32: 100.00 % steal-one
Worker 32: 0.00 % steal-half
Timer,32,56.943,6957.737,3399.172,0.462,17981.132,28395.446
Worker 13: 37 steal requests sent
Worker 13: 73 steal requests handled
Worker 13: 3006 steal requests declined
Worker 13: 8252 tasks executed
Worker 13: 73 tasks sent
Worker 13: 73 tasks split
Worker 13: 100.00 % steal-one
Worker 13: 0.00 % steal-half
Timer,13,147.470,7099.723,3438.656,2.372,17740.460,28428.681
Worker  8: 72 tasks split
Worker  8: 100.00 % steal-one
Worker 30: 37 steal requests sent
Worker 30: 0 steal requests handled
Worker 30: 2444 steal requests declined
Worker 30: 23325 tasks executed
Worker 30: 0 tasks sent
Worker 30: 0 tasks split
Worker 30: 100.00 % steal-one
Worker 30: 0.00 % steal-half
Timer,30,94.372,7038.752,3446.618,0.785,17766.724,28347.251
Worker 16: 38 steal requests sent
Worker 16: 72 steal requests handled
Worker 16: 2655 steal requests declined
Worker 16: 3708 tasks executed
Worker 16: 72 tasks sent
Worker 16: 72 tasks split
Worker 16: 100.00 % steal-one
Worker 16: 0.00 % steal-half
Timer,16,86.647,8105.503,3231.686,2.798,18223.436,29650.070
Worker  9: 34 steal requests sent
Worker  9: 64 steal requests handled
Worker  9: 3274 steal requests declined
Worker  9: 25721 tasks executed
Worker  9: 64 tasks sent
Worker  9: 64 tasks split
Worker  9: 100.00 % steal-one
Worker  9: 0.00 % steal-half
Timer,9,249.162,7092.431,3436.262,2.427,17734.208,28514.490
Worker  7: 40 steal requests sent
Worker  7: 74 steal requests handled
Worker  7: 3191 steal requests declined
Worker  7: 5327 tasks executed
Worker  7: 74 tasks sent
Worker  7: 74 tasks split
Worker  7: 100.00 % steal-one
Worker  7: 0.00 % steal-half
Timer,7,138.054,6861.554,3482.237,3.287,17315.058,27800.189
Worker 24: 33 steal requests sent
Worker 24: 0 steal requests handled
Worker 24: 2418 steal requests declined
Worker 24: 25174 tasks executed
Worker 24: 0 tasks sent
Worker 24: 0 tasks split
Worker 24: 100.00 % steal-one
Worker 24: 0.00 % steal-half
Timer,24,99.328,7045.619,3415.327,0.815,17725.872,28286.961
Worker  3: 26724 tasks executed
Worker  3: 75 tasks sent
Worker  3: 75 tasks split
Worker  3: 100.00 % steal-one
Worker 15: 38 steal requests sent
Worker 15: 66 steal requests handled
Worker 15: 2646 steal requests declined
Worker 15: 2726 tasks executed
Worker  6: 70 steal requests handled
Worker  6: 3555 steal requests declined
Worker  6: 26274 tasks executed
Worker  6: 70 tasks sent
Worker  6: 70 tasks split
Worker  6: 100.00 % steal-one
Worker  6: 0.00 % steal-half
Timer,6,196.350,6996.878,3429.454,2.007,17719.904,28344.594
Worker 14: 35 steal requests sent
Worker 14: 68 steal requests handled
Worker 14: 3072 steal requests declined
Worker 14: 4599 tasks executed
Worker 14: 68 tasks sent
Worker 14: 68 tasks split
Worker 14: 100.00 % steal-one
Worker 14: 0.00 % steal-half
Timer,14,162.134,7116.283,3409.806,2.232,17722.834,28413.289
Worker  1: 20 steal requests sent
Worker  1: 60 steal requests handled
Worker  1: 4363 steal requests declined
Worker  1: 51445 tasks executed
Worker  1: 60 tasks sent
Worker  1: 60 tasks split
Worker  1: 100.00 % steal-one
Worker  1: 0.00 % steal-half
Timer,1,299.332,7030.355,3396.620,4.260,17656.248,28386.814
Worker 27: 0 steal requests handled
Worker 27: 2303 steal requests declined
Worker 27: 17857 tasks executed
Worker 27: 0 tasks sent
Worker 27: 0 tasks split
Worker 27: 100.00 % steal-one
Worker 27: 0.00 % steal-half
Timer,27,71.999,7034.148,3428.589,0.547,17731.467,28266.751
Worker  0: 1 steal requests sent
Worker  0: 40 steal requests handled
Worker  0: 5045 steal requests declined
Worker  0: 2473159 tasks executed
Worker  0: 40 tasks sent
Worker  0: 10 tasks split
Worker  0: 100.00 % steal-one
Worker  0: 0.00 % steal-half
Timer,0,7428.066,5852.005,2049.953,7.784,11052.187,26389.995
Worker 21: 34 steal requests sent
Worker 21: 0 steal requests handled
Worker 21: 2869 steal requests declined
Worker 21: 23934 tasks executed
Worker 21: 0 tasks sent
Worker 21: 0 tasks split
Worker 21: 100.00 % steal-one
Worker 21: 0.00 % steal-half
Timer,21,141.978,7028.810,3428.964,1.293,17726.939,28327.984
Worker 31: 8281 tasks executed
Worker 31: 0 tasks sent
Worker 31: 0 tasks split
Worker 31: 100.00 % steal-one
Worker 31: 0.00 % steal-half
Timer,31,42.433,7075.626,3462.089,0.381,17755.089,28335.618
Worker 12: 36 steal requests sent
Worker 12: 74 steal requests handled
Worker 12: 3042 steal requests declined
Worker 12: 7046 tasks executed
Worker 12: 74 tasks sent
Worker 12: 74 tasks split
Worker 12: 100.00 % steal-one
Worker 12: 0.00 % steal-half
Timer,12,163.312,7059.514,3404.781,2.375,17686.628,28316.610
Worker 25: 38 steal requests sent
Worker 25: 0 steal requests handled
Worker 25: 2506 steal requests declined
Worker 25: 28919 tasks executed
Worker 25: 0 tasks sent
Worker 25: 0 tasks split
Worker 25: 100.00 % steal-one
Worker 25: 0.00 % steal-half
Timer,25,111.899,7063.668,3431.609,0.838,17721.555,28329.568
Worker  3: 0.00 % steal-half
Timer,3,228.089,7106.528,3406.020,3.824,17689.052,28433.513
Worker 15: 66 tasks sent
Worker 15: 66 tasks split
Worker 15: 100.00 % steal-one
Worker 15: 0.00 % steal-half
Timer,15,80.058,7134.656,3520.893,1.772,17707.194,28444.574
Worker 35: 39 steal requests sent
Worker 35: 0 steal requests handled
Worker 35: 2323 steal requests declined
Worker 35: 12824 tasks executed
Worker 35: 0 tasks sent
Worker 35: 0 tasks split
Worker 35: 100.00 % steal-one
Worker 35: 0.00 % steal-half
Timer,35,90.430,7100.583,3427.469,0.761,17772.149,28391.392
Worker  2: 21 steal requests sent
Worker  2: 60 steal requests handled
Worker  2: 4195 steal requests declined
Worker  2: 54829 tasks executed
Worker  2: 60 tasks sent
Worker  2: 60 tasks split
Worker  2: 100.00 % steal-one
Worker  2: 0.00 % steal-half
Timer,2,14854.437,7780.487,3028.948,2.399,15690.848,41357.119
Worker 18: 36 steal requests sent
Worker 18: 0 steal requests handled
Worker 18: 2885 steal requests declined
Worker 18: 25927 tasks executed
Worker 18: 0 tasks sent
Worker 18: 0 tasks split
Worker 18: 100.00 % steal-one
Worker 18: 0.00 % steal-half
Timer,18,140.243,7082.420,3414.857,1.215,17653.338,28292.073
Worker 26: 0 steal requests handled
Worker 26: 2429 steal requests declined
Worker 26: 20008 tasks executed
Worker  4: 29 steal requests sent
Worker  4: 67 steal requests handled
Worker  4: 3860 steal requests declined
Worker  4: 28989 tasks executed
Worker  4: 67 tasks sent
Worker  4: 67 tasks split
Worker  4: 100.00 % steal-one
Worker  4: 0.00 % steal-half
Timer,4,250.399,6385.547,4857.116,3.763,18666.364,30163.189
Worker 33: 37 steal requests sent
Worker 33: 0 steal requests handled
Worker 33: 2157 steal requests declined
Worker 20: 32 steal requests sent
Worker 20: 0 steal requests handled
Worker 11: 34 steal requests sent
Worker 11: 63 steal requests handled
Worker 11: 3035 steal requests declined
Worker 11: 7328 tasks executed
Worker 10: 65 tasks sent
Worker 10: 65 tasks split
Worker 10: 100.00 % steal-one
Worker 10: 0.00 % steal-half
Timer,10,195.228,7119.545,3418.841,2.004,17733.731,28469.349
Worker 17: 8194 tasks executed
Worker 26: 0 tasks sent
Worker 26: 0 tasks split
Worker 26: 100.00 % steal-one
Worker 26: 0.00 % steal-half
Timer,26,96.546,7063.857,3403.231,0.741,17724.032,28288.407
Worker 20: 2558 steal requests declined
Worker 20: 19716 tasks executed
Worker 20: 0 tasks sent
Worker 20: 0 tasks split
Worker  8: 0.00 % steal-half
Timer,8,192.811,7093.305,3423.944,3.574,17710.355,28423.989
Worker  5: 31 steal requests sent
Worker  5: 68 steal requests handled
Worker  5: 3545 steal requests declined
Worker 34: 37 steal requests sent
Worker 34: 0 steal requests handled
Worker 17: 38 tasks sent
Worker 17: 38 tasks split
Worker 17: 100.00 % steal-one
Worker 17: 0.00 % steal-half
Timer,17,99.339,7057.996,3440.329,1.072,17737.308,28336.044
Worker  5: 30440 tasks executed
Worker  5: 68 tasks sent
Worker  5: 68 tasks split
Worker 34: 2159 steal requests declined
Worker  5: 100.00 % steal-one
Worker 29: 0 steal requests handled
Worker 29: 2449 steal requests declined
Worker 29: 20583 tasks executed
Worker 29: 0 tasks sent
Worker 29: 0 tasks split
Worker 29: 100.00 % steal-one
Worker 29: 0.00 % steal-half
Timer,29,98.517,7055.290,3405.382,0.825,17710.874,28270.887
Worker 11: 63 tasks sent
Worker 11: 63 tasks split
Worker 11: 100.00 % steal-one
Worker 11: 0.00 % steal-half
Timer,11,14759.737,7823.362,3037.788,2.207,15743.915,41367.009
Worker  5: 0.00 % steal-half
Timer,5,14806.778,12367.326,3971.878,2.673,18949.675,50098.329
Worker 20: 100.00 % steal-one
Worker 20: 0.00 % steal-half
Worker 34: 11318 tasks executed
Worker 34: 0 tasks sent
Worker 34: 0 tasks split
Worker 34: 100.00 % steal-one
Worker 34: 0.00 % steal-half
Timer,34,64.704,7063.165,3421.925,0.553,17726.712,28277.060
Timer,20,104.570,7041.677,3411.791,0.815,17711.852,28270.705
Worker 33: 11421 tasks executed
Worker 33: 0 tasks sent
Worker 33: 0 tasks split
Worker 33: 100.00 % steal-one
Worker 33: 0.00 % steal-half
Timer,33,58.339,7072.869,3413.858,0.470,17742.986,28288.522
Worker 23: 32 steal requests sent
Worker 23: 0 steal requests handled
Worker 23: 2379 steal requests declined
Worker 23: 21580 tasks executed
Worker 23: 0 tasks sent
Worker 23: 0 tasks split
Worker 23: 100.00 % steal-one
Worker 23: 0.00 % steal-half
Timer,23,83.116,2767.684,9312.722,0.729,24085.002,36249.252
+========================================+

[Mempool] Trying to allocate from a full arena

https://travis-ci.com/mratsim/weave/jobs/271436702#L414-L426

========================================================================================

Running [] weave/memory/memory_pools.nim

========================================================================================

Single-threaded: System alloc for 100 blocks: 0.7871 s

Single-threaded: Pool   alloc for 100 blocks: 0.4178 s

Multi-threaded: System alloc: 0.0178 s

fatal.nim(39)            sysFatal

Error: unhandled exception: contracts.nim(86, 15) `

arena.meta.used < arena.blocks.len` 

    Contract violated for pre-condition at memory_pools.nim:359

        arena.meta.used < arena.blocks.len

    The following values are contrary to expectations:

        62 < 62  [Worker N/A]

 [AssertionError]

func allocBlock(arena: var Arena): ptr MemBlock {.inline.} =
## Allocate from an arena
preCondition: not arena.meta.free.isNil
preCondition: arena.meta.used < arena.blocks.len
arena.meta.used += 1
result = arena.meta.free
unpoisonMemRegion(result, WV_MemBlockSize)
# The following acts as prefetching for the block that we are returning as well
arena.meta.free = cast[ptr MemBlock](result.next.load(moRelaxed))
postCondition: arena.meta.used in 0 .. arena.blocks.len

[Testing] Concurrency: Race detection / Model Checking / Formal Verification

While Weave is currently doing a very good job at restricting shared writable state to channels, and specifically trySend and tryRecv routines, we need tooling and tests to detect races and concurrent heisenbugs.

Unfortunately it is apparently a NP-hard problem. The issue is that to ease debugging we need to reliable trigger the bug to ensure it is fixed. However threads interleaving is non-deterministic and we can't ask people to use our own deterministic fork of Windows, Linux or Mac.

Also allocating and free-ing memory willy-nilly from lots of threads might also overwhelm the allocator in use or maybe trigger bugs (if we use Nim allocator), so it's probably better to never free memory for testing to avoid allocator bugs/slowness (b8ac8d6)

There are a couple approaches that can be taken with varying order of impracticality.

Sanitizers

This category only requires recompiling with extra flags, sometimes just --debugger:native.
While they can detect the presence of bugs, I don't think they can prove the absence of one.

valgrind --tool:helgrind build/mybinary

http://valgrind.org/docs/manual/hg-manual.html

POSIX-only, with debugger:native it will mention the Nim lines that are potentially racy.
It slows down the code a lot.

Also there is some noise on memset, memcpy (not sure if it also happens if you never free memory)
DeepinScreenshot_select-area_20191120231538
DeepinScreenshot_select-area_20191120231347

LLVM ThreadSanitizer

Compile with clang with -fsanitize=thread

Libraries

Libraries can exhaustively tests all kind of thread interleaving provided a test scenario, say MPSC queue with 2 producers and 1 consumers. We however have to assume that being bug free for 2 producers mean being bug-free for N (which has been proved for some MPSC queue implementations).

Important: Library used should ideally support C++11 memory model and/or ensure correctness on weak memory model architecture (i.e. everything not x86 like ARM? PowerPC, MIPS)

Relacy Race Detector

We can switch Nim atomics to Relacy atomics via templates and recompile.

Chess

Microsoft, Windows-only

Landslide

https://github.com/bblum/landslide
https://www.pdl.cmu.edu/Landslide/index.shtml

See extensive PhD Thesis at the bottom

Do-it-yourself

The blog post for MultithreadedTC a verifier for the JVM explains in-depth how it is architected: via an internal metronome clock that sync all threads and then at synchronization points test all combinations of thread interleaving.

http://www.cs.umd.edu/projects/PL/multithreadedtc/overview.html

Model Checking and Formal verification

This is heavyweight and requires either using a foreign language or a lot of annotation and constraints in the source code, in exchange it provides mathematical guarantees of correctness:

VCC

Annotate C code and it will be passed to Z3

Iris

TLA+

Spin

Resources

load balancing on backed off child wake-up.

For Weave compiled with Backoff and StealAdaptative

Similar to #76 which waked up child workers with enough loop iterations to satisfy their whole subtree, (explained in-depth in #89), it might be better to send a lot of tasks when waking up child workers, overriding their stealOne/stealHalf ask.

Windows TLS Emulation is extremely slow

Overhead-bound benchmarks like Fibonacci and Depth-First Search are significantly slower on Windows than Linux and Mac.

Config: i9-9980XE 18 cores, 36 threads, with 4.1GHz all core Turbo

On Fibonacci in particular, the default eager futures takes 14s under windows while it takes 370ms under Linux for a whopping 30x slowdown.
Lazy futures allocated via alloca takes 800ms while they take 180ms under Linux.

This points to a memory allocator issue.

Memory-bound benchmarks (transpose) and CPU-bound benchmarks (Black-Scholes) seem to behave somewhat similarly to Linux.

Similar issues:

Low priority as we can't probably do anything more than what we have now in our memory subsytem. It's doubtful than even using Mimalloc on Windows (just for Weave) would help as our memory pool is based on the same techniques. Lastly Fibonacci is an extreme case with computation load of 1 cycle while Weave targets being efficient at 2000 cycles.

TODO: benchmark Cilk and TBB to make sure we are not missing something.

MPSC count can become negative

Even though the count is done on the consumer side which should always underestimate the real count, the number of estimated enqueued item in the MPSC channel can be negative.

See #48 (comment) and CI https://dev.azure.com/numforge/Weave/_build/results?buildId=25&view=logs

This is not blocking as this count is only informative and used for adaptative stealing and for the memory pool to trigger batch reception of memory.

This is something I actualy remarked in the past but seems to happen rarely:

func peek*(chan: var ChannelMpscUnboundedBatch): int32 {.inline.} =
## Estimates the number of items pending in the channel
## - If called by the consumer the true number might be more
## due to producers adding items concurrently.
## - If called by a producer the true number is undefined
## as other producers also add items concurrently and
## the consumer removes them concurrently.
##
## This is a non-locking operation.
result = int32 chan.count.load(moAcquire)
# For the consumer it's always positive or zero
postCondition: result >= 0 # TODO somehow it can be -1

Windows support

Windows is only missing barriers, unlike Mac, we can reuse the OS API for them.

[Benchmarks] Windows benchmarking

Related to #2.

#59 introduces windows support but the benchmarks are currently relying on unix-only time and memory measurement tooling. This prevents benchmarking on Windows.

Architecture diagram(s)

In the scheduler when the workers become alternatively:

  • workers
  • victims
  • thieves
  • sharing work
  • terminating

With state transitions triggered by:

  • receiving steal requests
  • running out of tasks
  • encountering a barrier
  • the managed worker state

I am pretty sure we could model a thread-local worker as reasonably sized state-machine.
This would help:

  • documentation
  • debugging: a quick table would help to know depending on the state at the entrance of the proc and the event/message received what should happen.
  • exhaustivity: making sure that we have handlers for all events
  • verification of the code

Potentially in the future we could have a code generator to generate those nested ifs:

proc decline(req: sink StealRequest) =
## Pass steal request to another worker
## or the manager if it's our own that came back
preCondition: req.retry <= PI_MaxRetriesPerSteal
req.retry += 1
profile(send_recv_req):
incCounter(stealDeclined)
if req.thiefID == myID():
# No one had jobs to steal
ascertain: req.victims.isEmpty()
ascertain: req.retry == PI_MaxRetriesPerSteal
if req.state == Stealing and myWorker().leftIsWaiting and myWorker().rightIsWaiting:
when PI_MaxConcurrentStealPerWorker == 1:
# When there is only one concurrent steal request allowed, it's always the last.
lastStealAttempt(req)
else:
# Is this the last theft attempt allowed per steal request?
# - if so: lastStealAttempt special case (termination if lead thread, sleep if worker)
# - if not: drop it and wait until we receive work or all out steal requests failed.
if myThefts().outstanding == PI_MaxConcurrentStealPerWorker and
myTodoBoxes().len == PI_MaxConcurrentStealPerWorker - 1:
# "PI_MaxConcurrentStealPerWorker - 1" steal requests have been dropped
# as evidenced by the corresponding channel "address boxes" being recycled
ascertain: myThefts().dropped == PI_MaxConcurrentStealPerWorker - 1
lastStealAttempt(req)
else:
drop(req)
else:
# Our own request but we still have work, so we reset it and recirculate.
# This can only happen if workers are allowed to steal before finishing their tasks.
when PI_StealEarly > 0:
req.retry = 0
req.victims.init(workforce)
req.victims.clear(myID())
req.findVictimAndSteal()
else: # No-op in "-d:danger"
postCondition: PI_StealEarly > 0 # Force an error
else: # Not our own request
req.findVictimAndSteal()

Though it's probably overkill given the number of states, requires testing the macros and makes it more tricky to go through the stacktraces.

Perf regression on SPC bench

Seems like it's time to implement the full benchmark suite.

There is a significant performance regression on SPC benchmark compared to original PoC, probably due to the new memory subsystem in #24.
I suspect it's releasing tasks from remote threads to the OS that requires more tuning.

Refactor the bitset data structure

The current bitset data structure is not satisfactory:
The backend is an uint32 limiting victim selection to 32 cores.
Even using uint64 is not enough. Instead it should be a compile-time constant.

A default of 256 should be enough has it was Nim defaults, it's also the limit of Windows
and it should keep the size of a steal request under a cache line size.
The buffer should use multiple limbs of uint32 or uint64 with a number configurable by compile-time derived from the max limit of workers.

Some parts of the bitset should be optimized as well:

  • random number generator. rand_r from stdlib.h is fast but does not have good random properties
    those might be important to benefit from the theoretical properties of randomized victim selection
    splitMix64 seems to be a nice simple. xoroshiro128+ is faster, the default Nim RNG but if only using the lower bits it fails BigCrush. We could use xoshiro256++ though which is even faster.
  • to restrict the range of the rng output to the number of cores, a simple mod is used. This will not give an uniform distribution if the number of cores does not divide the rng range i.e. if the number of core is not power of 2. Rejection sampling would correct that, but that may be overkill.
  • random victim selection. In case there are less than the half of the total number of cores that are potential victims, instead of picking a victim we can negate the distribution. The fast path with 3 retries currently implemented should still be used over 70% of the time.
    Note that there are some runtimes that do a random victim selection then increment in a loop until they find an actual victim. This might be unbalanced if there is a large gap for example 0 1 . . . . . 7 8: 7 has a disproportionate probability of getting picked, increasing the gap further.
  • Uncompressing the bitset faster into an array. Assuming we support 256 victims in our bitset, we need to iterate fast on it.

Setup continuous integration

Besides the x86, Linux, Mac usual suspects

  • Travis for ARM64
  • Azure Pipelines for Windows (when #37 is solved)

Travis in particular allows 6 cores for ARM and only 2 cores for X86.

SparsetSet: signed/unsigned/isEmpty/contains issue

Extracted from #19


It's probably linked to signed/unsigned conversion, but trying to isolate the bug removes it. It might require int conversion via Generics in a file that doesn't directly import sparsesets or something like that.

Symptoms:

DeepinScreenshot_select-area_20191122214201

DeepinScreenshot_select-area_20191122214606

Note that even with asserts off we still get issues. Some thief/victim ID or sparset length appear negative if converted to "int" or normal (in the 250 range) if kept as "Setuint".

I don't now why it doesn't happen on master.

Nested for-loops: Expressing reads-after-writes and writes-after-writes dependencies

From commits:

I am almost there to port a state-of-the-art BLAS to Weave and compete with OpenBLAS and MKL, however parallelizing 2 nested loops with read-after-writes dependencies causes issues.

Analysis

The current barrier is Master thread only and is only suitable for the root task

weave/weave/runtime.nim

Lines 88 to 96 in 7802daf

proc sync*(_: type Weave) =
## Global barrier for the Picasso runtime
## This is only valid in the root task
Worker: return
debugTermination:
log(">>> Worker %2d enters barrier <<<\n", myID())
preCondition: myTask().isRootTask()

Unfortunately in nested parallel loops normal workers could also reach it and will not be stopped and they will create more tasks or continue on their current one even though dependencies are not resolved.

Potential solutions

Extending the barrier

Extend the current barrier to work with worker threads in nested situations. An implementation of typical workload with nested barriers should be added to the bench suite to test behaviour in with a known testable workload.

Providing static loop scheduling

Providing static loop scheduling, by eagerly splitting the work may (?) prevent one thread running away creating tasks when their dependencies were not resolved. Need more thinking as I have trouble considering all the scenarios.

Create a waitable nested-for loops iterations

Weave already provides very-fine grained synchronization primitives with futures, we do not need to wait for threads but just for the iteration that does the packing work we depend on to. A potential syntax would be:

    ...

    # ###################################
    # 3. for ic = 0,...,m−1 in steps of mc
    parallelFor icb in 0 ..< tiles.ic_num_tasks:
      captures: {pc, tiles, nc, kc, alpha, beta, vA, vC, M}
      waitable: icForLoop

      ...

      sync(icForLoop) # Somehow ensure that the iterations we care about were done

      # #####################################
      # 4. for jr = 0,...,nc−1 in steps of nr
      parallelForStrided jr in 0 ..< nc, stride = NR:
        captures: {mc, nc, kc, alpha, packA, packB, beta, mcncC}

This would create a dummy future called icForLoop that could waited on by calls that are further nested.

Unknowns:

  • Multiple threads might try to complete the future, should we create a separate type? or do we allow sharing future?
  • Not too sure how that fits with lazy splitting, assume you wait on a icForLoop for iterations 0..<100 but actually you only require 0..<10 chunk to continue working, how do you specify that? The waitable should maybe store a range

Task graphs

While a couple of other frameworks are expressing such as task graphs:

The syntax of make_edge/precede is verbose, it doesn't seem to address the issues of fine-grained loop.

I believe it's better to express emerging dependencies via data dependencies, i.e. futures and waitable ranges

OpenMP tasks dependencies approach could be suitable and made cleaner, from OpenMP 4.5 doc:
DeepinScreenshot_select-area_20191207123237

Or CppSs (, https://gitlab.com/szs/CppSs, https://www.thinkmind.org/download.php?articleid=infocomp_2013_2_30_10112, https://arxiv.org/abs/1502.07608)
DeepinScreenshot_select-area_20191207123440

Or Kaapi (https://tel.archives-ouvertes.fr/tel-01151787v1/document)

DeepinScreenshot_select-area_20191207123633

Create a polyhedral compiler

Well, the whole point of Weave is easing creating a linear algebra compiler so ...

Code explanation

The way the code is structured is the following (from BLIS paper [2]):
image

For MxN = MxK * K*N
Assuming a shared memory arch with priavate L1-L2 cache per core and shared L3,
Tiling for CPU caches and register blocking is done the following way:

  • nc chosen to fit in L3 cache (4096 bytes), not done in Laser
  • kc and mc chosen to fit in half the L2 cache
  • kc and mc should avoid page-fault by not fitting in TLB
  • kc and nr chosen to fit in half the L1 cache
  • kc as large as possible to amortize mr*nr updates
  • This is similar to maximizing the area of a rectangle while minimizing the perimeter
    mc*kc/(2mc+2kc) with mc*kc < K
  • mr and nr are chosen depending on the number of registers and their width
    and so are runtime CPU dependent
  • Using the wrong parameters or register size can drop the performance by 30%
Loop around microkernel Index Length Stride Dimension Dependencies Notes
5th loop jc N nc N
4th loop pc K kc K Pack panels of B: kc*nc by blocks of nr Panel packing: Parallelization with Master barrier
Loop: Difficult to parallelize
as K is the reduction dimension
so parallelizing K
requires handling conflicting writes
3rd loop ic M mc M Pack panels of A: kc*mc by blocks of mr Panel packing: Parallelized
Normally parallelized but missing a way to express dependency or a worker barrier
2nd loop jr nc nr N Slice A into micropanel kc*nr Parallelized with master barrier
1st loop ir mc mr M Slice B into micropanel kc*mr
Microkernel k,i,j kc,mr,nr 1 K, M, N Read the micropanel to do mr*nr Vectorized

References

[1] Anatomy of High-Performance Matrix Multiplication (Revised)
Kazushige Goto, Robert A. Van de Geijn
- http://www.cs.utexas.edu/~flame/pubs/GotoTOMS_revision.pdf

[2] Anatomy of High-Performance Many-Threaded Matrix Multiplication
Smith et al
- http://www.cs.utexas.edu/users/flame/pubs/blis3_ipdps14.pdf

[3] Automating the Last-Mile for High Performance Dense Linear Algebra
Veras et al
- https://arxiv.org/pdf/1611.08035.pdf

[4] GEMM: From Pure C to SSE Optimized Micro Kernels
Michael Lehn
- http://apfel.mathematik.uni-ulm.de/~lehn/sghpc/gemm/index.html

Laser wiki - GEMM optimization resources
- https://github.com/numforge/laser/wiki/GEMM-optimization-resources

Research on IO-bound tasks

Weave / Project Picasso focuses on CPU-bound tasks, i.e. those are non-blocking and you can throw more CPU at it to have your result faster.

For IO-bound tasks the idea was to defer to specialized libraries like asyncdispatch and Chronos that use OS primitives (epoll/IOCP/kqueue) to handle IO efficiently.

However even for compute bound tasks we will have to deal with IO latencies for example in a distributed system or cluster. So we need a solution to do useful work in the downtime without blocking a whole thread.

That means:

  • either playing well with asyncdispatch/Chronos (do we run an event loop per thread or one event loop only ...)
  • or having deeper integration which probably will be better for people that needs both IO and compute but people only needing one or the other pay an extra tax. And the library will get significantly more complex, harder to maintain and with many more platform specific codepaths or even CPU specific in the case of coroutines with stack and register manipulation.

Research

  • Reduced I/O latencies with Futures

    Kyle Singer, Kunal Agrawal, I-Ting Angelina Lee

    https://arxiv.org/abs/1906.08239

    The paper explores coupling a Cilk-like workstealing runtime with a IO runtime based on Linux epoll and eventfd.

  • A practical solution to the Cactus Stack Problem

    Chaoran Yang, John Mellor-Crummey

    http://chaoran.me/assets/pdf/ws-spaa16.pdf

    Fibril: https://github.com/chaoran/fibril

    While not explicitly mentioning async I/O, the paper and the corresponding Fibril library are using
    coroutines/fibers-like tasks to achieve fast and extremely low overhead context switching.
    Coroutines are very efficient building blocks for async IO.

    For reference, the overhead is measured by fibonacci(40) which spawns billions of tasks, fibril achieves 130ms, Staccato 180ms, Weave 165-200ms depending on tradeoffs regarding memory management, more established runtimes have much more overhead: TBB 600ms~1s, Clang OpenMP ~2s, Julia Partr ~8s, HPX and GCC OpenMP cannot handle fib(40).

Implementations

C++ Target

#94 introduces pledges for dataflow graph.
It uses atomics in an union which leads to bad codegen with C++ at the moment.

Upstream: nim-lang/Nim#13062

Benchmarks TODO

Important benchmarks

  • Framework overhead via fibonacci
  • Unbalanced Tree Search
  • GEMM / Matrix Multiply vs OpenBLAS and MKL
  • Binary size overhead when runtime is not compiled in
  • Space overhead at runtime vs serial code
  • Returning memory to the OS on long-running processes
  • PARSEC benchmark suite: https://parsec.cs.princeton.edu/
  • NAS Parallel Benchmarks from the NASA Advanced Supercomputing: https://www.nas.nasa.gov/publications/npb.html

Instrumentation, tutorials, examples

  • topology: hyperthreading siblings, NUMA
  • measuring performance, core usage, latencies, cache misses, view assembly:
    • perf
    • Intel VTune
    • Apple Instruments
  • bloaty for binary size
  • perf c2c for measuring cache contention / false sharing
  • helgrind for locking

Requires changing the internals:

  • coz for causal profiling and bottleneck detection
  • relacy for race detection

Stretch goals

  • Other common benchmarks (nqueens, nbodies, LU, heat, qsort, bouncing producer-consumer, ...)
  • Porting michi (550 lines go bot with parallel Monte-Carlo Tree Search in Python) to Nim (https://github.com/pasky/michi) and benching against the C and Go implementations.

Reduce padding in the MPSC Channel + introduce a no count flag

Padding

The MPSC channel padding is very memory hungry:

ChannelMpscUnboundedBatch*[T: Enqueueable] = object
## Lockless multi-producer single-consumer channel
##
## Properties:
## - Lockless
## - Wait-free for producers
## - Consumer can be blocked by producers when they swap
## the tail, the tail can grow but the consumer sees it as nil
## until it's published all at once.
## - Unbounded
## - Intrusive List based
## - Keep an approximate count on enqueued
## - Support batching on both the producers and consumer side
# TODO: pass this through Relacy and Valgrind/Helgrind
# to make sure there are no bugs
# on arch with relaxed memory models
# Accessed by all
count{.align: WV_CacheLinePadding.}: Atomic[int]
# Producers and consumer slow-path
back{.align: WV_CacheLinePadding.}: Atomic[pointer] # Workaround generic atomics bug: https://github.com/nim-lang/Nim/issues/12695
# Consumer only - front is a dummy node
front{.align: WV_CacheLinePadding.}: typeof(default(T)[])

WV_CacheLinePadding is at 2x cachelinesize = 128, which means 384 bytes are taken. The value of 128 was chosen because Intel CPU prefetches cache lines by pairs. Facebook's Folly also did in-depth experiments to come up with this value.

This was OK when the MPSC channel was used in a fixed manner for incoming steal requests and incoming freed memory from remote thread however the dataflow parallelism protocol described in #92 (comment) requires allocating ephemeral MPSC channel.
If an application relies exclusively on dataflow graph parallelism, it will incur huge memory overhead as the memory pool only allocated 256 bytes.

As a compromise (hopefully) between cache invalidation prevention and memory usage, the data could be reorganized the following way:

  ChannelMpscUnboundedBatch*[T: Enqueueable] = object
    # Producers and consumer slow-path
    back{.align: WV_CacheLinePadding div 2.}: Atomic[pointer]
    # Accessed by all
    count{.align: WV_CacheLinePadding div 2.}: Atomic[int]
    # Consumer only - front is a dummy node
    front{.align: WV_CacheLinePadding div 2.}: typeof(default(T)[])

Padding is now 64 bytes for a total of 128 + sizeof(T) bytes taken, well within the memory pool block size of 256 bytes, it can be made intrusive to another datastructure with 256 - 128 - sizeof(T) bytes of metadata to save on allocations.

In terms of cache conflict, front/back are still 2x cache-line apart and there was cache invalidation on count anyway.

The ordering producers field then consumer field assumes that:

  • If the MPSC channel is used intrusively it's the first field of the datatype
  • the consumer is the likely owner and updater of that type
    This may or may not be true

Count

Count is needed for steal requests to approximate (give a lower-bound) the number of thieves in steal adaptative mode.
It is needed for remote freed memory for the memory pool to give a lower bound to the number of memory blocks that can be collected back in the memory arena.

However in the dataflow parallelism protocol it is not needed to keep track of the enqueued tasks count. Similarly for the non-adaptative steal it is not needed to keep track of the number of steal requests.
Atomic increment/decrement are very taxing as they require flushing the caches. The MPSC channel should have an optional count.
Note that in the case of an optional count, padding between back and front should be 2x cachelines.

Status

  • Reduce Padding
  • Introduce a nocount flag

AddressSanitizer test suite

It would be nice to run AddressSanitizer in CI following #78 and #79

Currently parallel_for and parallel_for_staged work

There are a couple of limitations:

  • Nim empty strings trigger a global-buffer-overflow in ASAN that pollutes Weave on fibonacci and nqueens for example.
  • The parallel_tasks file trigger an use-after-poison for reason to be investigated.

The MPSC channel has a deadlock

The MPSC channel for StealRequest has a deadlock

Reproducible with 5 workers on Fibonacci.

Also it might be worth it to move to a lock-free design even though the lock-based design is not a bottleneck.

Restrict the use of globals in functions

Ideally only the main scheduler functions accesses globals besides the metrics.

While there are plenty of articles with globals considered harmful, the main motivation here is that it makes the library much harder to test.

Scalable fuzzing and property-based testing would require independent, ideally side-effect free components that can be tested independently and in parallel.

[GCC] bad codegen / stack corruption crash on innocent code

And since we are witch hunting the C tooling, let's report this wonderful codegen bug or corruption of GCC (Clang doesn't crash). 100% reproducible:

image
image

I don't know what optimizations GCC tries to pull here but it doesn't work.
The fix is to assign pool.last = arena in a different scope from pool.first = arena

func append(pool: var TLPoolAllocator, arena: ptr Arena) {.inline.} =
preCondition: arena.next.isNil
debugMem:
log("Pool 0x%.08x - TID %d - append Arena 0x%.08x\n",
pool.addr, pool.threadID, arena)
if pool.numArenas == 0:
ascertain: pool.first.isNil
ascertain: pool.last.isNil
pool.first = arena
else:
arena.prev = pool.last
pool.last.next = arena
pool.last = arena
pool.numArenas += 1
arena.allocator = pool.addr

Calling an init/task/exit twice creates worker tree problem

Stacktrace when duplicating the main call in async.nim:

Sanity check 1: Printing 123456 654321 in parallel
123456 - SUCCESS
654321 - SUCCESS
Sanity check 2: fib(20)
/home/beta/Programming/Nim/weave/weave/async.nim(199) async
/home/beta/Programming/Nim/weave/weave/async.nim(192) main2
/home/beta/Programming/Nim/weave/weave/async.nim(142) async_fib
/home/beta/Programming/Nim/weave/weave/scheduler.nim(358) schedule
/home/beta/Programming/Nim/weave/weave/instrumentation/contracts.nim(74) shareWork
/home/beta/.choosenim/toolchains/nim-#devel/lib/system/assertions.nim(27) failedAssertImpl
/home/beta/.choosenim/toolchains/nim-#devel/lib/system/assertions.nim(20) raiseAssert
/home/beta/.choosenim/toolchains/nim-#devel/lib/system/fatal.nim(39) sysFatal
Error: unhandled exception: /home/beta/Programming/Nim/weave/weave/instrumentation/contracts.nim(74, 13) `
req.thiefID == myWorker().left or req.thiefID == myWorker.right` 
    Contract violated for transient condition at victims.nim:348
        req.thiefID == myWorker().left or req.thiefID == myWorker.right
    The following values are contrary to expectations:
        27 == 1 or 27 == 2
 [AssertionError]
Error: execution of an external program failed: '/home/beta/Programming/Nim/weave/build/parfor '

[Load Balancing - Backoff] Waking up children with loop tasks

From #68 (comment)

Another interesting log. When a child backs off, the parent will wake it and send it works when it finds it (i.e. lifeline see Saraswat, Lifeline-based Global Load Balancing, http://www.cs.columbia.edu/~martha/courses/4130/au12/p201-saraswat.pdf)

This is what happens in this log, the worker even tries to split tasks for its grandchildren but ultimately only send work to its direct child.

Worker  9: has 1 steal requests
Worker  9: found 6 steal requests addressed to its child 19 and grandchildren
Worker  9: 14 steps left (start: 0, current: 256, stop: 2000, stride: 128, 7 thieves)
Worker  9: Sending [1792, 2000) to worker 19
Worker  9: sending 1 tasks (task.fn 0x121e0a6c) to Worker 19
Worker  9: Continuing with [256, 1792)
Worker  9: waking up child 19
Matrix[1024, 128] (thread 9)
Matrix[1024, 256] (thread 9)
Matrix[1024, 384] (thread 9)
Matrix[1024, 512] (thread 9)
Matrix[1024, 640] (thread 9)
Matrix[1024, 768] (thread 9)
Matrix[1024, 896] (thread 9)
Matrix[1024, 1024] (thread 9)
Matrix[1024, 1152] (thread 9)
Matrix[1024, 1280] (thread 9)
Matrix[1024, 1408] (thread 9)
Matrix[1024, 1536] (thread 9)
Matrix[1024, 1664] (thread 9)
Worker 19: received a task with function address 0x121e0a6c (Channel 0x1c000b60)
Worker 19: running task.fn 0x121e0a6c
Matrix[1024, 1792] (thread 19)
Worker  9: sending own steal request to  0 (Channel 0x1333ce00)
Worker 15: sends state passively WAITING to its parent worker 7
Matrix[1024, 1920] (thread 19)
Worker 19: sending own steal request to  9 (Channel 0x1333d700)

Looking into the code we have in all state machines shareWork followed by handleThieves

weave/weave/victims.nim

Lines 249 to 305 in a1d862b

proc distributeWork(req: sink StealRequest): bool =
## Handle incoming steal request
## Returns true if we found work
## false otherwise
# Send independent task(s) if possible
if not myWorker().deque.isEmpty():
req.dispatchElseDecline()
return true
# TODO - the control flow is entangled here
# since we have a non-empty deque we will never take
# the branch that leads to termination
# and would logically return true
# Otherwise try to split the current one
if myTask().isSplittable():
if req.thiefID != myID():
myTask().splitAndSend(req)
return true
else:
req.forget()
return false
if req.state == Waiting:
# Only children can send us a failed state.
# Request should be saved by caller and
# worker tree updates should be done by caller as well
# TODO: disantangle control-flow and sink the request
postCondition: req.thiefID == myWorker().left or req.thiefID == myWorker().right
else:
decline(req)
return false
proc shareWork*() {.inline.} =
## Distribute work to all the idle children workers
## if we can
while not myWorker().workSharingRequests.isEmpty():
# Only dequeue if we find work
let req = myWorker().workSharingRequests.peek()
ascertain: req.thiefID == myWorker().left or req.thiefID == myWorker.right
if distributeWork(req): # Shouldn't this need a copy?
if req.thiefID == myWorker().left:
ascertain: myWorker().leftIsWaiting
myWorker().leftIsWaiting = false
else:
ascertain: myWorker().rightIsWaiting
myWorker().rightIsWaiting = false
Backoff:
wakeup(req.thiefID)
# Now we can dequeue as we found work
# We cannot access the steal request anymore or
# we would have a race with the child worker recycling it.
discard myWorker().workSharingRequests.dequeue()
else:
break

shareWork will wakeup a thread if distributeWork is successful. distributeWork will first look for plain tasks and then for the current splittable task. When evaluating the amount of work to send since the child is sleeping, it will take into account the whole subtree and so only sends a low amount of tasks to the child if plenty of steals are pending.

Normally, the remainder should be sent in handleThieves but in handleThieves the child is awake so all that worker subtree is not checked again and the tasks are never sent leading to load imbalance.

Rewrite the state transitions as a Finite State Machine

While rewriting the proof-of-concept to separate the system into different components, it has become increasingly clear that each worker could be remodeled as a Finite State Machine (FSM), at least a Mealy Machine that would be event driven (received a task/steal request), and maybe a Pushdown Automaton with a memory of 1 parent state as there is very strong commonality on the:

  • scheduling loop
  • the barrier
  • forcing futures

weave/weave/scheduler.nim

Lines 86 to 142 in 0045f6c

proc schedulingLoop() =
## Each worker thread execute this loop over and over
while not localCtx.signaledTerminate:
# Global state is intentionally minimized,
# It only contains the communication channels and read-only environment variables
# There is still the global barrier to ensure the runtime starts or stops only
# when all threads are ready.
# 1. Private task deque
debug: log("Worker %d: schedloop 1 - task from local deque\n", myID())
while (let task = nextTask(childTask = false); not task.isNil):
# Prio is: children, then thieves then us
ascertain: not task.fn.isNil
profile(run_task):
run(task)
profile(enq_deq_task):
# The memory is reused but not zero-ed
localCtx.taskCache.add(task)
# 2. Run out-of-task, become a thief
debug: log("Worker %d: schedloop 2 - becoming a thief\n", myID())
trySteal(isOutOfTasks = true)
ascertain: myThefts().outstanding > 0
var task: Task
profile(idle):
while not recv(task, isOutOfTasks = true):
ascertain: myWorker().deque.isEmpty()
ascertain: myThefts().outstanding > 0
declineAll()
# 3. We stole some task(s)
ascertain: not task.fn.isNil
debug: log("Worker %d: schedloop 3 - stoled tasks\n", myID())
let loot = task.batch
if loot > 1:
# Add everything
myWorker().deque.addListFirst(task, loot)
# And then only use the last
task = myWorker().deque.popFirst()
StealAdaptative:
myThefts().recentThefts += 1
# 4. Share loot with children
debug: log("Worker %d: schedloop 4 - sharing work\n", myID())
shareWork()
# 5. Work on what is left
debug: log("Worker %d: schedloop 5 - working on leftover\n", myID())
profile(run_task):
run(task)
profile(enq_deq_task):
# The memory is reused but not zero-ed
localCtx.taskCache.add(task)

weave/weave/scheduler.nim

Lines 193 to 268 in 0045f6c

proc forceFuture*[T](fv: Flowvar[T], parentResult: var T) =
## Eagerly complete an awaited FlowVar
let thisTask = myTask() # Only for ascertain
block CompleteFuture:
# Almost duplicate of schedulingLoop and sync() barrier
if isFutReady():
break CompleteFuture
## 1. Process all the children of the current tasks (and only them)
debug: log("Worker %d: forcefut 1 - task from local deque\n", myID())
while (let task = nextTask(childTask = true); not task.isNil):
profile(run_task):
run(task)
profile(enq_deq_task):
localCtx.taskCache.add(task)
if isFutReady():
break CompleteFuture
# ascertain: myTask() == thisTask # need to be able to print tasks TODO
# 2. Run out-of-task, become a thief and help other threads
# to reach children faster
debug: log("Worker %d: forcefut 2 - becoming a thief\n", myID())
while not isFutReady():
trySteal(isOutOfTasks = false)
var task: Task
profile(idle):
while not recv(task, isOutOfTasks = false):
# We might inadvertently remove our own steal request in
# dispatchTasks so resteal
profile_stop(idle)
trySteal(isOutOfTasks = false)
# If someone wants our non-child tasks, let's oblige
var req: StealRequest
while recv(req):
dispatchTasks(req)
profile_start(idle)
if isFutReady():
profile_stop(idle)
break CompleteFuture
# 3. We stole some task(s)
ascertain: not task.fn.isNil
debug: log("Worker %d: forcefut 3 - stoled tasks\n", myID())
let loot = task.batch
if loot > 1:
profile(enq_deq_task):
# Add everything
myWorker().deque.addListFirst(task, loot)
# And then only use the last
task = myWorker().deque.popFirst()
StealAdaptative:
myThefts().recentThefts += 1
# Share loot with children workers
debug: log("Worker %d: forcefut 4 - sharing work\n", myID())
shareWork()
# Run the rest
profile(run_task):
run(task)
profile(enq_deq_task):
# The memory is reused but not zero-ed
localCtx.taskCache.add(task)
LazyFV:
# Cleanup the lazy flowvar if allocated or copy directly into result
if not fv.lazyFV.hasChannel:
ascertain: fv.lazyFV.isReady
copyMem(parentResult.addr, fv.lazyFV.lazyChan.buf.addr, sizeof(parentResult))
else:
ascertain: not fv.lazyFV.lazyChan.chan.isNil
fv.lazyFV.lazyChan.chan.delete()

weave/weave/runtime.nim

Lines 93 to 173 in 0045f6c

proc sync*(_: type Runtime) =
## Global barrier for the Picasso runtime
## This is only valid in the root task
Worker: return
debugTermination:
log(">>> Worker %d enters barrier <<<\n", myID())
preCondition: myTask().isRootTask()
block EmptyLocalQueue:
## Empty all the tasks and before leaving the barrier
while true:
debug: log("Worker %d: globalsync 1 - task from local deque\n", myID())
while (let task = nextTask(childTask = false); not task.isNil):
# TODO: duplicate schedulingLoop
profile(run_task):
run(task)
profile(enq_deq_task):
# The memory is reused but not zero-ed
localCtx.taskCache.add(task)
if workforce() == 1:
localCtx.runtimeIsQuiescent = true
break EmptyLocalQueue
if localCtx.runtimeIsQuiescent:
break EmptyLocalQueue
# 2. Run out-of-task, become a thief and help other threads
# to reach the barrier faster
debug: log("Worker %d: globalsync 2 - becoming a thief\n", myID())
trySteal(isOutOfTasks = true)
ascertain: myThefts().outstanding > 0
var task: Task
profile(idle):
while not recv(task, isOutOfTasks = true):
ascertain: myWorker().deque.isEmpty()
ascertain: myThefts().outstanding > 0
declineAll()
if localCtx.runtimeIsQuiescent:
# Goto breaks profiling, but the runtime is still idle
break EmptyLocalQueue
# 3. We stole some task(s)
debug: log("Worker %d: globalsync 3 - stoled tasks\n", myID())
ascertain: not task.fn.isNil
let loot = task.batch
if loot > 1:
profile(enq_deq_task):
# Add everything
myWorker().deque.addListFirst(task, loot)
# And then only use the last
task = myWorker().deque.popFirst()
StealAdaptative:
myThefts().recentThefts += 1
# 4. Share loot with children
debug: log("Worker %d: globalsync 4 - sharing work\n", myID())
shareWork()
# 5. Work on what is left
debug: log("Worker %d: globalsync 5 - working on leftover\n", myID())
profile(run_task):
run(task)
profile(enq_deq_task):
# The memory is reused but not zero-ed
localCtx.taskCache.add(task)
# Restart the loop
# Execution continues but the runtime is quiescent until new tasks
# are created
postCondition: localCtx.runtimeIsQuiescent
debugTermination:
log(">>> Worker %d leaves barrier <<<\n", myID())

Furthermore assuming we implement a declarative DSL to generate a Worker state machine, the program should be easier to extend, maintain and control flow should be easier to follow as well.

This rely on #3.

An additional benefit is better formal verification beyond what just using Message Passing / Communicating Sequential Processes bring.

FSMs have been used extensively on embedded devices and lots of toolings is available to analyze them. Also apparently there is a whole branch of research on Communicating Finite State Machine that analyzes and proves property on FSMs communicating by channels.

Low-level wise as this would probably be a very hot path, and the state->transition of a FSM actually maps semantically to gotos, dispatch could use the hidden {.goto.} pragma. Computed gotos are probably not needed because we don't need to "compute" the goto from data like a VM computing the goto from the bytecode.

Get Thread ID in a portable way

Currently the threadID is leaking into all datastructures.

It seems like "obvious facilities" do not fit:

  • pthread_self return a pthread handle
  • gettid is Linux only
  • syscall(__NR_gettid) is Linux only (and involves a syscall)
  • pthread_getthreadid_np is BSD only
  • getThreadID is Windows-only

This is necessary to decouple the memory management and allow upstreaming it.

LookAside list:

# Stack pointer
top: T
# Mempool doesn't provide the proper free yet
freeFn*: proc(threadID: int32, t: T) {.nimcall, gcsafe.}
threadID*: int32 # TODO, memory pool abstraction leaking
# Adaptative freeing
count: int
recentAsk: int
# "closure" - This points to the proc + env currently registered in the allocator
# It is nil-ed on destruction of the lookaside list.
#
registeredAt: ptr tuple[onHeartbeat: proc(env: pointer) {.nimcall.}, env: pointer]

Mempool free requires the caller to supply its ThreadID

proc recycle*[T](myThreadID: int32, p: ptr T) {.gcsafe.} =
## Returns a memory block to its memory pool.
##
## This is thread-safe, any thread can call it.
## It must indicates its ID.
## A fast path is used if it's the ID of the borrowing thread,
## otherwise a slow path will be used.
##
## If the thread owning the pool was exited before this
## block was returned, the main thread should now
## have ownership of the related arenas and can deallocate them.
# TODO: sink ptr T - parsing bug to raise
# similar to https://github.com/nim-lang/Nim/issues/12091
preCondition: not p.isNil
let p = cast[ptr MemBlock](p)
# Find the owning arena
let arena = p.getArena()
if myThreadID == arena.meta.threadID.load(moRelaxed):
# thread-local free
if arena.meta.localFree.isNil:
p.next.store(nil, moRelaxed)
arena.meta.localFree = p
else:
arena.meta.localFree.prepend(p)
arena.meta.used -= 1
if unlikely(arena.isUnused()):
# If an arena is unused, we can try releasing it immediately
arena.allocator[].considerRelease(arena)
else:
# remote arena
let remoteRecycled = arena.meta.remoteFree.trySend(p)
postCondition: remoteRecycled

References
https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-getthreadid
https://stackoverflow.com/questions/21091000/how-to-get-thread-id-of-a-pthread-in-linux-c-program

Backoff deadlock with low number of threads

There seems to be a race condition here:

weave/weave/thieves.nim

Lines 187 to 204 in 42b0f80

proc lastStealAttemptFailure*(req: sink StealRequest) =
## If it's the last theft attempt per emitted steal requests
## - if we are the lead thread, we know that every other threads are idle/waiting for work
## but there is none --> termination
## - if we are a worker thread, we message our parent and
## passively wait for it to send us work or tell us to shutdown.
if myID() == LeaderID:
detectTermination()
forget(req)
else:
req.state = Waiting
debugTermination:
log("Worker %2d: sends state passively WAITING to its parent worker %d\n", myID(), myWorker().parent)
sendShare(req)
ascertain: not myWorker().isWaiting
myWorker().isWaiting = true
myParking().wait() # Thread is blocked here until woken up.

The child sends the steal request and then goes idle

But what if the parent checks the steal request and sends a signal to wakeup to the child before it has time to actually sleep, for example to signal termination:

weave/weave/signals.nim

Lines 42 to 56 in 42b0f80

proc signalTerminate*(_: pointer) =
preCondition: not localCtx.signaledTerminate
# 1. Terminating means everyone ran out of tasks
# so their cache for task channels should be full
# if there were sufficiently more tasks than workers
# 2. Since they have an unique parent, no one else sent them a signal (checked in asyncSignal)
if myWorker().left != Not_a_worker:
# Send the terminate signal
asyncSignal(signalTerminate, globalCtx.com.tasks[myWorker().left].access(0))
# Wake the worker up so that it can process the terminate signal
wakeup(myWorker().left)
if myWorker().right != Not_a_worker:
asyncSignal(signalTerminate, globalCtx.com.tasks[myWorker().right].access(0))
wakeup(myWorker().right)

The parent exits, is deadlocked at the exit barrier and the child worker is deadlocked sleeping forever.

Implement the parallel loop API and logic

API will stay WIP, ideally we can replicate the Nim OpenMP API.

However this requires capturing the environment from a macro, apparently there is a magic called liftLocal that may be able to do that. Alternative we could use the "owner" macro but that would require some acrobatics as it can only be used on a proc so we would need to pack everything into a proc and then call "owner".

In the mean time we can implement the following API which requires the developer to wrap his code in a proc that begins with a magic forEach template

proc display_range() =
forEach(i):
log("%d (thread %d)\n", i, ID)
log("Thread %d - SUCCESS\n", ID)
proc main() =
tasking_init()
async_for 0..100, display_range()
tasking_barrier()
tasking_exit()
main()

Evaluate if StealRequest could be pointer object

Having Steal Request as pointer objects with ownership passed by channels instead of deep copy might reduce the overhead of the framework:

  • Lock-free MPSC list-based queues have a wealth of literature, this would avoids having to debug #6 (though lock-free debugging is hard as well)
  • A bitset for 256 victims takes 32 bytes, with the rest of the data, the StealRequest size becomes quite big.
  • We could use a more efficient set suitable for random picking but that would use much more space (513 bytes instead of 32 for 256 max workers (#5 (comment))

One useful thing is that steal request destruction only happen in the creating worker and since each steal requests maps to an unique task channel, we would know when the task is sent back, which steal request to destroy. Other worker can only dispose of their own steal requests.

This brings cheaper copy in the MPSC channel and easier lock-free implementation at the cost of a guaranteed cache miss when reading the steal request and potentially increased latency on NUMA.

BLAS matmul vs Laser, Intel MKL, MKL-DNN, OpenBLAS and co

Somewhat related to #31 and very probably related to #35

By giving Laser (OpenMP-based) the time to get CPU at full-speed, it can reach 2.23 TFlops on my CPU: see below, nimsuggest was running at 100%

Backend:                        Laser (Pure Nim) + OpenMP
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Laser production implementation
Collected 300 samples in 2245 ms
Average time: 6.327 ms
Stddev  time: 2.929 ms
Min     time: 5.000 ms
Max     time: 48.000 ms
Perf:         2237.478 GFLOP/s

Intel MKL is at 2.85 TFlops

Backend:                        Intel MKL
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Intel MKL benchmark
Collected 300 samples in 1842 ms
Average time: 4.970 ms
Stddev  time: 3.956 ms
Min     time: 4.000 ms
Max     time: 70.000 ms
Perf:         2848.245 GFLOP/s

OpenBLAS is at 1.52 TFlops

Backend:                        OpenBLAS
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

OpenBLAS benchmark
Collected 300 samples in 3124 ms
Average time: 9.300 ms
Stddev  time: 2.637 ms
Min     time: 9.000 ms
Max     time: 40.000 ms
Perf:         1522.126 GFLOP/s

Unfortunately Weave is stuck at 0.6 TFlops.

Backend:                        Weave (Pure Nim)
Type:                           float32
A matrix shape:                 (M: 1920, N: 1920)
B matrix shape:                 (M: 1920, N: 1920)
Output shape:                   (M: 1920, N: 1920)
Required number of operations: 14155.776 millions
Required bytes:                   29.491 MB
Arithmetic intensity:            480.000 FLOP/byte
Theoretical peak single-core:    224.000 GFLOP/s
Theoretical peak multi:         4032.000 GFLOP/s

Weave implementation
Collected 300 samples in 7686 ms
Average time: 24.553 ms
Stddev  time: 4.426 ms
Min     time: 17.000 ms
Max     time: 60.000 ms
Perf:         576.532 GFLOP/s

First look

image

On the right more than half of the instructions for Weave are not floating points instructions.
Also it seems like the CPU frequency is at 2.7GHz in the case of Weave while Laser reaches 3.6GHz. My BIOS settings are 4.1GHz all core turbo for normal code, 4.0GHz for AVX2 and 3.5GHz for AVX512 code.

Call stack

image

Looking into the call stack, it seems like we are paying the price of dynamic scheduling :/ as the time spend into the GEMM kernel is similar.

Conclusion

We have very high overhead for coarse-grained parallelism where OpenMP shines. I suspect this is also the reason behind #35.

Suspicions

Additionally I suspect the runtime is polluting the L1 cache with runtime data while GEMM has been carefully tuned to take advantage of L1 and L2 caches. Using smaller tiles might help (similarly avoiding hyperthreading which does the same would probably help as well)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.